📄 lexer.otx

📁 inger小型c编译器源码
💻 OTX
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
% lexer.tex% Practical Compiler Construction% Chapter: Lexical Analyzer\chapter{Lexical Analyzer}        \section{Introduction}    %Introduce the concept of a lexer.		The first step in the compiling process involves reading source 		code, so that the compiler can check that source code for errors 		before translating it to, for example, assembly language. 		All programming languages provide an array of keywords, like 		\verb|IF|, \verb|WHILE|, \verb|SWITCH| and so on. A compiler 		is not usually interested in the individual characters that 		make up these keywords; these keywords are said to be atomic. 		However, in some cases the compiler does care about the individual 		characters that make up a word: an integer number 		(e.g. \verb|12345|), a string (e.g. \verb|"hello, world"|) and a 		floating point number (e.g. \verb|12e-09|) are all considered 		to be words, but the individual characters that make them up 		are significant.			This distinction requires special processing of the input text, 		and this special processing is usually moved out of the parser 		and placed in a module called the \emph{lexical analyzer}, or 		\emph{lexer} or \emph{scanner} for short. It is the lexer's 		responsibility to divide the input stream into \emph{tokens}		(atomic words). The parser (the module that deals with groups 		of tokens, checking that their order is valid) requests a token 		from the lexer, which reads characters from the input stream 		until it has accumulated enough characters to form a complete 		token, which it returns to the parser.		\begin{example} \label{example:tokenizing}			Given the input 			\begin{quote}				\verb|the quick brown fox jumps over the lazy dog|			\end{quote}			a lexer will split this into the tokens 			\begin{quote}				\verb|the|, \verb|quick|, \verb|brown|, 				\verb|fox|, \verb|jumps|, \verb|over|, 				\verb|the|, \verb|lazy| and \verb|dog|			\end{quote}		\end{example}				The process of splitting input into tokens is called 		\emph{tokenizing}. Apart from tokenizing a sentence, a lexer 		can also divide tokens up into classes. Consider the following 		example:			\begin{example} \label{example:tokenclasses}			Given the input			\begin{quote}				\verb|the sum of 2 + 2 = 4.|			\end{quote}			a lexer will split this into the following tokens, with classes:			\begin{quote}\begin{verbatim}				Word:   the				Word:   sum				Word:   of				Number: 2				Plus:   +				Number: 2				Equals: =				Number: 4				Dot:    .			\end{verbatim}\end{quote}		\end{example}			Some token classes are very narrow (containing only one token), 		while others are broad. For example, the token class \emph{Word}		is used to represent \verb|the|, \verb|sum| and \verb|of|, while 		the token class \emph{Dot} can only be used when a \verb|.| is 		read. Incidentally, the lexical analyzer must know how to 		separate individual tokens.  In program source text, keywords 		are usually separated by whitespace (spaces, tabs and line feeds). 		However, this is not always the case. Consider the following input:				\begin{example} \label{example:tokenseparation} 			Given the input			\begin{quote}				\verb|sum=(2+2)*3;|			\end{quote}			a lexer will split this into the following tokens:			\begin{quote}				\verb|sum|, \verb|=|, \verb|(|, \verb|2|, \verb|+|,				\verb|2|, \verb|)|, \verb|*|, \verb|3| and \verb|;|			\end{quote}		\end{example}				Those familiar with popular programming languages like C 		or Pascal may know that mathematical tokens like numbers, 		\verb|=|, \verb|+| and \verb|*| are not required to be separated 		from each other by whitespace. The lexer must have some way to 		know when a token ends and the next token (if any) begins. In 		the next section, we will discuss the theory of regular languages 		to further clarify this point and how lexers deal with it.			Lexers have an additional interesting property: they can be used 		to filter out input that is not important to the parser, so that 		the parser has less different tokens to deal with. Block comments 		and line comments are examples of uninteresting input.		\section{Regular Language Theory}	%This section explains all theory for regular languages in order	%to introduce the lexer regexps.		The lexical analyzer is a submodule of the parser. While the 		parser deals with \emph{context-free grammars} (a higher level of 		abstraction), the lexer deals with individual characters which 		form tokens (words). Some tokens are simple (\verb|IF| or 		\verb|WHILE|) while others are complex. All integer numbers, 		for example, are represented using the same token (\verb|INTEGER|), 		which covers many cases (\verb|1|, \verb|100|, \verb|5845|). 		This requires some notation to match all integer numbers, 		so they are treated the same.			The answer lies in realizing that the collection of integer 		numbers is really a small language, with very strict rules 		(it is a so-called \emph{regular language}). Before we can show		what regular languages are, we must must discuss some preliminary		definitions first.		    % Referentie naar hoofdstuknummer van hoofdstuk 'Grammars'		In the chapter on Grammars, we introduced the concept of an		\emph{alphabet}, denoted as $\Sigma$. An alphabet is rather like		the set of tokens a lexer can produce. We then defined the set		of strings $\Sigma^{*}$ that can be generated from an alphabet,		and said that a language is a subset of $\Sigma^{*}$. 				We now define, without proof, several operations that may be 		performed on languages. The first operation on languages that		we present is the binary \emph{concatenation} operation.				\begin{definition}[Concatenation operation] \label{def:languageconcatenation} \ \\			Let $X$ and $Y$ be two languages. Then $XY$ is the concatenation			of these languages with			\begin{center}				$XY = \{uv\ |\ u \in X\ \land\ v \in Y\}$.			\end{center}		\end{definition}				Concatenation of a language with itself is also possible and		is denoted $X^{2}$. Concatenation of a string can also be 		performed multiple times, e.g. $X^{7}$.			We will illustrate the definition of concatenation with an example.				\begin{example}[Concatenation operation] \label{example:languageconcatenation} \ \\			  Let $\Sigma$ be the alphabet $\{a, b, c\}$.			\\Let $X$ be the language over $\Sigma$ with $X = \{aa, bb\}$.			\\Let $Y$ be the language over $\Sigma$ with $Y = \{ca, b\}$.			\\Then $XY$ is the language $\{aaca,aab,bbca,bbb\}$.		\end{example}				The second operation that will need to define regular languages		is the binary \emph{union} operation.				\begin{definition}[Union operation] \label{def:languageunion} \ \\			Let $X$ and $Y$ be two languages. Then $X \cup Y$ is the 			union of these languages with			\begin{center}				$X \cup Y = \{u\ |\ u \in X\ \lor\ u \in Y\}$.			\end{center}				\end{definition}				Note that the priority of concatenation is higher than the priority		of union. Here is an example that shows how the union operation 		works:			\begin{example}[Union operation] \label{example:languageunion} \ \\			  Let $\Sigma$ be the alphabet $\{a, b, c\}$.		\\Let $X$ be the language over $\Sigma \{aa, bb\}$.		\\Let $Y$ be the language over $\Sigma \{ca, b\}$.		\\Then $X \cup Y$ is the language over $\Sigma \{aa, bb, ca, b\}$.		\end{example}				The final operation that we need to define is the unary 		\emph{Kleene star} operation.		% Footnote for Kleene			\begin{definition}[Kleene star] \label{def:languagekleenestar} \ \\			Let $X$ be a language. Then			\begin{equation}				X^{*} = \bigcup_{i=0}^{\infty} X^{i}			\end{equation}		\end{definition}				Or, in words: $X^{*}$ means that you can take 0 or 		more sentences from $X$ and concatenate them. The Kleene star 		operation is best clarified with an example.				\begin{example}[Kleene star] \label{example:languagekleenestar} \ \\			  Let $\Sigma$ be the alphabet $\{a, b\}$.		\\Let $X$ be the language over $\Sigma \{aa, bb\}$.		\\Then $X^{*}$ is the language $\{\lambda, aa, bb, aaaa, 		  aabb, bbaa, bbbb, \ldots\}$. 		\end{example}				There is also an extension to the Kleene star. $XX^{*}$ may 		be written $X^{+}$, meaning that at least one string from $X$		must be taken (whereas $X^{*}$ allows the empty string $\lambda$).				With these definitions, we can now give a definition for a 		regular language.					\begin{definition}[Regular languages] \label{def:regularlanguage}		    %Required space to stop enumerate from starting 		    %right after definition:			\ 			\begin{enumerate}				\item Basis: $\emptyset$, $\{\lambda\}$ and $\{a\}$ 				      are regular languages.				\item Recursive step: Let X and Y be regular languages. 				      Then				\begin{quote}					$X \cup Y$ is a regular language					\\$XY$ is a regular language					\\$X^{*}$ is a regular language				\end{quote}			\end{enumerate}		\end{definition}				Now that we have established what regular languages are, it is		important to note that lexical analyzer generators (software tools		that will be discussed below) use \emph{regular expressions}		to denote regular languages. Regular expressions are merely		another way of writing down regular languages. In regular 		expressions, it is customary to write the language consisting 		of a single string composed	of one word, $\{a\}$, as \mbox{\boldmath $a$}.				\begin{definition}[Regular expressions] \label{def:regularexpressions} \ \\		    Using recursion, we can define regular expressions as follows:			\begin{enumerate}				\item Basis: \mbox{\boldmath $\emptyset$}, \mbox{\boldmath $\lambda$} and					  \mbox{\boldmath $a$} are regular expressions.				\item Recursive step: Let X and Y be regular expressions. 				      Then				\begin{quote}					$X \cup Y$ is a regular expression					\\$XY$ is a regular expression					\\$X^{*}$ is a regular expression				\end{quote}			\end{enumerate}		\end{definition}				As you can see, the definition of regular expressions differs		from the definition of regular languages only by a notational		convenience (less braces to write).			So any language that can be composed of other regular languages or		expressions using concatenation, union, and the Kleene star, is also 		a regular language or expression. Note that the priority of the 		concatenation, Kleene Star and union operations are listed here		from highest to lowest priority.				\begin{example}[Regular expression] \label{example:sample_regular_expression} \ \\ 	    	Let \mbox{\boldmath $a$} and  \mbox{\boldmath $b$} be 	    	regular expressions by definition 	    	\ref{def:regularexpressions}(1). Then 	    	\mbox{\boldmath $ab$} is a regular expression by definition 	    	\ref{def:regularexpressions}(2) through concatenation. 	    	$($\mbox{\boldmath $ab$} $\cup$ \mbox{\boldmath $b$}$)$ is a 	    	regular expression by definition \ref{def:regularexpressions}(2) 	    	through union. 	    	$($\mbox{\boldmath $ab$} $\cup$ \mbox{\boldmath $b$}$)^{*}$ is a 	    	regular expression by definition \ref{def:regularexpressions}(2)	    	through union.			The sentences that can be generated by 			\mbox{$($\mbox{\boldmath $ab$} $\cup$ \mbox{\boldmath $b$}$)^{*}$} are 			$\{\lambda, ab, b, abb, bab, 			babab, \ldots\}$.		\end{example}			While context-free grammars are normally denoted using production		rules, for regular languages it is sufficient to use the easy		to read regular expressions. 				%\includegraphics{syntax.eps}		\section{Sample Regular Expressions}	%This section gives some sample regular expressions (non-UNIX).			In this section, we present a number of sample regular expressions		to illustrate the theory presented in the previous section. From now		on, we will now longer use bold \bold{$a$} to denote $\{a\}$, 		since we will soon move to UNIX regular expressions which do not		use bold either.				\begin{quote}			\begin{tabular}{lp{6cm}}				Regular expression	& Sentences generated\\				\hline				$q$					& the sentence $q$\\				$qqq$				& the sentence $qqq$\\				$q^{*}$				& all sentences of 0 or more $q$'s\\				$q^{+}$				& all sentences of 1 or more $q$'s\\				$q \cup \lambda$    & the empty sentence or $q$. Often denoted as $q?$ (see UNIX regular expressions).\\				$b^{*}((b^{+}a \cup \lambda)b^{*}$ & the collection of sentences that begin with 0 or more $b$'s, followed by either one or more $b$'s followed by an $a$, or nothing, followed by 0 or more $b$'s.\\			\end{tabular}		\end{quote}				These examples show that through repeated application of 		definition \ref{def:regularexpressions}, complex sequences can		be defined. This feature of regular expressions is used for		constructing lexical analyzers.	\section{UNIX Regular Expressions}	%This section explains the syntax for UNIX regexps.		Under UNIX, several extensions to regular expressions have been 		implemented that we can use. A UNIX regular expression\cite{regex}		is commonly called a \emph{regex} (multiple: \emph{regexes}).				There is no union operator on UNIX.	Instead, we supply a list of 		alternatives contained within square brackets.	    	    \begin{quote}			\verb|[abc]|	$\equiv (a \cup b \cup c)$		\end{quote}
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -