📄 lexer.otx

📁 inger小型c编译器源码
💻 OTX
📖 第 1 页 / 共 3 页
字号:
		To avoid having to type in all the individual letters when we 		want to match all lowercase letters, the following syntax is 		allowed:			\begin{quote}			\verb|[a-z]| $\equiv$ \verb|[abcdefghijklmnopqrstuvwxyz]|		\end{quote}			UNIX does not have a $\lambda$ either. Here is the alternative 		syntax:			\begin{quote}			\verb|a?| $\equiv a \cup \lambda$		\end{quote}				Lexical analyzer \emph{generators} allow the user to directly specify 		these regular expressions in order to identify lexical tokens 		(atomic words that string together to make sentences). We will 		discuss such a generator program shortly.		\section{States} \label{sec:states}	%This section discusses the lexer as a state machine.		With the theory of regular languages, we can now find out how a		lexical analyzer works. More specifically, we can see how the 		scanner can divide the input \verb|(34+12)| into separate tokens.			Suppose the programming language for which we wish to write a		scanner consists only of sentences of the form 		\verb|(|\emph{number}\verb|+|\emph{number}\verb|)|. Then we require the following 		regular expressions to define the tokens.	    	    \begin{quote}	    	\begin{tabular}{ll}	    		Token & Regular expression\\	    		\hline	    		\verb|(|	  	& \verb|(|\\	    		\verb|)|	  	& \verb|)|\\	    		\verb|+|	  	& \verb|+|\\	    		\emph{number}   & \verb|[0-9]+|\\	    	\end{tabular}	    \end{quote}	    		A lexer uses states to determine which characters it can expect,		and which may not occur in a certain situation. For simple tokens		(\verb|(|, \verb|)| and \verb|+|) this is easy: either one of		these characters is read or it is not. For the \emph{number} token, 		states are required.			As soon as the first digit of a number is read, the lexer enters		a state in which it expects more digits, and nothing else. 		If another digit is read, the lexer remains in this state and		adds the digit to the token read so far. It something else (not		a digit) is read, the lexer knows the \emph{number} token is 		finished and leaves the \emph{number} state, returning the token		to the caller (usually the parser). After that, it tries to 		match the unexpected character (maybe a \verb|+|) to another token. 			\begin{example}[States] \label{example:states} \ \\ 	    	Let the input be \verb|(34+12)|. The lexer starts out in	    	the \emph{base} state. For every character 	    	read from the input, the following table shows the state 	    	that the lexer is currently in and the action it performs.	    \end{example}	    	    \begin{quote}	    		\begin{tabular}{llp{6cm}}		   			Token read	& State	& Action taken\\	    			\hline	    			\verb|(|	&	\emph{base}		& Return \verb|(| to caller\\	    			\verb|3|	&	\emph{base}		& Save \verb|3|, enter \emph{number} state\\	    			\verb|4|	&	\emph{number}	& Save \verb|3|\\	    			\verb|+|	&	\emph{number}	& \verb|+| not expected. Leave \emph{number} state and return \verb|34| to caller\\					\verb|+|	&   \emph{base}		& Return \verb|+| to caller\\    							    			\verb|1|	&	\emph{base}		& Save \verb|1|, enter \emph{number} state\\	    			\verb|2|	&	\emph{number}	& Save \verb|2|\\	    			\verb|)|	&	\emph{number}	& \verb|)| unexpected. Leave \emph{number} state and return \verb|12| to caller\\	    			\verb|)|	&	\emph{base}		& return \verb|)| to caller\\	    		\end{tabular}	    \end{quote}				This example did not include whitespace (spaces, line feeds and tabs)		on purpose, since it tends to be confusing. Most scanners ignore 		spacing by matching it with a special regular expression and doing 		nothing.			There is another rule of thumb used by lexical analyzer generators		(see the discussion of this software below): they always try to		return the longest token possible. 			\begin{example} \label{example:tokenlength}\ \\			\verb|=| and \verb|==| are both tokens. Now if \verb|=| was read			and the next character is also \verb|=| then \verb|==| will be			returned instead of two times \verb|=|. 		\end{example}				In summary, a lexer determines which characters are valid in the		input at any given time through a set of states, on of which is		the active state. Different states have different valid characters		in the input stream. Some characters cause the lexer to shift		from its current state into another state.		\section{Common Regular Expressions}	%This section shows some common regexps for integers,	%floating point numbers, strings and comments.			This section discusses some commonly used regular expressions for		interesting tokens, such as strings and comments.				\subsubsection{Integer numbers}		An integer number consists of only digits. It ends when a		non-digit character is encountered. The scanner must watch		out for an overflow, e.g. \verb|12345678901234| does not fit		in most programming languages' type systems and should cause		the scanner to generate an overflow error.			The regular expression for integer numbers is			\begin{quote}			\verb|[0-9]+|		\end{quote}			This regular expression generates the collection of strings		containing at least one digit, and nothing but digits.			\begin{advice} \label{advice:lexeroverflow}		If the scanner generates an overflow or similar error, 		parsing of the source code can continue (but no target code 		can be generated). The scanner can just replace the faulty 		value with a correct one, e.g. ``\verb|1|''.		\end{advice}			\subsubsection{Floating point numbers}		Floating point numbers have a slightly more complex syntax 		than integer numbers. Here are some examples of floating 		point numbers:			\begin{quote}			\verb|1.0|, \verb|.001|, \verb|1e-09|, \verb|.2e+5|		\end{quote}			The regular expression for floating point numbers is:			\begin{quote}			\verb|[0-9]* . [0-9]+ ( e [+-] [0-9]+ )?|		\end{quote}			Spaces were added for readability. These are not part of the		generated strings. The scanner should check each of the 		subparts of the regular expression containing digits for 		possible overflow. 			\begin{advice} \label{advice:longregexps}		If a regular expression becomes long or too complex, it is 		possible to split it up into multiple regular expressions. 		The lexical analyzer's internal state machine will still work.		\end{advice}			\subsubsection{Strings}		Strings are a token type that requires some special processing 		by the lexer. This should become clear when we consider the 		following sample input:			\begin{quote}			\verb|"3+4"|		\end{quote}			Even though this input consists of numbers, and the \verb|+| 		operator, which may have regular expressions of their own, 		the entire expression should be returned to the caller 		since it is contained within double quotes. The trick to do 		this is to introduce another state to the lexical analyzer, 		called an \emph{exclusive state}. When in this state, the 		lexer will process only regular expressions marked with this state. 		The resulting regular expressions are these:			\begin{quote}			\begin{tabular}{lp{6cm}}				Regular expression	& Action\\				\hline				\verb|"|				& Enter \emph{string} state\\				\emph{string} \verb|.|	& Store character. A dot (\verb|.|) means anything. This regular expression is only considered when the lexer is in the \emph{string} state.\\				\emph{string} \verb|"|	& Return to previous state. Return string contents to caller. This regular expression is only considered when the lexer is in the \emph{string} state.\\			\end{tabular}		\end{quote}			\begin{advice} \label{advice:exlusivestates}		You can write code for exclusive states yourself (when writing		a lexical analyzer from scratch), but AT\&T lex and GNU flex		can do it for you.		\end{advice}			The regular expressions proposed above for strings do not heed		line feeds. You may want to disallow line feeds within strings,		though. Then you must add another regular expressions that		matches the line feed character ($\backslash$n in some languages) 		and generates an error when it is encountered within a string.			The lexer writer must also be wary of a buffer overflow; if 		the program source code consists of a \verb|"| and hundreds of 		thousands of letters (at least, not another \verb|"|), a compiler		that does not check for buffer overflow conditions will 		eventually crash for lack of memory. Note that you could 		match strings using a single regular expression:	    	    \begin{quote}			\verb|"(.)*"|		\end{quote}			but the state approach makes it much easier to check for buffer		overflow conditions since you can decide at any time whether the		current character must be stored or not.			\begin{advice} \label{advice:stringlimit}		To avoid a buffer overflow, limit the string length to about 		\mbox{64 KB} and generate an error if more characters are read. 		Skip all the offending characters until another \verb|"| is read 		(or end of file).		\end{advice}			\subsubsection{Comments}		Most compilers place the job of filtering comments out of		the source code with the lexical analyzer. We can therefore		create some regular expressions that do just that. This once		again requires the use of an exclusive state. In 		programming languages, the beginning and end of comments		are usually clearly marked:			\begin{quote}				\begin{tabular}{ll}				Language		& Comment style\\				\hline				C				& \verb|/* comment */|\\				C++				& \verb|// comment (line feed)|\\				Pascal			& \verb|{ comment }|\\				BASIC			& \verb|REM comment :|\\			\end{tabular}		\end{quote}			We can build our regular expressions around these delimiters.		Let's build sample expressions using the C comment delimiters:	    	    \begin{quote}	    	\begin{tabular}{lp{6cm}}	    		Regular expression & Action\\	    		\hline	    		\verb|/*|		& Enter \emph{comment} state\\	    		\emph{comment} \verb|.|	  & Ignore character. A dot (\verb|.|) means anything. This regular expression is only considered when the lexer is in the \emph{comment} state.\\	    		\emph{comment} \verb|*/|  & Return to previous state. Do not return to caller but read next token, effectively ignoring the comment. This regular expression is only considered when the lexer is in the \emph{comment} state.\\	    	\end{tabular}	    \end{quote}			Using a minor modification, we can also allow nested comments. To 		do this, we must have the lexer keep track of the comment 		nesting level. Only when the nesting level reaches 0 after 		leaving the final comment should the lexer leave the \emph{comment}		state. Note that you could handle comments using a single regular		expression:			\begin{quote}			\verb|/* (.)* */|		\end{quote}			But this approach does not support nested comments.	The treatment of 		line comments is slightly easier. Only one regular expression is 		needed:				\begin{quote}			\verb|//(.)*\n|		\end{quote}		\section{Lexical Analyzer Generators}	%This section discusses lexer generators, and introduces	%Flex.		Although it is certainly possible to write a lexical analyzer 		by hand, this task becomes increasingly complex as your 		input language gets richer. It is therefore more practical 		use a lexical analyzer generator. The code generated by such 		a generator program is usually faster and more efficient that 		any code you might write by hand\cite{lex_yacc}.			Here are several candidates you could use:				\begin{quote}			\begin{tabular}{lp{6cm}}				AT\&T lex			& Not free, ancient, UNIX and Linux implementations\\				GNU flex			& Free, modern, Linux implementation\\				Bumblebee lex		& Free, modern, Windows implementation\\			\end{tabular}		\end{quote}			The \emph{\langname} compiler was constructed using GNU flex; in the		next sections we will briefly discuss its syntax (since flex takes		lexical analyzer specifications as its input) and how to use the		output flex generates.			\subsubsection{Flex syntax}		The layout of a flex input file (extension \verb|.l|) is, in 		pseudocode:				\begin{verbatim}            %{                Any preliminary C code (inclusions, defines) that                 will be pasted in the resulting .C file            %}            Any flex definitions            %%            Regular expressions            %%            Any C code that will be appended to             the resulting .C file		\end{verbatim}					When a regular expression matches some input text, the 		lexical analyzer must execute an action. This usually involves		informing the caller (the parser) of the token class found. 		With an action included, the regular expressions take the 		following form:	    	    \begin{verbatim}	            [0-9]+	{                        intValue_g = atoi( yytext );
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -