📄 lexer.otx

📁 inger小型c编译器源码
💻 OTX
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
                        return( INTEGER );                    }	  	\end{verbatim}			Using \verb|return( INTEGER )|, the lexer informs the caller 		(the parser) that is has found an integer. It can only return one		item (the token class) so the actual value of the integer is		passed to the parser through the global variable \verb|intValue_g|.		Flex automatically stores the characters that make up the current		token in the global string \verb|yytext|.	    	    \subsubsection{Sample flex input file}			Here is a sample flex input file for the language that consists		of sentences of the form 		\verb|(|\emph{number}\verb|+|\emph{number}\verb|)|, 		and that allows spacing anywhere (except within tokens).	    	    \begin{verbatim}        %{            #define NUMBER 1000            int intValue_g;        %}        %%        "("         { return( `(` ); }        ")"         { return( `)' ); }        "+"         { return( `+' ); }        [0-9]+      {                         intValue_g = atoi( yytext );                        return( NUMBER );                    }        %%        int main()        {            int result;            while( ( result = yylex() ) != 0 )            {                printf( "Token class found: %d\n", result );            }                return( 0 );        }		\end{verbatim}				For many more examples, consult J. Levine's \emph{Lex and yacc}		\cite{lex_yacc}.  			\section{\langname{} Lexical Analyzer Specification}	%This section contains a listing of token categories,	%regular expressions for complex tokens, and listings of	%all tokens in Inger.			As a practical example, we will now discuss the token		categories in the \emph{\langname} language, and all		regular expressions used for complex tokens. The full		source for the \emph{\langname} lexer is included in 		appendix \ref{appendix:lexersource}.				\langname{} discerns several token categories: keywords (\verb|IF|,		\verb|WHILE| and so on), operators (\verb|+|, \verb|%|		and more), complex tokens (integer numbers, floating point		numbers, and strings), delimiters (parentheses, brackets)		and whitespace. 				We will list the tokens in each category and show which		regular expressions is used to match them.				\subsubsection{Keywords}				\langname{} expects all keywords (sometimes called \emph{reserved		words}) to be written in lowercase, allowing the literal 		keyword to be used to match the keyword itself. The following		table illustrates this:				\ \\\begin{tabular}{lll}			Token			& Regular Expression & Token identifier\\			\hline			break		& \verb|break|	 	& \verb|KW_BREAK|\\			case		& \verb|case|		& \verb|KW_CASE|\\			continue	& \verb|continue|	& \verb|KW_CONTINUE|\\			default		& \verb|default|	& \verb|KW_DEFAULT|\\			do			& \verb|do|			& \verb|KW_DO|\\			else		& \verb|else|		& \verb|KW_ELSE|\\			false		& \verb|false|		& \verb|KW_FALSE|\\			goto\_considered	& \verb|goto_considered| & \\			\ \_harmful  & \ \verb|_harmful| & \verb|KW_GOTO|\\			if			& \verb|if|			& \verb|KW_IF|\\			label		& \verb|label|		& \verb|KW_LABEL|\\			module		& \verb|module|		& \verb|KW_MODULE|\\			return		& \verb|return|		& \verb|KW_RETURN|\\			start		& \verb|start|		& \verb|KW_START|\\			switch		& \verb|switch|		& \verb|KW_SWITCH|\\			true		& \verb|true|		& \verb|KW_TRUE|\\			while		& \verb|while|		& \verb|KW_WHILE|\\		\end{tabular}				\subsubsection{Types}				Type names are also tokens. They are invariable and		can therefore be matched using their full name.				\ \\\begin{tabular}{lll}			Token			& Regular Expression & Token identifier\\			\hline			bool		& \verb|bool|	 	& \verb|KW_BOOL|\\			char		& \verb|char|		& \verb|KW_CHAR|\\			float   	& \verb|float|   	& \verb|KW_FLOAT|\\			int    		& \verb|int|	    & \verb|KW_INT|\\			pointer     & \verb|pointer|	& \verb|KW_POINTER|\\			string	    & \verb|string|		& \verb|KW_STRING|\\		\end{tabular}				\ \\Note that the \emph{pointer} type is equivalent to		\verb|void *| in the C language; it is a polymorphic		pointer type. Zero or more reference symbols (\verb|*|)		may be added after the \verb|pointer| keyword. For instance,		the declaration				\begin{quote}			\verb|pointer ** a;|		\end{quote}				declares $a$ to be a triple polymorphic pointer.				\subsubsection{Complex tokens}		\langname{}'s complex tokens variable identifiers, 		integer literals, floating point literals and character		literals.				\ \\\begin{tabular}{lll}			Token			& Regular Expression & Token identifier\\			\hline			integer literal & \verb|[0-9]+|	 	& \verb|INT|\\			identifier  & \verb|[_A-Za-z][_A-Za-z0-9]*| 	& \verb|IDENTIFIER|\\			float   	& \verb|[0-9]*\.[0-9]+([eE][\+-][0-9]+)?|  	& \verb|FLOAT|\\			char		& \verb|\'.\'|	& \verb|CHAR|\\		\end{tabular}				\subsubsection{Strings}		In \langname{}, strings cannot span multiple lines. Strings 		are read using and exlusive lexer \emph{string}	state. This		is best illustrated by some \verb|flex| code:		        \begin{verbatim}        \"                  { BEGIN STATE_STRING; }        <STATE_STRING>\"    { BEGIN 0; return( STRING ); }        <STATE_STRING>\n    { ERROR( "unterminated string" ); }        <STATE_STRING>.     { (store a character) }        <STATE_STRING>\\\"  { (add " to string)   }		\end{verbatim}				If a linefeed is encountered while reading a string,		the lexer displays an error message, since strings may		not span lines. Every character that is read while in		the string state is added to the string, except \verb|"|,		which terminates a string and causes the lexer to leave		the exclusive \emph{string} state. Using the \verb|\"|		control code, the programmer can actually add the \verb|"|		(double quotes) character to a string.				\subsubsection{Comments}		\langname{} supports two types of comments: line comments		(which are terminated by a line feed) and block comments		(which must be explicitly terminated). Line comments can		be read (and subsequently skipped) using a single regular		expression:				\begin{quote}			\verb|"//"[^\n]*|		\end{quote}				whereas block comments need an exclusive lexer state (since		they can also be nested). We illustrate this again using		some \verb|flex| code:				\begin{verbatim}        /*                    { BEGIN STATE_COMMENTS;                                 ++commentlevel; }        <STATE_COMMENTS>"/*"  { ++commentlevel; }        <STATE_COMMENTS>.     { }        <STATE_COMMENTS>\n    { }        <STATE_COMMENTS>"*/"  { if( --commentlevel == 0 )                                 BEGIN 0; }		\end{verbatim}				Once a comment is started using \verb|/*|, the lexer sets the		comment level to 1 and enters the comment state. The comment  		level is increased every time a \verb|/*| is encountered, and 		decreased every time a \verb|*/| is read. While in comment		state, all characters but the comment start and end		delimiters are discarded. The lexer leaves the comment		state after the last comment block terminates.				\subsubsection{Operators}		\langname{} provides a large selection of operators, of varying		priority. They are listed here in alphabetic order of		the token identifiers. This list includes only atomic		operators, not operators that delimit their argument on		both sides,	like function application.				\begin{quote}			\emph{funcname} \verb|(| \emph{expr[,expr...]} \verb|)| 		\end{quote}				or array indexing				\begin{quote}			\emph{arrayname} \verb|[| \emph{index} \verb|]|.		\end{quote}				In the next section, we will present a list of all		operators (including function application and array		indexing) sorted by priority.				Some operators consist of multiple characters. The lexer		can discern between the two by looking one character ahead		in the input stream and switching states (as explained in		section \ref{sec:states}.				\ \\\begin{tabular}{lll}			Token		& Regular Expression & Token identifier\\			\hline			addition	& \verb|+|			& \verb|OP_ADD|\\			assignment	& \verb|=|			& \verb|OP_ASSIGN|\\			bitwise and & \verb|&|			& \verb|OP_BITWISE_AND|\\			bitwise complement & \verb|~|   & \verb|OP_BITWISE_COMPLEMENT|\\			bitwise left shift & \verb|<<|	& \verb|OP_BITWISE_LSHIFT|\\			bitwise or  & \verb+|+          & \verb|OP_BITWISE_OR|\\			bitwise right shift & \verb|>>| & \verb|OP_BITWISE_RSHIFT|\\			bitwise xor	& \verb|^|			& \verb|OP_BITWISE_XOR|\\			division	& \verb|/|			& \verb|OP_DIVIDE|\\			equality	& \verb|==|			& \verb|OP_EQUAL|\\			greater than & \verb|>|			& \verb|OP_GREATER|\\			greater or equal & \verb|>=|	& \verb|OP_GREATEREQUAL|\\			less than   & \verb|<|			& \verb|OP_LESS|\\			less or equal & \verb|<=|		& \verb|OP_LESSEQUAL|\\			logical and	& \verb|&&|			& \verb|OP_LOGICAL_AND|\\			logical or	& \verb+||+			& \verb|OP_LOGICAL_OR|\\			modulus		& \verb|%|			& \verb|OP_MODULUS|\\			multiplication & \verb|*|		& \verb|OP_MULTIPLY|\\			logical negation & \verb|!|		& \verb|OP_NOT|\\			inequality 	& \verb|!=|			& \verb|OP_NOTEQUAL|\\			subtract	& \verb|-|			& \verb|OP_SUBTRACT|\\			ternary if	& \verb|?|			& \verb|OP_TERNARY_IF|\\		\end{tabular}				\ \\Note that the \verb|*| operator is also used for dereferencing		(in unary form) besides multiplication, and the \verb|&| operator		is also used for indirection besides bitwise and.				\subsubsection{Delimiters}				\langname{} has a number of delimiters. There are listed here		by there function description.				\ \\\begin{tabular}{lll}			Token		& Regexp & Token identifier\\			\hline			precedes function return type & \verb|->|		& \verb|ARROW|\\			start code block & \verb|{|		& \verb|LBRACE|\\			end code block & \verb|}|		& \verb|RBRACE|\\			begin array index	& \verb|[|		& \verb|LBRACKET|\\			end array index	& \verb|]|		& \verb|RBRACKET|\\			start function parameter list  & \verb|:|		& \verb|COLON|\\			function argument separation & \verb|,|		& \verb|COMMA|\\			expression priority, function application & \verb|(|		& \verb|LPAREN|\\			expression priority, function application & \verb|)|	& \verb|RPAREN|\\			statement terminator & \verb|;|		& \verb|SEMICOLON|\\		\end{tabular}			The full source to the \langname{} lexical analyzer is included	in appendix \ref{appendix:lexersource}.			\section{Operator Priorities}	The final section of this chapter discusses the priorities of	operators in \langname{}.		\begin{tabular}{llll}		Operator & Priority	& Associatity & Description\\		\hline		\verb|(|)	&	1	& L & function application\\		\verb|[]|	&   1	& L & array indexing\\		\verb|!|	& 	2	& R & logical negation\\		\verb|-|	&   2	& R & unary minus\\		\verb|+|	&	2	& R & unary plus\\		\verb|~|	&	3	& R	& bitwise complement\\		\verb|*|	&	3	& R & indirection\\		\verb|&|	&	3	& R & referencing\\		\verb|*|	& 	4	& L	& multiplication\\		\verb|/|	&	4	& L	& division\\		\verb|%|	&	4	& L	& modulus\\		\verb|+|	&	5	& L & addition\\		\verb|-|	&	5	& L & subtraction\\		\verb|>>|	&	6	& L & bitwise shift right\\		\verb|<<|	&	6	& L & bitwise shift left\\		\verb|<|	&	7	& L & less than\\		\verb|<=|	&	7	& L & less than or equal\\		\verb|>|	&	7	& L & greater than\\		\verb|>=|	&	7	& L & greater than or equal\\		\verb|==|	&	8	& L & equality\\		\verb|!=|	&	8	& L & inequality\\		\verb|&|	&	9	& L & bitwise and\\		\verb|^|	&	10	& L & bitwise xor\\		\verb+|+	&	11  & L & bitwise or\\		\verb|&&|	& 	12	& L & logical and\\		\verb+||+	& 	12	& L & logical or\\		\verb|?:|	&	13	& R & ternary if\\		\verb|=|	&	14	& R	& assignment\\	\end{tabular}					\begin{thebibliography}{99}		\bibitem{regex}H. Spencer: \emph{POSIX 1003.2 regular 		expressions}, UNIX man page regex(7), 1994		\bibitem{lex_yacc}J. Levine: \emph{Lex and Yacc}, 		O'Reilly \& sons, 2000	\end{thebibliography}
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -