📄 lexer.otx
字号:
return( INTEGER ); } \end{verbatim} Using \verb|return( INTEGER )|, the lexer informs the caller (the parser) that is has found an integer. It can only return one item (the token class) so the actual value of the integer is passed to the parser through the global variable \verb|intValue_g|. Flex automatically stores the characters that make up the current token in the global string \verb|yytext|. \subsubsection{Sample flex input file} Here is a sample flex input file for the language that consists of sentences of the form \verb|(|\emph{number}\verb|+|\emph{number}\verb|)|, and that allows spacing anywhere (except within tokens). \begin{verbatim} %{ #define NUMBER 1000 int intValue_g; %} %% "(" { return( `(` ); } ")" { return( `)' ); } "+" { return( `+' ); } [0-9]+ { intValue_g = atoi( yytext ); return( NUMBER ); } %% int main() { int result; while( ( result = yylex() ) != 0 ) { printf( "Token class found: %d\n", result ); } return( 0 ); } \end{verbatim} For many more examples, consult J. Levine's \emph{Lex and yacc} \cite{lex_yacc}. \section{\langname{} Lexical Analyzer Specification} %This section contains a listing of token categories, %regular expressions for complex tokens, and listings of %all tokens in Inger. As a practical example, we will now discuss the token categories in the \emph{\langname} language, and all regular expressions used for complex tokens. The full source for the \emph{\langname} lexer is included in appendix \ref{appendix:lexersource}. \langname{} discerns several token categories: keywords (\verb|IF|, \verb|WHILE| and so on), operators (\verb|+|, \verb|%| and more), complex tokens (integer numbers, floating point numbers, and strings), delimiters (parentheses, brackets) and whitespace. We will list the tokens in each category and show which regular expressions is used to match them. \subsubsection{Keywords} \langname{} expects all keywords (sometimes called \emph{reserved words}) to be written in lowercase, allowing the literal keyword to be used to match the keyword itself. The following table illustrates this: \ \\\begin{tabular}{lll} Token & Regular Expression & Token identifier\\ \hline break & \verb|break| & \verb|KW_BREAK|\\ case & \verb|case| & \verb|KW_CASE|\\ continue & \verb|continue| & \verb|KW_CONTINUE|\\ default & \verb|default| & \verb|KW_DEFAULT|\\ do & \verb|do| & \verb|KW_DO|\\ else & \verb|else| & \verb|KW_ELSE|\\ false & \verb|false| & \verb|KW_FALSE|\\ goto\_considered & \verb|goto_considered| & \\ \ \_harmful & \ \verb|_harmful| & \verb|KW_GOTO|\\ if & \verb|if| & \verb|KW_IF|\\ label & \verb|label| & \verb|KW_LABEL|\\ module & \verb|module| & \verb|KW_MODULE|\\ return & \verb|return| & \verb|KW_RETURN|\\ start & \verb|start| & \verb|KW_START|\\ switch & \verb|switch| & \verb|KW_SWITCH|\\ true & \verb|true| & \verb|KW_TRUE|\\ while & \verb|while| & \verb|KW_WHILE|\\ \end{tabular} \subsubsection{Types} Type names are also tokens. They are invariable and can therefore be matched using their full name. \ \\\begin{tabular}{lll} Token & Regular Expression & Token identifier\\ \hline bool & \verb|bool| & \verb|KW_BOOL|\\ char & \verb|char| & \verb|KW_CHAR|\\ float & \verb|float| & \verb|KW_FLOAT|\\ int & \verb|int| & \verb|KW_INT|\\ pointer & \verb|pointer| & \verb|KW_POINTER|\\ string & \verb|string| & \verb|KW_STRING|\\ \end{tabular} \ \\Note that the \emph{pointer} type is equivalent to \verb|void *| in the C language; it is a polymorphic pointer type. Zero or more reference symbols (\verb|*|) may be added after the \verb|pointer| keyword. For instance, the declaration \begin{quote} \verb|pointer ** a;| \end{quote} declares $a$ to be a triple polymorphic pointer. \subsubsection{Complex tokens} \langname{}'s complex tokens variable identifiers, integer literals, floating point literals and character literals. \ \\\begin{tabular}{lll} Token & Regular Expression & Token identifier\\ \hline integer literal & \verb|[0-9]+| & \verb|INT|\\ identifier & \verb|[_A-Za-z][_A-Za-z0-9]*| & \verb|IDENTIFIER|\\ float & \verb|[0-9]*\.[0-9]+([eE][\+-][0-9]+)?| & \verb|FLOAT|\\ char & \verb|\'.\'| & \verb|CHAR|\\ \end{tabular} \subsubsection{Strings} In \langname{}, strings cannot span multiple lines. Strings are read using and exlusive lexer \emph{string} state. This is best illustrated by some \verb|flex| code: \begin{verbatim} \" { BEGIN STATE_STRING; } <STATE_STRING>\" { BEGIN 0; return( STRING ); } <STATE_STRING>\n { ERROR( "unterminated string" ); } <STATE_STRING>. { (store a character) } <STATE_STRING>\\\" { (add " to string) } \end{verbatim} If a linefeed is encountered while reading a string, the lexer displays an error message, since strings may not span lines. Every character that is read while in the string state is added to the string, except \verb|"|, which terminates a string and causes the lexer to leave the exclusive \emph{string} state. Using the \verb|\"| control code, the programmer can actually add the \verb|"| (double quotes) character to a string. \subsubsection{Comments} \langname{} supports two types of comments: line comments (which are terminated by a line feed) and block comments (which must be explicitly terminated). Line comments can be read (and subsequently skipped) using a single regular expression: \begin{quote} \verb|"//"[^\n]*| \end{quote} whereas block comments need an exclusive lexer state (since they can also be nested). We illustrate this again using some \verb|flex| code: \begin{verbatim} /* { BEGIN STATE_COMMENTS; ++commentlevel; } <STATE_COMMENTS>"/*" { ++commentlevel; } <STATE_COMMENTS>. { } <STATE_COMMENTS>\n { } <STATE_COMMENTS>"*/" { if( --commentlevel == 0 ) BEGIN 0; } \end{verbatim} Once a comment is started using \verb|/*|, the lexer sets the comment level to 1 and enters the comment state. The comment level is increased every time a \verb|/*| is encountered, and decreased every time a \verb|*/| is read. While in comment state, all characters but the comment start and end delimiters are discarded. The lexer leaves the comment state after the last comment block terminates. \subsubsection{Operators} \langname{} provides a large selection of operators, of varying priority. They are listed here in alphabetic order of the token identifiers. This list includes only atomic operators, not operators that delimit their argument on both sides, like function application. \begin{quote} \emph{funcname} \verb|(| \emph{expr[,expr...]} \verb|)| \end{quote} or array indexing \begin{quote} \emph{arrayname} \verb|[| \emph{index} \verb|]|. \end{quote} In the next section, we will present a list of all operators (including function application and array indexing) sorted by priority. Some operators consist of multiple characters. The lexer can discern between the two by looking one character ahead in the input stream and switching states (as explained in section \ref{sec:states}. \ \\\begin{tabular}{lll} Token & Regular Expression & Token identifier\\ \hline addition & \verb|+| & \verb|OP_ADD|\\ assignment & \verb|=| & \verb|OP_ASSIGN|\\ bitwise and & \verb|&| & \verb|OP_BITWISE_AND|\\ bitwise complement & \verb|~| & \verb|OP_BITWISE_COMPLEMENT|\\ bitwise left shift & \verb|<<| & \verb|OP_BITWISE_LSHIFT|\\ bitwise or & \verb+|+ & \verb|OP_BITWISE_OR|\\ bitwise right shift & \verb|>>| & \verb|OP_BITWISE_RSHIFT|\\ bitwise xor & \verb|^| & \verb|OP_BITWISE_XOR|\\ division & \verb|/| & \verb|OP_DIVIDE|\\ equality & \verb|==| & \verb|OP_EQUAL|\\ greater than & \verb|>| & \verb|OP_GREATER|\\ greater or equal & \verb|>=| & \verb|OP_GREATEREQUAL|\\ less than & \verb|<| & \verb|OP_LESS|\\ less or equal & \verb|<=| & \verb|OP_LESSEQUAL|\\ logical and & \verb|&&| & \verb|OP_LOGICAL_AND|\\ logical or & \verb+||+ & \verb|OP_LOGICAL_OR|\\ modulus & \verb|%| & \verb|OP_MODULUS|\\ multiplication & \verb|*| & \verb|OP_MULTIPLY|\\ logical negation & \verb|!| & \verb|OP_NOT|\\ inequality & \verb|!=| & \verb|OP_NOTEQUAL|\\ subtract & \verb|-| & \verb|OP_SUBTRACT|\\ ternary if & \verb|?| & \verb|OP_TERNARY_IF|\\ \end{tabular} \ \\Note that the \verb|*| operator is also used for dereferencing (in unary form) besides multiplication, and the \verb|&| operator is also used for indirection besides bitwise and. \subsubsection{Delimiters} \langname{} has a number of delimiters. There are listed here by there function description. \ \\\begin{tabular}{lll} Token & Regexp & Token identifier\\ \hline precedes function return type & \verb|->| & \verb|ARROW|\\ start code block & \verb|{| & \verb|LBRACE|\\ end code block & \verb|}| & \verb|RBRACE|\\ begin array index & \verb|[| & \verb|LBRACKET|\\ end array index & \verb|]| & \verb|RBRACKET|\\ start function parameter list & \verb|:| & \verb|COLON|\\ function argument separation & \verb|,| & \verb|COMMA|\\ expression priority, function application & \verb|(| & \verb|LPAREN|\\ expression priority, function application & \verb|)| & \verb|RPAREN|\\ statement terminator & \verb|;| & \verb|SEMICOLON|\\ \end{tabular} The full source to the \langname{} lexical analyzer is included in appendix \ref{appendix:lexersource}. \section{Operator Priorities} The final section of this chapter discusses the priorities of operators in \langname{}. \begin{tabular}{llll} Operator & Priority & Associatity & Description\\ \hline \verb|(|) & 1 & L & function application\\ \verb|[]| & 1 & L & array indexing\\ \verb|!| & 2 & R & logical negation\\ \verb|-| & 2 & R & unary minus\\ \verb|+| & 2 & R & unary plus\\ \verb|~| & 3 & R & bitwise complement\\ \verb|*| & 3 & R & indirection\\ \verb|&| & 3 & R & referencing\\ \verb|*| & 4 & L & multiplication\\ \verb|/| & 4 & L & division\\ \verb|%| & 4 & L & modulus\\ \verb|+| & 5 & L & addition\\ \verb|-| & 5 & L & subtraction\\ \verb|>>| & 6 & L & bitwise shift right\\ \verb|<<| & 6 & L & bitwise shift left\\ \verb|<| & 7 & L & less than\\ \verb|<=| & 7 & L & less than or equal\\ \verb|>| & 7 & L & greater than\\ \verb|>=| & 7 & L & greater than or equal\\ \verb|==| & 8 & L & equality\\ \verb|!=| & 8 & L & inequality\\ \verb|&| & 9 & L & bitwise and\\ \verb|^| & 10 & L & bitwise xor\\ \verb+|+ & 11 & L & bitwise or\\ \verb|&&| & 12 & L & logical and\\ \verb+||+ & 12 & L & logical or\\ \verb|?:| & 13 & R & ternary if\\ \verb|=| & 14 & R & assignment\\ \end{tabular} \begin{thebibliography}{99} \bibitem{regex}H. Spencer: \emph{POSIX 1003.2 regular expressions}, UNIX man page regex(7), 1994 \bibitem{lex_yacc}J. Levine: \emph{Lex and Yacc}, O'Reilly \& sons, 2000 \end{thebibliography}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -