📄 lex.html

📁 IEEE 1003.1-2003, Single Unix Specification v3
💻 HTML
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
start-condition <tt>'&lt;'</tt> and <tt>'&gt;'</tt> operators shall be special only in a start condition at the beginning of aregular expression; elsewhere in the regular expression they shall be treated as ordinary characters.</p><h5><a name="tag_04_73_13_05"></a>Actions in lex</h5><p>The action to be taken when an ERE is matched can be a C program fragment or the special actions described below; the programfragment can contain one or more C statements, and can also include special actions. The empty C statement <tt>';'</tt> shall be avalid action; any string in the <b>lex.yy.c</b> input that matches the pattern portion of such a rule is effectively ignored orskipped. However, the absence of an action shall not be valid, and the action <i>lex</i> takes in such a condition isundefined.</p><p>The specification for an action, including C statements and special actions, can extend across several lines if enclosed inbraces:</p><pre><i>ERE</i> <tt>&lt;</tt><i>one or more blanks</i><tt>&gt; {</tt> <i>program statement                           program statement</i> <tt>}</tt></pre><p>The default action when a string in the input to a <b>lex.yy.c</b> program is not matched by any expression shall be to copy thestring to the output. Because the default behavior of a program generated by <i>lex</i> is to read the input and copy it to theoutput, a minimal <i>lex</i> source program that has just <tt>"%%"</tt> shall generate a C program that simply copies the input tothe output unchanged.</p><p>Four special actions shall be available:</p><pre><tt>|   ECHO;   REJECT;   BEGIN</tt></pre><dl compact><dt><tt>|</tt></dt><dd>The action <tt>'|'</tt> means that the action for the next rule is the action for this rule. Unlike the other three actions,<tt>'|'</tt> cannot be enclosed in braces or be semicolon-terminated; the application shall ensure that it is specified alone, withno other actions.</dd><dt><b>ECHO;</b></dt><dd>Write the contents of the string <i>yytext</i> on the output.</dd><dt><b>REJECT;</b></dt><dd>Usually only a single expression is matched by a given string in the input. <b>REJECT</b> means &quot;continue to the nextexpression that matches the current input&quot;, and shall cause whatever rule was the second choice after the current rule to beexecuted for the same input. Thus, multiple rules can be matched and executed for one input string or overlapping input strings.For example, given the regular expressions <tt>"xyz"</tt> and <tt>"xy"</tt> and the input <tt>"xyz"</tt> , usually only the regularexpression <tt>"xyz"</tt> would match. The next attempted match would start after <b>z.</b> If the last action in the<tt>"xyz"</tt> rule is <b>REJECT</b>, both this rule and the <tt>"xy"</tt> rule would be executed. The <b>REJECT</b> action may beimplemented in such a fashion that flow of control does not continue after it, as if it were equivalent to a <b>goto</b> to anotherpart of <i>yylex</i>(). The use of <b>REJECT</b> may result in somewhat larger and slower scanners.</dd><dt><b>BEGIN</b></dt><dd>The action: <pre><tt>BEGIN</tt> <i>newstate</i><tt>;</tt></pre><p>switches the state (start condition) to <i>newstate</i>. If the string <i>newstate</i> has not been declared previously as astart condition in the <i>Definitions</i> section, the results are unspecified. The initial state is indicated by the digit<tt>'0'</tt> or the token <b>INITIAL</b>.</p></dd></dl><p>The functions or macros described below are accessible to user code included in the <i>lex</i> input. It is unspecified whetherthey appear in the C code output of <i>lex</i>, or are accessible only through the <b>-l&nbsp;l</b> operand to <a href="../utilities/c99.html"><i>c99</i></a> (the <i>lex</i> library).</p><dl compact><dt><b>int&nbsp;</b> <i>yylex</i>(<b>void</b>)</dt><dd><br>Performs lexical analysis on the input; this is the primary function generated by the <i>lex</i> utility. The function shall returnzero when the end of input is reached; otherwise, it shall return non-zero values (tokens) determined by the actions that areselected.</dd><dt><b>int&nbsp;</b> <i>yymore</i>(<b>void</b>)</dt><dd><br>When called, indicates that when the next input string is recognized, it is to be appended to the current value of <i>yytext</i>rather than replacing it; the value in <i>yyleng</i> shall be adjusted accordingly.</dd><dt><b>int&nbsp;</b> <i>yyless</i>(<b>int&nbsp;</b> <i>n</i>)</dt><dd><br>Retains <i>n</i> initial characters in <i>yytext</i>, NUL-terminated, and treats the remaining characters as if they had not beenread; the value in <i>yyleng</i> shall be adjusted accordingly.</dd><dt><b>int&nbsp;</b> <i>input</i>(<b>void</b>)</dt><dd><br>Returns the next character from the input, or zero on end-of-file. It shall obtain input from the stream pointer <i>yyin</i>,although possibly via an intermediate buffer. Thus, once scanning has begun, the effect of altering the value of <i>yyin</i> isundefined. The character read shall be removed from the input stream of the scanner without any processing by the scanner.</dd><dt><b>int&nbsp;</b> <i>unput</i>(<b>int&nbsp;</b> <i>c</i>)</dt><dd><br>Returns the character <tt>'c'</tt> to the input; <i>yytext</i> and <i>yyleng</i> are undefined until the next expression ismatched. The result of using <i>unput</i>() for more characters than have been input is unspecified.</dd></dl><p>The following functions shall appear only in the <i>lex</i> library accessible through the <b>-l&nbsp;l</b> operand; they cantherefore be redefined by a conforming application:</p><dl compact><dt><b>int&nbsp;</b> <i>yywrap</i>(<b>void</b>)</dt><dd><br>Called by <i>yylex</i>() at end-of-file; the default <i>yywrap</i>() shall always return 1. If the application requires<i>yylex</i>() to continue processing with another source of input, then the application can include a function <i>yywrap</i>(),which associates another file with the external variable <b>FILE *</b> <i>yyin</i> and shall return a value of zero.</dd><dt><b>int&nbsp;</b> <i>main</i>(<b>int&nbsp;</b> <i>argc</i>, <b>char *</b><i>argv</i>[])</dt><dd><br>Calls <i>yylex</i>() to perform lexical analysis, then exits. The user code can contain <i>main</i>() to performapplication-specific operations, calling <i>yylex</i>() as applicable.</dd></dl><p>Except for <i>input</i>(), <i>unput</i>(), and <i>main</i>(), all external and static names generated by <i>lex</i> shall beginwith the prefix <b>yy</b> or <b>YY</b>.</p></blockquote><h4><a name="tag_04_73_14"></a>EXIT STATUS</h4><blockquote><p>The following exit values shall be returned:</p><dl compact><dt>&nbsp;0</dt><dd>Successful completion.</dd><dt>&gt;0</dt><dd>An error occurred.</dd></dl></blockquote><h4><a name="tag_04_73_15"></a>CONSEQUENCES OF ERRORS</h4><blockquote><p>Default.</p></blockquote><hr><div class="box"><em>The following sections are informative.</em></div><h4><a name="tag_04_73_16"></a>APPLICATION USAGE</h4><blockquote><p>Conforming applications are warned that in the <i>Rules</i> section, an ERE without an action is not acceptable, but need not bedetected as erroneous by <i>lex</i>. This may result in compilation or runtime errors.</p><p>The purpose of <i>input</i>() is to take characters off the input stream and discard them as far as the lexical analysis isconcerned. A common use is to discard the body of a comment once the beginning of a comment is recognized.</p><p>The <i>lex</i> utility is not fully internationalized in its treatment of regular expressions in the <i>lex</i> source code orgenerated lexical analyzer. It would seem desirable to have the lexical analyzer interpret the regular expressions given in the<i>lex</i> source according to the environment specified when the lexical analyzer is executed, but this is not possible with thecurrent <i>lex</i> technology. Furthermore, the very nature of the lexical analyzers produced by <i>lex</i> must be closely tied tothe lexical requirements of the input language being described, which is frequently locale-specific anyway. (For example, writingan analyzer that is used for French text is not automatically useful for processing other languages.)</p></blockquote><h4><a name="tag_04_73_17"></a>EXAMPLES</h4><blockquote><p>The following is an example of a <i>lex</i> program that implements a rudimentary scanner for a Pascal-like syntax:</p><pre><tt>%{/* Need this for the call to atof() below. */#include &lt;math.h&gt;/* Need this for printf(), fopen(), and stdin below. */#include &lt;stdio.h&gt;%}<br>DIGIT    [0-9]ID       [a-z][a-z0-9]*<br>%%<br>{DIGIT}+ {    printf("An integer: %s (%d)\n", yytext,        atoi(yytext));    }<br>{DIGIT}+"."{DIGIT}*        {    printf("A float: %s (%g)\n", yytext,        atof(yytext));    }<br>if|then|begin|end|procedure|function        {    printf("A keyword: %s\n", yytext);    }<br>{ID}    printf("An identifier: %s\n", yytext);<br>"+"|"-"|"*"|"/"        printf("An operator: %s\n", yytext);<br>"{"[^}\n]*"}"    /* Eat up one-line comments. */<br>[ \t\n]+        /* Eat up white space. */<br>.  printf("Unrecognized character: %s\n", yytext);<br>%%<br>int main(int argc, char *argv[]){    ++argv, --argc;  /* Skip over program name. */    if (argc &gt; 0)        yyin = fopen(argv[0], "r");    else        yyin = stdin;<br>    yylex();}</tt></pre></blockquote><h4><a name="tag_04_73_18"></a>RATIONALE</h4><blockquote><p>Even though the <b>-c</b> option and references to the C language are retained in this description, <i>lex</i> may begeneralized to other languages, as was done at one time for EFL, the Extended FORTRAN Language. Since the <i>lex</i> inputspecification is essentially language-independent, versions of this utility could be written to produce Ada, Modula-2, or Pascalcode, and there are known historical implementations that do so.</p><p>The current description of <i>lex</i> bypasses the issue of dealing with internationalized EREs in the <i>lex</i> source code orgenerated lexical analyzer. If it follows the model used by <a href="../utilities/awk.html"><i>awk</i></a> (the source code isassumed to be presented in the POSIX locale, but input and output are in the locale specified by the environment variables), thenthe tables in the lexical analyzer produced by <i>lex</i> would interpret EREs specified in the <i>lex</i> source in terms of theenvironment variables specified when <i>lex</i> was executed. The desired effect would be to have the lexical analyzer interpretthe EREs given in the <i>lex</i> source according to the environment specified when the lexical analyzer is executed, but this isnot possible with the current <i>lex</i> technology.</p><p>The description of octal and hexadecimal-digit escape sequences agrees with the ISO&nbsp;C standard use of escape sequences. Seethe RATIONALE for <a href="ed.html"><i>ed</i></a> for a discussion of bytes larger than 9 bits being represented by octal values.Hexadecimal values can represent larger bytes and multi-byte characters directly, using as many digits as required.</p><p>There is no detailed output format specification. The observed behavior of <i>lex</i> under four different historicalimplementations was that none of these implementations consistently reported the line numbers for error and warning messages.Furthermore, there was a desire that <i>lex</i> be allowed to output additional diagnostic messages. Leaving message formatsunspecified avoids these formatting questions and problems with internationalization.</p><p>Although the <tt>%x</tt> specifier for <i>exclusive</i> start conditions is not historical practice, it is believed to be aminor change to historical implementations and greatly enhances the usability of <i>lex</i> programs since it permits anapplication to obtain the expected functionality with fewer statements.</p><p>The <b>%array</b> and <b>%pointer</b> declarations were added as a compromise between historical systems. The System V-based<i>lex</i> copies the matched text to a <i>yytext</i> array. The <i>flex</i> program, supported in BSD and GNU systems, uses apointer. In the latter case, significant performance improvements are available for some scanners. Most historical programs shouldrequire no change in porting from one system to another because the string being referenced is null-terminated in both cases. (Themethod used by <i>flex</i> in its case is to null-terminate the token in place by remembering the character that used to come rightafter the token and replacing it before continuing on to the next scan.) Multi-file programs with external references to<i>yytext</i> outside the scanner source file should continue to operate on their historical systems, but would require one of thenew declarations to be considered strictly portable.</p><p>The description of EREs avoids unnecessary duplication of ERE details because their meanings within a <i>lex</i> ERE are thesame as that for the ERE in this volume of IEEE&nbsp;Std&nbsp;1003.1-2001.</p><p>The reason for the undefined condition associated with text beginning with a &lt;blank&gt; or within <tt>"%{"</tt> and<tt>"%}"</tt> delimiter lines appearing in the <i>Rules</i> section is historical practice. Both the BSD and System V <i>lex</i>copy the indented (or enclosed) input in the <i>Rules</i> section (except at the beginning) to unreachable areas of the<i>yylex</i>() function (the code is written directly after a <a href="../utilities/break.html"><i>break</i></a>statement). In some cases, the System V <i>lex</i> generates an error message or a syntax error, depending on the form of indentedinput.</p><p>The intention in breaking the list of functions into those that may appear in <b>lex.yy.c</b> <i>versus</i> those that onlyappear in <b>libl.a</b> is that only those functions in <b>libl.a</b> can be reliably redefined by a conforming application.</p><p>The descriptions of standard output and standard error are somewhat complicated because historical <i>lex</i> implementationschose to issue diagnostic messages to standard output (unless <b>-t</b> was given). IEEE&nbsp;Std&nbsp;1003.1-2001 allows thisbehavior, but leaves an opening for the more expected behavior of using standard error for diagnostics. Also, the System V behaviorof writing the statistics when any table sizes are given is allowed, while BSD-derived systems can avoid it. The programmer canalways precisely obtain the desired results by using either the <b>-t</b> or <b>-n</b> options.</p><p>The OPERANDS section does not mention the use of <b>-</b> as a synonym for standard input; not all historical implementationssupport such usage for any of the <i>file</i> operands.</p><p>A description of the <i>translation table</i> was deleted from early proposals because of its relatively low usage in historicalapplications.</p><p>The change to the definition of the <i>input</i>() function that allows buffering of input presents the opportunity for majorperformance gains in some applications.</p><p>The following examples clarify the differences between <i>lex</i> regular expressions and regular expressions appearingelsewhere in this volume of IEEE&nbsp;Std&nbsp;1003.1-2001. For regular expressions of the form <tt>"r/x"</tt> , the stringmatching <i>r</i> is always returned; confusion may arise when the beginning of <i>x</i> matches the trailing portion of <i>r</i>.For example, given the regular expression <tt>"a*b/cc"</tt> and the input <tt>"aaabcc"</tt> , <i>yytext</i> would contain thestring <tt>"aaab"</tt> on this match. But given the regular expression <tt>"x*/xy"</tt> and the input <tt>"xxxy"</tt> , the token<b>xxx</b>, not <b>xx</b>, is returned by some implementations because <b>xxx</b> matches <tt>"x*"</tt> .</p><p>In the rule <tt>"ab*/bc"</tt> , the <tt>"b*"</tt> at the end of <i>r</i> extends <i>r</i>'s match into the beginning of thetrailing context, so the result is unspecified. If this rule were <tt>"ab/bc"</tt> , however, the rule matches the text<tt>"ab"</tt> when it is followed by the text <tt>"bc"</tt> . In this latter case, the matching of <i>r</i> cannot extend into thebeginning of <i>x</i>, so the result is specified.</p></blockquote><h4><a name="tag_04_73_19"></a>FUTURE DIRECTIONS</h4><blockquote><p>None.</p></blockquote><h4><a name="tag_04_73_20"></a>SEE ALSO</h4><blockquote><p><a href="c99.html"><i>c99</i></a> , <a href="ed.html"><i>ed</i></a> , <a href="yacc.html"><i>yacc</i></a></p></blockquote><h4><a name="tag_04_73_21"></a>CHANGE HISTORY</h4><blockquote><p>First released in Issue 2.</p></blockquote><h4><a name="tag_04_73_22"></a>Issue 6</h4><blockquote><p>This utility is marked as part of the C-Language Development Utilities option.</p><p>The obsolescent <b>-c</b> option is withdrawn in this issue.</p><p>The normative text is reworded to avoid use of the term &quot;must&quot; for application requirements.</p></blockquote><div class="box"><em>End of informative text.</em></div><hr><hr size="2" noshade><center><font size="2"><!--footer start-->UNIX &reg; is a registered Trademark of The Open Group.<br>POSIX &reg; is a registered Trademark of The IEEE.<br>[ <a href="../mindex.html">Main Index</a> | <a href="../basedefs/contents.html">XBD</a> | <a href="../utilities/contents.html">XCU</a> | <a href="../functions/contents.html">XSH</a> | <a href="../xrat/contents.html">XRAT</a>]</font></center><!--footer end--><hr size="2" noshade></body></html>
上一页 1 23
💿 文件大小 2833 K
👤 上传用户 sunqingyan
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#Specification #1003.1 #Single #IEEE
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -