📄 javacoco.htm

📁 cocorj09-一个Java语言分析器
💻 HTM
📖 第 1 页 / 共 5 页
字号:
<td> the set of all acceptable (7 bit ASCII) characters

</table>


<P>Simple sets may then be combined by the union (+) and difference (-)
operators.</P>

<P><I>Examples</I> </P>

<PRE>
  CHARACTERS
    digit = &quot;0123456789&quot; .          /* the set of all digits */
    hexdigit = digit + &quot;ABCDEF&quot; .   /* the set of all hexadecimal digits */
    eol = CHR(13) .                 /* the end-of-line character */
    noDigit = ANY - digit .         /* any character that is not a digit */
    ctrlChars = CHR(1) .. CHR(31) . /* ascii control characters */
</PRE>

<P><B>Tokens</B>. A token is a terminal symbol for the parser but a syntactically structured symbol for the
scanner. The token structure has to be described by a regular expression in EBNF:</P>

<PRE>   Tokens      = &quot;TOKENS&quot; {Token} .
   Token       = TokenSymbol [&quot;=&quot; TokenExpr &quot;.&quot;] .
   TokenExpr   = TokenTerm {&quot;|&quot; TokenTerm} .
   TokenTerm   = TokenFactor {TokenFactor} [&quot;CONTEXT&quot; &quot;(&quot; TokenExpr &quot;)&quot;] .
   TokenFactor =   SetIdent | string
                 | &quot;(&quot; TokenExpr &quot;)&quot; |  &quot;[&quot; TokenExpr &quot;]&quot; | &quot;{&quot; TokenExpr &quot;}&quot; .
   TokenSymbol = TokenIdent | string .
   TokenIdent  = ident .
</PRE>

<P>Tokens may be declared in any order.  A token declaration defines a
<I>TokenSymbol</I> together with its structure. Usually the symbol on the left-hand
side of the declaration is an identifier, which is then used in other parts of
the grammar to denote the structure described on the right-hand side of the
declaration by a regular expression (expressed in EBNF).  This
<I>TokenExpr</I> may contain literals denoting themselves (for example
&quot;END&quot;), or the names of character sets (for example letter),
denoting an arbitrary character from such sets.  The restriction to regular
expressions means that it may not contain the names of any other tokens.</P>

<P>While token specification is usually straightforward, there are a number of
subtleties that may need emphasizing.  For example, since spaces are deemed to
be irrelevant when they come between tokens in the input for most languages,
one should not attempt to declare literal tokens that have spaces within
them.</P>

<P>The grammar for tokens allows for empty right-hand sides.  This may seem
strange, especially as no scanner is generated if the right-hand side of a
declaration is missing.  This facility is used if the user wishes to supply a
hand-crafted scanner, rather than the one generated by Coco/R (see Section 4).
In this case, the symbol on the left-hand side of a token declaration may
also simply be specified by a string, with no right-hand side. Tokens
specified without right-hand sides are numbered consecutively starting from 0,
and the hand-crafted scanner has to return token codes according to this
numbering scheme.</P>

<P>There is one predeclared token EOF that can be used in productions where
it is necessary to check explicitly that the end of the source has been
reached.  When the Scanner detects that the end of the source has been
reached further attempts to obtain a token return only this one.</P>

<P>The CONTEXT phrase in a <I>TokenTerm</I> means that the term is only
recognized when its right hand context in the input stream (i.e. the
characters following in the input stream) matches the <I>TokenExpr</I>
specified in brackets. Note that the context phrase is not part of the
token.</P>

<P><I>Examples</I> </P>

<PRE>   TOKENS
     ident  = letter {letter | digit} .
     real   = digit {digit} &quot;.&quot; {digit} [&quot;E&quot; [&quot;+&quot;|&quot;-&quot;] digit {digit}] .
     number = digit {digit} | digit {digit}&nbsp;CONTEXT (&quot;..&quot;) .
</PRE>

<P>The CONTEXT phrase in the above example allows the scanner to distinguish
between reals (e.g., 1.23) and range constructs (e.g., 1..2) that could
otherwise not be scanned with a single character lookahead. After reading
a &quot;1&quot; and a &quot;.&quot;, the scanner still works on both alternatives.
If the next character is again a &quot;.&quot; the &quot;..&quot; phrase
is pushed back to the input stream and a number is returned to the parser.
If the next character is not a &quot;.&quot;, the scanner continues with
the recognition of a real number. </P>

<P><B>Comments and ignorable characters.</B>
Usually spaces within the source text of a program are irrelevant, and in
scanning for the start of a token, a Coco/R generated scanner will simply
ignore them.  Other separators like tabs, line ends, and form feeds may also
be declared irrelevant, and some applications may prefer to ignore the
distinction between upper and lower case input.</P>

<P>Comments are difficult to specify with the regular expressions used to
denote tokens - indeed, nested comments may not be specified at all in this
way. Since comments are usually discarded by a parsing process, and may
typically appear in arbitrary places in source code, it makes sense to have a
special construct to express their structure.</P>

<P>Ignorable aspects of the scanning process are defined in Cocol by</P>

<PRE>   Comments  = &quot;COMMENTS&quot; &quot;FROM&quot; TokenExpr &quot;TO&quot; TokenExpr [ &quot;NESTED&quot; ] .
   Ignorable = &quot;IGNORE&quot; ( &quot;CASE&quot; | CharacterSet ) .
</PRE>

<P>where the optional keyword NESTED should have an obvious meaning.  A
practical restriction is that comment brackets must not be longer than 2
characters.  It is possible to declare several kinds of comments within a
single grammar, for example:</P>

<PRE>   COMMENTS FROM &quot;/*&quot; TO &quot;*/&quot; NESTED
   COMMENTS FROM &quot;//&quot; TO eol
   IGNORE CHR(9) .. CHR(13)
</PRE>

<P>The set of ignorable characters in this example is that which includes the
standard white space separators in ASCII files.  The null character CHR(0)
should not be included in any ignorable set.  It is used internally by Coco/R
to mark the end of the input file.</P>

<P><B>Pragmas</B>. Pragmas, like comments, are tokens that may occur anywhere
in the input stream, but, unlike comments, cannot be ignored. Pragmas
are often used to allow programmers to select compiler switches dynamically.
Since it becomes impractical to modify the phrase structure grammar to handle
this, a special mechanism is provided for the recognition and treatment of
pragmas. In Cocol they are declared like tokens, but may have an associated
semantic action that is executed whenever they are recognized by the
scanner.</P>

<PRE>   Pragmas  = &quot;PRAGMAS&quot; {Pragma} .
   Pragma   = Token [Action] .
   Action   = &quot;(.&quot; ArbitraryJavaCode &quot;.)&quot; .
</PRE>

<P><I>Example</I> </P>

<PRE>   PRAGMAS
     option = &quot;$&quot; {letter} .  (. i := 1;
                                 while (i &lt; t.val.length()) {
                                   if (t.val.charAt(i) == &quot;A&quot;) ...;
                                   else if (t.val.charAt(i) == &quot;B&quot;) ...;
                                   i++;
                                 } .)
</PRE>

<P>Note: The next token to be delivered to the parser is available in the
field <I>t</I> of the generated parser (see also Section 3.1). It holds the
token code (<I>t.kind</I>), the token text (lexeme) (<I>t.val</I>) and the
token position (<I>t.pos</I>, <I>t.line</I>, <I>t.col</I>). Thus, <I>t.val</I>
is the string of the recognized pragma. </P>

<P><B>User names.</B>
Normally the generated scanner and parser use integer literals to denote the
symbols and tokens.  This makes for unreadable parsers, in some estimations.
If a NAMES part is added to the scanner specification, Coco/R will
generate code that uses names for the symbols.  By default these names have a
rather stereotyped form, but preferred user-defined names may be specified - this
may sometimes be needed to help resolve name clashes (for example, between the
default names that would be chosen for "point" and ".").</P>

<PRE>   UserNames  = &quot;NAMES&quot; { UserName } .
   UserName   = TokenIdent  &quot;=&quot; ( identifier | string ) &quot;.&quot; .
</PRE>

<P><I>Example</I> </P>

<PRE>   NAMES
     period   = &quot;.&quot; .
     ellipsis = &quot;...&quot; .
</PRE>

<P>The ability to use names is an extension over the original
implementation.</P>

<H3>2.4 Parser specification</H3>

<P>The parser specification is the main part of the compiler description. It
contains the productions of an attributed grammar specifying the syntax of the
language to be recognized, as well as the action to be taken as each phrase or
token is recognized. The form of the parser specification may itself be
described in EBNF as follows:</P>

<PRE>   ParserSpecification = &quot;PRODUCTIONS&quot; {Production} .
   Production          = NonTerminal [FormalAttributes] [LocalDeclarations]
                         '=' Expression &quot;.&quot; .
   FormalAttributes    =   '&lt;' ['^'] ArbitraryJavaParameterDeclarations '&gt;'
                         | '&lt;.' ['^'] ArbitraryJavaParameterDeclarations '.&gt;' .
   LocalDeclarations   = &quot;(.&quot; ArbitraryJavaDeclarations &quot;.)&quot; .
   NonTerminal         = ident .
</PRE>

Any name appearing in a <I>Production</I> that has not earlier been declared as a
terminal token is considered to be the name of a <I>NonTerminal</I>. There must be
exactly one production for every <I>NonTerminal</I> (this production may, of course,
specify a list of alternative right-hand sides</P>

<P><B>Productions</B>. A production is considered as a "procedure" that parses
a nonterminal. It has its own scope for attributes and local names, and is
made up of a left-hand side and a right-hand side which are separated by an
equal sign. The left-hand side specifies the name of the nonterminal, together
with its formal attributes and local declarations. The right-hand side
consists of an EBNF expression that specifies the structure of the nonterminal
as well as its translation. </P>

<P>As in the case of tokens, some subtleties in the specification of productions
should be emphasized.  Firstly, the productions may be given in any order, but
a production must be given for a <I>GoalIdentifier</I> that matches the name used
for the grammar.</P>

<P>The <I>formal attributes</I> enclosed in angle brackets "<" and ">"
or "<." and ".>" essentially consist of parameter declarations in Java. Since
grammar symbols may have output attributes, while Java does not have reference
parameters, the first attribute in an attribute declaration is considered to
be an output attribute if it is preceded by the character '^'. An output
attribute is translated into a function return value. For example, the
declaration </P>

<PRE>   SomeSymbol &lt;^int n, String name&gt; = ... .
</PRE>

<P>is translated into </P>

<PRE>   private int SomeSymbol (String name) {
     int n;
     ...
     return n;
   }
</PRE>

<P>The <I>local declarations</I> are arbitrary Java declarations enclosed
in &quot;(.&quot; and &quot;.)&quot;. A production constitutes a scope
for its formal attributes and its locally declared names. Terminals and
nonterminals, globally declared fields and methods, as well as imported
classes, are visible in any production. </P>

<P>The syntax of attributes and local declarations is not checked by Coco/R;
this is left to the responsibility of the Java compiler that will actually
compile the generated application.</P>

<P>The goal symbol may not have any <I>FormalAttributes</I>.  Any information
that the parser is required to pass back to the calling driver program must be
handled in other ways.  At times this may prove slightly awkward.</P>

<P>It may happen that an identifier chosen as the name of a <I>NonTerminal</I>
may clash with one of the internal names used in the rest of the system.  Such
clashes will only become apparent when the application is compiled and linked,
and may require the user to redefine the grammar to use other identifiers.</P>

<P><B>Expressions</B>. An EBNF expression defines the context-free structure
of some part of the source language, together with the attributes and semantic
actions that specify how the parser must react to the recognition of each
component.</P>

<PRE>   Expression = Term {'|' Term} .
   Term       = Factor {Factor} .
   Factor     =   [&quot;WEAK&quot;] TokenSymbol
                | NonTerminal [Attributes]
                | SemAction
                | &quot;ANY&quot;
                | &quot;SYNC&quot;
                | '(' Expression ')'
                | '[' Expression ']'
                | '{' Expression '}' .
   Attributes =   '&lt;' ['^'] ArbitraryJavaParameters '&gt;'
                | '&lt;.' ['^'] ArbitraryJavaParameters '.&gt;'.
   SemAction  = &quot;(.&quot; ArbitraryJavaStatements &quot;.)&quot; .
   Symbol     = ident | string .
</PRE>

<P>The <I>Attributes</I> enclosed in angle brackets that may follow a
<I>NonTerminal</I> in the context of a <I>Factor</I> effectively denote
the actual parameters that will be used in calling the corresponding routine.
If a <I>NonTerminal</I> has been defined on the left-hand side of a
<I>Production</I> to have <I>FormalAttributes</I>, then every occurrence of that
<I>NonTerminal</I> in the right-hand side of a <I>Production</I> must have a
list of actual attributes that correspond to the <I>FormalAttributes</I>
according to the parameter compatibility rules of Java. However, the
conformance is only checked when the generated parser class is compiled.</P>

<P>If the attributes contain the symbol "&gt;" they must be enclosed in
brackets of the form "&lt;." and ".&gt;", for example

<PRE>  &lt;. a>b .&gt;
</PRE>

<P>If the first attribute is preceded by a '^' it is considered an <I>output
attribute</I>, and the corresponding parsing method will be called as a
function that assigns to this attribute, rather than as a void function.
Otherwise, attributes are <I>input attributes</I>.</P>

<P>A <I>semantic action</I> is an arbitrary sequence of Java statements
enclosed in &quot;(.&quot; and &quot;.)&quot;. These are simply incorporated
into the generated parser; their syntax is only checked when the generated
parser is compiled.  The digraph &quot;(.&quot; is not allowed within this
text, nor are incomplete strings, as these are symptomatic of mismatched
text.</P>

<P>The symbol ANY denotes any terminal that cannot follow ANY in that context.
It can conveniently be used to parse structures that contain arbitrary
text. For example, the translation of a Cocol/R attribute list incorporates
actions that look as follows: </P>

<PRE>   Attributes &lt;^int len&gt;
             (. int pos; .)
   =
     '&lt;'     (. pos := token.pos + 1; .)
     {ANY}
     '&gt;'     (. len := token.pos - pos; .) .
</PRE>

<P>In this example the closing angle bracket is an implicit alternative
of the ANY symbol in curly brackets. The meaning is that ANY matches any
terminal except '&gt;'. <I>token.pos</I> is the source text position of
the most recently recognized token (<I>token</I> is a field of the generated
parser; see Section 31). </P>

<P>Note that it is possible to write a production in which an action appears
to be associated with an alternative for an expression that contains no
terminals or nonterminals. This feature is often useful. For example, we
might have</P>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -