📄 tokenization.html

📁 gcc手册
💻 HTML
字号:
<html lang="en">

<head>

<title>The C Preprocessor</title>

<meta http-equiv="Content-Type" content="text/html">

<meta name="description" content="The C Preprocessor">

<meta name="generator" content="makeinfo 4.3">

<link href="http://www.gnu.org/software/texinfo/" rel="generator-home">

<!--

Copyright &copy; 1987, 1989, 1991, 1992, 1993, 1994, 1995, 1996,

1997, 1998, 1999, 2000, 2001, 2002, 2003

Free Software Foundation, Inc.



   <p>Permission is granted to copy, distribute and/or modify this document

under the terms of the GNU Free Documentation License, Version 1.1 or

any later version published by the Free Software Foundation.  A copy of

the license is included in the

section entitled "GNU Free Documentation License".



   <p>This manual contains no Invariant Sections.  The Front-Cover Texts are

(a) (see below), and the Back-Cover Texts are (b) (see below).



   <p>(a) The FSF's Front-Cover Text is:



   <p>A GNU Manual



   <p>(b) The FSF's Back-Cover Text is:



   <p>You have freedom to copy and modify this GNU Manual, like GNU

     software.  Copies published by the Free Software Foundation raise

     funds for GNU development. 

-->

</head>

<body>

<div class="node">

<p>

Node:<a name="Tokenization">Tokenization</a>,

Next:<a rel="next" accesskey="n" href="The-preprocessing-language.html#The%20preprocessing%20language">The preprocessing language</a>,

Previous:<a rel="previous" accesskey="p" href="Initial-processing.html#Initial%20processing">Initial processing</a>,

Up:<a rel="up" accesskey="u" href="Overview.html#Overview">Overview</a>

<hr><br>

</div>



<h3 class="section">Tokenization</h3>



   <p>After the textual transformations are finished, the input file is

converted into a sequence of <dfn>preprocessing tokens</dfn>.  These mostly

correspond to the syntactic tokens used by the C compiler, but there are

a few differences.  White space separates tokens; it is not itself a

token of any kind.  Tokens do not have to be separated by white space,

but it is often necessary to avoid ambiguities.



   <p>When faced with a sequence of characters that has more than one possible

tokenization, the preprocessor is greedy.  It always makes each token,

starting from the left, as big as possible before moving on to the next

token.  For instance, <code>a+++++b</code> is interpreted as

<code>a&nbsp;++&nbsp;++&nbsp;+&nbsp;b</code>, not as <code>a&nbsp;++&nbsp;+&nbsp;++&nbsp;b</code>, even though the

latter tokenization could be part of a valid C program and the former

could not.



   <p>Once the input file is broken into tokens, the token boundaries never

change, except when the <code>##</code> preprocessing operator is used to paste

tokens together.  See <a href="Concatenation.html#Concatenation">Concatenation</a>.  For example,



<pre class="example">     #define foo() bar

     foo()baz

          ==&gt; bar baz

     <em>not</em>

          ==&gt; barbaz

     </pre>



   <p>The compiler does not re-tokenize the preprocessor's output.  Each

preprocessing token becomes one compiler token.



   <p>Preprocessing tokens fall into five broad classes: identifiers,

preprocessing numbers, string literals, punctuators, and other.  An

<dfn>identifier</dfn> is the same as an identifier in C: any sequence of

letters, digits, or underscores, which begins with a letter or

underscore.  Keywords of C have no significance to the preprocessor;

they are ordinary identifiers.  You can define a macro whose name is a

keyword, for instance.  The only identifier which can be considered a

preprocessing keyword is <code>defined</code>.  See <a href="Defined.html#Defined">Defined</a>.



   <p>This is mostly true of other languages which use the C preprocessor. 

However, a few of the keywords of C++ are significant even in the

preprocessor.  See <a href="C---Named-Operators.html#C++%20Named%20Operators">C++ Named Operators</a>.



   <p>In the 1999 C standard, identifiers may contain letters which are not

part of the "basic source character set," at the implementation's

discretion (such as accented Latin letters, Greek letters, or Chinese

ideograms).  This may be done with an extended character set, or the

<code>\u</code> and <code>\U</code> escape sequences.  GCC does not presently

implement either feature in the preprocessor or the compiler.



   <p>As an extension, GCC treats <code>$</code> as a letter.  This is for

compatibility with some systems, such as VMS, where <code>$</code> is commonly

used in system-defined function and object names.  <code>$</code> is not a

letter in strictly conforming mode, or if you specify the <code>-$</code>

option.  See <a href="Invocation.html#Invocation">Invocation</a>.



   <p>A <dfn>preprocessing number</dfn> has a rather bizarre definition.  The

category includes all the normal integer and floating point constants

one expects of C, but also a number of other things one might not

initially recognize as a number.  Formally, preprocessing numbers begin

with an optional period, a required decimal digit, and then continue

with any sequence of letters, digits, underscores, periods, and

exponents.  Exponents are the two-character sequences <code>e+</code>,

<code>e-</code>, <code>E+</code>, <code>E-</code>, <code>p+</code>, <code>p-</code>, <code>P+</code>, and

<code>P-</code>.  (The exponents that begin with <code>p</code> or <code>P</code> are new

to C99.  They are used for hexadecimal floating-point constants.)



   <p>The purpose of this unusual definition is to isolate the preprocessor

from the full complexity of numeric constants.  It does not have to

distinguish between lexically valid and invalid floating-point numbers,

which is complicated.  The definition also permits you to split an

identifier at any position and get exactly two tokens, which can then be

pasted back together with the <code>##</code> operator.



   <p>It's possible for preprocessing numbers to cause programs to be

misinterpreted.  For example, <code>0xE+12</code> is a preprocessing number

which does not translate to any valid numeric constant, therefore a

syntax error.  It does not mean <code>0xE&nbsp;+&nbsp;12</code>, which is what you

might have intended.



   <p><dfn>String literals</dfn> are string constants, character constants, and

header file names (the argument of <code>#include</code>).<a rel="footnote" href="#fn-1"><sup>1</sup></a>  String constants and character

constants are straightforward: <tt>"<small class="dots">...</small>"</tt> or <tt>'<small class="dots">...</small>'</tt>.  In

either case embedded quotes should be escaped with a backslash:

<tt>'\''</tt> is the character constant for <code>'</code>.  There is no limit on

the length of a character constant, but the value of a character

constant that contains more than one character is

implementation-defined.  See <a href="Implementation-Details.html#Implementation%20Details">Implementation Details</a>.



   <p>Header file names either look like string constants, <tt>"<small class="dots">...</small>"</tt>, or are

written with angle brackets instead, <tt>&lt;<small class="dots">...</small>&gt;</tt>.  In either case,

backslash is an ordinary character.  There is no way to escape the

closing quote or angle bracket.  The preprocessor looks for the header

file in different places depending on which form you use.  See <a href="Include-Operation.html#Include%20Operation">Include Operation</a>.



   <p>No string literal may extend past the end of a line.  Older versions

of GCC accepted multi-line string constants.  You may use continued

lines instead, or string constant concatenation.  See <a href="Differences-from-previous-versions.html#Differences%20from%20previous%20versions">Differences from previous versions</a>.



   <p><dfn>Punctuators</dfn> are all the usual bits of punctuation which are

meaningful to C and C++.  All but three of the punctuation characters in

ASCII are C punctuators.  The exceptions are <code>@</code>, <code>$</code>, and

<code>`</code>.  In addition, all the two- and three-character operators are

punctuators.  There are also six <dfn>digraphs</dfn>, which the C++ standard

calls <dfn>alternative tokens</dfn>, which are merely alternate ways to spell

other punctuators.  This is a second attempt to work around missing

punctuation in obsolete systems.  It has no negative side effects,

unlike trigraphs, but does not cover as much ground.  The digraphs and

their corresponding normal punctuators are:



<pre class="example">     Digraph:        &lt;%  %&gt;  &lt;:  :&gt;  %:  %:%:

     Punctuator:      {   }   [   ]   #    ##

     </pre>



   <p>Any other single character is considered "other." It is passed on to

the preprocessor's output unmolested.  The C compiler will almost

certainly reject source code containing "other" tokens.  In ASCII, the

only other characters are <code>@</code>, <code>$</code>, <code>`</code>, and control

characters other than NUL (all bits zero).  (Note that <code>$</code> is

normally considered a letter.)  All characters with the high bit set

(numeric range 0x7F-0xFF) are also "other" in the present

implementation.  This will change when proper support for international

character sets is added to GCC.



   <p>NUL is a special case because of the high probability that its

appearance is accidental, and because it may be invisible to the user

(many terminals do not display NUL at all).  Within comments, NULs are

silently ignored, just as any other character would be.  In running

text, NUL is considered white space.  For example, these two directives

have the same meaning.



<pre class="example">     #define X^@1

     #define X 1

     </pre>



<p>(where <code>^@</code> is ASCII NUL).  Within string or character constants,

NULs are preserved.  In the latter two cases the preprocessor emits a

warning message.



   <div class="footnote">

<hr>

<h4>Footnotes</h4>

<ol type="1">

<li><a name="fn-1"></a>

<p>The C

standard uses the term <dfn>string literal</dfn> to refer only to what we are

calling <dfn>string constants</dfn>.</p>



   </ol><hr></div>



   </body></html>
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -