📄 cppinternals.texi
字号:
\input texinfo@setfilename cppinternals.info@settitle The GNU C Preprocessor Internals@include gcc-common.texi@ifinfo@dircategory Software development@direntry* Cpplib: (cppinternals). Cpplib internals.@end direntry@end ifinfo@c @smallbook@c @cropmarks@c @finalout@setchapternewpage odd@ifinfoThis file documents the internals of the GNU C Preprocessor.Copyright 2000, 2001, 2002, 2004, 2005, 2006, 2007 Free SoftwareFoundation, Inc.Permission is granted to make and distribute verbatim copies ofthis manual provided the copyright notice and this permission noticeare preserved on all copies.@ignorePermission is granted to process this file through Tex and print theresults, provided the printed document carries copying permissionnotice identical to this one except for the removal of this paragraph(this paragraph not being relevant to the printed manual).@end ignorePermission is granted to copy and distribute modified versions of thismanual under the conditions for verbatim copying, provided also thatthe entire resulting derived work is distributed under the terms of apermission notice identical to this one.Permission is granted to copy and distribute translations of this manualinto another language, under the above conditions for modified versions.@end ifinfo@titlepage@title Cpplib Internals@versionsubtitle@author Neil Booth@page@vskip 0pt plus 1filll@c man begin COPYRIGHTCopyright @copyright{} 2000, 2001, 2002, 2004, 2005Free Software Foundation, Inc.Permission is granted to make and distribute verbatim copies ofthis manual provided the copyright notice and this permission noticeare preserved on all copies.Permission is granted to copy and distribute modified versions of thismanual under the conditions for verbatim copying, provided also thatthe entire resulting derived work is distributed under the terms of apermission notice identical to this one.Permission is granted to copy and distribute translations of this manualinto another language, under the above conditions for modified versions.@c man end@end titlepage@contents@page@node Top@top@chapter Cpplib---the GNU C PreprocessorThe GNU C preprocessor isimplemented as a library, @dfn{cpplib}, so it can be easily shared betweena stand-alone preprocessor, and a preprocessor integrated with the C,C++ and Objective-C front ends. It is also available for use by otherprograms, though this is not recommended as its exposed interface hasnot yet reached a point of reasonable stability.The library has been written to be re-entrant, so that it can be usedto preprocess many files simultaneously if necessary. It has also beenwritten with the preprocessing token as the fundamental unit; thepreprocessor in previous versions of GCC would operate on text stringsas the fundamental unit.This brief manual documents the internals of cpplib, and explains someof the tricky issues. It is intended that, along with the comments inthe source code, a reasonably competent C programmer should be able tofigure out what the code is doing, and why things have been implementedthe way they have.@menu* Conventions:: Conventions used in the code.* Lexer:: The combined C, C++ and Objective-C Lexer.* Hash Nodes:: All identifiers are entered into a hash table.* Macro Expansion:: Macro expansion algorithm.* Token Spacing:: Spacing and paste avoidance issues.* Line Numbering:: Tracking location within files.* Guard Macros:: Optimizing header files with guard macros.* Files:: File handling.* Concept Index:: Index.@end menu@node Conventions@unnumbered Conventions@cindex interface@cindex header filescpplib has two interfaces---one is exposed internally only, and theother is for both internal and external use.The convention is that functions and types that are exposed to multiplefiles internally are prefixed with @samp{_cpp_}, and are to be found inthe file @file{internal.h}. Functions and types exposed to externalclients are in @file{cpplib.h}, and prefixed with @samp{cpp_}. Forhistorical reasons this is no longer quite true, but we should strive tostick to it.We are striving to reduce the information exposed in @file{cpplib.h} to thebare minimum necessary, and then to keep it there. This makes clearexactly what external clients are entitled to assume, and allows us tochange internals in the future without worrying whether library clientsare perhaps relying on some kind of undocumented implementation-specificbehavior.@node Lexer@unnumbered The Lexer@cindex lexer@cindex newlines@cindex escaped newlines@section OverviewThe lexer is contained in the file @file{lex.c}. It is a hand-codedlexer, and not implemented as a state machine. It can understand C, C++and Objective-C source code, and has been extended to allow reasonablysuccessful preprocessing of assembly language. The lexer does not makean initial pass to strip out trigraphs and escaped newlines, but handlesthem as they are encountered in a single pass of the input file. Itreturns preprocessing tokens individually, not a line at a time.It is mostly transparent to users of the library, since the library'sinterface for obtaining the next token, @code{cpp_get_token}, takes careof lexing new tokens, handling directives, and expanding macros asnecessary. However, the lexer does expose some functionality so thatclients of the library can easily spell a given token, such as@code{cpp_spell_token} and @code{cpp_token_len}. These functions areuseful when generating diagnostics, and for emitting the preprocessedoutput.@section Lexing a tokenLexing of an individual token is handled by @code{_cpp_lex_direct} andits subroutines. In its current form the code is quite complicated,with read ahead characters and such-like, since it strives to not stepback in the character stream in preparation for handling non-ASCII fileencodings. The current plan is to convert any such files to UTF-8before processing them. This complexity is therefore unnecessary andwill be removed, so I'll not discuss it further here.The job of @code{_cpp_lex_direct} is simply to lex a token. It is notresponsible for issues like directive handling, returning lookaheadtokens directly, multiple-include optimization, or conditional blockskipping. It necessarily has a minor r@^ole to play in memorymanagement of lexed lines. I discuss these issues in a separate section(@pxref{Lexing a line}).The lexer places the token it lexes into storage pointed to by thevariable @code{cur_token}, and then increments it. This variable isimportant for correct diagnostic positioning. Unless a specific lineand column are passed to the diagnostic routines, they will examine the@code{line} and @code{col} values of the token just before the locationthat @code{cur_token} points to, and use that location to report thediagnostic.The lexer does not consider whitespace to be a token in its own right.If whitespace (other than a new line) precedes a token, it sets the@code{PREV_WHITE} bit in the token's flags. Each token has its@code{line} and @code{col} variables set to the line and column of thefirst character of the token. This line number is the line number inthe translation unit, and can be converted to a source (file, line) pairusing the line map code.The first token on a logical, i.e.@: unescaped, line has the flag@code{BOL} set for beginning-of-line. This flag is intended forinternal use, both to distinguish a @samp{#} that begins a directivefrom one that doesn't, and to generate a call-back to clients that wantto be notified about the start of every non-directive line with tokenson it. Clients cannot reliably determine this for themselves: the firsttoken might be a macro, and the tokens of a macro expansion do not havethe @code{BOL} flag set. The macro expansion may even be empty, and thenext token on the line certainly won't have the @code{BOL} flag set.New lines are treated specially; exactly how the lexer handles them iscontext-dependent. The C standard mandates that directives areterminated by the first unescaped newline character, even if it appearsin the middle of a macro expansion. Therefore, if the state variable@code{in_directive} is set, the lexer returns a @code{CPP_EOF} token,which is normally used to indicate end-of-file, to indicateend-of-directive. In a directive a @code{CPP_EOF} token never meansend-of-file. Conveniently, if the caller was @code{collect_args}, italready handles @code{CPP_EOF} as if it were end-of-file, and reports anerror about an unterminated macro argument list.The C standard also specifies that a new line in the middle of thearguments to a macro is treated as whitespace. This white space isimportant in case the macro argument is stringified. The state variable@code{parsing_args} is nonzero when the preprocessor is collecting thearguments to a macro call. It is set to 1 when looking for the openingparenthesis to a function-like macro, and 2 when collecting the actualarguments up to the closing parenthesis, since these two cases need tobe distinguished sometimes. One such time is here: the lexer sets the@code{PREV_WHITE} flag of a token if it meets a new line when@code{parsing_args} is set to 2. It doesn't set it if it meets a newline when @code{parsing_args} is 1, since then code like@smallexample#define foo() barfoobaz@end smallexample@noindent would be output with an erroneous space before @samp{baz}:@smallexamplefoo baz@end smallexampleThis is a good example of the subtlety of getting token spacing correctin the preprocessor; there are plenty of tests in the testsuite forcorner cases like this.The lexer is written to treat each of @samp{\r}, @samp{\n}, @samp{\r\n}and @samp{\n\r} as a single new line indicator. This allows it totransparently preprocess MS-DOS, Macintosh and Unix files without theirneeding to pass through a special filter beforehand.We also decided to treat a backslash, either @samp{\} or the trigraph@samp{??/}, separated from one of the above newline indicators bynon-comment whitespace only, as intending to escape the newline. Ittends to be a typing mistake, and cannot reasonably be mistaken foranything else in any of the C-family grammars. Since handling it thisway is not strictly conforming to the ISO standard, the library issues awarning wherever it encounters it.Handling newlines like this is made simpler by doing it in one placeonly. The function @code{handle_newline} takes care of all newlinecharacters, and @code{skip_escaped_newlines} takes care of arbitrarilylong sequences of escaped newlines, deferring to @code{handle_newline}to handle the newlines themselves.The most painful aspect of lexing ISO-standard C and C++ is handlingtrigraphs and backlash-escaped newlines. Trigraphs are processed beforeany interpretation of the meaning of a character is made, and unfortunatelythere is a trigraph representation for a backslash, so it is possible forthe trigraph @samp{??/} to introduce an escaped newline.Escaped newlines are tedious because theoretically they can occuranywhere---between the @samp{+} and @samp{=} of the @samp{+=} token,within the characters of an identifier, and even between the @samp{*}and @samp{/} that terminates a comment. Moreover, you cannot be surethere is just one---there might be an arbitrarily long sequence of them.So, for example, the routine that lexes a number, @code{parse_number},cannot assume that it can scan forwards until the first non-numbercharacter and be done with it, because this could be the @samp{\}introducing an escaped newline, or the @samp{?} introducing the trigraph
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -