📄 cppinternals.texi
字号:
sequence that represents the @samp{\} of an escaped newline. If itencounters a @samp{?} or @samp{\}, it calls @code{skip_escaped_newlines}to skip over any potential escaped newlines before checking whether thenumber has been finished.Similarly code in the main body of @code{_cpp_lex_direct} cannot simplycheck for a @samp{=} after a @samp{+} character to determine whether ithas a @samp{+=} token; it needs to be prepared for an escaped newline ofsome sort. Such cases use the function @code{get_effective_char}, whichreturns the first character after any intervening escaped newlines.The lexer needs to keep track of the correct column position, includingcounting tabs as specified by the @option{-ftabstop=} option. Thisshould be done even within C-style comments; they can appear in themiddle of a line, and we want to report diagnostics in the correctposition for text appearing after the end of the comment.@anchor{Invalid identifiers}Some identifiers, such as @code{__VA_ARGS__} and poisoned identifiers,may be invalid and require a diagnostic. However, if they appear in amacro expansion we don't want to complain with each use of the macro.It is therefore best to catch them during the lexing stage, in@code{parse_identifier}. In both cases, whether a diagnostic is neededor not is dependent upon the lexer's state. For example, we don't wantto issue a diagnostic for re-poisoning a poisoned identifier, or forusing @code{__VA_ARGS__} in the expansion of a variable-argument macro.Therefore @code{parse_identifier} makes use of state flags to determinewhether a diagnostic is appropriate. Since we change state on aper-token basis, and don't lex whole lines at a time, this is not aproblem.Another place where state flags are used to change behavior is whilstlexing header names. Normally, a @samp{<} would be lexed as a singletoken. After a @code{#include} directive, though, it should be lexed asa single token as far as the nearest @samp{>} character. Note that wedon't allow the terminators of header names to be escaped; the first@samp{"} or @samp{>} terminates the header name.Interpretation of some character sequences depends upon whether we arelexing C, C++ or Objective-C, and on the revision of the standard inforce. For example, @samp{::} is a single token in C++, but in C it istwo separate @samp{:} tokens and almost certainly a syntax error. Suchcases are handled by @code{_cpp_lex_direct} based upon command-lineflags stored in the @code{cpp_options} structure.Once a token has been lexed, it leads an independent existence. Thespelling of numbers, identifiers and strings is copied to permanentstorage from the original input buffer, so a token remains valid andcorrect even if its source buffer is freed with @code{_cpp_pop_buffer}.The storage holding the spellings of such tokens remains until theclient program calls cpp_destroy, probably at the end of the translationunit.@anchor{Lexing a line}@section Lexing a line@cindex token runWhen the preprocessor was changed to return pointers to tokens, onefeature I wanted was some sort of guarantee regarding how long areturned pointer remains valid. This is important to the stand-alonepreprocessor, the future direction of the C family front ends, and evento cpplib itself internally.Occasionally the preprocessor wants to be able to peek ahead in thetoken stream. For example, after the name of a function-like macro, itwants to check the next token to see if it is an opening parenthesis.Another example is that, after reading the first few tokens of a@code{#pragma} directive and not recognizing it as a registered pragma,it wants to backtrack and allow the user-defined handler for unknownpragmas to access the full @code{#pragma} token stream. The stand-alonepreprocessor wants to be able to test the current token with theprevious one to see if a space needs to be inserted to preserve theirseparate tokenization upon re-lexing (paste avoidance), so it needs tobe sure the pointer to the previous token is still valid. Therecursive-descent C++ parser wants to be able to perform tentativeparsing arbitrarily far ahead in the token stream, and then to be ableto jump back to a prior position in that stream if necessary.The rule I chose, which is fairly natural, is to arrange that thepreprocessor lex all tokens on a line consecutively into a token buffer,which I call a @dfn{token run}, and when meeting an unescaped new line(newlines within comments do not count either), to start lexing back atthe beginning of the run. Note that we do @emph{not} lex a line oftokens at once; if we did that @code{parse_identifier} would not havestate flags available to warn about invalid identifiers (@pxref{Invalididentifiers}).In other words, accessing tokens that appeared earlier in the currentline is valid, but since each logical line overwrites the tokens of theprevious line, tokens from prior lines are unavailable. In particular,since a directive only occupies a single logical line, this means thatthe directive handlers like the @code{#pragma} handler can jump aroundin the directive's tokens if necessary.Two issues remain: what about tokens that arise from macro expansions,and what happens when we have a long line that overflows the token run?Since we promise clients that we preserve the validity of pointers thatwe have already returned for tokens that appeared earlier in the line,we cannot reallocate the run. Instead, on overflow it is expanded bychaining a new token run on to the end of the existing one.The tokens forming a macro's replacement list are collected by the@code{#define} handler, and placed in storage that is only freed by@code{cpp_destroy}. So if a macro is expanded in the line of tokens,the pointers to the tokens of its expansion that are returned will alwaysremain valid. However, macros are a little trickier than that, sincethey give rise to three sources of fresh tokens. They are the built-inmacros like @code{__LINE__}, and the @samp{#} and @samp{##} operatorsfor stringification and token pasting. I handled this by allocatingspace for these tokens from the lexer's token run chain. This meansthey automatically receive the same lifetime guarantees as lexed tokens,and we don't need to concern ourselves with freeing them.Lexing into a line of tokens solves some of the token memory managementissues, but not all. The opening parenthesis after a function-likemacro name might lie on a different line, and the front ends definitelywant the ability to look ahead past the end of the current line. Socpplib only moves back to the start of the token run at the end of aline if the variable @code{keep_tokens} is zero. Line-buffering isquite natural for the preprocessor, and as a result the only time cpplibneeds to increment this variable is whilst looking for the openingparenthesis to, and reading the arguments of, a function-like macro. Inthe near future cpplib will export an interface to increment anddecrement this variable, so that clients can share full control over thelifetime of token pointers too.The routine @code{_cpp_lex_token} handles moving to new token runs,calling @code{_cpp_lex_direct} to lex new tokens, or returningpreviously-lexed tokens if we stepped back in the token stream. It alsochecks each token for the @code{BOL} flag, which might indicate adirective that needs to be handled, or require a start-of-line call-backto be made. @code{_cpp_lex_token} also handles skipping over tokens infailed conditional blocks, and invalidates the control macro of themultiple-include optimization if a token was successfully lexed outsidea directive. In other words, its callers do not need to concernthemselves with such issues.@node Hash Nodes@unnumbered Hash Nodes@cindex hash table@cindex identifiers@cindex macros@cindex assertions@cindex named operatorsWhen cpplib encounters an ``identifier'', it generates a hash code forit and stores it in the hash table. By ``identifier'' we mean tokenswith type @code{CPP_NAME}; this includes identifiers in the usual Csense, as well as keywords, directive names, macro names and so on. Forexample, all of @code{pragma}, @code{int}, @code{foo} and@code{__GNUC__} are identifiers and hashed when lexed.Each node in the hash table contain various information about theidentifier it represents. For example, its length and type. At any onetime, each identifier falls into exactly one of three categories:@itemize @bullet@item MacrosThese have been declared to be macros, either on the command line orwith @code{#define}. A few, such as @code{__TIME__} are built-insentered in the hash table during initialization. The hash node for anormal macro points to a structure with more information about themacro, such as whether it is function-like, how many arguments it takes,and its expansion. Built-in macros are flagged as special, and insteadcontain an enum indicating which of the various built-in macros it is.@item AssertionsAssertions are in a separate namespace to macros. To enforce this, cppactually prepends a @code{#} character before hashing and entering it inthe hash table. An assertion's node points to a chain of answers tothat assertion.@item VoidEverything else falls into this category---an identifier that is notcurrently a macro, or a macro that has since been undefined with@code{#undef}.When preprocessing C++, this category also includes the named operators,such as @code{xor}. In expressions these behave like the operators theyrepresent, but in contexts where the spelling of a token matters theyare spelt differently. This spelling distinction is relevant when theyare operands of the stringizing and pasting macro operators @code{#} and@code{##}. Named operator hash nodes are flagged, both to catch thespelling distinction and to prevent them from being defined as macros.@end itemizeThe same identifiers share the same hash node. Since each identifiertoken, after lexing, contains a pointer to its hash node, this is usedto provide rapid lookup of various information. For example, whenparsing a @code{#define} statement, CPP flags each argument's identifierhash node with the index of that argument. This makes duplicatedargument checking an O(1) operation for each argument. Similarly, foreach identifier in the macro's expansion, lookup to see if it is anargument, and which argument it is, is also an O(1) operation. Further,each directive name, such as @code{endif}, has an associated directiveenum stored in its hash node, so that directive lookup is also O(1).@node Macro Expansion@unnumbered Macro Expansion Algorithm@cindex macro expansionMacro expansion is a tricky operation, fraught with nasty corner casesand situations that render what you thought was a nifty way tooptimize the preprocessor's expansion algorithm wrong in quite subtleways.I strongly recommend you have a good grasp of how the C and C++standards require macros to be expanded before diving into thissection, let alone the code!. If you don't have a clear mentalpicture of how things like nested macro expansion, stringification andtoken pasting are supposed to work, damage to your sanity can quicklyresult.@section Internal representation of macros@cindex macro representation (internal)The preprocessor stores macro expansions in tokenized form. Thissaves repeated lexing passes during expansion, at the cost of a smallincrease in memory consumption on average. The tokens are storedcontiguously in memory, so a pointer to the first one and a tokencount is all you need to get the replacement list of a macro.If the macro is a function-like macro the preprocessor also stores itsparameters, in the form of an ordered list of pointers to the hashtable entry of each parameter's identifier. Further, in the macro'sstored expansion each occurrence of a parameter is replaced with aspecial token of type @code{CPP_MACRO_ARG}. Each such token holds theindex of the parameter it represents in the parameter list, whichallows rapid replacement of parameters with their arguments duringexpansion. Despite this optimization it is still necessary to storethe original parameters to the macro, both for dumping with e.g.,@option{-dD}, and to warn about non-trivial macro redefinitions whenthe parameter names have changed.@section Macro expansion overviewThe preprocessor maintains a @dfn{context stack}, implemented as alinked list of @code{cpp_context} structures, which together representthe macro expansion state at any one time. The @code{structcpp_reader} member variable @code{context} points to the current topof this stack. The top normally holds the unexpanded replacement listof the innermost macro under expansion, except when cpplib is about topre-expand an argument, in which case it holds that argument'sunexpanded tokens.When there are no macros under expansion, cpplib is in @dfn{basecontext}. All contexts other than the base context contain acontiguous list of tokens delimited by a starting and ending token.When not in base context, cpplib obtains the next token from the listof the top context. If there are no tokens left in the list, it popsthat context off the stack, and subsequent ones if necessary, until anunexhausted context is found or it returns to base context. In basecontext, cpplib reads tokens directly from the lexer.If it encounters an identifier that is both a macro and enabled forexpansion, cpplib prepares to push a new context for that macro on thestack by calling the routine @code{enter_macro_context}. When thisroutine returns, the new context will contain the unexpanded tokens ofthe replacement list of that macro. In the case of function-likemacros, @code{enter_macro_context} also replaces any parameters in thereplacement list, stored as @code{CPP_MACRO_ARG} tokens, with theappropriate macro argument. If the standard requires that theparameter be replaced with its expanded argument, the argument willhave been fully macro expanded first.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -