📄 5scanner.html
字号:
<html>
<head>
<title>Scanner</title>
<meta name="description" content="Scanner implementation in C++">
<meta name="keywords" content="scanner, lexer, lexical analyzer, pointer, token, symbol">
<link rel="stylesheet" href="../../rs.css">
</head>
<body background="../../images/margin.gif" bgcolor="#FFFFDC">
<!-- Main Table -->
<table cellpadding="6">
<tr>
<td width="78">
<td>
<h3>Scanner</h3>
<p class=topics>Pointer to pointer type, passing simple types by reference, C++ identifiers, passing buffers.
<p>The list of tokens was enlarged to include four arithmetic operators, the assignment operator, parentheses and a token representing an identifier. An identifier is a symbolic name, like<i> pi</i>, <i>sin</i>, <i>x</i>, etc.
<tr>
<td class=margin valign=top>
<br>
<a href="source/calc2.zip">
<img src="Images/brace.gif" width=16 height=16 border=1 alt="Download!"><br>source</a>
<td>
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>enum EToken
{
tEnd,
tError,
tNumber, // literal number
tPlus, // +
tMult, // *
tMinus, // -
tDivide, // /
tLParen, // (
tRParen, // )
tAssign, // =
tIdent // identifier (symbolic name)
};
</pre></table><!-- End Code --><p>The <var>Accept</var> method was expanded to recognize the additional arithmetic symbols as well as floating point numbers and identifiers. Decimal point was added to the list of digits in the scanner抯 switch statement. This is to recognize numbers like .5 that start with the decimal point. The library function <var>strtod</var> (string to double) not only converts a string to a floating point number, but it also updates the pointer to the first character that cannot possibly be part of the number. This is very useful, since it lets us easily calculate the new value of <var>_iLook</var> after scanning the number.
<!-- Code --><table width="100%" cellspacing=10><tr>
<td class=codetable><pre>
case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
case '.':
{
_token = tNumber;
char * p;
_number = strtod (&_buf [_iLook], &p);
_iLook = p - _buf; // pointer subtraction
break;
}
</pre></table><!-- End Code --><p>The function <var>strtod</var> has two outputs: the value of the number that it has recognized and the pointer to the first unrecognized character.
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>double strtod (char const * str, char ** ppEnd);
</pre></table><!-- End Code --><p>How can a function have more than one output? The trick is to pass an argument that is a reference or a pointer to the value to be modified by the function. In our case the additional output is a pointer to char. We have to pass a reference or a pointer to this pointer. (Since <var>strtod</var> is a function from the standard C library it uses pointers rather than references. )
<p>Let抯 see what happens, step-by-step. We first define the variable which is to be modified by <var>strtod</var>. This variable is a pointer to a <var>char</var>
<!--Code --><table width="100%" cellspacing=10><tr>
<td class=codetable><pre>
char * p;
</pre></table><!-- End Code -->
<p>Notice that we don抰 have to initialize it to anything. It will be overwritten in the subsequent call anyway. Next, we pass the address of this variable to <var>strtod
</var><!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>_number = strtod (&_buf [_iLook], <var>&p</var>);
</pre></table><!-- End Code -->
<p>The function expects a <b><i>pointer to a pointer</i></b> to a <var>char</var>
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>double strtod (char const * str, <var>char ** ppEnd</var>);
</pre></table><!-- End Code -->
<p>By dereferencing this pointer to pointer, <var>strtod</var> can overwrite the value of the pointer. For instance, it could do this:
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>*ppEnd = pCurrent;
</pre></table><!-- End Code -->
<p>This would make the original <var>p</var> point to whatever <var>pCurrent</var> was pointing to.
<p align="CENTER"><img src="images/Image41.gif" width=240 height=126>
<p class=caption align="CENTER">Figure 2-6
<p>In C++ we could have passed a reference to a pointer instead (not that it's much more readable).
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>char *& pEnd
</pre></table><!-- End Code --><p>It is not clear that passing simple types like <var>char*</var> or <var>int</var> by reference leads to more readable code. Consider this
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>char * p;
_number = StrToDouble (&_buf [_iLook], <var>p</var>);
</pre></table><!-- End Code --><p>It looks like passing an uninitialized variable to a function. Only by looking up the declaration of <var>StrToDouble</var> would you know that <var>p</var> is passed by reference
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>double StrToDouble (char const * str, <var>char *& rpEnd</var>)
{
...
<var>rpEnd = pCurrent</var>;
...
}
</pre></table><!-- End Code --><p>Although it definitely is a good programming practice to look up at least the declaration of the function you are about to call, one might argue that it shouldn抰 be necessary to look it up when you are reading somebody else抯 code. Then again, how can you understand the code if you don抰 know what <var>StrToDouble</var> is doing? And how about a comment that will immediately explain what is going on?
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>char * p; // <b>p will be initialized by StrToDouble</b>
_number = StrToDouble (&_buf [_iLook], p);
</pre></table><!-- End Code --><p>You should definitely put a comment whenever you define a variable without immediately initializing it. Otherwise the reader of your code will suspect a bug.
<p>Taking all that into account my recommendation would be to go ahead and use C++ references for passing simple, as well as user defined types by reference.
<p>Of course, if <var>strtod</var> were not written by a human optimizing compiler, the code would probably look more like this
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':
case '.':
{
_token = tNumber;
_number = StrToDouble (_buf, _iLook); // updates _iLook
break;
}
</pre></table><!-- End Code --><p>with <var>StrToDouble</var> declared as follows
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>double StrToDouble (char const * pBuf, int& iCurrent);
</pre></table><!-- End Code --><p>It would start converting the string to a double starting from<var> pBuf [iCurrent]</var> advancing <var>iCurrent</var> past the end of the number.
<p>Back to <var>Scanner::Accept()</var>. Identifiers are recognized in the default statement of the big switch. The idea is that if the character is not a digit, not a decimal point, not an operator, then it must either be an identifier or an error. We require an identifier to start with an uppercase or lowercase letter, or with an underscore. By the way, this is exactly the same requirement that C++ identifiers must fulfill. We use the <var>isalpha()</var> function (really a macro) to check for the letters of the alphabet. Inside the identifier we (and C++) allow digits as well. The macro <var>isalnum()</var> checks if the character is alphanumeric. Examples of identifiers are thus<var> i</var>, <var>pEnd</var>, <var>_token</var>,<var> __iscsymf</var>, <var>Istop4digits</var>, <var>SERIOUS_ERROR_1</var>, etc.
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>default:
if (isalpha (_buf [_iLook]) || _buf [_iLook] == '_')
{
_token = tIdent;
_iSymbol = _iLook;
int cLook; // initialized in the do loop
do {
++_iLook;
cLook = _buf [_iLook];
} while (isalnum (cLook) || cLook == '_');
_lenSymbol = _iLook - _iSymbol;
if (_lenSymbol > maxSymLen)
_lenSymbol = maxSymLen;
}
else
_token = tError;
break;
</pre></table><!-- End Code --><p>To simplify our lives as programmers, we chose to limit the size of symbols to <var>maxSymLen</var>. Remember, we are still weekend programmers!
<p>Once the <var>Scanner</var> recognizes an identifier, it should be able to provide its name for use by other parts of the program. To retrieve a symbol name, we call the following method
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>void Scanner::SymbolName (char * strOut, int & len)
{
assert (len >= maxSymLen);
assert (_lenSymbol <= maxSymLen);
strncpy (strOut, &_buf [_iSymbol], _lenSymbol);
strOut [_lenSymbol] = 0;
len = _lenSymbol;
}
</pre></table><!-- End Code --><p>Notice that we <i>have to</i> make a copy of the string, since the original in the buffer is not null terminated. We copy the string to the caller抯 buffer <var>strOut</var> of length <var>len</var>. We do it by calling the function <var>strncpy</var> (string-n-copy, where n means that there is a maximum count of characters to be copied). The length is an in/out parameter. It should be initialized by the caller to the size of the buffer <var>strOut</var>. After <var>SymbolName</var> returns, its value reflects the actual length of the string copied.
<p>How do we know that the buffer is big enough? We make it part of the contract梥ee the assertions.
<p>The method <var>SymbolName</var> is an example of a more general pattern of passing buffers of data between objects. There are three main schemes: caller抯 fixed buffer, caller-allocated buffer and callee-allocated buffer. In our case the buffer is passed by the caller and its size is fixed. This allows the caller to use a local fixed buffer梩here is no need to allocate or re-allocate it every time the function is called. Here抯 the example of the <var>Parser</var> code that makes this call梩he buffer <var>strSymbol</var> is a local array
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>char strSymbol [maxSymLen + 1];
int lenSym = maxSymLen;
_scanner.SymbolName (strSymbol, lenSym);
</pre></table><!-- End Code --><p>Notice that this method can only be used when there is a well-defined and reasonable maximum size for the buffer, or when the data can be retrieved incrementally in multiple calls. Here, we were clever enough to always truncate the size of our identifiers to <var>maxSymLen</var>.
<p>If the size of the data to be passed in the buffer is not limited, we have to be able to allocate the buffer on demand. In the case of caller-allocated buffer we have two options. Optimally, the caller should be able to first ask for the size of data, allocate the appropriate buffer and call the method to fill the buffer. There is a variation of the scheme梩he caller re-allocated buffer梬here the caller allocates the buffer of some arbitrary size that covers, say, 99% of the cases. When the data does not fit into the buffer, the callee returns the appropriate failure code and lets the caller allocate a bigger buffer.
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>char * pBuf = new char [goodSize];
int len = goodSize;
if (FillBuffer (pBuf, len) == errOverflow)
{
// rarely necessary
delete [] pBuf;
pBuf = new char [len]; // len updated by FillBuffer
FillBuffer (pBuf, len);
}
</pre></table><!-- End Code --><p>This may seem like a strange optimization until you encounter situations where the call to ask for the size of data is really expensive. For instance, you might be calling across the network, or require disk access to find the size, etc.
<p>The callee-allocated buffer seems a simple enough scheme. The most likely complication is a memory leak when the caller forgets to deallocate the buffer (which, we should remember, hasn抰 been explicitly allocated by the caller). We抣l see how to protect ourselves from such problems using smart pointers (see the chapter on managing resources). Other complications arise when the callee uses a different memory allocator than the caller, or when the call is remoted using, for instance, remote procedure call (RPC). Usually we let the callee allocate memory when dealing with functions that have to return dynamic data structures (lists, trees, etc.). Here抯 a simple code example of callee-allocated buffer
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>char * pBuf = AcquireData ();
// use pBuf
delete pBuf;
</pre></table><!-- End Code --><p>The following decision tree summarizes various methods of passing data to the caller
<!-- Code --><table width="100%" cellspacing=10><tr> <td class=codetable>
<pre>if (max data size well defined)
{
use caller抯 fixed buffer
}
else if (it's cheap to ask for size)
{
use caller-allocated buffer
}
else if ((caller trusted to free memory
&& caller uses the same allocator
&& no problems with remoting)
|| returning dynamic data structures)
{
use callee-allocated buffer
}
else
{
use caller-re-allocated buffer
}
</pre></table><!-- End Code -->
<p>In the second part of the book we'll talk about some interesting ways of making the callee-allocated buffer a much more attractive and convenient mechanism.
<br><a href="6symtab.html">Next</a>
</table>
<!-- End Main Table -->
</body>
</html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -