📄 lexer.otx
字号:
% lexer.tex% Practical Compiler Construction% Chapter: Lexical Analyzer\chapter{Lexical Analyzer} \section{Introduction} %Introduce the concept of a lexer. The first step in the compiling process involves reading source code, so that the compiler can check that source code for errors before translating it to, for example, assembly language. All programming languages provide an array of keywords, like \verb|IF|, \verb|WHILE|, \verb|SWITCH| and so on. A compiler is not usually interested in the individual characters that make up these keywords; these keywords are said to be atomic. However, in some cases the compiler does care about the individual characters that make up a word: an integer number (e.g. \verb|12345|), a string (e.g. \verb|"hello, world"|) and a floating point number (e.g. \verb|12e-09|) are all considered to be words, but the individual characters that make them up are significant. This distinction requires special processing of the input text, and this special processing is usually moved out of the parser and placed in a module called the \emph{lexical analyzer}, or \emph{lexer} or \emph{scanner} for short. It is the lexer's responsibility to divide the input stream into \emph{tokens} (atomic words). The parser (the module that deals with groups of tokens, checking that their order is valid) requests a token from the lexer, which reads characters from the input stream until it has accumulated enough characters to form a complete token, which it returns to the parser. \begin{example} \label{example:tokenizing} Given the input \begin{quote} \verb|the quick brown fox jumps over the lazy dog| \end{quote} a lexer will split this into the tokens \begin{quote} \verb|the|, \verb|quick|, \verb|brown|, \verb|fox|, \verb|jumps|, \verb|over|, \verb|the|, \verb|lazy| and \verb|dog| \end{quote} \end{example} The process of splitting input into tokens is called \emph{tokenizing}. Apart from tokenizing a sentence, a lexer can also divide tokens up into classes. Consider the following example: \begin{example} \label{example:tokenclasses} Given the input \begin{quote} \verb|the sum of 2 + 2 = 4.| \end{quote} a lexer will split this into the following tokens, with classes: \begin{quote}\begin{verbatim} Word: the Word: sum Word: of Number: 2 Plus: + Number: 2 Equals: = Number: 4 Dot: . \end{verbatim}\end{quote} \end{example} Some token classes are very narrow (containing only one token), while others are broad. For example, the token class \emph{Word} is used to represent \verb|the|, \verb|sum| and \verb|of|, while the token class \emph{Dot} can only be used when a \verb|.| is read. Incidentally, the lexical analyzer must know how to separate individual tokens. In program source text, keywords are usually separated by whitespace (spaces, tabs and line feeds). However, this is not always the case. Consider the following input: \begin{example} \label{example:tokenseparation} Given the input \begin{quote} \verb|sum=(2+2)*3;| \end{quote} a lexer will split this into the following tokens: \begin{quote} \verb|sum|, \verb|=|, \verb|(|, \verb|2|, \verb|+|, \verb|2|, \verb|)|, \verb|*|, \verb|3| and \verb|;| \end{quote} \end{example} Those familiar with popular programming languages like C or Pascal may know that mathematical tokens like numbers, \verb|=|, \verb|+| and \verb|*| are not required to be separated from each other by whitespace. The lexer must have some way to know when a token ends and the next token (if any) begins. In the next section, we will discuss the theory of regular languages to further clarify this point and how lexers deal with it. Lexers have an additional interesting property: they can be used to filter out input that is not important to the parser, so that the parser has less different tokens to deal with. Block comments and line comments are examples of uninteresting input. \section{Regular Language Theory} %This section explains all theory for regular languages in order %to introduce the lexer regexps. The lexical analyzer is a submodule of the parser. While the parser deals with \emph{context-free grammars} (a higher level of abstraction), the lexer deals with individual characters which form tokens (words). Some tokens are simple (\verb|IF| or \verb|WHILE|) while others are complex. All integer numbers, for example, are represented using the same token (\verb|INTEGER|), which covers many cases (\verb|1|, \verb|100|, \verb|5845|). This requires some notation to match all integer numbers, so they are treated the same. The answer lies in realizing that the collection of integer numbers is really a small language, with very strict rules (it is a so-called \emph{regular language}). Before we can show what regular languages are, we must must discuss some preliminary definitions first. % Referentie naar hoofdstuknummer van hoofdstuk 'Grammars' In the chapter on Grammars, we introduced the concept of an \emph{alphabet}, denoted as $\Sigma$. An alphabet is rather like the set of tokens a lexer can produce. We then defined the set of strings $\Sigma^{*}$ that can be generated from an alphabet, and said that a language is a subset of $\Sigma^{*}$. We now define, without proof, several operations that may be performed on languages. The first operation on languages that we present is the binary \emph{concatenation} operation. \begin{definition}[Concatenation operation] \label{def:languageconcatenation} \ \\ Let $X$ and $Y$ be two languages. Then $XY$ is the concatenation of these languages with \begin{center} $XY = \{uv\ |\ u \in X\ \land\ v \in Y\}$. \end{center} \end{definition} Concatenation of a language with itself is also possible and is denoted $X^{2}$. Concatenation of a string can also be performed multiple times, e.g. $X^{7}$. We will illustrate the definition of concatenation with an example. \begin{example}[Concatenation operation] \label{example:languageconcatenation} \ \\ Let $\Sigma$ be the alphabet $\{a, b, c\}$. \\Let $X$ be the language over $\Sigma$ with $X = \{aa, bb\}$. \\Let $Y$ be the language over $\Sigma$ with $Y = \{ca, b\}$. \\Then $XY$ is the language $\{aaca,aab,bbca,bbb\}$. \end{example} The second operation that will need to define regular languages is the binary \emph{union} operation. \begin{definition}[Union operation] \label{def:languageunion} \ \\ Let $X$ and $Y$ be two languages. Then $X \cup Y$ is the union of these languages with \begin{center} $X \cup Y = \{u\ |\ u \in X\ \lor\ u \in Y\}$. \end{center} \end{definition} Note that the priority of concatenation is higher than the priority of union. Here is an example that shows how the union operation works: \begin{example}[Union operation] \label{example:languageunion} \ \\ Let $\Sigma$ be the alphabet $\{a, b, c\}$. \\Let $X$ be the language over $\Sigma \{aa, bb\}$. \\Let $Y$ be the language over $\Sigma \{ca, b\}$. \\Then $X \cup Y$ is the language over $\Sigma \{aa, bb, ca, b\}$. \end{example} The final operation that we need to define is the unary \emph{Kleene star} operation. % Footnote for Kleene \begin{definition}[Kleene star] \label{def:languagekleenestar} \ \\ Let $X$ be a language. Then \begin{equation} X^{*} = \bigcup_{i=0}^{\infty} X^{i} \end{equation} \end{definition} Or, in words: $X^{*}$ means that you can take 0 or more sentences from $X$ and concatenate them. The Kleene star operation is best clarified with an example. \begin{example}[Kleene star] \label{example:languagekleenestar} \ \\ Let $\Sigma$ be the alphabet $\{a, b\}$. \\Let $X$ be the language over $\Sigma \{aa, bb\}$. \\Then $X^{*}$ is the language $\{\lambda, aa, bb, aaaa, aabb, bbaa, bbbb, \ldots\}$. \end{example} There is also an extension to the Kleene star. $XX^{*}$ may be written $X^{+}$, meaning that at least one string from $X$ must be taken (whereas $X^{*}$ allows the empty string $\lambda$). With these definitions, we can now give a definition for a regular language. \begin{definition}[Regular languages] \label{def:regularlanguage} %Required space to stop enumerate from starting %right after definition: \ \begin{enumerate} \item Basis: $\emptyset$, $\{\lambda\}$ and $\{a\}$ are regular languages. \item Recursive step: Let X and Y be regular languages. Then \begin{quote} $X \cup Y$ is a regular language \\$XY$ is a regular language \\$X^{*}$ is a regular language \end{quote} \end{enumerate} \end{definition} Now that we have established what regular languages are, it is important to note that lexical analyzer generators (software tools that will be discussed below) use \emph{regular expressions} to denote regular languages. Regular expressions are merely another way of writing down regular languages. In regular expressions, it is customary to write the language consisting of a single string composed of one word, $\{a\}$, as \mbox{\boldmath $a$}. \begin{definition}[Regular expressions] \label{def:regularexpressions} \ \\ Using recursion, we can define regular expressions as follows: \begin{enumerate} \item Basis: \mbox{\boldmath $\emptyset$}, \mbox{\boldmath $\lambda$} and \mbox{\boldmath $a$} are regular expressions. \item Recursive step: Let X and Y be regular expressions. Then \begin{quote} $X \cup Y$ is a regular expression \\$XY$ is a regular expression \\$X^{*}$ is a regular expression \end{quote} \end{enumerate} \end{definition} As you can see, the definition of regular expressions differs from the definition of regular languages only by a notational convenience (less braces to write). So any language that can be composed of other regular languages or expressions using concatenation, union, and the Kleene star, is also a regular language or expression. Note that the priority of the concatenation, Kleene Star and union operations are listed here from highest to lowest priority. \begin{example}[Regular expression] \label{example:sample_regular_expression} \ \\ Let \mbox{\boldmath $a$} and \mbox{\boldmath $b$} be regular expressions by definition \ref{def:regularexpressions}(1). Then \mbox{\boldmath $ab$} is a regular expression by definition \ref{def:regularexpressions}(2) through concatenation. $($\mbox{\boldmath $ab$} $\cup$ \mbox{\boldmath $b$}$)$ is a regular expression by definition \ref{def:regularexpressions}(2) through union. $($\mbox{\boldmath $ab$} $\cup$ \mbox{\boldmath $b$}$)^{*}$ is a regular expression by definition \ref{def:regularexpressions}(2) through union. The sentences that can be generated by \mbox{$($\mbox{\boldmath $ab$} $\cup$ \mbox{\boldmath $b$}$)^{*}$} are $\{\lambda, ab, b, abb, bab, babab, \ldots\}$. \end{example} While context-free grammars are normally denoted using production rules, for regular languages it is sufficient to use the easy to read regular expressions. %\includegraphics{syntax.eps} \section{Sample Regular Expressions} %This section gives some sample regular expressions (non-UNIX). In this section, we present a number of sample regular expressions to illustrate the theory presented in the previous section. From now on, we will now longer use bold \bold{$a$} to denote $\{a\}$, since we will soon move to UNIX regular expressions which do not use bold either. \begin{quote} \begin{tabular}{lp{6cm}} Regular expression & Sentences generated\\ \hline $q$ & the sentence $q$\\ $qqq$ & the sentence $qqq$\\ $q^{*}$ & all sentences of 0 or more $q$'s\\ $q^{+}$ & all sentences of 1 or more $q$'s\\ $q \cup \lambda$ & the empty sentence or $q$. Often denoted as $q?$ (see UNIX regular expressions).\\ $b^{*}((b^{+}a \cup \lambda)b^{*}$ & the collection of sentences that begin with 0 or more $b$'s, followed by either one or more $b$'s followed by an $a$, or nothing, followed by 0 or more $b$'s.\\ \end{tabular} \end{quote} These examples show that through repeated application of definition \ref{def:regularexpressions}, complex sequences can be defined. This feature of regular expressions is used for constructing lexical analyzers. \section{UNIX Regular Expressions} %This section explains the syntax for UNIX regexps. Under UNIX, several extensions to regular expressions have been implemented that we can use. A UNIX regular expression\cite{regex} is commonly called a \emph{regex} (multiple: \emph{regexes}). There is no union operator on UNIX. Instead, we supply a list of alternatives contained within square brackets. \begin{quote} \verb|[abc]| $\equiv (a \cup b \cup c)$ \end{quote}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -