⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 3.doc.html

📁 java语言规范
💻 HTML
📖 第 1 页 / 共 4 页
字号:
<html>
<head>
<title>The Java Language Specification Lexical Structure</title>
</head>
<body BGCOLOR=#eeeeff text=#000000 LINK=#0000ff VLINK=#000077 ALINK=#ff0000>
 
<a href="index.html">Contents</a> | <a href="2.doc.html">Prev</a> | <a href="4.doc.html">Next</a> | <a href="j.index.doc1.html">Index</a>
<hr><br>
 
<a name="48198"></a>
<p><strong>
CHAPTER 3 </strong></p>
<a name="44591"></a>
<h1>Lexical Structure</h1>
<hr><p>
<a name="230314"></a>
This chapter specifies the lexical structure of Java.
<p><a name="230426"></a>
Java programs are written in Unicode <a href="3.doc.html#95413">(&#167;3.1)</a>, but lexical translations are provided <a href="3.doc.html#95504">(&#167;3.2)</a> so that Unicode escapes <a href="3.doc.html#100850">(&#167;3.3)</a> can be used to include any Unicode character using only ASCII characters. Line terminators are defined <a href="3.doc.html#231571">(&#167;3.4)</a> to support the different conventions of existing host systems while maintaining consistent line numbers.<p>
<a name="229936"></a>
The Unicode characters resulting from the lexical translations are reduced to a sequence of input elements <a href="3.doc.html#25687">(&#167;3.5)</a>, which are white space <a href="3.doc.html#95710">(&#167;3.6)</a>, comments <a href="3.doc.html#48125">(&#167;3.7)</a>, and tokens. The tokens are the identifiers <a href="3.doc.html#40625">(&#167;3.8)</a>, keywords <a href="3.doc.html#229308">(&#167;3.9)</a>, literals <a href="3.doc.html#48272">(&#167;3.10)</a>, separators <a href="3.doc.html#230752">(&#167;3.11)</a>, and operators <a href="3.doc.html#230663">(&#167;3.12)</a> of the Java syntactic grammar.<p>
<a name="95413"></a>
<h2>3.1    Unicode</h2>
<a name="230444"></a>
Java programs are written using the<i> </i>Unicode character set, version 2.0. Information
about this encoding may be found at:
<p><pre><a name="230446"></a><code>http://www.unicode.org </code>and<code> ftp://unicode.org
</code></pre><a name="230450"></a>
Versions of Java prior to 1.1 used Unicode version 1.1.5 (see <em>The Unicode Standard:
Worldwide Character Encoding </em><a href="1.doc.html#11506">(&#167;1.2)</a> and updates). See <a href="javalang.doc4.html#14345">&#167;20.5</a> for a discussion
of the differences between Unicode version 1.1.5 and Unicode version 2.0.
<p><a name="99446"></a>
Except for comments <a href="3.doc.html#48125">(&#167;3.7)</a>, identifiers, and the contents of character and string literals (<a href="3.doc.html#100960">&#167;3.10.4</a>, <a href="3.doc.html#101083">&#167;3.10.5</a>), all input elements <a href="3.doc.html#25687">(&#167;3.5)</a> in a Java program are formed only from ASCII characters (or Unicode escapes <a href="3.doc.html#100850">(&#167;3.3)</a> which result in ASCII characters). ASCII (ANSI X3.4) is the American Standard Code for Information Interchange. The first 128 characters of the Unicode character encoding are the ASCII characters.<p>
<a name="95504"></a>
<h2>3.2    Lexical Translations</h2>
<a name="48080"></a>
A raw Unicode character stream is translated into a sequence of Java tokens, using 
the following three lexical translation steps, which are applied in turn:
<p><ol>
<a name="48081"></a>
<li>A translation of Unicode escapes <a href="3.doc.html#100850">(&#167;3.3)</a> in the raw stream of Unicode characters to the corresponding Unicode character. A Unicode escape of the form <code>\u</code><i>xxxx</i>, where <i>xxxx</i> is a hexadecimal value, represents the Unicode character whose encoding is <i>xxxx</i>. This translation step allows any Java program to be expressed using only ASCII characters.
<a name="48082"></a>
<li>A translation of the Unicode stream resulting from step 1 into a stream of input characters and line terminators <a href="3.doc.html#231571">(&#167;3.4)</a>.
<a name="95812"></a>
<li>A translation of the stream of input characters and line terminators resulting from step 2 into a sequence of Java input elements <a href="3.doc.html#25687">(&#167;3.5)</a> which, after white space <a href="3.doc.html#95710">(&#167;3.6)</a> and comments <a href="3.doc.html#48125">(&#167;3.7)</a> are discarded, comprise the tokens <a href="3.doc.html#25687">(&#167;3.5)</a> that are the terminal symbols of the syntactic grammar <a href="2.doc.html#140845">(&#167;2.3)</a> for Java.
</ol>
<a name="100835"></a>
Java always uses the longest possible translation at each step, even if the result does not ultimately make a correct Java program, while another lexical translation would. Thus the input characters <code>a--b</code> are tokenized <a href="3.doc.html#25687">(&#167;3.5)</a> as <code>a</code>, <code>--</code>, <code>b</code>, which is not part of any grammatically correct Java program, even though the tokenization <code>a</code>, <code>-</code>, <code>-</code>,<code>&#32;b</code> could be part of a grammatically correct Java program.<p>
<a name="100850"></a>
<h2>3.3    Unicode Escapes</h2>
<a name="48089"></a>
Java implementations first recognize <i>Unicode escapes</i> in their input, translating 
the ASCII characters <code>\u</code> followed by four hexadecimal digits to the Unicode character
with the indicated hexadecimal value, and passing all other characters 
unchanged. This translation step results in a sequence of Unicode input characters:

<p><ul><pre>
<i>UnicodeInputCharacter:<br>
</i>	<i>UnicodeEscape<br>
</i>	<i>RawInputCharacter
</i>
<i>UnicodeEscape:<br>
</i><code>	\ </code><i>UnicodeMarker</i><code> </code><i>HexDigit</i><code> </code><i>HexDigit</i><code> </code><i>HexDigit</i><code> </code><i>HexDigit
</i>
<i>UnicodeMarker:<br>
</i>	<code>u<br>
</code>	<i>UnicodeMarker</i><code> u
</code>
<i>RawInputCharacter:<br>
</i>	any Unicode character

<i>HexDigit:</i> <i>one</i> <i>of<br>
</i><code>	0&#32;1&#32;2&#32;3&#32;4&#32;5&#32;6&#32;7&#32;8&#32;9&#32;a&#32;b&#32;c&#32;d&#32;e&#32;f&#32;A&#32;B&#32;C&#32;D&#32;E&#32;F
</code></pre></ul><a name="229834"></a>
The <code>\</code>, <code>u</code>, and hexadecimal digits here are all ASCII characters.<p>
<a name="231557"></a>
In addition to the processing implied by the grammar, for each raw input character that is a backslash <code>\</code>, input processing must consider how many other <code>\</code> characters contiguously precede it, separating it from a non-<code>\</code> character or the start of the input stream. If this number is even, then the <code>\</code> is eligible to begin a Unicode escape; if the number is odd, then the <code>\</code> is not eligible to begin a Unicode escape. For example, the raw input <code>"\\u2297=\u2297"</code> results in the eleven characters <code>"</code> &#32;<code>\</code> <code>\</code> <code>u</code> <code>2</code> <code>2</code> <code>9</code> <code>7</code> <code>=</code> <img src="chars/circmult.gif"> <code>"</code> (<code>\u2297</code> is the Unicode encoding of the character "<img src="chars/circmult.gif">")<code>.</code><p>
<a name="229835"></a>
If an eligible <code>\</code> is not followed by <code>u</code>, then it is treated as a <i>RawInputCharacter</i> and remains part of the escaped Unicode stream. If an eligible <code>\</code> is followed by <code>u</code>, or more than one <code>u</code>, and the last <code>u</code> is not followed by four hexadecimal digits, then a compile-time error occurs.<p>
<a name="48098"></a>
The character produced by a Unicode escape does not participate in further Unicode escapes. For example, the raw input <code>\u005cu005a</code> results in the six characters <code>\</code> <code>u</code> <code>0</code> <code>0</code> <code>5</code> <code>a</code>, because <code>005c</code> is the Unicode value for <code>\</code>.<code> </code>It does not result in the character <code>Z</code>, which is Unicode character <code>005a</code>, because the <code>\</code> that resulted from the <code>\u005c</code> is not interpreted as the start of a further Unicode escape.<p>
<a name="228824"></a>
Java specifies a standard way of transforming a Unicode Java program into ASCII that changes a Java program into a form that can be processed by ASCII-based tools. The transformation involves converting any Unicode escapes in the source text of the program to ASCII by adding an extra <code>u</code>-for example, <code>\u</code><i>xxxx</i> becomes <code>\uu</code><i>xxxx</i>-while simultaneously converting non-ASCII characters in the source text to a <code>\u</code><i>xxxx</i> escape containing a single <code>u</code>. This transformed version is equally acceptable to a Java compiler and represents the exact same program. The exact Unicode source can later be restored from this ASCII form by converting each escape sequence where multiple <code>u</code>'s are present to a sequence of Unicode characters with one fewer <code>u</code>, while simultaneously converting each escape sequence with a single <code>u</code> to the corresponding single Unicode character.<p>
<a name="231569"></a>
Java systems should use the <code>\u</code><i>xxxx</i> notation as an output format to display Unicode characters when a suitable font is not available.<p>
<a name="231571"></a>
<h2>3.4    Line Terminators</h2>
<a name="22634"></a>
Java implementations next divide the sequence of Unicode input characters into 
lines by recognizing <i>line terminators</i>. This definition of lines determines the line 
numbers produced by a Java compiler or other Java system component. It also 
specifies the termination of the <code>//</code> form of a comment <a href="3.doc.html#48125">(&#167;3.7)</a>.
<p><ul><pre>
<i>LineTerminator:<br>
</i><code>	</code>the ASCII LF character, also known as "newline"<br>
	the ASCII CR character, also known as "return"<br>
<code>	</code>the ASCII CR character followed by the ASCII LF character

<i>InputCharacter:<br>
</i>	<i>UnicodeInputCharacter</i> but not CR or LF
</pre></ul><a name="48107"></a>
Lines are terminated by the ASCII characters CR, or LF, or CR LF. The two characters CR immediately followed by LF are counted as one line terminator, not two. The result is a sequence of line terminators and input characters, which are the terminal symbols for the third step in the tokenization process.<p>
<a name="25687"></a>
<h2>3.5    Input Elements and Tokens</h2>
<a name="25688"></a>
The input characters and line terminators that result from escape processing <a href="3.doc.html#100850">(&#167;3.3)</a> 
and then input line recognition <a href="3.doc.html#231571">(&#167;3.4)</a> are reduced to a sequence of <i>input elements</i>. 
Those input elements that are not white space <a href="3.doc.html#95710">(&#167;3.6)</a> or comments <a href="3.doc.html#48125">(&#167;3.7)</a> are 
<i>tokens</i>. The tokens are the terminal symbols of the Java syntactic grammar <a href="2.doc.html#140845">(&#167;2.3)</a>.
<p><a name="95675"></a>
This process is specified by the following productions:<p>
<ul><pre>
<i>Input:<br>
</i>	<i>InputElements</i><sub><i>opt</i></sub><code> </code><i>Sub</i><sub><i>opt
</i></sub>
<i>InputElements:<br>
</i>	<i>InputElement<br>
</i>	<i>InputElements</i><code> </code><i>InputElement
</i>
<i>InputElement:<br>
</i>	<i>WhiteSpace<br>
</i>	<i>Comment<br>
</i>	<i>Token
</i>
<i>Token:<br>
</i>	<i>Identifier<br>
</i>	<i>Keyword<br>
</i>	<i>Literal<br>
</i>	<i>Separator<br>
</i>	<i>Operator
</i>
<i>Sub:<br>
</i>	the ASCII SUB character, also known as "control-Z"
</pre></ul><a name="95707"></a>
White space <a href="3.doc.html#95710">(&#167;3.6)</a> and comments <a href="3.doc.html#48125">(&#167;3.7)</a> can serve to separate tokens that, if adjacent, might be tokenized in another manner. For example, the ASCII characters <code>-</code> and <code>=</code> in the input can form the operator token <code>-=</code> <a href="3.doc.html#230663">(&#167;3.12)</a> only if there is no intervening white space or comment.<p>
<a name="25733"></a>
As a special concession for compatibility with certain operating systems, the ASCII SUB character (<code>\u001a</code>, or control-Z) is ignored if it is the last character in the escaped input stream.<p>
<a name="230834"></a>
Consider two tokens <i>x</i> and <i>y</i> in the resulting input stream. If <i>x</i> precedes <i>y</i>, then we say that <i>x</i> is <i>to the left of</i> <i>y</i> and that <i>y</i> is <i>to the right of</i> <i>x</i>. For example, in this simple piece of Java code:<p>
<pre><a name="230839"></a>
class Empty {
<a name="230840"></a>}
</pre><a name="230846"></a>
we say that the <code>}</code> token is to the right of the <code>{</code> token, even though it appears, in this 
two-dimensional representation on paper, downward and to the left of the <code>{</code> token. 
This convention about the use of the words left and right allows us to speak, for 
example, of the right-hand operand of a binary operator or of the left-hand side of 
an assignment.
<p><a name="95710"></a>
<h2>3.6    White Space</h2>
<a name="48121"></a>
<i>White space</i> is defined as the ASCII space, horizontal tab, and form feed characters,
as well as line terminators <a href="3.doc.html#231571">(&#167;3.4)</a>.
<p><ul><pre>
<i>WhiteSpace:<br>
</i>	the ASCII SP character, also known as "space"<br>
<code>	</code>the ASCII HT character, also known as "horizontal tab"<br>

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -