📄 java.regex.tutorial.html

📁 Regular Expressions of Java Tutorial
💻 HTML
📖 第 1 页 / 共 5 页
字号:
上一页 1 2 3 45
Enter your regex: .*?foo  // 勉强量词
Enter input string to search: xfooxxxxxxfoo
I found the text "xfoo" starting at index 0 and ending at index 4.
I found the text "xxxxxxfoo" starting at index 4 and ending at index 13.

Enter your regex: .*+foo // 侵占量词
Enter input string to search: xfooxxxxxxfoo
No match found.</pre>

　　第一个例子使用贪婪量词<code>.*</code>，寻找紧跟着字母“f”“o”“o”的“任何东西”零次或者多次。由于量词是贪婪的，表达式的<code>.*</code>部分第一次“吃掉”整个输入的字符串。在这一点，全部表达式不能成功地进行匹配，这是由于最后三个字母（“f”“o”“o”）已经被消耗掉了。那么匹配器会慢慢地每次回退一个字母，直到返还的“foo”在最右边出现，这时匹配成功并且搜索终止。<br/>
　　然而，第二个例子采用勉强量词，因此通过首次消耗“什么也没有”作为开始。由于“foo”并没有出现在字符串的开始，它被强迫吞掉第一个字母（“x”），在 0 和 4 处触发了第一个匹配。测试用具会继续处理，直到输入的字符串耗尽为止。在 4 和 13 找到了另外一个匹配。<br/>
　　第三个例子的量词是侵占，所以在寻找匹配时失败了。在这种情况下，整个输入的字符串被<code>.*+</code>消耗了，什么都没有剩下来满足表达式末尾的“foo”。<br/>
　　你可以在想抓取所有的东西，且决不回退的情况下使用侵占量词，在这种匹配不是立即被发现的情况下，它将会优于等价的贪婪量词。<br/>

<div id="h2"><a name="reg6"></a>6　捕获组<span class="returnContents"><a href="#contents">返回目录</a></span></div>

　　在上一节中，学习了每次如何把量词放在一个字符、字符类或者捕获组中。到目前为止，还没有详细地讨论过捕获组的概念。<br/>
　　<em>捕获组</em>（capturing group）是将多个字符作为单独的单元来对待的一种方式。构建它们可以通过把字符放在一对圆括号中而成为一组。例如，正则表达式<code>(dog)</code>建了单个的组，包括字符“d”“o”和“g”。匹配捕获组输入的字符串部分将会存放于内存中，稍后通过反向引用再次调用。（在 <a href="#reg6_2">6.2 节</a> 中将会讨论反向引用）

<div id="h3"><a name="reg6_1"></a>6.1　编号方式<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　在 Pattern 的 API 描述中，捕获组通过从左至右计算开始的圆括号进行编号。例如，在表达式<code>((A)(B(C)))</code>中，有下面的四组：<br/>
　　1. <code>((A)(B(C)))</code><br/>
　　2. <code>(A)</code><br/>
　　3. <code>(B(C))</code><br/>
　　4. <code>(C)</code><br/>
　　要找出当前的表达式中有多少组，通过调用 Matcher 对象的 groupCount 方法。groupCount 方法返回 int 类型值，表示当前 Matcher 模式中捕获组的数量。例如，groupCount 返回 4 时，表示模式中包含有 4 个捕获组。<br/>
　　有一个特别的组&mdash;&mdash;组 0，它表示整个表达式。这个组不包括在 groupCount 的报告范围内。以<code>(?</code>开始的组是纯粹的<em>非捕获组</em>（non-capturing group），它不捕获文本，也不作为组总数而计数。（可以看 <a href="#reg8">8 Pattern 类的方法</a> 一节中非捕获组的例子。）<br/>
　　Matcher 中的一些方法，可以指定 int 类型的特定组号作为参数，因此理解组是如何编号的是尤为重要的。<br/>
　　<label>public int start(int group)</label>：返回之前的匹配操作期间，给定组所捕获的子序列的初始索引。<br/>
　　<label>public int end(int group)</label>：返回之前的匹配操作期间，给定组所捕获子序列的最后字符索引加 1。<br/>
　　<label>public String group (int group)</label>：返回之前的匹配操作期间，通过给定组而捕获的输入子序列。<br/>

<div id="h3"><a name="reg6_2"></a>6.2　反向引用<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　匹配输入字符串的捕获组部分会存放在内存中，通过<em>反向引用</em>（backreferences）稍后再调用。在正则表达式中，反向引用使用反斜线（<code>\</code>）后跟一个表示需要再调用组号的数字来表示。例如，表达式<code>(\d\d)</code>定义了匹配一行中的两个数字的捕获组，通过反向引用<code>\1</code>，表达式稍候会被再次调用。<br/>
　　匹配两个数字，且后面跟着两个完全相同的数字时，就可以使用<code>(\d\d)\1</code>作为正则表达式：<br/>

<pre id="console">Enter your regex: (\d\d)\1
Enter input string to search: 1212
I found the text "1212" starting at index 0 and ending at index 4.</pre>

　　如果更改最后的两个数字，这时匹配就会失败：<br/>
 
<pre id="console">Enter your regex: (\d\d)\1
Enter input string to search: 1234
No match found.</pre>

　　对于嵌套的捕获组而言，反向引用采用完全相同的方式进行工作，即指定一个反斜线加上需要被再次调用的组号。<br/>

<div id="h2"><a name="reg7"></a>7　边界匹配器<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　就目前而言，我们的兴趣在于指定输入字符串中某些位置是否有匹配，还没有考虑到字符串的匹配产生在什么地方。<br/>
　　通过指定一些<em>边界匹配器</em>（boundary matchers）的信息，可以使模式匹配更为精确。比如说你对某个特定的单词感兴趣，并且它只出现在行首或者是行尾时。又或者你想知道匹配发生在单词边界（word boundary），或者是上一个匹配的尾部。<br/>
　　下表中列出了所有的边界匹配器及其说明。<br/>

<table border="0" cellpadding="0" cellspacing="0" class="regTab" align="center">
  <caption>边界匹配器</caption>
  <tr>
    <td class="regCenter"><code>^</code></td>
    <td>行首</td>
  </tr>
  <tr>
    <td class="regCenter"><code>$</code></td>
    <td>行尾</td>
  </tr>
  <tr>
    <td class="regCenter"><code>\b</code></td>
    <td>单词边界</td>
  </tr>
  <tr>
    <td class="regCenter"><code>\B</code></td>
    <td>非单词边界</td>
  </tr>
  <tr>
    <td class="regCenter"><code>\A</code></td>
    <td>输入的开头</td>
  </tr>
  <tr>
    <td class="regCenter"><code>\G</code></td>
    <td>上一个匹配的结尾</td>
  </tr>
  <tr>
    <td class="regCenter"><code>\Z</code></td>
    <td>输入的结尾，仅用于最后的结束符（如果有的话）</td>
  </tr>
  <tr>
    <td class="regCenter"><code>\z</code></td>
    <td>输入的结尾</td>
  </tr>
</table>

　　接下来的例子中，说明了<code>^</code>和<code>$</code>边界匹配器的用法。注意上表中，<code>^</code>匹配行首，<code>$</code>匹配行尾。<br/>

<pre id="console">Enter your regex: ^dog$
Enter input string to search: dog
I found the text "dog" starting at index 0 and ending at index 3.

Enter your regex: ^dog$
Enter input string to search:       dog
No match found.

Enter your regex: \s*dog$
Enter input string to search:             dog
I found the text "            dog" starting at index 0 and ending at index 15.

Enter your regex: ^dog\w*
Enter input string to search: dogblahblah
I found the text "dogblahblah" starting at index 0 and ending at index 11.</pre>

　　第一个例子的匹配是成功的，这是因为模式占据了整个输入的字符串。第二个例子失败了，是由于输入的字符串在开始部分包含了额外的空格。第三个例子指定的表达式是不限的空格，后跟着在行尾的 dog。第四个例子，需要 dog 放在行首，后面跟的是不限数量的单词字符。<br/>
　　对于检查一个单词开始和结束的边界模式（用于长字符串里子字符串），这时可以在两边使用<code>\b</code>，例如<code>\bdog\b</code>。

<pre id="console">Enter your regex: \bdog\b
Enter input string to search: The dog plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.

Enter your regex: \bdog\b
Enter input string to search: The doggie plays in the yard.
No match found.</pre>

　　对于匹配非单词边界的表达式，可以使用<code>\B</code>来代替：<br/>
 
<pre id="console">Enter your regex: \bdog\B
Enter input string to search: The dog plays in the yard.
No match found.

Enter your regex: \bdog\B
Enter input string to search: The doggie plays in the yard.
I found the text "dog" starting at index 4 and ending at index 7.</pre>

　　对于需要匹配仅出现在前一个匹配的结尾，可以使用<code>\G</code>：<br/>
 
<pre id="console">Enter your regex: dog
Enter input string to search: dog dog
I found the text "dog" starting at index 0 and ending at index 3.
I found the text "dog" starting at index 4 and ending at index 7.

Enter your regex: \Gdog
Enter input string to search: dog dog
I found the text "dog" starting at index 0 and ending at index 3.</pre>

　　这里的第二个例子仅找到了一个匹配，这是由于第二次出现的“dog”不是在前一个匹配结尾的开始。<a name="note_07"></a><sup><a href="#note07">[7]</a></sup><br/>

<div id="h2"><a name="reg8"></a>8　Pattern 类的方法<span class="returnContents"><a href="#contents">返回目录</a></span></div>

　　到目前为止，仅使用测试用具来建立最基本的 Pattern 对象。在这一节中，我们将探讨一些诸如使用标志构建模式、使用内嵌标志表达式等高级的技术。同时也探讨了一些目前还没有讨论过的其他有用的方法。<br/>

<div id="h3"><a name="reg8_1"></a>8.1　使用标志构建模式<span class="returnContents"><a href="#contents">返回目录</a></span></div>

　　Pattern 类定义了备用的 compile 方法，用于接受影响模式匹配方式的标志集。标志参数是一个位掩码，可以是下面公共静态字段中的任意一个：<br/>

<div id="h4">Pattern.CANON_EQ</span></div>
　　启用规范等价。在指定此标志后，当且仅当在其完整的规范分解匹配时，两个字符被视为匹配。例如，表达式<code>a\u030A</code><a name="note_08"></a><sup><a href="#note08">[8]</a></sup>在指定此标志后，将匹配字符串“\u00E5”（即字符 <span style="font-family: Courier New; font-size: 14pt;">&#229;</span>）。默认情况下，匹配不会采用规范等价。指定此标志可能会对性能会有一定的影响。<br/>

<div id="h4">Pattern.CASE_INSENSITIVE</span></div>
　　启用不区分大小写匹配。默认情况下，仅匹配 US-ASCII 字符集中的字符。Unicode 感知（Unicode-aware）的不区分大小写匹配，可以通过指定 UNICODE_CASE 标志连同此标志来启用。不区分大小写匹配也能通过内嵌标志表达式<code>(?i)</code>来启用。指定此标志可能会对性能会有一定的影响。<br/>

<div id="h4">Pattern.COMMENTS</span></div>
　　模式中允许存在空白和注释。在这种模式下，空白和以<code>#</code>开始的直到行尾的内嵌注释会被忽略。注释模式也能通过内嵌标志表达式<code>(?x)</code>来启用。<br/>

<div id="h4">Pattern.DOTALL</span></div>
　　启用 dotall 模式。在 dotall 模式下，表达式<code>.</code>匹配包括行结束符在内的任意字符。默认情况下，表达式不会匹配行结束符。dotall 模式也通过内嵌标志表达式<code>(?x)</code>来启用。［s 是“单行（single-line）”模式的助记符，与 Perl 中的相同。］<br/>

<div id="h4">Pattern.LITERAL</span></div>
　　启用模式的字面分析。指定该标志后，指定模式的输入字符串作为字面上的字符序列来对待。输入序列中的元字符和转义字符不具有特殊的意义了。CASE_INSENSITIVE 和 UNICODE_CASE 与此标志一起使用时，会对匹配产生一定的影响。其他的标志就变得多余了。启用字面分析没有内嵌标志表达式。<br/>

<div id="h4">Pattern.MULTILINE</span></div>
　　启用多行（multiline）模式。在多行模式下，表达式<code>^</code>和<code>$</code>分别匹配输入序列行结束符前面和行结束符的前面。默认情况下，表达式仅匹配整个输入序列的开始和结尾。多行模式也能通过内嵌标志表达式<code>(?m)</code>来启用。<br/>

<div id="h4">Pattern.UNICODE_CASE</span></div>
　　启用可折叠感知 Unicode（Unicode-aware case folding）大小写。在指定此标志后，需要通过 CASE_INSENSITIVE 标志来启用，不区分大小写区配将在 Unicode 标准的意义上来完成。默认情况下，不区分大小写匹配仅匹配 US-ASCII 字符集中的字符。可折叠感知 Unicode 大小写也能通过内嵌标志表达式<code>(?u)</code>来启用。指定此标志可能会对性能会有一定的影响。<br/>

<div id="h4">Pattern.UNIX_LINES</span></div>
　　启用 Unix 行模式。在这种模式下，<code>.</code>、<code>^</code>和<code>$</code>的行为仅识别“\n”的行结束符。Unix 行模式可以通过内嵌标志表达式<code>(?d)</code>来启用。<br/>
　　接下来，将修改测试用具 <a href="src/RegexTestHarness.java">RegexTestHarness.java</a>，用于构建不区分大小写匹配的模式。<br/>
　　首先，修改代码去调用 complie 的另外一个备用的方法：<br/>

<pre name="java" id="java">Pattern pattern = Pattern.compile(
        console.readLine("%nEnter your regex: "),
        Pttern.CASE_INSENSITIVE
    );</pre>

　　编译并运行这个测试用具，会得出下面的结果：<br/>
 
<pre id="console">Enter your regex: dog
Enter input string to search: DoGDOg
I found the text "DoG" starting at index 0 and ending at index 3.
I found the text "DOg" starting at index 3 and ending at index 6.</pre>

　　正如你所看到的，不管是否大小写，字符串字面上是“dog”的都产生了匹配。使用多个标志来编译一个模式，使用按位或操作符“|”分隔各个标志。为了更清晰地说明，下面的示例代码使用硬编码（hardcode）的方式，来取代控制台中的读取：<br/>
 
<pre name="java" id="java">pattern = Pattern.compile("[az]$", Pattern.MULTILINE | Pattern.UNIX_LINES);</pre>

　　也可以使用一个 int 类型的变量来代替：<br/>

<pre name="java" id="java">final int flags = Pattern.CASE_INSENSITIVE | Pattern.UNICODE_CASE;
Pattern pattern = Pattern.compile("aa", flags);</pre>

<div id="h3"><a name="reg8_2"></a>8.2　内嵌标志表达式<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　使用<em>内嵌标志表达式</em>（embedded flag expressions）也可以启用不同的标志。对于两个参数的 compile 方法，内嵌标志表达式是可选的，因为它在自身的正则表达式中被指定了。下面的例子使用最初的测试用具（<a href="src/RegexTestHarness.java">RegexTestHarness.java</a>），使用内嵌标志表达式<code>(?i)</code>来启用不区分大小写的匹配。<br/>
<pre id="console">Enter your regex: (?i)foo
Enter input string to search: FOOfooFoOfoO
I found the text "FOO" starting at index 0 and ending at index 3.
I found the text "foo" starting at index 3 and ending at index 6.
I found the text "FoO" starting at index 6 and ending at index 9.
I found the text "foO" starting at index 9 and ending at index 12.</pre>
　　所有匹配无关大小写都一次次地成功了。<br/>
　　内嵌标志表达式所对应 Pattern 的公用的访问字段表示如下表：<br/>

<table border="0" c
上一页 1 2 3 45
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -