📄 java.regex.tutorial.html

📁 Regular Expressions of Java Tutorial
💻 HTML
📖 第 1 页 / 共 5 页
字号:
      <td>侵占</td>
    </tr>
  </thead>
  <tr>
    <td class="regCenter"><code>X?</code></td>
    <td class="regCenter"><code>X??</code></td>
    <td class="regCenter"><code>X?+</code></td>
    <td>匹配 X 零次或一次</td>
  </tr>
  <tr>
    <td class="regCenter"><code>X*</code></td>
    <td class="regCenter"><code>X*?</code></td>
    <td class="regCenter"><code>X*+</code></td>
    <td>匹配 X 零次或多次</td>
  </tr>
  <tr>
    <td class="regCenter"><code>X+</code></td>
    <td class="regCenter"><code>X+?</code></td>
    <td class="regCenter"><code>X++</code></td>
    <td>匹配 X 一次或多次</td>
  </tr>
  <tr>
    <td class="regCenter"><code>X{n}</code></td>
    <td class="regCenter"><code>X{n}?</code></td>
    <td class="regCenter"><code>X{n}+</code></td>
    <td>匹配 X n 次</td>
  </tr>
  <tr>
    <td class="regCenter"><code>X{n,}</code></td>
    <td class="regCenter"><code>X{n,}?</code></td>
    <td class="regCenter"><code>X{n,}+</code></td>
    <td>匹配 X 至少 n 次</td>
  </tr>
  <tr>
    <td class="regCenter"><code>X{n,m}</code></td>
    <td class="regCenter"><code>X{n,m}?</code></td>
    <td class="regCenter"><code>X{n,m}+</code></td>
    <td>匹配 X 至少 n 次，但不多于 m 次</td>
  </tr>
</table>
　　那我们现在就从贪婪量词开始，构建三个不同的正则表达式：字母<code>a</code>后面跟着<code>?</code>、<code>*</code>和<code>+</code>。接下来看一下，用这些表达式来测试输入的字符串是空字符串时会发生些什么：<br/>

<pre id="console">Enter your regex: a?
Enter input string to search: 
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a*
Enter input string to search: 
I found the text "" starting at index 0 and ending at index 0.

Enter your regex: a+
Enter input string to search: 
No match found.</pre>

<div id="h3"><a name="reg5_1"></a>5.1　零长度匹配<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　在上面的例子中，开始的两个匹配是成功的，这是因为表达式<code>a?</code>和<code>a*</code>都允许字符出现零次。就目前而言，这个例子不像其他的，也许你注意到了开始和结束的索引都是 0。输入的空字符串没有长度，因此该测试简单地在索引 0 上匹配什么都没有，诸如此类的匹配称之为<em>零长度匹配</em>（zero-length matches）。零长度匹配会出现在以下几种情况：输入空的字符串、在输入字符串的开始处、在输入字符串最后字符的后面，或者是输入字符串中任意两个字符之间。由于它们开始和结束的位置有着相同的索引，因此零长度匹配是容易被发现的。<br/>

　　我们来看一下关于零长度匹配更多的例子。把输入的字符串改为单个字符“a”，你会注意到一些有意思的事情：<br/>

<pre id="console">Enter your regex: a?
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a*
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.

Enter your regex: a+
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.</pre>

　　所有的三个量词都是用来寻找字母“a”的，但是前面两个在索引 1 处找到了零长度匹配，也就是说，在输入字符串最后一个字符的后面。回想一下，匹配把字符“a”看作是位于索引 0 和索引 1 之间的单元格中，并且测试用具一直循环下去直到不再有匹配为止。依赖于所使用的量词不同，最后字符后面的索引“什么也没有”的存在可以或者不可以触发一个匹配。<br/>

　　现在把输入的字符串改为一行 5 个“a”时，会得到下面的结果：<br/>

<pre id="console">Enter your regex: a?
Enter input string to search: aaaaa
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 1 and ending at index 2.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "a" starting at index 3 and ending at index 4.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "" starting at index 5 and ending at index 5.

Enter your regex: a*
Enter input string to search: aaaaa
I found the text "aaaaa" starting at index 0 and ending at index 5.
I found the text "" starting at index 5 and ending at index 5.

Enter your regex: a+
Enter input string to search: aaaaa
I found the text "aaaaa" starting at index 0 and ending at index 5.</pre>

　　在“a”出现零次或一次时，表达式<code>a?</code>寻找到所匹配的每一个字符。表达式<code>a*</code>找到了两个单独的匹配：第一次匹配到所有的字母“a”，然后是匹配到最后一个字符后面的索引 5。最后，<code>a+</code>匹配了所有出现的字母“a”，忽略了在最后索引处“什么都没有”的存在。<br/>

　　在这里，你也许会感到疑惑，开始的两个量词在遇到除了“a”的字母时会有什么结果。例如，在“ababaaaab”中遇到了字母“b”会发生什么呢？<br/>

　　下面我们来看一下：<br/>

<pre id="console">Enter your regex: a?
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "a" starting at index 4 and ending at index 5.
I found the text "a" starting at index 5 and ending at index 6.
I found the text "a" starting at index 6 and ending at index 7.
I found the text "a" starting at index 7 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a*
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "" starting at index 1 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "" starting at index 3 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.
I found the text "" starting at index 8 and ending at index 8.
I found the text "" starting at index 9 and ending at index 9.

Enter your regex: a+
Enter input string to search: ababaaaab
I found the text "a" starting at index 0 and ending at index 1.
I found the text "a" starting at index 2 and ending at index 3.
I found the text "aaaa" starting at index 4 and ending at index 8.</pre>

　　即使字母“b”在单元格 1、3、8 中出现，但在这些位置上的输出报告了零长度匹配。正则表达式<code>a?</code>不是特意地去寻找字母“b”，它仅仅是去找字母“a”存在或者其中缺少的。如果量词允许匹配“a”零次，任何输入的字符不是“a”时将会作为零长度匹配。在前面的例子中，根据讨论的规则保证了 a 被匹配。<br/>

　　对于要精确地匹配一个模式 n 次时，可以简单地在一对花括号内指定一个数值：<br/>

<pre id="console">Enter your regex: a{3}
Enter input string to search: aa
No match found.

Enter your regex: a{3}
Enter input string to search: aaa
I found the text "aaa" starting at index 0 and ending at index 3.

Enter your regex: a{3}
Enter input string to search: aaaa
I found the text "aaa" starting at index 0 and ending at index 3.</pre>

　　这里，正则表确定式<code>a{3}</code>在一行中寻找连续出现三次的字母“a”。第一次测试失败的原由在于，输入的字符串没有足够的 a 用来匹配；第二次测试输出的字符串正好包括了三个“a”，触发了一次匹配；第三次测试也触发了一次匹配，这是由于在输出的字符串的开始部分正好有三个“a”。接下来的事情与第一次的匹配是不相关的，如果这个模式将在这一点后继续出现，那它将会触发接下来的匹配：

<pre id="console">Enter your regex: a{3}
Enter input string to search: aaaaaaaaa
I found the text "aaa" starting at index 0 and ending at index 3.
I found the text "aaa" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.</pre>

　　对于需要一个模式出现至少 n 次时，可以在这个数字后面加上一个逗号（<code>,</code>）：

<pre id="console">Enter your regex: a{3,}
Enter input string to search: aaaaaaaaa
I found the text "aaaaaaaaa" starting at index 0 and ending at index 9.</pre>

　　输入一样的字符串，这次测试仅仅找到了一个匹配，这是由于一个中有九个“a”满足了“至少”三个“a”的要求。<br/>

　　最后，对于指定出现次数的上限，可以在花括号添加第二个数字。<br/>
 
<pre id="console">Enter your regex: a{3,6} // 寻找一行中至少连续出现 3 个（但不多于 6 个）“a”
Enter input string to search: aaaaaaaaa
I found the text "aaaaaa" starting at index 0 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.</pre>

　　这里，第一次匹配在 6 个字符的上限时被迫终止了。第二个匹配包含了剩余的三个 a（这是匹配所允许最小的字符个数）。如果输入的字符串再少掉一个字母，这时将不会有第二个匹配，之后仅剩余两个 a。

<div id="h3"><a name="reg5_2"></a>5.2　捕获组和字符类中的量词<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　到目前为止，仅仅测试了输入的字符串包括一个字符的量词。实际上，量词仅仅可能附在一个字符后面一次，因此正则表达式<code>abc+</code>的意思就是“a 后面接着 b，再接着一次或者多次的 c”，它的意思并不是指<code>abc</code>一次或者多次。然而，量词也可能附在字符类和捕获组的后面，比如，<code>[abc]+</code>表示一次或者多次的 a 或 b 或 c，<code>(abc)+</code>表示一次或者多次的“abc”组。<br/>
　　我们来指定<code>(dog)</code>组在一行中三次进行说明。<br/>

<pre id="console">Enter your regex: (dog){3}
Enter input string to search: dogdogdogdogdogdog
I found the text "dogdogdog" starting at index 0 and ending at index 9.
I found the text "dogdogdog" starting at index 9 and ending at index 18.

Enter your regex: dog{3}
Enter input string to search: dogdogdogdogdogdog
No match found.</pre>

　　上面的第一个例子找到了三个匹配，这是由于量词用在了整个捕获组上。然而，把圆括号去掉，这时的量词<code>{3}</code>现在仅用在了字母“g”上，从而导致这个匹配失败。<br/>
　　类似地，也能把量词应用于整个字符类：<br/>

<pre id="console">Enter your regex: [abc]{3}
Enter input string to search: abccabaaaccbbbc
I found the text "abc" starting at index 0 and ending at index 3.
I found the text "cab" starting at index 3 and ending at index 6.
I found the text "aaa" starting at index 6 and ending at index 9.
I found the text "ccb" starting at index 9 and ending at index 12.
I found the text "bbc" starting at index 12 and ending at index 15.

Enter your regex: abc{3}
Enter input string to search: abccabaaaccbbbc
No match found.</pre>

　　上面的第一个例子中，量词<code>{3}</code>应用在了整个字符类上，但是第二个例子这个量词仅用在字母“c”上。

<div id="h3"><a name="reg5_3"></a>5.3　贪婪、勉强和侵占量词间的不同<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　在贪婪、勉强和侵占三个量词间有着细微的不同。<br/>
　　贪婪量词之所以称之为“贪婪的”，这是由于它们强迫匹配器读入（或者称之为吃掉）整个输入的字符串，来优先尝试第一次匹配，如果第一次尝试匹配（对于整个输入的字符串）失败，匹配器会通过回退整个字符串的一个字符再一次进行尝试，不断地进行处理直到找到一个匹配，或者左边没有更多的字符来用于回退了。赖于在表达式中使用的量词，最终它将尝试地靠着 1 或 0 个字符的匹配。<br/>
　　但是，勉强量词采用相反的途径：从输入字符串的开始处开始，因此每次勉强地吞噬一个字符来寻找匹配，最终它们会尝试整个输入的字符串。<br/>
　　最后，侵占量词始终是吞掉整个输入的字符串，尝试着一次（仅有一次）匹配。不像贪婪量词那样，侵占量词绝不会回退，即使这样做是允许全部的匹配成功。<br/>
　　为了说明一下，看看输入的字符串是 xfooxxxxxxfoo 时。<br/>

<pre id="console">Enter your regex: .*foo  // 贪婪量词
Enter input string to search: xfooxxxxxxfoo
I found the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -