📄 java.regex.tutorial.html

📁 Regular Expressions of Java Tutorial
💻 HTML
📖 第 1 页 / 共 5 页
字号:
            if (!found) {
                System.out.printf("No match found.%n");
            }
        }
    }
}</pre>

　　JDK 1.4 适用的测试用具（<a href="src/RegexTestHarnessV4.java">RegexTestHarnessV4.java</a>）：<br/>
<pre name="java" id="java">import java.io.BufferedInputStream;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTestHarnessV4 {

    public static void main(String[] args) throws IOException {
        BufferedReader br = new BufferedReader(
                new InputStreamReader(new BufferedInputStream(System.in))
            );
        while (true) {
            System.out.print("\nEnter your regex: ");
            Pattern pattern = Pattern.compile(br.readLine());
            System.out.print("Enter input string to search: ");
            Matcher matcher = pattern.matcher(br.readLine());
            boolean found = false;
            while (matcher.find()) {
                System.out.println("I found the text \"" + matcher.group() +
                        "\" starting at index " + matcher.start() +
                        " and ending at index " + matcher.end() +
                        ".");
                found = true;
            }
            if (!found) {
                System.out.println("No match found.");
            }
        }
    }
}</pre>

<div id="h2"><a name="reg2"></a>2　字符串<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　在大多数的情况下，API所支持模式匹配的基本形式是匹配字符串，如果正则表达式是<code>foo</code>，输入的字符串也是 foo，这个匹配将会是成功的，因为这两个字符串是相同的。试着用测试用具来测试一下：
<pre id="console">Enter your regex: foo
Enter input string to search: foo
I found the text "foo" starting at index 0 and ending at index 3.</pre>
　　结果确实是成功的。注意当输入的字符串是 3 个字符长度的时候，开始的索引是 0，结束的索引是 3。这个是约定俗成的，范围包括开始的索引，不包括结束的索引，如下图所示：<br/>

<div class="picSpec">
<img src="resource/regex3-1.gif" align="center"><br/>
图 1　字符串“foo”的单元格编号和索引值<a name="note_04"></a><sup><a href="#note04">[4]</a></sup></div>

　　字符串中的每一个字符位于其自身的<em>单元格</em>（cell）中，在每个单元格之间有索引指示位。字符串“foo”始于索引 0 处，止于索引 3 处，即使是这些字符它们自己仅占据了 0、1 和 2 号单元格。<br/>

　　就子序列匹配而言，你会注意到一些重叠，下一次匹配开始索引与前一次匹配的结束索引是相同的：<br/>

<pre id="console">Enter your regex: foo
Enter input string to search: foofoofoo
I found the text "foo" starting at index 0 and ending at index 3.
I found the text "foo" starting at index 3 and ending at index 6.
I found the text "foo" starting at index 6 and ending at index 9.</pre>

<div id="h3"><a name="reg2_1"></a>2.1　元字符<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　API 也支持许多可以影响模式匹配的特殊字符。把正则表达式改为<code>cat.</code>并输入字符串“cats”，输出如下所示：

<pre id="console">Enter your regex: cat.
Enter input string to search: cats
I found the text "cats" starting at index 0 and ending at index 4.</pre>

　　虽然在输入的字符串中没有点（.），但这个匹配仍然是成功的。这是由于点（<code>.</code>）是一个<em>元字符</em>（metacharacters）（被这个匹配翻译成了具有特殊意义的字符了）。这个例子为什么能匹配成功的原因在于，元字符<code>.</code>指的是“任意字符”。<br/>
　　API 所支持的元字符有：<code>(</code><code>[</code><code>{</code><code>\</code><code>^</code><code>-</code><code>$</code><code>|</code><code>}</code><code>]</code><code>)</code><code>?</code><code>*</code><code>+</code><code>.</code>

<p id="tip">
注意：在学习过更多的如何构建正则表达式后，你会碰到这些情况：上面的这些特殊字符不应该被处理为元字符。然而也能够使用这个清单来检查一个特殊的字符是否会被认为是元字符。例如，字符 !、@ 和 # 决不会有特殊的意义。
</p>

　　有两种方法可以强制将元字符处理成为普通字符：<br/>
　　1. 在元字符前加上反斜线（<code>\</code>）；<br/>
　　2. 把它放在<code>\Q</code>（引用开始）和<code>\E</code>（引用结束）之间<a name="note_05"></a><sup><a href="#note05">[5]</a></sup>。在使用这种技术时，<code>\Q</code>和<code>\E</code>能被放于表达式中的任何位置（假设先出现<code>\Q</code><a name="note_06"></a><sup><a href="#note06">[6]</a></sup>）<br/>

<div id="h2"><a name="reg3"></a>3　字符类<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　如果你曾看过 <a href="http://java.sun.com/javase/6/docs/api/java/util/regex/Pattern.html" target="_blank">Pattern</a> 类的说明，会看到一些构建正则表达式的概述。在这一节中你会发现下面的一些表达式：<br/>
<a name="fig1"></a>
<table border="0" cellpadding="0" cellspacing="0" class="regTab" align="center">
  <caption>字符类</caption>
  <tr>
    <td class="regCenter"><code>[abc]</code></td>
    <td>a, b 或 c（简单类）</td>
  </tr>
  <tr>
    <td class="regCenter"><code>[^abc]</code></td>
    <td>除 a, b 或 c 之外的任意字符（取反）</td>
  </tr>
  <tr>
    <td class="regCenter"><code>[a-zA-Z]</code></td>
    <td>a 到 z，或 A 到 Z，包括（范围）</td>
  </tr>
  <tr>
    <td class="regCenter"><code>[a-d[m-p]]</code></td>
    <td>a 到 d，或 m 到 p：<code>[a-dm-p]</code>（并集）</td>
  </tr>
  <tr>
    <td class="regCenter"><code>[a-z&&[def]]</code></td>
    <td>d，e 或 f（交集）</td>
  </tr>
  <tr>
    <td class="regCenter"><code>[a-z&&[^bc]]</code></td>
    <td>除 b 和 c 之外的 a 到 z 字符：<code>[ad-z]</code>（差集）</td>
  </tr>
  <tr>
    <td class="regCenter"><code>[a-z&&[^m-p]]</code></td>
    <td>a 到 z，并且不包括 m 到 p：<code>[a-lq-z]</code>（差集）</td>
  </tr>
</table>

　　左边列指定正则表达式构造，右边列描述每个构造的匹配的条件。<br/>

<p id="tip">
注意：“字符类（character class）”这个词中的“类（class）”指的并不是一个 .class 文件。在正则表达式的语义中，字符类是放在方括号里的字符集，指定了一些字符中的一个能被给定的字符串所匹配。
</p>

<div id="h3"><a name="reg3_1"></a>3.1　简单类（Simple Classes）<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　字符类最基本的格式是把一些字符放在一对方括号内。例如：正则表达式<code>[bcr]at</code>会匹配“bat”、“cat”或者“rat”，这是由于其定义了一个字符类（接受“b”、“c”或“r”中的一个字符）作为它的首字符。

<pre id="console">Enter your regex: [bcr]at
Enter input string to search: bat
I found the text "bat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at
Enter input string to search: cat
I found the text "cat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at
Enter input string to search: rat
I found the text "rat" starting at index 0 and ending at index 3.

Enter your regex: [bcr]at
Enter input string to search: hat
No match found.</pre>

　　在上面的例子中，在第一个字符匹配字符类中所定义字符中的一个时，整个匹配就是成功的。

<div id="h4"><a name="reg3_1_1"></a>3.1.1　否定<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　要匹配除那些列表之外所有的字符时，可以在字符类的开始处加上<code>^</code>元字符，这种就被称为<em>否定</em>（negation）。

<pre id="console">Enter your regex: [^bcr]at
Enter input string to search: bat
No match found.

Enter your regex: [^bcr]at
Enter input string to search: cat
No match found.

Enter your regex: [^bcr]at
Enter input string to search: rat
No match found.

Enter your regex: [^bcr]at
Enter input string to search: hat
I found the text "hat" starting at index 0 and ending at index 3.</pre>

　　在输入的字符串中的第一个字符不包含在字符类中所定义字符中的一个时，匹配是成功的。

<div id="h4"><a name="reg3_1_2"></a>3.1.2　范围<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　有时会想要定义一个包含值范围的字符类，诸如，“a 到 h”的字母或者是“1 到 5”的数字。指定一个范围，只要在被匹配的首字符和末字符间插入<code>-</code>元字符，比如：<code>[1-5]</code>或者是<code>[a-h]</code>。也可以在类里每个的边上放置不同的范围来提高匹配的可能性，例如：<code>[a-zA-Z]</code>将会匹配 a 到 z（小写字母）或者 A 到 Z（大写字母）中的任何一个字符。<br/>

　　下面是一些范围和否定的例子：<br/>

<pre id="console">Enter your regex: [a-c]
Enter input string to search: a
I found the text "a" starting at index 0 and ending at index 1.

Enter your regex: [a-c]
Enter input string to search: b
I found the text "b" starting at index 0 and ending at index 1.

Enter your regex: [a-c]
Enter input string to search: c
I found the text "c" starting at index 0 and ending at index 1.

Enter your regex: [a-c]
Enter input string to search: d
No match found.

Enter your regex: foo[1-5]
Enter input string to search: foo1
I found the text "foo1" starting at index 0 and ending at index 4.

Enter your regex: foo[1-5]
Enter input string to search: foo5
I found the text "foo5" starting at index 0 and ending at index 4.

Enter your regex: foo[1-5]
Enter input string to search: foo6
No match found.

Enter your regex: foo[^1-5]
Enter input string to search: foo1
No match found.

Enter your regex: foo[^1-5]
Enter input string to search: foo6
I found the text "foo6" starting at index 0 and ending at index 4.</pre>

<div id="h4"><a name="reg3_1_3"></a>3.1.3　并集<span class="returnContents"><a href="#contents">返回目录</a></span></div>
　　可以使用<em>并集</em>（union）来建一个由两个或两个以上字符类所组成的单字符类。构建一个并集，只要在一个字符类的边上嵌套另外一个，比如：<code>[0-4[6-8]]</code>，这种奇特方式构建的并集字符类，可以匹配 0，1，2，3，4，6，7，8 这几个数字。

<pre id="console">Enter your regex: [0-4[6-8]]
Enter input string to search: 0
I found the text "0" starting at index 0 and ending at index 1.

Enter your regex: [0-4[6-8]]
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -