ch06_19.htm

来自「By Tom Christiansen and Nathan Torkingto」· HTM 代码 · 共 636 行 · 第 1/2 页
HTM
636 行
<HTML><HEAD><TITLE>Recipe 6.18. Matching Multiple-Byte Characters (Perl Cookbook)</TITLE><METANAME="DC.title"CONTENT="Perl Cookbook"><METANAME="DC.creator"CONTENT="Tom Christiansen &amp; Nathan Torkington"><METANAME="DC.publisher"CONTENT="O'Reilly &amp; Associates, Inc."><METANAME="DC.date"CONTENT="1999-07-02T01:34:58Z"><METANAME="DC.type"CONTENT="Text.Monograph"><METANAME="DC.format"CONTENT="text/html"SCHEME="MIME"><METANAME="DC.source"CONTENT="1-56592-243-3"SCHEME="ISBN"><METANAME="DC.language"CONTENT="en-US"><METANAME="generator"CONTENT="Jade 1.1/O'Reilly DocBook 3.0 to HTML 4.0"><LINKREV="made"HREF="mailto:online-books@oreilly.com"TITLE="Online Books Comments"><LINKREL="up"HREF="ch06_01.htm"TITLE="6. Pattern Matching"><LINKREL="prev"HREF="ch06_18.htm"TITLE="6.17. Expressing AND, OR, and NOT in a Single Pattern"><LINKREL="next"HREF="ch06_20.htm"TITLE="6.19. Matching a Valid Mail Address"></HEAD><BODYBGCOLOR="#FFFFFF"><img alt="Book Home" border="0" src="gifs/smbanner.gif" usemap="#banner-map" /><map name="banner-map"><area shape="rect" coords="1,-2,616,66" href="index.htm" alt="Perl Cookbook"><area shape="rect" coords="629,-11,726,25" href="jobjects/fsearch.htm" alt="Search this book" /></map><div class="navbar"><p><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_18.htm"TITLE="6.17. Expressing AND, OR, and NOT in a Single Pattern"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 6.17. Expressing AND, OR, and NOT in a Single Pattern"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><B><FONTFACE="ARIEL,HELVETICA,HELV,SANSERIF"SIZE="-1"><ACLASS="chapter"REL="up"HREF="ch06_01.htm"TITLE="6. Pattern Matching"></A></FONT></B></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_20.htm"TITLE="6.19. Matching a Valid Mail Address"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 6.19. Matching a Valid Mail Address"BORDER="0"></A></TD></TR></TABLE></DIV><DIVCLASS="sect1"><H2CLASS="sect1"><ACLASS="title"NAME="ch06-chap06_matching_5">6.18. Matching Multiple-Byte Characters</A></H2><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-pgfId-1000009936">Problem</A></H3><PCLASS="para"><ACLASS="indexterm"NAME="ch06-idx-1000010766-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000010766-1"></A><ACLASS="indexterm"NAME="ch06-idx-1000010766-2"></A><ACLASS="indexterm"NAME="ch06-idx-1000010766-3"></A><ACLASS="indexterm"NAME="ch06-idx-1000010766-4"></A>You need to perform regular-expression searches against multiple-byte characters.</P><PCLASS="para">A <EMCLASS="emphasis">character encoding</EM> is a set mapping from characters and symbols to digital representations. ASCII is an encoding where each character is represented as exactly one byte, but complex writing systems, such as those for Chinese, Japanese, and Korean, have so many characters that their encodings need to use multiple bytes to represent characters.</P><PCLASS="para">Perl works on the principle that each byte represents a single character, which works well in ASCII but makes regular expression matches on strings containing multiple-byte characters tricky, to say the least. The regular expression engine does not understand the character boundaries in your string of bytes, and so can return "matches" from the middle of one character to the middle of another.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-pgfId-1000010128">Solution</A></H3><PCLASS="para">Exploit the encoding by tailoring the pattern to the sequences of bytes that constitute characters. The basic approach is to build a pattern that matches a single (multiple byte) character in the encoding, and then use that "any character" pattern in larger patterns.</P></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-chap06_examples_0">Discussion</A></H3><PCLASS="para">As an example, we'll examine one of the encodings for Japanese, called <EMCLASS="emphasis">EUC-JP</EM>, and then show how we use this in solving a number of multiple-byte encoding issues. EUC-JP can represent thousands of characters, but it's basically a superset of ASCII. Bytes with values ranging from 0 to 127 (<CODECLASS="literal">0x00</CODE> to <CODECLASS="literal">0x7F</CODE>) are almost exactly their ASCII counterparts, so those bytes represent one-byte characters. Some characters are represented by a pair of bytes, the first with value <CODECLASS="literal">0x8E</CODE> and the second with a value in the range <CODECLASS="literal">0xA0-0xDF</CODE>. Some others are represented by three bytes, the first with the value <CODECLASS="literal">0x8F</CODE> and the others in the range <CODECLASS="literal">0xA1-0xFE</CODE>, while others still are represented by two bytes, each in the <CODECLASS="literal">0xA1-0xFE</CODE> range.</P><PCLASS="para">We can convey this information&nbsp;- what bytes can make up characters in this encoding&nbsp;- as a regular expression. For ease of use later, here we'll define a string, <CODECLASS="literal">$eucjp</CODE>, that holds the regular expression to match a single EUC-JP character:</P><PRECLASS="programlisting">my $eucjp = q{                 # EUC-JP encoding subcomponents:    [\x00-\x7F]                # ASCII/JIS-Roman (one-byte/character)  | \x8E[\xA0-\xDF]            # half-width katakana (two bytes/char)  | \x8F[\xA1-\xFE][\xA1-\xFE] # JIS X 0212-1990 (three bytes/char)  | [\xA1-\xFE][\xA1-\xFE]     # JIS X 0208:1997 (two bytes/char)};</PRE><PCLASS="para">(Because we've inserted comments and whitespace for pretty-printing, we'll have to use the <CODECLASS="literal">/x</CODE> modifier when we use this in a match or substitution.)</P><PCLASS="para">With this template in hand, the following sections show how to:</P><ULCLASS="itemizedlist"><LICLASS="listitem"><PCLASS="para"><ACLASS="listitem"NAME="ch06-pgfId-1000009979"></A>Perform a normal match without any "false" matches</P></LI><LICLASS="listitem"><PCLASS="para"><ACLASS="listitem"NAME="ch06-pgfId-1000009981"></A>Count, convert (to another encoding), and/or filter characters</P></LI><LICLASS="listitem"><PCLASS="para"><ACLASS="listitem"NAME="ch06-pgfId-1000009983"></A>Verify whether the target text is valid according to an encoding</P></LI><LICLASS="listitem"><PCLASS="para"><ACLASS="listitem"NAME="ch06-pgfId-1000009985"></A>Detect which encoding the target text uses</P></LI></UL><PCLASS="para">All the examples are shown using EUC-JP as the encoding of interest, but they will work with any of the many multiple-byte encodings commonly used for text processing, such as Unicode, Big-5, etc.</P><DIVCLASS="sect3"><H4CLASS="sect3"><ACLASS="title"NAME="ch06-pgfId-1000009989">Avoiding false matches</A></H4><PCLASS="para">A false match is where the regular expression engine finds a match that begins in the middle of a multiple-byte character sequence. We can get around the problem by carefully controlling the match, ensuring that the pattern matching engine stays synchronized with the character boundaries at all times.</P><PCLASS="para">This can be done by anchoring the match to the start of the string, then manually bypassing characters ourselves when the real match can't happen at the current location. With the EUC-JP example, the "bypassing characters" part is <CODECLASS="literal">/(?:</CODE> <CODECLASS="literal">$eucjp</CODE> <CODECLASS="literal">)*?/</CODE>. <CODECLASS="literal">$eucjp</CODE> is our template to match any valid character, and because it is applied via the non-greedy <CODECLASS="literal">*?</CODE>, it can match a character only when whatever follows (presumably the desired real match) can't match. Here's a real example:</P><PRECLASS="programlisting">/^ (?: $eucjp )*?  \xC5\xEC\xB5\xFE/ox # Trying to find Tokyo</PRE><PCLASS="para">In the EUC-JP encoding, the Japanese word for Tokyo is written with two characters, the first encoded by the two bytes <CODECLASS="literal">\xC5\xEC</CODE>, the second encoded by the two bytes <CODECLASS="literal">\xB5\xFE</CODE>. As far as Perl is concerned, we're looking merely for the four-byte sequence <CODECLASS="literal">\xC5\xEC\xB5\xFE</CODE>, but because we use <CODECLASS="literal">(?:</CODE> <CODECLASS="literal">$eucjp</CODE> <CODECLASS="literal">)*?</CODE> to move along the string only by characters of our target encoding, we know we'll stay in synch.</P><PCLASS="para">Don't forget to use the <CODECLASS="literal"
ch06_19.htm - 源码说明

本页面展示了「By Tom Christiansen and Nathan Torkington ISBN 1-56592-243-3 First Edition, published August 1998」中的 ch06_19.htm 源码文件，采用 HTM 编程语言编写，共 636 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与Christiansen相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?