ch06_19.htm

来自「By Tom Christiansen and Nathan Torkingto」· HTM 代码 · 共 636 行 · 第 1/2 页
HTM
636 行
>/ox</CODE> modifiers. The <CODECLASS="literal">/x</CODE> modifier is especially crucial due to the whitespace used in the encoding template <CODECLASS="literal">$eucjp</CODE>. The <CODECLASS="literal">/o</CODE> modifier is for efficiency, since we know <CODECLASS="literal">$eucjp</CODE> won't change from use to use.</P><PCLASS="para">Use in a replacement is similar, but since the text leading to the real match is also part of the overall match, we must capture it with parentheses, being sure to include it in the replacment text. Assuming that <CODECLASS="literal">$Tokyo</CODE> and <CODECLASS="literal">$Osaka</CODE> have been set to the bytes sequences for their respective words in the EUC-JP encoding, we could use the following to replace Osaka for Tokyo:</P><PRECLASS="programlisting">/^ (  (?:eucjp)*? ) $Tokyo/$1$Osaka/ox</PRE><PCLASS="para">If used with <CODECLASS="literal">/g</CODE>, we want to anchor the match to the end of the previous match, rather than to the start of the string. That's as simple as changing <CODECLASS="literal">^</CODE> to <CODECLASS="literal">\G</CODE>:</P><PRECLASS="programlisting">/\G (  (?:eucjp)*? ) $Tokyo/$1$Osaka/gox</PRE></DIV><DIVCLASS="sect3"><H4CLASS="sect3"><ACLASS="title"NAME="ch06-pgfId-1000010009">Splitting multiple-byte strings</A></H4><PCLASS="para">Another common task is to split an input string into its individual charcters. With a one-byte-per-character encoding, you can simply split <CODECLASS="literal">//</CODE>, but with a multiple-byte encoding, we need something like:</P><PRECLASS="programlisting">@chars = /$eucjp/gox; # One character per list element</PRE><PCLASS="para">Now, <CODECLASS="literal">@chars</CODE> contains one character per element. The following snippet shows how you might use this to write a filter of some sort:</P><PRECLASS="programlisting">while (&lt;&gt;) {  my @chars = /$eucjp/gox; # One character per list element  for my $char (@chars) {    if (length($char) == 1) {      # Do something interesting with this one-byte character    } else {      # Do something interesting with this multiple-byte character    }  }  my $line = join(&quot;&quot;,@chars); # Glue list back together  print $line;}</PRE><PCLASS="para">In the two "do something interesting" parts, any change to <CODECLASS="literal">$char</CODE> will be reflected in the output when <CODECLASS="literal">@chars</CODE> is glued back together.</P></DIV><DIVCLASS="sect3"><H4CLASS="sect3"><ACLASS="title"NAME="ch06-pgfId-1000010032">Validating multiple-byte strings</A></H4><PCLASS="para">The use of <CODECLASS="literal">/$eucjp/gox</CODE> in this kind of technique relies strongly on the input string indeed being properly formatted in our target encoding, EUC-JP. If it's not, the template <CODECLASS="literal">/$eucjp/</CODE> won't be able to match, and bytes will be skipped.</P><PCLASS="para">One way to address this is to use <CODECLASS="literal">/\G$eucjp/gox</CODE> instead. This prohibits the pattern matching engine from skipping bytes in order to find a match (since the use of <CODECLASS="literal">\G</CODE> indicates that any match must immediately follow the previous match). This is still not a perfect approach, since it will simply stop matching on ill-formatted input data.</P><PCLASS="para">A better approach to confirm that a string is valid with respect to an encoding is to use something like:</P><PRECLASS="programlisting">$is_eucjp = m/^(?:$eucjp)*$/xo;</PRE><PCLASS="para">If a string has only valid characters from start to end, you know the string as a whole is valid.</P><PCLASS="para">There is one potential for a problem, and that's due to how the end-of-string metacharacter <CODECLASS="literal">$</CODE> works: it can be true at the end of the string (as we want), and also just before a newline at the end of the string. That means you can still match sucessfully even if the newline is not a valid character in the encoding. To get around this problem, you could use the more-complicated <CODECLASS="literal">(?!\n)$</CODE> instead of <CODECLASS="literal">$</CODE>.</P><PCLASS="para">You can use the basic validation technique to detect which encoding is being used. For example, Japanese is commonly encoded with either EUC-JP, or another encoding called Shift-JIS. If you've set up the templates, as with <CODECLASS="literal">$eucjp</CODE>, you can do something like:</P><PRECLASS="programlisting">$is_eucjp = m/^(?:$eucjp)*$/xo;$is_sjis  = m/^(?:$sjis)*$/xo;</PRE><PCLASS="para">If both are true, the text is likely ASCII (since, essentially, ASCII is a sub-component of both encodings). (It's not quite fool-proof, though, since some strings with multi-byte characters might appear to be valid in both encodings. In such a case, automatic detection becomes impossible, although one might use character-frequency data to make an educated guess.)</P></DIV><DIVCLASS="sect3"><H4CLASS="sect3"><ACLASS="title"NAME="ch06-pgfId-1000010053">Converting between encodings</A></H4><PCLASS="para">Converting from one encoding to another can be as simple as an extension of the process-each-character routine above. Conversions for some closely related encodings can be done by a simple mathematical computation on the bytes, while others might require huge mapping tables. In either case, you insert the code at the "do something interesting" points in the routine.</P><PCLASS="para">Here's an example to convert from EUC-JP to Unicode, using a <CODECLASS="literal">%euc2uni</CODE> hash as a mapping table:</P><PRECLASS="programlisting">while (&lt;&gt;) {  my @chars = /$eucjp/gox; # One character per list element  for my $euc (@chars) {    my $uni = $euc2uni{$char};    if (defined $uni) {        $euc = $uni;    } else {        ## deal with unknown EUC-&gt;Unicode mapping here.    }  }  my $line = join(&quot;&quot;,@chars);  print $line;}</PRE><PCLASS="para">The topic of multiple-byte matching and processing is of particular importance when dealing with Unicode, which has a variety of possible representations. UCS-2 and UCS-4 are fixed-length encodings. UTF-8 defines a mixed one- through six-byte encoding. UTF-16, which represents the most common instance of Unicode encoding, is a variable-length 16-bit encoding.<ACLASS="indexterm"NAME="ch06-idx-1000010159-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000010159-1"></A><ACLASS="indexterm"NAME="ch06-idx-1000010159-2"></A><ACLASS="indexterm"NAME="ch06-idx-1000010159-3"></A><ACLASS="indexterm"NAME="ch06-idx-1000010159-4"></A></P></DIV></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-pgfId-1000010163">See Also</A></H3><PCLASS="para">Jeffrey Friedl's article in Issue 5 of <CITECLASS="citetitle">The Perl Journal </CITE>; <CITECLASS="citetitle">CJKV Information Processing</CITE> by Ken Lunde; O'Reilly &amp; Associates, (due 1999)</P></DIV></DIV><DIVCLASS="htmlnav"><P></P><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_18.htm"TITLE="6.17. Expressing AND, OR, and NOT in a Single Pattern"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 6.17. Expressing AND, OR, and NOT in a Single Pattern"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="book"HREF="index.htm"TITLE="Perl Cookbook"><IMGSRC="../gifs/txthome.gif"ALT="Perl Cookbook"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_20.htm"TITLE="6.19. Matching a Valid Mail Address"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 6.19. Matching a Valid Mail Address"BORDER="0"></A></TD></TR><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228">6.17. Expressing AND, OR, and NOT in a Single Pattern</TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="index"HREF="index/index.htm"TITLE="Book Index"><IMGSRC="../gifs/index.gif"ALT="Book Index"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228">6.19. Matching a Valid Mail Address</TD></TR></TABLE><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><FONTSIZE="-1"></DIV<!-- LIBRARY NAV BAR --> <img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p> <a href="copyrght.htm">Copyright &copy; 2002</a> O'Reilly &amp; Associates. All rights reserved.</font> </p> <map name="library-map"> <area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map> </BODY></HTML>
ch06_19.htm - 源码说明

本页面展示了「By Tom Christiansen and Nathan Torkington ISBN 1-56592-243-3 First Edition, published August 1998」中的 ch06_19.htm 源码文件，采用 HTM 编程语言编写，共 636 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与Christiansen相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?