📄 ch06_19.htm
字号:
>/ox</CODE> modifiers. The <CODECLASS="literal">/x</CODE> modifier is especially crucial due to the whitespace used in the encoding template <CODECLASS="literal">$eucjp</CODE>. The <CODECLASS="literal">/o</CODE> modifier is for efficiency, since we know <CODECLASS="literal">$eucjp</CODE> won't change from use to use.</P><PCLASS="para">Use in a replacement is similar, but since the text leading to the real match is also part of the overall match, we must capture it with parentheses, being sure to include it in the replacment text. Assuming that <CODECLASS="literal">$Tokyo</CODE> and <CODECLASS="literal">$Osaka</CODE> have been set to the bytes sequences for their respective words in the EUC-JP encoding, we could use the following to replace Osaka for Tokyo:</P><PRECLASS="programlisting">/^ ( (?:eucjp)*? ) $Tokyo/$1$Osaka/ox</PRE><PCLASS="para">If used with <CODECLASS="literal">/g</CODE>, we want to anchor the match to the end of the previous match, rather than to the start of the string. That's as simple as changing <CODECLASS="literal">^</CODE> to <CODECLASS="literal">\G</CODE>:</P><PRECLASS="programlisting">/\G ( (?:eucjp)*? ) $Tokyo/$1$Osaka/gox</PRE></DIV><DIVCLASS="sect3"><H4CLASS="sect3"><ACLASS="title"NAME="ch06-pgfId-1000010009">Splitting multiple-byte strings</A></H4><PCLASS="para">Another common task is to split an input string into its individual charcters. With a one-byte-per-character encoding, you can simply split <CODECLASS="literal">//</CODE>, but with a multiple-byte encoding, we need something like:</P><PRECLASS="programlisting">@chars = /$eucjp/gox; # One character per list element</PRE><PCLASS="para">Now, <CODECLASS="literal">@chars</CODE> contains one character per element. The following snippet shows how you might use this to write a filter of some sort:</P><PRECLASS="programlisting">while (<>) { my @chars = /$eucjp/gox; # One character per list element for my $char (@chars) { if (length($char) == 1) { # Do something interesting with this one-byte character } else { # Do something interesting with this multiple-byte character } } my $line = join("",@chars); # Glue list back together print $line;}</PRE><PCLASS="para">In the two "do something interesting" parts, any change to <CODECLASS="literal">$char</CODE> will be reflected in the output when <CODECLASS="literal">@chars</CODE> is glued back together.</P></DIV><DIVCLASS="sect3"><H4CLASS="sect3"><ACLASS="title"NAME="ch06-pgfId-1000010032">Validating multiple-byte strings</A></H4><PCLASS="para">The use of <CODECLASS="literal">/$eucjp/gox</CODE> in this kind of technique relies strongly on the input string indeed being properly formatted in our target encoding, EUC-JP. If it's not, the template <CODECLASS="literal">/$eucjp/</CODE> won't be able to match, and bytes will be skipped.</P><PCLASS="para">One way to address this is to use <CODECLASS="literal">/\G$eucjp/gox</CODE> instead. This prohibits the pattern matching engine from skipping bytes in order to find a match (since the use of <CODECLASS="literal">\G</CODE> indicates that any match must immediately follow the previous match). This is still not a perfect approach, since it will simply stop matching on ill-formatted input data.</P><PCLASS="para">A better approach to confirm that a string is valid with respect to an encoding is to use something like:</P><PRECLASS="programlisting">$is_eucjp = m/^(?:$eucjp)*$/xo;</PRE><PCLASS="para">If a string has only valid characters from start to end, you know the string as a whole is valid.</P><PCLASS="para">There is one potential for a problem, and that's due to how the end-of-string metacharacter <CODECLASS="literal">$</CODE> works: it can be true at the end of the string (as we want), and also just before a newline at the end of the string. That means you can still match sucessfully even if the newline is not a valid character in the encoding. To get around this problem, you could use the more-complicated <CODECLASS="literal">(?!\n)$</CODE> instead of <CODECLASS="literal">$</CODE>.</P><PCLASS="para">You can use the basic validation technique to detect which encoding is being used. For example, Japanese is commonly encoded with either EUC-JP, or another encoding called Shift-JIS. If you've set up the templates, as with <CODECLASS="literal">$eucjp</CODE>, you can do something like:</P><PRECLASS="programlisting">$is_eucjp = m/^(?:$eucjp)*$/xo;$is_sjis = m/^(?:$sjis)*$/xo;</PRE><PCLASS="para">If both are true, the text is likely ASCII (since, essentially, ASCII is a sub-component of both encodings). (It's not quite fool-proof, though, since some strings with multi-byte characters might appear to be valid in both encodings. In such a case, automatic detection becomes impossible, although one might use character-frequency data to make an educated guess.)</P></DIV><DIVCLASS="sect3"><H4CLASS="sect3"><ACLASS="title"NAME="ch06-pgfId-1000010053">Converting between encodings</A></H4><PCLASS="para">Converting from one encoding to another can be as simple as an extension of the process-each-character routine above. Conversions for some closely related encodings can be done by a simple mathematical computation on the bytes, while others might require huge mapping tables. In either case, you insert the code at the "do something interesting" points in the routine.</P><PCLASS="para">Here's an example to convert from EUC-JP to Unicode, using a <CODECLASS="literal">%euc2uni</CODE> hash as a mapping table:</P><PRECLASS="programlisting">while (<>) { my @chars = /$eucjp/gox; # One character per list element for my $euc (@chars) { my $uni = $euc2uni{$char}; if (defined $uni) { $euc = $uni; } else { ## deal with unknown EUC->Unicode mapping here. } } my $line = join("",@chars); print $line;}</PRE><PCLASS="para">The topic of multiple-byte matching and processing is of particular importance when dealing with Unicode, which has a variety of possible representations. UCS-2 and UCS-4 are fixed-length encodings. UTF-8 defines a mixed one- through six-byte encoding. UTF-16, which represents the most common instance of Unicode encoding, is a variable-length 16-bit encoding.<ACLASS="indexterm"NAME="ch06-idx-1000010159-0"></A><ACLASS="indexterm"NAME="ch06-idx-1000010159-1"></A><ACLASS="indexterm"NAME="ch06-idx-1000010159-2"></A><ACLASS="indexterm"NAME="ch06-idx-1000010159-3"></A><ACLASS="indexterm"NAME="ch06-idx-1000010159-4"></A></P></DIV></DIV><DIVCLASS="sect2"><H3CLASS="sect2"><ACLASS="title"NAME="ch06-pgfId-1000010163">See Also</A></H3><PCLASS="para">Jeffrey Friedl's article in Issue 5 of <CITECLASS="citetitle">The Perl Journal </CITE>; <CITECLASS="citetitle">CJKV Information Processing</CITE> by Ken Lunde; O'Reilly & Associates, (due 1999)</P></DIV></DIV><DIVCLASS="htmlnav"><P></P><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><TABLEWIDTH="684"BORDER="0"CELLSPACING="0"CELLPADDING="0"><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_18.htm"TITLE="6.17. Expressing AND, OR, and NOT in a Single Pattern"><IMGSRC="../gifs/txtpreva.gif"ALT="Previous: 6.17. Expressing AND, OR, and NOT in a Single Pattern"BORDER="0"></A></TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="book"HREF="index.htm"TITLE="Perl Cookbook"><IMGSRC="../gifs/txthome.gif"ALT="Perl Cookbook"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228"><ACLASS="sect1"HREF="ch06_20.htm"TITLE="6.19. Matching a Valid Mail Address"><IMGSRC="../gifs/txtnexta.gif"ALT="Next: 6.19. Matching a Valid Mail Address"BORDER="0"></A></TD></TR><TR><TDALIGN="LEFT"VALIGN="TOP"WIDTH="228">6.17. Expressing AND, OR, and NOT in a Single Pattern</TD><TDALIGN="CENTER"VALIGN="TOP"WIDTH="228"><ACLASS="index"HREF="index/index.htm"TITLE="Book Index"><IMGSRC="../gifs/index.gif"ALT="Book Index"BORDER="0"></A></TD><TDALIGN="RIGHT"VALIGN="TOP"WIDTH="228">6.19. Matching a Valid Mail Address</TD></TR></TABLE><HRALIGN="LEFT"WIDTH="684"TITLE="footer"><FONTSIZE="-1"></DIV<!-- LIBRARY NAV BAR --> <img src="../gifs/smnavbar.gif" usemap="#library-map" border="0" alt="Library Navigation Links"><p> <a href="copyrght.htm">Copyright © 2002</a> O'Reilly & Associates. All rights reserved.</font> </p> <map name="library-map"> <area shape="rect" coords="1,0,85,94" href="../index.htm"><area shape="rect" coords="86,1,178,103" href="../lwp/index.htm"><area shape="rect" coords="180,0,265,103" href="../lperl/index.htm"><area shape="rect" coords="267,0,353,105" href="../perlnut/index.htm"><area shape="rect" coords="354,1,446,115" href="../prog/index.htm"><area shape="rect" coords="448,0,526,132" href="../tk/index.htm"><area shape="rect" coords="528,1,615,119" href="../cookbook/index.htm"><area shape="rect" coords="617,0,690,135" href="../pxml/index.htm"></map> </BODY></HTML>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -