collate.html

来自「perl教程」· HTML 代码 · 共 752 行 · 第 1/3 页

HTML
752
字号
<?xml version="1.0" ?>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<!-- saved from url=(0017)http://localhost/ -->
<script language="JavaScript" src="../../displayToc.js"></script>
<script language="JavaScript" src="../../tocParas.js"></script>
<script language="JavaScript" src="../../tocTab.js"></script>
<link rel="stylesheet" type="text/css" href="../../scineplex.css">
<title>Unicode::Collate - Unicode Collation Algorithm</title>
<link rel="stylesheet" href="../../Active.css" type="text/css" />
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<link rev="made" href="mailto:" />
</head>

<body>

<script>writelinks('__top__',2);</script>
<h1><a>Unicode::Collate - Unicode Collation Algorithm</a></h1>
<p><a name="__index__"></a></p>

<!-- INDEX BEGIN -->

<ul>

	<li><a href="#name">NAME</a></li>
	<li><a href="#synopsis">SYNOPSIS</a></li>
	<li><a href="#description">DESCRIPTION</a></li>
	<ul>

		<li><a href="#constructor_and_tailoring">Constructor and Tailoring</a></li>
		<li><a href="#methods_for_collation">Methods for Collation</a></li>
		<li><a href="#methods_for_searching">Methods for Searching</a></li>
		<li><a href="#other_methods">Other Methods</a></li>
	</ul>

	<li><a href="#export">EXPORT</a></li>
	<li><a href="#install">INSTALL</a></li>
	<li><a href="#caveats">CAVEATS</a></li>
	<li><a href="#author__copyright_and_license">AUTHOR, COPYRIGHT AND LICENSE</a></li>
	<li><a href="#see_also">SEE ALSO</a></li>
</ul>
<!-- INDEX END -->

<hr />
<p>
</p>
<h1><a name="name">NAME</a></h1>
<p>Unicode::Collate - Unicode Collation Algorithm</p>
<p>
</p>
<hr />
<h1><a name="synopsis">SYNOPSIS</a></h1>
<pre>
  <span class="keyword">use</span> <span class="variable">Unicode::Collate</span><span class="operator">;</span>
</pre>
<pre>
  <span class="comment">#construct</span>
  <span class="variable">$Collator</span> <span class="operator">=</span> <span class="variable">Unicode::Collate</span><span class="operator">-&gt;</span><span class="variable">new</span><span class="operator">(</span><span class="variable">%tailoring</span><span class="operator">);</span>
</pre>
<pre>
  <span class="comment">#sort</span>
  <span class="variable">@sorted</span> <span class="operator">=</span> <span class="variable">$Collator</span><span class="operator">-&gt;</span><span class="keyword">sort</span><span class="operator">(</span><span class="variable">@not_sorted</span><span class="operator">);</span>
</pre>
<pre>
  <span class="comment">#compare</span>
  <span class="variable">$result</span> <span class="operator">=</span> <span class="variable">$Collator</span><span class="operator">-&gt;</span><span class="keyword">cmp</span><span class="operator">(</span><span class="variable">$a</span><span class="operator">,</span> <span class="variable">$b</span><span class="operator">);</span> <span class="comment"># returns 1, 0, or -1.</span>
</pre>
<pre>
  <span class="comment"># If %tailoring is false (i.e. empty),</span>
  <span class="comment"># $Collator should do the default collation.</span>
</pre>
<p>
</p>
<hr />
<h1><a name="description">DESCRIPTION</a></h1>
<p>This module is an implementation of Unicode Technical Standard #10
(a.k.a. UTS #10) - Unicode Collation Algorithm (a.k.a. UCA).</p>
<p>
</p>
<h2><a name="constructor_and_tailoring">Constructor and Tailoring</a></h2>
<p>The <code>new</code> method returns a collator object.</p>
<pre>
   <span class="variable">$Collator</span> <span class="operator">=</span> <span class="variable">Unicode::Collate</span><span class="operator">-&gt;</span><span class="variable">new</span><span class="operator">(</span>
      <span class="string">UCA_Version</span> <span class="operator">=&gt;</span> <span class="variable">$UCA_Version</span><span class="operator">,</span>
      <span class="string">alternate</span> <span class="operator">=&gt;</span> <span class="variable">$alternate</span><span class="operator">,</span> <span class="comment"># deprecated: use of 'variable' is recommended.</span>
      <span class="string">backwards</span> <span class="operator">=&gt;</span> <span class="variable">$levelNumber</span><span class="operator">,</span> <span class="comment"># or \@levelNumbers</span>
      <span class="string">entry</span> <span class="operator">=&gt;</span> <span class="variable">$element</span><span class="operator">,</span>
      <span class="string">hangul_terminator</span> <span class="operator">=&gt;</span> <span class="variable">$term_primary_weight</span><span class="operator">,</span>
      <span class="string">ignoreName</span> <span class="operator">=&gt;</span> <span class="string">qr/$ignoreName/</span><span class="operator">,</span>
      <span class="string">ignoreChar</span> <span class="operator">=&gt;</span> <span class="string">qr/$ignoreChar/</span><span class="operator">,</span>
      <span class="string">katakana_before_hiragana</span> <span class="operator">=&gt;</span> <span class="variable">$bool</span><span class="operator">,</span>
      <span class="string">level</span> <span class="operator">=&gt;</span> <span class="variable">$collationLevel</span><span class="operator">,</span>
      <span class="string">normalization</span>  <span class="operator">=&gt;</span> <span class="variable">$normalization_form</span><span class="operator">,</span>
      <span class="string">overrideCJK</span> <span class="operator">=&gt;</span> <span class="operator">\&amp;</span><span class="variable">overrideCJK</span><span class="operator">,</span>
      <span class="string">overrideHangul</span> <span class="operator">=&gt;</span> <span class="operator">\&amp;</span><span class="variable">overrideHangul</span><span class="operator">,</span>
      <span class="string">preprocess</span> <span class="operator">=&gt;</span> <span class="operator">\&amp;</span><span class="variable">preprocess</span><span class="operator">,</span>
      <span class="string">rearrange</span> <span class="operator">=&gt;</span> <span class="operator">\</span><span class="variable">@charList</span><span class="operator">,</span>
      <span class="string">table</span> <span class="operator">=&gt;</span> <span class="variable">$filename</span><span class="operator">,</span>
      <span class="string">undefName</span> <span class="operator">=&gt;</span> <span class="string">qr/$undefName/</span><span class="operator">,</span>
      <span class="string">undefChar</span> <span class="operator">=&gt;</span> <span class="string">qr/$undefChar/</span><span class="operator">,</span>
      <span class="string">upper_before_lower</span> <span class="operator">=&gt;</span> <span class="variable">$bool</span><span class="operator">,</span>
      <span class="string">variable</span> <span class="operator">=&gt;</span> <span class="variable">$variable</span><span class="operator">,</span>
   <span class="operator">);</span>
</pre>
<dl>
<dt><strong><a name="item_uca_version">UCA_Version</a></strong>

<dd>
<p>If the tracking version number of UCA is given,
behavior of that tracking version is emulated on collating.
If omitted, the return value of <a href="#item_uca_version"><code>UCA_Version()</code></a> is used.
<a href="#item_uca_version"><code>UCA_Version()</code></a> should return the latest tracking version supported.</p>
</dd>
<dd>
<p>The supported tracking version: 8, 9, 11, or 14.</p>
</dd>
<dd>
<pre>
     UCA       Unicode Standard         DUCET (@version)
     ---------------------------------------------------
      8              3.1                3.0.1 (3.0.1d9)
      9     3.1 with Corrigendum 3      3.1.1 (3.1.1)
     11              4.0                4.0.0 (4.0.0)
     14             4.1.0               4.1.0 (4.1.0)</pre>
</dd>
<dd>
<p>Note: Recent UTS #10 renames &quot;Tracking Version&quot; to &quot;Revision.&quot;</p>
</dd>
</li>
<dt><strong><a name="item_alternate">alternate</a></strong>

<dd>
<p>-- see 3.2.2 Alternate Weighting, version 8 of UTS #10</p>
</dd>
<dd>
<p>For backward compatibility, <a href="#item_alternate"><code>alternate</code></a> (old name) can be used
as an alias for <a href="#item_variable"><code>variable</code></a>.</p>
</dd>
</li>
<dt><strong><a name="item_backwards">backwards</a></strong>

<dd>
<p>-- see 3.1.2 French Accents, UTS #10.</p>
</dd>
<dd>
<pre>
     backwards =&gt; $levelNumber or \@levelNumbers</pre>
</dd>
<dd>
<p>Weights in reverse order; ex. level 2 (diacritic ordering) in French.
If omitted, forwards at all the levels.</p>
</dd>
</li>
<dt><strong><a name="item_entry">entry</a></strong>

<dd>
<p>-- see 3.1 Linguistic Features; 3.2.1 File Format, UTS #10.</p>
</dd>
<dd>
<p>If the same character (or a sequence of characters) exists
in the collation element table through <a href="#item_table"><code>table</code></a>,
mapping to collation elements is overrided.
If it does not exist, the mapping is defined additionally.</p>
</dd>
<dd>
<pre>
    <span class="string">entry</span> <span class="operator">=&gt;</span> <span class="operator">&lt;&lt;</span><span class="default">'ENTRY'</span><span class="operator">,</span> <span class="comment"># for DUCET v4.0.0 (allkeys-4.0.0.txt)</span><span class="string">
    0063 0068 ; [.0E6A.0020.0002.0063] # ch
    0043 0068 ; [.0E6A.0020.0007.0043] # Ch
    0043 0048 ; [.0E6A.0020.0008.0043] # CH
    006C 006C ; [.0F4C.0020.0002.006C] # ll
    004C 006C ; [.0F4C.0020.0007.004C] # Ll
    004C 004C ; [.0F4C.0020.0008.004C] # LL
    00F1      ; [.0F7B.0020.0002.00F1] # n-tilde
    006E 0303 ; [.0F7B.0020.0002.00F1] # n-tilde
    00D1      ; [.0F7B.0020.0008.00D1] # N-tilde
    004E 0303 ; [.0F7B.0020.0008.00D1] # N-tilde
    </span><span class="default">ENTRY</span>
</pre>
</dd>
<dd>
<pre>
    <span class="string">entry</span> <span class="operator">=&gt;</span> <span class="operator">&lt;&lt;</span><span class="default">'ENTRY'</span><span class="operator">,</span> <span class="comment"># for DUCET v4.0.0 (allkeys-4.0.0.txt)</span><span class="string">
    00E6 ; [.0E33.0020.0002.00E6][.0E8B.0020.0002.00E6] # ae ligature as &lt;a&gt;&lt;e&gt;
    00C6 ; [.0E33.0020.0008.00C6][.0E8B.0020.0008.00C6] # AE ligature as &lt;A&gt;&lt;E&gt;
    </span><span class="default">ENTRY</span>
</pre>
</dd>
<dd>
<p><strong>NOTE:</strong> The code point in the UCA file format (before <code>';'</code>)
<strong>must</strong> be a Unicode code point (defined as hexadecimal),
but not a native code point.
So <code>0063</code> must always denote <code>U+0063</code>,
but not a character of <code>&quot;\x63&quot;</code>.</p>
</dd>
<dd>
<p>Weighting may vary depending on collation element table.
So ensure the weights defined in <a href="#item_entry"><code>entry</code></a> will be consistent with
those in the collation element table loaded via <a href="#item_table"><code>table</code></a>.</p>
</dd>
<dd>
<p>In DUCET v4.0.0, primary weight of <code>C</code> is <code>0E60</code>
and that of <code>D</code> is <code>0E6D</code>. So setting primary weight of <code>CH</code> to <code>0E6A</code>
(as a value between <code>0E60</code> and <code>0E6D</code>)
makes ordering as <code>C &lt; CH &lt; D</code>.
Exactly speaking DUCET already has some characters between <code>C</code> and <code>D</code>:
<code>small capital C</code> (<code>U+1D04</code>) with primary weight <code>0E64</code>,
<code>c-hook/C-hook</code> (<code>U+0188/U+0187</code>) with <code>0E65</code>,
and <code>c-curl</code> (<code>U+0255</code>) with <code>0E69</code>.
Then primary weight <code>0E6A</code> for <code>CH</code> makes <code>CH</code>
ordered between <code>c-curl</code> and <code>D</code>.</p>
</dd>
</li>
<dt><strong><a name="item_hangul_terminator">hangul_terminator</a></strong>

<dd>
<p>-- see 7.1.4 Trailing Weights, UTS #10.</p>
</dd>
<dd>
<p>If a true value is given (non-zero but should be positive),
it will be added as a terminator primary weight to the end of
every standard Hangul syllable. Secondary and any higher weights
for terminator are set to zero.
If the value is false or <a href="#item_hangul_terminator"><code>hangul_terminator</code></a> key does not exist,
insertion of terminator weights will not be performed.</p>
</dd>
<dd>
<p>Boundaries of Hangul syllables are determined
according to conjoining Jamo behavior in <em>the Unicode Standard</em>
and <em>HangulSyllableType.txt</em>.</p>
</dd>
<dd>
<p><strong>Implementation Note:</strong>
(1) For expansion mapping (Unicode character mapped
to a sequence of collation elements), a terminator will not be added
between collation elements, even if Hangul syllable boundary exists there.
Addition of terminator is restricted to the next position
to the last collation element.</p>
</dd>
<dd>
<p>(2) Non-conjoining Hangul letters
(Compatibility Jamo, halfwidth Jamo, and enclosed letters) are not
automatically terminated with a terminator primary weight.
These characters may need terminator included in a collation element
table beforehand.</p>
</dd>
</li>
<dt><strong><a name="item_ignorechar">ignoreChar</a></strong>

<dt><strong><a name="item_ignorename">ignoreName</a></strong>

⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?