📄 rulebasedcollator.java
字号:
/* * @(#)RuleBasedCollator.java 1.36 03/01/23 * * Copyright 2003 Sun Microsystems, Inc. All rights reserved. * SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. *//* * (C) Copyright Taligent, Inc. 1996, 1997 - All Rights Reserved * (C) Copyright IBM Corp. 1996-1998 - All Rights Reserved * * The original version of this source code and documentation is copyrighted * and owned by Taligent, Inc., a wholly-owned subsidiary of IBM. These * materials are provided under terms of a License Agreement between Taligent * and Sun. This technology is protected by multiple US and International * patents. This notice and attribution to Taligent may not be removed. * Taligent is a registered trademark of Taligent, Inc. * */package java.text;import java.util.Vector;import java.util.Locale;import sun.text.Normalizer;import sun.text.NormalizerUtilities;/** * The <code>RuleBasedCollator</code> class is a concrete subclass of * <code>Collator</code> that provides a simple, data-driven, table * collator. With this class you can create a customized table-based * <code>Collator</code>. <code>RuleBasedCollator</code> maps * characters to sort keys. * * <p> * <code>RuleBasedCollator</code> has the following restrictions * for efficiency (other subclasses may be used for more complex languages) : * <ol> * <li>If a special collation rule controlled by a <modifier> is specified it applies to the whole collator object. * <li>All non-mentioned characters are at the end of the * collation order. * </ol> * * <p> * The collation table is composed of a list of collation rules, where each * rule is of one of three forms: * <pre> * <modifier> * <relation> <text-argument> * <reset> <text-argument> * </pre> * The definitions of the rule elements is as follows: * <UL Type=disc> * <LI><strong>Text-Argument</strong>: A text-argument is any sequence of * characters, excluding special characters (that is, common * whitespace characters [0009-000D, 0020] and rule syntax characters * [0021-002F, 003A-0040, 005B-0060, 007B-007E]). If those * characters are desired, you can put them in single quotes * (e.g. ampersand => '&'). Note that unquoted white space characters * are ignored; e.g. <code>b c</code> is treated as <code>bc</code>. * <LI><strong>Modifier</strong>: There are currently two modifiers that * turn on special collation rules. * <UL Type=square> * <LI>'@' : Turns on backwards sorting of accents (secondary * differences), as in French. * <LI>'!' : Turns on Thai/Lao vowel-consonant swapping. If this * rule is in force when a Thai vowel of the range * \U0E40-\U0E44 precedes a Thai consonant of the range * \U0E01-\U0E2E OR a Lao vowel of the range \U0EC0-\U0EC4 * precedes a Lao consonant of the range \U0E81-\U0EAE then * the vowel is placed after the consonant for collation * purposes. * </UL> * <p>'@' : Indicates that accents are sorted backwards, as in French. * <LI><strong>Relation</strong>: The relations are the following: * <UL Type=square> * <LI>'<' : Greater, as a letter difference (primary) * <LI>';' : Greater, as an accent difference (secondary) * <LI>',' : Greater, as a case difference (tertiary) * <LI>'=' : Equal * </UL> * <LI><strong>Reset</strong>: There is a single reset * which is used primarily for contractions and expansions, but which * can also be used to add a modification at the end of a set of rules. * <p>'&' : Indicates that the next rule follows the position to where * the reset text-argument would be sorted. * </UL> * * <p> * This sounds more complicated than it is in practice. For example, the * following are equivalent ways of expressing the same thing: * <blockquote> * <pre> * a < b < c * a < b & b < c * a < c & a < b * </pre> * </blockquote> * Notice that the order is important, as the subsequent item goes immediately * after the text-argument. The following are not equivalent: * <blockquote> * <pre> * a < b & a < c * a < c & a < b * </pre> * </blockquote> * Either the text-argument must already be present in the sequence, or some * initial substring of the text-argument must be present. (e.g. "a < b & ae < * e" is valid since "a" is present in the sequence before "ae" is reset). In * this latter case, "ae" is not entered and treated as a single character; * instead, "e" is sorted as if it were expanded to two characters: "a" * followed by an "e". This difference appears in natural languages: in * traditional Spanish "ch" is treated as though it contracts to a single * character (expressed as "c < ch < d"), while in traditional German * a-umlaut is treated as though it expanded to two characters * (expressed as "a,A < b,B ... &ae;\u00e3&AE;\u00c3"). * [\u00e3 and \u00c3 are, of course, the escape sequences for a-umlaut.] * <p> * <strong>Ignorable Characters</strong> * <p> * For ignorable characters, the first rule must start with a relation (the * examples we have used above are really fragments; "a < b" really should be * "< a < b"). If, however, the first relation is not "<", then all the all * text-arguments up to the first "<" are ignorable. For example, ", - < a < b" * makes "-" an ignorable character, as we saw earlier in the word * "black-birds". In the samples for different languages, you see that most * accents are ignorable. * * <p><strong>Normalization and Accents</strong> * <p> * <code>RuleBasedCollator</code> automatically processes its rule table to * include both pre-composed and combining-character versions of * accented characters. Even if the provided rule string contains only * base characters and separate combining accent characters, the pre-composed * accented characters matching all canonical combinations of characters from * the rule string will be entered in the table. * <p> * This allows you to use a RuleBasedCollator to compare accented strings * even when the collator is set to NO_DECOMPOSITION. There are two caveats, * however. First, if the strings to be collated contain combining * sequences that may not be in canonical order, you should set the collator to * CANONICAL_DECOMPOSITION or FULL_DECOMPOSITION to enable sorting of * combining sequences. Second, if the strings contain characters with * compatibility decompositions (such as full-width and half-width forms), * you must use FULL_DECOMPOSITION, since the rule tables only include * canonical mappings. * * <p><strong>Errors</strong> * <p> * The following are errors: * <UL Type=disc> * <LI>A text-argument contains unquoted punctuation symbols * (e.g. "a < b-c < d"). * <LI>A relation or reset character not followed by a text-argument * (e.g. "a < ,b"). * <LI>A reset where the text-argument (or an initial substring of the * text-argument) is not already in the sequence. * (e.g. "a < b & e < f") * </UL> * If you produce one of these errors, a <code>RuleBasedCollator</code> throws * a <code>ParseException</code>. * * <p><strong>Examples</strong> * <p>Simple: "< a < b < c < d" * <p>Norwegian: "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J * < k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T * < u,U< v,V< w,W< x,X< y,Y< z,Z * < \u00E5=a\u030A,\u00C5=A\u030A * ;aa,AA< \u00E6,\u00C6< \u00F8,\u00D8" * * <p> * Normally, to create a rule-based Collator object, you will use * <code>Collator</code>'s factory method <code>getInstance</code>. * However, to create a rule-based Collator object with specialized * rules tailored to your needs, you construct the <code>RuleBasedCollator</code> * with the rules contained in a <code>String</code> object. For example: * <blockquote> * <pre> * String Simple = "< a< b< c< d"; * RuleBasedCollator mySimple = new RuleBasedCollator(Simple); * </pre> * </blockquote> * Or: * <blockquote> * <pre> * String Norwegian = "< a,A< b,B< c,C< d,D< e,E< f,F< g,G< h,H< i,I< j,J" + * "< k,K< l,L< m,M< n,N< o,O< p,P< q,Q< r,R< s,S< t,T" + * "< u,U< v,V< w,W< x,X< y,Y< z,Z" + * "< \u00E5=a\u030A,\u00C5=A\u030A" + * ";aa,AA< \u00E6,\u00C6< \u00F8,\u00D8"; * RuleBasedCollator myNorwegian = new RuleBasedCollator(Norwegian); * </pre> * </blockquote> * * <p> * Combining <code>Collator</code>s is as simple as concatenating strings. * Here's an example that combines two <code>Collator</code>s from two * different locales: * <blockquote> * <pre> * // Create an en_US Collator object * RuleBasedCollator en_USCollator = (RuleBasedCollator) * Collator.getInstance(new Locale("en", "US", "")); * // Create a da_DK Collator object * RuleBasedCollator da_DKCollator = (RuleBasedCollator) * Collator.getInstance(new Locale("da", "DK", "")); * // Combine the two * // First, get the collation rules from en_USCollator * String en_USRules = en_USCollator.getRules(); * // Second, get the collation rules from da_DKCollator * String da_DKRules = da_DKCollator.getRules(); * RuleBasedCollator newCollator = * new RuleBasedCollator(en_USRules + da_DKRules); * // newCollator has the combined rules * </pre> * </blockquote> * * <p> * Another more interesting example would be to make changes on an existing * table to create a new <code>Collator</code> object. For example, add * "&C< ch, cH, Ch, CH" to the <code>en_USCollator</code> object to create * your own: * <blockquote> * <pre> * // Create a new Collator object with additional rules * String addRules = "&C< ch, cH, Ch, CH"; * RuleBasedCollator myCollator = * new RuleBasedCollator(en_USCollator + addRules); * // myCollator contains the new rules * </pre> * </blockquote> * * <p> * The following example demonstrates how to change the order of * non-spacing accents, * <blockquote> * <pre> * // old rule * String oldRules = "=\u0301;\u0300;\u0302;\u0308" // main accents * + ";\u0327;\u0303;\u0304;\u0305" // main accents * + ";\u0306;\u0307;\u0309;\u030A" // main accents * + ";\u030B;\u030C;\u030D;\u030E" // main accents * + ";\u030F;\u0310;\u0311;\u0312" // main accents * + "< a , A ; ae, AE ; \u00e6 , \u00c6" * + "< b , B < c, C < e, E & C < d, D"; * // change the order of accent characters * String addOn = "& \u0300 ; \u0308 ; \u0302"; * RuleBasedCollator myCollator = new RuleBasedCollator(oldRules + addOn); * </pre> * </blockquote> * * <p> * The last example shows how to put new primary ordering in before the * default setting. For example, in Japanese <code>Collator</code>, you * can either sort English characters before or after Japanese characters, * <blockquote> * <pre>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -