📄 wordbreakdata.java
字号:
/* * @(#)WordBreakData.java 1.19 03/01/23 * * Copyright 2003 Sun Microsystems, Inc. All rights reserved. * SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. *//* * (C) Copyright Taligent, Inc. 1996, 1997 - All Rights Reserved * (C) Copyright IBM Corp. 1996 - 1998 - All Rights Reserved * * The original version of this source code and documentation * is copyrighted and owned by Taligent, Inc., a wholly-owned * subsidiary of IBM. These materials are provided under terms * of a License Agreement between Taligent and Sun. This technology * is protected by multiple US and International patents. * * This notice and attribution to Taligent may not be removed. * Taligent is a registered trademark of Taligent, Inc. * */package java.text;/** * The WordBreakData contains data used by SimpleTextBoundary * to determine word breaks. * @see #BreakIterator */final class WordBreakData extends TextBoundaryData{ // THEORY OF OPERATION: This class contains all the tables necessary to do // character-break iteration. This class descends from TextBoundaryData, which // is abstract. This class doesn't define any non-static members; it inherits the // non-static members from TextBoundaryData and fills them in with pointers to // the static members defined here. // There are two main parts to a TextBoundaryData object: the state-transition // tables and the character-mapping tables. The forward state table defines the // transitions for a deterministic finite state machine that locates character // boundaries. The rows are the states and the columns are character categories. // The cell values consist of two parts: The first is the row number of the next // state to transition to, or a "stop" value (0). (Because 0 is the stop value // rather than a valid state number, row 0 of the array isn't ever looked at; we // fill it with STOP values by convention.) The second part is a flag indicating // whether the iterator should update its break position on this transition. When // the flag is set, the sign bit of the value is turned on (SI is used to represent // the flag bit being turned on-- we do it this way rather than just using negative // numbers because we still need to see the SI flag when the value of the transition // is STOP. SI_STOP is used to denote this.) The starting state in all state tables // is 1. // The backward state table works the same way as the forward state table, but is // usually simplified. The iterator uses the backward state table only to find a // "safe place" to start iterating forward. It then seeks forward from the "safe // place" to the actual break position using the forward table. A "safe place" is // a spot in the text that is guaranteed to be a break position. // The character-category mapping tables are split into several pieces, one for // each stage of the category-mapping process: 1) kRawMapping maps generic Unicode // character categories to the character categories used by this break iterator. // The index of the array is the Unicode category number as returned by // Character.getType(). 2) The kExceptionFlags table is a table of Boolean values // indicating whether all the characters in the Unicode category have the // raw-mapping value. The rows correspond to the rows of the raw-mapping table. If // an entry is true, then we find the right category using... 3) The kExceptionChar // table. This table is a sorted list of SpecialMapping objects. Each entry defines // a range of contiguous characters that share the same category and the category // number. This list is binary-searched to find an entry corresponding to the // charactre being mapped. Only characters whose breaking category is different from // the raw-mapping value (the breaking category for their Unicode category) are // listed in this table. 4) The kAsciiValues table is a fast-path table for characters // in the Latin1 range. This table maps straight from a character value to a // category number, bypassing all the other tables. The programmer must take care // that all of the different category-mapping tables are consistent. // In the current implementation, all of these tables are created and maintained // by hand, not using a tool. private static final byte BREAK = 0; // characters not listed in any other category private static final byte letter = 1; // letters private static final byte number = 2; // digits private static final byte midLetter = 3;// punctuation that can occur within a word private static final byte midLetNum = 4;// punctuation that can occur inside a wors or a number private static final byte preNum = 5; // characters that may serve as a prefix to a number private static final byte postNum = 6; // characters that may serve as a suffix to a number private static final byte midNum = 7; // punctuation that can occur inside a number private static final byte preMidNum = 8;// punctuation that can occur either at the beginning // of or inside a number private static final byte blank = 9; // white space (other than always-break characters) private static final byte lf = 10; // the ASCII LF character private static final byte kata = 11; // Katakana private static final byte hira = 12; // Hiragana private static final byte kanji = 13; // all CJK ideographs private static final byte diacrit = 14; // CJK diacriticals private static final byte cr = 15; // the ASCII CR character private static final byte nsm = 16; // Unicode non-spacing marks private static final byte EOS = 17; // end of string private static final int COL_COUNT = 18;// number of categories private static final byte SI = (byte)0x80; private static final byte STOP = (byte) 0; private static final byte SI_STOP = (byte)SI + STOP; public WordBreakData() { super(kWordForward, kWordBackward, kWordMap); } // This table locates word boundaries, as this is defined for "find whole words" // searches and often for double-click selection. In this case, "words" are kept // separate from whitespace and punctuation. // The rules implemented here are as follows: // 1) Unless mentioned below, all characters are treated as "words" unto themselves // and have break positions on both sides (state 14) // 2) A "word" is kept together, and consists of a sequence of letters. Certain // punctuation marks, such as apostrophes and hyphens, are allowed inside a "word" // without causing a break, but only if they're flanked on both sides by letters. // (states 2 and 7) // 3) A "number" is kept together, and consists of an optional prefix character (such // as a minus, decimal point, or currency symbol), followed by a sequence of digits, // followed by an optional suffix character (such as a percent sign). The sequence // of digits may contain certain punctuation characters (such as commas and periods), // but only if they're flanked on both sides by digits. (states 3, 8, and 14) // 4) If a "number" and "word" occur in succession without any intervening characters, // they are kept together. This allows sequences like "$30F3" or "ascii2ebcdic" to // be treated as single units. (transitions between states 2 and 3) // 5) Sequences of whitespace are kept together. (state 6) // 6) The CR-LF sequence is kept together. (states 4 and 13) // 7) A sequence of Kanji is kept together. (state 12) // 8) Sequences of Hiragana and Katakana are kept together, and may include their // common diacritical marks. (states 10 and 11) // [The logic for Kanji and Kana characters is an approximation. There is no way // to detect real Japanese word boundaries without a dictionary.] // 9) Unicode non-spacing marks are completely transparent to the algorithm. // (see the "nsm" column) private static final byte kWordForwardData[] = { // brk let num mLe mLN // prN poN mNu pMN blk // lf kat hir kan dia // cr nsm EOS // 0 - dummy state STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, // 1 - main dispatch state (byte)(SI+14), (byte)(SI+2), (byte)(SI+3), (byte)(SI+14), (byte)(SI+14), (byte)(SI+5), (byte)(SI+14), (byte)(SI+14), (byte)(SI+5), (byte)(SI+6), (byte)(SI+4), (byte)(SI+10), (byte)(SI+11), (byte)(SI+12), (byte)(SI+9), (byte)(SI+13), (byte)(1), SI_STOP, // 2 - This state eats letters, advances to state 3 for numbers, and // goes to state 7 for mid-word punctuation. SI_STOP, (byte)(SI+2), (byte)(SI+3), (byte)(SI+7), (byte)(SI+7), SI_STOP, SI_STOP, SI_STOP, (byte)(SI+7), SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(2), SI_STOP, // 3 - This state eats digits, advances to state 2 for letters, uses // state 8 to handle mid-number punctuation, and goes to state 14 for // number-suffix characters. SI_STOP, (byte)(SI+2), (byte)(SI+3), SI_STOP, (byte)(SI+8), SI_STOP, (byte)(SI+14), (byte)(SI+8), (byte)(SI+8), SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(3), SI_STOP, // 4 - This state handles LFs by eating the LF and stopping. SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, // 5 - This state handles number-prefix characters. If the next character // is a digit, it goes to state 3; otherwise, it stops (the character is // a "word" by itself). SI_STOP, SI_STOP, (byte)(SI+3), SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(5), SI_STOP, // 6 - This state eats whitespace and stops on everything else. // (Except for CRs and LFs, which are kept together with the whitespace.) SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(SI+6), (byte)(SI+4), SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(SI+13), (byte)(6), SI_STOP, // 7 - This state handles mid-word punctuation: If the next character is a // letter, we're still in the word and we keep going. Otherwise, we stop, // and the break was actually before this character. STOP, (byte)(SI+2), STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, (byte)(7), STOP, // 8 - This state handles mid-number punctuation: If the next character is a // digit, we're still in the word and we keep going. Otherwise, we stop, // and the break position is actually before this character. STOP, STOP, (byte)(SI+3), STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, (byte)(8), STOP, // 9 - This state handles CJK diacritics. It'll keep going if the next // character is CJK; otherwise, it stops. SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(SI+10), (byte)(SI+11), SI_STOP, (byte)(SI+9), SI_STOP, (byte)(9), SI_STOP, // 10 - This state eats Katakana and CJK discritics. SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(SI+10), SI_STOP, SI_STOP, (byte)(SI+10), SI_STOP, (byte)(10), SI_STOP, // 11 - This state eats Hiragana and CJK diacritics. SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(SI+11), SI_STOP, (byte)(SI+11), SI_STOP, (byte)(11), SI_STOP, // 12 - This state eats Kanji. SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(SI+12), SI_STOP, SI_STOP, (byte)(12), SI_STOP, // 13 - This state handles CRs, which are "words" unto themselves (or // with preceding whitespace) unless followed by an LFs. SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(SI+4), SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, // 14 - This state handles LFs and number-suffix characters (when they // actually end a number) by eating the character and stopping. SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(14), SI_STOP }; private static final WordBreakTable kWordForward = new WordBreakTable(COL_COUNT, kWordForwardData); // This table is a completely-reversed version of the forward table. private static final byte kWordBackwardData[] = { // brk let num mLe mLN // prN poN mNu pMN blk // lf kat hir kan dia // cr nsm EOS // 0 STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP,
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -