📄 sentencebreakdata.java
字号:
/* * @(#)SentenceBreakData.java 1.23 03/01/23 * * Copyright 2003 Sun Microsystems, Inc. All rights reserved. * SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. *//* * (C) Copyright Taligent, Inc. 1996, 1997 - All Rights Reserved * (C) Copyright IBM Corp. 1996 - 1998 - All Rights Reserved * * The original version of this source code and documentation * is copyrighted and owned by Taligent, Inc., a wholly-owned * subsidiary of IBM. These materials are provided under terms * of a License Agreement between Taligent and Sun. This technology * is protected by multiple US and International patents. * * This notice and attribution to Taligent may not be removed. * Taligent is a registered trademark of Taligent, Inc. * */package java.text;/** * The SentenceBreakData contains data used by SimpleTextBoundary * to determine sentence breaks. * @see #BreakIterator */final class SentenceBreakData extends TextBoundaryData{ // THEORY OF OPERATION: This class contains all the tables necessary to do // character-break iteration. This class descends from TextBoundaryData, which // is abstract. This class doesn't define any non-static members; it inherits the // non-static members from TextBoundaryData and fills them in with pointers to // the static members defined here. // There are two main parts to a TextBoundaryData object: the state-transition // tables and the character-mapping tables. The forward state table defines the // transitions for a deterministic finite state machine that locates character // boundaries. The rows are the states and the columns are character categories. // The cell values consist of two parts: The first is the row number of the next // state to transition to, or a "stop" value (0). (Because 0 is the stop value // rather than a valid state number, row 0 of the array isn't ever looked at; we // fill it with STOP values by convention.) The second part is a flag indicating // whether the iterator should update its break position on this transition. When // the flag is set, the sign bit of the value is turned on (SI is used to represent // the flag bit being turned on-- we do it this way rather than just using negative // numbers because we still need to see the SI flag when the value of the transition // is STOP. SI_STOP is used to denote this.) The starting state in all state tables // is 1. // The backward state table works the same way as the forward state table, but is // usually simplified. The iterator uses the backward state table only to find a // "safe place" to start iterating forward. It then seeks forward from the "safe // place" to the actual break position using the forward table. A "safe place" is // a spot in the text that is guaranteed to be a break position. // The character-category mapping tables are split into several pieces, one for // each stage of the category-mapping process: 1) kRawMapping maps generic Unicode // character categories to the character categories used by this break iterator. // The index of the array is the Unicode category number as returned by // Character.getType(). 2) The kExceptionFlags table is a table of Boolean values // indicating whether all the characters in the Unicode category have the // raw-mapping value. The rows correspond to the rows of the raw-mapping table. If // an entry is true, then we find the right category using... 3) The kExceptionChar // table. This table is a sorted list of SpecialMapping objects. Each entry defines // a range of contiguous characters that share the same category and the category // number. This list is binary-searched to find an entry corresponding to the // charactre being mapped. Only characters whose breaking category is different from // the raw-mapping value (the breaking category for their Unicode category) are // listed in this table. 4) The kAsciiValues table is a fast-path table for characters // in the Latin1 range. This table maps straight from a character value to a // category number, bypassing all the other tables. The programmer must take care // that all of the different category-mapping tables are consistent. // In the current implementation, all of these tables are created and maintained // by hand, not using a tool. private static final byte other = 0; // characters not otherwise mentioned private static final byte space = 1; // whitespace private static final byte terminator = 2; // characters that always mark the end of a // sentence (? ! etc.) private static final byte ambiguosTerm = 3; // characters that may mark the end of a // sentence (periods) private static final byte openBracket = 4; // Opening punctuation that may occur before // the beginning of a sentence private static final byte closeBracket = 5; // Closing punctuation that may occur after // the end of a sentence private static final byte cjk = 6; // Characters where the previous sentence // does not have a space after a terminator. // Common in Japanese, Chinese, and Korean private static final byte paragraphBreak = 7; // the Unicode paragraph-break character private static final byte lowerCase = 8; // lower-case letters private static final byte upperCase = 9; // upper-case letters private static final byte number = 10; // digits private static final byte quote = 11; // the ASCII quote mark, which may be // either opening or closing punctuation private static final byte nsm = 12; // Unicode non-spacing marks private static final byte EOS = 13; // end of string private static final int COL_COUNT = 14; // number of categories private static final byte SI = (byte)0x80; private static final byte STOP = (byte) 0; private static final byte SI_STOP = (byte)SI + STOP; public SentenceBreakData() { super(kSentenceForward, kSentenceBackward, kSentenceMap); } // This table implements a relative simple heuristic for locating sentence // boundaries. It doesn't always work right (one common case is "Mr. Smith", // where it'll break between "Mr." and "Smith"), but is a pretty close // approximation. // The table implements these rules: // 1) Unless otherwise mentioned, don't break between characters. (state 1) // 2) If you see an unambiguous sentence terminator, continue seeking past more // terminators (if there are any), closing punctuation (if any), whitespace // (if any), and one paragraph separator (if any), in that order. The first // time you see an unexpected character, that's where the break goes. // (states 2 and 3) // 3) If you see a period followed by a Kanji character, there's a sentence break // after the period. If you see a period followed by whitespace or opening // punctuation, there's a break after the whitespace or before the opening // punctuation unless the next character is a lower-case letter, // a digit, closing punctuation, or a paragraph separator. If you see a // period followed by whitespace, followed by opening punctuation, there's a // break after the whitespace if the first character after the opening punctuation // is a capital letter, and a break after the opening punctuation if the next // character is anything other than a lower-case letter. (states 5, 6, and 7) // 4) There is ALWAYS a sentence break after a paragraph separator. (state 4) // 5) Non-spacing marks are transparent to the algorithm. (the nsm column) private static final byte kSentenceForwardData[] = { // other space terminator ambTerm // open close CJK PB // lower upper digit Quote // nsm EOS // 0 - dummy state STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, STOP, // 1 - this is the main state, which just eats characters // until it sees a paragraph break or a sentence-terminating // character (all states loop back to here if they // don't see the right sequence of things that denotes the // end of a sentence). (byte)(SI+1), (byte)(SI+1), (byte)(SI+2), (byte)(SI+5), (byte)(SI+1), (byte)(SI+1), (byte)(SI+1), (byte)(SI+4), (byte)(SI+1), (byte)(SI+1), (byte)(SI+1), (byte)(SI+1), (byte)(SI+1), SI_STOP, // 2 - This state is triggered when we pass an unambiguous // sentence terminator. It eats terminating characters // and closing punctuation, passes whitespace and paragraph // separators, switches to state 5 on periods, and stops // on everything else. SI_STOP, (byte)(SI+3), (byte)(SI+2), (byte)(SI+5), SI_STOP, (byte)(SI+2), SI_STOP, (byte)(SI+4), SI_STOP, SI_STOP, SI_STOP, (byte)(SI+2), (byte)(SI+2), SI_STOP, // 3 - This state eats trailing whitespace after a sentence. // It passes paragraph separators, but stops on anything else. SI_STOP, (byte)(SI+3), SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(SI+4), SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(SI+3), SI_STOP, // 4 - This state handles paragraph separators by eating them // and then stopping. SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, SI_STOP, // 5 - This state handles periods and other ambiguous sentence // terminators. It'll go back to state 2 on an unambiguous // terminator. It'll eat trailing punctuation and additional // periods. It stops on Kanji (a sentence in Kanji doesn't // have to be followed by whitespace), advances to state 6 // on whitespace, and loops back to the starting state // on anything else (i.e., this wasn't actually the end // of a sentence). (byte)(SI+1), (byte)(SI+6), (byte)(SI+2), (byte)(SI+5), (byte)(SI+7), (byte)(SI+5), SI_STOP, (byte)(SI+4), (byte)(SI+1), (byte)(SI+1), (byte)(SI+1), (byte)(SI+5), (byte)(SI+5), SI_STOP, // 6 - This state handles whitespace after a period. It eats // any additional whitespace and passes paragraph breaks. // It'll loop back on lower-case letters and digits (not the // end of a sentence) and stop (yes the end of a sentence) // on most other characters. Opening punctuation requires // more lookahead and transitions to state 7. SI_STOP, (byte)(SI+6), SI_STOP, SI_STOP, (byte)(SI+7), (byte)(SI+1), SI_STOP, (byte)(SI+4), (byte)(SI+1), SI_STOP, (byte)(SI+1), SI_STOP, (byte)(SI+6), SI_STOP, // 7 - This state handles opening punctuation after whitespace // after a period. It stops unless the next character is a // lower-case letter (it rewinds back to before the sequence // opening punctuation and THEN stops if the character is an // upper-case letter). It loops (without advancing the break // position while eating additional opening punctuation. SI_STOP, SI_STOP, SI_STOP, SI_STOP, (byte)(7), SI_STOP, SI_STOP, SI_STOP, (byte)(SI+1), STOP, SI_STOP, SI_STOP, (byte)(SI+7), SI_STOP }; private static final WordBreakTable kSentenceForward = new WordBreakTable(COL_COUNT, kSentenceForwardData); // This table locates a safe place for backward or random-access iterator // to turn around and seek forward. // 1) There is never a safe place to turn around before a non-spacing // mark. (state 1) // 2) There is always a sentence break after a paragraph separator.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -