📄 sentencebreakdata.java

📁 java源代码请看看啊提点宝贵的意见
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
12 下一页
/* * @(#)SentenceBreakData.java	1.23 03/01/23 * * Copyright 2003 Sun Microsystems, Inc. All rights reserved. * SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. *//* * (C) Copyright Taligent, Inc. 1996, 1997 - All Rights Reserved * (C) Copyright IBM Corp. 1996 - 1998 - All Rights Reserved * * The original version of this source code and documentation * is copyrighted and owned by Taligent, Inc., a wholly-owned * subsidiary of IBM. These materials are provided under terms * of a License Agreement between Taligent and Sun. This technology * is protected by multiple US and International patents. * * This notice and attribution to Taligent may not be removed. * Taligent is a registered trademark of Taligent, Inc. * */package java.text;/** * The SentenceBreakData contains data used by SimpleTextBoundary * to determine sentence breaks. * @see #BreakIterator */final class SentenceBreakData extends TextBoundaryData{    // THEORY OF OPERATION:  This class contains all the tables necessary to do    // character-break iteration.  This class descends from TextBoundaryData, which    // is abstract.  This class doesn't define any non-static members; it inherits the    // non-static members from TextBoundaryData and fills them in with pointers to    // the static members defined here.    //   There are two main parts to a TextBoundaryData object: the state-transition    // tables and the character-mapping tables.  The forward state table defines the    // transitions for a deterministic finite state machine that locates character    // boundaries.  The rows are the states and the columns are character categories.    // The cell values consist of two parts: The first is the row number of the next    // state to transition to, or a "stop" value (0).  (Because 0 is the stop value    // rather than a valid state number, row 0 of the array isn't ever looked at; we    // fill it with STOP values by convention.)  The second part is a flag indicating    // whether the iterator should update its break position on this transition.  When    // the flag is set, the sign bit of the value is turned on (SI is used to represent    // the flag bit being turned on-- we do it this way rather than just using negative    // numbers because we still need to see the SI flag when the value of the transition    // is STOP.  SI_STOP is used to denote this.)  The starting state in all state tables    // is 1.    //   The backward state table works the same way as the forward state table, but is    // usually simplified.  The iterator uses the backward state table only to find a    // "safe place" to start iterating forward.  It then seeks forward from the "safe    // place" to the actual break position using the forward table.  A "safe place" is    // a spot in the text that is guaranteed to be a break position.    //   The character-category mapping tables are split into several pieces, one for    // each stage of the category-mapping process: 1) kRawMapping maps generic Unicode    // character categories to the character categories used by this break iterator.    // The index of the array is the Unicode category number as returned by    // Character.getType().  2) The kExceptionFlags table is a table of Boolean values    // indicating whether all the characters in the Unicode category have the    // raw-mapping value.  The rows correspond to the rows of the raw-mapping table.  If    // an entry is true, then we find the right category using...  3) The kExceptionChar    // table.  This table is a sorted list of SpecialMapping objects.  Each entry defines    // a range of contiguous characters that share the same category and the category    // number.  This list is binary-searched to find an entry corresponding to the     // charactre being mapped.  Only characters whose breaking category is different from    // the raw-mapping value (the breaking category for their Unicode category) are    // listed in this table.  4) The kAsciiValues table is a fast-path table for characters    // in the Latin1 range.  This table maps straight from a character value to a    // category number, bypassing all the other tables.  The programmer must take care    // that all of the different category-mapping tables are consistent.    //   In the current implementation, all of these tables are created and maintained    // by hand, not using a tool.        private static final byte other = 0;        // characters not otherwise mentioned    private static final byte space = 1;        // whitespace    private static final byte terminator = 2;   // characters that always mark the end of a                                                //  sentence (? ! etc.)    private static final byte ambiguosTerm = 3; // characters that may mark the end of a                                                //  sentence (periods)    private static final byte openBracket = 4;  // Opening punctuation that may occur before                                                //  the beginning of a sentence    private static final byte closeBracket = 5; // Closing punctuation that may occur after                                                //  the end of a sentence    private static final byte cjk = 6;          // Characters where the previous sentence                                                //  does not have a space after a terminator.                                                //  Common in Japanese, Chinese, and Korean    private static final byte paragraphBreak = 7;                                                // the Unicode paragraph-break character    private static final byte lowerCase = 8;    // lower-case letters    private static final byte upperCase = 9;    // upper-case letters    private static final byte number = 10;      // digits    private static final byte quote = 11;       // the ASCII quote mark, which may be                                                //  either opening or closing punctuation    private static final byte nsm = 12;         // Unicode non-spacing marks    private static final byte EOS = 13;         // end of string    private static final int COL_COUNT = 14;    // number of categories    private static final byte SI = (byte)0x80;    private static final byte STOP = (byte) 0;    private static final byte SI_STOP = (byte)SI + STOP;    public SentenceBreakData() {        super(kSentenceForward, kSentenceBackward, kSentenceMap);    }    // This table implements a relative simple heuristic for locating sentence    // boundaries.  It doesn't always work right (one common case is "Mr. Smith",    // where it'll break between "Mr." and "Smith"), but is a pretty close    // approximation.    // The table implements these rules:    // 1) Unless otherwise mentioned, don't break between characters. (state 1)    // 2) If you see an unambiguous sentence terminator, continue seeking past more    //    terminators (if there are any), closing punctuation (if any), whitespace    //    (if any), and one paragraph separator (if any), in that order.  The first    //    time you see an unexpected character, that's where the break goes.    //    (states 2 and 3)    // 3) If you see a period followed by a Kanji character, there's a sentence break    //    after the period.  If you see a period followed by whitespace or opening    //    punctuation, there's a break after the whitespace or before the opening    //    punctuation unless the next character is a lower-case letter,    //    a digit, closing punctuation, or a paragraph separator.  If you see a    //    period followed by whitespace, followed by opening punctuation, there's a    //    break after the whitespace if the first character after the opening punctuation    //    is a capital letter, and a break after the opening punctuation if the next    //    character is anything other than a lower-case letter.  (states 5, 6, and 7)    // 4) There is ALWAYS a sentence break after a paragraph separator. (state 4)    // 5) Non-spacing marks are transparent to the algorithm.  (the nsm column)    private static final byte kSentenceForwardData[] =    {        // other       space          terminator     ambTerm        // open        close          CJK            PB        // lower       upper          digit          Quote        // nsm            EOS        // 0 - dummy state        STOP,          STOP,          STOP,          STOP,        STOP,          STOP,          STOP,          STOP,        STOP,          STOP,          STOP,          STOP,        STOP,          STOP,        // 1 - this is the main state, which just eats characters        // until it sees a paragraph break or a sentence-terminating        // character  (all states loop back to here if they        // don't see the right sequence of things that denotes the        // end of a sentence).        (byte)(SI+1),  (byte)(SI+1),  (byte)(SI+2),  (byte)(SI+5),        (byte)(SI+1),  (byte)(SI+1),  (byte)(SI+1),  (byte)(SI+4),        (byte)(SI+1),  (byte)(SI+1),  (byte)(SI+1),  (byte)(SI+1),        (byte)(SI+1),  SI_STOP,        // 2 - This state is triggered when we pass an unambiguous        // sentence terminator.  It eats terminating characters        // and closing punctuation, passes whitespace and paragraph        // separators, switches to state 5 on periods, and stops        // on everything else.        SI_STOP,       (byte)(SI+3),  (byte)(SI+2),  (byte)(SI+5),        SI_STOP,       (byte)(SI+2),  SI_STOP,       (byte)(SI+4),        SI_STOP,       SI_STOP,       SI_STOP,       (byte)(SI+2),        (byte)(SI+2),  SI_STOP,        // 3 - This state eats trailing whitespace after a sentence.        // It passes paragraph separators, but stops on anything else.        SI_STOP,       (byte)(SI+3),  SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       (byte)(SI+4),        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        (byte)(SI+3),  SI_STOP,        // 4 - This state handles paragraph separators by eating them        // and then stopping.        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,        // 5 - This state handles periods and other ambiguous sentence        // terminators.  It'll go back to state 2 on an unambiguous        // terminator.  It'll eat trailing punctuation and additional        // periods.  It stops on Kanji (a sentence in Kanji doesn't        // have to be followed by whitespace), advances to state 6        // on whitespace, and loops back to the starting state        // on anything else (i.e., this wasn't actually the end        // of a sentence).        (byte)(SI+1),  (byte)(SI+6),  (byte)(SI+2),  (byte)(SI+5),        (byte)(SI+7),  (byte)(SI+5),  SI_STOP,       (byte)(SI+4),        (byte)(SI+1),  (byte)(SI+1),  (byte)(SI+1),  (byte)(SI+5),        (byte)(SI+5),  SI_STOP,        // 6 - This state handles whitespace after a period.  It eats        // any additional whitespace and passes paragraph breaks.        // It'll loop back on lower-case letters and digits (not the        // end of a sentence) and stop (yes the end of a sentence)        // on most other characters.  Opening punctuation requires        // more lookahead and transitions to state 7.        SI_STOP,       (byte)(SI+6),  SI_STOP,       SI_STOP,        (byte)(SI+7),  (byte)(SI+1),  SI_STOP,       (byte)(SI+4),        (byte)(SI+1),  SI_STOP,       (byte)(SI+1),  SI_STOP,        (byte)(SI+6),  SI_STOP,        // 7 - This state handles opening punctuation after whitespace        // after a period.  It stops unless the next character is a        // lower-case letter (it rewinds back to before the sequence        // opening punctuation and THEN stops if the character is an        // upper-case letter).  It loops (without advancing the break        // position while eating additional opening punctuation.        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        (byte)(7),     SI_STOP,       SI_STOP,       SI_STOP,        (byte)(SI+1),  STOP,          SI_STOP,       SI_STOP,        (byte)(SI+7),  SI_STOP    };    private static final WordBreakTable kSentenceForward        = new WordBreakTable(COL_COUNT, kSentenceForwardData);    // This table locates a safe place for backward or random-access iterator    // to turn around and seek forward.    // 1) There is never a safe place to turn around before a non-spacing    //    mark. (state 1)    // 2) There is always a sentence break after a paragraph separator.
12 下一页
💿 文件大小 245 K
👤 上传用户 liu2000dz
📂 所属分类 Java编程
🏷️ 相关标签

#java #源代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -