📄 linebreakdata.java

📁 java源代码请看看啊提点宝贵的意见
💻 JAVA
📖 第 1 页 / 共 2 页
字号:
12 下一页
/* * @(#)LineBreakData.java	1.20 03/01/23 * * Copyright 2003 Sun Microsystems, Inc. All rights reserved. * SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms. *//* * (C) Copyright Taligent, Inc. 1996, 1997 - All Rights Reserved * (C) Copyright IBM Corp. 1996 - 1998 - All Rights Reserved * * The original version of this source code and documentation * is copyrighted and owned by Taligent, Inc., a wholly-owned * subsidiary of IBM. These materials are provided under terms * of a License Agreement between Taligent and Sun. This technology * is protected by multiple US and International patents. * * This notice and attribution to Taligent may not be removed. * Taligent is a registered trademark of Taligent, Inc. * */package java.text;/** * The LineBreakData contains data used by SimpleTextBoundary * to determine line breaks. * @see #BreakIterator */final class LineBreakData extends TextBoundaryData{    // THEORY OF OPERATION:  This class contains all the tables necessary to do    // character-break iteration.  This class descends from TextBoundaryData, which    // is abstract.  This class doesn't define any non-static members; it inherits the    // non-static members from TextBoundaryData and fills them in with pointers to    // the static members defined here.    //   There are two main parts to a TextBoundaryData object: the state-transition    // tables and the character-mapping tables.  The forward state table defines the    // transitions for a deterministic finite state machine that locates character    // boundaries.  The rows are the states and the columns are character categories.    // The cell values consist of two parts: The first is the row number of the next    // state to transition to, or a "stop" value (0).  (Because 0 is the stop value    // rather than a valid state number, row 0 of the array isn't ever looked at; we    // fill it with STOP values by convention.)  The second part is a flag indicating    // whether the iterator should update its break position on this transition.  When    // the flag is set, the sign bit of the value is turned on (SI is used to represent    // the flag bit being turned on-- we do it this way rather than just using negative    // numbers because we still need to see the SI flag when the value of the transition    // is STOP.  SI_STOP is used to denote this.)  The starting state in all state tables    // is 1.    //   The backward state table works the same way as the forward state table, but is    // usually simplified.  The iterator uses the backward state table only to find a    // "safe place" to start iterating forward.  It then seeks forward from the "safe    // place" to the actual break position using the forward table.  A "safe place" is    // a spot in the text that is guaranteed to be a break position.    //   The character-category mapping tables are split into several pieces, one for    // each stage of the category-mapping process: 1) kRawMapping maps generic Unicode    // character categories to the character categories used by this break iterator.    // The index of the array is the Unicode category number as returned by    // Character.getType().  2) The kExceptionFlags table is a table of Boolean values    // indicating whether all the characters in the Unicode category have the    // raw-mapping value.  The rows correspond to the rows of the raw-mapping table.  If    // an entry is true, then we find the right category using...  3) The kExceptionChar    // table.  This table is a sorted list of SpecialMapping objects.  Each entry defines    // a range of contiguous characters that share the same category and the category    // number.  This list is binary-searched to find an entry corresponding to the     // charactre being mapped.  Only characters whose breaking category is different from    // the raw-mapping value (the breaking category for their Unicode category) are    // listed in this table.  4) The kAsciiValues table is a fast-path table for characters    // in the Latin1 range.  This table maps straight from a character value to a    // category number, bypassing all the other tables.  The programmer must take care    // that all of the different category-mapping tables are consistent.    //   In the current implementation, all of these tables are created and maintained    // by hand, not using a tool.        private static final byte BREAK                 = 0;    //always breaks (must be present as first item)    private static final byte blank                 = 1;    //spaces, tabs, nulls.    private static final byte cr                    = 2;    //carriage return    private static final byte nonBlank              = 3;    //everything not included elsewhere    private static final byte op                    = 4;    //hyphens....    private static final byte jwrd                  = 5;    //hiragana, katakana, and kanji    private static final byte preJwrd               = 6;    //characters that bind to the beginning of a Japanese word    private static final byte postJwrd              = 7;    //characters that bind to the end of a Japanese word    private static final byte digit                 = 8;    //digits    private static final byte numPunct              = 9;    //punctuation that can appear within a number    private static final byte currency              = 10;    //currency symbols that can precede a number    private static final byte quote                 = 11;    // the ASCII quotation mark    private static final byte nsm                   = 12;    // non-spacing marks    private static final byte nbsp                  = 13;    // non-breaking characters    private static final byte EOS                   = 14;    private static final int COL_COUNT = 15;    private static final byte SI = (byte)0x80;    private static final byte STOP = (byte) 0;    private static final byte SI_STOP = (byte)SI + STOP;    public LineBreakData() {        super(kLineForward, kLineBackward, kLineMap);    }    // This table locates legal line-break positions.  i.e., a process that word-wraps a line of    // text can use this version of the BreakIterator to tell it where the legal places for    // breaking a line are.    // The rules implemented here are as follows:    // 1) There is always a legal break position after a line or paragraph separator, but    //    one can occur before only when the preceding character is also a line or paragraph    //    separator.  (The CR-LF sequence is also kept together.)  (states 4 and 7)    // 2) There is never a break before a non-spacing mark, unless it's preceded by a line    //    or paragraph separator.  (the nsm column)    // 3) There is never a break on either side of a non-breaking space (or other non-breaking    //    chartacters).  (the nbsp column, and state 1)    // 4) There is always a break before and after Kanji and Kana characters, except for certain    //    punctuation that must be kept with the following character and certain punctuation    //    and diacritic marks that must be kept with the preceding character.  (states 5 and 8)    // 5) There is always a legal break position following a dash, except when it is followed    //    by a digit, a line/paragraph separator, or whitespace. (state 6)    // 6) There is never a break before a whitespace character.  There is a break after a    //    whitespace character, except when it's followed by a line/paragraph separator.    //    (state 2)    // 7) Breaks don't occur anywhere else.  (state 1)    private static final byte kLineForwardData[] =    {        // brk         bl             cr             nBl        // op          kan            prJ            poJ        // dgt         np             curr           quote        // nsm         nbsp           EOS        // 00 - dummy state        STOP,          STOP,          STOP,          STOP,        STOP,          STOP,          STOP,          STOP,        STOP,          STOP,          STOP,          STOP,        STOP,          STOP,          STOP,        // 01 - main dispatch state.  This state eats pre-Kanji punctuation,        // non-breaking spaces, and non-spacing diacritics without transitioning        // to other states.        (byte)(SI+4),  (byte)(SI+2),  (byte)(SI+7),  (byte)(SI+3),        (byte)(SI+6),  (byte)(SI+5),  (byte)(SI+1),  (byte)(SI+8),        (byte)(SI+9),  (byte)(SI+8),  (byte)(SI+1),  (byte)(SI+3),        (byte)(SI+1),  (byte)(SI+1),  SI_STOP,        // 02 - This state eats whitespce and stops on almost anything else        // (the exceptions are non-breaking spaces, which go back to 1,        // and CRs and LFs)        (byte)(SI+4),  (byte)(SI+2),  (byte)(SI+7),  SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        (byte)(SI+2),  (byte)(SI+1),  SI_STOP,        // 03 - This state eats non-whitespace characters that aren't        // otherwise accounted for.  The only difference between        // this and state 1 is that it stops on Kanji (you can break        // between any two Kanji characters)        (byte)(SI+4),  (byte)(SI+2),  (byte)(SI+7),  (byte)(SI+3),        (byte)(SI+6),  SI_STOP,       (byte)(SI+1),  (byte)(SI+8),        (byte)(SI+9),  (byte)(SI+8),  (byte)(SI+1),  (byte)(SI+3),        (byte)(SI+3),  (byte)(SI+1),  SI_STOP,        // 04 - this is the state you go to when you see a hard line-        // breaking character.  It eats that character and stops.        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,        // 05 - this is the state that handles Kanji.  It handles        // post-Kanji punctuation, whitespace, non-breaking spaces,        // and line terminators, but stops on everything else        // (including more Kanji)        (byte)(SI+4),  (byte)(SI+2),  (byte)(SI+7),  SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       (byte)(SI+8),        SI_STOP,       (byte)(SI+8),  SI_STOP,       SI_STOP,        (byte)(SI+5),  (byte)(SI+1),  SI_STOP,        // 06 - This state handles dashes.  It'll continue on        // whitespace, more dashes, line terminators, and digits        // (the dash is a minus sign), but stops on everything else        // (unless there's an nbsp, a dash is always a legal        // break position).        (byte)(SI+4),  SI_STOP,       (byte)(SI+7),  SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        (byte)(SI+9),  SI_STOP,       (byte)(SI+11), SI_STOP,        (byte)(SI+6),  (byte)(SI+1),  SI_STOP,        // 07 - This state handles CRs.  A CR is a line terminator        // when it appears alone, and considered "half" a line        // terminator when it occurs right before any other line        // terminator (except another CR).        (byte)(SI+4),  SI_STOP,       SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,        // 08 - This state eats post-Kanji punctuation, and passes        // whitespace, non-breaking characters, dashes, line terminators,        // etc.  It stops on almost everything else.        (byte)(SI+4),  (byte)(SI+2),  (byte)(SI+7),  SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       (byte)(SI+8),        SI_STOP,       (byte)(SI+8),  SI_STOP,       (byte)(SI+3),        (byte)(SI+8),  (byte)(SI+1),  SI_STOP,        // 09 - This state is the main "number" state.  It eats        // digits.        (byte)(SI+4),  (byte)(SI+2),  (byte)(SI+7),  (byte)(SI+3),        (byte)(SI+6),  SI_STOP,       SI_STOP,       (byte)(SI+8),        (byte)(SI+9),  (byte)(SI+10), (byte)(SI+10), (byte)(SI+3),        (byte)(SI+9),  (byte)(SI+1),  SI_STOP,        // 10 - This state is the secondary "number" state.  It        // easts punctuation that can occur inside a number.        (byte)(SI+4),  (byte)(SI+2),  (byte)(SI+7),  SI_STOP,        SI_STOP,       SI_STOP,       SI_STOP,       (byte)(SI+8),        (byte)(SI+9),  (byte)(SI+8),  SI_STOP,       SI_STOP,        (byte)(SI+10), (byte)(SI+1),  SI_STOP,        // 11 - This state is here to allow a dash to go before a        // currency symbol and still be treated as a minus sign        // (if the character after the currency symbol is a digit).        STOP,          STOP,          STOP,          STOP,        STOP,          STOP,          STOP,          STOP,        (byte)(SI+9),  STOP,          STOP,          STOP,        (byte)(11),    (byte)(SI+1),  STOP    };    private static final WordBreakTable kLineForward        = new WordBreakTable(COL_COUNT, kLineForwardData);    // This table locates unambiguous break positions when iterating backward.    // It implements the following rules:    // 1) For most characters, there is a break before them if they're preceded    //    by whitespace, Kanji, or a line/paragraph separator. (CR-LF is kept together)    // 2) There is a break before a Kanji character, except when it's preceded by    //    a Kanji-prefix character.  (state 4)    // 3) There is NOT a break before a Kanji-suffix character, except when preceded    //    by whitespace, a line/paragraph separator, or a dash. (state 3)    // 4) There is never a break on either side of a non-break character.  (the nbsp column)    // 5) There is never a break before a non-spacing mark (the nsm column)    // [In this set of rules, "break" means "unambiguous break position".  There may sometimes    // be actual breaks in positions this table always skips.]    private static final byte kLineBackwardData[] =    {        // brk         bl             cr             nBl        // op          kan            prJ            poJ        // dgt         np             curr           quote        // nsm         nbsp           EOS        /*00*/        STOP,          STOP,          STOP,          STOP,        STOP,          STOP,          STOP,          STOP,        STOP,          STOP,          STOP,          STOP,        STOP,          STOP,          STOP,        /*01*/        (byte)(SI+1),  (byte)(SI+1),  (byte)(SI+1),  (byte)(SI+2),        (byte)(SI+2),  (byte)(SI+4),  (byte)(SI+2),  (byte)(SI+3),
12 下一页
💿 文件大小 245 K
👤 上传用户 liu2000dz
📂 所属分类 Java编程
🏷️ 相关标签

#java #源代码
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -