📄 format.txt

📁 ldap服务器源码
💻 TXT
字号:
## $Id: format.txt,v 1.2 2001/01/02 18:46:20 mleisher Exp $#CHARACTER DATA==============This package generates some data files that contain character properties usefulfor text processing.CHARACTER PROPERTIES====================The first data file is called "ctype.dat" and contains a compressed form ofthe character properties found in the Unicode Character Database (UCDB).Additional properties can be specified in limited UCDB format in another fileto avoid modifying the original UCDB.The following is a property name and code table to be used with the characterdata:NAME CODE DESCRIPTION---------------------Mn   0    Mark, Non-SpacingMc   1    Mark, Spacing CombiningMe   2    Mark, EnclosingNd   3    Number, Decimal DigitNl   4    Number, LetterNo   5    Number, OtherZs   6    Separator, SpaceZl   7    Separator, LineZp   8    Separator, ParagraphCc   9    Other, ControlCf   10   Other, FormatCs   11   Other, SurrogateCo   12   Other, Private UseCn   13   Other, Not AssignedLu   14   Letter, UppercaseLl   15   Letter, LowercaseLt   16   Letter, TitlecaseLm   17   Letter, ModifierLo   18   Letter, OtherPc   19   Punctuation, ConnectorPd   20   Punctuation, DashPs   21   Punctuation, OpenPe   22   Punctuation, ClosePo   23   Punctuation, OtherSm   24   Symbol, MathSc   25   Symbol, CurrencySk   26   Symbol, ModifierSo   27   Symbol, OtherL    28   Left-To-RightR    29   Right-To-LeftEN   30   European NumberES   31   European Number SeparatorET   32   European Number TerminatorAN   33   Arabic NumberCS   34   Common Number SeparatorB    35   Block SeparatorS    36   Segment SeparatorWS   37   WhitespaceON   38   Other NeutralsPi   47   Punctuation, InitialPf   48   Punctuation, Final## Implementation specific properties.#Cm   39   CompositeNb   40   Non-BreakingSy   41   Symmetric (characters which are part of open/close pairs)Hd   42   Hex DigitQm   43   Quote MarkMr   44   MirroringSs   45   Space, Other (controls viewed as spaces in ctype isspace())Cp   46   Defined characterThe actual binary data is formatted as follows:  Assumptions: unsigned short is at least 16-bits in size and unsigned long               is at least 32-bits in size.    unsigned short ByteOrderMark    unsigned short OffsetArraySize    unsigned long  Bytes    unsigned short Offsets[OffsetArraySize + 1]    unsigned long  Ranges[N], N = value of Offsets[OffsetArraySize]  The Bytes field provides the total byte count used for the Offsets[] and  Ranges[] arrays.  The Offsets[] array is aligned on a 4-byte boundary and  there is always one extra node on the end to hold the final index of the  Ranges[] array.  The Ranges[] array contains pairs of 4-byte values  representing a range of Unicode characters.  The pairs are arranged in  increasing order by the first character code in the range.  Determining if a particular character is in the property list requires a  simple binary search to determine if a character is in any of the ranges  for the property.  If the ByteOrderMark is equal to 0xFFFE, then the data was generated on a  machine with a different endian order and the values must be byte-swapped.  To swap a 16-bit value:     c = (c >> 8) | ((c & 0xff) << 8)  To swap a 32-bit value:     c = ((c & 0xff) << 24) | (((c >> 8) & 0xff) << 16) |         (((c >> 16) & 0xff) << 8) | (c >> 24)CASE MAPPINGS=============The next data file is called "case.dat" and contains three case mapping tablesin the following order: upper, lower, and title case.  Each table is inincreasing order by character code and each mapping contains 3 unsigned longswhich represent the possible mappings.The format for the binary form of these tables is:  unsigned short ByteOrderMark  unsigned short NumMappingNodes, count of all mapping nodes  unsigned short CaseTableSizes[2], upper and lower mapping node counts  unsigned long  CaseTables[NumMappingNodes]  The starting indexes of the case tables are calculated as following:    UpperIndex = 0;    LowerIndex = CaseTableSizes[0] * 3;    TitleIndex = LowerIndex + CaseTableSizes[1] * 3;  The order of the fields for the three tables are:    Upper case    ----------    unsigned long upper;    unsigned long lower;    unsigned long title;    Lower case    ----------    unsigned long lower;    unsigned long upper;    unsigned long title;    Title case    ----------    unsigned long title;    unsigned long upper;    unsigned long lower;  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the  same way as described in the CHARACTER PROPERTIES section.  Because the tables are in increasing order by character code, locating a  mapping requires a simple binary search on one of the 3 codes that make up  each node.  It is important to note that there can only be 65536 mapping nodes which  divided into 3 portions allows 21845 nodes for each case mapping table.  The  distribution of mappings may be more or less than 21845 per table, but only  65536 are allowed.COMPOSITIONS============This data file is called "comp.dat" and contains data that tracks characterpairs that have a single Unicode value representing the combination of the twocharacters.The format for the binary form of this table is:  unsigned short ByteOrderMark  unsigned short NumCompositionNodes, count of composition nodes  unsigned long  Bytes, total number of bytes used for composition nodes  unsigned long  CompositionNodes[NumCompositionNodes * 4]  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the  same way as described in the CHARACTER PROPERTIES section.  The CompositionNodes[] array consists of groups of 4 unsigned longs.  The  first of these is the character code representing the combination of two  other character codes, the second records the number of character codes that  make up the composition (not currently used), and the last two are the pair  of character codes whose combination is represented by the character code in  the first field.DECOMPOSITIONS==============The next data file is called "decomp.dat" and contains the decomposition datafor all characters with decompositions containing more than one character andare *not* compatibility decompositions.  Compatibility decompositions aresignaled in the UCDB format by the use of the <compat> tag in thedecomposition field.  Each list of character codes represents a fulldecomposition of a composite character.  The nodes are arranged in increasingorder by character code.The format for the binary form of this table is:  unsigned short ByteOrderMark  unsigned short NumDecompNodes, count of all decomposition nodes  unsigned long  Bytes  unsigned long  DecompNodes[(NumDecompNodes * 2) + 1]  unsigned long  Decomp[N], N = sum of all counts in DecompNodes[]  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the  same way as described in the CHARACTER PROPERTIES section.  The DecompNodes[] array consists of pairs of unsigned longs, the first of  which is the character code and the second is the initial index of the list  of character codes representing the decomposition.  Locating the decomposition of a composite character requires a binary search  for a character code in the DecompNodes[] array and using its index to  locate the start of the decomposition.  The length of the decomposition list  is the index in the following element in DecompNode[] minus the current  index.COMBINING CLASSES=================The fourth data file is called "cmbcl.dat" and contains the characters withnon-zero combining classes.The format for the binary form of this table is:  unsigned short ByteOrderMark  unsigned short NumCCLNodes  unsigned long  Bytes  unsigned long  CCLNodes[NumCCLNodes * 3]  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the  same way as described in the CHARACTER PROPERTIES section.  The CCLNodes[] array consists of groups of three unsigned longs.  The first  and second are the beginning and ending of a range and the third is the  combining class of that range.  If a character is not found in this table, then the combining class is  assumed to be 0.  It is important to note that only 65536 distinct ranges plus combining class  can be specified because the NumCCLNodes is usually a 16-bit number.NUMBER TABLE============The final data file is called "num.dat" and contains the characters that havea numeric value associated with them.The format for the binary form of the table is:  unsigned short ByteOrderMark  unsigned short NumNumberNodes  unsigned long  Bytes  unsigned long  NumberNodes[NumNumberNodes]  unsigned short ValueNodes[(Bytes - (NumNumberNodes * sizeof(unsigned long)))                            / sizeof(short)]  If the ByteOrderMark is equal to 0xFFFE, endian swapping is required in the  same way as described in the CHARACTER PROPERTIES section.  The NumberNodes array contains pairs of values, the first of which is the  character code and the second an index into the ValueNodes array.  The  ValueNodes array contains pairs of integers which represent the numerator  and denominator of the numeric value of the character.  If the character  happens to map to an integer, both the values in ValueNodes will be the  same.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -