📄 ssi-format.tex
字号:
\prog{file}. Does all overhead of sorting the primary and secondarykeys, and maintaining the association of secondary keys with primarykeys during and after the sort. Returns 0 on success, non-zero onerror.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_NOFILE} & an fopen() failed\\\prog{SSI\_ERR\_FWRITE} & an fwrite() failed\\\prog{SSI\_ERR\_MALLOC} & a malloc() failed\\\end{tabular}\item[void SSIFreeIndex(SSIINDEX *g)]Free an index structure.\end{sreapi} \subsubsection{Other SSI functions:}\begin{sreapi}\item[char *SSIErrorString(int n)] Returns a pointer to an internal string corresponding to error\prog{n}, a return code from any of the functions in the API thatreturn non-zero on error.\end{sreapi}\subsection{Detailed specification of SSI binary format}There are four sections to the SSI file:\begin{sreitems}{\textbf{Secondary keys}}\item[\textbf{Header}] Contains a magic number indicating GSI version number, and various information about the number and sizes of things in the index.\item[\textbf{Files}]Contains one or more \emph{file records}, one per sequence file that'sindexed. These contain information about the individual files.\item[\textbf{Primary keys}]Contains one or more \emph{primary key records}, one per primary key.\item[\textbf{Secondary keys}]Contains one or more \emph{secondary key records}, one per secondary key.\end{sreitems}All numeric quantities are stored as unsigned integers of known sizein network (bigendian) order, for maximum crossplatform portability ofthe index files. \prog{sqd\_uint16}, \prog{sqd\_uint32}, and\prog{sqd\_uint64} are typically typedef'd as \prog{unsigned short},\prog{unsigned int}, and \prog{unsigned long long} or \prog{unsignedlong} at SQUID compile-time. Values may need to be cast to signedquantities, so only half of their dynamic range is valid(e.g. 0..32,767 for values of type \prog{sqd\_uint16};0..2,146,483,647 (2 billion) for \prog{sqd\_uint32}; and 0..9.22e18 (9million trillion) for \prog{sqd\_uint64}). These typedef's arehandled automatically by the \prog{./configure} script (see\prog{squidconf.h.in} before configuration, \prog{squidconf.h} afterconfiguration). If necessary, \prog{./configure}'s guess can beoverridden in \prog{squidconf.h} after configuration.\subsubsection{Header section}The header section contains:\vspace{1em}\begin{tabular}{llrr}Variable & Description & Bytes & Type \\\hline\prog{magic} & SSI version magic number. & 4 & \prog{sqd\_uint32}\\\prog{flags} & Optional behavior flags (see below) & 4 & \prog{sqd\_uint32}\\\prog{nfiles} & Number of files in file section. & 2 & \prog{sqd\_uint16}\\\prog{nprimary} & Number of primary keys. & 4 & \prog{sqd\_uint32}\\\prog{nsecondary} & Number of secondary keys. & 4 & \prog{sqd\_uint32}\\\prog{flen} & Length of filenames (incl. '\verb+\0+') & 4 & \prog{sqd\_uint32}\\\prog{plen} & Length of primary key names (incl. '\verb+\0+') & 4 & \prog{sqd\_uint32}\\\prog{slen} & Length of sec. key names (incl. '\verb+\0+') & 4 & \prog{sqd\_uint32}\\\prog{frecsize} & \# of bytes in a file record & 4 & \prog{sqd\_uint32}\\\prog{precsize} & \# of bytes in a primary key record & 4 & \prog{sqd\_uint32}\\\prog{srecsize} & \# of bytes in a sec. key record & 4 & \prog{sqd\_uint32}\\\prog{foffset} & disk offset, start of file records & \dag & \dag\\\prog{poffset} & disk offset, start of primary key recs & \dag & \dag\\\prog{soffset} & disk offset, start of sec. key records & \dag & \dag\\\end{tabular}\vspace{1em}The optional behavior flags are:\vspace{1em}\begin{tabular}{lll}Flag & Value& Note\\ \hline\prog{SSI\_USE64} & $1 \ll 0$ & Large sequence files; all key offsets 64 bit.\\\prog{SSI\_USE64\_INDEX} & $1 \ll 1$ & Large index; GSI file itself uses 64-bit offsets.\\\hline\end{tabular}\vspace{1em}The optional behavior flags define whether the SSI file uses largefile (64-bit) offsets. This issue is discussed in greater detailbelow (see ``Large sequence files and large indices''). Briefly: if\prog{SSI\_USE64} is set, the sequence file is large, and all sequencefile offsets are 64-bit integers. If \prog{SSI\_USE64\_INDEX} isset, the index file itself is large, and \prog{foffset},\prog{poffset}, and \prog{soffset} (that is, all offsets within theindex file itself, indicated as \dag\ in the above table) are 64-bitintegers. \footnote{In the current API it is not expected that\prog{SSI\_USE64\_INDEX} would ever be set. The current index-writingAPI keeps the entire index in RAM (it has to sort the keys), and wouldpresumably have to be modified or replaced to be able to generate verylarge indices.}The reason to explicitly record various record sizes (\prog{frecsize},\prog{precsize}, \prog{srecsize}) and index file positions(\prog{foffset}, \prog{poffset}, \prog{soffset}) is to allow futureextendibility. More fields might be added without breaking older SSIparsers. The format is meant to be both forwards- andbackwards-compatible.\subsubsection{File section}The file section consists of \prog{nfiles} file records. Each recordis \prog{frecsize} bytes long, and contains:\vspace{1em}\begin{tabular}{llrr}Variable & Description & Bytes & Type \\\hline\prog{filename} & Name of file (possibly including full path) & \prog{flen} & char *\\\prog{format} & Format code for file; see squid.h for definitions & 4 & \prog{sqd\_uint32} \\\prog{flags} & Optional behavior flags & 4 & \prog{sqd\_uint32} \\\prog{bpl} & Bytes per sequence data line & 4 & \prog{sqd\_uint32} \\\prog{rpl} & Residues per sequence data line & 4 & \prog{sqd\_uint32} \\\hline\end{tabular}\vspace{1em}When a SSI file is written, \prog{frecsize} is equal to the sum ofthe sizes above. When a SSI file is read by a parser, it is possiblethat \prog{frecsize} is larger than the parser expects, if the parseris expecting an older version of the SSI format: additional fieldsmay be present, which increases \prog{frecsize}. The parser will onlytry to understand the data up to the \prog{frecsize} it expected tosee, but still knows the absolutely correct \prog{frecsize} forpurposes of skipping around in the index file.Normally the SSI index resides in the same directory as the sequencedata file(s), so \prog{filename} is relative to the location of theSSI index. In the event this is not true, \prog{filename} can containa full path.\prog{format} is a SQUID sequence file format code; e.g. something like \prog{SQFILE\_FASTA} or \prog{MSAFILE\_STOCKHOLM}. These constants are definedin \prog{squid.h}.Only one possible optional behavior flag is defined:\vspace{1em}\begin{tabular}{lll}Flag & Value& Note\\ \hline\prog{SSI\_FAST\_SUBSEQ} & $1 \ll 0$ & Fast subseq retrieval is possible for this file.\\\hline\end{tabular}\vspace{1em}When \prog{SSI\_FAST\_SUBSEQ} is set, \prog{bpl} and \prog{rpl} arenonzero. They can be used to calculate the offset of subsequencepositions in the data file. This is described in the optional behaviorsection below.\subsubsection{Primary key section}The primary key section consists of \prog{nprimary} records. Eachrecord is \prog{precsize} bytes long, and contains:\vspace{1em}\begin{tabular}{llrr}Variable & Description & Bytes & Type \\\hline\prog{key} & Key name (seq name, identifier, accession) & \prog{plen}& char *\\\prog{fnum} & File number (0..nfiles-1) & 2 & \prog{sqd\_uint16}\\\prog{offset1} & Offset to start of record & \ddag & \ddag \\\prog{offset2} & Offset to start of sequence data & \ddag & \ddag \\\prog{len} & Length of data (e.g. seq length, residues) & 4 & \prog{sqd\_uint32} \\\hline\end{tabular} \vspace{1em}The offsets are sequence file offsets (indicated by \ddag). They are4 bytes of type \prog{sqd\_uint32} normally, 8 bytes of type\prog{sqd\_uint32} if \prog{SSI\_USE64} is set, and \prog{sizeof(fpos\_t)} bytes of type \prog{fpos\_t} if \prog{SSI\_FPOS\_T} is set.\prog{offset2} and \prog{len} are only meaningful if \prog{SSI\_FAST\_SUBSEQ}is set on this key's file. \prog{offset2} gives the absolute diskposition of line 0 in the sequence data. \prog{len} is necessary forbounds checking in a subsequence retrieval, to be sure we don't try toreposition the disk outside the valid data.\subsubsection{Secondary key section}The secondary key section consists of \prog{nsecondary} records. Eachrecord is \prog{srecsize} bytes long, and contains:\vspace{1em}\begin{tabular}{llrr}Variable & Description & Bytes & Type \\\hline\prog{key} & Key name (seq name, identifier, accession) & \prog{slen}& char *\\\prog{pkey} & Primary key &\prog{plen}& char *\\\hline\end{tabular}\vspace{1em}All data are kept with the primary key records. Secondary keys aresimply translated to primary keys, then the primary key has to belooked up.\subsection{Optional behaviors}\subsubsection{Large sequence files and large indices: 64-bit operation}Normally a SSI index file can be no larger than 2 GB, and can indexsequence files that are no larger than 2 GB each. This is due tolimitations in the ANSI C/POSIX standards, which were developed for32-bit operating systems and filesystems. Most modern operatingsystems allow larger 64-bit file sizes, but as far as I'm aware (Dec2000), there are no standard interfaces yet for working with positions(offsets) in large files. On many platforms, SSI can extend to full64-bit capabilities, but on some platforms, it cannot. To understandthe limitations (of SSI, and possibly of my understanding) you needto understand some details about what's happening behind the SSI APIand how I understand C API's to modern 64-bit OS's and hardware.First, some information on ANSI C APIs for file positioning. ANSI Cprovides the portable functions \prog{fseek()} and \prog{ftell()} formanipulating simple offsets in a file. They store the offset in a\prog{long} (which ranges up to 2 GB). The Standard says we're allowedto do arithmetic on this value if the file is binary. ANSI C alsoprovides \prog{fgetpos()} and \prog{fsetpos()} which store filepositions in an opaque data type called \prog{fpos\_t}. Modernoperating systems with large file support define \prog{fpos\_t} in away that permits files $>$2 GB. However, \prog{fpos\_t} is an opaquetype. It has two disadvantages compared to a simple arithmetic typelike \prog{long}: first, we're not allowed to do arithmetic on it, andsecond, we can't store it in a binary file in anarchitecture-independent manner. We need both features for SSI,unfortunately. \footnote{Surely the professional C community has thesame problem; does \emph{everyone} hack around \prog{fpos\_t}?}Therefore we have to rely on system dependent features. Most operatingsystems provide a non-compliant library call that returns anarithmetic offset. Fully 64-bit systems typically give us a 64-bit\prog{off\_t} and functions \prog{ftello()}/\prog{fseeko()} that workwith that offset. Many systems provide a ``transitional interface''where all normally named functions are 32-bits, but specially named64-bit varieties are available: e.g. \prog{off\_t} is 32 bits, but\prog{off64\_t} is 64 bits and we have functions \prog{ftello64()} and\prog{fseeko64()}. Some systems provide a \prog{ftell64()} and\prog{fseek64()} that work on offsets of type \prog{long long}. Manypopular systems may even provide more than one of these models,depending on compiler flags. And, unfortunately, some systems provide none of these models (FreeBSDfor example). There, we will exploit the fact that most systems(including FreeBSD) do in fact implement \prog{fpos\_t} as a simplearithmetic type, such as an \prog{off\_t}, so we can misuse it.At compile time, SQUID's \prog{./configure} script tests for thesystem's capabilities for 64-bit file offsets, and configures asection in the \prog{squidconf.h} file. (The configuration includes acustom autoconf macro, \prog{SQ\_ARITHMETIC\_FPOS\_T()}, to test\prog{fpos\_t} and define \prog{ARITHMETIC\_FPOS\_T} if it is.) Fourpossible 64-bit models are tested in the following order; if one ofthem is possible, it will be used, and the constant\prog{HAS\_64BIT\_FILE\_OFFSETS} is set.\begin{enumerate}\item has \prog{ftello()}, \prog{fseeko()}; sizeof(\prog{off\_t}) $= 8$.\item has \prog{ftello64()}, \prog{fseeko64()}; sizeof(\prog{off64\_t}) $= 8$.\item has \prog{ftell64()}, \prog{fseek64()}\item \prog{fpos\_t} is an arithmetic 64-bit type; (mis)use\prog{fgetpos()}, \prog{fsetpos()}.\end{enumerate}\subsubsection{Fast subsequence retrieval}In some files (notably vertebrate chromosome contigs) the size of eachsequence is large. It may be slow to extract a subsequence by firstreading the whole sequence into memory -- or even prohibitive, if thesequence is so large that it can't be stored in memory.If the sequence data file is very consistently formatted so that eachline in each record (except the last one) is of the same length, inboth bytes and residues, we can determine a disk offset of the startof any subsequence by direct calculation.For example, a simple well-formatted FASTAfile with 50 residues per line would have 51 bytes per sequence line(counting the '\verb+\0+') (\prog{bpl}=51, \prog{rpl}=50). Position $i$ in a sequence$1..L$ will be on line $l = (i-1)/\mbox{\prog{rpl}}$, and line $l$ starts atdisk offset $l * \mbox{\prog{bpl}}$ relative to the start of the sequencedata. If there are no nonsequence characters in the data line exceptthe terminal '\verb+\0+' (which is true iff \prog{bpl} = \prog{rpl}+1 and 1 residue = 1byte), position $i$ can be precisely found:\[\mbox{relative offset of residue $i$} =\left((i-1)/\mbox{\prog{rpl}}\right)*\mbox{\prog{bpl}} + (i-1) \% \mbox{ \prog{rpl}}\]Even for sequence data lines with extra characters (e.g. spaces,coordinates, whatever), fast subsequence retrieval is possible; aparser can be positioned at the beginning of the appropriate line $l$,which starts at residue $(l*\mbox{\prog{rpl}}) + 1$, and it can start readingfrom there (e.g. the line that $i$ is on) rather than the beginning ofthe whole sequence record.The program that creates the index is responsible for determining if\prog{bpl} and \prog{rpl} are consistent throughout a file; if so, itmay set the \prog{SSI\_FAST\_SUBSEQ} flag for the file. Then any recordwhose primary key carries the optional data offset (\prog(offset2))and sequence length data is available for subsequence positioncalculations by \prog{SSIGetSubseqOffset()}. \end{document}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -