⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 ssi-format.tex

📁 hmmer源程序
💻 TEX
📖 第 1 页 / 共 2 页
字号:
% SRE, Mon Dec 25 13:00:46 2000\documentclass[12pt]{report}\usepackage{fullpage}\usepackage{times}\usepackage{epsfig}%\usepackage{html}               % From the LaTeX2html translator\usepackage{apalike}\setcounter{secnumdepth}{2}\input{macros}\begin{document}\bibliographystyle{apalike}\section{SSI format}SSI format (Sequence/Subsequence Index format) indexes flatfiledatabases by names and/or accessions, enabling fast retrieval.An SSI index is a binary file that stores sequence names or accessionsas \emph{keys} that it can look up rapidly. It differentiates between\emph{primary keys} and \emph{secondary keys}.  There is one and onlyone primary key per sequence. There can be more than one secondary keyper sequence. Both primary and secondary keys must be uniqueidentifiers (no two records have the same key). A program (likeHMMER's distributed PVM implementation) that needs to step througheach sequence one at a time can refer to the list of primary keys. Aprogram solely concerned with flexible sequence retrieval (such asSQUID's \prog{sfetch}) might consult an SSI index with accessions asprimary keys, and names as secondary keys.A single SSI file can index multiple sequence data files. This allowsindexing multifile databases (e.g. Genbank flatfile distributions).The SSI format is relatively simple and may prove useful for otherindexing tasks besides sequence names. HMMER uses SSI format to indexHMM files.\subsection{Special features of SSI}SSI superceded 1994's GSI format after human genome sequence filesstarted exceeding 2 GB filesystem limitations, and after problems inthe HMMER PVM implementation had to be hacked around. SSI has thefollowing additional features compared to GSI.\begin{description}\item[Separate primary key section] Primary keys are set apart in a separate section, enabling programs tostep through a guaranteed one-to-one mapping of keys to sequences.  Asecondary key section adds many-to-one mapping of keys to sequences.\item[Arbitrary filename and key lengths]File name lengths and key name lengths are effectively unlimited. \item[64-bit indexing]For sequence files exceeding 2GB, on architectures that support 64-bitfilesystems (such as IRIX, Solaris, Tru64 UNIX, FreeBSD...), SSIsupports 64-bit indexing; depending on the system, file sizes may theoretically be allowed to range up to millions of terabytes.\item[Fast subsequence extraction]SSI can be used to greatly accelerate \emph{subsequence} extractionfrom very long sequences (example: human chromosome contigs). Thesequence file must meet certain formatting conditions for this towork; see below for details.\end{description}\subsection{SSI API in SQUID}\subsubsection{Functions for using a SSI index file:}\begin{sreapi}\item[int SSIOpen(char *filename, SSIFILE **ret\_sfp)]Opens the SSI index file \prog{filename} and returns a \prog{SSIFILE*} stream through \prog{ret\_sfp}. Returns 0 on success, nonzero onfailure. The caller must eventually close this stream using\prog{SSIClose()}.  More than one index can be open at once.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_NOFILE}   & failed to open file; doesn't exist or not readable\\\prog{SSI\_ERR\_BADMAGIC} & not a SSI file \\\prog{SSI\_ERR\_NO64BIT}  & it has 64-bit offsets, and we can't support that\\\prog{SSI\_ERR\_FORMAT}   & file appears to be corrupted\\\prog{SSI\_ERR\_MALLOC}   & malloc failed \\\end{tabular}\item[int SSIGetOffsetByName(SSIFILE *sfp, char *key, int *ret\_fh, SSIOFFSET *ret\_offset)]Looks up the string \prog{key} in the open index \prog{sfp}.\prog{key} can be either a primary or secondary key. If \prog{key} isfound, \prog{*ret\_fh} contains a unique handle on the filethat contains {key} (suitable for an \prog{SSIFileInfo()} call, or forcomparison to the handle of the last file that was opened forretrieval), and \prog{offset} is filled in with the offset in thatfile. Returns 0 on success, non-zero on error.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_NO\_SUCH\_KEY} & key not found \\\prog{SSI\_ERR\_NODATA}        & fread() failed, file appears to be corrupted\\\end{tabular}\item[int SSIGetOffsetByNumber(SSIFILE *sfp, int nkey, int*ret\_fh, SSIOFFSET *offset)]Retrieves information for primary key number \prog{nkey}.  \prog{nkey}ranges from 0..\prog{nprimary-1}. When the key is found,\prog{*ret\_fh} contains a unique handle on the file thatcontains {key} (suitable for an SSIFileInfo() call, or for comparisonto the handle of the last file that was opened for retrieval), and\prog{offset} is filled in with the offset in that file. Returns 0 onsuccess, non-zero on error.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_SEEK\_FAILED}  & failed to reposition in index file\\\prog{SSI\_ERR\_NO\_SUCH\_KEY} & key not found \\\prog{SSI\_ERR\_NODATA}        & fread() failed, file appears to be corrupted\\\end{tabular}\item[int SSIGetSubseqOffset(SSIFILE *sfp, char *key, intrequested\_start,  int *ret\_fh,SSIOFFSET *record\_offset, SSIOFFSET *data\_offset, int *ret\_actual\_start)]Implements \prog{SSI\_FAST\_SUBSEQ}. Looks up the string \prog{key} in the open index \prog{sfp}, andasks for the nearest offset to a subsequence starting at position\prog{requested\_start} in the sequence (numbering the sequence 1..L).\prog{key} can be either a primary or secondary key. If \prog{key} isfound, \prog{*ret\_fh} contains a unique handle on the file thatcontains {key} (suitable for an SSIFileInfo() call, or for comparisonto the handle of the last file that was opened for retrieval);\prog{record\_offset} contains the disk offset to the start of therecord; \prog{data\_offset} contains the disk offset either exactly atthe requested residue, or at the start of the line containing therequested residue; \prog{ret\_actual\_start} contains the coordinate(1..L) of the first valid residue at or after\prog{data\_offset}. \prog{ret\_actual\_start} is $\leq$\prog{requested\_start}.  Returns 0 on success, non-zero on failure.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_NO\_SUBSEQS}   & this file or key doesn't allow subseq lookup\\\prog{SSI\_ERR\_NO\_SUCH\_KEY} & key not found \\\prog{SSI\_ERR\_RANGE}         & the requested\_start is out of bounds\\\prog{SSI\_ERR\_NODATA}        & fread() failed, file appears to be corrupted\\\end{tabular}\item[int SSISetFilePosition(FILE *fp, SSIOFFSET *offset]Uses \prog{offset} to sets the file position for \prog{fp} (usually anopen sequence file) relative to the start of the file.  Hides thedetails of system-dependent shenanigans necessary for file positioningin large ($>2$ GB) files. Behaves just like \prog{fseek(fp, offset,SEEK\_SET)} for 32 bit offsets and $<2$ GB files. Returns 0 onsuccess, nonzero on error.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_SEEK\_FAILED}  & failed to reposition the file\\\end{tabular}\item[int SSIFileInfo(SSIFILE *sfp, int fh, char **ret\_filename, int *ret\_format)]Given a file handle \prog{fh} in an open index file \prog{sfp},retrieve file name \prog{ret\_filename} and the file format\prog{ret\_format}.  \prog{ret\_filename} is a pointer to a stringmaintained internally by \prog{sfp}. It should not be free'd;\prog{SSIClose(sfp)} will take care of it.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_BADARG}  & no such file n\\\end{tabular}\item[void SSIClose(SSIFILE *sfp)]Close an open \prog{SSIFILE *}.\end{sreapi}\subsubsection{Skeleton example code for using a SSI index file:} \small\begin{verbatim}    SSIFILE   *sfp;    FILE       *fp;	    int         fh;    char       *seqfile;    int         fmt;    SSIOFFSET  offset;        SSIOpen(``foo.gsi'', &sfp);    /* Finding an entry by name      * (by number, with SSIGetOffsetByNumber(), is analogous)     */    SSIGetOffsetByName(sfp, ``important_key'', &fh, &offset);    SSIGetFileInfo(sfp, fh, &seqfile, &fmt);    fp = fopen(seqfile, ``r''); /* more usually SeqfileOpen(), using fmt */    SSIFilePosition(fp, &offset);         /* read the entry from there, do whatever... */    free(seqfile);    fclose(fp);    SSIClose(sfp);\end{verbatim}\normalsize\subsubsection{Functions for creating a SSI index file:}\begin{sreapi}\item[int SSIRecommendMode(char *file)]Examines the file and determines whether it should be indexed withlarge file support or not; returns \prog{SSI\_OFFSET\_I32} for mostfiles, \prog{SSI\_OFFSET\_I64} for large files, or -1 on failure.\item[SSIINDEX *SSICreateIndex(int mode)]Creates and initializes a SSI index structure. Sequence file offsettype to be used is specified by \prog{mode}, which may be either\prog{SSI\_OFFSET\_I32} or \prog{SSI\_OFFSET\_I64}.  Returns apointer to the new structure, or NULL on failure. The caller must freethis structure with \prog{SSIFreeIndex()} when done.\item[int SSIGetFilePosition(FILE *fp, int mode, SSIOFFSET *ret\_offset)]Fills \prog{ret\_offset} with the current disk offset of \prog{fp},relative to the start of the file.  {mode} is the type of offset touse; it must be either \prog{SSI\_OFFSET\_I32} or\prog{SSI\_OFFSET\_I64}. Returns 0 on success, non-zero on error.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_NO64BIT}       & 64-bit mode unsupported on this system\\\prog{SSI\_ERR\_TELL\_FAILED}  & failed to determine position in file\\\end{tabular}\item[int SSIAddFileToIndex(SSIINDEX *g, char *filename, int fmt,int *ret\_fh)]Adds the sequence file \prog{filename}, which is known to be in format\prog{fmt}, to the index \prog{g}. Creates and returns a uniquefilehandle \prog{ret\_fh} for associating primary keys with this fileusing \prog{SSIAddPrimaryKeyToIndex()}. Returns 0 on success, non-zeroon failure.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_TOOMANY\_FILES}  & exceeded file number limit\\\prog{SSI\_ERR\_MALLOC}          & a malloc() failed\\\end{tabular}\item[int SSISetFileForSubseq(SSIINDEX *g, int fh, int bpl, int rpl)]Set \prog{SSI\_FAST\_SUBSEQ} for the file indicated by filehandle\prog{fh} in the index \prog{g}, setting parameters \prog{bpl} and\prog{rpl} to the values given. \prog{bpl} is the number of bytes persequence data line.  \prog{rpl} is the number of residues per sequencedata line.  Caller must be sure that \prog{bpl} and \prog{rpl} do notchange on any line of any sequence record in the file (except for thelast data line of each record). If this is not the case in this file,\prog{SSI\_FAST\_SUBSEQ} will not work, and this routine should not becalled. Returns 0 on success, non-zero on failure.\item[int SSIAddPrimaryKeyToIndex(SSIINDEX *g, char *key, intfh, SSIOFFSET *r\_off, SSIOFFSET *d\_off, int L)]Puts a primary key \prog{key} in the index \prog{g}, while telling theindex that this primary key is in the file associated with filehandle\prog{fh} and its record starts at position \prog{r\_off} in thatfile.\prog{d\_off} and \prog{L} are optional; they may be left unset bypassing NULL and 0, respectively. (If one is provided, both must beprovided.)  If they are provided, \prog{d\_off} gives the position ofthe first line of sequence data in the record, and \prog{L} givesthe length of the sequence in residues. They are used when\prog{SSI\_FAST\_SUBSEQ} is set for the sequence file. If\prog{SSI\_FAST\_SUBSEQ} is not set for the file, \prog{d\_off} and\prog{L} will be ignored even if they are available, so it doesn'thurt for the indexing program to provide them; typically it won't knowwhether it's safe to set \prog{SSI\_FAST\_SUBSEQ} for the whole fileuntil the whole file has been read and every key has already beenadded to the index.Through \prog{ret\_kh} it provides a ``handle'' - a uniqueidentifier for the primary key - that any subsequent calls to\prog{SSIAddSecondaryKeyToIndex()} will use to associate one or moresecondary keys with this primary key.Returns 0 on success, non-zero on error.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_TOOMANY\_KEYS}  & exceeded primary key limit\\\prog{SSI\_ERR\_TOOMANY\_FILES} & filenum exceeds file limit\\\prog{SSI\_ERR\_MALLOC}         & a malloc() failed\\\end{tabular}\item[int SSIAddSecondaryKeyToIndex(SSIINDEX *g, char *key, char *pkey)]Puts a secondary key \prog{key} in the index \prog{g}, associating itwith a primary key \prog{pkey} that has already been added to the indexby \prog{SSIAddPrimaryKeyToIndex()}.Returns 0 on success, non-zero on error.Error codes:\\\begin{tabular}{ll}\prog{SSI\_ERR\_TOOMANY\_KEYS}  & exceeded secondary key limit\\\prog{SSI\_ERR\_MALLOC}         & a malloc() failed\\\end{tabular}\item[int SSIWriteIndex(char *file, SSIINDEX *g)]Writes complete index \prog{g} in SSI format to a binary file

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -