📄 formats.tex

📁 hmmer源程序
💻 TEX
📖 第 1 页 / 共 2 页
字号:
12 下一页
% --------------------------------------------------------------% squid:formats.tex% SRE, Wed Jul 14 17:54:59 1999% $CVS Id$% --------------------------------------------------------------\chapter {Sequence file formats}\section{Summary}The software can handle a number of different file formats. Bydefault, it autodetects the file format, so you don't have to worryabout converting formats. Most common file formats are recognized,including FASTA, Genbank, EMBL, Swissprot, PIR, and FASTA forunaligned sequences, and GCG MSF, Clustal, Phylip, and Stockholmformat for multiple sequence alignments. Some parts of the source codecall the autodetector the ``Babelfish''. The Babelfish has three drawbacks. First, it takes a small amount oftime to do the autodetection. Second, the Babelfish is aggressive, andit makes mistakes when a file isn't one of the known formats -- inparticular, it can recognize plain text files as SELEX alignments,because the SELEX format is so free-form. Third, because the Babelfishworks by reading the first part of the file then rewinding it beforestarting to process it, you can't use the Babelfish on a nonrewindablestream: e.g. when you're taking sequence input from a UNIX pipeinstead of a file, or when the file is gzipped and has to bedecompressed before processing.  In normal use, when you're using thesoftware interactively from the command line on sequence files thatyou're familiar with, the Babelfish is very convenient and(relatively) safe.However, you'll find that there are times when you want to overridethe Babelfish -- particularly in high-throughput analysis, when youknow the format your files are supposed to be in, and you'd ratherincrease robustness and sacrifice interactive flexibility. All theprograms have an \verb+--informat <format>+ option that lets youspecify the format and shut off the Babelfish. You \emph{must} use\verb+--informat+ to use compressed files, or to read sequence from aUNIX pipe... see below for more details on these tricks.\section{Formats recognized by the Babelfish}Recognized unaligned sequence file formats:\begin{tabular}{ll}\hlineFormat name   & Note \\ \hlinefasta         & BLAST flatfile databases, etc.\\genbank       & NCBI Genbank flat file format.\\embl          & Includes both EMBL (DNA) and SWISSPROT (protein) databases.\\pir           & Protein Information Resource database (NBRF/Georgetown)\\gcg           & Wisconsin Genetics Computer Group; only allows one sequence per file.\\gcgdata       & I think this GCG database format is obsolete now.\\ \hline\end{tabular}Recognized multiple sequence alignment file formats:\begin{tabular}{ll}\hlineFormat name   & Note \\ \hlinestockholm     & Pfam format. Allows databases of more than one alignment per file\\selex         & Old NeXagen RNA alignment format, adopted by early HMMER releases.\\msf           & GCG's alignment format.\\clustal       & ClustalV, ClustalW, and friends.\\a2m           & Aligned FASTA format; see comment below.\\phylip        & Format used by Felsenstein's PHYLIP phylogenetic inference software\\\hline\end{tabular}Aligned FASTA format (here called ``A2M'', though I believe that whatHaussler's group at UCSC started calling A2M is yet another variant ofaligned FASTA that's incompatible with this A2M) is only autodetectedwhen an alignment file is expected. Otherwise an A2M file will berecognized as unaligned FASTA, and its gap characters (if any) will beparsed as sequence characters -- often not what you want.Alignment files may be used when unaligned files are expected -- thesequences will silently be de-aligned and read sequentially. Theconverse is not true; you can't give an unaligned sequence format whenan alignment is expected (makes sense, right?).There is no provision for enforcing that single unaligned sequenceformats really do contain just a single sequence.  An attempt toconvert a multisequence file to GCG format will silently ``succeed'',and the file may look ok to your eye, but that multisequence ``GCG''file is illegal. The data will be corrupted if you try to read thatfile back in, possibly without generating any error messages.It turns out that other formats work too, but they're undocumented,not subjected to any quality control testing at software release time,and prone to change without notice at my slightest whim. (In otherwords, even less supported than the software already is.) The brave,curious, or desperate are invited to peruse\prog{seqio.c} and \prog{squid.h}.\section{Special tricks}\subsection{Reading from standard input (probably UNIX-only)}If you give ``-'' as a sequence filename, the software will read thesequences from standard input rather than from a file. You will needto specify the format of the incoming data using the\verb+--informat+ option.Any format except SELEX can be read from standard input. This lets youuse any program downstream in a standard UNIX pipe.There is one limitation: you can't use ``-'' more than once on acommand line, for obvious reasons. (How is it supposed to read morethan one file from one standard input stream?) If you do, behavior ofthe software is undefined -- in other words, the software don't checkfor whether you're making this mistake, so God help you if you do.\subsection{Reading from gzip'ed files (probably UNIX-only)}A sequence file in any format except SELEX can be compressed by gzip,and read in its compressed form. The software looks for the suffix\prog{.gz} to detect gzip'ed files. This allows you to save disk spaceby keeping sequence files gzip'ed, if you like. gzip is not built in;the software needs to find a gzip executable in your current PATH.If for some reason you name a file with a \prog{.gz} suffix and it's\emph{not} a gzip-compressed file, the software will still try todecompress it, and peculiar things may happen.\section{FASTA format, the recommended unaligned format}FASTA is probably the simplest of formats for unaligned sequences.FASTA files are easily created in a text editor.  Each sequence ispreceded by a line starting with \verb+>+. The first word on this lineis the name of the sequence. The rest of the line is a description ofthe sequence (free format). The remaining lines contain the sequenceitself. You can put as many letters on a sequence line as you want.\textbf{Example of a simple FASTA file:}\begin{verbatim}>seq1 This is the description of my first sequence.AGTACGTAGTAGCTGCTGCTACGTGCGCTAGCTAGTACGTCA CGACGTAGATGCTAGCTGACTCGATGC>seq2 This is a description of my second sequence.CGATCGATCGTACGTCGACTGATCGTAGCTACGTCGTACGTAG CATCGTCAGTTACTGCATGCTCG\end{verbatim}For better or worse, FASTA is not a documented standard. Minor (andmajor) variants are in widespread use in the bioinformatics community,all of which are called ``FASTA format''. My software attempts tocater to all of them, and is tolerant of common deviations in FASTAformat. Certainly anything that is accepted by the database formattingprograms in NCBI BLAST or WU-BLAST (e.g. setdb, pressdb, xdformat)will also be accepted by my software. Blank lines in a FASTA file areignored, and so are spaces or other gap symbols (dashes, underscores,periods) in a sequence. Other non-amino or non-nucleic acid symbols inthe sequence are also silently ignored, mostly because some peopleseem to think that ``*'' or ``.'' should be added to protein sequencesto (redundantly) indicate the end of the sequence. The parser willalso accept unlimited line lengths, which allows it to accomodate theenormous description lines in the NCBI NR databases.On the other hand, any FASTA files \emph{generated} by my softwareadhere closely to community standards, and should be usable by othersoftware packages (BLAST, FASTA, etc.) that are more picky aboutparsing their input files. That means you can run a sloppy FASTA filethru \prog{sreformat} to clean it up.Partly because of this tolerance, the software may have a difficulttime dealing with files that are \textit{not} in FASTA format,especially if you're relying on the Babelfish to do formatautodetection.  Some (now mercifully uncommon) file formats are sosimilar to FASTA format that they be erroneously called FASTA by theBabelfish and then quietly but lethally misparsed. An example is theold NBRF file format. If you're using \verb+--informat+, things willbe more robust, and the software should simply refuse to accept anon-FASTA file -- but you shouldn't count on this, because filesperversely similar to FASTA will still confuse the parser.  (The gistof these caveats applies to all formats, not just FASTA.)\section{SELEX, the quick and dirty alignment format}An example of a simple SELEX alignment file:\begin{verbatim}# Example selex fileseq1     ACGACGACGACG.seq2     ..GGGAAAGG.GAseq3     UUU..AAAUUU.Aseq1  ..ACGseq2  AAGGGseq3  AA...UUU\end{verbatim}SELEX is an interleaved multiple alignment format that arose as asimple, intuitive format that was easy to write and manipulatemanually in a text editor. It is usually easy to convert otheralignment formats into SELEX format, even with a couple of lines ofPerl, but it can be harder to go the other way, since SELEX is morefree-format than other alignment formats. For instance, GCG's MSFformat and the output of the CLUSTALV multiple alignment program aresimilar interleaved formats that can be converted to SELEX just bystripping a small number of non-sequence lines out. Because SELEXevolved to accomodate different user input styles, it is very tolerantof various inconsistencies such as different gap symbols, varying linelengths, etc.Each line contains a name, followed by the aligned sequence. A space,dash, underscore, or period denotes a gap. If the alignment is toolong to fit on one line, the alignment is split into multiple blocks,separated by blank lines. The number of sequences, their order, andtheir names must be the same in every block (even if a sequence has noresidues in a given block!) Other blank lines are ignored. You can addcomments to the file on lines starting with a \verb+#+.SELEX stands for ``Systematic Evolution of Ligands by ExponentialEnrichment'' -- it refers to the Tuerk and Gold technology forevolving families of small RNAs for particular functions\cite{Tuerk90b}. SELEX files were what we used to keep track ofalignments of these small RNA families, at a company then calledNeXagen, in Boulder. It's an interesting piece of historical baggage.With the development of HMMER and more need for annotated alignmentsin Pfam, SELEX format later evolved into ``extended SELEX'', with areserved comment style that allowed structural markup and otherannotations, but that became unwieldy. We now use Stockholm format(see below) for highly annotated alignments. (Extended SELEX isdeprecated and undocumented.) Still, the basic SELEX format remains auseful ``lowest common denominator'' alignment format, and has beenretained.\subsubsection {Detailed specification of a SELEX file}\begin{enumerate}\itemAny line beginning with a \verb+#=+ as the first two characters is aparsed machine comment in extended SELEX, and is now deprecated. \itemAll other lines beginning with a \verb+%+ or \verb+#+ as the firstcharacter are user comments.  User comments are ignored by allsoftware. Anything may appear on these lines. Any number of commentsmay be included in a SELEX file, and at any point.\itemLines of data consist of a name followed by a sequence. The totallength of the line must be smaller than 4096 characters.\itemNames must be a single word. Any non-whitespace characters areaccepted.  No spaces are tolerated in names: names MUST be asingle word. Names must be less than 32 characters long.\item In the sequence, any of the characters \verb+-_.+ or a space arerecognized as gaps. Any other characters are interpreted as sequence.Sequence is case-sensitive. There is a common assumption by mysoftware that upper-case symbols are used for consensus (match)positions and lower-case symbols are used for inserts. This languageof ``match'' versus ``insert'' comes from the hidden Markov modelformalism \cite{Krogh94}. To almost all of my software, this isn'timportant, and it immediately converts the sequence to all upper-caseafter it's read.\itemMultiple different sequences are grouped in a block of data lines.Blocks are separated by blank lines. No blank lines are tolerated
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -