📄 formats.tex
字号:
between the sequence lines in a block. Each block in a multi-blockfile of a long alignment must have its sequences in the same order ineach block. The names are checked to verify that this is the case; ifnot, only a warning is generated. (In manually constructed files, someusers may wish to use shorthand names in subsequent blocks after aninitial block with full names -- but this isn't recommended.)\end{enumerate}\section{Stockholm, the recommended multiple sequence alignment format}While we recommend a community standard format (FASTA) for unalignedsequence files, the recommended multiple alignment file format is nota community standard. The Pfam Consortium developed a format (basedon extended SELEX) called ``Stockholm format''. The reasons for thisare two-fold. First, there really is no standard accepted format formultiple sequence alignment files, so we don't feel guilty aboutinventing a new one. Second, the formats of popular multiple alignmentsoftware (e.g. CLUSTAL, GCG MSF, PHYLIP) do not support richdocumentation and markup of the alignment. Stockholm format wasdeveloped to support extensible markup of multiple sequencealignments, and we use this capability extensively in both RNA work(with structural markup) and the Pfam database (with extensive use ofboth annotation and markup).\subsection{A minimal Stockholm file}\begin{verbatim}# STOCKHOLM 1.0seq1 ACDEF...GHIKLseq2 ACDEF...GHIKLseq3 ...EFMNRGHIKLseq1 MNPQTVWYseq2 MNPQTVWYseq3 MNPQT...\end{verbatim}The simplest Stockholm file is pretty intuitive, easily generated in atext editor. It is usually easy to convert alignment formats into a``least common denominator'' Stockholm format. For instance, SELEX,GCG's MSF format, and the output of the CLUSTALV multiple alignmentprogram are all similar interleaved formats.The first line in the file must be \verb+# STOCKHOLM 1.x+, where\verb+x+ is a minor version number for the format specification(and which currently has no effect on my parsers). This line allows aparser to instantly identify the file format.In the alignment, each line contains a name, followed by the alignedsequence. A dash or period denotes a gap. If the alignment is too longto fit on one line, the alignment may be split into multiple blocks,with blocks separated by blank lines. The number of sequences, theirorder, and their names must be the same in every block. Within a givenblock, each (sub)sequence (and any associated \verb+#=GR+ and\verb+#=GC+ markup, see below) is of equal length, called the\textit{block length}. Block lengths may differ from block to block;the block length must be at least one residue, and there is nomaximum. Other blank lines are ignored. You can add comments to the file onlines starting with a \verb+#+.All other annotation is added using a tag/value comment style. Thetag/value format is inherently extensible, and readily madebackwards-compatible; unrecognized tags will simply be ignored. Extraannotation includes consensus and individual RNA or protein secondarystructure, sequence weights, a reference coordinate system for thecolumns, and database source information including name, accessionnumber, and coordinates (for subsequences extracted from a longersource sequence) See below for details.\subsection{Syntax of Stockholm markup}There are four types of Stockholm markup annotation, for per-file,per-sequence, per-column, and per-residue annotation:\begin{wideitem}\item {\emprog{#=GF <tag> <s>}} Per-file annotation. \prog{<s>} is a free format text line of annotation type \prog{<tag>}. For example, \prog{#=GF DATE April 1, 2000}. Can occur anywhere in the file, but usually all the \prog{#=GF} markups occur in a header.\item {\emprog{#=GS <seqname> <tag> <s>}} Per-sequence annotation. \prog{<s>} is a free format text line of annotation type \prog{tag} associated with the sequence named \prog{<seqname>}. For example, \prog{#=GS seq1 SPECIES_SOURCE Caenorhabditis elegans}. Can occur anywhere in the file, but in single-block formats (e.g. the Pfam distribution) will typically follow on the line after the sequence itself, and in multi-block formats (e.g. HMMER output), will typically occur in the header preceding the alignment but following the \prog{#=GF} annotation.\item {\emprog{#=GC <tag> <...s...>} Per-column annotation. \prog{<...s...>} is an aligned text line of annotation type \prog{<tag>}. \verb+#=GC+ lines are associated with a sequence alignment block; \prog{<...s...>} is aligned to the residues in the alignment block, and has the same length as the rest of the block. Typically \verb+#=GC+ lines are placed at the end of each block.\item {\emprog{#=GR <seqname> <tag> <.....s.....>} Per-residue annotation. \prog{<...s...>} is an aligned text line of annotation type \prog{<tag>}, associated with the sequence named \prog{<seqname>}. \verb+#=GR+ lines are associated with one sequence in a sequence alignment block; \prog{<...s...>} is aligned to the residues in that sequence, and has the same length as the rest of the block. Typically \verb+#=GR+ lines are placed immediately following the aligned sequence they annotate.\end{wideitem}\subsection{Semantics of Stockholm markup}Any Stockholm parser will accept syntactically correct files, but isnot obligated to do anything with the markup lines. It is up to theapplication whether it will attempt to interpret the meaning (thesemantics) of the markup in a useful way. At the two extremes are theBelvu alignment viewer and the HMMER profile hidden Markov modelsoftware package.Belvu simply reads Stockholm markup and displays it, without trying tointerpret it at all. The tag types (\prog{#=GF}, etc.) are sufficientto tell Belvu how to display the markup: whether it is attached to thewhole file, sequences, columns, or residues.HMMER uses Stockholm markup to pick up a variety of information fromthe Pfam multiple alignment database. The Pfam consortium thereforeagrees on additional syntax for certain tag types, so HMMER can parsesome markups for useful information. This additional syntax is imposedby Pfam, HMMER, and other software of mine, not by Stockholm formatper se. You can think of Stockholm as akin to XML, and what mysoftware reads as akin to an XML DTD, if you're into that sort ofstructured data format lingo.The Stockholm markup tags that are parsed semantically by my softwareare as follows:\subsubsection{Recognized #=GF annotations}\begin{wideitem}\item [\emprog{ID <s>}] Identifier. \emprog{<s>} is a name for the alignment; e.g. ``rrm''. One word. Unique in file.\item [\emprog{AC <s>}] Accession. \emprog{<s>} is a unique accession number for the alignment; e.g. ``PF00001''. Used by the Pfam database, for instance. Often a alphabetical prefix indicating the database (e.g. ``PF'') followed by a unique numerical accession. One word. Unique in file. \item [\emprog{DE <s>}] Description. \emprog{<s>} is a free format line giving a description of the alignment; e.g. ``RNA recognition motif proteins''. One line. Unique in file.\item [\emprog{AU <s>}] Author. \emprog{<s>} is a free format line listing the authors responsible for an alignment; e.g. ``Bateman A''. One line. Unique in file.\item [\emprog{GA <f> <f>}] Gathering thresholds. Two real numbers giving HMMER bit score per-sequence and per-domain cutoffs used in gathering the members of Pfam full alignments. See Pfam and HMMER documentation for more detail. \item [\emprog{NC <f> <f>}] Noise cutoffs. Two real numbers giving HMMER bit score per-sequence and per-domain cutoffs, set according to the highest scores seen for unrelated sequences when gathering members of Pfam full alignments. See Pfam and HMMER documentation for more detail.\item [\emprog{TC <f> <f>}] Trusted cutoffs. Two real numbers giving HMMER bit score per-sequence and per-domain cutoffs, set according to the lowest scores seen for true homologous sequences that were above the GA gathering thresholds, when gathering members of Pfam full alignments. See Pfam and HMMER documentation for more detail.\end{wideitem}\subsection{Recognized #=GS annotations}\begin{wideitem}\item [\emprog{WT <f>}] Weight. \emprog{<f>} is a positive real number giving the relative weight for a sequence, usually used to compensate for biased representation by downweighting similar sequences. Usually the weights average 1.0 (e.g. the weights sum to the number of sequences in the alignment) but this is not required. Either every sequence must have a weight annotated, or none of them can. \item [\emprog{AC <s>}] Accession. \emprog{<s>} is a database accession number for this sequence. (Compare the \prog{#=GF AC} markup, which gives an accession for the whole alignment.) One word. \item [\emprog{DE <s>}] Description. \emprog{<s>} is one line giving a description for this sequence. (Compare the \prog{#=GF DE} markup, which gives a description for the whole alignment.)\end{wideitem}\subsection{Recognized #=GC annotations}\begin{wideitem}\item [\emprog{RF}] Reference line. Any character is accepted as a markup for a column. The intent is to allow labeling the columns with some sort of mark. \item [\emprog{SS_cons}] Secondary structure consensus. For protein alignments, DSSP codes or gaps are accepted as markup: [HGIEBTSCX.-_], where H is alpha helix, G is 3/10-helix, I is p-helix, E is extended strand, B is a residue in an isolated b-bridge, T is a turn, S is a bend, C is a random coil or loop, and X is unknown (for instance, a residue that was not resolved in a crystal structure). For RNA alignments the symbols \verb+>+ and \verb+<+ are used for base pairs (pairs point at each other). \verb-+- indicate definitely single-stranded positions, and any gap symbol indicates unassigned bases or single-stranded positions. This description roughly follows \cite{Konings89}. RNA pseudoknots are represented by alphabetic characters, with upper case letters representing the 5' side of the helix and lower case letters representing the 3' side. Note that this limits the annotation to a maximum of 26 pseudoknots per sequence. \item [\emprog{SA_cons}] Surface accessibility consensus. 0-9, gap symbols, or X are accepted as markup. 0 means <10\% accessible residue surface area, 1 means <20\%, 9 means <100\%, etc. X means unknown structure.\end{wideitem}\subsection{Recognized #=GR annotations}\begin{wideitem}\item [\emprog{SS}] Secondary structure consensus. See \prog{#=GC SS_cons} above.\item [\emprog{SA}] Surface accessibility consensus. See \prog{#=GC SA_cons} above.\end{wideitem}
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -