📄 selex.tex

📁 hmmer源程序
💻 TEX
字号:
\section{ SELEX alignment file format }\subsection{ Example of a simple SELEX format file}\begin{verbatim}# Example selex fileseq1     ACGACGACGACG.seq2     ..GGGAAAGG.GAseq3     UUU..AAAUUU.Aseq1  ..ACGseq2  AAGGGseq3  AA...UUU\end{verbatim}SELEX is an interleaved multiple alignment format that evolved as anintuitive format.  SELEX files are easy to write and manipulatemanually with a text editor.  It is usually easy to convert otheralignment formats into SELEX format; the output of the CLUSTALVmultiple alignment program and GCG's MSF format are similarinterleaved formats. Because it evolved to accomodate different userinput styles, it is very tolerant of various inconsistencies such asdifferent gap symbols, varying line lengths, etc.As the format evolved, more features have been added. To maintaincompatibility with past alignment files, the new features are addedusing a reserved comment style. These extra features are usuallymaintained by automated SELEX-generating software, such as the {\ttkoala} sequence alignment editor or my {\tt cove} and {\tt hmm} sequenceanalysis packages. This extra information includes consensus andindividual RNA or protein secondary structure, per-sequence weights, areference coordinate system for the columns, and database sourceinformation including name, accession number, and coordinates (forsubsequences extracted from a longer source sequence).\subsection {Specification of a SELEX file}\begin{enumerate}\itemAny line beginning with a \verb+#=+ as the first two characters is amachine ``comment''.  \verb+#=+ comments are reserved for additionaldata about the alignment. Usually these features are maintained bysoftware such as the {\tt koala} editor, not by hand.\itemAll other lines beginning with a \verb+%+ or \verb+#+ as the firstcharacter is a user comment.  User comments are ignored by allsoftware. Any number of comments may be included.\itemLines of data consist of a name followed by a sequence. The totallength of the line must be smaller than 1024 characters.\itemNames must be a single word. Any non-whitespace characters areaccepted.  No spaces are tolerated in names: names MUST be asingle word.\itemIn the sequence, any of the characters \verb+-_.+ or a space arerecognized as gaps. Gaps are converted to a '.'. Any other charactersare interpreted as sequence.  Sequence is case-sensitive. There is acommon assumption by my software that upper-case symbols are used forconsensus (match) positions and lower-case symbols are used forinserts. This language of ``match'' versus ``insert'' comes from thehidden Markov model formalism \cite{Krogh94}. To almost all of mysoftware, this isn't important, and it immediately converts thesequence to all upper-case after it's read.\itemMultiple different sequences are grouped in a block of data lines.Blocks are separated by blank lines. No blank lines are toleratedbetween the sequence lines in a block. Each block in a multi-blockfile of a long alignment must have its sequences in the same order ineach block. The names are checked to verify that this is the case; ifnot, only a warning is generated. (In manually constructed files, someusers may wish to use shorthand names after the first block with fullnames, but this isn't recommended.)\end{enumerate}\subsection {Special comments}\subsubsection {Secondary structure}I use one-letter codes to indicate secondary structures. Secondarystructure strings are aligned to sequence blocks just like additionalsequences.For RNA secondary structure, the symbols \verb+>+ and \verb+<+ areused for base pairs (pairs point at each other).  \verb-+- indicateother single-stranded positions, {\tt .} indicates unassigned bases.This description follows \cite{Konings89}.  For protein secondarystructure, I use {\tt E} to indicate residues in $\beta$-sheet, {\ttH} for those in $\alpha$-helix, {\tt L} for those in loops, and {\tt.} for unassigned residues.RNA pseudoknots are represented by alphabetic characters, with uppercase letters representing the 5' side of the helix and lower caseletters representing the 3' side. Note that this restricts theannotation to a maximum of 26 pseudoknots per sequence.Lines beginning with \verb+#=SS+ or \verb+#=CS+ are individual orconsensus secondary structure data, respectively.  \verb+#=SS+individual secondary structure lines must immediately follow thesequence they are associated with.  There can only be one \verb+#=SS+per sequence. \verb+#=CS+ consensus secondary structure predictionsprecede all the sequences in each block. There can only be one\verb+#=CS+ per file.\subsubsection {Reference coordinate system}Alignments are usually numbered by some reference coordinate system,often a canonical molecule. For instance, tRNA positions are numberedby reference to the positions of yeast tRNA-Phe.A line beginning with \verb+#=RF+ preceding the sequences in a blockgives a reference coordinate system. Any non-gap symbol in the\verb+#=RF+ line indicates that sequence positions in its columns arenumbered. For instance, the \verb+#=RF+ lines for a tRNA alignmentwould have 76 non-gap symbols for the canonical numbered columns; theymight be the aligned tRNA-Phe sequence itself, or they might be justX's.\subsubsection {Sequence header}Additional per-sequence information can be placed in a header beforeany blocks appear. These lines, one per sequence and in exactly thesame order as the sequences appear in the alignment, are formattedlike \verb+#=SQ <seqname> <weight> <database source name> <databaseaccession> <source coordinates as start..stop::original length><description>+.This information includes a sequence weight (for compensating forbiased representation of subfamilies of sequences in the alignment);source information, if the sequence came from a database, consistingof identifier, accession number, and source coordinates; and adescription of the sequence.If a \verb+#=SQ+ line is present, all the fields must be present.  Ifno information is available for a field, use '-' for all the fieldsexcept the source coordinates, which would be given as '0'.\subsubsection {Author}The first non-comment, non-blank line of the file may be a \verb+#=AU+``author'' line. There is a programmatic interface foralignment-generating programs to record a short comment like \verb+11November 1993, by Feng-Doolittle v. 2.1.1+, and this comment will berecorded on the \verb+#=AU+ line by \verb+WriteSELEX()+.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -