📄 edialign.txt
字号:
edialign Function Local multiple alignment of sequencesDescription edialign is an EMBOSS version of the program DIALIGN2 by B. Morgenstern. It takes as input nucleic acid or protein sequences and produces as output a multiple sequence alignment. The sequences need not be similar over their complete length, since the program constructs alignments from gapfree pairs of similar segments of the sequences. Such segment pairs are referred to as "diagonals". If (possibly) coding nucleic acid sequences are to be aligned, edialign can optionally translate the compared "nucleic acid segments" to "peptide segments", or even perform comparisons at both nucleic acid and protein levels, so as to increase the sensitivity of the comparison.Algorithm For a complete explanation of the algorithm, see the references. In short : As described in our papers, the program DIALIGN constructs alignments from gapfree pairs of similar segments of the sequences. Such segment pairs are referred to as "diagonals". Every possible diagonal is given a so-called weight reflecting the degree of similarity among the two segments involved. The overall score of an alignment is then defined as the sum of weights of the diagonals it consists of and the program tries to find an alignment with maximum score -- in other words : the program tries to find a consistent collection of diagonals with maximum sum of weights. This novel scoring scheme for alignments is the basic difference between DIALIGN and other global or local alignment methods. Note that DIALIGN does not employ any kind of gap penalty. It is possible to use a threshold T for the quality of the diagonals. In this case, a diagonal is considered for alignment only if its "weight" exceeds this threshold. Regions of lower similarity are ignored. In the first version of the program (DIALIGN 1), this threshold was in many situations absolutely necessary to obtain meaningful alignments. By contrast, DIALIGN 2 should produce reasonable alignments without a threshold, i.e. with T = 0. This is the most important difference between DIALIGN 2 and the first version of the program. Nevertheless, it is still possible to use a positive threshold T to filter out regions of lower significance and to include only high scoring diagonals into the alignment. The use of overlap weights improves the sensitivity of the program if multiple sequences are aligned but it also increases the running time, especially if large numbers of sequences are aligned. By default, "overlap weights" are used if up to 35 sequences are aligned but switched off for larger data sets. If (possibly) coding nucleic acid sequences are to be aligned, DIALIGN optionally translates the compared "nucleic acid segments" to "peptide segments" according to the genetic code -- without presupposing any of the three possible reading frames, so all combinations of reading frames get checked for significant similarity. If this option is used, the similarity among segments will be assessed on the "peptide level" rather than on the "nucleic acid level". For the levels of sequence similarity, release 2.2 of DIALIGN has two additional options: * It can measure the similarity among segment pairs at both levels of similarity (nucleotide-level and peptide-level similarity). The score of a fragment is based on whatever similarity is stronger. As a result, the program can now produce mixed alignments that contain both types of fragments. Fragments with stronger similarity at the "nucleotide level" are referred to as N-fragments whereas fragments with stronger similarity a the peptide level are called P-fragments. * If the translation or mixed alignment option is used, it is possible to consider the reverse complements of segments, too. In this case, both the original segments and their reverse complements are translated and both pairs of implied "peptide segments" are compared. This option is useful if DNA sequences contain coding regions not only on the "Watson strand" but also on the "Crick strand". The score that DIALIGN assigns to a fragment is based on the probability to find a fragment of the same respective length and number of matches (or BLOSUM values, if the translation option is used) in random sequences of the same length as the input sequences. If long genomic sequences are aligned, an iterative procedure can be applied where the program first looks for fragments with strong similarity. In subsequent steps, regions between these fragments are realigned. Here, the score of a fragment is based on random occurrence in these regions between the previously aligned segment pairs.Usage Here is a sample session with edialign% edialign Local multiple alignment of sequencesInput sequence set: vtest.seqOutput file [vtest.edialign]: (gapped) output sequence(s) [vtest.fasta]: Go to the input files for this example Go to the output files for this exampleCommand line arguments Standard (Mandatory) qualifiers: [-sequences] seqset Sequence set filename and optional format, or reference (input USA) [-outfile] outfile [*.edialign] Output file name [-outseq] seqoutall [.] (Aligned) sequence set(s) filename and optional format (output USA) Additional (Optional) qualifiers (* if not always prompted):* -nucmode menu [n] Nucleic acid sequence alignment mode (simple, translated or mixed) (Values: n (simple); nt (translation); ma (mixed alignments))* -revcomp boolean [N] Also consider the reverse complement -overlapw selection [default (when Nseq =< 35)] By default overlap weights are used when Nseq =<35 but you can set this to 'yes' or 'no' -linkage menu [UPGMA] Clustering method to construct sequence tree (UPGMA, minimum linkage or maximum linkage) (Values: UPGMA (UPGMA); max (maximum linkage); min (minimum linkage)) -maxfragl integer [40] Maximum fragment length (Integer 0 or more)* -fragmat boolean [N] Consider only N-fragment pairs that start with two matches* -fragsim integer [4] Consider only P-fragment pairs if first amino acid or codon pair has similarity score of at least n (Integer 0 or more) -itscore boolean [N] Use iterative score -threshold float [0.0] Threshold for considering diagonal for alignment (Number 0.000 or more) Advanced (Unprompted) qualifiers: -mask boolean [N] Replace unaligned characters by stars '*' rather then putting them in lowercase -dostars boolean [N] Activate writing of stars instead of numbers -starnum integer [4] Put up to n stars '*' instead of digits 0-9 to indicate level of conservation (Integer 0 or more) Associated qualifiers: "-sequences" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -odirectory2 string Output directory "-outseq" associated qualifiers -osformat3 string Output seq format -osextension3 string File name extension -osname3 string Base file name -osdirectory3 string Output directory -osdbname3 string Database name to add -ossingle3 boolean Separate file for each entry -oufo3 string UFO features -offormat3 string Features format -ofname3 string Features file name -ofdirectory3 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messagesInput file format edialign reads any normal sequence USAs. You must give as input at least two sequences. You can use proteins as well as nucleic acids, but you can't mix them. Input files for usage example
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -