📄 emma.txt
字号:
one has a high quality reference alignment and wishes to keep it fixed while adding new sequences automatically. Terminal Gaps In the original Clustal V program, terminal gaps were penalised the same as all other gaps. This caused some ugly side effects e.g.acgtacgtacgtacgt acgtacgtacgtacgta----cgtacgtacgt gets the same score as ----acgtacgtacgt NOW, terminal gaps are free. This is better on average and stops silly effects like single residues jumping to the edge of the alignment. However, it is not perfect. It does mean that if there should be a gap near the end of the alignment, the program may be reluctant to insert it i.e.cccccgggccccc cccccgggcccccccccc---ccccc may be considered worse (lower score) than cccccccccc--- In the right hand case above, the terminal gap is free and may score higher than the laft hand alignment. This can be prevented by lowering the gap opening and extension penalties. It is difficult to get this right all the time. Please watch the ends of your alignments. Speed of the initial (pairwise) alignments (fast approximate/slow accurate) By default, the initial pairwise alignments are now carried out using a full dynamic programming algorithm. This is more accurate than the older hash/ k-tuple based alignments (Wilbur and Lipman) but is MUCH slower. On a fast workstation you may not notice but on a slow box, the difference is extreme. You can set the alignment method from the menus easily to the older, faster method. Delaying alignment of distant sequences The user can set a cut off to delay the alignment of the most divergent sequences in a data set until all other sequences have been aligned. By default, this is set to 40% which means that if a sequence is less than 40% identical to any other sequence, its alignment will be delayed. Iterative realignment/Reset gaps between alignments By default, if you align a set of sequences a second time (e.g. with changed gap penalties), the gaps from the first alignment are discarded. You can set this from the menus so that older gaps will be kept between alignments, This can sometimes give better alignments by keeping the gaps (do not reset them) and doing the full multiple alignment a second time. Sometimes, the alignment will converge on a better solution; sometimes the new alignment will be the same as the first. There can be a strange side effect: you can get columns of nothing but gaps introduced. Any gaps that are read in from the input file are always kept, regardless of the setting of this switch. If you read in a full multiple alignment, the "reset gaps" switch has no effect. The old gaps will remain and if you carry out a multiple alignment, any new gaps will be added in. If you wish to carry out a full new alignment of a set of sequences that are already aligned in a file you must input the sequences without gaps. Profile alignment By profile alignment, we simply mean the alignment of old alignments/sequences. In this context, a profile is just an existing alignment (or even a set of unaligned sequences; see below). This allows you to read in an old alignment (in any of the allowed input formats) and align one or more new sequences to it. From the profile alignment menu, you are allowed to read in 2 profiles. Either profile can be a full alignment OR a single sequence. In the simplest mode, you simply align the two profiles to each other. This is useful if you want to gradually build up a full multiple alignment. A second option is to align the sequences from the second profile, one at a time to the first profile. This is done, taking the underlying tree between the sequences into account. This is useful if you have a set of new sequences (not aligned) and you wish to add them all to an older alignment.Changes to the phylogentic tree calculations and some hints Improved distance calculations for protein trees The phylogenetic trees in Clustal W (the real trees that you calculate AFTER alignment; not the guide trees used to decide the branching order for multiple alignment) use the Neighbor-Joining method of Saitou and Nei based on a matrix of "distances" between all sequences. These distances can be corrected for "multiple hits". This is normal practice when accurate trees are needed. This correction stretches distances (especially large ones) to try to correct for the fact that OBSERVED distances (mean number of differences per site) greatly underestimate the actual number that happened during evolution. In Clustal V we used a simple formula to convert an observed distance to one that is corrected for multiple hits. The observed distance is the mean number of differences per site in an alignment (ignoring sites with a gap) and is therefore always between 0.0 (for ientical sequences) an 1.0 (no residues the same at any site). These distances can be multiplied by 100 to give percent difference values. 100 minus percent difference gives percent identity. The formula we use to correct for multiple hits is from Motoo Kimura (Kimura, M. The neutral Theory of Molecular Evolution, Camb.Univ.Press, 1983, page 75) and is: K = -Ln(1 - D - (D.D)/5) where D is the observed distance and K is corrected distance. This formula gives mean number of estimated substitutions per site and, in contrast to D (the observed number), can be greater than 1 i.e. more than one substitution per site, on average. For example, if you observe 0.8 differences per site (80% difference; 20% identity), then the above formula predicts that there have been 2.5 substitutions per site over the course of evolution since the 2 sequences diverged. This can also be expressed in PAM units by multiplying by 100 (mean number of substitutions per 100 residues). The PAM scale of evolution and its derivation/calculation comes from the work of Margaret Dayhoff and co workers (the famous Dayhoff PAM series of weight matrices also came from this work). Dayhoff et al constructed an elaborate model of protein evolution based on observed frequencies of substitution between very closely related proteins. Using this model, they derived a table relating observed distances to predicted PAM distances. Kimura's formula, above, is just a "curve fitting" approximation to this table. It is very accurate in the range 0.75 > D > 0.0 but becomes increasingly unaccurate at high D (>0.75) and fails completely at around D = 0.85. To circumvent this problem, we calculated all the values for K corresponding to D above 0.75 directly using the Dayhoff model and store these in an internal table, used by Clustal W. This table is declared in the file dayhoff.h and gives values of K for all D between 0.75 and 0.93 in intervals of 0.001 i.e. for D = 0.750, 0.751, 0.752 ...... 0.929, 0.930. For any observed D higher than 0.930, we arbitrarily set K to 10.0. This sounds drastic but with real sequences, distances of 0.93 (less than 7% identity) are rare. If your data set includes sequences with this degree of divergence, you will have great difficulty getting accurate trees by ANY method; the alignment itself will be very difficult (to construct and to evaluate). There are some important things to note. Firstly, this formula works well if your sequences are of average amino acid composition and if the amino acids substitute according to the original Dayhoff model. In other cases, it may be misleading. Secondly, it is based only on observed percent distance i.e. it does not DIRECTLY take conservative substitutions into account. Thirdly, the error on the estimated PAM distances may be VERY great for high distances; at very high distance (e.g. over 85%) it may give largely arbitrary corrected distances. In most cases, however, the correction is still worth using; the trees will be more accurate and the branch lengths will be more realistic. A far more sophisticated distance correction based on a full Dayhoff model which DOES take conservative substitutions and actual amino acid composition into account, may be found in the PROTDIST program of the PHYLIP package. For serious tree makers, this program is highly recommended. TWO NOTES ON BOOTSTRAPPING... When you use the BOOTSTRAP in Clustal W to estimate the reliability of parts of a tree, many of the uncorrected distances may randomly exceed the arbitrary cut off of 0.93 (sequences only 7% identical) if the sequences are distantly related. This will happen randomly i.e. even if none of the pairs of sequences are less than 7% identical, the bootstrap samples may contain pairs of sequences that do exceed this cut off. If this happens, you will be warned. In practice, this can happen with many data sets. It is not a serious problem if it happens rarely. If it does happen (you are warned when it happens and told how often the problem occurs), you should consider removing the most distantly related sequences and/or using the PHYLIP package instead. A further problem arises in almost exactly the opposite situation: when you bootstrap a data set which contains 3 or more sequences that are identical or almost identical. Here, the sets of identical sequences should be shown as a multifurcation (several sequences joing at the same part of the tree). Because the Neighbor-Joining method only gives strictly dichotomous trees (never more than 2 sequences join at one time), this cannot be exactly represented. In practice, this is NOT a problem as there will be some internal branches of zero length seperating the sequences. If you display the tree with all branch lengths, you will still see a multifurcation. However, when you bootstrap the tree, only the branching orders are stored and counted. In the case of multifurcations, the exact branching order is arbitrary but the program will always get the same branching order, depending only on the input order of the sequences. In practice, this is only a problem in situations where you have a set of sequences where all of them are VERY similar. In this case, you can find very high support for some groupings which will disappear if you run the analysis with a different input order. Again, the PHYLIP package deals with this by offering a JUMBLE option to shuffle the input order of your sequences between each bootstrap sample.Usage Here is a sample session with emma% emma Multiple alignment program - interface to ClustalW programInput (gapped) sequence(s): globins.fastaoutput sequence set [hbb_human.aln]: Dendrogram (tree file) from clustalw output file [hbb_human.dnd]: CLUSTAL W (1.83) Multiple Sequence AlignmentsSequence type explicitly set to ProteinSequence format is PearsonSequence 1: HBB_HUMAN 146 aaSequence 2: HBB_HORSE 146 aaSequence 3: HBA_HUMAN 141 aaSequence 4: HBA_HORSE 141 aaSequence 5: MYG_PHYCA 153 aaSequence 6: GLB5_PETMA 149 aaSequence 7: LGB2_LUPLU 153 aaStart of Pairwise alignmentsAligning...Sequences (1:2) Aligned. Score: 83Sequences (1:3) Aligned. Score: 43Sequences (1:4) Aligned. Score: 42Sequences (1:5) Aligned. Score: 24Sequences (1:6) Aligned. Score: 21Sequences (1:7) Aligned. Score: 14Sequences (2:3) Aligned. Score: 41Sequences (2:4) Aligned. Score: 43Sequences (2:5) Aligned. Score: 24Sequences (2:6) Aligned. Score: 19Sequences (2:7) Aligned. Score: 15Sequences (3:4) Aligned. Score: 87Sequences (3:5) Aligned. Score: 26Sequences (3:6) Aligned. Score: 29Sequences (3:7) Aligned. Score: 16Sequences (4:5) Aligned. Score: 26Sequences (4:6) Aligned. Score: 27Sequences (4:7) Aligned. Score: 12Sequences (5:6) Aligned. Score: 21Sequences (5:7) Aligned. Score: 7Sequences (6:7) Aligned. Score: 11Guide tree file created: [12345678A]Start of Multiple AlignmentThere are 6 groupsAligning...Group 1: Sequences: 2 Score:2194Group 2: Sequences: 2 Score:2165Group 3: Sequences: 4 Score:960Group 4: DelayedGroup 5: DelayedGroup 6: DelayedSequence:5 Score:865Sequence:6 Score:797Sequence:7 Score:1044Alignment Score 4164GCG-Alignment file created [12345678A] Go to the input files for this example Go to the output files for this exampleCommand line arguments Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutset [.] Sequence set filename and optional format (output USA) [-dendoutfile] outfile [*.emma] Dendrogram (tree file) from clustalw output file Additional (Optional) qualifiers (* if not always prompted): -onlydend toggle [N] Only produce dendrogram file* -dend toggle [N] Do alignment using an old dendrogram* -dendfile infile Dendrogram (tree file) from clustalw file (optional)* -pwmatrix menu [b] The scoring table which describes the similarity of each amino acid to each other. There are three 'in-built' series of weight matrices offered. Each consists of several matrices which work differently at different evolutionary distances. To see the exact details, read the documentation. Crudely, we store several matrices in memory, spanning the full range of amino acid distance (from almost identical sequences to highly divergent ones). For very similar sequences, it is best to use a strict weight matrix which only gives a high score to identities and the most favoured conservative substitutions. For more divergent sequences, it is appropriate to use 'softer' matrices which give a high score to many other frequent substitutions. 1) BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity (homology searches). The matrices used are: Blosum80, 62, 45 and 30. 2) PAM (Dayhoff). These have been extremely widely used since the late '70s. We use the PAM 120, 160, 250 and 350 matrices. 3) GONNET . These matrices were derived using almost the same procedure as the
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -