📄 clustalv.doc
字号:
Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE FIRST.$ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=pTake the sequences in proteins.seq; they are explicitely set to be protein; align them; bootstrap a tree using 500 samples and a seed number of 99.******************************************************************* 5. Algorithms and references.In this section, we will try to BRIEFLY describe the algorithms used in ClustalV and give references. The topics covered are: -Multiple alignments -Profile alignments -Protein weight matrices -Phylogenetic trees -distances -NJ method -Bootstrapping -Phylip -ReferencesMULTIPLE ALIGNMENTS.The approach used in ClustalV is a modified version of the method of Feng and Doolittle (1987) who aligned the sequences in larger and larger groups according to the branching order in an initial phylogenetic tree. This approach allows a very useful combination of computational tractability and sensitivity. The positions of gaps that are generated in early alignments remain through later stages. This can be justified because gaps that arise from the comparison of closely related sequences should not be moved because of later alignment with more distantly related sequences. At each alignment stage, you align two groups of already aligned sequences. This is done using a dynamic programming algorithm where one allows the residues that occur in every sequence at each alignment position to contribute to the alignment score. A Dayhoff (1978) PAM matrix is used in protein comparisons.The details of the algorithm used in ClustalV have been published in Higgins and Sharp (1989). This was an improved version of an earlier algorithm published in Higgins and Sharp (1988). First, you calculate a crude similarity measure between every pair of sequence. This is done using the fast, approximate alignment algorithm of Wilbur and Lipman (1983). Then, these scores are used to calculate a "guide tree" or dendrogram, which will tell the multiple alignment stage in which order to align the sequences for the final multiple alignment. This "guide tree" is calculated using the UPGMA method of Sneath and Sokal (1973). UPGMA is a fancy name for one type of average linkage cluster analysis, invented by Sokal and Michener (1958). Having calculated the dendrogram, the sequences are aligned in larger and larger groups. At each alignment stage, we use the algorithm of Myers and Miller (1988) for the optimal alignments. This algorithm is a very memory efficient variation of Gotoh's algorithm (Gotoh, 1982). It is because of this algorithm that ClustalV can work on microcomputers. Each of these alignments consists of aligning 2 alignments, using what we call "profile alignments".PROFILE ALIGNMENTS.We use the term "profile alignment" to describe the alignment of 2 alignments. We use this term because the method is a simple extension of the profile method of Gribskov, et al. (1987) for aligning 1 sequence with an alignment. Normally, with a 2 sequence alignment, you use a weight matrix (e.g. a PAM 250 matrix) to give a score between the pairs of aligned residues. The alignment is considered "optimal" if it gives the best total score for aligned residues minus penalties for any gaps (insertions or deletions) that must be introduced. Profile alignments are a simple extension of 2 sequence alignments in that you can treat each of the two input alignments as single sequences but you calculate the score at aligned positions as the average weight matrix score of all the residues in one alignment versus all those in the other e.g. if you have 2 alignments with I and J sequences respectively; the score at any position is the average of all the I times J scores of the residues compared seperately. Any gaps that are introduced are placed in all of the sequences of an alignment at the same position. The profile alignments offered in the "profile alignment menu" are also calculated in this way.PROTEIN WEIGHT MATRICES.There are 3 built-in weight matrices used by clustalV. These are the PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity matrix. Each matrix is given as the bottom left half, including the diagonal of a 20 by 20 matrix. The order of the rows and columns is CSTPAGNDEQHRKMILVFYW.PAM 250C 12 S 0 2 T -2 1 3 P -3 1 0 6 A -2 1 1 1 2 G -3 1 0 -1 1 5 N -4 1 0 -1 0 0 2 D -5 0 0 -1 0 1 2 4 E -5 0 0 -1 0 0 1 3 4 Q -5 -1 -1 0 0 -1 1 2 2 4 H -3 -1 -1 0 -1 -2 2 1 1 3 6 R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5 M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6 I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9 Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10 W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17 ---------------------------------------------------------------- C S T P A G N D E Q H R K M I L V F Y WIDENTITY MATRIX10 0 10 0 0 10 0 0 0 10 0 0 0 0 10 0 0 0 0 1 10 0 0 0 0 0 0 10 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10PAM 100 14 -1 6 -5 2 7 -6 1 -1 10 -5 2 2 1 6 -8 1 -3 -3 1 8 -8 2 0 -3 -1 -1 7 -11 -1 -2 -4 -1 -1 4 8 -11 -2 -3 -3 0 -2 1 5 8 -11 -3 -3 -1 -2 -5 -1 1 4 9 -6 -4 -5 -2 -5 -7 2 -1 -2 4 11 -6 -1 -4, -2 -5 -8 -3 -6 -5 1 1 10 -11 -2 -1 -4 -4 -5 1 -2 -2 -1 -3 3 8 -11 -4 -2 -6 -3 -8 -5 -8 -6 -2 -7 -2 1 13 -5 -4 -1 -6 -3 -7 -4 -6 -5 -5 -7 -4 4 2 9 -12 -7 -5 -5 -5 -8 -6 -9 -7 -3 -5 -7 -6 4 2 9 -4 -4 -1 -4 0 -4 -5 -6 -5 -5 -6 -6 -6 1 5 1 8 -10 -5 -6 -9 -7 -8 -6 -11 -11 -10 -4 -7-11 -2 0 0 -5 12 -2 -6 -6 -11 -6 -11 -3 -9 -7 -9 -1-10-10 -8 -4 -5 -6 6 13 -13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7 1 -9-11-12 -7-14 -2 -2 19 PHYLOGENETIC TREES.There are two COMMONLY used approaches for inferring phylogentic trees from sequence data: parsimony and distance methods. There are other approaches which are probably superior in theory but which are yet to be used widely. This does not mean that they are no use; we (the authors of this program at any rate) simply do not know enough about them yet. You should see the documentation accompanying the Phylip package and some of the references there for an explanation of the different methods and what assumptions are implied when you use them. There is a constant debate in the literature as to the merits of different methods but unfortunately, a lot of what is said is incomprehensible or inaccurate. It is also a field that is prone to having highly opinionated schools of thought. This is a pity because it prevents rational discussion of the pro's and con's of the different methods. The approach adopted in ClustalV is to supply just one method and to produce alignments in a format that can be used by Phylip. In simple cases, the trees produced will be as "good" (reliable, robust) as those from ANY other method. In more complicated cases, there is no single magic recipe that we can supply that will work well in even most situations.The method we provide is the Neighbor Joining method (NJ) of Saitou and Nei (1987) which is a distance method. We use this for three reasons: it is conceptually and computationally simple; it is fast; it gives "good" trees in simple cases. It is difficult to prove that one tree is "better" than another if you do not know the true phylogeny; the few systematic surveys of methods show it to work more or less as well as any other method ON AVERAGE. Another reason for using the NJ method is that it is very commonly used; THIS IS A BAD REASON SCIENTIFICALLY but at least you will not feel lonely if you use it.The NJ method works on a matrix of distances (the distance matrix) between all pairs of sequence to be analysed. These distances are related to the degree of divergence between the sequences. It is normal to calculate the distances from the sequences after they are multiply aligned. If you calculate them from seperate alignments (as done for the dendrograms in another part of this program), you may increase the error considerably. DISTANCESThe simplest measure of distance between sequences is percent divergence (100% minus percent identity). For two sequences, you count how many positions differ between them (ignoring all positions with a gap or an unknown residue) and divide by the number of positions considered. It is common practice to also ignore all positions in the alignment where there is a GAP in ANY of the sequences (Tossgaps ? option in the menu). Usually, you express the percent distance divided by 100 (gives distances between 0.0 and 1.0).This measure of distance is perfectly adequate (with some further modification described below) for rRNA sequences. However it treats all residues identically e.g. all amino acid substitutions are equally weighted. It also treats all positions identically e.g. it does not take account of different rates of substitution in different positions of different codons in protein coding DNA sequences; see Li et al (1985) for a distance measure that does. Despite these shortcomings, these percent identity distances do work well in practice in a wide variety of situations. In a simple world, you would like a distance to be proportional to the time since the sequences diverged. If this were EXACTLY true, then the calculation of the tree would be a simple matter of algebra (UPGMA does this for you) and the branch lengths will be nice and meaningful (times). In practice this OBVIOUSLY depends on the existence and quality of the "molecular clock", a subject of on-going debate. However, even if there is a good clock, there is a further problem with estimating divergences. As sequences diverge, they become "saturated" with mutations. Sites can have substitutions more than once. Calculated distances will underestimate actual divergence times; the greater the divergence, the greater the discrepancy. There are various methods for dealing with this and we provide two commonly used ones, both due to Motoo Kimura; one for proteins and one for DNA. For distance K (percent divergence /100 ) ...Correction for Protein distances: (Kimura, 1983). Corrected K = -ln(1.0 - K - (K * k/5.0))Correction for nucleotide distances: Kimura's 2-parameter method (Kimura, 1980). Corrected K = 0.5*ln(a) + 0.25*ln(b) where a = 1/(1 - 2*P - Q) and b = 1/(1 - 2*Q) P and Q are the proportions of transitions (A<-->G, C<-->T) and transversions occuring between the sequences. One paradoxical effect of these corrections, is that distances can be corrected to have more than 100% divergence. That is because, for very highly diverged sequences of length N, you can estimate that more than N substitutions have occured by correcting the observed distance in the above ways. Don't panic!NEIGHBOR JOINING TREES.VERY briefly, the NJ method works as follows. You start by placing the sequences in a star topology (no internal branches). You then find that internal branch (take 2 sequences; join them; connect them to the rest by the internal branch) which when added to the tree will minimise the total branch length. The two joined sequences (neighbours) are merged into a single sequence and the process is repeated. For an unrooted tree with N sequences, there are N-3 internal branches. The above process is repeated N-3 times to give the final tree. The full details are given in Saitou and Nei (1987).As explained elsewhere in the documentation, you can only root the tree by one of two methods:1) assume a degree of constancy in the molecular clock and place the root along the longest branch (internal or external). Methods that appear to produce rooted trees automatically are often just doing this without letting you know; this is true of UPGMA.2) root the tree on biological grounds. The usual method is to include an "outgroup", a sequence that you are certain will branch to the outside of the tree. BOOTSTRAPPING.Bootstrapping is a general purpose technique that can be used for placing confidence limits on statistics that you estimate without any knowledge of the underlying distribution (e.g. a normal or poisson distribution). In the case of phylogenetic trees, there are several analytical methods for placing confidence limits on groupings (actually on the internal branches) but these are either restricted to particular tree drawing methods or only work on small trees of 4 or 5 sequences. Felsenstein (1985) showed how to use bootstrapping to calculate confidence limits on trees. His approach is completely general and can be applied to any tree drawing method. The main assumption of the method i
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -