📄 clustalv.doc
字号:
Read whatever sequences are in the file "proteins.seq" and do a full
multiple alignment; output will go to the files: "proteins.dnd"
(dendrogram) and "proteins.aln" (alignment).
$ clustalv proteins.seq/ktup=2/matrix=pam100/output=pir
Same as last example but use K-Tuple size of 2; use a PAM 100
protein weight matrix; write the alignment out in NBRF/PIR format
(goes to a file called "proteins.pir").
$ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11
Take the alignment in "proteins.seq" and align it with "more.seq"
using default values for everything except the fixed gap penalty
which is set to 11. The sequence type is explicitely set to
PROTEIN.
$ clustalv proteins.pir/tree/kimura
Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY)
and calculate a phylogenetic tree using Kimura's correction for
distances.
$ clustalv proteins.pir/align/tree/kimura
Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE
FIRST.
$ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p
Take the sequences in proteins.seq; they are explicitely set to be
protein; align them; bootstrap a tree using 500 samples and a seed
number of 99.
*******************************************************************
5. Algorithms and references.
In this section, we will try to BRIEFLY describe the algorithms used
in ClustalV and give references. The topics covered are:
-Multiple alignments
-Profile alignments
-Protein weight matrices
-Phylogenetic trees
-distances
-NJ method
-Bootstrapping
-Phylip
-References
MULTIPLE ALIGNMENTS.
The approach used in ClustalV is a modified version of the method of
Feng and Doolittle (1987) who aligned the sequences in larger and
larger groups according to the branching order in an initial
phylogenetic tree. This approach allows a very useful combination
of computational tractability and sensitivity.
The positions of gaps that are generated in early alignments remain
through later stages. This can be justified because gaps that arise
from the comparison of closely related sequences should not be moved
because of later alignment with more distantly related sequences.
At each alignment stage, you align two groups of already aligned
sequences. This is done using a dynamic programming algorithm where
one allows the residues that occur in every sequence at each
alignment position to contribute to the alignment score. A Dayhoff
(1978) PAM matrix is used in protein comparisons.
The details of the algorithm used in ClustalV have been published in
Higgins and Sharp (1989). This was an improved version of an
earlier algorithm published in Higgins and Sharp (1988). First, you
calculate a crude similarity measure between every pair of sequence.
This is done using the fast, approximate alignment algorithm of
Wilbur and Lipman (1983). Then, these scores are used to calculate
a "guide tree" or dendrogram, which will tell the multiple alignment
stage in which order to align the sequences for the final multiple
alignment. This "guide tree" is calculated using the UPGMA method
of Sneath and Sokal (1973). UPGMA is a fancy name for one type of
average linkage cluster analysis, invented by Sokal and Michener
(1958).
Having calculated the dendrogram, the sequences are aligned in
larger and larger groups. At each alignment stage, we use the
algorithm of Myers and Miller (1988) for the optimal alignments.
This algorithm is a very memory efficient variation of Gotoh's
algorithm (Gotoh, 1982). It is because of this algorithm that
ClustalV can work on microcomputers. Each of these alignments
consists of aligning 2 alignments, using what we call "profile
alignments".
PROFILE ALIGNMENTS.
We use the term "profile alignment" to describe the alignment of 2
alignments. We use this term because the method is a simple
extension of the profile method of Gribskov, et al. (1987) for
aligning 1 sequence with an alignment. Normally, with a 2 sequence
alignment, you use a weight matrix (e.g. a PAM 250 matrix) to give a
score between the pairs of aligned residues. The alignment is
considered "optimal" if it gives the best total score for aligned
residues minus penalties for any gaps (insertions or deletions) that
must be introduced.
Profile alignments are a simple extension of 2 sequence alignments
in that you can treat each of the two input alignments as single
sequences but you calculate the score at aligned positions as the
average weight matrix score of all the residues in one alignment
versus all those in the other e.g. if you have 2 alignments with I
and J sequences respectively; the score at any position is the
average of all the I times J scores of the residues compared
seperately. Any gaps that are introduced are placed in all of the
sequences of an alignment at the same position. The profile
alignments offered in the "profile alignment menu" are also
calculated in this way.
PROTEIN WEIGHT MATRICES.
There are 3 built-in weight matrices used by clustalV. These are
the PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity
matrix. Each matrix is given as the bottom left half, including the
diagonal of a 20 by 20 matrix. The order of the rows and columns is
CSTPAGNDEQHRKMILVFYW.
PAM 250
C 12
S 0 2
T -2 1 3
P -3 1 0 6
A -2 1 1 1 2
G -3 1 0 -1 1 5
N -4 1 0 -1 0 0 2
D -5 0 0 -1 0 1 2 4
E -5 0 0 -1 0 0 1 3 4
Q -5 -1 -1 0 0 -1 1 2 2 4
H -3 -1 -1 0 -1 -2 2 1 1 3 6
R -4 0 -1 0 -2 -3 0 -1 -1 1 2 6
K -5 0 0 -1 -1 -2 1 0 0 1 0 3 5
M -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
I -2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
L -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
V -2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
F -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
Y 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
W -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
----------------------------------------------------------------
C S T P A G N D E Q H R K M I L V F Y W
IDENTITY MATRIX
10
0 10
0 0 10
0 0 0 10
0 0 0 0 10
0 0 0 0 1 10
0 0 0 0 0 0 10
0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
PAM 100
14
-1 6
-5 2 7
-6 1 -1 10
-5 2 2 1 6
-8 1 -3 -3 1 8
-8 2 0 -3 -1 -1 7
-11 -1 -2 -4 -1 -1 4 8
-11 -2 -3 -3 0 -2 1 5 8
-11 -3 -3 -1 -2 -5 -1 1 4 9
-6 -4 -5 -2 -5 -7 2 -1 -2 4 11
-6 -1 -4, -2 -5 -8 -3 -6 -5 1 1 10
-11 -2 -1 -4 -4 -5 1 -2 -2 -1 -3 3 8
-11 -4 -2 -6 -3 -8 -5 -8 -6 -2 -7 -2 1 13
-5 -4 -1 -6 -3 -7 -4 -6 -5 -5 -7 -4 4 2 9
-12 -7 -5 -5 -5 -8 -6 -9 -7 -3 -5 -7 -6 4 2 9
-4 -4 -1 -4 0 -4 -5 -6 -5 -5 -6 -6 -6 1 5 1 8
-10 -5 -6 -9 -7 -8 -6 -11 -11 -10 -4 -7-11 -2 0 0 -5 12
-2 -6 -6 -11 -6 -11 -3 -9 -7 -9 -1-10-10 -8 -4 -5 -6 6 13
-13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7 1 -9-11-12 -7-14 -2 -2 19
PHYLOGENETIC TREES.
There are two COMMONLY used approaches for inferring phylogentic
trees from sequence data: parsimony and distance methods. There are
other approaches which are probably superior in theory but which are
yet to be used widely. This does not mean that they are no use; we
(the authors of this program at any rate) simply do not know enough
about them yet. You should see the documentation accompanying the
Phylip package and some of the references there for an explanation
of the different methods and what assumptions are implied when you
use them.
There is a constant debate in the literature as to the merits of
different methods but unfortunately, a lot of what is said is
incomprehensible or inaccurate. It is also a field that is prone to
having highly opinionated schools of thought. This is a pity
because it prevents rational discussion of the pro's and con's of
the different methods. The approach adopted in ClustalV is to
supply just one method and to produce alignments in a format that
can be used by Phylip. In simple cases, the trees produced will be
as "good" (reliable, robust) as those from ANY other method. In
more complicated cases, there is no single magic recipe that we can
supply that will work well in even most situations.
The method we provide is the Neighbor Joining method (NJ) of Saitou
and Nei (1987) which is a distance method. We use this for three
reasons: it is conceptually and computationally simple; it is fast;
it gives "good" trees in simple cases. It is difficult to prove that
one tree is "better" than another if you do not know the true
phylogeny; the few systematic surveys of methods show it to work
more or less as well as any other method ON AVERAGE. Another reason
for using the NJ method is that it is very commonly used; THIS IS A
BAD REASON SCIENTIFICALLY but at least you will not feel lonely if
you use it.
The NJ method works on a matrix of distances (the distance matrix)
between all pairs of sequence to be analysed. These distances are
related to the degree of divergence between the sequences. It is
normal to calculate the distances from the sequences after they are
multiply aligned. If you calculate them from seperate alignments
(as done for the dendrograms in another part of this program), you
may increase the error considerably.
DISTANCES
The simplest measure of distance between sequences is percent
divergence (100% minus percent identity). For two sequences, you
count how many positions differ between them (ignoring all positions
with a gap or an unknown residue) and divide by the number of
positions considered. It is common practice to also ignore all
positions in the alignment where there is a GAP in ANY of the
sequences (Tossgaps ? option in the menu). Usually, you express the
percent distance divided by 100 (gives distances between 0.0 and
1.0).
This measure of distance is perfectly adequate (with some further
modification described below) for rRNA sequences. However it treats
all residues identically e.g. all amino acid substitutions are
equally weighted. It also treats all positions identically e.g. it
does not take account of different rates of substitution in
different positions of different codons in protein coding DNA
sequences; see Li et al (1985) for a distance measure that does.
Despite these shortcomings, these percent identity distances do work
well in practice in a wide variety of situations.
In a simple world, you would like a distance to be proportional to
the time since the sequences diverged. If this were EXACTLY true,
then the calculation of the tree would be a simple matter of algebra
(UPGMA does this for you) and the branch lengths will be nice and
meaningful (times). In practice this OBVIOUSLY depends on the
existence and quality of the "molecular clock", a subject of on-
going debate. However, even if there is a good clock, there is a
further problem with estimating divergences. As sequences diverge,
they become "saturated" with mutations. Sites can have
substitutions more than once. Calculated distances will
underestimate actual divergence times; the greater the divergence,
the greater the discrepancy. There are various methods for dealing
with this and we provide two commonly used ones, both due to Motoo
Kimura; one for proteins and one for DNA.
For distance K (percent divergence /100 ) ...
Correction for Protein distances: (Kimura, 1983).
Corrected K = -ln(1.0 - K - (K * k/5.0))
Correction for nucleotide distances: Kimura's 2-parameter method
(Kimura, 1980).
Corrected K = 0.5*ln(a) + 0.25*ln(b)
where a = 1/(1 - 2*P - Q)
and b = 1/(1 - 2*Q)
P and Q are the proportions of transitions (A<-->G, C<-->T)
and transversions occuring between the sequences.
One paradoxical effect of these corrections, is that distances can
be corrected to have more than 100% divergence. That is because,
for very highly diverged sequences of length N, you can estimate
that more than N substitutions have occured by correcting the
observed distance in the above ways. Don't panic!
NEIGHBOR JOINING TREES.
VERY briefly, the NJ method works as follows. You start by placing
the sequences in a star topology (no internal branches). You then
find that
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -