⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 clustalv.doc

📁 clustalw1.83.DOS.ZIP,用于多序列比对的软件
💻 DOC
📖 第 1 页 / 共 5 页
字号:
Read whatever sequences are in the file "proteins.seq" and do a full 
multiple alignment; output will go to the files: "proteins.dnd" 
(dendrogram) and "proteins.aln" (alignment).


$ clustalv proteins.seq/ktup=2/matrix=pam100/output=pir

Same as last example but use K-Tuple size of 2; use a PAM 100 
protein weight matrix; write the alignment out in NBRF/PIR format 
(goes to a file called "proteins.pir").


$ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11

Take the alignment in "proteins.seq" and align it with "more.seq" 
using default values for everything except the fixed gap penalty 
which is set to 11.  The sequence type is explicitely set to 
PROTEIN.


$ clustalv proteins.pir/tree/kimura

Take the sequences in proteins.pir (they MUST BE ALIGNED ALREADY) 
and calculate a phylogenetic tree using Kimura's correction for 
distances.  


$ clustalv proteins.pir/align/tree/kimura

Same as the previous example, EXCEPT THAT AN ALIGNMENT IS DONE 
FIRST.


$ clustalv proteins.seq/align/boot=500/seed=99/tossgaps/type=p

Take the sequences in proteins.seq; they are explicitely set to be 
protein; align them; bootstrap a tree using 500 samples and a seed 
number of 99.


*******************************************************************

		5.  Algorithms and references.



In this section, we will try to BRIEFLY describe the algorithms used 
in ClustalV and give references.  The topics covered are:


	-Multiple alignments

	-Profile alignments

	-Protein weight matrices

	-Phylogenetic trees

		-distances

		-NJ method

		-Bootstrapping

		-Phylip

	-References






MULTIPLE ALIGNMENTS.

The approach used in ClustalV is a modified version of the method of 
Feng and Doolittle (1987) who aligned the sequences in larger and 
larger groups according to the branching order in an initial 
phylogenetic tree.  This approach allows a very useful combination 
of computational tractability and sensitivity.  

The positions of gaps that are generated in early alignments remain 
through later stages.  This can be justified because gaps that arise 
from the comparison of closely related sequences should not be moved 
because of later alignment with more distantly related sequences.  
At each alignment stage, you align two groups of already aligned 
sequences.  This is done using a dynamic programming algorithm where 
one allows the residues that occur in every sequence at each 
alignment position to contribute to the alignment score.  A Dayhoff 
(1978) PAM matrix is used in protein comparisons.

The details of the algorithm used in ClustalV have been published in 
Higgins and Sharp (1989).  This was an improved version of an 
earlier algorithm published in Higgins and Sharp (1988).  First, you 
calculate a crude similarity measure between every pair of sequence.  
This is done using the fast, approximate alignment algorithm of 
Wilbur and Lipman (1983).  Then, these scores are used to calculate 
a "guide tree" or dendrogram, which will tell the multiple alignment 
stage in which order to align the sequences for the final multiple 
alignment.  This "guide tree" is calculated using the UPGMA method 
of Sneath and Sokal (1973).  UPGMA is a fancy name for one type of 
average linkage cluster analysis, invented by Sokal and Michener 
(1958).  

Having calculated the dendrogram, the sequences are aligned in 
larger and larger groups.  At each alignment stage, we use the 
algorithm of Myers and Miller (1988) for the optimal alignments.  
This algorithm is a very memory efficient variation of Gotoh's 
algorithm (Gotoh, 1982).  It is because of this algorithm that 
ClustalV can work on microcomputers.   Each of these alignments 
consists of aligning 2 alignments, using what we call "profile 
alignments".


PROFILE ALIGNMENTS.

We use the term "profile alignment" to describe the alignment of 2 
alignments.  We use this term because the method is a simple 
extension of the profile method of Gribskov, et al. (1987) for 
aligning 1 sequence with an alignment.  Normally, with a 2 sequence 
alignment, you use a weight matrix (e.g. a PAM 250 matrix) to give a 
score between the pairs of aligned residues.  The alignment is 
considered "optimal" if it gives the best total score for aligned 
residues minus penalties for any gaps (insertions or deletions) that 
must be introduced.  

Profile alignments are a simple extension of 2 sequence alignments 
in that you can treat each of the two input alignments as single 
sequences but you calculate the score at aligned positions as the 
average weight matrix score of all the residues in one alignment 
versus all those in the other e.g. if you have 2 alignments with I 
and J sequences respectively; the score at any position is the 
average of all the I times J scores of the residues compared 
seperately.  Any gaps that are introduced are placed in all of the 
sequences of an alignment at the same position.  The profile 
alignments offered in the "profile alignment menu" are also 
calculated in this way.


PROTEIN WEIGHT MATRICES.

There are 3 built-in weight matrices used by clustalV.  These are 
the PAM 100 and PAM 250 matrices of Dayhoff (1978) and an identity 
matrix.  Each matrix is given as the bottom left half, including the 
diagonal of a 20 by 20 matrix.  The order of the rows and columns is 
CSTPAGNDEQHRKMILVFYW.


PAM 250

C  12 
S   0  2 
T  -2  1  3 
P  -3  1  0  6 
A  -2  1  1  1  2 
G  -3  1  0 -1  1  5 
N  -4  1  0 -1  0  0  2 
D  -5  0  0 -1  0  1  2  4 
E  -5  0  0 -1  0  0  1  3  4 
Q  -5 -1 -1  0  0 -1  1  2  2  4 
H  -3 -1 -1  0 -1 -2  2  1  1  3  6 
R  -4  0 -1  0 -2 -3  0 -1 -1  1  2  6 
K  -5  0  0 -1 -1 -2  1  0  0  1  0  3  5 
M  -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2  0  0  6 
I  -2 -1  0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2  2  5 
L  -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3  4  2  6 
V  -2 -1  0 -1  0 -1 -2 -2 -2 -2 -2 -2 -2  2  4  2  4 
F  -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5  0  1  2 -1  9 
Y   0 -3 -3 -5 -3 -5 -2 -4 -4 -4  0 -4 -4 -2 -1 -1 -2  7 10 
W  -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3  2 -3 -4 -5 -2 -6  0  0 17 
---------------------------------------------------------------- 
    C  S  T  P  A  G  N  D  E  Q  H  R  K  M  I  L  V  F  Y  W


IDENTITY MATRIX

10 
 0  10 
 0  0  10 
 0  0  0  10 
 0  0  0  0  10 
 0  0  0  0  1  10 
 0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  10 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10 
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10





PAM 100

 14 
 -1  6 
 -5  2   7 
 -6  1  -1  10 
 -5  2   2   1   6 
 -8  1  -3  -3   1   8 
 -8  2   0  -3  -1  -1  7 
-11 -1  -2  -4  -1  -1  4   8 
-11 -2  -3  -3   0  -2  1   5   8 
-11 -3  -3  -1  -2  -5 -1   1   4   9 
 -6 -4  -5  -2  -5  -7  2  -1  -2   4 11 
 -6 -1  -4, -2  -5  -8 -3  -6  -5   1  1 10 
-11 -2  -1  -4  -4  -5  1  -2  -2  -1 -3  3  8 
-11 -4  -2  -6  -3  -8 -5  -8  -6  -2 -7 -2  1 13 
 -5 -4  -1  -6  -3  -7 -4  -6  -5  -5 -7 -4  4  2  9 
-12 -7  -5  -5  -5  -8 -6  -9  -7  -3 -5 -7 -6  4  2  9 
 -4 -4  -1  -4   0  -4 -5  -6  -5  -5 -6 -6 -6  1  5  1  8 
-10 -5  -6  -9  -7  -8 -6 -11 -11 -10 -4 -7-11 -2  0  0 -5 12 
 -2 -6  -6 -11  -6 -11 -3  -9  -7  -9 -1-10-10 -8 -4 -5 -6  6 13 
-13 -4 -10 -11 -11 -13 -8 -13 -14 -11 -7  1 -9-11-12 -7-14 -2 -2 19 




PHYLOGENETIC TREES.

There are two COMMONLY used approaches for inferring phylogentic 
trees from sequence data: parsimony and distance methods. There are 
other approaches which are probably superior in theory but which are 
yet to be used widely. This does not mean that they are no use; we 
(the authors of this program at any rate) simply do not know enough 
about them yet.  You should see the documentation accompanying the 
Phylip package and some of the references there for an explanation 
of the different methods and what assumptions are implied when you 
use them.   

There is a constant debate in the literature as to the merits of 
different methods but unfortunately, a lot of what is said is 
incomprehensible or inaccurate.  It is also a field that is prone to 
having highly opinionated schools of thought.  This is a pity 
because it prevents rational discussion of the pro's and con's of 
the different methods.  The approach adopted in ClustalV is to 
supply just one method and to produce alignments in a format that 
can be used by Phylip.  In simple cases, the trees produced will be 
as "good" (reliable, robust) as those from ANY other method.  In 
more complicated cases, there is no single magic recipe that we can 
supply that will work well in even most situations.

The method we provide is the Neighbor Joining method (NJ) of Saitou 
and Nei (1987) which is a distance method.  We use this for three 
reasons:  it is conceptually and computationally simple; it is fast; 
it gives "good" trees in simple cases. It is difficult to prove that 
one tree is "better" than another if you do not know the true 
phylogeny; the few systematic surveys of methods show it to work 
more or less as well as any other method ON AVERAGE.  Another reason 
for using the NJ method is that it is very commonly used; THIS IS A 
BAD REASON SCIENTIFICALLY but at least you will not feel lonely if 
you use it.

The NJ method works on a matrix of distances (the distance matrix) 
between all pairs of sequence to be analysed.  These distances are 
related to the degree of divergence between the sequences.  It is 
normal to calculate the distances from the sequences after they are 
multiply aligned.  If you calculate them from seperate alignments 
(as done for the dendrograms in another part of this program), you 
may increase the error considerably.  


DISTANCES

The simplest measure of distance between sequences is percent 
divergence (100% minus percent identity).  For two sequences, you 
count how many positions differ between them (ignoring all positions 
with a gap or an unknown residue) and divide by the number of 
positions considered.  It is common practice to also ignore all 
positions in the alignment where there is a GAP in ANY of the 
sequences (Tossgaps ? option in the menu).  Usually, you express the 
percent distance divided by 100 (gives distances between 0.0 and 
1.0).

This measure of distance is perfectly adequate (with some further 
modification described below) for rRNA sequences. However it treats 
all residues identically e.g. all amino acid substitutions are 
equally weighted. It also treats all positions identically e.g. it 
does not take account of different rates of substitution in 
different positions of different codons in protein coding DNA 
sequences; see Li et al (1985) for a distance measure that does.  
Despite these shortcomings, these percent identity distances do work 
well in practice in a wide variety of situations.  

In a simple world, you would like a distance to be proportional to 
the time since the sequences diverged.  If this were EXACTLY true, 
then the calculation of the tree would be a simple matter of algebra 
(UPGMA does this for you) and the branch lengths will be nice and 
meaningful (times).  In practice this OBVIOUSLY depends on the 
existence and quality of the "molecular clock", a subject of on-
going debate.  However, even if there is a good clock, there is a 
further problem with estimating divergences.  As sequences diverge, 
they become "saturated" with mutations.  Sites can have 
substitutions more than once.  Calculated distances will 
underestimate actual divergence times; the greater the divergence, 
the greater the discrepancy.  There are various methods for dealing 
with this and we provide two commonly used ones, both due to Motoo 
Kimura; one for proteins and one for DNA. 


For distance K (percent divergence /100 ) ...

Correction for Protein distances:  (Kimura, 1983).

       Corrected K = -ln(1.0 - K - (K * k/5.0))



Correction for nucleotide distances: Kimura's 2-parameter method 
(Kimura, 1980).

       Corrected K = 0.5*ln(a) + 0.25*ln(b)

       where     a = 1/(1 - 2*P - Q)
       and       b = 1/(1 - 2*Q)

       P and Q are the proportions of transitions (A<-->G, C<-->T)
       and transversions occuring between the sequences.  


One paradoxical effect of these corrections, is that distances can 
be corrected to have more than 100% divergence.  That is because, 
for very highly diverged sequences of length N, you can estimate 
that more than N substitutions have occured by correcting the 
observed distance in the above ways.  Don't panic!



NEIGHBOR JOINING TREES.

VERY briefly, the NJ method works as follows.  You start by placing 
the sequences in a star topology (no internal branches).  You then 
find that

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -