📄 clustalv.doc
字号:
Cycle 2 = SEQ: 1 ( 0.28142) joins Node: 5 ( 0.33462)
Cycle 3 = SEQ: 2 ( 0.05879) joins SEQ: 3 ( 0.03086)
Cycle 4 (Last cycle, trichotomy):
Node: 1 ( 0.20798) joins
Node: 2 ( 0.02341) joins
SEQ: 4 ( 0.04915)
The output file first shows the percent divergence (distance)
figures between each pair of sequence. Then a description of a NJ
tree is given. This description shows which sequences (SEQ:) or
which groups of sequences (NODE: , a node is numbered using the
lowest sequence that belongs to it) join at each level of the tree.
This is an unrooted tree!! This means that the direction of
evolution through the tree is not shown. This can only be inferred
in one of two ways:
1) assume a degree of constancy in the molecular clock and place the
root (bottom of the tree; the point where all the sequences radiate
from) half way along the longest branch. **OR**
2) use an "outgroup", a sequence from an organism that you "know"
must be outside of the rest of the sequences i.e. root the tree
manually, on biological grounds.
The above tree can be represented diagramatically as follows:
SEQ 1 SEQ 4
I I
13.6 I 28.1 I 4.9 5.9
SEQ 6 ----------I I I I--------- SEQ 2
I I I I
I--------I-----------I----------I
13.4 I 33.5 20.8 2.3 I 3.1
SEQ 5 ----------I I--------- SEQ 3
The figures along each branch are percent divergences along that
branch. If you root the tree by placing the root along the longest
branch (33.5%) then you can draw it again as follows, this time
rooted:
13.6
I-------------------- SEQ 6
I---------I 13.4
I I-------------------- SEQ 5
I 33.5
-----I 28.1
I I-------------------- SEQ 1
I I
I---------I 4.9
I 20.8 I----------- SEQ 4
I--------I
I 5.9
I 2.3 I----- SEQ 2
I-----I 3.1
I----- SEQ 3
The longest branch (33.5% between 5,6 and 1,2,3,4) is split between
the 2 bottom branches of the tree. As it happens in this particular
case, sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4
are alpha and beta globins, so you could also justify the above
rooting on biological grounds. If you do not have any particular
need or evidence for the position of the root, then LEAVE THE TREE
UNROOTED. Unrooted trees do not look as pretty as rooted ones but
it is uaual to leave them unrooted if you do not have any evidence
for the position of the root.
BOTSTRAPPING: Different sets of sequences and different tree
drawing methods may give different topologies (branching orders) for
parts of a tree that are weakly supported by the data. It is useful
to have an indication of the degree of error in the tree. There are
several ways of doing this, some of them rather technical. We
provide one general purpose method in this program, which makes use
of a technique called bootstrapping (see Felsenstein, 1985).
In the case of sequence alignments, bootstrapping involves taking
random samples of positions from the alignment. If the alignment
has N positions, each bootstrap sample consists of a random sample
of N positions, taken WITH REPLACEMENT i.e. in any given sample,
some sites may be sampled several times, others not at all. Then,
with each sample of sites, you calculate a distance matrix as usual
and draw a tree. If the data very strongly support just one tree
then the sample trees will be very similar to each other and to the
original tree, drawn without bootstrapping. However, if parts of
the tree are not well supported, then the sample trees will vary
considerably in how they represent these parts.
In practice, you should use a very large number of bootstrap
replicates (1000 is recommended, even if it means running the
program for an hour on a slow microcomputer; on a workstation it
will be MUCH faster). For each grouping on the tree, you record the
number of times this grouping occurs in the sample trees. For a
group to be considered "significant" at the 95% level (or P <= 0.05
in statistical terms) you expect the grouping to show up in >= 95%
of the sample trees. If this happens, then you can say that the
grouping is significant, given the data set and the method used to
draw the tree.
So, when you use the bootstrap option, a NJ tree is drawn as before
and then you are asked to say how many bootstrap samples you want
(1000 is the default) and you are asked to give a seed number for
the random number generator. If you give the same seed number in
future, you will get the same results (we hope). Remember to give
different seed numbers if you wish to carry out genuinely different
bootstrap sampling experiments. Below is the output file from using
the same data for the 6 globin sequences as used before. The output
file has the same name as the input fike with the extension ".njb".
//
STUFF DELETED .... same as for the ordinary NJ output
//
Bootstrap Confidence Limits
Random number generator seed = 99
Number of bootstrap trials = 1000
Diagrammatic representation of the above tree:
Each row represents 1 tree cycle; defining 2 groups.
Each column is 1 sequence; the stars in each line show 1 group;
the dots show the other
Numbers show occurences in bootstrap samples.
****.. 1000
.***.. 1000 <- This is the answer!!
*..*** 812
122311
For an unrooted tree with N sequences, there are actually only N-3
genuinely different groupings that we can test (this is the number
of "internal branches"; each internal branch splits the sequences
into 2 groups). In this example, we have 6 sequences with 3
internal branches in the reference tree. In the bootstrap
resampling, we count how often each of these internal branches
occur. Here, we find that the branch which splits 1,2,3 and 4
versus 5 and 6 occurs in all 1000 samples; the branch which splits
2,3 and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2
and 3 versus 1,4,5 and 6 occurs in 812/1000 samples. We can put
these figures on to the diagrammatic representation we made earlier
of our unrooted NJ tree as follows:
SEQ 1 SEQ 4
I I
I I
SEQ 6 ----------I I I I--------- SEQ 2
I 1000 I 1000 I 812 I
I--------I-----------I----------I
I I
SEQ 5 ----------I I--------- SEQ 3
You can equally put these confidence figures on the rooted tree (in
fact the interpretation is simpler with rooted trees). With the
unrooted tree, the grouping of sequence 5 with 6 is significant (as
is the grouping of sequences 1,2,3 and 4). Equally the grouping of
sequences 1,5 and 6 is significant (the same as saying that 2,3 and
4 group significantly). However, the grouping of 2 and 3 is not
significant, although it is relatively strongly supported.
Unfortunately, there is a small complication in the interpretation
of these results. In statistical hypothesis testing, it is not
valid to make multiple simultaneous tests and to treat the result of
each test completely independantly. In the above case, if you have
one particular test (grouping) that you wish to make in advance, it
is valid to test IT ALONE and to simply show the other bootstrap
figures for reference. If you do not have any particular test in
mind before you do the bootstrapping, you can just show all of the
figures and use the 95% level as an ARBITRARY cut off to show those
groups that are very strongly supported; but not mention anything
about SIGNIFICANCE testing. In the literature, it is common
practice to simply show the figures with a tree; they frequently
speak for themselves.
*******************************************************************
4. Command Line Interface.
You can do almost everything that can be done from the menus, using
a command line interface. In this mode, the program will take all of
its instructions as "switches" when you activate it; no questions
will be asked; if there are no errors, the program just does an
analysis and stops. It does not work so well on the MAC but is
still possible. To get you started we will show you the 2 simplest
uses of the command line as it looks on VAX/VMS. On all other
machines (except the MAC) it works in the same way.
$ clustalv /help **OR** $ clustalv /check
Both of the above switches give you a one page summary of the
command line on the screen and then the program stops.
$ clustalv proteins.seq **OR** $ clustalv /infile=proteins.seq
This will read the sequences from the file 'proteins.seq' and do a
complete multiple alignment. Default parameters will be used, the
program will try to tell whether or not the sequences are DNA or
protein and the output will go to a file called 'proteins.aln' . A
dendrogram file called 'proteins.dnd' will also be created. Thus
the default action for the program, when it successfully reads in an
input file is to do a full multiple alignment. Some further
examples of command line usage will be given leter.
Command line switches can be abbreviated but MAKE SURE YOU DO NOT
MAKE THEM AMBIGUOUS. No attempt will be made to detect ambiguity.
Use enough characters to distinguish each switch uniquely.
The full list of allowed switches is given below:
DATA (sequences)
/INFILE=file.ext :input sequences. If you give an input file and
nothing else as a switch, the default action is
to do a complete multiple alignment. The input
file can also be specified by giving it as the
first command line parameter with no "/" in
front of it e.g $ clustalv file.ext .
/PROFILE1=file.ext :You use these two switches to give the names of
/PROFILE2=file.ext two profiles. The default action is to align
the two. You must give the names of both profile
files.
VERBS (do things)
/HELP :list the command line parameters on the screen.
/CHECK
/ALIGN :do full multiple alignment. This is the default
action if no other switches except for input files
are given.
/TREE :calculate NJ tree. If this is the only action
specified (e.g. $ clustalv proteins.seq/tree ) it IS
ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED. If
the sequences are not already aligned, you should
also give the /ALIGN switch. This will align the
sequences first, output an alignment file and
calculate the tree in memory.
/BOOTSTRAP(=n) :bootstrap a NJ tree (n= number of bootstraps;
default = 1000). If this is the only action
specified (e.g. $ clustalv proteins.seq/bootstrap )
it IS ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.
If the sequences are not already aligned, you should
also give the /ALIGN switch. This will align the
sequences first, output an alignment file and
calculate the bootstraps in memory. You can set the
number of bootstrap trials here (e.g./bootstrap=500).
You can set the seed number for the random number
generator with /seed=n.
PARAMETERS (set things)
***Pairwise alignments:***
/KTUP=n :word size
/TOPDIAGS=n :number of best diagonals
/WINDOW=n :window around best diagonals
/PAIRGAP=n :gap penalty
***Multiple alignments:***
/FIXEDGAP=n :fixed length gap pen.
/FLOATGAP=n :variable length gap pen.
/MATRIX= :PAM100 or ID or file name. The default weight matrix
for proteins is PAM 250.
/TYPE=p or d :type is protein or DNA. This allows you to
explicitely overide the programs attempt at guessing
the type of the sequence. It is only useful if you
are using sequences with a VERY strange composition.
/OUTPUT= :GCG or PHYLIP or PIR. The default output is
Clustal format.
/TRANSIT :transitions not weighted. The default is to weight
transitions as more favourable than other mismatches
in DNA alignments. This switch makes all nucleotide
mismatches equally weighted.
***Trees:***
/KIMURA :use Kimura's correction on distances.
/TOSSGAPS :ignore positions with a gap in ANY sequence.
/SEED=n :seed number for bootstraps.
EXAMPLES:
These examples use the VAX/VMS $ prompt; otherwise, command-line
usage is the same on all machines except the Macintosh.
$ clustalv proteins.seq OR $ clustalv /infile=proteins.seq
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -