📄 clustalv.doc
字号:
I I------ 3 Diagram of the sequence similarity
I----I
I I------------- 4 relationships shown in the above
I--I
I I------------------ 1 dendrogram file (branch lengths are
----I
I I------------- 5 not to scale).
I-------I
I------------- 6
MULTIPLE ALIGNMENT PARAMETERS:
Having calculated a dendrogram between a set of sequences, the final
multiple alignment is carried out by a series of alignments of
larger and larger groups of sequences. The order is determined by
the dendrogram so that the most similar sequences get aligned first.
Any gaps that are introduced in the early alignments are fixed.
When two groups of sequences are aligned against each other, a full
protein weight matrix (such as a Dayhoff PAM 250) is used. Two gap
penalties are offered: a "FIXED" penalty for opening up a gap and a
"FLOATING" penalty for extending a gap.
********* MULTIPLE ALIGNMENT PARAMETERS *********
1. Fixed Gap Penalty :10
2. Floating Gap Penalty :10
3. Toggle Transitions (DNA):Weighted
4. Protein weight matrix :PAM 250
H. HELP
Enter number (or [RETURN] to exit):
FIXED GAP PENALTY: Reduce this to encourage gaps of all sizes;
increase it to discourage them. Terminal gaps are penalised same
as all others. BEWARE of making this too small (approx 5 or so); if
the penalty is too small, the program may prefer to align each
sequence opposite one long gap.
FLOATING GAP PENALTY: Reduce this to encourage longer gaps;
increase it to shorten them. Terminal gaps are penalised same as
all others. BEWARE of making this too small (approx 5 or so); if
the penalty is too small, the program may prefer to align each
sequence opposite one long gap.
DNA TRANSITIONS = WEIGHTED or UNWEIGHTED: By default, transitions
(A versus G; C versus T) are weighted more strongly than
transversions (an A aligned with a G will be preferred to an A
aligned with a C or a T). You can make all pairs of nucleotide
equally weighted with this option.
PROTEIN WEIGHT MATRIX: For protein comparisons, a weight matrix is
used to differentially weight different pairs of aligned amino
acids. The default is the well known Dayhoff PAM 250 matrix. We
also offer a PAM 100 matrix, an identity matrix (all weights are the
same for exact matches) or allow you to give the name of a file with
your own matrix. The weight matrices used by Clustal V are shown in
full in the Algorithms and References section of this documentation.
If you input a matrix from a file, it must be in the following
format. Use a 20x20 matrix only (entries for the 20 "normal" amino
acids only; no ambiguity codes etc.). Input the lower left triangle
of the matrix, INCLUDING the diagonal. The order of the amino acids
(rows and columns) must be: CSTPAGNDEQHRKMILVFYW. The values can be
in free format seperated by spaces (not commas). The PAM 250 matrix
is shown below in this format.
12
0 2
-2 1 3
-3 1 0 6
-2 1 1 1 2
-3 1 0 -1 1 5
-4 1 0 -1 0 0 2
-5 0 0 -1 0 1 2 4
-5 0 0 -1 0 0 1 3 4
-5 -1 -1 0 0 -1 1 2 2 4
-3 -1 -1 0 -1 -2 2 1 1 3 6
-4 0 -1 0 -2 -3 0 -1 -1 1 2 6
-5 0 0 -1 -1 -2 1 0 0 1 0 3 5
-5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2 0 0 6
-2 -1 0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2 2 5
-6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6
-2 -1 0 -1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4
-4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1 9
0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2 -1 -1 -2 7 10
-8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3 2 -3 -4 -5 -2 -6 0 0 17
Values must be integers and can be all positive or positive and
negative as above. These are SIMILARITY values.
ALIGNMENT OUTPUT OPTIONS:
By default, the alignment goes to a file in a self explanatory
"blocked" alignment format. This format is fine for displaying the
results but requires heavy editing if you wish to use the alignment
with other software. To help, we provide 3 other formats which can
be turned on or off. If you have a sequence data set or alignment
in memory, you can also ask for output files in whatever formats are
turned on, NOW. The menu you use to choose format is shown below.
***
We draw your attention to NBRF/PIR format in particular. This
format is EXACTLY the same as one of the input formats. Therefore,
alignments written in this format can be used again as input (to the
profile alignments or phylogenetic trees).
***
********* Format of Alignment Output *********
1. Toggle CLUSTAL format output = ON
2. Toggle NBRF/PIR format output = OFF
3. Toggle GCG format output = OFF
4. Toggle PHYLIP format output = OFF
5. Create alignment output file(s) now?
H. HELP
Enter number (or [RETURN] to exit):
CLUSTAL FORMAT: This is a self explanatory alignment. The
alignment is written out in blocks. Identities are highlighted and
(if you use a PAM 250 matrix) positions in the alignment where all
of the residues are "similar" to each other (PAM 250 score of 8 or
more) are indicated.
NBRF/PIR FORMAT: This is the usual NBRF/PIR format with gaps
indicated by hyphens ("-"). AS we have stressed before, this format
is EXACTLY compatible with the sequence input format. Therefore you
can read in these alignments again for profile alignments or for
calculating phylogenetic trees.
GCG FORMAT: In version 7 of the Wisconsin GCG package, a new
multiple sequence format was introduced. This is the MSF (Multiple
Sequence Format) format. It can be used as input to the GCG
sequence editor or any of the GCG programs that make use of multiple
alignments. THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG
PACKAGE OR LATER.
PHYLIP FORMAT: This format can be used by the Phylip package of
Joe Felsenstein (see the references/algorithms section for details
of how to get it). Phylip allows you to do a huge range of
phylogenetic analyses (we just offer one method in this program) and
is probably the most widely used set of programs for drawing trees.
It also works on just about every computer you can think of,
providing you have a decent Pascal compiler.
******************************
* PROFILE ALIGNMENT MENU. *
******************************
This menu is for taking two old alignments (or single sequences) and
aligning them with each other. The result is one bigger alignment.
The menu is very similar to the multiple alignment menu except that
there is no mention of dendrograms here (they are not needed) and
you need to input two sets of sequences. The menu looks like this:
******Profile*Alignment*Menu******
1. Input 1st. profile/sequence
2. Input 2nd. profile/sequence
3. Do alignment now
4. Alignment parameters
5. Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice:
You must input profile number 1 first. When both profiles are
loaded, use item 3 (Do alignment now) and the 2 profiles will be
aligned. Items 4 and 5 (parameters and output options) are
identical to the equivalent options on the multiple alignment menu.
The same input routines that are used for general input are used
here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA
format, with gaps indicated by hyphens ("-"). This is why we have
continualy drawn your attention to the NBRF/PIR format as a useful
output format.
Either profile can consist of just one sequence. Therefore, if you
have a favourite alignment of sequences that you are working on and
wish to add a new sequence, you can use this menu, provided the
alignment is in the correct format.
The total number of sequences in the two profiles must be less less
than or equal to the MAXN parameter set in the clustalv.h header
file.
******************************
* PHYLOGENETIC TREE MENU. *
******************************
This menu allows you to input an alignment and calculate a
phylogenetic tree. You can also calculate a tree if you have just
carried out a multiple alignment and the alignment is still in
memory. THE SEQUENCES MUST BE ALIGNED ALREADY!!!!!! The tree will
look strange if the sequences are not already aligned. You can also
"BOOTSTRAP" the tree to show confidence levels for groupings. This
is SLOW on microcomputers but works fine on workstations or
mainframes.
******Phylogenetic*tree*Menu******
1. Input an alignment
2. Exclude positions with gaps? = OFF
3. Correct for multiple substitutions? = OFF
4. Draw tree now
5. Bootstrap tree
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice:
The same input routine that is used for general input is used here
i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format,
with gaps indicated by hyphens ("-"). This is why we have
continualy drawn your attention to the NBRF/PIR format as a useful
output format.
If you have input an alignment, then just use item 4 to draw a tree.
The method used is the Neighbor Joining method of Saitou and Nei
(1987). This is a "distance method". First, percent divergence
figures are calculated between all pairs of sequence. These
divergence figures are then used by the NJ method to give the tree.
Example trees will be shown below.
There are two options which can be used to control the way the
distances are calculated. These are set by options 2 and 3 in the
menu.
EXCLUDE POSITIONS WITH GAPS? This option allows you to ignore all
alignment positions (columns) where there is a gap in ANY sequence.
This guarantees that "like" is compared with "like" in all distances
i.e. the same positions are used to calculate all distances. It
also means that the distances will be "metric". The disadvantage of
using this option is that you throw away much of the data if there
are many gaps. If the total number of gaps is small, it has little
effect.
CORRECT FOR MULTIPLE SUBSTITUTIONS? As sequences diverge,
substitutions accumulate. It becomes increasingly likely that more
than one substitution (as a result of a mutation) will have happened
at a site where you observe just one difference now. This option
allows you to use formulae developed by Motoo Kimura to correct for
this effect. It has the effect of stretching long branches in tres
while leaving short ones relatively untouched. The desired effect
is to try and make distances proportional to time since divergence.
The tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the
name of the input, alignment file. An example is shown below for 6
globin sequences.
DIST = percentage divergence (/100)
Length = number of sites used in comparison
1 vs. 2 DIST = 0.5683; length = 139
1 vs. 3 DIST = 0.5540; length = 139
1 vs. 4 DIST = 0.5315; length = 111
1 vs. 5 DIST = 0.7447; length = 141
1 vs. 6 DIST = 0.7571; length = 140
2 vs. 3 DIST = 0.0897; length = 145
2 vs. 4 DIST = 0.1391; length = 115
2 vs. 5 DIST = 0.7517; length = 145
2 vs. 6 DIST = 0.7431; length = 144
3 vs. 4 DIST = 0.0957; length = 115
3 vs. 5 DIST = 0.7379; length = 145
3 vs. 6 DIST = 0.7361; length = 144
4 vs. 5 DIST = 0.7304; length = 115
4 vs. 6 DIST = 0.7368; length = 114
5 vs. 6 DIST = 0.2697; length = 152
Neighbor-joining Method
Saitou, N. and Nei, M. (1987) The Neighbor-joining Method:
A New Method for Reconstructing Phylogenetic Trees.
Mol. Biol. Evol., 4(4), 406-425
This is an UNROOTED tree
Numbers in parentheses are branch lengths
Cycle 1 = SEQ: 5 ( 0.13382) joins SEQ: 6 ( 0.13592)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -