📄 clustalv.doc
字号:
3. Profile Alignments
4. Phylogenetic trees
S. Execute a system command
H. HELP
X. EXIT (leave program)
Your choice:
The options S and H appear on all the main menus. H will provide
help and if you type S you will be asked to enter a command, such as
DIR or LS, which will be sent to the system (does not work on
Mac's). Before carrying out an alignment, you must use option 1
(sequence input); the format for sequences is explained below.
Under menu item 2 you will be able to automatically align your
sequences to each other. Menu item 3 allows you to do profile
alignments. These are alignments of old alignments. This allows
you to build up a multiple alignment in stages or add a new sequence
to an old alignment. You can calculate phylogenetic trees from
alignments using menu item 4.
******************************
* SEQUENCE INPUT. *
******************************
All sequences should be in 1 file. Three formats are automatically
recognised and used: NBRF/PIR, EMBL/SwissProt and FASTA (Pearson and
Lipman (1988) format).
***
Users of the Wisconsin GCG package should use the command TONBRF
(recently changed to TOPIR) to reformat their sequences before use.
***
Sequences can be in upper or lower case. For proteins, the only
symbols recognised are: A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y and
for DNA/RNA use: A,C,G and T (or U). Any other letters of the
alphabet will be treated as X (proteins) or N (DNA/RNA) for unknown.
All other symbols (blanks, digits etc.) will be ignored EXCEPT for
the hyphen "-" which can be used to specify a gap. This last point
is especially useful for 2 reasons: 1) you can fix the positions of
some gaps in advance; 2) the alignment output from this program can
be written out in NBRF format using "-"'s to specify gaps; these
alignments can be used again as input, either for profile alignments
or for phylogenetic trees.
If you are using an editor to create sequence files, use the FASTA
format as it is by far the simplest (see below). If you have access
to utility programs for generating/converting the NBRF/PIR format
then use it in preference.
FASTA (PEARSON AND LIPMAN, 1988) FORMAT: The sequences are
delimited by an angle bracket ">" in column 1. The text immediately
after the ">" is used as a title. Everything on the following line
until the next ">" or the end of the file is one sequence.
e.g.
> RABSTOUT rabbit Guinness receptor
LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC
> MUSNOSE mouse nose drying factor
mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfdv
> HSHEAVEN human Guinness receptor repeat
mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv
mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv
NBRF/PIR FORMAT is similar to FASTA format but immediately
after the ">", you find the characters "P1;" if the sequences are
protein or "DL;" if they are nucleic acid. Clustalv looks for the
";" character as the third character after the ">". If it finds one
it assumes that the format is NBRF if not, FASTA format is assumed.
The text after the ";" is treated as a sequence name while the
entire next line is treated as a title. The sequence is terminated
by a star "*" and the next sequence can then begin (with a >P1; etc
). This is just the basic format description (there are other
variations and rules).
ANY files/sequences in GCG format can be converted to this format
using the TONBRF command (now TOPIR) of the Wisconsin GCG package.
e.g.
>P1;RABSTOUT
rabbit Guinness receptor
LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC*
>P1;MUSNOSE
mouse nose drying factor
mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfd
*
>P1;HSHEAVEN
human Guinness receptor repeat protein.
mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv
mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv*
EMBL/SWISSPROT FORMAT: Do not try to create files with this
format unless you have utilities to help. If you are just using an
editor, use one of the above formats. If you do use this format,
the program will ignore everything between the ID line (line
beginning with the characters "ID") and the SQ line. The sequence
is then read from between the SQ line and the "//" characters.
It is critically important for the program to know whether or not it
is aligning DNA or protein sequences. The input routines attempt to
guess which type of sequence is being used by counting the number of
A,C,G,T or U's in the sequences. If the total is more than 85% of
the sequence length then DNA is assumed. If you use very bizarre
sequences (proteins with really strange aa compositions or DNA
sequences with loads of strange ambiguity codes) you might confuse
the program. It is difficult to do but be careful.
******************************
* MULTIPLE ALIGNMENT MENU. *
******************************
The multiple alignment menu is shown below. Before explaining how
to use it, you must be introduced briefly to the alignment strategy.
If you do not follow this, try using option 1 anyway; the entire
process will be carried out automatically.
To do a complete multiple alignment, we need to know the approximate
relationships of the sequences to each other (which ones are most
similar to each other). We do this by calculating a crude
phylogenetic tree which we call a dendrogram (to distinguish it from
the more sensitive trees available under the phylogenetic tree
menu). This dendrogram is used as a guide to align bigger and
bigger groups of sequences during the multiple alignment. The
dendrogram is calculated in 2 stages: 1) all pairs of sequence are
compared using the fast/approximate method of Wilbur and Lipman
(1983); the result of each comparison is a similarity score. 2) the
similarity scores are used to construct the dendrogram using the
UPGMA cluster analysis method of Sneath and Sokal (1973).
The construction of the dendrogram can be very time consuming if you
wish to align many sequences (e.g. for 100 sequences you need to
carry out 100x99/2 sequence comparisons = 4950). During every
multiple alignment, a dendrogram is constructed and saved to a file
(something.dnd). These can be reused later.
******Multiple*Alignment*Menu******
1. Do complete multiple alignment now
2. Produce dendrogram file only
3. Use old dendrogram file
4. Pairwise alignment parameters
5. Multiple alignment parameters
6. Output format options
S. Execute a system command
H. HELP
or press [RETURN] to go back to main menu
Your choice:
So, if in doubt, and you have already loaded some sequences from the
main menu, just try option 1 and press the <Return> key in response
to any questions. You will be prompted for 2 file names e.g. if the
sequence input file was called DRINK.PEP, you will be offered
DRINK.ALN as the file to contain the alignment and DRINK.DND for the
dendrogram.
If you wish to repeat a multiple alignment (e.g. to experiment with
different gap penalties) but do not wish to make a dendrogram all
over again use menu item 3 (providing you are using the same
sequences). Similarly, menu item 2 allows you to produce the
dendrogram file only.
PAIRWISE ALIGNMENT PARAMETERS:
The parameters that control the initial fast/approximate comparisons
can be set from menu item 4 which looks like:
********* WILBUR/LIPMAN PAIRWISE ALIGNMENT PARAMETERS *********
1. Toggle Scoring Method :Percentage
2. Gap Penalty :3
3. K-tuple :1
4. No. of top diagonals :5
5. Window size :5
H. HELP
Enter number (or [RETURN] to exit):
The similarity scores are calculated from fast alignments generated
by the method of Wilbur and Lipman (1983). These are 'hash' or
'word' or 'k-tuple' alignments carried out in 3 stages.
First you mark the positions of every fragment of sequence, K-tuple
long (for proteins, the default length is 1 residue, for DNA it is 2
bases) in both sequences. Then you locate all k-tuple matches
between the 2 sequences. At this stage you have to imagine a dot-
matrix plot between the 2 sequences with each k-tuple match as a
dot. You find those diagonals in the plot with most matches (you
take the "No. of top diagonals" best ones) and mark all diagonals
within "Window size" of each top diagonal. This process will define
diagonal bands in the plot where you hope the most likely regions of
similarity will lie.
The final alignment stage is to find that head to tail arrangement
of k-tuple matches from these diagonal regions that will give the
highest score. The score is calculated as the number of exactly
matching residues in this alignment minus a "gap penalty" for every
gap that was introduced. When you toggle "Scoring method" you
choose between expressing these similarity scores as raw scores or
expressed as a percentage of the shorter sequence length.
K-TUPLE SIZE: Can be 1 or 2 for proteins; 1 to 4 for DNA.
Increase this to increase speed; decrease to improve sensitivity.
GAP PENALTY: The number of matching residues that must be found
in order to introduce a gap. This should be larger than K-Tuple
Size. This has little effect on speed or sensitivity.
NO. OF TOP DIAGONALS: The number of best diagonals in the
imaginary dot-matrix plot that are considered. Decrease (must be
greater than zero) to increase speed; increase to improve
sensitivity.
WINDOW SIZE: The number of diagonals around each "top" diagonal
that are considered. Decrease for speed; increase for greater
sensitivity.
SCORING METHOD: The similarity scores may be expressed as raw scores
(number of identical residues minus a "gap penalty" for each gap) or
as percentage scores. If the sequences are of very different
lengths, percentage scores make more sense.
CHANGING THE PAIRWISE ALIGNMENT PARAMETERS
The main reason for wanting to change the above parameters is SPEED
(especially on microcomputers), NOT SENSITIVITY. The dendrograms
that are produced can only show the relationships between the
sequences APPROXIMATELY because the similarity scores are calculated
from seperate pairwise alignments; not from a multiple alignment
(that is what we eventually hope to produce). If the groupings of
the sequences are "obvious", the above method should work well; if
the relationships are obscure or weakly represented by the data, it
will not make much difference playing with the parameters. The main
factor influencing speed is the K-TUPLE SIZE followed by the WINDOW
SIZE.
The alignments are carried out in a small amount of memory.
Occasionally (it is hard to predict), you will run out of memory
while doing these alignments; when this happens, it will say on the
screen: "Sequences (a,b) partially aligned" (instead of "Sequences
(a,b) aligned"). This means that the alignment score for these
sequences will be approximate; it is not a problem unless many of
the alignments do this. It can be fixed by using less sensitive
parameters or increasing parameter FSIZE in clustalv.h .
THE DENDROGRAM ITSELF
The similarity scores generated by the fast comparison of all the
sequences are used to construct a dendrogram by the UPGMA method of
Sneath and Sokal (1973). This is a form of cluster analysis and the
end result produces something that looks like a tree. It represents
the similarity of the sequences as a hierarchy. The dendrogram is
written to a file in a machine readable format and is ahown below
for an example with 6 sequences.
91.0 0 0 2 012000 ! seq 2 joins seq 3 at 91% ID.
72.0 1 0 3 011200 ! seq 4 joins seqs 2,3 at 72%
71.1 0 0 2 000012 ! seq 5 joins seq 6 at 71%
35.5 0 2 4 122200 ! seq 1 joins seqs 2,3,4
21.7 4 3 6 111122 ! seqs 1,2,3,4 join seqs 5,6
This LOOKS complicated but you do not normally need to care what is
in here. Anyway, each row represents the joining together of 2 or
more sequences. You progress from the top down, joining more and
more sequences until all are joined together; for N sequences you
have N-1 groupings hence there are 5 rows in the above file (there
were 6 sequences). In each row, the first number is the similarity
score of this grouping; ignore the next three columns for the
moment; the last 6 digits in the line show which sequences are
grouped; there is one digit for each sequence (the first digit is
for the first sequence). The rule is: in each row, all of the "1"s
join all of the "2"s; the zero's do nothing.
Hence, in the first row, sequence 2 joins sequence 3 at a similarity
level of 91% identity; next, sequence 4 joins the previous grouping
of 2 plus 3 at a level of 72% etc. This is shown diagrammatically
below. Before leaving the dendrogram format, the other 3 columns of
numbers are: a pointer to the row from which the "1" sequences were
last joined (or zero if only one of them); a pointer to the row in
which the "2"s were last joined; the total number of sequences
joined in this line.
I------ 2
I------I
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -