📄 clustalv.doc

📁 clustalw1.83.DOS.ZIP,用于多序列比对的软件
💻 DOC
📖 第 1 页 / 共 5 页
字号:
     3. Profile Alignments
     4. Phylogenetic trees

     S. Execute a system command
     H. HELP
     X. EXIT (leave program)


Your choice: 



The options S and H appear on all the main menus.  H will provide 
help and if you type S you will be asked to enter a command, such as 
DIR or LS, which will be sent to the system (does not work on 
Mac's).  Before carrying out an alignment, you must use option 1 
(sequence input); the format for sequences is explained below.  
Under menu item 2 you will be able to automatically align your 
sequences to each other.  Menu item 3 allows you to do profile 
alignments.  These are alignments of old alignments.  This allows 
you to build up a multiple alignment in stages or add a new sequence 
to an old alignment.   You can calculate phylogenetic trees from 
alignments using menu item 4.




      ******************************
      *       SEQUENCE INPUT.      *
      ******************************


All sequences should be in 1 file.  Three formats are automatically 
recognised and used: NBRF/PIR, EMBL/SwissProt and FASTA (Pearson and 
Lipman (1988) format).   

***
Users of the Wisconsin GCG package should use the command TONBRF 
(recently changed to TOPIR) to reformat their sequences before use. 
*** 

Sequences can be in upper or lower case.  For proteins, the only 
symbols recognised are:  A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y and 
for DNA/RNA use: A,C,G and T (or U).  Any other letters of the 
alphabet will be treated as X (proteins) or N (DNA/RNA) for unknown.  
All other symbols (blanks, digits etc.) will be ignored EXCEPT for 
the hyphen "-" which can be used to specify a gap.  This last point 
is especially useful for 2 reasons: 1) you can fix the positions of 
some gaps in advance; 2) the alignment output from this program can 
be written out in NBRF format using "-"'s to specify gaps; these 
alignments can be used again as input, either for profile alignments 
or for phylogenetic trees.

If you are using an editor to create sequence files, use the FASTA 
format as it is by far the simplest (see below).  If you have access 
to utility programs for generating/converting the NBRF/PIR format 
then use it in preference.



FASTA (PEARSON AND LIPMAN, 1988) FORMAT:     The sequences are 
delimited by an angle bracket ">" in column 1.  The text immediately 
after the ">" is used as a title.  Everything on the following line 
until the next ">" or the end of the file is one sequence.

e.g.

> RABSTOUT   rabbit Guinness receptor
   LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
   ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC
> MUSNOSE   mouse nose drying factor
    mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
    fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfdv
> HSHEAVEN    human Guinness receptor repeat
 mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt
 fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv
 mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt
 fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv



NBRF/PIR FORMAT         is similar to FASTA format but immediately 
after the ">", you find the characters "P1;" if the sequences are 
protein or "DL;" if they are nucleic acid.  Clustalv looks for the 
";" character as the third character after the ">".  If it finds one 
it assumes that the format is NBRF if not, FASTA format is assumed.  
The text after the ";" is treated as a sequence name while the 
entire next line is treated as a title.  The sequence is terminated 
by a star "*" and the next sequence can then begin (with a >P1; etc 
).  This is just the basic format description (there are other 
variations and rules).

ANY files/sequences in GCG format can be converted to this format 
using the TONBRF command (now TOPIR) of the Wisconsin GCG package.


e.g.

>P1;RABSTOUT
rabbit Guinness receptor
LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS
ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC*
>P1;MUSNOSE   
mouse nose drying factor
mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfd
*
>P1;HSHEAVEN    
human Guinness receptor repeat protein.
mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv
mhkmmhkgmkhmhgmhmhg   lhmkmhlkmgkhmgkmk  ytytytryrwtqtqwtwyt
fdgfdsgafdagfdgfsag   dfavdfdvgavfsvfgv  dfsvdgvagvfdv*


  

EMBL/SWISSPROT FORMAT:       Do not try to create files with this 
format unless you have utilities to help.  If you are just using an 
editor, use one of the above formats.  If you do use this format, 
the program will ignore everything between the ID line (line 
beginning with the characters "ID") and the SQ line.  The sequence 
is then read from between the SQ line and the "//" characters.



It is critically important for the program to know whether or not it 
is aligning DNA or protein sequences.  The input routines attempt to 
guess which type of sequence is being used by counting the number of 
A,C,G,T or U's in the sequences.  If the total is more than 85% of 
the sequence length then DNA is assumed.  If you use very bizarre 
sequences (proteins with really strange aa compositions or DNA 
sequences with loads of strange ambiguity codes) you might confuse 
the program.  It is difficult to do but be careful.





      ******************************
      *  MULTIPLE ALIGNMENT MENU.  *
      ******************************

The multiple alignment menu is shown below.  Before explaining how 
to use it, you must be introduced briefly to the alignment strategy. 
If you do not follow this, try using option 1 anyway; the entire 
process will be carried out automatically.

To do a complete multiple alignment, we need to know the approximate 
relationships of the sequences to each other (which ones are most 
similar to each other).  We do this by calculating a crude 
phylogenetic tree which we call a dendrogram (to distinguish it from 
the more sensitive trees available under the phylogenetic tree 
menu).   This dendrogram is used as a guide to align bigger and 
bigger groups of sequences during the multiple alignment.  The 
dendrogram is calculated in 2 stages: 1) all pairs of sequence are 
compared using the fast/approximate method of Wilbur and Lipman 
(1983); the result of each comparison is a similarity score. 2) the 
similarity scores are used to construct the dendrogram using the 
UPGMA cluster analysis method of Sneath and Sokal (1973).  

The construction of the dendrogram can be very time consuming if you 
wish to align many sequences (e.g. for 100 sequences you need to 
carry out 100x99/2 sequence comparisons = 4950). During every 
multiple alignment, a dendrogram is constructed and saved to a file 
(something.dnd).  These can be reused later.








******Multiple*Alignment*Menu******


    1.  Do complete multiple alignment now
    2.  Produce dendrogram file only
    3.  Use old dendrogram file
    4.  Pairwise alignment parameters
    5.  Multiple alignment parameters
    6.  Output format options

    S.  Execute a system command
    H.  HELP
    or press [RETURN] to go back to main menu


Your choice: 


So, if in doubt, and you have already loaded some sequences from the 
main menu, just try option 1 and press the <Return> key in response 
to any questions.  You will be prompted for 2 file names e.g. if the 
sequence input file was called DRINK.PEP, you will be offered 
DRINK.ALN as the file to contain the alignment and DRINK.DND for the 
dendrogram.  

If you wish to repeat a multiple alignment (e.g. to experiment with 
different gap penalties) but do not wish to make a dendrogram all 
over again use menu item 3 (providing you are using the same 
sequences).  Similarly, menu item 2 allows you to produce the 
dendrogram file only.




PAIRWISE ALIGNMENT PARAMETERS:     

The parameters that control the initial fast/approximate comparisons 
can be set from menu item 4 which looks like:


 ********* WILBUR/LIPMAN PAIRWISE ALIGNMENT PARAMETERS *********


     1. Toggle Scoring Method  :Percentage
     2. Gap Penalty            :3
     3. K-tuple                :1
     4. No. of top diagonals   :5
     5. Window size            :5

     H. HELP


Enter number (or [RETURN] to exit): 



The similarity scores are calculated from fast alignments generated 
by the method of Wilbur and Lipman (1983).  These are 'hash' or 
'word' or 'k-tuple' alignments carried out in 3 stages.  

First you mark the positions of every fragment of sequence, K-tuple 
long (for proteins, the default length is 1 residue, for DNA it is 2 
bases) in both sequences.  Then you locate all k-tuple matches 
between the 2 sequences.   At this stage you have to imagine a dot-
matrix plot between the 2 sequences with each k-tuple match as a 
dot.   You find those diagonals in the plot with most matches (you 
take the "No. of top diagonals" best ones) and mark all diagonals 
within "Window size" of each top diagonal.  This process will define 
diagonal bands in the plot where you hope the most likely regions of 
similarity will lie.  

The final alignment stage is to find that head to tail arrangement 
of k-tuple matches from these diagonal regions that will give the 
highest score.  The score is calculated as the number of exactly 
matching residues in this alignment minus a "gap penalty" for every 
gap that was introduced.  When you toggle "Scoring method" you 
choose between expressing these similarity scores as raw scores or 
expressed as a percentage of the shorter sequence length.  

K-TUPLE SIZE:   Can be 1 or 2 for proteins; 1 to 4 for DNA.  
Increase this to increase speed; decrease to improve sensitivity.

GAP PENALTY:    The number of matching residues that must be found 
in order to introduce a gap.  This should be larger than K-Tuple 
Size.  This has little effect on speed or sensitivity.

NO. OF TOP DIAGONALS:    The number of best diagonals in the 
imaginary dot-matrix plot that are considered.  Decrease (must be 
greater than zero) to increase speed; increase to improve 
sensitivity.

WINDOW SIZE:    The number of diagonals around each "top" diagonal 
that are considered.   Decrease for speed; increase for greater 
sensitivity.

SCORING METHOD: The similarity scores may be expressed as raw scores 
(number of identical residues minus a "gap penalty" for each gap) or 
as percentage scores.  If the sequences are of very different 
lengths, percentage scores make more sense.



CHANGING THE PAIRWISE ALIGNMENT PARAMETERS

The main reason for wanting to change the above parameters is SPEED 
(especially on microcomputers), NOT SENSITIVITY.   The dendrograms 
that are produced can only show the relationships between the 
sequences APPROXIMATELY because the similarity scores are calculated 
from seperate pairwise alignments; not from a multiple alignment 
(that is what we eventually hope to produce).  If the groupings of 
the sequences are "obvious", the above method should work well; if 
the relationships are obscure or weakly represented by the data, it 
will not make much difference playing with the parameters.  The main 
factor influencing speed is the K-TUPLE SIZE followed by the WINDOW 
SIZE.  

The alignments are carried out in a small amount of memory.  
Occasionally (it is hard to predict), you will run out of memory 
while doing these alignments; when this happens, it will say on the 
screen: "Sequences (a,b) partially aligned" (instead of "Sequences 
(a,b) aligned").  This means that the alignment score for these 
sequences will be approximate;  it is not a problem unless many of 
the alignments do this.  It can be fixed by using less sensitive 
parameters or increasing parameter FSIZE in clustalv.h .


THE DENDROGRAM ITSELF

The similarity scores generated by the fast comparison of all the 
sequences are used to construct a dendrogram by the UPGMA method of 
Sneath and Sokal (1973).  This is a form of cluster analysis and the 
end result produces something that looks like a tree.  It represents 
the similarity of the sequences as a hierarchy.  The dendrogram is 
written to a file in a machine readable format and is ahown below 
for an example with 6 sequences.


    91.0   0   0   2   012000         ! seq 2 joins seq 3 at 91% ID.
    72.0   1   0   3   011200         ! seq 4 joins seqs 2,3 at 72%
    71.1   0   0   2   000012         ! seq 5 joins seq 6 at 71%
    35.5   0   2   4   122200         ! seq 1 joins seqs 2,3,4
    21.7   4   3   6   111122         ! seqs 1,2,3,4 join seqs 5,6

This LOOKS complicated but you do not normally need to care what is 
in here.  Anyway, each row represents the joining together of 2 or 
more sequences.  You progress from the top down, joining more and 
more sequences until all are joined together; for N sequences you 
have N-1 groupings hence there are 5 rows in the above file (there 
were 6 sequences).  In each row, the first number is the similarity 
score of this grouping; ignore the next three columns for the 
moment; the last 6 digits in the line show which sequences are 
grouped; there is one digit for each sequence (the first digit is 
for the first sequence).  The rule is:  in each row, all of the "1"s 
join all of the "2"s; the zero's do nothing.   

Hence, in the first row, sequence 2 joins sequence 3 at a similarity 
level of 91% identity; next, sequence 4 joins the previous grouping 
of 2 plus 3 at a level of 72% etc.   This is shown diagrammatically 
below.  Before leaving the dendrogram format, the other 3 columns of 
numbers are: a pointer to the row from which the "1" sequences were 
last joined (or zero if only one of them); a pointer to the row in 
which the "2"s were last joined; the total number of sequences 
joined in this line.




                      I------ 2
               I------I
💿 文件大小 448 K
👤 上传用户 xufengping716
📂 所属分类其他行业
🏷️ 相关标签

#clustalw #DOS #ZIP #83
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -