📄 clustalv.doc
字号:
Your choice: The options S and H appear on all the main menus. H will provide help and if you type S you will be asked to enter a command, such as DIR or LS, which will be sent to the system (does not work on Mac's). Before carrying out an alignment, you must use option 1 (sequence input); the format for sequences is explained below. Under menu item 2 you will be able to automatically align your sequences to each other. Menu item 3 allows you to do profile alignments. These are alignments of old alignments. This allows you to build up a multiple alignment in stages or add a new sequence to an old alignment. You can calculate phylogenetic trees from alignments using menu item 4. ****************************** * SEQUENCE INPUT. * ******************************All sequences should be in 1 file. Three formats are automatically recognised and used: NBRF/PIR, EMBL/SwissProt and FASTA (Pearson and Lipman (1988) format). ***Users of the Wisconsin GCG package should use the command TONBRF (recently changed to TOPIR) to reformat their sequences before use. *** Sequences can be in upper or lower case. For proteins, the only symbols recognised are: A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y and for DNA/RNA use: A,C,G and T (or U). Any other letters of the alphabet will be treated as X (proteins) or N (DNA/RNA) for unknown. All other symbols (blanks, digits etc.) will be ignored EXCEPT for the hyphen "-" which can be used to specify a gap. This last point is especially useful for 2 reasons: 1) you can fix the positions of some gaps in advance; 2) the alignment output from this program can be written out in NBRF format using "-"'s to specify gaps; these alignments can be used again as input, either for profile alignments or for phylogenetic trees.If you are using an editor to create sequence files, use the FASTA format as it is by far the simplest (see below). If you have access to utility programs for generating/converting the NBRF/PIR format then use it in preference.FASTA (PEARSON AND LIPMAN, 1988) FORMAT: The sequences are delimited by an angle bracket ">" in column 1. The text immediately after the ">" is used as a title. Everything on the following line until the next ">" or the end of the file is one sequence.e.g.> RABSTOUT rabbit Guinness receptor LKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHAS ADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC> MUSNOSE mouse nose drying factor mhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwyt fdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfdv> HSHEAVEN human Guinness receptor repeat mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwyt fdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdvNBRF/PIR FORMAT is similar to FASTA format but immediately after the ">", you find the characters "P1;" if the sequences are protein or "DL;" if they are nucleic acid. Clustalv looks for the ";" character as the third character after the ">". If it finds one it assumes that the format is NBRF if not, FASTA format is assumed. The text after the ";" is treated as a sequence name while the entire next line is treated as a title. The sequence is terminated by a star "*" and the next sequence can then begin (with a >P1; etc ). This is just the basic format description (there are other variations and rules).ANY files/sequences in GCG format can be converted to this format using the TONBRF command (now TOPIR) of the Wisconsin GCG package.e.g.>P1;RABSTOUTrabbit Guinness receptorLKMHLMGHLKMGLKMGLKGMHLMHLKHMHLMTYTYTTYRRWPLWMWLPDFGHASADSCVCAHGFAVCACFAHFDVCFGAVCFHAVCFAHVCFAAAVCFAVCAC*>P1;MUSNOSE mouse nose drying factormhkmmhkgmkhmhgmhmhglhmkmhlkmgkhmgkmkytytytryrwtqtqwtwytfdgfdsgafdagfdgfsagdfavdfdvgavfsvfgvdfsvdgvagvfd*>P1;HSHEAVEN human Guinness receptor repeat protein.mhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwytfdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdvmhkmmhkgmkhmhgmhmhg lhmkmhlkmgkhmgkmk ytytytryrwtqtqwtwytfdgfdsgafdagfdgfsag dfavdfdvgavfsvfgv dfsvdgvagvfdv* EMBL/SWISSPROT FORMAT: Do not try to create files with this format unless you have utilities to help. If you are just using an editor, use one of the above formats. If you do use this format, the program will ignore everything between the ID line (line beginning with the characters "ID") and the SQ line. The sequence is then read from between the SQ line and the "//" characters.It is critically important for the program to know whether or not it is aligning DNA or protein sequences. The input routines attempt to guess which type of sequence is being used by counting the number of A,C,G,T or U's in the sequences. If the total is more than 85% of the sequence length then DNA is assumed. If you use very bizarre sequences (proteins with really strange aa compositions or DNA sequences with loads of strange ambiguity codes) you might confuse the program. It is difficult to do but be careful. ****************************** * MULTIPLE ALIGNMENT MENU. * ******************************The multiple alignment menu is shown below. Before explaining how to use it, you must be introduced briefly to the alignment strategy. If you do not follow this, try using option 1 anyway; the entire process will be carried out automatically.To do a complete multiple alignment, we need to know the approximate relationships of the sequences to each other (which ones are most similar to each other). We do this by calculating a crude phylogenetic tree which we call a dendrogram (to distinguish it from the more sensitive trees available under the phylogenetic tree menu). This dendrogram is used as a guide to align bigger and bigger groups of sequences during the multiple alignment. The dendrogram is calculated in 2 stages: 1) all pairs of sequence are compared using the fast/approximate method of Wilbur and Lipman (1983); the result of each comparison is a similarity score. 2) the similarity scores are used to construct the dendrogram using the UPGMA cluster analysis method of Sneath and Sokal (1973). The construction of the dendrogram can be very time consuming if you wish to align many sequences (e.g. for 100 sequences you need to carry out 100x99/2 sequence comparisons = 4950). During every multiple alignment, a dendrogram is constructed and saved to a file (something.dnd). These can be reused later.******Multiple*Alignment*Menu****** 1. Do complete multiple alignment now 2. Produce dendrogram file only 3. Use old dendrogram file 4. Pairwise alignment parameters 5. Multiple alignment parameters 6. Output format options S. Execute a system command H. HELP or press [RETURN] to go back to main menuYour choice: So, if in doubt, and you have already loaded some sequences from the main menu, just try option 1 and press the <Return> key in response to any questions. You will be prompted for 2 file names e.g. if the sequence input file was called DRINK.PEP, you will be offered DRINK.ALN as the file to contain the alignment and DRINK.DND for the dendrogram. If you wish to repeat a multiple alignment (e.g. to experiment with different gap penalties) but do not wish to make a dendrogram all over again use menu item 3 (providing you are using the same sequences). Similarly, menu item 2 allows you to produce the dendrogram file only.PAIRWISE ALIGNMENT PARAMETERS: The parameters that control the initial fast/approximate comparisons can be set from menu item 4 which looks like: ********* WILBUR/LIPMAN PAIRWISE ALIGNMENT PARAMETERS ********* 1. Toggle Scoring Method :Percentage 2. Gap Penalty :3 3. K-tuple :1 4. No. of top diagonals :5 5. Window size :5 H. HELPEnter number (or [RETURN] to exit): The similarity scores are calculated from fast alignments generated by the method of Wilbur and Lipman (1983). These are 'hash' or 'word' or 'k-tuple' alignments carried out in 3 stages. First you mark the positions of every fragment of sequence, K-tuple long (for proteins, the default length is 1 residue, for DNA it is 2 bases) in both sequences. Then you locate all k-tuple matches between the 2 sequences. At this stage you have to imagine a dot-matrix plot between the 2 sequences with each k-tuple match as a dot. You find those diagonals in the plot with most matches (you take the "No. of top diagonals" best ones) and mark all diagonals within "Window size" of each top diagonal. This process will define diagonal bands in the plot where you hope the most likely regions of similarity will lie. The final alignment stage is to find that head to tail arrangement of k-tuple matches from these diagonal regions that will give the highest score. The score is calculated as the number of exactly matching residues in this alignment minus a "gap penalty" for every gap that was introduced. When you toggle "Scoring method" you choose between expressing these similarity scores as raw scores or expressed as a percentage of the shorter sequence length. K-TUPLE SIZE: Can be 1 or 2 for proteins; 1 to 4 for DNA. Increase this to increase speed; decrease to improve sensitivity.GAP PENALTY: The number of matching residues that must be found in order to introduce a gap. This should be larger than K-Tuple Size. This has little effect on speed or sensitivity.NO. OF TOP DIAGONALS: The number of best diagonals in the imaginary dot-matrix plot that are considered. Decrease (must be greater than zero) to increase speed; increase to improve sensitivity.WINDOW SIZE: The number of diagonals around each "top" diagonal that are considered. Decrease for speed; increase for greater sensitivity.SCORING METHOD: The similarity scores may be expressed as raw scores (number of identical residues minus a "gap penalty" for each gap) or as percentage scores. If the sequences are of very different lengths, percentage scores make more sense.CHANGING THE PAIRWISE ALIGNMENT PARAMETERSThe main reason for wanting to change the above parameters is SPEED (especially on microcomputers), NOT SENSITIVITY. The dendrograms that are produced can only show the relationships between the sequences APPROXIMATELY because the similarity scores are calculated from seperate pairwise alignments; not from a multiple alignment (that is what we eventually hope to produce). If the groupings of the sequences are "obvious", the above method should work well; if the relationships are obscure or weakly represented by the data, it will not make much difference playing with the parameters. The main factor influencing speed is the K-TUPLE SIZE followed by the WINDOW SIZE. The alignments are carried out in a small amount of memory. Occasionally (it is hard to predict), you will run out of memory while doing these alignments; when this happens, it will say on the screen: "Sequences (a,b) partially aligned" (instead of "Sequences (a,b) aligned"). This means that the alignment score for these sequences will be approximate; it is not a problem unless many of the alignments do this. It can be fixed by using less sensitive parameters or increasing parameter FSIZE in clustalv.h .THE DENDROGRAM ITSELFThe similarity scores generated by the fast comparison of all the sequences are used to construct a dendrogram by the UPGMA method of Sneath and Sokal (1973). This is a form of cluster analysis and the end result produces something that looks like a tree. It represents the similarity of the sequences as a hierarchy. The dendrogram is written to a file in a machine readable format and is ahown below for an example with 6 sequences. 91.0 0 0 2 012000 ! seq 2 joins seq 3 at 91% ID. 72.0 1 0 3 011200 ! seq 4 joins seqs 2,3 at 72% 71.1 0 0 2 000012 ! seq 5 joins seq 6 at 71% 35.5 0 2 4 122200 ! seq 1 joins seqs 2,3,4 21.7 4 3 6 111122 ! seqs 1,2,3,4 join seqs 5,6This LOOKS complicated but you do not normally need to care what is in here. Anyway, each row represents the joining together of 2 or more sequences. You progress from the top down, joining more and more sequences until all are joined together; for N sequences you have N-1 groupings hence there are 5 rows in the above file (there were 6 sequences). In each row, the first number is the similarity score of this grouping; ignore the next three columns for the moment; the last 6 digits in the line show which sequences are grouped; there is one digit for each sequence (the first digit is for the first sequence). The rule is: in each row, all of the "1"s join all of the "2"s; the zero's do nothing. Hence, in the first row, sequence 2 joins sequence 3 at a similarity level of 91% identity; next, sequence 4 joins the previous grouping of 2 plus 3 at a level of 72% etc. This is shown diagrammatically below. Before leaving the dendrogram format, the other 3 columns of numbers are: a pointer to the row from which the "1" sequences were last joined (or zero if only one of them); a pointer to the row in which the "2"s were last joined; the total number of sequences joined in this line. I------ 2 I------I I I------ 3 Diagram of the sequence similarity I----I I I------------- 4 relationships shown in the above I--I I I------------------ 1 dendrogram file (branch lengths are ----I I I------------- 5 not to scale). I-------I I------------- 6
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -