📄 clustalv.doc

📁 生物序列比对程序clustw的源代码
💻 DOC
📖 第 1 页 / 共 5 页
字号:
in one of two ways:  1) assume a degree of constancy in the molecular clock and place the root (bottom of the tree; the point where all the sequences radiate from) half way along the longest branch.     **OR**2) use an "outgroup", a sequence from an organism that you "know" must be outside of the rest of the sequences i.e. root the tree manually, on biological grounds.The above tree can be represented diagramatically as follows:                          SEQ 1       SEQ 4                           I           I          13.6             I 28.1      I 4.9          5.9  SEQ 6 ----------I        I           I          I--------- SEQ 2                  I        I           I          I                  I--------I-----------I----------I          13.4    I  33.5      20.8         2.3   I   3.1  SEQ 5 ----------I                               I--------- SEQ 3The figures along each branch are percent divergences along that branch.  If you root the tree by placing the root along the longest branch (33.5%) then you can draw it again as follows, this time rooted:                        13.6                I-------------------- SEQ 6      I---------I       13.4      I         I-------------------- SEQ 5      I 33.5  -----I                 28.1      I         I-------------------- SEQ 1      I         I      I---------I                4.9                I  20.8  I----------- SEQ 4                I--------I                           I       5.9                         I 2.3 I----- SEQ 2                         I-----I 3.1                               I----- SEQ 3The longest branch (33.5% between 5,6 and 1,2,3,4) is split between the 2 bottom branches of the tree.  As it happens in this particular case, sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4 are alpha and beta globins, so you could also justify the above rooting on biological grounds.  If you do not have any particular need or evidence for the position of the root, then LEAVE THE TREE UNROOTED.  Unrooted trees do not look as pretty as rooted ones but it is uaual to leave them unrooted if you do not have any evidence for the position of the root.BOTSTRAPPING:    Different sets of sequences and different tree drawing methods may give different topologies (branching orders) for parts of a tree that are weakly supported by the data.  It is useful to have an indication of the degree of error in the tree.  There are several ways of doing this, some of them rather technical.  We provide one general purpose method in this program, which makes use of a technique called bootstrapping (see Felsenstein, 1985).In the case of sequence alignments, bootstrapping involves taking random samples of positions from the alignment.  If the alignment has N positions, each bootstrap sample consists of a random sample of N positions, taken WITH REPLACEMENT i.e. in any given sample, some sites may be sampled several times, others not at all.  Then, with each sample of sites, you calculate a distance matrix as usual and draw a tree.  If the data very strongly support just one tree then the sample trees will be very similar to each other and to the original tree, drawn without bootstrapping.  However, if parts of the tree are not well supported, then the sample trees will vary considerably in how they represent these parts.In practice, you should use a very large number of bootstrap replicates (1000 is recommended, even if it means running the program for an hour on a slow microcomputer; on a workstation it will be MUCH faster).  For each grouping on the tree, you record the number of times this grouping occurs in the sample trees.  For a group to be considered "significant" at the 95% level (or P <= 0.05 in statistical terms) you expect the grouping to show up in >= 95% of the sample trees.  If this happens, then you can say that the grouping is significant, given the data set and the method used to draw the tree.  So, when you use the bootstrap option, a NJ tree is drawn as before and then you are asked to say how many bootstrap samples you want (1000 is the default) and you are asked to give a seed number for the random number generator.  If you give the same seed number in future, you will get the same results (we hope).  Remember to give different seed numbers if you wish to carry out genuinely different bootstrap sampling experiments.  Below is the output file from using the same data for the 6 globin sequences as used before.  The output file has the same name as the input fike with the extension ".njb".//STUFF DELETED  .... same as for the ordinary NJ output//			Bootstrap Confidence Limits Random number generator seed =      99 Number of bootstrap trials   =    1000 Diagrammatic representation of the above tree:  Each row represents 1 tree cycle; defining 2 groups. Each column is 1 sequence; the stars in each line show 1 group;  the dots show the other Numbers show occurences in bootstrap samples. ****..   1000              .***..   1000                <- This is the answer!!*..***    812 122311For an unrooted tree with N sequences, there are actually only N-3 genuinely different groupings that we can test (this is the number of "internal branches"; each internal branch splits the sequences into 2 groups).  In this example, we have 6 sequences with 3 internal branches in the reference tree.  In the bootstrap resampling, we count how often each of these internal branches occur.  Here, we find that the branch which splits 1,2,3 and 4 versus 5 and 6 occurs in all 1000 samples; the branch which splits 2,3 and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2 and 3 versus 1,4,5 and 6 occurs in 812/1000 samples.  We can put these figures on to the diagrammatic representation we made earlier of our unrooted NJ tree as follows:                          SEQ 1       SEQ 4                           I           I                           I           I              SEQ 6 ----------I        I           I          I--------- SEQ 2                  I  1000  I   1000    I   812    I                  I--------I-----------I----------I                  I                               I      SEQ 5 ----------I                               I--------- SEQ 3You can equally put these confidence figures on the rooted tree (in fact the interpretation is simpler with rooted trees).  With the unrooted tree, the grouping of sequence 5 with 6 is significant (as is the grouping of sequences 1,2,3 and 4).  Equally the grouping of sequences 1,5 and 6 is significant (the same as saying that 2,3 and 4 group significantly).  However, the grouping of 2 and 3 is not significant, although it is relatively strongly supported.  Unfortunately, there is a small complication in the interpretation of these results.  In statistical hypothesis testing, it is not valid to make multiple simultaneous tests and to treat the result of each test completely independantly.  In the above case, if you have one particular test (grouping) that you wish to make in advance, it is valid to test IT ALONE and to simply show the other bootstrap figures for reference.  If you do not have any particular test in mind before you do the bootstrapping, you can just show all of the figures and use the 95% level as an ARBITRARY cut off to show those groups that are very strongly supported; but not mention anything about SIGNIFICANCE testing.  In the literature, it is common practice to simply show the figures with a tree; they frequently speak for themselves.  *******************************************************************		4.  Command Line Interface.You can do almost everything that can be done from the menus, using a command line interface. In this mode, the program will take all of its instructions as "switches" when you activate it; no questions will be asked; if there are no errors, the program just does an analysis and stops.   It does not work so well on the MAC but is still possible.  To get you started we will show you the 2 simplest uses of the command line as it looks on VAX/VMS.  On all other machines (except the MAC) it works in the same way.$ clustalv /help           **OR**   $ clustalv /checkBoth of the above switches give you a one page summary of the command line on the screen and then the program stops. $ clustalv proteins.seq    **OR**   $ clustalv /infile=proteins.seq    This will read the sequences from the file 'proteins.seq' and do a complete multiple alignment.  Default parameters will be used, the program will try to tell whether or not the sequences are DNA or protein and the output will go to a file called 'proteins.aln' . A dendrogram file called 'proteins.dnd' will also be created.  Thus the default action for the program, when it successfully reads in an input file is to do a full multiple alignment.  Some further examples of command line usage will be given leter.Command line switches can be abbreviated but MAKE SURE YOU DO NOT MAKE THEM AMBIGUOUS.  No attempt will be made to detect ambiguity.  Use enough characters to distinguish each switch uniquely.The full list of allowed switches is given below:                DATA (sequences)/INFILE=file.ext    :input sequences.  If you give an input file and 				nothing else as a switch, the default action is 				to do a complete multiple alignment.  The input 				file can also be specified by giving it as the 				first command line parameter with no "/" in 					front of it e.g $ clustalv file.ext  ./PROFILE1=file.ext	:You use these two switches to give the names of  /PROFILE2=file.ext	two profiles.  The default action is to align 			the two. You must give the names of both profile 				files.                 VERBS (do things)/HELP  		:list the command line parameters on the screen./CHECK                           /ALIGN        	:do full multiple alignment.  This is the default 				action if no other switches except for input files 			are given./TREE      	:calculate NJ tree.  If this is the only action 				specified (e.g. $ clustalv proteins.seq/tree ) it IS 			ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.  If 			the sequences are not already aligned, you should 				also give the /ALIGN switch.  This will align the 				sequences first, output an alignment file and 				calculate the tree in memory. /BOOTSTRAP(=n)	:bootstrap a NJ tree (n= number of bootstraps; 				default = 1000).  If this is the only action 					specified (e.g. $ clustalv proteins.seq/bootstrap ) 			it IS ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.  			If the sequences are not already aligned, you should 			also give the /ALIGN switch.  This will align the 				sequences first, output an alignment file and 				calculate the bootstraps in memory.  You can set the 			number of bootstrap trials here (e.g./bootstrap=500).  			You can set the seed number for the random number 				generator with /seed=n.                PARAMETERS (set things)***Pairwise alignments:***/KTUP=n      	:word size                  /TOPDIAGS=n  	:number of best diagonals/WINDOW=n    	:window around best diagonals  /PAIRGAP=n   	:gap penalty***Multiple alignments:***/FIXEDGAP=n  	:fixed length gap pen.      /FLOATGAP=n  	:variable length gap pen./MATRIX=     	:PAM100 or ID or file name. The default weight matrix 			for proteins is PAM 250./TYPE=p or d 	:type is protein or DNA.   This allows you to 				explicitely overide the programs attempt at guessing 			the type of the sequence.  It is only useful if you 			are using sequences with a VERY strange composition./OUTPUT=     	:GCG or PHYLIP or PIR.  The default output is 				Clustal format.    /TRANSIT     	:transitions not weighted.  The default is to weight 			transitions as more favourable than other mismatches 			in DNA alignments.  This switch makes all nucleotide 			mismatches equally weighted.***Trees:***                             /KIMURA      	:use Kimura's correction on distances.   /TOSSGAPS    	:ignore positions with a gap in ANY sequence./SEED=n      	:seed number for bootstraps.EXAMPLES:These examples use the VAX/VMS $ prompt; otherwise, command-line usage is the same on all machines except the Macintosh. $ clustalv proteins.seq      OR     $ clustalv /infile=proteins.seqRead whatever sequences are in the file "proteins.seq" and do a full multiple alignment; output will go to the files: "proteins.dnd" (dendrogram) and "proteins.aln" (alignment).$ clustalv proteins.seq/ktup=2/matrix=pam100/output=pirSame as last example but use K-Tuple size of 2; use a PAM 100 protein weight matrix; write the alignment out in NBRF/PIR format (goes to a file called "proteins.pir").$ clustalv /profile1=proteins.seq/profile2=more.seq/type=p/fixed=11Take the alignment in "proteins.seq" and align it with "more.seq" using default values for everything except the fixed gap penalty which is set to 11.  The sequence type is explicitely set to PROTEIN.$ clustalv proteins.pir/tree/kimuraTake the sequences in proteins.pir (they MUST BE ALIGNED ALREADY) and calculate a phylogenetic tree using Kimura's correction for distances.  $ clustalv proteins.pir/align/tree/kimura
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -