📄 clustalv.doc

📁 clustalw1.83.DOS.ZIP,用于多序列比对的软件
💻 DOC
📖 第 1 页 / 共 5 页
字号:

 Cycle   2     =  SEQ:   1 (  0.28142) joins Node:   5 (  0.33462)

 Cycle   3     =  SEQ:   2 (  0.05879) joins  SEQ:   3 (  0.03086)

 Cycle   4 (Last cycle, trichotomy):

		 Node:   1 (  0.20798) joins
		 Node:   2 (  0.02341) joins
		  SEQ:   4 (  0.04915) 



The output file first shows the percent divergence (distance) 
figures between each pair of sequence.  Then a description of a NJ 
tree is given.  This description shows which sequences (SEQ:) or 
which groups of sequences (NODE: , a node is numbered using the 
lowest sequence that belongs to it) join at each level of the tree.  

This is an unrooted tree!! This means that the direction of 
evolution through the tree is not shown.  This can only be inferred 
in one of two ways:  
1) assume a degree of constancy in the molecular clock and place the 
root (bottom of the tree; the point where all the sequences radiate 
from) half way along the longest branch.     **OR**
2) use an "outgroup", a sequence from an organism that you "know" 
must be outside of the rest of the sequences i.e. root the tree 
manually, on biological grounds.

The above tree can be represented diagramatically as follows:


                          SEQ 1       SEQ 4
                           I           I
          13.6             I 28.1      I 4.9          5.9
  SEQ 6 ----------I        I           I          I--------- SEQ 2
                  I        I           I          I
                  I--------I-----------I----------I
          13.4    I  33.5      20.8         2.3   I   3.1
  SEQ 5 ----------I                               I--------- SEQ 3


The figures along each branch are percent divergences along that 
branch.  If you root the tree by placing the root along the longest 
branch (33.5%) then you can draw it again as follows, this time 
rooted:



                        13.6
                I-------------------- SEQ 6
      I---------I       13.4
      I         I-------------------- SEQ 5
      I 33.5 
 -----I                 28.1
      I         I-------------------- SEQ 1
      I         I
      I---------I                4.9
                I  20.8  I----------- SEQ 4
                I--------I  
                         I       5.9
                         I 2.3 I----- SEQ 2
                         I-----I 3.1
                               I----- SEQ 3



The longest branch (33.5% between 5,6 and 1,2,3,4) is split between 
the 2 bottom branches of the tree.  As it happens in this particular 
case, sequences 5 and 6 are myoglobins while sequences 1,2,3 and 4 
are alpha and beta globins, so you could also justify the above 
rooting on biological grounds.  If you do not have any particular 
need or evidence for the position of the root, then LEAVE THE TREE 
UNROOTED.  Unrooted trees do not look as pretty as rooted ones but 
it is uaual to leave them unrooted if you do not have any evidence 
for the position of the root.


BOTSTRAPPING:    Different sets of sequences and different tree 
drawing methods may give different topologies (branching orders) for 
parts of a tree that are weakly supported by the data.  It is useful 
to have an indication of the degree of error in the tree.  There are 
several ways of doing this, some of them rather technical.  We 
provide one general purpose method in this program, which makes use 
of a technique called bootstrapping (see Felsenstein, 1985).

In the case of sequence alignments, bootstrapping involves taking 
random samples of positions from the alignment.  If the alignment 
has N positions, each bootstrap sample consists of a random sample 
of N positions, taken WITH REPLACEMENT i.e. in any given sample, 
some sites may be sampled several times, others not at all.  Then, 
with each sample of sites, you calculate a distance matrix as usual 
and draw a tree.  If the data very strongly support just one tree 
then the sample trees will be very similar to each other and to the 
original tree, drawn without bootstrapping.  However, if parts of 
the tree are not well supported, then the sample trees will vary 
considerably in how they represent these parts.

In practice, you should use a very large number of bootstrap 
replicates (1000 is recommended, even if it means running the 
program for an hour on a slow microcomputer; on a workstation it 
will be MUCH faster).  For each grouping on the tree, you record the 
number of times this grouping occurs in the sample trees.  For a 
group to be considered "significant" at the 95% level (or P <= 0.05 
in statistical terms) you expect the grouping to show up in >= 95% 
of the sample trees.  If this happens, then you can say that the 
grouping is significant, given the data set and the method used to 
draw the tree.  

So, when you use the bootstrap option, a NJ tree is drawn as before 
and then you are asked to say how many bootstrap samples you want 
(1000 is the default) and you are asked to give a seed number for 
the random number generator.  If you give the same seed number in 
future, you will get the same results (we hope).  Remember to give 
different seed numbers if you wish to carry out genuinely different 
bootstrap sampling experiments.  Below is the output file from using 
the same data for the 6 globin sequences as used before.  The output 
file has the same name as the input fike with the extension ".njb".

//
STUFF DELETED  .... same as for the ordinary NJ output
//
			Bootstrap Confidence Limits


 Random number generator seed =      99

 Number of bootstrap trials   =    1000


 Diagrammatic representation of the above tree: 

 Each row represents 1 tree cycle; defining 2 groups.

 Each column is 1 sequence; the stars in each line show 1 group; 
 the dots show the other

 Numbers show occurences in bootstrap samples.
 
****..   1000              
.***..   1000                <- This is the answer!!
*..***    812 
122311


For an unrooted tree with N sequences, there are actually only N-3 
genuinely different groupings that we can test (this is the number 
of "internal branches"; each internal branch splits the sequences 
into 2 groups).  In this example, we have 6 sequences with 3 
internal branches in the reference tree.  In the bootstrap 
resampling, we count how often each of these internal branches 
occur.  Here, we find that the branch which splits 1,2,3 and 4 
versus 5 and 6 occurs in all 1000 samples; the branch which splits 
2,3 and 4 versus 1,5 and 6 occurs in 1000; the branch which splits 2 
and 3 versus 1,4,5 and 6 occurs in 812/1000 samples.  We can put 
these figures on to the diagrammatic representation we made earlier 
of our unrooted NJ tree as follows:



                          SEQ 1       SEQ 4
                           I           I
                           I           I            
  SEQ 6 ----------I        I           I          I--------- SEQ 2
                  I  1000  I   1000    I   812    I
                  I--------I-----------I----------I
                  I                               I    
  SEQ 5 ----------I                               I--------- SEQ 3



You can equally put these confidence figures on the rooted tree (in 
fact the interpretation is simpler with rooted trees).  With the 
unrooted tree, the grouping of sequence 5 with 6 is significant (as 
is the grouping of sequences 1,2,3 and 4).  Equally the grouping of 
sequences 1,5 and 6 is significant (the same as saying that 2,3 and 
4 group significantly).  However, the grouping of 2 and 3 is not 
significant, although it is relatively strongly supported.  

Unfortunately, there is a small complication in the interpretation 
of these results.  In statistical hypothesis testing, it is not 
valid to make multiple simultaneous tests and to treat the result of 
each test completely independantly.  In the above case, if you have 
one particular test (grouping) that you wish to make in advance, it 
is valid to test IT ALONE and to simply show the other bootstrap 
figures for reference.  If you do not have any particular test in 
mind before you do the bootstrapping, you can just show all of the 
figures and use the 95% level as an ARBITRARY cut off to show those 
groups that are very strongly supported; but not mention anything 
about SIGNIFICANCE testing.  In the literature, it is common 
practice to simply show the figures with a tree; they frequently 
speak for themselves.  



*******************************************************************

		4.  Command Line Interface.



You can do almost everything that can be done from the menus, using 
a command line interface. In this mode, the program will take all of 
its instructions as "switches" when you activate it; no questions 
will be asked; if there are no errors, the program just does an 
analysis and stops.   It does not work so well on the MAC but is 
still possible.  To get you started we will show you the 2 simplest 
uses of the command line as it looks on VAX/VMS.  On all other 
machines (except the MAC) it works in the same way.

$ clustalv /help           **OR**   $ clustalv /check

Both of the above switches give you a one page summary of the 
command line on the screen and then the program stops. 


$ clustalv proteins.seq    **OR**   $ clustalv /infile=proteins.seq    

This will read the sequences from the file 'proteins.seq' and do a 
complete multiple alignment.  Default parameters will be used, the 
program will try to tell whether or not the sequences are DNA or 
protein and the output will go to a file called 'proteins.aln' . A 
dendrogram file called 'proteins.dnd' will also be created.  Thus 
the default action for the program, when it successfully reads in an 
input file is to do a full multiple alignment.  Some further 
examples of command line usage will be given leter.

Command line switches can be abbreviated but MAKE SURE YOU DO NOT 
MAKE THEM AMBIGUOUS.  No attempt will be made to detect ambiguity.  
Use enough characters to distinguish each switch uniquely.







The full list of allowed switches is given below:


                DATA (sequences)

/INFILE=file.ext    :input sequences.  If you give an input file and 
				nothing else as a switch, the default action is 
				to do a complete multiple alignment.  The input 
				file can also be specified by giving it as the 
				first command line parameter with no "/" in 	
				front of it e.g $ clustalv file.ext  .

/PROFILE1=file.ext	:You use these two switches to give the names of  
/PROFILE2=file.ext	two profiles.  The default action is to align 
			the two. You must give the names of both profile 
				files. 



                VERBS (do things)

/HELP  		:list the command line parameters on the screen.
/CHECK           
                
/ALIGN        	:do full multiple alignment.  This is the default 	
			action if no other switches except for input files 
			are given.

/TREE      	:calculate NJ tree.  If this is the only action 	
			specified (e.g. $ clustalv proteins.seq/tree ) it IS 
			ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.  If 
			the sequences are not already aligned, you should 	
			also give the /ALIGN switch.  This will align the 	
			sequences first, output an alignment file and 	
			calculate the tree in memory. 

/BOOTSTRAP(=n)	:bootstrap a NJ tree (n= number of bootstraps; 	
			default = 1000).  If this is the only action 		
			specified (e.g. $ clustalv proteins.seq/bootstrap ) 
			it IS ASSUMED THAT THE SEQUENCES ARE ALREADY ALIGNED.  
			If the sequences are not already aligned, you should 
			also give the /ALIGN switch.  This will align the 	
			sequences first, output an alignment file and 	
			calculate the bootstraps in memory.  You can set the 
			number of bootstrap trials here (e.g./bootstrap=500).  
			You can set the seed number for the random number 	
			generator with /seed=n.



                PARAMETERS (set things)

***Pairwise alignments:***

/KTUP=n      	:word size              
    
/TOPDIAGS=n  	:number of best diagonals

/WINDOW=n    	:window around best diagonals 
 
/PAIRGAP=n   	:gap penalty



***Multiple alignments:***

/FIXEDGAP=n  	:fixed length gap pen.  
    
/FLOATGAP=n  	:variable length gap pen.

/MATRIX=     	:PAM100 or ID or file name. The default weight matrix 
			for proteins is PAM 250.

/TYPE=p or d 	:type is protein or DNA.   This allows you to 	
			explicitely overide the programs attempt at guessing 
			the type of the sequence.  It is only useful if you 
			are using sequences with a VERY strange composition.

/OUTPUT=     	:GCG or PHYLIP or PIR.  The default output is 	
			Clustal format.
    
/TRANSIT     	:transitions not weighted.  The default is to weight 
			transitions as more favourable than other mismatches 
			in DNA alignments.  This switch makes all nucleotide 
			mismatches equally weighted.


***Trees:***                             

/KIMURA      	:use Kimura's correction on distances.   

/TOSSGAPS    	:ignore positions with a gap in ANY sequence.

/SEED=n      	:seed number for bootstraps.




EXAMPLES:

These examples use the VAX/VMS $ prompt; otherwise, command-line 
usage is the same on all machines except the Macintosh.

 
$ clustalv proteins.seq      OR     $ clustalv /infile=proteins.seq
💿 文件大小 448 K
👤 上传用户 xufengping716
📂 所属分类其他行业
🏷️ 相关标签

#clustalw #DOS #ZIP #83
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -