📄 clustalv.doc

📁 生物序列比对程序clustw的源代码
💻 DOC
📖 第 1 页 / 共 5 页
字号:
MULTIPLE ALIGNMENT PARAMETERS:Having calculated a dendrogram between a set of sequences, the final multiple alignment is carried out by a series of alignments of larger and larger groups of sequences.  The order is determined by the dendrogram so that the most similar sequences get aligned first.  Any gaps that are introduced in the early alignments are fixed.  When two groups of sequences are aligned against each other, a full protein weight matrix (such as a Dayhoff PAM 250) is used.  Two gap penalties are offered: a "FIXED" penalty for opening up a gap and a "FLOATING" penalty for extending a gap.   ********* MULTIPLE ALIGNMENT PARAMETERS *********     1. Fixed Gap Penalty       :10     2. Floating Gap Penalty    :10     3. Toggle Transitions (DNA):Weighted     4. Protein weight matrix   :PAM 250     H. HELPEnter number (or [RETURN] to exit): FIXED GAP PENALTY:   Reduce this to encourage gaps of all sizes; increase it to discourage them.   Terminal gaps are penalised same as all others.  BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.FLOATING GAP PENALTY:   Reduce this to encourage longer gaps; increase it to shorten them.   Terminal gaps are penalised same as all others.  BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.DNA TRANSITIONS = WEIGHTED or UNWEIGHTED:   By default, transitions (A versus G; C versus T) are weighted more strongly than transversions (an A aligned with a G will be preferred to an A aligned with a C or a T).  You can make all pairs of nucleotide equally weighted with this option.PROTEIN WEIGHT MATRIX:  For protein comparisons, a weight matrix is used to differentially weight different pairs of aligned amino acids.  The default is the well known Dayhoff PAM 250 matrix.  We also offer a PAM 100 matrix, an identity matrix (all weights are the same for exact matches) or allow you to give the name of a file with your own matrix.  The weight matrices used by Clustal V are shown in full in the Algorithms and References section of this documentation.  If you input a matrix from a file, it must be in the following format.  Use a 20x20 matrix only (entries for the 20 "normal" amino acids only; no ambiguity codes etc.).  Input the lower left triangle of the matrix, INCLUDING the diagonal.  The order of the amino acids (rows and columns) must be: CSTPAGNDEQHRKMILVFYW.  The values can be in free format seperated by spaces (not commas).  The PAM 250 matrix is shown below in this format.  12    0  2   -2  1  3   -3  1  0  6   -2  1  1  1  2   -3  1  0 -1  1  5   -4  1  0 -1  0  0  2   -5  0  0 -1  0  1  2  4   -5  0  0 -1  0  0  1  3  4   -5 -1 -1  0  0 -1  1  2  2  4   -3 -1 -1  0 -1 -2  2  1  1  3  6   -4  0 -1  0 -2 -3  0 -1 -1  1  2  6   -5  0  0 -1 -1 -2  1  0  0  1  0  3  5   -5 -2 -1 -2 -1 -3 -2 -3 -2 -1 -2  0  0  6   -2 -1  0 -2 -1 -3 -2 -2 -2 -2 -2 -2 -2  2  5   -6 -3 -2 -3 -2 -4 -3 -4 -3 -2 -2 -3 -3  4  2  6   -2 -1  0 -1  0 -1 -2 -2 -2 -2 -2 -2 -2  2  4  2  4   -4 -3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5  0  1  2 -1  9    0 -3 -3 -5 -3 -5 -2 -4 -4 -4  0 -4 -4 -2 -1 -1 -2  7 10   -8 -2 -5 -6 -6 -7 -4 -7 -7 -5 -3  2 -3 -4 -5 -2 -6  0  0 17 Values must be integers and can be all positive or positive and negative as above.  These are SIMILARITY values.  ALIGNMENT OUTPUT OPTIONS:      By default, the alignment goes to a file in a self explanatory "blocked" alignment format.  This format is fine for displaying the results but requires heavy editing if you wish to use the alignment with other software.  To help, we provide 3 other formats which can be turned on or off.  If you have a sequence data set or alignment in memory, you can also ask for output files in whatever formats are turned on, NOW.  The menu you use to choose format is shown below. *** We draw your attention to NBRF/PIR format in particular.  This format is EXACTLY the same as one of the input formats.  Therefore, alignments written in this format can be used again as input (to the profile alignments or phylogenetic trees).*** ********* Format of Alignment Output *********     1. Toggle CLUSTAL format output   =  ON     2. Toggle NBRF/PIR format output  =  OFF     3. Toggle GCG format output       =  OFF     4. Toggle PHYLIP format output    =  OFF     5. Create alignment output file(s) now?     H. HELPEnter number (or [RETURN] to exit): CLUSTAL FORMAT:     This is a self explanatory alignment.  The alignment is written out in blocks.  Identities are highlighted and (if you use a PAM 250 matrix) positions in the alignment where all of the residues are "similar" to each other (PAM 250 score of 8 or more) are indicated.NBRF/PIR FORMAT:    This is the usual NBRF/PIR format with gaps indicated by hyphens ("-"). AS we have stressed before, this format is EXACTLY compatible with the sequence input format.  Therefore you can read in these alignments again for profile alignments or for calculating phylogenetic trees.  GCG FORMAT:         In version 7 of the Wisconsin GCG package, a new multiple sequence format was introduced.  This is the MSF (Multiple Sequence Format) format.  It can be used as input to the GCG sequence editor or any of the GCG programs that make use of multiple alignments.   THIS FORMAT IS ONLY SUPPORTED IN VERSION 7 OF THE GCG PACKAGE OR LATER.  PHYLIP FORMAT:      This format can be used by the Phylip package of Joe Felsenstein (see the references/algorithms section for details of how to get it).  Phylip allows you to do a huge range of phylogenetic analyses (we just offer one method in this program) and is probably the most widely used set of programs for drawing trees.It also works on just about every computer you can think of, providing you have a decent Pascal compiler.      ******************************      *   PROFILE ALIGNMENT MENU.  *      ******************************This menu is for taking two old alignments (or single sequences) and aligning them with each other.  The result is one bigger alignment.  The menu is very similar to the multiple alignment menu except that there is no mention of dendrograms here (they are not needed) and you need to input two sets of sequences.  The menu looks like this:******Profile*Alignment*Menu******    1.  Input 1st. profile/sequence    2.  Input 2nd. profile/sequence    3.  Do alignment now    4.  Alignment parameters    5.  Output format options    S.  Execute a system command    H.  HELP    or press [RETURN] to go back to main menuYour choice: You must input profile number 1 first.   When both profiles are loaded, use item 3 (Do alignment now) and the 2 profiles will be aligned.  Items 4 and 5 (parameters and output options) are identical to the equivalent options on the multiple alignment menu.  The same input routines that are used for general input are used here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format, with gaps indicated by hyphens ("-").  This is why we have continualy drawn your attention to the NBRF/PIR format as a useful output format.  Either profile can consist of just one sequence.   Therefore, if you have a favourite alignment of sequences that you are working on and wish to add a new sequence, you can use this menu, provided the alignment is in the correct format.  The total number of sequences in the two profiles must be less less than or equal to the MAXN parameter set in the clustalv.h header file.        ******************************      *   PHYLOGENETIC TREE MENU.  *      ******************************This menu allows you to input an alignment and calculate a phylogenetic tree.  You can also calculate a tree if you have just carried out a multiple alignment and the alignment is still in memory.  THE SEQUENCES MUST BE ALIGNED ALREADY!!!!!!   The tree will look strange if the sequences are not already aligned.  You can also "BOOTSTRAP" the tree to show confidence levels for groupings.  This is SLOW on microcomputers but works fine on workstations or mainframes.******Phylogenetic*tree*Menu******    1.  Input an alignment    2.  Exclude positions with gaps?        = OFF    3.  Correct for multiple substitutions? = OFF    4.  Draw tree now    5.  Bootstrap tree    S.  Execute a system command    H.  HELP    or press [RETURN] to go back to main menuYour choice: The same input routine that is used for general input is used here i.e. sequences must be in NBRF/PIR, EMBL/SwissProt or FASTA format, with gaps indicated by hyphens ("-").  This is why we have continualy drawn your attention to the NBRF/PIR format as a useful output format.  If you have input an alignment, then just use item 4 to draw a tree.  The method used is the Neighbor Joining method of Saitou and Nei (1987).  This is a "distance method". First, percent divergence figures are calculated between all pairs of sequence.  These divergence figures are then used by the NJ method to give the tree.  Example trees will be shown below.  There are two options which can be used to control the way the distances are calculated.  These are set by options 2 and 3 in the menu.  EXCLUDE POSITIONS WITH GAPS?   This option allows you to ignore all alignment positions (columns) where there is a gap in ANY sequence.  This guarantees that "like" is compared with "like" in all distances i.e. the same positions are used to calculate all distances.  It also means that the distances will be "metric".  The disadvantage of using this option is that you throw away much of the data if there are many gaps.  If the total number of gaps is small, it has little effect.   CORRECT FOR MULTIPLE SUBSTITUTIONS?    As sequences diverge, substitutions accumulate.  It becomes increasingly likely that more than one substitution (as a result of a mutation) will have happened at a site where you observe just one difference now.  This option allows you to use formulae developed by Motoo Kimura to correct for this effect.  It has the effect of stretching long branches in tres while leaving short ones relatively untouched.  The desired effect is to try and make distances proportional to time since divergence.  The tree is sent to a file called BLAH.NJ, where BLAH.SEQ is the name of the input, alignment file.  An example is shown below for 6 globin sequences.   DIST   = percentage divergence (/100) Length = number of sites used in comparison   1 vs.   2  DIST = 0.5683;  length =    139   1 vs.   3  DIST = 0.5540;  length =    139   1 vs.   4  DIST = 0.5315;  length =    111   1 vs.   5  DIST = 0.7447;  length =    141   1 vs.   6  DIST = 0.7571;  length =    140   2 vs.   3  DIST = 0.0897;  length =    145   2 vs.   4  DIST = 0.1391;  length =    115   2 vs.   5  DIST = 0.7517;  length =    145   2 vs.   6  DIST = 0.7431;  length =    144   3 vs.   4  DIST = 0.0957;  length =    115   3 vs.   5  DIST = 0.7379;  length =    145   3 vs.   6  DIST = 0.7361;  length =    144   4 vs.   5  DIST = 0.7304;  length =    115   4 vs.   6  DIST = 0.7368;  length =    114   5 vs.   6  DIST = 0.2697;  length =    152			Neighbor-joining Method Saitou, N. and Nei, M. (1987) The Neighbor-joining Method: A New Method for Reconstructing Phylogenetic Trees. Mol. Biol. Evol., 4(4), 406-425 This is an UNROOTED tree Numbers in parentheses are branch lengths Cycle   1     =  SEQ:   5 (  0.13382) joins  SEQ:   6 (  0.13592) Cycle   2     =  SEQ:   1 (  0.28142) joins Node:   5 (  0.33462) Cycle   3     =  SEQ:   2 (  0.05879) joins  SEQ:   3 (  0.03086) Cycle   4 (Last cycle, trichotomy):		 Node:   1 (  0.20798) joins		 Node:   2 (  0.02341) joins		  SEQ:   4 (  0.04915) The output file first shows the percent divergence (distance) figures between each pair of sequence.  Then a description of a NJ tree is given.  This description shows which sequences (SEQ:) or which groups of sequences (NODE: , a node is numbered using the lowest sequence that belongs to it) join at each level of the tree.  This is an unrooted tree!! This means that the direction of evolution through the tree is not shown.  This can only be inferred
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -