📄 clustalw.doc

📁 生物序列比对程序clustw的源代码
💻 DOC
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
The New Hampshire format is only useful if you have software to display ormanipulate the trees.  The PHYLIP package is highly recommended if you intendto do much work with trees and includes programs for doing this.  If you donot have such software, request the trees in the older clustal formatand see the documentation for Clustal V (clustalv.doc).  WE DO NOT PROVIDEANY DIRECT MEANS FOR VIEWING TREES GRAPHICALLY.-------------------------------------------------------------------------4) THE ALIGNMENT ALGORITHMSThe basic algorithm is the same as for Clustal V and is described in somedetail in clustalv.doc.  The new modifications are described in detail in clustalw.ms.  Here we just list some notes to help answer some of the mostobvious questions.Terminal GapsIn the original Clustal V program, terminal gaps were penalised the sameas all other gaps.  This caused some ugly side effects e.g.acgtacgtacgtacgt                              acgtacgtacgtacgta----cgtacgtacgt  gets the same score as      ----acgtacgtacgtNOW, terminal gaps are free.  This is better on average and stops sillyeffects like single residues jumping to the edge of the alignment.  However,it is not perfect.  It does mean that if there should be a gap near the end of the alignment, the program may be reluctant to insert it i.e. cccccgggccccc                                              cccccgggcccccccccc---ccccc  may be considered worse (lower score) than  cccccccccc---In the right hand case above, the terminal gap is free and may score higherthan the laft hand alignment.  This can be prevented by lowering the gapopening and extension penalties.   It is difficult to get this right all thetime.  Please watch the ends of your alignments. Speed of the initial (pairwise) alignments (fast approximate/slow accurate)By default, the initial pairwise alignments are now carried out using a fulldynamic programming algorithm.  This is more accurate than the older hash/k-tuple based alignments (Wilbur and Lipman) but is MUCH slower.  On a fastworkstation you may not notice but on a slow box, the difference is extreme.You can set the alignment method from the menus easily to the older, fastermethod.Delaying alignment of distant sequencesThe user can set a cut off to delay the alignment of the most divergentsequences in a data set until all other sequences have been aligned.  By default, this is set to 40% which means that if a sequence is less than 40%identical to any other sequence, its alignment will be delayed.  Iterative realignment/Reset gaps between alignmentsBy default, if you align a set of sequences a second time (e.g. with changedgap penalties), the gaps from the first alignment are discarded.  You canset this from the menus so that older gaps will be kept between alignments,This can sometimes give better alignments by keeping the gaps (do not resetthem) and doing the full multiple alignment a second time.  Sometimes, thealignment will converge on a better solution; sometimes the new alignment willbe the same as the first.  There can be a strange side effect: you can getcolumns of nothing but gaps introduced.  Any gaps that are read in from the input file are always kept, regardless of the setting of this switch.  If you read in a full multiple alignment, the "resetgaps" switch has no effect.  The old gaps will remain and if you carry out a multiple alignment, any new gaps will be added in.  If you wish to carry out a full new alignment of a set of sequences that are already aligned in a fileyou must input the sequences without gaps.Profile alignmentBy profile alignment, we simply mean the alignment of old alignments/sequences.In this context, a profile is just an existing alignment (or even a set of unaligned sequences; see below).  This allows you toread in an old alignment (in any of the allowed input formats) and alignone or more new sequences to it.  From the profile alignment menu, youare allowed to read in 2 profiles.  Either profile can be a full alignmentOR a single sequence.  In the simplest mode, you simply align the two profilesto each other. This is useful if you want to gradually build up a fullmultiple alignment.  A second option is to align the sequences from the second profile, one ata time to the first profile.  This is done, taking the underlying tree betweenthe sequences into account.  This is useful if you have a set of new sequences(not aligned) and you wish to add them all to an older alignment.----------------------------------------------------------------------------5) CHANGES TO THE PHYLOGENTIC TREE CALCULATIONS AND SOME HINTS.IMPROVED DISTANCE CALCULATIONS FOR PROTEIN TREESThe phylogenetic trees in Clustal W (the real trees that you calculateAFTER alignment; not the guide trees used to decide the branching orderfor multiple alignment) use the Neighbor-Joining method of Saitou andNei based on a matrix of "distances" between all sequences.  These distancescan be corrected for "multiple hits".  This is normal practice when accuratetrees are needed.  This correction stretches distances (especially large ones)to try to correct for the fact that OBSERVED distances (mean number of differences per site) greatly underestimate the actual number that happenedduring evolution.  In Clustal V we used a simple formula to convert an observed distance to onethat is corrected for multiple hits.  The observed distance is the mean numberof differences per site in an alignment (ignoring sites with a gap) and istherefore always between 0.0 (for ientical sequences) an 1.0 (no residues thesame at any site).  These distances can be multiplied by 100 to give percentdifference values.  100 minus percent difference gives percent identity.The formula we use to correct for multiple hits is from Motoo Kimura(Kimura, M. The neutral Theory of Molecular Evolution, Camb.Univ.Press, 1983,page 75) and is:K = -Ln(1 - D - (D.D)/5)  where D is the observed distance and K is                                     corrected distance.This formula gives mean number of estimated substitutions per site and, incontrast to D (the observed number), can be greater than 1 i.e. more thanone substitution per site, on average.  For example, if you observe 0.8differences per site (80% difference; 20% identity), then the above formulapredicts that there have been 2.5 substitutions per site over the course of evolution since the 2 sequences diverged.  This can also be expressed in PAM units by multiplying by 100 (mean number of substitutions per 100 residues).The PAM scale of evolution and its derivation/calculation comes from thework of Margaret Dayhoff and co workers (the famous Dayhoff PAM seriesof weight matrices also came from this work).  Dayhoff et al constructedan elaborate model of protein evolution based on observed frequenciesof substitution between very closely related proteins.  Using this model,they derived a table relating observed distances to predicted PAM distances.Kimura's formula, above, is just a "curve fitting" approximation to this table.It is very accurate in the range 0.75 > D > 0.0 but becomes increasinglyunaccurate at high D (>0.75) and fails completely at around D = 0.85.To circumvent this problem, we calculated all the values for K correspondingto D above 0.75 directly using the Dayhoff model and store these in an internal table, used by Clustal W.  This table is declared in the file dayhoff.h andgives values of K for all D between 0.75 and 0.93 in intervals of 0.001 i.e.for D = 0.750, 0.751, 0.752 ...... 0.929, 0.930.   For any observed D higher than 0.930, we arbitrarily set K to 10.0.  This sounds drastic butwith real sequences, distances of 0.93 (less than 7% identity) are rare.If your data set includes sequences with this degree of divergence, youwill have great difficulty getting accurate trees by ANY method; the alignmentitself will be very difficult (to construct and to evaluate).There are some importantthings to note.  Firstly, this formula works well if your sequences areof average amino acid composition and if the amino acids substitute accordingto the original Dayhoff model.  In other cases, it may be misleading.  Secondly,it is based only on observed percent distance i.e. it does not DIRECTLYtake conservative substitutions into account.  Thirdly, the error on theestimated PAM distances may be VERY great for high distances; at very highdistance (e.g. over 85%) it may give largely arbitrary corrected distances.In most cases, however, the correction is still worth using; the trees willbe more accurate and the branch lengths will be more realistic.  A far more sophisticated distance correction based on a full Dayhoffmodel which DOES take conservative substitutions and actual amino acidcomposition into account, may be found in the PROTDIST program of thePHYLIP package.  For serious tree makers, this program is highly recommended. TWO NOTES ON BOOTSTRAPPING...When you use the BOOTSTRAP in Clustal W to estimate the reliability of partsof a tree, many of the uncorrected distances may randomly exceed the arbitrary cutoff of 0.93 (sequences only 7% identical) if the sequences are distantlyrelated.  This will happen randomly i.e. even if none of the pairs of sequences are less than 7% identical, the bootstrap samples may contain pairsof sequences that do exceed this cut off.If this happens, you will be warned.  In practice, this canhappen with many data sets.  It is not a serious problem if it happens rarely.If it does happen (you are warned when it happens and told how often theproblem occurs), you should consider removing the most distantlyrelated sequences and/or using the PHYLIP package instead.A further problem arises in almost exactly the opposite situation: whenyou bootstrap a data set which contains 3 or more sequences that are identicalor almost identical.  Here, the sets of identical sequences should be shownas a multifurcation (several sequences joing at the same part of the tree).Because the Neighbor-Joining method only gives strictly dichotomous trees(never more than 2 sequences join at one time), this cannot be exactly represented.  In practice, this is NOT a problem as there will be someinternal branches of zero length seperating the sequences.  If youdisplay the tree with all branch lengths, you will still see a multifurcation.  However, when you bootstrapthe tree, only the branching orders are stored and counted.  In the caseof multifurcations, the exact branching order is arbitrary but the programwill always get the same branching order, depending only on the input orderof the sequences.  In practice, this is only a problem in situations whereyou have a set of sequences where all of them are VERY similar.  In this case,you can find very high support for some groupings which will disappear if yourun the analysis with a different input order.  Again, the PHYLIP packagedeals with this by offering a JUMBLE option to shuffle the input orderof your sequences between each bootstrap sample.  ----------------------------------------------------------------------------6) SUMMARY OF THE COMMAND LINE USAGEClustal W is designed to be run interactively.  However, there are many situations where it is convenient to run it from the command line, especiallyif you wish to run it from another piece of software (e.g. SeqApp or GDE).All parameters can be set from the command line by giving options after theclustalw command. On UNIX options should be preceded by '-', all other systemsuse the '/' character.If anything is put on the command line, the program will (attempt to) carryout whatever is requested and will exit.  If you wish to use the commandline to set some parameters and then go into interactive mode, use thecommand line switch: interactive .... e.g.clustalw -quicktree -interactive    on UNIXorclustalw /quicktree /interactive    on VMS,MAC and PCwill set the default initial alignment mode to fast/approximate and will thengo to the main menu.To see a list of all the command line parameters, type: clustalw -options           on UNIXorclustalw /options           on VMS,MAC and PCand you will see a list with no explanation.To get (VERY BRIEF) help on command line usage, use the /HELP or /CHECK(-help or -check on UNIX systems) options.  Otherwise, the command lineusage is self explanatory or is explained in clustalv.doc.  The defaultsfor all parameters are set in the file param.h which can be changed easily (remember to recompile the program afterwards :-).------------------------------------------------------------------------------
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -