📄 clustalw.ms

📁 经典生物信息学多序列比对工具clustalw
💻 MS
📖 第 1 页 / 共 3 页
字号:
3) Progressive alignmentThe basic procedure at this stage is to use a series of pairwise alignments to align larger and larger groups of sequences, following the branching order in the guide tree.   You proceed from the tips of the rooted tree towards the root.   In the globin example in figure 1 you align the sequences in the following order: human vs. horse beta globin; human vs. horse alpha globin; the 2 alpha globins vs. the 2 beta globins; the myoglobin vs. the haemoglobins; the cyanohaemoglobin vs the haemoglobins plus myoglobin; the leghaemoglobin vs. all the rest.  At each stage a full dynamic programming (26,27) algorithm is used with a residue weight matrix and penalties for opening and extending gaps.   Each step consists of aligning two existing alignments or sequences.  Gaps that are present in older alignments remain fixed.  In the basic algorithm, new gaps that are introduced at each stage get full gap opening and extension penalties, even if they are introduced inside old gap positions (see the section on gap penalties below for modifications to this rule).  In order to calculate the score between a position from one sequence or alignment and one from another, the average of all the pairwise weight matrix scores from the amino acids in the two sets of sequences is used i.e. if you align 2 alignments with 2 and 4 sequences respectively, the score at each position is the average of 8 (2x4) comparisons.   This is illustrated in figure 2.  If either set of sequences contains one or more gaps in one of the positions being considered, each gap versus a residue is scored as zero.   The default amino acid weight matrices we use are rescored to have only positive values. Therefore, this treatment of gaps treats the score of a residue versus a gap as having the worst possible score.  When sequences are weighted (see improvements to progressive alignment, below), each weight matrix value is multiplied by the weights from the 2 sequences, as illustrated in figure 2.Improvements to progressive alignmentAll of the remaining modifications apply only to the final progressive alignment stage.   Sequence weighting is relatively straightforward and is already widely used in profile searches (15,16).   The treatment of gap penalties is more complicated.   Initial gap penalties are calculated depending on the weight matrix, the similarity of the sequences, and the length of the sequences. Then, an attempt is made to derive sensible local gap opening penalties at every position in each pre-aligned group of sequences that will vary as new sequences are added.   The use of different weight matrices as the alignment progresses is novel and largely by-passes the problem of initial choice of weight matrix.   The final modification allows us to delay the addition of very divergent sequences until the end of the alignment process when all of the more closely related sequences have already been aligned.Sequence weightingSequence weights are calculated directly from the guide tree.    The weights are normalised such that the biggest one is set to 1.0 and the rest are all less than one.  Groups of closely related sequences receive lowered weights because they contain much duplicated information.  Highly divergent sequences without any close relatives receive high weights.  These weights are used as simple multiplication factors for scoring positions from different sequences or prealigned groups of sequences.  The method is illustrated in figure 2.  In the globin example in figure 1, the two alpha globins get downweighted because they are almost duplicate sequences (as do the two beta globins); they receive a combined weight of only slightly more than if a single alpha globin was used.   Initial gap penaltiesInitially, two gap penalties are used: a gap opening penalty (GOP) which gives the cost of opening a new gap of any length and a gap extension penalty (GEP) which gives the cost of every item in a gap.  Initial values can be set by the user from a menu.   The software then automatically attempts to choose appropriate gap penalties for each sequence alignment, depending on the following factors.1) Dependence on the weight matrixIt has been shown (16,28) that varying the gap penalties used with different weight matrices can improve the accuracy of sequence alignments. Here, we use the average score for two mismatched residues (ie. off-diagonal values in the matrix) as a scaling factor for the GOP.2) Dependence on the similarity of the sequencesThe percent identity of the two (groups of) sequences to be aligned is used to increase the GOP for closely related sequences and decrease it for more divergent sequences on a linear scale.3) Dependence on the lengths of the sequences   The scores for both true and false sequence alignments grow with the length of the sequences. We use the logarithm of the length of the shorter sequence to increase the GOP with sequence length.Using these three modifications, the initial GOP calculated by the program is:GOP->(GOP+log(MIN(N,M))) * (average residue mismatch score) *                                                               (percent identity scaling factor)where N, M are the lengths of the two sequences.4) Dependence on the difference in the lengths of the sequencesThe GEP is modified depending on the difference between the lengths of the two sequences to be aligned. If one sequence is much shorter than the other, the GEP is increased to inhibit too many long gaps in the shorter sequence.The initial GEP calculated by the program is:GEP ->  GEP*(1.0+|log(N/M)|) where N, M are the lengths of the two sequences.Position-specific gap penalties In most dynamic programming applications, the initial gap opening and extension penalties are applied equally at every position in the sequence, regardless of the location of a gap, except for terminal gaps which are usually allowed at no cost.   In CLUSTAL W, before any pair of sequences or prealigned groups of sequences are aligned, we generate a table of gap opening penalties for every position in the two (sets of) sequences.  An example is shown in figure 3.  We manipulate the initial gap opening penalty in a position specific manner, in order to make gaps more or less likely at different positions.   The local gap penalty modification rules are applied in a hierarchical manner.   The exact details of each rule are given below.  Firstly, if there is a gap at a position, the gap opening and gap extension penalties are lowered; the other rules do not apply.   This makes gaps more likely at positions where there are already gaps.  If there is no gap at a position, then the gap opening penalty is increased if the position is within 8 residues of an existing gap.   This discourages gaps that are too close together.  Finally, at any position within a run of hydrophilic residues, the penalty is decreased.  These runs usually indicate loop regions in protein structures.  If there is no run of hydrophilic residues, the penalty is modified using a table of residue specific gap propensities (12).   These propensities were derived by counting the frequency of each residue at either end of gaps in alignments of proteins of known structure.  An illustration of the application of these rules from one part of the globin example, in figure 1, is given in figure 3.  1) Lowered gap penalties at existing gapsIf there are already gaps at a position, then the GOP is reduced in proportion to the number of sequences with a gap at this position and the GEP is lowered by a half.  The new gap opening penalty is calculated as:GOP ->  GOP*0.3*(no. of sequences without a gap/no. of sequences).2) Increased gap penalties near existing gapsIf a position does not have any gaps but is within 8 residues of an existing gap, the GOP is increased by:GOP ->  GOP*(2+((8-distance from gap)*2)/8)3) Reduced gap penalties in hydrophilic stretchesAny run of 5 hydrophilic residues is considered to be a hydrophilic stretch.  The residues that are to be considered hydrophilic may be set by the user but are conservatively set to D, E, G, K, N, Q, P, R or S by default.   If, at any position, there are no gaps and any of the sequences has such a stretch, the GOP is reduced by one third.4) Residue specific penaltiesIf there is no hydrophilic stretch and the position does not contain any gaps, then the GOP is multiplied by one of the 20 numbers in table 1, depending on the residue.  If there is a mixture of residues at a position, the multiplication factor is the average of all the contributions from each sequence.  Weight matricesTwo main series of weight matrices are offered to the user: the Dayhoff PAM series (3) and the BLOSUM series (4).   The default is the BLOSUM series.  In each case, there is a choice of matrix ranging from strict ones, useful for comparing very closely related sequences to very "soft" ones that are useful for comparing very distantly related sequences.   Depending on the distance between the two sequences or groups of sequences to be compared, we switch between 4 different matrices.  The distances are measured directly from the guide tree.  The ranges of distances and tables used with the PAM series of matrices is: 80-100%:PAM20, 60-80%:PAM60, 40-60%:PAM120, 0-40%:PAM350. The range used with the BLOSUM series is:80-100%:BLOSUM80,60-80%:BLOSUM62, 30-60%:BLOSUM45, 0-30%:BLOSUM30.Divergent sequencesThe most divergent sequences (most different, on average from all of the other sequences) are usually the most difficult to align correctly.  It is sometimes better to delay the incorporation of these sequences until all of the more easily aligned sequences are merged first.  This may give a better chance of correctly placing the gaps and matching weakly conserved positions against the rest of the sequences.   A choice is offered to set a cut off (default is 40% identity or less with any other sequence) that will delay the alignment of the divergent sequences until all of the rest have been aligned.  Software and AlgorithmsDynamic ProgrammingThe most demanding part of the multiple alignment strategy, in terms of computer processing and memory usage, is the alignment of two (groups of) sequences at each step in the final progressive alignment.   To make it possible to align very long sequences (e.g. dynein heavy chains at ~ 5,000 residues) in a reasonable amount of memory, we use the memory efficient dynamic programming algorithm of Myers and Miller (26).   This sacrifices some processing time but makes very large alignments practical in very little memory.   One disadvantage of this algorithm is that it does not allow different gap opening and extension penalties at each position.  We have modified the algorithm so as to allow this and the details are described in a separate paper (27).   Menus/file formatsSix different sequence input formats are detected automatically and read by the program:  EMBL/Swiss Prot, NBRF/PIR, Pearson/FASTA (29), GCG/MSF (30), GDE (Steven Smith, Harvard University Genome Center) and CLUSTAL format alignments.   The last three formats allow users to read in complete alignments (e.g. for calculating phylogenetic trees or for addition of new sequences to an existing alignment).   Alignment output may be requested in standard CLUSTAL format (self-explanatory blocked alignments) or in formats compatible with the GDE, PHYLIP (31) or GCG (30) packages.   The program offers the user the ability to calculate Neighbour-Joining phylogenetic trees from existing alignments with options to correct for multiple hits (32,33) and to estimate confidence levels using a bootstrap resampling procedure (34).   The trees may be output in the "New Hampshire" format that is compatible with the PHYLIP package (31).Alignment to an alignmentProfile alignment is used to align two existing alignments (either of which may consist of just one sequence) or to add a series of new sequences to an existing alignment.   This is useful because one may wish to build up a multiple alignment gradually, choosing different parameters manually, or correcting intermediate errors as the alignment proceeds.   Often, just a few sequences cause misalignments in the progressive algorithm and these can be removed from the process and then added at the end by profile alignment.  A second use is where one has a high quality reference alignment and wishes to keep it fixed while adding new sequences automatically.  Portability/AvailabilityThe full source code of the package is provided free to academic users.   The program will run on any machine with a full ANSI conforming C compiler.  It has been tested on the following hardware/software combinations:  Decstation/Ultrix, Vax or ALPHA/VMS, Silicon Graphics/IRIX.   The source code and documentation are available by E-mail from the EMBL file server (send the words HELP and HELP SOFTWARE on two lines to the internet address: Netserv@EMBL-Heidelberg.DE) or by anonymous FTP from FTP.EMBL-Heidelberg.DE.  Queries may be addressed by E-mail to Des.Higgins@EBI.AC.UK or Gibson@EMBL-Heidelberg.DE.RESULTS AND DISCUSSIONAlignment of SH3 DomainsThe ~60 residue SH3 domain was chosen to illustrate the performance of CLUSTAL W, as there is a reference manual alignment (23) and the fold is known (24).  SH3 domains, with a minimum similarity below 12% identity, are poorly aligned by progressive alignment programs such as CLUSTAL V
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -