📄 clustalw.ms

📁 经典生物信息学多序列比对工具clustalw
💻 MS
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
This is just an ASCII text version of the manuscript describingClustal W, without the figures.  It was published:Nucleic Acids Research, 22(22):4673-4680.CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice.Julie D. Thompson, Desmond G. Higgins1 and Toby J. Gibson*European Molecular Biology LaboratoryPostfach 102209Meyerhofstrasse 1D-69012 HeidelbergGermanyPhone:		+49-6221-387398Fax:		+49-6221-387306E-mail:		Gibson@EMBL-Heidelberg.DE		Des.Higgins@EBI.AC.UK		Thompson@EMBL-Heidelberg.DEKeywords:	Multiple alignment, phylogenetic tree, weight matrix, gap		penalty, dynamic programming, sequence weighting.1 Current address: European Bioinformatics InstituteHinxton HallHinxtonCambridge CB10 1RQUK.* To whom correspondence should be addressedABSTRACTThe sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences.   Firstly, individual weights are assigned to each sequence in a partial alignment in order to downweight near-duplicate sequences and upweight the most divergent ones.   Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned.    Thirdly, residue specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure.   Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions.  These modifications are incorporated into a new program, CLUSTAL W which is freely available.  INTRODUCTIONThe simultaneous alignment of many nucleotide or amino acid sequences is now an essential tool in molecular biology.  Multiple alignments are used to find diagnostic patterns to characterise protein families; to detect or demonstrate homology between new sequences and existing families of sequences; to help predict the secondary and tertiary structures of new sequences; to suggest oligonucleotide primers for PCR; as an essential prelude to molecular evolutionary analysis.   The rate of appearance of new sequence data is steadily increasing and the development of efficient and accurate automatic methods for multiple alignment is, therefore, of major importance.   The majority of automatic multiple alignments are now carried out using the "progressive" approach of Feng and Doolittle (1).   In this paper, we describe a number of improvements to the progressive multiple alignment method which greatly improve the sensitivity without sacrificing any of the speed and efficiency which makes this approach so practical.  The new methods are made available in a program called CLUSTAL W which is freely available and portable to a wide variety of computers and operating systems.In order to align just two sequences, it is standard practice to use dynamic programming (2).  This guarantees a mathematically optimal alignment, given a table of scores for matches and mismatches between all amino acids or nucleotides (e.g. the PAM250 matrix (3) or BLOSUM62 matrix (4)) and penalties for insertions or deletions of different lengths.   Attempts at generalising dynamic programming to multiple alignments are limited to small numbers of short sequences (5).  For much more than eight or so proteins of average length, the problem is uncomputable given current computer power.  Therefore, all of the methods capable of handling larger problems in practical timescales, make use of heuristics.    Currently, the most widely used approach is to exploit the fact that homologous sequences are evolutionarily related.  One can build up a multiple alignment progressively by a series of pairwise alignments, following the branching order in a phylogenetic tree (1).  One first aligns the most closely related sequences, gradually adding in the more distant ones.   This approach is sufficiently fast to allow alignments of virtually any size.   Further, in simple cases, the quality of the alignments is excellent, as judged by the ability to correctly align corresponding domains from sequences of known secondary or tertiary structure (6).  In more difficult cases, the alignments give good starting points for further automatic or manual refinement.This approach works well when the data set consists of sequences of different degrees of divergence.   Pairwise alignment of very closely related sequences can be carried out very accurately.   The correct answer may often be obtained using a wide range of parameter values (gap penalties and weight matrix).  By the time the most distantly related sequences are aligned, one already has a sample of aligned sequences which gives important information about the variability at each position.   The positions of the gaps that were introduced during the early alignments of the closely related sequences are not changed as new sequences are added.   This is justified because the placement of gaps in alignments between closely related sequences is much more accurate than between distantly related ones.   When all of the sequences are highly divergent (e.g. less than approximately 25-30% identity between any pair of sequences), this progressive approach becomes much less reliable.There are two major problems with the progressive approach:  the local minimum problem and the choice of alignment parameters.   The local minimum problem stems from the "greedy" nature of the alignment strategy.  The algorithm greedily adds sequences together, following the initial tree.  There is no guarantee that the global optimal solution, as defined by some overall measure of multiple alignment quality (7,8), or anything close to it, will be found.   More specifically, any mistakes (misaligned regions) made early in the alignment process cannot be corrected later as new information from other sequences is added.   This problem is frequently thought of as mainly resulting from an incorrect branching order in the initial tree.  The initial trees are derived from a matrix of distances between separately aligned pairs of sequences and are much less reliable than trees from complete multiple alignments.   In our experience, however, the real problem is caused simply by errors in the initial alignments.  Even if the topology of the guide tree is correct, each alignment step in the multiple alignment process may have some percentage of the residues misaligned.   This percentage will be very low on average for very closely related sequences but will increase as sequences diverge.   It is these misalignments which carry through from the early alignment steps that cause the local minimum problem.   The only way to correct this is to use an iterative or stochastic sampling procedure (e.g. 7,9,10).   We do not directly address this problem in this paper.The alignment parameter choice problem is, in our view, at least as serious as the local minimum problem.   Stochastic or iterative algorithms will be just as badly affected as progressive ones if the parameters are inappropriate: they will arrive at a false global minimum.  Traditionally, one chooses one weight matrix and two gap penalties (one for opening a new gap and one for extending an existing gap) and hope that these will work well over all parts of all the sequences in the data set.   When the sequences are all closely related, this works.  The first reason is that virtually all residue weight matrices give most weight to identities.   When identities dominate an alignment, almost any weight matrix will find approximately the correct solution.   With very divergent sequences, however, the scores given to non-identical residues will become critically important; there will be more mismatches than identities.   Different weight matrices will be optimal at different evolutionary distances or for different classes of proteins.  The second reason is that the range of gap penalty values that will find the correct or best possible solution can be very broad for highly similar sequences (11).   As more and more divergent sequences are used, however, the exact values of the gap penalties become important for success.   In each case, there may be a very narrow range of values which will deliver the best alignment.  Further, in protein alignments, gaps do not occur randomly (i.e. with equal probability at all positions).  They occur far more often between the major secondary structural elements of alpha helices and beta strands than within (12).The major improvements described in this paper attempt to address the alignment parameter choice problem.   We dynamically vary the gap penalties in a position and residue specific manner. The observed relative frequencies of gaps adjacent to each of the 20 amino acids (12) are used to locally adjust the gap opening penalty after each residue.   Short stretches of hydrophilic residues (e.g. 5 or more) usually indicate loop or random coil regions and the gap opening penalties are locally reduced in these stretches.   In addition, the locations of the gaps found in the early alignments are also given reduced gap opening penalties.  It has been observed in alignments between sequences of known structure that gaps tend not to be closer than roughly eight residues on average (12).   We increase the gap opening penalty within eight residues of exising gaps.   The two main series of amino acid weight matrices that are used today are the PAM series (3) and the BLOSUM series (4).   In each case, there is a range of matrices to choose from.  Some matrices are appropriate for aligning very closely related sequences where most weight by far is given to identities, with only the most frequent conservative substitutions receiving high scores.  Other matrices work better at greater evolutionary distances where less importance is attached to identities (13).  We choose different weight matrices, as the alignment proceeds, depending on the estimated divergence of the sequences to be aligned at each stage.  Sequences are weighted to correct for unequal sampling across all evolutionary distances in the data set (14).   This downweights sequences that are very similar to other sequences in the data set and upweights the most divergent ones.  The weights are calculated directly from the branch lengths in the initial guide tree (15).   Sequence weighting has already been shown to be effective in improving the sensitivity of profile searches (15,16).  In the original CLUSTAL programs (17-19), the initial guide trees, used to guide the multiple alignment, were calculated using the UPGMA method (20).  We now use the Neighbour-Joining method (21) which is more robust against the effects of unequal evolutionary rates in different lineages and which gives better estimates of individual branch lengths.  This is useful because it is these branch lengths which are used to derive the sequence weights.  We also allow users to choose between fast approximate alignments (22) or full dynamic programming for the distance calculations used to make the guide tree. The new improvements dramatically improve the sensitivity of the progressive alignment method for difficult alignments involving highly diverged sequences.  We show one very demanding test case of over 60 SH3 domains (23) which includes sequence pairs with as little as 12% identity and where there is only one exactly conserved residue across all of the sequences.   Using default parameters, we can achieve an alignment that is almost exactly correct, according to available structural information (24).   Using the program in a wide variety of situations, we find that it will normally find the correct alignment, in all but the most difficult and pathological of cases.  MATERIAL AND METHODSThe basic alignment methodThe basic multiple alignment algorithm consists of three main stages: 1) all pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences; 2) a guide tree is calculated from the distance matrix; 3) the sequences are progressively aligned according to the branching order in the guide tree.   An example using 7 globin sequences of known tertiary structure (25) is given in figure 1.1) The distance matrix/pairwise alignmentsIn the original CLUSTAL programs, the pairwise distances were calculated using a fast approximate method (22).   This allows very large numbers of sequences to be aligned, even on a microcomputer.   The scores are calculated as the number of k-tuple matches (runs of identical residues, typically 1 or 2 long for proteins or 2 to 4 long for nucleotide sequences) in the best alignment between two sequences minus a fixed penalty for every gap.   We now offer a choice between this method and the slower but more accurate scores from full dynamic programming alignments using two gap penalties (for opening or extending gaps) and a full amino acid weight matrix.   These scores are calculated as the number of identities in the best alignment divided by the number of residues compared (gap positions are excluded).   Both of these scores are initially calculated as percent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site.   We do not correct for multiple substitutions in these initial distances.   In figure 1 we give the 7x7 distance matrix between the 7 globin sequences calculated using the full dynamic programming method.2) The guide treeThe trees used to guide the final multiple alignment process are calculated from the distance matrix of step 1 using the Neighbour-Joining method (21).   This produces unrooted trees with branch lengths proportional to estimated divergence along each branch.   The root is placed by a "mid-point" method (15) at a position where the means of the branch lengths on either side of the root are equal.   These trees are also used to derive a weight for each sequence (15).   The weights are dependent upon the distance from the root of the tree but sequences which have a common branch with other sequences share the weight derived from the shared branch.   In the example in figure 1, the leghaemoglobin (Lgb2_Luplu) gets a weight of 0.442 which is equal to the length of the branch from the root to it.  The Human beta globin (Hbb_Human) gets a weight consisting of the length of the branch leading to it that is not shared with any other sequences (0.081) plus half the length of the branch shared with the horse beta globin (0.226/2) plus one quarter the length of the branch shared by all four haemoglobins (0.061/4) plus one fifth the branch shared between the haemoglobins and the myoglobin (0.015/5) plus one sixth the branch leading to all the vertebrate globins (0.062).  This sums to a total of 0.221.  By contrast, in the normal progressive alignment algorithm, all sequences would be equally weighted.  The rooted tree with branch lengths and sequence weights for the 7 globins is given in figure 1.
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -