📄 clustalw.ms
字号:
and PILEUP: neither program can generate the correct blocks corresponding to the secondary structure elements. Figure 4 shows an alignment generated by CLUSTAL W of the example set of SH3 domains. The alignment was generated in two steps. After progressive alignment, five blocks were produced, corresponding to structural elements, with gaps inserted exclusively in the known loop regions. The beta strands in blocks 1, 4 and 5 were all correctly superposed. However, four sequences in block 2 and one sequence in block 3 were misaligned by 1-2 residues (underlined in figure 4). A second progressive alignment of the aligned sequences, including the gaps, improved this alignment: A single misaligned sequence, H_P55, remains in block 2 (boxed in figure 4), while block 3 is now completely aligned. This alignment corrects several errors (eg. P85A, P85B and FUS1) in the manual alignment (23).The SH3 alignment illustrates several features of CLUSTAL W usage. Firstly, in a practical application involving divergent sequences, the initial progressive alignment is likely to be a good but not perfect approximation to the correct alignment. The alignment quality can be improved in a number of ways. If the block structure of the alignment appears to be correct, realignment of the alignment will usually improve most of the misaligned blocks: the existing gaps allow the blocks to "float" cheaply to a locally optimal position without disturbing the rest of the alignment. Remaining sequences which are doubtfully aligned can then be individually tested by profile alignment to the remainder: the misaligned H_P55 SH3 domain can be correctly aligned by profile (with GOP <= 8). The indel regions in the final alignment can then be manually cleaned up: Usually the exact alignment in the loop regions is not determinable, and may have no meaning in structural terms. It is then desirable to have a single gap per structural loop. CLUSTAL W achieved this for two of the four SH3 loop regions (figure 4).If the block structure of the alignment appears suspect, greater intervention by the user may be required. The most divergent sequences, especially if they have large insertions (which can be discerned with the aid of dot matrix plots), should be left out of the progressive alignment. If there are sets of closely related sequences that are deeply diverged from other sets, these can be separately aligned and then merged by profile alignment. Incorrectly determined sequences, containing frameshifts, can also confound regions of an alignment: these can be hard to detect but sometimes they have been grouped within the excluded divergent sequences: then they may be revealed when they are individually compared to the alignment as having apparently nonsense segments with respect to the other sequences. Finding the best alignmentIn cases where all of the sequences in a data set are very similar (e.g. no pair less than 35% identical), CLUSTAL W will find an alignment which is difficult to improve by eye. In this sense, the alignment is optimal with regard to the alternative of manual alignment. Mathematically, this is vague and can only be put on a more systematic footing by finding an objective function (a measure of multiple alignment quality) that exactly mirrors the information used by an "expert" to evaluate an alignment. Nonetheless, if an alignment is impossible to improve by eye, then the program has achieved a very useful result. In more difficult cases, as more divergent sequences are included, it becomes increasingly difficult to find good alignments and to evaluate them. What we find with CLUSTAL W is that the basic block-like structure of the alignment (corresponding to the major secondary structure elements) is usually recovered, with some of the most divergent sequences misaligned in small regions. This is a very useful starting point for manual refinement as it helps define the major blocks of similarity. The problem sequences can be removed from the analysis and realigned to the rest of the sequences automatically or with different parameter settings. An examination of the tree used to guide the alignment will usually show which sequences will be most unreliably placed (those that branch off closest to the root and/or those that align to other single sequences at a very low level of sequence identity rather than align to a group of pre-aligned sequences). Finally, one can simply iterate the multiple alignment process by feeding an output alignment back into CLUSTAL W and repeating the multiple alignment process (using the same or different parameters). The SH3 domain alignment in figure 4 was derived in this way by 2 passes using default parameters. In the second pass, the local gap penalties are dominated by the placement of the initial major gap positions. The alignment will either remain unchanged or will converge rapidly (after 1 or 2 extra passes) on a better solution. If the placement of the initial gaps is approximately correct but some of the sequences are locally misaligned, this works well. Comparison with other methodsRecently, several papers have addressed the problem of position specific parameters for multiple alignment. In one case (35), local gap penalties are increased in alpha helical and beta strand regions, when the 3-D structures of one or more of the sequences are known. In a second case (36), a hidden Markov model was used to estimate position specific gap penalties and residue substitution weight matrices when large numbers of examples of a protein domain were known. With CLUSTAL W, we attempt to derive the same information purely from the set of sequences to be aligned. Therefore, we can apply the method to any set of sequences. The success of this approach will depend on the number of available sequences and their evolutionary relationships. It will also depend on the decision making process during multiple alignment (e.g. when to change weight matrix) and the accuracy and appropriateness of our parameterisation. In the long term, this can only be evaluated by exhaustive testing of sets of sequences where the correct alignment (or parts of it) are known from structural information. What is clear, however, is that the modifications described here significantly improve the sensitivity of the progressive multiple alignment approach. This is achieved with almost no sacrifice in speed and efficiency. There are several areas where further improvements in sensitivity and accuracy can be made. Firstly, the residue weight matrices and gap settings can be made more accurate as more and more data accumulate, while matrices for specific sequence types can be derived (e.g. for transmembrane regions (37)). Secondly, stochastic or iterative optimisation methods can be used to refine initial alignments (7,9,10). CLUSTAL W could be run with several sets of starting parameters and in each case, the alignments refined according to an objective function. The search for a good objective function, that takes into account the sequence and position specific information used in CLUSTAL W is a key area of research. Finally, the average number of examples of each protein domain or family is growing steadily. It is not only important that programs can cope with the large volumes of data that are being generated, they should be able to exploit the new information to make the alignments more and more accurate. Globally optimal alignments (according to an objective function) may not always be possible but the problem may be avoided if sufficiently large volumes of data become available. CLUSTAL W is a step in this direction.ACKNOWLEDGEMENTSNumerous people have offered advice and suggestions for improvements to earlier versions of the CLUSTAL programs. D.H. wishes to apologise to all of the irate CLUSTAL V users who had to live with the bugs and lack of facilities for getting trees in the New Hampshire format. We wish to specifically thank Jeroen Coppieters who suggested using a series of weight matrices and Steven Henikoff for advice on using the BLOSUM matrices. We are grateful to Rein Aasland, Peer Bork, Ariel Blocker and B巖trand Seraphin for providing challenging alignment problems. T.G. and J.T. thank Kevin Leonard for support and encouragement. Finally, we thank all of the people who were involved with various CLUSTAL programs over the years, namely: Paul Sharp, Rainer Fuchs and Alan Bleasby.REFERENCES 1.Feng, D.-F. and Doolittle, R.F. (1987). J. Mol. Evol. 25, 351-360. 2.Needleman, S.B. and Wunsch, C.D. (1970). J. Mol. Biol. 48, 443-453. 3.Dayhoff, M.O., Schwartz, R.M. and Orcutt, B.C. (1978) in Atlas of Protein Sequence and Structure, vol. 5, suppl. 3 (Dayhoff, M.O., ed.), pp 345-352, NBRF, Washington. 4.Henikoff, S. and Henikoff, J.G. (1992). Proc. Natl. Acad. Sci. USA 89, 10915-10919. 5.Lipman, D.J., Altschul, S.F. and Kececioglu, J.D. (1989). Proc. Natl. Acad. Sci. USA 86, 4412-4415. 6.Barton, G.J. and Sternberg, M.J.E. (1987). J. Mol. Biol. 198, 327-337. 7.Gotoh, O. (1993). CABIOS 9, 361-370. 8.Altschul, S.F. (1989). J. Theor. Biol. 138, 297-309. 9.Lukashin, A.V., Engelbrecht, J. and Brunak, S. (1992). Nucl. Acids Res. 20, 2511-2516.10.Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wooton, J.C. (1993). Science, 262, 208-214.11.Vingron, M. and Waterman, M.S. (1993). J. Mol. Biol. 234, 1-12.12.Pascarella, S. and Argos, P. (1992). J. Mol. Biol. 224, 461-471.13.Collins, J.F. and Coulson, A.F.W. (1987). In Nucleic acid and protein sequence analysis a practical approach, Bishop, M.J. and Rawlings, C.J. ed., chapter 13, pp. 323-358.14.Vingron, M. and Sibbald, P.R. (1993). Proc. Natl. Acad. Sci. USA, 90, 8777-8781.15.Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994). CABIOS, 10, 19-29.16.L焧hy, R., Xenarios, I. and Bucher, P. (1994). Protein Science, 3, 139-146.17.Higgins, D.G. and Sharp, P.M. (1988). Gene, 73, 237-244.18.Higgins, D.G. and Sharp, P.M. (1989). CABIOS, 5, 151-153.19.Higgins, D.G., Bleasby, A.J. and Fuchs, R. (1992). CABIOS, 8, 189-191.20.Sneath, P.H.A. and Sokal, R.R. (1973). Numerical Taxonomy, W.H. Freeman, San Francisco.21.Saitou, N. and Nei, M. (1987). Mol. Biol. Evol. 4, 406-425.22.Wilbur, W.J. and Lipman, D.J. (1983). Proc. Natl. Acad. Sci. USA, 80, 726-730.23.Musacchio, A., Gibson, T., Lehto, V.-P. and Saraste, M. (1992). FEBS Lett. 307, 55-61.24.Musacchio, A., Noble, M., Pauptit, R., Wierenga, R. and Saraste, M. (1992). Nature, 359, 851-855.25.Bashford, D., Chothia, C. and Lesk, A.M. (1987). J. Mol. Biol. 196, 199-216.26.Myers, E.W. and Miller, W. (1988). CABIOS, 4, 11-17.27.Thompson, J.D. (1994). CABIOS, (Submitted).28.Smith, T.F., Waterman, M.S. and Fitch, W.M. (1981). J. Mol. Evol. 18, 38-46.29.Pearson, W.R. and Lipman, D.J. (1988). Proc. Natl. Acad. Sci. USA. 85, 2444-2448.30.Devereux, J., Haeberli, P. and Smithies, O. (1984). Nucleic Acids Res. 12, 387-395.31.Felsenstein, J. (1989). Cladistics 5, 164-166.32.Kimura, M. (1980). J. Mol. Evol. 16, 111-120.33.Kimura, M. (1983). The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge.34.Felsenstein, J. (1985). Evolution 39, 783-791.35.Smith, R.F. and Smith, T.F. (1992) Protein Engineering 5, 35-41.36.Krogh, A., Brown, M., Mian, S., Sj歭ander, K. and Haussler, D. (1994) J. Mol. Biol. 235-1501-1531.37.Jones, D.T., Taylor, W.R. and Thornton, J.M. (1994). FEBS Lett. 339, 269-275.38.Bairoch, A. and B歝kmann, B. (1992) Nucleic Acids Res., 20, 2019-2022.39.Noble, M.E.M., Musacchio, A., Saraste, M., Courtneidge, S.A. and Wierenga, R.K. (1993) EMBO J. 12, 2617-2624.40.Kabsch, W. and Sander, C. (1983) Biopolymers, 22, 2577-2637.FIGURE LEGENDSFigure 1. The basic progressive alignment procedure, illustrated using a set of 7 globins of known tertiary structure. The sequence names are from Swiss Prot (38): Hba_Horse: horse alpha globin; Hba_Human: human alpha globin; Hbb_Horse: horse beta globin; Hbb_Human: human beta globin; Myg_Phyca: sperm whale myoglobin; Glb5_Petma: lamprey cyanohaemoglobin; Lgb2_Luplu: lupin leghaemoglobin. In the distance matrix, the mean number of differences per residue is given. The unrooted tree shows all branch lengths drawn to scale. In the rooted tree, all branch lengths (mean number of differences per residue along each branch) are given as well as weights for each sequence. In the multiple alignment, the approximate positions of the 7 alpha helices, common to all 7 proteins are shown. This alignment was derived using CLUSTAL W with default parameters and the PAM (3) series of weight matrices. Figure 2. The scoring scheme for comparing two positions from two alignments. Two sections of alignment with 4 and 2 sequences respectively are shown. The score of the position with amino acids T,L,K,K versus the position with amino acids V and I is given with and without sequence weights. M(X,Y) is the weight matrix entry for amino acid X versus amino acid Y. Wn is the weight for sequence n.Figure 3. The variation in local gap opening penalty is plotted for a section of alignment. The inital gap opening penalty is indicated by a dotted line. Two hydrophilic stretches are underlined. The lowest penalties correspond to the ends of the alignment, the hydrophilic stretches and the two positions with gaps. The highest values are within 8 residues of the two gap positions. The rest of the variation is caused by the residue specific gap penalties (12).Figure 4. CLUSTAL W Alignment of a set of SH3 domains taken from (23). Secondary structure assignments for the solved Spectrin (24) and Fyn (39) domains are according to DSSP (40). The alignment was generated in two steps using default parameters. After full multiple alignment, the aligned sequences were realigned. Segments which were correctly aligned in the second pass are underlined. The single misaligned segment in H_P55 and the misaligned residue in H_NCK/2 are boxed.The sequences are coloured to illustrate significant features. All G (orange) and P (yellow) are coloured. Other residues matching a frequent occurrence of a property in a column are coloured: hydrophobic = blue; hydrophobic tendency = light blue; basic = red; acidic = purple; hydrophilic = green; White = unconserved. The alignment figure was prepared with the GDE sequence editor (S. Smith, Harvard University) and COLORMASK (J. Thompson, EMBL).Table 1. Pascarella and Argos residue specific gap modification factors. -----------------------------------------------------------------------------------A 1.13 M 1.29C 1.13 N 0.63D 0.96 P 0.74E 1.31 Q 1.07F 1.20 R 0.72G 0.61 S 0.76H 1.00 T 0.89I 1.32 V 1.25K 0.96 Y 1.00L 1.21 W 1.23-----------------------------------------------------------------------------------The values are normalised around a mean value of 1.0 for H. The lower the value, the greater the chance of having an adjacent gap. These are derived from the original table of relative frequencies of gaps adjacent to each residue (12) by subtraction from 2.0.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -