📄 emma.txt

📁 emboss的linux版本的源代码
💻 TXT
📖 第 1 页 / 共 4 页
字号:
12 3 4 下一页
                                   emma Function   Multiple alignment program - interface to ClustalW programDescription   EMMA calculates the multiple alignment of nucleic acid or protein   sequences according to the method of Thompson, J.D., Higgins, D.G. and   Gibson, T.J. (1994).   This is an interface to the ClustalW distribution.  The basic alignment method   The basic multiple alignment algorithm consists of three main stages:   1) all pairs of sequences are aligned separately in order to calculate   a distance matrix giving the divergence of each pair of sequences; 2)   a guide tree is calculated from the distance matrix; 3) the sequences   are progressively aligned according to the branching order in the   guide tree. An example using 7 globin sequences of known tertiary   structure (25) is given in figure 1.    1) The distance matrix/pairwise alignments   In the original CLUSTAL programs, the pairwise distances were   calculated using a fast approximate method (22). This allows very   large numbers of sequences to be aligned, even on a microcomputer. The   scores are calculated as the number of k-tuple matches (runs of   identical residues, typically 1 or 2 long for proteins or 2 to 4 long   for nucleotide sequences) in the best alignment between two sequences   minus a fixed penalty for every gap. We now offer a choice between   this method and the slower but more accurate scores from full dynamic   programming alignments using two gap penalties (for opening or   extending gaps) and a full amino acid weight matrix. These scores are   calculated as the number of identities in the best alignment divided   by the number of residues compared (gap positions are excluded). Both   of these scores are initially calculated as percent identity scores   and are converted to distances by dividing by 100 and subtracting from   1.0 to give number of differences per site. We do not correct for   multiple substitutions in these initial distances. In figure 1 we give   the 7x7 distance matrix between the 7 globin sequences calculated   using the full dynamic programming method.    2) The guide tree   The trees used to guide the final multiple alignment process are   calculated from the distance matrix of step 1 using the   Neighbour-Joining method (21). This produces unrooted trees with   branch lengths proportional to estimated divergence along each branch.   The root is placed by a "mid-point" method (15) at a position where   the means of the branch lengths on either side of the root are equal.   These trees are also used to derive a weight for each sequence (15).   The weights are dependent upon the distance from the root of the tree   but sequences which have a common branch with other sequences share   the weight derived from the shared branch. In the example in figure 1,   the leghaemoglobin (Lgb2_Luplu) gets a weight of 0.442 which is equal   to the length of the branch from the root to it. The Human beta globin   (Hbb_Human) gets a weight consisting of the length of the branch   leading to it that is not shared with any other sequences (0.081) plus   half the length of the branch shared with the horse beta globin   (0.226/2) plus one quarter the length of the branch shared by all four   haemoglobins (0.061/4) plus one fifth the branch shared between the   haemoglobins and the myoglobin (0.015/5) plus one sixth the branch   leading to all the vertebrate globins (0.062). This sums to a total of   0.221. By contrast, in the normal progressive alignment algorithm, all   sequences would be equally weighted. The rooted tree with branch   lengths and sequence weights for the 7 globins is given in figure 1.    3) Progressive alignment   The basic procedure at this stage is to use a series of pairwise   alignments to align larger and larger groups of sequences, following   the branching order in the guide tree. You proceed from the tips of   the rooted tree towards the root.   In the globin example in figure 1 you align the sequences in the   following order: human vs. horse beta globin; human vs. horse alpha   globin; the 2 alpha globins vs. the 2 beta globins; the myoglobin vs.   the haemoglobins; the cyanohaemoglobin vs the haemoglobins plus   myoglobin; the leghaemoglobin vs. all the rest. At each stage a full   dynamic programming (26,27) algorithm is used with a residue weight   matrix and penalties for opening and extending gaps. Each step   consists of aligning two existing alignments or sequences. Gaps that   are present in older alignments remain fixed. In the basic algorithm,   new gaps that are introduced at each stage get full gap opening and   extension penalties, even if they are introduced inside old gap   positions (see the section on gap penalties below for modifications to   this rule). In order to calculate the score between a position from   one sequence or alignment and one from another, the average of all the   pairwise weight matrix scores from the amino acids in the two sets of   sequences is used i.e. if you align 2 alignments with 2 and 4   sequences respectively, the score at each position is the average of 8   (2x4) comparisons. This is illustrated in figure 2. If either set of   sequences contains one or more gaps in one of the positions being   considered, each gap versus a residue is scored as zero. The default   amino acid weight matrices we use are rescored to have only positive   values. Therefore, this treatment of gaps treats the score of a   residue versus a gap as having the worst possible score. When   sequences are weighted (see improvements to progressive alignment,   below), each weight matrix value is multiplied by the weights from the   2 sequences, as illustrated in figure 2.    Improvements to progressive alignment   All of the remaining modifications apply only to the final progressive   alignment stage. Sequence weighting is relatively straightforward and   is already widely used in profile searches (15,16). The treatment of   gap penalties is more complicated. Initial gap penalties are   calculated depending on the weight matrix, the similarity of the   sequences, and the length of the sequences. Then, an attempt is made   to derive sensible local gap opening penalties at every position in   each pre-aligned group of sequences that will vary as new sequences   are added. The use of different weight matrices as the alignment   progresses is novel and largely by-passes the problem of initial   choice of weight matrix. The final modification allows us to delay the   addition of very divergent sequences until the end of the alignment   process when all of the more closely related sequences have already   been aligned.  Sequence weighting   Sequence weights are calculated directly from the guide tree. The   weights are normalised such that the biggest one is set to 1.0 and the   rest are all less than one. Groups of closely related sequences   receive lowered weights because they contain much duplicated   information. Highly divergent sequences without any close relatives   receive high weights. These weights are used as simple multiplication   factors for scoring positions from different sequences or prealigned   groups of sequences. The method is illustrated in figure 2. In the   globin example in figure 1, the two alpha globins get downweighted   because they are almost duplicate sequences (as do the two beta   globins); they receive a combined weight of only slightly more than if   a single alpha globin was used.  Initial gap penalties   Initially, two gap penalties are used: a gap opening penalty (GOP)   which gives the cost of opening a new gap of any length and a gap   extension penalty (GEP) which gives the cost of every item in a gap.   Initial values can be set by the user from a menu. The software then   automatically attempts to choose appropriate gap penalties for each   sequence alignment, depending on the following factors.    1) Dependence on the weight matrix   It has been shown (16,28) that varying the gap penalties used with   different weight matrices can improve the accuracy of sequence   alignments. Here, we use the average score for two mismatched residues   (ie. off-diagonal values in the matrix) as a scaling factor for the   GOP.    2) Dependence on the similarity of the sequences   The percent identity of the two (groups of) sequences to be aligned is   used to increase the GOP for closely related sequences and decrease it   for more divergent sequences on a linear scale.    3) Dependence on the lengths of the sequences   The scores for both true and false sequence alignments grow with the   length of the sequences. We use the logarithm of the length of the   shorter sequence to increase the GOP with sequence length.   Using these three modifications, the initial GOP calculated by the   program is:   GOP->(GOP+log(MIN(N,M))) * (average residue mismatch score) * (percent   identity scaling factor)   where N, M are the lengths of the two sequences.    4) Dependence on the difference in the lengths of the sequences   The GEP is modified depending on the difference between the lengths of   the two sequences to be aligned. If one sequence is much shorter than   the other, the GEP is increased to inhibit too many long gaps in the   shorter sequence. The initial GEP calculated by the program is:   GEP -> GEP*(1.0+|log(N/M)|)   where N, M are the lengths of the two sequences.  Position-specific gap penalties   In most dynamic programming applications, the initial gap opening and   extension penalties are applied equally at every position in the   sequence, regardless of the location of a gap, except for terminal   gaps which are usually allowed at no cost. In CLUSTAL W, before any   pair of sequences or prealigned groups of sequences are aligned, we   generate a table of gap opening penalties for every position in the   two (sets of) sequences. An example is shown in figure 3. We   manipulate the initial gap opening penalty in a position specific   manner, in order to make gaps more or less likely at different   positions.   The local gap penalty modification rules are applied in a hierarchical   manner.   The exact details of each rule are given below. Firstly, if there is a   gap at a position, the gap opening and gap extension penalties are   lowered; the other rules do not apply. This makes gaps more likely at   positions where there are already gaps. If there is no gap at a   position, then the gap opening penalty is increased if the position is   within 8 residues of an existing gap. This discourages gaps that are   too close together. Finally, at any position within a run of   hydrophilic residues, the penalty is decreased. These runs usually   indicate loop regions in protein structures. If there is no run of   hydrophilic residues, the penalty is modified using a table of residue   specific gap propensities (12). These propensities were derived by   counting the frequency of each residue at either end of gaps in   alignments of proteins of known structure. An illustration of the   application of these rules from one part of the globin example, in   figure 1, is given in figure 3.    1) Lowered gap penalties at existing gaps   If there are already gaps at a position, then the GOP is reduced in   proportion to the number of sequences with a gap at this position and   the GEP is lowered by a half. The new gap opening penalty is   calculated as:   GOP -> GOP*0.3*(no. of sequences without a gap/no. of sequences).    2) Increased gap penalties near existing gaps   If a position does not have any gaps but is within 8 residues of an   existing gap, the GOP is increased by:   GOP -> GOP*(2+((8-distance from gap)*2)/8)    3) Reduced gap penalties in hydrophilic stretches   Any run of 5 hydrophilic residues is considered to be a hydrophilic   stretch. The residues that are to be considered hydrophilic may be set   by the user but are conservatively set to D, E, G, K, N, Q, P, R or S   by default. If, at any position, there are no gaps and any of the   sequences has such a stretch, the GOP is reduced by one third.    4) Residue specific penalties   If there is no hydrophilic stretch and the position does not contain   any gaps, then the GOP is multiplied by one of the 20 numbers in table   1, depending on the residue. If there is a mixture of residues at a   position, the multiplication factor is the average of all the   contributions from each sequence.  Weight matrices   Two main series of weight matrices are offered to the user: the   Dayhoff PAM series (3) and the BLOSUM series (4). The default is the   BLOSUM series. In each case, there is a choice of matrix ranging from   strict ones, useful for comparing very closely related sequences to   very "soft" ones that are useful for comparing very distantly related   sequences. Depending on the distance between the two sequences or   groups of sequences to be compared, we switch between 4 different   matrices. The distances are measured directly from the guide tree. The   ranges of distances and tables used with the PAM series of matrices   is: 80-100%:PAM20, 60-80%:PAM60, 40-60%:PAM120, 0-40%:PAM350. The   range used with the BLOSUM series is:80-100%:BLOSUM80,   60-80%:BLOSUM62, 30-60%:BLOSUM45, 0-30%:BLOSUM30.  Divergent sequences   The most divergent sequences (most different, on average from all of   the other sequences) are usually the most difficult to align   correctly. It is sometimes better to delay the incorporation of these   sequences until all of the more easily aligned sequences are merged   first. This may give a better chance of correctly placing the gaps and   matching weakly conserved positions against the rest of the sequences.   A choice is offered to set a cut off (default is 40% identity or less   with any other sequence) that will delay the alignment of the   divergent sequences until all of the rest have been aligned.Software and Algorithms  Dynamic Programming   The most demanding part of the multiple alignment strategy, in terms   of computer processing and memory usage, is the alignment of two   (groups of) sequences at each step in the final progressive alignment.   To make it possible to align very long sequences (e.g. dynein heavy   chains at ~ 5,000 residues) in a reasonable amount of memory, we use   the memory efficient dynamic programming algorithm of Myers and Miller   (26). This sacrifices some processing time but makes very large   alignments practical in very little memory. One disadvantage of this   algorithm is that it does not allow different gap opening and   extension penalties at each position. We have modified the algorithm   so as to allow this and the details are described in a separate paper   (27).  Alignment to an alignment   Profile alignment is used to align two existing alignments (either of   which may consist of just one sequence) or to add a series of new   sequences to an existing alignment. This is useful because one may   wish to build up a multiple alignment gradually, choosing different   parameters manually, or correcting intermediate errors as the   alignment proceeds. Often, just a few sequences cause misalignments in   the progressive algorithm and these can be removed from the process   and then added at the end by profile alignment. A second use is where
12 3 4 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -