📄 clustalx.html

📁 是有关基因比对的经典算法的实现。这对于初学计算生物学的人是非常重要的算法。
💻 HTML
📖 第 1 页 / 共 5 页
字号:
weight should be near to zero; for closely related sequences it can be useful
to assign a higher score. The default is set to 0.5.
</P>
<P>
</P>
<P>
The PROTEIN WEIGHT MATRIX option allows you to choose a series of weight
matrices. For protein alignments, you use a weight matrix to determine the
similarity of non-identical amino acids. For example, Tyr aligned with Phe is
usually judged to be 'better' than Tyr aligned with Pro.
</P>
<P>
There are three 'in-built' series of weight matrices offered. Each consists of
several matrices which work differently at different evolutionary distances. To
see the exact details, read the documentation. Crudely, we store several
matrices in memory, spanning the full range of amino acid distance (from almost
identical sequences to highly divergent ones). For very similar sequences, it
is best to use a strict weight matrix which only gives a high score to
identities and the most favoured conservative substitutions. For more divergent
sequences, it is appropriate to use "softer" matrices which give a high score
to many other frequent substitutions.
</P>
<P>
1) BLOSUM (Henikoff). These matrices appear to be the best available for 
carrying out data base similarity (homology searches). The matrices currently
used are: Blosum 80, 62, 45 and 30. BLOSUM was the default in earlier Clustal X
versions.
</P>
<P>
2) PAM (Dayhoff). These have been extremely widely used since the late '70s. We
currently use the PAM 20, 60, 120, 350 matrices.
</P>
<P>
3) GONNET. These matrices were derived using almost the same procedure as the
Dayhoff one (above) but are much more up to date and are based on a far larger
data set. They appear to be more sensitive than the Dayhoff series. We
currently use the GONNET 80, 120, 160, 250 and 350 matrices. This series is the
default for Clustal X version 1.8.
</P>
<P>
We also supply an identity matrix which gives a score of 10 to two identical 
amino acids and a score of zero otherwise. This matrix is not very useful.
</P>
<P>
Load protein matrix: allows you to read in a comparison matrix from a file.
This can be either a single matrix or a series of matrices (see below for
format). 
</P>
<P>
</P>
<P>
DNA WEIGHT MATRIX option allows you to select a single matrix (not a series)
used for aligning nucleic acid sequences. Two hard-coded matrices are available:
</P>
<P>
1) IUB. This is the default scoring matrix used by BESTFIT for the comparison
of nucleic acid sequences. X's and N's are treated as matches to any IUB
ambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.
</P>
<P>
2) CLUSTALW(1.6). A previous system used by ClustalW, in which matches score
1.0 and mismatches score 0. All matches for IUB symbols also score 0.
</P>
<P>
Load DNA matrix: allows you to read in a nucleic acid comparison matrix from a
file (just one matrix, not a series).
</P>
<P>
</P>
<P>
SINGLE MATRIX INPUT FORMAT
The format used for a single matrix is the same as the BLAST program. The
scores in the new weight matrix should be similarities. You can use negative as
well as positive values if you wish, although the matrix will be automatically
adjusted to all positive scores, unless the NEGATIVE MATRIX option is selected.
Any lines beginning with a # character are assumed to be comments. The first
non-comment line should contain a list of amino acids in any order, using the 1
letter code, followed by a * character. This should be followed by a square
matrix of scores, with one row and one column for each amino acid. The last row
and column of the matrix (corresponding to the * character) contain the minimum
score over the whole matrix.
</P>
<P>
MATRIX SERIES INPUT FORMAT
ClustalX uses different matrices depending on the mean percent identity of the
sequences to be aligned. You can specify a series of matrices and the range of
the percent identity for each matrix in a matrix series file. The file is
automatically recognised by the word CLUSTAL_SERIES at the beginning of the
file. Each matrix in the series is then specified on one line which should
start with the word MATRIX. This is followed by the lower and upper limits of
the sequence percent identities for which you want to apply the matrix. The
final entry on the matrix line is the filename of a Blast format matrix file
(see above for details of the single matrix file format).
</P>
<P>
Example.
</P>
<P>
CLUSTAL_SERIES
</P>
<P> 
MATRIX 81 100 /us1/user/julie/matrices/blosum80
MATRIX 61 80 /us1/user/julie/matrices/blosum62
MATRIX 31 60 /us1/user/julie/matrices/blosum45
MATRIX 0 30 /us1/user/julie/matrices/blosum30
</P>
<P>
</P>
<P>
<STRONG>
PROTEIN GAP PARAMETERS
</STRONG>
</P>
<P>
RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce or
increase the gap opening penalties at each position in the alignment or 
sequence. See the documentation for details. As an example, positions that are
rich in glycine are more likely to have an adjacent gap than positions that are
rich in valine.
</P>
<P>
HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within a
run (5 or more residues) of hydrophilic amino acids; these are likely to be
loop or random coil regions where gaps are more common. The residues that are
"considered" to be hydrophilic can be entered in HYDROPHILIC RESIDUES.
</P>
<P>
GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too close
to each other. Gaps that are less than this distance apart are penalised more
than other gaps. This does not prevent close gaps; it makes them less frequent,
promoting a block-like appearance of the alignment.
</P>
<P>
END GAP SEPARATION treats end gaps just like internal gaps for the purposes of
avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above). If you
turn this off, end gaps will be ignored for this purpose. This is useful when
you wish to align fragments where the end gaps are not biologically meaningful.
</P>
<P>
</P>
<P>
</P>
<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
<CENTER><H2><A NAME="P">                   Profile and Structure Alignments
</A></H2></CENTER>
<P>
</P>
<P>   
By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile 
alignments allow you to store alignments of your favourite sequences and add
new sequences to them in small bunches at a time. A profile is simply an
alignment of one or more sequences (e.g. an alignment output file from Clustal
X). Each input can be a single sequence. One or both sets of input sequences
may include secondary structure assignments or gap penalty masks to guide the
alignment. 
</P>
<P>
Make sure PROFILE ALIGNMENT MODE is selected, using the switch directly above
the sequence display area. Then, use the ALIGNMENT menu to do profile and
secondary structure alignments.
</P>
<P>
The profiles can be in any of the allowed input formats with "-" characters
used to specify gaps (except for GCG/MSF where "." is used).
</P>
<P>
You have to load the 2 profiles by choosing FILE, LOAD PROFILE 1 and  LOAD
PROFILE 2. Then ALIGNMENT, ALIGN PROFILE 2 TO PROFILE 1 will align the 2
profiles to each other. Secondary structure masks in either profile can be used
to guide the alignment. This option compares all the sequences in profile 1
with all the sequences in profile 2 in order to build guide trees which will be
used to calculate sequence weights, and select appropriate alignment parameters
for the final profile alignment.
</P>
<P>
You can skip the first stage (pairwise alignments; guide trees) by using old
guide tree files (ALIGN PROFILES FROM GUIDE TREES). 
</P>
<P>
The ALIGN SEQUENCES TO PROFILE 1 option will take the sequences in the second
profile and align them to the first profile, 1 at a time.  This is useful to
add some new sequences to an existing alignment, or to align a set of sequences
to a known structure. In this case, the second profile set need not be
pre-aligned.
</P>
<P>
You can skip the first stage (pairwise alignments; guide tree) by using an old
guide tree file (ALIGN SEQUENCES TO PROFILE 1 FROM TREE). 
</P>
<P>
SAVE LOG FILE will write the alignment calculation scores to a file. The log
filename is the same as the input sequence filename, with an extension .log
appended.
</P>
<P>
The alignment parameters can be set using the ALIGNMENT PARAMETERS menu,
Pairwise Parameters, Multiple Parameters and Protein Gap Parameters options.
These are EXACTLY the same parameters as used by the general, automatic
multiple alignment procedure. The general multiple alignment procedure is
simply a series of profile alignments. Carrying out a series of profile
alignments on larger and larger groups of sequences, allows you to manually
build up a complete alignment, if necessary editing intermediate alignments.
</P>
<P>
<STRONG>
SECONDARY STRUCTURE PARAMETERS
</STRONG>
</P>
<P>
Use this menu to set secondary structure options. If a solved structure is
known, it can be used to guide the alignment by raising gap penalties within
secondary structure elements, so that gaps will preferentially be inserted into
unstructured surface loop regions. Alternatively, a user-specified gap penalty
mask can be supplied for a similar purpose.
</P>
<P>
A gap penalty mask is a series of numbers between 1 and 9, one per position in 
the alignment. Each number specifies how much the gap opening penalty is to be 
raised at that position (raised by multiplying the basic gap opening penalty
by the number) i.e. a mask figure of 1 at a position means no change
in gap opening penalty; a figure of 4 means that the gap opening penalty is
four times greater at that position, making gaps 4 times harder to open.
</P>
<P>
The format for gap penalty masks and secondary structure masks is explained in
a separate help section.
</P>
<P>
</P>
<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
<CENTER><H2><A NAME="B">            Secondary Structure / Gap Penalty Masks
</A></H2></CENTER>
<P>
</P>
<P>
The use of secondary structure-based penalties has been shown to improve  the
accuracy of sequence alignment. Clustal X now allows secondary structure/ gap
penalty masks to be supplied with the input sequences used during profile
alignment. (NB. The secondary structure information is NOT used during multiple
sequence alignment). The masks work by raising gap penalties in specified
regions (typically secondary structure elements) so that gaps are
preferentially opened in the less well conserved regions (typically surface
loops).
</P>
<P>
The USE PROFILE 1(2) SECONDARY STRUCTURE / GAP PENALTY MASK options control
whether the input 2D-structure information or gap penalty masks will be used
during the profile alignment.
</P>
<P>
The OUTPUT options control whether the secondary structure and gap penalty
masks should be included in the Clustal X output alignments. Showing both is
useful for understanding how the masks work. The 2D-structure information is
itself useful in judging the alignment quality and in seeing how residue
conservation patterns vary with secondary structure. 
</P>
<P>
The HELIX and STRAND GAP PENALTY options provide the value for raising the gap
penalty at core Alpha Helical (A) and Beta Strand (B) residues. In CLUSTAL
format, capital residues denote the A and B core structure notation. Basic gap
penalties are multiplied by the amount specified.
</P>
<P>
The LOOP GAP PENALTY option provides the value for the gap penalty in Loops.
By default this penalty is not raised. In CLUSTAL format, loops are specified
by "." in the secondary structure notation.
</P>
<P>
The SECONDARY STRUCTURE TERMINAL PENALTY provides the value for setting the gap
penalty at the ends of secondary structures. Ends of secondary structures are
known to grow or shrink, comparing related structures. Therefore by default
these are given intermediate values, lower than the core penalties. All
secondary structure read in as lower case in CLUSTAL format gets the reduced
terminal penalty.
</P>
<P>
The HELIX and STRAND TERMINAL POSITIONS options specify the range of structure
termini for the intermediate penalties. In the alignment output, these are
indicated as lower case. For Alpha Helices, by default, the range spans the 
end-helical turn (3 residues). For Beta Strands, the default range spans the
end residue and the adjacent loop residue, since sequence conservation often
extends beyond the actual H-bonded Beta Strand.
</P>
<P>
Clustal X can read the masks from SWISS-PROT, CLUSTAL or GDE format input
files. For many 3-D protein structures, secondary structure information is
recorded in the feature tables of SWISS-PROT database entries. You should
always check that the assignments are correct - some are quite inaccurate.
Clustal X looks for SWISS-PROT HELIX and STRAND assignments e.g.
</P>
<P>
</P>
<P>
<PRE>
FT   HELIX       100    115
FT   STRAND      118    119
</PRE>
</P>
<P>
The structure and penalty masks can also be read from CLUSTAL alignment format 
as comment lines beginning "!SS_" or "!GM_" e.g.
</P>
<P>
<PRE>
!SS_HBA_HUMA    ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA
!GM_HBA_HUMA    112224444444444222122244444444442222224222111111111222444444
HBA_HUMA        VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
</PRE>
</P>
<P>
Note that the mask itself is a set of numbers between 1 and 9 each of which is 
assigned to the residue(s) in the same column below. 
</P>
<P>
In GDE flat file format, the masks are specified as text and the names must
begin with "SS_ or "GM_.
</P>
<P>
Either a structure or penalty mask or both may be used. If both are included
in an alignment, the user will be asked which is to be used.
</P>
<P>
</P>
<P>
</P>
<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
<CENTER><H2><A NAME="T">                            Phylogenetic Trees
</A></H2></CENTER>
<P>
</P>
<P>
Before calculating a tree, you must have an ALIGNMENT in memory. This can be
input using the FILE menu, LOAD SEQUENCES option or you should have just
carried out a full multiple alignment and the alignment is still in memory.
Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!!
</P>
<P>
The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First
you calculate distances (percent divergence) between all pairs of sequence from
a multiple alignment; second you apply the NJ method to the distance matrix.
</P>
<P>
To calculate a tree, use the DRAW N-J TREE option. This gives an UNROOTED tree
and all branch lengths. The root of the tree can only be inferred by using an
outgroup (a sequence that you are certain branches at the outside of the tree
.... certain on biological grounds) OR if you assume a degree of constancy in
the 'molecular clock', you can place the root in the 'middle' of the tree
(roughly equidistant from all tips).
</P>
<P>
BOOTSTRAP N-J TREE uses a method for deriving confidence values for the 
groupings in a tree (first adapted for trees by Joe Felsenstein). It involves
making N random samples of sites from the alignment (N should be LARGE, e.g.
500 - 1000); drawing N trees (1 from each sample) and counting how many times
each grouping from the original tree occurs in the sample trees. You can set N
using the NUMBER OF BOOTSTRAP TRIALS option in the BOOTSTRAP TREE window. In
practice, you should use a large number of bootstrap replicates (1000 is
recommended, even if it means running the program for an hour on a slow 
computer). You can also supply a seed number for the random number generator
here. Different runs with the same seed will give the same answer. See the
documentation for more details.
</P>
<P>
💿 文件大小 848 K
👤 上传用户 nassdaq
📂 所属分类 *行业应用
🏷️ 相关标签

#算法 #基因 #生物学 #计算
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -