📄 clustalx.hlp

📁 是有关基因比对的经典算法的实现。这对于初学计算生物学的人是非常重要的算法。
💻 HLP
📖 第 1 页 / 共 5 页
字号:
The format used for a single matrix is the same as the BLAST program. The
scores in the new weight matrix should be similarities. You can use negative as
well as positive values if you wish, although the matrix will be automatically
adjusted to all positive scores, unless the NEGATIVE MATRIX option is selected.
Any lines beginning with a # character are assumed to be comments. The first
non-comment line should contain a list of amino acids in any order, using the 1
letter code, followed by a * character. This should be followed by a square
matrix of scores, with one row and one column for each amino acid. The last row
and column of the matrix (corresponding to the * character) contain the minimum
score over the whole matrix.

MATRIX SERIES INPUT FORMAT
ClustalX uses different matrices depending on the mean percent identity of the
sequences to be aligned. You can specify a series of matrices and the range of
the percent identity for each matrix in a matrix series file. The file is
automatically recognised by the word CLUSTAL_SERIES at the beginning of the
file. Each matrix in the series is then specified on one line which should
start with the word MATRIX. This is followed by the lower and upper limits of
the sequence percent identities for which you want to apply the matrix. The
final entry on the matrix line is the filename of a Blast format matrix file
(see above for details of the single matrix file format).

Example.

CLUSTAL_SERIES
 
MATRIX 81 100 /us1/user/julie/matrices/blosum80
MATRIX 61 80 /us1/user/julie/matrices/blosum62
MATRIX 31 60 /us1/user/julie/matrices/blosum45
MATRIX 0 30 /us1/user/julie/matrices/blosum30


<STRONG>
PROTEIN GAP PARAMETERS
</STRONG>

RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce or
increase the gap opening penalties at each position in the alignment or 
sequence. See the documentation for details. As an example, positions that are
rich in glycine are more likely to have an adjacent gap than positions that are
rich in valine.

HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within a
run (5 or more residues) of hydrophilic amino acids; these are likely to be
loop or random coil regions where gaps are more common. The residues that are
"considered" to be hydrophilic can be entered in HYDROPHILIC RESIDUES.

GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too close
to each other. Gaps that are less than this distance apart are penalised more
than other gaps. This does not prevent close gaps; it makes them less frequent,
promoting a block-like appearance of the alignment.

END GAP SEPARATION treats end gaps just like internal gaps for the purposes of
avoiding gaps that are too close (set by GAP SEPARATION DISTANCE above). If you
turn this off, end gaps will be ignored for this purpose. This is useful when
you wish to align fragments where the end gaps are not biologically meaningful.


>>HELP P <<
                   Profile and Structure Alignments
   
By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile 
alignments allow you to store alignments of your favourite sequences and add
new sequences to them in small bunches at a time. A profile is simply an
alignment of one or more sequences (e.g. an alignment output file from Clustal
X). Each input can be a single sequence. One or both sets of input sequences
may include secondary structure assignments or gap penalty masks to guide the
alignment. 

Make sure PROFILE ALIGNMENT MODE is selected, using the switch directly above
the sequence display area. Then, use the ALIGNMENT menu to do profile and
secondary structure alignments.

The profiles can be in any of the allowed input formats with "-" characters
used to specify gaps (except for GCG/MSF where "." is used).

You have to load the 2 profiles by choosing FILE, LOAD PROFILE 1 and  LOAD
PROFILE 2. Then ALIGNMENT, ALIGN PROFILE 2 TO PROFILE 1 will align the 2
profiles to each other. Secondary structure masks in either profile can be used
to guide the alignment. This option compares all the sequences in profile 1
with all the sequences in profile 2 in order to build guide trees which will be
used to calculate sequence weights, and select appropriate alignment parameters
for the final profile alignment.

You can skip the first stage (pairwise alignments; guide trees) by using old
guide tree files (ALIGN PROFILES FROM GUIDE TREES). 

The ALIGN SEQUENCES TO PROFILE 1 option will take the sequences in the second
profile and align them to the first profile, 1 at a time.  This is useful to
add some new sequences to an existing alignment, or to align a set of sequences
to a known structure. In this case, the second profile set need not be
pre-aligned.

You can skip the first stage (pairwise alignments; guide tree) by using an old
guide tree file (ALIGN SEQUENCES TO PROFILE 1 FROM TREE). 

SAVE LOG FILE will write the alignment calculation scores to a file. The log
filename is the same as the input sequence filename, with an extension .log
appended.

The alignment parameters can be set using the ALIGNMENT PARAMETERS menu,
Pairwise Parameters, Multiple Parameters and Protein Gap Parameters options.
These are EXACTLY the same parameters as used by the general, automatic
multiple alignment procedure. The general multiple alignment procedure is
simply a series of profile alignments. Carrying out a series of profile
alignments on larger and larger groups of sequences, allows you to manually
build up a complete alignment, if necessary editing intermediate alignments.

<STRONG>
SECONDARY STRUCTURE PARAMETERS
</STRONG>

Use this menu to set secondary structure options. If a solved structure is
known, it can be used to guide the alignment by raising gap penalties within
secondary structure elements, so that gaps will preferentially be inserted into
unstructured surface loop regions. Alternatively, a user-specified gap penalty
mask can be supplied for a similar purpose.

A gap penalty mask is a series of numbers between 1 and 9, one per position in 
the alignment. Each number specifies how much the gap opening penalty is to be 
raised at that position (raised by multiplying the basic gap opening penalty
by the number) i.e. a mask figure of 1 at a position means no change
in gap opening penalty; a figure of 4 means that the gap opening penalty is
four times greater at that position, making gaps 4 times harder to open.

The format for gap penalty masks and secondary structure masks is explained in
a separate help section.

>>HELP B << 
            Secondary Structure / Gap Penalty Masks

The use of secondary structure-based penalties has been shown to improve  the
accuracy of sequence alignment. Clustal X now allows secondary structure/ gap
penalty masks to be supplied with the input sequences used during profile
alignment. (NB. The secondary structure information is NOT used during multiple
sequence alignment). The masks work by raising gap penalties in specified
regions (typically secondary structure elements) so that gaps are
preferentially opened in the less well conserved regions (typically surface
loops).

The USE PROFILE 1(2) SECONDARY STRUCTURE / GAP PENALTY MASK options control
whether the input 2D-structure information or gap penalty masks will be used
during the profile alignment.

The OUTPUT options control whether the secondary structure and gap penalty
masks should be included in the Clustal X output alignments. Showing both is
useful for understanding how the masks work. The 2D-structure information is
itself useful in judging the alignment quality and in seeing how residue
conservation patterns vary with secondary structure. 

The HELIX and STRAND GAP PENALTY options provide the value for raising the gap
penalty at core Alpha Helical (A) and Beta Strand (B) residues. In CLUSTAL
format, capital residues denote the A and B core structure notation. Basic gap
penalties are multiplied by the amount specified.

The LOOP GAP PENALTY option provides the value for the gap penalty in Loops.
By default this penalty is not raised. In CLUSTAL format, loops are specified
by "." in the secondary structure notation.

The SECONDARY STRUCTURE TERMINAL PENALTY provides the value for setting the gap
penalty at the ends of secondary structures. Ends of secondary structures are
known to grow or shrink, comparing related structures. Therefore by default
these are given intermediate values, lower than the core penalties. All
secondary structure read in as lower case in CLUSTAL format gets the reduced
terminal penalty.

The HELIX and STRAND TERMINAL POSITIONS options specify the range of structure
termini for the intermediate penalties. In the alignment output, these are
indicated as lower case. For Alpha Helices, by default, the range spans the 
end-helical turn (3 residues). For Beta Strands, the default range spans the
end residue and the adjacent loop residue, since sequence conservation often
extends beyond the actual H-bonded Beta Strand.

Clustal X can read the masks from SWISS-PROT, CLUSTAL or GDE format input
files. For many 3-D protein structures, secondary structure information is
recorded in the feature tables of SWISS-PROT database entries. You should
always check that the assignments are correct - some are quite inaccurate.
Clustal X looks for SWISS-PROT HELIX and STRAND assignments e.g.


<PRE>
FT   HELIX       100    115
FT   STRAND      118    119
</PRE>

The structure and penalty masks can also be read from CLUSTAL alignment format 
as comment lines beginning "!SS_" or "!GM_" e.g.

<PRE>
!SS_HBA_HUMA    ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA
!GM_HBA_HUMA    112224444444444222122244444444442222224222111111111222444444
HBA_HUMA        VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
</PRE>

Note that the mask itself is a set of numbers between 1 and 9 each of which is 
assigned to the residue(s) in the same column below. 

In GDE flat file format, the masks are specified as text and the names must
begin with "SS_ or "GM_.

Either a structure or penalty mask or both may be used. If both are included
in an alignment, the user will be asked which is to be used.


>>HELP T <<
                            Phylogenetic Trees

Before calculating a tree, you must have an ALIGNMENT in memory. This can be
input using the FILE menu, LOAD SEQUENCES option or you should have just
carried out a full multiple alignment and the alignment is still in memory.
Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!!

The method used is the NJ (Neighbour Joining) method of Saitou and Nei. First
you calculate distances (percent divergence) between all pairs of sequence from
a multiple alignment; second you apply the NJ method to the distance matrix.

To calculate a tree, use the DRAW N-J TREE option. This gives an UNROOTED tree
and all branch lengths. The root of the tree can only be inferred by using an
outgroup (a sequence that you are certain branches at the outside of the tree
.... certain on biological grounds) OR if you assume a degree of constancy in
the 'molecular clock', you can place the root in the 'middle' of the tree
(roughly equidistant from all tips).

BOOTSTRAP N-J TREE uses a method for deriving confidence values for the 
groupings in a tree (first adapted for trees by Joe Felsenstein). It involves
making N random samples of sites from the alignment (N should be LARGE, e.g.
500 - 1000); drawing N trees (1 from each sample) and counting how many times
each grouping from the original tree occurs in the sample trees. You can set N
using the NUMBER OF BOOTSTRAP TRIALS option in the BOOTSTRAP TREE window. In
practice, you should use a large number of bootstrap replicates (1000 is
recommended, even if it means running the program for an hour on a slow 
computer). You can also supply a seed number for the random number generator
here. Different runs with the same seed will give the same answer. See the
documentation for more details.

EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where
ANY of the sequences have a gap will be ignored. This means that 'like' will
be compared to 'like' in all distances, which is highly desirable. It also
automatically throws away the most ambiguous parts of the alignment, which are
concentrated around gaps (usually). The disadvantage is that you may throw away
much of the data if there are many gaps (which is why it is difficult for us to
make it the default).  

CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this option
makes no difference. For greater divergence, this option corrects for the fact
that observed distances underestimate actual evolutionary distances. This is
because, as sequences diverge, more than one substitution will happen at many
sites. However, you only see one difference when you look at the present day
sequences. Therefore, this option has the effect of stretching branch lengths
in trees (especially long branches). The corrections used here (for DNA or
proteins) are both due to Motoo Kimura. See the documentation for details.  

Where possible, this option should be used. However, for VERY divergent
sequences, the distances cannot be reliably corrected. You will be warned if
this happens. Even if none of the distances in a data set exceed the reliable
threshold, if you bootstrap the data, some of the bootstrap distances may
randomly exceed the safe limit.  

SAVE LOG FILE will write the tree calculation scores to a file. The log
filename is the same as the input sequence filename, with an extension .log
appended.

<H4>
OUTPUT FORMAT OPTIONS
</H4>

Three different formats are allowed. None of these displays the tree visually.
You can display the tree using the NJPLOT program distributed with Clustal X
OR get the PHYLIP package and use the tree drawing facilities there. 
 
1) CLUSTAL FORMAT TREE. This format is verbose and lists all of the distances
between the sequences and the number of alignment positions used for each. The
tree is described at the end of the file. It lists the sequences that are
joined at each alignment step and the branch lengths. After two sequences are
joined, it is referred to later as a NODE. The number of a NODE is the number
of the lowest sequence in that NODE.   

2) PHYLIP FORMAT TREE. This format is the New Hampshire format, used by many
phylogenetic analysis packages. It consists of a series of nested parentheses,
describing the branching order, with the sequence names and branch lengths. It
can be read by the NJPLOT program distributed with ClustalX. It can also be
used by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see
the trees graphically. This is the same format used during multiple alignment
for the guide trees. Some other packages that can read and display New
Hampshire format are TreeTool, TreeView, and Phylowin.

3) PHYLIP DISTANCE MATRIX. This format just outputs a matrix of all the
pairwise distances in a format that can be used by the PHYLIP package. It used
to be useful when one could not produce distances from protein sequences in the
Phylip package but is now redundant (PROTDIST of Phylip 3.5 now does this).

4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,
including PAUP and MacClade. The format is described fully in:
Maddison, D. R., D. L. Swofford and W. P. Maddison.  1997.
NEXUS: an extensible file format for systematic information.
Systematic Biology 46:590-621.

BOOTSTRAP LABELS ON: By default, the bootstrap values are correctly placed on
the tree branches of the phylip format output tree. The toggle allows them to
be placed on the nodes, which is incorrect, but some display packages (e.g.
TreeTool, TreeView and Phylowin) only support node labelling but not branch
labelling. Care should be taken to note which branches and labels go together. 


>>HELP C <<
💿 文件大小 848 K
👤 上传用户 nassdaq
📂 所属分类 *行业应用
🏷️ 相关标签

#算法 #基因 #生物学 #计算
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -