📄 clustalx.html

📁 是有关基因比对的经典算法的实现。这对于初学计算生物学的人是非常重要的算法。
💻 HTML
📖 第 1 页 / 共 5 页
字号:
EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions where
ANY of the sequences have a gap will be ignored. This means that 'like' will
be compared to 'like' in all distances, which is highly desirable. It also
automatically throws away the most ambiguous parts of the alignment, which are
concentrated around gaps (usually). The disadvantage is that you may throw away
much of the data if there are many gaps (which is why it is difficult for us to
make it the default).  
</P>
<P>
CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this option
makes no difference. For greater divergence, this option corrects for the fact
that observed distances underestimate actual evolutionary distances. This is
because, as sequences diverge, more than one substitution will happen at many
sites. However, you only see one difference when you look at the present day
sequences. Therefore, this option has the effect of stretching branch lengths
in trees (especially long branches). The corrections used here (for DNA or
proteins) are both due to Motoo Kimura. See the documentation for details.  
</P>
<P>
Where possible, this option should be used. However, for VERY divergent
sequences, the distances cannot be reliably corrected. You will be warned if
this happens. Even if none of the distances in a data set exceed the reliable
threshold, if you bootstrap the data, some of the bootstrap distances may
randomly exceed the safe limit.  
</P>
<P>
SAVE LOG FILE will write the tree calculation scores to a file. The log
filename is the same as the input sequence filename, with an extension .log
appended.
</P>
<P>
<H4>
OUTPUT FORMAT OPTIONS
</H4>
</P>
<P>
Three different formats are allowed. None of these displays the tree visually.
You can display the tree using the NJPLOT program distributed with Clustal X
OR get the PHYLIP package and use the tree drawing facilities there. 
</P>
<P> 
1) CLUSTAL FORMAT TREE. This format is verbose and lists all of the distances
between the sequences and the number of alignment positions used for each. The
tree is described at the end of the file. It lists the sequences that are
joined at each alignment step and the branch lengths. After two sequences are
joined, it is referred to later as a NODE. The number of a NODE is the number
of the lowest sequence in that NODE.   
</P>
<P>
2) PHYLIP FORMAT TREE. This format is the New Hampshire format, used by many
phylogenetic analysis packages. It consists of a series of nested parentheses,
describing the branching order, with the sequence names and branch lengths. It
can be read by the NJPLOT program distributed with ClustalX. It can also be
used by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to see
the trees graphically. This is the same format used during multiple alignment
for the guide trees. Some other packages that can read and display New
Hampshire format are TreeTool, TreeView, and Phylowin.
</P>
<P>
3) PHYLIP DISTANCE MATRIX. This format just outputs a matrix of all the
pairwise distances in a format that can be used by the PHYLIP package. It used
to be useful when one could not produce distances from protein sequences in the
Phylip package but is now redundant (PROTDIST of Phylip 3.5 now does this).
</P>
<P>
4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,
including PAUP and MacClade. The format is described fully in:
Maddison, D. R., D. L. Swofford and W. P. Maddison.  1997.
NEXUS: an extensible file format for systematic information.
Systematic Biology 46:590-621.
</P>
<P>
BOOTSTRAP LABELS ON: By default, the bootstrap values are correctly placed on
the tree branches of the phylip format output tree. The toggle allows them to
be placed on the nodes, which is incorrect, but some display packages (e.g.
TreeTool, TreeView and Phylowin) only support node labelling but not branch
labelling. Care should be taken to note which branches and labels go together. 
</P>
<P>
</P>
<P>
</P>
<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
<CENTER><H2><A NAME="C">                               Colors
</A></H2></CENTER>
<P>
</P>
<P>
Clustal X provides a versatile coloring scheme for the sequence alignment 
display. The sequences (or profiles) are colored automatically, when they are
loaded. Sequences can be colored either by assigning a color to specific
residues, or on the basis of an alignment consensus. In the latter case, the
alignment consensus is calculated automatically, and the residues in each
column are colored according to the consensus character assigned to that
column. In this way, you can choose to highlight, for example, conserved
hydrophylic or hydrophobic positions in the alignment.
</P>
<P>
The 'rules' used to color the alignment are specified in a COLOR PARAMETER
FILE. Clustal X automatically looks for a file called 'colprot.par' for protein
sequences or 'coldna.par' for DNA, in the current directory. (If your running
under UNIX, it then looks in your home directory, and finally in the
directories in your PATH environment variable).
</P>
<P>
By default, if no color parameter file is found, protein sequences are colored
by residue as follows:
</P>
<P>
<PRE>
	Color			Residue Code
</P>
<P>
	ORANGE			GPST
	RED			HKR
	BLUE			FWY
	GREEN			ILMV
</PRE>
</P>
<P>
In the case of DNA sequences, the default colors are as follows:
</P>
<P>
<PRE>
	Color			Residue Code
</P>
<P>
	ORANGE			A
	RED			C
	BLUE			T
	GREEN			G
</PRE>
</P>
<P>
</P>
<P>
The default BACKGROUND COLORING option shows the sequence residues using a
black character on a colored background. It can be switched off to show
residues as a colored character on a white background. 
</P>
<P>
Either BLACK AND WHITE or DEFAULT COLOR options can be selected. The Color
option looks first for the color parameter file (as described above) and, if no
file is found, uses the default residue-specific colors.
</P>
<P>
You can specify your own coloring scheme by using the LOAD COLOR PARAMETER FILE
option. The format of the color parameter file is described below.
</P>
<P>
<H4>
COLOR PARAMETER FILE
</H4>
</P>
<P>
This file is divided into 3 sections:
</P>
<P>
1) the names and rgb values of the colors
2) the rules for calculating the consensus
3) the rules for assigning colors to the residues
</P>
<P> 
An example file is given here.
</P>
<P>
<PRE>
 --------------------------------------------------------------------
@rgbindex
RED          0.9 0.1 0.1
BLUE         0.1 0.1 0.9
GREEN        0.1 0.9 0.1
YELLOW       0.9 0.9 0.0
</P>
<P>
@consensus
% = 60% w:l:v:i:m:a:f:c:y:h:p
# = 80% w:l:v:i:m:a:f:c:y:h:p
- = 50% e:d
+ = 60% k:r
q = 50% q:e
p = 50% p
n = 50% n
t = 50% t:s
</P>
<P>
@color
g = RED
p = YELLOW
t = GREEN if t:%:#
n = GREEN if n
w = BLUE if %:#:p
k = RED if +
 --------------------------------------------------------------------
</PRE>
</P>
<P>
The first section is optional and is identified by the header @rgbindex. If
this section exists, each color used in the file must be named and the rgb
values specified (on a scale from 0 to 1). If the rgb index section is not
found, the following set of hard-coded colors will be used.
</P>
<P>
<PRE>
RED          0.9 0.1 0.1
BLUE         0.1 0.1 0.9
GREEN        0.1 0.9 0.1
ORANGE       0.9 0.7 0.3
CYAN         0.1 0.9 0.9
PINK         0.9 0.5 0.5
MAGENTA      0.9 0.1 0.9
YELLOW       0.9 0.9 0.0
</PRE>
</P>
<P>
The second section is optional and is identified by the header @consensus. It
defines how the consensus is calculated.
</P>
<P> 
The format of each consensus parameter is:-
</P>
<P> 
<PRE>
c = n% residue_list
</P>
<P> 
        where
              c             is a character used to identify the parameter.
              n             is an integer value used as the percentage cutoff
                            point.
              residue_list  is a list of residues denoted by a single
                            character, delimited by a colon (:).
</PRE>
</P>
<P> 
For example:   # = 60% w:l:v:i
</P>
<P>
will assign a consensus character # to any column in the alignment which
contains more than 60% of the residues w,l,v and i.
</P>
<P>        
</P>
<P> 
The third section is identified by the header @color, and defines how colors
are assigned to each residue in the alignment.
</P>
<P> 
The color parameters can take one of two formats:
</P>
<P>
<PRE>
1) r = color
2) r = color if consensus_list
</P>
<P> 
        where
              r             is a character used to denote a residue.
              color         is one of the colors in the GDE color lookup table.
              residue_list  is a list of residues denoted by a single
                            character, delimited by a colon (:).
</PRE>
</P>
<P> 
Examples:
1) g = ORANGE
</P>
<P>
will color all glycines ORANGE, regardless of the consensus.
</P>
<P>
2) w = BLUE if w:%:#
</P>
<P>
will color BLUE any tryptophan which is found in a column with a consensus of
w, % or #.
</P>
<P> 
</P>
<P>
</P>
<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
<CENTER><H2><A NAME="Q">                       Alignment Quality Analysis
</A></H2></CENTER>
<P>
</P>
<P>
<H3>
QUALITY SCORES
</H3>
</P>
<P>
Clustal X provides an indication of the quality of an alignment by plotting
a 'conservation score' for each column of the alignment. A high score indicates
a well-conserved column; a low score indicates low conservation. The quality
curve is drawn below the alignment.
</P>
<P>
Two methods are also provided to indicate single residues or sequence segments
which score badly in the alignment.
</P>
<P> 
Low-scoring residues are expected to occur at a moderate frequency in all the
sequences because of their steady divergence due to the natural processes of
evolution. The most divergent sequences are likely to have the most outliers.
However, the highlighted residues are especially useful in pointing to
sequence misalignments. Note that clustering of highlighted residues is a
strong indication of misalignment. This can arise due to various reasons, for
example:
</P>
<P> 
        1. Partial or total misalignments caused by a failure in the
        alignment algorithm. Usually only in difficult alignment cases.
</P>
<P> 
        2. Partial or total misalignments because at least one of the
        sequences in the given set is partly or completely unrelated to the
        other sequences. It is up to the user to check that the set of
        sequences are alignable.
</P>
<P>
        3. Frameshift translation errors in a protein sequence causing local
        mismatched regions to be heavily highlighted. These are surprisingly
        common in database entries. If suspected, a 3-frame translation of
        the source DNA needs to be examined.
</P>
<P> 
Occasionally, highlighted residues may point to regions of some biological
significance. This might happen for example if a protein alignment contains a
sequence which has acquired new functions relative to the main sequence set. It
is important to exclude other explanations, such as error or the natural
divergence of sequences, before invoking a biological explanation.
</P>
<P>
</P>
<P>
<H3>
LOW-SCORING SEGMENTS
</H3>
</P>
<P>
Unreliable regions in the alignment can be highlighted using the Low-Scoring
Segments option. A sequence-weighted profile is used to indicate any segments
in the sequences which score badly. Because the profile calculation may take
some time, an option is provided to calculate LOW-SCORING SEGMENTS. The 
segment display can then be toggled on or off without having to repeat the
time-consuming calculations.
</P>
<P>
For details of the low-scoring segment calculation, see the CALCULATION section
below.
</P>
<P>
</P>
<P>
<H4>
LOW-SCORING SEGMENT PARAMETERS
</H4>
</P>
<P>
MINIMUM LENGTH OF SEGMENTS: short segments (or even single residues) can be
hidden by increasing the minimum length of segments which will be displayed.
</P>
💿 文件大小 848 K
👤 上传用户 nassdaq
📂 所属分类 *行业应用
🏷️ 相关标签

#算法 #基因 #生物学 #计算
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -