📄 clustalx.html

📁 是有关基因比对的经典算法的实现。这对于初学计算生物学的人是非常重要的算法。
💻 HTML
📖 第 1 页 / 共 5 页
字号:
number of sequences. If the block length is set to 0, The alignment will not
be divided into blocks, but printed across a number of pages.
</P>
<P>
</P>
<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
<CENTER><H2><A NAME="E">                          Editing Alignments
</A></H2></CENTER>
<P>
</P>
<P>
Clustal X allows you to change the order of the sequences in the alignment, by
cutting-and-pasting the sequence names.
</P>
<P>
To select a group of sequences to be moved, click on a sequence name and drag
the cursor until all the required sequences are highlighted. Holding down the
Shift key when clicking on the first name will add new sequences to those
already selected.
</P>
<P>
(Options are provided to Select All Sequences, Select Profile 1 or Select 
Profile 2.)
</P>
<P>
The selected sequences can be removed from the alignment by using the EDIT
menu, CUT option.
</P>
<P>
To add the cut sequences back into an alignment, select a sequence by clicking
on the sequence name. The cut sequences will be added to the alignment,
immediately following the selected sequence, by the EDIT menu, PASTE option.
</P>
<P>
To add the cut sequences to an empty alignment (eg. when cutting sequences from
Profile 1 and pasting them to Profile 2), click on the empty sequence name
display area, and select the EDIT menu, PASTE option as before.
</P>
<P>
The sequence selection and sequence range selection can be cleared using the
EDIT menu, CLEAR SEQUENCE SELECTION and CLEAR RANGE SELECTION options
respectively.
</P>
<P>
To search for a string of residues in the sequences, select the sequences to be
searched by clicking on the sequence names. You can then enter the string to
search for by selecting the SEARCH FOR STRING option. If the string is found in
any of the sequences selected, the sequence name and column number is printed
below the sequence display.
</P>
<P>
In PROFILE ALIGNMENT MODE, the two profiles can be merged (normally done after
alignment) by selecting ADD PROFILE 2 TO PROFILE 1. The sequences currently
displayed as Profile 2 will be appended to Profile 1. 
</P>
<P>
The REMOVE ALL GAPS option will remove all gaps from the sequences currently
selected.
WARNING: This option removes ALL gaps, not only those introduced by ClustalX,
but also those that were read from the input alignment file. Any secondary
structure information associated with the alignment will NOT be automatically
realigned.
</P>
<P>
The REMOVE GAP-ONLY COLUMNS will remove those positions in the alignment which
contain gaps in all sequences. This can occur as a result of removing divergent
sequences from an alignment, or if an alignment has been realigned.
</P>
<P>
</P>
<A HREF="#INDEX"> <EM>Back to Index</EM> </A>
<CENTER><H2><A NAME="M">                          Multiple Alignments
</A></H2></CENTER>
<P>
</P>
<P>
Make sure MULTIPLE ALIGNMENT MODE is selected, using the switch directly above
the sequence display area. Then, use the ALIGNMENT menu to do multiple
alignments.
</P>
<P>
Multiple alignments are carried out in 3 stages:
</P>
<P> 
1) all sequences are compared to each other (pairwise alignments);
</P>
<P> 
2) a dendrogram (like a phylogenetic tree) is constructed, describing the
approximate groupings of the sequences by similarity (stored in a file).
</P>
<P> 
3) the final multiple alignment is carried out, using the dendrogram as a guide.
</P>
<P>
The 3 stages are carried out automatically by the DO COMPLETE ALIGNMENT option.
You can skip the first stages (pairwise alignments; guide tree) by using an old
guide tree file (DO ALIGNMENT FROM GUIDE TREE); or you can just produce the
guide tree with no final multiple alignment (PRODUCE GUIDE TREE ONLY).
</P>
<P>
</P>
<P>
REALIGN SELECTED SEQUENCES is used to realign badly aligned sequences in the
alignment. Sequences can be selected by clicking on the sequence names - see
Editing Alignments for more details. The unselected sequences are then 'fixed'
and a profile is made including only the unselected sequences. Each of the
selected sequences in turn is then realigned to this profile. The realigned
sequences will be displayed as a group at the end the alignment.
</P>
<P>
</P>
<P>
REALIGN SELECTED SEQUENCE RANGE is used to realign a small region of the 
alignment. A residue range can be selected by clicking on the sequence display
area. A multiple alignment is then performed, following the 3 stages described
above, but only using the selected residue range. Finally the new alignment of
the range is pasted back into the full sequence alignment.
</P>
<P>
By default, gap penalties are used at each end of the subrange in order to 
penalise terminal gaps. If the REALIGN SEGMENT END GAP PENALTIES option is
switched off, gaps can be introduced at the ends of the residue range at no
cost.
</P>
<P>
</P>
<P>
ALIGNMENT PARAMETERS displays a sub-menu with the following options:
</P>
<P>
RESET NEW GAPS BEFORE ALIGNMENT will remove any new gaps introduced into the
sequences during multiple alignment if you wish to change the parameters and
try again. This only takes effect just before you do a second multiple
alignment. You can make phylogenetic trees after alignment whether or not this
is ON. If you turn this OFF, the new gaps are kept even if you do a second
multiple alignment. This allows you to iterate the alignment gradually.
Sometimes, the alignment is improved by a second or third pass.
</P>
<P>
RESET ALL GAPS BEFORE ALIGNMENT will remove all gaps in the sequences including
gaps which were read in from the sequence input file. This only takes effect
just before you do a second multiple alignment.  You can make phylogenetic
trees after alignment whether or not this is ON.  If you turn this OFF, all
gaps are kept even if you do a second multiple alignment. This allows you to
iterate the alignment gradually.  Sometimes, the alignment is improved by a
second or third pass.
</P>
<P>
</P>
<P>
PAIRWISE ALIGNMENT PARAMETERS control the speed/sensitivity of the initial
alignments.
</P>
<P>
MULTIPLE ALIGNMENT PARAMETERS control the gaps in the final multiple
alignments.
</P>
<P>
PROTEIN GAP PARAMETERS displays a temporary window which allows you to set
various parameters only used in the alignment of protein sequences.
</P>
<P>
(SECONDARY STRUCTURE PARAMETERS, for use with the Profile Alignment Mode only,
allows you to set various parameters only used with gap penalty masks.)
</P>
<P>
SAVE LOG FILE will write the alignment calculation scores to a file. The log
filename is the same as the input sequence filename, with an extension .log
appended.
</P>
<P>
</P>
<P>
<H4>
OUTPUT FORMAT OPTIONS
</H4>
</P>
<P>
You can choose from 6 different alignment formats (CLUSTAL, GCG, NBRF/PIR,
PHYLIP, GDE and NEXUS).  You can choose more than one (or all 6 if you wish).  
</P>
<P>
CLUSTAL format output is a self explanatory alignment format. It shows the
sequences aligned in blocks. It can be read in again at a later date to (for
example) calculate a phylogenetic tree or add in new sequences by profile
alignment.
</P>
<P>
GCG output can be used by any of the GCG programs that can work on multiple
alignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG
.msf format files (multiple sequence file); new in version 7 of GCG.
</P>
<P>
NEXUS format is used by several phylogeny programs, including PAUP and
MacClade.
</P>
<P>
PHYLIP format output can be used for input to the PHYLIP package of Joe 
Felsenstein.  This is a very widely used package for doing every imaginable
form of phylogenetic analysis (MUCH more than the the modest introduction
offered by this program).
</P>
<P>
NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gap
characters "-" are used to indicate the positions of gaps in the multiple 
alignment. These files can be re-used as input in any part of clustal that
allows sequences (or alignments or profiles) to be read in.  
</P>
<P>
GDE:  this format is used by the GDE package of Steven Smith and is understood
by SEQLAB in GCG 9 or later.
</P>
<P>
GDE OUTPUT CASE: sequences in GDE format may be written in either upper or
lower case.
</P>
<P> 
CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of the
alignment lines in clustalw format.
</P>
<P>
OUTPUT ORDER is used to control the order of the sequences in the output
alignments. By default, it uses the order in which the sequences were aligned
(from the guide tree/dendrogram), thus automatically grouping closely related
sequences. It can be switched to be the same as the original input order.
</P>
<P>
PARAMETER OUTPUT: This option will save all your parameter settings in a
parameter file (suffix .par) during alignment. The file can be subsequently
used to rerun ClustalW using the same parameters.
</P>
<P>
</P>
<P>
<H3>
ALIGNMENT PARAMETERS
</H3>
</P>
<P>
<STRONG>
PAIRWISE ALIGNMENT PARAMETERS
</STRONG>
</P>
<P>
A distance is calculated between every pair of sequences and these are used to
construct the phylogenetic tree which guides the final multiple alignment. The
scores are calculated from separate pairwise alignments. These can be
calculated using 2 methods: dynamic programming (slow but accurate) or by the
method of Wilbur and Lipman (extremely fast but approximate).   
</P>
<P>
You can choose between the 2 alignment methods using the PAIRWISE ALIGNMENTS
option. The slow/accurate method is fast enough for short sequences but will be
VERY SLOW for many (e.g. >100) long (e.g. >1000 residue) sequences.   
</P>
<P>
</P>
<P>
<STRONG>
SLOW-ACCURATE alignment parameters:
</STRONG>
</P>
<P>
These parameters do not have any affect on the speed of the alignments. They
are used to give initial alignments which are then rescored to give percent
identity scores. These % scores are the ones which are displayed on the 
screen. The scores are converted to distances for the trees.
</P>
<P>
Gap Open Penalty:      the penalty for opening a gap in the alignment.
</P>
<P>
Gap Extension Penalty: the penalty for extending a gap by 1 residue.
</P>
<P>
Protein Weight Matrix: the scoring table which describes the similarity of 
each amino acid to each other.
</P>
<P>
Load protein matrix: allows you to read in a comparison table from a file.
</P>
<P>
DNA weight matrix: the scores assigned to matches and mismatches (including
IUB ambiguity codes).
</P>
<P>
Load DNA matrix: allows you to read in a comparison table from a file.
</P>
<P>
See the Multiple alignment parameters, MATRIX option below for details of the
matrix input format.
</P>
<P>
</P>
<P>
<STRONG>
FAST-APPROXIMATE alignment parameters:
</STRONG>
</P>
<P>
These similarity scores are calculated from fast, approximate, global align-
ments, which are controlled by 4 parameters. 2 techniques are used to make
these alignments very fast: 1) only exactly matching fragments (k-tuples) are
considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
are used.
</P>
<P>
GAP PENALTY:   This is a penalty for each gap in the fast alignments. It has
little effect on the speed or sensitivity except for extreme values.
</P>
<P>
K-TUPLE SIZE:  This is the size of exactly matching fragment that is used. 
INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
For longer sequences (e.g. >1000 residues) you may wish to increase the
default.
</P>
<P>
TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
dot-matrix plot) is calculated. Only the best ones (with most matches) are used
in the alignment. This parameter specifies how many. Decrease for speed;
increase for sensitivity.
</P>
<P>
WINDOW SIZE:  This is the number of diagonals around each of the 'best' 
diagonals that will be used. Decrease for speed; increase for sensitivity.
</P>
<P>
</P>
<P>
<STRONG>
MULTIPLE ALIGNMENT PARAMETERS
</STRONG>
</P>
<P>
These parameters control the final multiple alignment. This is the core of the
program and the details are complicated. To fully understand the use of the
parameters and the scoring system, you will have to refer to the documentation.
</P>
<P>
Each step in the final multiple alignment consists of aligning two alignments 
or sequences. This is done progressively, following the branching order in the
GUIDE TREE. The basic parameters to control this are two gap penalties and the
scores for various identical/non-indentical residues. 
</P>
<P>
The GAP OPENING and EXTENSION PENALTIES can be set here. These control the 
cost of opening up every new gap and the cost of every item in a gap.  
Increasing the gap opening penalty will make gaps less frequent. Increasing 
the gap extension penalty will make gaps shorter. Terminal gaps are not 
penalised.
</P>
<P>
The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most distantly
related sequences until after the most closely related sequences have  been
aligned. The setting shows the percent identity level required to delay the
addition of a sequence; sequences that are less identical than this level to
any other sequences will be aligned later.
</P>
<P>
The TRANSITION WEIGHT gives transitions (A<-->G or C<-->T i.e. purine-purine or
pyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zero
means that the transitions are scored as mismatches, while a weight of 1 gives
the transitions the match score. For distantly related DNA sequences, the
💿 文件大小 848 K
👤 上传用户 nassdaq
📂 所属分类 *行业应用
🏷️ 相关标签

#算法 #基因 #生物学 #计算
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -