📄 clustalw.hlp
字号:
This is the on-line help file for CLUSTAL W ( version 1.83).
It should be named or defined as: clustalw_help
except with MSDOS in which case it should be named CLUSTALW.HLP
For full details of usage and algorithms, please read the CLUSTALW.DOC file.
Toby Gibson EMBL, Heidelberg, Germany.
Des Higgins UCC, Cork, Ireland.
Julie Thompson IGBMC, Strasbourg, France.
>>NEW <<
Fasta output
===========
Write/Read sequence with range specified. The command line syntax
for range specification is flexible. You can use one of the following
syntax.
-range=n:m
-range=n-m
-range="n m"
where m is the starting and m is the length of the sequence.
Range and range numbers.
=======================
Include range numbers in the ouput.
-seqno_range=on/off
The sequence range will be appended as to the names of the sequence.
PIM: Percentage Identity Matrix
===============================
>>HELP 1 << General help for CLUSTAL W (1.81)
Clustal W is a general purpose multiple alignment program for DNA or proteins.
SEQUENCE INPUT: all sequences must be in 1 file, one after another.
7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT,
Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file.
All non-alphabetic characters (spaces, digits, punctuation marks) are ignored
except "-" which is used to indicate a GAP ("." in MSF-RSF).
To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to
INPUT them; go to menu item 2 to do the multiple alignment.
PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this to
add a new sequence to an old alignment, or to use secondary structure to guide
the alignment process. GAPS in the old alignments are indicated using the "-"
character. PROFILES can be input in ANY of the allowed formats; just
use "-" (or "." for MSF-RSF) for each gap position.
PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read in
with "-" characters to indicate gaps) OR after a multiple alignment while the
alignment is still in memory.
The program tries to automatically recognise the different file formats used
and to guess whether the sequences are amino acid or nucleotide. This is not
always foolproof.
FASTA and NBRF-PIR formats are recognised by having a ">" as the first
character in the file.
EMBL-Swiss Prot formats are recognised by the letters
ID at the start of the file (the token for the entry name field).
CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.
GCG-MSF format is recognised by one of the following:
- the word PileUp at the start of the file.
- the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT
at the start of the file.
- the word MSF on the first line of the line, and the characters ..
at the end of this line.
GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning of
the file.
If 85% or more of the characters in the sequence are from A,C,G,T,U or N, the
sequence will be assumed to be nucleotide. This works in 97.3% of cases
but watch out!
>>HELP 2 << Help for multiple alignments
If you have already loaded sequences, use menu item 1 to do the complete
multiple alignment. You will be prompted for 2 output files: 1 for the
alignment itself; another to store a dendrogram that describes the similarity
of the sequences to each other.
Multiple alignments are carried out in 3 stages (automatically done from menu
item 1 ...Do complete multiple alignments now):
1) all sequences are compared to each other (pairwise alignments);
2) a dendrogram (like a phylogenetic tree) is constructed, describing the
approximate groupings of the sequences by similarity (stored in a file).
3) the final multiple alignment is carried out, using the dendrogram as a guide.
PAIRWISE ALIGNMENT parameters control the speed-sensitivity of the initial
alignments.
MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.
RESET GAPS (menu item 7) will remove any new gaps introduced into the sequences
during multiple alignment if you wish to change the parameters and try again.
This only takes effect just before you do a second multiple alignment. You
can make phylogenetic trees after alignment whether or not this is ON.
If you turn this OFF, the new gaps are kept even if you do a second multiple
alignment. This allows you to iterate the alignment gradually. Sometimes, the
alignment is improved by a second or third pass.
SCREEN DISPLAY (menu item 8) can be used to send the output alignments to the
screen as well as to the output file.
You can skip the first stages (pairwise alignments; dendrogram) by using an
old dendrogram file (menu item 3); or you can just produce the dendrogram
with no final multiple alignment (menu item 2).
OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 6
different alignment formats (CLUSTAL, GCG, NBRF-PIR, PHYLIP, GDE, NEXUS, and FASTA).
>>HELP 3 << Help for pairwise alignment parameters
A distance is calculated between every pair of sequences and these are used to
construct the dendrogram which guides the final multiple alignment. The scores
are calculated from separate pairwise alignments. These can be calculated using
2 methods: dynamic programming (slow but accurate) or by the method of Wilbur
and Lipman (extremely fast but approximate).
You can choose between the 2 alignment methods using menu option 8. The
slow-accurate method is fine for short sequences but will be VERY SLOW for
many (e.g. >100) long (e.g. >1000 residue) sequences.
SLOW-ACCURATE alignment parameters:
These parameters do not have any affect on the speed of the alignments.
They are used to give initial alignments which are then rescored to give percent
identity scores. These % scores are the ones which are displayed on the
screen. The scores are converted to distances for the trees.
1) Gap Open Penalty: the penalty for opening a gap in the alignment.
2) Gap extension penalty: the penalty for extending a gap by 1 residue.
3) Protein weight matrix: the scoring table which describes the similarity
of each amino acid to each other.
4) DNA weight matrix: the scores assigned to matches and mismatches
(including IUB ambiguity codes).
FAST-APPROXIMATE alignment parameters:
These similarity scores are calculated from fast, approximate, global align-
ments, which are controlled by 4 parameters. 2 techniques are used to make
these alignments very fast: 1) only exactly matching fragments (k-tuples) are
considered; 2) only the 'best' diagonals (the ones with most k-tuple matches)
are used.
K-TUPLE SIZE: This is the size of exactly matching fragment that is used.
INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.
For longer sequences (e.g. >1000 residues) you may need to increase the default.
GAP PENALTY: This is a penalty for each gap in the fast alignments. It has
little affect on the speed or sensitivity except for extreme values.
TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginary
dot-matrix plot) is calculated. Only the best ones (with most matches) are
used in the alignment. This parameter specifies how many. Decrease for speed;
increase for sensitivity.
WINDOW SIZE: This is the number of diagonals around each of the 'best'
diagonals that will be used. Decrease for speed; increase for sensitivity.
>>HELP 4 << Help for multiple alignment parameters
These parameters control the final multiple alignment. This is the core of the
program and the details are complicated. To fully understand the use of the
parameters and the scoring system, you will have to refer to the documentation.
Each step in the final multiple alignment consists of aligning two alignments
or sequences. This is done progressively, following the branching order in
the GUIDE TREE. The basic parameters to control this are two gap penalties and
the scores for various identical-non-indentical residues.
1) and 2) The GAP PENALTIES are set by menu items 1 and 2. These control the
cost of opening up every new gap and the cost of every item in a gap.
Increasing the gap opening penalty will make gaps less frequent. Increasing
the gap extension penalty will make gaps shorter. Terminal gaps are not
penalised.
3) The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most
distantly related sequences until after the most closely related sequences have
been aligned. The setting shows the percent identity level required to delay
the addition of a sequence; sequences that are less identical than this level
to any other sequences will be aligned later.
4) The TRANSITION WEIGHT gives transitions (A <--> G or C <--> T
i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0
and 1; a weight of zero means that the transitions are scored as mismatches,
while a weight of 1 gives the transitions the match score. For distantly related
DNA sequences, the weight should be near to zero; for closely related sequences
it can be useful to assign a higher score.
5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a choice of
weight matrices. The default for proteins in version 1.8 is the PAM series
derived by Gonnet and colleagues. Note, a series is used! The actual matrix
that is used depends on how similar the sequences to be aligned at this
alignment step are. Different matrices work differently at each evolutionary
distance.
6) DNA WEIGHT MATRIX leads to a new menu where a single matrix (not a series)
can be selected. The default is the matrix used by BESTFIT for comparison of
nucleic acid sequences.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -