📄 clustalw_help
字号:
This is the on-line help file for CLUSTAL W ( version 1.83). It should be named or defined as: clustalw_help except with MSDOS in which case it should be named CLUSTALW.HLPFor full details of usage and algorithms, please read the CLUSTALW.DOC file.Toby Gibson EMBL, Heidelberg, Germany.Des Higgins UCC, Cork, Ireland.Julie Thompson IGBMC, Strasbourg, France.>>NEW << Fasta output =========== Write/Read sequence with range specified. The command line syntax for range specification is flexible. You can use one of the following syntax. -range=n:m -range=n-m -range="n m" where m is the starting and m is the length of the sequence. Range and range numbers. ======================= Include range numbers in the ouput. -seqno_range=on/off The sequence range will be appended as to the names of the sequence. PIM: Percentage Identity Matrix ===============================>>HELP 1 << General help for CLUSTAL W (1.81)Clustal W is a general purpose multiple alignment program for DNA or proteins.SEQUENCE INPUT: all sequences must be in 1 file, one after another. 7 formats are automatically recognised: NBRF-PIR, EMBL-SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG-MSF (Pileup), GCG9-RSF and GDE flat file.All non-alphabetic characters (spaces, digits, punctuation marks) are ignoredexcept "-" which is used to indicate a GAP ("." in MSF-RSF). To do a MULTIPLE ALIGNMENT on a set of sequences, use item 1 from this menu to INPUT them; go to menu item 2 to do the multiple alignment.PROFILE ALIGNMENTS (menu item 3) are used to align 2 alignments. Use this toadd a new sequence to an old alignment, or to use secondary structure to guide the alignment process. GAPS in the old alignments are indicated using the "-" character. PROFILES can be input in ANY of the allowed formats; just use "-" (or "." for MSF-RSF) for each gap position.PHYLOGENETIC TREES (menu item 4) can be calculated from old alignments (read inwith "-" characters to indicate gaps) OR after a multiple alignment while the alignment is still in memory.The program tries to automatically recognise the different file formats usedand to guess whether the sequences are amino acid or nucleotide. This is notalways foolproof.FASTA and NBRF-PIR formats are recognised by having a ">" as the first character in the file. EMBL-Swiss Prot formats are recognised by the lettersID at the start of the file (the token for the entry name field). CLUSTAL format is recognised by the word CLUSTAL at the beginning of the file.GCG-MSF format is recognised by one of the following: - the word PileUp at the start of the file. - the word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT at the start of the file. - the word MSF on the first line of the line, and the characters .. at the end of this line.GCG-RSF format is recognised by the word !!RICH_SEQUENCE at the beginning ofthe file.If 85% or more of the characters in the sequence are from A,C,G,T,U or N, thesequence will be assumed to be nucleotide. This works in 97.3% of casesbut watch out!>>HELP 2 << Help for multiple alignmentsIf you have already loaded sequences, use menu item 1 to do the completemultiple alignment. You will be prompted for 2 output files: 1 for the alignment itself; another to store a dendrogram that describes the similarityof the sequences to each other.Multiple alignments are carried out in 3 stages (automatically done from menuitem 1 ...Do complete multiple alignments now):1) all sequences are compared to each other (pairwise alignments);2) a dendrogram (like a phylogenetic tree) is constructed, describing theapproximate groupings of the sequences by similarity (stored in a file).3) the final multiple alignment is carried out, using the dendrogram as a guide.PAIRWISE ALIGNMENT parameters control the speed-sensitivity of the initialalignments.MULTIPLE ALIGNMENT parameters control the gaps in the final multiple alignments.RESET GAPS (menu item 7) will remove any new gaps introduced into the sequencesduring multiple alignment if you wish to change the parameters and try again.This only takes effect just before you do a second multiple alignment. Youcan make phylogenetic trees after alignment whether or not this is ON.If you turn this OFF, the new gaps are kept even if you do a second multiplealignment. This allows you to iterate the alignment gradually. Sometimes, the alignment is improved by a second or third pass.SCREEN DISPLAY (menu item 8) can be used to send the output alignments to the screen as well as to the output file.You can skip the first stages (pairwise alignments; dendrogram) by using anold dendrogram file (menu item 3); or you can just produce the dendrogramwith no final multiple alignment (menu item 2).OUTPUT FORMAT: Menu item 9 (format options) allows you to choose from 6 different alignment formats (CLUSTAL, GCG, NBRF-PIR, PHYLIP, GDE, NEXUS, and FASTA). >>HELP 3 << Help for pairwise alignment parametersA distance is calculated between every pair of sequences and these are used toconstruct the dendrogram which guides the final multiple alignment. The scoresare calculated from separate pairwise alignments. These can be calculated using2 methods: dynamic programming (slow but accurate) or by the method of Wilburand Lipman (extremely fast but approximate). You can choose between the 2 alignment methods using menu option 8. Theslow-accurate method is fine for short sequences but will be VERY SLOW for many (e.g. >100) long (e.g. >1000 residue) sequences. SLOW-ACCURATE alignment parameters: These parameters do not have any affect on the speed of the alignments. They are used to give initial alignments which are then rescored to give percentidentity scores. These % scores are the ones which are displayed on the screen. The scores are converted to distances for the trees.1) Gap Open Penalty: the penalty for opening a gap in the alignment.2) Gap extension penalty: the penalty for extending a gap by 1 residue.3) Protein weight matrix: the scoring table which describes the similarity of each amino acid to each other.4) DNA weight matrix: the scores assigned to matches and mismatches (including IUB ambiguity codes).FAST-APPROXIMATE alignment parameters:These similarity scores are calculated from fast, approximate, global align-ments, which are controlled by 4 parameters. 2 techniques are used to makethese alignments very fast: 1) only exactly matching fragments (k-tuples) areconsidered; 2) only the 'best' diagonals (the ones with most k-tuple matches)are used.K-TUPLE SIZE: This is the size of exactly matching fragment that is used. INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.For longer sequences (e.g. >1000 residues) you may need to increase the default.GAP PENALTY: This is a penalty for each gap in the fast alignments. It haslittle affect on the speed or sensitivity except for extreme values.TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginarydot-matrix plot) is calculated. Only the best ones (with most matches) areused in the alignment. This parameter specifies how many. Decrease for speed;increase for sensitivity.WINDOW SIZE: This is the number of diagonals around each of the 'best' diagonals that will be used. Decrease for speed; increase for sensitivity.>>HELP 4 << Help for multiple alignment parametersThese parameters control the final multiple alignment. This is the core of theprogram and the details are complicated. To fully understand the use of theparameters and the scoring system, you will have to refer to the documentation.Each step in the final multiple alignment consists of aligning two alignments or sequences. This is done progressively, following the branching order in the GUIDE TREE. The basic parameters to control this are two gap penalties andthe scores for various identical-non-indentical residues. 1) and 2) The GAP PENALTIES are set by menu items 1 and 2. These control the cost of opening up every new gap and the cost of every item in a gap. Increasing the gap opening penalty will make gaps less frequent. Increasing the gap extension penalty will make gaps shorter. Terminal gaps are not penalised.3) The DELAY DIVERGENT SEQUENCES switch delays the alignment of the mostdistantly related sequences until after the most closely related sequences have been aligned. The setting shows the percent identity level required to delaythe addition of a sequence; sequences that are less identical than this levelto any other sequences will be aligned later.4) The TRANSITION WEIGHT gives transitions (A <--> G or C <--> T i.e. purine-purine or pyrimidine-pyrimidine substitutions) a weight between 0and 1; a weight of zero means that the transitions are scored as mismatches,while a weight of 1 gives the transitions the match score. For distantly relatedDNA sequences, the weight should be near to zero; for closely related sequencesit can be useful to assign a higher score.5) PROTEIN WEIGHT MATRIX leads to a new menu where you are offered a choice ofweight matrices. The default for proteins in version 1.8 is the PAM series derived by Gonnet and colleagues. Note, a series is used! The actual matrixthat is used depends on how similar the sequences to be aligned at this alignment step are. Different matrices work differently at each evolutionarydistance. 6) DNA WEIGHT MATRIX leads to a new menu where a single matrix (not a series)can be selected. The default is the matrix used by BESTFIT for comparison ofnucleic acid sequences.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -