📄 clustalx_help

📁 经典生物信息学多序列比对工具clustalw
💻
📖 第 1 页 / 共 5 页
字号:
To search for a string of residues in the sequences, select the sequences to besearched by clicking on the sequence names. You can then enter the string tosearch for by selecting the SEARCH FOR STRING option. If the string is found inany of the sequences selected, the sequence name and column number is printedbelow the sequence display.In PROFILE ALIGNMENT MODE, the two profiles can be merged (normally done afteralignment) by selecting ADD PROFILE 2 TO PROFILE 1. The sequences currentlydisplayed as Profile 2 will be appended to Profile 1. The REMOVE ALL GAPS option will remove all gaps from the sequences currentlyselected.WARNING: This option removes ALL gaps, not only those introduced by ClustalX,but also those that were read from the input alignment file. Any secondarystructure information associated with the alignment will NOT be automaticallyrealigned.The REMOVE GAP-ONLY COLUMNS will remove those positions in the alignment whichcontain gaps in all sequences. This can occur as a result of removing divergentsequences from an alignment, or if an alignment has been realigned.>>HELP M <<                          Multiple AlignmentsMake sure MULTIPLE ALIGNMENT MODE is selected, using the switch directly abovethe sequence display area. Then, use the ALIGNMENT menu to do multiplealignments.Multiple alignments are carried out in 3 stages: 1) all sequences are compared to each other (pairwise alignments); 2) a dendrogram (like a phylogenetic tree) is constructed, describing theapproximate groupings of the sequences by similarity (stored in a file). 3) the final multiple alignment is carried out, using the dendrogram as a guide.The 3 stages are carried out automatically by the DO COMPLETE ALIGNMENT option.You can skip the first stages (pairwise alignments; guide tree) by using an oldguide tree file (DO ALIGNMENT FROM GUIDE TREE); or you can just produce theguide tree with no final multiple alignment (PRODUCE GUIDE TREE ONLY).REALIGN SELECTED SEQUENCES is used to realign badly aligned sequences in thealignment. Sequences can be selected by clicking on the sequence names - seeEditing Alignments for more details. The unselected sequences are then 'fixed'and a profile is made including only the unselected sequences. Each of theselected sequences in turn is then realigned to this profile. The realignedsequences will be displayed as a group at the end the alignment.REALIGN SELECTED SEQUENCE RANGE is used to realign a small region of the alignment. A residue range can be selected by clicking on the sequence displayarea. A multiple alignment is then performed, following the 3 stages describedabove, but only using the selected residue range. Finally the new alignment ofthe range is pasted back into the full sequence alignment.By default, gap penalties are used at each end of the subrange in order to penalise terminal gaps. If the REALIGN SEGMENT END GAP PENALTIES option isswitched off, gaps can be introduced at the ends of the residue range at nocost.ALIGNMENT PARAMETERS displays a sub-menu with the following options:RESET NEW GAPS BEFORE ALIGNMENT will remove any new gaps introduced into thesequences during multiple alignment if you wish to change the parameters andtry again. This only takes effect just before you do a second multiplealignment. You can make phylogenetic trees after alignment whether or not thisis ON. If you turn this OFF, the new gaps are kept even if you do a secondmultiple alignment. This allows you to iterate the alignment gradually.Sometimes, the alignment is improved by a second or third pass.RESET ALL GAPS BEFORE ALIGNMENT will remove all gaps in the sequences includinggaps which were read in from the sequence input file. This only takes effectjust before you do a second multiple alignment.  You can make phylogenetictrees after alignment whether or not this is ON.  If you turn this OFF, allgaps are kept even if you do a second multiple alignment. This allows you toiterate the alignment gradually.  Sometimes, the alignment is improved by asecond or third pass.PAIRWISE ALIGNMENT PARAMETERS control the speed/sensitivity of the initialalignments.MULTIPLE ALIGNMENT PARAMETERS control the gaps in the final multiplealignments.PROTEIN GAP PARAMETERS displays a temporary window which allows you to setvarious parameters only used in the alignment of protein sequences.(SECONDARY STRUCTURE PARAMETERS, for use with the Profile Alignment Mode only,allows you to set various parameters only used with gap penalty masks.)SAVE LOG FILE will write the alignment calculation scores to a file. The logfilename is the same as the input sequence filename, with an extension .logappended.<H4>OUTPUT FORMAT OPTIONS</H4>You can choose from 7 different alignment formats (CLUSTAL, GCG, NBRF/PIR,PHYLIP, GDE, NEXUS, FASTA).  You can choose more than one (or all 7 if you wish).  CLUSTAL format output is a self explanatory alignment format. It shows thesequences aligned in blocks. It can be read in again at a later date to (forexample) calculate a phylogenetic tree or add in new sequences by profilealignment.GCG output can be used by any of the GCG programs that can work on multiplealignments (e.g. PRETTY, PROFILEMAKE, PLOTALIGN). It is the same as the GCG.msf format files (multiple sequence file); new in version 7 of GCG.NEXUS format is used by several phylogeny programs, including PAUP andMacClade.PHYLIP format output can be used for input to the PHYLIP package of Joe Felsenstein.  This is a very widely used package for doing every imaginableform of phylogenetic analysis (MUCH more than the the modest introductionoffered by this program).NBRF/PIR: this is the same as the standard PIR format with ONE ADDITION. Gapcharacters "-" are used to indicate the positions of gaps in the multiple alignment. These files can be re-used as input in any part of clustal thatallows sequences (or alignments or profiles) to be read in.  FASTA: this is included for compatibility with numberous sequence analysis programs.GDE:  this format is used by the GDE package of Steven Smith and is understoodby SEQLAB in GCG 9 or later.GDE OUTPUT CASE: sequences in GDE format may be written in either upper orlower case. CLUSTALW SEQUENCE NUMBERS: residue numbers may be added to the end of thealignment lines in clustalw format.OUTPUT ORDER is used to control the order of the sequences in the outputalignments. By default, it uses the order in which the sequences were aligned(from the guide tree/dendrogram), thus automatically grouping closely relatedsequences. It can be switched to be the same as the original input order.PARAMETER OUTPUT: This option will save all your parameter settings in aparameter file (suffix .par) during alignment. The file can be subsequentlyused to rerun ClustalW using the same parameters.<H3>ALIGNMENT PARAMETERS</H3>--------------------<STRONG>PAIRWISE ALIGNMENT PARAMETERS</STRONG>A distance is calculated between every pair of sequences and these are used toconstruct the phylogenetic tree which guides the final multiple alignment. Thescores are calculated from separate pairwise alignments. These can becalculated using 2 methods: dynamic programming (slow but accurate) or by themethod of Wilbur and Lipman (extremely fast but approximate).   You can choose between the 2 alignment methods using the PAIRWISE ALIGNMENTSoption. The slow/accurate method is fast enough for short sequences but will beVERY SLOW for many (e.g. >100) long (e.g. >1000 residue) sequences.   <STRONG>SLOW-ACCURATE alignment parameters:</STRONG>These parameters do not have any affect on the speed of the alignments. Theyare used to give initial alignments which are then rescored to give percentidentity scores. These % scores are the ones which are displayed on the screen. The scores are converted to distances for the trees.Gap Open Penalty:      the penalty for opening a gap in the alignment.Gap Extension Penalty: the penalty for extending a gap by 1 residue.Protein Weight Matrix: the scoring table which describes the similarity of each amino acid to each other.Load protein matrix: allows you to read in a comparison table from a file.DNA weight matrix: the scores assigned to matches and mismatches (includingIUB ambiguity codes).Load DNA matrix: allows you to read in a comparison table from a file.See the Multiple alignment parameters, MATRIX option below for details of thematrix input format.<STRONG>FAST-APPROXIMATE alignment parameters:</STRONG>These similarity scores are calculated from fast, approximate, global align-ments, which are controlled by 4 parameters. 2 techniques are used to makethese alignments very fast: 1) only exactly matching fragments (k-tuples) areconsidered; 2) only the 'best' diagonals (the ones with most k-tuple matches)are used.GAP PENALTY:   This is a penalty for each gap in the fast alignments. It haslittle effect on the speed or sensitivity except for extreme values.K-TUPLE SIZE:  This is the size of exactly matching fragment that is used. INCREASE for speed (max= 2 for proteins; 4 for DNA), DECREASE for sensitivity.For longer sequences (e.g. >1000 residues) you may wish to increase thedefault.TOP DIAGONALS: The number of k-tuple matches on each diagonal (in an imaginarydot-matrix plot) is calculated. Only the best ones (with most matches) are usedin the alignment. This parameter specifies how many. Decrease for speed;increase for sensitivity.WINDOW SIZE:  This is the number of diagonals around each of the 'best' diagonals that will be used. Decrease for speed; increase for sensitivity.<STRONG>MULTIPLE ALIGNMENT PARAMETERS</STRONG>These parameters control the final multiple alignment. This is the core of theprogram and the details are complicated. To fully understand the use of theparameters and the scoring system, you will have to refer to the documentation.Each step in the final multiple alignment consists of aligning two alignments or sequences. This is done progressively, following the branching order in theGUIDE TREE. The basic parameters to control this are two gap penalties and thescores for various identical/non-indentical residues. The GAP OPENING and EXTENSION PENALTIES can be set here. These control the cost of opening up every new gap and the cost of every item in a gap.  Increasing the gap opening penalty will make gaps less frequent. Increasing the gap extension penalty will make gaps shorter. Terminal gaps are not penalised.The DELAY DIVERGENT SEQUENCES switch delays the alignment of the most distantlyrelated sequences until after the most closely related sequences have  beenaligned. The setting shows the percent identity level required to delay theaddition of a sequence; sequences that are less identical than this level toany other sequences will be aligned later.The TRANSITION WEIGHT gives transitions (A<-->G or C<-->T i.e. purine-purine orpyrimidine-pyrimidine substitutions) a weight between 0 and 1; a weight of zeromeans that the transitions are scored as mismatches, while a weight of 1 givesthe transitions the match score. For distantly related DNA sequences, theweight should be near to zero; for closely related sequences it can be usefulto assign a higher score. The default is set to 0.5.The PROTEIN WEIGHT MATRIX option allows you to choose a series of weightmatrices. For protein alignments, you use a weight matrix to determine thesimilarity of non-identical amino acids. For example, Tyr aligned with Phe isusually judged to be 'better' than Tyr aligned with Pro.There are three 'in-built' series of weight matrices offered. Each consists ofseveral matrices which work differently at different evolutionary distances. Tosee the exact details, read the documentation. Crudely, we store severalmatrices in memory, spanning the full range of amino acid distance (from almostidentical sequences to highly divergent ones). For very similar sequences, itis best to use a strict weight matrix which only gives a high score toidentities and the most favoured conservative substitutions. For more divergentsequences, it is appropriate to use "softer" matrices which give a high scoreto many other frequent substitutions.1) BLOSUM (Henikoff). These matrices appear to be the best available for carrying out data base similarity (homology searches). The matrices currentlyused are: Blosum 80, 62, 45 and 30. BLOSUM was the default in earlier Clustal Xversions.2) PAM (Dayhoff). These have been extremely widely used since the late '70s. Wecurrently use the PAM 20, 60, 120, 350 matrices.3) GONNET. These matrices were derived using almost the same procedure as theDayhoff one (above) but are much more up to date and are based on a far largerdata set. They appear to be more sensitive than the Dayhoff series. Wecurrently use the GONNET 80, 120, 160, 250 and 350 matrices. This series is thedefault for Clustal X version 1.8.We also supply an identity matrix which gives a score of 10 to two identical amino acids and a score of zero otherwise. This matrix is not very useful.Load protein matrix: allows you to read in a comparison matrix from a file.This can be either a single matrix or a series of matrices (see below forformat). DNA WEIGHT MATRIX option allows you to select a single matrix (not a series)used for aligning nucleic acid sequences. Two hard-coded matrices are available:1) IUB. This is the default scoring matrix used by BESTFIT for the comparisonof nucleic acid sequences. X's and N's are treated as matches to any IUBambiguity symbol. All matches score 1.9; all mismatches for IUB symbols score 0.2) CLUSTALW(1.6). A previous system used by ClustalW, in which matches score1.0 and mismatches score 0. All matches for IUB symbols also score 0.Load DNA matrix: allows you to read in a nucleic acid comparison matrix from afile (just one matrix, not a series).
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -