📄 clustalx_help

📁 经典生物信息学多序列比对工具clustalw
💻
📖 第 1 页 / 共 5 页
字号:
SINGLE MATRIX INPUT FORMATThe format used for a single matrix is the same as the BLAST program. Thescores in the new weight matrix should be similarities. You can use negative aswell as positive values if you wish, although the matrix will be automaticallyadjusted to all positive scores, unless the NEGATIVE MATRIX option is selected.Any lines beginning with a # character are assumed to be comments. The firstnon-comment line should contain a list of amino acids in any order, using the 1letter code, followed by a * character. This should be followed by a squarematrix of scores, with one row and one column for each amino acid. The last rowand column of the matrix (corresponding to the * character) contain the minimumscore over the whole matrix.MATRIX SERIES INPUT FORMATClustalX uses different matrices depending on the mean percent identity of thesequences to be aligned. You can specify a series of matrices and the range ofthe percent identity for each matrix in a matrix series file. The file isautomatically recognised by the word CLUSTAL_SERIES at the beginning of thefile. Each matrix in the series is then specified on one line which shouldstart with the word MATRIX. This is followed by the lower and upper limits ofthe sequence percent identities for which you want to apply the matrix. Thefinal entry on the matrix line is the filename of a Blast format matrix file(see above for details of the single matrix file format).Example.CLUSTAL_SERIES MATRIX 81 100 /us1/user/julie/matrices/blosum80MATRIX 61 80 /us1/user/julie/matrices/blosum62MATRIX 31 60 /us1/user/julie/matrices/blosum45MATRIX 0 30 /us1/user/julie/matrices/blosum30<STRONG>PROTEIN GAP PARAMETERS</STRONG>RESIDUE SPECIFIC PENALTIES are amino acid specific gap penalties that reduce orincrease the gap opening penalties at each position in the alignment or sequence. See the documentation for details. As an example, positions that arerich in glycine are more likely to have an adjacent gap than positions that arerich in valine.HYDROPHILIC GAP PENALTIES are used to increase the chances of a gap within arun (5 or more residues) of hydrophilic amino acids; these are likely to beloop or random coil regions where gaps are more common. The residues that are"considered" to be hydrophilic can be entered in HYDROPHILIC RESIDUES.GAP SEPARATION DISTANCE tries to decrease the chances of gaps being too closeto each other. Gaps that are less than this distance apart are penalised morethan other gaps. This does not prevent close gaps; it makes them less frequent,promoting a block-like appearance of the alignment.END GAP SEPARATION treats end gaps just like internal gaps for the purposes ofavoiding gaps that are too close (set by GAP SEPARATION DISTANCE above). If youturn this off, end gaps will be ignored for this purpose. This is useful whenyou wish to align fragments where the end gaps are not biologically meaningful.>>HELP P <<                   Profile and Structure Alignments   By PROFILE ALIGNMENT, we mean alignment using existing alignments. Profile alignments allow you to store alignments of your favourite sequences and addnew sequences to them in small bunches at a time. A profile is simply analignment of one or more sequences (e.g. an alignment output file from ClustalX). Each input can be a single sequence. One or both sets of input sequencesmay include secondary structure assignments or gap penalty masks to guide thealignment. Make sure PROFILE ALIGNMENT MODE is selected, using the switch directly abovethe sequence display area. Then, use the ALIGNMENT menu to do profile andsecondary structure alignments.The profiles can be in any of the allowed input formats with "-" charactersused to specify gaps (except for GCG/MSF where "." is used).You have to load the 2 profiles by choosing FILE, LOAD PROFILE 1 and  LOADPROFILE 2. Then ALIGNMENT, ALIGN PROFILE 2 TO PROFILE 1 will align the 2profiles to each other. Secondary structure masks in either profile can be usedto guide the alignment. This option compares all the sequences in profile 1with all the sequences in profile 2 in order to build guide trees which will beused to calculate sequence weights, and select appropriate alignment parametersfor the final profile alignment.You can skip the first stage (pairwise alignments; guide trees) by using oldguide tree files (ALIGN PROFILES FROM GUIDE TREES). The ALIGN SEQUENCES TO PROFILE 1 option will take the sequences in the secondprofile and align them to the first profile, 1 at a time.  This is useful toadd some new sequences to an existing alignment, or to align a set of sequencesto a known structure. In this case, the second profile set need not bepre-aligned.You can skip the first stage (pairwise alignments; guide tree) by using an oldguide tree file (ALIGN SEQUENCES TO PROFILE 1 FROM TREE). SAVE LOG FILE will write the alignment calculation scores to a file. The logfilename is the same as the input sequence filename, with an extension .logappended.The alignment parameters can be set using the ALIGNMENT PARAMETERS menu,Pairwise Parameters, Multiple Parameters and Protein Gap Parameters options.These are EXACTLY the same parameters as used by the general, automaticmultiple alignment procedure. The general multiple alignment procedure issimply a series of profile alignments. Carrying out a series of profilealignments on larger and larger groups of sequences, allows you to manuallybuild up a complete alignment, if necessary editing intermediate alignments.<STRONG>SECONDARY STRUCTURE PARAMETERS</STRONG>Use this menu to set secondary structure options. If a solved structure isknown, it can be used to guide the alignment by raising gap penalties withinsecondary structure elements, so that gaps will preferentially be inserted intounstructured surface loop regions. Alternatively, a user-specified gap penaltymask can be supplied for a similar purpose.A gap penalty mask is a series of numbers between 1 and 9, one per position in the alignment. Each number specifies how much the gap opening penalty is to be raised at that position (raised by multiplying the basic gap opening penaltyby the number) i.e. a mask figure of 1 at a position means no changein gap opening penalty; a figure of 4 means that the gap opening penalty isfour times greater at that position, making gaps 4 times harder to open.The format for gap penalty masks and secondary structure masks is explained ina separate help section.>>HELP B <<             Secondary Structure / Gap Penalty MasksThe use of secondary structure-based penalties has been shown to improve  theaccuracy of sequence alignment. Clustal X now allows secondary structure/ gappenalty masks to be supplied with the input sequences used during profilealignment. (NB. The secondary structure information is NOT used during multiplesequence alignment). The masks work by raising gap penalties in specifiedregions (typically secondary structure elements) so that gaps arepreferentially opened in the less well conserved regions (typically surfaceloops).The USE PROFILE 1(2) SECONDARY STRUCTURE / GAP PENALTY MASK options controlwhether the input 2D-structure information or gap penalty masks will be usedduring the profile alignment.The OUTPUT options control whether the secondary structure and gap penaltymasks should be included in the Clustal X output alignments. Showing both isuseful for understanding how the masks work. The 2D-structure information isitself useful in judging the alignment quality and in seeing how residueconservation patterns vary with secondary structure. The HELIX and STRAND GAP PENALTY options provide the value for raising the gappenalty at core Alpha Helical (A) and Beta Strand (B) residues. In CLUSTALformat, capital residues denote the A and B core structure notation. Basic gappenalties are multiplied by the amount specified.The LOOP GAP PENALTY option provides the value for the gap penalty in Loops.By default this penalty is not raised. In CLUSTAL format, loops are specifiedby "." in the secondary structure notation.The SECONDARY STRUCTURE TERMINAL PENALTY provides the value for setting the gappenalty at the ends of secondary structures. Ends of secondary structures areknown to grow or shrink, comparing related structures. Therefore by defaultthese are given intermediate values, lower than the core penalties. Allsecondary structure read in as lower case in CLUSTAL format gets the reducedterminal penalty.The HELIX and STRAND TERMINAL POSITIONS options specify the range of structuretermini for the intermediate penalties. In the alignment output, these areindicated as lower case. For Alpha Helices, by default, the range spans the end-helical turn (3 residues). For Beta Strands, the default range spans theend residue and the adjacent loop residue, since sequence conservation oftenextends beyond the actual H-bonded Beta Strand.Clustal X can read the masks from SWISS-PROT, CLUSTAL or GDE format inputfiles. For many 3-D protein structures, secondary structure information isrecorded in the feature tables of SWISS-PROT database entries. You shouldalways check that the assignments are correct - some are quite inaccurate.Clustal X looks for SWISS-PROT HELIX and STRAND assignments e.g.<PRE>FT   HELIX       100    115FT   STRAND      118    119</PRE>The structure and penalty masks can also be read from CLUSTAL alignment format as comment lines beginning "!SS_" or "!GM_" e.g.<PRE>!SS_HBA_HUMA    ..aaaAAAAAAAAAAaaa.aaaAAAAAAAAAAaaaaaaAaaa.........aaaAAAAAA!GM_HBA_HUMA    112224444444444222122244444444442222224222111111111222444444HBA_HUMA        VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK</PRE>Note that the mask itself is a set of numbers between 1 and 9 each of which is assigned to the residue(s) in the same column below. In GDE flat file format, the masks are specified as text and the names mustbegin with "SS_ or "GM_.Either a structure or penalty mask or both may be used. If both are includedin an alignment, the user will be asked which is to be used.>>HELP T <<                            Phylogenetic TreesBefore calculating a tree, you must have an ALIGNMENT in memory. This can beinput using the FILE menu, LOAD SEQUENCES option or you should have justcarried out a full multiple alignment and the alignment is still in memory.Remember YOU MUST ALIGN THE SEQUENCES FIRST!!!!The method used is the NJ (Neighbour Joining) method of Saitou and Nei. Firstyou calculate distances (percent divergence) between all pairs of sequence froma multiple alignment; second you apply the NJ method to the distance matrix.To calculate a tree, use the DRAW N-J TREE option. This gives an UNROOTED treeand all branch lengths. The root of the tree can only be inferred by using anoutgroup (a sequence that you are certain branches at the outside of the tree.... certain on biological grounds) OR if you assume a degree of constancy inthe 'molecular clock', you can place the root in the 'middle' of the tree(roughly equidistant from all tips).BOOTSTRAP N-J TREE uses a method for deriving confidence values for the groupings in a tree (first adapted for trees by Joe Felsenstein). It involvesmaking N random samples of sites from the alignment (N should be LARGE, e.g.500 - 1000); drawing N trees (1 from each sample) and counting how many timeseach grouping from the original tree occurs in the sample trees. You can set Nusing the NUMBER OF BOOTSTRAP TRIALS option in the BOOTSTRAP TREE window. Inpractice, you should use a large number of bootstrap replicates (1000 isrecommended, even if it means running the program for an hour on a slow computer). You can also supply a seed number for the random number generatorhere. Different runs with the same seed will give the same answer. See thedocumentation for more details.EXCLUDE POSITIONS WITH GAPS? With this option, any alignment positions whereANY of the sequences have a gap will be ignored. This means that 'like' willbe compared to 'like' in all distances, which is highly desirable. It alsoautomatically throws away the most ambiguous parts of the alignment, which areconcentrated around gaps (usually). The disadvantage is that you may throw awaymuch of the data if there are many gaps (which is why it is difficult for us tomake it the default).  CORRECT FOR MULTIPLE SUBSTITUTIONS? For small divergence (say <10%) this optionmakes no difference. For greater divergence, this option corrects for the factthat observed distances underestimate actual evolutionary distances. This isbecause, as sequences diverge, more than one substitution will happen at manysites. However, you only see one difference when you look at the present daysequences. Therefore, this option has the effect of stretching branch lengthsin trees (especially long branches). The corrections used here (for DNA orproteins) are both due to Motoo Kimura. See the documentation for details.  Where possible, this option should be used. However, for VERY divergentsequences, the distances cannot be reliably corrected. You will be warned ifthis happens. Even if none of the distances in a data set exceed the reliablethreshold, if you bootstrap the data, some of the bootstrap distances mayrandomly exceed the safe limit.  SAVE LOG FILE will write the tree calculation scores to a file. The logfilename is the same as the input sequence filename, with an extension .logappended.<H4>OUTPUT FORMAT OPTIONS</H4>Three different formats are allowed. None of these displays the tree visually.You can display the tree using the NJPLOT program distributed with Clustal XOR get the PHYLIP package and use the tree drawing facilities there.  1) CLUSTAL FORMAT TREE. This format is verbose and lists all of the distancesbetween the sequences and the number of alignment positions used for each. Thetree is described at the end of the file. It lists the sequences that arejoined at each alignment step and the branch lengths. After two sequences arejoined, it is referred to later as a NODE. The number of a NODE is the numberof the lowest sequence in that NODE.   2) PHYLIP FORMAT TREE. This format is the New Hampshire format, used by manyphylogenetic analysis packages. It consists of a series of nested parentheses,describing the branching order, with the sequence names and branch lengths. Itcan be read by the NJPLOT program distributed with ClustalX. It can also beused by the RETREE, DRAWGRAM and DRAWTREE programs of the PHYLIP package to seethe trees graphically. This is the same format used during multiple alignmentfor the guide trees. Some other packages that can read and display NewHampshire format are TreeTool, TreeView, and Phylowin.3) PHYLIP DISTANCE MATRIX. This format just outputs a matrix of all thepairwise distances in a format that can be used by the PHYLIP package. It usedto be useful when one could not produce distances from protein sequences in thePhylip package but is now redundant (PROTDIST of Phylip 3.5 now does this).4) NEXUS FORMAT TREE. This format is used by several popular phylogeny programs,including PAUP and MacClade. The format is described fully in:Maddison, D. R., D. L. Swofford and W. P. Maddison.  1997.NEXUS: an extensible file format for systematic information.Systematic Biology 46:590-621.BOOTSTRAP LABELS ON: By default, the bootstrap values are correctly placed onthe tree branches of the phylip format output tree. The toggle allows them tobe placed on the nodes, which is incorrect, but some display packages (e.g.TreeTool, TreeView and Phylowin) only support node labelling but not branchlabelling. Care should be taken to note which branches and labels go together.
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -