📄 clustalw.doc
字号:
README for Clustal W version 1.7 June 1997 Clustal W version 1.7 DocumentationThis file provides some notes on the latest changes, installation and usageof the Clustal W multiple sequence alignment program.Julie Thompson (Thompson@EMBL-Heidelberg.DE)Toby Gibson (Gibson@EMBL-Heidelberg.DE)European Molecular Biology LaboratoryMeyerhofstrasse 1D 69117 HeidelbergGermanyDes Higgins (Higgins@ucc.ie)University of County CorkCorkIrelandPlease e-mail bug reports/complaints/suggestions (polite if possible)to Toby Gibson or Des Higgins. Thompson, J.D., Higgins, D.G. and Gibson, T.J. (1994)CLUSTAL W: improving the sensitivity of progressive multiple sequence alignmentthrough sequence weighting, positions-specific gap penalties and weight matrixchoice. Nucleic Acids Research, 22:4673-4680.--------------------------------------------------------------What's New (June 1997) in Version 1.7 (since version 1.6).1. The static arrays used by clustalw for storing the alignment data have beenreplaced by dynamically allocated memory. There is now no limit on the numberor length of sequences which can be input.2. The alignment of DNA sequences now offers a new hard-coded matrix, as wellas the identity matrix used previously. The new matrix is the default scoringmatrix used by the BESTFIT program of the GCG package for the comparison ofnucleic acid sequences. X's and N's are treated as matches to any IUB ambiguitysymbol. All matches score 1.9; all mismatches for IUB symbols score 0.0.3. The transition weight option for aligning nucleotide sequences has beenchanged from an on/off toggle to a weight between 0 and 1. A weight of zeromeans that the transitions are scored as mismatches; a weight of 1 gives transitions the full match score. For distantly related DNA sequences, theweight should be near to zero; for closely related sequences it can be usefulto assign a higher score.4. The RSF sequence alignment file format used by GCG Version 9 can now beread.5. The clustal sequence alignment file format has been changed to allowsequence names longer than 10 characters. The maximum length allowed is set inclustalw.h by the statement:#define MAXNAMES 10For the fasta format, the name is taken as the first string after the '>'character, stopping at the first white space. (Previously, the first 10characters were taken, replacing blanks by underscores).6. The bootstrap values written in the phylip tree file format can be assignedeither to branches or nodes. The default is to write the values on the nodes,as this can be read by several commonly-used tree display programs. But notethat this can lead to confusion if the tree is rooted and the bootstraps maybe better attached to the internal branches: Software developers should ensurethey can read the branch label format.7. The sequence weighting used during sequence to profile alignments has beenchanged. The tree weight is now multiplied by the percent identity of thenew sequence compared with the most closely related sequence in the profile.8. The sequence weighting used during profile to profile alignments has beenchanged. A guide tree is now built for each profile separately and thesequence weights calculated from the two trees. The weights for eachsequence are then multiplied by the percent identity of the sequence comparedwith the most closely related sequence in the opposite profile.9. The adjustment of the Gap Opening and Gap Extension Penalties for sequencesof unequal length has been improved.10. The default order of the sequences in the output alignment file has beenchanged. Previously the default was to output the sequences in the same orderas the input file. Now the default is to use the order in which the sequenceswere aligned (from the guide tree/dendrogram), thus automatically groupingclosely related sequences.11. The option to 'Reset Gaps between alignments' has been switched off bydefault.12. The conservation line output in the clustal format alignment file has beenchanged. Three characters are now used:'*' indicates positions which have a single, fully conserved residue':' indicates that one of the following 'strong' groups is fully conserved:- STA NEQK NHQK NDEQ QHRK MILV MILF HY FYW'.' indicates that one of the following 'weaker' groups is fully conserved:- CSA ATV SAG STNK STPA SGND SNDEQK NDEQHK NEQHRK FVLIM HFYThese are all the positively scoring groups that occur in the Gonnet Pam250matrix. The strong and weak groups are defined as strong score >0.5 and weakscore =<0.5 respectively.13. A bug in the modification of the Myers and Miller alignment algorithmfor residue-specific gap penalites has been fixed. This occasionally causednew gaps to be opened a few residues away from the optimal position.14. The GCG/MSF input format no longer needs the word PILEUP on the firstline. Several versions can now be recognised:- 1. The word PILEUP as the first word in the file 2. The word !!AA_MULTIPLE_ALIGNMENT or !!NA_MULTIPLE_ALIGNMENT as the first word in the file 3. The characters MSF on the first line in the line, and the characters .. at the end of the line.15. The standard command line separator for UNIX systems has been changed from'/' to '-'. ie. to give options on the command line, you now type clustalw input.aln -gapopen=8.0instead of clustalw input.aln /gapopen=8.0 ATTENTION SOFTWARE DEVELOPERS!! -------------------------------The CLUSTAL sequence alignment output format has been modified:1. Names longer than 10 chars are now allowed. (The maximum is specified inclustalw.h by '#define MAXNAMES'.)2. The consensus line now consists of three characters: '*',':' and '.'. (Onlythe '*' and '.' were previously used.)3. An option (not the default) has been added, allowing the user to print outsequence numbers at the end of each line of the alignment output.4. Both RNA bases (U) and base ambiguities are now supported in nucleic acidsequences. In the past, all characters (upper or lower case) other thana,c,g,t or u were converted to N. Now the following characters are recognised and retained in the alignment output: ABCDGHKMNRSTUVWXY (upper or lower case).5. A Blank line inadvertently added in the version 1.6 header has been takenout again.--------------------------------------------------------------What's New (March 1996) in Version 1.6 (since version 1.5).1) Improved handling of sequences of unequal length. Previously, weincreased the gap extension penalties for both sequences if the two sequences(or groups of previously aligned sequences) were of different lengths. Now, we increase the gap opening and extension penalties for the shorter sequence only. This helps prevent short sequences being stretched outalong longer ones.2) Added the "Gonnet" series of weight matrices (from Gaston Gonnet and co-workers at the ETH in Zurich). Fixed a bug in the matrixchoice menu; now PAM matrices can be selected ok.3) Added secondary structure/gap penalty masks. These allow you to include, in an alignment, a position specific set of gap penalties. You can either set a gap opening penalty at each position or specifythe secondary strcuture (if protein; alpha helix, beta strand or loop)and have gap penalties set automatically. This, basically, is used to make gaps harder to open inside helices or strands. These masks are only used in the "profile alignment" menu. They may be read inas part of an alignment in a special format (see the on-line help fordetails) or associated with each sequence, if the sequences are in Swiss Prot format and secondary structure information is given. All of the mask parameters can be set from the profile alignment menu. Basically, themask is made up of a series of numbers between 1 and 9, one per position.The gap opening penalty at a position is calculated as the starting penaltymultipleied by the mask value at that site. 4) Added command line options /profile and /sequences.These allow uses to choose between normal profile alignment where thetwo profiles (pre-existing alignments specified in the files/profile1= and /profile2=) are merged/aligned with each other (/profile)and the case where the individual sequences in /profile2 are alignedsequentially with the alignment in /profile1 (/sequences).5) Fixed bug in modified Myers and Miller algorithm - gap penalty scorewas not always calculated properly for type 2 midpoints. This is the corealignment algorithm.6) Only allows one output file format to be selected from command line- ie. multiple output alignment files are not allowed.7) Fixed 'bad calls to ckfree' error during calculation of phylip distancematrix.8) Fixed command line options /gapopen /gapext /type=protein /negative.9) Allowed user to change command line separator on UNIX from '/' to '-'.This allows unix users to use the more conventinal '-' symbolfor seperating command line options. "/" can then be used in unixfile names on the command line. The symbol that is used,is specified in the file clustalw.h which must be edited if you wish to change it (and the program must then be recompiled). Find the block of code in clustalw.h that corrsponds to the operating system youare using. These blocks are started by one of the following:#ifdef VMS #elif MAC#elif MSDOS#elif UNIXOn the next line after each is the line:#define COMMANDSEP '/'Change this in the appropriate block of code (e.g. the UNIX block) to #define COMMANDSEP '-'if you wish to use the "-" character as command seperator. --------------------------------------------------------------What's New (April 1995) in Version 1.5 (since version 1.3).
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -