⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 fasta20.doc

📁 序列对齐 Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequenc
💻 DOC
📖 第 1 页 / 共 4 页
字号:
                        COPYRIGHT NOTICECopyright 1988, 1991, 1992, 1994, 1995, 1996 by William R.Pearson and the University of Virginia.  All rights reserved. TheFASTA program and documentation may not be sold or incorporatedinto a commercial product, in whole or in part, without writtenconsent of William R. Pearson and the University of Virginia.For further information regarding permission for use orreproduction, please contact: David Hudson, Assistant Provost forResearch, University of Virginia, P.O. Box 9025, Charlottesville,VA 22906-9025, (434) 924-6853The FASTA program packageIntroduction     This documentation describes the version 2.0x of the FASTAprogram package (see W. R. Pearson and D. J. Lipman (1988),"Improved Tools for Biological Sequence Analysis", PNAS 85:2444-2448, and W. R.  Pearson (1990) "Rapid and Sensitive SequenceComparison with FASTP and FASTA" Methods in Enzymology 183:63-98). Version 2.0 modifies version 1.8 to include explicitstatistical estimates for similarity scores based on the extremevalue distribution.  In addition, FASTA protein alignments nowuse the Smith-Waterman algorithm with no limitation on gap size.FASTA and SSEARCH now use the BLOSUM50 matrix by default, withoptions to change gap penalties on the command line. Version 1.7replaces rdf2 and rss with prdf and prss, which use the extreme-value distribution to calculate accurate probability estimates.Although there are a large number of programs in this package,they belong to four groups:    Library search programs: FASTA, FASTX, TFASTA, TFASTX, SSEARCH    Local homology programs: LFASTA, PLFASTA, LALIGN, PLALIGN, FLALIGN    Statistical significance: PRDF, RELATE, PRSS, RANDSEQ    Global alignment: ALIGNIn addition, I have included several programs for proteinsequence analysis, including a Kyte-Doolittle hydropathicityplotting program (GREASE, TGREASE), and a secondary structureprediction package (GARNIER).     The FASTA sequence comparison programs on this disk areimproved versions of the FASTP program, originally described inScience (Lipman and Pearson, (1985) Science 227:1435-1441).  Wehave made several improvements.  First, the library searchprograms use a more sensitive method for the initial comparisonof two sequences which allows the scores of several similarregions to be combined.  As a result, the results of a librarysearch are now given with three scores, initn (the new initialscore which may include several similar regions), init1 (the oldfastp initial score from the best initial region), and opt (theold fastp optimized score allowing gaps in a 32 residue wideband).     These programs have also been modified to become "universal"(hence FAST-A, for FASTA-All, as opposed to FAST-P (protein) orFAST-N (nucleotides)); by changing the environment variableSMATRIX, the programs can be used to search protein sequences,DNA sequences, or whatever you like.  By default, FASTA, LFASTA,and the PRDF programs automatically recognize protein and DNAsequences.  Sequences are first read as amino acids, and thenconverted to nucleotides if the sequence is greater than 85%A,C,G,T (the '-n' option can be used to indicate DNA sequences).TFASTA compares protein sequences to a translated DNA sequence.Alternative scoring matrices can also be used.  In addition tothe BLOSUM50 matrix for proteins, the PAM250 matrix or matricesbased on simple identities or the genetic code can also be usedfor sequence comparisons or evaluation of significance.  Severaldifferent protein sequence matrices have been included;instructions for constructing your own scoring matrix areincluded in the file FORMAT.DOC.The remainder of this document is divided into three sections:(1) a brief history of the changes to the FASTA package; (2) Aguide to installing the programs and databases; (3) A guide tousing the FASTA programs. The programs are very easy to use, soif you are using them on a machine that is administered bysomeone else, you may want to skip to section (3) to learn how touse the programs, and then read section (1) to look at some ofthe more recent changes.  If you are installing the programs onyour own machine, you will need to read section (2) carefully.1.  Revision History1.1.  Changes with version 2.0u     Version 2.0u provides several major improvements overprevious versions of FASTA (and SSEARCH).  The most important isthe incorporation of explicit statistical estimates andappropriate normalization of similarity scores. This improvementis discussed in more detail below in the section entitledStatistical Significance.  In addition, all of the proteincomparison programs now use the BLOSUM50 matrix, with gappenalties of -12, -2, by default.  BLOSUM50 performssignificantly better than the older PAM250 matrix.  PAM250 canstill be used with the command line option: -s 250.  (DNAsequence comparisons use a more stringent gap penalty of -16, -4,which produces excellent statistical estimates when optimizedscores are used. TFASTA uses -16, -4 as well.)     The quality of the fit of the extreme value distribution tothe actual distribution of similarity scores is summarized withthe Kolmogorov-Smirnov statistic.  The acceptance limits for thisstatistic can be found in many statistics books.  In general,values <0.10 (N=30) indicate excellent agreement between theactual and theoretical distributions.  If this statistic is >0.2, consider using a higher (more stringent) gap penalty, e.g.-16, -4 rather than -12, -2.  The default scoring matrix for DNAhas been changed to score +5 for an identity and -4 for amismatch.  These are the same scores used by BLASTN.     With explicit expectation calculations, the program nowshows all scores and alignments with expectations less than 10.0(with optimized scores, 2.0 without optimization) when the "-Q"(quiet) mode is used.  The expectation threshold can be changedwith the "-E" option.     Finally, the algorithm used to produce the final alignmentsof protein sequences is now a full Smith-Waterman, with unlimitedgaps.  (The older band-limited alignments are used for DNAsequences and TFASTA by default, because Smith-Watermanalignments are very slow for long sequences.)  Both the optimizedand Smith-Waterman scores are reported; if the Smith-Watermanscore is higher, then additional gaps allowed a better alignmentand similarity score to be calculated.     FASTA searches now optimize similarity scores by default(this slows searches about 2-fold (worst case) for ktup=2). Thus,the meaning of the "-o" option has been reversed; "-o" now turnsoff optimization and reports results sorted by "initn" scores.Optimization significantly improves the sensitivity of FASTA, sothat it almost matches Smith-Waterman.  With version 2.0, thedefault band width used for optimized calculations can be variedwith the "-y" option.  For proteins with ktup=2, a width of 16(-y 16) is used; 16 is also used for DNA sequences.  For proteinsand ktup=1, a width of 32 is used. Searches that disableoptimization with the "-o" option will work fine for sequencesthat share 25% or more identity in general, but to detectevolutionary relationships with 20% - 25% identity, the moresensitive default optimization is often required.  Optimizationis required for accurate statistical estimates with eitherprotein or DNA sequences.     The FASTA package now includes FASTX, a program thatcompares a DNA sequence to a protein sequence database bytranslating the DNA sequence in three frames (the reverse framesare selected with the -i option) and aligning the three-frametranslation with the sequences in the protein database.Alignment scores allow frameshifts so that a cDNA or EST sequencewith insertion/deletion errors can be aligned with its homologuesfrom beginning to end.     With release 20u6, there is also a TFASTX program, which isa replacement for TFASTA.  TFASTA treats each of the six readingframes of a DNA library sequence as a different sequence; TFASTXcompares a protein sequence against only two sequences from eachDNA sequence - the forward and reverse orientation.  For a givenorientation, TFASTX calculates a similarity score for alignmentsthat allow frameshifts, thus considering all possible readingframes.     Another new program is included - randseq - which willproduce a randomly shuffled (uniform or local shuffle) from aninput sequence.  This randomly shuffled sequence can be used toevaluate the statistical estimates produced by FASTA, SSEARCH, orBLAST.1.2.  Changes with version 1.7Version 1.7 has been released to provide the PRDF and PRSSprograms for shuffling sequences and estimating accurately theprobabilities of the unshuffled-sequence scores.PRDF      a version of RDF2 that uses calculates the probability          of a similarity score more accurately by using a fit to          an extreme value distribution.  Code to fit the extreme          value distribution parameters and the impetus to update          RDF2 was provided by Phil Green, U. of Washington.PRSS      a version of PRDF that uses a rigorous Smith-Waterman          calculation to score similarities1.3.  Changes with version 1.6     FASTA version 1.6 uses a new method for calculating optimalscores in a band (the optimization or last step in the FASTAalgorithm). In addition, it uses a linear-space method forcalculating the actual alignments.  FASTA v1.6 package includesseveral new programs:SSEARCH   a program to search a sequence database using the          rigorous Smith-Waterman algorithm (this program is          about 100-fold slower than FASTA with ktup=2 (for          proteins).LALIGN    A rigorous local sequence alignment program that will          display the N-best local alignments (N=10 by default).PLALIGN   a version of lalign that plots the local alignments to          a tektronix display.FLALIGN   a version of lalign that plots the local alignments to          a GCG Figure file.     The LALIGN/PLALIGN/FLALIGN programs incorporate the "sim"algorithm described by Huang and Miller (1991) Adv. Appl. Math.12:337-357.  The SSEARCH and PRSS programs incorporate algorithmsdescribed by Huang, Hardison, and Miller (1990) CABIOS 6:373-381.     LFASTA and PLFASTA now calculate a different number of localsimilarities; they now behave more like LALIGN/PLALIGN.  Sincelocal alignments of identical sequences produce "mirror-image"alignments, lalign and lfasta consider only one-half of thepotential alignments between sequences from identical file names.Thus    lfasta mchu.aa mchu.aaDisplays only two alignments, with earlier versions of theprogram, it would have displayed five, including the identityalignment.  PLFASTA does display five alignments; when twoidentical filenames are given, it draws the identity alignment,calculates the two unique local alignments, draws them, and drawstheir mirror images. LFASTA/PLFASTA and LALIGN/PLALIGN use thefilenames, rather than the actual sequences, to determine whethersequences are identical; you can "trick" the programs intobehaving the old way by putting the same sequence in twodifferent files.1.4.  Changes with version 1.5     FASTA version 1.5 includes a number of substantial revisionsto improve the performance and sensitivity of the program.  It isnow possible to tell the program to optimize all of the initnscores greater than a threshold.  The threshold is set at thesame value as the old FASTA cutoff score.  Alternatively, you cantell FASTA to sort the results by the init1, rather than theinitn, score by using the -1 option.  FASTA -1 ... will reportthe results the way the older FASTP program did.     A new method has been provided for selecting libraries. Inthe past, one could enter the name of a sequence file to besearched or a single letter that would specify a library from thelist included in the $FASTLIBS file. Now, you can specify a setof library files with a string of letters preceded by a '%'.Thus, if the FASTLIBS file has the lines:    Genbank 70 primates$1P/seqlib/gbpri.seq 1    Genbank 70 rodents$1R/seqlib/gbrod.seq 1    Genbank 70 other mammals$1M/seqlib/gbmam.seq 1    Genbank 70 vertebrates $1B/seqlib/gbvrt.seq 1Then the string: "%PRMB" would tell FASTA to search the fourlibraries listed above.  The %PRMB string can be entered eitheron the command line or when the program asks for a filename orlibrary letter.     FASTA1.5 also provides additional flexibility for specifyingthe number of results and alignments to be displayed with the -Q(quiet) option.  The -b number option allows you to specify thenumber of sequence scores to show when the search is finished.Thus    FASTA -b 100 ...tells the program to display the top 100 sequence scores. In thepast, if you displayed 100 scores (in -Q mode), you would alsohave store 100 alignments. The -d option allows you to limit thenumber of alignments shown.  FASTA -b 100 -d 20 would show 100scores and 20 alignments.     Finally, FASTA can provide a complete list of all of thesequences and scores calculated to a file with the -r (results)

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -