📄 fasta20.doc
字号:
have set LIBTYPE).Support for the old compressed GenBank files, which have not beendistributed for more than four years, has been removed fromprograms in the FASTA package. Test the setup by running FASTA. Enter the sequence file'MUSPLFM.AA' when the program requests it (this file is includedwith the programs). The program should then ask you to select aprotein sequence library. Alternatively, if you run the TFASTAprogram and use the MUSPLFM.AA query sequence, the program shouldshow you a selection of DNA sequence libraries. Once the fastgbsfile has been set up correctly, you can set FASTLIBS=fastgbs inyour AUTOEXEC.BAT file, and you will not need to remember wherethe libraries are kept or how they are named. FASTA and TFASTA must open a large number of files whensearching and reporting the results of a GENBANK floppy diskformat library search. You may have problems with the largenumber of files under DOS on IBM-PC's (Unix and VMS users willnot have these problems). If you are going to search the GENBANKfloppy disk format DNA sequence library under DOS, you should addthe line: FILES=16to your CONFIG.SYS file. (Typically this is already done forprograms like Windows or WordPerfect.)3. Using the FASTA Package3.1. Overview The FASTA sequence comparison programs all require similarinformation, the name of a query sequence file, a library file,and the ktup parameter. All of the programs can accept argumentson the command line, or they will prompt for the file names andktup value.To use FASTA, simply type: FASTA and you will be prompted for : the name of the test sequence file the name of the library file and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences) ktup of 2 is about 5 times faster than ktup = 1. For a 200 aa sequence against a 10,000,000 aa library, the program takes about 30 min with ktup = 2, 150 min with ktup = 1, on a 12 Mhz 286 IBM-PC.The program can also be run by typing FASTA test.aa /lib/bigfile.lib ktup (1 or 2)Included with the package are the test files, MUSPLFM.AA,LCBO.AA, MCHU.AA and BOVPRL.SEQ. To check to make certain thateverything is working, you can try: fasta musplfm.aa lcbo.aa and tfasta musplfm.aa bovprl.seqTo test the local similarity programs LFASTA and PLFASTA, try: lfasta mchu.aa mchu.aa and plfasta mchu.aa mchu.aa (use this only on an IBM-PC with graphics or on a Tektronix terminal under UNIX or VMS)MCHU (calmodulin) has four duplicated calcium binding sites thatare clearly detected by LFASTA. For a more complicated example,try MWRTC1.aa, myosin heavy chain.3.2. Sequence files The FASTA programs know about three kinds of sequence files(four under VMS): (1) plain sequence files that can only be usedas query sequences or for LFASTA, PRDF, and ALIGN. (2) Standardlibrary files. These are the same as plain sequence files, eachsequence is preceded by a comment line with a '>' in the firstcolumn. (3) distributed sequence libraries (this is a broad classthat includes the NBRF/PIR VMS and blocked ascii formats, Genbankflat-file format, EMBL flat-file format, and Intelligeneticsformat. All of the files that you create should be of type (1)or (2). Type (2) files (ones with a be used as query or librarysequence files by all of the programs. I have included several sample test files, *.AA. The firstline may begin with a '>' or ';' followed by a comment. Thetext after ';' in other lines will be ignored. Spaces andtabs (and anything else that is not an amino-acid code) areignored. Library files should have the form: >Sequence name and identifier A F A S Y T .... actual sequence. F S S .... second line of sequence. >Next sequence name and identifierThis is often referred to as "FASTA" or "Pearson" format. Youcan build your own library by concatenating several sequencefiles. Just be sure that each sequence is preceded by a linebeginning with a '>' with a sequence name. The test file should not have lines longer than 120characters, and sequences entered with word processors should usea document mode, with normal carriage returns at the end oflines.Program Summary3.3. Sequence search programsFASTA universal sequence comparison. Defaults to comparing protein sequences; if the sequences are > 85% A+C+G+T or the -n option is used, a DNA sequence is assumed.FASTX Search a protein sequence library using amino acid sequence comparison to the forward three frames of a translated DNA query sequence. (The reverse frames are specified with the -i option.) Alignment scores allow frameshifts; the final alignment uses a Smith-Waterman type alignment routine (no limit on gaps) that allows frameshifts.TFASTA Search DNA library for a protein sequence by translating the DNA sequence to protein in all six frames (three forward frames with the -3 command line option). TFASTA with ktup=2 is about as fast as a DNA FASTA with ktup=4, and is substantially more sensitive. (also reads the GENBANK library)TFASTX Search DNA library for a protein sequence by translating the DNA sequence to protein in all six frames (three forward frames with the -3 command line option) calculating similarity scores that allow frameshifts. TFASTX produces an optimal Smith-Waterman alignment of the query and translated-library sequence.SSEARCH Universal sequence comparison using the Smith-Waterman algorithm ( T. F. Smith and M. S. Waterman (1981) J. Mol. Biol. 147:195-197). This program uses code developed by Huang and Miller (X. Huang, R. C. Hardison, W. Miller (1990) CABIOS 6:373-381) for calculating the local similarity score and code from the ALIGN program (see below) for calculating the local alignment. SSEARCH is about 50-times slower than FASTA with ktup=2 (for proteins).ALIGN optimal global alignment of two sequences with no short-cuts. This program is a slightly modified version of one taken from E. Myers and W. Miller. The algorithm is described in E. Myers and W. Miller, "Optimal Alignments in Linear Space" (CABIOS (1988) 4:11-17).3.4. Local similarity programsLFASTA local similarity searches showing local alignments. The algorithm used to calculate the local alignment in a band has been improved (Chao, Pearson, and Miller, submitted).PLFASTA local similarity searches with plot output (on the IBM, this program requires that the environment variable BGIDIR be set).PCLFASTA (unix only) local similarity searches with plot output using pic commands.LALIGN Calculates the N-best local alignments using a rigorous algorithm. (N=10 by default.) The algorithm was developed by Huang and Miller (X. Huang and W. Miller (1991) Adv. Appl. Math. 12:337-357), which is a linear-space version of an algorithm described by M. S. Waterman and M. Eggert (J. Mol. Biol. 197:723-728). Like SSEARCH, LALIGN is rigorous, but also very slow.PLALIGN A version of LALIGN that plots its output to a screen or to a Tektronix terminal emulator.3.5. Statistical Significance With version 2.0 of the FASTA program distribution, FASTA,TFASTA, and SSEARCH now provide estimates of statisticalsignificance for library searches. Work by Altschul, Arratia,Karlin, Mott, Waterman, and others (see Altschul et al. (1994)Nature Genetics 6:119 for an excellent review) suggests thatlocal sequence similarity scores follow the extreme valuedistribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) whereu = ln(Kmn)/lambda and m,m are the lengths of the query andlibrary sequence. This formula can be rewritten as: 1 - exp(-Kmnexp(-lambda x), which shows that the average score for anunrelated library sequence increases with the logarithm of thelength of the library sequence. FASTA and SSEARCH use simplelinear regression against the the log of the library sequencelength to calculate a normalized "z-score" with mean 50,regardless of library sequence length, and variance 10. Thesez-scores can then be used with the extreme value distribution andthe poisson distribution (to account for the fact that eachlibrary sequence comparison is an independent test) to calculatethe number of library sequences to obtain a score greater than orequal to the score obtained in the search. The original idea androutines to do the linear regression on library sequence lengthwere provided Phil Green, U. Washington. This version of FASTAand SSEARCH uses a slightly different strategy for fitting thedata than those originally provided by Dr. Green. The expected number of sequences is plotted in the histogramusing an "*". Since the parameters for the extreme valuedistribution are not calculated directly from the distribution ofsimilarity scores, the pattern of "*'s" in the histogram gives aqualitative view of how well the statistical theory fits thesimilarity scores calculated by FASTA and SSEARCH. For FASTA, ifoptimized scores are calculated for each sequence in the database(the default), the agreement between the actual distribution of"z-scores" and the expected distribution based on the lengthdependence of the score and the extreme value distribution isusually very good. Likewise, the distribution of SSEARCH Smith-Waterman scores typically agrees closely with the actualdistribution of "z-scores." The agreement with unoptimizedscores, ktup=2, is often not very good, with too many highscoring sequences and too few low scoring sequences compared withthe predicted relationship between sequence length and similarityscore. In those cases, the expectation values may beoverestimates. The statistical routines assume that the library contains alarge sample of unrelated sequences. If this is not the case,then the expectation values are meaningless. Likewise, if thereare fewer than 20 sequences in the library, the statisticalcalculations are not done. For protein searches, library sequences with E() values <0.01 for searches of a 10,000 entry protein database are almostalways homologous. Frequently sequences with E()-values from 1 -10 are related as well. Remember, however, that these E() valuesalso reflect differences between the amino acid composition ofthe query sequence and that of the "average" library sequence.Thus, when searches are done with query sequences with "biased"amino-acid composition, unrelated sequences may have"significant" scores because of sequence bias. The programsbelow, PRDF and PRSS, can address this problem by calculatingsimilarity scores for random sequences with the same length andamino acid composition. If optimization is not used ("-o"), E-values for DNAsequences overestimate the significance of the scores that areobtained and unrelated sequences frequently have E()-values <0.0005. With optimization, the agreement between E()-valuecompares favorably with protein sequence comparison. This is inpart due to the use of more stringent gap penalties for DNAsequence comparison, -16, -4 rather than -12, -2. With thelatter penalties, many unrelated sequences appear to havesignificant similarity. Nevertheless, since protein sequencecomparison is much more sensitive, DNA sequence comparison shouldnot be used to identify sequences that encode protein. Even withktup=6, optimization rarely increases run-times more than 50%with mRNA-size query sequences. Optimization should be usedwhenever possible. Similar comments apply to TFASTA, where higher gappenalties (-16,-4) are required for accurate statisticalestimates. Because TFASTA produces so many artificial "coding"sequences with atypical amino acid compositions, the statisticalestimates with TFASTA are often over estimates. With optimizedscores, ktup=1, and gap penalties of -16, -4, unrelated sequenceswill sometimes have E() values of 0.1. If initn scores are used,unrelated sequences may have have E() values < 0.01.PRDF improved version of RDF program that includes accurate probability estimates for all three scoring methods (includes local or window shuffle routine)PRSS A version of PRDF that uses the rigorous Smith-Waterman calculation used by SSEARCH.RANDSEQ produces a randomly shuffled sequence from a query sequence.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -