📄 fasta3x.me
字号:
.nr pp 11.nr sp 11.nr tp 11.nr fp 10.nr fi 0n.sz 11.if t \{.po 1i.he 'FASTA3.DOC''Release 3.4, Fall, 2003'.fo ''- % -''\}.if n \{.po 0.na.nh\}.ll 6.5i.ce\fBCOPYRIGHT NOTICE\fP.lpCopyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by WilliamR. Pearson and the University of Virginia. All rights reserved. TheFASTA program and documentation may not be sold or incorporated into acommercial product, in whole or in part, without written consent ofWilliam R. Pearson and the University of Virginia. For furtherinformation regarding permission for use or reproduction, pleasecontact: David Hudson, Assistant Provost for Research, University ofVirginia, P.O. Box 9025, Charlottesville, VA 22906-9025, (434)924-6853.uh "\s+2The FASTA program package\s0".uh "Introduction".ppThis documentation describes the version 3 of the FASTA programpackage (see W. R. Pearson and D. J. Lipman (1988), "Improved Toolsfor Biological Sequence Analysis", PNAS 85:2444-2448 [.wrp881.]; W. R.Pearson (1996) "Effective protein sequence comparison"Meth. Enzymol. 266:227-258;[.wrp960.] Pearson et. al. (1997) Genomics46:24-36;[.wrp971.] Pearson, (1999) Meth. in Molecular Biology132:185-219.[.wrp000.] Version 3 of the FASTA packages contains manyprograms for searching DNA and protein databases and one program(prss3) for evaluating statistical significance from randomly shuffledsequences. Several additional analysis programs, including programsthat produce local alignments, are available as part of version 2 ofthe FASTA package, which is still available..ppThis document is divided into three sections: (1) A summary overview ofthe programs in the FASTA3 package; (2) A guide to installing theprograms and databases; (3) A guide to using the FASTA programs. Therevision history of the programs can be found in the\fCreadme.v30..v34\fP, files. The programs are easy to use, so ifyou are using them on a machine that is administered by someone else,you can skip section (2) and focus on (1) and (3) to learn how to usethe programsIf you are installing the programs on your ownmachine, you will need to read section (2) carefully..sh 1 "An overview of the \f(CBFASTA\fP programs".ppAlthough there are a large number of programs in this package, theybelong to three groups: (1) "Conventional" Library search programs:FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3, TFASTY3, SSEARCH3;(2)Programs for searching with short fragments:FASTS3, FASTF3, TFASTS3, TFASTF3;(3)Statistical significance: PRSS3.Programs that start with \f(CBfast\fP search proteindatabases, while \f(CBtfast\fP programs search translated DNA databases.Table I gives a brief description of the programs..lp.(z.TScenter;c sc s= =l lw(5.5i).\d\fBTable I. Comparison programs in the FASTA3 package\fP\u\fCfasta3\fP T{Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequence database using the FASTA algorithm.[.wrp881,wrp960.] Search speed and selectivity are controlled with the \fIktup\fP(wordsize) parameter. For protein comparisons, \fIktup\fP = 2 by default; \fIktup\fP =1 is more sensitive but slower. For DNA comparisons, \fIktup\fP=6 by default; \fIktup\fP=3 or \fIktup\fP=4 provides higher sensitivity; \fIktup\fP=1 should be used for oligonucleotides (DNA query lengths < 20).T}\fCssearch3\fP T{Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequence database using the Smith-Waterman algorithm.[.wat815.] \fCssearch3\fP is about 10-times slower than FASTA3, but is more sensitive for full-length protein sequence comparison.T}\fCfastx3\fP/ \fCfasty3\fP T{Compare a DNA sequence to a proteinsequence database, by comparing the translated DNA sequence in threeframes and allowing gaps and frameshifts. \fCfastx3\fP uses asimpler, faster algorithm for alignments that allows frameshifts onlybetween codons; \fCfasty3\fP is slower but produces better alignmentswith poor quality sequences because frameshifts are allowed withincodons.T}\fCtfastx3\fP/ \fCtfasty3\fP T{Compare a protein sequence to a DNA sequencedatabase, calculating similarities with frameshifts to the forward andreverse orientations.T}\fCtfasta3\fP T{Compare a protein sequence to a DNA sequence database, calculating similarities (without frameshifts) to the 3 forward and three reverse reading frames. \fCtfastx3\fP and \fCtfasty3\fP are preferred because they calculate similarity over frameshifts.T}\fCfastf3/tfastf3\fP T{Compares an ordered peptide mixture, as would be obtained byEdman degredation of a CNBr cleavage of a protein, against a protein(\fCfastf\fP) or DNA (\fCtfastf\fP) database.T}\fCfasts3/tfasts3\fP T{Compares set of short peptide fragments, as would be obtainedfrom mass-spec. analysis of a protein, against aprotein (\fCfasts\fP) or DNA (\fCtfasts\fP) database.T}= =.TE.)z.sh 1 "Installing FASTA and the sequence databases".sh 2 "Obtaining the libraries".ppThe FASTA program package does not include any protein or DNA sequencelibraries. Protein databases are available on CD-ROM from the PIR andEMBL (see below), or via anonymouse FTP from many different sources.As this document is updated in the fall of 1999, no DNA databases areavailable on CD-ROM from the major sequence databases: Genbank at theNational for Biotechnology Information (\fCwww.ncbi.nlm.nih.gov\fP and\fCftp://ncbi.nlm.nih.gov\fP) and EMBL at the European BioinformaticsInstitute (\fCwww.ebi.ac.uk\fP). However, the databases are availablevia anonymous FTP from both sites..sh 3 "The GENBANK DNA sequence library".ppBecause of the large size of DNA databases, you will probably want tokeep DNA databases in only one, or possibly two, formats. The FASTA3programs that search DNA databases - \fCfasta3\fP, \fCtfastx/y3\fP,and \fCtfasta3\fP can read DNA databases in Genbank flatfile (notASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (\fCpressdb\fP), andBLAST2.0 (\fCformatdb\fP) formats, as well as EMBL format. If you arealso running the GCG suite of sequence analysis programs, you shoulduse GCG/compressed-binary format or BLAST2.0 format for your\fCfasta3\fP searches. If not, BLAST2.0 is a good choice. Thesefiles are considerably more compact than Genbank flat files, and arepreferred. The NCBI does not provide software for converting fromGenbank flat files to Blast2.0 DNA databases, but you can use theBlast \fCformatdb\fP program to convert ASN.1 formated Genbank files,which are available from the NCBI \fCftp\fP site..ppThe NCBI also provides the \fCnr\fP, \fCswissprot\fP, and several ESTdatabases that are used by BLAST in FASTA format from:\fCftp://ncbi.nlm.nih.gov/blast/db\fP. These databases are updatednightly..sh 3 "The NBRF protein sequence library".ppYou can obtain the PIR protein sequence database[.pir980.] from:.(lNational Biomedical Research FoundationGeorgetown University Medical Center3900 Reservoir Rd, N.W.Washington, D.C. 20007.)lor via ftp from \fCnbrf.georgetown.edu\fP or from the NCBI(\fCncbi.nlm.nih.gov/repository/PIR\fP). The data in the \fCascii\fPdirectory is in PIR Codata format, which is not widely used. Irecommend the PIR/VMS format data (libtype=5) in the \fCvms\fPdirectory..sh 3 "The EBI/EMBL CD-ROM libraries".ppThe European Bioinformatics Institute (EBI) distributes both the EMBLDNA database and the SwissProt database on CD-ROM,[.apw961.] and theyare available from:.(lEMBL-Outstation European Bioinformatics InstituteWellcome Trust Genome Campus,Hinxton HallHinxton,Cambridge CB10 1SDUnited KingdomTel: +44 (0)1223 494444Fax: +44 (0)1223 494468Email: DATALIB@ebi.ac.uk.)lIn addition, the SWISS-PROT protein sequence database is available viaanonymous FTP from \fCftp://ftp.expasy.ch/databases/swiss-prot/\fP(also see \fCwww.expasy.ch\fP)..sh 2 "Finding the libraries: FASTLIBS".ppThe major problem that most new users of the FASTA package have is insetting up the program to find the databases and their library type.In general, if you cannot get \fCfasta3\fP to read a sequencedatabase, it is likely that something is wrong with the \fCFASTLIBS\fPfile. A common problem is that the database file is found, but eitherno sequences are read, or an incorrect number of entries is read.This is almost always because the library format (\fClibtype\fP) isincorrect. Note that a type 5 file (PIR/VMS format) can be readas a type 0 (default FASTA) format file, and the number of entrieswill be correct, but the sequence lengths will not..ppAll the search programs in the FASTA3 package use the environmentvariable \fCFASTLIBS\fP to find the protein and DNA sequence libraries. The\fCFASTLIBS\fP variable contains the name of a file that has the actualfilenames of the libraries. The \fCfastlibs\fP file included with thedistribution on is an example of a file that can be referred to byFASTLIBS. To use the \fCfastlibs\fP file, type:.(l\fCsetenv FASTLIBS /usr/lib/fasta/fastgbs\fP (BSD UNIX/csh)or\fCexport FASTLIBS=/usr/lib/fasta/fastgbs\fP (SysV UNIX/ksh).)lThen edit the \fCfastlibs\fP file to indicate where the protein and DNAsequence libraries can be found. If you have a hard disk and yourprotein sequence library is kept in the file \fC/usr/lib/aabank.lib\fP andyour Genbank DNA sequence library is kept in the directory:\fC/usr/lib/genbank\fP, then \fCfastgbs\fP might contain:.ne 8.(l.ft CNBRF Protein$0P/usr/lib/seq/aabank.lib 0SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5GB Primate$1P@/usr/lib/genbank/gpri.namGB Rodent$1R@/usr/lib/genbank/grod.nam GB Mammal$1M@/usr/lib/genbank/gmammal.nam^ 1 ^^^^ 4 ^ ^ 23 (5).ft R.)lThe first line of this file says that there is a copy of the NBRFprotein sequence database (which is a protein database) that can beselected by typing "P" on the command line or when the database menuis presented in the file \fC/usr/lib/seq/aabank.lib\fP..ppNote that there are 4 or 5 fields in the lines in \fCfastgbs\fP. The firstfield is the description of the library which will be displayed byFASTA; it ends with a '$'. The second field (1 character), is a 0 ifthe library is a protein library and 1 if it is a DNA library. Thethird field (1 character) is the character to be typed to select thelibrary..ppThe fourth field is the name of the library file. In the exampleabove, the \fC/usr/lib/seq/aabank.lib\fP file contains the entireprotein sequence library. However the DNA library file names arepreceded by a '@', because these files (\fCgpri.nam, grod.nam,gmammal.nam\fP) do not contain the sequences; instead they contain the namesof the files which contain the sequences. This is done because theGENBANK DNA database is broken down in to a large number of smallerfiles. In order to search the entire primate database, you mustsearch more than a dozen files..ppIn addition, an optional fifth field can be used to specify the formatof the library file. Alternatively, you can specify the libraryformat in a file of file names (a file preceded by an '@'). Thisfield must be separated from the file name by a space character ('\ ')from the filename. In the example above, the \fCaabank.lib\fP file isin Pearson/FASTA format, while the \fCswiss.seq\fP file is in PIR/VMS format(from the EMBL CD-ROM). Currently, FASTA can read the following formats:.(l I.ft C0 Pearson/FASTA (>SEQID - comment/sequence)1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)2 NBRF CODATA (ENTRY/SEQUENCE)3 EMBL/SWISS-PROT (ID/DE/SQ)4 Intelligenetics (;comment/SEQID/sequence)5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)6 GCG (version 8.0) Unix Protein and DNA (compressed)11 NCBI Blast1.3.2 format (unix only)12 NCBI Blast2.0 format (unix only, fasta32t08 or later).ft R.)lIn particular, this version will work with the EMBL and PIR VMSformats that are distributed on the EMBL CD-ROM. The latter format(PIR VMS) is much faster to search than EMBL format. This releasealso works with the protein and DNA database formats created for theBLASTP and BLASTN programs by SETDB and PRESSDB and with the new NCBIsearch format. If a library format is not specified, for example,because you are just comparing two sequences, Pearson/FASTA (format 0)is used by default. To specify a library type on the command line,add it to the library filename and surround the filename and librarytype in quotes:.(l.ft Cfasta3 query.file "/seqdb/genbank/gbpri1.seq 1".ft P.)l.ppYou can specify a group of library files by putting a '@' symbolbefore a file that contains a list of file names to be searched. Forexample, if @gmam.nam is in the fastgbs file, the file "gmam.nam"might contain the lines:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -