⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 fasta3x.doc

📁 序列对齐 Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequenc
💻 DOC
📖 第 1 页 / 共 3 页
字号:
(Updated December, 2003)                        COPYRIGHT NOTICECopyright 1988, 1991, 1992, 1994, 1995, 1996, 1999 by William R.Pearson and the University of Virginia.  All rights reserved. TheFASTA program and documentation may not be sold or incorporatedinto a commercial product, in whole or in part, without writtenconsent of William R. Pearson and the University of Virginia.For further information regarding permission for use orreproduction, please contact: David Hudson, Assistant Provost forResearch, University of Virginia, P.O. Box 9025, Charlottesville,VA 22906-9025, (434) 924-6853The FASTA program packageIntroduction     This documentation describes the version 3 of the FASTAprogram package (see W. R. Pearson and D. J. Lipman (1988),"Improved Tools for Biological Sequence Analysis", PNAS85:2444-2448 (Pearson and Lipman, 1988); W. R.  Pearson (1996)"Effective protein sequence comparison" Meth. Enzymol.266:227-258 (Pearson, 1996); Pearson et. al. (1997) Genomics46:24-36 (Zhang et al., 1997);  Pearson, (1999) Meth. inMolecular Biology 132:185-219 (Pearson, 2000).  Version 3 of theFASTA packages contains many programs for searching DNA andprotein databases and one program (prss3) for evaluatingstatistical significance from randomly shuffled sequences.Several additional analysis programs, including programs thatproduce local alignments, are available as part of version 2 ofthe FASTA package, which is still available.     This document is divided into three sections: (1) A summaryoverview of the programs in the FASTA3 package; (2) A guide toinstalling the programs and databases; (3) A guide to using theFASTA programs. The revision history of the programs can be foundin the readme.v30..v34, files. The programs are easy to use, soif you are using them on a machine that is administered bysomeone else, you can skip section (2) and focus on (1) and (3)to learn how to use the programsIf you are installing theprograms on your own machine, you will need to read section (2)carefully.1.  An overview of the FASTA programs     Although there are a large number of programs in thispackage, they belong to three groups: (1) "Conventional" Librarysearch programs: FASTA3, FASTX3, FASTY3, TFASTA3, TFASTX3,TFASTY3, SSEARCH3; (2) Programs for searching with shortfragments: FASTS3, FASTF3, TFASTS3, TFASTF3; (3) Statisticalsignificance: PRSS3.  Programs that start with fast searchprotein databases, while tfast programs search translated DNAdatabases.  Table I gives a brief description of the programs.            Table I. Comparison programs in the FASTA3 package---------------------------------------------------------------------------fasta3             Compare  a  protein  sequence  to  a  protein  sequence                   database  or  a DNA sequence to a DNA sequence database                   using the FASTA algorithm  (Pearson  and  Lipman, 1988,                   Pearson, 1996).   Search speed and selectivity are con-                   trolled with the ktup(wordsize) parameter.  For protein                   comparisons,  ktup = 2 by default; ktup =1 is more sen-                   sitive but slower.  For DNA comparisons, ktup=6 by  de-                   fault;  ktup=3  or  ktup=4 provides higher sensitivity;                   ktup=1 should be used for oligonucleotides  (DNA  query                   lengths < 20).ssearch3           Compare  a  protein  sequence  to  a  protein  sequence                   database or a DNA sequence to a DNA  sequence  database                   using  the  Smith-Waterman  algorithm (Smith and Water-                   man, 1981).  ssearch3 is  about  10-times  slower  than                   FASTA3,  but  is more sensitive for full-length protein                   sequence comparison.fastx3/ fasty3     Compare a DNA sequence to a protein sequence  database,                   by  comparing  the  translated  DNA  sequence  in three                   frames and allowing gaps and frameshifts.  fastx3  uses                   a  simpler, faster algorithm for alignments that allows                   frameshifts only between codons; fasty3 is  slower  but                   produces  better alignments with poor quality sequences                   because frameshifts are allowed within codons.tfastx3/ tfasty3   Compare a protein sequence to a DNA sequence  database,                   calculating  similarities  with frameshifts to the for-                   ward and reverse orientations.tfasta3            Compare a protein sequence to a DNA sequence  database,                   calculating similarities (without frameshifts) to the 3                   forward and three reverse reading frames.  tfastx3  and                   tfasty3 are preferred because they calculate similarity                   over frameshifts.fastf3/tfastf3     Compares an ordered peptide mixture, as  would  be  ob-                   tained  by  Edman  degredation  of a CNBr cleavage of a                   protein, against a  protein  (fastf)  or  DNA  (tfastf)                   database.fasts3/tfasts3     Compares  set  of  short peptide fragments, as would be                   obtained from mass-spec. analysis of a protein, against                   a protein (fasts) or DNA (tfasts) database.---------------------------------------------------------------------------2.  Installing FASTA and the sequence databases2.1.  Obtaining the libraries     The FASTA program package does not include any protein orDNA sequence libraries.  Protein databases are available on CD-ROM from the PIR and EMBL (see below), or via anonymouse FTP frommany different sources.  As this document is updated in the fallof 1999, no DNA databases are available on CD-ROM from the majorsequence databases: Genbank at the National for BiotechnologyInformation (www.ncbi.nlm.nih.gov and ftp://ncbi.nlm.nih.gov) andEMBL at the European Bioinformatics Institute (www.ebi.ac.uk).However, the databases are available via anonymous FTP from bothsites.2.1.1.  The GENBANK DNA sequence library     Because of the large size of DNA databases, you willprobably want to keep DNA databases in only one, or possibly two,formats.  The FASTA3 programs that search DNA databases - fasta3,tfastx/y3, and tfasta3 can read DNA databases in Genbank flatfile(not ASN.1), FASTA, GCG/compressed-binary, BLAST1.4 (pressdb),and BLAST2.0 (formatdb) formats, as well as EMBL format.  If youare also running the GCG suite of sequence analysis programs, youshould use GCG/compressed-binary format or BLAST2.0 format foryour fasta3 searches.  If not, BLAST2.0 is a good choice.  Thesefiles are considerably more compact than Genbank flat files, andare preferred.  The NCBI does not provide software for convertingfrom Genbank flat files to Blast2.0 DNA databases, but you canuse the Blast formatdb program to convert ASN.1 formated Genbankfiles, which are available from the NCBI ftp site.     The NCBI also provides the nr, swissprot, and several ESTdatabases that are used by BLAST in FASTA format from:ftp://ncbi.nlm.nih.gov/blast/db.  These databases are updatednightly.2.1.2.  The NBRF protein sequence library     You can obtain the PIR protein sequence database (Barker etal., 1998) from:    National  Biomedical Research Foundation    Georgetown  University  Medical  Center    3900 Reservoir Rd, N.W.    Washington, D.C. 20007or via ftp from nbrf.georgetown.edu or from the NCBI(ncbi.nlm.nih.gov/repository/PIR). The data in the asciidirectory is in PIR Codata format, which is not widely used.  Irecommend the PIR/VMS format data (libtype=5) in the vmsdirectory.2.1.3.  The EBI/EMBL CD-ROM libraries     The European Bioinformatics Institute (EBI) distributes boththe EMBL DNA database and the SwissProt database on CD-ROM(Bairoch and Apweiler, 1996), and they are available from:    EMBL-Outstation  European Bioinformatics Institute    Wellcome Trust Genome Campus,    Hinxton Hall    Hinxton,    Cambridge CB10 1SD    United Kingdom    Tel: +44 (0)1223 494444    Fax: +44 (0)1223 494468    Email: DATALIB@ebi.ac.ukIn addition, the SWISS-PROT protein sequence database isavailable via anonymous FTP fromftp://ftp.expasy.ch/databases/swiss-prot/ (also seewww.expasy.ch).2.2.  Finding the libraries: FASTLIBS     The major problem that most new users of the FASTA packagehave is in setting up the program to find the databases and theirlibrary type.  In general, if you cannot get fasta3 to read asequence database, it is likely that something is wrong with theFASTLIBS file.  A common problem is that the database file isfound, but either no sequences are read, or an incorrect numberof entries is read.  This is almost always because the libraryformat (libtype) is incorrect.  Note that a type 5 file (PIR/VMSformat) can be read as a type 0 (default FASTA) format file, andthe number of entries will be correct, but the sequence lengthswill not.     All the search programs in the FASTA3 package use theenvironment variable FASTLIBS to find the protein and DNAsequence libraries.  The FASTLIBS variable contains the name of afile that has the actual filenames of the libraries.  Thefastlibs file included with the distribution on is an example ofa file that can be referred to by FASTLIBS. To use the fastlibsfile, type:    setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)    or    export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)Then edit the fastlibs file to indicate where the protein and DNAsequence libraries can be found.  If you have a hard disk andyour protein sequence library is kept in the file/usr/lib/aabank.lib and your Genbank DNA sequence library is keptin the directory: /usr/lib/genbank, then fastgbs might contain:    NBRF Protein$0P/usr/lib/seq/aabank.lib 0    SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5    GB Primate$1P@/usr/lib/genbank/gpri.nam    GB Rodent$1R@/usr/lib/genbank/grod.nam    GB Mammal$1M@/usr/lib/genbank/gmammal.nam    ^   1    ^^^^       4                   ^     ^              23                             (5)The first line of this file says that there is a copy of the NBRFprotein sequence database (which is a protein database) that canbe selected by typing "P" on the command line or when thedatabase menu is presented in the file /usr/lib/seq/aabank.lib.     Note that there are 4 or 5 fields in the lines in fastgbs.The first field is the description of the library which will bedisplayed by FASTA; it ends with a '$'.  The second field (1character), is a 0 if the library is a protein library and 1 ifit is a DNA library.  The third field (1 character) is thecharacter to be typed to select the library.     The fourth field is the name of the library file.  In theexample above, the /usr/lib/seq/aabank.lib file contains theentire protein sequence library.  However the DNA library filenames are preceded by a '@', because these files (gpri.nam,grod.nam, gmammal.nam) do not contain the sequences; instead theycontain the names of the files which contain the sequences.  Thisis done because the GENBANK DNA database is broken down in to alarge number of smaller files.  In order to search the entireprimate database, you must search more than a dozen files.     In addition, an optional fifth field can be used to specifythe format of the library file.  Alternatively, you can specifythe library format in a file of file names (a file preceded by an'@').  This field must be separated from the file name by a spacecharacter (' ') from the filename.  In the example above, theaabank.lib file is in Pearson/FASTA format, while the swiss.seqfile is in PIR/VMS format (from the EMBL CD-ROM). Currently,FASTA can read the following formats:    0 Pearson/FASTA (>SEQID - comment/sequence)    1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)    2 NBRF CODATA (ENTRY/SEQUENCE)    3 EMBL/SWISS-PROT (ID/DE/SQ)    4 Intelligenetics (;comment/SEQID/sequence)    5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)    6 GCG (version 8.0) Unix Protein and DNA (compressed)    11 NCBI Blast1.3.2 format  (unix only)    12 NCBI Blast2.0 format  (unix only, fasta32t08 or later)In particular, this version will work with the EMBL and PIR VMSformats that are distributed on the EMBL CD-ROM. The latterformat (PIR VMS) is much faster to search than EMBL format.  Thisrelease also works with the protein and DNA database formatscreated for the BLASTP and BLASTN programs by SETDB and PRESSDBand with the new NCBI search format.  If a library format is notspecified, for example, because you are just comparing twosequences, Pearson/FASTA (format 0) is used by default. Tospecify a library type on the command line, add it to the libraryfilename and surround the filename and library type in quotes:    fasta3 query.file "/seqdb/genbank/gbpri1.seq 1"     You can specify a group of library files by putting a '@'symbol before a file that contains a list of file names to besearched.  For example, if @gmam.nam is in the fastgbs file, thefile "gmam.nam" might contain the lines:    </seqdb/genbank    gbpri1.seq 1    gbpri2.seq 1    gbpri3.seq 1    gbpri4.seq 1    gbrod.seq 1    gbmam.seq 1In this case, the line beginning with a '<' indicates thedirectory the files will be found in.  The remaining lines namethe actual sequence files.  So the first sequence file to besearched would be:    /usr/lib/genbank/gbpri.seqThe notation "<PIRNAQ:" might be used under the VAX/VMS operatingsystem. Under UNIX, the trailing '/' is left off, so the librarydirectory might be written as "</usr/seqlib".     The FASTA programs can search a database composed ofdifferent files in different sequence formats.  For example, youmay wish to search the Genbank files (in GenBank flat fileformat) and the EMBL DNA sequence database on CD-ROM.  To dothis, you simply list the names and filetypes of the files to besearched in a file of filenames.  For example, to search themammalian portion of Genbank, the unannotated portion of Genbank,and the unannotated portion of the EMBL library, you could usethe file:    </usr/lib/DNA    gbpri.seq 1    #  (this '#' causes the program to display the size of the library)    gbrod.seq 1    ...    gbmam.seq 1    ...    gbuna.seq 1    ...    unanno.seq 5

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -