📄 fasta20.doc

📁 序列对齐 Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequenc
💻 DOC
📖 第 1 页 / 共 4 页
字号:
option.  FASTA -r results.out ... creates a file with a list ofscores for every sequence in the library.  The list is notsorted, and only includes those scores calculated during theinitial scan of the library.2.  Installing the FASTA package2.1.  Installing the programs2.1.1.  Unix version     The FASTA distribution comes with several makefile's thatcan be used to compile the FASTA programs.  Over the years, asATT Unix System 5 and BSD unix have converged, these files havebecome very similar. To begin with, I recommend using thestandard Makefile.  There are two values in the makefile thatshould be checked against the values used on your system: the HZvalue, which is the frequency in ticks per second used by thetimes() system call, this value can usually be found by running:    grep HZ /usr/include/sys/*and the functions available to return random numbers.  If youhave a rand48() function that returns a 32-bit random number, useit and use the lines:    NRAND=nrand48    RANFLG= -DRAND32If not, you will need to use the rand() function call anddetermine whether it returns a 16-bit or a 32-bit value.  Thesefunctions are used by PRDF and PRSS.  If you have problemscompiling the programs, you may want to examine the makefile.unxand makefile.sun files, to look for differences.  I have tried touse very standard unix functions in these programs, and they havebeen successfully compiled, with very small changes to theMakefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPSmachines (under the BSD environment).2.1.2.  IBM-PC/DOS version     For the IBM-PC/DOS version, the FASTA source code diskcontains the complete source code to all of the programs on theother disks.  The programs were compiled with Borland's Turbo'C++', using Borland's MAKE utility.  The graphics programs(PLFASTA, TGREASE) use the graphics device drivers supplied withthe Turbo 'C' V2.0 package.  Also included are the documentationfiles PROGRAMS.DOC and FORMAT.DOC.  You do not need any of thefiles the source code disk to run the programs.  The files onthis disk are identical to the UNIX and VMS versions that run onlarger machines.  Also included is the code to compileALIGN0.EXE.  ALIGN0 is the same as ALIGN, but does not penalizefor end-gaps.     If you have the DOS or Macintosh version of the FASTApackage, to install the programs you should: (1)   Make a new directory (folder) for the FASTA programs.       This need not be the same as the directory for your       sequence databases. (2)   Copy the files from the FASTA source disk to the new       directory. (3)   (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your       PATH command to include the FASTA directory and (b) add       the line:           set FASTLIBS=c:\yourfastadirectory\fastgbs       On the Macintosh, you may need to edit the "environment"       file and change the line that reads:           FASTLIBS=fastgbs       to indicate the full directory path for the fastgbs file,       for example:           FASTLIBS=Q105:FASTA:fastgbs (4)   Finally, you will need to edit the fastgbs file.  This is       usually the most confusing part of the installation.  An       example of this file is shown below; to customize this       file for your machine, you will need to change the file       names from those provided in the fastgbs file to ones that       reflect the directory names and file names you use on your       machine. This is explained in more detail below.  In       addition, some entries in the fastgbs file refer to other       files of file names.  These files of file names (as       opposed to actual database files) may also need to be       edited.2.2.  Installing the libraries2.2.1.  The NBRF protein sequence library     The FASTA program package does not include any protein orDNA sequence libraries.  You can obtain the PIR protein sequencedatabase from:    National  Biomedical Research Foundation    Georgetown  University  Medical  Center    3900 Reservoir Rd, N.W.    Washington, D.C. 20007In addition, this database is available via anonymous ftp fromthe host "ftp.bchs.uh.edu". It is available in two formats, VMSand CODATA format.  The "VMS" format (library type 5 below) canbe searched much faster, can be easily reformatted for use by the"BLAST" rapid searching program, and is compatible with theGenetics Computer Group package of programs.  The CODATA formatis used by the EUGENE/MBIR computing package from Baylor (librarytype 2).2.2.2.  The GENBANK DNA sequence library     FASTA, and TFASTA search sequences from the GENBANK"flatfile" (not ASN.1) DNA sequence library in the flat-fileformat distributed by the National Center for BiotechnologyInformation and the PIR format used by EBI/EMBL.  CD-ROMs can beobtained from:    Genbank    National Center for Biotechnology Information    National Library of Medicine    National Institutes of Health    8600 Rockville Pike    Bethesda, MD  20894     The GenBank DNA sequence library is also available viaanonymous FTP from ncbi.nlm.nih.gov.2.2.3.  The EBI/EMBL CD-ROM libraries     The European Bioinformatics Institute (EBI) is nowdistributing the EMBL CD-ROM that contains both the complete EMBLDNA sequence database (which should be essentially identical tothe GenBank DNA sequence database) and the SWISS-PROT proteinsequence database. SWISS-PROT is derived from the NBRF Proteinsequence database with additions from the EBI/EMBL DNA sequencedatabase.  This CD-ROM is a "best-buy," since it provides bothDNA and protein sequence libraries.  It is available from:    European Bioinformatics Institute    Hinxton Genome Campus, Hinxton Hall    Hinxton, Cambridge CB10 1RQ,    United Kingdom    Tel: +44 1223 4944    Fax: +44 1223 494468    Email: DATALIB@ebi.ac.uk     In addition, the SWISS-PROT protein sequence database isavailable via anonymous FTP from ncbi.nlm.nih.gov.2.3.  Finding the libraries: FASTLIBS     FASTA and TFASTA use the environment variable FASTLIBS tofind the protein and DNA sequence libraries.  The FASTLIBSvariable contains the name of a file that has the actualfilenames of the libraries.  The FASTGBS file on is an example ofa file that can be referred to by FASTLIBS. To use the FASTGBSfile, type:    setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh)    or    export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)Then edit the FASTGBS file to indicate where the protein and DNAsequence libraries can be found.  If you have a hard disk andyour protein sequence library is kept in the file/usr/lib/aabank.lib and your Genbank DNA sequence library is keptin the directory: /usr/lib/genbank, then fastgbs might contain:    NBRF Protein$0P/usr/lib/seq/aabank.lib 0    SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5    GB Primate$1P@/usr/lib/genbank/gpri.nam    GB Rodent$1R@/usr/lib/genbank/grod.nam    GB Mammal$1M@/usr/lib/genbank/gmammal.nam    ^   1    ^^^^       4                   ^     ^              23                             (5)The first line of this file says that there is a copy of the NBRFprotein sequence database (which is a protein database) that canbe selected by typing "P" on the command line or when thedatabase menu is presented in the file /usr/lib/seq/aabank.lib.     Note that there are 4 or 5 fields in the lines in fastgbs.The first field is the description of the library which will bedisplayed by FASTA; it ends with a '$'.  The second field (1character), is a 0 if the library is a protein library and 1 ifit is a DNA library.  The third field (1 character) is thecharacter to be typed to select the library.     The fourth field is the name of the library file.  In theexample above, the /usr/lib/seq/aabank.lib file contains theentire protein sequence library.  However the DNA library filenames are preceded by a '@', because these files (gpri.nam,grod.nam, gmammal.nam) do not contain the sequences; instead theycontain the names of the files which contain the sequences.  Thisis done because the GENBANK DNA database is broken down in to alarge number of smaller files.  In order to search the entireprimate database, you must search more than a dozen files.     In addition, an optional fifth field can be used to specifythe format of the library file.  Alternatively, you can specifythe library format in a file of file names (a file preceded by an'@').  This field must be separated from the file name by a spacecharacter (' ') from the filename.  In the example above, theaabank.lib file is in Pearson/FASTA format, while the swiss.seqfile is in PIR/VMS format (from the EMBL CD-ROM). Currently,FASTA can read the following formats:    0 Pearson/FASTA (>SEQID - comment/sequence)    1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN)    2 NBRF CODATA (ENTRY/SEQUENCE)    3 EMBL/SWISS-PROT (ID/DE/SQ)    4 Intelligenetics (;comment/SEQID/sequence)    5 NBRF/PIR VMS (>P1;SEQID/comment/sequence)    6 GCG (version 8.0) Unix Protein and DNA (compressed)    11 NCBI Blast1.3.2 format  (unix only)In particular, this version will work with the EMBL and PIR VMSformats that are distributed on the EMBL CD-ROM. The latterformat (PIR VMS) is much faster to search than EMBL format.  Thisrelease also works with the protein and DNA database formatscreated for the BLASTP and BLASTN programs by SETDB and PRESSDBand with the new NCBI search format.  If a library format is notspecified, for example, because you are just comparing twosequences, Pearson/FASTA (format 0) is used by default.  Tochange this default, you may set the LIBTYPE environment variableto a number.  For example,    setenv LIBTYPE 1would cause the program to use the GenBank LOCUS format bydefault for libraries (or the second sequence file), but thePearson/FASTA format would still be used for the query sequence.     You can specify a group of library files by putting a '@'symbol before a file that contains a list of file names to besearched.  For example, if @gmam.nam is in the fastgbs file, thefile "gmam.nam" might contain the lines:    </usr/lib/genbank    gbpri.seq 1    gbrod.seq 1    gbmam.seq 1In this case, the line beginning with a '<' indicates thedirectory the files will be found in.  The remaining lines namethe actual sequence files.  So the first sequence file to besearched would be:    /usr/lib/genbank/gbpri.seqThe notation "<PIRNAQ:" might be used under the VAX/VMS operatingsystem. Under UNIX, the trailing '/' is left off, so the librarydirectory might be written as "</usr/seqlib".     With version 1.4 of the FASTA package, the FASTA and TFASTAprograms can search a library composed of different files indifferent sequence formats.  For example, you may wish to searchthe Genbank files (in GenBank flat file format) and the EMBL DNAsequence database on CD-ROM.  To do this, you simply list thenames and filetypes of the files to be searched in a file offilenames.  For example, to search the mammalian portion ofGenbank, the unannotated portion of Genbank, and the unannotatedportion of the EMBL library, you could use the file:    </usr/lib/DNA    gbpri.seq 1    #  (this '#' causes the program to display the size of the library)    gbrod.seq 1    gbmam.seq 1    gbuna.seq 1    unanno.seq 5    #    You do not need to include library format numbers if  you    only use the Pearson/FASTA version of the PIR protein se-    quence library.  If no library  type  is  specified,  the    program  assumes  that  type  0 is being used (unless you
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -