📄 fasta20.doc
字号:
option. FASTA -r results.out ... creates a file with a list ofscores for every sequence in the library. The list is notsorted, and only includes those scores calculated during theinitial scan of the library.2. Installing the FASTA package2.1. Installing the programs2.1.1. Unix version The FASTA distribution comes with several makefile's thatcan be used to compile the FASTA programs. Over the years, asATT Unix System 5 and BSD unix have converged, these files havebecome very similar. To begin with, I recommend using thestandard Makefile. There are two values in the makefile thatshould be checked against the values used on your system: the HZvalue, which is the frequency in ticks per second used by thetimes() system call, this value can usually be found by running: grep HZ /usr/include/sys/*and the functions available to return random numbers. If youhave a rand48() function that returns a 32-bit random number, useit and use the lines: NRAND=nrand48 RANFLG= -DRAND32If not, you will need to use the rand() function call anddetermine whether it returns a 16-bit or a 32-bit value. Thesefunctions are used by PRDF and PRSS. If you have problemscompiling the programs, you may want to examine the makefile.unxand makefile.sun files, to look for differences. I have tried touse very standard unix functions in these programs, and they havebeen successfully compiled, with very small changes to theMakefile, on Sun's (Sun OS 4.1), IBM RS/6000's (AIX), and MIPSmachines (under the BSD environment).2.1.2. IBM-PC/DOS version For the IBM-PC/DOS version, the FASTA source code diskcontains the complete source code to all of the programs on theother disks. The programs were compiled with Borland's Turbo'C++', using Borland's MAKE utility. The graphics programs(PLFASTA, TGREASE) use the graphics device drivers supplied withthe Turbo 'C' V2.0 package. Also included are the documentationfiles PROGRAMS.DOC and FORMAT.DOC. You do not need any of thefiles the source code disk to run the programs. The files onthis disk are identical to the UNIX and VMS versions that run onlarger machines. Also included is the code to compileALIGN0.EXE. ALIGN0 is the same as ALIGN, but does not penalizefor end-gaps. If you have the DOS or Macintosh version of the FASTApackage, to install the programs you should: (1) Make a new directory (folder) for the FASTA programs. This need not be the same as the directory for your sequence databases. (2) Copy the files from the FASTA source disk to the new directory. (3) (DOS only) Edit your AUTOEXEC.BAT file to (a) modify your PATH command to include the FASTA directory and (b) add the line: set FASTLIBS=c:\yourfastadirectory\fastgbs On the Macintosh, you may need to edit the "environment" file and change the line that reads: FASTLIBS=fastgbs to indicate the full directory path for the fastgbs file, for example: FASTLIBS=Q105:FASTA:fastgbs (4) Finally, you will need to edit the fastgbs file. This is usually the most confusing part of the installation. An example of this file is shown below; to customize this file for your machine, you will need to change the file names from those provided in the fastgbs file to ones that reflect the directory names and file names you use on your machine. This is explained in more detail below. In addition, some entries in the fastgbs file refer to other files of file names. These files of file names (as opposed to actual database files) may also need to be edited.2.2. Installing the libraries2.2.1. The NBRF protein sequence library The FASTA program package does not include any protein orDNA sequence libraries. You can obtain the PIR protein sequencedatabase from: National Biomedical Research Foundation Georgetown University Medical Center 3900 Reservoir Rd, N.W. Washington, D.C. 20007In addition, this database is available via anonymous ftp fromthe host "ftp.bchs.uh.edu". It is available in two formats, VMSand CODATA format. The "VMS" format (library type 5 below) canbe searched much faster, can be easily reformatted for use by the"BLAST" rapid searching program, and is compatible with theGenetics Computer Group package of programs. The CODATA formatis used by the EUGENE/MBIR computing package from Baylor (librarytype 2).2.2.2. The GENBANK DNA sequence library FASTA, and TFASTA search sequences from the GENBANK"flatfile" (not ASN.1) DNA sequence library in the flat-fileformat distributed by the National Center for BiotechnologyInformation and the PIR format used by EBI/EMBL. CD-ROMs can beobtained from: Genbank National Center for Biotechnology Information National Library of Medicine National Institutes of Health 8600 Rockville Pike Bethesda, MD 20894 The GenBank DNA sequence library is also available viaanonymous FTP from ncbi.nlm.nih.gov.2.2.3. The EBI/EMBL CD-ROM libraries The European Bioinformatics Institute (EBI) is nowdistributing the EMBL CD-ROM that contains both the complete EMBLDNA sequence database (which should be essentially identical tothe GenBank DNA sequence database) and the SWISS-PROT proteinsequence database. SWISS-PROT is derived from the NBRF Proteinsequence database with additions from the EBI/EMBL DNA sequencedatabase. This CD-ROM is a "best-buy," since it provides bothDNA and protein sequence libraries. It is available from: European Bioinformatics Institute Hinxton Genome Campus, Hinxton Hall Hinxton, Cambridge CB10 1RQ, United Kingdom Tel: +44 1223 4944 Fax: +44 1223 494468 Email: DATALIB@ebi.ac.uk In addition, the SWISS-PROT protein sequence database isavailable via anonymous FTP from ncbi.nlm.nih.gov.2.3. Finding the libraries: FASTLIBS FASTA and TFASTA use the environment variable FASTLIBS tofind the protein and DNA sequence libraries. The FASTLIBSvariable contains the name of a file that has the actualfilenames of the libraries. The FASTGBS file on is an example ofa file that can be referred to by FASTLIBS. To use the FASTGBSfile, type: setenv FASTLIBS /usr/lib/fasta/fastgbs (BSD UNIX/csh) or export FASTLIBS=/usr/lib/fasta/fastgbs (SysV UNIX/ksh)Then edit the FASTGBS file to indicate where the protein and DNAsequence libraries can be found. If you have a hard disk andyour protein sequence library is kept in the file/usr/lib/aabank.lib and your Genbank DNA sequence library is keptin the directory: /usr/lib/genbank, then fastgbs might contain: NBRF Protein$0P/usr/lib/seq/aabank.lib 0 SWISS PROT 10$0S/usr/lib/vmspir/swiss.seq 5 GB Primate$1P@/usr/lib/genbank/gpri.nam GB Rodent$1R@/usr/lib/genbank/grod.nam GB Mammal$1M@/usr/lib/genbank/gmammal.nam ^ 1 ^^^^ 4 ^ ^ 23 (5)The first line of this file says that there is a copy of the NBRFprotein sequence database (which is a protein database) that canbe selected by typing "P" on the command line or when thedatabase menu is presented in the file /usr/lib/seq/aabank.lib. Note that there are 4 or 5 fields in the lines in fastgbs.The first field is the description of the library which will bedisplayed by FASTA; it ends with a '$'. The second field (1character), is a 0 if the library is a protein library and 1 ifit is a DNA library. The third field (1 character) is thecharacter to be typed to select the library. The fourth field is the name of the library file. In theexample above, the /usr/lib/seq/aabank.lib file contains theentire protein sequence library. However the DNA library filenames are preceded by a '@', because these files (gpri.nam,grod.nam, gmammal.nam) do not contain the sequences; instead theycontain the names of the files which contain the sequences. Thisis done because the GENBANK DNA database is broken down in to alarge number of smaller files. In order to search the entireprimate database, you must search more than a dozen files. In addition, an optional fifth field can be used to specifythe format of the library file. Alternatively, you can specifythe library format in a file of file names (a file preceded by an'@'). This field must be separated from the file name by a spacecharacter (' ') from the filename. In the example above, theaabank.lib file is in Pearson/FASTA format, while the swiss.seqfile is in PIR/VMS format (from the EMBL CD-ROM). Currently,FASTA can read the following formats: 0 Pearson/FASTA (>SEQID - comment/sequence) 1 Uncompressed Genbank (LOCUS/DEFINITION/ORIGIN) 2 NBRF CODATA (ENTRY/SEQUENCE) 3 EMBL/SWISS-PROT (ID/DE/SQ) 4 Intelligenetics (;comment/SEQID/sequence) 5 NBRF/PIR VMS (>P1;SEQID/comment/sequence) 6 GCG (version 8.0) Unix Protein and DNA (compressed) 11 NCBI Blast1.3.2 format (unix only)In particular, this version will work with the EMBL and PIR VMSformats that are distributed on the EMBL CD-ROM. The latterformat (PIR VMS) is much faster to search than EMBL format. Thisrelease also works with the protein and DNA database formatscreated for the BLASTP and BLASTN programs by SETDB and PRESSDBand with the new NCBI search format. If a library format is notspecified, for example, because you are just comparing twosequences, Pearson/FASTA (format 0) is used by default. Tochange this default, you may set the LIBTYPE environment variableto a number. For example, setenv LIBTYPE 1would cause the program to use the GenBank LOCUS format bydefault for libraries (or the second sequence file), but thePearson/FASTA format would still be used for the query sequence. You can specify a group of library files by putting a '@'symbol before a file that contains a list of file names to besearched. For example, if @gmam.nam is in the fastgbs file, thefile "gmam.nam" might contain the lines: </usr/lib/genbank gbpri.seq 1 gbrod.seq 1 gbmam.seq 1In this case, the line beginning with a '<' indicates thedirectory the files will be found in. The remaining lines namethe actual sequence files. So the first sequence file to besearched would be: /usr/lib/genbank/gbpri.seqThe notation "<PIRNAQ:" might be used under the VAX/VMS operatingsystem. Under UNIX, the trailing '/' is left off, so the librarydirectory might be written as "</usr/seqlib". With version 1.4 of the FASTA package, the FASTA and TFASTAprograms can search a library composed of different files indifferent sequence formats. For example, you may wish to searchthe Genbank files (in GenBank flat file format) and the EMBL DNAsequence database on CD-ROM. To do this, you simply list thenames and filetypes of the files to be searched in a file offilenames. For example, to search the mammalian portion ofGenbank, the unannotated portion of Genbank, and the unannotatedportion of the EMBL library, you could use the file: </usr/lib/DNA gbpri.seq 1 # (this '#' causes the program to display the size of the library) gbrod.seq 1 gbmam.seq 1 gbuna.seq 1 unanno.seq 5 # You do not need to include library format numbers if you only use the Pearson/FASTA version of the PIR protein se- quence library. If no library type is specified, the program assumes that type 0 is being used (unless you
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -