📄 fasta3x.me

📁 序列对齐 Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequenc
💻 ME
📖 第 1 页 / 共 3 页
字号:
.(l.ft C</seqdb/genbankgbpri1.seq 1gbpri2.seq 1gbpri3.seq 1gbpri4.seq 1gbrod.seq 1gbmam.seq 1.ft R.)lIn this case, the line beginning with a '<' indicates the directorythe files will be found in.  The remaining lines name the actualsequence files.  So the first sequence file to be searched would be:.(l.ft C/usr/lib/genbank/gbpri.seq.ft R.)lThe notation "\fC<PIRNAQ:\fP" might be used under the VAX/VMS operatingsystem. Under UNIX, the trailing '/' is left off, so the librarydirectory might be written as "\fC</usr/seqlib\fP"..ppThe FASTA programs can search a database composed of different filesin different sequence formats.  For example, you may wish to searchthe Genbank files (in GenBank flat file format) and the EMBL DNAsequence database on CD-ROM.  To do this, you simply list the namesand filetypes of the files to be searched in a file of filenames.  Forexample, to search the mammalian portion of Genbank, the unannotatedportion of Genbank, and the unannotated portion of the EMBL library,you could use the file:.(l I.ft C</usr/lib/DNAgbpri.seq 1\&#  (this '#' causes the program to display the size of the library)gbrod.seq 1\&...gbmam.seq 1\&...gbuna.seq 1\&...unanno.seq 5\&#.ft R.)l.(l I FYou do not need to include library format numbers if you only use thePearson/FASTA version of the PIR protein sequence library.  If nolibrary type is specified, the program assumes that type 0 is beingused..)l .ppTest the setup by running FASTA.  Enter the sequencefile '\fCmgstm1.aa\fP' when the program requests it (this file isincluded with the programs).  The program should then ask you toselect a protein sequence library.  Alternatively, if you run theTFASTA program and use the mgstm1.aa query sequence, the programshould show you a selection of DNA sequence libraries.Once the fastgbs file has been set up correctly, you canset FASTLIBS=fastgbs in your AUTOEXEC.BAT file, and you will not need toremember where the libraries are kept or how they are named..ne 8.sh 1 "Using the FASTA Package".sh 2 "Overview".ppThe FASTA sequence comparison programs all require similarinformation, the name of a query sequence file, a library file, andthe \fIktup\fP parameter.  All of the programs can accept argumentson the command line, or they will prompt for the file names and\fIktup\fP value..lpTo use FASTA, simply type:.(l.ft C\f(CBFASTA\fPand you will be prompted for :.in +0.5ithe name of the test sequence filethe name of the library fileand whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)(ktup of 2 is about 5 times faster than ktup = 1).ft R.)lThe program can also be run by typing.(l.ft CFASTA test.aa /lib/bigfile.lib \fIktup\fP (1 or 2).ft R.)l.lpIncluded with the package are several test files.To check to make certain that everything is working, you can try:.(l.ft Cfasta musplfm.aa prot_test.libandtfastx mgstm1.aa gst.nlib.ft R.)l.sh 2 "Sequence files".ppThe \fCfasta3\fP programs know about three kinds of sequence files:(1) plain sequence files - files that contain nothing butsequence residues - can only be used as query sequences. (2) FASTAformat files.  These are the same as plain sequence files, eachsequence is preceded by a comment line with a '>' in the firstcolumn. (3) distributed sequence libraries (this is a broad class thatincludes the NBRF/PIR VMS and blocked ascii formats, Genbank flat-fileformat, EMBL flat-file format, and Intelligenetics format.  All of thefiles that you create should be of type (1) or (2).  FASTA formatfiles (ones with a '>' and comment before the sequence) are preferred,because they can be used as query or library sequence files by all ofthe programs..ppI have included several sample test files, \fC*.aa\fP and \fC*.seq\fPas well as two small sequence libraries, \fCprot_test.lib\fP and\fCgst.nlib\fP.  The first line may begin with a '>' by a comment.Spaces and tabs (and anything else that is not an amino-acid code) areignored..ppLibrary files should have the form:.(l.ft C>Sequence name and identifierA F A S Y T .... actual sequence.F S S       .... second line of sequence.>Next sequence name and identifier.ft R.)lThis is often referred to as "FASTA" or format.  You canbuild your own library by concatenating several sequence files.  Justbe sure that each sequence is preceded by a line beginning with a '>'with a sequence name..ppThe test file should not have lines longer than 120 characters, andsequences entered with word processors should use a documentmode, with normal carriage returns at the end of lines..ppA different format is required to specify the ordered peptide mixture for \fCfastf3/tfastf3\fP. For example:.(l I.ft C>mgstm1MGCEN,MIDYP,MLLAY,MLLGY.ft P.)lindicates \fCm\fP in the first position of all three peptides (asfrom CNBr), \fCG, I, L\fP (twice) in the second position (first cycle),\fCC,D,L\fP (twice) in the third position, etc.  The commas (\fC,\fP)are required to indicate the number of fragments in the mixture, butthere should be no comma after the last residue..ppFor the \fCfasts3/tfasts3\fP program, the format is the same, except that thereis no requirement for the peptides to be the same length..sh 1 "Statistical Significance".ppAll the programs in the FASTA3 package attempt to calculate accurateestimates of the statistical significance of a match. For\fCfasta3\fP, \fCssearch3\fP, and \fCfastx3/y3\fP, these estimates arevery accurate.[.wrp971,wrp981.].  Altschul et al. [.alt940.] providesan excellent review of the statistics of local similarity scores.Local sequence similarity scores follow the extreme valuedistribution, so that P(s > x) = 1 - exp(-exp(-lambda(x-u)) where u =ln(Kmn)/lambda and m,m are the lengths of the query and librarysequence. This formula can be rewritten as: 1 - exp(-Kmn exp(-lambdax), which shows that the average score for an unrelated librarysequence increases with the logarithm of the length of the librarysequence.  The \fCfasta3\fP programs use simple linear regressionagainst the the log of the library sequence length to calculate anormalized "z-score" with mean 50, regardless of library sequencelength, and variance 10. (Several other estimation methods areavailable with the \fC\-z\fP option.) These z-scores can then be usedwith the extreme value distribution and the poisson distribution (toaccount for the fact that each library sequence comparison is anindependent test) to calculate the number of library sequences toobtain a score greater than or equal to the score obtained in thesearch. The original idea and routines to do the linear regression onlibrary sequence length were provided Phil Green, U. Washington.  Thisversion uses a slightly different strategy for fitting the data thanthose originally provided by Dr. Green..ppThe expected number of sequences is plotted in the histogram using an"*". Since the parameters for the extreme value distribution are notcalculated directly from the distribution of similarity scores, thepattern of "*'s" in the histogram gives a qualitative view of how wellthe statistical theory fits the similarity scores calculated by theprograms.  For \fCfasta3\fP, if optimized scores are calculated foreach sequence in the database (the default), the agreement between theactual distribution of "z-scores" and the expected distribution basedon the length dependence of the score and the extreme valuedistribution is usually very good.  Likewise, the distribution of\fCssearch3\fP Smith-Waterman scores typically agrees closely with the<actual distribution of "z-scores."  The agreement with unoptimizedscores, \fIktup=2\fP, is often not very good, with too many highscoring sequences and too few low scoring sequences compared with thepredicted relationship between sequence length and similarity score.In those cases, the expectation values may be overestimates..ppWith version 33t01, all the FASTA programs also report a "bit" score,which is equivalent to the bit score reported by BLAST2.  TheFASTA33/BLAST2 bit score is calculated as: (lambda*S - ln K)/ln 2,where S is the raw similarity score, lambda and K are statisticalparameters estimated from the distribution of unrelated sequencesimilarity scores.  The statistical signficance of a given bit scoredepends on the lengths of the query and library sequences and the sizeof the library, but a 1 bit increase in score corresponds to a 2-foldreduction in expectation; a 10-bit increase implies 1000-fold lowerexpectation, etc..ppThe statistical routines assume that the library contains a largesample of unrelated sequences.  If this is not true, then statisticalparameters can be estimated by using the \fC\-z 11\-15\fP, options.\fC\-z\fP options greater than 10 calculate a shuffled similarity scorefor each library sequence, in addition to the unshuffled score, andestimate the statistical parameters from the scores of the shuffledsequences.  If there are fewer than 20 sequences in the library, thestatistical calculations are not done..ppFor protein searches, library sequences with E() values < 0.01 forsearches of a 10,000 entry protein database are almost alwayshomologous. Frequently sequences with E()-values from 1 - 10 arerelated as well, but unrelated sequences ( 1 \- 10 per search) willhave scores in this renage as well. Remember, however, that these E()values also reflect differences between the amino acid composition ofthe query sequence and that of the "average" library sequence.  Thus,when searches are done with query sequences with "biased" amino-acidcomposition, unrelated sequences may have "significant" scores becauseof sequence bias.  \fCPRSS3\fP can address this problem by calculatingsimilarity scores for random sequences with the same length and aminoacid composition. .sh 1 "Options".ppCommand line options are available to change the scoring parametersand output display. \fBCommand line options must preceed other programarguments, such as the query and library file names.\fP.sh 2 "Command line options".ip "-a"(fasta3, ssearch3 only) show both sequences in their entirety..ip "-A"force Smith-Waterman alignments for fasta3 DNA sequences.  By default,only fasta3 protein sequence comparisons use Smith-Waterman alignments..ip "-B"Show normalized score as a z-score, rather than a bit-score in the listof best scores..ip "-b #"Number of sequence scores to be shown on output.  In the absence ofthis option, fasta (and tfasta and ssearch) display all librarysequences obtaining similarity scores with expectations less than10.0 if optimized score are used, or 2.0 if they are not. The -boption can limit the display further, but it will not cause additionalsequences to be displayed..ip "-c #"Threshold score for optimization (OPTCUT).  Set "-c 1" tooptimize every sequence in a database..ip "-E #"Limit the number of scores and alignments shown based on theexpected number of scores.  Used to override the expectation value of 10.0used by default.  When used with -Q, -E 2.0 will show all library sequenceswith scores with an expectation value <= 2.0..ip "-d #"Maximum number of alignments to be displayed.  Ignored if "-Q" is notused..ip "-f"Penalty for the first residue in a gap (-12 by default for proteins,-16 for DNA, -15 for FAST[XY]/TFAST[XY])..ip "-F #"Limit the number of scores and alignments shown based on the expectednumber of scores. "-E #" sets the highest E()-value shown; "-F #" setsthe lowest E()-value. Thus, "-F 0.0001" will not show any matches oralignments with E() < 0.0001.  This allows one to skip over closerelationships in searches for more distant relationships..ip "-g"Penalty for additional residues in a gap (-2 by default for proteins,-4 for DNA, -3 for FAST[XY]/TFAST[XY])..ip "-h"Penalty for frameshift (fastx3/y3, tfastx3/y3 only)..ip "-H"Omit histogram..ip "-i"Invert (reverse complement) the query sequence if it is DNA.  Fortfasta3/x3/y3, search the reverse complement of the library sequenceonly..ip "-j #"Penalty for frameshift within a codon (fasty3/tfasty3 only)..ip "-l file"Location of library menu file (FASTLIBS)..ip "-L"Display more information about the library sequence in the alignment..ip "-M low-high"Range of amino acid sequence lengths to be included in the search..ip "-m #"Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10.(l I.ft C    \-m 0        \-m 1          \-m 2          \-m 3        \-m 4.ft CMWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------MWKSCGYPYT   MWKSCGYPYT.ft P.)l
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -