📄 fasta3x.doc

📁 序列对齐 Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequenc
💻 DOC
📖 第 1 页 / 共 3 页
字号:
    #    You do not need to include library format numbers if  you    only use the Pearson/FASTA version of the PIR protein se-    quence library.  If no library  type  is  specified,  the    program assumes that type 0 is being used.     Test the setup by running FASTA.  Enter the sequence file'mgstm1.aa' when the program requests it (this file is includedwith the programs).  The program should then ask you to select aprotein sequence library.  Alternatively, if you run the TFASTAprogram and use the mgstm1.aa query sequence, the program shouldshow you a selection of DNA sequence libraries.  Once the fastgbsfile has been set up correctly, you can set FASTLIBS=fastgbs inyour AUTOEXEC.BAT file, and you will not need to remember wherethe libraries are kept or how they are named.3.  Using the FASTA Package3.1.  Overview     The FASTA sequence comparison programs all require similarinformation, the name of a query sequence file, a library file,and the ktup parameter.  All of the programs can accept argumentson the command line, or they will prompt for the file names andktup value.To use FASTA, simply type:    FASTA    and you will be prompted for :         the name of the test sequence file         the name of the library file         and whether you want ktup = 1 or 2. (or 1 to 6 for DNA sequences)         (ktup of 2 is about 5 times faster than ktup = 1)The program can also be run by typing    FASTA test.aa /lib/bigfile.lib ktup (1 or 2)Included with the package are several test files.  To check tomake certain that everything is working, you can try:    fasta musplfm.aa prot_test.lib    and    tfastx mgstm1.aa gst.nlib3.2.  Sequence files     The fasta3 programs know about three kinds of sequencefiles: (1) plain sequence files - files that contain nothing butsequence residues - can only be used as query sequences. (2)FASTA format files.  These are the same as plain sequence files,each sequence is preceded by a comment line with a '>' in thefirst column. (3) distributed sequence libraries (this is a broadclass that includes the NBRF/PIR VMS and blocked ascii formats,Genbank flat-file format, EMBL flat-file format, andIntelligenetics format.  All of the files that you create shouldbe of type (1) or (2).  FASTA format files (ones with a '>' andcomment before the sequence) are preferred, because they can beused as query or library sequence files by all of the programs.     I have included several sample test files, *.aa and *.seq aswell as two small sequence libraries, prot_test.lib and gst.nlib.The first line may begin with a '>' by a comment.  Spaces andtabs (and anything else that is not an amino-acid code) areignored.     Library files should have the form:    >Sequence name and identifier    A F A S Y T .... actual sequence.    F S S       .... second line of sequence.    >Next sequence name and identifierThis is often referred to as "FASTA" or format.  You can buildyour own library by concatenating several sequence files.  Justbe sure that each sequence is preceded by a line beginning with a'>' with a sequence name.     The test file should not have lines longer than 120characters, and sequences entered with word processors should usea document mode, with normal carriage returns at the end oflines.     A different format is required to specify the orderedpeptide mixture for fastf3/tfastf3. For example:    >mgstm1    MGCEN,    MIDYP,    MLLAY,    MLLGYindicates m in the first position of all three peptides (as fromCNBr), G, I, L (twice) in the second position (first cycle),C,D,L (twice) in the third position, etc.  The commas (,) arerequired to indicate the number of fragments in the mixture, butthere should be no comma after the last residue.     For the fasts3/tfasts3 program, the format is the same,except that there is no requirement for the peptides to be thesame length.4.  Statistical Significance     All the programs in the FASTA3 package attempt to calculateaccurate estimates of the statistical significance of a match.For fasta3, ssearch3, and fastx3/y3, these estimates are veryaccurate (Pearson, 1998, Zhang et al., 1997)..  Altschul et al.(Altschul et al., 1994) provides an excellent review of thestatistics of local similarity scores.  Local sequence similarityscores follow the extreme value distribution, so that P(s > x) =1 - exp(-exp(-lambda(x-u)) where u = ln(Kmn)/lambda and m,m arethe lengths of the query and library sequence. This formula canbe rewritten as: 1 - exp(-Kmn exp(-lambda x), which shows thatthe average score for an unrelated library sequence increaseswith the logarithm of the length of the library sequence.  Thefasta3 programs use simple linear regression against the the logof the library sequence length to calculate a normalized "z-score" with mean 50, regardless of library sequence length, andvariance 10. (Several other estimation methods are available withthe -z option.) These z-scores can then be used with the extremevalue distribution and the poisson distribution (to account forthe fact that each library sequence comparison is an independenttest) to calculate the number of library sequences to obtain ascore greater than or equal to the score obtained in the search.The original idea and routines to do the linear regression onlibrary sequence length were provided Phil Green, U. Washington.This version uses a slightly different strategy for fitting thedata than those originally provided by Dr. Green.     The expected number of sequences is plotted in the histogramusing an "*". Since the parameters for the extreme valuedistribution are not calculated directly from the distribution ofsimilarity scores, the pattern of "*'s" in the histogram gives aqualitative view of how well the statistical theory fits thesimilarity scores calculated by the programs.  For fasta3, ifoptimized scores are calculated for each sequence in the database(the default), the agreement between the actual distribution of"z-scores" and the expected distribution based on the lengthdependence of the score and the extreme value distribution isusually very good.  Likewise, the distribution of ssearch3 Smith-Waterman scores typically agrees closely with the <actualdistribution of "z-scores."  The agreement with unoptimizedscores, ktup=2, is often not very good, with too many highscoring sequences and too few low scoring sequences compared withthe predicted relationship between sequence length and similarityscore.  In those cases, the expectation values may beoverestimates.     With version 33t01, all the FASTA programs also report a"bit" score, which is equivalent to the bit score reported byBLAST2.  The FASTA33/BLAST2 bit score is calculated as: (lambda*S- ln K)/ln 2, where S is the raw similarity score, lambda and Kare statistical parameters estimated from the distribution ofunrelated sequence similarity scores.  The statisticalsignficance of a given bit score depends on the lengths of thequery and library sequences and the size of the library, but a 1bit increase in score corresponds to a 2-fold reduction inexpectation; a 10-bit increase implies 1000-fold lowerexpectation, etc.     The statistical routines assume that the library contains alarge sample of unrelated sequences.  If this is not true, thenstatistical parameters can be estimated by using the -z 11-15,options.  -z options greater than 10 calculate a shuffledsimilarity score for each library sequence, in addition to theunshuffled score, and estimate the statistical parameters fromthe scores of the shuffled sequences.  If there are fewer than 20sequences in the library, the statistical calculations are notdone.     For protein searches, library sequences with E() values <0.01 for searches of a 10,000 entry protein database are almostalways homologous. Frequently sequences with E()-values from 1 -10 are related as well, but unrelated sequences ( 1 - 10 persearch) will have scores in this renage as well. Remember,however, that these E() values also reflect differences betweenthe amino acid composition of the query sequence and that of the"average" library sequence.  Thus, when searches are done withquery sequences with "biased" amino-acid composition, unrelatedsequences may have "significant" scores because of sequence bias.PRSS3 can address this problem by calculating similarity scoresfor random sequences with the same length and amino acidcomposition.5.  Options     Command line options are available to change the scoringparameters and output display. Command line options must preceedother program arguments, such as the query and library filenames.5.1.  Command line options-a   (fasta3, ssearch3 only) show both sequences in their     entirety.-A   force Smith-Waterman alignments for fasta3 DNA sequences.     By default, only fasta3 protein sequence comparisons use     Smith-Waterman alignments.-B   Show normalized score as a z-score, rather than a bit-score     in the list of best scores.-b # Number of sequence scores to be shown on output.  In the     absence of this option, fasta (and tfasta and ssearch)     display all library sequences obtaining similarity scores     with expectations less than 10.0 if optimized score are     used, or 2.0 if they are not. The -b option can limit the     display further, but it will not cause additional sequences     to be displayed.-c # Threshold score for optimization (OPTCUT).  Set "-c 1" to     optimize every sequence in a database.-E # Limit the number of scores and alignments shown based on the     expected number of scores.  Used to override the expectation     value of 10.0 used by default.  When used with -Q, -E 2.0     will show all library sequences with scores with an     expectation value <= 2.0.-d # Maximum number of alignments to be displayed.  Ignored if     "-Q" is not used.-f   Penalty for the first residue in a gap (-12 by default for     proteins, -16 for DNA, -15 for FAST[XY]/TFAST[XY]).-F # Limit the number of scores and alignments shown based on the     expected number of scores. "-E #" sets the highest E()-value     shown; "-F #" sets the lowest E()-value. Thus, "-F 0.0001"     will not show any matches or alignments with E() < 0.0001.     This allows one to skip over close relationships in searches     for more distant relationships.-g   Penalty for additional residues in a gap (-2 by default for     proteins, -4 for DNA, -3 for FAST[XY]/TFAST[XY]).-h   Penalty for frameshift (fastx3/y3, tfastx3/y3 only).-H   Omit histogram.-i   Invert (reverse complement) the query sequence if it is DNA.     For tfasta3/x3/y3, search the reverse complement of the     library sequence only.-j # Penalty for frameshift within a codon (fasty3/tfasty3 only).-l file     Location of library menu file (FASTLIBS).-L   Display more information about the library sequence in the     alignment.-M low-high     Range of amino acid sequence lengths to be included in the     search.-m # Specify alignment type: 0, 1, 2, 3, 4, 5, 6, 9, 10             -m 0        -m 1          -m 2          -m 3        -m 4         MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT         ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------         MWKSCGYPYT   MWKSCGYPYT     -m 5 provides a combination of -m 4 and -m 0. -m 6 provides     -m 5 plus HTML formatting.-m 9 provides coordinates and scores with the best score     information.  A simple " -m 9 extends the normal best score     information:         The best scores are:                                      opt bits E(14548)         XURTG4 glutathione transferase (EC 2.5.1.18) 4 -   ( 219) 1248 291.7 1.1e-79     to include the additional information (on the same line,     separated by a <tab>):         %_id  %_gid   sw  alen  an0  ax0  pn0  px0  an1  ax1 pn1 px1 gapq gapl  fs         0.771 0.771 1248  218    1  218    1  218    1  218    1  219   0   0   0      -m 9c provides additional information: an encoded alignment     string.  Thus:                10        20        30        40        50          60         70         GT8.7  NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ                :.::  . :: ::  .   .:::         : .:    ::.:   .: : ..:.. :::  :..:         XURTG  NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ                        20        30                 40        50        60     would be encoded:         =23+9=13-2=10-1=3+1=5     The alignment encoding is with repect to the alignment, not     the sequences.  The coordinate of the alignment is given     earlier in the " -m 9c" line.-m 10     -m 10 is a new, parseable format for use with other     programs.  See the file "readme.v20u4" for a more complete     description.     As of version "fa34t23b2", it has become possible to combine     independent "-m" options.  Thus, one can use "-m 1 -m 6 -m     9".-M low-high     Include library sequences (proteins only) with lengths     between low and high.-n   Force the query sequence to be treated as a DNA sequence.     This is particularly useful for query sequences that contain     a large number of ambiguous residues, e.g. transcription     factor binding sites.-O   Send copy of results to "filename."  Helpful for
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -