📄 fasta3x.doc

📁 序列对齐 Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequenc
💻 DOC
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
     environments without STDOUT (mostly for the Macintosh).-o   Turn off default optimization of all scores greater than     OPTCUT. Sort results by "initn" scores (reduces the accuracy     of statistical estimates).-p   Force query to be treated as protein sequence.-Q,-q     Quiet - does not prompt for any input.  Writes scores and     alignments to the terminal or standard output file.-r   Specify match/mismatch scores for DNA comparisons.  The     default is "+5/-4". "+3/-2" can perform better in some     cases.-R file     Save a results summary line for every sequence in the     sequence library.  The summary line includes the sequence     identifier, superfamily number (if available) position in     the library, and the similarity scores calculated.  This     option can be used to evaluate the sensitivity and     selectivity of different search strategies (Pearson, 1995,     Pearson, 1998).-s file     Specify the scoring matrix file.  fasta3 uses the same     scoring matrices as Blast1.4/2.0.  Several scoring matrix     files are included in the standard distribution.  For     protein sequences: codaa.mat - based on minimum mutation     matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250     matrix developed by Dayhoff et al. (Dayhoff et al., 1978);     pam120.mat - a PAM120 matrix.  The default scoring matrix is     BLOSUM50 ("-s BL50"). Other matrices available from within     the program are: PAM250/"-s P250", PAM120/"-s P120",     PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40"     (MDM are modern PAM matrices from Jones et al. (Jones et     al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s     BL80".-S   Treat lower-case characters in the query or library     sequences as "low-complexity" ("seg"-ed) residues.     Traditionally, the "seg" program (Wootton and     Federhen, 1993) is used to remove low complexity regions in     DNA sequences by replacing the residues with an "X".  When     the "-S" option is used, the FASTA33 programs provide a     potentially more informative approach.  With "-S", lower     case characters in the query or database sequences are     treated as "X"'s during the initial scan, but are treated as     normal residues during the final alignment display.  Since     statistical significance is calculated from the similarity     score calculated during the library search, when the lower     case residues are "X"'s, low complexity regions will not     produce statistically significant matches.  However, if a     significant alignment contains low complexity regions, their     alignmen is shown.  With "-S", lower case characters may be     included in the alignment to indicate low complexity     regions, and the final alignment score may be higher than     the score obtained during the search.     The pseg program can be used to produce databases (or query     sequences) with lower case residues indicating low     complexity regions using the command:         pseg database.fasta -z 1 -q  > database.lc_seg     (seg can also be used with some post processing, see     readme.v33tx.)-U   Treat the query sequence an RNA sequence.  In addition to     selecting a DNA/RNA alphabet, this option causes changes to     the scoring matrix so that 'G:A' , 'T:C' or 'U:C' are scored     as 'G:G'.-V str     It is now possible to specify some annotation characters     that can be included (and will be ignored), in the query     sequence file.  Thus, One might have a file with:     "ACVS*ITRLFT?", where "*" and "?"  are used to indicate     phosphorylation.  By giving the option -V '*?', those     characters in the query will be moved to an "annotation     string", and alignments that include the annotated residues     will be highlighted with the appropriate character above the     sequence (on the number line).-w # Line length (width) = number (<200)-W #  context length (default is 1/2 of line width -w) for     alignment, like fasta and ssearch, that provide additional     sequence context.-x # Specify the penalty for a match to an 'X', independently of     the PAM matrix.  Particularly useful for fastx3/fasty3,     where termination codons are encoded as 'X'.-X   Specifies offsets for the beginning of the query and library     sequence.  For example, if you are comparing upstream     regions for two genes, and the first sequence contains 500     nt of upstream sequence while the second contains 300 nt of     upstream sequence, you might try:         fasta -X "-500 -300" seq1.nt seq2.nt     If the -X option is not used, FASTA assumes numbering starts     with 1.  (You should double check to be certain the negative     numbering works properly.)-y   Set the width of the band used for calculating "optimized"     scores.  For proteins and ktup=2, the width is 16.  For     proteins with ktup=1, the width is 32 by default.  For DNA     the width is 16.-z -1,0,1,2,3,4,5     -z -1 turns off statistical calculations. z 0 estimates the     significance of the match from the mean and standard     deviation of the library scores, without correcting for     library sequence length.  -z 1 (the default) uses a weighted     regression of average score vs library sequence length; -z 2     uses maximum likelihood estimates of Lambda and K; -z 3 uses     Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5     uses two variations on the -z 1 strategy. -z 1 and -z 2 are     the best methods, in general.-z 11,12,14,15     estimate the statistical parameters from shuffled copies of     each library sequence.  This doubles the time required for a     search, but allows accurate statistics to be estimated for     libraries comprised of a single protein family.-Z db_size     set the apparent size of the database to be used when     calculating expectation E() values.  If you searched a     database with 1,000 sequences, but would like to have the     E()-values calculated in the context of a 100,000 sequence     database, use '-Z 100000'.-1   sort output by init1 score (for compatibility with FASTP -     do not use).-3   translate only three forward framesFor example:    fasta -w 80 -a seq1.aa seq.aawould compare the sequence in seq1.aa to that in seq2.aa anddisplay the results with 80 residues on an output line, showingall of the residues in both sequences.  Be sure to enter theoptions before entering the file names, or just enter the optionson the command line, and the program will prompt for the filenames.     (November, 1997) In addition, it is now possible to providethe fasta programs with the query sequence (fasta, fasty,ssearch, tfastx), or two sequences (prss, lalign, plalign) fromthe unix "stdin" stream.  This makes it much easier to set upFASTA or PRSS WWW pages.  To specify that stdin be used, ratherthan a file, the file name should be specified as '-' or '@' (thelatter file name makes it possible to specify a subset of thesequence).  Thus:    cat query.aa | fasta -q @:25-75 swould take residues 25-75 from query.aa and search the 's'library (see the discussion of FASTLIBS).5.2.  Environment variables     Because the current version of the program allows the userto set virtually every option on the command line (except thektup, which must be set as the third command line argument), onlythe FASTLIBS environment variable is routinely used.FASTLIBS     specifies the location of the file which contains the list     of library descriptions, locations, and library types (see     section on finding library files).6.  Frequently Asked Questions (1)   Which program should I use? See Table I. (2)   How do I search with both DNA strands with fasta3 and       fastx3? With version 32 of the FASTA program package, all       searches that use DNA queries (e.g. fasta3, fastx3/y3)       examine both strands. To revert to earlier FASTA behavior       - only looking at the forward or reverse strand - use -3       to search only the forward strand and -i -3 to search only       the reverse strand. (3)   When I search Genbank - the program reports: 0 residues in       0 sequences.  This typically happens because the program       does not know that you are searching a Genbank flatfile       database and is looking for a FASTA format database.  Be       certain to specify the library type ("1" for Genbank       flatfile) with the database name. (4)   What is the difference between fastx3 and fasty3 (or       tfastx3 and tfasty3).  [t]fastx3 uses a simpler codon       based model for alignments that does not allow frameshifts       in some codon positions (see ref. (Zhang et al., 1997)).       tfastx3 is about 30% faster, but tfasty3 can produce       higher quality alignments in some cases. (5)   When I run fasta3 -q, I don't see any (or very little)       output, but I get lots of scores when I run interactively.       With the -Q option, the number of high scores displayed is       limited by the -E # cutoff, which is 10.0 for protein       comparisons, 2.0 for DNA comparisons, and 5.0 for       translated DNA:protein comparisons.  In interactive mode       (without -Q), by default you see 20 high scores,       regardless of E() value. (6)   What is ktup - All of the programs with fast in their name       use a computer science method called a lookup table to       speed the search.  For proteins with ktup=2, this means       that the program does not look at any sequence alignment       that does not involve matching two identical residues in       both sequences.  Likewise with DNA and ktup = 6, the       initial alignment of the sequences looks for 6 identical       adjacent nucleotides in both sequences.  Because it is       less likely that two identical amino-acids will line up by       chance in two unrelated proteins, this speeds up the       comparison.  But very distantly related sequences may       never have two identical residues in a row but will have       single aligned identities.  In this case, ktup = 1 may       find alignments that ktup=2 misses. (7)   Sometimes, in the list of best scores, the same sequence       is shown twice with exactly the same score.  Sometimes,       the sequence is there twice, but the scores are slightly       different. When any of the fasta3 programs searches a long       sequence, it breaks the sequence up into overlapping       pieces.  The length of the piece depends on the length of       the query and the particular program being used (it can       also be controlled with the -N #### option).  Since the       pieces overlap by the length of the query sequence (or       3*query_length for fastx/y3 and tfasta/x/y3), if the       highest scoring alignment is at the end of one piece, it       will be scored again at the beginning of the next piece.       If the alignment is not be completely included in the       overlap region, one of the pieces will give a higher score       than the other.  These duplications can be detected by       looking at the coordinates of the alignment.  If either       the beginning or end coordinate is identical in two       alignments, the alignments are at least partially       duplicates.As always, please inform me of bugs as soon as possible.William R. PearsonDepartment of BiochemistryJordan Hall Box 800733U. of VirginiaCharlottesville, VAwrp@virginia.EDU7.  ReferencesAltschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C.(1994). Issues in searching molecular sequence databases. NatureGenet. 6,119-129.Altschul, S. F. and Gish, W. (1996). Local alignment statistics.Methods Enzymol. 266,460-480.Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot proteinsequence data bank and its new supplement TrEMBL. Nucleic Acids.Res. 24,21-25.Barker, W. C., Garavelli, J. S., Haft, D. H., Hunt, L. T.,Marzec, C. R., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S. L.,Ledley, R. S., Mewes, H. W., Pfeiffer, F., and Tsugita, A.(1998). The PIR-International Protein Sequence Database. NucleicAcids Res 26,27-32.Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978). A modelof evolutionary change in proteins. In Atlas of Protein Sequenceand Structure, vol. 5, supplement 3. M. Dayhoff, ed. (SilverSpring, MD: National Biomedical Research Foundation), pp.345-352.Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). Therapid generation of mutation data matrices from proteinsequences. Comp. Appl. Biosci. 8,275-282.Pearson, W. R. (2000). Flexible similarity searching with theFASTA3 program package. In Bioinformatics Methods and Protocols,S. Misener and S. A. Krawetz, ed. (Totowa, NJ: Humana Press), pp.185-219.Pearson, W. R. and Lipman, D. J. (1988). Improved tools forbiological sequence comparison. Proc. Natl. Acad. Sci. USA85,2444-2448.Pearson, W. R. (1995). Comparison of methods for searchingprotein sequence databases. Prot. Sci. 4,1145-1160.Pearson, W. R. (1996). Effective protein sequence comparison.Methods Enzymol. 266,227-258.Pearson, W. R. (1998). Empirical statistical estimates forsequence similarity searches. J. Mol. Biol. 276,71-84.Smith, T. F. and Waterman, M. S. (1981). Identification of commonmolecular subsequences. J. Mol. Biol. 147,195-197.Wootton, J. C. and Federhen, S. (1993). Statistics of localcomplexity in amino acid sequences and sequence databases.Comput. Chem. 17,149-163.Zhang, Z., Pearson, W. R., and Miller, W. (1997). Aligning a DNAsequence with a protein sequence. J. Computational Biology4,339-349.
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -