📄 fasta3x.doc
字号:
environments without STDOUT (mostly for the Macintosh).-o Turn off default optimization of all scores greater than OPTCUT. Sort results by "initn" scores (reduces the accuracy of statistical estimates).-p Force query to be treated as protein sequence.-Q,-q Quiet - does not prompt for any input. Writes scores and alignments to the terminal or standard output file.-r Specify match/mismatch scores for DNA comparisons. The default is "+5/-4". "+3/-2" can perform better in some cases.-R file Save a results summary line for every sequence in the sequence library. The summary line includes the sequence identifier, superfamily number (if available) position in the library, and the similarity scores calculated. This option can be used to evaluate the sensitivity and selectivity of different search strategies (Pearson, 1995, Pearson, 1998).-s file Specify the scoring matrix file. fasta3 uses the same scoring matrices as Blast1.4/2.0. Several scoring matrix files are included in the standard distribution. For protein sequences: codaa.mat - based on minimum mutation matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250 matrix developed by Dayhoff et al. (Dayhoff et al., 1978); pam120.mat - a PAM120 matrix. The default scoring matrix is BLOSUM50 ("-s BL50"). Other matrices available from within the program are: PAM250/"-s P250", PAM120/"-s P120", PAM40/"-s P40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 - M40" (MDM are modern PAM matrices from Jones et al. (Jones et al., 1992),), BLOSUM50, 62, and 80/"-s BL50", "-s BL62", "-s BL80".-S Treat lower-case characters in the query or library sequences as "low-complexity" ("seg"-ed) residues. Traditionally, the "seg" program (Wootton and Federhen, 1993) is used to remove low complexity regions in DNA sequences by replacing the residues with an "X". When the "-S" option is used, the FASTA33 programs provide a potentially more informative approach. With "-S", lower case characters in the query or database sequences are treated as "X"'s during the initial scan, but are treated as normal residues during the final alignment display. Since statistical significance is calculated from the similarity score calculated during the library search, when the lower case residues are "X"'s, low complexity regions will not produce statistically significant matches. However, if a significant alignment contains low complexity regions, their alignmen is shown. With "-S", lower case characters may be included in the alignment to indicate low complexity regions, and the final alignment score may be higher than the score obtained during the search. The pseg program can be used to produce databases (or query sequences) with lower case residues indicating low complexity regions using the command: pseg database.fasta -z 1 -q > database.lc_seg (seg can also be used with some post processing, see readme.v33tx.)-U Treat the query sequence an RNA sequence. In addition to selecting a DNA/RNA alphabet, this option causes changes to the scoring matrix so that 'G:A' , 'T:C' or 'U:C' are scored as 'G:G'.-V str It is now possible to specify some annotation characters that can be included (and will be ignored), in the query sequence file. Thus, One might have a file with: "ACVS*ITRLFT?", where "*" and "?" are used to indicate phosphorylation. By giving the option -V '*?', those characters in the query will be moved to an "annotation string", and alignments that include the annotated residues will be highlighted with the appropriate character above the sequence (on the number line).-w # Line length (width) = number (<200)-W # context length (default is 1/2 of line width -w) for alignment, like fasta and ssearch, that provide additional sequence context.-x # Specify the penalty for a match to an 'X', independently of the PAM matrix. Particularly useful for fastx3/fasty3, where termination codons are encoded as 'X'.-X Specifies offsets for the beginning of the query and library sequence. For example, if you are comparing upstream regions for two genes, and the first sequence contains 500 nt of upstream sequence while the second contains 300 nt of upstream sequence, you might try: fasta -X "-500 -300" seq1.nt seq2.nt If the -X option is not used, FASTA assumes numbering starts with 1. (You should double check to be certain the negative numbering works properly.)-y Set the width of the band used for calculating "optimized" scores. For proteins and ktup=2, the width is 16. For proteins with ktup=1, the width is 32 by default. For DNA the width is 16.-z -1,0,1,2,3,4,5 -z -1 turns off statistical calculations. z 0 estimates the significance of the match from the mean and standard deviation of the library scores, without correcting for library sequence length. -z 1 (the default) uses a weighted regression of average score vs library sequence length; -z 2 uses maximum likelihood estimates of Lambda and K; -z 3 uses Altschul-Gish parameters (Altschul and Gish, 1996); -z 4 - 5 uses two variations on the -z 1 strategy. -z 1 and -z 2 are the best methods, in general.-z 11,12,14,15 estimate the statistical parameters from shuffled copies of each library sequence. This doubles the time required for a search, but allows accurate statistics to be estimated for libraries comprised of a single protein family.-Z db_size set the apparent size of the database to be used when calculating expectation E() values. If you searched a database with 1,000 sequences, but would like to have the E()-values calculated in the context of a 100,000 sequence database, use '-Z 100000'.-1 sort output by init1 score (for compatibility with FASTP - do not use).-3 translate only three forward framesFor example: fasta -w 80 -a seq1.aa seq.aawould compare the sequence in seq1.aa to that in seq2.aa anddisplay the results with 80 residues on an output line, showingall of the residues in both sequences. Be sure to enter theoptions before entering the file names, or just enter the optionson the command line, and the program will prompt for the filenames. (November, 1997) In addition, it is now possible to providethe fasta programs with the query sequence (fasta, fasty,ssearch, tfastx), or two sequences (prss, lalign, plalign) fromthe unix "stdin" stream. This makes it much easier to set upFASTA or PRSS WWW pages. To specify that stdin be used, ratherthan a file, the file name should be specified as '-' or '@' (thelatter file name makes it possible to specify a subset of thesequence). Thus: cat query.aa | fasta -q @:25-75 swould take residues 25-75 from query.aa and search the 's'library (see the discussion of FASTLIBS).5.2. Environment variables Because the current version of the program allows the userto set virtually every option on the command line (except thektup, which must be set as the third command line argument), onlythe FASTLIBS environment variable is routinely used.FASTLIBS specifies the location of the file which contains the list of library descriptions, locations, and library types (see section on finding library files).6. Frequently Asked Questions (1) Which program should I use? See Table I. (2) How do I search with both DNA strands with fasta3 and fastx3? With version 32 of the FASTA program package, all searches that use DNA queries (e.g. fasta3, fastx3/y3) examine both strands. To revert to earlier FASTA behavior - only looking at the forward or reverse strand - use -3 to search only the forward strand and -i -3 to search only the reverse strand. (3) When I search Genbank - the program reports: 0 residues in 0 sequences. This typically happens because the program does not know that you are searching a Genbank flatfile database and is looking for a FASTA format database. Be certain to specify the library type ("1" for Genbank flatfile) with the database name. (4) What is the difference between fastx3 and fasty3 (or tfastx3 and tfasty3). [t]fastx3 uses a simpler codon based model for alignments that does not allow frameshifts in some codon positions (see ref. (Zhang et al., 1997)). tfastx3 is about 30% faster, but tfasty3 can produce higher quality alignments in some cases. (5) When I run fasta3 -q, I don't see any (or very little) output, but I get lots of scores when I run interactively. With the -Q option, the number of high scores displayed is limited by the -E # cutoff, which is 10.0 for protein comparisons, 2.0 for DNA comparisons, and 5.0 for translated DNA:protein comparisons. In interactive mode (without -Q), by default you see 20 high scores, regardless of E() value. (6) What is ktup - All of the programs with fast in their name use a computer science method called a lookup table to speed the search. For proteins with ktup=2, this means that the program does not look at any sequence alignment that does not involve matching two identical residues in both sequences. Likewise with DNA and ktup = 6, the initial alignment of the sequences looks for 6 identical adjacent nucleotides in both sequences. Because it is less likely that two identical amino-acids will line up by chance in two unrelated proteins, this speeds up the comparison. But very distantly related sequences may never have two identical residues in a row but will have single aligned identities. In this case, ktup = 1 may find alignments that ktup=2 misses. (7) Sometimes, in the list of best scores, the same sequence is shown twice with exactly the same score. Sometimes, the sequence is there twice, but the scores are slightly different. When any of the fasta3 programs searches a long sequence, it breaks the sequence up into overlapping pieces. The length of the piece depends on the length of the query and the particular program being used (it can also be controlled with the -N #### option). Since the pieces overlap by the length of the query sequence (or 3*query_length for fastx/y3 and tfasta/x/y3), if the highest scoring alignment is at the end of one piece, it will be scored again at the beginning of the next piece. If the alignment is not be completely included in the overlap region, one of the pieces will give a higher score than the other. These duplications can be detected by looking at the coordinates of the alignment. If either the beginning or end coordinate is identical in two alignments, the alignments are at least partially duplicates.As always, please inform me of bugs as soon as possible.William R. PearsonDepartment of BiochemistryJordan Hall Box 800733U. of VirginiaCharlottesville, VAwrp@virginia.EDU7. ReferencesAltschul, S. F., Boguski, M. S., Gish, W., and Wootton, J. C.(1994). Issues in searching molecular sequence databases. NatureGenet. 6,119-129.Altschul, S. F. and Gish, W. (1996). Local alignment statistics.Methods Enzymol. 266,460-480.Bairoch, A. and Apweiler, R. (1996). The Swiss-Prot proteinsequence data bank and its new supplement TrEMBL. Nucleic Acids.Res. 24,21-25.Barker, W. C., Garavelli, J. S., Haft, D. H., Hunt, L. T.,Marzec, C. R., Orcutt, B. C., Srinivasarao, G. Y., Yeh, L. S. L.,Ledley, R. S., Mewes, H. W., Pfeiffer, F., and Tsugita, A.(1998). The PIR-International Protein Sequence Database. NucleicAcids Res 26,27-32.Dayhoff, M., Schwartz, R. M., and Orcutt, B. C. (1978). A modelof evolutionary change in proteins. In Atlas of Protein Sequenceand Structure, vol. 5, supplement 3. M. Dayhoff, ed. (SilverSpring, MD: National Biomedical Research Foundation), pp.345-352.Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992). Therapid generation of mutation data matrices from proteinsequences. Comp. Appl. Biosci. 8,275-282.Pearson, W. R. (2000). Flexible similarity searching with theFASTA3 program package. In Bioinformatics Methods and Protocols,S. Misener and S. A. Krawetz, ed. (Totowa, NJ: Humana Press), pp.185-219.Pearson, W. R. and Lipman, D. J. (1988). Improved tools forbiological sequence comparison. Proc. Natl. Acad. Sci. USA85,2444-2448.Pearson, W. R. (1995). Comparison of methods for searchingprotein sequence databases. Prot. Sci. 4,1145-1160.Pearson, W. R. (1996). Effective protein sequence comparison.Methods Enzymol. 266,227-258.Pearson, W. R. (1998). Empirical statistical estimates forsequence similarity searches. J. Mol. Biol. 276,71-84.Smith, T. F. and Waterman, M. S. (1981). Identification of commonmolecular subsequences. J. Mol. Biol. 147,195-197.Wootton, J. C. and Federhen, S. (1993). Statistics of localcomplexity in amino acid sequences and sequence databases.Comput. Chem. 17,149-163.Zhang, Z., Pearson, W. R., and Miller, W. (1997). Aligning a DNAsequence with a protein sequence. J. Computational Biology4,339-349.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -