📄 fasta3x.me

📁 序列对齐 Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequenc
💻 ME
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
.ip \fC\-m 5\fP provides a combination of \fC\-m 4\fP and\fC\-m 0. \fC\-m 6 provides \fC\-m 5\fP plus HTML formatting..ip "-m 9"provides coordinates and scores with the best score information.A simple "\fC -m 9\fP extends the normal best score information:.(l.ft CThe best scores are:                                      opt bits E(14548)XURTG4 glutathione transferase (EC 2.5.1.18) 4 -   ( 219) 1248 291.7 1.1e-79.ft P.)lto include the additional information (on the same line, separated bya <tab>):.(l.ft C%_id  %_gid   sw  alen  an0  ax0  pn0  px0  an1  ax1 pn1 px1 gapq gapl  fs0.771 0.771 1248  218    1  218    1  218    1  218    1  219   0   0   0.ft P.)l\fC -m 9c\fP provides additional information: an encoded alignment string.  Thus:.(l I.ft C       10        20        30        40        50          60         70  GT8.7  NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ       :.::  . :: ::  .   .:::         : .:    ::.:   .: : ..:.. :::  :..:XURTG  NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ               20        30                 40        50        60        .ft P.)lwould be encoded:.(l.ft C=23+9=13-2=10-1=3+1=5.ft P.)lThe alignment encoding is with repect to the alignment, not thesequences.  The coordinate of the alignment is given earlier in the"\fC -m 9c\fP" line..ip "-m 10"\fC\-m 10\fP is a new, parseable format for usewith other programs.  See the file "readme.v20u4" for a more completedescription. .ipAs of version "fa34t23b2", it has become possible to combine independent"\fC\-m\fP" options.  Thus, one can use "\fC\-m 1 -m 6 -m 9\fP"..ip "-M low\-high"Include library sequences (proteins only) with lengths between low andhigh..ip "-n"Force the query sequence to be treated as a DNA sequence.  This isparticularly useful for query sequences that contain a large number ofambiguous residues, e.g. transcription factor binding sites..ip "-O"Send copy of results to "filename."  Helpful for environments withoutSTDOUT (mostly for the Macintosh)..ip "-o "Turn off default optimization of all scores greater than OPTCUT. Sortresults by "initn" scores (reduces the accuracy of statisticalestimates)..ip "-p"Force query to be treated as protein sequence..ip "-Q,-q"Quiet - does not prompt for any input.  Writes scores and alignmentsto the terminal or standard output file..ip "-r"Specify match/mismatch scores for DNA comparisons.  The default is"+5/-4". "+3/-2" can perform better in some cases..ip "-R file"Save a results summary line for every sequence in the sequencelibrary.  The summary line includes the sequence identifier,superfamily number (if available) position in the library, and the similarity scores calculated.  This option canbe used to evaluate the sensitivity and selectivity of differentsearch strategies.[.wrp951,wrp981.].ip "-s file"Specify the scoring matrix file.  \fCfasta3\fP uses the same scoringmatrices as Blast1.4/2.0.  Several scoring matrix files are includedin the standard distribution.  For protein sequences: \fCcodaa.mat\fP- based on minimum mutation matrix; \fCidnaa.mat\fP - identity matrix;\fCpam250.mat\fP - the PAM250 matrix developed by Dayhoff etal.;[.day787.]  \fCpam120.mat\fP - a PAM120 matrix.  The defaultscoring matrix is BLOSUM50 ("-s BL50"). Other matrices available fromwithin the program are: PAM250/"-s P250", PAM120/"-s P120", PAM40/"-sP40", PAM20/"-s P20", MDM10 - MDM40/"-s M10 \- M40" (MDM are modernPAM matrices from Jones et al.,[.tay925.]), BLOSUM50, 62, and 80/"-sBL50", "-s BL62", "-s BL80"..ip "-S"Treat lower-case characters in the query or library sequences as"low-complexity" ("seg"-ed) residues.  Traditionally, the "seg"program [.woo935.] is used to remove low complexity regions in DNAsequences by replacing the residues with an "X".  When the "-S" optionis used, the FASTA33 (and later) programs provide a potentially moreinformative approach.  With "-S", lower case characters in the queryor database sequences are treated as "X"'s during the initial scan,but are treated as normal residues during the final alignment display.Since statistical significance is calculated from the similarity scorecalculated during the library search, when the lower case residues are"X"'s, low complexity regions will not produce statisticallysignificant matches.  However, if a significant alignment contains lowcomplexity regions, their alignmen is shown.  With "-S", lower casecharacters may be included in the alignment to indicate low complexityregions, and the final alignment score may be higher than the scoreobtained during the search..ipThe \fCpseg\fP program can be used to produce databases (or querysequences) with lower case residues indicating low complexity regionsusing the command:.(l I\fCpseg database.fasta -z 1 -q  > database.lc_seg\fP.)l(\fCseg\fP can also be used with some post processing, see readme.v33tx.).ipThe \fC-S\fP option should always be used with \fCFASTX/Y\fCP and\fCTFASTX/Y\fP because out of frame translations often generatelow-complexity protein sequences.  However, only lower case charactersin the protein sequence (or protein database) are masked; lower caseDNA sequences are translated into upper case protein sequences, andnot treated as low complexity by the translated alignment programs..ip "-t #"Translation table - tfasta3, fastx3, tfastx3, fasty3, andtfasty3 now support the BLAST tranlation tables.  See\fChttp://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi\fP..ipIn addition, "\-t t" or "\-t t#" turns on the addition of an implicit terminationcodon to a protein:translated DNA match.  That is, each proteinsequence implicitly ends with "*", which matches the termination codesfor the appropriate genetic code.  "\-t t#" sets implicit terminationand a different genetic code..ip "-U"Treat the query sequence an RNA sequence.  In addition to selecting aDNA/RNA alphabet, this option causes changes to the scoring matrix sothat 'G:A' , 'T:C' or 'U:C' are scored as 'G:G'..ip "-V str"It is now possible to specify some annotation characters that can beincluded (and will be ignored), in the query sequence file.  Thus, Onemight have a file with: \fC"ACVS*ITRLFT?"\fP, where "*" and "?"  areused to indicate phosphorylation.  By giving the option \fC\-V '*?'\fP,those characters in the query will be moved to an "annotation string",and alignments that include the annotated residues will be highlightedwith the appropriate character above the sequence (on the number line)..ip "-w #"Line length (width) = number (<200).ip "-W #" context length (default is 1/2 of line width -w) for alignment,like fasta and ssearch, that provide additional sequence context..ip "-x #match,#mismatch"Specify the penalty for a match to an 'X', and mismatch to 'X',independently of the PAM matrix.  Particularly useful for\fCfastx3/fasty3\fP, where termination codons are encoded as 'X'..ip "-X \"off1 off2\""Specifies offsets for the beginning of the query and library sequence.For example, if you are comparing upstream regions for two genes, andthe first sequence contains 500 nt of upstream sequence while thesecond contains 300 nt of upstream sequence, you might try:.(l I\fCfasta -X "-500 -300" seq1.nt seq2.nt\fP.)lIf the -X option is not used, FASTA assumes numbering starts with 1.(You should double check to be certain the negative numbering worksproperly.).ip "-y"Set the width of the band used for calculating "optimized" scores.For proteins and ktup=2, the width is 16.  For proteins with ktup=1,the width is 32 by default.  For DNA the width is 16..ip "-z -1,0,1,2,3,4,5"\fC\-z -1\fP turns off statistical calculations. \fCz 0\fP estimatesthe significance of the match from the mean and standard deviation ofthe library scores, without correcting for library sequence length.\fC\-z 1\fP (the default) uses a weighted regression of average scorevs library sequence length; \fC\-z 2\fP uses maximum likelihoodestimates of.if t \(*l.if n Lambdaand \fIK\fP; \fC\-z 3\fP uses Altschul-Gishparameters;[.alt960.] \fC\-z 4 \- 5\fP uses two variations on the\fC\-z 1\fP strategy. \fC\-z 1\fP and \fC\-z 2\fP are the best methods,in general..ip "-z 11,12,14,15" estimate the statistical parameters from shuffled copies of eachlibrary sequence.  This doubles the time required for a search, butallows accurate statistics to be estimated for libraries comprised ofa single protein family..ip "-Z db_size"set the apparent size of the database to be used when calculatingexpectation E() values.  If you searched a database with 1,000sequences, but would like to have the E()-values calculated in thecontext of a 100,000 sequence database, use '-Z 100000'..ip "-1"sort output by init1 score (for compatibility with FASTP - do notuse)..ip "-3"translate only three forward frames.sp.lpFor example:.(l\fCfasta -w 80 -a seq1.aa seq.aa\fP.)lwould compare the sequence in seq1.aa to that in seq2.aa and display theresults with 80 residues on an output line, showing all of the residuesin both sequences.  Be sure to enter the options before entering the filenames, or just enter the options on the command line, and the program willprompt for the file names..sp.pp(November, 1997) In addition, it is now possible to provide the fastaprograms with the query sequence (fasta, fasty, ssearch, tfastx), ortwo sequences (prss, lalign, plalign) from the unix "stdin" stream.  Thismakes it much easier to set up FASTA or PRSS WWW pages.  To specifythat stdin be used, rather than a file, the file name should bespecified as '-' or '@' (the latter file name makes it possible tospecify a subset of the sequence).Thus:.(lcat query.aa | fasta -q @:25-75 s.)lwould take residues 25-75 from query.aa and search the 's' library(see the discussion of FASTLIBS)..sh 2 "Environment variables".ppBecause the current version of the program allows the user to setvirtually every option on the command line (except the \fIktup\fP,which must be set as the third command line argument), only the\fCFASTLIBS\fP environment variable is routinely used..ip "FASTLIBS"specifies the location of the file which contains the list of librarydescriptions, locations, and library types (see section on findinglibrary files)..sh 1 "Frequently Asked Questions (FAQs)".np\fIWhich program should I use?\fP See Table I..np\fIHow do I search with both DNA strands with\fP \fCfasta3\fP \fIand\fP\fCfastx3\fP? With version 32 of the FASTA program package, allsearches that use DNA queries (e.g. \fCfasta3\fP, \fCfastx3/y3\fP)examine both strands. To revert to earlier FASTA behavior - onlylooking at the forward or reverse strand - use \fC\-3\fP to search onlythe forward strand and \fC\-i -3\fP to search only the reverse strand..np\fIWhen I search Genbank - the program reports:\fP \fC0 residues in 0sequences\fP.  This typically happens because the program does notknow that you are searching a Genbank flatfile database and is lookingfor a FASTA format database.  Be certain to specify the library type("1" for Genbank flatfile) with the database name..npWhat is the difference between \fCfastx3\fP and \fCfasty3\fP (or\fCtfastx3\fP and \fCtfasty3\fP).  \fC[t]fastx3\fP uses a simplercodon based model for alignments that does not allow frameshifts insome codon positions (see ref. [.wrp971.]).  \fCtfastx3\fP is about30% faster, but \fCtfasty3\fP can produce higher quality alignments insome cases..np\fIWhen I run\fP \fCfasta3 -q\fP, I don't see any (or very little)output, but I get lots of scores when I run interactively. With the\fC\-Q\fP option, the number of high scores displayed is limited by the\fC\-E #\fP cutoff, which is 10.0 for protein comparisons, 2.0 for DNAcomparisons, and 5.0 for translated DNA:protein comparisons.  Ininteractive mode (without \fC\-Q\fP), by default you see 20 highscores, regardless of \fCE()\fP value..np\fIWhat is ktup\fP \- All of the programs with \fCfast\fP in theirname use a computer science method called a lookup table to speed thesearch.  For proteins with \fIktup\fP=2, this means that the programdoes not look at any sequence alignment that does not involve matchingtwo identical residues in both sequences.  Likewise with DNA and\fIktup\fP = 6, the initial alignment of the sequences looks for 6identical adjacent nucleotides in both sequences.  Because it is lesslikely that two identical amino-acids will line up by chance in twounrelated proteins, this speeds up the comparison.  But very distantlyrelated sequences may never have two identical residues in a row butwill have single aligned identities.  In this case, \fIktup\fP = 1 mayfind alignments that \fIktup\fP=2 misses..np\fISometimes, in the list of best scores, the same sequence is showntwice with exactly the same score.  Sometimes, the sequence is theretwice, but the scores are slightly different.\fP When any of the\fCfasta3\fP programs searches a long sequence, it breaks the sequenceup into \fIoverlapping\fP pieces.  The length of the piece depends onthe length of the query and the particular program being used (it canalso be controlled with the -N #### option).  Since the pieces overlapby the length of the query sequence (or 3*query_length for fastx/y3and tfasta/x/y3), if the highest scoring alignment is at the end ofone piece, it will be scored again at the beginning of the next piece.If the alignment is not be completely included in the overlap region,one of the pieces will give a higher score than the other.  Theseduplications can be detected by looking at the coordinates of thealignment.  If either the beginning or end coordinate is identical intwo alignments, the alignments are at least partially duplicates..lpAs always, please inform me of bugs as soon as possible..sp.nfWilliam R. PearsonDepartment of BiochemistryJordan Hall Box 800733U. of VirginiaCharlottesville, VAwrp@virginia.EDU.sh 1 "References".[]
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -