📄 fasta20.doc

📁 序列对齐 Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequenc
💻 DOC
📖 第 1 页 / 共 4 页
字号:
上一页 1 2 34
RELATE    significance program described by Dayhoff (Atlas of          Protein Sequence and Structure, Vol. 5, Supplement 3).          Each chunk of 25 residues in one sequence is compared          to every 25 residue fragment of the second sequence.          Sequences which are genuinely related will have a large          number of scores greater than 3 standard deviations          above the mean score of all of the comparisons.3.6.  Other analysis programsAACOMP    calculate the amino acid composition and molecular          weight of a sequence.BESTSCOR  calculate the best self-comparison score.GREASE    Kyte-Doolittle hydropathicity profileTGREASE   graphic plot of Kyte-Doolittle profileFROMGB    convert from GenBank LOCUS format (also used by the          IBI-Pustell programs) to Pearson/FASTA format.GARNIER   A secondary structure prediction program using the          method of Garnier, Osgusthorpe, and Robson, J. Mol.          Biol., (1978) 120:97-120.3.7.  Options     These programs have a number of output options, which areinvoked by the environment variables LINLEN, SHOWALL, and MARKX.Alternatively, these values can be controlled by command lineoptions.  The number of sequence residues per output line is nowadjustable by setting the environment variable LINLEN, or thecommand line option -w.  LINLEN is normally 60, to change it setLINLEN=80 before running the program or add -w 80 to the commandline.  LINLEN can be set up to 200.  SHOWALL (-a) determineswhether all, or just a portion, of the aligned sequences aredisplayed.  Previously, FASTP would show the entire length ofboth sequences in an alignment while FASTN would only show theportions of the two sequences that overlapped. Now the default isto show only the overlap between the two sequences, to showcomplete sequences, set SHOWALL=1, or use the -a option on thecommand line.     The differences between the two aligned sequences can behighlighted in three different ways by changing the environmentvariable MARKX or the -m option.  Normally (MARKX=0) the programuses ':' do denote identities and '.' to denote conservativereplacements.  If MARKX=1, the program will not mark identities;instead conservative replacements are denoted by a 'x' and non-conservative substitutions by a 'X'.  If MARKX=2, the residues inthe second sequence are only shown if they are different from thefirst. MARKX=3 displays the aligned library sequences without thequery sequence; these can be used to build a primitive multiplealignment.  MARKX=4 provides a graphical display of theboundaries of the alignments. Thus the five options are:     MARKX=0      MARKX=1       MARKX=2       MARKX=3      MARKX=4    MWRTCGPPYT   MWRTCGPPYT    MWRTCGPPYT                 MWRTCGPPYT    ::..:: :::     xx  X       ..KS..Y...    MWKSCGYPYT   ----------    MWKSCGYPYT   MWKSCGYPYT(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseableformat for use with other programs.  See the file"readme.v20u4"for a more complete description.3.8.  Command line options     It is now possible to specify  several options on thecommand line, instead of using environment variables.  Thecommand line options are preceded by a dash; the followingoptions are available:-a        same as showall=1-A        force Smith-Waterman alignments for DNA sequences and          TFASA.  By default, only FASTA protein sequence          comparisons use Smith-Waterman alignments.-b #      Number of sequence scores to be shown on output.  In          the absence of this option, fasta (and tfasta and          ssearch) display all library sequences obtaining          similarity scores with expectations less than 10.0 if          optimized score are used, or 2.0 if they are not. The          -b option can limit the display further, but it will          not cause additional sequences to be displayed.-c #      Threshold score for optimization (OPTCUT).  Set "-c 1"          to optimize every sequence in a database.  (This slows          the program down about 5-fold).-E #      Limit the number of scores and alignments shown based          on the expected number of scores.  Used to override the          expectation value of 10.0 used by default.  When used          with -Q, -E 2.0 will show all library sequences with          scores with an expectation value <= 2.0.-d #      Number of alignments to be reported by default. (Used          in conjunction with -Q).  No longer necessary, see "-b"          above.-f        Penalty for the first residue in a gap (-12 by default          for proteins, -16 for DNA or for TFASTA).-g        Penalty for additional residues in a gap (-2 by default          for proteins, -4 for DNA and TFASTA ).-h        Penalty for frameshift (FASTX, TFASTX only).-H        Omit histogram.-i        Invert (reverse complement) the query sequence if it is          DNA.  For TFASTX, search the reverse complement of the          library sequence only.-k #      Threshold for joining init1 segments to build an initn          score (GAPCUT).-l file   Location of library menu file (FASTLIBS).-L        Display more information about the library sequence in          the alignment.-m #      MARKX = # (0, 1, 2, 3, 4, 10)-n        Force the query sequence to be treated as a DNA          sequence.  This is particularly useful for query          sequences that contain a large number of ambiguous          residues, e.g. transcription factor binding sites.-O        Send copy of results to "filename."  Helpful for          environments without STDOUT.-o        Turn off default optimization of all scores greater          than OPTCUT. Sort results by "initn" scores.-Q,-q     Quiet - does not prompt for any input.  Writes scores          and alignments to the terminal or standard output file.-r file   Save a results summary line for every sequence in the          sequence library.  The summary line includes the          sequence identifier, superfamily number (if available)          position in the library, and the similarity scores          calculated.  This option can be used to evaluate the          sensitivity and selectivity of different search          strategies (see W. R. Pearson (1991) Genomics 11:635-          650.)-s file   SMATRIX is read from file.  Several SMATRIX files are          provided with the standard distribution.  For protein          sequences: codaa.mat - based on minimum mutation          matrix; idnaa.mat - identity matrix; pam250.mat - the          PAM250 matrix developed by Dayhoff et al (Atlas of          Protein Sequence and Structure, vol. 5, suppl. 3,          1978); pam120.mat - a PAM120 matrix.  The default          scoring matrix is BLOSUM50, PAM250 is available with          "-s 250", BLOSUM62 ("-s BL62") is also available.-v        (LINEVAL) values used for line styles in plfasta-w #      Line length (width) = number (<200)-x        Specifies offsets for the beginning of the query and          library sequence.  For example, if you are comparing          upstream regions for two genes, and the first sequence          contains 500 nt of upstream sequence while the second          contains 300 nt of upstream sequence, you might try:              fasta -x "-500 -300" seq1.nt seq2.nt          If the -x option is not used, FASTA assumes numbering          starts with 1.  This option will not work properly with          the translated library sequence with tfasta.  (You          should double check to be certain the negative          numbering works properly.)-y        Set the width of the band used for calculating          "optimized" scores.  For proteins and ktup=2, the width          is 16.  For proteins with ktup=1, the width is 32 by          default.  For DNA the width is 16.-z        Turn off statistical calculations.-1        sort output by init1 score (as FASTP used to do).-3        (TFASTA, TFASTX only) translate only three forward          framesFor example:    fasta -w 80 -a seq1.aa seq.aawould compare the sequence in seq1.aa to that in seq2.aa anddisplay the results with 80 residues on an output line, showingall of the residues in both sequences.  Be sure to enter theoptions before entering the file names, or just enter the optionson the command line, and the program will prompt for the filenames.     Not all of these options are appropriate for all of theprograms.  The options above are used by FASTA and TFASTA. RELATEuses the -s option, ALIGN uses the -w, -m, and -s options, andthe PRDF program uses -c, -f, -k, and -s.4.  Environment variable summary     Environment variables allow you to set search parametersthat will be used frequently when you run a program; for example,if you prefer to use the PAM250 scoring matrix, you might "setSMATRIX=250."  Command line parameters, if used, always overrideenvironment variable settings. The following environmentvariables are used by this program:AABANK    the file name  of the default sequence library.FASTLIBS  the location of the file which contains the list of          library files to be searched.GAPCUT    threshold used for joining init1 regions in the second          step of FASTA.  Normally set based on sequence length          and ktup.LIBTYPE   used to specify the format of the library sequence for          FASTA and TFASTA.LINLEN    output line length - can go up to 200LINEVAL   used by plfasta to determine the relationship between          line style and similarity score (-v).  This should be a          string of three numbers, e.g.  "200 100 50"MARKX     symbol for denoting matches, mismatches. Note that this          symbol is only used across the optimized local region;          sequences that are outside this region are not marked.OPTCUT    Set the threshold to be used for optimization in a band          around the best initial region.  Normally the OPTCUT          value is calculated from the length of the sequence and          the ktup value (for a 200 residue sequence, it is about          28).  If OPTCUT=1, every sequence in the database will          be optimized.  This is the most sensitive option.PAMFACT   This version of fasta uses a more sensitive method for          identifying initial regions. Instead of using a          constant factor (fact) for each match in a ktup, it          uses the scoring matrix (PAM) scores.  While this works          well for protein sequences, it has not been as          carefully tested for DNA sequences, so by default, this          modification is used for proteins but not for DNA.          Setting the PAMFACT environment variable to 1 forces          the option on; PAMFACT=0 turns it off.SHOWALL   on output, show the complete sequence instead of just          the overlap of the two aligned sequences.SMATRIX   alternative scoring matrix file.TEKPLOT   (IBM-PC only, Unix and VMS versions generate Tektronix          graphics by default) Generate Tektronix output.          Normally, PLFASTA and TGREASE plot graphs using the          Turbo C graphics library.  Unfortunately, often these          plots cannot be printed out without special programs.          However, if you set TEKPLOT=1, tektronix graphics          commands will be used.  Tektronix commands can be used          together with the PLOTDEV program, available from          Microplot Systems.  They no lonter sell this program,          but it can be downloaded from          http://iquest.com/~microplt/index1.html.  PLOTDEV also          allows you to print out graphics on the screen.As always, please inform me of bugs as soon as possible.William R. PearsonDepartment of BiochemistryBox 440, Jordan HallU. of VirginiaCharlottesville, VAwrp@virginia.EDU
上一页 1 2 34
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -