📄 fasta20.doc
字号:
RELATE significance program described by Dayhoff (Atlas of Protein Sequence and Structure, Vol. 5, Supplement 3). Each chunk of 25 residues in one sequence is compared to every 25 residue fragment of the second sequence. Sequences which are genuinely related will have a large number of scores greater than 3 standard deviations above the mean score of all of the comparisons.3.6. Other analysis programsAACOMP calculate the amino acid composition and molecular weight of a sequence.BESTSCOR calculate the best self-comparison score.GREASE Kyte-Doolittle hydropathicity profileTGREASE graphic plot of Kyte-Doolittle profileFROMGB convert from GenBank LOCUS format (also used by the IBI-Pustell programs) to Pearson/FASTA format.GARNIER A secondary structure prediction program using the method of Garnier, Osgusthorpe, and Robson, J. Mol. Biol., (1978) 120:97-120.3.7. Options These programs have a number of output options, which areinvoked by the environment variables LINLEN, SHOWALL, and MARKX.Alternatively, these values can be controlled by command lineoptions. The number of sequence residues per output line is nowadjustable by setting the environment variable LINLEN, or thecommand line option -w. LINLEN is normally 60, to change it setLINLEN=80 before running the program or add -w 80 to the commandline. LINLEN can be set up to 200. SHOWALL (-a) determineswhether all, or just a portion, of the aligned sequences aredisplayed. Previously, FASTP would show the entire length ofboth sequences in an alignment while FASTN would only show theportions of the two sequences that overlapped. Now the default isto show only the overlap between the two sequences, to showcomplete sequences, set SHOWALL=1, or use the -a option on thecommand line. The differences between the two aligned sequences can behighlighted in three different ways by changing the environmentvariable MARKX or the -m option. Normally (MARKX=0) the programuses ':' do denote identities and '.' to denote conservativereplacements. If MARKX=1, the program will not mark identities;instead conservative replacements are denoted by a 'x' and non-conservative substitutions by a 'X'. If MARKX=2, the residues inthe second sequence are only shown if they are different from thefirst. MARKX=3 displays the aligned library sequences without thequery sequence; these can be used to build a primitive multiplealignment. MARKX=4 provides a graphical display of theboundaries of the alignments. Thus the five options are: MARKX=0 MARKX=1 MARKX=2 MARKX=3 MARKX=4 MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT MWRTCGPPYT ::..:: ::: xx X ..KS..Y... MWKSCGYPYT ---------- MWKSCGYPYT MWKSCGYPYT(fasta20u4, Feb. 1996) In addition MARKX=10 is a new, parseableformat for use with other programs. See the file"readme.v20u4"for a more complete description.3.8. Command line options It is now possible to specify several options on thecommand line, instead of using environment variables. Thecommand line options are preceded by a dash; the followingoptions are available:-a same as showall=1-A force Smith-Waterman alignments for DNA sequences and TFASA. By default, only FASTA protein sequence comparisons use Smith-Waterman alignments.-b # Number of sequence scores to be shown on output. In the absence of this option, fasta (and tfasta and ssearch) display all library sequences obtaining similarity scores with expectations less than 10.0 if optimized score are used, or 2.0 if they are not. The -b option can limit the display further, but it will not cause additional sequences to be displayed.-c # Threshold score for optimization (OPTCUT). Set "-c 1" to optimize every sequence in a database. (This slows the program down about 5-fold).-E # Limit the number of scores and alignments shown based on the expected number of scores. Used to override the expectation value of 10.0 used by default. When used with -Q, -E 2.0 will show all library sequences with scores with an expectation value <= 2.0.-d # Number of alignments to be reported by default. (Used in conjunction with -Q). No longer necessary, see "-b" above.-f Penalty for the first residue in a gap (-12 by default for proteins, -16 for DNA or for TFASTA).-g Penalty for additional residues in a gap (-2 by default for proteins, -4 for DNA and TFASTA ).-h Penalty for frameshift (FASTX, TFASTX only).-H Omit histogram.-i Invert (reverse complement) the query sequence if it is DNA. For TFASTX, search the reverse complement of the library sequence only.-k # Threshold for joining init1 segments to build an initn score (GAPCUT).-l file Location of library menu file (FASTLIBS).-L Display more information about the library sequence in the alignment.-m # MARKX = # (0, 1, 2, 3, 4, 10)-n Force the query sequence to be treated as a DNA sequence. This is particularly useful for query sequences that contain a large number of ambiguous residues, e.g. transcription factor binding sites.-O Send copy of results to "filename." Helpful for environments without STDOUT.-o Turn off default optimization of all scores greater than OPTCUT. Sort results by "initn" scores.-Q,-q Quiet - does not prompt for any input. Writes scores and alignments to the terminal or standard output file.-r file Save a results summary line for every sequence in the sequence library. The summary line includes the sequence identifier, superfamily number (if available) position in the library, and the similarity scores calculated. This option can be used to evaluate the sensitivity and selectivity of different search strategies (see W. R. Pearson (1991) Genomics 11:635- 650.)-s file SMATRIX is read from file. Several SMATRIX files are provided with the standard distribution. For protein sequences: codaa.mat - based on minimum mutation matrix; idnaa.mat - identity matrix; pam250.mat - the PAM250 matrix developed by Dayhoff et al (Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, 1978); pam120.mat - a PAM120 matrix. The default scoring matrix is BLOSUM50, PAM250 is available with "-s 250", BLOSUM62 ("-s BL62") is also available.-v (LINEVAL) values used for line styles in plfasta-w # Line length (width) = number (<200)-x Specifies offsets for the beginning of the query and library sequence. For example, if you are comparing upstream regions for two genes, and the first sequence contains 500 nt of upstream sequence while the second contains 300 nt of upstream sequence, you might try: fasta -x "-500 -300" seq1.nt seq2.nt If the -x option is not used, FASTA assumes numbering starts with 1. This option will not work properly with the translated library sequence with tfasta. (You should double check to be certain the negative numbering works properly.)-y Set the width of the band used for calculating "optimized" scores. For proteins and ktup=2, the width is 16. For proteins with ktup=1, the width is 32 by default. For DNA the width is 16.-z Turn off statistical calculations.-1 sort output by init1 score (as FASTP used to do).-3 (TFASTA, TFASTX only) translate only three forward framesFor example: fasta -w 80 -a seq1.aa seq.aawould compare the sequence in seq1.aa to that in seq2.aa anddisplay the results with 80 residues on an output line, showingall of the residues in both sequences. Be sure to enter theoptions before entering the file names, or just enter the optionson the command line, and the program will prompt for the filenames. Not all of these options are appropriate for all of theprograms. The options above are used by FASTA and TFASTA. RELATEuses the -s option, ALIGN uses the -w, -m, and -s options, andthe PRDF program uses -c, -f, -k, and -s.4. Environment variable summary Environment variables allow you to set search parametersthat will be used frequently when you run a program; for example,if you prefer to use the PAM250 scoring matrix, you might "setSMATRIX=250." Command line parameters, if used, always overrideenvironment variable settings. The following environmentvariables are used by this program:AABANK the file name of the default sequence library.FASTLIBS the location of the file which contains the list of library files to be searched.GAPCUT threshold used for joining init1 regions in the second step of FASTA. Normally set based on sequence length and ktup.LIBTYPE used to specify the format of the library sequence for FASTA and TFASTA.LINLEN output line length - can go up to 200LINEVAL used by plfasta to determine the relationship between line style and similarity score (-v). This should be a string of three numbers, e.g. "200 100 50"MARKX symbol for denoting matches, mismatches. Note that this symbol is only used across the optimized local region; sequences that are outside this region are not marked.OPTCUT Set the threshold to be used for optimization in a band around the best initial region. Normally the OPTCUT value is calculated from the length of the sequence and the ktup value (for a 200 residue sequence, it is about 28). If OPTCUT=1, every sequence in the database will be optimized. This is the most sensitive option.PAMFACT This version of fasta uses a more sensitive method for identifying initial regions. Instead of using a constant factor (fact) for each match in a ktup, it uses the scoring matrix (PAM) scores. While this works well for protein sequences, it has not been as carefully tested for DNA sequences, so by default, this modification is used for proteins but not for DNA. Setting the PAMFACT environment variable to 1 forces the option on; PAMFACT=0 turns it off.SHOWALL on output, show the complete sequence instead of just the overlap of the two aligned sequences.SMATRIX alternative scoring matrix file.TEKPLOT (IBM-PC only, Unix and VMS versions generate Tektronix graphics by default) Generate Tektronix output. Normally, PLFASTA and TGREASE plot graphs using the Turbo C graphics library. Unfortunately, often these plots cannot be printed out without special programs. However, if you set TEKPLOT=1, tektronix graphics commands will be used. Tektronix commands can be used together with the PLOTDEV program, available from Microplot Systems. They no lonter sell this program, but it can be downloaded from http://iquest.com/~microplt/index1.html. PLOTDEV also allows you to print out graphics on the screen.As always, please inform me of bugs as soon as possible.William R. PearsonDepartment of BiochemistryBox 440, Jordan HallU. of VirginiaCharlottesville, VAwrp@virginia.EDU
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -