📄 readme.v33t0

📁 序列对齐 Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequenc
💻 V33T0
📖 第 1 页 / 共 4 页
字号:
allow scoring matrices with unusual scaling to be used.  In v33t05,there was a line that excluded all scores > 300 from the statisticalestimation procedure.  While 300 is a high score with any "normal"scoring matrix, some investigators were using matrices scaled 10X, sothat a score of 300 was really a score of 30 with a conventionalmatrix, and should not be excluded.  Unfortunately, removing the testto exclude scores > 300 meant that when a rRNA sequence was used tosearch the bacterial division, tens of thousands of high scoringrelated sequences were treated as if they were unrelated, with theresult that the variance estimates were much too high, and thus highreal scores had low z-scores, and thus were not statisticallysignificant.  (There appear to be more than 20,000 rRNA sequences inthe bacterial division of Genbank, almost 25% of all sequences).The solution to the problem is a substantial enhancement in thestrategies used to exclude high-scoring, related sequences, the -z 1,4, and 5 parameter estimation strategies.  The programs now estimatethe expected high scoring sequence by calculating an ungapped Lambdaand K, and then use a relatively conservative threshold for excludingscores that are higher than would be expected 0.01 times by chance.By calculating Lambda and K, we can scale the cutoff thresholds toallow scoring matrices with unusual scales.  For "normal" searches,there should be little change, but there should be an improvement forsearches with large numbers of related sequences in the database.As a result of testing for this change, a bug in the karlin() functionused with -z 6 was found and corrected.=======>>Sept. 9, 2000Changes to manshowbest.c to include correct display coordinates.Significant changes to structs.h, param.h, p2_complib.c,p2_workcomp.c, to store and use a reliable a_struct for alignmentcoordinates.Other cosmetic changes.>>Sept. 7, 2000Minor changes to complib.c, showrss.c, so that prss33 -q uses 200shuffles and prss33 provides bit scores, rather than z-scores.(no version number change).Modifications to p2_complib.c to include superfamily numbers forps4comp* ms4comp*.>>Aug 22, 2000Changes to mmgetaa.c, ncbl2_mlib.c, dropfs.c to accomodate AIX.00README.1st updated to reflect the current version and correctoutdated information on threads.>>Aug. 3, 2000Modifications to initpam2() in initsw.c to correct a problem with pam_xwhen the -S option is used.Modifications to compacc.c, scaleswn.c to ensure that residue numbersare calculated properly when more than 2 Gb of sequence is searched.>>July 12, 2000Modifications to dropnfa.c so that DNA matches to 'N' will be includedin the "ungapped %identity".  Thus, a sequence that is 100% identicalfor 100 nt on either side of a 100 nt region that has been masked to'NNNNN' will be reported as: "67% identical (100% ungapped)".  Thishas been added to deal with masked BAC-end databases.  It would bebetter if masking changed the letters to lowercase, but the mouseBAC-end sequences at TIGR use 'NNNNN'.  This is currently availableonly for the fasta function, not [t]fast[x/y], etc, and only for DNAsequences.mk_n_pam() in apam.c modified to ensure that mismatch scores of -1remain -1.>>June 25, 2000Modification to nxgetaa.c, nmgetaa.c, mmgetaa.c to return Genbank Accessionnumber as part of the descriptive string.>>June 11, 2000(no version change - not yet released)Modifications to calcons(), calc_id(), showbest(), p_workcomp.c toprovide ngap_q (number of alignment gaps in query) , ngap_l (numberof gaps in library) information for -m 9 output.>>June 6, 2000(no version change - not yet released)Modified scaleswn.c to provide better support for unconventionalscoring scoring matrices, in particular, scoring matrices where everyvalue is 50-times higher.  Previous versions of the MLE estimator (-z2) started with lambda = 0.2, which is too high for a scoring matrixgoing from -500:+1500. The initial estimate for lambda is nowcalculated using the formula: lambda = pi/sqrt(6*variance).  For thedefault -z 1, a restriction to limit scores to a maximum of 300 forthe statistical analysis was removed.>>June 3, 2000Modified aligment output, and -m 9 and -m10, to report an "ungapped"identity as well as the traditional "gapped" identity.  Thetraditional "gapped" identity reports the number of identities dividedby the overall length of the alignment, including gaps.  The"ungapped" identity does not include gaps in the length of thealignment.  This new value is included for alignments that includeintrons; thus, a tfastx33 search might find the 100% identical genomicsequence but report the gapped percent identity if a short intron wereincluded in the alignment (the alignment probably would not span along exon) as 66%.  The "ungapped" identity would remain 100%.  Theungapped identity value is also shown in the "-m 9" output line afterthe "gapped" fraction identical.>>June 1, 2000Modified -m 9 output to provide fraction identical, alignment boundaryinformation with the initial list of high scoring sequences, just asthe pv3comp and mp_comp versions do.  The -m 9 option now shows thesame alignment display as -m 0, but the width of the alignment isincreased by 40.  Thus, by default, -m 9 will show the list of besthits, with percent identity, Smith-Waterman score, and alignmentboundaries initially, and then show alignments standard (-m 0)alignments with 100 residues/line.>>May 29, 2000Correct some problems with reading data files with <CR>'s under unix.nmgetaa.c/nxgetaa.c/mmgetaa.c have been modified to convert <TAB>('\t') to <SPC> (' ') in descriptive lines.=======>>May 3, 2000  Corrected problem with very low mean_var in fit_llen() in scaleswn.c.>>May 2, 2000  (no version number change - previous version not released)  Merged fasta33t05d2 with fasta33t06.  Also removed restriction on"-M size-range" to proteins - the size range now can be applied to DNAas well.>>May 1, 2000 (changes to v33t05d merged into v33t06) Introduced changes to include '*' as a valid sequence character, whichindicates termination.  Thus, 'TGA', 'TAG', and 'TAA' are nowtranlated to '*' rather than 'X', and the protein PAM matrices havebeen modified to provide a match score of approximately 1/2 the maxidentity score for a '*:*' match.  Otherise, '*' is the same as 'X'.This change only affects query sequences that include a '*' toindicate an end of sequence, the '*' is not there by default.The inclusion of '*' broke some things in tfasts33, tfastf33, fasty33,and tfasty33, which were fixed today.>>March 28, 2000/April 24, 2000 --> v33t06(a) -z 6 statistics that factor in composition(b) -smatrix-offset pam-offset parameter(a) This release provides a new statistics option, -z 6, whichprovides a more sophisticated model that accounts for sequencecomposition.  When -z 6 is used (only for fasta33(_t) andssearch33(_t)), the program calculates a composition parametercomp=1/lambda using a modified version of the Karlin-Altschul karlin()function.  As a result, every sequence in the database has anassociated length (n1) and composition (comp).The length n1 and composition comp are used in the maximum likelihoodestimation described by Mott (1992) Bull. Math. Biol. 54:59-75.  Fourparameters are estimated, a0, a1, a2, and b1, and the probability ofobtaining a score is then:p(s >= x) = 1-exp(-exp(-( a0 + a1*comp + a2*comp*log(n0*n1) + x)/(b1*comp)))The maximum likelihood estimates of a0, a1, a2, and b1 are calculatedusing the Nelder-Mead simplex search strategy.The average Lambda is reported for the search using Lambda =1/(b1*ave_comp).  Where ave_comp is the geometric mean of the comp valuescalculated during the statistical estimates.The "lambda/comp" calculation can fail for sequences with very biasedamino acid composition.  When this occurs, 'comp' is set to -1.0 (asis 'H', the information content parameter) and the 'ave_comp' value isused to calculate statistical significance.  (But obviously 'ave_comp'is not really appropriate, since if the sequence had an average 'comp'value, it would have been calculated.)  When -z 6 is used, thealignment display shows the 'comp' and 'H' values for that librarysequence.(b) Scoring matrix offsets - The main reason that the "lamdba/comp"calculation fails is that, for the particular query/library sequencepair, the expected score is not < 0, instead, Sum {p_ij S_ij} >= 0.0.This problem is reported to 'stderr' when it occurs.  The simplestsolution to the problem is to provide an offset to the scoring matrix;for example, to use Blosum62 - 1, which ranges from +10 to -5, ratherthan the standard +11 to -4.  This option used to be available withthe -S offset option, but -S is now used to specify a lower-caseseg-ed database.  The offset can now be specified as part of thescoring matrix name.  Thus, "-s BL62-1" uses Blosum62 reduced by 1 ateach entry.  The '-' character is used to indicate an offset, soscoring matrix files must not have a '-' in their name.Alternatively, "-s BL80+1" or "-s BL80--1" would add one to each value.nxgetaa.c, nmgetaa.c, and mmgetaa.c have been edited to avoid stringrun-off problems after strncpy().Fixed problem where positive gap extension penalties in ssearch33were not converted to negative values.>>April 8, 2000Fixed problem in calculating corrected sequence lengths forAltschul-Gish probabilities.>>March 30, 2000  (no version change, date updated to March 30, 2000)Corrected problem with -m 9 option.The '*' character is now available to allow translated alignments toextend through the termination codon. Thus, if a protein sequence endswith a '*', and matches in to a translated termination codon, thescore will be increased.  The *:* match score is set to 1/2 the maxpositive score for the matrix (see upam.h).  This strategy can also beused to upweight a match that extends all the way to the end of afull-length sequence by putting '*' at the end of both the query andlibrary protein sequences.  Recognition of '*' will probably become acommand line option.>>March 21, 2000  (no version change, previous version not distributed)Changes to map_db.c, list_db.c, and mmgetaa.c to accomodate largesequence files.  Long (64-bit on some systems) variables are now usedto specify file and memory position for the memory mapped functions.As a result, there are now two *.xin (memory mapped index) fileformats: MP0, which uses 32-bit longs, and MP1, which uses 64-bitlongs. On 64-bit machines, MP0 32-bit indices are read properly, butlimit the database size to 2 or 4 Gb; MP1 64-bit indices allow verylarge databases.  Blast2.0 formatdb databases are still limited to4Gb.  To compile map_db.c to generate 64-bit index files, include thecompile time option -DBIG_LIB64 in the Makefile.  (Currently thisoption has been tested only on the DEC Alpha and SGI platforms, andwill work only with Unix versions that provide 64-bit longs and 64-bitftell()'s.)The -R results file now uses sfn_cmp() to report a matchingsuperfamily number, if one exists, and '0' otherwise.>>March 12, 2000  (no version change, previous version not distributed)Provide new strategy for specifying library abbreviations.  Inaddition to:	fasta33 query.aa %anrone can also specify:	fasta33 query.aa %pir1+sp+nror	fasta33 query.aa +pir1+sp+nror 	fasta33 query.aa %+pir1+sp+nrwhere the + anywhere in the library name string indicates thatvariable length library names, separated by '+', are being used (thelast '+' is optional).  The FASTLIBS file then becomes:================PIR1 Annotated Protein Database (rel 56)$0+pir1+/slib2/blast/pir1.lsegNBRF Protein database (complete)$0+nbrf+@/seqlib/lib/NBRF.namNRL_3d structure database$0D/seqlib/lib/nrl_3d.seq 5NCBI/Blast non-redundant proteins$0+nr+/slib2/blast/nr.lsegNCBI/Blast Swissprot$0+sp+/slib2/blast/swissprot.lseg================The two abbreviation types, single letter and +word+, cannot beintermixed, and at least initially, +word+ specifiers arecase-sensitive (single letter abbreviations are not) and will not beavailable interactively, only on the command line.Removed 'K' estimate for Expectation_n, Expectation_i fits to thedistribution of unrelated similarity scores.  'K' cannot be calculatedfrom the data available.  'Lamdba' can be calculated, it is1.28255/sqrt(mean_var), and is still available.>>March 3, 2000  (no version change)changed Makefile33.common, Makefile.common, to incorporate $(NRAND)rather than "rand48".  Provide nrandom.c which uses random(), asreplacement for nrand.c, which uses rand48().>>February 8, 2000  --> v33t05Fixes to scaleswn.c (proc_hist_ml) to set num_db_entries properly.Scaleswn.c also provides Lambda estimates for -z 1/11 (Expectation_n),and -z 1/14 (Expectation_i) statistical estimates.Modifications to calc_id() to correct bug in counting identities.Modified showalign() to use calc_id() with -m 9, for simplerdebugging.Additional modifications to dropfa*.c files to deal properly with 'n's
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -