📄 readme.v33t0

📁 序列对齐 Compare a protein sequence to a protein sequence database or a DNA sequence to a DNA sequenc
💻 V33T0
📖 第 1 页 / 共 4 页
字号:
and 'x's.Added new option: -x #, which allows one to override the penalty for amatch against 'x' (or 'N') provided by the scoring matrix.  Thisoption is particularly useful in fast[x/y] searches, where out offrame low complexity regions can generate high scores.The old function of '-x' - to specify an alternate coordinate system,is now available as '-X # #'.Updated scaleswn.c to provide window shuffle information for -z 12.Updated compacc.c, workacc.c, to fix serious bug in wshuffle()that destroyed aa1[n1]=0.>>January 25, 2000  --> v33t04  A serious bug in all of the fasta related programs has beencorrected.  The new code in fasta33 which ignores certain residuesfailed to initialize one of the arrays properly.  As a result, inpathological situations, a very strong match could be missed.  Corrected minor bug in initsw.c that cause misplaced "ktup" commandline argument, which should be ingnored by ssearch, to be read as -dktup.  Improved error message for 0 length query sequence.>>January 17, 2000  --> no external version number changeModified mmgetaa.c, map_db.c, and nmgetaa.c to provide memory mappingof genbank flatfile (format=1) files.  This format could be read muchmore efficiently, however.>>January 12, 2000  --> no external version number changeChanged the behavior of the options that set the number of high scores(-b) and alignments (-d) that are displayed.  Previously, fasta33 -E10.0 -d 10 would show 50 best scores, rather than all the scores withE() < 10.0.  To get the -E threshold to limit, -E 10.0 -b 10000 -d 10was required. This is now fixed. Setting "-d 10" does not affect thenumber of best scores shown.Minor change in mw.h to remove unused defines.fasta3x.me (fasta3x.doc) updated.>>January 6, 2000  --> v33t03Corrected bug in memory mapped reads of gcg_binary format filesthat potentially caused the last 63 residues to be read improperly.Changes to comp_thr.c, pthr_subs.c, uthr_subs.c, ibm_pthr_subs.c toensure that each thread has its own work_info structure. This solvessome minor race conditions that sometimes caused some parametersnot to be reported properly.Changes to most of the drop*.c files to correct some minor problemswith sequence alphabets. Code in mmgetaa.c (memory mapped code forFASTA, GCG compressed files) reordered to prevent files from beingmemory mapped if appropriate index files are not available.See readme.pvm_3.3 for updates to the pvm programs.>>December 10, 1999  (no version change - modifications largely affect ps3comp*)Modifications to showsum.c to deal with 2 scores/sequence.  Modificationsto mmgetaa.c for superfamily numbers.>>December 7, 1999 (no version change, previous version not released)Corrected problem in mmgetaa.c that caused searches on a memory mappedsingle long sequence (e.g. Chr22) to fail.  Corrected bug in map_db.cthat caused it to crash on some architectures if a filename was notspecified.  Corrected off-by-three error in fasty/tfasty.  Correctedindexing error in dropfz2.c.>>December 5, 1999 --> v33t02  corrected some bugs in inifa.c/initsw.c/doinit.c that causedabbreviated function names to be lost.modify showbest.c, showalign.c to include information on position inlibrary sequence (bbp->cont) to distinguish subsegment of very longsequences.  Currently, the new label is available only with -m 6.>>November 29, 1999 [t]fastz33 uses v33t02 of fasty function.Replace dropfz.c with dropfz2.c.  Dropfz2.c interprets any codons,that include the nucleotide 'N' as the amino 'X'. Previously, 'N' wastreated as 'A', so 'NNN' ended up 'K'.  This modification, togetherwith the -S option and lower-case pseg'ed databases, should ensurethat DNA queries with large numbers of 'N's do not match lowcomplexity regions.>>November 20, 1999 (no version change, previous version not released)Modify initfa.c to disply initn, init1 scores for [t]fast[fs].Include "-B" option to show previous z-scores.>>November 17, 1999 (no version change, previous version not released) Modify dropfx.c to use saatran(), rather than aatran().  saatrantranslates any 'N' containing codon as 'X'.  aatran() treats 'N' asan 'A'.  Although more steps are required for translation, the programappears to run just as fast.>>November 7, 1999 --> v33t01Substantial changes to the output format in showbest.c (the list ofhigh scoring sequences) and showalign.c (the alignments).  The classiclist of best scores:The best scores are:                             initn init1 opt z-sc E(82014)gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIO  ( 218) 1497 1497 1497 1761.1 2.3e-91gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE  ( 218) 1413 1413 1413 1662.9 6.7e-86has been replaced by:The best scores are:                                       opt bits E(82138)gi|121716|sp|P10649|GTM1_MOUSE GLUTATHIONE S-TRAN  ( 218) 1497 354 7.6e-98gi|121717|sp|P04905|GTM1_RAT GLUTATHIONE S-TRANSF  ( 218) 1413 335 5.3e-92This display provides more information and removes the outdated initnand init1 scores, which are no longer used. The "bit" score iscomparable to the blast2 bit score.  It is calculated as: (lambda*S -ln K)/ln 2, where S is the raw similarity score, lambda and K arestatistical parameters estimated from the distribution of unrelatedsequence similarity scores.  All of the similarity scores, includinginit1, initn, and z-scores are reported with the alignment data.Z-scores are displayed instead of bit scores in the list of highscores if the command line option "-B" is specified.In addition, the alignment score line has changed from:>>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER  (220 aa) initn: 954 init1: 954 opt: 958 Z-score: 1130.9 expect() 1.1e-56Smith-Waterman score: 958;  61.927% identity in 218 aa overlap (1-218:1-218)to:>>gi|2506495|sp|P20136|GTM2_CHICK GLUTATHIONE S-TRANSFER  (220 aa) initn: 954 init1: 954 opt: 958  Z-score: 1130.9  bits: 216.4 E(): 2.8e-56Smith-Waterman score: 958;  61.927% identity in 218 aa overlap (1-218:1-218)In addition to the addition of the "bits:" score, the "expect()" labelhas changed to "E()" to save some space.>>November 4,12, 1999(no version change)Fixed serious bug in -z 2 lambda/K calculation in scaleswn.cFixed bugs in llgetaa.c (openlib()) and definition of superfamilynumbers.>>October 21, 1999(no version change)Begin using CVS for version control. Correct faulty error message indropfs.c.  Corrected bad "goto loopl;" in dropfz.c.  Corrected prss3.rspfor Makefile.tc (Win32 version).>>October 18, 1999 --> v33t0Corrected some serious bugs with the various fasta/x/y programs whenthe -DALLOCN0 was used to save memory.  Improvements to fasta3x.me/.docdocumentation.>>October 12, 1999 --> v33txFor this initial release of version 33 of the FASTA programs, theMakefile's have been modified to make "fasta33(_t)", "fastx33(_t)",etc, so that you can test fasta33 while retaining fasta3 (from releasev32t08).  The FASTA33 programs are somewhat slower than previousreleases, but I believe the ability to handle low complexity regionswithout 'X'ing them out outweighs the slowdown.  By (temporarily)changing the names of the programs slightly, it will be easier for youto judge the relative cost and benefit.  To "make" the programs as"fasta3(_t)", etc, simply replace "Makefile33.common" with"Makefile.common" in the "Makefile" that you use.>>September 30, 1999ssearch3/fasta3/fastx3/fasty3 have been modified to search databasescontaining both upper and lower case letters, where lower case lettersindicate low-complexity regions.  With the modified programs, lowercase letters are treated as 'X's' in the initial scan, but are thentreated normally in the final alignment.  In addition, alignments cancontain lower case letters.  Lower case letters are treated aslow-complexity regions during the seach phase of the program, but as"conventional" residues during the alignment phase, with the "-S"option.  Currently, lower case letters are mapped to 'X's during thescan of the entire library.  In the future, alternate weights will beavailable. This is a substantial improvement for very large scalecomparison, where one seeks both accurate statistical estimates andaccurate %identities and alignments, and for translated DNA:proteincomparisons, like "fastx3" and "fasty3", where out-of-frametranslations tend to match low complexity regions (see Pearson etal. (1997) Genomics 46:24-36).Protein databases (and query sequences) can be generated in theappropriate format using John Wooton's "pseg" program, available fromftp://ncbi.nlm.nih.gov/pub/seg/pseg.  Once you have compiled the "pseg"program, use the command:	pseg database.fasta -z 1 -q  > database.lc_segOnce you have database.lc_seg, run the command "map_db" to generatea ".xin" file that can be used to efficiently memory map the database.You can then search database.lc_seg with or without the "-S" option.Without "-S", the database is treated as any other FASTA format file -all the residues are present.  With "-S", lower case residues will betreated as 'x's' during the initial scan but as normal residues whenfinal alignments are displayed.When the -S option is used, the matrix information line is changedfrom: "BL50 matrix (15:-5)" to "BL50 matrix (15:-5)xS".  The "-S"option is no longer available to provide a scoring matrix offset.Unfortunately, Blast2.0 format files cannot contain lower caseletters.  We have addressed this problem by providing efficient memorymapped access to Fasta and GCG/PIR, and GCG/compressed-binary files inthe last release of fasta32t08. The memory mapped file I/Oimprovements are provided in fasta33 as well.================ readme.v32 ================FASTX/Y and FASTA (DNA) are now half as fast, because the programs nowsearch both the forward and reverse strands by default.The documentation in fasta3x.me/fasta3x.doc has been substantiallyrevised.>>October 20, 1999(no version change)Modify nxgetaa.c/nmgetaa.c to recognize 'N' as a possible DNA character.>>October 9, 1999 --> v32t08 (no version number change)Added "-M low-high" option, where low and high are inclusion limitsfor library sequences.  If a library sequence is shorter than "low" orlonger than "high", it will not be considered in the search.  Thus,"-M 200-250" limits the database search to proteins between 200 and250 residues in length.  This should be particularly useful for fasts3and fastf3.  -M -500 searches library sequences < 500; -M 200 -searches sequences > 200.  This limit applies only to proteinsequences.Modified scaleswn.c to fall back to maximum likelihood estimates oflambda, K rather than mean/variance estimates. (This allows MLEestimation to be used instead of proc_hist_n when a limited range ofscores is examined.)>>October 2, 1999 --> v32t08Many changes:(1) memory mapped (mmap()ed) database reading - other database reading fixes(2) BLAST2 databases supported(3) true maximum likelihood estimates for Lambda, K(4) Misc. minor fixes(1) (Sept. 26 - Oct. 2, 1999) Memory mapped database access.It is now possible to use mmap()ed access to FASTA format databases,if the "map_db" program has been used to produce an ".xin" file.  IfUSE_MMAP is defined at compile time and a ".xin" file is present, the".xin" will be used to access sequences directly after the file ismmap()ed.  On my 4-processor Alpha, this can reduce elapsed time by50%. It is not quite as efficient as BLAST2 format, but it is close.Currently, memory mapping is supported for type 0 (FASTA), 5(PIR/GCG ascii), and 6 (GCG binary).  Memory mapping is used if a".xin" file is present. ".xin" files are created by the new program"map_db".  The syntax for "map_db" is:	map_db [-n] "/dir/database.fa"which creates the file /dir/database.fa.xin.  Library types can beincluded in the filename; thus:	map_db -n "/gcggenbank/gb_om.seq 6"would be used for a type 6 GCG binary file. The ".xin" file must be updated each time the database file changes.map_db writes the size of the database file into the ".xin" file, sothat if the database file changes, making the ".xin" offsetinformation invalid, the ".xin" file is not used. "list_db" isprovided to print out the offset information in the ".xin" file.(Oct 2, 1999) The memory mapping routines have been changed toallow several files to be memory mapped simultaneously. Indeed, once adatabase has been memory mapped, it will not be unmap()ed until theprogram finishes.  This fixes a problem under Digital Unix, and shouldmake re-access to mmap()ed files (as when displaying high scores andalignments) much more efficient.  If no more memory is available formmap()ing, the file will be read using conventional fread/fgets.(Oct 2, 1999) The names of the database reading functions has been
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -