📄 est2genome.txt
字号:
est2genome Function Align EST and genomic DNA sequencesDescription est2genome is a software tool to aid the prediction of genes by sequence homology. The program will align a set of spliced nucleotide sequences (ESTs cDNAs or mRNAs) to an unspliced genomic DNA sequence, inserting introns of arbitrary length when needed. In addition, where feasible introns start and stop at the splice consensus dinucleotides GT and AG. Unless instructed otherwise, the program makes three alignments: First it compares both stands of the spliced sequence against the forward strand of the genomic, assuming the splice consensus GT/AG (ie in the forward gene direction). The maximum-scoring orientation is then realigned assuming the splice consensus CT/AC (ie in the reversed gene direction). Only the overall maximum-scoring alignment is reported. The program outputs a list of the exons and introns it has found. The format is like that of MSPcrunch, ie a list of matching segments. This format is easy to parse into other software. The program also indicates, based on the splice site information, the gene's predicted direction of transcription. Optionally the full sequence alignment is printed as well (see the example).Algorithm The program uses a linear-space divide-and-conquer strategy (Myers and Miller, 1988; Huang, 1994) to limit memory use: 1. A first pass Smith-Waterman local alignment scan is done to find the start and end of the maximally scoring segments. 2. Subsequences corresponding to these segments are extracted 3a. If the product of the subsequences' lengths is less than a user-defined threshold (i.e. they will fit in memory) the segments are realigned using the Needleman-Wunsch global alignment algorithm, which will give the same result as the Smith-Waterman since the subsequences are guaranteed to align end-to-end. 3b. If the product of the lengths exceeds the threshold (a full alignment will not fit in memory) the alignment is made recursively by splitting the spliced (EST) sequence in half and finding the genome sequence position which aligns with the mid-point. The process is repeated until the product of gthe lengths is less than the threshold. The divided sequences are aligned separately and then merged. 4. The genome sequence is searched against the forward and reverse strands of the spliced (EST) sequence, assuming a forward gene splicing direction (i.e. GT/AG consensus). 5. Then the best-scoring orientation is realigned assuming reverse splicing (CT/AC consensus). The overall best alignment is reported.Usage Here is a sample session with est2genome% est2genome Align EST and genomic DNA sequencesSpliced EST nucleotide sequence(s): tembl:hs989235Unspliced genomic nucleotide sequence: tembl:hsnfg9Output file [hs989235.est2genome]: Go to the input files for this example Go to the output files for this exampleCommand line arguments Standard (Mandatory) qualifiers: [-estsequence] seqall Spliced EST nucleotide sequence(s) [-genomesequence] sequence Unspliced genomic nucleotide sequence [-outfile] outfile [*.est2genome] Output file name Additional (Optional) qualifiers: -match integer [1] Score for matching two bases (Any integer value) -mismatch integer [1] Cost for mismatching two bases (Any integer value) -gappenalty integer [2] Cost for deleting a single base in either sequence, excluding introns (Any integer value) -intronpenalty integer [40] Cost for an intron, independent of length. (Any integer value) -splicepenalty integer [20] Cost for an intron, independent of length and starting/ending on donor-acceptor sites (Any integer value) -minscore integer [30] Exclude alignments with scores below this threshold score. (Any integer value) Advanced (Unprompted) qualifiers: -reverse boolean Reverse the orientation of the EST sequence -[no]splice boolean [Y] Use donor and acceptor splice sites. If you want to ignore donor-acceptor sites then set this to be false. -mode menu [both] This determines the comparion mode. The default value is 'both', in which case both strands of the est are compared assuming a forward gene direction (ie GT/AG splice sites), and the best comparsion redone assuming a reversed (CT/AC) gene splicing direction. The other allowed modes are 'forward', when just the forward strand is searched, and 'reverse', ditto for the reverse strand. (Values: both (Both strands); forward (Forward strand only); reverse (Reverse strand only)) -[no]best boolean [Y] You can print out all comparisons instead of just the best one by setting this to be false. -space float [10.0] For linear-space recursion. If product of sequence lengths divided by 4 exceeds this then a divide-and-conquer strategy is used to control the memory requirements. In this way very long sequences can be aligned. If you have a machine with plenty of memory you can raise this parameter (but do not exceed the machine's physical RAM) (Any numeric value) -shuffle integer [0] Shuffle (Any integer value) -seed integer [20825] Random number seed (Any integer value) -align boolean Show the alignment. The alignment includes the first and last 5 bases of each intron, together with the intron width. The direction of splicing is indicated by angle brackets (forward or reverse) or ???? (unknown). -width integer [50] Alignment width (Any integer value) Associated qualifiers: "-estsequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-genomesequence" associated qualifiers -sbegin2 integer Start of the sequence to be used -send2 integer End of the sequence to be used -sreverse2 boolean Reverse (if DNA) -sask2 boolean Ask for begin/end/reverse -snucleotide2 boolean Sequence is nucleotide -sprotein2 boolean Sequence is protein -slower2 boolean Make lower case -supper2 boolean Make upper case -sformat2 string Input sequence format -sdbname2 string Database name -sid2 string Entryname -ufo2 string UFO features -fformat2 string Features format -fopenfile2 string Features file name "-outfile" associated qualifiers -odirectory3 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messagesInput file format est2genome reads two nucleotide sequences. The first is an EST sequence (a single read or a finished cDNA). The second is a genomic finished sequence. Input files for usage example 'tembl:hs989235' is a sequence entry in the example nucleic acid database 'tembl' Database entry: tembl:hs989235ID HS989235 standard; RNA; EST; 495 BP.XXAC H45989;XXSV H45989.1XXDT 18-NOV-1995 (Rel. 45, Created)DT 04-MAR-2000 (Rel. 63, Last updated, Version 2)XXDE yo13c02.s1 Soares adult brain N2b5HB55Y Homo sapiens cDNA cloneDE IMAGE:177794 3', mRNA sequence.XXKW EST.XXOS Homo sapiens (human)OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;OC Eutheria; Primates; Catarrhini; Hominidae; Homo.XXRN [1]RP 1-495RA Hillier L., Clark N., Dubuque T., Elliston K., Hawkins M., Holman M.,RA Hultman M., Kucaba T., Le M., Lennon G., Marra M., Parsons J., Rifkin L.,RA Rohlfing T., Soares M., Tan F., Trevaskis E., Waterston R., Williamson A.,RA Wohldmann P., Wilson R.;RT "The WashU-Merck EST Project";RL Unpublished.XXDR RZPD; IMAGp998F03326; IMAGp998F03326.XXCC On May 8, 1995 this sequence version replaced gi:800819.CC Contact: Wilson RKCC Washington University School of MedicineCC 4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108CC Tel: 314 286 1800CC Fax: 314 286 1810CC Email: est@watson.wustl.eduCC Insert Size: 544CC High quality sequence stops: 265CC Source: IMAGE Consortium, LLNLCC This clone is available royalty-free through LLNL ; contact theCC IMAGE Consortium (info@image.llnl.gov) for further information.CC Possible reversed clone: polyT not foundCC Insert Length: 544 Std Error: 0.00CC Seq primer: SP6CC High quality sequence stop: 265.XXFH Key Location/QualifiersFHFT source 1..495FT /db_xref="taxon:9606"FT /db_xref="ESTLIB:300"FT /db_xref="RZPD:IMAGp998F03326"FT /note="Organ: brain; Vector: pT7T3D (Pharmacia) with aFT modified polylinker; Site_1: Not I; Site_2: Eco RI; 1stFT strand cDNA was primed with a Not I - oligo(dT) primer [5'FT TGTTACCAATCTGAAGTGGGAGCGGCCGCGCTTTTTTTTTTTTTTTTTTT 3'],FT double-stranded cDNA was size selected, ligated to Eco RIFT adapters (Pharmacia), digested with Not I and cloned intoFT the Not I and Eco RI sites of a modified pT7T3 vectorFT (Pharmacia). Library went through one round ofFT normalization to a Cot = 53. Library constructed by BentoFT Soares and M.Fatima Bonaldo. The adult brain RNA wasFT provided by Dr. Donald H. Gilden. Tissue was acquired 17-18FT hours after death which occurred in consequence of aFT ruptured aortic aneurysm. RNA was prepared from a pool ofFT tissues representing the following areas of the brain:FT frontal, parietal, temporal and occipital cortex from theFT left and right hemispheres, subcortical white matter, basalFT ganglia, thalamus, cerebellum, midbrain, pons and medulla."FT /sex="Male"FT /organism="Homo sapiens"FT /clone="IMAGE:177794"FT /clone_lib="Soares adult brain N2b5HB55Y"FT /dev_stage="55-year old"FT /lab_host="DH10B (ampicillin resistant)"XXSQ Sequence 495 BP; 73 A; 135 C; 169 G; 104 T; 14 other; ccggnaagct cancttggac caccgactct cgantgnntc gccgcgggag ccggntggan 60 aacctgagcg ggactggnag aaggagcaga gggaggcagc acccggcgtg acggnagtgt 120 gtggggcact caggccttcc gcagtgtcat ctgccacacg gaaggcacgg ccacgggcag 180 gggggtctat gatcttctgc atgcccagct ggcatggccc cacgtagagt ggnntggcgt 240 ctcggtgctg gtcagcgaca cgttgtcctg gctgggcagg tccagctccc ggaggacctg 300 gggcttcagc ttcccgtagc gctggctgca gtgacggatg ctcttgcgct gccatttctg 360 ggtgctgtca ctgtccttgc tcactccaaa ccagttcggc ggtccccctg cggatggtct 420 gtgttgatgg acgtttgggc tttgcagcac cggccgccga gttcatggtn gggtnaagag 480 atttgggttt tttcn 495// Database entry: tembl:hsnfg9ID HSNFG9 standard; DNA; HUM; 33760 BP.XXAC Z69719;XXSV Z69719.1XXDT 26-FEB-1996 (Rel. 46, Created)DT 22-NOV-1999 (Rel. 61, Last updated, Version 3)XXDE Human DNA sequence from cosmid NFG9 from a contig from the tip of the shortDE arm of chromosome 16, spanning 2Mb of 16p13.3. Contains Interleukin 9DE Receptor Pseudogene, repeat polymorphism, ESTs, CpG islands and endogenousDE retroviral DNA.XXKW 16p13.3; CpG island; Interleukin 9 Receptor Pseudogene;KW repeat polymorphism.XXOS Homo sapiens (human)OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;OC Eutheria; Primates; Catarrhini; Hominidae; Homo.XXRN [1]RP 1-33760RA Kershaw J.;RT ;RL Submitted (22-FEB-1996) to the EMBL/GenBank/DDBJ databases.RL Sanger Centre, Hinxton, Cambridgeshire, CB10 1RQ, England. E-mail enquires:RL humquery@sanger.ac.uk
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -