📄 tcode.txt
字号:
tcode Function Fickett TESTCODE statistic to identify protein-coding DNADescription tcode tests DNA sequences for protein coding regions using an algorithm which looks for simple and universal differences between protein-coding and noncoding DNA.The original paper reports that the test had been thoroughly proven on 400,000 bases of sequence data: it misclassifies 5% of the regions tested and gives an answer of "No Opinion" one fifth of the time. The program slides a window of user-selectable size over the DNA sequence. For each window the TESTCODE statistic is applied. The results can be output as a text report or displayed graphically. The text output reports each window as "Coding", "Noncoding" or "No opinion". Entries marked "No opinion" have a TESTCODE value that falls between the maximum and minimum values required to report a region as noncoding or coding. For the graphical plot, all points above a green horizontal line are determined to be coding regions. Those below a red line are determined to be noncoding. Points between the red and green lines are "no opinion" ones.Biological Relevance The statistic reflects the fact that codons are used with unequal frequency and that oligonucleotides and nucleotides tend to be repeated with a periodicity of three. This application can assist in determining the probability of a region of nucleic sequence encoding a functional protein.Algorithm The Fickett (1982) algorithm is used (1). A window of at least 200 bases is moved over the sequence in steps of 3 bases Let: A1 = Number of A's in positions 1,4,7 ... A2 = Number of A's in positions 2,5,8 ... A3 = Number of A's in positions 3,6,9 ... A position value is determined that reflects the degree to which each base is favoured in one codon position over another, i.e. Apos = MAX(A1,A2,A3) / MIN(A1,A2,A3)+1 This is done for all 4 bases. The percentage composition of each base is also determined. Eight values are therefore determined, four position values and four composition values. These are then converted to probabilities (p) of coding using a look-up table provided as the data file for the program. The values in this look-up table have been determined experimentally using known coding and noncoding sequences. Each of the probabilities is multiplied by a weight (w) value (again from the look-up table) for the respective base. The weight value reflects the percentage of the time that each parameter alone successfully predicted coding or noncoding function for the sequences of known function. The TESTCODE statistic is then: p1w1 + p2w2 + p3w3 + p4w4 + p5w5 + p6w6 + p7w7 + p8w8 A result of less than 0.74 is probably a non-coding region. A result equal or greater than 0.95 is probably a coding region. Anything in between these two values is uncertain.Usage Here is a sample session with tcode% tcode Fickett TESTCODE statistic to identify protein-coding DNAInput nucleotide sequence(s): tembl:hsfau1Length of sliding window [200]: Output report [hsfau1.tcode]: Go to the input files for this example Go to the output files for this example Example 2 Produce a graphical plot% tcode -plot -graph cps Fickett TESTCODE statistic to identify protein-coding DNAInput nucleotide sequence(s): tembl:hsfau1Length of sliding window [200]: Created tcode.ps Go to the output files for this exampleCommand line arguments Standard (Mandatory) qualifiers (* if not always prompted): [-sequence] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) -window integer [200] This is the number of nucleotide bases over which the TESTCODE statistic will be performed each time. The window will then slide along the sequence, covering the same number of bases each time. (Integer 200 or more)* -outfile report [*.tcode] Output report file name* -graph xygraph [$EMBOSS_GRAPHICS value, or x11] Graph type (ps, hpgl, hp7470, hp7580, meta, cps, x11, tekt, tek, none, data, xterm, png) Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: -datafile datafile [Etcode.dat] The default data file is Etcode.dat and contains coding probabilities for each base. The probabilities are for both positional and compositional information. -step integer [3] The selected window will, by default, slide along the nucleotide sequence by three bases at a time, retaining the frame (although the algorithm is not frame sensitive). This may be altered to increase or decrease the increment of the slide. (Integer 1 or more) -plot toggle [N] On selection a graph of the sequence (X axis) plotted against the coding score (Y axis) will be displayed. Sequence above the green line is coding, that below the red line is non-coding. Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -rformat string Report format -rname string Base file name -rextension string File name extension -rdirectory string Output directory -raccshow boolean Show accession number in the report -rdesshow boolean Show description in the report -rscoreshow boolean Show the score in the report -rusashow boolean Show the full USA in the report -rmaxall integer Maximum total hits to report -rmaxseq integer Maximum hits to report for one sequence "-graph" associated qualifiers -gprompt boolean Graph prompting -gdesc string Graph description -gtitle string Graph title -gsubtitle string Graph subtitle -gxtitle string Graph x axis title -gytitle string Graph y axis title -goutfile string Output file for non interactive displays -gdirectory string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messagesInput file format tcode reads any normal sequence USAs. The program will ignore ambiguity codes in the nucleic acid sequence and just accept the four common bases. This is a function of the algorithm, and the data tables. Input files for usage example 'tembl:hsfau1' is a sequence entry in the example nucleic acid database 'tembl' Database entry: tembl:hsfau1ID HSFAU1 standard; DNA; HUM; 2016 BP.XXAC X65921; S45242;XXSV X65921.1XXDT 13-MAY-1992 (Rel. 31, Created)DT 21-JUL-1993 (Rel. 36, Last updated, Version 5)XXDE H.sapiens fau 1 geneXXKW fau 1 gene.XXOS Homo sapiens (human)OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;OC Eutheria; Primates; Catarrhini; Hominidae; Homo.XXRN [1]RP 1-2016RA Kas K.;RT ;RL Submitted (29-APR-1992) to the EMBL/GenBank/DDBJ databases.RL K. Kas, University of Antwerp, Dept of Biochemistry T3.22,RL Universiteitsplein 1, 2610 Wilrijk, BELGIUMXXRN [2]RP 1-2016RX MEDLINE; 92412144.RA Kas K., Michiels L., Merregaert J.;RT "Genomic structure and expression of the human fau gene: encoding theRT ribosomal protein S30 fused to a ubiquitin-like protein.";RL Biochem. Biophys. Res. Commun. 187:927-933(1992).XXDR SWISS-PROT; P35544; UBIM_HUMAN.DR SWISS-PROT; Q05472; RS30_HUMAN.XXFH Key Location/QualifiersFHFT source 1..2016FT /db_xref="taxon:9606"FT /organism="Homo sapiens"FT /clone_lib="CML cosmid"FT /clone="15.1"FT mRNA join(408..504,774..856,951..1095,1557..1612,1787..>1912)FT /gene="fau 1"FT exon 408..504FT /number=1FT intron 505..773FT /number=1FT exon 774..856 [Part of this file has been deleted for brevity]FT RAKRRMQYNRRFVNVVPTFGKKKGPNANS"FT intron 857..950FT /number=2FT exon 951..1095FT /number=3FT intron 1096..1556FT /number=3FT exon 1557..1612FT /number=4FT intron 1613..1786FT /number=4FT exon 1787..>1912FT /number=5FT polyA_signal 1938..1943XXSQ Sequence 2016 BP; 421 A; 562 C; 538 G; 495 T; 0 other; ctaccatttt ccctctcgat tctatatgta cactcgggac aagttctcct gatcgaaaac 60 ggcaaaacta aggccccaag taggaatgcc ttagttttcg gggttaacaa tgattaacac 120 tgagcctcac acccacgcga tgccctcagc tcctcgctca gcgctctcac caacagccgt 180 agcccgcagc cccgctggac accggttctc catccccgca gcgtagcccg gaacatggta 240 gctgccatct ttacctgcta cgccagcctt ctgtgcgcgc aactgtctgg tcccgccccg 300 tcctgcgcga gctgctgccc aggcaggttc gccggtgcga gcgtaaaggg gcggagctag 360 gactgccttg ggcggtacaa atagcaggga accgcgcggt cgctcagcag tgacgtgaca 420 cgcagcccac ggtctgtact gacgcgccct cgcttcttcc tctttctcga ctccatcttc 480 gcggtagctg ggaccgccgt tcaggtaaga atggggcctt ggctggatcc gaagggcttg 540 tagcaggttg gctgcggggt cagaaggcgc ggggggaacc gaagaacggg gcctgctccg 600 tggccctgct ccagtcccta tccgaactcc ttgggaggca ctggccttcc gcacgtgagc 660 cgccgcgacc accatcccgt cgcgatcgtt tctggaccgc tttccactcc caaatctcct 720 ttatcccaga gcatttcttg gcttctctta caagccgtct tttctttact cagtcgccaa 780 tatgcagctc tttgtccgcg cccaggagct acacaccttc gaggtgaccg gccaggaaac 840 ggtcgcccag atcaaggtaa ggctgcttgg tgcgccctgg gttccatttt cttgtgctct 900 tcactctcgc ggcccgaggg aacgcttacg agccttatct ttccctgtag gctcatgtag 960 cctcactgga gggcattgcc ccggaagatc aagtcgtgct cctggcaggc gcgcccctgg 1020 aggatgaggc cactctgggc cagtgcgggg tggaggccct gactaccctg gaagtagcag 1080 gccgcatgct tggaggtgag tgagagagga atgttctttg aagtaccggt aagcgtctag 1140 tgagtgtggg gtgcatagtc ctgacagctg agtgtcacac ctatggtaat agagtacttc 1200 tcactgtctt cagttcagag tgattcttcc tgtttacatc cctcatgttg aacacagacg 1260 tccatgggag actgagccag agtgtagttg tatttcagtc acatcacgag atcctagtct 1320 ggttatcagc ttccacacta aaaattaggt cagaccaggc cccaaagtgc tctataaatt 1380 agaagctgga agatcctgaa atgaaactta agatttcaag gtcaaatatc tgcaactttg 1440 ttctcattac ctattgggcg cagcttctct ttaaaggctt gaattgagaa aagaggggtt 1500
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -