📄 cpgreport.txt
字号:
cpgreport Function Reports all CpG rich regionsDescription cpgreport scans a nucleotide sequence for regions with higher than expected frequencies of the dinucleotide CG. CpG refers to a C nucleotide immediately followed by a G. The 'p' in 'CpG' refers to the phosphate group linking the two bases. Detection of regions of genomic sequences that are rich in the CpG pattern is important because such regions are resistant to methylation and tend to be associated with genes which are frequently switched on. Regions rich in the CpG pattern are known as CpG islands. This program does not find CpG islands as normally defined: "a region of greater than 200 bp with a %GC of greater than 50% and observed/expected CpG > 0.6". cpgreport instead uses a running sum rather than a window to create the score as follows: if not CpG at position i, then decrement running-Sum counter, but if CpG then running-Sum counter is incremented by the CPGSCORE. Spans greater than the threshold are searched for recursively. This method overpredicts islands but finds the smaller ones around primary exons.Usage Here is a sample session with cpgreport% cpgreport tembl:rnu68037 Reports all CpG rich regionsCpG score [17]: Output file [rnu68037.cpgreport]: Features output [rnu68037.gff]: Go to the input files for this example Go to the output files for this exampleCommand line arguments Standard (Mandatory) qualifiers: [-sequence] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) -score integer [17] This sets the score for each CG sequence found. A value of 17 is more sensitive, but 28 has also been used with some success. (Integer from 1 to 200) [-outfile] outfile [*.cpgreport] Output file name [-outfeat] featout [unknown.gff] File for output features Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-outfile" associated qualifiers -odirectory2 string Output directory "-outfeat" associated qualifiers -offormat3 string Output feature format -ofopenfile3 string Features file name -ofextension3 string File name extension -ofdirectory3 string Output directory -ofname3 string Base file name -ofsingle3 boolean Separate file for each entry General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messagesInput file format Any DNA sequence USA. Input files for usage example 'tembl:rnu68037' is a sequence entry in the example nucleic acid database 'tembl' Database entry: tembl:rnu68037ID RNU68037 standard; RNA; ROD; 1218 BP.XXAC U68037;XXSV U68037.1XXDT 23-SEP-1996 (Rel. 49, Created)DT 04-MAR-2000 (Rel. 63, Last updated, Version 2)XXDE Rattus norvegicus EP1 prostanoid receptor mRNA, complete cds.XXKW .XXOS Rattus norvegicus (Norway rat)OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;OC Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus.XXRN [1]RP 1-1218RA Abramovitz M., Boie Y.;RT "Cloning of the rat EP1 prostanoid receptor";RL Unpublished.XXRN [2]RP 1-1218RA Abramovitz M., Boie Y.;RT ;RL Submitted (26-AUG-1996) to the EMBL/GenBank/DDBJ databases.RL Biochemistry & Molecular Biology, Merck Frosst Center for TherapeuticRL Research, P. O. Box 1005, Pointe Claire - Dorval, Quebec H9R 4P8, CanadaXXDR SWISS-PROT; P70597; PE21_RAT.XXFH Key Location/QualifiersFHFT source 1..1218FT /db_xref="taxon:10116"FT /organism="Rattus norvegicus"FT /strain="Sprague-Dawley"FT CDS 1..1218FT /codon_start=1FT /db_xref="SWISS-PROT:P70597"FT /note="family 1 G-protein coupled receptor"FT /product="EP1 prostanoid receptor"FT /protein_id="AAB07735.1"FT /translation="MSPYGLNLSLVDEATTCVTPRVPNTSVVLPTGGNGTSPALPIFSMFT TLGAVSNVLALALLAQVAGRLRRRRSTATFLLFVASLLAIDLAGHVIPGALVLRLYTAGFT RAPAGGACHFLGGCMVFFGLCPLLLGCGMAVERCVGVTQPLIHAARVSVARARLALALLFT AAMALAVALLPLVHVGHYELQYPGTWCFISLGPPGGWRQALLAGLFAGLGLAALLAALVFT CNTLSGLALLRARWRRRRSRRFRENAGPDDRRRWGSRGLRLASASSASSITSTTAALRSFT SRGGGSARRVHAHDVEMVGQLVGIMVVSCICWSPLLVLVVLAIGGWNSNSLQRPLFLAVFT RLASWNQILDPWVYILLRQAMLRQLLRLLPLRVSAKGGPTELSLTKSAWEASSLRSSRHFT SGFSHL"XXSQ Sequence 1218 BP; 162 A; 397 C; 387 G; 272 T; 0 other; atgagcccct acgggcttaa cctgagccta gtggatgagg caacaacgtg tgtaacaccc 60 agggtcccca atacatctgt ggtgctgcca acaggcggta acggcacatc accagcgctg 120 cctatcttct ccatgacgct gggtgctgtg tccaacgtgc tggcgctggc gctgctggcc 180 caggttgcag gcagactgcg gcgccgccgc tcgactgcca ccttcctgtt gttcgtcgcc 240 agcctgcttg ccatcgacct agcaggccat gtgatcccgg gcgccttggt gcttcgcctg 300 tatactgcag gacgtgcgcc cgctggcggg gcctgtcatt tcctgggcgg ctgtatggtc 360 ttctttggcc tgtgcccact tttgcttggc tgtggcatgg ccgtggagcg ctgcgtgggt 420 gtcacgcagc cgctgatcca cgcggcgcgc gtgtccgtag cccgcgcacg cctggcacta 480 gccctgctgg ccgccatggc tttggcagtg gcgctgctgc cactagtgca cgtgggtcac 540 tacgagctac agtaccctgg cacttggtgt ttcattagcc ttgggcctcc tggaggttgg 600 cgccaggcgt tgcttgcggg cctcttcgcc ggccttggcc tggctgcgct ccttgccgca 660 ctagtgtgta atacgctcag cggcctggcg ctccttcgtg cccgctggag gcggcgtcgc 720 tctcgacgtt tccgagagaa cgcaggtccc gatgatcgcc ggcgctgggg gtcccgtgga 780 ctccgcttgg cctccgcctc gtctgcgtca tccatcactt caaccacagc tgccctccgc 840 agctctcggg gaggcggctc cgcgcgcagg gttcacgcac acgacgtgga aatggtgggc 900 cagctcgtgg gcatcatggt ggtgtcgtgc atctgctgga gccccctgct ggtattggtg 960 gtgttggcca tcgggggctg gaactctaac tccctgcagc ggccgctctt tctggctgta 1020 cgcctcgcgt cgtggaacca gatcctggac ccatgggtgt acatcctgct gcgccaggct 1080 atgctgcgcc aacttcttcg cctcctaccc ctgagggtta gtgccaaggg tggtccaacg 1140 gagctgagcc taaccaagag tgcctgggag gccagttcac tgcgtagctc ccggcacagt 1200 ggcttcagcc acttgtga 1218//Output file format Output files for usage example File: rnu68037.cpgreportCPGREPORT of RNU68037 from 1 to 1218Sequence Begin End Score CpG %CG CG/GCRNU68037 12 13 17 1 100.0 -RNU68037 47 48 17 1 100.0 -RNU68037 96 1032 630 87 66.1 0.65RNU68037 1072 1100 26 3 62.1 0.00RNU68037 1139 1140 17 1 100.0 -RNU68037 1183 1193 26 2 72.7 2.00 File: rnu68037.gff##gff-version 2.0##date 2006-07-15##Type DNA RNU68037RNU68037 cpgreport misc_feature 12 13 17.000 +. Sequence "RNU68037.1"RNU68037 cpgreport misc_feature 47 48 17.000 +. Sequence "RNU68037.2"RNU68037 cpgreport misc_feature 96 1032 630.000 +. Sequence "RNU68037.3"RNU68037 cpgreport misc_feature 1072 1100 26.000 +. Sequence "RNU68037.4"RNU68037 cpgreport misc_feature 1139 1140 17.000 +. Sequence "RNU68037.5"RNU68037 cpgreport misc_feature 1183 1193 26.000 +. Sequence "RNU68037.6" The first non-blank line of the output file 'rnu68037.cpgreport' is the title line giving the program name, the name of sequence being analysed and the start and end positions of the sequence. The second non-blank line contains the headings of the columns. Subsequent lines contain columns with the following information: * The name of the sequence. * The begin position and the end position of the CpG-rich region. * The score of the CpG-rich region. * The number of CpG's in the CpG-rich region. * The %(G+C) in the CpG-rich region. * The ratio of CpG to GpC in the CpG-rich region. If the count of GpC in the region is zero, then the ratio of CG/GC is reported as '-'.Data files None.Notes This program does not find CpG islands as normally defined (see cpgplot).References None.Warnings None.Diagnostic Error Messages None.Exit status 0 if successful.Known bugs None.See also Program name Description cpgplot Plot CpG rich areas geecee Calculates fractional GC content of nucleic acid sequences newcpgreport Report CpG rich areas newcpgseek Reports CpG rich regions As there is no official definition of what is a cpg island is, and worst where they begin and end, we have to live with 2 definitions and thus two methods. These are: 1. newcpgseek and cpgreport - both declare a putative island if the score is higher than a threshold (17 at the moment). They now also displaying the actual CpG count, the % CG and the observed/expected ration in the region where the score is above the threshold. This scoring method based on sum/frequencies overpredicts islands but finds the smaller ones around primary exons. newcpgseek uses the same method as cpgreport but the output is different and more readable. 2. newcpgreport and cpgplot use a sliding window within which the Obs/Exp ratio of CpG is calculated. The important thing to note in this method is that an island, in order to be reported, is defined as a region that satisfies the following contraints: Obs/Exp ratio > 0.6 % C + % G > 50% Length > 200. For all practical purposes you should probably use newcpgreport. It is actually used to produce the human cpgisland database you can find on the EBI's ftp server as well as on the EBI's SRS server. geecee measures CG content in the entire input sequence and is not to be used to detect CpG islands. It can be usefull for detecting sequences that MIGHT contain an island.Author(s) This program was originally written by Gos Micklem (gos
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -