📄 newcpgseek.txt

📁 emboss的linux版本的源代码
💻 TXT
字号:
                                newcpgseek Function   Reports CpG rich regionsDescription   newcpgseek reports CpG rich regions of a sequence as candidate CpG   islands.   CpG refers to a C nucleotide immediately followed by a G. The 'p' in   'CpG' refers to the phosphate group linking the two bases.   Detection of regions of genomic sequences that are rich in the CpG   pattern is important because such regions are resistant to methylation   and tend to be associated with genes which are frequently switched on.   Regions rich in the CpG pattern are known as CpG islands.   It has been estimated that about half of all mammalian genes have a   CpG-rich region around their 5' end. It is said that all mammalian   house-keeping genes have a CpG island!   Non-mammalian vertebrates have some CpG islands that are associated   with genes, but the association gets equivocal in the farther   taxonomic groups.   Finding a CpG island upstream of predicted exons or genes is good   contributory evidence.   CpG islands are usually defined as "length over 200bp with %GC over   50% and obs/ervedexpexted CpG more than 0.6". However this program   uses a running sum rather than a window to produce a score: if there   is not a CpG at position i, then decrement runSum counter, but if CpG   then runSum += CPGSCORE. Spans above the threshold are searched for   recursively. If the score is higher than a threshold (17 at the   moment) then a putative island is declared.   This program reads in one or more sequences and finds regions where   there is a high absolute frequency of CpG dimers as well as a high   proportion of CpG compared to GpC.Usage   Here is a sample session with newcpgseek% newcpgseek Reports CpG rich regionsInput nucleotide sequence(s): tembl:rnu68037CpG score [17]: Output file [rnu68037.newcpgseek]:    Go to the input files for this example   Go to the output files for this exampleCommand line arguments   Standard (Mandatory) qualifiers:  [-sequence]          seqall     Nucleotide sequence(s) filename and optional                                  format, or reference (input USA)   -score              integer    [17] CpG score (Integer from 1 to 200)  [-outfile]           outfile    [*.newcpgseek] Output file name   Additional (Optional) qualifiers: (none)   Advanced (Unprompted) qualifiers: (none)   Associated qualifiers:   "-sequence" associated qualifiers   -sbegin1            integer    Start of each sequence to be used   -send1              integer    End of each sequence to be used   -sreverse1          boolean    Reverse (if DNA)   -sask1              boolean    Ask for begin/end/reverse   -snucleotide1       boolean    Sequence is nucleotide   -sprotein1          boolean    Sequence is protein   -slower1            boolean    Make lower case   -supper1            boolean    Make upper case   -sformat1           string     Input sequence format   -sdbname1           string     Database name   -sid1               string     Entryname   -ufo1               string     UFO features   -fformat1           string     Features format   -fopenfile1         string     Features file name   "-outfile" associated qualifiers   -odirectory2        string     Output directory   General qualifiers:   -auto               boolean    Turn off prompts   -stdout             boolean    Write standard output   -filter             boolean    Read standard input, write standard output   -options            boolean    Prompt for standard and additional values   -debug              boolean    Write debug output to program.dbg   -verbose            boolean    Report some/full command line options   -help               boolean    Report command line options. More                                  information on associated and general                                  qualifiers can be found with -help -verbose   -warning            boolean    Report warnings   -error              boolean    Report errors   -fatal              boolean    Report fatal errors   -die                boolean    Report dying program messagesInput file format   newcpgseek reads a nucleic acid sequence USA.  Input files for usage example   'tembl:rnu68037' is a sequence entry in the example nucleic acid   database 'tembl'  Database entry: tembl:rnu68037ID   RNU68037   standard; RNA; ROD; 1218 BP.XXAC   U68037;XXSV   U68037.1XXDT   23-SEP-1996 (Rel. 49, Created)DT   04-MAR-2000 (Rel. 63, Last updated, Version 2)XXDE   Rattus norvegicus EP1 prostanoid receptor mRNA, complete cds.XXKW   .XXOS   Rattus norvegicus (Norway rat)OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;OC   Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus.XXRN   [1]RP   1-1218RA   Abramovitz M., Boie Y.;RT   "Cloning of the rat EP1 prostanoid receptor";RL   Unpublished.XXRN   [2]RP   1-1218RA   Abramovitz M., Boie Y.;RT   ;RL   Submitted (26-AUG-1996) to the EMBL/GenBank/DDBJ databases.RL   Biochemistry & Molecular Biology, Merck Frosst Center for TherapeuticRL   Research, P. O. Box 1005, Pointe Claire - Dorval, Quebec H9R 4P8, CanadaXXDR   SWISS-PROT; P70597; PE21_RAT.XXFH   Key             Location/QualifiersFHFT   source          1..1218FT                   /db_xref="taxon:10116"FT                   /organism="Rattus norvegicus"FT                   /strain="Sprague-Dawley"FT   CDS             1..1218FT                   /codon_start=1FT                   /db_xref="SWISS-PROT:P70597"FT                   /note="family 1 G-protein coupled receptor"FT                   /product="EP1 prostanoid receptor"FT                   /protein_id="AAB07735.1"FT                   /translation="MSPYGLNLSLVDEATTCVTPRVPNTSVVLPTGGNGTSPALPIFSMFT                   TLGAVSNVLALALLAQVAGRLRRRRSTATFLLFVASLLAIDLAGHVIPGALVLRLYTAGFT                   RAPAGGACHFLGGCMVFFGLCPLLLGCGMAVERCVGVTQPLIHAARVSVARARLALALLFT                   AAMALAVALLPLVHVGHYELQYPGTWCFISLGPPGGWRQALLAGLFAGLGLAALLAALVFT                   CNTLSGLALLRARWRRRRSRRFRENAGPDDRRRWGSRGLRLASASSASSITSTTAALRSFT                   SRGGGSARRVHAHDVEMVGQLVGIMVVSCICWSPLLVLVVLAIGGWNSNSLQRPLFLAVFT                   RLASWNQILDPWVYILLRQAMLRQLLRLLPLRVSAKGGPTELSLTKSAWEASSLRSSRHFT                   SGFSHL"XXSQ   Sequence 1218 BP; 162 A; 397 C; 387 G; 272 T; 0 other;     atgagcccct acgggcttaa cctgagccta gtggatgagg caacaacgtg tgtaacaccc        60     agggtcccca atacatctgt ggtgctgcca acaggcggta acggcacatc accagcgctg       120     cctatcttct ccatgacgct gggtgctgtg tccaacgtgc tggcgctggc gctgctggcc       180     caggttgcag gcagactgcg gcgccgccgc tcgactgcca ccttcctgtt gttcgtcgcc       240     agcctgcttg ccatcgacct agcaggccat gtgatcccgg gcgccttggt gcttcgcctg       300     tatactgcag gacgtgcgcc cgctggcggg gcctgtcatt tcctgggcgg ctgtatggtc       360     ttctttggcc tgtgcccact tttgcttggc tgtggcatgg ccgtggagcg ctgcgtgggt       420     gtcacgcagc cgctgatcca cgcggcgcgc gtgtccgtag cccgcgcacg cctggcacta       480     gccctgctgg ccgccatggc tttggcagtg gcgctgctgc cactagtgca cgtgggtcac       540     tacgagctac agtaccctgg cacttggtgt ttcattagcc ttgggcctcc tggaggttgg       600     cgccaggcgt tgcttgcggg cctcttcgcc ggccttggcc tggctgcgct ccttgccgca       660     ctagtgtgta atacgctcag cggcctggcg ctccttcgtg cccgctggag gcggcgtcgc       720     tctcgacgtt tccgagagaa cgcaggtccc gatgatcgcc ggcgctgggg gtcccgtgga       780     ctccgcttgg cctccgcctc gtctgcgtca tccatcactt caaccacagc tgccctccgc       840     agctctcggg gaggcggctc cgcgcgcagg gttcacgcac acgacgtgga aatggtgggc       900     cagctcgtgg gcatcatggt ggtgtcgtgc atctgctgga gccccctgct ggtattggtg       960     gtgttggcca tcgggggctg gaactctaac tccctgcagc ggccgctctt tctggctgta      1020     cgcctcgcgt cgtggaacca gatcctggac ccatgggtgt acatcctgct gcgccaggct      1080     atgctgcgcc aacttcttcg cctcctaccc ctgagggtta gtgccaaggg tggtccaacg      1140     gagctgagcc taaccaagag tgcctgggag gccagttcac tgcgtagctc ccggcacagt      1200     ggcttcagcc acttgtga                                                    1218//Output file format  Output files for usage example  File: rnu68037.newcpgseekNEWCPGSEEK of RNU68037 from 1 to 1218with score > 17 Begin    End  Score        CpG  %CG  CG/GC*    96   1032   630         87  66.1   0.65  1072   1100    26          3  62.1   0.00  1183   1193    26          2  72.7   2.00-------------------------------------------Data files   None.Notes   None.References   None.Warnings   None.Diagnostic Error Messages   None.Exit status   It always exits with a status of 0.Known bugs   None.See also   Program name                        Description   cpgplot      Plot CpG rich areas   cpgreport    Reports all CpG rich regions   geecee       Calculates fractional GC content of nucleic acid sequences   newcpgreport Report CpG rich areas   As there is no official definition of what is a cpg island is, and   worst where they begin and end, we have to live with 2 definitions and   thus two methods. These are:   1. newcpgseek and cpgreport - both declare a putative island if the   score is higher than a threshold (17 at the moment). They now also   displaying the actual CpG count, the % CG and the observed/expected   ration in the region where the score is above the threshold. This   scoring method based on sum/frequencies overpredicts islands but finds   the smaller ones around primary exons. newcpgseek uses the same method   as cpgreport but the output is different and more readable.   2. newcpgreport and cpgplot use a sliding window within which the   Obs/Exp ratio of CpG is calculated. The important thing to note in   this method is that an island, in order to be reported, is defined as   a region that satisfies the following contraints:   Obs/Exp ratio > 0.6   % C + % G > 50%   Length > 200.   For all practical purposes you should probably use newcpgreport. It is   actually used to produce the human cpgisland database you can find on   the EBI's ftp server as well as on the EBI's SRS server.   geecee measures CG content in the entire input sequence and is not to   be used to detect CpG islands. It can be usefull for detecting   sequences that MIGHT contain an island.Author(s)   Rodrigo Lopez (rls
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -