📄 preg.txt

📁 emboss的linux版本的源代码
💻 TXT
📖 第 1 页 / 共 3 页
字号:
上一页 1 23
   When a parenthesized subpattern is quantified with a minimum repeat   count that is greater than 1 or with a limited maximum, more store is   required for the compiled pattern, in proportion to the size of the   minimum or maximum. If a pattern starts with .* or .{0,} and the   PCRE_DOTALL option (equivalent to Perl's /s) is set, thus allowing the   . to match newlines, the pattern is implicitly anchored, because   whatever follows will be tried against every character position in the   subject string, so there is no point in retrying the overall match at   any position after the first. PCRE normally treats such a pattern as   though it were preceded by \A.   In cases where it is known that the subject string contains no   newlines, it is worth setting PCRE_DOTALL in order to obtain this   optimization, or alternatively using ^ to indicate anchoring   explicitly.   However, there is one situation where the optimization cannot be used.   When .* is inside capturing parentheses that are the subject of a   backreference elsewhere in the pattern, a match at the start may fail,   and a later one succeed. Consider, for example:   (.*)abc\1   If the subject is "xyz123abc123" the match point is the fourth   character. For this reason, such a pattern is not implicitly anchored.   When a capturing subpattern is repeated, the value captured is the   substring that matched the final iteration. For example, after   (tweedle[dume]{3}\s*)+   has matched "tweedledum tweedledee" the value of the captured   substring is "tweedledee". However, if there are nested capturing   subpatterns, the corresponding captured values may have been set in   previous iterations. For example, after       /(a|(b))+/  PCRE PERFORMANCE   Certain items that may appear in regular expression patterns are more   efficient than others. It is more efficient to use a character class   like [aeiou] than a set of alternatives such as (a|e|i|o|u). In   general, the simplest construction that provides the required   behaviour is usually the most efficient. Jeffrey Friedl's book   contains a lot of discussion about optimizing regular expressions for   efficient performance.   When a pattern begins with .* not in parentheses, or in parentheses   that are not the subject of a backreference, and the PCRE_DOTALL   option is set, the pattern is implicitly anchored by PCRE, since it   can match only at the start of a subject string. However, if   PCRE_DOTALL is not set, PCRE cannot make this optimization, because   the . meta-character does not then match a newline, and if the subject   string contains newlines, the pattern may match from the character   immediately following one of them instead of from the very start. For   example, the pattern   .*second   matches the subject "first\nand second" (where \n stands for a newline   character), with the match starting at the seventh character. In order   to do this, PCRE has to retry the match starting after every newline   in the subject.   If you are using such a pattern with subject strings that do not   contain newlines, the best performance is obtained by setting   PCRE_DOTALL, or starting the pattern with ^.* to indicate explicit   anchoring. That saves PCRE from having to scan along the subject   looking for a newline to restart at.   Beware of patterns that contain nested indefinite repeats. These can   take a long time to run when applied to a string that does not match.   Consider the pattern fragment   (a+)*   This can match "aaaa" in 33 different ways, and this number increases   very rapidly as the string gets longer. (The * repeat can match 0, 1,   2, 3, or 4 times, and for each of those cases other than 0, the +   repeats can match different numbers of times.) When the remainder of   the pattern is such that the entire match is going to fail, PCRE has   in principle to try every possible variation, and this can take an   extremely long time. An optimization catches some of the more simple   cases such as   (a+)*b   where a literal character follows. Before embarking on the standard   matching procedure, PCRE checks that there is a "b" later in the   subject string, and if there is not, it fails the match immediately.   However, when there is no following literal this optimization cannot   be used. You can see the difference by comparing the behaviour of   (a+)*\d   with the pattern above. The former gives a failure almost instantly   when applied to a whole line of "a" characters, whereas the latter   takes an appreciable time with strings longer than about 20   characters.Usage   Here is a sample session with preg% preg Regular expression search of a protein sequenceInput protein sequence(s): tsw:*_ratRegular expression pattern: IA[QWF]AOutput report [100k_rat.preg]:    Go to the input files for this example   Go to the output files for this exampleCommand line arguments   Standard (Mandatory) qualifiers:  [-sequence]          seqall     Protein sequence(s) filename and optional                                  format, or reference (input USA)  [-pattern]           regexp     Any regular expression pattern is accepted)  [-outfile]           report     [*.preg] Output report file name   Additional (Optional) qualifiers: (none)   Advanced (Unprompted) qualifiers: (none)   Associated qualifiers:   "-sequence" associated qualifiers   -sbegin1            integer    Start of each sequence to be used   -send1              integer    End of each sequence to be used   -sreverse1          boolean    Reverse (if DNA)   -sask1              boolean    Ask for begin/end/reverse   -snucleotide1       boolean    Sequence is nucleotide   -sprotein1          boolean    Sequence is protein   -slower1            boolean    Make lower case   -supper1            boolean    Make upper case   -sformat1           string     Input sequence format   -sdbname1           string     Database name   -sid1               string     Entryname   -ufo1               string     UFO features   -fformat1           string     Features format   -fopenfile1         string     Features file name   "-pattern" associated qualifiers   -pformat2           string     File format   -pname2             string     Pattern base name   "-outfile" associated qualifiers   -rformat3           string     Report format   -rname3             string     Base file name   -rextension3        string     File name extension   -rdirectory3        string     Output directory   -raccshow3          boolean    Show accession number in the report   -rdesshow3          boolean    Show description in the report   -rscoreshow3        boolean    Show the score in the report   -rusashow3          boolean    Show the full USA in the report   -rmaxall3           integer    Maximum total hits to report   -rmaxseq3           integer    Maximum hits to report for one sequence   General qualifiers:   -auto               boolean    Turn off prompts   -stdout             boolean    Write standard output   -filter             boolean    Read standard input, write standard output   -options            boolean    Prompt for standard and additional values   -debug              boolean    Write debug output to program.dbg   -verbose            boolean    Report some/full command line options   -help               boolean    Report command line options. More                                  information on associated and general                                  qualifiers can be found with -help -verbose   -warning            boolean    Report warnings   -error              boolean    Report errors   -fatal              boolean    Report fatal errors   -die                boolean    Report dying program messagesInput file format   preg reads any protein sequence USA.  Input files for usage example   'tsw:*_rat' is a sequence entry in the example protein database 'tsw'Output file format  Output files for usage example  File: 100k_rat.preg######################################### Program: preg# Rundate: Sat 15 Jul 2006 12:00:00# Commandline: preg#    -sequence "tsw:*_rat"#    -pattern "IA[QWF]A"# Report_format: seqtable# Report_file: 100k_rat.preg#########################################=======================================## Sequence: 100K_RAT     from: 1   to: 889# HitCount: 1## Pattern: IA[QWF]A##=======================================  Start     End Pattern_name Sequence    390     393 regex1       IAQA#---------------------------------------#---------------------------------------#---------------------------------------# Total_sequences: 1# Total_hitcount: 1#---------------------------------------Data files   None.Notes   None.References   None.Warnings   Regular expressions are case-sensitive. The pattern 'AAAA' will not   match the sequence 'aaaa'. For this reason, both your pattern and the   input sequences are converted to upper-case.Diagnostic Error Messages   None.Exit status   It always exits with a status of 0. Always returns 0.Known bugs   None.See also    Program name                         Description   antigenic      Finds antigenic sites in proteins   digest         Protein proteolytic enzyme or reagent cleavage digest   epestfind      Finds PEST motifs as potential proteolytic cleavage sites   fuzzpro        Protein pattern search   fuzztran       Protein pattern search after translation   helixturnhelix Report nucleic acid binding motifs   oddcomp        Find protein sequence regions with a biased composition   patmatdb       Search a protein sequence with a motif   patmatmotifs   Search a PROSITE motif database with a protein sequence   pepcoil        Predicts coiled coil regions   pscan          Scans proteins using PRINTS   sigcleave      Reports protein signal cleavage sites   Other EMBOSS programs allow you to search for simple patterns and may   be easier for the user who has never used regular expressions before:     * fuzznuc - Nucleic acid pattern search     * fuzzpro - Protein pattern search     * fuzztran - Protein pattern search after translationAuthor(s)   Peter Rice (pmr
上一页 1 23
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -