📄 preg.txt
字号:
When a parenthesized subpattern is quantified with a minimum repeat count that is greater than 1 or with a limited maximum, more store is required for the compiled pattern, in proportion to the size of the minimum or maximum. If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent to Perl's /s) is set, thus allowing the . to match newlines, the pattern is implicitly anchored, because whatever follows will be tried against every character position in the subject string, so there is no point in retrying the overall match at any position after the first. PCRE normally treats such a pattern as though it were preceded by \A. In cases where it is known that the subject string contains no newlines, it is worth setting PCRE_DOTALL in order to obtain this optimization, or alternatively using ^ to indicate anchoring explicitly. However, there is one situation where the optimization cannot be used. When .* is inside capturing parentheses that are the subject of a backreference elsewhere in the pattern, a match at the start may fail, and a later one succeed. Consider, for example: (.*)abc\1 If the subject is "xyz123abc123" the match point is the fourth character. For this reason, such a pattern is not implicitly anchored. When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration. For example, after (tweedle[dume]{3}\s*)+ has matched "tweedledum tweedledee" the value of the captured substring is "tweedledee". However, if there are nested capturing subpatterns, the corresponding captured values may have been set in previous iterations. For example, after /(a|(b))+/ PCRE PERFORMANCE Certain items that may appear in regular expression patterns are more efficient than others. It is more efficient to use a character class like [aeiou] than a set of alternatives such as (a|e|i|o|u). In general, the simplest construction that provides the required behaviour is usually the most efficient. Jeffrey Friedl's book contains a lot of discussion about optimizing regular expressions for efficient performance. When a pattern begins with .* not in parentheses, or in parentheses that are not the subject of a backreference, and the PCRE_DOTALL option is set, the pattern is implicitly anchored by PCRE, since it can match only at the start of a subject string. However, if PCRE_DOTALL is not set, PCRE cannot make this optimization, because the . meta-character does not then match a newline, and if the subject string contains newlines, the pattern may match from the character immediately following one of them instead of from the very start. For example, the pattern .*second matches the subject "first\nand second" (where \n stands for a newline character), with the match starting at the seventh character. In order to do this, PCRE has to retry the match starting after every newline in the subject. If you are using such a pattern with subject strings that do not contain newlines, the best performance is obtained by setting PCRE_DOTALL, or starting the pattern with ^.* to indicate explicit anchoring. That saves PCRE from having to scan along the subject looking for a newline to restart at. Beware of patterns that contain nested indefinite repeats. These can take a long time to run when applied to a string that does not match. Consider the pattern fragment (a+)* This can match "aaaa" in 33 different ways, and this number increases very rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 times, and for each of those cases other than 0, the + repeats can match different numbers of times.) When the remainder of the pattern is such that the entire match is going to fail, PCRE has in principle to try every possible variation, and this can take an extremely long time. An optimization catches some of the more simple cases such as (a+)*b where a literal character follows. Before embarking on the standard matching procedure, PCRE checks that there is a "b" later in the subject string, and if there is not, it fails the match immediately. However, when there is no following literal this optimization cannot be used. You can see the difference by comparing the behaviour of (a+)*\d with the pattern above. The former gives a failure almost instantly when applied to a whole line of "a" characters, whereas the latter takes an appreciable time with strings longer than about 20 characters.Usage Here is a sample session with preg% preg Regular expression search of a protein sequenceInput protein sequence(s): tsw:*_ratRegular expression pattern: IA[QWF]AOutput report [100k_rat.preg]: Go to the input files for this example Go to the output files for this exampleCommand line arguments Standard (Mandatory) qualifiers: [-sequence] seqall Protein sequence(s) filename and optional format, or reference (input USA) [-pattern] regexp Any regular expression pattern is accepted) [-outfile] report [*.preg] Output report file name Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-pattern" associated qualifiers -pformat2 string File format -pname2 string Pattern base name "-outfile" associated qualifiers -rformat3 string Report format -rname3 string Base file name -rextension3 string File name extension -rdirectory3 string Output directory -raccshow3 boolean Show accession number in the report -rdesshow3 boolean Show description in the report -rscoreshow3 boolean Show the score in the report -rusashow3 boolean Show the full USA in the report -rmaxall3 integer Maximum total hits to report -rmaxseq3 integer Maximum hits to report for one sequence General qualifiers: -auto boolean Turn off prompts -stdout boolean Write standard output -filter boolean Read standard input, write standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messagesInput file format preg reads any protein sequence USA. Input files for usage example 'tsw:*_rat' is a sequence entry in the example protein database 'tsw'Output file format Output files for usage example File: 100k_rat.preg######################################### Program: preg# Rundate: Sat 15 Jul 2006 12:00:00# Commandline: preg# -sequence "tsw:*_rat"# -pattern "IA[QWF]A"# Report_format: seqtable# Report_file: 100k_rat.preg#########################################=======================================## Sequence: 100K_RAT from: 1 to: 889# HitCount: 1## Pattern: IA[QWF]A##======================================= Start End Pattern_name Sequence 390 393 regex1 IAQA#---------------------------------------#---------------------------------------#---------------------------------------# Total_sequences: 1# Total_hitcount: 1#---------------------------------------Data files None.Notes None.References None.Warnings Regular expressions are case-sensitive. The pattern 'AAAA' will not match the sequence 'aaaa'. For this reason, both your pattern and the input sequences are converted to upper-case.Diagnostic Error Messages None.Exit status It always exits with a status of 0. Always returns 0.Known bugs None.See also Program name Description antigenic Finds antigenic sites in proteins digest Protein proteolytic enzyme or reagent cleavage digest epestfind Finds PEST motifs as potential proteolytic cleavage sites fuzzpro Protein pattern search fuzztran Protein pattern search after translation helixturnhelix Report nucleic acid binding motifs oddcomp Find protein sequence regions with a biased composition patmatdb Search a protein sequence with a motif patmatmotifs Search a PROSITE motif database with a protein sequence pepcoil Predicts coiled coil regions pscan Scans proteins using PRINTS sigcleave Reports protein signal cleavage sites Other EMBOSS programs allow you to search for simple patterns and may be easier for the user who has never used regular expressions before: * fuzznuc - Nucleic acid pattern search * fuzzpro - Protein pattern search * fuzztran - Protein pattern search after translationAuthor(s) Peter Rice (pmr
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -