📄 readme
字号:
AGREPY: PYTHON PORT OF AGREP Version 1.2 September 2002INTRODUCTIONThese files contain a port, to Python of the inexact stringmatching functionality of agrep.Agrep, written by Sun Wu and Udi Manber (described in "FastText Searching Allowing Errors", CACM, 35(10), 1992), is a suiteof C functions which together perform various string matchingoperations under UNIX (i.e. specified at the commandline). Themost recent version is 2.04, and is available fromhttp://glimpse.cs.arizona.edu/software.html . The suitecontains functions for exact string matching, matching allowinga small maximum number of insertions, deletions or substitutions,and functions for matching patterns containing metacharacters(i.e. regular expressions). Inexact matching of patterns involvingregular expressions is also allowed, and another function performsmatching on a file of input string, rather than just a singlestring specificied on the command-line. Finally, while the defaulttext unit is the \n (or \r) terminated line, multiline matching isavailable through the specification (again on the command-line) of a"record" separator. Given an input pattern and a file, agrep printson standard output all records that match the pattern (so a singlematch suffices to have a record printed).AGREPYThis port takes agrep from its current, user level setting andmakes it available as a Python module. However, in recent Pythonimplementations, much of the functionality described above is alreadycovered: exact matching is already found in the string module andregular expression matching is in the regsub and re modules. Furthermore,in the context of embedded functions (rather than command-line), IO andthe defintion of a text string are handled by the surrounding applicationand are therefore no longer relevant. Therefore, what this port implementsare solely those functions relating to inexact matching of text stringswhich contain no metacharacters. (Inexact matching of regular expressionshas been ignored because the semantics are not clear - at least to me;concurrent matching of multiple input patterns is deferred to another day).On the other hand, apgrepy extends agrep in the following sense:given a pattern and a text string, agrepy and returns a list of all,non-overlapping pairs of text indexes such that the start index is thefirst character of the text that matches the earliest pattern characterexactly, and the end is the last text character that matches exactly. The endindex of each match is 1 place greater than the actual index so itcan be immediately used to construct a slice. (agrep itself iscontent with recognizing that the input line contains a match,but does not say where or differentiate multiple matches.)Specifically, AGREPY Pythonizes two functions from file sgrep.cof agrep version 2.04. "agrep", now called, sagrep, deals with"short" pattern strings (setable by a header constant, currently 24),and "a_monkey", now called lagrep, which deals with longer strings.Each of these is supported by functions which set up the data structuresused during matching. SWIG, http://www.swig.org/, is used to createthe interface module. As far as possible, the original code of agrephas been preserved, except where required by the new circumstances (e.g.to support determination of match end-points) or forced by SWIG(e.g. move from K&R headers to ANSI prototypes, elimination offunction-like C preprocessor macros). I also indulged in a littletidying up of the code to make it easier to maintain.MODULE CONTENTSagrepy contains two functions:compile(<pattern string>, <pattern length>, <max number of errors>)Must be called for each new pattern or altered maximum number of errors.The pattern must be no longer than 256 chacters and the specifiedmaximum number of errors between 1 and 8. (You are far better off usingone of the conventional string search functions such as `find' if youexpecting 0 errors!) compile returns a C object which is passed to agrepy.agrepy(<pattern string>, <pattern length>, <text string>, <text length>, <Go-To-Ends>, <pattern compilation object>)Performs inexact matching on the text string, return a list(possibly empty) of pairs of match end-points. The fourth argument,<Go-To-Ends>, is an integer (0 or 1) interpreted as a boolean, which forcesmatching to extend to the end of the pattern, even if it has been satisfiedalready given the number of errors permitted. This gives a better indicationof the scope of the match in the text string.MANIFESTMakefileREADMEagrep.c Main programlagrep.c Long patterns matching and compilation functionssagrep.c Short patterns matching and compilation functionsagrep.hagrep.i SWIG input file (generates agrep_wrap.c)agrep_wrap.c Shouldn't need to regenerate thisruntests A shell test applicationtestagrepy.py Called from runtests, runs various testsxxxy1, out1 Data file for test program and sample outputxxxy2, out2 Data file for test program and sample outputxxxy3, out3 Data file for test program and sample outputerrs.errs1 out.errs1 Error check output fileslong_pattern errs.errs2 out.errs2 Error check output fileslong_string out.long Very long test stringCOPYRIGHT.agrep Copyright notice for original agrep implementationCOPYRIGHTFor the copyright on the orginal algorithms see the file COPYRIGHT.agrep.Other portions have been written by Michael J. Wise and are copyrightunder the terms of Open Source Definition (http://www.opensource.org/osd.html).Systems TestedPC/Linux SGI/IRIXVersion HistoryVersion 1.0 July 1999 First postingVersion 1.1 September 1999 The orginal version had an end-of-text bug (sagrepy.c), and both versions had problems finding the correct ends of a match. There also can be a genuine ambiguity. The methodology now is that sagrepy/lagrepy find the ends of matches fairly accurately, and a separate, recursive function firms up the end position and finds the start position. Todo Faced with small, more or less evenly distributed alphabets, e.g. DNA, lagrepy's two character hashing is not particularly effective when the number of mismatches is above 1. It works, but the efficiency suffers. The solution employed by agrep is to treat DNA as a special case after first traversing the pattern to acertain that it just contains members of the alphabet a,c,t,g or n. A better solution would be to recognize that the pattern is from a limited alphabet and change the hashing to cover a larger number of characters (currently just 2).Version 1.2 Sep 2002 A number of small changes, most visibily to the arguments to agrepy.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -