📄 readme
字号:
DBACL - digramic Bayesian classifierPURPOSEdbacl is a command line program which can be used to categorizeseveral types of text documents. Each document category isconstructed as a maximum entropy language model, with respect toa reference measure based on digrams (character pairs).Before recognition can take place, a number of text corpora must be "learned". For example, an English category could be based on a text file containing the collected works of Shakespeare. The Gutenberg project (http://promo.net/pg/) makes freely availablemany public domain works in electronic form.After learning, any number of text files can be compared, in terms of Bayesian posterior probabilities, with up to 128 learned categories.The actual number of categories is limited only by available memory.dbacl is bundled with a few other utilities:- bayesol is a postprocessor which takes the dbacl output and computes an optimal decision based on costs of misclassification. Together with dbacl, this allows the construction of sophisticated, multilingual, classification scripts, if you're not afraid of some shell scripting.- mailcross performs email classification cross validation. It can be used to assess the performance of custom email classification scripts based on dbacl and bayesol.- mailinspect reads an mbox style mail folder and displays the emails in sorted order, based on similarity to any given category. DOCUMENTATIONSee the bundled manpage. Generic instructions can be found in the file INSTALL.A tutorial is to be found in the file tutorial.html, and an exposition of the algorithms is in dbacl.ps. LICENSEDBACL is distributed under the terms of the GNU General Public License (GPL)which can be found in the file COPYING. The hash function code used in the file jenkins.c is public domain, by Bob Jenkins.BUILDINGThere are several configuration options you can change in the file dbacl.h,if you want to increase the maximum number of categories or optimizehash table overhead. To build and install the program, you can execute the following steps fromwithin the source DBACL directory:./configuremakemake install The last part should be executed with superuser privileges for system wideinstallation. Alternatively./configure --prefix=/home/xyzzymake make installbuilds and installs in user xyzzy's home directory, without the need forroot privileges. In this case, the following environment variables should be set permanently (e.g. in the file .profile):PATH=$PATH:/home/xyzzy/binMANPATH=$MANPATH:/home/xyzzy/manINTERNATIONALIZATIONdbacl uses the current locale for processing. 8-bit clean multibyte character sets (such as UTF-8) are supported in the default mode, and arbitrary multibyte character sets require the -i command line option. If you intend to use the -i option together with regular expressions,you must build with a wide character POSIX regex library: ensure thatthe BOOST library is present on the system and type./configure WIDE_REGEX=1make make installWarning: there is a large performance penalty if you build dbacl this way,which shows up whenever you use regular expressions. Only build this way ifyou need correct regular expressions in a multibyte environment which isn't 8-bit clean.OTHER DEPENDENCIESThe main filter programs dbacl and bayesol have no special dependencies, andcan always be compiled. mailinspect uses the readline and slang libraries for screen management ininteractive mode. The configure script will check for these libraries and if it can't find them, mailinspect will be compiled without interactive support. mailcross is a bash shell script which calls awk and formail at variouspoints. It will test for the existence of these programs in your path andrefuse to run if it can't find them.RUNNINGThere is a tutorial which you can read with any web browser, point it to thefile tutorial.html. For command line options and examples of possible use, type after installation: man dbaclman bayesolman mailcrossman mailinspectYou can also find a technical description of the algorithms and statisticsin the postscript file dbacl.psTUTORIAL SAMPLESThe tutorial.html document comes with several sample text files:- sample1.txt and sample4.txt are extracts from Mark Twain, Huckleberry Finn- sample2.txt, sample3.txt, sample5.tx are extracts from Douglas Adams, The Hitchhikers' Guide to the GalaxyAUTHORLaird A. Breyer <laird@lbreyer.com>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -