📄 tutorial.html

📁 dbacl是一个通用目的的digramic贝叶斯文本分类器。它可以学习你提供的文本
💻 HTML
📖 第 1 页 / 共 2 页
字号:
12 下一页
<html><title>Language models, classification and dbacl</title><body><h1><center>Language models, classification and dbacl</center></h1><center><p>Laird A. Breyer</p></center><h2>Introduction</h2><p>This is a non-mathematical tutorial on how to use the dbacl Bayesian textclassifier. The mathematical details can be read <a href="dbacl.ps">here</a>.<p><a href="http://www.lbreyer.com/gpl.html">dbacl</a> is aUNIX command line tool, so you will need to work at the shell prompt(here written %). The program comes with five sample text documents anda few scripts. Look for them in the same directory as this tutorial, or youcan use any other plain text documents instead.Make sure the sampledocuments you will use are in the current working directory.<h2>For the impatient</h2><p>dbacl has two major modes of operation. The first is learning mode, where oneor more text documents are analysed to find out what make them look the waythey do. At the shell prompt, type (without the leading %)<pre>% dbacl -l one sample1.txt% dbacl -l two sample2.txt</pre><p>This creates two files named <i>one</i> and <i>two</i>, which contain the important features ofeach sample document.<p>The second major mode is classification mode.Let's say that you want to see if <i>sample3.txt</i> is closer to <i>sample1.txt</i>or <i>sample2.txt</i>; type<pre>% dbacl -c one -c two sample3.txt -vtwo</pre><p>and dbacl should tell you it thinks <i>sample3.txt</i> is more like <i>two</i> (which is thecategory learned from <i>sample2.txt</i>) and not as much like <i>one</i> (which is thecategory learned from <i>sample1.txt</i>). That's it.<p>You can create as many categories as you want, <i>one</i>, <i>two</i>, <i>three</i>, <i>good</i>, <i>bad</i>, <i>important</i>, <i>jokes</i>, but remember that each one must be learned from a representativecollection of plain text documents.<p>dbacl is designed to be easy to use within a script, so you can make it part ofyour own projects, perhaps a spam detection script, or an agent which automatically downloads the latest newspaper articles on your favourite topic... <h2>Language models</h2><p>dbacl works by scanning the text it learns for features, which can be nearlyanything you like. For example, unless you tell it otherwise, the standardfeatures are all alphabetic single words in the document. dbacl builds astatistical model, ie a probability distribution, based only on those features,so anything that is not a feature will be ignored both during learning, andduring classification.<p>This dependence on features is a double edged sword, because it helps dbaclfocus on the things that matter (single alphabetic words by default), butif something else matters more then it is ignored if you don't tell dbacl it's afeature. This is the hard part, and it's up to you.<p>When telling dbacl what kind of features to look out for, you must use the language of regular expressions. For example, if you think the only interesting features for category <i>one</i> are words which contain the letter 'q', then you would type<pre>% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \  -g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt</pre><p>The rule is that dbacl always takes as a feature whatever it finds within round brackets.Reading this can be painful if you don't know regular expressions, however.<p>In English, the first expression after the -g option above reads: take as a featureany string which looks like: <b>"start of the line"</b> (written ^) followed by <b>"zero or morecharacters within the range a-z or A-Z"</b> (written [a-zA-Z]*), followed by <b>"the character q"</b> (written q), followed by <b>"zero or more characters within the range a-z or A-Z"</b> (written [a-zA-Z]*). The second expression is nearly identical: <b>"a single character which is not in the range a-zA-Z"</b> (written [^a-zA-Z]), followed by <b>"zero or more characters within the range a-z or A-Z"</b> (can you guess?), followed by <b>"the character q"</b>, followed by <b>"zero or more characters within the range a-z or A-Z"</b>. The single quote marks are used to keep the whole expression together.<p>A regular expression is a simultaneous superposition of many text strings. Justlike a word, you read and write it one character at a time.<p><table border="1" rules="all">  <tr>    <td><b>Symbol</b></td>    <td><b>What it means</b></td>  </tr>  <tr>    <td>.</td>    <td>any character except newline</td>  </tr>  <tr>    <td>*</td>    <td>zero or more copies of preceding character or parenthesized expression</td>  </tr>  <tr>    <td>+</td>    <td>one or more copies of preceding character or parenthesized expression</td>  </tr>  <tr>    <td>?</td>    <td>zero or one copies of preceding character or parenthesized expression</td>  </tr>  <tr>    <td>^</td>    <td>beginning of line</td>  </tr>  <tr>    <td>$</td>    <td>end of line</td>  </tr>  <tr>    <td>a|b</td>    <td>a or b</td>  </tr>  <tr>    <td>[abc]</td>    <td>one character equal to a, b or c</td>  </tr>  <tr>    <td>[^abc]</td>    <td>one character not equal to a, b or c</td>  </tr>  <tr>    <td>\*, \?, or \.</td>    <td>the actual character *, ? or .</td>  </tr></table><p>To get a feel for the kinds of features taken into account by dbacl in the example above, you can use the -D option. Retype the above in the slightly changed form<pre>% dbacl -l justq -g '^([a-zA-Z]*q[a-zA-Z]*)' \ -g '[^a-zA-Z]([a-zA-Z]*q[a-zA-Z]*)' sample2.txt -D | head -10</pre><p>This lists the first few matches, one per line, which exist in the <i>sample1.txt</i> document. Obviously, only taking into account features which consist of words with the letter 'q' in them makes a poor model. <p>Sometimes, it's convenient to use parentheses which you want to throw away. dbaclunderstands the special notation ||xyz which you can place at the end of a regular expression, where x, y, z  should be digits corresponding to the parenthesesyou want to keep.Here is an example for mixed Japanese and English documents, which matches alphabetic words and single ideograms:<pre>% LANG=ja_JP dbacl -D -l konichiwa japanese.txt -i \ -g '(^|[^a-zA-Z0-9])([a-zA-Z0-9]+|[[:alpha:]])||2'</pre><p>In the table below, you will find a list of some simple regular expressions to get you started:<p><table border="1" rules="all">  <tr>    <td><b>If you want to match...</b></td>    <td><b>Then you need this expression...</b></td>    <td><b>Examples</b></td>  </tr>  <tr>    <td>alphabetic words</td>    <td>(^|[^[:alpha:]])([[:alpha:]]+)||2    </td>    <td>hello, kitty</td>  </tr>  <tr>    <td>words in capitals</td>    <td>(^|[^[A-Z]])([A-Z]+)||2    </td>    <td>MAKE, MONEY, FAST</td>  </tr>  <tr>    <td>strings of characters separated by spaces</td>    <td>(^|[ ])([^ ]+)||2    </td>    <td>w$%&tf9(, amazing!, :-)</td>  </tr>  <tr>    <td>time of day</td>    <td>(^|[^0-9])([0-9?[0-9]:[0-9][0-9](am|pm))||2    </td>    <td>9:17am, 12:30pm</td>  </tr>  <tr>    <td>words which end in a number</td>    <td>(^|[^a-zA-Z0-9])([a-zA-Z]+[0-9]+)[^a-zA-Z]||2    </td>    <td>borg17234, A1</td>  </tr>  <tr>    <td>alphanumeric word pairs</td>	<td>(^|[^[:alnum:]])([[:alnum:]]+)[^[:alnum:]]+([[:alnum:]]+)||23<br>	</td>	<td>good morning, how are</td>  </tr></table><p>The last entry in the table above shows how to take word pairs as features.Such models are called bigram models, as opposed to the unigram models whosefeatures are only single words, and they are used to capture extra information.<p>For example, in a unigram model the pair of words "well done" and "done well" havethe same probability. A bigram model can learn that "well done" is more common in food related documents (provided this combination of words was actually found within the learning corpus).<p>However, there is a big statistical problem: because there exist many more meaningful bigrams than unigrams, you'll need a much bigger corpus to obtain meaningful statistics. One way around this is a technique called smoothing, which predicts unseen bigrams from already seen unigrams. To obtain such acombined unigram/bigram alphabetic word model, type<pre>% dbacl -l smooth -g '(^|[^a-zA-Z])([a-zA-Z]+)||2' \ -g '(^|[^a-zA-Z])([a-zA-Z]+)[^a-zA-Z]+([a-zA-Z]+)||23' sample1.txt</pre><p>If all you want are alphabetic bigrams, trigrams, etc, there is a special switch-w you can use. The command<pre>% dbacl -l slick -w 2 sample1.txt</pre><p>produces a model <i>slick</i> which is nearly identical to <i>smooth</i> (the difference is that a regular expression cannot straddle newlines, but -w ngrams can).<p>Obviously, all this typing is getting tedious, and you will eventuallywant to automatethe learning stage in a shell script. Use regular expressions sparingly, asthey can quickly degrade the performance (speed and memory) of dbacl. See<a href="#appendix">Appendix A</a> for ways around this.<h2>Evaluating the models</h2><p>Now that you have a grasp of the variety of language models which dbacl can generate, the important question is what set of features should you use?<p>There is no easy answer to this problem.Intuitively, a largerset of features seems always preferable,since it takes more information into account.However, there is a tradeoff.Comparing more features requires extra memory,but much more importantly, too many features can <i>overfit</i> the data.This resultsin a model which is so good at predicting the learned documents,that virtually no other documents are considered even remotely similar.<p>It is beyond the scope of this tutorial to describe the variety of statisticalmethods which can help decide what features are meaningful. However, to get arough idea of the quality of the  model, we can look at the cross entropyreported by dbacl.<p>The cross entropy is measured in bits and has the following meaning:If we use our probabilistic model to construct an optimal compression algorithm,then the cross entropy of a text string is the predicted number of bits which is needed on average, after compression, for each separate feature.This rough description isn't complete, since the cross entropy doesn't measure the amount of space also needed for the probability model itself.<p>To compute the cross entropy of category <i>one</i>, type<pre>% dbacl -c one sample1.txt -vncross_entropy 7.60 bits complexity 678</pre><p>The cross entropy is the first value returned. The second value essentiallymeasures how many features describe the document.Now suppose we try other models trained on the same document:<pre>% dbacl -c slick sample1.txt -vncross_entropy 4.74 bits complexity 677% dbacl -c smooth sample1.txt -vncross_entropy 5.27 bits complexity 603</pre><p>According to these estimates, both bigram models fit <i>sample1.txt</i> better. This iseasy to see for <i>slick</i>, since the complexity (essentially the number of features)is nearly the same as for <i>one</i>. But <i>smooth</i> looks at fewer features, and actuallycompresses them just slightly better in this case. Let's ask dbacl which category fits better:<pre>% dbacl -c one -c slick -c smooth sample1.txt -vsmooth</pre><p><b>WARNING: dbacl doesn't yet cope well with widely different feature sets. Don't try to comparecategories built on completely different feature specifications unless you fully understand thestatistical implications.</b><h2>Decision Theory</h2><p>If you've read this far, then you probably intend to use dbacl toautomatically classify text documents, and possibly executecertain actions depending on the outcome. The bad news is that dbacl isn't designed for this. The good news is that there is a companion program, bayesol,which is. To use it, you just need to learn some Bayesian Decision Theory.<p>We'll suppose that the document <i>sample4.txt</i> must be classified in one of thecategories <i>one</i>, <i>two</i> and <i>three</i>.To make optimal decisions, you'll need three ingredients: a <b>prior distribution</b>,a set of <b>conditional probabilities</b> and a <b>measure of risk</b>. We'll get to thesein turn. <p>The <b>prior distribution</b> is a set of weights, which you must choose yourself,representing your beforehand beliefs. You choose this before you even look at<i>sample4.txt</i>. For example, you might know from experience that category <i>one</i> is twice aslikely as two and three. The prior distribution is a set of weights you chooseto reflect your beliefs, e.g. <i>one</i>:2, <i>two</i>:1, <i>three</i>:1. If you have no idea what tochoose, give each an equal weight (<i>one</i>:1, <i>two</i>:1, <i>three</i>:1).<p>Next, we need <b>conditional probabilities</b>. This is what dbacl is for. Type<pre>% dbacl -c one -c two -c three sample4.txt -None 100.00% two  0.00% three  0.00%</pre><p>As you can see, dbacl is 100% sure that <i>sample4.txt</i> resembles category <i>one</i>.Such accurate answers are typical with the kinds of models used by dbacl.In reality, the probabilities for <i>two</i> and <i>three</i> are very, very small andthe probability for <i>one</i> is really close, but not equal to 1.See <a href="#appendix2">Appendix B</a> for a rough explanation.<p>We combine the prior (which represents your own beliefs and experiences) withthe conditionals (which represent what dbacl thinks about <i>sample4.txt</i>) to obtaina set of <b>posterior probabilities</b>. In our example,<ul><li>Posterior probability that <i>sample4.txt</i> resembles <i>one</i>: 100%*2/(2+1+1) = 100%<li>Posterior probability that <i>sample4.txt</i> resembles two: 0%*1/(2+1+1) = 0%<li>Posterior probability that <i>sample4.txt</i> resembles three: 0%*1/(2+1+1) = 0%</ul>Okay, so here the prior doesn't have much of an effect. But it'sthere if you need it.<p>Now comes the tedious part.What you really want to dois take these posterior distributions under advisement, and makean informed decision. <p>To decide which category best suits your own plans, you need to workout the <b>costs of misclassifications</b>. Only you can decide these numbers, and thereare many. But at the end, you've worked out your risk. Here's an example:<ul><li>If <i>sample4.txt</i> is like <i>one</i> but it ends up marked like <i>one</i>, then the cost is <b>0</b><li>If <i>sample4.txt</i> is like <i>one</i> but it ends up marked like <i>two</i>, then the cost is <b>1</b><li>If <i>sample4.txt</i> is like <i>one</i> but it ends up marked like <i>three</i>, then the cost is <b>2</b><li>If <i>sample4.txt</i> is like <i>two</i> but it ends up marked like <i>one</i>, then the cost is <b>3</b><li>If <i>sample4.txt</i> is like <i>two</i> but it ends up marked like <i>two</i>, then the cost is <b>0</b><li>If <i>sample4.txt</i> is like <i>two</i> but it ends up marked like <i>three</i>, then the cost is <b>5</b><li>If <i>sample4.txt</i> is like <i>three</i> but it ends up marked like <i>one</i>, then the cost is <b>1</b><li>If <i>sample4.txt</i> is like <i>three</i> but it ends up marked like <i>two</i>, then the cost is <b>1</b><li>If <i>sample4.txt</i> is like <i>three</i> but it ends up marked like <i>three</i>, then the cost is <b>0</b></ul>
12 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -