📄 tutorial.html

📁 dbacl是一个通用目的的digramic贝叶斯文本分类器。它可以学习你提供的文本
💻 HTML
📖 第 1 页 / 共 2 页
字号:
上一页 12
These numbers are often placed in a table called the loss matrix (thisway, you can't forget a case), like so:<p><table border="1" rules="all">  <tr>    <td rowspan="2"><b>correct category</b></td>    <td colspan="3"><b>misclassified as</b></td>  </tr>  <tr>    <td><i>one</i></td>    <td><i>two</i></td>    <td><i>three</i></td>  </tr>  <tr>    <td><i>one</i></td>    <td><b>0</b></td>    <td><b>1</b></td>    <td><b>2</b></td>  </tr>  <tr>    <td><i>two</i></td>    <td><b>3</b></td>    <td><b>0</b></td>    <td><b>5</b></td>  </tr>  <tr>    <td><i>three</i></td>    <td><b>1</b></td>    <td><b>1</b></td>    <td><b>0</b></td>  </tr></table><p>We are now ready to combine all these numbers to obtain the True Bayesian Decision.For every possible category, we simply weigh the risk with the posteriorprobabilities of obtaining each of the possible misclassifications. Then we choose the category with least expected posterior risk.<p><ul><li>For category <i>one</i>, the expected risk is <b>0</b>*100% + <b>3</b>*0% + <b>1</b>*0% = <b>0</b> <-- smallest<li>For category <i>two</i>, the expected risk is <b>1</b>*100% + <b>0</b>*0% + <b>1</b>*0% = <b>1</b><li>For category <i>three</i>, the expected risk is <b>1</b>*100% + <b>0</b>*0% + <b>1</b>*0% = <b>2</b>    </ul><p>The lowest expected risk is for caterogy <i>one</i>, so that's the category we chooseto represent <i>sample4.txt</i>. Done!<p>Of course, the loss matrix above doesn't really have an effect on theprobability calculations, because the conditional probabilities strongly point tocategory <i>one</i> anyway. But now you understand how the calculation works. Below, we'll look at a more realistic example.<p>One last point: you may wonder how dbacl itself decides which category todisplay when classifying with the -v switch. The simple answer is that dbacl alwaysdisplays the category with maximal conditional probability (often called the MAP estimate). This is mathematically completely equivalent to the special case of decision theory when the prior has equal weights, and the loss matrix takes the value 1 everywhere, except on the diagonal (ie correct classifications have no cost, everything else costs 1).<h2>Using bayesol</h2><p>bayesol is a companion program for dbacl which makes the decision calculationseasier. The bad news is that you still have to write down a prior and loss matrixyourself. Eventually, someone, somewhere may write a graphical interface.<p>bayesol reads a risk specification file, which is a text file containing information about the categories required, the prior distribution and the cost ofmisclassifications. For the toy example discussed earlier, the file <i>toy.risk</i> looks like this:<pre>categories {    one, two, three}prior {    2, 1, 1}loss_matrix {"" one   [ 0, 1, 2 ]"" two   [ 3, 0, 5 ]"" three [ 1, 1, 0 ]}</pre><p>Let's see if our calculation was correct:<pre>% dbacl -c one -c two -c three sample4.txt -vna | bayesol -c toy.risk -vone</pre><p>Good! However, as discussed above, the misclassification costs needimprovement. This is completely up to you, but here are some possiblesuggestions to get you started.<p>To devise effective loss matrices, it pays to think about the way that dbaclcomputes the probabilities. <a href="#appendix2">Appendix B</a> gives somedetails, but we don't need to go that far. Recall that the language models arebased on features (which are usually kinds of words).Every feature counts towards the final probabilities, and a big documentwill have more features, hence more opportunities to steer theprobabilities one way or another. So a feature is like an informationbearing unit of text.<p>When we read a text document which doesn't accord with our expectations, wegrow progressively more annoyed as we read further into the text. This is likean annoyance interest rate which compounds on information units within the text.For dbacl, the number of information bearing units is reported as the complexityof the text.This suggests that the cost of reading a misclassified document could have theform (1 + interest)^complexity. Here's an example loss matrix which uses this idea<pre>loss_matrix { "" one   [ 0,               (1.1)^complexity,  (1.1)^complexity ]"" two   [(1.1)^complexity, 0,                 (1.7)^complexity ] "" three [(1.5)^complexity, (1.01)^complexity, 0 ]} </pre><p>Remember, these aren't monetary interest rates, they are value judgements.You can see this loss matrix in action by typing<pre>% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example1.risk -vthree</pre><p>Now if we increase the cost of misclassifying <i>two</i> as <i>three</i> from1.7 to 2.0, the optimal category becomes<pre>% dbacl -c one -c two -c three sample5.txt -vna | bayesol -c example2.risk -vtwo</pre><p>bayesol can also handle infinite costs. Just write "inf" where you need it.This is particularly useful with regular expressions. If you look at eachrow of loss_matrix above, you see an empty string "" before each category.This indicates that this row is to be used by default in the actual loss matrix.But sometimes, the losses can depend on seeing a particular string in the document we want to classify.<p>Suppose you normally like to use the loss matrix above, but in case thedocument contains the word "Trillian", then the cost of misclassificationis infinite. Here is an updated loss_matrix:<pre>loss_matrix { ""          one   [ 0,               (1.1)^complexity,  (1.1)^complexity ]"Trillian"  two   [ inf,             0,                 inf ]""          two   [(1.1)^complexity, 0,                 (2.0)^complexity ] ""          three [(1.5)^complexity, (1.01)^complexity, 0 ]}</pre><p>bayesol looks in its input for the regular expression "Trillian", and if itis found, then for misclassifications away from <i>two</i>,it uses the row with the infinite values, otherwise it uses the defaultrow, which starts with "". If you have several rows with regular expressions,bayesol always uses the first one from the top which matches within the input.<p>The regular expression facility can also be used to perform more complicateddocument dependent loss calculations. Suppose you like to count the numberof lines of the input document which start with the character '>', as aproportion of the total number of lines in the document. The following perl script transcribes its input and appends the calculated proportion.<pre>#!/usr/bin/perl # this is file prop.pl$special = $normal = 0; while(&lt;SDTIN&gt;) {    $special++ if /^ >/;     $normal++;     print; } $prop = $special/$normal; print "proportion: $prop\n"; </pre><p>If we used this script, then we could take the output of dbacl, append theproportion of lines containing '>', and pass the result as input to bayesol.For example, the following line is included in the <i>example2.risk</i>specification<pre>"^proportion: ([0-9.]+)" one [ 0, (1+$1)^complexity, (1.2)^complexity ]</pre><p>and through this, bayesol reads, if present,the line containing the proportion wecalculated andtakes this into account when it constructs the loss matrix.You can try this like so:<pre>% dbacl -T email -c one -c two -c three sample6.txt -nav \  | perl prop.pl | bayesol -c example2.risk -v</pre><p>Note that in the loss_matrix specification above, $1 refers to the <i>numerical</i> value of thequantity inside the parentheses. Also, it is useful to remember thatwhen using the -a switch, dbacl outputs all the original linesfrom <i>unknown.txt</i> with an extra space in front of them. If anotherinstance of dbacl needs to read this output again (e.g. in a pipeline),then the latter should be invoked the -A switch.<h2>Miscellaneous</h2><p>Be careful when classifying very small strings.Except for the multinomial models (which includes the default model),the dbacl calculations are optimized for large stringswith more than 20 or 30 features.For small text lines, the complex models give only approximate scores.In those cases, stick with unigram models, which are always exact.<p>In the UNIX philosophy, programs are small and do one thing well. Following this philosophy, dbacl essentially only reads plain text documents. If you have non-textual documents (word, html, postscript) which you want to learn from, you will need to use specialized tools to first convert these into plain text. There are many free tools available for this.<p>dbacl has limited support for reading mbox files (UNIX email) and can filter out html tags in a quick and dirty way, however this is only intended as a convenience, and should not be relied upon to be fully accurate.<h2><a name="appendix">Appendix A: memory requirements</a></h2><p>When experimenting with complicated models, dbacl will quickly fill up its hashtables. dbacl is designed to use a predictable amount of memory (to prevent nasty surprises on some systems). The default hash table size in version 1.1 is 15, which is enough for 32,000 unique features and produces a 512K category file on my system. You can use the -h switch to select hash table size, in powers of two. Beware that learning takes much more memory than classifying. Use the -V switch to find out the cost per feature. On my system, each feature costs 6 bytes for classifying but 17 bytes for learning.  <p>For testing, I use the collected works of Mark Twain, which is a 19MB pure text file. Timings are on a 500Mhz Pentium III.<p><table border="1" rules="all">  <tr>    <td><b>command</b></td>    <td><b>Unique features</b></td>    <td><b>Category size</b></td>    <td><b>Learning time</b></td>  </tr>  <tr>    <td>dbacl -l twain1 Twain-Collected_Works.txt -w 1 -h 16</td>    <td align="right">49,251</td>    <td align="right">512K</td>    <td>0m9.240s</td>  </tr>  <tr>    <td>dbacl -l twain2 Twain-Collected_Works.txt -w 2 -h 20</td>    <td align="right">909,400</td>    <td align="right">6.1M</td>    <td>1m1.100s</td>  </tr>  <tr>    <td>dbacl -l twain3 Twain-Collected_Works.txt -w 3 -h 22</td>    <td align="right">3,151,718</td>    <td align="right">24M</td>    <td>3m42.240s</td>  </tr></table><p>As can be seen from this table, including bigrams and trigrams has a noticeablememory and performance effect during learning. Luckily, classification speedis only affected by the number of features found in the unknown document.<p><table border="1" rules="all">  <tr>    <td><b>command</b></td>    <td><b>features</b></td>    <td><b>Classification time</b></td>  </tr>  <tr>    <td>dbacl -c twain1 Twain-Collected_Works.txt</td>    <td>unigrams</td>    <td align="right">0m4.860s</td>  </tr>  <tr>    <td>dbacl -c twain2 Twain-Collected_Works.txt</td>    <td>unigrams and bigrams</td>    <td align="right">0m8.930s</td>  </tr>  <tr>    <td>dbacl -c twain3 Twain-Collected_Works.txt</td>    <td>unigrams, bigrams and trigrams</td>    <td align="right">0m12.750s</td>  </tr></table><p>The heavy memory requirements during learning of complicated models can bereduced at the expense of the model itself. dbacl has a feature decimation switchwhich slows down the hash table filling rate by simply ignoring many of thefeatures found in the input. <h2><a name="appendix2">Appendix B: Extreme probabilities</a></h2><p>Why is the result of a dbacl probability calculation always so accurate?<pre>% dbacl -c one -c two -c three sample4.txt -None 100.00% two  0.00% three  0.00%</pre><p>The reason for this has to do with the type of model which dbacl uses. Let'slook at some scores:<pre>% dbacl -c one -c two -c three sample4.txt -none 9465.93 two 10252.89 three 10198.90% dbacl -c one -c two -c three sample4.txt -nvone 14.70 * 644 two 15.92 * 644 three 15.84 * 644</pre><p>The first set of numbers are minus the logarithm (base 2) of each category'sprobability of producing the full document sample4.txt. This represents theevidence away from each category, and is measured in bits.<i>two</i> and <i>three</i> are fairly even, but <i>one</i> has by farthe lowest score and hence highest probability (in other words, the modelfor one is the least bad at predicting <i>sample4.txt</i>, so if there are onlythree possible choices, it's the best). To understand these numbers, it's best to split each of them up intoa product of cross entropy (base e)and complexity, as is done in the second line.<p>Remember that dbacl calculates probabilities about resemblanceby weighing the evidence for all the features found in the input document.There are 644 features in <i>sample4.txt</i>, and each feature contributes on average 10.19 bits of evidence against category <i>one</i>, 11.04 bits against category <i>two</i> and 10.98 bits against category <i>three</i>. Let's look at what happens if we only look at the first 25 lines of <i>sample4.txt</i>:<pre>% head -25 sample4.txt | dbacl -c one -c two -c three -nvone 13.44 * 107 two 14.83 * 107 three 14.70 * 107</pre><p>There are fewer features in the first 25 lines of <i>sample4.txt</i> than in the fulltext file, but the picture is substantially unchanged. <pre>% head -25 sample4.txt | dbacl -c one -c two -c three -None 100.00% two  0.00% three  0.00%</pre><p>dbacl is still very sure, because it has looked at many features and foundsmall differences which add up to quite different scores.Now let's look at only the first two lines of <i>sample4.txt</i>:<pre>% head -2 sample4.txt | dbacl -c one -c two -c three -None 99.93% two  0.00% three  0.07%% head -2 sample4.txt | dbacl -c one -c two -c three -nvone 11.96 * 8 two 15.86 * 8 three 13.27 * 8</pre><p>Now there are only eight features to look at, and dbacl is getting unsure.In this example, the features are sufficiently different that dbacl is only slightly unsure, but even so, eight words (in this model a feature is a word) is not much to go on. <p>So the interpretation of the probabilities is clear. dbacl weighs theevidence from each feature it finds, and reports the best fit among the choicesit is offered. Whether thesefeatures are the right features to look at for best classification is anothermatter entirely, and it's entirely up to you to decide.</body></html>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -