📄 readme.html
字号:
<p>First column denotes the word of each token, second column denotes their part of speech, third column denotes surrounding part of speech tags within window size 5, last column is label.</p> <li>Template file format</li> <p>Each line in the template file denotes one template. Each template is designed in the format: <br> <b>w0</b>%x[<b>i1</b>,<b>j1</b>]<b>w1</b>%x[<b>i2</b>,<b>j2</b>] ... <b>w(m-1)</b>%x[<b>im</b>,<b>jm</b>]<b>wm</b>%y[<b>k1</b>]%y[<b>k2</b>]...%y[0]<br> Bold parts are customized parameters. The first index (i1,...,im) of x specified the relative position from the current focusing token, while the second index (j1,...,jm) of x specified absolute position of the column. w0,...,wm are customized strings. Index of y specified the label of the token in the relative position. Here are some attentions:<br></p> <ul> <li>Indexes of y in each template (k1,...,kn) should be arranged in ascending order, with kn = 0. Any templates with kn unequal to 0 can be regularized to the above format by subtract kn from the first index of each x (i1-kn,i2-kn,...,im-kn) and index of y (k1-kn,...,kn-kn)</li> </ul> <p>Here is an example: <br>Training data</p> <pre>He PRP PRP VBZ DT Breckons VBZ PRP VBZ DT JJ Othe DT PRP VBZ DT JJ NN Bcurrent JJ VBZ DT JJ NN NN Iaccount NN DT JJ NN NN MD Ideficit NN JJ NN NN MD VB I <= current tokenwill MD NN NN MD VB TO Onarrow VB NN MD VB TO RB Oto TO MD VB TO RB # Oonly RB VB TO RB # CD B# # TO RB # CD CD I1.8 CD RB # CD CD IN Ibillion CD # CD CD IN NNP Iin IN CD CD IN NNP . OSeptember NNP CD IN NNP . B. . IN NNP . O </pre> <pre>templates generated features%x[-1,0]%y[0] if previous word is "account", then current label is "B" if previous word is "account", then current label is "I" if previous word is "account", then current label is "O" %x[0,1]%y[0] if current pos is "NN", then current label is "B" if current pos is "NN", then current label is "I" if current pos is "NN", then current label is "O"%x[0,2]%y[0] if the surrounding part of speech tags contain "JJ", then current label is "B" if the surrounding part of speech tags contain "JJ", then current label is "I" if the surrounding part of speech tags contain "JJ", then current label is "O" if the surrounding part of speech tags contain "NN", then current label is "B". Since "NN" appears twice in this cell, such feature function value is 2. if the surrounding part of speech tags contain "NN", then current label is "I" if the surrounding part of speech tags contain "NN", then current label is "O" ... </pre> <p>Another example:</p> <pre>illegal legal%x[-1,0]%x[1,0]%y[0]%y[1] %x[-2,0]%x[0,0]%y[-1]%y[0] </pre> <p>Here , first illegal template ended with y[1], it should be regularized. </p> <li>Null features</li> <p> In some case, features could be null. For example, in Chinese word part of speech tagging, if you choose the second Chinese character in word as feature, then for single character words, this feature is null. In such case, set this cell with "" (null string). </p> <li>Options in crf_learn command</li> <p> Several training options could be added to Pocket CRF, as a more complex case, type command:<br> <b>crf_learn -i 100 template train model</b><br> Here "-i 100" tells Pocket CRF to train no more than 100 iterations. Defaultly, this parameter is 10000.<br> All the training options are given below: </p> <table border=1> <tr><td>option</td><td>type</td><td>default</td><td>meaning</td></tr> <tr><td>-h</td><td></td><td></td><td>Print help message.</td></tr> <tr><td>-c</td><td>double</td><td>1</td><td>training option. Gaussian smooth factor, with low value, CRF trends to underfit the training sample, with high vaule, CRF trend to overfit.</td></tr> <tr><td>-f</td><td>int</td><td>0</td><td>Frequency threshold. Features occur less than the threshold are eliminated. Usually, set "-f 1" to use all features, when features are too many to store in the memory, set higher value to eliminated rare features. </td></tr> <tr><td>-p</td><td>int</td><td>1</td><td>Thread number for multi-thread training</td></tr> <tr><td>-l</td><td>double</td><td>0</td><td>training option, l1 norm regularizer for fast feature selection.With higher l1 value, CRF selected more features If you don't need feature selection , let l1=0. After feature selection, CRF will perform a normal traing process.(training with gaussian prior)</td></tr> <tr><td>-i</td><td>int</td><td>10000</td><td>Max iteration number.</td></tr> <tr><td>-e</td><td>double</td><td>0.0001</td><td>Controls the training precision</td></tr> <tr><td>-d</td><td>int</td><td>5</td><td>Iteration depth in LBFGS. With higher value, CRF convergence in less iteration at the cost of more hard disk space requirement.</td></tr> <tr><td>-a</td><td>int</td><td>0</td><td>Training algorithm, 0: CRF, 1: averaged perceptron, 2: passive aggressive algorithm.</td></tr> <tr><td>-m</td><td>int</td><td>0</td><td>Efficiency of CRF training, 0: Keep all data in memory for fast training, 1: Save some data on disk to reduce memory requirement.</td></tr> </table> <p> So you could try several other examples:<br> <b>crf_learn -c 5 -l 10000 template train model</b><br> <b>crf_learn -p 2 -e 0.0000001 template train model</b><br> </p> <li>Testing</li> <p>To use Pocket CRF for testing, type command "<b>crf_test model key result</b>". Here 3 file names should be given one by one: model file, key file, result file. In this example, the 3 names are "model","key","result". The first 2 files should be prepared, and the last file is generated by Pocket CRF.</p> <li>Key file format</li> <p>Key file format are exactly the same as train file.</p> <li>Result file format</li> <p>For the simplest case "<b>crf_test model key result</b>", the result file adds one column to key file, which is the label predicts by Pocket CRF</p> <li>Options in crf_test command</li> <p> All the testing options are given below: </p> <table border=1> <tr><td>option</td><td>type</td><td>default</td><td>meaning</td></tr> <tr><td>-h</td><td></td><td></td><td>Print help message.</td></tr> <tr><td>-m</td><td>int</td><td>0</td><td>0 or 1, if "-m 1", CRF will calculate the marginal probability for each label.</td></tr> <tr><td>-n</td><td>int</td><td>1</td><td>performs n best outputs.</td></tr> </table> <li>Complex result file format</li> <p>When you use option "-m" or "-n" the format of result file is a little more complex, here is an example:<br> Type command "<b>crf_test -m 1 -n 2 model key result</b>", the format of result file is like:</p> <pre>0.605596 0.0658866Rockwell NNP B B O 0.888325 0.0162721 0.0954027said VBD O O O 0.00208226 0.0188027 0.979115the DT B B B 0.986093 0.00939122 0.0045155agreement NN I I I 0.00569465 0.992155 0.0021503calls VBZ O O O 0.00145051 0.00889529 0.989654for IN O O O 0.00188102 4.84368e-005 0.998071it PRP B B B 0.935978 0.00099749 0.0630243to TO O O O 0.00216055 0.0209176 0.976922supply VB O O O 0.0300674 0.00531862 0.964614200 CD B B B 0.918897 0.0393702 0.0417323additional JJ I I I 0.062074 0.861668 0.0762578so-called JJ I I I 0.0887792 0.897094 0.0141267shipsets NNS I I I 0.00810292 0.978231 0.0136665for IN O O O 0.000130022 0.000924022 0.998946the DT B B B 0.997464 0.000451147 0.00208493planes NNS I I I 0.00289372 0.995456 0.00165012. . O O O 0.000495323 0.00158809 0.997917 </pre> <p>The 2 double values in the first line are the top 2 label sequence joint probabilities. From the second line to the last, the 4th and 5th columns represent the top first and second label sequences respectively. The 6th 7th 8th columns represent the marginal probabilities for labels in alphabetic order, i.e., "B","I","O" respectively. </p> </ul> <h2><a name="reference">Reference</a></h2> <ul> <li>J. Lafferty, A. McCallum, and F. Pereira. <a href="http://www.cis.upenn.edu/~pereira/papers/crf.pdf">Conditional random fields: Probabilistic models for segmenting and labeling sequence data</a>, In Proc. of ICML, pp.282-289, 2001 </li> <li>Taku kudo. <a href="http://sourceforge.net/projects/crfpp/">CRF++: Yet Another CRF toolkit</a></li> <li>Mark Schmidt, Glenn Fung, Romer Rosales. <a href="http://pages.cs.wisc.edu/~gfung/GeneralL1/FastGeneralL1.pdf">Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches</a></li> </ul> <h2><a name="todo">To do</a></h2> <ul> <li>High Dimensional CRF</li> </ul> Contact: <i>qianxian@fudan.edu.cn</i> </body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -