📄 readme.html

📁 pocket_crf_0.45
💻 HTML
📖 第 1 页 / 共 2 页
字号:
上一页 12
		<p>First column denotes the word of each token, second column denotes their part of speech, third column denotes surrounding part of speech tags within window size 5, last column is label.</p>				<li>Template file format</li>		<p>Each line in the template file denotes one template. Each template is designed in the format: <br>		<b>w0</b>%x[<b>i1</b>,<b>j1</b>]<b>w1</b>%x[<b>i2</b>,<b>j2</b>] ... <b>w(m-1)</b>%x[<b>im</b>,<b>jm</b>]<b>wm</b>%y[<b>k1</b>]%y[<b>k2</b>]...%y[0]<br>		Bold parts are customized parameters. The first index (i1,...,im) of x specified the relative position from the current focusing token,		while the second index (j1,...,jm) of x specified absolute position of the column. w0,...,wm are customized strings. Index of y specified		the label of the token in the relative position.		Here are some attentions:<br></p>		<ul>			<li>Indexes of y in each template (k1,...,kn) should be arranged in ascending order, with kn = 0. 			Any templates with kn unequal to 0 can be regularized to the above format by subtract kn from the first index of each x			(i1-kn,i2-kn,...,im-kn) and index of y (k1-kn,...,kn-kn)</li>		</ul>		<p>Here is an example: <br>Training data</p>		<pre>He      PRP     PRP VBZ DT      Breckons VBZ     PRP VBZ DT JJ   Othe     DT      PRP VBZ DT JJ NN        Bcurrent JJ      VBZ DT JJ NN NN Iaccount NN      DT JJ NN NN MD  Ideficit NN      JJ NN NN MD VB  I		<= current tokenwill    MD      NN NN MD VB TO  Onarrow  VB      NN MD VB TO RB  Oto      TO      MD VB TO RB #   Oonly    RB      VB TO RB # CD   B#       #       TO RB # CD CD   I1.8     CD      RB # CD CD IN   Ibillion CD      # CD CD IN NNP  Iin      IN      CD CD IN NNP .  OSeptember       NNP     CD IN NNP .     B.       .       IN NNP .        O		</pre>		<pre>templates			generated features%x[-1,0]%y[0]			if previous word is "account", then current label is "B"				if previous word is "account", then current label is "I"				if previous word is "account", then current label is "O"				%x[0,1]%y[0]			if current pos is "NN", then current label is "B"				if current pos is "NN", then current label is "I"				if current pos is "NN", then current label is "O"%x[0,2]%y[0]			if the surrounding part of speech tags contain "JJ", then current label is "B"				if the surrounding part of speech tags contain "JJ", then current label is "I"				if the surrounding part of speech tags contain "JJ", then current label is "O"				if the surrounding part of speech tags contain "NN", then current label is "B".					Since "NN" appears twice in this cell, such feature function value is 2.				if the surrounding part of speech tags contain "NN", then current label is "I"				if the surrounding part of speech tags contain "NN", then current label is "O"				...		</pre>		<p>Another example:</p>		<pre>illegal				legal%x[-1,0]%x[1,0]%y[0]%y[1]	%x[-2,0]%x[0,0]%y[-1]%y[0]		</pre>		<p>Here , first illegal template ended with y[1], it should be regularized. </p>						<li>Null features</li>		<p>		In some case, features could be null. For example, in Chinese word part of speech tagging, if you choose the second Chinese character		in word as feature, then for single character words, this feature is null. In such case, set this cell with "" (null string).		</p>				<li>Options in crf_learn command</li>		<p>		Several training options could be added to Pocket CRF, as a more complex case, type command:<br>		<b>crf_learn -i 100 template train model</b><br>		Here "-i 100" tells Pocket CRF to train no more than 100 iterations. Defaultly, this parameter is 10000.<br>		All the training options are given below:		</p>		<table border=1>		<tr><td>option</td><td>type</td><td>default</td><td>meaning</td></tr>		<tr><td>-h</td><td></td><td></td><td>Print help message.</td></tr>		<tr><td>-c</td><td>double</td><td>1</td><td>training option. Gaussian smooth factor, with low value, CRF trends to underfit the training sample, with high vaule, CRF trend to overfit.</td></tr>		<tr><td>-f</td><td>int</td><td>0</td><td>Frequency threshold. Features occur less than the threshold are eliminated. Usually, set "-f 1" to use all features, when features are too many to store in the memory, set higher value to eliminated rare features. </td></tr>		<tr><td>-p</td><td>int</td><td>1</td><td>Thread number for multi-thread training</td></tr>		<tr><td>-l</td><td>double</td><td>0</td><td>training option, l1 norm regularizer for fast feature selection.With higher l1 value, CRF selected more features If you don't need feature selection , let l1=0. After feature selection, CRF will perform a normal traing process.(training with gaussian prior)</td></tr>		<tr><td>-i</td><td>int</td><td>10000</td><td>Max iteration number.</td></tr>		<tr><td>-e</td><td>double</td><td>0.0001</td><td>Controls the training precision</td></tr>		<tr><td>-d</td><td>int</td><td>5</td><td>Iteration depth in LBFGS. With higher value, CRF convergence in less iteration at the cost of more hard disk space requirement.</td></tr>		<tr><td>-a</td><td>int</td><td>0</td><td>Training algorithm, 0: CRF, 1: averaged perceptron, 2: passive aggressive algorithm.</td></tr>		<tr><td>-m</td><td>int</td><td>0</td><td>Efficiency of CRF training, 0: Keep all data in memory for fast training, 1: Save some data on disk to reduce memory requirement.</td></tr>		</table>		<p>		So you could try several other examples:<br>		<b>crf_learn -c 5 -l 10000 template train model</b><br>		<b>crf_learn -p 2 -e 0.0000001 template train model</b><br>		</p>				<li>Testing</li>		<p>To use Pocket CRF for testing, type command "<b>crf_test model key result</b>". Here 3 file names should		be given one by one: model file, key file, result file. In this example, the 3 names are "model","key","result".		The first 2 files should be prepared, and the last file is generated by Pocket CRF.</p>				<li>Key file format</li>		<p>Key file format are exactly the same as train file.</p>				<li>Result file format</li>		<p>For the simplest case "<b>crf_test model key result</b>", the result file adds one column to key file, which is the label predicts by Pocket CRF</p>				<li>Options in crf_test command</li>		<p>		All the testing options are given below:		</p>		<table border=1>		<tr><td>option</td><td>type</td><td>default</td><td>meaning</td></tr>		<tr><td>-h</td><td></td><td></td><td>Print help message.</td></tr>		<tr><td>-m</td><td>int</td><td>0</td><td>0 or 1, if "-m 1", CRF will calculate the marginal probability for each label.</td></tr>		<tr><td>-n</td><td>int</td><td>1</td><td>performs n best outputs.</td></tr>		</table>				<li>Complex result file format</li>		<p>When you use option "-m" or "-n" the format of result file is a little more complex, here is an example:<br>		Type command "<b>crf_test -m 1 -n 2 model key result</b>", the format of result file is like:</p>		<pre>0.605596	0.0658866Rockwell	NNP	B	B	O	0.888325	0.0162721	0.0954027said	VBD	O	O	O	0.00208226	0.0188027	0.979115the	DT	B	B	B	0.986093	0.00939122	0.0045155agreement	NN	I	I	I	0.00569465	0.992155	0.0021503calls	VBZ	O	O	O	0.00145051	0.00889529	0.989654for	IN	O	O	O	0.00188102	4.84368e-005	0.998071it	PRP	B	B	B	0.935978	0.00099749	0.0630243to	TO	O	O	O	0.00216055	0.0209176	0.976922supply	VB	O	O	O	0.0300674	0.00531862	0.964614200	CD	B	B	B	0.918897	0.0393702	0.0417323additional	JJ	I	I	I	0.062074	0.861668	0.0762578so-called	JJ	I	I	I	0.0887792	0.897094	0.0141267shipsets	NNS	I	I	I	0.00810292	0.978231	0.0136665for	IN	O	O	O	0.000130022	0.000924022	0.998946the	DT	B	B	B	0.997464	0.000451147	0.00208493planes	NNS	I	I	I	0.00289372	0.995456	0.00165012.	.	O	O	O	0.000495323	0.00158809	0.997917		</pre>		<p>The 2 double values in the first line are the top 2 label sequence joint probabilities. From the second line to the last, the 4th and 5th columns represent the top first and second label sequences respectively. 		The 6th 7th 8th columns represent the marginal probabilities for labels in alphabetic order, i.e., "B","I","O" respectively. 		</p>	</ul>		<h2><a name="reference">Reference</a></h2>	<ul>		<li>J. Lafferty, A. McCallum, and F. Pereira. <a href="http://www.cis.upenn.edu/~pereira/papers/crf.pdf">Conditional random fields: Probabilistic models for segmenting and labeling sequence data</a>, In Proc. of ICML, pp.282-289, 2001 </li>		<li>Taku kudo. <a href="http://sourceforge.net/projects/crfpp/">CRF++: Yet Another CRF toolkit</a></li>		<li>Mark Schmidt, Glenn Fung, Romer Rosales. <a href="http://pages.cs.wisc.edu/~gfung/GeneralL1/FastGeneralL1.pdf">Fast Optimization Methods for L1 Regularization: A Comparative Study and Two New Approaches</a></li>	</ul>	<h2><a name="todo">To do</a></h2>	<ul>		<li>High Dimensional CRF</li>	</ul>	Contact: <i>qianxian@fudan.edu.cn</i>	</body></html>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -