📄 index.html
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd"><html> <head> <link rev="made" href="mailto:taku@chasen.org"> <title>CRF++: Yet Another CRF toolkit</title> <link type="text/css" rel="stylesheet" href="default.css"> </head> <body> <h1>CRF++: Yet Another CRF toolkit</h1> <h2>Introduction</h2> <p><b>CRF++</b> is a simple, customizable, and open source implementation of <a href="http://www.cis.upenn.edu/~pereira/papers/crf.pdf">Conditional Random Fields (CRFs)</a> for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as Named Entity Recognition, Information Extraction and Text Chunking. <h2>Table of contents</h2> <ul> <li><a href="#features">Features</a></li> <li><a href="#news">News</a></li> <li><a href="#download">Download</a> </li> <ul> <li><a href="#source">Source</a></li> <li><a href="#windows">Binary package for MS-Windows</a></li> </ul> <li><a href="#install">Installation</a></li> <li> <a href="#usage">Usage</a> <ul> <li><a href="#format">Training and Test file formats</a></li> <li><a href="#templ">Preparing feature templates</a></li> <li><a href="#training">Training (encoding)</a></li> <li><a href="#testing">Testing (decoding)</a></li> </ul> </li> <li><a href="#tips">Case studies</a></li> <li><a href="#tips">Useful Tips</a></li> <li><a href="#todo">To do</a></li> <li><a href="#links">Links</a></li> </ul> <h2><a name="features">Features</a></h2> <ul> <li>Can redefine feature sets</li> <li>Written in C++ with STL</li> <li>Fast training based on <a href="http://www-fp.mcs.anl.gov/otc/Guide/SoftwareGuide/Blurbs/lbfgs.html">LBFGS</a>, a quasi-newton algorithm for large scale numerical optimization problem</li> <li>Less memory usage both in training and testing</li> <li>encoding/decoding in practical time</li> <li>Can perform n-best outputs</li> <li>Can perform single-best MIRA training</li> <li>Can output marginal probabilities for all candidates</li> <li>Available as an open source software</li> </ul> <h2><a name="news">News</a></h2> <ul> <strong>2007-03-07</strong>: <a href="#download">CRF++ 0.47</a> Released<br> <ul> <li>Fixed a bug in MIRA training </ul> <strong>2007-02-12</strong>: CRF++ 0.46 Released<br> <ul> <li>Changed the licence from LGPL to LGPL/BSD dual license <li>Perl/Ruby/Python/Java binding supports (see perl/ruby/python/java directory respectively) <li>Code refactoring </ul> <strong>2006-11-26</strong>: CRF++ 0.45<br> <ul> <li>Support 1-best MIRA training (use -a MIRA option) </ul> <strong>2006-08-18</strong>: CRF++ 0.44<br> <ul> <li>Fixed a bug in feature extraction</li> <li>Allowed redundant spaces in training/test files</li> <li>Determined real column size by looking at template</li> <li>Added sample code of API (sdk/example.cpp) <li>Described usage of each API function (crfpp.h) </ul> <strong>2006-08-07</strong>: CRF++ 0.43<br> <ul> <li>implemented several API functions to get lattice information</li> <li>added -c option to control cost-factor </ul> <strong>2006-03-31</strong>: CRF++ 0.42<br> <ul> <li>Fixed a bug in feature extraction</li> </ul> <strong>2006-03-30</strong>: CRF++ 0.41<br> <ul> <li>Support parallel training</li> </ul> <strong>2006-03-21</strong>: CRF++ 0.40<br> <ul> <li>Fixed a fatal memory leak bug</li> <li>make CRF++ API</li> </ul> <strong>2005-10-29</strong>: CRF++ 0.3</a> <ul> <li>added -t option that enables you to have not only binary model but also text model <li>added -C option for converting a text model to a binary model </ul> <strong>2005-07-04</strong>: CRF++ 0.2 Released<br> <ul> <li>Fixed several bugs</li> </ul> <strong>2005-05-28</strong>: CRF++ 0.1 Released<br> <ul> <li>Initial Release</li> </ul> </ul> <h2><a name="download">Download</a></h2> <ul> <li><b>CRF++</b> is free software; you can redistribute it and/or modify it under the terms of the <a href= "http://www.gnu.org/copyleft/lesser.html">GNU Lesser General Public License</a> or <ahref="http://www.opensource.org/licenses/bsd-license.php">new BSD License</a></li> <li> Please let <a href= "mailto:taku@chasen.org">me</a> know if you use <b>CRF++</b> for research purpose or find any research publications where <b>CRF++</b> is applied. <h3><a name="source">Source</a></h3> <ul> <li>CRF++-0.47.tar.gz: <a href= "./src/CRF++-0.47.tar.gz">HTTP</a></li> </ul> <h3><a name="windows">Binary package for MS-Windows</a></h3> <ul> <li><a href="./win/">HTTP</a><br> </ul> </li> </ul> <h2><a name="install">Installation</a></h2> <ul> <li> Requirements <ul> <li>C++ compiler (gcc 3.0 or higher)</li> </ul> </li> <li> How to make <pre>% ./configure % make% su# make install</pre> You can change default install path by using --prefix option of configure script.<br> Try --help option for finding out other options. </li> </ul> <h2><a name="usage">Usage</a></h2> <h3><a name="format">Training and Test file formats</a></h3> <p>Both the training file and the test file need to be in a particular format for <b>CRF++</b> to work properly. Generally speaking, training and test file must consist of multiple <b>tokens</b>. In addition, a <b>token</b> consists of multiple (but fixed-numbers) columns. The definition of tokens depends on tasks, however, in most of typical cases, they simply correspond to <b>words</b>. Each token must be represented in one line, with the columns separated by white space (spaces or tabular characters). A sequence of token becomes a <b>sentence</b>. To identify the boundary between sentences, an empty line is put.</p> <p>You can give as many columns as you like, however the number of columns must be fixed through all tokens. Furthermore, there are some kinds of "semantics" among the columns. For example, 1st column is 'word', second column is 'POS tag' third column is 'sub-category of POS' and so on.</p> <p>The last column represents a true answer tag which is going to be trained by CRF.</p> <p>Here's an example of such a file: (data for CoNLL shared task)</p><pre>He PRP B-NPreckons VBZ B-VPthe DT B-NPcurrent JJ I-NPaccount NN I-NPdeficit NN I-NPwill MD B-VPnarrow VB I-VPto TO B-PPonly RB B-NP# # I-NP1.8 CD I-NPbillion CD I-NPin IN B-PPSeptember NNP B-NP. . OHe PRP B-NPreckons VBZ B-VP..</pre> <p>There are 3 columns for each token.</p> <ul> <li>The word itself (e.g. reckons);</li> <li>part-of-speech associated with the word (e.g. VBZ);</li> <li>Chunk(answer) tag represented in IOB2 format;</li> </ul> <p>The following data is invalid, since the number of columns of second and third are 2. (They have no POS column.) The number of columns should be fixed.</p><pre>He PRP B-NPreckons B-VPthe B-NPcurrent JJ I-NPaccount NN I-NP..</pre> <h3><a name="templ">Preparing feature templates</a></h3> <p> As CRF++ is designed as a general purpose tool, you have to specify the feature templates in advance. This file describes which features are used in training and testing. </p> <ul> <li>Template basic and macro</li> <p> Each line in the template file denotes one <i>template</i>. In each template, special macro <i>%x[row,col]</i> will be used to specify a token in the input data. <i>row</i> specfies the relative position from the current focusing token and <i>col</i> specifies the absolute position of the column. </p><p>Here you can find some examples for the replacements</p><pre>Input: DataHe PRP B-NPreckons VBZ B-VPthe DT B-NP << CURRENT TOKENcurrent JJ I-NP account NN I-NP</pre><p><table border><tr><td>template</td><td>expanded feature</td></tr><tr><td><b>%x[0,0]</b></td><td>the</td></tr><tr><td><b>%x[0,1]</b></td><td>DT</td></tr><tr><td><b>%x[-1,0]</b></td><td>rokens</td></tr><tr><td><b>%x[-2,1]</b></td><td>PRP</td></tr><tr><td><b>%x[0,0]/%x[0,1]</b></td><td>the/DT</td></tr><tr><td><b>ABC%x[0,1]123</b></td><td>ABCthe123</td></tr></table></p><br><li>Template type</li><p>Note also that there are two types of templates. The types are specified with the first character of templates.</p> <ul> <li>Unigram template: first character, <b>'U'</b></li> <p> This is a template to describe unigram features. When you give a template "U01:%x[0,1]", CRF++ automatically generates a set of feature functions (func1 ... funcN) like: </p> <pre>func1 = if (output = B-NP and feature="U01:DT") return 1 else return 0func2 = if (output = I-NP and feature="U01:DT") return 1 else return 0func3 = if (output = O and feature="U01:DT") return 1 else return 0....funcXX = if (output = B-NP and feature="U01:NN") return 1 else return 0funcXY = if (output = O and feature="U01:NN") return 1 else return 0...</pre> <p> The number of feature functions generated by a template amounts to (L * N), where L is the number of output classes and N is the number of unique string expanded from the given template. </p> <li>Bigram template: first character, <b>'B'</b></li> <p> This is a template to describe bigram features. With this template, a combination of the current output token and previous output token (bigram) is automatically generated. Note that this type of template generates a total of (L * L * N) distinct features, where L is the number of output classes and N is the number of unique features generated by the templates. When the number of classes is large, this type of templates would produce a tons of distinct features that would cause inefficiency both in training/testing. </p> <li>What is the diffrence between unigram and bigram features?</li> <p> The words unigram/bigram are confusing, since a macro for unigram-features does allow you to write word-level bigram like %x[-1,0]%x[0,0]. Here, unigram and bigram features mean uni/bigrams of output tags.</p> <ul> <li>unigram: |output tag| x |all possible strings expanded with a macro|</li> <li>bigram: |output tag| x |output tag| x |all possible strings expanded with a macro|</li> </ul> <p></p></ul><li>Identifiers for distinguishing relative positions</li><p>You also need to put an identifier in templates when relative positions of
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -