⭐ 欢迎来到虫虫下载站! | 📦 资源下载 📁 资源专辑 ℹ️ 关于我们
⭐ 虫虫下载站

📄 index.html

📁 Conditional Random Field(CRF)是重要的串学习模型
💻 HTML
📖 第 1 页 / 共 2 页
字号:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN""http://www.w3.org/TR/html4/strict.dtd"><html>  <head>    <link rev="made" href="mailto:taku@chasen.org">    <title>CRF++: Yet Another CRF toolkit</title>    <link type="text/css" rel="stylesheet" href="default.css">  </head>  <body>    <h1>CRF++: Yet Another CRF toolkit</h1>    <h2>Introduction</h2>      <p><b>CRF++</b> is a simple, customizable, and open source      implementation of <a href="http://www.cis.upenn.edu/~pereira/papers/crf.pdf">Conditional Random Fields (CRFs)</a>      for segmenting/labeling sequential data. CRF++ is designed for generic purpose and will be applied to a variety of NLP tasks, such as       Named Entity Recognition, Information Extraction and Text Chunking.    <h2>Table of contents</h2>    <ul>      <li><a href="#features">Features</a></li>      <li><a href="#news">News</a></li>      <li><a href="#download">Download</a> </li>        <ul>          <li><a href="#source">Source</a></li>          <li><a href="#windows">Binary package for MS-Windows</a></li>        </ul>      <li><a href="#install">Installation</a></li>      <li>        <a href="#usage">Usage</a>         <ul>          <li><a href="#format">Training and Test file formats</a></li>          <li><a href="#templ">Preparing feature templates</a></li>          <li><a href="#training">Training (encoding)</a></li>          <li><a href="#testing">Testing (decoding)</a></li>        </ul>      </li>      <li><a href="#tips">Case studies</a></li>      <li><a href="#tips">Useful Tips</a></li>      <li><a href="#todo">To do</a></li>      <li><a href="#links">Links</a></li>    </ul>    <h2><a name="features">Features</a></h2>    <ul>     <li>Can redefine feature sets</li>     <li>Written in C++ with STL</li>     <li>Fast training based on <a href="http://www-fp.mcs.anl.gov/otc/Guide/SoftwareGuide/Blurbs/lbfgs.html">LBFGS</a>, a quasi-newton algorithm	 for large scale numerical optimization problem</li>     <li>Less memory usage both in training and testing</li>     <li>encoding/decoding in practical time</li>     <li>Can perform n-best outputs</li>     <li>Can perform single-best MIRA training</li>     <li>Can output marginal probabilities for all candidates</li>	 <li>Available as an open source software</li>    </ul>    <h2><a name="news">News</a></h2>    <ul>    <strong>2007-03-07</strong>: <a href="#download">CRF++ 0.47</a> Released<br>    <ul>     <li>Fixed a bug in MIRA training    </ul>          <strong>2007-02-12</strong>: CRF++ 0.46 Released<br>    <ul>     <li>Changed the licence from LGPL to LGPL/BSD dual          license     <li>Perl/Ruby/Python/Java binding supports (see         perl/ruby/python/java directory respectively)     <li>Code refactoring    </ul>      <strong>2006-11-26</strong>: CRF++ 0.45<br>    <ul>      <li>Support 1-best MIRA training (use -a MIRA option)    </ul>          <strong>2006-08-18</strong>: CRF++ 0.44<br>    <ul>      <li>Fixed a bug in feature extraction</li>      <li>Allowed redundant spaces in training/test files</li>      <li>Determined real column size by looking at template</li>      <li>Added sample code of API (sdk/example.cpp)      <li>Described usage of each API function (crfpp.h)    </ul>      <strong>2006-08-07</strong>: CRF++ 0.43<br>     <ul>      <li>implemented several API functions to get lattice          information</li>      <li>added -c option to control cost-factor     </ul>    <strong>2006-03-31</strong>: CRF++ 0.42<br>     <ul>      <li>Fixed a bug in feature extraction</li>     </ul>    <strong>2006-03-30</strong>: CRF++ 0.41<br>     <ul>      <li>Support parallel training</li>     </ul>        <strong>2006-03-21</strong>: CRF++ 0.40<br>     <ul>      <li>Fixed a fatal memory leak bug</li>      <li>make CRF++ API</li>      </ul>        <strong>2005-10-29</strong>: CRF++ 0.3</a>      <ul>      <li>added -t option that enables you to have not only binary      model but also text model      <li>added -C option for converting a text model to a binary model      </ul>           <strong>2005-07-04</strong>: CRF++ 0.2     Released<br>       <ul>        <li>Fixed several bugs</li>       </ul>         <strong>2005-05-28</strong>: CRF++ 0.1     Released<br>       <ul>        <li>Initial Release</li>       </ul>    </ul>    <h2><a name="download">Download</a></h2>    <ul>      <li><b>CRF++</b> is free software; you can redistribute it      and/or modify it under the terms of the <a href=      "http://www.gnu.org/copyleft/lesser.html">GNU Lesser General      Public License</a> or <ahref="http://www.opensource.org/licenses/bsd-license.php">new BSD License</a></li>      <li>        Please let <a href=        "mailto:taku@chasen.org">me</a> know if you use        <b>CRF++</b> for research purpose or find any research        publications where <b>CRF++</b> is applied.         <h3><a name="source">Source</a></h3>        <ul>          <li>CRF++-0.47.tar.gz: <a href=              "./src/CRF++-0.47.tar.gz">HTTP</a></li>        </ul>        <h3><a name="windows">Binary package for MS-Windows</a></h3>        <ul>          <li><a href="./win/">HTTP</a><br>        </ul>      </li>    </ul>    <h2><a name="install">Installation</a></h2>    <ul>      <li>        Requirements         <ul>          <li>C++ compiler (gcc 3.0 or higher)</li>        </ul>      </li>      <li>        How to make <pre>% ./configure % make% su# make install</pre>        You can change default install path by using --prefix        option of configure script.<br>        Try --help option for finding out other options.      </li>    </ul>  <h2><a name="usage">Usage</a></h2>  <h3><a name="format">Training and Test file formats</a></h3>         <p>Both the training file and the test file need to be in a        particular format for <b>CRF++</b> to work properly.        Generally speaking, training and test file must consist of        multiple <b>tokens</b>. In addition, a <b>token</b>        consists of multiple (but fixed-numbers) columns. The        definition of tokens depends on tasks, however, in         most of typical cases, they simply correspond to         <b>words</b>. Each token must be represented in one line,        with the columns separated by white space (spaces or        tabular characters). A sequence of token becomes a        <b>sentence</b>. To identify the boundary between        sentences, an empty line is put.</p>        <p>You can give as many columns as you like, however the        number of columns must be fixed through all tokens.        Furthermore, there are some kinds of "semantics" among the        columns. For example, 1st column is 'word', second column        is 'POS tag' third column is 'sub-category of POS' and so        on.</p>        <p>The last column represents a true answer tag which is going        to be trained by CRF.</p>        <p>Here's an example of such a file: (data for CoNLL shared        task)</p><pre>He        PRP  B-NPreckons   VBZ  B-VPthe       DT   B-NPcurrent   JJ   I-NPaccount   NN   I-NPdeficit   NN   I-NPwill      MD   B-VPnarrow    VB   I-VPto        TO   B-PPonly      RB   B-NP#         #    I-NP1.8       CD   I-NPbillion   CD   I-NPin        IN   B-PPSeptember NNP  B-NP.         .    OHe        PRP  B-NPreckons   VBZ  B-VP..</pre>  <p>There are 3 columns for each token.</p>  <ul>   <li>The word itself (e.g. reckons);</li>   <li>part-of-speech associated with the word (e.g. VBZ);</li>   <li>Chunk(answer) tag represented in IOB2 format;</li>  </ul>  <p>The following data is invalid, since the number of  columns of second and third are 2. (They have no POS  column.) The number of columns should be fixed.</p><pre>He        PRP  B-NPreckons   B-VPthe       B-NPcurrent   JJ   I-NPaccount   NN   I-NP..</pre>  <h3><a name="templ">Preparing feature templates</a></h3>   <p>  As CRF++ is designed as a general purpose tool, you have to  specify the feature templates in advance. This file describes  which features are used in training and testing.  </p>  <ul>  <li>Template basic and macro</li>  <p>  Each line in the template file denotes one <i>template</i>.  In each template, special macro <i>%x[row,col]</i> will be  used to specify a token in the input data. <i>row</i> specfies the  relative position from the current focusing token  and <i>col</i> specifies the absolute position of the column.  </p><p>Here you can find some examples for the replacements</p><pre>Input: DataHe        PRP  B-NPreckons   VBZ  B-VPthe       DT   B-NP &lt;&lt; CURRENT TOKENcurrent   JJ   I-NP account   NN   I-NP</pre><p><table border><tr><td>template</td><td>expanded feature</td></tr><tr><td><b>%x[0,0]</b></td><td>the</td></tr><tr><td><b>%x[0,1]</b></td><td>DT</td></tr><tr><td><b>%x[-1,0]</b></td><td>rokens</td></tr><tr><td><b>%x[-2,1]</b></td><td>PRP</td></tr><tr><td><b>%x[0,0]/%x[0,1]</b></td><td>the/DT</td></tr><tr><td><b>ABC%x[0,1]123</b></td><td>ABCthe123</td></tr></table></p><br><li>Template type</li><p>Note also that  there are two types of templates.  The types are specified with the first character of templates.</p>  <ul>   <li>Unigram template: first character, <b>'U'</b></li>       <p>       This is a template to describe unigram features.       When you give a template "U01:%x[0,1]", CRF++ automatically       generates a set of feature functions (func1 ... funcN) like:       </p>              <pre>func1 = if (output = B-NP and feature="U01:DT") return 1&nbsp;else return 0func2 = if (output = I-NP and feature="U01:DT") return 1&nbsp;else return 0func3 = if (output = O and feature="U01:DT") return 1&nbsp; else return 0....funcXX = if (output = B-NP and feature="U01:NN") return 1&nbsp; else return 0funcXY = if (output = O and feature="U01:NN") return 1&nbsp; else return 0...</pre>       <p>       The number of feature functions generated by a template amounts to       (L * N), where L is the number of output classes and N is the       number of unique string expanded from the given template.       </p>   <li>Bigram template: first character, <b>'B'</b></li>       <p>       This is a template to describe bigram features.       With this template, a combination of the current output token and previous output token       (bigram) is automatically generated. Note that this type of template generates a total of        (L * L * N) distinct features, where L is the       number of output classes and N is the number       of unique features generated by the templates.       When the number of classes is large, this type of templates would produce       a tons of distinct features that would cause inefficiency both       in training/testing.        </p>           <li>What is the diffrence between unigram and bigram features?</li>     <p>     The words unigram/bigram are confusing, since a macro for unigram-features     does allow you to write word-level bigram like %x[-1,0]%x[0,0]. Here,     unigram and bigram features mean uni/bigrams of output tags.</p>     <ul>     <li>unigram: |output tag| x |all possible strings expanded with a macro|</li>     <li>bigram: |output tag| x |output tag| x |all possible strings expanded with a macro|</li>     </ul>     <p></p></ul><li>Identifiers for distinguishing relative positions</li><p>You also need to put an identifier in templates when relative positions of

⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -