📄 stempel-1.0.license

📁 这个包主要用来帮助我们了解carrot2的格式
💻 LICENSE
📖 第 1 页 / 共 2 页
字号:
上一页 12
correct (unique) stem. Note: quite often correct stems were also
correct lemmas.</li>
  <li><b>lemma OK:</b> the number of cases when produced output was a
correct lemma.</li>
  <li><b>missing:</b> the number of cases when stemmer was unable to
provide any output.</li>
  <li><b>stem bad:</b> the number of cases when produced output was a
stem, but already in use identifying a different set.</li>
  <li><b>lemma bad:</b> the number of cases when produced output was an
incorrect lemma. Note: quite often in such case the output was a
correct stem.</li>
  <li><b>table size:</b> the size in bytes of the stemmer table.</li>
</ul>
<div align="center">
<table border="1" cellpadding="2" cellspacing="0">
  <tbody>
    <tr bgcolor="#a0b0c0">
      <th>Training sets</th>
      <th>Testing forms</th>
      <th>Stem OK</th>
      <th>Lemma OK</th>
      <th>Missing</th>
      <th>Stem Bad</th>
      <th>Lemma Bad</th>
      <th>Table size [B]</th>
    </tr>
    <tr align="right">
      <td>100</td>
      <td>1022985</td>
      <td>842209</td>
      <td>593632</td>
      <td>172711</td>
      <td>22331</td>
      <td>256642</td>
      <td>28438</td>
    </tr>
    <tr align="right">
      <td>200</td>
      <td>1022985</td>
      <td>862789</td>
      <td>646488</td>
      <td>153288</td>
      <td>16306</td>
      <td>223209</td>
      <td>48660</td>
    </tr>
    <tr align="right">
      <td>500</td>
      <td>1022985</td>
      <td>885786</td>
      <td>685009</td>
      <td>130772</td>
      <td>14856</td>
      <td>207204</td>
      <td>108798</td>
    </tr>
    <tr align="right">
      <td>700</td>
      <td>1022985</td>
      <td>909031</td>
      <td>704609</td>
      <td>107084</td>
      <td>15442</td>
      <td>211292</td>
      <td>139291</td>
    </tr>
    <tr align="right">
      <td>1000</td>
      <td>1022985</td>
      <td>926079</td>
      <td>725720</td>
      <td>90117</td>
      <td>14941</td>
      <td>207148</td>
      <td>183677</td>
    </tr>
    <tr align="right">
      <td>2000</td>
      <td>1022985</td>
      <td>942886</td>
      <td>746641</td>
      <td>73429</td>
      <td>14903</td>
      <td>202915</td>
      <td>313516</td>
    </tr>
    <tr align="right">
      <td>5000</td>
      <td>1022985</td>
      <td>954721</td>
      <td>759930</td>
      <td>61476</td>
      <td>14817</td>
      <td>201579</td>
      <td>640969</td>
    </tr>
    <tr align="right">
      <td>7000</td>
      <td>1022985</td>
      <td>956165</td>
      <td>764033</td>
      <td>60364</td>
      <td>14620</td>
      <td>198588</td>
      <td>839347</td>
    </tr>
    <tr align="right">
      <td>10000</td>
      <td>1022985</td>
      <td>965427</td>
      <td>775507</td>
      <td>50797</td>
      <td>14662</td>
      <td>196681</td>
      <td>1144537</td>
    </tr>
    <tr align="right">
      <td>12000</td>
      <td>1022985</td>
      <td>967664</td>
      <td>782143</td>
      <td>48722</td>
      <td>14284</td>
      <td>192120</td>
      <td>1313508</td>
    </tr>
    <tr align="right">
      <td>15000</td>
      <td>1022985</td>
      <td>973188</td>
      <td>788867</td>
      <td>43247</td>
      <td>14349</td>
      <td>190871</td>
      <td>1567902</td>
    </tr>
    <tr align="right">
      <td>17000</td>
      <td>1022985</td>
      <td>974203</td>
      <td>791804</td>
      <td>42319</td>
      <td>14333</td>
      <td>188862</td>
      <td>1733957</td>
    </tr>
    <tr align="right">
      <td>20000</td>
      <td>1022985</td>
      <td>976234</td>
      <td>791554</td>
      <td>40058</td>
      <td>14601</td>
      <td>191373</td>
      <td>1977615</td>
    </tr>
  </tbody>
</table>
</div>
<p>I also measured the time to produce a stem (which involves
traversing a trie,
retrieving a patch command and applying the patch command to the input
string).
On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03
HotSpot),
for tables ranging in size from 1,000 to 20,000 cells, the time to
produce a
single stem varies between 5-10 microseconds.<br>
</p>
<p>This means that the stemmer can process up to <span
 style="font-weight: bold;">200,000 words per second</span>, an
outstanding result when compared to other stemmers (Morfeusz - ~2,000
w/s, FormAN (MS Word analyzer) - ~1,000 w/s).<br>
</p>
<p>The package contains a class <code>org.getopt.stempel.Benchmark</code>,
which you can use to produce reports
like the one below:<br>
</p>
<pre>--------- Stemmer benchmark report: -----------<br>Stemmer table:  /res/tables/stemmer_2000.out<br>Input file:     ../test3.txt<br>Number of runs: 3<br><br> RUN NUMBER:            1       2       3<br> Total input words      1378176 1378176 1378176<br> Missed output words    112     112     112<br> Time elapsed [ms]      6989    6940    6640<br> Hit rate percent       99.99%  99.99%  99.99%<br> Miss rate percent      00.01%  00.01%  00.01%<br> Words per second       197192  198584  207557<br> Time per word [us]     5.07    5.04    4.82<br></pre>
<h2>Summary</h2>
<p>The results of these tests are very encouraging. It seems that using
the
training corpus and the stemming algorithm described above results in a
high-quality stemmer useful for most applications. Moreover, it can
also
be used as a better than average lemmatizer.</p>
<p>Both the author of the implementation
(Leo Galambos, &lt;leo.galambos AT egothor DOT org&gt;) and the author
of this
compilation (Andrzej Bialecki &lt;ab AT getopt DOT org&gt;) would
appreciate any
feedback and suggestions for further improvements.</p>
<a name="distrib">
<h2>Download</h2>
</a>
<p>You can download the full source code here:</p>
<ul>
  <li><a href="stempel-1.0.zip">stempel-src-1.0.zip</a></li>
  <li><a href="stempel-1.0.tgz">stempel-src-1.0.tgz</a></li>
</ul>
<p><i>NOTE: due to licensing restrictions I am unable to provide
downloads for the
original corpora.</i></p>
<p>You will need <a href="http://jakarta.apache.org/ant">Jakarta Ant</a>
to
build the JAR file. You can also download a pre-compiled JAR here
(includes
stemming tables):
</p>
<ul>
  <li><a href="stempel-1.0.jar">stempel-1.0.jar</a></li>
</ul>
<p>JavaDoc API documentation can be found <a href="api/index.html">here</a>.</p>
<h3>License</h3>
<p>Most of the code is covered by <a href="http://www.egothor.org">Egothor
Open Source License</a>,
an Apache-style license. The rest of the code is covered by the Apache
License 2.0.</p>
<p>The Open Source distribution contains pre-built stemmer tables
trained with
a random well-balanced sample of up to 2000 sets. Other tables are
available
commercially for a modest price - please contact Andrzej Bialecki
&lt;ab AT getopt DOT
org&gt; for more details.<br>
</p>
<h2>Bibliography</h2>
<ol>
  <li>Galambos, L.: Multilingual Stemmer in Web Environment, PhD
Thesis,
Faculty of Mathematics and Physics, Charles University in Prague, in
press.</li>
  <li>Galambos, L.: Semi-automatic Stemmer Evaluation. International
Intelligent Information Processing and Web Mining Conference, 2004,
Zakopane, Poland.</li>
  <li>Galambos, L.: Lemmatizer for Document Information Retrieval
Systems in JAVA.<span style="text-decoration: underline;"> </span><a
 class="moz-txt-link-rfc2396E"
 href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01">&lt;http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01&gt;</a>
SOFSEM 2001, Piestany, Slovakia. <br>
  </li>
</ol>
<br>
<br>
</body>
</html>
上一页 12
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -