📄 stempel-1.0.license
字号:
correct (unique) stem. Note: quite often correct stems were also
correct lemmas.</li>
<li><b>lemma OK:</b> the number of cases when produced output was a
correct lemma.</li>
<li><b>missing:</b> the number of cases when stemmer was unable to
provide any output.</li>
<li><b>stem bad:</b> the number of cases when produced output was a
stem, but already in use identifying a different set.</li>
<li><b>lemma bad:</b> the number of cases when produced output was an
incorrect lemma. Note: quite often in such case the output was a
correct stem.</li>
<li><b>table size:</b> the size in bytes of the stemmer table.</li>
</ul>
<div align="center">
<table border="1" cellpadding="2" cellspacing="0">
<tbody>
<tr bgcolor="#a0b0c0">
<th>Training sets</th>
<th>Testing forms</th>
<th>Stem OK</th>
<th>Lemma OK</th>
<th>Missing</th>
<th>Stem Bad</th>
<th>Lemma Bad</th>
<th>Table size [B]</th>
</tr>
<tr align="right">
<td>100</td>
<td>1022985</td>
<td>842209</td>
<td>593632</td>
<td>172711</td>
<td>22331</td>
<td>256642</td>
<td>28438</td>
</tr>
<tr align="right">
<td>200</td>
<td>1022985</td>
<td>862789</td>
<td>646488</td>
<td>153288</td>
<td>16306</td>
<td>223209</td>
<td>48660</td>
</tr>
<tr align="right">
<td>500</td>
<td>1022985</td>
<td>885786</td>
<td>685009</td>
<td>130772</td>
<td>14856</td>
<td>207204</td>
<td>108798</td>
</tr>
<tr align="right">
<td>700</td>
<td>1022985</td>
<td>909031</td>
<td>704609</td>
<td>107084</td>
<td>15442</td>
<td>211292</td>
<td>139291</td>
</tr>
<tr align="right">
<td>1000</td>
<td>1022985</td>
<td>926079</td>
<td>725720</td>
<td>90117</td>
<td>14941</td>
<td>207148</td>
<td>183677</td>
</tr>
<tr align="right">
<td>2000</td>
<td>1022985</td>
<td>942886</td>
<td>746641</td>
<td>73429</td>
<td>14903</td>
<td>202915</td>
<td>313516</td>
</tr>
<tr align="right">
<td>5000</td>
<td>1022985</td>
<td>954721</td>
<td>759930</td>
<td>61476</td>
<td>14817</td>
<td>201579</td>
<td>640969</td>
</tr>
<tr align="right">
<td>7000</td>
<td>1022985</td>
<td>956165</td>
<td>764033</td>
<td>60364</td>
<td>14620</td>
<td>198588</td>
<td>839347</td>
</tr>
<tr align="right">
<td>10000</td>
<td>1022985</td>
<td>965427</td>
<td>775507</td>
<td>50797</td>
<td>14662</td>
<td>196681</td>
<td>1144537</td>
</tr>
<tr align="right">
<td>12000</td>
<td>1022985</td>
<td>967664</td>
<td>782143</td>
<td>48722</td>
<td>14284</td>
<td>192120</td>
<td>1313508</td>
</tr>
<tr align="right">
<td>15000</td>
<td>1022985</td>
<td>973188</td>
<td>788867</td>
<td>43247</td>
<td>14349</td>
<td>190871</td>
<td>1567902</td>
</tr>
<tr align="right">
<td>17000</td>
<td>1022985</td>
<td>974203</td>
<td>791804</td>
<td>42319</td>
<td>14333</td>
<td>188862</td>
<td>1733957</td>
</tr>
<tr align="right">
<td>20000</td>
<td>1022985</td>
<td>976234</td>
<td>791554</td>
<td>40058</td>
<td>14601</td>
<td>191373</td>
<td>1977615</td>
</tr>
</tbody>
</table>
</div>
<p>I also measured the time to produce a stem (which involves
traversing a trie,
retrieving a patch command and applying the patch command to the input
string).
On a machine running Windows XP (Pentium 4, 1.7 GHz, JDK 1.4.2_03
HotSpot),
for tables ranging in size from 1,000 to 20,000 cells, the time to
produce a
single stem varies between 5-10 microseconds.<br>
</p>
<p>This means that the stemmer can process up to <span
style="font-weight: bold;">200,000 words per second</span>, an
outstanding result when compared to other stemmers (Morfeusz - ~2,000
w/s, FormAN (MS Word analyzer) - ~1,000 w/s).<br>
</p>
<p>The package contains a class <code>org.getopt.stempel.Benchmark</code>,
which you can use to produce reports
like the one below:<br>
</p>
<pre>--------- Stemmer benchmark report: -----------<br>Stemmer table: /res/tables/stemmer_2000.out<br>Input file: ../test3.txt<br>Number of runs: 3<br><br> RUN NUMBER: 1 2 3<br> Total input words 1378176 1378176 1378176<br> Missed output words 112 112 112<br> Time elapsed [ms] 6989 6940 6640<br> Hit rate percent 99.99% 99.99% 99.99%<br> Miss rate percent 00.01% 00.01% 00.01%<br> Words per second 197192 198584 207557<br> Time per word [us] 5.07 5.04 4.82<br></pre>
<h2>Summary</h2>
<p>The results of these tests are very encouraging. It seems that using
the
training corpus and the stemming algorithm described above results in a
high-quality stemmer useful for most applications. Moreover, it can
also
be used as a better than average lemmatizer.</p>
<p>Both the author of the implementation
(Leo Galambos, <leo.galambos AT egothor DOT org>) and the author
of this
compilation (Andrzej Bialecki <ab AT getopt DOT org>) would
appreciate any
feedback and suggestions for further improvements.</p>
<a name="distrib">
<h2>Download</h2>
</a>
<p>You can download the full source code here:</p>
<ul>
<li><a href="stempel-1.0.zip">stempel-src-1.0.zip</a></li>
<li><a href="stempel-1.0.tgz">stempel-src-1.0.tgz</a></li>
</ul>
<p><i>NOTE: due to licensing restrictions I am unable to provide
downloads for the
original corpora.</i></p>
<p>You will need <a href="http://jakarta.apache.org/ant">Jakarta Ant</a>
to
build the JAR file. You can also download a pre-compiled JAR here
(includes
stemming tables):
</p>
<ul>
<li><a href="stempel-1.0.jar">stempel-1.0.jar</a></li>
</ul>
<p>JavaDoc API documentation can be found <a href="api/index.html">here</a>.</p>
<h3>License</h3>
<p>Most of the code is covered by <a href="http://www.egothor.org">Egothor
Open Source License</a>,
an Apache-style license. The rest of the code is covered by the Apache
License 2.0.</p>
<p>The Open Source distribution contains pre-built stemmer tables
trained with
a random well-balanced sample of up to 2000 sets. Other tables are
available
commercially for a modest price - please contact Andrzej Bialecki
<ab AT getopt DOT
org> for more details.<br>
</p>
<h2>Bibliography</h2>
<ol>
<li>Galambos, L.: Multilingual Stemmer in Web Environment, PhD
Thesis,
Faculty of Mathematics and Physics, Charles University in Prague, in
press.</li>
<li>Galambos, L.: Semi-automatic Stemmer Evaluation. International
Intelligent Information Processing and Web Mining Conference, 2004,
Zakopane, Poland.</li>
<li>Galambos, L.: Lemmatizer for Document Information Retrieval
Systems in JAVA.<span style="text-decoration: underline;"> </span><a
class="moz-txt-link-rfc2396E"
href="http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01"><http://www.informatik.uni-trier.de/%7Eley/db/conf/sofsem/sofsem2001.html#Galambos01></a>
SOFSEM 2001, Piestany, Slovakia. <br>
</li>
</ol>
<br>
<br>
</body>
</html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -