📄 testfile
字号:
One technique for improving IR performance is to provide searchers withways of finding morphological variants of search terms. If, for example, a searcher enters the term stemming as part of a query, it is likely that s/he will also be interested in such variants as stemmed and stem. We use the term conflation, meaning the act of fusing or combining, as the general term for the process of matching morphological term variants. Conflation can be either manual--using some kind of regular expressions--or automatic, via programs called stemmers. Stemming is also used in IR to reduce the size of index files. Since a single stem typically corresponds to several full terms, by storing stems instead of terms, compression factors of over fifty percent can be achieved.As can be seen in Figure 1.2 in chapter 1, terms can be stemmed at indexing time or at search time. The advantage of stemming at indexing time is efficiency and index file compression--since index terms are alreadystemmed, this operation requires no resources at search time, and the index file will be compressed as described above. The disadvantage of indexing time stemming is that information about the full terms will be lost, or additional storage will be required to store both the stemmed and unstemmed forms.Figure 8.1 shows a taxonomy for stemming algorithms. There are four automatic approaches. Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem. These algorithms sometimes also transform the resultant stem. The name stemmer derives from this method, which is the most common. Successor variety stemmers use the frequencies of letter sequences in a body of text as the basis of stemming. The n-gram method conflates terms based on the number of digrams or n-grams they share.Terms and their corresponding stems can also be stored in a table. Stemmingis then done via lookups in the table. These methods are described below.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -