📄 selectd.xml
字号:
<chapter id="selectd"><title>selectd</title><para><application>selectd</application> is the <application>Select</application> daemon.</para><para>See also the <link linkend="ref_selectd">reference page</link> for <application>selectd</application> for more information on how to use it.</para><sect1><title>Language handling</title><para>To be written.</para><sect2><title>Character sets</title></sect2><sect2><title>Stopwords</title></sect2><sect2><title>Stemming</title></sect2></sect1><sect1><title>Text transformations</title><para>When classifying text with a vector classifier, the text has to be transformed into a vector first. This is accomplished in two main steps: tokenization and vectorization.</para><sect2><title>Tokenization</title><para>Tokenization is the process to transform a text into text tokens.For example the text "I ordered you pancakes" might be tranformed into the tokens "I", "ordered", "you" and "pancakes".</para><para>The following list shows the different tokenizers available in <application>Select</application>.Tokenizers whose names ends with ".byte" operates on bytes rather than characters, while the others (try to) interpret characters according to some specified character set encoding.The latter type outputs all tokens in UTF-8 encoding.<variablelist><varlistentry><term>alpha</term><listitem><para>Alpha character tokenizer.Creates tokens from all (successive) sequences of alpha characters.<!--Alpha characters is here defined as those ... unicode--></para></listitem></varlistentry><varlistentry><term>ngram.byte</term><listitem><para>N-gram tokenizer.</para></listitem></varlistentry><varlistentry><term>null</term><listitem><para>Null tokenizer.Every text is transformed into zero tokens.This was created especially for the <link linkend="trivial">Trivial classifier</link>.</para></listitem></varlistentry></variablelist></para><sect3><title>Token selection</title><para>Not all tokens generated by the tokenizer are worth using for classification.To be expanded.</para><para>It is possible to set a minimum and maximum length of the allowed tokens.</para><para>Uppercase letters can be forced to lower case.</para><para>Stopwords.</para><para>Stemming.</para></sect3></sect2><sect2><title>Vectorization</title><para>Vectorization is the process to transform a sequence of text tokens into a vector.</para></sect2></sect1></chapter>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -