stringkernel.java

来自「Java 编写的多种数据挖掘算法包括聚类、分类、预处理等」· Java 代码 · 共 1,468 行 · 第 1/4 页
JAVA
1,468 行
package weka.classifiers.functions.supportVector;import weka.core.Capabilities;import weka.core.Instance;import weka.core.Instances;import weka.core.Option;import weka.core.SelectedTag;import weka.core.Tag;import weka.core.TechnicalInformation;import weka.core.Utils;import weka.core.Capabilities.Capability;import weka.core.TechnicalInformation.Type;import weka.core.TechnicalInformation.Field;import weka.core.TechnicalInformationHandler;import java.util.Enumeration;import java.util.Vector;/** <!-- globalinfo-start --> * Implementation of the subsequence kernel (SSK) as described in [1] and of the subsequence kernel with lambda pruning (SSK-LP) as described in [2].<br/> * <br/> * For more information, see<br/> * <br/> * Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins (2002). Text Classification using String Kernels. Journal of Machine Learning Research. 2:419-444.<br/> * <br/> * F. Kleedorfer, A. Seewald (2005). Implementation of a String Kernel for WEKA. Wien, Austria. * <p/> <!-- globalinfo-end --> * <!-- technical-bibtex-start --> * BibTeX: * <pre> * &#64;article{Lodhi2002, *    author = {Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins}, *    journal = {Journal of Machine Learning Research}, *    pages = {419-444}, *    title = {Text Classification using String Kernels}, *    volume = {2}, *    year = {2002}, *    HTTP = {http://www.jmlr.org/papers/v2/lodhi02a.html} * } *  * &#64;techreport{Kleedorfer2005, *    address = {Wien, Austria}, *    author = {F. Kleedorfer and A. Seewald}, *    institution = {Oesterreichisches Forschungsinstitut fuer Artificial Intelligence}, *    number = {TR-2005-13}, *    title = {Implementation of a String Kernel for WEKA}, *    year = {2005} * } * </pre> * <p/> <!-- technical-bibtex-end --> *  <!-- options-start --> * Valid options are: <p/> *  * <pre> -D *  Enables debugging output (if available) to be printed. *  (default: off)</pre> *  * <pre> -no-checks *  Turns off all checks - use with caution! *  (default: checks on)</pre> *  * <pre> -P &lt;0|1&gt; *  The pruning method to use: *  0 = No pruning *  1 = Lambda pruning *  (default: 0)</pre> *  * <pre> -C &lt;num&gt; *  The size of the cache (a prime number). *  (default: 250007)</pre> *  * <pre> -IC &lt;num&gt; *  The size of the internal cache (a prime number). *  (default: 200003)</pre> *  * <pre> -L &lt;num&gt; *  The lambda constant. Penalizes non-continuous subsequence *  matches. Must be in (0,1). *  (default: 0.5)</pre> *  * <pre> -ssl &lt;num&gt; *  The length of the subsequence. *  (default: 3)</pre> *  * <pre> -ssl-max &lt;num&gt; *  The maximum length of the subsequence. *  (default: 9)</pre> *  * <pre> -N *  Use normalization. *  (default: no)</pre> *  <!-- options-end --> *  * <h1>Theory</h1> * <h2>Overview</h2> * The algorithm computes a measure of similarity between two texts based on * the number and form of their common subsequences, which need not be * contiguous. This method can be parametrized by specifying the subsequence * length k, the penalty factor lambda, which penalizes non-contiguous matches, * and optional 'lambda pruning', which takes maxLambdaExponent, * <code>m</code>, as parameter. Lambda pruning causes very 'stretched' * substring matches not to be counted, thus speeding up the computation. The * functionality of SSK and SSK-LP is explained in the following using simple * examples. *  * <h2>Explanation &amp; Examples</h2> * for all of the following examples, we assume these parameter values:  *<pre>  *k=2 *lambda=0.5 *m=8 (for SSK-LP examples) *</pre> *  * <h3>SSK</h3> *  * <h4>Example 1</h4> *  * <pre> *SSK(2,"ab","axb")=0.5^5 = 0,03125 *</pre> * There is one subsequence of the length of 2 that both strings have in * common, "ab".  The result of SSK is computed by raising lambda to the power * of L, where L is the length of the subsequence match in the one string plus * the length of the subsequence match in the other, in our case: * <pre> *&nbsp;  ab    axb *L= 2  +   3 = 5 * </pre> * hence, the kernel yields 0.5^5 = 0,03125 * * <h4>Example 2</h4> * <pre> *SSK(2,"ab","abb")=0.5^5 + 0.5^4 = 0,09375 *</pre> * Here, we also have one subsequence of the length of 2 that both strings have * in common, "ab".  The result of SSK is actually computed by summing over all * values computed for each occurrence of a common subsequence match. In this * example, there are two possible cases: * <pre> *ab    abb *--    --  L=4 *--    - - L=5 * </pre> * we have two matches, one of the length of 2+2=4, one of the length of 2+3=5,  * so we get the result 0.5^5 + 0.5^4 = 0,09375. * * <h3>SSK-LP</h3> * Without lambda pruning, the string kernel finds *all* common subsequences of * the given length, whereas with lambda pruning, common subsequence matches * that are too much stretched in both strings are not taken into account. It * is argued that the value yielded for such a common subsequence is too low * (<code>lambda ^(length[match_in_s] + length[match_in_t]</code>) . Tests have * shown that a tremendous speedup can be achieved using this technique while * suffering from very little quality loss. <br> * Lambda pruning is parametrized by the maximum lambda exponent. It is a good * idea to choose that value to be about 3 or 4 times the subsequence length as * a rule of thumb. YMMV. * * <h4>Example 3</h4> * Without lambda pruning, one common subsequence,  * "AB" would be found in the following two strings. (With k=2)   * <pre> *SSK(2,"ab","axb")=0.5^14 = 0,00006103515625 *</pre> * lambda pruning allows for the control of the match length. So, if m  * (the maximum lambda exponent) is e.g. 8, these two strings would  * yield a kernel value of 0: * <pre> *with lambda pruning:    SSK-LP(2,8,"AxxxxxxxxxB","AyB")= 0 *without lambda pruning: SSK(2,"AxxxxxxxxxB","AyB")= 0.5^14 = 0,00006103515625   *</pre> * This is because the exponent for lambda (=the length of the subsequence * match) would be 14, which is &gt; 8. In Contrast, the next result is * &gt; 0 *<pre> *m=8 *SSK-LP(2,8,"AxxB","AyyB")=0.5^8 = 0,00390625 *</pre> * because the lambda exponent would be 8, which is just accepted by lambda * pruning. * * <h3>Normalization</h3> * When the string kernel is used for its main purpose, as the kernel of a * support vector machine, it is not normalized.  The normalized kernel can be * switched on by -F (feature space normalization) but is much slower.  Like * most unnormalized kernels, K(x,x) is not a fixed value, see the next * example. * * <h4>Example 4</h4>  *<pre> *SSK(2,"ab","ab")=0.5^4 = 0.0625 *SSK(2,"AxxxxxxxxxB","AxxxxxxxxxB") = 12.761724710464478 *</pre> * SSK is evaluated twice, each time for two identical strings. A good measure * of similarity would produce the same value in both cases, which should   * indicate the same level of similarity. The value of the normalized SSK would * be 1.0 in both cases. So for the purpose of computing string similarity the * normalized kernel should be used. For SVM the unnormalized kernel is usually * sufficient. * * <h2>Complexity of SSK and SSK-LP</h2> * The time complexity of this method (without lambda pruning and with an * infinitely large cache) is<br>  * <pre>O(k*|s|*|t|)</pre> * Lambda Pruning has a complexity (without caching) of<br>  * <pre>O(m*binom(m,k)^2*(|s|+n)*|t|)</pre> <br>   * <pre> *k...          subsequence length (ssl) *s,t...        strings *|s|...        length of string s *binom(x,y)... binomial coefficient (x!/[(x-y)!y!]) *m...          maxLambdaExponent (ssl-max) *</pre> *  * Keep in mind that execution time can increase fast for long strings  * and big values for k, especially if you don't use lambda pruning. * With lambda pruning, computation is usually so fast that switching * on the cache leads to slower computation because of setup costs. Therefore * caching is switched off for lambda pruning. * <br> * <br> * For details and qualitative experiments about SSK, see [1] <br> * For details about lambda pruning and performance comparison of SSK  * and SSK-LP (SSK with lambda pruning), see [2]    * Note that the complexity estimation in [2] assumes no caching of * intermediate results, which has been implemented in the meantime and * greatly improves the speed of the SSK without lambda pruning. *<br> * *<h1>Notes for usage within Weka</h1> * Only instances of the following form can be processed using string kernels: * <pre> *+----------+-------------+---------------+ *|attribute#|     0       |       1       | *+----------+-------------+---------------+ *| content  | [text data] | [class label] | *+----------------------------------------+ * ... or ... *+----------+---------------+-------------+ *|attribute#|     0         |     1       | *+----------+---------------+-------------+ *| content  | [class label] | [text data] | *+----------------------------------------+ *</pre> * * @author Florian Kleedorfer (kleedorfer@austria.fm) * @author Alexander K. Seewald (alex@seewald.at) * @version $Revision: 1.2 $ */public class StringKernel   extends Kernel  implements TechnicalInformationHandler {    /** for serialization */  private static final long serialVersionUID = -4902954211202690123L;  /** The size of the cache (a prime number) */  private int m_cacheSize = 250007;  /** The size of the internal cache for intermediate results (a prime number) */  private int m_internalCacheSize = 200003;  /** The attribute number of the string attribute */  private int m_strAttr;  /** Kernel cache (i.e., cache for kernel evaluations) */  private double[] m_storage;  private long[] m_keys;  /** Counts the number of kernel evaluations. */  private int m_kernelEvals = 0;  /** The number of instance in the dataset */  private int m_numInsts;  /** Pruning method: No Pruning */  public final static int PRUNING_NONE = 0;  /** Pruning method: Lambda See [2] for details. */  public final static int PRUNING_LAMBDA = 1;  /** Pruning methods */  public static final Tag [] TAGS_PRUNING = {    new Tag(PRUNING_NONE, "No pruning"),    new Tag(PRUNING_LAMBDA, "Lambda pruning"),  };    /** the pruning method */  protected int m_PruningMethod = PRUNING_NONE;  /** the decay factor that penalizes non-continuous substring matches. See [1]   * for details. */  protected double m_lambda = 0.5;  /** The substring length */  private int m_subsequenceLength = 3;  /** The maximum substring length for lambda pruning */  private int m_maxSubsequenceLength = 3;  /** powers of lambda are prepared prior to kernel evaluations.   * all powers between 0 and this value are precalculated */  protected static final int MAX_POWER_OF_LAMBDA = 10000;  /** the precalculated powers of lambda */  protected double[] m_powersOflambda = null;  /** flag for switching normalization on or off. This defaults to false and   * can be turned on by the switch for feature space normalization in SMO   */  private boolean m_normalize = false;  /** private cache for intermediate results */	  private int maxCache; // is set in unnormalizedKernel(s1,s2)  private double[] cachekh;  private int[] cachekhK;  private double[] cachekh2;  private int[] cachekh2K;  /** cached indexes for private cache */  private int m_multX;  private int m_multY;  private int m_multZ;  private int m_multZZ;  private boolean m_useRecursionCache = true;  /**   * default constructor   */  public StringKernel() {    super();  }    /**   * creates a new StringKernel object. Initializes the kernel cache and the   * 'lambda cache', i.e. the precalculated powers of lambda from lambda^2 to   * lambda^MAX_POWER_OF_LAMBDA   *   * @param data		the dataset to use   * @param cacheSize		the size of the cache   * @param subsequenceLength	the subsequence length   * @param lambda		the lambda value   * @param debug		whether to output debug information   * @throws Exception		if something goes wrong   */  public StringKernel(Instances data, int cacheSize, int subsequenceLength,    double lambda, boolean debug) throws Exception {    setDebug(debug);    setCacheSize(cacheSize);    setInternalCacheSize(200003);    setSubsequenceLength(subsequenceLength);    setMaxSubsequenceLength(-1);    setLambda(lambda);        buildKernel(data);  }    /**   * Returns a string describing the kernel   *    * @return a description suitable for displaying in the   *         explorer/experimenter gui
stringkernel.java - 源码说明

本页面展示了「Java 编写的多种数据挖掘算法包括聚类、分类、预处理等」中的 stringkernel.java 源码文件，采用 Java 编程语言编写，共 1,468 行代码。您可以在线阅读完整代码内容，也可以返回资源详情页下载完整源码包进行本地学习和开发。
虫虫下载站收录了大量与Java相关的技术资源，包括源代码、技术文档、电路图等，是电子工程师和嵌入式开发者的专业学习平台。
⌨️ 快捷键说明

复制代码Ctrl + C
搜索代码Ctrl + F
全屏模式F11
增大字号Ctrl + =
减小字号Ctrl + -
显示快捷键?