📄 stringkernel.java
字号:
package weka.classifiers.functions.supportVector;import weka.core.Capabilities;import weka.core.Instance;import weka.core.Instances;import weka.core.Option;import weka.core.SelectedTag;import weka.core.Tag;import weka.core.TechnicalInformation;import weka.core.Utils;import weka.core.Capabilities.Capability;import weka.core.TechnicalInformation.Type;import weka.core.TechnicalInformation.Field;import weka.core.TechnicalInformationHandler;import java.util.Enumeration;import java.util.Vector;/** <!-- globalinfo-start --> * Implementation of the subsequence kernel (SSK) as described in [1] and of the subsequence kernel with lambda pruning (SSK-LP) as described in [2].<br/> * <br/> * For more information, see<br/> * <br/> * Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Christopher J. C. H. Watkins (2002). Text Classification using String Kernels. Journal of Machine Learning Research. 2:419-444.<br/> * <br/> * F. Kleedorfer, A. Seewald (2005). Implementation of a String Kernel for WEKA. Wien, Austria. * <p/> <!-- globalinfo-end --> * <!-- technical-bibtex-start --> * BibTeX: * <pre> * @article{Lodhi2002, * author = {Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins}, * journal = {Journal of Machine Learning Research}, * pages = {419-444}, * title = {Text Classification using String Kernels}, * volume = {2}, * year = {2002}, * HTTP = {http://www.jmlr.org/papers/v2/lodhi02a.html} * } * * @techreport{Kleedorfer2005, * address = {Wien, Austria}, * author = {F. Kleedorfer and A. Seewald}, * institution = {Oesterreichisches Forschungsinstitut fuer Artificial Intelligence}, * number = {TR-2005-13}, * title = {Implementation of a String Kernel for WEKA}, * year = {2005} * } * </pre> * <p/> <!-- technical-bibtex-end --> * <!-- options-start --> * Valid options are: <p/> * * <pre> -D * Enables debugging output (if available) to be printed. * (default: off)</pre> * * <pre> -no-checks * Turns off all checks - use with caution! * (default: checks on)</pre> * * <pre> -P <0|1> * The pruning method to use: * 0 = No pruning * 1 = Lambda pruning * (default: 0)</pre> * * <pre> -C <num> * The size of the cache (a prime number). * (default: 250007)</pre> * * <pre> -IC <num> * The size of the internal cache (a prime number). * (default: 200003)</pre> * * <pre> -L <num> * The lambda constant. Penalizes non-continuous subsequence * matches. Must be in (0,1). * (default: 0.5)</pre> * * <pre> -ssl <num> * The length of the subsequence. * (default: 3)</pre> * * <pre> -ssl-max <num> * The maximum length of the subsequence. * (default: 9)</pre> * * <pre> -N * Use normalization. * (default: no)</pre> * <!-- options-end --> * * <h1>Theory</h1> * <h2>Overview</h2> * The algorithm computes a measure of similarity between two texts based on * the number and form of their common subsequences, which need not be * contiguous. This method can be parametrized by specifying the subsequence * length k, the penalty factor lambda, which penalizes non-contiguous matches, * and optional 'lambda pruning', which takes maxLambdaExponent, * <code>m</code>, as parameter. Lambda pruning causes very 'stretched' * substring matches not to be counted, thus speeding up the computation. The * functionality of SSK and SSK-LP is explained in the following using simple * examples. * * <h2>Explanation & Examples</h2> * for all of the following examples, we assume these parameter values: *<pre> *k=2 *lambda=0.5 *m=8 (for SSK-LP examples) *</pre> * * <h3>SSK</h3> * * <h4>Example 1</h4> * * <pre> *SSK(2,"ab","axb")=0.5^5 = 0,03125 *</pre> * There is one subsequence of the length of 2 that both strings have in * common, "ab". The result of SSK is computed by raising lambda to the power * of L, where L is the length of the subsequence match in the one string plus * the length of the subsequence match in the other, in our case: * <pre> * ab axb *L= 2 + 3 = 5 * </pre> * hence, the kernel yields 0.5^5 = 0,03125 * * <h4>Example 2</h4> * <pre> *SSK(2,"ab","abb")=0.5^5 + 0.5^4 = 0,09375 *</pre> * Here, we also have one subsequence of the length of 2 that both strings have * in common, "ab". The result of SSK is actually computed by summing over all * values computed for each occurrence of a common subsequence match. In this * example, there are two possible cases: * <pre> *ab abb *-- -- L=4 *-- - - L=5 * </pre> * we have two matches, one of the length of 2+2=4, one of the length of 2+3=5, * so we get the result 0.5^5 + 0.5^4 = 0,09375. * * <h3>SSK-LP</h3> * Without lambda pruning, the string kernel finds *all* common subsequences of * the given length, whereas with lambda pruning, common subsequence matches * that are too much stretched in both strings are not taken into account. It * is argued that the value yielded for such a common subsequence is too low * (<code>lambda ^(length[match_in_s] + length[match_in_t]</code>) . Tests have * shown that a tremendous speedup can be achieved using this technique while * suffering from very little quality loss. <br> * Lambda pruning is parametrized by the maximum lambda exponent. It is a good * idea to choose that value to be about 3 or 4 times the subsequence length as * a rule of thumb. YMMV. * * <h4>Example 3</h4> * Without lambda pruning, one common subsequence, * "AB" would be found in the following two strings. (With k=2) * <pre> *SSK(2,"ab","axb")=0.5^14 = 0,00006103515625 *</pre> * lambda pruning allows for the control of the match length. So, if m * (the maximum lambda exponent) is e.g. 8, these two strings would * yield a kernel value of 0: * <pre> *with lambda pruning: SSK-LP(2,8,"AxxxxxxxxxB","AyB")= 0 *without lambda pruning: SSK(2,"AxxxxxxxxxB","AyB")= 0.5^14 = 0,00006103515625 *</pre> * This is because the exponent for lambda (=the length of the subsequence * match) would be 14, which is > 8. In Contrast, the next result is * > 0 *<pre> *m=8 *SSK-LP(2,8,"AxxB","AyyB")=0.5^8 = 0,00390625 *</pre> * because the lambda exponent would be 8, which is just accepted by lambda * pruning. * * <h3>Normalization</h3> * When the string kernel is used for its main purpose, as the kernel of a * support vector machine, it is not normalized. The normalized kernel can be * switched on by -F (feature space normalization) but is much slower. Like * most unnormalized kernels, K(x,x) is not a fixed value, see the next * example. * * <h4>Example 4</h4> *<pre> *SSK(2,"ab","ab")=0.5^4 = 0.0625 *SSK(2,"AxxxxxxxxxB","AxxxxxxxxxB") = 12.761724710464478 *</pre> * SSK is evaluated twice, each time for two identical strings. A good measure * of similarity would produce the same value in both cases, which should * indicate the same level of similarity. The value of the normalized SSK would * be 1.0 in both cases. So for the purpose of computing string similarity the * normalized kernel should be used. For SVM the unnormalized kernel is usually * sufficient. * * <h2>Complexity of SSK and SSK-LP</h2> * The time complexity of this method (without lambda pruning and with an * infinitely large cache) is<br> * <pre>O(k*|s|*|t|)</pre> * Lambda Pruning has a complexity (without caching) of<br> * <pre>O(m*binom(m,k)^2*(|s|+n)*|t|)</pre> <br> * <pre> *k... subsequence length (ssl) *s,t... strings *|s|... length of string s *binom(x,y)... binomial coefficient (x!/[(x-y)!y!]) *m... maxLambdaExponent (ssl-max) *</pre> * * Keep in mind that execution time can increase fast for long strings * and big values for k, especially if you don't use lambda pruning. * With lambda pruning, computation is usually so fast that switching * on the cache leads to slower computation because of setup costs. Therefore * caching is switched off for lambda pruning. * <br> * <br> * For details and qualitative experiments about SSK, see [1] <br> * For details about lambda pruning and performance comparison of SSK * and SSK-LP (SSK with lambda pruning), see [2] * Note that the complexity estimation in [2] assumes no caching of * intermediate results, which has been implemented in the meantime and * greatly improves the speed of the SSK without lambda pruning. *<br> * *<h1>Notes for usage within Weka</h1> * Only instances of the following form can be processed using string kernels: * <pre> *+----------+-------------+---------------+ *|attribute#| 0 | 1 | *+----------+-------------+---------------+ *| content | [text data] | [class label] | *+----------------------------------------+ * ... or ... *+----------+---------------+-------------+ *|attribute#| 0 | 1 | *+----------+---------------+-------------+ *| content | [class label] | [text data] | *+----------------------------------------+ *</pre> * * @author Florian Kleedorfer (kleedorfer@austria.fm) * @author Alexander K. Seewald (alex@seewald.at) * @version $Revision: 1.2 $ */public class StringKernel extends Kernel implements TechnicalInformationHandler { /** for serialization */ private static final long serialVersionUID = -4902954211202690123L; /** The size of the cache (a prime number) */ private int m_cacheSize = 250007; /** The size of the internal cache for intermediate results (a prime number) */ private int m_internalCacheSize = 200003; /** The attribute number of the string attribute */ private int m_strAttr; /** Kernel cache (i.e., cache for kernel evaluations) */ private double[] m_storage; private long[] m_keys; /** Counts the number of kernel evaluations. */ private int m_kernelEvals = 0; /** The number of instance in the dataset */ private int m_numInsts; /** Pruning method: No Pruning */ public final static int PRUNING_NONE = 0; /** Pruning method: Lambda See [2] for details. */ public final static int PRUNING_LAMBDA = 1; /** Pruning methods */ public static final Tag [] TAGS_PRUNING = { new Tag(PRUNING_NONE, "No pruning"), new Tag(PRUNING_LAMBDA, "Lambda pruning"), }; /** the pruning method */ protected int m_PruningMethod = PRUNING_NONE; /** the decay factor that penalizes non-continuous substring matches. See [1] * for details. */ protected double m_lambda = 0.5; /** The substring length */ private int m_subsequenceLength = 3; /** The maximum substring length for lambda pruning */ private int m_maxSubsequenceLength = 3; /** powers of lambda are prepared prior to kernel evaluations. * all powers between 0 and this value are precalculated */ protected static final int MAX_POWER_OF_LAMBDA = 10000; /** the precalculated powers of lambda */ protected double[] m_powersOflambda = null; /** flag for switching normalization on or off. This defaults to false and * can be turned on by the switch for feature space normalization in SMO */ private boolean m_normalize = false; /** private cache for intermediate results */ private int maxCache; // is set in unnormalizedKernel(s1,s2) private double[] cachekh; private int[] cachekhK; private double[] cachekh2; private int[] cachekh2K; /** cached indexes for private cache */ private int m_multX; private int m_multY; private int m_multZ; private int m_multZZ; private boolean m_useRecursionCache = true; /** * default constructor */ public StringKernel() { super(); } /** * creates a new StringKernel object. Initializes the kernel cache and the * 'lambda cache', i.e. the precalculated powers of lambda from lambda^2 to * lambda^MAX_POWER_OF_LAMBDA * * @param data the dataset to use * @param cacheSize the size of the cache * @param subsequenceLength the subsequence length * @param lambda the lambda value * @param debug whether to output debug information * @throws Exception if something goes wrong */ public StringKernel(Instances data, int cacheSize, int subsequenceLength, double lambda, boolean debug) throws Exception { setDebug(debug); setCacheSize(cacheSize); setInternalCacheSize(200003); setSubsequenceLength(subsequenceLength); setMaxSubsequenceLength(-1); setLambda(lambda); buildKernel(data); } /** * Returns a string describing the kernel * * @return a description suitable for displaying in the * explorer/experimenter gui
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -