📄 logisticregression.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.stats;import com.aliasi.matrix.DenseVector;import com.aliasi.matrix.Matrices;import com.aliasi.matrix.Vector;import com.aliasi.matrix.SparseFloatVector;import com.aliasi.util.AbstractExternalizable;import com.aliasi.util.Compilable;import com.aliasi.util.Strings;import java.io.IOException;import java.io.ObjectInput;import java.io.ObjectOutput;import java.io.PrintWriter;import java.io.Serializable;import java.util.Arrays;/** * A <code>LogisticRegression</code> instance is a multi-class vector * classifier model generating conditional probability estimates of * categories.  This class also provides static factory methods for * estimating multinomial regression models using stochastic gradient * descent (SGD) to find maximum likelihood or maximum a posteriori * (MAP) estimates with Laplace, Gaussian, Cauchy priors on * coefficients. * * <p>The classification package contains a class {@link * com.aliasi.classify.LogisticRegressionClassifier} which adapts this * class's models and estimators to act as generic classifiers given * an instance of {@link com.aliasi.util.FeatureExtractor}. * * * <table border="0" style="font-size:80%; width:80%; margin:0 1em 0 2em"> * <tr><td> * <b>Also Known As (AKA)</b> * * <p>Multinomial logistic regression is also known as polytomous, * polychotomous, or multi-class logistic regression, or just * multilogit regression. * * <p>Binary logistic regression is an instance of a generalized * linear model (GLM) with the logit link function.  The logit * function is the inverse of the logistic function, and the logistic * function is sometimes called the sigmoid function or the s-curve. * * <p>Logistic regression estimation obeys the maximum entropy * principle, and thus logistic regression is sometimes called * &quot;maximum entropy modeling&quot;, and the resulting classifier * the &quot;maximum entropy classifier&quot;. * * <p>The generalization of binomial logistic regression to * multinomial logistic regression is sometimes called a softmax or * exponential model. * * <p>Maximum a priori (MAP) estimation with Gaussian priors is often * referred to as &quot;ridge regression&quot;; with Laplace priors * MAP estimation is known as the &quot;lasso&quot;.  MAP estimation * with Gaussian, Laplace or Cauchy priors is known as parameter * shrinkage.  Gaussian and Laplace are forms of regularized * regression, with the Gaussian version being regularized with the * L<sub>2</sub> norm (Euclidean distance, called the Frobenius norm * for matrices of parameters) and the Laplace version being * regularized with the L<sub>1</sub> norm (taxicab distance or * Manhattan metric); other Minkowski metrics may be used for * shrinkage. * * <p>Binary logistic regression is equivalent to a one-layer, * single-output neural network with a logistic activation function * trained under log loss.  This is sometimes called classification * with a single neuron.</td></tr></table> * * <h4>Model Parameters</h4> * * A logistic regression model is a discriminitive classifier for * vectors of fixed dimensionality.  The dimensions are often referred * to as &quot;features&quot;. The method {@link * #numInputDimensions()} returns the number of dimensions (features) * in the model.  Because the model is well-behaved under sparse * vectors, the dimensionality may be returned as {@link * Integer#MAX_VALUE}, a common default choice for sparse vectors. * * <p>A logistic regression model also fixes the number of output * categories.  The method {@link #numOutcomes()} returns the number * of categories.  These outcome categories will be represented as * integers from <code>0</code> to <code>numOutcomes()-1</code> * inclusive. * * <p>A model is parameterized by a real-valued vector for every * category other than the last, each of which must be of the same * dimensionality as the model's input feature dimensionality.  The * constructor {@link #LogisticRegression(Vector[])} takes an array of * {@link Vector} objects, which may be dense or sparse, but must all * be of the same dimensionality. * * * <h4>Likelihood</h4> * * <p>The likelihood of a given output category <code>k &lt; * numOutcomes()</code> given an input vector <code>x</code> of * dimensionality <code>numFeatures()</code> is given by: * * <blockquote><pre> * p(c | x, &beta;) = exp(&beta;<sub>k</sub> * x)  / Z(x)   if c &lt; numOutcomes()-1 * *               1 / Z(x)              if c = numOutcomes()-1</pre></blockquote> * * where <code>&beta;<sub>k</sub> * x</code> is vector dot (or inner) * product: * * <blockquote><pre> * &beta;<sub>k</sub> * x = <big><big><big>&Sigma;</big></big></big><sub>i &lt; numDimensions()</sub> &beta;<sub>k,i</sub> * x<sub>i</sub></pre></blockquote> * * and where the normalizing denominator, called the partition function, * is: * * <blockquote><pre> * Z(x) = 1 + <big><big><big>&Sigma;</big></big></big><sub><sub>k &lt; numOutcomes()-1</sub></sub> exp(&beta;<sub>k</sub> * x)</pre></blockquote> * * * <h4>Error and Gradient</h4> * * This class computes maximum a posteriori parameter values given a * sequence of training pairs <code>(x,c)</code> and a prior, which * must be an instance of {@link RegressionPrior}.  The error function * is just the negative log likelihood and log prior: * * <blockquote><pre> * Err(D,&beta;) = -<big>(</big> log<sub>2</sub> p(&beta;|&sigma;<sup>2</sup>) + <big><big><big>&Sigma;</big></big></big><sub>{(x,c') in D}</sub> log<sub>2</sub> p(c'|x,&beta;)<big>)</big></pre></blockquote> * * where <code>p(&beta;|&sigma;<sub>2</sub>)</code> is the likelihood of the parameters * <code>&beta;</code> in the prior, and <code>p(c|x,&beta;)</code> is * the probability of category <code>c</code> given input <code>x</code> * and parameters <code>&beta;</code>. * * <p>The maximum a posteriori estimate is such that the gradient * (vector of partial derivatives of parameters) is zero.  If the data * is not linearly separable, a maximum likelihood solution must * exist.  If the data is not linearly separable and none of the data * dimensions is colinear, the solution will be unique.  If there is * an informative Cauchy, Gaussian or Laplace prior, there will be a * unique MAP solution even in the face of linear separability or * colinear dimensions.  Proofs of solution exists require showing the * matrix of second partial derivatives of the error with respect to * pairs of parameters, is positive semi-definite; if it is positive * definite, the error is strictly concave and the MAP solution is * unique. * * <p>The gradient * for parameter vector <code>&beta;<sub>c</sub></code> for * outcome <code>c &lt; k-1</code> is: * * <blockquote><pre> * grad(Err(D,&beta;<sub>c</sub>)) * = &part;Err(D,&beta;) / &part;&beta;<sub>c</sub> * = &part;(- log p(&beta;|&sigma;<sup>2</sup>)) / &part;&beta;<sub>c</sub> *   + &part;( - <big><big>&Sigma;</big></big><sub>{(x,c') in D}</sub> log p(c' | x, &beta;))</pre></blockquote> * * where the gradient of the priors are described in the * class documentation for {@link RegressionPrior}, and the * gradient of the likelihood function is:. * * <blockquote><pre> * &part;(-<big><big>&Sigma;</big></big><sub>{(x,c') in D}</sub> log p(c' | x, &beta;)) / &part;&beta;<sub>c</sub> * =  - <big><big>&Sigma;</big></big><sub>{(x,c') in D}</sub> &part; log p(c' | x, &beta;))  /&part;&beta;<sub>c</sub> * =  - <big><big>&Sigma;</big></big><sub>{(x,c') in D}</sub> x * (p(c' | x, &beta;) - I(c = c'))</pre></blockquote> * * where the indicator function <code>I(c=c')</code> is equal to 1 if * <code>c=c'</code> and equal to 0 otherwise. * * * <h4>Intercept Term</h4> * * It is conventional to assume that inputs have their first dimension * reserved for the constant <code>1</code>, which makes the * parameters <code>&beta;<sub>c,0</sub></code> intercepts.  The priors * allow the intercept to be given an uninformative prior even if the * other dimensions have informative priors. * * <h4>Feature Normalization</h4> * * It is also common to convert inputs to z-scores in logistic * regression.  The z-score is computed given the mean and deviation * of each dimension.  The problem with centering (subtracting the mean * from each value) is that it destroys sparsity.  We recommend not * centering and using an intercept term with an uninformative prior. * * <p>Variance normalization can be achieved by setting the variance * prior parameter independently for each dimension. * * * <h4>Non-Linear and Interaction Features</h4> * * It is common in logistic regression to include derived features * which represent non-linear combinations of other input features. * Typically, this is done through multiplication.  For instance, if * the output is a quadratic function of an input dimension * <code>i</code>, then in addition to the raw value * <code>x<sub>i</sub></code>, anotehr feature <code>j</code> may be * introduced with value <code>x<sub>i</sub><sup>2</sup></code>. * * <p>Similarly, interaction terms are often added for features * <code>x<sub>i</sub></code> and <code>x<sub>j</sub></code>, * with a new feature <code>x<sub>k</sub></code> being defined * with value <code>x<sub>i</sub> <code>x<sub>j</sub></code>. * * <p>The resulting model is linear in the derived features, but * will no longer be linear in the original features. * * * <h4>Stochastic Gradient Descent</h4> * * <p>This class estimates logistic regression models using stochastic * gradient descent (SGD).  The SGD method runs through the data one * or more times, considering one training case at a time, adjusting * the parameters along some multiple of the contribution to the gradient * of the error for that case. * * <p>With informative priors, the search space * is strictly concave, and there will be a unique solution.  In cases * of linear dependence between dimensions or in separable data, * maximum likelihood estimation may diverge. * * <p>The basic algorithm is: * * <blockquote><pre> * &beta; = 0; * for (epoch = 0; epoch &lt; maxEpochs; ++epoch) *     for training case (x,c') in D *         for category c &lt; numOutcomes-1 *             &beta;<sub>c</sub> -= learningRate(epoch) * grad(Err(x,c,c',&beta;,&sigma;<sup>2</sup>)) *     if (epoch &gt; minEpochs &amp;&amp; converged) *         return &beta;</pre></blockquote> * * where we discuss the learning rate and convergence conditions * in the next section.  The gradient of the error is described * above, and the gradient contribution of the prior and its * parameters <code>&sigma;</code> are described in the class * documentation for {@link RegressionPrior}.  Note that the error * gradient must be divided by the number of training cases to * get the incremental contribution of the prior gradient. * * The actual algorithm uses a lazy form of updating the contribution * of the gradient of the prior.  The result is an algorithm that * handles sparse input data touching only the non-zero dimensions of * inputs during parameter updates. * * <h4>Learning Parameters</h4> * * In addition to the model parameters (including priors) and training * data (input vectors and reference categories), the regression * estimation method also requires four parameters that control * search.  The simplest search parameters are the minimum and maximum * epoch parameters, which control the number of epochs used for * optimzation. * * <p>The argument <code>minImprovement</code> determines how much * improvement in training data and model log likelihood under the * current model is necessary to go onto the next epoch.  This is * measured relatively, with the algorithm stopping when the current * epoch's error <code>err</code> is relatively close to the previous * epoch's error, <code>errLast</code>: * * <blockquote><pre> * abs(err - errLast)/(abs(err) + abs(errLast)) &lt; minImprovement</pre></blockquote> *
12 3 下一页
💿 文件大小 4561 K
👤 上传用户 edan1181
📂 所属分类 Java编程
🏷️ 相关标签

#LingPipe #Java #自然语言处理 #开源
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -