📄 logisticregression.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.stats;import com.aliasi.matrix.DenseVector;import com.aliasi.matrix.Matrices;import com.aliasi.matrix.Vector;import com.aliasi.matrix.SparseFloatVector;import com.aliasi.util.AbstractExternalizable;import com.aliasi.util.Compilable;import com.aliasi.util.Strings;import java.io.IOException;import java.io.ObjectInput;import java.io.ObjectOutput;import java.io.PrintWriter;import java.io.Serializable;import java.util.Arrays;/** * A <code>LogisticRegression</code> instance is a multi-class vector * classifier model generating conditional probability estimates of * categories. This class also provides static factory methods for * estimating multinomial regression models using stochastic gradient * descent (SGD) to find maximum likelihood or maximum a posteriori * (MAP) estimates with Laplace, Gaussian, Cauchy priors on * coefficients. * * <p>The classification package contains a class {@link * com.aliasi.classify.LogisticRegressionClassifier} which adapts this * class's models and estimators to act as generic classifiers given * an instance of {@link com.aliasi.util.FeatureExtractor}. * * * <table border="0" style="font-size:80%; width:80%; margin:0 1em 0 2em"> * <tr><td> * <b>Also Known As (AKA)</b> * * <p>Multinomial logistic regression is also known as polytomous, * polychotomous, or multi-class logistic regression, or just * multilogit regression. * * <p>Binary logistic regression is an instance of a generalized * linear model (GLM) with the logit link function. The logit * function is the inverse of the logistic function, and the logistic * function is sometimes called the sigmoid function or the s-curve. * * <p>Logistic regression estimation obeys the maximum entropy * principle, and thus logistic regression is sometimes called * "maximum entropy modeling", and the resulting classifier * the "maximum entropy classifier". * * <p>The generalization of binomial logistic regression to * multinomial logistic regression is sometimes called a softmax or * exponential model. * * <p>Maximum a priori (MAP) estimation with Gaussian priors is often * referred to as "ridge regression"; with Laplace priors * MAP estimation is known as the "lasso". MAP estimation * with Gaussian, Laplace or Cauchy priors is known as parameter * shrinkage. Gaussian and Laplace are forms of regularized * regression, with the Gaussian version being regularized with the * L<sub>2</sub> norm (Euclidean distance, called the Frobenius norm * for matrices of parameters) and the Laplace version being * regularized with the L<sub>1</sub> norm (taxicab distance or * Manhattan metric); other Minkowski metrics may be used for * shrinkage. * * <p>Binary logistic regression is equivalent to a one-layer, * single-output neural network with a logistic activation function * trained under log loss. This is sometimes called classification * with a single neuron.</td></tr></table> * * <h4>Model Parameters</h4> * * A logistic regression model is a discriminitive classifier for * vectors of fixed dimensionality. The dimensions are often referred * to as "features". The method {@link * #numInputDimensions()} returns the number of dimensions (features) * in the model. Because the model is well-behaved under sparse * vectors, the dimensionality may be returned as {@link * Integer#MAX_VALUE}, a common default choice for sparse vectors. * * <p>A logistic regression model also fixes the number of output * categories. The method {@link #numOutcomes()} returns the number * of categories. These outcome categories will be represented as * integers from <code>0</code> to <code>numOutcomes()-1</code> * inclusive. * * <p>A model is parameterized by a real-valued vector for every * category other than the last, each of which must be of the same * dimensionality as the model's input feature dimensionality. The * constructor {@link #LogisticRegression(Vector[])} takes an array of * {@link Vector} objects, which may be dense or sparse, but must all * be of the same dimensionality. * * * <h4>Likelihood</h4> * * <p>The likelihood of a given output category <code>k < * numOutcomes()</code> given an input vector <code>x</code> of * dimensionality <code>numFeatures()</code> is given by: * * <blockquote><pre> * p(c | x, β) = exp(β<sub>k</sub> * x) / Z(x) if c < numOutcomes()-1 * * 1 / Z(x) if c = numOutcomes()-1</pre></blockquote> * * where <code>β<sub>k</sub> * x</code> is vector dot (or inner) * product: * * <blockquote><pre> * β<sub>k</sub> * x = <big><big><big>Σ</big></big></big><sub>i < numDimensions()</sub> β<sub>k,i</sub> * x<sub>i</sub></pre></blockquote> * * and where the normalizing denominator, called the partition function, * is: * * <blockquote><pre> * Z(x) = 1 + <big><big><big>Σ</big></big></big><sub><sub>k < numOutcomes()-1</sub></sub> exp(β<sub>k</sub> * x)</pre></blockquote> * * * <h4>Error and Gradient</h4> * * This class computes maximum a posteriori parameter values given a * sequence of training pairs <code>(x,c)</code> and a prior, which * must be an instance of {@link RegressionPrior}. The error function * is just the negative log likelihood and log prior: * * <blockquote><pre> * Err(D,β) = -<big>(</big> log<sub>2</sub> p(β|σ<sup>2</sup>) + <big><big><big>Σ</big></big></big><sub>{(x,c') in D}</sub> log<sub>2</sub> p(c'|x,β)<big>)</big></pre></blockquote> * * where <code>p(β|σ<sub>2</sub>)</code> is the likelihood of the parameters * <code>β</code> in the prior, and <code>p(c|x,β)</code> is * the probability of category <code>c</code> given input <code>x</code> * and parameters <code>β</code>. * * <p>The maximum a posteriori estimate is such that the gradient * (vector of partial derivatives of parameters) is zero. If the data * is not linearly separable, a maximum likelihood solution must * exist. If the data is not linearly separable and none of the data * dimensions is colinear, the solution will be unique. If there is * an informative Cauchy, Gaussian or Laplace prior, there will be a * unique MAP solution even in the face of linear separability or * colinear dimensions. Proofs of solution exists require showing the * matrix of second partial derivatives of the error with respect to * pairs of parameters, is positive semi-definite; if it is positive * definite, the error is strictly concave and the MAP solution is * unique. * * <p>The gradient * for parameter vector <code>β<sub>c</sub></code> for * outcome <code>c < k-1</code> is: * * <blockquote><pre> * grad(Err(D,β<sub>c</sub>)) * = ∂Err(D,β) / ∂β<sub>c</sub> * = ∂(- log p(β|σ<sup>2</sup>)) / ∂β<sub>c</sub> * + ∂( - <big><big>Σ</big></big><sub>{(x,c') in D}</sub> log p(c' | x, β))</pre></blockquote> * * where the gradient of the priors are described in the * class documentation for {@link RegressionPrior}, and the * gradient of the likelihood function is:. * * <blockquote><pre> * ∂(-<big><big>Σ</big></big><sub>{(x,c') in D}</sub> log p(c' | x, β)) / ∂β<sub>c</sub> * = - <big><big>Σ</big></big><sub>{(x,c') in D}</sub> ∂ log p(c' | x, β)) /∂β<sub>c</sub> * = - <big><big>Σ</big></big><sub>{(x,c') in D}</sub> x * (p(c' | x, β) - I(c = c'))</pre></blockquote> * * where the indicator function <code>I(c=c')</code> is equal to 1 if * <code>c=c'</code> and equal to 0 otherwise. * * * <h4>Intercept Term</h4> * * It is conventional to assume that inputs have their first dimension * reserved for the constant <code>1</code>, which makes the * parameters <code>β<sub>c,0</sub></code> intercepts. The priors * allow the intercept to be given an uninformative prior even if the * other dimensions have informative priors. * * <h4>Feature Normalization</h4> * * It is also common to convert inputs to z-scores in logistic * regression. The z-score is computed given the mean and deviation * of each dimension. The problem with centering (subtracting the mean * from each value) is that it destroys sparsity. We recommend not * centering and using an intercept term with an uninformative prior. * * <p>Variance normalization can be achieved by setting the variance * prior parameter independently for each dimension. * * * <h4>Non-Linear and Interaction Features</h4> * * It is common in logistic regression to include derived features * which represent non-linear combinations of other input features. * Typically, this is done through multiplication. For instance, if * the output is a quadratic function of an input dimension * <code>i</code>, then in addition to the raw value * <code>x<sub>i</sub></code>, anotehr feature <code>j</code> may be * introduced with value <code>x<sub>i</sub><sup>2</sup></code>. * * <p>Similarly, interaction terms are often added for features * <code>x<sub>i</sub></code> and <code>x<sub>j</sub></code>, * with a new feature <code>x<sub>k</sub></code> being defined * with value <code>x<sub>i</sub> <code>x<sub>j</sub></code>. * * <p>The resulting model is linear in the derived features, but * will no longer be linear in the original features. * * * <h4>Stochastic Gradient Descent</h4> * * <p>This class estimates logistic regression models using stochastic * gradient descent (SGD). The SGD method runs through the data one * or more times, considering one training case at a time, adjusting * the parameters along some multiple of the contribution to the gradient * of the error for that case. * * <p>With informative priors, the search space * is strictly concave, and there will be a unique solution. In cases * of linear dependence between dimensions or in separable data, * maximum likelihood estimation may diverge. * * <p>The basic algorithm is: * * <blockquote><pre> * β = 0; * for (epoch = 0; epoch < maxEpochs; ++epoch) * for training case (x,c') in D * for category c < numOutcomes-1 * β<sub>c</sub> -= learningRate(epoch) * grad(Err(x,c,c',β,σ<sup>2</sup>)) * if (epoch > minEpochs && converged) * return β</pre></blockquote> * * where we discuss the learning rate and convergence conditions * in the next section. The gradient of the error is described * above, and the gradient contribution of the prior and its * parameters <code>σ</code> are described in the class * documentation for {@link RegressionPrior}. Note that the error * gradient must be divided by the number of training cases to * get the incremental contribution of the prior gradient. * * The actual algorithm uses a lazy form of updating the contribution * of the gradient of the prior. The result is an algorithm that * handles sparse input data touching only the non-zero dimensions of * inputs during parameter updates. * * <h4>Learning Parameters</h4> * * In addition to the model parameters (including priors) and training * data (input vectors and reference categories), the regression * estimation method also requires four parameters that control * search. The simplest search parameters are the minimum and maximum * epoch parameters, which control the number of epochs used for * optimzation. * * <p>The argument <code>minImprovement</code> determines how much * improvement in training data and model log likelihood under the * current model is necessary to go onto the next epoch. This is * measured relatively, with the algorithm stopping when the current * epoch's error <code>err</code> is relatively close to the previous * epoch's error, <code>errLast</code>: * * <blockquote><pre> * abs(err - errLast)/(abs(err) + abs(errLast)) < minImprovement</pre></blockquote> *
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -