📄 regressionprior.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.stats;import com.aliasi.matrix.Vector;import com.aliasi.util.AbstractExternalizable;import com.aliasi.util.Compilable;import java.io.IOException;import java.io.ObjectInput;import java.io.ObjectOutput;import java.io.Serializable;import java.util.Arrays;/** * A <code>RegressionPrior</code> instance represents a prior * distribution on parameters for linear or logistic regression. * * <p>Instances of this class are used as parameters in the {@link * LogisticRegression} class to control the regularization or lack * thereof used by the stochastic gradient descent optimizers. The * priors all assume a zero mean (or position) for each dimension, but * allow variances (or scales) to vary by input dimension. * * <p>The behavior of a prior is determined by its gradient, the * partial derivatives with respect to the dimensions of the error * function for the prior (negative log likelihood) with respect to * a coefficient <code>β<sub>i</sub></code>. * * <blockquote><pre> * gradient(β<sub>i</sub>,i) = - ∂ log p(β) / ∂ β<sub>i</sub></pre></blockquote> * * <p>See the class documentation for {@link LogisticRegression} * for more information. * * <p>Priors also implement a log (base 2) probability density for the * prior for a given parameter in a given dimension. The total log * prior probability is the sum of the log probabilities for the dimensions. * * <p>Priors affect gradient descent fitting of regression through * their contribution to the gradient of the error function with * respect to the parameter vector. The contribution of the prior to * the error function is the negative log probability of the parameter * vector(s) with respect to the prior distribution. The gradient of * the error function is the collection of partial derivatives of the * error function with respect to the components of the parameter * vector. The regression prior abstract base class is defined in * terms of a single method {@link #gradient(double,int)}, which * specifies the value of the gradient of the error function for a * specified dimension with a specified value in that dimension. * * <p>This class implements static factory methods to construct * non-informative, Gaussian and Laplace priors. The Gaussian and * Laplace priors may specify a different variance for each dimension, * but assumes all the prior means are zero. The priors also assume * the dimensions are independent so that the full covariance matrix * is assumed to be diagonal (that is, there is zero covariance between * different dimensions). * * * <h4>Non-informative Prior & Maximum Likelihood Estimation</h4> * * <p>Using a non-informative prior for regression results in standard * maximum likelihood estimation. * * <p>The non-informative prior assumes a uniform distribution over * parameter vectors: * * <blockquote><pre> * p(β<sub>i</sub>,i) = 1.0</pre></blockquote> * * and thus contributes nothing to the gradient: * * <blockquote><pre> * gradient(β<sub>i</sub>,i) = 0.0</pre></blockquote> * * A non-informative prior is constructed using the static method * {@link #noninformative()}. * * * <h4>Gaussian Prior, L<sub>2</sub> Regularization & Ridge Regression</h4> * * <p>The Gaussian prior assumes a Gaussian (also known as normal) density over * parameter vectors which results in L<sub>2</sub>-regularized * regression, also known as ridge regression. Specifically, the * prior allows a variance to be specified per dimension, but * assumes dimensions are independent in that all off-diagonal * covariances are zero. * * <p>The Gaussian density is defined by: * * <blockquote><pre> * p(β<sub>i</sub>,i) = 1.0/sqrt(2 * π σ<sub>i</sub><sup>2</sup>) * exp(-β<sub>i</sub><sup>2</sup>/(2 * σ<sub>i</sub><sup>2</sup>))</pre></blockquote> * * <p>The Gaussian prior leads to the following contribution to the * gradient for a dimension <code>i</code> with parameter * <code>beta<sub>i</sub></code> and variance * <code>σ<sub>i</sub><sup>2</sup></code>: * * <blockquote><pre> * gradient(β<sub>i</sub>,i) = β<sub>i</sub>/(2 * σ<sub>i</sub><sup>2</sup>)</pre></blockquote> * * <p>Gaussian priors are constructed using one of the static factory * methods, {@link #gaussian(double[])} or {@link * #gaussian(double,boolean)}. * * <h4>Laplace Prior, L<sub>1</sub> Regularization & the Lasso</h4> * * <p>The Laplace prior assumes a Laplace density over parameter * vectors which results in L<sub>1</sub>-regularized regression, also * known as the lasso. The Laplace prior is called a * double-exponential distribution because it is looks like an exponential * distribution for positive values and the reflection of this exponential * distribution around zero (or more generally, around its mean parameter). * * <p>A Laplace prior allows a variance to be specified per dimension, * but like the Gaussian prior, assumes means are zero and that the * dimensions are independent in that all off-diagonal covariances are * zero. * * <p>The Laplace density is defined by: * * <blockquote><pre> * p(β<sub>i</sub>,i) = (sqrt(2)/(2 * σ<sub>i</sub>)) * exp(- sqrt(2) * abs(β<sub>i</sub>) / σ<sub>i</sub>)</pre></blockquote> * * <p>The Laplace prior leads to the following contribution to the * gradient for a dimension <code>i</code> with parameter * <code>beta<sub>i</sub></code>, mean zero and variance * <code>σ<sub>i</sub><sup>2</sup></code>: * * <blockquote><pre> * gradient(β<sub>i</sub>,i) = signum(β<sub>i</sub>)/(2 * σ<sub>i</sub><sup>2</sup>)</pre></blockquote> * * where the <code>signum</code> function is defined by {@link Math#signum(double)}. * * <p>Laplace priors are constructed using one of the static factory * methods, {@link #laplace(double[])} or {@link * #laplace(double,boolean)}. * * * <h4>Cauchy Prior</h4> * * <p>The Cauchy prior assumes a Cauchy density (also known as a * Lorentz density) over priors. The Cauchy density is a Student-t * density with one degree of freedom. The Cauchy density allows a * scale to be specified for each dimension. The mean and variance * are undefined as their integrals diverge. The Cauchy distribution * is symmetric and for regression priors, we assume a mode of zero. * * <p>The Cauchy density is defined by: * * <blockquote><pre> * p(β<sub>i</sub>,i) = (1 / π) * (λ / (β<sub>i</sub><sup>2</sup> + λ<sup>2</sup>))</pre></blockquote> * * <p>The Cauchy prior leads to the following contribution to the * gradient for dimension <code>i</code> with parameter <code>β<sub>i</sub></code> and scale * <code>λ<sub>i</sub><sup>2</sup></code>: * * <blockquote><pre> * gradient(β<sub>i</sub>, i) = 2 β<sub>i</sub> / (β<sub>i</sub><sup>2</sup> + λ<sub>i</sub><sup>2</sup>)</pre></blockquote> * * <p>Cauchy priors are constructed using one of the static factory * methods {@link #cauchy(double[])} or {@link #cauchy(double,boolean)}. * * <h4>Special Treatment of Intercept</h4> * * <p>By convention, input dimension zero (<code>0</code>) may be * reserved for the intercept and set to value 1.0 in all input * vectors. For regularized regression, the regularization is * typically not applied to the intercept term. To match this * convention, the factory methods allow a boolean parameter * indicating whether the intercept parameter has a * non-informative/uniform prior. If the intercept flag indicates it * is non-informative, then dimension 0 will not have an infinite * prior variance or scale, and hence a zero gradient. The result is * that the intercept will be fit by maximum likelihood. * * <h4>Serialization</h4> * * <p>All of the regression priors may be serialized. * <h4>References</h4> * * <p>For full details on the Gaussian and Laplace distributions, * see: * * <ul> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Normal_distribution">Normal (Gaussian) Distribution</a></li> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Laplace_distribution">Laplace (Double Exponential) Distribution</a></li> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Cauchy_distribution">Cauchy Distribution</a> * </ul> * * <p>For explanations of how the priors are used with logistic regression, * see the following two textbooks: * * <ul> * <li>Hastie, Trevor, Tibshirani, Robert and Jerome Friedman. 2001. * <i><a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/">Elements of Statistical Learning</a></i>. * Springer.</li> * * <li>Bishop, Christopher M. 2006. <a href="http://research.microsoft.com/~cmbishop/PRML/">Pattern Recognition and Machine Learning</a>. * Springer.</li> * </ul> * * and two tech reports: * * <ul> * <li>Genkin, Alexander, David D. Lewis, and David Madigan. 2004. * <a href="http://www.stat.columbia.edu/~gelman/stuff_for_blog/madigan.pdf">Large-Scale Bayesian Logistic Regression for Text Categorization</a>. * Rutgers University Technical Report. * (<a href="http://stat.rutgers.edu/~madigan/PAPERS/techno-06-09-18.pdf">alternate download</a>). * * <li> Gelman, Andrew, Aleks Jakulin, Yu-Sung Su, and Maria Grazia Pittau. 2007.<a href="http://ssrn.com/abstract=1010421">A Default Prior Distribution for Logistic and Other Regression Models</a>.* </li> * </ul> * * @author Bob Carpenter * @version 3.5 * @since LingPipe3.5 */public abstract class RegressionPrior implements Serializable { // do not allow instances or subclasses private RegressionPrior() { } /** * Returns the contribution to the gradient of the error function * of the specified parameter value for the specified dimension. * * @param betaForDimension Parameter value for the specified dimension. * @param dimension The dimension. * @return The contribution to the gradient of the error function * of the parameter value and dimension. */ public abstract double gradient(double betaForDimension, int dimension); /** * Returns the log (base 2) of the prior density evaluated at the * specified coefficient value for the specified dimension. The * overall error function is the sum of the negative log * likelihood of the data under the model and the negative log of * the prior. * * @param betaForDimension Parameter value for the specified dimension. * @param dimension The dimension. * @return The prior probability of the specified parameter value * for the specified dimension. */ public abstract double log2Prior(double betaForDimension, int dimension); /** * Returns the log (base 2) prior density for a specified * coefficient vector. * * @param beta Parameter vector. * @return The log (base 2) prior for the specified parameter * vector. * @throws IllegalArgumentException If the specified parameter * vector does not match the dimensionality of the prior (if * specified). */ public double log2Prior(Vector beta) { int numDimensions = beta.numDimensions(); verifyNumberOfDimensions(numDimensions); double log2Prior = 0.0;
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -