📄 regressionprior.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 3 页
字号:
12 3 下一页
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.stats;import com.aliasi.matrix.Vector;import com.aliasi.util.AbstractExternalizable;import com.aliasi.util.Compilable;import java.io.IOException;import java.io.ObjectInput;import java.io.ObjectOutput;import java.io.Serializable;import java.util.Arrays;/** * A <code>RegressionPrior</code> instance represents a prior * distribution on parameters for linear or logistic regression. * * <p>Instances of this class are used as parameters in the {@link * LogisticRegression} class to control the regularization or lack * thereof used by the stochastic gradient descent optimizers.  The * priors all assume a zero mean (or position) for each dimension, but * allow variances (or scales) to vary by input dimension. * * <p>The behavior of a prior is determined by its gradient, the * partial derivatives with respect to the dimensions of the error * function for the prior (negative log likelihood) with respect to * a coefficient <code>&beta;<sub>i</sub></code>. * * <blockquote><pre> * gradient(&beta;<sub>i</sub>,i) = - &part; log p(&beta;) / &part; &beta;<sub>i</sub></pre></blockquote> * * <p>See the class documentation for {@link LogisticRegression} * for more information. * * <p>Priors also implement a log (base 2) probability density for the * prior for a given parameter in a given dimension.  The total log * prior probability is the sum of the log probabilities for the dimensions. * * <p>Priors affect gradient descent fitting of regression through * their contribution to the gradient of the error function with * respect to the parameter vector.  The contribution of the prior to * the error function is the negative log probability of the parameter * vector(s) with respect to the prior distribution.  The gradient of * the error function is the collection of partial derivatives of the * error function with respect to the components of the parameter * vector.  The regression prior abstract base class is defined in * terms of a single method {@link #gradient(double,int)}, which * specifies the value of the gradient of the error function for a * specified dimension with a specified value in that dimension. * * <p>This class implements static factory methods to construct * non-informative, Gaussian and Laplace priors.  The Gaussian and * Laplace priors may specify a different variance for each dimension, * but assumes all the prior means are zero.  The priors also assume * the dimensions are independent so that the full covariance matrix * is assumed to be diagonal (that is, there is zero covariance between * different dimensions). * * * <h4>Non-informative Prior &amp; Maximum Likelihood Estimation</h4> * * <p>Using a non-informative prior for regression results in standard * maximum likelihood estimation. * * <p>The non-informative prior assumes a uniform distribution over * parameter vectors: * * <blockquote><pre> * p(&beta;<sub>i</sub>,i) = 1.0</pre></blockquote> * * and thus contributes nothing to the gradient: * * <blockquote><pre> * gradient(&beta;<sub>i</sub>,i) =  0.0</pre></blockquote> * * A non-informative prior is constructed using the static method * {@link #noninformative()}. * * * <h4>Gaussian Prior, L<sub>2</sub> Regularization &amp; Ridge Regression</h4> * * <p>The Gaussian prior assumes a Gaussian (also known as normal) density over * parameter vectors which results in L<sub>2</sub>-regularized * regression, also known as ridge regression.  Specifically, the * prior allows a variance to be specified per dimension, but * assumes dimensions are independent in that all off-diagonal * covariances are zero. * * <p>The Gaussian density is defined by: * * <blockquote><pre> * p(&beta;<sub>i</sub>,i) = 1.0/sqrt(2 * &pi; &sigma;<sub>i</sub><sup>2</sup>) * exp(-&beta;<sub>i</sub><sup>2</sup>/(2 * &sigma;<sub>i</sub><sup>2</sup>))</pre></blockquote> * * <p>The Gaussian prior leads to the following contribution to the * gradient for a dimension <code>i</code> with parameter * <code>beta<sub>i</sub></code> and variance * <code>&sigma;<sub>i</sub><sup>2</sup></code>: * * <blockquote><pre> * gradient(&beta;<sub>i</sub>,i) = &beta;<sub>i</sub>/(2 * &sigma;<sub>i</sub><sup>2</sup>)</pre></blockquote> * * <p>Gaussian priors are constructed using one of the static factory * methods, {@link #gaussian(double[])} or {@link * #gaussian(double,boolean)}. * * <h4>Laplace Prior, L<sub>1</sub> Regularization &amp; the Lasso</h4> * * <p>The Laplace prior assumes a Laplace density over parameter * vectors which results in L<sub>1</sub>-regularized regression, also * known as the lasso.  The Laplace prior is called a * double-exponential distribution because it is looks like an exponential * distribution for positive values and the reflection of this exponential * distribution around zero (or more generally, around its mean parameter). * * <p>A Laplace prior allows a variance to be specified per dimension, * but like the Gaussian prior, assumes means are zero and that the * dimensions are independent in that all off-diagonal covariances are * zero. * * <p>The Laplace density is defined by: * * <blockquote><pre> * p(&beta;<sub>i</sub>,i) = (sqrt(2)/(2 * &sigma;<sub>i</sub>)) * exp(- sqrt(2) * abs(&beta;<sub>i</sub>) / &sigma;<sub>i</sub>)</pre></blockquote> * * <p>The Laplace prior leads to the following contribution to the * gradient for a dimension <code>i</code> with parameter * <code>beta<sub>i</sub></code>, mean zero and variance * <code>&sigma;<sub>i</sub><sup>2</sup></code>: * * <blockquote><pre> * gradient(&beta;<sub>i</sub>,i) = signum(&beta;<sub>i</sub>)/(2 * &sigma;<sub>i</sub><sup>2</sup>)</pre></blockquote> * * where the <code>signum</code> function is defined by {@link Math#signum(double)}. * * <p>Laplace priors are constructed using one of the static factory * methods, {@link #laplace(double[])} or {@link * #laplace(double,boolean)}. * * * <h4>Cauchy Prior</h4> * * <p>The Cauchy prior assumes a Cauchy density (also known as a * Lorentz density) over priors.  The Cauchy density is a Student-t * density with one degree of freedom.  The Cauchy density allows a * scale to be specified for each dimension.  The mean and variance * are undefined as their integrals diverge.  The Cauchy distribution * is symmetric and for regression priors, we assume a mode of zero. * * <p>The Cauchy density is defined by: * * <blockquote><pre> * p(&beta;<sub>i</sub>,i) = (1 / &pi;) * (&lambda; / (&beta;<sub>i</sub><sup>2</sup> + &lambda;<sup>2</sup>))</pre></blockquote> * * <p>The Cauchy prior leads to the following contribution to the * gradient for dimension <code>i</code> with parameter <code>&beta;<sub>i</sub></code> and scale * <code>&lambda;<sub>i</sub><sup>2</sup></code>: * * <blockquote><pre> * gradient(&beta;<sub>i</sub>, i) = 2 &beta;<sub>i</sub> / (&beta;<sub>i</sub><sup>2</sup> + &lambda;<sub>i</sub><sup>2</sup>)</pre></blockquote> * * <p>Cauchy priors are constructed using one of the static factory * methods {@link #cauchy(double[])} or {@link #cauchy(double,boolean)}. * * <h4>Special Treatment of Intercept</h4> * * <p>By convention, input dimension zero (<code>0</code>) may be * reserved for the intercept and set to value 1.0 in all input * vectors.  For regularized regression, the regularization is * typically not applied to the intercept term.  To match this * convention, the factory methods allow a boolean parameter * indicating whether the intercept parameter has a * non-informative/uniform prior.  If the intercept flag indicates it * is non-informative, then dimension 0 will not have an infinite * prior variance or scale, and hence a zero gradient.  The result is * that the intercept will be fit by maximum likelihood. * * <h4>Serialization</h4> * * <p>All of the regression priors may be serialized. * <h4>References</h4> * * <p>For full details on the Gaussian and Laplace distributions, * see: * * <ul> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Normal_distribution">Normal (Gaussian) Distribution</a></li> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Laplace_distribution">Laplace (Double Exponential) Distribution</a></li> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Cauchy_distribution">Cauchy Distribution</a> * </ul> * * <p>For explanations of how the priors are used with logistic regression, * see the following two textbooks: * * <ul> * <li>Hastie, Trevor, Tibshirani, Robert and Jerome Friedman. 2001. * <i><a href="http://www-stat.stanford.edu/~tibs/ElemStatLearn/">Elements of Statistical Learning</a></i>. * Springer.</li> * * <li>Bishop, Christopher M. 2006. <a href="http://research.microsoft.com/~cmbishop/PRML/">Pattern Recognition and Machine Learning</a>. * Springer.</li> * </ul> * * and two tech reports: * * <ul> * <li>Genkin, Alexander, David D. Lewis, and David Madigan. 2004. * <a href="http://www.stat.columbia.edu/~gelman/stuff_for_blog/madigan.pdf">Large-Scale Bayesian Logistic Regression for Text Categorization</a>. * Rutgers University Technical Report. * (<a href="http://stat.rutgers.edu/~madigan/PAPERS/techno-06-09-18.pdf">alternate download</a>). * * <li> Gelman, Andrew, Aleks Jakulin, Yu-Sung Su, and Maria Grazia Pittau. 2007.<a href="http://ssrn.com/abstract=1010421">A Default Prior Distribution for Logistic and Other Regression Models</a>.* </li> * </ul> * * @author  Bob Carpenter * @version 3.5 * @since   LingPipe3.5 */public abstract class RegressionPrior implements Serializable {    // do not allow instances or subclasses    private RegressionPrior() { }    /**     * Returns the contribution to the gradient of the error function     * of the specified parameter value for the specified dimension.     *     * @param betaForDimension Parameter value for the specified dimension.     * @param dimension The dimension.     * @return The contribution to the gradient of the error function     * of the parameter value and dimension.     */    public abstract double gradient(double betaForDimension, int dimension);    /**     * Returns the log (base 2) of the prior density evaluated at the     * specified coefficient value for the specified dimension.  The     * overall error function is the sum of the negative log     * likelihood of the data under the model and the negative log of     * the prior.     *     * @param betaForDimension Parameter value for the specified dimension.     * @param dimension The dimension.     * @return The prior probability of the specified parameter value     * for the specified dimension.     */    public abstract double log2Prior(double betaForDimension, int dimension);    /**     * Returns the log (base 2) prior density for a specified     * coefficient vector.     *     * @param beta Parameter vector.     * @return The log (base 2) prior for the specified parameter     * vector.     * @throws IllegalArgumentException If the specified parameter     * vector does not match the dimensionality of the prior (if     * specified).     */    public double log2Prior(Vector beta) {        int numDimensions = beta.numDimensions();        verifyNumberOfDimensions(numDimensions);        double log2Prior = 0.0;
12 3 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -