📄 svdmatrix.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.matrix;import java.io.PrintWriter;import java.util.ArrayList;import java.util.Arrays;import java.util.List;// import java.util.Random;/** * An <code>SvdMatrix</code> provides a means of storing a matrix that * has been factored via a singular-value decomposition (SVD). This * class also provides a static method for computing regularized * singular-value decompositions of partial matrices. * * <h3>Singular Value Decomposition</h3> * * <p>Singular value decomposition (SVD) factors an * <code>m×n</code> matrix <code>A</code> into a product of * three matrices: * * <blockquote><pre> * A = U * S * V<sup><sup>T</sup></sup></pre></blockquote> * * where <code>U</code> is an <code>m×k</code> matrix, * <code>V</code> is an <code>n×k</code> matrix, and * <code>S</code> is a <code>k×k</code> matrix, where * <code>k</code> is the rank of the matrix <code>A</code>. The * multiplication (<code>*</code>) is matrix multiplication and * the superscripted <code>T</code> indicates matrix transposition. * * <p>The <code>m</code>-dimensional vectors making up the columns of * <code>U</code> are called left singular vectors, whereas * the <code>n</code>-dimesnional vectors making up the rows of * <code>V</code> are called right singular vectors. The values on * the diagonal of <code>S</code> are called the singular values. * The combination of the <code>q</code>-th left singular vector, * right singular vector, and singular value is known as a factor. * * <p>The singular value matrix <code>S</code> is a diagonal matrix * with positive, strictly non-increasing values, so that * <code>S[i][i] >= S[i+1][i+1]</code>, for <code>i < k</code>. * The set of left and set of right singular vectors are orthonormal. * Normality means that each singular vector is of unit length (length * <code>1</code>). Orthogonality means that any pair of left singular * vectors is orthogonal and any pair of right singular vectors are * orthogonal (meaning their dot product is <code>0</code>). * * <p>Matrices have unique singular-value decompositions * up to the re-ordering of columns with equal singular values and * up to cancelling sign changes in the singular vectors. * * <h3>Construction and Value Computation</h3> * * An <code>SvdMatrix</code> may be constructed out of the singular * vectors and singular values, or out of the vectors with singular * values scaled in. * * <p>Given that <code>S</code> is diagonal, the value of a particular * entry in <code>A</code> works out to: * * <blockquote><pre> * A[i][j] = <big>Σ</big><sub>k</sub></code> U[i][k] * S[k][k] * V[j][k]</pre></blockquote> * * To save time in the application and space in the class, * we factor <code>S</code> into <code>U</code> and <code>V</code> * to produce a new pair of matrices <code>U'</code> and <code>V'</code> * defined by: * * <blockquote><pre> * U' = U * sqrt(S) * V'<sup>T</sup> = sqrt(S) * V<sup>T</sup></pre></blockquote> * * with the square-root performed component-wise: * * <blockquote><pre> * sqrt(S)[i][j] = sqrt(S[i][j])</pre></blockquote> * * By the associativity of matrix multiplication, we have: * * <blockquote><pre> * U * S * V<sup>T</sup> * = U * sqrt(S) * sqrt(S) * V * = U' * V'<sup>T</sup></pre></blockquote> * * <p>Thus the class implementation is able to store <code>U'</code> * and <code>V'</code>, thus reducing the amount computation involved * in returning a value (using column vectors as the default vector * orientation): * * <blocqkuote><pre> * A[i][j] = U'[i]<sup>T</sup> * V'[j]</pre></blockquote> * * * <h3>Square Error and the Frobenius Norm</h3> * * Suppose <code>A</code> and <code>B</code> are * <code>m×n</code> matrices. The square error between them is * defined by the so-called Frobenius norm, which extends the standard * vector L<sub>2</sub> or Euclidean norm to matrices: * * <blockquote><pre> * squareError(A,B) * = frobeniusNorm(A-B) * = <big>Σ</big><sub>i < m</sub> <big>Σ</big><sub>j < n</sub> (A[i][j] - B[i][j])<sup>2</sup></pre></blockquote> * * The square error is sometimes referred to as the residual sum * of squares (RSS), because <code>A[i][j] - B[i][j]</code> * is the residual (difference between actual and predicted value). * * <h3>Lower-order Approximations</h3> * * Consider factoring a matrix <code>A</code> of dimension * <code>m×n</code> into the product of two matrices * <code>X * Y<sup>T</sup></code>, where <code>X</code> is of dimension * <code>m×k</code> and <code>Y</code> is of dimension * <code>n×k</code>. We may then measure the square * error <code>squareError(A,X * Y)</code> to determine how well * the factorization matches <code>A</code>. We know that if * we take <code>U', V'</code> from the singular value * decomposition that: * * <blockquote><pre> * squareError(A, U' * V'<sup>T</sup>) = 0.0</pre></blockquote> * * Here <code>U'</code> and <code>V'</code> have a number of columns * (called factors) equal to the rank of the matrix <code>A</code>. * The singular value decomposition is such that the first * <code>k</code> columns of <code>U'</code> and <code>V'</code> * provide the best order <code>q</code> approximation of * <code>A</code> of any <code>X</code> and <code>Y</code> * of dimensions <code>m×q</code> and <code>n×q</code> * In symbols: * * <blockquote><pre> * U'<sub>q</sub>, V'<sub>q</sub> = argmin <sub>X is m×q, Y is n×q</sub> squareError(A, X * Y<sup>T</sup>)</pre></blockquote> * * where <code>U'<sub>q</sub></code> is the restriction of <code>U'</code> * to its first <code>q</code> columns. * * <p>Often errors are reported as means, where the mean square error * (MSE) is defined by: * * <blockquote><pre> * meanSquareError(A,B) = squareError(A,B)/(m×n)</pre></blockquote> * * <p>To adjust to a linear scale, the square root of mean square * error (RMSE) is often used: * * <blockquote><pre> * rootMeanSquareError(A,B) = sqrt(meanSquareError(A,B))</pre></blockquote> * * * <h3>Partial Matrices</h3> * * A partial matrix is one in which some of the values are unknown. * This is in contrast with a sparse matrix, in which most of the * values are zero. A variant of singular value decomposition may be * used to impute the unknown values by minimizing square error with * respect to the known values only. Unknown values are then simply * derived from the factorization <code>U' * V'<sup>T</sup></code>. * Typically, the approximation is of lower order than the rank of * the matrix. * * <h3>Regularized SVD via Shrinkage</h3> * * Linear regression techniques such as SVD often overfit their * training data in order to derive exact answers. This problem * is mitigated somewhat by choosing low-order approximations to * the full-rank SVD. Another option is to penalize large values * in the singular vectors, thus favoring smaller values. The * most common way to do this because of its practicality is via * parameter shrinkage. * * <p>Shrinkage is a general technique in least squares fitting that * adds a penalty term proportional to the square of the size of * the parameters. Thus the square error objective function is * replaced with a regularized version: * * <blockquote><pre> * regularizedError(A, U' * V') * = squareError(A, U' * V') * + parameterCost(U') + parameterCost(V')</pre></blockquote> * * where the parameter costs for a vector <code>X</code> of * dimensionality <code>q</code> is the sum of the squared * parameters: * * <blockquote><pre> * parameterCost(X) * = λ * <big>Σ</big><sub>i < q</sub> X[i]<sup>2</sup> * = λ * length(X)<sup>2</sup></pre></blockquote> * * Note that the hyperparameter <code>λ</code> scales the * parameter cost term in the overall error. * * <p>Setting the regularization parameter to zero sets the parameter * costs to zero, resulting in an unregularized SVD computation. * <p>In Bayesian terms, regularization is equivalent to a normal * prior on parameter values centered on zero with variance controlled * by <code>λ</code>. The resulting minimizer of regularized * error is the maximum a posteriori (MAP) solution. The Bayesian * approach also allows covariances to be estimated (including simple * parameter variance estimates), but these are not implemented in * this class. * * * <h3>Regularized Stochastic Gradient Descent</h3> * * Singular value decomposition may be computed "exactly" * (modulo arithmetic precision and convergence) using an algorithm * whose time complexity is <code>O(m<sup>3</sup> + * n<sup>3</sup>)</code> for an <code>m×n</code> matrix (equal * to <code>O(max(m,n)<sup>3</sup>)</code>. Arithmetic precision is * especially vexing at singular values near zero and with highly * correlated rows or columns in the input matrix. * * <p>For large partial matrices, we use a form of stochastic gradient * descent which computes a single row and column singular vector (a * single factor, that is) at a time. Each factor is estimated by * iteratively visiting the data points and adjusting the unnormalized * singular vectors making up the current factor. Each adjustment is * a least-squares fitting step, where we simultaneously adjust the * left singular vectors given the right singular vectors and * vice-versa. * * <p>The least-squares adjustments take the following form. For each * value, we compute the current error (using the order <code>k</code> * approximation and the current order <code>k+1</code> values) and * then move the vectors to reduce error. We use <code>U'</code> and * <code>V'</code> for the incremental values of the singular vectors * scaled by the singular values: * * <blockquote><pre> * for (k = 0; k < maxOrder; ++k) * for (i < m) U'[i][k] = random.nextGaussian()*initValue; * for (j < n) V'[j][k] = random.nextGaussian()*initValue; * for (epoch = 0; epoch < maxEpochs && not converged; ++epoch) * for (i,j such that M[i][j] defined) * error = M[i][j] - U'<sub>k</sub>[i] * V'<sub>k</sub>[j] <small>// * is vector dot product</small> * uTemp = U'[i][k] * vTemp = V'[j][k] * U'[i][k] += learningRate[epoch] * ( error * vTemp - regularization * uTemp ) * V'[j][k] += learningRate[epoch] * ( error * uTemp - regularization * vTemp )</pre></blockquote> * * where <code>initValue</code>, <code>maxEpochs</code>, * <code>learningRate</code> (see below), and * <code>regularization</code> are algorithm hyperparameters. Note * that the initial values of the singular vectors are set randomly to * the result of a draw from a Gaussian (normal) distribution with * mean 0.0 and variance 1.0. * * <p>Because we use the entire set of * factors in the error computation, the current factor is guaranteed * to have singular vectors orthogonal to the singular vectors already * computed. * * <p>Note that in the actual implementation, the contribution to the * error of the first <code>k-1</code> factors is cached to reduce * time in the inner loop. This requires a double-length floating * point value for each defined entry in the matrix. * * <h4>Gradient Interpretation</h4> * * <p>Like most linear learners, this algorithm merely moves the * parameter vectors <code>U'[i]</code> and <code>U'[j]</code> in the * direction of the gradient of the error. The gradient is the * multivariate derivative of the objective function being minimized. * Because our object is squared error, the gradient is just its * derivative, which is just (two times) the (linear) error itself. We * roll the factor of 2 into the learning rate to derive the update in * the algorithm pseudo-code above. * * <p>The term <code>(error * vTemp)</code> is the component of the * error gradient due to the current value of <code>V'[i][k]</code> * and the term <code>(regularization * uTemp)</code> is the component of the * gradient to the size of the parameter <code>U'[i][k]</code>. The * updates thus move the parameter vectors in the direction of * the gradient. * * <h4>Convergence Conditions</h4> * * <p>The convergence conditions for a given factor require either * hitting the maximum number of allowable epochs, or finding the * improvement from one epoch to the next is below some relative * threshold: * * <blockquote><pre> * regError<sup>(epoch)</sup> = regError(M,U'<sub>k</sub><sup>(epoch)</sup> * V'<sub>k</sub><sup>(epoch)</sup><sup>T</sup>) * relativeImprovement<sup>(epoch+1)</sup> = relativeDiff(regError<sup>(epoch+1)</sup>, regError<sup>(epoch)</sup>) * relativeDiff(x,y) = abs(x-y)/(abs(x) + abs(y))</pre></blockquote> * * When the relative difference in square error is less than * a hyperparameter threshold <code>minImprovement</code>, the * epoch terminates and the algorithm moves on to the next * factor <code>k+1</code>. * * <p>Note that a complete matrix is a degenerate kind of partial * matrix. The gradient descent computation still works in this * situation, but is not as efficient or as accurate as an * algebraic SVD solver for small matrices. * * <h3>Annealing Schedule</h3> * * Learning rates that are too high are unstable, whereas learning rates * that are too low never reach their targets. To get around this * problem, the learning rate, <code>learningRate[epoch]</code>, is * lowered as the number of epochs increase. This lowering of the * learning rate has a thermodynamic interpretation in terms of free * energy, hence the name "annealing". Larger moves are * made in earlier epochs, then the temperature is gradually lowered * so that the learner may settle into a stable fixed point. The * function <code>learningRate[epoch]</code> is called the annealing * schedule. * * <p>There are theoretical requirements on the annealing schedule * that guarantee convergence (up to arithmetic precision and no upper * bound on the number of epochs): * * <blockquote><pre> * <big>Σ</big><sub>epoch</sub> learningRate[epoch] = infinity * <big>Σ</big><sub>epoch</sub> learningRate[epoch]<sup>2</sup> < infinity</pre></blockquote> * * The schedule we use is the one commonly chosen to meet the * above requirements: * * <blockquote><pre> * learningRate[epoch] = initialLearningRate / (1 + epoch/annealingRate)</pre></blockquote> * * where <code>initialLearningRate</code> is an initial learning rate and * <code>annealingFactor</code> determines how quickly it shrinks. * The learning rate moves from its initial size * (<code>initialLearningRate</code>) to one half (<code>1/2</code>) of its * original size after <code>annealingRate</code> epochs, and
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -