📄 latentdirichletallocation.java
字号:
* * <blockquote><pre> * φ<sup>*</sup>[topic][word] * = (count(topic,word) + β) / (count(topic) + numWords*β)</pre></blockquote> * * <p>A complete Gibbs sample is represented as an instance of {@link * LatentDirichletAllocation.GibbsSample}, which provides access to * the topic assignment to every token, as well as methods to compute * <code>θ<sup>*</sup></code> and <code>φ<sup>*</sup></code> * as defined above. A sample also maintains the original priors and * word counts. Just the estimates of the topic-word distributions * <code>φ[topic]</code> and the prior topic concentration * <code>α</code> are sufficient to define an LDA model. Note * that the imputed values of <code>θ<sup>*</sup>[doc]</code> * used during estimation are part of a sample, but are not part of * the LDA model itself. The LDA model contains enough information to * estimate <code>θ<sup>*</sup></code> for an arbitrary * document, as described in the next section. * * <p>The Gibbs sampling algorithm starts with a random assignment of * topics to words, then simply iterates through the tokens in turn, * sampling topics according to the distribution defined above. After * each run through the entire corpus, a callback is made to a handler * for the samples. This setup may be configured for * an initial burnin period, essentially just discarding the first * batch of samples. Then it may be configured to sample only periodically * thereafter to avoid correlations between samples. * * <h3>LDA as Multi-Topic Classifier</h3> * * <p>An LDA model consists of a topic distribution Dirichlet prior * <code>α</code> and a word distribution * <code>φ[topic]</code> for each topic. Given an LDA model and a * new document <code>words = { words[0], ..., words[length-1] }</code> * consisting of a sequence of words, the posterior distribution over * topic weights is given by: * * <blockquote><pre> * p(θ | words, α, φ)</pre></blockquote> * * Although this distribution is not solvable analytically, it is easy * to estimate using a simplified form of the LDA estimator's Gibbs * sampler. The conditional distribution of a topic assignment * <code>topics[token]</code> to a single token given an assignment * <code>topics'</code> to all other tokens is given by: * * <blockquote><pre> * p(topic[token] | topics', words, α, φ) * ∝ p(topic[token], words[token] | topics', α φ) * = p(topic[token] | topics', α) * p(words[token] | φ[topic[token]]) * = (count(topic[token]) + α) / (words.length - 1 + numTopics * α) * * p(words[token] | φ[topic[token]])</pre></blockquote> * * This leads to a straightforward sampler over posterior topic * assignments, from which we may directly compute the Dirichlet posterior over * topic distributions or a MAP topic distribution. * * <p>This class provides a method to sample these topic assignments, * which may then be used to form Dirichlet distributions or MAP point * estimates of <code>θ<sup>*</sup></code> for the document * <code>words</code>. * <h3>LDA as a Conditional Language Model</h3> * * <p>An LDA model may be used to estimate the likelihood of a word * given a previous bag of words: * * <blockquote><pre> * p(word | words, α, φ) * = <big><big><big><big>∫</big></big></big></big>p(word | θ, φ) p(θ | words, α, φ) <i>d</i>θ</pre></blockquote> * * This integral is easily evaluated using sampling over the topic * distributions <code>p(θ | words, α, φ)</code> and * averaging the word probability determined by each sample. * The word probability for a sample <code>θ</code> is defined by: * * <blockquote><pre> * p(word | θ, φ) * = <big><big><big>Σ</big></big></big><sub><sub>topic < numTopics</sub></sub> p(topic | θ) * p(word | φ[topic])</pre></blockquote> * * Although this approach could theoretically be applied to generate * the probability of a document one word at a time, the cost would * be prohibitive, as there are quadratically many samples required * because samples for the <code>n</code>-th word consist of topic * assignments to the previous <code>n-1</code> words. * * <!-- * <h3>LDA as a Language Model</h3> * * The likelihood of a document relative to an LDA model is defined * by integrating over all possible topic distributions weighted * by prior likelihood: * * <blockquote><pre> * p(words | α, φ) * = <big><big><big>∫</big></big></big> p(words | θ, φ) * p(θ | α) <i>d</i>θ</pre></blockquote> * * The probability of a document is just the product of * the probability of its tokens: * * <blockquote><pre> * p(words | θ, φ) * = <big><big><big>Π</big></big></big><sub><sub>token < words.length</sub></sub> p(words[token] | θ, φ)</pre></blockquote> * * The probability of a word given a topic distribution and a * per-topic word distribution is derived by summing its probabilities * over all topics, weighted by topic probability: * * <blockquote><pre> * p(word | θ, φ) * = <big><big><big>Σ</big></big></big><sub><sub>topic < numTopics</sub></sub> p(topic | θ) * p(word | φ[topic])</pre></blockquote> * * Unfortunately, this value is not easily computed using Gibbs * sampling. Although various estimates exist in the literature, * they are quite expensive to compute. * --> * * <h3>Bayesian Calculations and Exchangeability</h3> * * <p>An LDA model may be used for a variety of statistical * calculations. For instance, it may be used to determine the * distribution of topics to words, and using these distributions, may * determine word similarity. Similarly, document similarity may be * determined by the topic distributions in a document. * * <p>Point estimates are derived using a single LDA model. For * Bayesian calculation, multiple samples are taken to produce * multiple LDA models. The results of a calculation on these * models is then averaged to produce a Bayesian estimate of the * quantity of interest. The sampling methodology is effectively * numerically computing the integral over the posterior. * * <p>Bayesian calculations over multiple samples are complicated by * the exchangeability of topics in the LDA model. In particular, there * is no guarantee that topics are the same between samples, thus it * is not acceptable to combine samples in topic-level reasoning. For * instance, it does not make sense to estimate the probability of * a topic in a document using multiple samples. * * * <h3>Non-Document Data</h3> * * The "words" in an LDA model don't necessarily have to * represent words in documents. LDA is basically a multinomial * mixture model, and any multinomial outcomes may be modeled with * LDA. For instance, a document may correspond to a baseball game * and the words may correspond to the outcomes of at-bats (some * might occur more than once). LDA has also been used for * gene expression data, where expression levels from mRNA microarray * experiments is quantized into a multinomial outcome. * * <p>LDA has also been applied to collaborative filtering. Movies * act as words, with users modeled as documents, the bag of words * they've seen. Given an LDA model and a user's films, the user's * topic distribution may be inferred and used to estimate the likelihood * of seeing unseen films. * * <h3>References</h3> * * <ul> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs Sampling</a> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov chain Monte Carlo</a> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet Distribution</a> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_Allocation">Latent Dirichelt Distribution</a> * <li>Steyvers, Mark and Tom Griffiths. 2007. * <a href="http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf">Probabilistic topic models</a>. * In Thomas K. Landauer, Danielle S. McNamara, Simon Dennis and Walter Kintsch (eds.), * <i>Handbook of Latent Semantic Analysis.</i> * Laurence Erlbaum.</li> * <!-- alt link: http://cocosci.berkeley.edu/tom/papers/SteyversGriffiths.pdf --> * * <li>Blei, David M., Andrew Y. Ng, and Michael I. Jordan. 2003. * <a href="http://jmlr.csail.mit.edu/papers/v3/blei03a.html">Latent Dirichlet allocation</a>. * <i>Journal of Machine Learning Research</i> <b>3</b>(2003):993-1022.</li> * <!-- http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf --> * </ul> * * @author Bob Carpenter * @version 3.3.0 * @since LingPipe3.3 */public class LatentDirichletAllocation { private final double mDocTopicPrior; private final double[][] mTopicWordProbs; /** * Construct a latent Dirichelt allocation (LDA) model using the * specified document-topic prior and topic-word distributions. * <p>The topic-word probability array <code>topicWordProbs</code> * represents a collection of discrete distributions * <code>topicwordProbs[topic]</code> for topics, and thus must * satisfy: * * <blockquote><pre> * topicWordProbs[topic][word] >= 0.0 * * <big><big><big>Σ</big></big></big><sub><sub>word < numWords</sub></sub> topicWordProbs[topic][word] = 1.0</pre></blockquote> * * <p><b>Warning:</b> These requirements are <b>not</b> checked by the * constructor. * * <p>See the class documentation above for an explanation of * the parameters and what can be done with a model. * * @param docTopicPrior The document-topic prior. * @throws IllegalArgumentException If the document-topic prior is * not finite and positive, or if the topic-word probabilities * arrays are not all the same length with entries between 0.0 and * 1.0 inclusive. */ public LatentDirichletAllocation(double docTopicPrior, double[][] topicWordProbs) { if (docTopicPrior <= 0.0 || Double.isNaN(docTopicPrior) || Double.isInfinite(docTopicPrior)) { String msg = "Document-topic prior must be finite and positive." + " Found docTopicPrior=" + docTopicPrior; throw new IllegalArgumentException(msg); } int numTopics = topicWordProbs.length; if (numTopics < 1) { String msg = "Require non-empty topic-word probabilities."; throw new IllegalArgumentException(msg); } int numWords = topicWordProbs[0].length; for (int topic = 1; topic < numTopics; ++topic) { if (topicWordProbs[topic].length != numWords) { String msg = "All topics must have the same number of words." + " topicWordProbs[0].length=" + topicWordProbs[0].length + " topicWordProbs[" + topic + "]=" + topicWordProbs[topic].length; throw new IllegalArgumentException(msg); } } for (int topic = 0; topic < numTopics; ++topic) { for (int word = 0; word < numWords; ++word) { if (topicWordProbs[topic][word] < 0.0 || topicWordProbs[topic][word] > 1.0) { String msg = "All probabilities must be between 0.0 and 1.0" + " Found topicWordProbs[" + topic + "][" + word + "]=" + topicWordProbs[topic][word]; throw new IllegalArgumentException(msg); } } } mDocTopicPrior = docTopicPrior; mTopicWordProbs = topicWordProbs; } /** * Returns the number of topics in this LDA model. * * @return The number of topics in this model. */ public int numTopics() { return mTopicWordProbs.length; } /** * Returns the number of words on which this LDA model * is based. * * @return The numbe of words in this model. */ public int numWords() { return mTopicWordProbs[0].length; } /** * Returns the concentration value of the uniform Dirichlet prior over * topic distributions for documents. This value is effectively * a prior count for topics used for additive smoothing during * estimation. * * @return The prior count of topics in documents. */ public double documentTopicPrior() { return mDocTopicPrior; } /** * Returns the probability of the specified word in the specified * topic. The values returned should be non-negative and finite, * and should sum to 1.0 over all words for a specifed topic. * * @param topic Topic identifier. * @param word Word identifier. * @return Probability of the specified word in the specified * topic. */ public double wordProbability(int topic, int word) {
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -