📄 latentdirichletallocation.java
字号:
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.cluster;import com.aliasi.corpus.ObjectHandler;import com.aliasi.symbol.SymbolTable;import com.aliasi.tokenizer.Tokenizer;import com.aliasi.tokenizer.TokenizerFactory;import com.aliasi.stats.Statistics;import com.aliasi.util.Arrays;import com.aliasi.util.Math;import com.aliasi.util.ObjectToCounterMap;import com.aliasi.util.Strings;// import java.util.Arrays;import java.util.ArrayList;import java.util.List;import java.util.Random;import java.util.Set;/** * A <code>LatentDirichletAllocation</code> object represents a latent * Dirichlet allocation (LDA) model. LDA provides a Bayesian model of * document generation in which each document is generated by * a mixture of topical multinomials. An LDA model specifies the * number of topics, a Dirichlet prior over topic mixtures for a * document, and a discrete distribution over words for each topic. * * <p>A document is generated from an LDA model by first selecting a * multinomial over topics given the Dirichlet prior. Then for each * token in the document, a topic is generated from the * document-specific topic distribution, and then a word is generated * from the discrete distribution for that topic. Note that document * length is not generated by this model; a fully generative model * would need a way of generating lengths (e.g. a Poisson * distribution) or terminating documents (e.g. a disginuished * end-of-document symbol). * * <p>An LDA model may be estimated from an unlabeled training corpus * (collection of documents) using a second Dirichlet prior, this time * over word distributions in a topic. This class provides a static * inference method that produces (collapsed Gibbs) samples from the * posterior distribution of topic assignments to words, any one of * which may be used to construct an LDA model instance. * * <p>An LDA model can be used to infer the topic mixture of unseen * text documents, which can be used to compare documents by topical * similarity. A fixed LDA model may also be used to estimate the * likelihood of a word occurring in a document given the other words * in the document. A collection of LDA models may be used for fully * Bayesian reasoning at the corpus (collection of documents) level. * * <p>LDA may be applied to arbitrary multinomial data. To apply it * to text, a tokenizer factory converts a text document to bag of * words and a symbol table converts these tokens to integer outcomes. * The static method {@link * #tokenizeDocuments(CharSequence[],TokenizerFactory,SymbolTable,int)} * is available to carry out the text to multinomial conversion with * pruning of low counts. * * * <h4>LDA Generative Model</h4> * * <p>LDA operates over a fixed vocabulary of discrete outcomes, which * we call words for convenience, and represent as a set of integers: * * <blockquote><pre> * words = { 0, 1, ..., numWords-1 }</pre></blockquote> * * <p>A corpus is a ragged array of documents, each document consisting * of an array of words: * * <blockquote><pre> * int[][] words = { words[0], words[1], ..., words[numDocs-1] }</pre></blockquote> * * A given document <code>words[i]</code>, <code>i < numDocs</code>, * is itself represented as a bag of words, each word being * represented as an integer: * * <blockquote><pre> * int[] words[i] = { words[i][0], words[i][1], ..., words[i][words[i].length-1] }</pre></blockquote> * * The documents do not all need to be the same length, so the * two-dimensional array <code>words</code> is ragged. * * <p>A particular LDA model is defined over a fixed number of topics, * also represented as integers: * * <blockquote><pre> * topics = { 0, 1, ..., numTopics-1 }</pre></blockquote> * * <p>For a given topic <code>topic < numTopics</code>, LDA specifies a * discrete distribution <code>φ[topic]</code> over words: * * <blockquote><pre> * φ[topic][word] >= 0.0 * * <big><big><big>Σ</big></big></big><sub><sub>word < numWords</sub></sub> φ[topic][word] = 1.0</pre></blockquote> * * <p>In an LDA model, each document in the corpus is generated from a * document-specific mixture of topics <code>θ[doc]</code>. The * distribution <code>θ[doc]</code> is a discrete distribution over * topics: * * <blockquote><pre> * θ[doc][topic] >= 0.0 * * <big><big><big>Σ</big></big></big><sub><sub>topic < numTopics</sub></sub> θ[doc][topic] = 1.0</pre></blockquote> * * <p>Under LDA's generative model for documents, a document-specific * topic mixture <code>θ[doc]</code> is generated from a uniform * Dirichlet distribution with concentration parameter * <code>α</code>. The Dirichlet is the conjugate prior for the * multinomial in which <code>α</code> acts as a prior count * assigned to a topic in a document. Typically, LDA is run with a * fairly diffuse prior with concentration <code>α < * 1</code>, leading to skewed posterior topic distributions. * * <p>Given a topic distribution <code>θ[doc]</code> for a * document, tokens are generated (conditionally) independently. For * each token in the document, a topic * <code>topics[doc][token]</code> is generated according to the * topic distribution <code>θ[doc]</code>, then the word instance * <code>words[doc][token]</code> is * generated given the topic using the topic-specific distribution * over tokens <code>φ[topics[doc][token]]</code>. * * <p>For estimation purposes, LDA places a uniform Dirichlet prior * with concentration parameter <code>β</code> on each of the * topic distributions <code>φ[topic]</code>. The first step in * modeling a corpus is to generate the topic distributions * <code>φ</code> from a Dirichlet parameterized by * <code>β</code>. * * <p>In sampling notation, the LDA generative model is expressed as * follows: * * <blockquote><pre> * φ[topic] ~ Dirichlet(β) * θ[doc] ~ Dirichlet(α) * topics[doc][token] ~ Discrete(θ[doc]) * words[doc][token] ~ Discrete(φ[topic[doc][token]])</pre></blockquote> * * <p>A generative Bayesian model is based on a the joint probablity * of all observed and unobserved variables, given the priors. Given a * text corpus, only the words are observed. The unobserved variables * include the assignment of a topic to each word, the assignment of a * topic distribution to each document, and the assignment of word * distributions to each topic. We let * <code>topic[doc][token]</code>, <code>doc < numDocs, token < * docLength[doc]</code>, be the topic assigned to the specified token * in the specified document. This leaves two Dirichlet priors, one * parameterized by <code>α</code> for topics in documents, and * one parameterized by <code>β</code> for words in topics. * These priors are treated as hyperparameters for purposes of * estimation; cross-validation may be used to provide so-called empirical * Bayes estimates for the priors. * * <p>The full LDA probability model for a corpus follows the generative * process outlined above. First, topic distributions are chosen at the * corpus level for each topic given their Dirichlet prior, and then the * remaining variables are generating given these topic distributions: * * <blockquote><pre> * p(words, topics, θ, &phi | α, &beta) * = <big><big><big>Π</big></big></big><sub><sub>topic < numTopics</sub></sub> * p(φ[topic] | β) * * p(words, topics, θ | α, φ)</pre></blockquote> * * Note that because the multinomial parameters for * <code>φ[topic]</code> are continuous, * <code>p(φ[topic] | β)</code> represents a density, not a discrete * distribution. Thus it does not make sense to talk about the * probability of a given multinomial <code>φ[topic]</code>; non-zero * results only arise from integrals over measurable subsets of the * multinomial simplex. It is possible to sample from a density, so * the generative model is well founded. * * <p>A document is generated by first generating its topic distribution * given the Dirichlet prior, and then generating its topics and words: * * <blockquote><pre> * p(words, topics, θ | α, φ) * = <big><big><big>Π</big></big></big><sub><sub>doc < numDocs</sub></sub> * p(θ[doc] | α) * * p(words[doc], topics[doc] | θ[doc], φ)</pre></blockquote> * * The topic and word are generated from the multinomials * <code>θ[doc]</code> and the topic distributions * <code>φ</code> using the chain rule, first generating the topic * given the document's topic distribution and then generating the * word given the topic's word distribution. * * <blockquote><pre> * p(words[doc], topics[doc] | θ[doc], φ) * = <big><big><big>Π</big></big></big><sub><sub>token < words[doc].length</sub></sub> * p(topics[doc][token] | θ[doc]) * * p(words[doc][token] | φ[topics[doc][token]])</pre></blockquote> * * <p>Given the topic and document multinomials, this distribution * is discrete, and thus may be evaluated. It may also be marginalized * by summation: * * <blockquote><pre> * p(words[doc] | θ[doc], φ) * = <big><big><big>Π</big></big></big><sub><sub> token < words[doc].length</sub></sub> * <big><big><big>Σ</big></big></big><sub><sub>topic < numTopics</sub></sub> p(topic | θ[doc]) * p(words[doc][token] | φ[topic])</pre></blockquote> * * <p>Conditional probablities are computed in the usual way by * marginalizing other variables through integration. Unfortunately, * this simple mathematical operation is often intractable * computationally. * * * <h3>Estimating LDA with Collapsed Gibbs Sampling</h3> * * <p>This class uses a collapsed form of Gibbs sampling over the * posterior distribution of topic assignments given the documents * and Dirichlet priors: * * <blockquote><pre> * p(topics | words, α β)</pre></blockquote> * * <p>This distribution may be derived from the joint * distribution by marginalizing (also known as "collapsing" * or "integrating out") the contributions of the * document-topic and topic-word distributions. * <p>The Gibbs sampler used to estimate LDA models produces samples * that consist of a topic assignment to every token in the corpus. * The conjugacy of the Dirichlet prior for multinomials makes the * sampling straightforward. * * <p>An initial sample is produced by randomly assigning topics to * tokens. Then, the sampler works iteratively through the corpus, * one token at a time. At each token, it samples a new topic assignment * to that token given all the topic assignments to other tokens in the * corpus: * * <blockquote><pre> * p(topics[doc][token] | words, topics')</pre></blockquote> * * The notation <code>topics'</code> represents the set of topic * assignments other than to <code>topics[doc][token]</code>. This * collapsed posterior conditional is estimated directly: * * <blockquote><pre> * p(topics[doc][token] = topic | words, topics') * = (count'(doc,topic) + α) / (docLength[doc]-1 + numTopics*α) * * (count'(topic,word) + β) / (count'(topic) + numWords*β)</pre></blockquote> * * The counts are all defined relative to <code>topics'</code>; that * is, the current topic assignment for the token being sampled is not * considered in the counts. Note that the two factors are estimates * of <code>θ[doc]</code> and <code>φ[topic]</code> with all * data other than the assignment to the current token. Note how the * prior concentrations arise as additive smoothing factors in these * estimates, a result of the Dirichlet's conjugacy to the * multinomial. For the purposes of sampling, the document-length * normalization in the denominator of the first term is not necessary, * as it remains constant across topics. * * <p>The posterior Dirichlet distributions may be computed using just * the counts. For instance, the posterior distribution for topics in * documents is estimated as: * * <blockquote><pre> * p(&theta[doc]|α, β, words) * = Dirichlet(count(doc,0)+β, count(doc,1)+β, ..., count(doc,numTopics-1)+β)</pre></blockquote> * * <p>The sampling distribution is defined from the maximum a posteriori * (MAP) estimates of the multinomial distribution over topics in a document: * * <blockquote><pre> * θ<sup>*</sup>[doc] = ARGMAX<sub><sub>θ[doc]</sub></sub> p(θ[doc] | α, β, words)</pre></blockquote> * * which we know from the Dirichlet distribution is: * * <blockquote><pre> * θ<sup>*</sup>[doc][topic] * = (count(doc,topic) + α) / (docLength[doc] + numTopics*α)</pre></blockquote> * * By the same reasoning, the MAP word distribution in topics is:
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -