📄 latentdirichletallocation.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 5 页
字号:
 * * <blockquote><pre> * &phi;<sup>*</sup>[topic][word] * = (count(topic,word) + &beta;) / (count(topic) + numWords*&beta;)</pre></blockquote> * * <p>A complete Gibbs sample is represented as an instance of {@link * LatentDirichletAllocation.GibbsSample}, which provides access to * the topic assignment to every token, as well as methods to compute * <code>&theta;<sup>*</sup></code> and <code>&phi;<sup>*</sup></code> * as defined above.  A sample also maintains the original priors and * word counts.  Just the estimates of the topic-word distributions * <code>&phi;[topic]</code> and the prior topic concentration * <code>&alpha;</code> are sufficient to define an LDA model.  Note * that the imputed values of <code>&theta;<sup>*</sup>[doc]</code> * used during estimation are part of a sample, but are not part of * the LDA model itself.  The LDA model contains enough information to * estimate <code>&theta;<sup>*</sup></code> for an arbitrary * document, as described in the next section. * * <p>The Gibbs sampling algorithm starts with a random assignment of * topics to words, then simply iterates through the tokens in turn, * sampling topics according to the distribution defined above.  After * each run through the entire corpus, a callback is made to a handler * for the samples.  This setup may be configured for * an initial burnin period, essentially just discarding the first * batch of samples.  Then it may be configured to sample only periodically * thereafter to avoid correlations between samples. * * <h3>LDA as Multi-Topic Classifier</h3> * * <p>An LDA model consists of a topic distribution Dirichlet prior * <code>&alpha;</code> and a word distribution * <code>&phi;[topic]</code> for each topic.  Given an LDA model and a * new document <code>words = { words[0], ..., words[length-1] }</code> * consisting of a sequence of words, the posterior distribution over * topic weights is given by: * * <blockquote><pre> * p(&theta; | words, &alpha;, &phi;)</pre></blockquote> * * Although this distribution is not solvable analytically, it is easy * to estimate using a simplified form of the LDA estimator's Gibbs * sampler.  The conditional distribution of a topic assignment * <code>topics[token]</code> to a single token given an assignment * <code>topics'</code> to all other tokens is given by: * * <blockquote><pre> * p(topic[token] | topics', words, &alpha;, &phi;) * &#8733; p(topic[token], words[token] | topics', &alpha; &phi;) * = p(topic[token] | topics', &alpha;) * p(words[token] | &phi;[topic[token]]) * = (count(topic[token]) + &alpha;) / (words.length - 1 + numTopics * &alpha;) *   * p(words[token] | &phi;[topic[token]])</pre></blockquote> * * This leads to a straightforward sampler over posterior topic * assignments, from which we may directly compute the Dirichlet posterior over * topic distributions or a MAP topic distribution. * * <p>This class provides a method to sample these topic assignments, * which may then be used to form Dirichlet distributions or MAP point * estimates of <code>&theta;<sup>*</sup></code> for the document * <code>words</code>. * <h3>LDA as a Conditional Language Model</h3> * * <p>An LDA model may be used to estimate the likelihood of a word * given a previous bag of words: * * <blockquote><pre> * p(word | words, &alpha;, &phi;) * = <big><big><big><big>&#8747;</big></big></big></big>p(word | &theta;, &phi;) p(&theta; | words, &alpha;, &phi;) <i>d</i>&theta;</pre></blockquote> * * This integral is easily evaluated using sampling over the topic * distributions <code>p(&theta; | words, &alpha;, &phi;)</code> and * averaging the word probability determined by each sample. * The word probability for a sample <code>&theta;</code> is defined by: * * <blockquote><pre> * p(word | &theta;, &phi;) * = <big><big><big>&Sigma;</big></big></big><sub><sub>topic &lt; numTopics</sub></sub> p(topic | &theta;) * p(word | &phi;[topic])</pre></blockquote> * * Although this approach could theoretically be applied to generate * the probability of a document one word at a time, the cost would * be prohibitive, as there are quadratically many samples required * because samples for the <code>n</code>-th word consist of topic * assignments to the previous <code>n-1</code> words. * * <!-- * <h3>LDA as a Language Model</h3> * * The likelihood of a document relative to an LDA model is defined * by integrating over all possible topic distributions weighted * by prior likelihood: * * <blockquote><pre> * p(words | &alpha;, &phi;) * = <big><big><big>&#8747;</big></big></big> p(words | &theta;, &phi;) * p(&theta; | &alpha;) <i>d</i>&theta;</pre></blockquote> * * The probability of a document is just the product of * the probability of its tokens: * * <blockquote><pre> * p(words | &theta;, &phi;) * = <big><big><big>&Pi;</big></big></big><sub><sub>token &lt; words.length</sub></sub> p(words[token] | &theta;, &phi;)</pre></blockquote> * * The probability of a word given a topic distribution and a * per-topic word distribution is derived by summing its probabilities * over all topics, weighted by topic probability: * * <blockquote><pre> * p(word | &theta;, &phi;) * = <big><big><big>&Sigma;</big></big></big><sub><sub>topic &lt; numTopics</sub></sub> p(topic | &theta;) * p(word | &phi;[topic])</pre></blockquote> * * Unfortunately, this value is not easily computed using Gibbs * sampling.  Although various estimates exist in the literature, * they are quite expensive to compute. * --> * * <h3>Bayesian Calculations and Exchangeability</h3> * * <p>An LDA model may be used for a variety of statistical * calculations.  For instance, it may be used to determine the * distribution of topics to words, and using these distributions, may * determine word similarity.  Similarly, document similarity may be * determined by the topic distributions in a document. * * <p>Point estimates are derived using a single LDA model.  For * Bayesian calculation, multiple samples are taken to produce * multiple LDA models.  The results of a calculation on these * models is then averaged to produce a Bayesian estimate of the * quantity of interest.  The sampling methodology is effectively * numerically computing the integral over the posterior. * * <p>Bayesian calculations over multiple samples are complicated by * the exchangeability of topics in the LDA model.  In particular, there * is no guarantee that topics are the same between samples, thus it * is not acceptable to combine samples in topic-level reasoning.  For * instance, it does not make sense to estimate the probability of * a topic in a document using multiple samples. * * * <h3>Non-Document Data</h3> * * The &quot;words&quot; in an LDA model don't necessarily have to * represent words in documents.  LDA is basically a multinomial * mixture model, and any multinomial outcomes may be modeled with * LDA.  For instance, a document may correspond to a baseball game * and the words may correspond to the outcomes of at-bats (some * might occur more than once).  LDA has also been used for * gene expression data, where expression levels from mRNA microarray * experiments is quantized into a multinomial outcome. * * <p>LDA has also been applied to collaborative filtering.  Movies * act as words, with users modeled as documents, the bag of words * they've seen.  Given an LDA model and a user's films, the user's * topic distribution may be inferred and used to estimate the likelihood * of seeing unseen films. * * <h3>References</h3> * * <ul> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Gibbs_sampling">Gibbs Sampling</a> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov chain Monte Carlo</a> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet Distribution</a> * <li>Wikipedia: <a href="http://en.wikipedia.org/wiki/Latent_Dirichlet_Allocation">Latent Dirichelt Distribution</a> * <li>Steyvers, Mark and Tom Griffiths. 2007. *     <a href="http://psiexp.ss.uci.edu/research/papers/SteyversGriffithsLSABookFormatted.pdf">Probabilistic topic models</a>. *     In  Thomas K. Landauer, Danielle S. McNamara, Simon Dennis and Walter Kintsch (eds.), *     <i>Handbook of Latent Semantic Analysis.</i> *     Laurence Erlbaum.</li> *       <!-- alt link: http://cocosci.berkeley.edu/tom/papers/SteyversGriffiths.pdf --> * * <li>Blei, David M., Andrew Y. Ng, and Michael I. Jordan.  2003. *       <a href="http://jmlr.csail.mit.edu/papers/v3/blei03a.html">Latent Dirichlet allocation</a>. *       <i>Journal of Machine Learning Research</i> <b>3</b>(2003):993-1022.</li> *       <!-- http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf --> * </ul> * * @author Bob Carpenter * @version 3.3.0 * @since   LingPipe3.3 */public class LatentDirichletAllocation {    private final double mDocTopicPrior;    private final double[][] mTopicWordProbs;    /**     * Construct a latent Dirichelt allocation (LDA) model using the     * specified document-topic prior and topic-word distributions.     * <p>The topic-word probability array <code>topicWordProbs</code>     * represents a collection of discrete distributions     * <code>topicwordProbs[topic]</code> for topics, and thus must     * satisfy:     *     * <blockquote><pre>     * topicWordProbs[topic][word] &gt;= 0.0     *     * <big><big><big>&Sigma;</big></big></big><sub><sub>word &lt; numWords</sub></sub> topicWordProbs[topic][word] = 1.0</pre></blockquote>     *     * <p><b>Warning:</b> These requirements are <b>not</b> checked by the     * constructor.     *     * <p>See the class documentation above for an explanation of     * the parameters and what can be done with a model.     *     * @param docTopicPrior The document-topic prior.     * @throws IllegalArgumentException If the document-topic prior is     * not finite and positive, or if the topic-word probabilities     * arrays are not all the same length with entries between 0.0 and     * 1.0 inclusive.     */    public LatentDirichletAllocation(double docTopicPrior,                                     double[][] topicWordProbs) {        if (docTopicPrior <= 0.0            || Double.isNaN(docTopicPrior)            || Double.isInfinite(docTopicPrior)) {            String msg = "Document-topic prior must be finite and positive."                + " Found docTopicPrior=" + docTopicPrior;            throw new IllegalArgumentException(msg);        }        int numTopics = topicWordProbs.length;        if (numTopics < 1) {            String msg = "Require non-empty topic-word probabilities.";            throw new IllegalArgumentException(msg);        }        int numWords = topicWordProbs[0].length;        for (int topic = 1; topic < numTopics; ++topic) {            if (topicWordProbs[topic].length != numWords) {                String msg = "All topics must have the same number of words."                    + " topicWordProbs[0].length="                    + topicWordProbs[0].length                    + " topicWordProbs[" + topic + "]="                    + topicWordProbs[topic].length;                throw new IllegalArgumentException(msg);            }        }        for (int topic = 0; topic < numTopics; ++topic) {            for (int word = 0; word < numWords; ++word) {                if (topicWordProbs[topic][word] < 0.0                    || topicWordProbs[topic][word] > 1.0) {                    String msg = "All probabilities must be between 0.0 and 1.0"                        + " Found topicWordProbs[" + topic + "][" + word + "]="                        + topicWordProbs[topic][word];                    throw new IllegalArgumentException(msg);                }            }        }        mDocTopicPrior = docTopicPrior;        mTopicWordProbs = topicWordProbs;    }    /**     * Returns the number of topics in this LDA model.     *     * @return The number of topics in this model.     */    public int numTopics() {        return mTopicWordProbs.length;    }    /**     * Returns the number of words on which this LDA model     * is based.     *     * @return The numbe of words in this model.     */    public int numWords() {        return mTopicWordProbs[0].length;    }    /**     * Returns the concentration value of the uniform Dirichlet prior over     * topic distributions for documents.  This value is effectively     * a prior count for topics used for additive smoothing during     * estimation.     *     * @return The prior count of topics in documents.     */    public double documentTopicPrior() {        return mDocTopicPrior;    }    /**     * Returns the probability of the specified word in the specified     * topic.  The values returned should be non-negative and finite,     * and should sum to 1.0 over all words for a specifed topic.     *     * @param topic Topic identifier.     * @param word Word identifier.     * @return Probability of the specified word in the specified     * topic.     */    public double wordProbability(int topic, int word) {
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -