📄 latentdirichletallocation.java

📁 一个自然语言处理的Java开源工具包。LingPipe目前已有很丰富的功能
💻 JAVA
📖 第 1 页 / 共 5 页
字号:
12 3 4 5 下一页
/* * LingPipe v. 3.5 * Copyright (C) 2003-2008 Alias-i * * This program is licensed under the Alias-i Royalty Free License * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i * Royalty Free License Version 1 for more details. * * You should have received a copy of the Alias-i Royalty Free License * Version 1 along with this program; if not, visit * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211, * +1 (718) 290-9170. */package com.aliasi.cluster;import com.aliasi.corpus.ObjectHandler;import com.aliasi.symbol.SymbolTable;import com.aliasi.tokenizer.Tokenizer;import com.aliasi.tokenizer.TokenizerFactory;import com.aliasi.stats.Statistics;import com.aliasi.util.Arrays;import com.aliasi.util.Math;import com.aliasi.util.ObjectToCounterMap;import com.aliasi.util.Strings;// import java.util.Arrays;import java.util.ArrayList;import java.util.List;import java.util.Random;import java.util.Set;/** * A <code>LatentDirichletAllocation</code> object represents a latent * Dirichlet allocation (LDA) model.  LDA provides a Bayesian model of * document generation in which each document is generated by * a mixture of topical multinomials.  An LDA model specifies the * number of topics, a Dirichlet prior over topic mixtures for a * document, and a discrete distribution over words for each topic. * * <p>A document is generated from an LDA model by first selecting a * multinomial over topics given the Dirichlet prior.  Then for each * token in the document, a topic is generated from the * document-specific topic distribution, and then a word is generated * from the discrete distribution for that topic.  Note that document * length is not generated by this model; a fully generative model * would need a way of generating lengths (e.g. a Poisson * distribution) or terminating documents (e.g. a disginuished * end-of-document symbol). * * <p>An LDA model may be estimated from an unlabeled training corpus * (collection of documents) using a second Dirichlet prior, this time * over word distributions in a topic.  This class provides a static * inference method that produces (collapsed Gibbs) samples from the * posterior distribution of topic assignments to words, any one of * which may be used to construct an LDA model instance. * * <p>An LDA model can be used to infer the topic mixture of unseen * text documents, which can be used to compare documents by topical * similarity.  A fixed LDA model may also be used to estimate the * likelihood of a word occurring in a document given the other words * in the document.  A collection of LDA models may be used for fully * Bayesian reasoning at the corpus (collection of documents) level. * * <p>LDA may be applied to arbitrary multinomial data.  To apply it * to text, a tokenizer factory converts a text document to bag of * words and a symbol table converts these tokens to integer outcomes. * The static method {@link * #tokenizeDocuments(CharSequence[],TokenizerFactory,SymbolTable,int)} * is available to carry out the text to multinomial conversion with * pruning of low counts. * * * <h4>LDA Generative Model</h4> * * <p>LDA operates over a fixed vocabulary of discrete outcomes, which * we call words for convenience, and represent as a set of integers: * * <blockquote><pre> * words = { 0, 1, ..., numWords-1 }</pre></blockquote> * * <p>A corpus is a ragged array of documents, each document consisting * of an array of words: * * <blockquote><pre> * int[][] words = { words[0], words[1], ..., words[numDocs-1] }</pre></blockquote> * * A given document <code>words[i]</code>, <code>i &lt; numDocs</code>, * is itself represented as a bag of words, each word being * represented as an integer: * * <blockquote><pre> * int[] words[i] = { words[i][0], words[i][1], ..., words[i][words[i].length-1] }</pre></blockquote> * * The documents do not all need to be the same length, so the * two-dimensional array <code>words</code> is ragged. * * <p>A particular LDA model is defined over a fixed number of topics, * also represented as integers: * * <blockquote><pre> * topics = { 0, 1, ..., numTopics-1 }</pre></blockquote> * * <p>For a given topic <code>topic &lt; numTopics</code>, LDA specifies a * discrete distribution <code>&phi;[topic]</code> over words: * * <blockquote><pre> * &phi;[topic][word] &gt;= 0.0 * * <big><big><big>&Sigma;</big></big></big><sub><sub>word &lt; numWords</sub></sub> &phi;[topic][word] = 1.0</pre></blockquote> * * <p>In an LDA model, each document in the corpus is generated from a * document-specific mixture of topics <code>&theta;[doc]</code>.  The * distribution <code>&theta;[doc]</code> is a discrete distribution over * topics: * * <blockquote><pre> * &theta;[doc][topic] &gt;= 0.0 * * <big><big><big>&Sigma;</big></big></big><sub><sub>topic &lt; numTopics</sub></sub> &theta;[doc][topic] = 1.0</pre></blockquote> * * <p>Under LDA's generative model for documents, a document-specific * topic mixture <code>&theta;[doc]</code> is generated from a uniform * Dirichlet distribution with concentration parameter * <code>&alpha;</code>.  The Dirichlet is the conjugate prior for the * multinomial in which <code>&alpha;</code> acts as a prior count * assigned to a topic in a document.  Typically, LDA is run with a * fairly diffuse prior with concentration <code>&alpha; &lt; * 1</code>, leading to skewed posterior topic distributions. * * <p>Given a topic distribution <code>&theta;[doc]</code> for a * document, tokens are generated (conditionally) independently.  For * each token in the document, a topic * <code>topics[doc][token]</code> is generated according to the * topic distribution <code>&theta;[doc]</code>, then the word instance * <code>words[doc][token]</code> is * generated given the topic using the topic-specific distribution * over tokens <code>&phi;[topics[doc][token]]</code>. * * <p>For estimation purposes, LDA places a uniform Dirichlet prior * with concentration parameter <code>&beta;</code> on each of the * topic distributions <code>&phi;[topic]</code>.  The first step in * modeling a corpus is to generate the topic distributions * <code>&phi;</code> from a Dirichlet parameterized by * <code>&beta;</code>. * * <p>In sampling notation, the LDA generative model is expressed as * follows: * * <blockquote><pre> * &phi;[topic]           ~ Dirichlet(&beta;) * &theta;[doc]             ~ Dirichlet(&alpha;) * topics[doc][token] ~ Discrete(&theta;[doc]) * words[doc][token]  ~ Discrete(&phi;[topic[doc][token]])</pre></blockquote> * * <p>A generative Bayesian model is based on a the joint probablity * of all observed and unobserved variables, given the priors.  Given a * text corpus, only the words are observed.  The unobserved variables * include the assignment of a topic to each word, the assignment of a * topic distribution to each document, and the assignment of word * distributions to each topic.  We let * <code>topic[doc][token]</code>, <code>doc &lt; numDocs, token &lt; * docLength[doc]</code>, be the topic assigned to the specified token * in the specified document.  This leaves two Dirichlet priors, one * parameterized by <code>&alpha;</code> for topics in documents, and * one parameterized by <code>&beta;</code> for words in topics. * These priors are treated as hyperparameters for purposes of * estimation; cross-validation may be used to provide so-called empirical * Bayes estimates for the priors. * * <p>The full LDA probability model for a corpus follows the generative * process outlined above.  First, topic distributions are chosen at the * corpus level for each topic given their Dirichlet prior, and then the * remaining variables are generating given these topic distributions: * * <blockquote><pre> * p(words, topics, &theta;, &phi | &alpha;, &beta) * = <big><big><big>&Pi;</big></big></big><sub><sub>topic &lt; numTopics</sub></sub> *        p(&phi;[topic] | &beta;) *        * p(words, topics, &theta; | &alpha;, &phi;)</pre></blockquote> * * Note that because the multinomial parameters for * <code>&phi;[topic]</code> are continuous, * <code>p(&phi;[topic] | &beta;)</code> represents a density, not a discrete * distribution.  Thus it does not make sense to talk about the * probability of a given multinomial <code>&phi;[topic]</code>; non-zero * results only arise from integrals over measurable subsets of the * multinomial simplex.  It is possible to sample from a density, so * the generative model is well founded. * * <p>A document is generated by first generating its topic distribution * given the Dirichlet prior, and then generating its topics and words: * * <blockquote><pre> * p(words, topics, &theta; | &alpha;, &phi;) * = <big><big><big>&Pi;</big></big></big><sub><sub>doc &lt; numDocs</sub></sub> *        p(&theta;[doc] | &alpha;) *        * p(words[doc], topics[doc] | &theta;[doc], &phi;)</pre></blockquote> * * The topic and word are generated from the multinomials * <code>&theta;[doc]</code> and the topic distributions * <code>&phi;</code> using the chain rule, first generating the topic * given the document's topic distribution and then generating the * word given the topic's word distribution. * * <blockquote><pre> * p(words[doc], topics[doc] | &theta;[doc], &phi;) * = <big><big><big>&Pi;</big></big></big><sub><sub>token &lt; words[doc].length</sub></sub> *        p(topics[doc][token] | &theta;[doc]) *        * p(words[doc][token] | &phi;[topics[doc][token]])</pre></blockquote> * * <p>Given the topic and document multinomials, this distribution * is discrete, and thus may be evaluated.  It may also be marginalized * by summation: * * <blockquote><pre> * p(words[doc] | &theta;[doc], &phi;) * = <big><big><big>&Pi;</big></big></big><sub><sub> token &lt; words[doc].length</sub></sub> *       <big><big><big>&Sigma;</big></big></big><sub><sub>topic &lt; numTopics</sub></sub> p(topic | &theta;[doc]) * p(words[doc][token] | &phi;[topic])</pre></blockquote> * * <p>Conditional probablities are computed in the usual way by * marginalizing other variables through integration.  Unfortunately, * this simple mathematical operation is often intractable * computationally. * * * <h3>Estimating LDA with Collapsed Gibbs Sampling</h3> * * <p>This class uses a collapsed form of Gibbs sampling over the * posterior distribution of topic assignments given the documents * and Dirichlet priors: * * <blockquote><pre> * p(topics | words, &alpha; &beta;)</pre></blockquote> * * <p>This distribution may be derived from the joint * distribution by marginalizing (also known as &quot;collapsing&quot; * or &quot;integrating out&quot;) the contributions of the * document-topic and topic-word distributions. * <p>The Gibbs sampler used to estimate LDA models produces samples * that consist of a topic assignment to every token in the corpus. * The conjugacy of the Dirichlet prior for multinomials makes the * sampling straightforward. * * <p>An initial sample is produced by randomly assigning topics to * tokens.  Then, the sampler works iteratively through the corpus, * one token at a time.  At each token, it samples a new topic assignment * to that token given all the topic assignments to other tokens in the * corpus: * * <blockquote><pre> * p(topics[doc][token] | words, topics')</pre></blockquote> * * The notation <code>topics'</code> represents the set of topic * assignments other than to <code>topics[doc][token]</code>.  This * collapsed posterior conditional is estimated directly: * * <blockquote><pre> * p(topics[doc][token] = topic | words, topics') * = (count'(doc,topic) + &alpha;) / (docLength[doc]-1 + numTopics*&alpha;) * * (count'(topic,word) + &beta;) / (count'(topic) + numWords*&beta;)</pre></blockquote> * * The counts are all defined relative to <code>topics'</code>; that * is, the current topic assignment for the token being sampled is not * considered in the counts.  Note that the two factors are estimates * of <code>&theta;[doc]</code> and <code>&phi;[topic]</code> with all * data other than the assignment to the current token.  Note how the * prior concentrations arise as additive smoothing factors in these * estimates, a result of the Dirichlet's conjugacy to the * multinomial.  For the purposes of sampling, the document-length * normalization in the denominator of the first term is not necessary, * as it remains constant across topics. * * <p>The posterior Dirichlet distributions may be computed using just * the counts.  For instance, the posterior distribution for topics in * documents is estimated as: * * <blockquote><pre> * p(&theta[doc]|&alpha;, &beta;, words) * = Dirichlet(count(doc,0)+&beta;, count(doc,1)+&beta;, ..., count(doc,numTopics-1)+&beta;)</pre></blockquote> * * <p>The sampling distribution is defined from the maximum a posteriori * (MAP) estimates of the multinomial distribution over topics in a document: * * <blockquote><pre> * &theta;<sup>*</sup>[doc] = ARGMAX<sub><sub>&theta;[doc]</sub></sub> p(&theta;[doc] | &alpha;, &beta;, words)</pre></blockquote> * * which we know from the Dirichlet distribution is: * * <blockquote><pre> * &theta;<sup>*</sup>[doc][topic] * = (count(doc,topic) + &alpha;) / (docLength[doc] + numTopics*&alpha;)</pre></blockquote> * * By the same reasoning, the MAP word distribution in topics is:
12 3 4 5 下一页
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -