📄 readme.txt

📁 latent dirichlet allocation的C实现代码
💻 TXT
字号:
***************************LATENT DIRICHLET ALLOCATION***************************David M. Bleiblei[at]cs.princeton.edu(C) Copyright 2006, David M. Blei (blei [at] cs [dot] princeton [dot] edu)This file is part of LDA-C.LDA-C is free software; you can redistribute it and/or modify it underthe terms of the GNU General Public License as published by the FreeSoftware Foundation; either version 2 of the License, or (at youroption) any later version.LDA-C is distributed in the hope that it will be useful, but WITHOUTANY WARRANTY; without even the implied warranty of MERCHANTABILITY orFITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public Licensefor more details.You should have received a copy of the GNU General Public Licensealong with this program; if not, write to the Free SoftwareFoundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307USA------------------------------------------------------------------------This is a C implementation of latent Dirichlet allocation (LDA), amodel of discrete data which is fully described in Blei et al. (2003)(http://www.cs.berkeley.edu/~blei/papers/blei03a.pdf).LDA is a hierarchical probabilistic model of documents.  Let \alpha bea scalar and \beta_{1:K} be K distributions of words (called "topics").As implemented here, a K topic LDA model assumes the followinggenerative process of an N word document:          1. \theta | \alpha ~ Dirichlet(\alpha, ..., \alpha)          2. for each word n = {1, ..., N}:             a. Z_n | \theta ~ Mult(\theta)             b. W_n | z_n, \beta ~ Mult(\beta_{z_n})This code implements variational inference of \theta and z_{1:N} for adocument, and estimation of the topics \beta_{1:K} and Dirichletparameter \alpha.------------------------------------------------------------------------TABLE OF CONTENTSA. COMPILINGB. TOPIC ESTIMATION   1. SETTINGS FILE   2. DATA FILE FORMATC. INFERENCED. PRINTING TOPICSE. QUESTIONS, COMMENTS, PROBLEMS, UPDATE ANNOUNCEMENTS------------------------------------------------------------------------A. COMPILINGType "make" in a shell.------------------------------------------------------------------------B. TOPIC ESTIMATIONEstimate the model by executing:     lda est [alpha] [k] [settings] [data] [random/seeded/*] [directory]The term [random/seeded/*] > describes how the topics will beinitialized.  "Random" initializes each topic randomly; "seeded"initializes each topic to a distribution smoothed from a randomlychosen document; or, you can specify a model name to load apre-existing model as the initial model (this is useful to continue EMfrom where it left off).  To change the number of initial documentsused, edit lda-estimate.c.The model (i.e., \alpha and \beta_{1:K}) and variational posteriorDirichlet parameters will be saved in the specified directory everyten iterations.  Additionally, there will be a log file for thelikelihood bound and convergence score at each iteration.  Thealgorithm runs until that score is less than "em_convergence" (fromthe settings file) or "em_max_iter" iterations are reached.  (Tochange the lag between saved models, edit lda-estimate.c.)The saved models are in two files:     <iteration>.other contains alpha.     <iteration>.beta contains the log of the topic distributions.     Each line is a topic; in line k, each entry is log p(w | z=k)The variational posterior Dirichlets are in:     <iteration>.gammaThe settings file and data format are described below.1. Settings fileSee settings.txt for a sample.  See inf-settings.txt for an example ofa settings file for inference.  These are placeholder values; theyshould be experimented with.This is of the following form:     var max iter [integer e.g., 10 or -1]     var convergence [float e.g., 1e-8]     em max iter [integer e.g., 100]     em convergence [float e.g., 1e-5]     alpha [fit/estimate]where the settings are     [var max iter]     The maximum number of iterations of coordinate ascent variational     inference for a single document.  A value of -1 indicates "full"     variational inference, until the variational convergence     criterion is met.     [var convergence]     The convergence criteria for variational inference.  Stop if     (score_old - score) / abs(score_old) is less than this value (or     after the maximum number of iterations).  Note that the score is     the lower bound on the likelihood for a particular document.     [em max iter]     The maximum number of iterations of variational EM.     [em convergence]     The convergence criteria for varitional EM.  Stop if (score_old -     score) / abs(score_old) is less than this value (or after the     maximum number of iterations).  Note that "score" is the lower     bound on the likelihood for the whole corpus.     [alpha]     If set to [fixed] then alpha does not change from iteration to     iteration.  If set to [estimate], then alpha is estimated along     with the topic distributions.2. Data formatUnder LDA, the words of each document are assumed exchangeable.  Thus,each document is succinctly represented as a sparse vector of wordcounts. The data is a file where each line is of the form:     [M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]where [M] is the number of unique terms in the document, and the[count] associated with each term is how many times that term appearedin the document.  Note that [term_1] is an integer which indexes theterm; it is not a string.------------------------------------------------------------------------C. INFERENCETo perform inference on a different set of data (in the same format asfor estimation), execute:     lda inf [settings] [model] [data] [name]Variational inference is performed on the data using the model in[model].* (see above).  Two files will be created : [name].gamma arethe variational Dirichlet parameters for each document;[name].likelihood is the bound on the likelihood for each document.------------------------------------------------------------------------D. PRINTING TOPICSThe Python script topics.py lets you print out the top Nwords from each topic in a .beta file.  Usage is:     python topics.py <beta file> <vocab file> <n words>------------------------------------------------------------------------E. QUESTIONS, COMMENTS, PROBLEMS, AND UPDATE ANNOUNCEMENTSPlease join the topic-models mailing list,topic-models@lists.cs.princeton.edu.To join, go to http://lists.cs.princeton.edu and click on"topic-models."
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -