ModErn Text Analysis
META Enumerates Textual Applications
Public Member Functions | Protected Member Functions | Protected Attributes | List of all members
meta::topics::parallel_lda_gibbs Class Reference

An LDA topic model implemented using the Approximate Distributed LDA algorithm. More...

#include <parallel_lda_gibbs.h>

Inheritance diagram for meta::topics::parallel_lda_gibbs:
meta::topics::lda_gibbs meta::topics::lda_model

Public Member Functions

virtual ~parallel_lda_gibbs ()=default
 Destructor: virtual for potential subclassing.
 
- Public Member Functions inherited from meta::topics::lda_gibbs
 lda_gibbs (std::shared_ptr< index::forward_index > idx, uint64_t num_topics, double alpha, double beta)
 Constructs the lda model over the given documents, with the given number of topics, and hyperparameters \(\alpha\) and \(\beta\) for the priors on \(\phi\) (topic distributions) and \(\theta\) (topic proportions), respectively. More...
 
virtual ~lda_gibbs ()=default
 Destructor: virtual for potential subclassing.
 
virtual void run (uint64_t num_iters, double convergence=1e-6) override
 Runs the sampler for a maximum number of iterations, or until the given convergence criterion is met. More...
 
virtual double compute_term_topic_probability (term_id term, topic_id topic) const override
 
virtual double compute_doc_topic_probability (doc_id doc, topic_id topic) const override
 
- Public Member Functions inherited from meta::topics::lda_model
 lda_model (std::shared_ptr< index::forward_index > idx, uint64_t num_topics)
 Constructs an lda_model over the given set of documents and with a fixed number of topics. More...
 
virtual ~lda_model ()=default
 Destructor. More...
 
void save_doc_topic_distributions (const std::string &filename) const
 Saves the topic proportions \(\theta_d\) for each document to the given file. More...
 
void save_topic_term_distributions (const std::string &filename) const
 Saves the term distributions \(\phi_j\) for each topic to the given file. More...
 
void save (const std::string &prefix) const
 Saves the current model to a set of files beginning with prefix: prefix.phi, prefix.theta, and prefix.terms. More...
 
uint64_t num_topics () const
 

Protected Member Functions

virtual void initialize () override
 Initializes the first set of topic assignments for inference. More...
 
virtual void perform_iteration (uint64_t iter, bool init=false) override
 Performs a sampling iteration of the AD-LDA algorithm. More...
 
virtual void decrease_counts (topic_id topic, term_id term, doc_id doc) override
 Decreases all counts associated with the given topic, term, and document by one. More...
 
virtual void increase_counts (topic_id topic, term_id term, doc_id doc) override
 Increases all counts associated with the given topic, term, and document by one. More...
 
virtual double compute_sampling_weight (term_id term, doc_id doc, topic_id topic) const override
 Computes a weight proportional to \(P(z_i = j | w, \boldsymbol{z})\). More...
 
- Protected Member Functions inherited from meta::topics::lda_gibbs
topic_id sample_topic (term_id term, doc_id doc)
 Samples a topic from the full conditional distribution \(P(z_i = j | w, \boldsymbol{z})\). More...
 
double corpus_log_likelihood () const
 
lda_gibbsoperator= (const lda_gibbs &)=delete
 lda_gibbs cannot be copy assigned.
 
 lda_gibbs (const lda_gibbs &other)=delete
 lda_gibbs cannot be copy constructed.
 
- Protected Member Functions inherited from meta::topics::lda_model
lda_modeloperator= (const lda_model &)=delete
 lda_models cannot be copy assigned.
 
 lda_model (const lda_model &)=delete
 lda_models cannot be copy constructed.
 

Protected Attributes

parallel::thread_pool pool_
 The thread pool used for parallelization.
 
std::unordered_map< std::thread::id, std::vector< stats::multinomial< term_id > > > phi_diffs_
 Stores the difference in topic_term counts on a per-thread basis for use in the reduction step. More...
 
- Protected Attributes inherited from meta::topics::lda_gibbs
std::vector< std::vector< topic_id > > doc_word_topic_
 The topic assignment for every word in every document. More...
 
std::vector< stats::multinomial< term_id > > phi_
 The word distributions for each topic, \(\phi_t\).
 
std::vector< stats::multinomial< topic_id > > theta_
 The topic distributions for each document, \(\theta_d\).
 
std::mt19937_64 rng_
 The random number generator for the sampler.
 
- Protected Attributes inherited from meta::topics::lda_model
std::shared_ptr< index::forward_indexidx_
 The index containing the documents for the model.
 
size_t num_topics_
 The number of topics.
 
size_t num_words_
 The number of total unique words.
 

Detailed Description

An LDA topic model implemented using the Approximate Distributed LDA algorithm.

Based on the algorithm detailed by David Newman et. al.

See also
http://www.jmlr.org/papers/volume10/newman09a/newman09a.pdf

Member Function Documentation

void meta::topics::parallel_lda_gibbs::initialize ( )
overrideprotectedvirtual

Initializes the first set of topic assignments for inference.

Employs an online application of the sampler where counts are only considered for the words observed so far through the loop.

Reimplemented from meta::topics::lda_gibbs.

void meta::topics::parallel_lda_gibbs::perform_iteration ( uint64_t  iter,
bool  init = false 
)
overrideprotectedvirtual

Performs a sampling iteration of the AD-LDA algorithm.

This consists of splitting up the sampling of (document, word) topic assignments across threads, keeping for each thread a difference in counts for the potentially shared topic counts. Once the sampling has finished, the counts are reduced down (serially) before the iteration is completed.

Parameters
iterThe current iteration number
initWhether or not this iteration should use the online method for initializing the sampler

Reimplemented from meta::topics::lda_gibbs.

void meta::topics::parallel_lda_gibbs::decrease_counts ( topic_id  topic,
term_id  term,
doc_id  doc 
)
overrideprotectedvirtual

Decreases all counts associated with the given topic, term, and document by one.

Parameters
topicThe topic in question
termThe term in question
docThe document in question

Reimplemented from meta::topics::lda_gibbs.

void meta::topics::parallel_lda_gibbs::increase_counts ( topic_id  topic,
term_id  term,
doc_id  doc 
)
overrideprotectedvirtual

Increases all counts associated with the given topic, term, and document by one.

Parameters
topicThe topic in question
termThe term in question
docThe document in question

Reimplemented from meta::topics::lda_gibbs.

double meta::topics::parallel_lda_gibbs::compute_sampling_weight ( term_id  term,
doc_id  doc,
topic_id  topic 
) const
overrideprotectedvirtual

Computes a weight proportional to \(P(z_i = j | w, \boldsymbol{z})\).

Parameters
termThe current word we are sampling for
docThe document in which the term resides
topicThe topic \(j\) we want to compute the probability for
Returns
a weight proportional to the probability that the given term in the given document belongs to the given topic

Reimplemented from meta::topics::lda_gibbs.

Member Data Documentation

std::unordered_map<std::thread::id, std::vector<stats::multinomial<term_id> > > meta::topics::parallel_lda_gibbs::phi_diffs_
protected

Stores the difference in topic_term counts on a per-thread basis for use in the reduction step.

Indexed as [thread_id][topic]


The documentation for this class was generated from the following files: