Latent Dirichlet Allocation (LDA) for Topic Modelling

4 min readNov 30, 2020

Topic modelling is a statistical technique used to extract specific topic is a given collection of documents. LDA is one of the most prominent and widely used topic model.

Let us start with its definition as per the research paper and then move on to each components in detail along with its comparison to previous papers.

From the name we can infer that it has something to do with latent variables (derived variable in a dataset) and Dirichlet distribution. (https://www.statisticshowto.com/dirichlet-distribution/)

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.

The basic methodology towards text corpora was proposed by Information Retrieval researchers (IR 1999) and it was to reduce each document in the corpus to a vector of real number. This method is followed even till date in search engines. [A corpus is a large collection of structured texts and corpora is a collection of corpus].

The easy way to implement the basic methodology is using a TF-IDF technique. In TF-ID, for each document in the corpus, term frequency (TF) is calculated based on the frequency of occurrence of each word in the document and then normalizing it. IDF or inverse document frequency measures the number of occurrence of the word in the entire corpus on a log scale after normalizing.

The main short-coming of TF-IDF was that it provides a relatively small amount of reduction in description length and reveals little in the way of inter- or intra document statistical structure. This was addressed in Latent Sematic Indexing which uses Singular Value Decomposition of dimensionality reduction.

However, LSI showed a poor retrieval performance and issues with understanding synonyms and polysemy. pLSI or probabilistic Latent Semantic Indexing was developed as an alternative to LSI, which models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of “topics.” Thus each word is generated from a single topic, and different words in a document may be generated from different topics. In pLSI, each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers. This leads to several problems: (1) the number of parameters in the model grows linearly with the size of the corpus, which leads to serious problems with overfitting, and (2) it is not clear how to assign probability to a document outside of the training set.

Both LSI and pLSI are based on Bag Of Words assumption that the order of words in a document can be neglected. This is called exchangeability and it also states that the order of documents within a corpus can also be neglected, which led to Latent Dirichlet Algorithm or LDA.

LDA assumes the following generative process for each document w in a corpus D:

1. Choose N ∼ Poisson(ξ).

2. Choose θ ∼ Dir(α).

3. For each of the N words wn:

(a) Choose a topic zn ∼ Multinomial(θ).

(b) Choose a word wn from p(wn |zn,β), a multinomial probability conditioned on the topic zn.

Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is given by

where p(zn |θ) is simply θi for the unique i such that zi n = 1. Integrating over θ and summing over z, we obtain the marginal distribution of a document:

Finally, taking the product of the marginal probabilities of single documents, we obtain the probability of a corpus:

The key inferential problem that we need to solve in order to use LDA is that of computing the posterior distribution of the hidden variables given a document.

Example of LDA in python is given below