

In terms of Bayesian statistics, topic modelling optimizes the probability of a topic given the document times the probability of the word inside this topic. So, since these two words – lactose and lactase – cooccur frequently, our topic model will recognize them as elements belonging to the same topic. At the same time, lactase might be even rarer in the rest of the corpus than lactose. And if some of those articles focus on lactose intolerance, there is a fair chance that the word lactase will occur frequently two. But if we have a corpus of academic papers on nutrition, there is a good chance that some of these papers contain many occurrences of lactose.

If we have really large amounts of text, then the words that appear together frequently in similar contexts are likely to contribute to the same topic. It is based on what is known as the Firthian hypothesis, which states that “you shall know a word by the company it keeps” (Firth 1957: 11). In topic modelling, we generate models which try to learn from the context of words. This is precisely what topic modelling allows us to do. One way of dealing with the abundance of features is to abstract away from the features towards larger concepts, specifically to topics. Our discussion of document classification in the last chapter provided new possibilities of analyzing large quantities of data, but also raised an important issue: LightSide makes use of thousands of features, which makes it impossible to interpret these features individually. In this chapter, we discuss topic modelling, a useful method to analyze large quantities of text, and present an example using Mallet.
