Language Modeling in MeTA

A language model is a probability distribution over sequences of tokens. In most cases, these tokens are words and the sequences are sentences. If we train a language model on some reference corpus, it can then be used to calculate the likelihood of new text with respect to the reference. This has many broad and practical applications in natural language processing.

In MeTA, we have a basic (yet efficient) n-gram language model class. An n-gram language model makes the assumption that the probability of a word only depends on the previous n-1 words. That is, the language model creates a probability distribution over all windows of n words.

MeTA does not yet support language model inference, which is the process of learning the model parameters. Instead, it reads an already-trained language model from the standardized ARPA file format. We recommend tokenizing data with MeTA and then using a language modeling toolkit such as KenLM to create the .arpa file. MeTA reads this file and creates its own binarized version, which can then be used for various tasks. Using KenLM, we can create an .arpa file for MeTA with the following command:

./lmplz --order 3 --text input.txt.tok --arpa

Thus, the file that MeTA’s LM uses is which is a 3-gram language model.

To run the language modeling applications bundled with MeTA, you need to configure the [language-model] section in the configuration file. Here is an example configuration:

arpa-file = "../data/"
binary-file-prefix = "english-sentences-"

The arpa-file parameter is the path to the .arpa model file. MeTA reads this file and then stores its own binarized version with the prefix binary-file-prefix. MeTA uses whichever n-value was used to generate the .arpa file.

./sentence-likelihood config.toml

Some example output using the provided model may be

[info]  Loading language model from binary files: english-sentences-* (../src/lm/language_model.cpp:32)
[info]  Done. (2ms) (../src/lm/language_model.cpp:44)
Input a sentence, (blank) to quit.

> I should get a part time job.
Tokenized sentence: <s> I should get a part time job . </s>
Perplexity per word: 8.29551 (0ms)
Log prob: -9.18843 (0ms)

> I should get a part time octopus.
Tokenized sentence: <s> I should get a part time octopus . </s>
Perplexity per word: 30.0232 (0ms)
Log prob: -14.7746 (0ms)

A higher perplexity means that the input sentence does not seem as likely as a lower perplexity. Log probability is the opposite: a higher log probability means that the input sentence is more likely to have been generated by the language model than a sentence with lower log probability. Note that all log probabilities are negative, so high log probabilities will be close to zero.

An important note is that the input sentence should be tokenized in the same way as the reference corpus read by the language model inference algorithm. Otherwise, the vocabularies may not match up and there could be out-of-vocabulary words that decrease the likelihood of the sentence unintentionally.

The file src/lm/tools/sentence_likelihood.cpp contains the simple use case of the language model class as demonstrated above.