Language Modeling in MeTA
A language model is a probability distribution over sequences of tokens. In most cases, these tokens are words and the sequences are sentences. If we train a language model on some reference corpus, it can then be used to calculate the likelihood of new text with respect to the reference. This has many broad and practical applications in natural language processing.
In MeTA, we have a basic (yet efficient) n-gram language model class. An n-gram language model makes the assumption that the probability of a word only depends on the previous n-1 words. That is, the language model creates a probability distribution over all windows of n words.
MeTA does not yet support language model inference, which is the process of
learning the model parameters. Instead, it reads an already-trained language
model from the standardized
file format. We recommend tokenizing data with MeTA and then using a language
modeling toolkit such as KenLM to create the
.arpa file. MeTA reads this file and creates its own binarized version, which
can then be used for various tasks. Using KenLM, we can create an
for MeTA with the following command:
./lmplz --order 3 --text input.txt.tok --arpa output.arpa
Thus, the file that MeTA’s LM uses is
output.arpa which is a 3-gram language
To run the language modeling applications bundled with MeTA, you need to
[language-model] section in the configuration file. Here is an
[language-model] arpa-file = "../data/english-sentences.arpa" binary-file-prefix = "english-sentences-"
arpa-file parameter is the path to the
.arpa model file. MeTA reads this
file and then stores its own binarized version with the prefix
binary-file-prefix. MeTA uses whichever n-value was used to generate the
Some example output using the provided model may be
[info] Loading language model from binary files: english-sentences-* (../src/lm/language_model.cpp:32) [info] Done. (2ms) (../src/lm/language_model.cpp:44) Input a sentence, (blank) to quit. > I should get a part time job. Tokenized sentence: <s> I should get a part time job . </s> Perplexity per word: 8.29551 (0ms) Log prob: -9.18843 (0ms) > I should get a part time octopus. Tokenized sentence: <s> I should get a part time octopus . </s> Perplexity per word: 30.0232 (0ms) Log prob: -14.7746 (0ms)
A higher perplexity means that the input sentence does not seem as likely as a lower perplexity. Log probability is the opposite: a higher log probability means that the input sentence is more likely to have been generated by the language model than a sentence with lower log probability. Note that all log probabilities are negative, so high log probabilities will be close to zero.
An important note is that the input sentence should be tokenized in the same way as the reference corpus read by the language model inference algorithm. Otherwise, the vocabularies may not match up and there could be out-of-vocabulary words that decrease the likelihood of the sentence unintentionally.
src/lm/tools/sentence_likelihood.cpp contains the simple use case of
the language model class as demonstrated above.