Word embeddings are a way of representing the individual words used in natural languages as fixed-length numeric vectors in some vector space. Most useful models for word embeddings find vectors for words where meaning can be captured via (linear) vector composition. For example, one can answer word analogy questions like the following:

• woman is to sister as man is to what? (brother)
• summer is to rain as winter is to what? (snow)
• man is to king as woman is to what? (queen)
• fell is to fallen as ate is to what? (eaten)

We can answer these questions by finding the word vector that is most similar (via some metric like cosine similarity) to the result of some vector math operation. For answering the first question, one might form a query like

where $\mathbf{v_i}$ represents a word embedding vector for a particular word $i$ in our vocabulary.

There are many different models for word embeddings. MeTA implements the learning algorithm from GloVe for learning its word embeddings. This tutorial will walk you through how to use the tools in MeTA for learning and interacting with word embeddings on your own data.

Learning Embeddings

MeTA’s GloVe implementation is broken into three steps:

1. Extract a vocabulary from the data for which we would like to construct word embeddings
2. Use that vocabulary to extract the co-occurrence matrix from our data
3. Learn word embeddings for each word in our vocabulary using the co-occurrence matrix we extracted

Steps 1 and 2 are one-time, upfront costs. Step 3 can be repeated as many times as you would like (to, e.g., construct embeddings of different dimensionality) once the vocabulary and co-occurrence matrix have been extracted.

Vocabulary Extraction

To extract a vocabulary from your data, you will need to add the following section (with parameters adjusted according to your needs) to your configuration file:

[embeddings]
prefix = "path/to/store/model/files"
filter = [{type = "icu-tokenizer", suppress-tags = "true"},
{type = "lowercase"}]
[embeddings.vocab]
min-count = 10
max-size = 400000

The prefix key indicates the folder where you would like to store the model files. (This path should be created before running the tools.)

The filter key is a filter chain to use to extract the token sequences from your data. You can feel free to change this however you would like. The chain given above is a reasonable default for learning uncased word vectors.

In the embeddings.vocab table, you can specify how to prune your vocabulary. Typically, you will either truncate the vocabulary below a certain frequency count (min-count), or you will truncate the vocabulary at a certain maximum size (max-size) to keep only the most frequent terms. The less data available for a vocabulary item, the worse its word embedding will be.

Note that even if you limit your vocabulary, the model will always include an <unk> vector that will be returned when querying for out-of-vocabulary terms.

To extract the vocabulary, you can now run the embedding-vocab tool:

./embedding-vocab config.toml

Embedding Training

Now you are ready to train the embeddings themselves on the global co-occurrence data we extracted in the previous two steps. This process can be configured with the following (optional) values in the [embeddings] section of your configuration file.

max-ram = 4096
vector-size = 50
num-threads = 4
max-iter = 25
learning-rate = 0.05
xmax = 100.0
scale = 0.75
unk-num-avg = 100
• max-ram, as before, is a heuristic memory limit that is used during the first phase of the learning algorithm, which shuffles the data for the SGD-based trainer.
• vector-size indicates the desired dimensionality of the generated word embeddings
• num-threads indicates the number of concurrent threads to run during training. Each thread will operate on its own separate subset of the training data, so this should be set low enough to allow concurrent access to separate files for each thread. By default, we use one thread per “core” (including hyperthreading cores)
• max-iter indicates the number of iterations to run the algorithm for. More iterations results in better optimization, but this is the major time/quality tradeoff setting.
• learning-rate is the initial learning rate. You likely won’t need to adjust this unless you are using truly massive corpora.
• xmax indicates the maximum co-occurrence count for which to stop the “dampening” that occurs for rare word pairs. You likely won’t need to adjust this.
• scale indicates the exponent used in the scaling function. You likely won’t need to adjust this.
• unk-num-avg indicates the number of rare words to average for constructing the <unk> word embedding.

You can now train your word embeddings using the glove tool:

./glove config.toml

The output will be written as two vector files: $prefix/embeddings.target.bin and $prefix/embeddings.context.bin.

Playing with Embeddings

Now that you’ve learned some word embeddings on your data, you can explore your dataset with the interactive-embeddings tool.

./interactive-embeddings config.toml

This tool will prompt you for vector-space queries and report to you the top 10 most similar words according to cosine distance with your query. For example, to answer the analogy questions given at the beginning of the tutorial, we could use the following queries:

• sister - woman + man
• rain - summer + winter
• king - man + woman
• fallen - fell + ate

Any addition or subtraction expression involving at least one word will be accepted.

API for Embeddings

If you want to use word embeddings in your own application, you can load them into a word_embeddings object and query it like so:

// load embeddings given the [embeddings] configuration group
auto model = embeddings::load_embeddings(config);

// query the model for a specific word
auto embed = model.at("dog");
embed.tid; // the term id for the vector
embed.v;   // the embedding vector for the term

// query the model to convert a term id to a string_view
auto term = model.term(embed.tid);

// query the model to find the top_k similar embeddings
auto top = model.top_k(embed.v);

top[0].e;     // the embedding, with fields tid and v
top[0].score; // the score that this embedding obtained