Lecture 2.3 - Word
Embeddings
Generative AI Teaching Kit

The NVIDIA Deep Learning Institute Generative AI Teaching Kit is licensed by NVIDIA and Dartmouth College under the
Creative Commons Attribution-NonCommercial 4.0 International License.

Page 2

This lecture
 Motivation for Word Embeddings
 Classical Embedding Methods
 Modern Word Embeddings
 Contextual Embeddings and Beyond

Page 3

Motivation for Word Embeddings
Connecting text to data

Page 4

Translating text to numbers
For any language model, we need a way to convert and
represent the human-native languages, written in text and
spoken verbally, into a format that can be utilized by our
computational language models.
Encoding this information will lead us to the concept of
word embedding vectors. But it will help to first see how we
get there.
Let’s answer these questions:
 Why is this difficult?
 How is it helpful?
 Is there a perfect way to do it?

Page 5

Why is Language hard to represent?
Complex language that we utilize is thought to have become a distinctly unique human trait around 100,000 years ago.
 Language is so important to us as a species that we have special parts of the brain responsible for processing and
decoding language as we experience and learn it.
 Language itself is already representational; it is used to convey information and represent the world around us.
Converting this to a digital or numerical form, immediately poses challenges such as:

 How should we convey meaning of words?
 Only certain combinations of letters make up words, the vast combinations of letters in a language are not words
 The same word can have multiple meanings in different contexts

Page 6

How is it helpful to represent language numerically?
For our language models to be as useful as possible they
should be capable of interpreting as much information from
the tokens they have processed and be able to produce the
right tokens to complete their tasks.

Representing language as numbers, or even vectors, will
afford our models more ability to perform analysis and
pattern recognition.
With a good representation, we can:
 Build new words
 Cluster similar words
 Use semantic features to map/traverse an embedding
space

Page 7

Is there a best way to represent all languages?
Whilst many similarities exist between different languages
all across the globe, differences in the culture and in their
abundance online mean that no two languages will ever be
perfectly overlined in representation or model efficacy.

This remains an open problem to this day, with a number of
research endeavors focusing on how to best model multiple
languages with generative AI models.
 HuggingFace's BLOOM
 Cohere's Aya
 Microsoft's VeLLM

Page 8

Classical Embedding Methods
Early attempts to vectorize language

Page 9

Comparing documents and sparse vectors
Originally, NLP models were used to perform tasks such as
document similarity and matching.
To do this, a common approach was to covert a document
(or sentence) into a single vector where each dimension
corresponded to a word in the vocabulary

What this results in are known as “sparse” vectors.
For smaller examples these approaches might seem
reasonable but imagine a vocabulary of all English words
~500k…
Any given document would have a tiny set of unique words,
meaning that the vector of that document would be almost
entirely made up of 0s.

Page 10

Models from word counting
While these sparse vectors might be inefficient
computationally, they can be useful for basic NLP
applications. Some old but widely used at the time methods
utilizing word counts include:
Bag-of-Words (BoW)
 BoW is a simple text representation technique that
converts text into a numerical format by counting the
frequency of each word in a document, disregarding
grammar and word order.
Term frequency – inverse document frequency (TF-IDF)
 TF-IDF is a statistical measure used to evaluate the
importance of a word in a document relative to a
collection of documents, helping to identify terms that are
significant for understanding content.

Page 11

Limitations of early word counting models
While these methods work reasonably well to identify similar documents, they still have limitations that would
prevent more NLP applications:
Limitations of TF-IDF and BoW:
 Lack of context: These methods treat words independently and do not capture their meaning or
relationships (e.g., “cat” and “kitten” are unrelated in BoW/TF-IDF).
 High dimensionality: The feature space grows with the size of the vocabulary, leading to sparse and
computationally expensive representations.
 Fixed vocabulary: They struggle with unseen words or out-of-vocabulary issues.
 No semantic meaning: These representations do not encode the meaning or the “similarity” between words

Page 12

Ideal properties of good word vectors
What would a good word vector model consist of?
1. Semantic Similarity:
Words with similar meanings (e.g., “cat” and “kitten”) should have similar representations, reflecting their
conceptual closeness. Contextual relationships should be captured (e.g., “bank” in the sense of finance vs.
riverbank).
2. Dimensionality Efficiency:
Representations should be compact (dense) to avoid the sparsity and inefficiency of large vectors like those in
BoW or TF-IDF.
3. Context Awareness:
Ideally, representations should consider the context in which words appear to capture polysemy (e.g., “bat” as
an animal vs. a baseball bat).
4. Scalability and Generalizability:
They should work well for large datasets and generalize effectively to unseen data, including out-of-vocabulary
words or phrases.
5. Arithmetic Properties:
Good word representations allow meaningful operations, such as “king - man + woman = queen,” to capture
analogies and relationships.
In the next section we will discuss two methods that achieve this and more!
Page 13

Modern Word Embeddings
Dense vectors to encode meaning

Page 14

Creating dense, context aware word vectors
In our attempt to create dense, context aware word vectors,
we will need to make some assumptions about words
themselves.
Namely, we will assume that words of similar meaning will
be surrounded by the same sets of words.
This is the distributional hypothesis attributed to J.R.
Firth in 1957. “You shall know a word by the company it
keeps”
This means that we can set the word of interest fixed and
look at all of the words that occur in our corpus around it
and use these surrounding words to build out our word
vectors.
This will 1) allow us to create dense vectors since all of the
vector will be filled with the occurrences of other words and
2) encode similar meanings for similar words due to the
distributional hypothesis. But how do we build them?
Page 15

Word2Vec
Word2Vec is a family of algorithms proposed in 2013.
Unlike previous approaches to create document vectors
with each word as a dimension, Word2Vec starts with each
word being represented as a multidimensional dense
vector of 100s of dimensions in what is known as an
embedding space. Each unique word corresponds to a
unique location in this space.

Two approaches are used to generate these embedding
vectors:
 Continuous Bag-of-Words (CBOW)
 Skip-gram
We will cover these next.

Page 16

Word2Vec: Continuous Bag-of-Words
CBOW can be viewed as a ‘fill in the blank’ task, where the word embedding
represents the way the word influences the relative probabilities of other words
in the context window.
Words which are semantically similar should influence these probabilities in
similar ways, because semantically similar words should be used in similar
contexts.
The order of context words does not influence prediction (bag of words
assumption).
Pros:
 Faster Training: It predicts the target word from the context as a whole,
making it computationally efficient.
 Works Well with Smaller Datasets: The averaging process makes it robust
even with less data.
Cons:
 Dilutes Semantic Specificity: Averaging context words can blur distinctions,
leading to less precise embeddings, especially for polysemous words.
 Not Ideal for Rare Words: Rare words may have weaker embeddings as
they contribute less often to the context’s average.
Page 17

Word2Vec: Skip-gram
In the continuous skip-gram architecture, the word2vec model uses the target word to
predict the surrounding window of context words.
The skip-gram model weighs those context words nearby more heavily.
Pros:
 Better for Rare Words: It handles infrequent words well, as it generates multiple
context-word pairs, improving their embeddings.
 Captures Detailed Semantic Relationships: It excels in tasks requiring nuanced
word representations, like analogies (e.g., “king - man + woman = queen”).
Cons:
 Slower Training: Skip-Gram generates one prediction per word-context pair,
making it computationally more expensive.
 Requires Larger Data: Performs better with larger datasets, as it can struggle with
sparse data due to its granular focus on word-context pairs.

Page 18

Word Embedding Spaces
Once trained, these word2vec models represent a latent space for the
word embeddings. With each word represented as a dense vector, they
possess several useful features:
 Words with similar meanings (e.g., “king” and “queen”) are closer
together.
 Linear relationships emerge, allowing analogies like “king - man +
woman = queen.”
 Embedding spaces typically have 100–300 dimensions, capturing
much of the linguistic structure efficiently.
Useful tasks with word embeddings:
Semantic Similarity: Finding related words (e.g., for recommendations
or clustering).
Downstream Tasks: Serving as input to models for tasks like
sentiment analysis, translation, or question answering.

Page 19

Contextual Embeddings and Beyond
Building embeddings with transformers and attention

Page 20

Context Embeddings vs. Static embeddings
In the last section we looked at Word2Vec and the static embeddings. These use
a corpus to build up the embedding space. However, the design of these static
models have some limitations:
Limitations of Static Embeddings
 Fixed representation for each word (e.g., “bank” has the same vector in all
contexts).
 Cannot handle polysemy or context-specific meanings.
 Limited performance on tasks requiring nuanced understanding.
To improve on these, we will cover the latest methods:
Contextual Embeddings
 Dynamic word representations that change with context.
 Captures syntax, grammar, and semantics from surrounding words.
 Powers state-of-the-art NLP tasks like translation and question answering.

Page 21

Context Embeddings
Contextual embeddings are word representations that adapt based on the context in which a word appears.
Words often have multiple meanings depending on their context. Contextual embeddings allow models to distinguish these
meanings by considering the surrounding words.

Contextual embeddings are produced by deep learning models like BERT and GPT. These models use the transformer
architecture, which processes the entire sentence at once and learns relationships between all the words using
mechanisms like attention. They also handle polysemy, adapt to unseen words through sub-word tokenization, and create
embeddings that capture grammatical and semantic relationships.
Page 22

Wrap Up
Word Embeddings
 Today we explored word vectors and word embeddings
 We saw the difficulties of converting the concept of
language into a numerical representation
 The simpler concepts of document vectors were introduced
 More complex and denser word embeddings algorithms like
Word2Vec were discussed as a means to solve the
limitations of simpler sparse vectors
 Modern, contextual embeddings were also introduced which
are produced from technologies like attention-based
transformers.
-----------------------------------------------------------------------------

Page 23

Thank you!