Deep Learning 101: Lesson 28: The Role of Position Embedding in NLP
This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.
One of the unique aspects of Transformers is its non-sequential processing of text. However, the order of the words in a sentence has essential meaning, which brings us to position embeddings. Position embeddings are added to word embeddings to give the model information about the position of words in a sentence. This allows the Transformer to understand the order of words, a crucial aspect of language structure. Some advanced models use “relative” position embeddings to understand the relative positions of words to each other, further enhancing the model’s ability to capture contextual nuance.
The most common representation of position embedding consists of multiple sine and cosine waves, as shown in the figure below. This representation provides a continuous and unique coding for each word position. The periodic nature of the sine and cosine functions allows the model to capture both short-term and long-term positional relationships.
Here is the formula to calculate the sine and cosine values:
Where:
- d is the total embedding dimension.
- n is the Frequency Factor (which is generally equal to 10000 for real-life systems).
- p represents the position of a token (word) within the sequence.
- i represents the dimension of the embedding. Given that the embedding has a total dimension of d, i can take values from 1 up to d/2
- PE(p,i) is the position embedding of position p at dimension i
- The terms 2i and 2i+1 ensure that sine is applied to the even dimensions and cosine to the odd dimensions.
We use these values for d, n, and p to calculate the sine and cosine values shown in the figure above.
d = 3
n = 100
L = 20
To plot the graphs, we start with i=0 and calculate sin and call it sine0, which is plotted as a blue line, then we calculate cos for i=0 and call it cos0, which is plotted as an orange line. Then we increment i to 1 and calculate sin and call it sin1, plotted as the green line. This way we have covered all 3 dimensions of the embedding space. The x-axis of this frequency plot is the position p of a token or word; this value starts at 0 and goes up to the sequence length L.
The overall position embedding value of a token or word is the net value of all sine and cosine components. This net or total positional embedding (or value) of a word can also be represented as a 3D scatter plot, as shown in the figure below.
In this diagram, the blue circle 0 is the position embedding when a word is at position 0, the blue circle 1 is the position embedding when a word is at position 1, and so on. It is almost as if the position embedding creates a spiral around the word’s token embedding, shown as a red dot, as the same word moves its position from low to high.
Summary
Position embeddings play a crucial role in Transformers by providing information about the position of words in a sentence, despite the model’s non-sequential processing nature. By adding position embeddings to word embeddings, the model can capture the order of words, essential for understanding language structure. These embeddings are commonly represented using sine and cosine waves, allowing the model to encode both short-term and long-term positional relationships. The periodic functions of sine and cosine offer continuous and unique coding for each word position, enhancing the model’s ability to grasp contextual nuances. The overall position embedding value is a combination of all sine and cosine components, which can be visualized in 3D scatter plots, showing how positional information is integrated with token embeddings.
4 Ways to Learn
1. Read the article: Position Embedding
2. Play with the visual tool: Position Embedding
3. Watch the video: Position Embedding
4. Practice with the code: Position Embedding
Previous Article: Understanding Word Embeddings
Next Article: Attention Scores in NLP