Deep Learning 101: Lesson 29: Attention Scores in NLP

7 min readSep 3, 2024

This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.

The Attention mechanism is arguably the most critical innovation in Transformers, enabling these models to focus on different parts of the input sequence when performing a task. There are two types of attention score:

Self-Attention: In Transformer models, the self-attention mechanism allows each word in the input sequence to attend to all other words. This mechanism calculates an ‘Attention Score’ for each word pair, determining how much focus to put on other parts of the input when processing a particular word.
Scaled Dot-Product Attention: This is a specific type of attention used in Transformers. It computes scores based on the dot product of the query (a representation of a word) with all keys (representations of all words in the sequence), which are then scaled, passed through a softmax function, and multiplied with the value (another representation of the word). This process results in a weighted representation that combines information from relevant parts of the input.

In the below image, we observe an example of a scalar-based attention scoring system. It depicts a straightforward mapping where keys such as “apple,” “banana,” and “chair” are associated with scalar values 9, 6, and 3, respectively.

Figure 1: Example of a scalar-based attention scoring system

Queries:

Key: apple ⇒ Value: 9 (probability=1.0)

Key: banana ⇒ Value: 6 (probability=1.0)

Key: chair ⇒ value: 3 (probability=1.0)

Key: fruit ⇒ value: ? (probability=?)

When querying for these specific keys, the system will return the associated value with absolute certainty; however, it will fall short on a query like “fruit” because it cannot extrapolate from the existing keys.

The next figure (below) shows a vector-based attention scoring model. Here the same entities “apple”, “banana” and “chair” are presented. However, these entities are vectorized using key attributes such as sweetness, durability, and texture, and value attributes such as price, liking, or use. Note the different numbers assigned to each attribute to provide meaning, context, and relationship. To give some examples of attribute values, we can say that “apple” has more sweetness than “banana” and “chair”; chairs are more durable than apples and bananas; bananas are richer in texture compared to apples and chairs, and so on. Note that in a real system, these numbers are generated by some training mechanism involving various forms of embedding.

Figure 2: Vector-based attention scoring model

Now, if we want to run a query on this system, and this time the query is not a single word or value, but rather a vector of values in terms of Sweetness, Durability, and Texture, as shown below. In other words, we are asking to match keys and values where Sweetness is equal to 0.8, Durability is equal to 0.2, and Texture is equal to 0.1.

Query:

Sweetness: 0.8, Durability: 0.2, Texture: 0.1

Let’s assume that we have a system for calculating the “match score” between the query attributes and the key attributes for each of the entities apple, banana, and chair, and that this match score comes out to be the values below:

Note that the top match for this query is apple with a matching score of 0.44, which is higher than 0.37 and 0.16. This ‘matching score’ is called the Scaled Dot Product and is calculated using the formula below.

Where

K⋅Q is the dot product of vectors K and Q.
dk is the dimensionality of the K vector.
The scaling by square of dk is done to prevent the dot product from growing too large in magnitude.

In other words, the scaled dot product is the result of taking the dot product between two vectors, K and Q, and then dividing it by the square root of the dimensionality of the vectors. The scaled dot product measures the similarity between two vectors, K and Q, while taking into account their dimensionality. The table below shows the scaled dot product calculated for apple, banana, and chair for the query (0.8, 0.2, 0.1).

dk = 3

Figure 4: Softmax function transforming scaled dot products into a probability distribution

Having established the methodology for computing the scaled dot product, which effectively measures the alignment between query and key vectors in our attention mechanism, we now proceed to refine this raw alignment into a form suitable for probabilistic interpretation. This is where the softmax function comes in. By applying softmax to the scaled dot products, we transform these raw scores into a distribution of probabilities, allowing the model to distinguish and give greater weight to the most relevant inputs. The softmax function exponentially emphasizes the highest scores while simultaneously diminishing the lower scores, resulting in a clear hierarchy of importance among the inputs. This probabilistic output not only provides a normalized and differential weighting scheme, but also paves the way for the model to make context-aware decisions. Below is the formula for calculating softmax values:

xi is the ith element of the vector x
e is the base of the natural logarithm.
The denominator is the sum of the exponential values of all elements in the vector,

If we want to apply the Softmax function to the Scaled Dot product values, the xi becomes:

These Softmax values represent the matching score as probabilities between the K and the Q vectors.

Having transformed the scaled dot product into a probabilistic distribution via the softmax function, we are now ready to compute the scaled dot product attention. This next stage of computation is crucial: it involves multiplying the softmax-normalized weights by the vector of values V. This operation is not just arithmetic, but a synthesis of relevance-the softmax probabilities serve as a weighting system that emphasizes certain features while attenuating others, depending on their computed importance from the softmax stage. By element-wise multiplying these weights with the corresponding values in V, we effectively assemble a weighted sum that epitomizes “attention”. This resulting vector is a distilled essence of the input information, attentively modulated to emphasize the most relevant aspects as dictated by the query. Thus, the Scaled Dot Product Attention is a concentrated representation ready for further processing by subsequent layers — it is here that the focus of the model is quantified and directed, embodying the very concept of “attention” within our neural architecture. We calculate this Scaled Product Attention score using the formula below.

By using the Softmax values calculated in the previous step, we compute the Scaled Product Attention score in the below table.

In the context of our illustrative example with ‘apple’, ‘banana’, and ‘chair’, each with their corresponding key (K) and value (V) vectors, Scaled Dot Product Attention plays a crucial role. For a given query vector (Q), which may represent the desired characteristics of a particular fruit, the dot product attention mechanism efficiently identifies and enhances the attributes of ‘apple’ or ‘banana’ that most closely match Q. By computing the dot product between Q and each K, and then scaling and normalizing these scores through softmax, we obtain a refined focus on the most relevant entity. Multiplying this focused distribution by V yields an aggregated output that preserves the most salient features corresponding to the query, effectively allowing the Transformer model to “attend” to the most relevant information. This mechanism is critical within the Transformer architecture because it underpins the model’s ability to handle sequences by selectively focusing on specific elements without the constraints of sequential processing. It allows the Transformer to capture complex interdependencies and contextual relationships between words (or items, in the case of ‘apple’, ‘banana’ and ‘chair’), thereby enhancing its ability to understand and generate human language with remarkable subtlety and depth.

Summary

Attention scores are crucial in Transformer models, enabling them to focus on different parts of the input sequence for better task performance. This process involves calculating self-attention scores for each word pair in the input sequence. Scaled Dot-Product Attention computes these scores by taking the dot product of queries with keys, scaling, and applying a softmax function. This method transforms raw alignment into probabilistic values, which are then multiplied by the value vectors to get the final attention scores. The resulting vector represents the model’s focus, emphasizing relevant features based on the query. This mechanism allows the Transformer to capture complex relationships and contextual information, enhancing its ability to understand and generate language effectively.

4 Ways to Learn

1. Read the article: Attention Scores

2. Play with the visual tool: Attention Scores

3. Watch the video: Attention Scores

4. Practice with the code: Attention Scores

Previous Article: The Role of Position Embedding in NLP
Next Article: Understanding Text with Attention Heatmaps

Deep Learning 101: Lesson 29: Attention Scores in NLP

Summary

4 Ways to Learn

Written by Muneeb S. Ahmad