Deep Learning 101: Lesson 31: Exploring BERT

22 min readSep 4, 2024

This article is part of the “Deep Learning 101” series. Explore the full series for more insights and in-depth learning here.
☞ Learn with the visual tool: BERT Explorer

BERT, which stands for Bidirectional Encoder Representations from Transformers, is a revolutionary model in natural language processing (NLP). Unlike traditional models that read text in one direction (either left-to-right or right-to-left), BERT analyzes text bidirectionally, capturing the context from both sides of a word. This bidirectional approach allows BERT to better understand the nuances and semantics of words in a sentence. Imagine reading a book and understanding the meaning of a word not only from the words that precede it, but also from the words that follow it. That’s the magic of BERT! Its architecture is based on the concept of transformers, which use attention scores to weigh the importance of different words in a sentence. This attention mechanism helps the model focus on words that are more relevant in a given context. BERT’s capabilities have led to breakthroughs in several NLP tasks such as question answering and spam detection. Its ability to understand the context of words in a sentence makes it exceptionally powerful, and its pre-trained models can be fine-tuned for specific tasks, saving time and computational resources.

The Architecture of BERT

The BERT (Bidirectional Encoder Representations from Transformers) architecture represents a paradigm shift in natural language processing, leveraging the power of deep learning to understand the nuances and context of human language. The below diagram showcases the inner workings of BERT, beginning at the base with the input text, which in this example is a masked sentence: “The [MASK] is hot. The [MASK] are bright.” This input text is first converted into input embeddings, which are vectorized representations of the words that encapsulate semantic meaning.

Figure 1: Masked Language Modeling in BERT

To these input embeddings, positional encodings are added, ensuring that the model retains information about the order of words — a critical aspect of understanding language. The resulting vectors are then passed through a series of encoding layers, each consisting of two sub-layers: a multi-head attention mechanism and a feed-forward neural network, with each sub-layer followed by an “Add & Norm” step that includes residual connections and layer normalization.

The multi-head attention mechanism within the encoder is a distinctive feature of BERT, allowing the model to attend to different parts of the input sentence simultaneously. This mechanism can capture a broad range of contextual clues from all positions in the input sequence, enabling true bidirectionality in understanding language. By using multiple ‘heads’, BERT can focus on various aspects of language, such as syntax and semantics, concurrently, leading to a rich and nuanced representation of text.

Following each multi-head attention block, the “Add & Norm” step applies a residual connection, which helps in combating the vanishing gradients problem by allowing gradients to flow through the network directly. Layer normalization is also applied to stabilize the learning process, ensuring that the activations don’t reach extreme values that could hamper the learning.

The feed-forward network is the second sub-layer of each encoder, introducing additional non-linear transformations that allow BERT to learn complex representations. The output of the feed-forward network is also normalized using an “Add & Norm” step.

As the information flows through the stack of encoders, BERT develops a deep understanding of the input text by considering both the left and right context of each word (or masked token). This is in contrast to previous models that processed text in a unidirectional manner, thereby missing out on the full context.

In the diagram, the top encoder outputs the final representation of the input text, which can then be used for various downstream tasks such as sentiment analysis, question answering, and language inference. The “Masks” notation indicates words that are deliberately hidden during training to encourage the model to predict them based on context, thereby learning deeper bidirectional representations. The “IsNext” label refers to a binary prediction task during BERT’s pre-training that determines whether two text segments naturally follow each other, which further aids in understanding sentence relationships.

Overall, BERT’s architecture is engineered to deeply understand language by capturing the intricate interdependencies of words and their context, setting a new standard for a myriad of NLP tasks.

BERT’s Pre-Training with MLM and NSP

BERT’s pre-training involves two innovative strategies: Masked Language Model (MLM) and Next Sentence Prediction (NSP). These strategies enable the model to understand language context and relationships between sentences, which are crucial for downstream tasks such as question answering and language inference.

Masked Language Model (MLM):

During the MLM pre-training task, a certain percentage of the input tokens are randomly masked. For instance, in the sentence “The quick brown fox jumps over the lazy dog,” words like “brown” and “over” might be replaced with a [MASK] token, resulting in “The quick [MASK] fox jumps [MASK] the lazy dog.” BERT then attempts to predict the original value of the masked words, based solely on their context. This forces the model to develop a deep understanding of the language, as it cannot rely on the left-to-right or right-to-left context only (as was common in previous models) but must use the full context of the surrounding words to make its predictions. Typically, 15% of the words in each sequence are masked during training.

Next Sentence Prediction (NSP):

The NSP task is designed to help BERT understand the relationships between consecutive sentences, which is important for tasks that require understanding the relationship between sentences, such as question answering and natural language inference. During pre-training, the model is presented with pairs of sentences and must predict whether the second sentence is the subsequent sentence in the original document. In about 50% of the cases, the second sentence is indeed the following sentence (labeled as ‘IsNext’), while in the other 50%, it is a random sentence from the corpus (labeled as ‘NotNext’).

For example, BERT might be presented with the pair of sentences “The quick brown fox jumps over the lazy dog” and “They race across the field,” and must predict whether the second sentence logically follows the first. The model learns to understand the coherence and flow of information in text passages through this binary classification task.

These pre-training tasks are performed on a large corpus of unlabeled text data, which allows BERT to learn language patterns from a vast amount of information before it’s fine-tuned on a smaller, task-specific dataset. The pre-trained BERT model can then be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as sentiment analysis, entity recognition, and question answering, without substantial task-specific architecture modifications.

In the above input text, the words which are bold-underlined are masked during the pre-training of the BERT model.

We ran the pre-training with 5000 epochs and captured the values of various variables, vectors, and attributes inside the BERT and we are presenting these observations in the various sections of this article so that you can get some insight into the system. We will go component by component and present the captured data. Before going deep into the topic let’s see how BERT employs the Masked Language Modeling to understand a natural language.

How “Masked Language Modeling” works

Imagine we have the following sentence pair.

Input sentence pair:

The cat chased the mouse. It was a fast chase.

This input input sentence pair is then standardized as per required by BERT using the special tokens like [CLS] for class, [MASK] for masking, [SEP] for segment/sentence separation, and [PAD] for padding to make all training samples of equal size as below.

Masked and standardized sentence pair:

[CLS] the [MASK] chased the mouse [SEP] It was a fast [MASK] [SEP] [PAD] (cat, chase)

This masked and standardized sentence pair is then fed into the BERT model and trained using multiple Encoder layers and running multiple iterations or epochs. At the end of a training cycle the BERT model predicts the masked words as “cat” and “chase”. The model also tries to predict if this sentence pair has a logical continuation of a previous sentence pair using the flag “IsNext”.

Figure 2: Masked Language Modeling in BERT

In this article we will explore various components and pieces of BERT so that we can understand what goes on in each step of BERT. We will use some predefined training examples to probe the data inside BERT. To make presentation easier, we will flip the model upside down so the navigation becomes easier. Also for simplicity we will use only 2 Encoder layers in our model. We will explain the Configuration, Input text, Input Embedding, Multi-Head Attention, Scaled Dot-product Attention, “Add and Normalize”, and “FeedForward” networks and components. Finally we will explore the Output and Prediction parts of the system.

Below is a sample configuration of the BERT model which is very minimalist. Here the maximum number of tokens per sample is 14. Number of samples per batch is 6. To make things simple we only use 1 batch in total; in other words 6 sentences in all. The number of predictions per sample is 2 which means we will not mask more than 2 words per sample. The number of encoder layers is 2. The dimension of embedding space is 7. Please note that in real life this number is much higher (in hundreds). There are some feed forward matrices and vectors (you will see them a bit later); the dimensions of this space is 28. The dimension of Q, K, and V entities used in Attention score (which you will also see them later) is 4. Finally the number of segments (or sentences) per training sample is 2.

Below is the block of input text containing 6 training samples each containing two sentences separated by period. Each sample is separated by a newline character.

Figure 5: Input Samples (sentence pairs)

Next we show how these input samples are masked and standardized as per required by BERT using the special tokens like [CLS] for class, [MASK] for masking, [SEP] for segment/sentence separation, and [PAD] as padding to make all training samples of equal size.

Figure 6: Masked Input Samples (Sentence pairs)

In the below table we show all the words used in the training samples along with their numerical values (Indices). These numerical values are the one which will be understood and interpreted by the model.

The next table shows the masked tokens. These are the dictionary indices for the masked words for example 37 and 16 are the indices for the words “cat” and “chase” in the first training sample. The IsNext column tells us if two sentence pairs in each sample are logically connected or not. The way BERT handles IsNext data, we can assign “False” if the sentence pairs are logically connected and “True” otherwise. How BERT predicts the Masked Word Indices and the IsNext will be explained a bit later.

Figure 8: Masked Word Indices and IsNext Values

The next table provided is a critical component in understanding how the BERT model processes text inputs. This table lists six different input samples, with each sample represented by a sequence of indices (Input Ids). These indices correspond to specific words or tokens in the vocabulary used by the BERT model.

The Segment Ids table in Figure 10 provides information on how input tokens are divided into different segments for the BERT model. Each input sample is represented by a sequence of segment ids, where 0 and 1 indicate the first and second segments, respectively. This segmentation helps the model distinguish between different parts of the input text, such as separating sentences or phrases within a sample. Segment Ids are crucial for tasks like next sentence prediction, where the model needs to understand the relationship between consecutive segments of text. In this table, each row corresponds to an input sample, showing a binary sequence that designates the boundaries between segments, which aids the model in processing and understanding the context of the input data effectively.

Figure 11 illustrates the masked positions within the input samples for a BERT model. Each row represents an input sample and the specific positions within that sample where tokens have been masked. Masked positions are crucial for the model to predict these masked tokens during training, helping it learn to understand and fill in the missing parts of the input sequences. This table provides a clear view of which tokens are hidden in each sample, guiding the model’s learning process.

Let’s see how BERT is trained on the given input samples to predict the masked words and IsNext flag. The training starts with the Input Embedding component where “Input Ids” and “Segment Ids” are used to compute the overall embedding. This involves generating token embeddings, position embeddings, and segment embeddings, which are then combined to form the final embeddings used for further processing. An attention mask is also generated to handle padding effectively. For the rest of the BERT explorer discussion, we will analyze the data in various components of the BERT architecture for a single input sample, examining how each part contributes to the training and prediction processes.

Input Embedding

Let’s see how BERT is trained on the given input samples to predict the masked words and the IsNext flag. The training starts with the Input Embedding component where “Input Ids” and “Segment Ids” are used to compute the Overall embedding. For the rest of the BERT explorer discussion, we will analyze the data in various components of BERT architecture for a single input sample.

In Figure 12, the process begins with the input IDs and segment IDs, each containing 14 elements. These IDs are passed into the Embedding Generator. The generator produces three types of embeddings: token embeddings (tok_embed), position embeddings (pos_embed), and segment embeddings (seg_embed), each with a shape of [14, 7].

These embeddings are then summed to form a unified embedding of shape [14, 7], which captures the combined information from tokens, positions, and segments. This combined embedding is essential for the model to understand the context and relationships between different parts of the input sequence.

Additionally, an attention mask is generated to handle the padding tokens effectively. This mask ensures that the model focuses on the actual tokens and ignores the padded ones during training. The attention mask has a shape of [14, 14], matching the input length to manage dependencies between tokens.

This input embedding setup provides the foundation for the subsequent layers in BERT, enabling the model to process the input data accurately and efficiently. As we proceed through the BERT architecture, we’ll explore how these embeddings and masks are utilized to train the model and generate predictions.

This table below shows the initial embedding vectors for the tokens in Input Sample 1 at the beginning of training (Epoch 0). Each word is mapped to a vector of numerical values, representing its initial position in the embedding space. For example, the word [CLS] is represented by the vector [2.72, -2.42, 2.83, 1.4, -0.3, 1.6, -1.01].

Figure 13: Embeddings Vectors for Input Sample 1 — Initial (Epoch 0) Values

The next table (below) shows the embedding vectors for the same tokens after the model has been fully trained (Epoch 5000). The embedding vectors have been updated to better capture the semantic meanings of the words based on the training data. For example, the embedding vector for [CLS] has changed to [3.6, -3.33, 2.77, 1.42, 1.21, 1.09, -1.2].

Comparing Figures 13 and 14, we observe how the training process refines the embeddings, allowing the model to learn more accurate representations of the tokens. These refined embeddings are crucial for the downstream tasks of masked word prediction and the IsNext classification. By analyzing these vectors, we can understand how the model’s understanding of the tokens evolves over time.

Figure 14: Embeddings Vectors for Input Sample 1 — Fully trained (Epoch 5000) Values

In summary, the training process of BERT involves computing initial embeddings, refining them through multiple epochs, and utilizing them in various components of the model to perform tasks like masked word prediction and sentence pair classification. This iterative training and embedding refinement enable BERT to capture the nuanced meanings of words and their relationships in the input data.

Understanding the Encoder Layer in BERT

The encoder layer is a fundamental component of BERT (Bidirectional Encoder Representations from Transformers). It is responsible for transforming the input embeddings into contextualized representations by capturing the relationships between words in a sentence. Each encoder layer in BERT consists of two main sub-layers: Multi-Head Attention (MHA) and a feed-forward neural network. The MHA allows the model to focus on different parts of the sentence simultaneously, while the feed-forward network processes the combined information to generate a refined output. These layers are stacked multiple times, allowing BERT to build deep, intricate representations of the input text, making it highly effective for a wide range of natural language processing tasks.

The inputs to an encoder layer are embedding vectors, attention mask, and masked position. The output of an encoder includes the updated embedding vectors and the propagation of the attention mask and masked position. The major component of an encoder is the Multi-Head Attention (MHA), represented by the light gray block. Within this block, the attention score is computed for each attention head. In this demo, we are using five attention heads, shown as h0, h1, h2, h3, and h4. Multi-Head Attention allows BERT to focus on different parts of a sentence simultaneously, helping the model capture various aspects and relationships within the text more effectively. For more details on calculating and viewing Multi-Head attention, refer to the Attention Heatmap tool.

As attention score requires entities like Query, Key, and Value, everything related to the Query is shown here, the Key-related components here, and the Value here. Initially, the input embedding becomes the Q, K, and V matrices. During training, three linear weight matrices (W_Q, W_K, and W_V) are applied to the initial embeddings Q, K, and V to generate W_Q(Q), W_K(K), and W_V(V). In this tool, black data blocks represent training linear weight matrices, which can be viewed by clicking on them and analyzing the data by changing input samples and/or rotating through epochs.

These W_Q(Q), W_K(K), and W_V(V) entities are then split into five entities each to introduce five heads in the Multi-Head Attention calculation. Note the data shape change from 14 by 20 to 14 by 4. These new entities are represented as lower-case q_s, k_s, and v_s. The attention head numbers are shown as h0 through h4. The attention score is calculated using a specific formula for each attention head, incorporating the attention mask. The attention scores can be explored by clicking on the attention score blocks. The grid data is color-coded, where the first column and row are the words in the input sample. The values at intersections represent relationships between the words, such as “Next word relationship,” “Pay attention relationship,” “pronouns relationship,” etc.

Using more attention heads allows the system to learn more relationship types. Light colors indicate higher values, and dark colors indicate lower values. By rotating through the epochs until the model is fully learned, you can analyze the relationships between words. For example, for input sample 0 and attention head h2, there is a higher value between “cat” and “chase,” indicating a strong relationship. Different relationships can be spotted by examining different input samples and attention heads.

The next operation in the encoder is the multiplication of the attention score “attn” with the value entities “v_s” for each attention head. This product is called the context of each word (ctx), capturing the essence of its surrounding information in the sentence. Then we concatenate the context entities for all attention heads into a single entity for easier processing and propagation. A linear weight matrix is multiplied with the context to get the output, bringing the entities back to the original embedding dimension of [14, 7]. The residual of the input embedding is added to this context output to ensure that the original embedding information isn’t lost during a long series of data transformations.

After this, Layer Normalization is applied to stabilize and standardize the activations. This ensures that context data don’t reach extremely high or low values. The MHA entity can be explored by viewing its associated vectors. A Linear Weight matrix is then applied to the MHA to get an output optimized for further processing and generating the encoder output.

Next, a GELU (Gaussian Error Linear Unit) activation function is applied. This nonlinearity allows BERT to capture and model more complex relationships in the data. Another Linear weight matrix is applied to the output to generate the encoder output, transforming the data into a shape that fits the embedding dimension of [14, 7] for feeding into the next encoder layer.

After the input data is processed by the first encoder layer, the output of Encoder 1 is fed into the input of Encoder Layer 2. The second encoder layer also takes three inputs, similar to the first encoder layer: the embedding, the attention mask, and the masked position. These inputs are essential for the layer to perform its function effectively.

In this tutorial, we are using only two encoder layers in our model for simplicity. However, in real-world systems, the number of encoder layers can extend to a couple of dozen or even more. This stacking of multiple encoder layers is a significant aspect of BERT’s architecture.

The reason for having multiple encoder layers is to allow the model to iteratively refine and capture increasingly complex relationships and hierarchies in the input data. With each successive layer, BERT can delve deeper into the nuances, contexts, and dependencies within the text. This multi-layered approach enables BERT to build a comprehensive understanding of the language, making it capable of handling a wide range of natural language processing tasks with high accuracy.

Each layer enhances the representations generated by the previous layers, progressively building up a richer and more intricate understanding of the input data. This depth of processing is what allows BERT to excel in tasks such as question answering, text classification, and language inference, among others.

By passing the data through multiple encoder layers, BERT can capture both local and global context, which is crucial for understanding the subtleties of language. This iterative refinement process ensures that by the time the data reaches the final encoder layer, it is a well-rounded and deeply contextualized representation of the original input.

In summary, the multiple encoder layers in BERT are fundamental to its ability to understand and model the complexities of natural language. They allow the model to capture fine-grained details and long-range dependencies, making BERT one of the most powerful tools in the field of natural language processing.

The Output Layer in BERT

The output layer of BERT processes the final embeddings to generate predictions for the Masked Language Model (MLM) and the “isNext” classification. This layer takes two primary inputs: the output from the last encoder layer and the masked position.

On the left side, the “isNext” classification processes data to determine if one sentence follows another. On the right side, the Masked Language Model (MLM) processes data to predict masked words in the input text. Both tasks are crucial for understanding and generating coherent text.

Masked Language Model Processing

Let’s explore how the Masked Language Model (MLM) processing works. The masked position block indicates which word indices are masked in the input sample. For instance, in Sample 0, the numbers 2 and 11 mean that the words at indices 2 and 11 are masked. This can be verified by checking the Masked Input Text section.

Next, the masked positions are expanded to match the embedding dimension, creating rows of repeated indices. These rows are then used to extract specific cells from the encoder output, corresponding to the masked positions. This extraction provides a subset of the encoder output, referred to as h_masked.

A Linear weight matrix is applied to h_masked to enhance its features, followed by a GELU (Gaussian Error Linear Unit) activation function to introduce non-linearity, which helps capture complex patterns. Layer Normalization is then applied to stabilize the training process. A Decoder weight matrix reshapes and optimizes h_masked to match the vocabulary size, enabling the prediction of masked words. A bias is added to the logits, producing the final logits for the Masked Language Model (logits_lm), representing the raw scores for each word in the vocabulary.

These logits are then processed to yield final predicted probabilities for each word in the dictionary. By iterating through epochs, the model learns to accurately predict masked words, as indicated by the heatmap representation of logits.

isNext Classification Processing

For the “isNext” classification, the zeroth row from the encoder output is extracted, typically representing the [CLS] token. A Linear weight matrix is applied to this extracted output to refine its features, followed by a Tanh activation function to scale the values between -1 and +1, producing h_pooled. Another Linear weight matrix reshapes and optimizes h_pooled to produce logits for classification (logits_clsf), which are used to predict whether the second sentence follows the first.

By iterating through epochs, the model adjusts the logits to correctly classify sentence pairs as “isNext” or “NotNext.” The final logits are processed to match the input labels, ensuring accurate predictions.

Prediction

In this final stage, we first explore the Masked Language Model. The Softmax activation function is applied to logits_lm to get the probability distribution over the entire vocabulary of words.

By examining the output after the Softmax application, we see that the first predicted masked word with the highest probability is the word number 37, shown with the bright color. The second masked word has the highest probability with the word number 16, also indicated with a bright color. Now, if we apply Argmax to this data, we should get the word index 37 and 16 as the output. The Argmax function selects the index of the maximum value along a specified dimension, in this case, the rows. The output shows the numbers 37 and 16 as the predicted tokens (or word numbers in the vocabulary). This matches the actual tokens, which are also 37 and 16. By looking at the number dictionary entity, we find that the word associated with the number 37 is “cat,” and the word associated with the number 16 is “chase.” Thus, the BERT model has correctly predicted the masked words.

Now, let’s quickly explore the “isNext” classification. Remember that logits_clsf are the raw values for the two classes: NotNext and IsNext. For the selected sample, the logits are 3.3 for NotNext and -4.01 for IsNext. Next, we apply the Max operation and create the tensor, indicating that the selected (zeroth) tensor value is 3.3 and it is the zero-eth index. The index of this higher value tensor is extracted, which is zero, meaning NotNext is true and IsNext is false. When converted to Boolean, we see the Predicted value as False, which matches the actual value as False.

To summarize, let’s take a look at how BERT predicted the masked words and IsNext flags for the 4 input samples we used in the training.

Figure 17: Summary of BERT’s predictions

Here is a table showing the 4 input samples (see Figure 17), the Predicted and Actual words, and IsNext elements. Please note that BERT has performed all the predictions correctly.

Sample 0: The sentence pair “[CLS] the [MASK] chased the mouse [SEP] It was a fast [MASK] [SEP] [PAD]” had the predicted words as 37, 16 which translates to “cat, chase” and the actual words were indeed 37, 16. The Predicted IsNext was False, matching the Actual IsNext value of False.
Sample 1: The sentence pair “[CLS] I need to [MASK] milk [SEP] please [MASK] to the store [SEP] [PAD]” had the predicted words as 23, 24 which translates to “buy, go” and the actual words were indeed 23, 24. The Predicted IsNext was False, matching the Actual IsNext value of False.
Sample 2: The sentence pair “[CLS] buy me some [MASK] [SEP] please visit the [MASK] [SEP] [PAD] [PAD] [PAD]” had the predicted words as 17, 29 which translates to “milk, shop” and the actual words were indeed 17, 29. The Predicted IsNext was True, matching the Actual IsNext value of True.
Sample 3: The sentence pair “[CLS] The [MASK] ran quickly [SEP] it was too [MASK] [SEP] [PAD] [PAD] [PAD]” had the predicted words as 12, 40 which translates to “mouse, fast” and the actual words were indeed 12, 40. The Predicted IsNext was False, matching the Actual IsNext value of False.

The table clearly shows that BERT has accurately predicted both the masked words and the IsNext flags for all the samples. This demonstrates the model’s ability to understand and generate contextualized representations effectively.

Summary

In this article, we’ve delved into the inner workings of BERT (Bidirectional Encoder Representations from Transformers), showcasing its revolutionary approach to understanding and processing natural language. BERT’s bidirectional nature allows it to capture context from both directions, leading to more nuanced and accurate representations of words in a sentence. We have explored various components of BERT, including its architecture, pre-training tasks, and the detailed workings of its encoder layers and output layers.

We began by understanding how BERT’s architecture leverages the power of transformers, with a particular focus on the Multi-Head Attention mechanism that enables the model to attend to different parts of a sentence simultaneously. This bidirectional attention is key to BERT’s ability to understand complex language constructs. The addition of positional encodings ensures the model retains information about the order of words, which is critical for capturing the full context.

The pre-training phase of BERT, which includes Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), was also discussed. These tasks allow BERT to develop a deep understanding of language by predicting masked words and determining if two sentences are sequentially related.

We then examined how the input data is processed through the encoder layers, transforming initial embeddings into contextualized representations. The importance of multiple encoder layers in capturing intricate relationships and hierarchies in the input data was highlighted, showing how each layer refines and enhances the understanding of the text.

The article also detailed the output layer of BERT, where the final embeddings are used to generate predictions for masked words and determine the “isNext” relationship between sentences. This section included an in-depth look at the Masked Language Model processing and the isNext classification, demonstrating BERT’s ability to accurately predict and classify based on the learned representations.

To summarize, we’ve showcased how BERT predicted masked words and isNext flags for four input samples used in training. The predictions were accurate, demonstrating BERT’s powerful capability in understanding and generating language.

The journey through BERT’s components has provided insights into how this state-of-the-art model excels in various natural language processing tasks, making it a valuable tool for researchers and practitioners in the field of AI and machine learning.

4 Ways to Learn

Read the article: BERT Explorer
Play with the visual tool: BERT Explorer

Play with the visual tool: BERT Explorer

Watch the video: BERT Explorer
Practice with the code: BERT Explorer

Previous Article: Understanding Text with Attention Heatmaps
Start the course with Lesson 1: Data Scaling