Introduction
BERT, short for Bidirectional Encoder Representations from Transformers, is a system leveraging the transformer model and unsupervised pre-training for natural language processing. Being pre-trained, BERT learns beforehand through two unsupervised tasks: masked language modeling and sentence prediction. This enables tailoring BERT for specific tasks without starting from scratch. Essentially, BERT is a pre-trained system using a unique model to understand language, simplifying its application to diverse tasks. Let’s understand BERT’s attention mechanism and its working in this article.
Also Read: What is BERT? Click here!
Learning Objectives
- Understanding the attention mechanism in BERT
- How Tokenization is Done in BERT?
- How Are Attention Weights Computed in BERT?
- Python Implementation of a BERT model
This article was published as a part of the Data Science Blogathon.
Attention Mechanism in BERT
Let’s start with understanding what attention means in the simplest terms. Attention is one of the ways by which the model tries to put more weight on those input features that are more important for a sentence.
Let us consider the following examples to understand how attention works fundamentally.
Example 1

In the above sentence, the BERT model may want to put more weightage on the word “cat” and the verb “jumped” than “bag” since knowing them will be more critical for the prediction of the next word “fell” than knowing where the cat jumped from.
Example 2
Consider the following sentence –

For predicting the word “spaghetti”, the attention mechanism enables giving more weightage to the verb “eating” rather than the quality “bland” of the spaghetti.
Example 3
Similarly, for a translation task like the following:
Input sentence: How was your day
Target sentence: Comment se passe ta journée

For each word in the output phrase, the attention mechanism will map the significant and pertinent words from the input sentence and give these input words a larger weight. In the above image, notice how the French word ‘Comment’ assigns the highest weightage (represented by dark blue) to the word ‘How,’ and for the word ‘journee,’ the input word ‘day’ receives the highest weightage. This is how the attention mechanism helps attain higher output accuracy by putting more weightage on the words that are more critical for the relevant prediction.
The question that comes to mind is how the model then gives these different weights to the different input words. Let us see in the next section how attention weights enable this mechanism exactly.
Attention Weights For Composite Representations
BERT uses attention weights to process sequences. Consider a sequence X comprising three vectors, each with four elements. The attention function transforms X into a new sequence Y with the same length. Each Y vector is a weighted average of the X vectors, with weights termed attention weights. These weights applied to X’s word embeddings produce composite embeddings in Y.

The calculation of each vector in Y relies on varying attention weights assigned to x1, x2, and x3, depending on the required attention for each input feature in generating the corresponding vector in Y. Mathematically speaking, it would looking something as shown –

In the above equations, the values 0.4, 0.3 and 0.2 are nothing but the different attention weights assigned to x1, x2 and x3 for computing the composite embeddings y1,y2 and y3. As can be seen, the attention weights assigned to x1,x2 and x3 for computing the composite embeddings are completely different for y1, y2 and y3.
Attention is critical for understanding the context of the sentence as it enables the model to understand how different words are related to each other in addition to understanding the individual words. For example, when a language model tries to predict the next word in the following sentence
“The restless cat was ___”
The model should understand the composite notion of restless cat in addition to understanding the concepts of restless or cat individually; e.g., restless cat often jumps, so jump could be a fair next word in the sentence.
Keys & Query Vectors For Acquiring Attention Weights
By now we know that attention weights help in giving us composite representations of our input words by computation of a weighted average of the inputs with the help of the weights. However, the next question that comes then is where these attention weights come from. The attention weights essentially come from two vectors known by the name of key and query vectors.
BERT measures attention between word pairs using a function that assigns a score to each word pair based on their relationship. It uses query and key vectors as word embeddings to assess compatibility. The compatibility score calculates by taking the dot product of the query vector of one word and the key vector of the other. For instance, it computes the score between ‘jumping’ and ‘cat’ using the dot product of the query vector (q1) of ‘jumping’ and the key vector (k2) of ‘cat’ – q1*k2.

To convert compatibility scores to valid attention weights, they need to be normalized. BERT does this by applying the softmax function to these scores, ensuring they are positive and total to one. The resulting values are the final attention weights for each word. Notably, the key and query vectors are computed dynamically from the output of the previous layer, letting BERT adjust its attention mechanism depending on the specific context.
Attention Heads in BERT
BERT learns multiple attention mechanisms which are known as heads. These heads work together at the same time concurrently. Having multiple heads helps BERT understand the relationships between words better than if it only had one head.
BERT splits its Query, Key, and Value parameters N-ways. Each of these N pairs independently passes through a separate Head, performing attention calculations. The results from these pairs are then combined to generate a final Attention score. This is why it is termed ‘Multi-head attention,’ providing BERT with enhanced capability to capture multiple relationships and nuances for each word.

BERT also stacks multiple layers of attention. Each layer takes the output from the previous layer and pays attention to it. By doing this many times, BERT can create very detailed representations as it goes deeper into the model.
Depending on the specific BERT model, there are either 12 or 24 layers of attention and each layer has either 12 or 16 attention heads. This means that a single BERT model can have up to 384 different attention mechanisms because the weights are not shared between layers.
Python Implementation of a BERT model
Step 1. Import the Necessary Libraries
We would need to import the ‘torch’ python library to be able to use PyTorch. We would also need to import BertTokenizer and BertForSequenceClassification from the transformers library. The tokenizer library helps enable the tokenization of the text while BertForSequenceClassification for text classification.
import torch
from transformers import BertTokenizer, BertForSequenceClassification
Step 2. Load Pre-trained BERT Model and Tokenizer
In this step, we load the “bert-base-uncased” pre-trained model and feed it to the BertForSequenceClassification’s from_pretrained method. Since we want to carry out a simple sentiment classification here, we set num_labels as 2 which represents “positive” and “negative class”.
model_name="bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
Step 3. Set Device to GPU if Available
This step is only for switching device to GPU is its available or sticking to CPU.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
#import csv
Step 4. Define the Input Text and Tokenize
In this step, we define the input text for which we want to carry out classification. We also define the tokenizer object which is responsible for converting text into a sequence of tokens, which are the basic units of information that machine learning models can understand. ‘max_length’ parameter sets the maximum length of the tokenized sequence. If the tokenized sequence exceeds this length, the system will truncate it. The parameter ‘padding’ dictates that the tokenized sequence will be padded with zeros to reach the maximum length if it is shorter.The parameter “truncation” indicates whether to truncate the tokenized sequence if it exceeds the maximum length.
Since this parameter is set to True, the sequence will be truncated if necessary. The parameter “return_tensors” specifies the format in which to return the tokenized sequence. In this case, the function returns the sequence as a PyTorch tensor. It then moves the ‘input_ids’ and ‘attention_mask’ of the generated tokens to the specified device. The attention mask, previously discussed, is a binary tensor that indicates which parts of the input sequence to attend more to for a specific prediction task.
text = "I did not really enjoyed this movie. It was fantastic!"
#Tokenize the input text
tokens = tokenizer.encode_plus(
text,
max_length=128,
padding='max_length',
truncation=True,
return_tensors="pt"
)
# Move input tensors to the device
input_ids = tokens['input_ids'].to(device)
attention_mask = tokens['attention_mask'].to(device)
#import csv
Step 5. Perform Sentiment Prediction
In the next step, the model generates the prediction for the given input_ids and attention_mask.
with torch.no_grad():
outputs = model(input_ids, attention_mask)
predicted_label = torch.argmax(outputs.logits, dim=1).item()
sentiment="positive" if predicted_label == 1 else 'negative'
print(f"The sentiment of the input text is {sentiment}.")
#import csv
Output
The sentiment of the input text is Positive.
Conclusion
This article covered attention in BERT, highlighting its importance in understanding sentence context and word relationships. We explored attention weights, which give composite representations of input words through weighted averages. The computation of these weights involves key and query vectors. BERT determines the compatibility score between two words by taking the dot product of these vectors. This process, known as “heads”, is BERT’s way of focusing on words. Multiple attention heads enhance BERT’s understanding of word relationships. Finally, we looked into the python implementation of a pretrained BERT model.
Key Takeaways
- BERT is based on two crucial NLP advancements: the transformer architecture and unsupervised pre-training.
- It uses ‘attention’ to prioritize relevant input features in sentences, aiding in understanding word relationships and contexts.
- Attention weights calculate a weighted average of inputs for composite representations. The use of multiple attention heads and layers allows BERT to create detailed word representations by focusing on previous layer outputs.
Frequently Asked Questions
A. BERT, short for Bidirectional Encoder Representations from Transformers, is a system leveraging the transformer model and unsupervised pre-training for natural language processing.
A. It undergoes pretraining, learning beforehand through two unsupervised tasks: masked language modeling and sentence prediction.
A. Use BERT models for a variety of applications in NLP including but not limited to text classification, sentiment analysis, question answering, text summarization, machine translation, spell checking and grammar checking, content recommendation.
A. Self-attention is a mechanism in the BERT model (and other transformer-based models) that allows each word in the input sequence to interact with every other word. It allows the model to take into account the entire context of the sentence, instead of just looking at words in isolation or within a fixed window size.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.