Natural Language Processing - 2025s2
  • Home
  • Final Project
  • Text Classification
    • 00 - Review - Regular expressions
    • 01 - Practice - Sentiment Analysis with ANEW
    • 02 - Theory - Math of Logistic Regression
    • 02a - Theory - Supplementary Material
    • 03 - Case Study - Classification on IMDB
    • 04 - Practice - Cross-dataset Classification
    • 05 - Practice - Detecting Fake News
  • Language Models
    • 00 - Theory - Language Models
    • 00a - Solution for exercises in 00
    • 01 - Case Study - Language Models
    • 02 - Theory - From Sklearn to Pytorch
    • 03 - Theory - MLP, Residuals, Normalization
    • 05 - Practice - Tokenizers, Classification and Visualization
    • 06 - Theory - Self-Attention and Self-Supervised Training
    • 07 - Case Study - Pre-trained BERT
    • 08 - Practice - Fine-tuning BERT
  • Search
  • Previous
  • Next
  • Case Study: BERT
    • What is BERT?
    • Task 1: Masked Language Model
    • Task 2: Next Sentence Prediction
  • Activities
    • Questions

Case Study: BERT¶

What is BERT?¶

After the transformer, we had many other advances. One of such, of course, is the GPT, which uses a decoder-only transformer architecture to predict the next word in a sentence. GPT uses a decoder-only architecture because it needs the masked multi-head attention device to avoid making trivial predictions. Ultimately, GPT generates an embedding space that increases the likelihood of choosing meaningful words for a text continuation.

The Google team found another interesting way to obtain this type of representation. They trained an encoder-only transformer that can predict words removed from the text - similarly to how we know what is missing in "Luke, I am your ____". The idea here is that we can use information from the future for this task, because it is highly dependent on context. Simultaneously, they trained the model to classify whether two given phrases follow each other in a corpus. So, BERT was born.

graph LR; subgraph Input; T["Token embeddings"]; P["Position embeddings"]; S["Segment embeddings"]; ADD(["SUM"]); T --> ADD; P --> ADD; S --> ADD; end; SEQ["Sequence Model"]; ADD --> SEQ; RES["Result: 1 vector per input token"]; SEQ --> RES;

Bert stands for Bidirectional Encoder Representations from Transformers, and was introduced in this paper from 2019. The greatest contribution of BERT, besides its architecture, is the idea of training the language model for different tasks at the same time.

We are definitely not going to train BERT in class, but we are using it for other tasks. We will use the BERT implementation from Hugging Face. All help files are here.

Task 1: Masked Language Model¶

The first task BERT was trained for was the Masked Language Model. This was inspired in a task called "Cloze", and the idea is to remove a word from a sentence and let the system predict what word should fill that sentence:

graph LR; subgraph Inputs; INPUT["[CLS] remove some parts [MASK] a sentence"]; end; INPUT --> BERT["BERT"]; subgraph Outputs; OUTPUT["C T1 T2 T3 T4 T5 T6"]; end; BERT --> OUTPUT; Train["Loss: T4 should be the word 'of'"] OUTPUT --> Train;

This task suggests that the embedding space created by BERT should allow representing words in the context of the rest of the sentence!

To play with this task with Hugging Face's library, you can use:

In [1]:
Copied!
from transformers import pipeline
unmasker = pipeline('fill-mask', model='bert-base-uncased')
unmasker("Remove some parts [MASK] a sentence.")
from transformers import pipeline unmasker = pipeline('fill-mask', model='bert-base-uncased') unmasker("Remove some parts [MASK] a sentence.")
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0
Out[1]:
[{'score': 0.9431136250495911,
  'token': 1997,
  'token_str': 'of',
  'sequence': 'remove some parts of a sentence.'},
 {'score': 0.04985498636960983,
  'token': 2013,
  'token_str': 'from',
  'sequence': 'remove some parts from a sentence.'},
 {'score': 0.004208952654153109,
  'token': 1999,
  'token_str': 'in',
  'sequence': 'remove some parts in a sentence.'},
 {'score': 0.000622662715613842,
  'token': 2306,
  'token_str': 'within',
  'sequence': 'remove some parts within a sentence.'},
 {'score': 0.0005233758711256087,
  'token': 2076,
  'token_str': 'during',
  'sequence': 'remove some parts during a sentence.'}]
In [2]:
Copied!
unmasker("I have a student called [MASK].")
unmasker("I have a student called [MASK].")
Out[2]:
[{'score': 0.006842342671006918,
  'token': 4074,
  'token_str': 'alex',
  'sequence': 'i have a student called alex.'},
 {'score': 0.006842134054750204,
  'token': 3520,
  'token_str': 'sam',
  'sequence': 'i have a student called sam.'},
 {'score': 0.005493461154401302,
  'token': 6864,
  'token_str': 'amy',
  'sequence': 'i have a student called amy.'},
 {'score': 0.005373646505177021,
  'token': 4532,
  'token_str': 'sarah',
  'sequence': 'i have a student called sarah.'},
 {'score': 0.005297194700688124,
  'token': 3841,
  'token_str': 'ben',
  'sequence': 'i have a student called ben.'}]

Algorithmic bias and Hallucinations¶

Note that BERT is generating words that make sense. However, these continuations do not necessarily correspond to reality. In fact, these continuations are simply something that maximizes a probability related to a specific dataset!

Check, for example, the output for:

In [3]:
Copied!
unmasker("Minas Gerais is famous for its [MASK].")
unmasker("Minas Gerais is famous for its [MASK].")
Out[3]:
[{'score': 0.11554374545812607,
  'token': 4511,
  'token_str': 'wine',
  'sequence': 'minas gerais is famous for its wine.'},
 {'score': 0.09914577007293701,
  'token': 14746,
  'token_str': 'wines',
  'sequence': 'minas gerais is famous for its wines.'},
 {'score': 0.09358436614274979,
  'token': 12212,
  'token_str': 'beaches',
  'sequence': 'minas gerais is famous for its beaches.'},
 {'score': 0.07331068813800812,
  'token': 6813,
  'token_str': 'tourism',
  'sequence': 'minas gerais is famous for its tourism.'},
 {'score': 0.054305534809827805,
  'token': 12846,
  'token_str': 'cuisine',
  'sequence': 'minas gerais is famous for its cuisine.'}]

Kentucky is a state in the USA that may or may not have wineries, but definitely does not have famous beaches! Now, check the output when you change Kentucky for the Brazilian state of Minas Gerais!

See - there is no "brain" inside BERT. There is merely a system that finds plausible completions for a task. This is something we have been calling "hallucinations" in LLMs. In the end, the model is just as biased as the dataset used for training it.

Algorithmic prejudice¶

Despite the funny things things that the model could output, there are some assertions that can be dangerous, or outright sexist. Try to see the output of:

In [4]:
Copied!
unmasker("That [MASK] is a doctor.")
unmasker("That [MASK] is a doctor.")
Out[4]:
[{'score': 0.17646944522857666,
  'token': 2158,
  'token_str': 'man',
  'sequence': 'that man is a doctor.'},
 {'score': 0.11029130220413208,
  'token': 3124,
  'token_str': 'guy',
  'sequence': 'that guy is a doctor.'},
 {'score': 0.08735679090023041,
  'token': 2450,
  'token_str': 'woman',
  'sequence': 'that woman is a doctor.'},
 {'score': 0.0790017694234848,
  'token': 2002,
  'token_str': 'he',
  'sequence': 'that he is a doctor.'},
 {'score': 0.061698563396930695,
  'token': 2016,
  'token_str': 'she',
  'sequence': 'that she is a doctor.'}]

Now, let's make a small change here:

In [5]:
Copied!
unmasker("That [MASK] is a nurse.")
unmasker("That [MASK] is a nurse.")
Out[5]:
[{'score': 0.2685098946094513,
  'token': 2450,
  'token_str': 'woman',
  'sequence': 'that woman is a nurse.'},
 {'score': 0.22261548042297363,
  'token': 2611,
  'token_str': 'girl',
  'sequence': 'that girl is a nurse.'},
 {'score': 0.20899169147014618,
  'token': 2016,
  'token_str': 'she',
  'sequence': 'that she is a nurse.'},
 {'score': 0.0432039275765419,
  'token': 2028,
  'token_str': 'one',
  'sequence': 'that one is a nurse.'},
 {'score': 0.029987310990691185,
  'token': 7743,
  'token_str': 'bitch',
  'sequence': 'that bitch is a nurse.'}]

We could go on finding examples of other types of prejudice - there are all sorts of sexism and racism lying in the hidden spaces of BERT.

In [6]:
Copied!
sentences = [
    'That criminal is from [MASK].',
    'That CEO is from [MASK].',
    'That man works as a [MASK].',
    'That woman works as a [MASK].',
]

for s in sentences:
    print (unmasker(s)[0]['sequence'])
sentences = [ 'That criminal is from [MASK].', 'That CEO is from [MASK].', 'That man works as a [MASK].', 'That woman works as a [MASK].', ] for s in sentences: print (unmasker(s)[0]['sequence'])
that criminal is from mexico.
that ceo is from chicago.
that man works as a lawyer.
that woman works as a prostitute.

This is bad, but remember this was 2019, and people were impressed that the system could generate coherent words at all! Nowadays, LLM outputs go through a filter that finds phrases that are potentially harmful, so they don't write ugly phrases.

Which of the phrases below are true about this?

Task 2: Next Sentence Prediction¶

BERT was also trained for a task called Next Sentence Prediction. The idea of this task is to insert two sentences in the input of BERT, separating them with a special [SEP] token. Then, the system uses the output of the [CLS] token to classify whether these two sentences do or do not follow each other. It is something like:

graph LR; subgraph Inputs; INPUT["[CLS] Here I am [SEP] rock you like a hurricane"]; end; INPUT --> BERT["BERT"]; subgraph Outputs; OUTPUT["C T1 T2 etc"]; end; BERT --> OUTPUT; OUTPUT --> LR; Train["Loss: C should be equal to 1"] LR --- Train;
graph LR; subgraph Inputs; INPUT["[CLS] Here I am [SEP] rock your body"]; end; INPUT --> BERT["BERT"]; subgraph Outputs; OUTPUT["C T1 T2 etc"]; end; BERT --> OUTPUT; Train["Loss: C should be equal to 0"] OUTPUT --- Train;

The consequence of this training is that the embedding $C$ of the [CLS] token represents the content of the rest of the tokens. Hence, we can use it for classification. For such, we can go straight to the HuggingFace library and use:

In [7]:
Copied!
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained("bert-base-uncased") text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input)

The embedding for the [CLS] token can be accessed using:

In [8]:
Copied!
import torch
output_cls = output.last_hidden_state[0,0,:]
print(output_cls.shape)

text = "I like cake"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
output_cls1 = output.last_hidden_state[0,0,:]


text = "I like candy"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
output_cls2 = output.last_hidden_state[0,0,:]

text = "My computer is broken"
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
output_cls3 = output.last_hidden_state[0,0,:]


all_outputs = torch.stack([output_cls1, output_cls2, output_cls3])
print(all_outputs.shape)

x = all_outputs.detach().cpu().numpy()
import torch output_cls = output.last_hidden_state[0,0,:] print(output_cls.shape) text = "I like cake" encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) output_cls1 = output.last_hidden_state[0,0,:] text = "I like candy" encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) output_cls2 = output.last_hidden_state[0,0,:] text = "My computer is broken" encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) output_cls3 = output.last_hidden_state[0,0,:] all_outputs = torch.stack([output_cls1, output_cls2, output_cls3]) print(all_outputs.shape) x = all_outputs.detach().cpu().numpy()
torch.Size([768])
torch.Size([3, 768])
In [9]:
Copied!
from scipy.spatial.distance import cdist

# Calculate cosine distance between rows of x
cosine_distances = cdist(x, x, metric='cosine')

print(cosine_distances)
from scipy.spatial.distance import cdist # Calculate cosine distance between rows of x cosine_distances = cdist(x, x, metric='cosine') print(cosine_distances)
[[0.         0.01527569 0.06267575]
 [0.01527569 0.         0.05894094]
 [0.06267575 0.05894094 0.        ]]
In [ ]:
Copied!

In [10]:
Copied!
y = ['fun', 'fun', 'serious']
y = ['fun', 'fun', 'serious']

There are many details in this implementation, so I made a video exploring them all.

In [ ]:
Copied!

Activities¶

Questions¶

Remembering (Recall facts and basic concepts)

  1. What are the tasks BERT is trained for?
  2. What is next sentence prediction (NSP)?
  3. What is masked language modelling (MLM)?

Understanding (Explain ideas or concepts)

  1. Explain in your own words the core idea behind the use of the CLS token as a representation of the sentence contents.
  2. Why should we expect biases in the masked token prediction task?

Applying (Use information in new situations)

  1. How would you modify the code for text generation to incorporate concepts like temperature, as we have seen previously?
  2. How could we use BERT to generate long strings of text?

Analyzing (Draw connections among ideas, compare/contrast, break down)

  1. Is the model able to generate novel material, that is, phrases that have never been seen before?
  2. Can the model be considered "creative"?

Evaluating (Justify a stand or decision, critique)

  1. Critique the interpretability of the model (predicting probability for single words). While insightful, what potential inaccuracies or simplifications does this method introduce compared to how words contribute within a text?
In [ ]:
Copied!


Documentation built with MkDocs.

Search

From here you can search these documents. Enter your search terms below.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search