Natural Language Processing - 2025s2
  • Home
  • Final Project
  • Text Classification
    • 00 - Review - Regular expressions
    • 01 - Practice - Sentiment Analysis with ANEW
    • 02 - Theory - Math of Logistic Regression
    • 02a - Theory - Supplementary Material
    • 03 - Case Study - Classification on IMDB
    • 04 - Practice - Cross-dataset Classification
    • 05 - Practice - Detecting Fake News
  • Language Models
    • 00 - Theory - Language Models
    • 00a - Solution for exercises in 00
    • 01 - Case Study - Language Models
    • 02 - Theory - From Sklearn to Pytorch
    • 03 - Theory - MLP, Residuals, Normalization
    • 05 - Practice - Tokenizers, Classification and Visualization
    • 06 - Theory - Self-Attention and Self-Supervised Training
    • 07 - Case Study - Pre-trained BERT
    • 08 - Practice - Fine-tuning BERT
  • Search
  • Previous
  • Next
  • Bullshit generator: a case study for n-gram language models
    • Finding n-grams!
    • An N-Gram language model
    • Generating some bullshit
    • A fallback strategy
    • Generating some shakespeare!
  • Activities
    • Questions
  • Expected answers

Bullshit generator: a case study for n-gram language models¶

Finding n-grams!¶

By now, you might have noticed that using one single word in the past to predict the next word feels wrong. This is because we choose words based on a long-term context - and using a single word is a large oversimplification on this.

A possible solution is to change our original equation $𝑃(𝑤_𝑛∣𝑤_{𝑛−1})$ to a less naive one in which the probability of a word is calculated based on $L$ previous words ($L$ stands for "context length"): $𝑃(𝑤_𝑛∣𝑤_{𝑛−1}, w_{n-2}, \cdots, w_{n-L})$ . For such, we will need to use n-grams.

N-grams are simply sequences of N words that appear in the text. For example, in "these are nice n-grams", for n=2, we have the n-grams: "these are", "are nice", "nice n-grams". Note that now we can calculate $P(\text{nice}|\text{these are})$.

We can get all n-grams and their continuations from a string using:

In [36]:
Copied!
import re 

def get_ngrams_and_continuations(input_str : str, L : int) -> tuple[list, list]:
    list_of_words = re.findall(r'\w+', input_str.lower())
    ngrams = [tuple(list_of_words[i:i+L]) for i in range(len(list_of_words)-L)]
    continuations = [list_of_words[i+L] for i in range(len(list_of_words)-L)]
    return ngrams, continuations

data = "this is my cat. This is my house. This is my dog. This is my computer."
ngrams, continuations = get_ngrams_and_continuations(data, 2)
for i in range(len(ngrams)):
    print(f"{ngrams[i]} -> {continuations[i]}")
import re def get_ngrams_and_continuations(input_str : str, L : int) -> tuple[list, list]: list_of_words = re.findall(r'\w+', input_str.lower()) ngrams = [tuple(list_of_words[i:i+L]) for i in range(len(list_of_words)-L)] continuations = [list_of_words[i+L] for i in range(len(list_of_words)-L)] return ngrams, continuations data = "this is my cat. This is my house. This is my dog. This is my computer." ngrams, continuations = get_ngrams_and_continuations(data, 2) for i in range(len(ngrams)): print(f"{ngrams[i]} -> {continuations[i]}")
('this', 'is') -> my
('is', 'my') -> cat
('my', 'cat') -> this
('cat', 'this') -> is
('this', 'is') -> my
('is', 'my') -> house
('my', 'house') -> this
('house', 'this') -> is
('this', 'is') -> my
('is', 'my') -> dog
('my', 'dog') -> this
('dog', 'this') -> is
('this', 'is') -> my
('is', 'my') -> computer

An N-Gram language model¶

We can now proceed to estimate the probability of a continuation given an n-gram. For the example above, "this is" is definitely followed by "my". However, "is my" can be followed by "cat", "house", "dog", or "computer". Now, we can convert our n-grams and their continuations to a language model using probability counts:

In [37]:
Copied!
from collections import defaultdict

def ngram_language_model(ngrams, continuations):
    model = defaultdict(lambda: defaultdict(int))
    for ngram, continuation in zip(ngrams, continuations):
        model[ngram][continuation] += 1
    
    # Convert counts to probabilities
    for ngram, continuation_counts in model.items():
        total_count = sum(continuation_counts.values())
        for continuation in continuation_counts:
            continuation_counts[continuation] /= total_count
            
    return model

model = ngram_language_model(ngrams, continuations)
for ngram, continuation_counts in model.items():
    print(f"{ngram}: {dict(continuation_counts)}")
from collections import defaultdict def ngram_language_model(ngrams, continuations): model = defaultdict(lambda: defaultdict(int)) for ngram, continuation in zip(ngrams, continuations): model[ngram][continuation] += 1 # Convert counts to probabilities for ngram, continuation_counts in model.items(): total_count = sum(continuation_counts.values()) for continuation in continuation_counts: continuation_counts[continuation] /= total_count return model model = ngram_language_model(ngrams, continuations) for ngram, continuation_counts in model.items(): print(f"{ngram}: {dict(continuation_counts)}")
('this', 'is'): {'my': 1.0}
('is', 'my'): {'cat': 0.25, 'house': 0.25, 'dog': 0.25, 'computer': 0.25}
('my', 'cat'): {'this': 1.0}
('cat', 'this'): {'is': 1.0}
('my', 'house'): {'this': 1.0}
('house', 'this'): {'is': 1.0}
('my', 'dog'): {'this': 1.0}
('dog', 'this'): {'is': 1.0}

Generating some bullshit¶

Now, we can generate some bullshit by simply initializing our model with an N-gram:

In [38]:
Copied!
import numpy as np
np.random.seed(41)  # For reproducibility

initial_text = "this is"

def generate_text(model, initial_text, n=2, length=10):
    words = initial_text.split()
    for _ in range(length):
        ngram = tuple(words[-n:])
        if ngram in model:
            continuations = list(model[ngram].keys())
            probabilities = list(model[ngram].values())
            next_word = np.random.choice(continuations, p=probabilities)
            words.append(next_word)
        else:
            break
    return ' '.join(words)

generate_text(model, initial_text, n=2, length=40)
import numpy as np np.random.seed(41) # For reproducibility initial_text = "this is" def generate_text(model, initial_text, n=2, length=10): words = initial_text.split() for _ in range(length): ngram = tuple(words[-n:]) if ngram in model: continuations = list(model[ngram].keys()) probabilities = list(model[ngram].values()) next_word = np.random.choice(continuations, p=probabilities) words.append(next_word) else: break return ' '.join(words) generate_text(model, initial_text, n=2, length=40)
Out[38]:
'this is my cat this is my dog this is my house this is my house this is my house this is my dog this is my house this is my dog this is my dog this is my computer'

A fallback strategy¶

Also, by now, you probably found out that larger n-grams become more and more uncommon. This is so true that finding two texts that contain n-grams with a context $L$ larger than around 10 can be used as basis to flag copy-paste plagiarism. Hence, with larger n-grams, we will probably fall into situations in which we don't have information on how to proceed.

On the other hand, we might like larger context lengths because they can make our texts more cohesive.

How to deal with that?

One possibility is to have a weighting strategy in which the probabilities for models that consider different n-gram lengths are combined. However, the optimal combination could be hard to obtain.

Another possibility is to use a fallback strategy: we try a model with context $L$. If it fails to find the n-gram, then we proceed to a model with context $L-1$, and so on.

We could implement such a model like this:

In [39]:
Copied!
models = {} # the key is the n-gram length L and the value is the model
for L in range(1, 5):
    ngrams, continuations = get_ngrams_and_continuations(data, L)
    model = ngram_language_model(ngrams, continuations)
    models[L] = model

def generate_text_with_fallback(models, initial_text, max_length=40):
    model_lengths = sorted(models.keys())[::-1]  # Start with the largest n-gram
    words = initial_text.split()
    for _ in range(max_length):
        for L in model_lengths:
            ngram = tuple(words[-L:])
            if ngram in models[L]:
                continuations = list(models[L][ngram].keys())
                probabilities = list(models[L][ngram].values())
                next_word = np.random.choice(continuations, p=probabilities)
                words.append(next_word)
                break
        else:
            break
    return ' '.join(words)

initial_text = "this is"
np.random.seed(41)  # For reproducibility
generated_text = generate_text_with_fallback(models, initial_text, max_length=40)
print(generated_text)
models = {} # the key is the n-gram length L and the value is the model for L in range(1, 5): ngrams, continuations = get_ngrams_and_continuations(data, L) model = ngram_language_model(ngrams, continuations) models[L] = model def generate_text_with_fallback(models, initial_text, max_length=40): model_lengths = sorted(models.keys())[::-1] # Start with the largest n-gram words = initial_text.split() for _ in range(max_length): for L in model_lengths: ngram = tuple(words[-L:]) if ngram in models[L]: continuations = list(models[L][ngram].keys()) probabilities = list(models[L][ngram].values()) next_word = np.random.choice(continuations, p=probabilities) words.append(next_word) break else: break return ' '.join(words) initial_text = "this is" np.random.seed(41) # For reproducibility generated_text = generate_text_with_fallback(models, initial_text, max_length=40) print(generated_text)
this is my cat this is my house this is my dog this is my computer

Generating some shakespeare!¶

Well, now, let's get Shakespeare's complete works and do the same:

In [40]:
Copied!
with open('shakespeare.txt', 'r', encoding='utf-8') as file:
    shakespeare_text = file.read()
    
models = {}
for L in range(1, 7):
    ngrams, continuations = get_ngrams_and_continuations(shakespeare_text, L)
    model = ngram_language_model(ngrams, continuations)
    models[L] = model
with open('shakespeare.txt', 'r', encoding='utf-8') as file: shakespeare_text = file.read() models = {} for L in range(1, 7): ngrams, continuations = get_ngrams_and_continuations(shakespeare_text, L) model = ngram_language_model(ngrams, continuations) models[L] = model
In [51]:
Copied!
np.random.seed(45)  # For reproducibility
initial_text = "I believe"
generated_text = generate_text_with_fallback(models, initial_text, max_length=40)
print(generated_text)
np.random.seed(45) # For reproducibility initial_text = "I believe" generated_text = generate_text_with_fallback(models, initial_text, max_length=40) print(generated_text)
I believe thyself than i will trust a sickly appetite that loathes even as it longs but sure my sister if i were ripe for your persuasion you have said enough to shake me from the arm of the all noble theseus

Activities¶

Questions¶

Remembering (Recall facts and basic concepts)

  1. What is the main task being performed in this case study?
  2. What machine learning algorithm is used for the classification task?
  3. What technique is used to convert the texts into models?
  4. What dataset is used for training and testing the model?
  5. Do we need labels for this type of model?

Understanding (Explain ideas or concepts)

  1. Explain in your own words the core idea behind the N-Gram text representation.
  2. What is the purpose of the fallback strategy in text generation?

Applying (Use information in new situations)

  1. How would you modify the code for text generation to incorporate concepts like temperature, as we have seen previously?
  2. How can we use this same idea to generate movie reviews?
  3. How can we use this same idea to generate movie reviews that have positive sentiments?
  4. How can we use this same idea to generate movie reviews that have positive sentiments and mention the cinematography as positive?

Analyzing (Draw connections among ideas, compare/contrast, break down)

  1. Analyze the outputs for shakespeare. Can you find the generated material within "the complete works of Shakespeare"?
  2. Is the model able to generate novel material, that is, phrases that have never been seen before?
  3. Can the model be considered "creative"?

Evaluating (Justify a stand or decision, critique)

  1. Evaluate the author's statement: "we have a reasonable reproduction of shakespeare".
  2. Critique the interpretability of the model (predicting probability for single words). While insightful, what potential inaccuracies or simplifications does this method introduce compared to how words contribute within a text?

Expected answers¶

Remembering (Recall facts and basic concepts)

  1. What is the main task being performed in this case study? Prediction of next word.
  2. What machine learning algorithm is used for the classification task? N-Gram Language Models, or Order-N Markov Chains.
  3. What technique is used to convert the texts into models? Simple counting and dividing.
  4. What dataset is used for training and testing the model? The Complete Works of Shakespeare.
  5. Do we need labels for this type of model? No.

Understanding (Explain ideas or concepts)

  1. Explain in your own words the core idea behind the N-Gram text representation. N-Grams are sequences of words that are considered a single "token". They can help modelling sequences of words.
  2. What is the purpose of the fallback strategy in text generation? Larger N-Grams can become too rare. Hence, it can be necessary to resort to a lower-order model in some situations. The fallback strategy selects the higher-order model that can be used in each situation.

Applying (Use information in new situations)

  1. How would you modify the code for text generation to incorporate concepts like temperature, as we have seen previously? Perhaps adding temperature before the choice. Another idea could be to assume all words are possible, with a minimum probability of $p$, so that we increase the chance of generating diverse outcomes.
  2. How can we use this same idea to generate movie reviews? Train the model in IMDB.
  3. How can we use this same idea to generate movie reviews that have positive sentiments? Train the model in the positive-label part of IMDB.
  4. How can we use this same idea to generate movie reviews that have positive sentiments and mention the cinematography as positive? Train the model in the items from IMDB that are positive and contain the word 'cinematography'.

Analyzing (Draw connections among ideas, compare/contrast, break down)

  1. Analyze the outputs for shakespeare. Can you find the generated material within "the complete works of Shakespeare"? Some parts of the output are repeated, but larger chunks are more likely to be recombinations of smaller chunks from the original material, hence they cannot be found.
  2. Is the model able to generate novel material, that is, phrases that have never been seen before? More or less. It can create recombinations of known phrases and themes, but not entirely new themes and ideas.
  3. Can the model be considered "creative"? Not in a human sense. The generated novelties are simply the result of randomness. This is the same as flipping a coin many times - although that specific sequence of heads and tails could be new in the entire history of humanity, we would probably not argue that the coin was creative.

Evaluating (Justify a stand or decision, critique)

  1. Critique the interpretability of the model (predicting probability for single words). While insightful, what potential inaccuracies or simplifications does this method introduce compared to how words contribute within a text? It is easy to interpret the model: in each step, we know what the model observes, and what are its possible outcomes. In fact, we could use debug messages to track all of this.

Documentation built with MkDocs.

Search

From here you can search these documents. Enter your search terms below.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search