Natural Language Processing - 2025s2
  • Home
  • Final Project
  • Text Classification
    • 00 - Review - Regular expressions
    • 01 - Practice - Sentiment Analysis with ANEW
    • 02 - Theory - Math of Logistic Regression
    • 02a - Theory - Supplementary Material
    • 03 - Case Study - Classification on IMDB
    • 04 - Practice - Cross-dataset Classification
    • 05 - Practice - Detecting Fake News
  • Language Models
    • 00 - Theory - Language Models
    • 00a - Solution for exercises in 00
    • 01 - Case Study - Language Models
    • 02 - Theory - From Sklearn to Pytorch
    • 03 - Theory - MLP, Residuals, Normalization
    • 05 - Practice - Tokenizers, Classification and Visualization
    • 06 - Theory - Self-Attention and Self-Supervised Training
    • 07 - Case Study - Pre-trained BERT
    • 08 - Practice - Fine-tuning BERT
  • Search
  • Previous
  • Next
  • Linguistic Models: the XX-Century approach
    • Exercise 1: conditional probabilities for next word
    • Exercise 2: estimating a linguistic model
    • Exercise 3: suggest a next word
    • Exercise 4: make a text generator
    • Exercise 5: generation techniques
    • Exercise 6: reading from real data

Linguistic Models: the XX-Century approach¶

So far, we have used probabilistic models to determine the likelihood of finding a word $w$ in any document within the collection $c$, that is: $P(w | c)$. Implicitly, this means that the order of words within a document does not impact its meaning. Metaphorically, it's as if we placed all the words in a big bag, and therefore this type of representation based on the presence or absence of words is called a bag-of-words.

The bag-of-words model is effective for many applications but can miss important characteristics of a word: on the one hand, it is unlikely that a text mentioning "platypuses" and "kangaroos" is not referring to both; on the other hand, the text: "platypuses are more dangerous than kangaroos" is very different from "kangaroos are more dangerous than platypuses".

One way to create models for the order in which words appear in a text is called a generative linguistic model (or generative, depending on the translation you adopt). In this type of model, we estimate the probability of finding the $n$-th word of a sequence based on the previous word, that is:

$𝑃(𝑤_𝑛∣𝑤_{𝑛−1})$

We can create a small model for the phrase:

Pass one, pass two, pass three

In this case, our model gives us probabilities like:

$P(\text{pass} | \text{one}) = 1$

$P(\text{pass} | \text{two}) = 1$

$P(\text{two} | \text{pass}) = 1/3$

Note that these probabilities are estimated by counting in a training data set!

Exercise 1: conditional probabilities for next word¶

In the excerpt below:

Joana went for a walk with some of her seven dogs on a sunny afternoon, and she met a friend. A person who was there also stopped to talk to them, and some other dogs also stopped to play with the dogs.

Calculate:

  • $P(\text{afternoon} | \text{sunny})$
  • $P(\text{some} | \text{with})$
  • $P(\text{on} | \text{dogs})$

Exercise 2: estimating a linguistic model¶

There are many libraries that can be used to estimate the conditional probabilities for all words in a text. We are going to build these probabilities from scratch.

The piece of code below splits a text into its individual words (we are not concerned with punctuation at this point).

Add to the code so that we generate a "dictionary of dictionaries", similar to an inverted index. This structure must represent conditional probabilities in the following way:

Suppose we have:

  • $P(\text{dogs} | \text{with}) = 0.2$
  • $P(\text{dogs} | \text{her}) = 0.3$
  • $P(\text{cats} | \text{her}) = 0.4$

the data structure should look like:

{
    'with' : { 'dogs' : 0.2 },
    'her' : { 'dogs' : 0.3,
              'cats' : 0.4}
}
In [13]:
Copied!
import re

text = """Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence
concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze
large amounts of natural language data.
Challenges in natural language processing frequently involve speech recognition, natural language understanding,
and natural language generation. Nowadays, Artificial Intelligence is a highly trending technology and is gaining popularity among NLP developers.
Although Artificial Intelligence is a marvelous technology and has wonderful results, it is still a developing technology and its ethical use is a major concern.
"""

words = re.findall(r"\b\w+\b", text.upper())
print(words)

conditional_probabilities = {}
import re text = """Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. Nowadays, Artificial Intelligence is a highly trending technology and is gaining popularity among NLP developers. Although Artificial Intelligence is a marvelous technology and has wonderful results, it is still a developing technology and its ethical use is a major concern. """ words = re.findall(r"\b\w+\b", text.upper()) print(words) conditional_probabilities = {}
['NATURAL', 'LANGUAGE', 'PROCESSING', 'NLP', 'IS', 'A', 'SUBFIELD', 'OF', 'LINGUISTICS', 'COMPUTER', 'SCIENCE', 'INFORMATION', 'ENGINEERING', 'AND', 'ARTIFICIAL', 'INTELLIGENCE', 'CONCERNED', 'WITH', 'THE', 'INTERACTIONS', 'BETWEEN', 'COMPUTERS', 'AND', 'HUMAN', 'NATURAL', 'LANGUAGES', 'IN', 'PARTICULAR', 'HOW', 'TO', 'PROGRAM', 'COMPUTERS', 'TO', 'PROCESS', 'AND', 'ANALYZE', 'LARGE', 'AMOUNTS', 'OF', 'NATURAL', 'LANGUAGE', 'DATA', 'CHALLENGES', 'IN', 'NATURAL', 'LANGUAGE', 'PROCESSING', 'FREQUENTLY', 'INVOLVE', 'SPEECH', 'RECOGNITION', 'NATURAL', 'LANGUAGE', 'UNDERSTANDING', 'AND', 'NATURAL', 'LANGUAGE', 'GENERATION', 'NOWADAYS', 'ARTIFICIAL', 'INTELLIGENCE', 'IS', 'A', 'HIGHLY', 'TRENDING', 'TECHNOLOGY', 'AND', 'IS', 'GAINING', 'POPULARITY', 'AMONG', 'NLP', 'DEVELOPERS', 'ALTHOUGH', 'ARTIFICIAL', 'INTELLIGENCE', 'IS', 'A', 'MARVELOUS', 'TECHNOLOGY', 'AND', 'HAS', 'WONDERFUL', 'RESULTS', 'IT', 'IS', 'STILL', 'A', 'DEVELOPING', 'TECHNOLOGY', 'AND', 'ITS', 'ETHICAL', 'USE', 'IS', 'A', 'MAJOR', 'CONCERN']

Exercise 3: suggest a next word¶

Based on the model estimated in the previous exercise, program a function that takes a word and returns a possible next word. If the base word is not part of the model's vocabulary, it should return a random word from the vocabulary. Use the np.random.choice functionality to make choices with predefined probabilities, as shown below. Remember that you should use the probabilities calculated by your model.

In [15]:
Copied!
import numpy as np
print(np.random.choice(['one', 'two', 'three'], p=[0.5, 0.2, 0.3]))
import numpy as np print(np.random.choice(['one', 'two', 'three'], p=[0.5, 0.2, 0.3]))
three

Exercise 4: make a text generator¶

Use the functionality you implemented above to suggest next words, and then incorporate them into your text. For example:

  1. Start with "artificial intelligence is"
  2. Suggest word "a"
  3. Incorporate "a" into the original sentence, so that we have: "artificial intelligence is a"
  4. Suggest next word based on "artificial intelligence is a"
  5. Keep going!
In [ ]:
Copied!
def generate_text(model : dict, starting_string : str, num_words : int) -> str:
    text = starting_string
    # Generate words based on your model (model should be the conditional probabilities!)
    return text
def generate_text(model : dict, starting_string : str, num_words : int) -> str: text = starting_string # Generate words based on your model (model should be the conditional probabilities!) return text
In [16]:
Copied!
# Make your solution here
# Make your solution here

Exercise 5: generation techniques¶

At this point, we generate the "next word" in a sequence by sampling from a distribution measured from data. We can make some tweaks here to make our generated words more interesting.

Temperature¶

One possibility is to use a parameter called temperature. The temperature parameter (often referred to as $\tau$) is inspired in annealing processes. In this process, electrons in materias with high temperature jump more often, thus they can be found in a probability cloud with larger variance.

When we use temperature, we sample from a distribution $P_z$ calculated by:

$$P_z( A | B) = \frac{P( A | B) ^ {e^{-\tau}}}{\sum{P( A | B) ^ {e^{-\tau}}}} $$

Note that, in this equation, a high temperature $(\tau \rightarrow \infty)$ leads $P_z(A|B)$ to become a uniform distribution, whereas low (negative) values make the distribution more spiky.

Also, note that there are many formulations for temperature. I like this one because $\tau=0$ implies in $P_z(A|B) = P(A|B)$, so we have an anchor point. Check the figures:

In [65]:
Copied!
import matplotlib.pyplot as plt
def apply_temperature(P, tau):
    z =  P**(np.exp(-tau))
    return z / np.sum(z)
P = np.array([0.5, 0.4, 0.1])
print(apply_temperature(P, 0))
plt.figure(figsize=(10,3))

plt.subplot(1,3,2)
plt.bar(range(3), apply_temperature(P, 0))
plt.ylim(0,1)
plt.title('$\\tau$ = 0')
plt.subplot(1,3,3)
plt.bar(range(3), apply_temperature(P, 2))
plt.ylim(0,1)
plt.title('$\\tau$ = 2')
plt.subplot(1,3,1)
plt.bar(range(3), apply_temperature(P, -2))
plt.ylim(0,1)
plt.title('$\\tau$ = -2')
plt.suptitle('Higher temperature makes the distribution closer to uniform!')
plt.tight_layout()
plt.show()
import matplotlib.pyplot as plt def apply_temperature(P, tau): z = P**(np.exp(-tau)) return z / np.sum(z) P = np.array([0.5, 0.4, 0.1]) print(apply_temperature(P, 0)) plt.figure(figsize=(10,3)) plt.subplot(1,3,2) plt.bar(range(3), apply_temperature(P, 0)) plt.ylim(0,1) plt.title('$\\tau$ = 0') plt.subplot(1,3,3) plt.bar(range(3), apply_temperature(P, 2)) plt.ylim(0,1) plt.title('$\\tau$ = 2') plt.subplot(1,3,1) plt.bar(range(3), apply_temperature(P, -2)) plt.ylim(0,1) plt.title('$\\tau$ = -2') plt.suptitle('Higher temperature makes the distribution closer to uniform!') plt.tight_layout() plt.show()
[0.5 0.4 0.1]
No description has been provided for this image

Top-K¶

Top-K generation is another interesting technique that essentially means we select the top $K$ most likely words from the distribution and then sample from them using uniform probabilities. The choice of $K$ impacts the level of randomness in the generation, and this technique can avoid large deviations from a particular trail of thought.

Implementing temperature or top-k¶

Implement either temperature or top-k generation in your word generation function!

Exercise 6: reading from real data¶

The code below loads The Complete Works of Shakespeare from Project Guttemberg. Use this text to train your model and check if you can generate some shakespeare-like texts using your model!

In [10]:
Copied!

In [ ]:
Copied!
# Make your solution here!
with open('shakespeare.txt', 'r') as f:
    shakespeare_text = f.read()

    
# Make your solution here! with open('shakespeare.txt', 'r') as f: shakespeare_text = f.read()
In [ ]:
Copied!


Documentation built with MkDocs.

Search

From here you can search these documents. Enter your search terms below.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search