Natural Language Processing - 2025s2
  • Home
  • Final Project
  • Text Classification
    • 00 - Review - Regular expressions
    • 01 - Practice - Sentiment Analysis with ANEW
    • 02 - Theory - Math of Logistic Regression
    • 02a - Theory - Supplementary Material
    • 03 - Case Study - Classification on IMDB
    • 04 - Practice - Cross-dataset Classification
    • 05 - Practice - Detecting Fake News
  • Language Models
    • 00 - Theory - Language Models
    • 00a - Solution for exercises in 00
    • 01 - Case Study - Language Models
    • 02 - Theory - From Sklearn to Pytorch
    • 03 - Theory - MLP, Residuals, Normalization
    • 05 - Practice - Tokenizers, Classification and Visualization
    • 06 - Theory - Self-Attention and Self-Supervised Training
    • 07 - Case Study - Pre-trained BERT
    • 08 - Practice - Fine-tuning BERT
  • Search
  • Previous
  • Next
  • Regular expressions: finding text within text
    • Learning outcomes
    • Exercise 1
    • Exercise 2
    • Exercise 3
    • Exercise 4
    • Exercise 5

Regular expressions: finding text within text¶

Learning outcomes¶

At the end of this lesson, students should be able to:

  • Define simple regular expressions for classical cases
  • Use the re library to implement regex pattern search

Please, be mindful of these skills while working in this activity.

Exercise 1¶

There are many situations that call for detecting particular words in a text. For example, we could count how many times the word "are" appears in a text:

In [1]:
Copied!
text = """
Llamas are fascinating animals that are found in the Andes Mountains.
Yoda says that cute animals they are.
Are llamas cute? Yes they are!
A llama never leaves its herd and will protect it with its life.
They are known for their long necks and thick fur.
Llamas are used as pack animals by indigenous people because they are strong and can carry heavy loads.
They are also very social animals and are often seen in groups.
Llamas ARE herbivores, which means they are only eating plants.
They are also known for their gentle and calm nature, which makes them popular in petting zoos.
Overall, llamas are remarkable creatures that are a joy to observe and are important to the cultures where they are found.
"""
text = """ Llamas are fascinating animals that are found in the Andes Mountains. Yoda says that cute animals they are. Are llamas cute? Yes they are! A llama never leaves its herd and will protect it with its life. They are known for their long necks and thick fur. Llamas are used as pack animals by indigenous people because they are strong and can carry heavy loads. They are also very social animals and are often seen in groups. Llamas ARE herbivores, which means they are only eating plants. They are also known for their gentle and calm nature, which makes them popular in petting zoos. Overall, llamas are remarkable creatures that are a joy to observe and are important to the cultures where they are found. """
In [2]:
Copied!
def find_words(word_to_find, text):
    words = text.split()
    matching_words = []
    for w in words:
        if w == word_to_find:
            matching_words.append(w)
    return len(matching_words)

word_to_find = "are"
count = find_words(word_to_find, text)
print(f'The word "are" appears {count} times in this text.')
def find_words(word_to_find, text): words = text.split() matching_words = [] for w in words: if w == word_to_find: matching_words.append(w) return len(matching_words) word_to_find = "are" count = find_words(word_to_find, text) print(f'The word "are" appears {count} times in this text.')
The word "are" appears 13 times in this text.

This is a somewhat naive approach. It tends to miss variations of a word. For example, the word "are." in the end of the second line is not counted in this approach

Spot the ocurrences of the word "are" that are not caught by this code.

Hint: how does string equality (s1==s2) work with uppercase/lowercase or punctuation?

Exercise 2¶

If you check the documentation for Python's string library, you will find many methods that can help you make a better detector for words.

Use them to rewrite the method find_words_less_naive below so that it:

  • Works when words have punctuation (that is, ignores punctuation)
  • Ignores case (uppercase/lowercase),
  • Does not count a word if it appears in a plural (ex: says does not count as a say).
In [3]:
Copied!
def find_words_less_naive(word_to_find, text):
    words = text.split()
    matching_words = []
    for w in words:
        if w == word_to_find:
            matching_words.append(w)
    return len(matching_words)
def find_words_less_naive(word_to_find, text): words = text.split() matching_words = [] for w in words: if w == word_to_find: matching_words.append(w) return len(matching_words)

Exercise 3¶

You may have noticed that the words "llama" and "llamas" both start with "llama".

If we wanted to make a function searching for either llama our llamas, then we could phrase it somewhat like:

The sequence llama followed by an optional s.

This type of pattern is called a regular expression.

A regular expression is a string that contains:

*Usual literal characters, like a, b, or 7 *Special markers, which present special behaviors.

We use regular expressions to search for these patterns within strings. It could be seen as a more advanced way of using string.find. Actually, searching for a regular expression without any special character would lead to the same behavior as using string.find.

Special characters, however, give us some interesting ideas.

The ? character, for example, means that the previous character is optional, that is: llamas? matches both llama and llamas.

aa?bb?cc?

Which of the strings below would match aa?bb?cc?

Check your answers using the re library as below:

  1. cba
  2. abbcc
  3. abbc
  4. abcc
  5. aa
  6. aabbc
  7. abc
  8. aabc
  9. bcc
  10. abb
  11. aabcc
  12. ac
  13. ccbbaa
In [6]:
Copied!
import re

print(re.match(r"aa?bb?cc?", "cba"))
print(re.match(r"aa?bb?cc?", "abc"))
print(re.match(r"aa?bb?cc?", "abc").group())
import re print(re.match(r"aa?bb?cc?", "cba")) print(re.match(r"aa?bb?cc?", "abc")) print(re.match(r"aa?bb?cc?", "abc").group())
None
<re.Match object; span=(0, 3), match='abc'>
abc

Exercise 4¶

Now we know how to use sets and questions marks.

Two other symbols that will be useful are:

  • + indicates one or more ocurrences of the previous symbol
  • * indicates zero or more ocurrences of the previous symbol

Now, match the regular expressions below with their corresponding strings. Check your answers with the re.match() function.

Strings Expressions
BAbaaaccc [aA][bB]c*
Bccc [aA]?[bB]ccc
ab [aA][bB]c+
ABccccc [aAbB]*c+
In [7]:
Copied!
import re
import re

Exercise 5¶

There are some special sets we should be aware of:

the [A-Z] denotes a range. In this case, the set would contain all uppercase ascii characters from A to Z. Ranges can be combined, like [A-Za-z] for all ascii characters Modern regular expressions have sets such as \w to denote any character that can form a word, or \d for any digit. Check the documentation at https://docs.python.org/3/library/re.html to see more symbols!

Match the strings below with their corresponding expressions. Check your answers using the re.match() function.

Strings Expressions
SnakeCaseVariableNames123 \w+.com
nlp.com \w+
123432412394 \w\w\w+[.]com
Brown [A-Za-z][A-Za-z0-9]+
n!com \d+

Documentation built with MkDocs.

Search

From here you can search these documents. Enter your search terms below.

Keyboard Shortcuts

Keys Action
? Open this help
n Next page
p Previous page
s Search