Regular expressions: finding text within text¶
Learning outcomes¶
At the end of this lesson, students should be able to:
- Define simple regular expressions for classical cases
- Use the
relibrary to implement regex pattern search
Please, be mindful of these skills while working in this activity.
Exercise 1¶
There are many situations that call for detecting particular words in a text. For example, we could count how many times the word "are" appears in a text:
text = """
Llamas are fascinating animals that are found in the Andes Mountains.
Yoda says that cute animals they are.
Are llamas cute? Yes they are!
A llama never leaves its herd and will protect it with its life.
They are known for their long necks and thick fur.
Llamas are used as pack animals by indigenous people because they are strong and can carry heavy loads.
They are also very social animals and are often seen in groups.
Llamas ARE herbivores, which means they are only eating plants.
They are also known for their gentle and calm nature, which makes them popular in petting zoos.
Overall, llamas are remarkable creatures that are a joy to observe and are important to the cultures where they are found.
"""
def find_words(word_to_find, text):
words = text.split()
matching_words = []
for w in words:
if w == word_to_find:
matching_words.append(w)
return len(matching_words)
word_to_find = "are"
count = find_words(word_to_find, text)
print(f'The word "are" appears {count} times in this text.')
The word "are" appears 13 times in this text.
This is a somewhat naive approach. It tends to miss variations of a word. For example, the word "are." in the end of the second line is not counted in this approach
Spot the ocurrences of the word "are" that are not caught by this code.
Hint: how does string equality (s1==s2) work with uppercase/lowercase or punctuation?
Exercise 2¶
If you check the documentation for Python's string library, you will find many methods that can help you make a better detector for words.
Use them to rewrite the method find_words_less_naive below so that it:
- Works when words have punctuation (that is, ignores punctuation)
- Ignores case (uppercase/lowercase),
- Does not count a word if it appears in a plural (ex: says does not count as a say).
def find_words_less_naive(word_to_find, text):
words = text.split()
matching_words = []
for w in words:
if w == word_to_find:
matching_words.append(w)
return len(matching_words)
Exercise 3¶
You may have noticed that the words "llama" and "llamas" both start with "llama".
If we wanted to make a function searching for either llama our llamas, then we could phrase it somewhat like:
The sequence llama followed by an optional s.
This type of pattern is called a regular expression.
A regular expression is a string that contains:
*Usual literal characters, like a, b, or 7 *Special markers, which present special behaviors.
We use regular expressions to search for these patterns within strings. It could be seen as a more advanced way of using string.find. Actually, searching for a regular expression without any special character would lead to the same behavior as using string.find.
Special characters, however, give us some interesting ideas.
The ? character, for example, means that the previous character is optional, that is: llamas? matches both llama and llamas.
aa?bb?cc?
Which of the strings below would match aa?bb?cc?
Check your answers using the re library as below:
- cba
- abbcc
- abbc
- abcc
- aa
- aabbc
- abc
- aabc
- bcc
- abb
- aabcc
- ac
- ccbbaa
import re
print(re.match(r"aa?bb?cc?", "cba"))
print(re.match(r"aa?bb?cc?", "abc"))
print(re.match(r"aa?bb?cc?", "abc").group())
None <re.Match object; span=(0, 3), match='abc'> abc
Exercise 4¶
Now we know how to use sets and questions marks.
Two other symbols that will be useful are:
+indicates one or more ocurrences of the previous symbol*indicates zero or more ocurrences of the previous symbol
Now, match the regular expressions below with their corresponding strings. Check your answers with the re.match() function.
| Strings | Expressions |
|---|---|
| BAbaaaccc | [aA][bB]c* |
| Bccc | [aA]?[bB]ccc |
| ab | [aA][bB]c+ |
| ABccccc | [aAbB]*c+ |
import re
Exercise 5¶
There are some special sets we should be aware of:
the [A-Z] denotes a range. In this case, the set would contain all uppercase ascii characters from A to Z.
Ranges can be combined, like [A-Za-z] for all ascii characters
Modern regular expressions have sets such as \w to denote any character that can form a word, or \d for any digit.
Check the documentation at https://docs.python.org/3/library/re.html to see more symbols!
Match the strings below with their corresponding expressions. Check your answers using the re.match() function.
| Strings | Expressions |
|---|---|
| SnakeCaseVariableNames123 | \w+.com |
| nlp.com | \w+ |
| 123432412394 | \w\w\w+[.]com |
| Brown | [A-Za-z][A-Za-z0-9]+ |
| n!com | \d+ |