Regular Expressions
Regular Expressions#
Match patterns against text
Reading a file#
Use open()
names_file = open("names.txt", encoding="utf-8")
data = names_file.read()
names_file.close()
a better way:
with open("some_file.txt") as open_file:
data = open_file.read()
Get the regex library#
import re
Match#
Matches from beginning of string
The r
tells python it is a raw string (no escape character)
re.match(r'Love', data)
Search#
Match anywhere in string
re.search(r'Kenneth', data)
Findall#
Finds all places where it doesn’t overlap
Escape characters#
\w
- any unicode word character\W
- anything that isn’t unicode\s
- whitespace\S
- non-whitespace\d
- any number 0 - 9\D
- anything that isn’t a number\b
- word boundary (edges of a word)\B
- anything not edges of a word
Parenthesis define a group in regular expressions
You have to escpae them with \(
Frequency#
{3}
- exactly 3 times(,3)
- 0 to 3 times{3,}
- 3 or more times{3,5}
- 3 to 5 times?
- optional (0 or 1 time)*
- occurs at least 0 times+
- occurs 1 or more times
eg.
re.search(r'\w+')
re.search(r'\(?\d{3}\)?')
Sets#
[aple]
- Matchesapple
[a-z]
- any lowercase letter (ranges)[^2]
- anything that is not 2
Email address example:
print(re.findall(r'[-\w\d+.]+@[-\w\d.]*, data))
Flags#
- Ignore case: re.findall(r’[trehous]+\b’, data, re.IGNORECASE)
Shorthand for re.IGNORECASE
is re.I
- Muliple lines (Mulitpline regex)
Use re.VERBOSE
or re.X
Add multiple flags with pipe symbol:
re.VERBOSE|re.I
Treat each multiline as a string: re.MULITLINE
or re.M
Beginning and End#
Beginning: ^
End: $
Named Groups#
Use (?P<name>{your-expression-here})
Making a dictionary out of a list#
line = re.search(r'''
^(?P<name>[-\w ]*,\s[-\w ]+)\t # last and first names
(?P<email>[-\w\d.+]+@[-\w\d.]+)\t # Email
(?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t # Phone
(?P<job>[\w\s]+,\s[\w\s.]+)\t? # Job and company
(?P<twitter>@[\w\d]+)?$ # twitter
''', data, re.X|re.M)
print(line)
print(line.groupdict())
Compile a pattern to an object#
Get it ready for use re.compile()
Remove data
as it wont have been run against anything at that stage
Allows returning an iterable of Matches
line = re.compile(r'''
^(?P<name>(?P<first>[-\w ]*),\s(?P<last>[-\w ]+))\t # last and first names
(?P<email>[-\w\d.+]+@[-\w\d.]+)\t # Email
(?P<phone>\(?\d{3}\)?-?\s?\d{3}-\d{4})?\t # Phone
(?P<job>[\w\s]+,\s[\w\s.]+)\t? # Job and company
(?P<twitter>@[\w\d]+)?$ # twitter
''', re.X|re.M)
print(re.search(line, data).groupdict())
print(line.search(data).groupdict())
for match in line.finditer(data):
print('{first} {last} <{email}>'.format(**match.groupdict()))
String Interpolation into a Regex#
You can format a regular expression string the same way you do any other string interpolation:
pattern = r'^set groups {group} interfaces (?P<line>.[^\n]+)$'.format(
group=group
)
lines_regex = re.compile(pattern, re.MULTILINE)
matches = lines_regex.findall(file_contents)
Findall with groupdicts#
You can’t get groupdicts
if you use findall()
.
However there is a way to still get the groupdicts
using a finditer
:
[m.groupdict() for m in regex.finditer(search_string)]
Get anything that is not a newline#
re.findall(r'^(?P<line>.[^\n]+)$, words, re.MULTILINE)