2.35. Pattern basics¶
A regular expression is a small language for describing strings by
form rather than by their exact contents. Use it when a string
matches if-and-only-if it follows a pattern you can describe but
can’t enumerate – “a sequence of digits followed by a unit”, “a line
that starts with ERROR and ends with a number”, “any of these
file extensions, in any order, with optional v prefixes”.
Reach for re only when a plain string method won’t do.
str.startswith(),str.endswith()– testing for a fixed prefix or suffix.in– testing whether a fixed substring is present.str.split(),str.find(),str.replace()– working with fixed delimiters.
Each of those is faster, easier to read, and harder to get wrong than the equivalent regex. Use regex when the form of the string matters and the exact substring does not.
2.35.1. The four things you’ll use¶
The MicroPython re module surfaces four things:
re.compile()– turn a pattern string into a compiled pattern object you can reuse.re.match()– try the pattern at the start of a string. The pattern is anchored at position 0.re.search()– try the pattern anywhere in a string. Returns the first match.re.sub()– find every match and replace it.
Notable omissions vs CPython: no re.findall, no re.finditer,
no re.split at module level (compiled patterns have a split
method instead), no re.fullmatch, no flag constants like
re.IGNORECASE. Where you’d reach for one of those on CPython,
build the equivalent from re.search() in a loop.
2.35.2. A first pattern¶
The pattern r'\d+' matches one or more digits:
>>> import re
>>> m = re.search(r'\d+', 'sensor reading 42 ok')
>>> m.group(0)
'42'
A few things to notice:
The pattern is written as a raw string (
r'...') so the backslash in\dreachesreinstead of being processed as a Python string escape. Always use raw strings for regex patterns.re.search()returns a match object on success andNoneon failure. Always check before callingmatch.group().m.group(0)is the full text the pattern matched. Group 1, 2, … appear later, once the pattern contains capturing parentheses.
The same pattern with re.match() returns None because
the string doesn’t start with a digit:
>>> re.match(r'\d+', 'sensor reading 42 ok') is None
True
>>> re.match(r'\d+', '42 readings')
<match num=1>
2.35.3. The pieces of a pattern¶
Most useful patterns are built from a small set of pieces. The ones that work in MicroPython:
Literal characters – any character that isn’t special matches
itself. hello matches hello.
Special characters – . ^ $ * + ? { } [ ] \ | ( ) all have
meanings below. To match one of them literally, escape it with a
backslash: \. matches a literal dot.
Character classes – shorthands for common character sets:
\d– any digit0-9\D– any non-digit\s– any whitespace character (space, tab, newline)\S– any non-whitespace\w– any “word” character: letters, digits, underscore\W– any non-word character.– any character except newline
Quantifiers – how many times the previous piece must match:
*– zero or more (greedy)+– one or more (greedy)?– zero or one{n}– exactly n{m,n}– between m and n (inclusive)
Combining: \d{3}-\d{4} matches three digits, a dash, four digits.
\s+ matches one or more whitespace characters. hello.*world
matches hello, anything (including nothing), then world.
Note
Greedy means the quantifier consumes as much of the input as it
can while still letting the rest of the pattern match. Against
hello x world y world, the .* in hello.*world matches
the longest run that still leaves a world at the end – it
captures x world y, not the shorter x. The same is true of
+ and the {m,n} range form: the engine takes the longest
match it can, then backs off only if the rest of the pattern
fails.
2.35.4. Substitution¶
re.sub() finds every match and replaces it with a string. The
replacement can reference captured groups via \1, \2, …
(covered with the rest of the group syntax later). Without groups,
re.sub is a straight find-and-replace on a regex:
>>> re.sub(r'\s+', ' ', 'too many spaces')
'too many spaces'
>>> re.sub(r'\d+', 'N', 'log 12, log 345, log 6')
'log N, log N, log N'
The third argument is the string to operate on; the result is a new string with every match replaced.
2.35.5. Splitting – on a compiled pattern only¶
There is no re.split at module level. To split on a regex, compile
the pattern first and call its split method:
>>> sep = re.compile(r'\s*,\s*')
>>> sep.split('a , b,c , d')
['a', 'b', 'c', 'd']
The optional second argument caps the number of splits:
>>> sep.split('a, b, c, d', 2)
['a', 'b', 'c, d']
2.35.6. Compiling for reuse¶
If the same pattern runs many times – inside a loop or in a hot function – compile it once and reuse the compiled object:
digit_run = re.compile(r'\d+')
def first_number(line):
m = digit_run.search(line)
return int(m.group(0)) if m else None
Calling pattern.match() and pattern.search() on a compiled
object is the same as the module-level functions but skips the
recompile cost on every call.
2.35.7. Patterns that don’t match anything¶
Three patterns in particular catch developers out:
.*matches the empty string.re.search(r'.*', s).group(0)returns''on any input.A pattern with an unescaped special character is a syntax error.
re.compile(r'cost: $5')raisesValueErrorbecause$means “end of string”. User'cost: \$5'.The dot
.does not match a newline. To match across newlines, write the pattern to handle them explicitly with[\s\S]or feed one line at a time.
With these pieces a pattern can match almost any fixed-form slice of text. Pulling structured data back out of the match requires capturing groups.