2.35. Pattern basics¶

A regular expression is a small language for describing strings by form rather than by their exact contents. Use it when a string matches if-and-only-if it follows a pattern you can describe but can’t enumerate – “a sequence of digits followed by a unit”, “a line that starts with ERROR and ends with a number”, “any of these file extensions, in any order, with optional v prefixes”.

Reach for re only when a plain string method won’t do.

str.startswith(), str.endswith() – testing for a fixed prefix or suffix.
in – testing whether a fixed substring is present.
str.split(), str.find(), str.replace() – working with fixed delimiters.

Each of those is faster, easier to read, and harder to get wrong than the equivalent regex. Use regex when the form of the string matters and the exact substring does not.

2.35.1. The four things you’ll use¶

The MicroPython re module surfaces four things:

re.compile() – turn a pattern string into a compiled pattern object you can reuse.
re.match() – try the pattern at the start of a string. The pattern is anchored at position 0.
re.search() – try the pattern anywhere in a string. Returns the first match.
re.sub() – find every match and replace it.

Notable omissions vs CPython: no re.findall, no re.finditer, no re.split at module level (compiled patterns have a split method instead), no re.fullmatch, no flag constants like re.IGNORECASE. Where you’d reach for one of those on CPython, build the equivalent from re.search() in a loop.

2.35.2. A first pattern¶

The pattern r'\d+' matches one or more digits:

>>> import re
>>> m = re.search(r'\d+', 'sensor reading 42 ok')
>>> m.group(0)
'42'

A few things to notice:

The pattern is written as a raw string (r'...') so the backslash in \d reaches re instead of being processed as a Python string escape. Always use raw strings for regex patterns.
re.search() returns a match object on success and None on failure. Always check before calling match.group().
m.group(0) is the full text the pattern matched. Group 1, 2, … appear later, once the pattern contains capturing parentheses.

The same pattern with re.match() returns None because the string doesn’t start with a digit:

>>> re.match(r'\d+', 'sensor reading 42 ok') is None
True
>>> re.match(r'\d+', '42 readings')
<match num=1>

2.35.3. The pieces of a pattern¶

Most useful patterns are built from a small set of pieces. The ones that work in MicroPython:

Literal characters – any character that isn’t special matches itself. hello matches hello.

Special characters – . ^ $ * + ? { } [ ] \ | ( ) all have meanings below. To match one of them literally, escape it with a backslash: \. matches a literal dot.

Character classes – shorthands for common character sets:

\d – any digit 0-9
\D – any non-digit
\s – any whitespace character (space, tab, newline)
\S – any non-whitespace
\w – any “word” character: letters, digits, underscore
\W – any non-word character
. – any character except newline

Quantifiers – how many times the previous piece must match:

* – zero or more (greedy)
+ – one or more (greedy)
? – zero or one
{n} – exactly n
{m,n} – between m and n (inclusive)

Combining: \d{3}-\d{4} matches three digits, a dash, four digits. \s+ matches one or more whitespace characters. hello.*world matches hello, anything (including nothing), then world.

Note

Greedy means the quantifier consumes as much of the input as it can while still letting the rest of the pattern match. Against hello x world y world, the .* in hello.*world matches the longest run that still leaves a world at the end – it captures x world y, not the shorter x. The same is true of + and the {m,n} range form: the engine takes the longest match it can, then backs off only if the rest of the pattern fails.

2.35.4. Substitution¶

re.sub() finds every match and replaces it with a string. The replacement can reference captured groups via \1, \2, … (covered with the rest of the group syntax later). Without groups, re.sub is a straight find-and-replace on a regex:

>>> re.sub(r'\s+', ' ', 'too    many   spaces')
'too many spaces'

>>> re.sub(r'\d+', 'N', 'log 12, log 345, log 6')
'log N, log N, log N'

The third argument is the string to operate on; the result is a new string with every match replaced.

2.35.5. Splitting – on a compiled pattern only¶

There is no re.split at module level. To split on a regex, compile the pattern first and call its split method:

>>> sep = re.compile(r'\s*,\s*')
>>> sep.split('a , b,c ,  d')
['a', 'b', 'c', 'd']

The optional second argument caps the number of splits:

>>> sep.split('a, b, c, d', 2)
['a', 'b', 'c, d']

2.35.6. Compiling for reuse¶

If the same pattern runs many times – inside a loop or in a hot function – compile it once and reuse the compiled object:

digit_run = re.compile(r'\d+')

def first_number(line):
    m = digit_run.search(line)
    return int(m.group(0)) if m else None

Calling pattern.match() and pattern.search() on a compiled object is the same as the module-level functions but skips the recompile cost on every call.

2.35.7. Patterns that don’t match anything¶

Three patterns in particular catch developers out:

.* matches the empty string. re.search(r'.*', s).group(0) returns '' on any input.
A pattern with an unescaped special character is a syntax error. re.compile(r'cost: $5') raises ValueError because $ means “end of string”. Use r'cost: \$5'.
The dot . does not match a newline. To match across newlines, write the pattern to handle them explicitly with [\s\S] or feed one line at a time.

With these pieces a pattern can match almost any fixed-form slice of text. Pulling structured data back out of the match requires capturing groups.