2.36. Groups and anchors¶

A pattern can do more than say “this string matches” – it can pull the matched pieces apart and hand each one to the application by name. Parentheses around part of a pattern make it a capturing group; the match object then exposes each group as a separate substring.

2.36.1. Capturing groups¶

Wrap any part of a pattern in (...) to capture what it matched:

>>> import re
>>> m = re.search(r'temp (\d+) at (\d+)s', 'temp 42 at 137s ok')
>>> m.group(0)
'temp 42 at 137s'
>>> m.group(1)
'42'
>>> m.group(2)
'137'

Group 0 is always the entire match.
Groups 1, 2, … are the captured substrings, numbered left to right by their opening parenthesis.
Calling match.group() with an index past the last group raises IndexError.

A common pattern is “match a known structure, capture the variable parts as ints”:

def parse_temp(line):
    m = re.search(r'temp (\d+) at (\d+)s', line)
    if not m:
        return None
    return int(m.group(1)), int(m.group(2))

2.36.2. Non-capturing groups¶

Parentheses also group a sub-expression so a quantifier can apply to the whole group. That’s the only purpose of grouping in r'(ab)+' – “one or more of ab”. The fact that ab shows up as group 1 is a side effect.

To group without capturing, use (?:...):

>>> re.search(r'(?:ab)+', 'xababy').group(0)
'abab'

Non-capturing groups keep the group numbers tidy when a pattern uses grouping for structure but doesn’t care about pulling each piece out.

2.36.3. Anchors¶

Anchors don’t match a character – they match a position.

^ – start of the string.
$ – end of the string.

Anchors are what make re.match() and re.search() behave differently. re.match(p, s) is the same as re.search('^' + p, s): it forces the pattern to start at position 0. Adding $ to the end of a pattern then makes the pattern match the entire string and nothing else:

>>> re.search(r'^\d+$', '12345')
<match num=1>
>>> re.search(r'^\d+$', '12345 ok') is None
True

^ and $ in MicroPython re always mean the start and end of the whole string passed to re.search(). There is no re.MULTILINE flag to make them match at every embedded newline, and $ does not match the position before a trailing \n either – it has to be the absolute end of the input. To get per-line behaviour, split the input on newlines first and run the pattern on each line.

2.36.4. Character sets¶

Square brackets define an explicit set of characters. The match consumes exactly one character from the set.

[abc] – one of a, b, c.
[a-z] – one character in the range a-z (inclusive).
[a-zA-Z0-9] – letters or digits. Three ranges combined.
[^abc] – not one of a, b, c. The ^ only negates when it’s the first character inside the brackets.

Examples:

>>> re.search(r'[A-F0-9]{6}', 'colour #1a2b3c rest').group(0)
'1A2B3C'
>>> re.search(r'[A-F0-9]{6}', 'colour #1a2b3c rest') is None
True

The first call returns None in practice because the literal text is lowercase. MicroPython re has no re.IGNORECASE flag – to match case-insensitively, write both cases into the set:

>>> re.search(r'[A-Fa-f0-9]{6}', 'colour #1a2b3c rest').group(0)
'1a2b3c'

The class shortcuts (\d, \s, \w, and their negated forms) can be used inside [...] too: [\w-] is “word characters or a literal dash.”

2.36.5. Greedy vs lazy quantifiers¶

The quantifiers *, +, ?, and {m,n} are greedy by default – they match as many characters as the rest of the pattern will still allow. Often that’s exactly what’s wanted; sometimes it isn’t:

>>> re.search(r'<(.+)>', 'a <b> <c> d').group(1)
'b> <c'

The greedy .+ grabbed all the way to the last >. Appending ? makes the quantifier lazy – it matches as little as possible:

>>> re.search(r'<(.+?)>', 'a <b> <c> d').group(1)
'b'

The lazy form stops at the first >. Lazy quantifiers come up constantly when extracting balanced delimiters from a string.

2.36.6. Backreferences in substitution¶

re.sub() can refer back to captured groups in the replacement string via \1, \2, … The substitution rewrites every match using the captured pieces:

>>> re.sub(r'(\d+)\.(\d+)', r'\2.\1', 'swap 12.34 and 5.6')
'swap 34.12 and 6.5'

Each match captures two numbers, and the replacement swaps them. \g<1> is an alternative syntax for the same thing – useful when the next character in the replacement is a digit (r'\g<1>0' to append a literal zero to group 1 rather than reading “group 10”).

2.36.7. What’s not available¶

A reminder of what the MicroPython re does not support, in case a pattern from CPython lands here and surprises you:

Lookahead (?=...) and lookbehind (?<=...) – not implemented.
Named groups (?P<name>...) and named backreferences (?P=name) – not implemented.
Flag constants like re.IGNORECASE, re.MULTILINE, re.DOTALL – not honored. Build the case-insensitive set or pre-split the input yourself.
The match.groups(), match.span(), match.start(), and match.end() methods are gated on a ROM level that no shipped OpenMV board enables. Code that relies on them won’t run on the cam.

With patterns, groups, and anchors, the regex toolset on the cam is small enough to learn in one sitting and rich enough to do everything short of context-sensitive parsing.