Groups and anchors ================== A pattern can do more than say "this string matches" -- it can pull the matched pieces apart and hand each one to the application by name. Parentheses around part of a pattern make it a *capturing group*; the match object then exposes each group as a separate substring. Capturing groups ---------------- Wrap any part of a pattern in ``(...)`` to capture what it matched:: >>> import re >>> m = re.search(r'temp (\d+) at (\d+)s', 'temp 42 at 137s ok') >>> m.group(0) 'temp 42 at 137s' >>> m.group(1) '42' >>> m.group(2) '137' * Group 0 is always the entire match. * Groups 1, 2, ... are the captured substrings, numbered left to right by their *opening* parenthesis. * Calling :meth:`match.group` with an index past the last group raises ``IndexError``. A common pattern is "match a known structure, capture the variable parts as ints":: def parse_temp(line): m = re.search(r'temp (\d+) at (\d+)s', line) if not m: return None return int(m.group(1)), int(m.group(2)) Non-capturing groups -------------------- Parentheses also *group* a sub-expression so a quantifier can apply to the whole group. That's the only purpose of grouping in ``r'(ab)+'`` -- "one or more of ``ab``". The fact that ``ab`` shows up as group 1 is a side effect. To group without capturing, use ``(?:...)``:: >>> re.search(r'(?:ab)+', 'xababy').group(0) 'abab' Non-capturing groups keep the group numbers tidy when a pattern uses grouping for structure but doesn't care about pulling each piece out. Anchors ------- Anchors don't match a character -- they match a *position*. * ``^`` -- start of the string. * ``$`` -- end of the string. Anchors are what make :func:`re.match` and :func:`re.search` behave differently. ``re.match(p, s)`` is the same as ``re.search('^' + p, s)``: it forces the pattern to start at position 0. Adding ``$`` to the end of a pattern then makes the pattern match the *entire* string and nothing else:: >>> re.search(r'^\d+$', '12345') >>> re.search(r'^\d+$', '12345 ok') is None True ``^`` and ``$`` in MicroPython :mod:`re` always mean the start and end of the *whole* string passed to :func:`re.search`. There is no ``re.MULTILINE`` flag to make them match at every embedded newline, and ``$`` does not match the position before a trailing ``\n`` either -- it has to be the absolute end of the input. To get per-line behaviour, split the input on newlines first and run the pattern on each line. Character sets -------------- Square brackets define an explicit set of characters. The match consumes exactly one character from the set. * ``[abc]`` -- one of ``a``, ``b``, ``c``. * ``[a-z]`` -- one character in the range ``a``-``z`` (inclusive). * ``[a-zA-Z0-9]`` -- letters or digits. Three ranges combined. * ``[^abc]`` -- *not* one of ``a``, ``b``, ``c``. The ``^`` only negates when it's the first character inside the brackets. Examples:: >>> re.search(r'[A-F0-9]{6}', 'colour #1a2b3c rest').group(0) '1A2B3C' >>> re.search(r'[A-F0-9]{6}', 'colour #1a2b3c rest') is None True The first call returns :data:`None` in practice because the literal text is lowercase. MicroPython :mod:`re` has no ``re.IGNORECASE`` flag -- to match case-insensitively, write both cases into the set:: >>> re.search(r'[A-Fa-f0-9]{6}', 'colour #1a2b3c rest').group(0) '1a2b3c' The class shortcuts (``\d``, ``\s``, ``\w``, and their negated forms) can be used inside ``[...]`` too: ``[\w-]`` is "word characters or a literal dash." Greedy vs lazy quantifiers -------------------------- The quantifiers ``*``, ``+``, ``?``, and ``{m,n}`` are *greedy* by default -- they match as many characters as the rest of the pattern will still allow. Often that's exactly what's wanted; sometimes it isn't:: >>> re.search(r'<(.+)>', 'a d').group(1) 'b> ``. Appending ``?`` makes the quantifier *lazy* -- it matches as little as possible:: >>> re.search(r'<(.+?)>', 'a d').group(1) 'b' The lazy form stops at the first ``>``. Lazy quantifiers come up constantly when extracting balanced delimiters from a string. Backreferences in substitution ------------------------------ :func:`re.sub` can refer back to captured groups in the replacement string via ``\1``, ``\2``, ... The substitution rewrites every match using the captured pieces:: >>> re.sub(r'(\d+)\.(\d+)', r'\2.\1', 'swap 12.34 and 5.6') 'swap 34.12 and 6.5' Each match captures two numbers, and the replacement swaps them. ``\g<1>`` is an alternative syntax for the same thing -- useful when the next character in the replacement is a digit (``r'\g<1>0'`` to append a literal zero to group 1 rather than reading "group 10"). What's not available -------------------- A reminder of what the MicroPython :mod:`re` does *not* support, in case a pattern from CPython lands here and surprises you: * Lookahead ``(?=...)`` and lookbehind ``(?<=...)`` -- not implemented. * Named groups ``(?P...)`` and named backreferences ``(?P=name)`` -- not implemented. * Flag constants like ``re.IGNORECASE``, ``re.MULTILINE``, ``re.DOTALL`` -- not honored. Build the case-insensitive set or pre-split the input yourself. * The :meth:`match.groups`, :meth:`match.span`, :meth:`match.start`, and :meth:`match.end` methods are gated on a ROM level that no shipped OpenMV board enables. Code that relies on them won't run on the cam. With patterns, groups, and anchors, the regex toolset on the cam is small enough to learn in one sitting and rich enough to do everything short of context-sensitive parsing.