22. Regular Expressions

Chapter 22 of 36 · 15 min

Regular expressions (regex) are text extraction and transformation tools. In AI work, you'll use them to parse logs, extract structured data from messy text, and validate inputs before they hit your models.

Python's re module is your interface:

import re

# Match pattern anywhere in string
text = "Result: accuracy=0.9234, loss=0.123"
accuracy = re.search(r'accuracy=([0-9.]+)', text)
print(accuracy.group(1))  # '0.9234'

# Find all matches
logs = """
2024-01-15 ERROR: Failed to process doc_123.pdf
2024-01-16 INFO: Success for batch 42
2024-01-17 ERROR: Timeout on api-v2.doc
"""
errors = re.findall(r'ERROR: (.+)', logs)
print(errors)  # ['Failed to process doc_123.pdf', 'Timeout on api-v2.doc']

# Replace
cleaned = re.sub(r'(\w+)_(\d+)\.pdf', r'\1-\2.txt', "doc_123.pdf")
print(cleaned)  # 'doc-123.txt'

The r'' raw string notation is critical for regex patterns—it prevents escape sequence interpretation. Capture groups (parentheses) let you extract specific parts; \1, \2 in replacements refer to them.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Given this model log output:

MODEL Training Results
Epoch 1: loss=2.345, acc=0.123
Epoch 2: loss=1.876, acc=0.456
Epoch 3: loss=1.234, acc=0.678

Write a Python script using regex to extract all (epoch, loss, accuracy) tuples. Print them as a list of dicts.