RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /AI Safety and Alignment
  6. /Ch. 14
AI Safety and Alignment

14. Safety Guardrails

Chapter 14 of 18 · 15 min
KEY INSIGHT

Guardrails are a last line of defense, not a primary strategy. Over-reliance on output filtering creates adversarial incentives and can degrade legitimate use cases.

Safety guardrails intercept harmful outputs before they reach users. Effective guardrails operate at multiple levels: prompt filtering, generation constraints, and response validation.

Prompt Classification Pipeline

from transformers import pipeline

class SafetyGuardrail:
    """Multi-stage safety classification pipeline."""
    
    def __init__(self, harmful_threshold=0.7, uncertain_threshold=0.4):
        self.classifier = pipeline(
            "text-classification",
            model="distilbert-base-uncased-finetuned-sst-2"
        )
        self.harmful_threshold = harmful_threshold
        self.uncertain_threshold = uncertain_threshold
        
    def classify_prompt(self, prompt: str) -> tuple[str, float]:
        """Classify prompt risk level and return category + confidence."""
        result = self.classifier(prompt)[0]
        score = result['score'] if result['label'] == 'NEGATIVE' else 1 - result['score']
        
        if score >= self.harmful_threshold:
            return 'BLOCK', score
        elif score >= self.uncertain_threshold:
            return 'REVIEW', score
        return 'ALLOW', score
    
    def apply_guardrail(self, prompt: str, response: str) -> str:
        """Validate response against known harm patterns."""
        categories = {
            'PI': r'\b(weapon|explosive|bomb)\b',
            'CSAM': r'decade.*younger|minor.*sexual',
            'HATE': r'\bhate\b|\bdehumaniz',
        }
        
        for category, pattern in categories.items():
            if re.search(pattern, response, re.IGNORECASE):
                return f"I can't complete this request. [Category: {category}]"
        
        return response

Token-Level Blocking

class TokenBlocklist:
    """Block generation of specific token sequences."""
    
    def __init__(self, vocab):
        self.vocab = vocab
        self.blocked_prefixes = self._build_prefix_tree()
        
    def _build_prefix_tree(self):
        """Build a prefix trie of blocked phrases."""
        blocked_phrases = [
            "how to create", "instructions for making",
            "step-by-step guide to building"
        ]
        trie = {}
        for phrase in blocked_phrases:
            node = trie
            for char in phrase.lower():
                node = node.setdefault(char, {})
            node['$END$'] = True
        return trie
    
    def should_block_token(self, generated_ids: list[int]) -> bool:
        """Check if current sequence matches a blocked prefix."""
        text = ' '.join(self.vocab.decode(generated_ids).split())
        node = self.blocked_prefixes
        for char in text.lower():
            if char not in node:
                return False
            node = node[char]
        return '$END$' in node

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Build a guardrail system that combines prompt classification, in-generation blocking, and response validation. Test it against a red-team dataset and measure false positive and false negative rates.

← Chapter 13
Fairness Metrics
Chapter 15 →
Constitutional AI