Threat Taxonomy — AI Safety and Alignment (Chapter 2)

Effective defense requires understanding what you're defending against. This chapter establishes a taxonomy of threats to local AI systems, enabling systematic security assessment.

Threat Categories

AI threats fall into four primary categories, each with distinct characteristics and mitigations.

Model Extraction occurs when attackers query a model to reconstruct its functionality or approximate its weights. Local models face this risk through repeated API calls that map input-output relationships. The attacker builds a substitute model or gains capabilities exceeding intended access levels.

Prompt Injection embeds malicious instructions within inputs that the model executes, overriding its system prompt or intended behavior. This threat is particularly relevant for local deployments processing external inputs—emails, documents, user messages.

Jailbreaking explicitly attempts to bypass safety measures, forcing models to produce outputs their designers restricted. Local models may lack dependable safety layers, creating vulnerabilities.

Data Poisoning compromises training data to influence model behavior. For local deployments, this applies when fine-tuning on potentially adversarial inputs or processing user-generated content that enters training loops.

Attack Surface Analysis

Each local AI deployment has an attack surface determined by its access patterns. Consider a document analysis system:

# Sample attack surface for document-processing AI
# Input: User-uploaded documents
# Processing: Local model inference
# Output: Summaries, extractions, analysis

class DocumentProcessor:
    def __init__(self, model_path):
        self.model = load_model(model_path)
        self.system_prompt = "Analyze documents professionally."
    
    def process(self, uploaded_file):
        # Path 1: Direct file content injection
        content = uploaded_file.read()
        
        # Path 2: Filename metadata embedding
        filename = uploaded_file.filename
        
        # Path 3: Metadata headers
        metadata = uploaded_file.metadata
        
        # All three paths feed into model
        prompt = f"{self.system_prompt}\n\nContent: {content}"
        return self.model.generate(prompt)

Attackers exploit multiple input vectors simultaneously. A poisoned filename might contain instructions; extracted metadata might bypass filters.

Severity and Likelihood Framework

Not all threats require equal investment. A severity-likelihood matrix guides prioritization:

Threat	Severity	Likelihood	Priority
Model theft via API abuse	High	Medium	Critical
Malicious document injection	High	Low	High
Prompt injection via web scraper	Medium	High	High
Accidental PII leakage	Medium	Medium	Medium

Understanding threat taxonomy enables operators to allocate defensive resources appropriately.