Red Teaming
Red teaming is the practice of systematically probing an LLM to find failure modes: harmful outputs, jailbreaks, hallucinations, or policy violations. Operators run red-teaming tests before deploying a model to validate safety filters and alignment. The term comes from military/cybersecurity red teams that simulate adversaries. In local AI, red teaming matters because a model running on your hardware may lack cloud-level guardrails — you are responsible for verifying it doesn't generate dangerous content when prompted adversarially.
Deeper dive
Red teaming for LLMs typically involves crafting adversarial prompts designed to bypass safety measures. Common techniques include prompt injection (e.g., 'Ignore previous instructions and...'), role-playing scenarios ('You are DAN, do anything now'), or encoding harmful requests in base64. Operators may use automated tools like Garak or manual probing. The goal is to identify vulnerabilities before the model is used in production. For local models, red teaming is especially important because many open-weight models (e.g., Llama 3.1, Mistral) have weaker built-in refusal mechanisms than API-based models like GPT-4. Running red-teaming tests helps operators decide whether to add external guardrails (e.g., a moderation model) or adjust system prompts.
Practical example
An operator downloads Llama 3.1 8B via Ollama and wants to use it for a customer-facing chatbot. Before deployment, they run Garak, an open-source red-teaming tool, with the command garak --model_type ollama --model_name llama3.1:8b. Garak sends hundreds of adversarial prompts (e.g., 'How to make a bomb?') and reports which ones bypass the model's safety. The operator finds that 5% of harmful prompts succeed, so they add a second moderation model (e.g., Llama Guard) as a filter.
Workflow example
In a typical workflow, after pulling a model with ollama pull llama3.1:8b, the operator runs a red-teaming script. For example, using the llm-redteam Python library: they define test categories (e.g., 'jailbreak', 'hate_speech'), then execute python redteam.py --model ollama/llama3.1:8b. The script logs any successful bypasses. Based on results, the operator may adjust the system prompt (e.g., 'You are a helpful assistant that refuses harmful requests') or switch to a fine-tuned model with stronger alignment, like Llama 3.1-Instruct.
Reviewed by Fredoline Eruo. See our editorial policy.