20. Inference with Fine-Tuned Models

Chapter 20 of 24 · 20 min

KEY INSIGHT

Deploying fine-tuned models requires different optimization priorities than training. Memory footprint, latency, and throughput dominate considerations. The base model architecture and quantization level fundamentally determine inference characteristics. Loading patterns matter. Models can be loaded fully into memory, memory-mapped for zero-copy access, or streamed in chunks. Each approach trades startup time against operational memory. ```python from llama_cpp import Llama from transformers import AutoTokenizer import torch class FineTunedInferenceEngine: def __init__(self, model_path, tokenizer_path=None, n_ctx=2048, n_gpu_layers=0): self.model_path = model_path self.n_ctx = n_ctx # Load model self.model = Llama( model_path=model_path, n_ctx=n_ctx, n_gpu_layers=n_gpu_layers, # 0 = CPU only use_mmap=True, # Memory mapping for large models use_mlock=False, # Don't lock in RAM ) # Load tokenizer if tokenizer_path: self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path) def generate(self, prompt, max_tokens=256, temperature=0.7, top_p=0.9): """Generate text with standard sampling parameters.""" return self.model( prompt, max_tokens=max_tokens, temperature=temperature, top_p=top_p, repeat_penalty=1.1, stop=["</s>", "User:", "\n\n\n"] ) ``` **Batching strategies** dramatically affect throughput: ```python def batch_generate(engine, prompts, max_tokens=128): """Generate for multiple prompts in a batch.""" if hasattr(engine.model, "create_batch"): # Native batching support return engine.model.create_batch(prompts, max_tokens=max_tokens) else: # Sequential with caching results = [] for prompt in prompts: results.append(engine.generate(prompt, max_tokens=max_tokens)) return results ``` **Failure mode**: Prompt injection in production. Fine-tuned models may be more susceptible to instruction following that bypasses safety measures. Always implement input validation and output filtering. ```python class SafeInferenceEngine(FineTunedInferenceEngine): def generate(self, prompt, **kwargs): # Validate input if len(prompt) > self.n_ctx * 4: raise ValueError("Input too long") if self.contains_dangerous_patterns(prompt): return {"error": "Unsafe prompt detected"} return super().generate(prompt, **kwargs) @staticmethod def contains_dangerous_patterns(prompt): dangerous = ["<|system|>", "[INST]", "{{system"] return any(p in prompt.lower() for p in dangerous) ```

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

: Implement Streaming Generation

def stream_generate(engine, prompt, max_tokens=256):
    """Yield tokens as they are generated."""
    for token in engine.model.generate(prompt, max_tokens=max_tokens):
        yield token
        
# Usage
for token in stream_generate(engine, "Write a story"):
    print(token, end="", flush=True)