20. Inference with Fine-Tuned Models

Chapter 20 of 24 · 20 min

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

: Implement Streaming Generation

def stream_generate(engine, prompt, max_tokens=256):
    """Yield tokens as they are generated."""
    for token in engine.model.generate(prompt, max_tokens=max_tokens):
        yield token
        
# Usage
for token in stream_generate(engine, "Write a story"):
    print(token, end="", flush=True)