03. NER Prompting vs Fine-Tuning
The decision between prompting-based NER and fine-tuning an LLM involves tradeoffs across cost, latency, accuracy, and maintenance overhead. Each approach suits different operational contexts.
Prompting minimizes implementation complexity. No training data is required beyond prompt refinement, and schema changes happen through prompt modification rather than model retraining. This flexibility comes at a cost—latency increases because full context must be processed during every inference call. Prompts also consume context window space, reducing available input length for actual text.
Fine-tuning produces specialized models optimized for specific entity types and output formats. Once trained, inference requires no prompt template overhead, reducing latency significantly. For high-volume NER pipelines processing thousands of documents per minute, fine-tuned models often provide better cost-per-inference economics.
from llamafactory import LlamaFactory
config = {
"model_name": "llama3:8b",
"dataset": "ner_dataset",
"template": "ner",
"output_dir": "./ner_finetuned"
}
factory = LlamaFactory()
model = factory.get_model("LLaMA3-NER-finetuned")
# Fine-tuning configuration
train_config = {
"batch_size": 4,
"learning_rate": 2e-4,
"num_epochs": 3,
"warmup_ratio": 0.1
}
Training data requirements for fine-tuning depend on entity type complexity and base model size. Effective fine-tuning typically requires 500-2000 annotated examples per entity type. Data quality matters more than quantity; consistent annotation guidelines produce better results than large noisy datasets.
Evaluation methodologies differ between approaches. Prompting allows rapid A/B testing of instruction variations on held-out examples. Fine-tuning evaluation requires monitoring validation metrics throughout training to detect overfitting before convergence.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Implement both prompting and fine-tuning pipelines for the same entity schema. Measure inference latency, annotation cost for training data, and accuracy metrics across three domain variations.