Should I fine-tune, or just use a better prompt?
The answer
One paragraph. No hedging beyond what the data actually warrants.
Fine-tune ONLY when prompting + RAG can't get you there. Most people who think they need fine-tuning need a better prompt instead.
The 3-question test:
1. Can a single well-crafted prompt with examples (few-shot) get 80%+ of the result you want? If yes — stop. Iterate on the prompt. Cost: $0 to $50 in API calls or local-GPU time.
2. If your problem is "the model doesn't know my data," can you provide that data in the context? If yes — that's RAG, not fine-tuning. Build a small vector index with your docs. Cost: $0 with local embedders.
3. Does the model fail in a SPECIFIC, REPRODUCIBLE pattern that no amount of prompting fixes? Examples: wrong output format on 1 in 5 generations, refuses to follow a domain-specific style, can't classify your taxonomy correctly even with examples in context. This is the actual fine-tuning case.
Cost reality (May 2026):
- LoRA fine-tune of a 7B model on a single H100 via RunPod: ~$15-30 for a 4-hour run
- Full fine-tune of a 32B model on 4× H100: $300-600 minimum
- Prompt iteration: nearly free
- RAG infrastructure: free with local embedders
The compounding cost: a fine-tuned model loses its general capability proportional to how aggressively you fine-tuned it. If you fine-tune Llama 3.1 8B on your customer-support corpus, it becomes worse at coding, math, and general knowledge. Then you need to either keep two models in production or accept the loss. Most teams underestimate this cost.
When fine-tuning is the right answer:
- Output format reliability (JSON, function-calling, structured extraction)
- Domain-specific style (legal, medical, brand voice)
- Speed: a fine-tuned 7B model can match a prompted 32B model for narrow tasks at 5× the speed
- Cost: if you're spending $500/mo on API calls for the same task pattern, $100 of fine-tuning amortizes in 3 weeks
The honest pre-check: before any fine-tune run, write down 50 example inputs + expected outputs. If you can't, you don't have the dataset to fine-tune. If you can, run them through the un-tuned model first with the prompt-engineering pass. The gap between un-tuned and your bar tells you whether fine-tuning is even necessary.
Explore the numbers for your specific stack
Where we got the numbers
Fine-tune cost estimates from RunPod + Lambda Labs pricing pages May 2026. The 'fine-tuning kills general capability' observation comes from the LIMO / catastrophic-forgetting literature and r/LocalLLaMA practitioner reports.
Also see
The path that usually beats fine-tuning for 'the model doesn't know my data' problems.
Coding agents are an example where good prompting + the right base model beats most fine-tunes.
If you've decided to fine-tune, Unsloth is the operator-grade pick. Apache 2.0, NVIDIA-only.
TCO including fine-tune training + serving costs vs API spend over 12 months.
Other questions in this thread
Other /q/ landings on the same topic — same editorial discipline.
- Is fine-tuning dead in 2026? RAG vs distillation vs prompting — when does fine-tuning actually win?
- I want my AI conversations to stay private — what's the realistic local-first setup?
- Persistent KV cache vs RAG — which one should I use for 'chat with my docs'?
- Why doesn't my local LLM have web search — and what are the actual offline alternatives?
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.