Local fine-tuning workstation
QLoRA / LoRA fine-tuning on a single workstation. Axolotl / Unsloth + bitsandbytes + DeepSpeed (optional) + dataset prep + WandB (or self-hosted MLflow). Targets 7B-13B fine-tunes on 24 GB VRAM; pushes 32B with multi-GPU.
Build summary
Goal: Fine-tune open-weight models on private data without renting cloud GPUs.
Operator card
- ✓Domain-specific fine-tunes (legal, medical, code style)
- ✓Researchers iterating on model quality
- ✓Teams that have data they can't ship to a cloud trainer
- ✓Anyone who needs custom chat templates / personas
- ⚠You only have a single 16 GB card (rent cloud time instead)
- ⚠You don't have ≥100K-row labeled dataset (zero-shot prompting often beats undertrained tunes)
- ⚠You can't commit a workstation to multi-day jobs
- ⚠You haven't yet evaluated whether prompting / RAG solves the problem
Service ledger
6 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.
Hardware
7B QLoRA: single RTX 4090 24 GB comfortable. ~16 GB used (8-bit weights + gradients + optimizer states + activations); ~6 GB headroom.
13B QLoRA: tight on a 4090 (~22 GB used). Possible with batch size 1; comfortable on 2× 3090 with FSDP.
32B QLoRA: needs multi-GPU. 2× 3090 / 2× 4090 with DeepSpeed ZeRO-3 or FSDP. Throughput is bandwidth-bound; expect 5-10 hours per epoch on a small dataset.
Full fine-tune (not LoRA): rare on consumer hardware. 7B full FT needs ~80 GB GPU memory — 4× 3090 minimum. Most operators stay LoRA / QLoRA.
CPU + RAM matters more than for inference: dataloader workers + tokenizer + checkpoint streaming. 128 GB RAM is the comfortable floor.
Storage
NVMe Gen4 minimum. SATA SSD chokes on dataset I/O during training.
Plan: ~50 GB tokenized dataset cache (HF Arrow), ~15 GB per saved checkpoint (full FT) or ~150 MB (LoRA), ~10 GB activation memory (DeepSpeed offload if used).
Checkpoint frequency matters. Save every N steps proportional to training cost — losing the last hour of an 8-hour run hurts. save_steps: 250 is a typical sweet spot for QLoRA.
Networking
Local-only training, but: WandB self-hosted needs network access for the dashboard; MLflow same. Bind to LAN, not internet.
If pulling training data from HF Hub: pre-download once, work from local cache. Mid-training network blips can crash a run.
Multi-GPU on a single node uses NCCL; no external network. Multi-node fine-tuning requires 25 GbE+ minimum and falls outside this workflow.
Observability
Training metrics:
- Loss curve. Smooth decrease = healthy. Sudden spikes = LR too high or bad batch.
- Gradient norm. Clipping kicks in around 1.0; sustained much higher = exploding gradients.
- Tokens/sec. Drops over time = GPU throttle (check temps + power).
- VRAM utilization. Should be steady within 1-2 GB. Sudden spikes = OOM-near-miss → reduce batch size.
Hardware metrics:
- GPU temp during training. >85 °C = throttling = silently degraded throughput.
- Power draw. Sustained at TDP limit = healthy; clipping = thermal limit.
Post-training: run eval (lm-eval-harness, see /workflows/local-eval-lab). Compare against base model. If +0% → something went wrong (dataset, learning rate, or checkpoint corruption).
Security
Training data is sensitive. If you fine-tune on customer logs, internal docs, code — that data is now baked into the weights. Treat the resulting model as confidential as the source.
Checkpoint extraction. A bad actor with checkpoint access can extract approximate training data. Treat checkpoint storage like a database backup.
Dataset poisoning. If your dataset comes from semi-public sources (forum scrapes, Discord exports), assume some samples are adversarial. Run a quality-filter pass before training.
Model card hygiene. Document what was fine-tuned, on what data, with what license, when. Six months later you'll need this when someone asks "where did this model come from?"
Upgrade path
Bigger models: dual 3090 → 4× 3090 → cloud H100 rental for 70B+ fine-tunes. Self-hosting 70B fine-tunes requires sustained operational discipline; renting 8× H100 for $20/hr is often cheaper end-to-end.
Better techniques: add ORPO / DPO for alignment after initial SFT. Add reward modeling for RLHF if you have the pipeline.
Production: wrap training pipeline in Argo Workflows or Buildkite; add per-run cost tracking; promote successful checkpoints to production via a model registry (MLflow registry, HF Hub private repo).
Continuous fine-tuning: active-learning loop where production-flagged bad outputs feed back into a labeled dataset → trigger nightly QLoRA → eval → ship. The "rolling fine-tune" pattern.
What breaks first
- OOM 4 hours in. Batch size + gradient accumulation + sequence length combined exceeds VRAM at the longest sequence in the dataset. Pre-bucket by length or set strict max-len.
- Catastrophic forgetting. QLoRA on a narrow domain destroys general capability. Always eval base + tuned on diverse benchmarks.
- Tokenizer mismatch. Adding special tokens without resizing embeddings = silent corruption. Audit tokenizer config before training.
- Driver / CUDA ABI break. A driver update mid-training can crash with cryptic NCCL errors. Pin the driver; don't auto-update.
- Dataset memory leak. HF datasets caches grow; can fill the disk during a long run. Monitor.
Composes these stacks
The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.
Open the custom build engine and explore which hardware tier actually supports this workflow.
Validation
This workflow doesn't name model + hardware specifically enough to validate. Add explicit modelSlug + hardwareSlug to services for the bridge to work.