Research

Week build-out

Local fine-tuning workstation

QLoRA / LoRA fine-tuning on a single workstation. Axolotl / Unsloth + bitsandbytes + DeepSpeed (optional) + dataset prep + WandB (or self-hosted MLflow). Targets 7B-13B fine-tunes on 24 GB VRAM; pushes 32B with multi-GPU.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,900 words

Build summary

Hardware footprint

RTX 4090 (7B-13B) OR 2× 3090 / 2× 4090 (32B) · 128 GB RAM · 2 TB NVMe

Concurrency

1 active training run; queued via Buildkite / GHA self-hosted.

Power

Sustained 400-450 W per card during training.

Goal: Fine-tune open-weight models on private data without renting cloud GPUs.

Operator card

Workflow

Best for

✓Domain-specific fine-tunes (legal, medical, code style)
✓Researchers iterating on model quality
✓Teams that have data they can't ship to a cloud trainer
✓Anyone who needs custom chat templates / personas

Avoid if

⚠You only have a single 16 GB card (rent cloud time instead)
⚠You don't have ≥100K-row labeled dataset (zero-shot prompting often beats undertrained tunes)
⚠You can't commit a workstation to multi-day jobs
⚠You haven't yet evaluated whether prompting / RAG solves the problem

Stability

evolving

Maintenance

Weekly attention

Skill

Expert

Long-session reliability

reliable

Service ledger

6 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute

vLLM (post-training)

Inference

Eval inference. After training, merge LoRA → load in vLLM → run lm-eval-harness. The same engine that serves production also evaluates fine-tuned models.

Runs: Docker, GPU 0

Surface

Axolotl

Router / orchestrator

Training orchestrator. Config-driven, supports QLoRA / LoRA / full fine-tune; handles dataset templating; sane defaults for most chat templates.

Runs: Python venv, GPU pool

Unsloth

Router / orchestrator

Faster trainer. 2-3× faster than vanilla HuggingFace + bitsandbytes for QLoRA. Supports fewer model architectures than Axolotl but covers Llama / Qwen / Mistral families well.

Runs: Python venv, GPU pool

Data

datasets (HF) + custom Python scripts

Ingest

Dataset preparation. Most fine-tune data wrangling is custom; HF datasets handles the persistence + sharding + Arrow format efficiently.

Runs: Python venv

Local NVMe + S3-compatible (MinIO) for archive

Storage

Checkpoint storage. Hot checkpoints on NVMe (7B QLoRA = ~150 MB, full fine-tune = ~14 GB). Archive completed runs to MinIO with metadata.

Runs: host filesystem + Docker

Operations

WandB (self-hosted) OR MLflow

Observability

8080/tcp

Training tracking. Loss curves, gradient norms, GPU utilization tracked over time. Self-hosted WandB stays private; MLflow is the fully-OSS alternative.

Runs: Docker container

Hardware

7B QLoRA: single RTX 4090 24 GB comfortable. ~16 GB used (8-bit weights + gradients + optimizer states + activations); ~6 GB headroom.

13B QLoRA: tight on a 4090 (~22 GB used). Possible with batch size 1; comfortable on 2× 3090 with FSDP.

32B QLoRA: needs multi-GPU. 2× 3090 / 2× 4090 with DeepSpeed ZeRO-3 or FSDP. Throughput is bandwidth-bound; expect 5-10 hours per epoch on a small dataset.

Full fine-tune (not LoRA): rare on consumer hardware. 7B full FT needs ~80 GB GPU memory — 4× 3090 minimum. Most operators stay LoRA / QLoRA.

CPU + RAM matters more than for inference: dataloader workers + tokenizer + checkpoint streaming. 128 GB RAM is the comfortable floor.

Storage

NVMe Gen4 minimum. SATA SSD chokes on dataset I/O during training.

Plan: ~50 GB tokenized dataset cache (HF Arrow), ~15 GB per saved checkpoint (full FT) or ~150 MB (LoRA), ~10 GB activation memory (DeepSpeed offload if used).

Checkpoint frequency matters. Save every N steps proportional to training cost — losing the last hour of an 8-hour run hurts. save_steps: 250 is a typical sweet spot for QLoRA.

Networking

Local-only training, but: WandB self-hosted needs network access for the dashboard; MLflow same. Bind to LAN, not internet.

If pulling training data from HF Hub: pre-download once, work from local cache. Mid-training network blips can crash a run.

Multi-GPU on a single node uses NCCL; no external network. Multi-node fine-tuning requires 25 GbE+ minimum and falls outside this workflow.

Observability

Training metrics:

Loss curve. Smooth decrease = healthy. Sudden spikes = LR too high or bad batch.
Gradient norm. Clipping kicks in around 1.0; sustained much higher = exploding gradients.
Tokens/sec. Drops over time = GPU throttle (check temps + power).
VRAM utilization. Should be steady within 1-2 GB. Sudden spikes = OOM-near-miss → reduce batch size.

Hardware metrics:

GPU temp during training. >85 °C = throttling = silently degraded throughput.
Power draw. Sustained at TDP limit = healthy; clipping = thermal limit.

Post-training: run eval (lm-eval-harness, see /workflows/local-eval-lab). Compare against base model. If +0% → something went wrong (dataset, learning rate, or checkpoint corruption).

Security

Training data is sensitive. If you fine-tune on customer logs, internal docs, code — that data is now baked into the weights. Treat the resulting model as confidential as the source.

Checkpoint extraction. A bad actor with checkpoint access can extract approximate training data. Treat checkpoint storage like a database backup.

Dataset poisoning. If your dataset comes from semi-public sources (forum scrapes, Discord exports), assume some samples are adversarial. Run a quality-filter pass before training.

Model card hygiene. Document what was fine-tuned, on what data, with what license, when. Six months later you'll need this when someone asks "where did this model come from?"

Upgrade path

Bigger models: dual 3090 → 4× 3090 → cloud H100 rental for 70B+ fine-tunes. Self-hosting 70B fine-tunes requires sustained operational discipline; renting 8× H100 for $20/hr is often cheaper end-to-end.

Better techniques: add ORPO / DPO for alignment after initial SFT. Add reward modeling for RLHF if you have the pipeline.

Production: wrap training pipeline in Argo Workflows or Buildkite; add per-run cost tracking; promote successful checkpoints to production via a model registry (MLflow registry, HF Hub private repo).

Continuous fine-tuning: active-learning loop where production-flagged bad outputs feed back into a labeled dataset → trigger nightly QLoRA → eval → ship. The "rolling fine-tune" pattern.

What breaks first

OOM 4 hours in. Batch size + gradient accumulation + sequence length combined exceeds VRAM at the longest sequence in the dataset. Pre-bucket by length or set strict max-len.
Catastrophic forgetting. QLoRA on a narrow domain destroys general capability. Always eval base + tuned on diverse benchmarks.
Tokenizer mismatch. Adding special tokens without resizing embeddings = silent corruption. Audit tokenizer config before training.
Driver / CUDA ABI break. A driver update mid-training can crash with cryptic NCCL errors. Pin the driver; don't auto-update.
Dataset memory leak. HF datasets caches grow; can fill the disk during a long run. Monitor.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/dual-3090-workstation →/stacks/rtx-4090-workstation →

Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Validation

This workflow doesn't name model + hardware specifically enough to validate. Add explicit modelSlug + hardwareSlug to services for the bridge to work.