RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Workflows
  4. /Local fine-tuning workstation
Research
Week build-out

Local fine-tuning workstation

QLoRA / LoRA fine-tuning on a single workstation. Axolotl / Unsloth + bitsandbytes + DeepSpeed (optional) + dataset prep + WandB (or self-hosted MLflow). Targets 7B-13B fine-tunes on 24 GB VRAM; pushes 32B with multi-GPU.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,900 words

Build summary

Hardware footprint
RTX 4090 (7B-13B) OR 2× 3090 / 2× 4090 (32B) · 128 GB RAM · 2 TB NVMe
Concurrency
1 active training run; queued via Buildkite / GHA self-hosted.
Power
Sustained 400-450 W per card during training.

Goal: Fine-tune open-weight models on private data without renting cloud GPUs.

Operator card

Workflow
Best for
  • ✓Domain-specific fine-tunes (legal, medical, code style)
  • ✓Researchers iterating on model quality
  • ✓Teams that have data they can't ship to a cloud trainer
  • ✓Anyone who needs custom chat templates / personas
Avoid if
  • ⚠You only have a single 16 GB card (rent cloud time instead)
  • ⚠You don't have ≥100K-row labeled dataset (zero-shot prompting often beats undertrained tunes)
  • ⚠You can't commit a workstation to multi-day jobs
  • ⚠You haven't yet evaluated whether prompting / RAG solves the problem
Stability
evolving
Maintenance
Weekly attention
Skill
Expert
Long-session reliability
reliable

Service ledger

6 services across 4 layers. Each entry includes a one-line operator note explaining why this pick over alternatives.

Compute
vLLM (post-training)
Inference
Eval inference. After training, merge LoRA → load in vLLM → run lm-eval-harness. The same engine that serves production also evaluates fine-tuned models.
Runs: Docker, GPU 0
Surface
Axolotl
Router / orchestrator
Training orchestrator. Config-driven, supports QLoRA / LoRA / full fine-tune; handles dataset templating; sane defaults for most chat templates.
Runs: Python venv, GPU pool
Unsloth
Router / orchestrator
Faster trainer. 2-3× faster than vanilla HuggingFace + bitsandbytes for QLoRA. Supports fewer model architectures than Axolotl but covers Llama / Qwen / Mistral families well.
Runs: Python venv, GPU pool
Data
datasets (HF) + custom Python scripts
Ingest
Dataset preparation. Most fine-tune data wrangling is custom; HF datasets handles the persistence + sharding + Arrow format efficiently.
Runs: Python venv
Local NVMe + S3-compatible (MinIO) for archive
Storage
Checkpoint storage. Hot checkpoints on NVMe (7B QLoRA = ~150 MB, full fine-tune = ~14 GB). Archive completed runs to MinIO with metadata.
Runs: host filesystem + Docker
Operations
WandB (self-hosted) OR MLflow
Observability
8080/tcp
Training tracking. Loss curves, gradient norms, GPU utilization tracked over time. Self-hosted WandB stays private; MLflow is the fully-OSS alternative.
Runs: Docker container

Hardware

7B QLoRA: single RTX 4090 24 GB comfortable. ~16 GB used (8-bit weights + gradients + optimizer states + activations); ~6 GB headroom.

13B QLoRA: tight on a 4090 (~22 GB used). Possible with batch size 1; comfortable on 2× 3090 with FSDP.

32B QLoRA: needs multi-GPU. 2× 3090 / 2× 4090 with DeepSpeed ZeRO-3 or FSDP. Throughput is bandwidth-bound; expect 5-10 hours per epoch on a small dataset.

Full fine-tune (not LoRA): rare on consumer hardware. 7B full FT needs ~80 GB GPU memory — 4× 3090 minimum. Most operators stay LoRA / QLoRA.

CPU + RAM matters more than for inference: dataloader workers + tokenizer + checkpoint streaming. 128 GB RAM is the comfortable floor.

Storage

NVMe Gen4 minimum. SATA SSD chokes on dataset I/O during training.

Plan: ~50 GB tokenized dataset cache (HF Arrow), ~15 GB per saved checkpoint (full FT) or ~150 MB (LoRA), ~10 GB activation memory (DeepSpeed offload if used).

Checkpoint frequency matters. Save every N steps proportional to training cost — losing the last hour of an 8-hour run hurts. save_steps: 250 is a typical sweet spot for QLoRA.

Networking

Local-only training, but: WandB self-hosted needs network access for the dashboard; MLflow same. Bind to LAN, not internet.

If pulling training data from HF Hub: pre-download once, work from local cache. Mid-training network blips can crash a run.

Multi-GPU on a single node uses NCCL; no external network. Multi-node fine-tuning requires 25 GbE+ minimum and falls outside this workflow.

Observability

Training metrics:

  • Loss curve. Smooth decrease = healthy. Sudden spikes = LR too high or bad batch.
  • Gradient norm. Clipping kicks in around 1.0; sustained much higher = exploding gradients.
  • Tokens/sec. Drops over time = GPU throttle (check temps + power).
  • VRAM utilization. Should be steady within 1-2 GB. Sudden spikes = OOM-near-miss → reduce batch size.

Hardware metrics:

  • GPU temp during training. >85 °C = throttling = silently degraded throughput.
  • Power draw. Sustained at TDP limit = healthy; clipping = thermal limit.

Post-training: run eval (lm-eval-harness, see /workflows/local-eval-lab). Compare against base model. If +0% → something went wrong (dataset, learning rate, or checkpoint corruption).

Security

Training data is sensitive. If you fine-tune on customer logs, internal docs, code — that data is now baked into the weights. Treat the resulting model as confidential as the source.

Checkpoint extraction. A bad actor with checkpoint access can extract approximate training data. Treat checkpoint storage like a database backup.

Dataset poisoning. If your dataset comes from semi-public sources (forum scrapes, Discord exports), assume some samples are adversarial. Run a quality-filter pass before training.

Model card hygiene. Document what was fine-tuned, on what data, with what license, when. Six months later you'll need this when someone asks "where did this model come from?"

Upgrade path

Bigger models: dual 3090 → 4× 3090 → cloud H100 rental for 70B+ fine-tunes. Self-hosting 70B fine-tunes requires sustained operational discipline; renting 8× H100 for $20/hr is often cheaper end-to-end.

Better techniques: add ORPO / DPO for alignment after initial SFT. Add reward modeling for RLHF if you have the pipeline.

Production: wrap training pipeline in Argo Workflows or Buildkite; add per-run cost tracking; promote successful checkpoints to production via a model registry (MLflow registry, HF Hub private repo).

Continuous fine-tuning: active-learning loop where production-flagged bad outputs feed back into a labeled dataset → trigger nightly QLoRA → eval → ship. The "rolling fine-tune" pattern.

What breaks first

  1. OOM 4 hours in. Batch size + gradient accumulation + sequence length combined exceeds VRAM at the longest sequence in the dataset. Pre-bucket by length or set strict max-len.
  2. Catastrophic forgetting. QLoRA on a narrow domain destroys general capability. Always eval base + tuned on diverse benchmarks.
  3. Tokenizer mismatch. Adding special tokens without resizing embeddings = silent corruption. Audit tokenizer config before training.
  4. Driver / CUDA ABI break. A driver update mid-training can crash with cryptic NCCL errors. Pin the driver; don't auto-update.
  5. Dataset memory leak. HF datasets caches grow; can fill the disk during a long run. Monitor.

Composes these stacks

The /stacks layer covers what to assemble; this workflow shows how those assemblies operate as a system.

/stacks/dual-3090-workstation →/stacks/rtx-4090-workstation →
Map this workflow to a build

Open the custom build engine and explore which hardware tier actually supports this workflow.

Open custom builder →

Validation

This workflow doesn't name model + hardware specifically enough to validate. Add explicit modelSlug + hardwareSlug to services for the bridge to work.

Help keep this page accurate

We read every submission. Editorial review takes 1-7 days.

Report outdatedSuggest a correctionDid this workflow work for you?