RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
Glossary / Large language models / Parameter-Efficient Fine-Tuning (PEFT)
Large language models

Parameter-Efficient Fine-Tuning (PEFT)

Parameter-Efficient Fine-Tuning (PEFT) is a set of techniques that adapt a pre-trained large language model to a specific task or domain by updating only a small fraction of the model's parameters, rather than retraining all weights. This drastically reduces VRAM and storage requirements, making fine-tuning feasible on consumer hardware. Common PEFT methods include LoRA (Low-Rank Adaptation), which injects trainable rank-decomposition matrices into attention layers, and Adapters, which add small bottleneck modules. Operators encounter PEFT when they want to customize a model (e.g., for chat style or domain knowledge) without the cost of full fine-tuning.

Deeper dive

PEFT methods work by keeping the original model weights frozen and introducing a small number of new, trainable parameters. For example, LoRA decomposes weight updates into low-rank matrices (typically rank r=8-64) applied to attention projection matrices. This means a 7B-parameter model might only train ~0.1-1% of its parameters. The trained adapter weights (often just a few MB) can be merged back into the base model or loaded separately at inference. Other PEFT techniques include Prefix Tuning (learns virtual tokens prepended to input), Prompt Tuning (learns soft prompts), and IA3 (learns element-wise scaling vectors). PEFT is especially relevant for operators with limited VRAM: a full fine-tune of Llama 3.1 8B requires ~60 GB VRAM (with gradient checkpointing), while LoRA fine-tuning the same model fits in ~16 GB. The trade-off is that PEFT may not achieve the same accuracy as full fine-tuning on very divergent tasks, but for most instruction-following or style adaptation, it performs nearly as well.

Practical example

An operator with an RTX 3090 (24 GB VRAM) wants to fine-tune Llama 3.1 8B to respond in a specific tone. Full fine-tuning would exceed VRAM, but using LoRA with rank=16, batch size=1, and gradient accumulation steps=4, the training fits comfortably. The resulting adapter file is ~34 MB, which can be loaded alongside the base model in Ollama or vLLM. Inference speed is identical to the base model because LoRA weights are merged.

Workflow example

Using Hugging Face Transformers with PEFT: load the base model with from_pretrained, then apply LoRA via get_peft_model from the peft library. Train with standard Trainer. The saved adapter can be pushed to Hugging Face Hub. In Ollama, you can create a Modelfile that includes the base model and adapter: FROM llama3.1:8b then ADAPTER ./lora-adapter.gguf. Running ollama create my-model produces a merged model. In vLLM, LoRA adapters are supported via the --enable-lora flag and --lora-modules argument.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →