Large language models

Parameter-Efficient Fine-Tuning (PEFT)

Parameter-Efficient Fine-Tuning (PEFT) is a set of techniques that adapt a pre-trained large language model to a specific task or domain by updating only a small fraction of the model's parameters, rather than retraining all weights. This drastically reduces VRAM and storage requirements, making fine-tuning feasible on consumer hardware. Common PEFT methods include LoRA (Low-Rank Adaptation), which injects trainable rank-decomposition matrices into attention layers, and Adapters, which add small bottleneck modules. Operators encounter PEFT when they want to customize a model (e.g., for chat style or domain knowledge) without the cost of full fine-tuning.

Deeper dive

PEFT methods work by keeping the original model weights frozen and introducing a small number of new, trainable parameters. For example, LoRA decomposes weight updates into low-rank matrices (typically rank r=8-64) applied to attention projection matrices. This means a 7B-parameter model might only train ~0.1-1% of its parameters. The trained adapter weights (often just a few MB) can be merged back into the base model or loaded separately at inference. Other PEFT techniques include Prefix Tuning (learns virtual tokens prepended to input), Prompt Tuning (learns soft prompts), and IA3 (learns element-wise scaling vectors). PEFT is especially relevant for operators with limited VRAM: a full fine-tune of Llama 3.1 8B requires ~60 GB VRAM (with gradient checkpointing), while LoRA fine-tuning the same model fits in ~16 GB. The trade-off is that PEFT may not achieve the same accuracy as full fine-tuning on very divergent tasks, but for most instruction-following or style adaptation, it performs nearly as well.

Practical example

An operator with an RTX 3090 (24 GB VRAM) wants to fine-tune Llama 3.1 8B to respond in a specific tone. Full fine-tuning would exceed VRAM, but using LoRA with rank=16, batch size=1, and gradient accumulation steps=4, the training fits comfortably. The resulting adapter file is ~34 MB, which can be loaded alongside the base model in Ollama or vLLM. Inference speed is identical to the base model because LoRA weights are merged.

Workflow example

Using Hugging Face Transformers with PEFT: load the base model with from_pretrained, then apply LoRA via get_peft_model from the peft library. Train with standard Trainer. The saved adapter can be pushed to Hugging Face Hub. In Ollama, you can create a Modelfile that includes the base model and adapter: FROM llama3.1:8b then ADAPTER ./lora-adapter.gguf. Running ollama create my-model produces a merged model. In vLLM, LoRA adapters are supported via the --enable-lora flag and --lora-modules argument.

Reviewed by Fredoline Eruo. See our editorial policy.