RUNLOCALAIv38
->Will it run?Best GPUCompareTroubleshootStartLearnPulseModelsHardwareToolsBench
Run check
RUNLOCALAI

Independently operated catalog for local-AI hardware and software. Hand-written verdicts. Source-cited claims. Reproducible commands when we have them.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
TOOLS
  • Will it run?
  • Compare hardware
  • Cost vs cloud
  • Choose my GPU
  • Prompting kits
  • Quick answers
REF
  • All buyer guides
  • Learn local AI
  • Methodology
  • Glossary
  • Errors KB
  • Trust
EDITOR
  • About
  • Author
  • How we make money
  • Editorial policy
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

© 2026 runlocalai.coIndependently operated
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Learn
  4. /Courses
  5. /Fine-Tuning with LoRA and QLoRA
  6. /Ch. 8
Fine-Tuning with LoRA and QLoRA

08. Dataset Preparation

Chapter 8 of 24 · 15 min
KEY INSIGHT

Fine-tuning dataset quality determines ceiling performance; even large datasets produce poor models when label noise or formatting inconsistencies are present.

High-quality training data determines fine-tuning success more than any hyperparameter tuning. The dataset requirements for LoRA fine-tuning differ from pre-training: smaller size but higher quality and better alignment with target behavior.

Dataset size requirements depend on task complexity and model size. Simple classification or formatting tasks may require only hundreds to thousands of examples. Complex instruction following or behavioral adaptation typically benefits from thousands to tens of thousands of examples. Beyond a certain point, additional data provides diminishing returns unless the additional examples cover new scenarios.

Quality matters more than quantity in fine-tuning. Noisy labels, inconsistent formatting, and duplicate examples degrade model performance more severely in fine-tuning than in pre-training. The model has less opportunity to average out errors when training on a smaller corpus. Data cleaning and deduplication receive higher priority.

Structuring examples consistently helps the model learn the target behavior more efficiently. Each example should demonstrate the complete input-output relationship without ambiguity. For instruction-tuning tasks, this means including a clear instruction, context when relevant, and the expected response.

Source data format varies widely: CSV files, JSON lines, parquet, database exports, or scraped web content. The training pipeline must normalize this data into a consistent format before tokenization. Common structures include instruction-response pairs, multi-turn conversations, or input-output mappings depending on the target behavior.

Label quality deserves particular attention for supervised fine-tuning. Mislabeled examples teach incorrect behaviors that persist even when the majority of data is correct. Manual inspection of a random sample helps identify labeling errors before investing training resources. Correction or removal of mislabeled examples often improves results more than extending training.

EXERCISE

Download a public instruction-tuning dataset and perform exploratory analysis: check format consistency, identify duplicates, examine label distribution, and flag potential quality issues.

# dataset_analysis.py
from datasets import load_dataset
from collections import Counter

def analyze_instruction_dataset(dataset_name: str) -> dict:
    """Analyze an instruction-tuning dataset for quality metrics."""
    dataset = load_dataset(dataset_name)
    
    split = list(dataset.keys())[0]  # Use first available split
    data = dataset[split]
    
    analysis = {
        "num_examples": len(data),
        "columns": data.column_names,
        "example_lengths": [],
        "null_counts": {},
        "duplicates": 0
    }
    
    # Check for null values in each column
    for col in data.column_names:
        null_count = sum(1 for x in data[col] if x is None or x == "")
        analysis["null_counts"][col] = null_count
    
    # Calculate example lengths
    for idx in range(min(len(data), 1000)):  # Sample for efficiency
        example = data[idx]
        total_length = sum(
            len(str(v)) for v in example.values() if v is not None
        )
        analysis["example_lengths"].append(total_length)
    
    # Check for duplicates (simplified)
    texts = [str(data[i]) for i in range(len(data))]
    analysis["duplicates"] = len(texts) - len(set(texts))
    
    analysis["avg_length"] = sum(analysis["example_lengths"]) / len(analysis["example_lengths"])
    analysis["length_std"] = (sum((l - analysis["avg_length"])**2 for l in analysis["example_lengths"]) / len(analysis["example_lengths"])) ** 0.5
    
    return analysis

def print_analysis_report(analysis: dict):
    """Print formatted analysis report."""
    print("=" * 50)
    print("Dataset Analysis Report")
    print("=" * 50)
    print(f"Number of examples: {analysis['num_examples']:,}")
    print(f"Columns: {analysis['columns']}")
    print(f"\nNull counts by column:")
    for col, count in analysis['null_counts'].items():
        pct = 100 * count / analysis['num_examples']
        print(f"  {col}: {count} ({pct:.2f}%)")
    print(f"\nDuplicate examples: {analysis['duplicates']}")
    print(f"\nLength statistics:")
    print(f"  Average: {analysis['avg_length']:.1f} chars")
    print(f"  Std dev:  {analysis['length_std']:.1f} chars")

# Example usage
# report = analyze_instruction_dataset("yahma/alpaca-cleaned")
# print_analysis_report(report)
← Chapter 7
4-bit NormalFloat
Chapter 9 →
Data Formatting