14. Model Selection for Code

Chapter 14 of 20 · 15 min

Code generation has distinct requirements: syntax accuracy, API knowledge, debugging capability, and readability. This chapter helps you select models optimized for programming tasks.

Code model requirements:

  1. Syntax accuracy: Generates valid Python, JavaScript, etc.
  2. API familiarity: Knows common library interfaces
  3. Context awareness: Uses provided code, not generic patterns
  4. Debugging capability: Reads error messages and suggests fixes

Benchmark-first selection:

Start with HumanEval and MBPP scores, but also test on your specific stack:

# code_benchmark.py
test_cases = [
    {
        "id": "pandas_cleanup",
        "prompt": "Write a function that takes a DataFrame with columns ['date', 'value'] and returns a DataFrame with missing dates filled and outliers (values > 3 std) removed.",
        "reference_implementation": True,
        "tests": [
            "test_df = pd.DataFrame(...)",
            "out = remove_outliers(fill_dates(in_df))",
            "assert len(out) > 0"
        ]
    },
    # Add cases specific to your codebase
]

Model selection by language:

Language Recommended models Notes
Python CodeLlama, Deepseek-Coder, Mistral Strong Python focus
JavaScript WizardCoder, CodeLlama React/Node APIs
General CodeLlama 70B Large, covers multiple languages

Code-specific optimizations:

Some models are fine-tuned specifically for code:

  • CodeLlama: Meta's code-specialized Llama variant, multiple sizes
  • Deepseek-Coder: Trained on code completion, strong results
  • StarCoder: Trained on GitHub with permissive licenses

These outperform general models of the same size on code tasks.

Quantization for code:

Code generation often tolerates lower quantization better than other tasks because:

  1. The answer is verifiable (run the code)
  2. Syntax errors are obvious failures
  3. Complex logic benefits from model quality over quantization precision

Use Q4_K_M as baseline, consider Q5_K_M if working on critical code.

Local verification checkpoint

Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.

EXERCISE

Run a model on 5 real coding tasks from your current project. Measure syntax errors, API usage errors, and logical errors separately. Compare to HumanEval performance.