14. Model Selection for Code
Code generation has distinct requirements: syntax accuracy, API knowledge, debugging capability, and readability. This chapter helps you select models optimized for programming tasks.
Code model requirements:
- Syntax accuracy: Generates valid Python, JavaScript, etc.
- API familiarity: Knows common library interfaces
- Context awareness: Uses provided code, not generic patterns
- Debugging capability: Reads error messages and suggests fixes
Benchmark-first selection:
Start with HumanEval and MBPP scores, but also test on your specific stack:
# code_benchmark.py
test_cases = [
{
"id": "pandas_cleanup",
"prompt": "Write a function that takes a DataFrame with columns ['date', 'value'] and returns a DataFrame with missing dates filled and outliers (values > 3 std) removed.",
"reference_implementation": True,
"tests": [
"test_df = pd.DataFrame(...)",
"out = remove_outliers(fill_dates(in_df))",
"assert len(out) > 0"
]
},
# Add cases specific to your codebase
]
Model selection by language:
| Language | Recommended models | Notes |
|---|---|---|
| Python | CodeLlama, Deepseek-Coder, Mistral | Strong Python focus |
| JavaScript | WizardCoder, CodeLlama | React/Node APIs |
| General | CodeLlama 70B | Large, covers multiple languages |
Code-specific optimizations:
Some models are fine-tuned specifically for code:
- CodeLlama: Meta's code-specialized Llama variant, multiple sizes
- Deepseek-Coder: Trained on code completion, strong results
- StarCoder: Trained on GitHub with permissive licenses
These outperform general models of the same size on code tasks.
Quantization for code:
Code generation often tolerates lower quantization better than other tasks because:
- The answer is verifiable (run the code)
- Syntax errors are obvious failures
- Complex logic benefits from model quality over quantization precision
Use Q4_K_M as baseline, consider Q5_K_M if working on critical code.
Local verification checkpoint
Run the smallest example from this chapter in a local workspace and record the package version, runtime, data path, and observed output. If the result depends on model size, vector count, CPU/GPU backend, or available memory, note that constraint beside the exercise so the lesson remains reproducible.
Run a model on 5 real coding tasks from your current project. Measure syntax errors, API usage errors, and logical errors separately. Compare to HumanEval performance.