Evaluation metrics

R²

R² (coefficient of determination) measures how well a regression model's predictions match actual outcomes, on a scale from 0 to 1. In local AI, R² appears when evaluating fine-tuned models on regression tasks (e.g., predicting token-level latency or VRAM usage). An R² of 1 means perfect prediction; 0 means the model performs no better than always predicting the mean. Operators encounter R² in training logs or evaluation scripts to judge whether a quantized or fine-tuned model preserves predictive accuracy.

Deeper dive

R² is calculated as 1 - (SS_res / SS_tot), where SS_res is the sum of squared residuals (prediction errors) and SS_tot is the sum of squared differences from the mean. A negative R² can occur if the model fits worse than a horizontal line, which sometimes happens with poorly quantized models on small datasets. In practice, R² is sensitive to outliers and does not indicate bias or variance individually. For local AI, R² is most relevant when benchmarking quantized models on regression benchmarks (e.g., predicting perplexity or runtime). A drop in R² after quantization signals loss of predictive fidelity.

Practical example

An operator fine-tunes a small regression model to predict inference latency on an RTX 4090. After applying 4-bit quantization, the R² on a held-out test set drops from 0.95 to 0.82, indicating that quantization reduced the model's ability to accurately predict latency. This helps the operator decide whether the speed gain from quantization is worth the accuracy loss.

Workflow example

In a Hugging Face Transformers training script, operators can log R² using evaluate.load('r_squared') after each epoch. For example, running python train.py --output_dir ./results prints R² alongside loss. If R² plateaus below 0.9, the operator might adjust learning rate or increase dataset size. In llama.cpp, R² is not directly computed, but operators can compute it externally using model outputs and ground truth.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work