Classical ML algorithms

CatBoost

CatBoost is a gradient boosting library developed by Yandex that handles categorical features automatically without manual encoding. For operators running local AI, CatBoost is relevant when working with tabular data tasks like classification or regression, where it competes with XGBoost and LightGBM. Its key differentiator is the use of ordered boosting to reduce prediction shift, and it natively supports categorical columns, which can simplify preprocessing pipelines.

Deeper dive

CatBoost builds an ensemble of decision trees sequentially, where each tree corrects errors of the previous ones. Unlike other boosting libraries, CatBoost uses symmetric trees (oblivious trees) and a novel method for handling categorical features: it computes target statistics based on a random permutation of the data, avoiding target leakage. The library also implements ordered boosting, a technique that uses a separate model for each data point to compute residuals, further reducing overfitting. CatBoost is optimized for GPU training and can be faster than XGBoost on certain datasets. For local operators, CatBoost is available via pip and can be used with Python or command-line interface. It outputs a model file that can be loaded for inference, but it is not typically used in LLM pipelines; it is more common in traditional ML workflows.

Practical example

An operator training a model to predict housing prices on a dataset with categorical features like 'neighborhood' and 'roof type' can use CatBoost without one-hot encoding. With a 16 GB GPU, training on 100k rows with 50 features takes roughly 5-10 minutes. The model file size is typically a few MB, easily fitting in system RAM.

Workflow example

In a local ML workflow, an operator might run catboost fit --learn-set train.csv --test-set test.csv --column-description col_desc.cd to train a model. The column description file specifies which columns are categorical. After training, the model is saved as model.cbm. For inference, catboost calc --input-path test.csv --model-path model.cbm --output-path predictions.txt produces predictions. This workflow is common in Kaggle competitions or small-scale production systems.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work