Classical ML algorithms

XGBoost

XGBoost (Extreme Gradient Boosting) is a gradient-boosted decision tree (GBDT) library optimized for structured/tabular data. It builds an ensemble of shallow decision trees sequentially, each correcting errors of the previous one. Operators encounter XGBoost when training models on datasets like CSV files for classification or regression tasks. It runs on CPU or GPU, and GPU acceleration can reduce training time from hours to minutes on consumer GPUs like an RTX 3060. XGBoost is not a neural network; it's a classical ML algorithm often used before LLMs for tasks like click prediction or fraud detection.

Deeper dive

XGBoost implements gradient boosting with regularization to prevent overfitting. It uses a weighted quantile sketch for approximate split finding, enabling efficient handling of large datasets. Key hyperparameters include n_estimators (number of trees), max_depth (tree depth, typically 3-10), learning_rate (shrinkage), and subsample (row sampling). GPU training is enabled via tree_method='gpu_hist' and requires CUDA. For operators, XGBoost is relevant when building pipelines that combine classical models with LLM outputs, e.g., using an LLM to generate features then feeding them into XGBoost. It's also common in Kaggle competitions and production systems where tabular data dominates.

Practical example

An operator trains an XGBoost classifier on a 10 GB CSV of customer churn data using an RTX 3090. With n_estimators=1000, max_depth=6, learning_rate=0.1, and tree_method='gpu_hist', training completes in 3 minutes. Without GPU (tree_method='hist'), the same job takes 45 minutes. The trained model file is ~200 MB, easily loaded in Python via xgboost.Booster.

Workflow example

In a typical workflow, an operator runs pip install xgboost then writes a Python script: import xgboost as xgb; dtrain = xgb.DMatrix('train.csv'); params = {'objective':'binary:logistic', 'tree_method':'gpu_hist', 'max_depth':6}; model = xgb.train(params, dtrain, num_boost_round=1000). The model is saved with model.save_model('churn.model') and later loaded for inference on CPU or GPU. Operators monitor GPU utilization with nvidia-smi to ensure the GPU is being used.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work