Yoruba Language Models — Local AI for African Markets (Chapter 4)

Yoruba NLP capabilities have expanded significantly through initiatives including the Yoruba WordNet, Bible corpora, and academic language resources from institutions like the University of Lagos. Production deployments must understand the available models, their architectural trade-offs, and practical deployment considerations.

Model options span the capability spectrum. Swahili-trained models often include Yoruba as secondary training data due to geographic overlap. AfriBERTa provides transformer-based embeddings trained on diverse African languages. masakhane-ner and related projects offer task-specific models. Fine-tuning on Yoruba-specific corpora improves performance for specialized domains.

Quantization becomes essential for deployment on constrained hardware. INT8 quantization of transformer models typically achieves 4x memory reduction with minimal accuracy loss. For smaller models, INT4 quantization enables deployment on devices with 1-2GB available RAM. The trade-off involves calibration dataset selection—using domain-relevant Yoruba text rather than generic English corpora.

# Quantized Yoruba text classification deployment
import numpy as np
from onnxruntime import InferenceSession
import tokenizers

class YorubaClassifier:
    def __init__(self, model_path: str, vocab_path: str):
        # Load quantized ONNX model
        providers = ['CPUExecutionProvider']
        self.session = InferenceSession(model_path, providers=providers)
        self.input_name = self.session.get_inputs()[0].name
        self.output_name = self.session.get_outputs()[0].name
        
        # Initialize tokenizer
        self.tokenizer = tokenizers.BERTWordPieceTokenizer(vocab_path)
        self.labels = ['farming', 'weather', 'market', 'health', 'general']
    
    def predict(self, text: str) -> dict:
        # Normalize Yoruba text
        normalized = self._normalize_yoruba(text)
        
        # Tokenize with padding
        encoded = self.tokenizer.encode(normalized, max_length=128)
        input_ids = np.array([encoded.ids], dtype=np.int64)
        attention_mask = np.array([encoded.attention_mask], dtype=np.int64)
        
        # Run inference
        logits = self.session.run(
            [self.output_name],
            {self.input_name: input_ids, 'attention_mask': attention_mask}
        )[0]
        
        # Convert to probabilities
        probs = self._softmax(logits[0])
        
        return {
            'prediction': self.labels[np.argmax(probs)],
            'confidence': float(np.max(probs)),
            'all_probs': {l: float(p) for l, p in zip(self.labels, probs)}
        }
    
    def _normalize_yoruba(self, text: str) -> str:
        # Preserve diacritical marks essential for meaning
        return text.strip()
    
    @staticmethod
    def _softmax(x: np.ndarray) -> np.ndarray:
        exp_x = np.exp(x - np.max(x))  # Subtract max for numerical stability
        return exp_x / np.sum(exp_x)

# Example usage
classifier = YorubaClassifier(
    model_path='yoruba_classifier_int8.onnx',
    vocab_path='yoruba_vocab.txt'
)
result = classifier.predict("Ọjọ́-abamọ́ tí mo ti retí láti oko yóò fẹ́yìn ara rẹ̀")
print(result)
# Output: {'prediction': 'farming', 'confidence': 0.87, ...}

Memory optimization strategies include knowledge distillation from larger models, pruning attention heads with low activation, and vocabulary pruning for domain-specific deployments. Farm advisory applications might remove religious vocabulary while preserving agricultural terms, reducing model size by 15-20%.

Benchmarking Yoruba models requires appropriate datasets. The MasakhaNER corpus provides named entity recognition evaluation data. Bible translation memory systems offer sentence-level alignment for machine translation. Agricultural extension SMS logs, where available, provide realistic domain-specific evaluation. Custom data collection through partnerships with local universities accelerates development while building local AI capacity.