04. Yoruba Language Models
Yoruba NLP capabilities have expanded significantly through initiatives including the Yoruba WordNet, Bible corpora, and academic language resources from institutions like the University of Lagos. Production deployments must understand the available models, their architectural trade-offs, and practical deployment considerations.
Model options span the capability spectrum. Swahili-trained models often include Yoruba as secondary training data due to geographic overlap. AfriBERTa provides transformer-based embeddings trained on diverse African languages. masakhane-ner and related projects offer task-specific models. Fine-tuning on Yoruba-specific corpora improves performance for specialized domains.
Quantization becomes essential for deployment on constrained hardware. INT8 quantization of transformer models typically achieves 4x memory reduction with minimal accuracy loss. For smaller models, INT4 quantization enables deployment on devices with 1-2GB available RAM. The trade-off involves calibration dataset selection—using domain-relevant Yoruba text rather than generic English corpora.
# Quantized Yoruba text classification deployment
import numpy as np
from onnxruntime import InferenceSession
import tokenizers
class YorubaClassifier:
def __init__(self, model_path: str, vocab_path: str):
# Load quantized ONNX model
providers = ['CPUExecutionProvider']
self.session = InferenceSession(model_path, providers=providers)
self.input_name = self.session.get_inputs()[0].name
self.output_name = self.session.get_outputs()[0].name
# Initialize tokenizer
self.tokenizer = tokenizers.BERTWordPieceTokenizer(vocab_path)
self.labels = ['farming', 'weather', 'market', 'health', 'general']
def predict(self, text: str) -> dict:
# Normalize Yoruba text
normalized = self._normalize_yoruba(text)
# Tokenize with padding
encoded = self.tokenizer.encode(normalized, max_length=128)
input_ids = np.array([encoded.ids], dtype=np.int64)
attention_mask = np.array([encoded.attention_mask], dtype=np.int64)
# Run inference
logits = self.session.run(
[self.output_name],
{self.input_name: input_ids, 'attention_mask': attention_mask}
)[0]
# Convert to probabilities
probs = self._softmax(logits[0])
return {
'prediction': self.labels[np.argmax(probs)],
'confidence': float(np.max(probs)),
'all_probs': {l: float(p) for l, p in zip(self.labels, probs)}
}
def _normalize_yoruba(self, text: str) -> str:
# Preserve diacritical marks essential for meaning
return text.strip()
@staticmethod
def _softmax(x: np.ndarray) -> np.ndarray:
exp_x = np.exp(x - np.max(x)) # Subtract max for numerical stability
return exp_x / np.sum(exp_x)
# Example usage
classifier = YorubaClassifier(
model_path='yoruba_classifier_int8.onnx',
vocab_path='yoruba_vocab.txt'
)
result = classifier.predict("Ọjọ́-abamọ́ tí mo ti retí láti oko yóò fẹ́yìn ara rẹ̀")
print(result)
# Output: {'prediction': 'farming', 'confidence': 0.87, ...}
Memory optimization strategies include knowledge distillation from larger models, pruning attention heads with low activation, and vocabulary pruning for domain-specific deployments. Farm advisory applications might remove religious vocabulary while preserving agricultural terms, reducing model size by 15-20%.
Benchmarking Yoruba models requires appropriate datasets. The MasakhaNER corpus provides named entity recognition evaluation data. Bible translation memory systems offer sentence-level alignment for machine translation. Agricultural extension SMS logs, where available, provide realistic domain-specific evaluation. Custom data collection through partnerships with local universities accelerates development while building local AI capacity.
Fine-tune a small Yoruba language model on agricultural SMS data. Evaluate performance improvements over the base model on agricultural terminology classification.