Ijwi ry'Ikirundi AI | Case Study
Nov 2025 – Present | Founder & Technical LeadThe Problem
Kirundi is spoken by over 12 million people across Burundi, DR Congo, and diaspora communities, yet it had zero AI-ready datasets. No speech recognition. No machine translation. No NLP models. Major AI platforms (OpenAI, Google, Meta) don't support Kirundi at all. Without data, the language risked being permanently excluded from the AI era.
The Solution
I designed and built a complete end-to-end data infrastructure from scratch:
- Zero-cost contribution platform: a web app (GitHub Pages + Google Apps Script as backend API) that lets anyone submit Kirundi sentences and audio recordings without installing anything.
- Automated ETL pipeline: Python scripts (Pandas, RegEx) that pull raw contributions, clean and validate them, deduplicate entries, and compute metadata (word count, character count, speaker info).
- Dataset hosting on Hugging Face: versioned via Git LFS, publicly accessible, and structured for ASR, TTS, and NLP downstream tasks.
- Open-source community: CONTRIBUTING.md, issue templates, and contributor onboarding to scale data collection beyond one person.
Measurable Results
4,700+
Sentences Collected32,900+
Words Processed$0
Infrastructure Cost...
Dataset DownloadsKey Technical Decisions
- Zero hosting cost: chose GitHub Pages + Apps Script over a traditional server, eliminating recurring expenses for a community project in a low-income context.
- Git LFS for audio: audio clips are large; Git LFS allows versioned storage on Hugging Face without bloating the repository.
- CSV-first data format: kept the dataset in simple CSV (not Parquet/Arrow) so contributors and researchers in Burundi with limited tooling can inspect and contribute without specialized software.
Why This Matters
This is not a side project. It's digital sovereignty infrastructure. Every sentence in this dataset moves Kirundi closer to having working speech recognition, machine translation, and AI assistants. The architecture is designed to be replicated for other low-resource African languages. The entire stack is open-source and documented.