Ijwi ry'Ikirundi AI | Pipeline Architecture
End-to-end data flow: from community contributions to a production-ready dataset on Hugging Face.
Data Flow Overview
| # | Stage | Component | Technology |
|---|---|---|---|
| 1 | Contribution | Community members submit Kirundi sentences and audio recordings via a web app | GitHub Pages JavaScript |
| 2 | API Layer | Serverless backend receives submissions and writes to staging storage | Google Apps Script |
| 3 | Staging | Raw data stored temporarily for batch processing | Google Sheets |
| 4 | ETL Pipeline | Pull raw data, clean with RegEx, deduplicate, compute metadata (word count, speaker info) | Python Pandas |
| 5 | Validation | Automated quality checks, format verification, CI/CD workflows | GitHub Actions |
| 6 | Deployment | Versioned dataset pushed to Hugging Face Hub via Git LFS | Hugging Face Git LFS |
| 7 | Consumption | Dataset available for ASR, TTS, machine translation, and NLP research | HF Datasets API |
Design Principles
- Zero infrastructure cost: GitHub Pages + Apps Script eliminates hosting expenses
- Git LFS for audio: versioned storage without bloating the repo
- CSV-first format: accessible to researchers with limited tooling
- Fully open-source: every component is public and documented
Tech Stack
Python
Pandas
RegEx
Hugging Face Hub
Git LFS
Google Apps Script
GitHub Pages
GitHub Actions