Ijwi ry'Ikirundi AI | Pipeline Architecture

End-to-end data flow: from community contributions to a production-ready dataset on Hugging Face.

Data Flow Overview
# Stage Component Technology
1 Contribution Community members submit Kirundi sentences and audio recordings via a web app GitHub Pages JavaScript
2 API Layer Serverless backend receives submissions and writes to staging storage Google Apps Script
3 Staging Raw data stored temporarily for batch processing Google Sheets
4 ETL Pipeline Pull raw data, clean with RegEx, deduplicate, compute metadata (word count, speaker info) Python Pandas
5 Validation Automated quality checks, format verification, CI/CD workflows GitHub Actions
6 Deployment Versioned dataset pushed to Hugging Face Hub via Git LFS Hugging Face Git LFS
7 Consumption Dataset available for ASR, TTS, machine translation, and NLP research HF Datasets API
Design Principles
  • Zero infrastructure cost: GitHub Pages + Apps Script eliminates hosting expenses
  • Git LFS for audio: versioned storage without bloating the repo
  • CSV-first format: accessible to researchers with limited tooling
  • Fully open-source: every component is public and documented
Tech Stack
Python Pandas RegEx Hugging Face Hub Git LFS Google Apps Script GitHub Pages GitHub Actions