Kirundi AI | Pipeline Architecture

Ijwi ry'Ikirundi AI | Pipeline Architecture

End-to-end data flow: from community contributions to a production-ready dataset on Hugging Face.

Data Flow Overview

#	Stage	Component	Technology
1	Contribution	Community members submit Kirundi sentences and audio recordings via a web app	GitHub Pages JavaScript
2	API Layer	Serverless backend receives submissions and writes to staging storage	Google Apps Script
3	Staging	Raw data stored temporarily for batch processing	Google Sheets
4	ETL Pipeline	Pull raw data, clean with RegEx, deduplicate, compute metadata (word count, speaker info)	Python Pandas
5	Validation	Automated quality checks, format verification, CI/CD workflows	GitHub Actions
6	Deployment	Versioned dataset pushed to Hugging Face Hub via Git LFS	Hugging Face Git LFS
7	Consumption	Dataset available for ASR, TTS, machine translation, and NLP research	HF Datasets API

Design Principles

Zero infrastructure cost: GitHub Pages + Apps Script eliminates hosting expenses
Git LFS for audio: versioned storage without bloating the repo
CSV-first format: accessible to researchers with limited tooling
Fully open-source: every component is public and documented

Tech Stack

Python Pandas RegEx Hugging Face Hub Git LFS Google Apps Script GitHub Pages GitHub Actions