Modular tools to collect, preprocess, align, and prepare Mooré speech/text data for Text-to-Speech (TTS) and Automatic Speech Recognition (ASR).
This toolkit reproduces the full Moore Speech Corpora data collection pipeline.
We offer modular, CLI-friendly scripts for tasks like:
- Scrape Mooré audios and texts from online sources (Bible, YouTube, etc.)
- Segment and align long audios w/ texts
- Audio preprocessing: resampling, mono conversion
- Moore text normalization
- Audios denoising and enhancement
- Export datasets to different formats (Hugging Face, LJSpeech etc...)
The toolkit is structured as follows:
moore-toolkit/
├─ crawlers/             # Bible, YouTube, etc.
├─ preprocessing/        # Resample, normalize text
├─ forced_alignment/     # MMS scripts & wrappers
├─ datasets/             # HF dataset prep & push
├─ utils/                # Shared helpers
├─ environment.yml
└─ README.md
git clone https://github.com/anyantudre/MooreSpeechCorpora.git
cd MooreSpeechCorpora
conda env create -f environment.yml
conda activate mooredataIt's highly recommended to use Python 3.10.11!!!
- Data Crawling: crawls data from sources like Bible and YouTube.
# crawling Moore Bible example
sh ./crawlers/bible/crawl.shSee crawlers/README.md for full instructions and more details.
- Preprocessing: preprocessing.
# example resampling Moore data
bash preprocessing/resample.sh --input_folder datasets/moore/bible/raw --output_folder datasets/moore/bible/resampledSee preprocessing/README.md for full instructions and more details.
- Forced Alignment: outputs segmented audio and manifest.jsonfiles for each chapter.
# run forced alignment
bash forced_alignement/align_and_segment.sh \
  --audio_folder datasets/moore/bible/resampled \
  --text_folder datasets/moore/bible/resampled \
  --output_folder datasets/moore/bible/aligned \
  --lang mos \
  --uroman_path ../uroman/binSee forced_alignment/README.md for full instructions and more details.
- Dataset Preparation/Export: uploads dataset with columns: audio, transcription, duration, chapter to Hugging Face Hub.
python data_export/prepare_hf_dataset.py --input_folder datasets/moore/bible/aligned --repo_id anyantudre/moore-speech-bible --hf_token hf_xxxxSee datasets/README.md for full instructions and more details.
- Denoising & Enhancement (optional): applies Resemble Enhance to improve audio quality, optionally skipping enhancement or keeping original audio.
python denoising/denoise_and_push.py \
  --dataset_id anyantudre/moore-speech-bible \
  --output_repo_id anyantudre/moore-speech-bible-denoised \
  --hf_token hf_xxxx \
  --enhance_audio \
  --keep_original_audioThe resulting dataset will include
denoised_audioand optionallyenhanced_audiofields, depending on the flags.
See denoising/README.md for full instructions and parameters.
Contributions are more than welcome! Please read CONTRIBUTING.md for guidelines on how to get started.
- cawoylel: this repo is largely inspired by their excellent work on the Fula language!
- Facebook AI Research Fairseq for multilingual alignment tools.
- bible.com for Mooré audio/text
- Uroman for romanization
- Resemble Enhance for speech enhancement