This is the code used for creating GermanRAG, a German dataset for finetuning LLMs on Retrieval Augmented Generation tasks (RAG).
- Install the requirements with
pip install -r requirements.txt. - Generate
germandpr_subset.jsonlwithpython germandpr.py - Clone Airoboros,
pip install -e .there and copygermandpr_subset.jsonlaswell asconfig_germanrag.yamlinto the root directory. - Copy
airoboros/instructors/germanrag.pyandairoboros/instructors/prompts/germanrag.txtfrom this repo to the respective directories in Airoboros. - Add
from airoboros.instructors.germanrag import generate as germanrag_generatorhere. - Add
"germanrag": germanrag_generatorhere. - Run
airoboros generate-instructions --config-path config_germanrag.yaml - Copy your generated
instructions.jsonlback into this repo's root directory. - Optional: Validate generations with
python validate_generations.py. - Run
python germanrag.pyto generate the final dataset.
- Choose how to deduplicate/collapse the contexts in GermanDPR, i.e. on shortest, longest, first/random answer span.
- Fix function for three sentence context window.
- Experimental/Optional: Finish choping and mixing of contexts on chunk level.
- Add (true) negatives beyond hard negatives, by pairing with random/dissimilar contexts.
- Generalize to more datasets in SQuAD format.
- The GermanRAG dataset is derived from GermanDPR, see 'Acknowledgments' in the dataset card.
- Airoboros by Jon Durbin, consider giving a tip ;)
Feel free to open issues/PRs and come join us in our Discord! 😊
Check out our models at DiscoResearch 🪩🧪.