A simple, powerful, minimal codebase to generate synthetic data using OpenAI, MistralAI, AnthropicAI, or offline inference with vLLM
Create a virtual environment and install the following packages:
pip install --upgrade pippip install --no-cache-dir -r ./requirements.txt
- Change
PROVIDERunderparams.py openai,mistraloranthropic
- Change
GPT_MODEL,OPENAI_API_KEYandOUTPUT_FILE_PATHunderparams.py
- Change
MISTRALAI_MODEL,MISTRALAI_API_KEYandOUTPUT_FILE_PATHunderparams.py
- Change
ANTHROPICAI_MODEL,ANTHROPICAI_API_KEYandOUTPUT_FILE_PATHunderparams.py
- Run with
python main.py
- Run with
sensei_vllm.py
This example generates 100 input-output pairs per iteration by using a local instance of mistralai/Mixtral-8x7B-Instruct-v0.1 for text generation. The script runs an infinite loop and adds samples to an output file after each iteration.
python sensei_vllm.py --model-id mistralai/Mixtral-8x7B-Instruct-v0.1 --backend vllm --tensor-parallel-size 8 --max_len 1024 --dtype float16 --domain lang --outputs ./ --samples_per_iter 100To use the system prompts for code change domain from lang to code (--domain code)
To use a specific group of topics (--topics_group)
- Change the topics in
topics.py - Change the system contexts in
system_messages.py - Change the number of workers in
params.py