A sophisticated Reddit community simulation platform that uses local Large Language Models (LLMs) to simulate realistic user interactions and social dynamics on Reddit communities. The system generates AI agents with distinct personas, scrapes real Reddit data, and runs time-based simulations of user behavior including posting, commenting, and engagement patterns.
This simulator works with (as indicated by filename):
- Ollama locally hosted models (driver.py)
- Gemini API (driver2.py)
- Openrouter models (driver3.py)
- Project Overview
- Simulation Flow
- Quick Start
- Project Structure
- Prerequisites
- Installation
- Environment Configuration
- Development Workflow
- Running Simulations
- Troubleshooting
Oasis2.0 simulates Reddit communities by:
- Agent Generation: Creating thousands of AI agents with unique personas based on real Reddit user profiles
- Reddit Data Scraping: Fetching real posts and user data from specified subreddits
- Social Simulation: Running time-stepped simulations where agents interact with posts through likes, comments, and shares
- Recommendation Engine: Implementing a "For You Page" algorithm that surfaces relevant content to users
- Behavioral Analysis: Tracking and logging user engagement patterns over time
Here is a bird's eye view diagram of the simulation:
This is a quick start tutorial on using Gemini API models.
Get up and running in 5 minutes:
- Clone and setup the project
git clone <repository-url>
cd oasis2.0
pip install -r requirements.txt
- Get your own Gemini API key from Google AI studio Then, in driver2.py, configure these settings:
...
API_KEY = os.getenv("GEMINI_API_KEY", "YOUR_API_KEY")
genai.configure(api_key=API_KEY)
MODEL_NAME = "gemini-2.5-flash" #or any gemini model
...
- customise your input options here:
START_TIME = datetime(2025, 7, 9, 0, 0, 0) #start time of sim
TIMESTEP_DAYS = 1 #timestep length, can configure to be in hours/mins/etc
NUM_TIMESTEPS = 10 #no of timesteps
ONLINE_RATE = 0.0075 # ~0.75% of users online per timestep
SUBREDDIT_SIZE=43000 #no of followers in subreddit to be modelled
subreddit = "SecurityCamera" #subreddit name
POSTS_FILE = f"posts/posts.json" #posts path, all in posts directory
AGENTS_FILE = "agents/agents.json" #agents path, all in agents directory
OUTPUT_DIR = "output"
POSTS_OUT_FILE = os.path.join(OUTPUT_DIR, "posts", f"{subreddit}/posts_2.csv") #where the log file is saved
- Run your first simulation
python driver2.py
That's it! The simulation will start running with default settings for the SecurityCamera subreddit. Results will be saved in the output/
directory.
π‘ First time setup tip: The simulation uses pre-generated agents and posts. For custom subreddits, run python posts/reddit_scraper.py
first to collect fresh data.
oasis2.0/
βββ agents/ (not used at this point)
β βββ agent_generator.py # Generate AI agents from user profiles
β βββ agents_{subreddit_name}.json # Generated agent database
β βββ scrape_users_profiles.py # Scrape user profiles from Reddit
βββ posts/
β βββ reddit_scraper.py # Scrape Reddit posts and data
β βββ analyse_posts.py # Process post engagement data (not used at this point)
β βββ posts.json # Scraped Reddit posts
βββ recommendation/
β βββ fyp.py # "For You Page" recommendation engine
βββ output/
β βββ logs/ # Simulation execution logs (not used)
β βββ posts/ # Post engagement results
βββ prompts/ # LLM prompt templates
βββ train/ # Training data and models
β βββ prompt_gen.py/ # generate subreddit-specific prompt
β βββ (other files)/ # were my testing files, ignore
βββ validation/ # Validation scripts and data
βββ driver.py # Simulation using Ollama
βββ driver2.py # Simulation using Gemini API
βββ driver3.py # Simulation using OpenRouter models
Before setting up the project, ensure you have the following installed:
- RAM: Minimum 8GB (16GB+ recommended for larger simulations)
- Storage: At least 10GB free space for models and simulation data
- Internet Connection: Required for initial setup and Reddit data scraping
Follow these steps to set up the project locally:
git clone https://github.com/b1-ing/oasis2.0
cd oasis2.0
# Create virtual environment
python -m venv venv
# Activate virtual environment
# Windows:
venv\Scripts\activate
# macOS/Linux:
source venv/bin/activate
pip install -r requirements.txt
For driver.py using Ollama, check the following:
# Check if Ollama is running
ollama list
# Pull required model (if not already done)
ollama pull llama3
For Gemini API using driver2.py, get your own Gemini API key from Google AI studio. Then, in driver2.py, configure these settings:
...
API_KEY = os.getenv("GEMINI_API_KEY", "YOUR_API_KEY")
genai.configure(api_key=API_KEY)
MODEL_NAME = "gemini-2.5-flash" #or any gemini model
...
For OpenRouter, get an API key from the OpenRouter website. Then, configure these settings in driver3.py:
API_KEY = "YOUR_API_KEY"
MODEL_NAME = "MODEL_NAME" # copy model name from openrouter
model="..." # shortform name for the selected model (for file path naming)
Key configuration parameters are set in the driver scripts:
START_TIME = datetime(SET_TIME_HERE)
TIMESTEP_DAYS = 1 #timestep length, can configure to be in hours/mins/etc
NUM_TIMESTEPS = 10 #no of timesteps
ONLINE_RATE = 0.0075 # ~0.75% of users online per timestep
SUBREDDIT_SIZE=43000 #no of followers in subreddit to be modelled
subreddit = "SecurityCamera" #subreddit name
POSTS_FILE = f"posts/posts.json" #posts path, all in posts directory
AGENTS_FILE = "agents/agents.json" #agents path, all in agents directory
OUTPUT_DIR = "output"
POSTS_OUT_FILE = os.path.join(OUTPUT_DIR, "posts", f"{subreddit}/posts_2.csv") #where the log file is saved
The system includes pre-generated user profiles:
agents_SecurityCamera.json
- Reddit user profiles from SecurityCamera subredditagents_NationalServiceSG.json
- Specialized profiles for National Service subreddit
There are post datasets as part of this repo:
posts_SecurityCamera.json
- dataset of 189 posts from SecurityCamera subredditposts_NationalServiceSG.json
- dataset of 276 posts for National Service subreddit
Keep building onto these existing datasets, and try the simulator on new subreddits!
Instructions to scrape data are below:
Before running simulations, scrape Reddit posts for your target subreddit:
python posts/reddit_scraper.py
Configure the following options:
if __name__ == "__main__":
subreddit = SUBREDDIT_NAME
posts = fetch_reddit_json(subreddit, limit=100)
save_to_json(posts, subreddit)
Note that the Reddit API limits the number of posts that can be scraped to 100/API call; hence, the need to run the scraper program at intervals of a few days to a week to keep building the dataset!
Then, run this to get the posts with the actual engagement scores in csv format:
python validation/reddit_scraper_validation.py
You will be benchmarking the model results against this csv you generate.
There is also an option to scrape user profiles from a target subreddit, which will scrape the 100 user profiles with most recent activity on the subreddit. NOTE THAT THIS IS NOT NEEDED FOR THE SIMULATION AT THIS STAGE; this provides an option for further customisation down the road.
To run this, you require a reddit PRAW (Python Reddit API Wrapper) id and secret, which can be generated from your app preferences.
Then, run:
python agents/scrape_users_profiles.py
Configure the following options:
reddit = praw.Reddit(
client_id="YOUR_CLIENT_ID",
client_secret="YOUR_CLIENT_SECRET",
user_agent="oasis2.0",
)
subreddit_name="SUBREDDIT_NAME"
Note that the Reddit API limits the number of profiles that can be scraped to 100/API call.
This is a useful addition FOR FURTHER FUTURE CUSTOMISATION.
Create AI agents based on scraped user profiles:
python agents/agent_generator.py
This creates a json file of the specified number of agents, with each agent given one of the scraped user profiles.
You can choose a certain subsection of the posts dataset to generate a subreddit-specific context to aid the LLM in identifying viral posts correctly. Think of this as your "training set", i.e. these posts should not go into your simulation cycle later on. Create a new json file for this separate training set.
Go to train/prompt_gen.py, and based on which posts actually went viral, change this:
Here are posts from the reddit community r/{SUBREDDIT}:
{posts_str}
Some posts (e.g. INSERT_VIRAL_POST_NUMBERS_HERE) were viral.
Based on these, identify the features that make posts go viral in this community.
and change the input file path at the top.
Afterward, copy the output into your selected driver file here:
print(recommended_posts)
posts_str = json.dumps(recommended_posts, indent=2)
community_details = f"""...
Execute the main simulation by running either one of the 3 driver files.
Simulation progress is logged in:
output/posts/{subreddit}/{model}/posts.csv
- Post engagement data
- Recommendation Engine: Modify
recommendation/fyp.py
to adjust content recommendation algorithms. - Currently, the recommendation algorithm is using a temporal-only approach, showing the 20 most recent posts.
- I also tried a temporal and engagement weighted algorithm, which weighs recent engagement score and age of the post; if you try to recreate this, note that I faced an issue where the LLM would not react on some posts at all because it wasn't shown the posts!
Problem: Connection refused
or Model not found
errors
Solutions:
# Check if Ollama service is running
ollama serve
# Verify model is installed
ollama list
# Pull the required model
ollama pull llama3
Problem: System runs out of memory during large simulations
Solutions:
- Reduce
NUM_AGENTS
inagent_generator.py
- Decrease
NUM_TIMESTEPS
in simulation configuration - Use a smaller LLM model (e.g.,
llama3:8b
instead of larger variants)