OpenEstimate is a multi-domain benchmark for evaluating language models on probabilistic estimation, a specific form of reasoning under uncertainty.
Real-world LM deployments in domains like healthcare, finance, and other forms of knowledge work require models to handle incomplete information and quantify uncertainty. Yet, most LM evaluations focus on well-defined problems with clear answers. OpenEstimate addresses this gap by testing models on probabilistic estimation tasks where they must synthesize background knowledge into accurate, well-calibrated Bayesian priors.
Language models have access to vast amounts of knowledge, but their ability to reason probabilistically about uncertain outcomes remains poorly understood. OpenEstimate evaluates LMs on aspects of:
- π² Probabilistic Reasoning: Estimating distributions over uncertain quantities
- π§© Background Knowledge Synthesis: Synthesizing background knowledge into distributional estimates
- π Calibration: Producing well-calibrated uncertainty estimates (not just accurate point predictions)
The benchmark assesses both the accuracy and calibration of LM-elicited priors, quantifying their usefulness relative to samples from the true distribution.
Across six contemporary language models, we find:
- LM-elicited priors are often inaccurate and overconfident
- Performance improves modestly with different elicitation protocols
- Changes in sampling strategy, reasoning effort, or prompt design have limited impact
OpenEstimate includes three diverse domains with real-world data:
| Dataset | Domain | Description | Variables |
|---|---|---|---|
| π₯ NHANES | Healthcare | National Health and Nutrition Examination Survey data with health metrics and demographic information | Health outcomes, biomarkers, lifestyle factors |
| πΌ Glassdoor | Employment | Company and employment data including salaries, industries, and workplace metrics | Compensation, company characteristics, job roles |
| π° PitchBook | Finance | Startup and venture capital data with funding rounds, valuations, and company metrics | Funding amounts, valuations, company growth |
Each dataset includes:
- β Ground truth distributions computed from observational data
- β Variable descriptions in natural language
- β Conditioning information of varying complexity (1-3 conditions)
Multiple methods for eliciting distributional beliefs from language models:
- Direct: Model directly specifies distribution parameters (mean, variance)
- Quantile-based: Model provides quantiles (e.g., 10th, 50th, 90th percentiles), which are fit to a distribution
- Mean-Variance: Model separately estimates mean and variance
Different expert personas to test prompt sensitivity:
- Base: Neutral helpful assistant with domain expertise
- Conservative: Explicitly instructed to provide conservative estimates
- Superforecaster: Prompted to follow forecasting best practices (Γ la Philip Tetlock)
Comprehensive metrics for assessing prior quality:
- Mean Absolute Error (MAE): Point estimate accuracy
- Expected Calibration Error (ECE): Calibration of probabilistic predictions
- Uncertainty-Accuracy Correlation: Relationship between uncertainty estimates and accuracy of predictions
Support for multiple distribution families:
- Gaussian/Normal: For unbounded continuous variables
- Beta: For bounded continuous variables (e.g., proportions)
We compare LM-elicited priors against statistical baselines that are computed by sampling N examples from the true distribution and updating an uninformative prior with those examples. This enables a comparison of LM performance against different numbers of samples from the true distribution.
- Python 3.8 or higher
- API keys for LM providers (OpenAI, Together AI, etc.)
-
Clone the repository:
git clone https://github.com/your-username/openestimate.git cd openestimate -
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.envfile in the root directory:OPENAI_API_KEY=your-openai-api-key TOGETHER_API_KEY=your-together-api-key
- Generate experiment specifications:
cd ~/openestimate/experiments python generate_specs.py python generate_run_scripts.py cd dataset_name/experiment_name # e.g., cd glassdoor/model_family_comparison ./run_experiments_generated.sh
- Analyze results:
cd ~/openestimate/analysis python run_analysis.py \ --datasets glassdoor,nhanes,pitchbook \ --output_dir analysis_results
You can extend OpenEstimate with new datasets. See data/readme.md for details on how to do this.
openestimate/
βββ data/ # Data generation and processing
β βββ generate.py # Main variable generation pipeline
β βββ glassdoor.py # Glassdoor dataset processing
β βββ nhanes_generation.py # NHANES dataset processing
β βββ pitchbook.py # PitchBook dataset processing
β βββ compute_posteriors.py # Ground truth computation
β βββ baselines/ # Baseline priors
β βββ variables/ # Generated benchmark variables
β
βββ elicitation/ # Prior elicitation from language models
β βββ src/
β β βββ main.py # Main elicitation script
β β βββ elicitation.py # Core elicitation logic
β β βββ fit_priors.py # Prior fitting methods
β β βββ clients.py # LM API clients
β β βββ utils.py # Utility functions
β βββ prompts/ # Elicitation protocol templates
β
βββ experiments/ # Experiment configurations
β βββ generate_specs.py # Generate experiment specifications
β βββ glassdoor/ # Glassdoor experiments
β βββ nhanes/ # NHANES experiments
β βββ pitchbook/ # PitchBook experiments
β
βββ analysis/ # Results analysis and visualization
β βββ run_analysis.py # Main analysis script
β βββ compare_models.py # Cross-model comparisons
β βββ ablations.py # Ablation studies
β βββ plotting.py # Visualization utilities
β βββ utils.py # Analysis utilities
If you use OpenEstimate in your research, please cite:
@article{openestimate2024,
title={OpenEstimate: A Benchmark for Evaluating Language Models on Probabilistic Estimation},
author={[Authors]},
journal={[Venue]},
year={2024},
url={https://github.com/your-username/openestimate}
}We welcome contributions! Areas of particular interest:
- Additional datasets and domains
- New elicitation protocols
- Alternative distribution families
- Improved evaluation metrics
- Calibration and uncertainty quantification methods
Please open an issue or submit a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues, please:
- Open an issue on GitHub
- Contact the authors at [email protected]