Skip to content

alanarenda/openestimate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

3 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

OpenEstimate

Python License

OpenEstimate is a multi-domain benchmark for evaluating language models on probabilistic estimation, a specific form of reasoning under uncertainty.

Real-world LM deployments in domains like healthcare, finance, and other forms of knowledge work require models to handle incomplete information and quantify uncertainty. Yet, most LM evaluations focus on well-defined problems with clear answers. OpenEstimate addresses this gap by testing models on probabilistic estimation tasks where they must synthesize background knowledge into accurate, well-calibrated Bayesian priors.

Overview 🎯

Language models have access to vast amounts of knowledge, but their ability to reason probabilistically about uncertain outcomes remains poorly understood. OpenEstimate evaluates LMs on aspects of:

  • 🎲 Probabilistic Reasoning: Estimating distributions over uncertain quantities
  • 🧩 Background Knowledge Synthesis: Synthesizing background knowledge into distributional estimates
  • πŸ“Š Calibration: Producing well-calibrated uncertainty estimates (not just accurate point predictions)

The benchmark assesses both the accuracy and calibration of LM-elicited priors, quantifying their usefulness relative to samples from the true distribution.

Key Findings πŸ”

Across six contemporary language models, we find:

  • LM-elicited priors are often inaccurate and overconfident
  • Performance improves modestly with different elicitation protocols
  • Changes in sampling strategy, reasoning effort, or prompt design have limited impact

Datasets πŸ“š

OpenEstimate includes three diverse domains with real-world data:

Dataset Domain Description Variables
πŸ₯ NHANES Healthcare National Health and Nutrition Examination Survey data with health metrics and demographic information Health outcomes, biomarkers, lifestyle factors
πŸ’Ό Glassdoor Employment Company and employment data including salaries, industries, and workplace metrics Compensation, company characteristics, job roles
πŸ’° PitchBook Finance Startup and venture capital data with funding rounds, valuations, and company metrics Funding amounts, valuations, company growth

Each dataset includes:

  • βœ“ Ground truth distributions computed from observational data
  • βœ“ Variable descriptions in natural language
  • βœ“ Conditioning information of varying complexity (1-3 conditions)

Evaluation Design βš™οΈ

Elicitation Protocols

Multiple methods for eliciting distributional beliefs from language models:

  • Direct: Model directly specifies distribution parameters (mean, variance)
  • Quantile-based: Model provides quantiles (e.g., 10th, 50th, 90th percentiles), which are fit to a distribution
  • Mean-Variance: Model separately estimates mean and variance

System Prompts

Different expert personas to test prompt sensitivity:

  • Base: Neutral helpful assistant with domain expertise
  • Conservative: Explicitly instructed to provide conservative estimates
  • Superforecaster: Prompted to follow forecasting best practices (Γ  la Philip Tetlock)

Evaluation Metrics

Comprehensive metrics for assessing prior quality:

  • Mean Absolute Error (MAE): Point estimate accuracy
  • Expected Calibration Error (ECE): Calibration of probabilistic predictions
  • Uncertainty-Accuracy Correlation: Relationship between uncertainty estimates and accuracy of predictions

Distribution Types

Support for multiple distribution families:

  • Gaussian/Normal: For unbounded continuous variables
  • Beta: For bounded continuous variables (e.g., proportions)

Baselines

We compare LM-elicited priors against statistical baselines that are computed by sampling N examples from the true distribution and updating an uninformative prior with those examples. This enables a comparison of LM performance against different numbers of samples from the true distribution.


Installation πŸ› οΈ

Prerequisites

  • Python 3.8 or higher
  • API keys for LM providers (OpenAI, Together AI, etc.)

Setup

  1. Clone the repository:

    git clone https://github.com/your-username/openestimate.git
    cd openestimate
  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up environment variables: Create a .env file in the root directory:

    OPENAI_API_KEY=your-openai-api-key
    TOGETHER_API_KEY=your-together-api-key

Quick Start πŸš€

Running Experiments and Analyzing Results

  1. Generate experiment specifications:
    cd ~/openestimate/experiments
    python generate_specs.py
    python generate_run_scripts.py 
    cd dataset_name/experiment_name # e.g., cd glassdoor/model_family_comparison
    ./run_experiments_generated.sh 
  2. Analyze results:
    cd ~/openestimate/analysis
    python run_analysis.py \
      --datasets glassdoor,nhanes,pitchbook \
      --output_dir analysis_results

Generating Custom Benchmarks

You can extend OpenEstimate with new datasets. See data/readme.md for details on how to do this.


Repository Structure πŸ“

openestimate/
β”œβ”€β”€ data/                      # Data generation and processing
β”‚   β”œβ”€β”€ generate.py           # Main variable generation pipeline
β”‚   β”œβ”€β”€ glassdoor.py          # Glassdoor dataset processing
β”‚   β”œβ”€β”€ nhanes_generation.py  # NHANES dataset processing
β”‚   β”œβ”€β”€ pitchbook.py          # PitchBook dataset processing
β”‚   β”œβ”€β”€ compute_posteriors.py # Ground truth computation
β”‚   β”œβ”€β”€ baselines/            # Baseline priors
β”‚   └── variables/            # Generated benchmark variables
β”‚
β”œβ”€β”€ elicitation/              # Prior elicitation from language models
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ main.py          # Main elicitation script
β”‚   β”‚   β”œβ”€β”€ elicitation.py   # Core elicitation logic
β”‚   β”‚   β”œβ”€β”€ fit_priors.py    # Prior fitting methods
β”‚   β”‚   β”œβ”€β”€ clients.py       # LM API clients
β”‚   β”‚   └── utils.py         # Utility functions
β”‚   └── prompts/             # Elicitation protocol templates
β”‚
β”œβ”€β”€ experiments/              # Experiment configurations
β”‚   β”œβ”€β”€ generate_specs.py    # Generate experiment specifications
β”‚   β”œβ”€β”€ glassdoor/           # Glassdoor experiments
β”‚   β”œβ”€β”€ nhanes/              # NHANES experiments
β”‚   └── pitchbook/           # PitchBook experiments
β”‚
β”œβ”€β”€ analysis/                 # Results analysis and visualization
β”‚   β”œβ”€β”€ run_analysis.py      # Main analysis script
β”‚   β”œβ”€β”€ compare_models.py    # Cross-model comparisons
β”‚   β”œβ”€β”€ ablations.py         # Ablation studies
β”‚   β”œβ”€β”€ plotting.py          # Visualization utilities
β”‚   └── utils.py             # Analysis utilities

Citation πŸ“

If you use OpenEstimate in your research, please cite:

@article{openestimate2024,
  title={OpenEstimate: A Benchmark for Evaluating Language Models on Probabilistic Estimation},
  author={[Authors]},
  journal={[Venue]},
  year={2024},
  url={https://github.com/your-username/openestimate}
}

Contributing 🀝

We welcome contributions! Areas of particular interest:

  • Additional datasets and domains
  • New elicitation protocols
  • Alternative distribution families
  • Improved evaluation metrics
  • Calibration and uncertainty quantification methods

Please open an issue or submit a pull request.


License πŸ“œ

This project is licensed under the MIT License - see the LICENSE file for details.


Contact πŸ“§

For questions or issues, please:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published