SelectionSAT

A Python project for parsing SAT Mathematics percentile rank data from PDF files and analyzing selection bias in standardized testing.

Overview

This repository contains tools to extract and process SAT Mathematics percentile rank data from official College Board PDF reports (2011-2015) and perform Bayesian analysis of selection bias in standardized testing. The project addresses a key question in educational assessment: whether observed gender differences in SAT Mathematics scores reflect true differences in ability or selection effects where students with different ability levels have different probabilities of taking the test.

The repository includes:

Data extraction tools for parsing PDF percentile rank reports into structured CSV files
Bayesian analysis using hierarchical modeling to estimate selection bias effects
Statistical validation through parameter recovery tests and model diagnostics

Installation

Clone the repository:

git clone <repository-url>
cd SelectionSAT

Create the conda environment:

make create_environment

Activate the environment:

conda activate SelectionSAT

Install additional requirements:

make requirements

Analysis Summary

The selection_bias.ipynb notebook contains a Bayesian analysis of selection bias in SAT Mathematics scores using data from 2011-2015.

Key Findings

Observed Score Differences:

Male test-takers consistently scored 31-34 points higher than female test-takers across all years (2011-2015)

Selection Bias Analysis:

Strong evidence of selection effects: The analysis estimates that boys with low math ability are less likely to take the SAT than girls with equivalent ability
Gender difference in selection: Selection effects are substantially stronger for males than females
Reduced true ability gap: When accounting for selection bias, the estimated difference in underlying math efficacy between genders drops from 31-34 points to only ~8 points

Methodology

The analysis uses a hierarchical Bayesian model that:

Models latent ability distribution: Assumes a Gaussian distribution of true math efficacy extending beyond the SAT score range (100-900)
Estimates selection functions: Relates potential test scores to probability of taking the SAT
Accounts for test artifacts: Models the "spiky" score distributions caused by SAT scoring mechanics (multiple choice penalties, etc)
Pools information across years: Uses hierarchical modeling to share information between years while allowing for year-specific effects

Technical Approach

Data: SAT Mathematics percentile rank distributions by gender (2011-2015)
Modeling: PyMC-based Bayesian inference with MCMC sampling
Validation: Parameter recovery tests confirm model reliability
Key parameters:
- mu: Mean of latent efficacy distribution by gender
- sigma: Standard deviation of latent efficacy by gender
- beta: Selection effect strength (how much efficacy affects test-taking probability)

Implications

The analysis suggests that a substantial portion of the observed gender gap in SAT Mathematics scores may be due to selection bias rather than true differences in mathematical ability.

The methodology provides a framework for analyzing selection bias in other standardized assessments and educational contexts.

Data Processing

Data Sources

The project processes official College Board SAT Mathematics percentile rank reports:

SAT-Mathemathics_Percentile_Ranks_2011.pdf
SAT-Mathemathics-Percentile-Ranks-2012.pdf
SAT-Mathematics-Percentile-Ranks-2013.pdf
sat-percentile-ranks-mathematics-2014.pdf
sat-percentile-ranks-mathematics-2015.pdf

Parse SAT Data

Run this script to extract data from all available PDF files:

python parse_sat_math.py

This will:

Process all SAT Mathematics PDF files in the directory
Extract score distributions by gender for each year
Save individual CSV files for each year (e.g., sat_math_2011.csv)
Display summary statistics for each dataset

Output Format

Each CSV file contains the following columns:

Score: SAT Mathematics score (200-800)
Total_Number: Total number of test-takers with this score
Male_Number: Number of male test-takers with this score
Female_Number: Number of female test-takers with this score

Model Versions

The repository contains four different versions of the selection bias model, each exploring different modeling choices:

Version 1: `selection_bias.ipynb` (Baseline Model)

Noise parameter: Fixed sigma_noise = 0.05
Noise detrending: None - uses raw ZeroSumNormal for log_noise
Selection function: logits = beta * thetas_scaled (no intercept parameter)
Scaling: scale = 100 (thetas scaled to range -3 to 3)
Characteristics: Simplest model, serves as baseline for comparison

Version 2: `selection_bias2.ipynb` (Alternative Selection Model)

Noise parameter: Fixed sigma_noise = 0.05
Noise detrending: Detrends log_noise in simple model by subtracting linear trend
Selection function: logits = alpha + beta * thetas_scaled (includes intercept alpha)
Scaling: scale = 400 (different scaling approach)
Hierarchical model: Uses HalfNormal prior for beta, includes Binomial likelihood for number of test-takers
Characteristics: Explores alternative selection function parameterization with intercept term

Version 3: `selection_bias3.ipynb` (Estimated Noise Parameter)

Noise parameter: sigma_noise is estimated with Gamma prior (not fixed)
Noise detrending: Detrends log_noise in simple model; hierarchical model has detrending with slope penalty
Selection function: logits = beta * thetas_scaled (no intercept parameter)
Scaling: scale = 100
Characteristics: Allows the noise magnitude to be learned from data rather than fixed

Version 4: `selection_bias4.ipynb` (Slope Penalty Model)

Noise parameter: sigma_noise is estimated with Gamma prior
Noise detrending: Detrends log_noise with explicit slope penalty using pm.Potential in both simple and hierarchical models
Selection function: logits = beta * thetas_scaled (no intercept parameter)
Scaling: scale = 100
Characteristics: Most sophisticated noise modeling with explicit penalty on slope of detrended noise

Key Differences Summary

Feature	Version 1	Version 2	Version 3	Version 4
`sigma_noise`	Fixed (0.05)	Fixed (0.05)	Estimated	Estimated
Noise detrending	None	Yes (simple)	Yes (both)	Yes (both, with penalty)
Selection intercept	No	Yes (`alpha`)	No	No
Scale	100	400	100	100
Hierarchical Binomial	No	Yes	No	No

All versions share the same core structure: they model latent efficacy distributions, estimate selection functions, and account for test artifacts. The differences reflect exploration of modeling choices around noise handling, selection function parameterization, and whether to fix or estimate the noise magnitude parameter.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Makefile		Makefile
README.md		README.md
parse_sat_math.py		parse_sat_math.py
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
sat_math_csv_files.zip		sat_math_csv_files.zip
selection_bias.ipynb		selection_bias.ipynb
selection_bias2.ipynb		selection_bias2.ipynb
selection_bias3.ipynb		selection_bias3.ipynb
selection_bias3_loo.ipynb		selection_bias3_loo.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SelectionSAT

Overview

Installation

Analysis Summary

Key Findings

Methodology

Technical Approach

Implications

Data Processing

Data Sources

Parse SAT Data

Output Format

Model Versions

Version 1: `selection_bias.ipynb` (Baseline Model)

Version 2: `selection_bias2.ipynb` (Alternative Selection Model)

Version 3: `selection_bias3.ipynb` (Estimated Noise Parameter)

Version 4: `selection_bias4.ipynb` (Slope Penalty Model)

Key Differences Summary

About

Uh oh!

Releases

Packages

Languages

AllenDowney/SelectionSAT

Folders and files

Latest commit

History

Repository files navigation

SelectionSAT

Overview

Installation

Analysis Summary

Key Findings

Methodology

Technical Approach

Implications

Data Processing

Data Sources

Parse SAT Data

Output Format

Model Versions

Version 1: selection_bias.ipynb (Baseline Model)

Version 2: selection_bias2.ipynb (Alternative Selection Model)

Version 3: selection_bias3.ipynb (Estimated Noise Parameter)

Version 4: selection_bias4.ipynb (Slope Penalty Model)

Key Differences Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Version 1: `selection_bias.ipynb` (Baseline Model)

Version 2: `selection_bias2.ipynb` (Alternative Selection Model)

Version 3: `selection_bias3.ipynb` (Estimated Noise Parameter)

Version 4: `selection_bias4.ipynb` (Slope Penalty Model)

Packages