# 1. Clone and setup environment
git clone https://github.com/Alezander9/hep-foundation
cd hep-foundation
uv venv --python 3.9
source .venv/bin/activate
# 2. Install package and dependencies
uv pip install -e .
uv sync --group dev
pre-commit install
# 3. Run test to verify installation
pytest tests/test_pipeline.py
# Check that test passes and results look good in _test_results/
# 4. Run real experiment
cp tests/_test_pipeline_config.yaml _experiment_config_stack/
python scripts/run_pipelines.py
# Results will appear in _foundation_experiments/The HEP Foundation pipeline is designed as a simple workflow:
- Create configs → Put YAML configuration files in
_experiment_config_stack/ - Run pipeline → Execute
python scripts/run_pipelines.py - Get results → Find experiment results in
_foundation_experiments/
Use tests/_test_pipeline_config.yaml as a template - just modify the values for your experiments. The pipeline processes all configs in the stack and removes them as it completes each one.
Each experiment produces a complete folder in _foundation_experiments/ containing:
- Training results - Model weights, training history, and plots
- Regression evaluation - Data efficiency comparison across model types
- Signal classification - Binary classification performance analysis
- Anomaly detection - Background vs signal discrimination metrics
- Reproducibility - Copy of original config and experiment metadata
The pipeline automatically runs the full sequence: foundation model training → regression → signal classification → anomaly detection.
- Dataset size: Typically O(1M) events per dataset
- Training speed: ~15 seconds per epoch on NERSC A100 GPU
- GPU requirement: Strongly recommended for training (CPU training is very slow)
For NERSC users
-
Download catalogs on login nodes - The pipeline is bottlenecked by downloading ROOT files from CERN OpenData, so run downloads first before requesting a job:
python scripts/download_catalogs.py # Downloads catalogs -
Run pipeline on NERSC cluster:
sbatch jobs/submit_pipeline_simple.sh
For local development: Just run the pipeline directly as seen in the Quick Start setup.
src/hep_foundation/ # Main package source code
├── config/ # Configuration loading and validation
├── data/ # Dataset management, PhysLite data system
├── models/ # Model architectures (VAE, DNN, etc.)
├── pipeline/ # FoundationModelPipeline and helpers
├── plots/ # Plotting code and HistogramManager
├── training/ # ModelTrainer and training utilities
└── utils/ # Plotting, logging, and utility functions
scripts/ # Execution and utility scripts
├── run_pipelines.py # Main pipeline runner
├── create_datasets.py # Local dataset creation
└── transfer_*.py # Remote transfer utilities
tests/ # Test suite and test configurations
jobs/ # SLURM job submission scripts
└── debug_pipeline.sh # Check and log environment info
└── submit_pipeline_simple.sh # Run full pipeline on all catalogs
logs/ # Pipeline execution logs
_experiment_config_stack/ # Input: YAML configs to process
_foundation_experiments/ # Output: Experiment results
_processed_datasets/ # Cached datasets (HDF5 files)
_test_results/ # Test outputs (cleaned each run)
Creating configs: Use tests/_test_pipeline_config.yaml as a template for your experiments.
Key configuration sections:
dataset: Data selection (ATLAS run numbers, signal types)models: VAE and DNN architecturestraining: Training parameters (epochs, batch size, learning rate)evaluation: Data sizes for efficiency studies
PhysLite features: Specify any PhysLite branch names in the config. Derived features (eta, pt, etc.) are automatically calculated from base branches using physlite_derived_features.py.
Each experiment folder contains:
001_Foundation_VAE_Model/
├── _experiment_config.yaml # Reproducible config copy
├── _experiment_info.json # Experiment metadata
├── models/foundation_model/ # Saved model weights
├── training/ # Training metrics and plots
└── testing/
├── regression_evaluation/ # Data efficiency: regression tasks
├── signal_classification/ # Data efficiency: classification
└── anomaly_detection/ # Background vs signal scoring
Key output files:
- Training plots and metrics in
training/ - Data efficiency plots comparing foundation model benefits in
testing/*/ - Model weights for reuse in
models/foundation_model/
Code quality:
.pre-commit-config.yaml- Automated code formatting (ruff) and quality checks (vulture)uvpackage management withpyproject.tomlconfiguration
Visual Interface:
launch_gradio.py- Opens webpage with result viewer UI
Development tools:
.devcontainer/- Docker container for consistent development environmentscripts/test_gpu.py- Verify TensorFlow GPU access on your systemsrc/hep_foundation/utils/plot_utils.py- Standardized colors, fonts, and styling for all plots
Testing:
test/test_pipeline- Small scale (~60s) test of the whole pipeline. Flags nissing or unexpected files in output folder_test_resultspython run_pytest.py- Python wrapper for tests, only logs warnings, errors, and progress logs. Full logs go to_test_results/pytest.log
If you use this software in your research, please cite:
Yue, A. (2024). HEP Foundation: Foundation models for High Energy Physics data analysis.
https://github.com/Alezander9/hep-foundation
BibTeX Format
@software{yue_hep_foundation_2024,
author = {Yue, Alexander},
title = {HEP Foundation: Foundation models for High Energy Physics data analysis},
url = {https://github.com/Alezander9/hep-foundation},
year = {2024}
}This project is licensed under the MIT License - see the LICENSE file for details.
Questions or issues?
Email: [email protected] GitHub: Alezander9/hep-foundation