GitHub

Quick Start

# 1. Clone and setup environment
git clone https://github.com/Alezander9/hep-foundation
cd hep-foundation
uv venv --python 3.9
source .venv/bin/activate

# 2. Install package and dependencies
uv pip install -e .
uv sync --group dev
pre-commit install

# 3. Run test to verify installation
pytest tests/test_pipeline.py
# Check that test passes and results look good in _test_results/

# 4. Run real experiment
cp tests/_test_pipeline_config.yaml _experiment_config_stack/
python scripts/run_pipelines.py
# Results will appear in _foundation_experiments/

Usage

Pipeline Overview

The HEP Foundation pipeline is designed as a simple workflow:

Create configs → Put YAML configuration files in _experiment_config_stack/
Run pipeline → Execute python scripts/run_pipelines.py
Get results → Find experiment results in _foundation_experiments/

Use tests/_test_pipeline_config.yaml as a template - just modify the values for your experiments. The pipeline processes all configs in the stack and removes them as it completes each one.

Results

Each experiment produces a complete folder in _foundation_experiments/ containing:

Training results - Model weights, training history, and plots
Regression evaluation - Data efficiency comparison across model types
Signal classification - Binary classification performance analysis
Anomaly detection - Background vs signal discrimination metrics
Reproducibility - Copy of original config and experiment metadata

The pipeline automatically runs the full sequence: foundation model training → regression → signal classification → anomaly detection.

System Requirements

Performance Expectations

Dataset size: Typically O(1M) events per dataset
Training speed: ~15 seconds per epoch on NERSC A100 GPU
GPU requirement: Strongly recommended for training (CPU training is very slow)

Recommended Workflow

For NERSC users

Download catalogs on login nodes - The pipeline is bottlenecked by downloading ROOT files from CERN OpenData, so run downloads first before requesting a job:
```
python scripts/download_catalogs.py  # Downloads catalogs
```
Run pipeline on NERSC cluster:
```
sbatch jobs/submit_pipeline_simple.sh
```

For local development: Just run the pipeline directly as seen in the Quick Start setup.

Project Structure

Key Directories

src/hep_foundation/            # Main package source code
├── config/                    # Configuration loading and validation
├── data/                      # Dataset management, PhysLite data system
├── models/                    # Model architectures (VAE, DNN, etc.)
├── pipeline/                  # FoundationModelPipeline and helpers
├── plots/                     # Plotting code and HistogramManager
├── training/                  # ModelTrainer and training utilities
└── utils/                     # Plotting, logging, and utility functions

scripts/                       # Execution and utility scripts
├── run_pipelines.py           # Main pipeline runner
├── create_datasets.py         # Local dataset creation
└── transfer_*.py              # Remote transfer utilities

tests/                         # Test suite and test configurations
jobs/                          # SLURM job submission scripts
└── debug_pipeline.sh          # Check and log environment info
└── submit_pipeline_simple.sh  # Run full pipeline on all catalogs
logs/                          # Pipeline execution logs

_experiment_config_stack/      # Input: YAML configs to process
_foundation_experiments/       # Output: Experiment results
_processed_datasets/           # Cached datasets (HDF5 files)
_test_results/                 # Test outputs (cleaned each run)

Configuration Files

Creating configs: Use tests/_test_pipeline_config.yaml as a template for your experiments.

Key configuration sections:

dataset: Data selection (ATLAS run numbers, signal types)
models: VAE and DNN architectures
training: Training parameters (epochs, batch size, learning rate)
evaluation: Data sizes for efficiency studies

PhysLite features: Specify any PhysLite branch names in the config. Derived features (eta, pt, etc.) are automatically calculated from base branches using physlite_derived_features.py.

Understanding Results

Each experiment folder contains:

001_Foundation_VAE_Model/
├── _experiment_config.yaml     # Reproducible config copy
├── _experiment_info.json       # Experiment metadata
├── models/foundation_model/    # Saved model weights
├── training/                   # Training metrics and plots
└── testing/
    ├── regression_evaluation/      # Data efficiency: regression tasks
    ├── signal_classification/      # Data efficiency: classification
    └── anomaly_detection/          # Background vs signal scoring

Key output files:

Training plots and metrics in training/
Data efficiency plots comparing foundation model benefits in testing/*/
Model weights for reuse in models/foundation_model/

Development Utilities

Code quality:

.pre-commit-config.yaml - Automated code formatting (ruff) and quality checks (vulture)
uv package management with pyproject.toml configuration

Visual Interface:

launch_gradio.py - Opens webpage with result viewer UI

Development tools:

.devcontainer/ - Docker container for consistent development environment
scripts/test_gpu.py - Verify TensorFlow GPU access on your system
src/hep_foundation/utils/plot_utils.py - Standardized colors, fonts, and styling for all plots

Testing:

test/test_pipeline - Small scale (~60s) test of the whole pipeline. Flags nissing or unexpected files in output folder _test_results
python run_pytest.py - Python wrapper for tests, only logs warnings, errors, and progress logs. Full logs go to _test_results/pytest.log

Citation

If you use this software in your research, please cite:

Yue, A. (2024). HEP Foundation: Foundation models for High Energy Physics data analysis.
https://github.com/Alezander9/hep-foundation

BibTeX Format

@software{yue_hep_foundation_2024,
  author = {Yue, Alexander},
  title = {HEP Foundation: Foundation models for High Energy Physics data analysis},
  url = {https://github.com/Alezander9/hep-foundation},
  year = {2024}
}

Note: A research paper is in preparation. This citation will be updated when published.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Questions or issues?

Email: [email protected] GitHub: Alezander9/hep-foundation

Name		Name	Last commit message	Last commit date
Latest commit History 283 Commits
.cursor/rules		.cursor/rules
.devcontainer		.devcontainer
_experiment_config_stack		_experiment_config_stack
jobs		jobs
scripts		scripts
src/hep_foundation		src/hep_foundation
standalone_utils		standalone_utils
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
=0.8.0		=0.8.0
LICENSE		LICENSE
README.md		README.md
launch_gradio.py		launch_gradio.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
run_pytest.py		run_pytest.py
setup.sh		setup.sh
title.svg		title.svg
uv.lock		uv.lock
vulture_whitelist.py		vulture_whitelist.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quick Start

Usage

Pipeline Overview

Results

System Requirements

Performance Expectations

Recommended Workflow

Project Structure

Key Directories

Configuration Files

Understanding Results

Development Utilities

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

Alezander9/hep_foundation

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Usage

Pipeline Overview

Results

System Requirements

Performance Expectations

Recommended Workflow

Project Structure

Key Directories

Configuration Files

Understanding Results

Development Utilities

Citation

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages