Skip to content

Alezander9/hep_foundation

Repository files navigation

HEP Foundation

Python License Framework Physics


Quick Start

# 1. Clone and setup environment
git clone https://github.com/Alezander9/hep-foundation
cd hep-foundation
uv venv --python 3.9
source .venv/bin/activate

# 2. Install package and dependencies
uv pip install -e .
uv sync --group dev
pre-commit install

# 3. Run test to verify installation
pytest tests/test_pipeline.py
# Check that test passes and results look good in _test_results/

# 4. Run real experiment
cp tests/_test_pipeline_config.yaml _experiment_config_stack/
python scripts/run_pipelines.py
# Results will appear in _foundation_experiments/

Usage

Pipeline Overview

The HEP Foundation pipeline is designed as a simple workflow:

  1. Create configs → Put YAML configuration files in _experiment_config_stack/
  2. Run pipeline → Execute python scripts/run_pipelines.py
  3. Get results → Find experiment results in _foundation_experiments/

Use tests/_test_pipeline_config.yaml as a template - just modify the values for your experiments. The pipeline processes all configs in the stack and removes them as it completes each one.

Results

Each experiment produces a complete folder in _foundation_experiments/ containing:

  • Training results - Model weights, training history, and plots
  • Regression evaluation - Data efficiency comparison across model types
  • Signal classification - Binary classification performance analysis
  • Anomaly detection - Background vs signal discrimination metrics
  • Reproducibility - Copy of original config and experiment metadata

The pipeline automatically runs the full sequence: foundation model training → regression → signal classification → anomaly detection.

System Requirements

Performance Expectations

  • Dataset size: Typically O(1M) events per dataset
  • Training speed: ~15 seconds per epoch on NERSC A100 GPU
  • GPU requirement: Strongly recommended for training (CPU training is very slow)

Recommended Workflow

For NERSC users
  1. Download catalogs on login nodes - The pipeline is bottlenecked by downloading ROOT files from CERN OpenData, so run downloads first before requesting a job:

    python scripts/download_catalogs.py  # Downloads catalogs
  2. Run pipeline on NERSC cluster:

    sbatch jobs/submit_pipeline_simple.sh

For local development: Just run the pipeline directly as seen in the Quick Start setup.

Project Structure

Key Directories

src/hep_foundation/            # Main package source code
├── config/                    # Configuration loading and validation
├── data/                      # Dataset management, PhysLite data system
├── models/                    # Model architectures (VAE, DNN, etc.)
├── pipeline/                  # FoundationModelPipeline and helpers
├── plots/                     # Plotting code and HistogramManager
├── training/                  # ModelTrainer and training utilities
└── utils/                     # Plotting, logging, and utility functions

scripts/                       # Execution and utility scripts
├── run_pipelines.py           # Main pipeline runner
├── create_datasets.py         # Local dataset creation
└── transfer_*.py              # Remote transfer utilities

tests/                         # Test suite and test configurations
jobs/                          # SLURM job submission scripts
└── debug_pipeline.sh          # Check and log environment info
└── submit_pipeline_simple.sh  # Run full pipeline on all catalogs
logs/                          # Pipeline execution logs

_experiment_config_stack/      # Input: YAML configs to process
_foundation_experiments/       # Output: Experiment results
_processed_datasets/           # Cached datasets (HDF5 files)
_test_results/                 # Test outputs (cleaned each run)

Configuration Files

Creating configs: Use tests/_test_pipeline_config.yaml as a template for your experiments.

Key configuration sections:

  • dataset: Data selection (ATLAS run numbers, signal types)
  • models: VAE and DNN architectures
  • training: Training parameters (epochs, batch size, learning rate)
  • evaluation: Data sizes for efficiency studies

PhysLite features: Specify any PhysLite branch names in the config. Derived features (eta, pt, etc.) are automatically calculated from base branches using physlite_derived_features.py.

Understanding Results

Each experiment folder contains:

001_Foundation_VAE_Model/
├── _experiment_config.yaml     # Reproducible config copy
├── _experiment_info.json       # Experiment metadata
├── models/foundation_model/    # Saved model weights
├── training/                   # Training metrics and plots
└── testing/
    ├── regression_evaluation/      # Data efficiency: regression tasks
    ├── signal_classification/      # Data efficiency: classification
    └── anomaly_detection/          # Background vs signal scoring

Key output files:

  • Training plots and metrics in training/
  • Data efficiency plots comparing foundation model benefits in testing/*/
  • Model weights for reuse in models/foundation_model/

Development Utilities

Code quality:

  • .pre-commit-config.yaml - Automated code formatting (ruff) and quality checks (vulture)
  • uv package management with pyproject.toml configuration

Visual Interface:

  • launch_gradio.py - Opens webpage with result viewer UI

Development tools:

  • .devcontainer/ - Docker container for consistent development environment
  • scripts/test_gpu.py - Verify TensorFlow GPU access on your system
  • src/hep_foundation/utils/plot_utils.py - Standardized colors, fonts, and styling for all plots

Testing:

  • test/test_pipeline - Small scale (~60s) test of the whole pipeline. Flags nissing or unexpected files in output folder _test_results
  • python run_pytest.py - Python wrapper for tests, only logs warnings, errors, and progress logs. Full logs go to _test_results/pytest.log

Citation

If you use this software in your research, please cite:

Yue, A. (2024). HEP Foundation: Foundation models for High Energy Physics data analysis.
https://github.com/Alezander9/hep-foundation
BibTeX Format
@software{yue_hep_foundation_2024,
  author = {Yue, Alexander},
  title = {HEP Foundation: Foundation models for High Energy Physics data analysis},
  url = {https://github.com/Alezander9/hep-foundation},
  year = {2024}
}
Note: A research paper is in preparation. This citation will be updated when published.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Questions or issues?

Email: [email protected] GitHub: Alezander9/hep-foundation

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •