Glucose Prediction

This code is a refactor of the code used in Engineering digital biomarkers of interstitial glucose from noninvasive smartwatches. Note there may be minor discrepancies between this code and the original paper due to the addition of vector operations, parallel processing, and updating all of the code to work with the new versions of python libraries.

This repo contains a machine learning pipeline for predicting glucose levels using wearable sensor data, food logs, and demographic information. This project implements both personalized and population-based models using XGBoost with Random Forest feature selection.

Overview

This codebase provides a complete pipeline for:

Feature Engineering: Processing wearable sensor data (EDA, temperature, heart rate, accelerometer) and food logs and engineers 69 features
Model Training: Training personalized and population-based glucose prediction models
Cross-Validation: Leave-one-participant-out (LOPOCV) and personalized 50/50 split validation strategies

See the methods section of the original paper for additional information on rationale for design choices.

Project Structure

glucose-prediction/
├── configs/                    # Configuration files
│   ├── fe_config.yaml         # Feature engineering pipeline config
│   ├── model_loocv.yaml       # Population model config
│   └── model_personalized.yaml # Personalized model config
├── data/                       # Data directory (not in repo)
├── src/
│   ├── glucose_fe/            # Feature engineering pipeline
│   │   ├── cli.py            # Command-line interface
│   │   ├── pipeline.py       # Main pipeline orchestration
│   │   ├── features.py       # Feature computation (Pandas)
│   │   ├── features_polars.py # Feature computation (Polars)
│   │   ├── glucose.py        # Glucose data processing
│   │   ├── hrv.py            # Heart rate variability features
│   │   ├── stress.py         # Stress detection features
│   │   ├── wake.py           # Wake/sleep pattern features
│   │   ├── food.py           # Food log processing
│   │   └── io.py             # Data I/O utilities
│   └── models/                # Model training scripts
│       ├── train_personalized_xgb.py  # Personalized XGBoost training
│       ├── train_population_xgb.py    # Population XGBoost training
│       ├── config.py          # Model configuration
│       └── utils.py           # Model utilities
├── notebooks/                  # Jupyter notebooks
├── requirements.txt            # Python dependencies
└── README.md                  # This file

Prerequisites

Python 3.8+
Virtual environment (recommended)

Installation

Clone the repository:

git clone https://github.com/brinnaebent/glucose-prediction.git
cd glucose-prediction

Create and activate a virtual environment:

python -m venv gp-venv
source gp-venv/bin/activate  # On Windows: gp-venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

Data Setup

Before running the pipeline, you need to organize your data in the following structure. If downloaded from PhysioNet, it will already be organized in this way:

data/
├── 001/                       # Participant ID
│   ├── EDA_001.csv               # Empatica EDA sensor data
│   ├── TEMP_001.csv              # Empatica temperature data
│   ├── HR_001.csv                # Empatica heart rate data
│   ├── ACC_001.csv               # Empatica accelerometer data
│   ├── IBI_001.csv               # Empatica Inter-beat interval data
|   ├── Food_Log_001.csv          # Food logs
│   └── Dexcom_001.csv            # Glucose data
├── 002/                       # Another participant
│   └── ...
└── Demographics.csv           # Participant demographics

Configuration

Feature Engineering Config (`configs/fe_config.yaml`)

Update the paths in fe_config.yaml to match your data directory structure:

paths:
  root: "/path/to/your/data"
  medx_dir: "."
  food_logs_dir: "."
  out_dir: "/path/to/output/directory"
  demographics_csv: "Demographics.csv"

Model Configs

Update the data paths in the model configuration files:

configs/model_loocv.yaml for population models
configs/model_personalized.yaml for personalized models

Usage

1. Feature Engineering Pipeline

Run the feature engineering pipeline to process raw sensor data:

python -m src.glucose_fe.cli --config configs/fe_config.yaml --max-workers 1

Options:

--config: Path to configuration file (required)
--compile-only: Only compile features without processing (optional)
--max-workers: Number of parallel workers (optional, defaults to all cores)

Output: The pipeline generates a compiled dataset at out/ALL_features_cleaned.parquet

2. Model Training

Population Model (Leave-One-Participant-Out Cross-Validation)

python -m src.models.train_population_xgb --config configs/model_loocv.yaml

Personalized Model (50/50 Split per Participant)

python -m src.models.train_personalized_xgb --config configs/model_personalized.yaml

Key Components

Feature Engineering

Sensor Features: Rolling statistics (mean, std, min, max) for EDA, temperature, heart rate, and accelerometer data
HRV Features: Heart rate variability metrics computed over sliding windows
Stress Detection: EDA peak counting for stress level assessment
Wake/Sleep Patterns: Activity-based sleep pattern detection
Food Features: Meal timing and nutritional information processing

Modeling

Feature Selection: Random Forest-based feature importance filtering
XGBoost Models: Gradient boosting with early stopping
Validation Strategies:
- Population: Leave-one-participant-out cross-validation
- Personalized: 50/50 temporal split per participant

Performance Optimization

Polars Engine: Use engine: "polars" in config for faster data processing
Parallel Processing: Adjust max-workers based on your system capabilities
Memory Management: Polars provides better memory efficiency for large datasets

Output Structure

The pipeline generates the following outputs:

out/
├── ALL_features_cleaned.parquet    # Compiled feature dataset
├── modeling_population/            # Population model outputs
│   ├── models/                     # Trained models
│   ├── preds/                      # Predictions
│   ├── feature_lists/              # Selected features per fold
│   └── feature_importances/        # Feature importance rankings
└── modeling_personalized/          # Personalized model outputs
    ├── models/                     # Trained models
    ├── preds/                      # Predictions
    ├── feature_lists/              # Selected features per participant
    └── feature_importances/        # Feature importance rankings

Results

Using data directly downloaded from PhysioNet And the defaults and configs currently in this repository (as of 8/10/25):

Results for Population LOOCV model

Mean RMSE: 22.973 ± 4.767 Mean MAPE: 15.58% ± 4.17% Mean Accuracy: 84.42% ± 4.17%

Results for Personalized model

Mean RMSE: 22.286 ± 4.448 Mean MAPE: 14.07% ± 2.75% Mean ACC : 85.93% ± 2.75%

Note: small discrepancies from paper exist due to minor changes mentioned above.

Troubleshooting

Common Issues

Memory Errors: Reduce max-workers or use Polars engine
Missing Data: Ensure all required sensor files are present for each participant
Path Errors: Verify all paths in configuration files are correct and accessible

Contributing

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

License

MIT

Citation

If you use this code in your research, please cite the original paper:

Bent, B., Cho, P.J., Henriquez, M. et al. Engineering digital biomarkers of interstitial glucose from noninvasive smartwatches. npj Digit. Med. 4, 89 (2021). https://doi.org/10.1038/s41746-021-00465-w

Issues

Please submit an issue! This was refactored ~5 years after the original paper code was written, so there may be minor discrepancies. Support the open source nature of this code - if you fix something, submit a PR and help everyone out :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Glucose Prediction

Overview

Project Structure

Prerequisites

Installation

Data Setup

Configuration

Feature Engineering Config (`configs/fe_config.yaml`)

Model Configs

Usage

1. Feature Engineering Pipeline

2. Model Training

Population Model (Leave-One-Participant-Out Cross-Validation)

Personalized Model (50/50 Split per Participant)

Key Components

Feature Engineering

Modeling

Performance Optimization

Output Structure

Results

Results for Population LOOCV model

Results for Personalized model

Troubleshooting

Common Issues

Contributing

License

Citation

Issues

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
configs		configs
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

brinnaebent/glucose-prediction

Folders and files

Latest commit

History

Repository files navigation

Glucose Prediction

Overview

Project Structure

Prerequisites

Installation

Data Setup

Configuration

Feature Engineering Config (configs/fe_config.yaml)

Model Configs

Usage

1. Feature Engineering Pipeline

2. Model Training

Population Model (Leave-One-Participant-Out Cross-Validation)

Personalized Model (50/50 Split per Participant)

Key Components

Feature Engineering

Modeling

Performance Optimization

Output Structure

Results

Results for Population LOOCV model

Results for Personalized model

Troubleshooting

Common Issues

Contributing

License

Citation

Issues

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Feature Engineering Config (`configs/fe_config.yaml`)

Packages