⚠️ Status: Under Active DevelopmentThis analysis pipeline is currently being developed and refined. Methods, results, and documentation are subject to change. For detailed technical documentation, see CLAUDE.md.
This repository contains an R-based data analysis pipeline for investigating COVID-19 transmission patterns across geographic regions and demographic groups in North America. The pipeline analyzes SARS-CoV-2 genomic sequences to identify identical or near-identical sequence pairs and calculates relative risk (RR) metrics for transmission between different populations.
- State/Division-level RR matrices: Quantifies transmission risk between US states, Canadian provinces, and Mexican regions
- Census division and region analyses: Aggregated transmission patterns across US Census geographic hierarchies
- Time-stratified geographic RR: Rolling window analysis (2-month windows, ±1 month) to track changing transmission patterns over time
- Network visualization: Identification and visualization of persistent inter-state transmission connections
- Age-stratified RR matrices: Fine-grained age bins (individual years or 5-year bins depending on dataset)
- Age × State joint analyses: Simultaneous stratification by age and geography
- Sex-stratified analyses: Transmission patterns by biological sex
- Age-time series: Temporal dynamics of age-specific transmission with normalization
- School instruction modality correlation: Links school in-person/hybrid/virtual instruction shares with age-specific transmission rates
- State trajectory classification: Classifies states into three categories based on in-person instruction patterns during the 2020-2021 academic year:
- Persistent High (6 states): Maintained ≥80% in-person throughout
- Persistent Low (6 states): Maintained ≤20% in-person throughout
- Rising In-Person (3 states): Transitioned from ≤20% to ≥80% during the year
- Academic year analyses: School-age transmission RR by state across multiple academic years (2020-2024)
- Normalized RR metrics:
nRR(diagonal normalization): Measures preferential mixing within groups relative to between-group mixingnRR_fixed(baseline normalization): Transmission relative to a stable reference population (e.g., working-age adults)
- Bootstrap confidence intervals: Subsampling-based uncertainty quantification
- Mixed-effects modeling: Accounts for temporal autocorrelation and state-level random effects in school analyses
-
Sequence Data Processing
- Raw SARS-CoV-2 genomic sequences with metadata (location, date, age, sex, viral clade)
- Pairwise distance calculations identify identical or near-identical sequences
- Sequences filtered by coverage (≥90%), host type (human), and geographic validity
-
Relative Risk Calculation
- Pairs of identical sequences represent potential transmission events
- RR quantifies whether pairs occur more or less frequently than expected between exposure groups
- Formula:
RR = ((pair_count × N_total) + 1) / (x_appearances × y_appearances) - Pseudocount (+1) prevents division by zero and stabilizes rare category estimates
-
Temporal Stratification
- Rolling time windows: 2-month windows centered on mid-points spaced every 4 weeks
- Academic year periods: Fixed Sept 1 - May 31 windows for school analyses
- Time-bounding enables tracking of transmission dynamics across pandemic phases
-
Normalization
- Diagonal normalization controls for overall transmission intensity changes over time
- Baseline normalization enables comparison against a stable reference group
- Group-specific normalization (by date, state, etc.) prevents confounding across strata
ncov-usa-mig/
├── data/ # Input data (sequence metadata, pairs, reference data)
│ ├── metadata/ # Compressed sequence metadata
│ ├── distance_aggregated/ # Pairwise sequence comparisons
│ └── state_school_share.csv # School instruction modality by state/month
├── scripts/ # Analysis scripts
│ ├── clean_data.R # Data cleaning and standardization
│ ├── bind_pairs_exp.R # Pair-exposure variable joining
│ ├── calculate_rr_matrix.R # Core RR calculation function
│ ├── age_analysis.R # Age-stratified RR
│ ├── state_analysis.R # Geographic RR
│ ├── age_time_RR_analysis.R # Time-stratified age RR with normalization
│ ├── school_share_analysis.R # School modality correlation & trajectory classification
│ └── *_plot.R / *_heatmap.R # Visualization scripts
├── results/ # Analysis outputs
│ ├── df_RR_by_*.tsv # RR matrices
│ └── time_age/ # Time-stratified school analyses
├── figs/ # Visualizations
├── db_files/ # DuckDB databases for efficient querying
├── Snakefile # Workflow automation
├── config.yaml # Configuration settings
├── CLAUDE.md # Detailed technical documentation
└── README.md # This file
df_RR_by_age_class.tsv: Age-stratified relative riskdf_RR_by_state.tsv: State/division-stratified relative riskdf_RR_by_age_state.tsv: Joint age × state relative riskdf_RR_by_census_div.tsv: Census division-level relative risk
time_age/df_RR_by_time_age_series.tsv: Age RR time series (rolling windows)time_age/df_RR_by_school_state_time.tsv: School-age RR by state and timetime_age/df_RR_by_school_state_ay.tsv: School-age RR by academic yeartime_age/state_trajectory_classifications.tsv: State in-person instruction trajectories
- Heatmaps: Symmetric RR matrices with dendrograms
- Geographic maps: State-level transmission patterns on US maps
- Time series plots: Temporal dynamics with confidence intervals
- Network visualizations: Persistent inter-state transmission connections
- Trajectory plots: School instruction modality by state classification
- R (≥4.0) with packages:
- tidyverse (dplyr, ggplot2, readr, tidyr, purrr, stringr)
- DuckDB (efficient database queries)
- dbplyr (lazy evaluation)
- data.table (fast data manipulation)
- lme4/lmerTest (mixed-effects models)
- argparse (command-line interfaces)
- RColorBrewer, usmap (visualization)
- Snakemake (≥7.18): Workflow management
- DuckDB CLI: Database initialization
- zstd: Data compression/decompression
- SLURM cluster environment (for large-scale analyses)
- Sufficient memory for genomic data processing (recommend ≥32GB RAM)
- Multi-core processor for parallel processing
# 1. Initialize database
./scripts/init_db.sh
# 2. Clean and standardize metadata
Rscript ./scripts/clean_data.R
# 3. Run a specific analysis
Rscript ./scripts/age_analysis.R --ci TRUE
# 4. Generate visualizations
Rscript ./scripts/age_heatmap.R# Run complete analysis pipeline via Snakemake
snakemake --profile ./profile --cores 32 --group-components duckdb_acc=1
# Or submit to SLURM cluster
sbatch batch_analysis.sh# Generate school-age RR data (requires age_time_RR_analysis.R to run first)
Rscript ./scripts/school_share_analysis.RThis will:
- Classify states by in-person instruction trajectory
- Generate correlation plots between school modality and transmission
- Create faceted trajectory visualization
- Output state classifications to
time_age/state_trajectory_classifications.tsv
- ✅ Implemented normalization functions (diagonal and baseline)
- ✅ Added school instruction modality correlation analyses
- ✅ Created state trajectory classification system
- ✅ Refined time windows from 6-month to 2-month rolling windows
- ✅ Added academic year fixed-period analyses
- ✅ Implemented country-stratified age RR time series
- ✅ Enhanced visualization suite with faceted trajectory plots
- 🔄 Validation of normalization approaches for different research questions
- 🔄 Sensitivity analyses for time window parameters
- 🔄 Integration with policy/intervention timelines
- 🔄 Network analysis of persistent transmission routes
- Ensure make_state_nb_dist.R completes successfully and verify output
- Update state_time_scatter.R to use the updated neighbor distances from make_state_nb_dist.R
- CLAUDE.md: Comprehensive technical documentation for developers
- Core architecture and data flow
- Function reference with line numbers
- Normalization strategy guide
- File organization and naming conventions
- Development patterns and best practices
Note: Methods and results are preliminary. Formal publication is in preparation.
For questions about this analysis pipeline, please contact the repository maintainers.
[License information to be added]
Last Updated: October 2025