Simulate ancient DNA (aDNA) reads from pangenome haplotypes
ancestralsim is a pipeline for simulating ancient DNA sequencing reads from locus-specific pangenome haplotypes. It uses gargammel to generate realistic aDNA reads with authentic damage patterns, fragmentation, and contamination profiles, then aligns them to a reference chromosome using BWA.
- Pangenome-based simulation: Works with diploid samples from pangenome FASTA files
- Realistic aDNA characteristics:
- Configurable DNA fragment lengths (typical aDNA: ~45bp)
- Single or double-stranded deamination patterns
- C-to-T transitions at fragment ends
- Contamination modeling: Simulate exogenous human DNA contamination from other samples
- Flexible sequencing: Single-end (SE) or paired-end (PE) library support
git clone --recursive https://github.com/davidebolo1993/ancestralsim.git
cd ancestralsimENV_PATH="/path/to/environment/installation/directory"
mamba env create -f environment.yml -p $ENV_PATH
#fix incorrext lib
cd $ENV_PATH/lib
ln -s libgsl.so.27 libgsl.so.25
cd -singularity pull --dir gargamel/. docker://quay.io/biocontainers/gargammel:1.1.4--hb66fcc3_0When using this singularity container, you need to also have bwa, fastp and samtools executables available in $PATH.
#Help
./ancestralsim.sh -h
# Basic usage
./ancestralsim.sh
-r reference/GRCh38.fa
-g ./gargammel
-b region_mapping.tsv
# Increasing 0.5X coverage, increase modern-human contamination to 15%, simulate a single-end short-read library
./ancestralsim.sh
-r reference/GRCh38.fa
-g ./gargammel
-b region_mapping.tsv
-c 1.0
--cont-ratio 0.15
-o simulation_outputUsage: ancestralsim.sh -r <reference.fa> -g <gargammel_dir> -b <region_mapping.tsv> [options]
Required:
-r FILE Reference genome FASTA
-g DIR Path to gargammel installation directory
-b FILE TSV mapping regions to alleles (chr<TAB>region_name<TAB>fasta_path)
Optional:
-c FLOAT Target coverage (default 0.5)
-l INT Mean fragment length (default 70)
-R INT Read length (default 100)
-L TYPE Library type: se|pe (default pe)
-d TYPE Deamination type: single|double (default single)
--deam-rate "VALS" Custom deam rates like "0.03,0.4,0.01,0.3"
--cont-ratio FLOAT Exogenous DNA ratio (default 0.1)
-o DIR Output directory (default output)
-t INT Threads (default 4)
-h Show this help
Tab-separated file specifying chromosome, region name, and path to alleles FASTA:
chr1 CYP2J2 /path/to/chr1_region1.fasta.gz
chr1 MUC1 /path/to/chr1_region2.fasta.gz
chr17 BRCA1 /path/to/chr17_region1.fasta.gz
chr22 CYP2D6 /path/to/chr22_region1.fasta.gzThe input FASTA file must contain diploid samples with headers in the format:
>SAMPLE_ID#HAPLOTYPE#CHROMOSOME
Example:
>HG00096#1#chr1
ACGTACGTACGT...
>HG00096#2#chr1
ACGTACGTACGT...
>HG00097#1#chr1
ACGTACGTACGT...
>HG00097#2#chr1
ACGTACGTACGT...
The pipeline automatically identifies diploid samples (those with both haplotype #1 and #2) and while it uses a diploid sample for simulation of aDNA reads, another random diploid sample from the same set is used to simulate modern-day contaminants.
The pipeline generates:
output/
├── chr1/
│ ├── reference_chr1.fa # Extracted chromosome reference
│ ├── CYP2J2/
│ │ ├── bams/
│ │ │ ├── HG00096.sorted.bam
│ │ │ ├── HG00096.sorted.bam.bai
│ │ │ └── HG00096.flagstat.txt
│ │ ├── logs/
│ │ │ ├── HG00096_sequence_mapping.txt
│ │ │ ├── HG00096_gargammel.log
│ │ │ └── HG00096_fastp.html
│ │ └── region_summary.txt
│ └── MUC1/
│ └── (same structure)
├── chr17/
│ └── BRCA1/
│ └── (same structure)
├── merged_bams/ # Created after running merge script
│ ├── HG00096.merged.bam # Reads from all regions
│ └── HG00097.merged.bam
├── merge_commands.sh # Auto-generated merge script
└── global_simulation_summary.txt
chr*/region/bams/*.sorted.bam: Per-region aligned aDNA readsmerged_bams/*.merged.bam: Per-sample BAMs merged across all regionslogs/*_sequence_mapping.txt: Maps simulated sequences to original haplotype nameslogs/*_fastp.html: Adapter trimming quality reportsmerge_commands.sh: Script to merge per-region BAMs into per-sample BAMs
After simulation completes, merge per-region BAMs into single BAM files per sample:
bash output/merge_commands.sh
This creates output/merged_bams/SAMPLE.merged.bam files containing reads from all simulated regions.
Target sequencing coverage. For aDNA studies, typical values range from 0.1x to 2x.
Mean DNA fragment length. Ancient DNA is typically highly fragmented:
- Modern DNA: 150-500bp
- Ancient DNA: 30-80bp (mean ~45bp)
DNA damage patterns:
single: C-to-T deamination on one strand (partially treated with UDG)double: C-to-T on both strands (non-UDG treated)
Proportion of exogenous human DNA contamination (0.0-1.0):
- 0.0: No contamination
- 0.1: 10% contamination (typical for well-preserved samples)
- 0.3: 30% contamination (moderate)
- 0.5+: High contamination (challenging samples)
As mentioned above, the pipeline randomly selects a different diploid sample as the contaminant source.
If you use ancestralsim in your research, please cite the gargammel paper:
- gargammel: Renaud G, Hanghøj K, Korneliussen TS, et al. (2017) gargammel: a sequence simulator for ancient DNA. Bioinformatics 33(4):577-579. doi:10.1093/bioinformatics/btw670
This project is licensed under the MIT License - see the LICENSE file for details.
Contributions are welcome! Please feel free to submit a Pull Request.
For issues, questions, or suggestions, please open an issue on GitHub.
Davide Bolognini (@davidebolo1993)