DagFlow

.

DAG-flow runtime.

A self-contained, minimal runtime for parallel task execution in C++20.
Designed for workloads where you know the dependency graph in advance, need predictable scheduling, and want full control over affinity, priorities, and back-pressure — without the complexity of a full TBB.

Mini-runtime for parallel tasks in C++20:

Work-stealing pool (Chase–Lev decks, central MPMC queues).
Lockfree MPMC (Michael–Scott) with QSBR path.
A local small_function for cheap closures.
A lightweight DAG graph with cancellation, competition limits, and back-pressure.
High-level API in the spirit of TBB: submit/then/when_all/parallel_for via TaskScope.

Design

Scheduler: local deques (Chase–Lev) + "central" MPMC queues for external submissions; worker-first, then an attempt to steal from neighbors.
MPMC: Michael–Scott with a QSBR path (quiet state based reclamation) for queue tails in the pool.
Synchronization: memory_order (acquire/release/seq_cst where reconciliation is needed).
Graph Security: Workers keep Core (shared_ptr) → no UAF, even if the TaskGraph was destroyed before wait().

By default, small_function<void(), 128 > in nodes. If you catch a Callable too large, increase the SSO (256) or pack the captures into the std::shared_ptr block.

For "heavy" stages, set concurrency > 1 in ScheduleOptions/NodeOptions.

Configure Config for CPU/NUMA; pin_threads=true is useful for cache stability.

In large pipelines, turn on back-pressure (capacity, Overflow) at the bottleneck stages.

Benchmarks

CPU: Ryzen 7 6800H OS: Windows 11 Build Mode: Release Compiler: MSVCx64 19.44.35214 Config: pin_threads = false

Benchmark	Runs	Mean Time	Min Time	Max Time	Cost of 1 task +-
Dependent chain (1000 tasks)	5	0.00579 s	0.00491 s	0.00765 s	~3–4 μs per node
Independent tasks (1000)	5	1.5283 s	1.41794 s	1.64484 s	—
Independent batched (1000, batch=10)	5	1.41718 s	1.39467 s	1.44458 s	~7–20% faster, than without batch
Parallel_for (1e6 elements)	5	0.54729 s	0.53645 s	0.57182 s	—
Workflow (width=10, depth=5)	5	0.00161 s	0.00147 s	0.00179 s	~30–35 μs for all DAG
Noop tasks (1 000 000)	5	4.411 s*	4.37711 s	4.48551 s	~4.7–4.8 μs per task

Numbers may vary with CPU governor, background load, and configuration.

Compare with TBB

Benchmark	Problem size	My TP mean (s)	TBB mean (s)	Speedup (TP / TBB)	TP throughput	TBB throughput
Dependent chain	1000 tasks	0.00579	0.00288156	2.01×	~172,712 tasks/s	~347,034 tasks/s
Independent tasks	1000 tasks	1.52830	1.25051	1.22×	~654.3 tasks/s	~799.7 tasks/s
Independent batched	1000 (batch=10)	1.41718	1.30121	1.09×	~705.6 tasks/s	~768.5 tasks/s
parallel_for(old)	1,000,000 elems	0.54729	0.0268362	20.39×	~1.83M elems/s	~37.26M elems/s
for_each_ws (new)	1,000,000 elems	0.00957	0.0268362	0.36×*	~104.5M elems/s	~37.26M elems/s
Workflow (w=10,d=5)	~50 stage ops	0.00161	0.00029248	5.50×	—	—
Noop tasks	1,000,000 tasks	4.411	0.225568	20.75×	~213,630 tasks/s	~4,433,253 tasks/s

* - (2.8× faster than TBB)

For each versions

For each version	Runs	Mean Time	Min Time	Max Time	Notes / Speedup
for_each (static chunking)	5	0.01761 s	0.01456 s	0.02074 s	baseline
for_each_ws (range stealing)	5	0.00957 s	0.00855 s	0.01058 s	~1.84× faster vs static

for_each_ws (range stealing) — Lazy binary range partitioning with stealing "upper" halves (help-first). Uniform loading of threads on uneven load and less "tails". Compatible only with RA iterators.

Visualization

Build and usage

git clone https://github.com/cpp20120/DagFlow.git
cd DagFlow
cmake -B build -DTp_BUILD_EXAMPLES=ON
cmake --build build --config Release

How to use at your cmake project

Via add_subdirectory:

add_subdirectory(external/DagFlow)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE DagFlow)

Via find_package:

find_package(DagFlow REQUIRED)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE DagFlow::DagFlow)

find_package works after cmake --install.

2.5 On windows add to your target:

if (WIN32 AND TP_BUILD_SHARED)
  add_custom_command(TARGET ${CMAKE_PROJECT_NAME} POST_BUILD
    COMMAND ${CMAKE_COMMAND} -E copy_if_different
      $<TARGET_FILE:DagFlow>
      $<TARGET_FILE_DIR:my_app>
  )
endif()

Exampe of usage: there

Minimal example

#include <dagflow.hpp>

dagflow::Pool pool;
dagflow::TaskScope scope(pool);

auto a = scope.submit([] { /* … */ });
auto b = scope.then(a, [] { /* … */ });
auto c = scope.when_all({a, b}, [] { /* … */ });

scope.run_and_wait();

Note: This pool is not yet optimized noop and non ws version of for_each microbenchmarks — missing chunk/range stealing and has a non-ideal queue implementation for such patterns. The focus is DAG execution, affinity control, and back-pressure.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
bench		bench
cmake		cmake
docs		docs
include		include
src		src
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CMakeSettings.json		CMakeSettings.json
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DagFlow

Design

Benchmarks

Visualization

Build and usage

Exampe of usage: there

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

cpp20120/DagFlow

Folders and files

Latest commit

History

Repository files navigation

DagFlow

Design

Benchmarks

Visualization

Build and usage

Exampe of usage: there

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages