DAG-flow runtime.
A self-contained, minimal runtime for parallel task execution in C++20.
Designed for workloads where you know the dependency graph in advance, need predictable scheduling, and want full control over affinity, priorities, and back-pressure — without the complexity of a full TBB.
Mini-runtime for parallel tasks in C++20:
-
Work-stealing pool (Chase–Lev decks, central MPMC queues).
-
Lockfree MPMC (Michael–Scott) with QSBR path.
-
A local small_function for cheap closures.
-
A lightweight DAG graph with cancellation, competition limits, and back-pressure.
-
High-level API in the spirit of TBB: submit/then/when_all/parallel_for via TaskScope.
-
Scheduler: local deques (Chase–Lev) + "central" MPMC queues for external submissions; worker-first, then an attempt to steal from neighbors.
-
MPMC: Michael–Scott with a QSBR path (quiet state based reclamation) for queue tails in the pool.
-
Synchronization: memory_order (acquire/release/seq_cst where reconciliation is needed).
-
Graph Security: Workers keep Core (shared_ptr) → no UAF, even if the TaskGraph was destroyed before wait().
By default, small_function<void(), 128 > in nodes. If you catch a Callable too large, increase the SSO (256) or pack the captures into the std::shared_ptr block.
For "heavy" stages, set concurrency > 1 in ScheduleOptions/NodeOptions.
Configure Config for CPU/NUMA; pin_threads=true is useful for cache stability.
In large pipelines, turn on back-pressure (capacity, Overflow) at the bottleneck stages.
CPU: Ryzen 7 6800H OS: Windows 11 Build Mode: Release Compiler: MSVCx64 19.44.35214 Config: pin_threads = false
Benchmark | Runs | Mean Time | Min Time | Max Time | Cost of 1 task +- |
---|---|---|---|---|---|
Dependent chain (1000 tasks) | 5 | 0.00579 s | 0.00491 s | 0.00765 s | ~3–4 μs per node |
Independent tasks (1000) | 5 | 1.5283 s | 1.41794 s | 1.64484 s | — |
Independent batched (1000, batch=10) | 5 | 1.41718 s | 1.39467 s | 1.44458 s | ~7–20% faster, than without batch |
Parallel_for (1e6 elements) | 5 | 0.54729 s | 0.53645 s | 0.57182 s | — |
Workflow (width=10, depth=5) | 5 | 0.00161 s | 0.00147 s | 0.00179 s | ~30–35 μs for all DAG |
Noop tasks (1 000 000) | 5 | 4.411 s* | 4.37711 s | 4.48551 s | ~4.7–4.8 μs per task |
- Numbers may vary with CPU governor, background load, and configuration.
Compare with TBB
Benchmark | Problem size | My TP mean (s) | TBB mean (s) | Speedup (TP / TBB) | TP throughput | TBB throughput |
---|---|---|---|---|---|---|
Dependent chain | 1000 tasks | 0.00579 | 0.00288156 | 2.01× | ~172,712 tasks/s | ~347,034 tasks/s |
Independent tasks | 1000 tasks | 1.52830 | 1.25051 | 1.22× | ~654.3 tasks/s | ~799.7 tasks/s |
Independent batched | 1000 (batch=10) | 1.41718 | 1.30121 | 1.09× | ~705.6 tasks/s | ~768.5 tasks/s |
parallel_for(old) | 1,000,000 elems | 0.54729 | 0.0268362 | 20.39× | ~1.83M elems/s | ~37.26M elems/s |
for_each_ws (new) | 1,000,000 elems | 0.00957 | 0.0268362 | 0.36×* | ~104.5M elems/s | ~37.26M elems/s |
Workflow (w=10,d=5) | ~50 stage ops | 0.00161 | 0.00029248 | 5.50× | — | — |
Noop tasks | 1,000,000 tasks | 4.411 | 0.225568 | 20.75× | ~213,630 tasks/s | ~4,433,253 tasks/s |
*
- (2.8× faster than TBB)
For each versions
For each version | Runs | Mean Time | Min Time | Max Time | Notes / Speedup |
---|---|---|---|---|---|
for_each (static chunking) | 5 | 0.01761 s | 0.01456 s | 0.02074 s | baseline |
for_each_ws (range stealing) | 5 | 0.00957 s | 0.00855 s | 0.01058 s | ~1.84× faster vs static |
for_each_ws (range stealing) — Lazy binary range partitioning with stealing "upper" halves (help-first). Uniform loading of threads on uneven load and less "tails". Compatible only with RA iterators.
git clone https://github.com/cpp20120/DagFlow.git
cd DagFlow
cmake -B build -DTp_BUILD_EXAMPLES=ON
cmake --build build --config Release
How to use at your cmake project
- Via
add_subdirectory
:
add_subdirectory(external/DagFlow)
add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE DagFlow)
- Via
find_package
:
find_package(DagFlow REQUIRED)
add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE DagFlow::DagFlow)
find_package
works after cmake --install.
2.5 On windows add to your target:
if (WIN32 AND TP_BUILD_SHARED)
add_custom_command(TARGET ${CMAKE_PROJECT_NAME} POST_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_if_different
$<TARGET_FILE:DagFlow>
$<TARGET_FILE_DIR:my_app>
)
endif()
Exampe of usage: there
Minimal example
#include <dagflow.hpp>
dagflow::Pool pool;
dagflow::TaskScope scope(pool);
auto a = scope.submit([] { /* … */ });
auto b = scope.then(a, [] { /* … */ });
auto c = scope.when_all({a, b}, [] { /* … */ });
scope.run_and_wait();
Note: This pool is not yet optimized noop and non ws version of for_each microbenchmarks — missing chunk/range stealing and has a non-ideal queue implementation for such patterns. The focus is DAG execution, affinity control, and back-pressure.