Skip to content

cpp20120/DagFlow

Repository files navigation

DagFlow

CMake License .

DAG-flow runtime.

A self-contained, minimal runtime for parallel task execution in C++20.
Designed for workloads where you know the dependency graph in advance, need predictable scheduling, and want full control over affinity, priorities, and back-pressure — without the complexity of a full TBB.

Mini-runtime for parallel tasks in C++20:

  • Work-stealing pool (Chase–Lev decks, central MPMC queues).

  • Lockfree MPMC (Michael–Scott) with QSBR path.

  • A local small_function for cheap closures.

  • A lightweight DAG graph with cancellation, competition limits, and back-pressure.

  • High-level API in the spirit of TBB: submit/then/when_all/parallel_for via TaskScope.

  • Scheduler: local deques (Chase–Lev) + "central" MPMC queues for external submissions; worker-first, then an attempt to steal from neighbors.

  • MPMC: Michael–Scott with a QSBR path (quiet state based reclamation) for queue tails in the pool.

  • Synchronization: memory_order (acquire/release/seq_cst where reconciliation is needed).

  • Graph Security: Workers keep Core (shared_ptr) → no UAF, even if the TaskGraph was destroyed before wait().

By default, small_function<void(), 128 > in nodes. If you catch a Callable too large, increase the SSO (256) or pack the captures into the std::shared_ptr block.

For "heavy" stages, set concurrency > 1 in ScheduleOptions/NodeOptions.

Configure Config for CPU/NUMA; pin_threads=true is useful for cache stability.

In large pipelines, turn on back-pressure (capacity, Overflow) at the bottleneck stages.

Benchmarks

CPU: Ryzen 7 6800H OS: Windows 11 Build Mode: Release Compiler: MSVCx64 19.44.35214 Config: pin_threads = false

Benchmark Runs Mean Time Min Time Max Time Cost of 1 task +-
Dependent chain (1000 tasks) 5 0.00579 s 0.00491 s 0.00765 s ~3–4 μs per node
Independent tasks (1000) 5 1.5283 s 1.41794 s 1.64484 s
Independent batched (1000, batch=10) 5 1.41718 s 1.39467 s 1.44458 s ~7–20% faster, than without batch
Parallel_for (1e6 elements) 5 0.54729 s 0.53645 s 0.57182 s
Workflow (width=10, depth=5) 5 0.00161 s 0.00147 s 0.00179 s ~30–35 μs for all DAG
Noop tasks (1 000 000) 5 4.411 s* 4.37711 s 4.48551 s ~4.7–4.8 μs per task
  • Numbers may vary with CPU governor, background load, and configuration.

Compare with TBB

Benchmark Problem size My TP mean (s) TBB mean (s) Speedup (TP / TBB) TP throughput TBB throughput
Dependent chain 1000 tasks 0.00579 0.00288156 2.01× ~172,712 tasks/s ~347,034 tasks/s
Independent tasks 1000 tasks 1.52830 1.25051 1.22× ~654.3 tasks/s ~799.7 tasks/s
Independent batched 1000 (batch=10) 1.41718 1.30121 1.09× ~705.6 tasks/s ~768.5 tasks/s
parallel_for(old) 1,000,000 elems 0.54729 0.0268362 20.39× ~1.83M elems/s ~37.26M elems/s
for_each_ws (new) 1,000,000 elems 0.00957 0.0268362 0.36×* ~104.5M elems/s ~37.26M elems/s
Workflow (w=10,d=5) ~50 stage ops 0.00161 0.00029248 5.50×
Noop tasks 1,000,000 tasks 4.411 0.225568 20.75× ~213,630 tasks/s ~4,433,253 tasks/s

* - (2.8× faster than TBB)

For each versions

For each version Runs Mean Time Min Time Max Time Notes / Speedup
for_each (static chunking) 5 0.01761 s 0.01456 s 0.02074 s baseline
for_each_ws (range stealing) 5 0.00957 s 0.00855 s 0.01058 s ~1.84× faster vs static

for_each_ws (range stealing) — Lazy binary range partitioning with stealing "upper" halves (help-first). Uniform loading of threads on uneven load and less "tails". Compatible only with RA iterators.

Visualization

First Second Third

Build and usage

git clone https://github.com/cpp20120/DagFlow.git
cd DagFlow
cmake -B build -DTp_BUILD_EXAMPLES=ON
cmake --build build --config Release

How to use at your cmake project

  1. Via add_subdirectory:
add_subdirectory(external/DagFlow)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE DagFlow)
  1. Via find_package:
find_package(DagFlow REQUIRED)

add_executable(my_app main.cpp)
target_link_libraries(my_app PRIVATE DagFlow::DagFlow)

find_package works after cmake --install.

2.5 On windows add to your target:

if (WIN32 AND TP_BUILD_SHARED)
  add_custom_command(TARGET ${CMAKE_PROJECT_NAME} POST_BUILD
    COMMAND ${CMAKE_COMMAND} -E copy_if_different
      $<TARGET_FILE:DagFlow>
      $<TARGET_FILE_DIR:my_app>
  )
endif()

Exampe of usage: there

Minimal example

#include <dagflow.hpp>

dagflow::Pool pool;
dagflow::TaskScope scope(pool);

auto a = scope.submit([] { /**/ });
auto b = scope.then(a, [] { /**/ });
auto c = scope.when_all({a, b}, [] { /**/ });

scope.run_and_wait();

Note: This pool is not yet optimized noop and non ws version of for_each microbenchmarks — missing chunk/range stealing and has a non-ideal queue implementation for such patterns. The focus is DAG execution, affinity control, and back-pressure.

About

A self-contained, minimal runtime for parallel task execution in C++20.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •