High-Performance CUDA-Accelerated Technical Analysis Library
TACUDA delivers lightning-fast technical analysis indicators powered by NVIDIA CUDA GPUs. Built for quantitative traders, researchers, and financial analysts who demand maximum performance from their computational workflows.
- GPU-accelerated kernels for massive parallel computation
- Pipelined workloads with optional
cudaStream_t
support - Zero-copy operations where possible
- Optimized memory patterns for coalesced access
- Moving Averages: SMA, EMA, WMA
- Price Transforms: AVGPRICE, MEDPRICE, TYPPRICE, WCLPRICE, MIDPRICE
- Oscillators: MIDPOINT, MAXINDEX, MININDEX, WILLR
- Momentum: ROC, ROCP, ROCR, ROCR100
- Statistical: STDDEV, MIN, MAX, MINMAX, MINMAXINDEX
- Advanced: RSI, MACD, BBANDS, Others
Comprehensive pattern detection including Doji, Hammer, Engulfing patterns, Three White Soldiers, and many more.
- C/C++: Stable ABI with
extern "C"
interface - Python: Pythonic API with NumPy integration
- C#: Native .NET bindings with generic support
- Warm-up markers via trailing
NaN
values for incomplete windows - Comprehensive test suite with GoogleTest
- Memory-safe RAII patterns
- Thread-safe indicator registry
Component | Version | Notes |
---|---|---|
NVIDIA GPU | Compute ≥ 6.0 | Maxwell architecture or newer |
CUDA Toolkit | 11.x / 12.x | 12.x recommended |
CMake | ≥ 3.21 | |
Compiler | C++17 | GCC, Clang, or MSVC |
Python | ≥ 3.8 | Optional for bindings |
NET | ≥ 7.0 | Optional for C# binding |
# Clone repository
git clone https://github.com/pavadik/tacuda.git
cd tacuda
# Build core library
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)
# Run example
./build/examples/tacuda_example
# Build with Python support
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_PYTHON=ON
cmake --build build -j$(nproc)
# Test installation
PYTHONPATH=build python -c "
import numpy as np
import tacuda
data = np.arange(1, 11, dtype=np.float32)
result = tacuda.sma(data, window=5)
print('SMA Result:', result)
"
# Build C# bindings
dotnet build bindings/csharp/ConsoleExample -c Release
# Ensure library is discoverable (Linux example)
export LD_LIBRARY_PATH=$PWD/build:$LD_LIBRARY_PATH
dotnet run --project bindings/csharp/ConsoleExample
Python and C# bindings are generated from the public header. After editing
include/tacuda.h
, regenerate the artefacts and commit the results:
python bindings/generate_bindings.py
CTest contains a guard that fails if the checked-in bindings are stale.
import numpy as np
import tacuda
# Generate sample price data
prices = np.random.randn(1000).cumsum().astype(np.float32)
# Calculate 20-period Simple Moving Average
sma_20 = tacuda.sma(prices, window=20)
# Generic indicator interface
roc_5 = tacuda.run("ROC", prices, timeperiod=5)
# Williams %R oscillator
willr = tacuda.run("WILLR", prices, timeperiod=14)
#include "tacuda/api.h"
#include "tacuda/indicators/sma.h"
#include <vector>
#include <iostream>
int main() {
// Input data
std::vector<float> prices = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
std::vector<float> output;
// Configure SMA parameters
tacuda::SMAParams params{5}; // 5-period window
// Execute on GPU
auto status = tacuda::run_indicator_host("SMA", prices, output, params);
std::cout << "SMA computed successfully!" << std::endl;
for (float value : output) {
std::cout << value << " ";
}
std::cout << std::endl;
return status == tacuda::Status::OK ? 0 : 1;
}
Expected output:
SMA computed successfully!
3 4 5 6 7 8 nan nan nan nan
The library leaves trailing NaN
values in place of samples that do not yet
cover a full window so consumers can easily spot warm-up regions.
#include <tacuda/OHLCVSeries.h>
#include <tacuda.h>
std::vector<float> open{1.0f, 2.0f, 1.5f};
std::vector<float> high{1.2f, 2.3f, 1.6f};
std::vector<float> low{0.9f, 1.9f, 1.3f};
std::vector<float> close{1.1f, 2.1f, 1.4f};
// Volume defaults to zero when omitted
tacuda::OHLCVSeries candles(open, high, low, close);
std::vector<float> imi(candles.size());
ct_imi(candles.open_data(), candles.close_data(), imi.data(), static_cast<int>(candles.size()), 3);
from tacuda import OHLCV, imi
ohlcv = OHLCV.from_columns(open, high, low, close, volume)
result = imi(ohlcv.open, ohlcv.close, period=3)
packed = ohlcv.column_major() # [O1..On, H1..Hn, ...]
using Tacuda.Bindings;
var candles = new OhlcvSeries(open, high, low, close);
var output = new float[candles.Length];
NativeMethods.ct_imi(candles.Open, candles.Close, output, candles.Length, period: 3);
var columnMajor = candles.ToColumnMajor();
using Tacuda;
class Program
{
static void Main()
{
float[] prices = {1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f};
// Simple Moving Average
var sma = Tacuda.SMA(prices, window: 5);
// Generic interface
var roc = Tacuda.Run<ROCParams>("ROC", prices, new ROCParams { timeperiod = 5 });
Console.WriteLine($"SMA: [{string.Join(", ", sma)}]");
}
}
TACUDA delivers significant performance improvements over CPU implementations:
# Benchmark example (1M data points)
import time
import numpy as np
import tacuda
def benchmark_sma():
n = 1_000_000
data = np.random.rand(n).astype(np.float32)
for window in [5, 14, 50, 200]:
# CPU implementation
cpu_start = time.perf_counter()
cpu_result = np.convolve(data, np.ones(window)/window, mode='same')
cpu_time = time.perf_counter() - cpu_start
# GPU implementation
gpu_start = time.perf_counter()
gpu_result = tacuda.sma(data, window=window)
gpu_time = time.perf_counter() - gpu_start
speedup = cpu_time / gpu_time
print(f"Window {window:3d}: CPU {cpu_time:.4f}s | GPU {gpu_time:.4f}s | {speedup:.1f}x faster")
Typical Results (NVIDIA RTX 4090):
Window 5: CPU 0.0234s | GPU 0.0031s | 7.5x faster
Window 14: CPU 0.0267s | GPU 0.0033s | 8.1x faster
Window 50: CPU 0.0312s | GPU 0.0035s | 8.9x faster
Window 200: CPU 0.0445s | GPU 0.0041s | 10.9x faster
The EMA family of indicators now share a single CUDA implementation based on a prefix-scan formulation of the linear recurrence:
EMA[i] = α · x[i] + (1 − α) · EMA[i − 1]
Instead of iterating sequentially, we interpret each update as an affine
transformation and compute the cumulative product of these transforms with
thrust::inclusive_scan
. This yields all intermediate EMA values in parallel,
which are then re-used across EMA, DEMA, TEMA, T3, TRIX and MACD calculations.
The shared helper keeps warm-up regions initialised to NaN
while guaranteeing
identical numerical outputs to the previous per-kernel loops.
To reproduce the performance improvement for large smoothing periods, run the dedicated benchmark:
python benchmarks/bench_ema.py
TACUDA employs a clean, extensible architecture:
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Language │ │ Unified │ │ CUDA Kernel │
│ Bindings │───▶│ Dispatcher │───▶│ Execution │
│ (Python/C#/C++) │ │ Registry │ │ Engine │
└─────────────────┘ └──────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐
│ Type Safety │ │ Memory │ │ Optimized │
│ Parameter │ │ Management │ │ Algorithms │
│ Validation │ │ (RAII) │ │ (Parallel) │
└─────────────────┘ └──────────────────┘ └─────────────────┘
- 🔧 Registry: Thread-safe, lazy-loaded indicator registry
- ⚡ IndicatorFn: Unified kernel signature for all indicators
- 🎯 DeviceBuffer: RAII memory management with async operations
- 🔌 Bindings: Language-specific wrappers with native feel
# Run comprehensive test suite
ctest --test-dir build --output-on-failure
# Performance regression testing
cd benchmarks && python benchmark_suite.py
# Memory leak detection (Linux)
valgrind --tool=memcheck ./build/examples/tacuda_example
Test Coverage:
- ✅ Numerical accuracy vs reference implementations
- ✅ Edge cases (empty inputs, extreme values)
- ✅ Memory safety and leak detection
- ✅ Multi-threaded safety
- ✅ Cross-platform compatibility
Adding new indicators is straightforward:
// include/tacuda/indicators/my_indicator.h
struct MyIndicatorParams {
int period;
float factor;
};
// src/indicators/my_indicator.cu
__global__ void my_indicator_kernel(const float* d_in, float* d_out,
int n, MyIndicatorParams params) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= n) return;
// Your algorithm here
d_out[idx] = d_in[idx] * params.factor;
}
// Register in the global registry
REGISTER_INDICATOR("MY_INDICATOR", my_indicator_dispatch, MyIndicatorParams);
# Python binding
def my_indicator(data, period=10, factor=1.0):
return run("MY_INDICATOR", data, period=period, factor=factor)
- CUDA implementations for moving averages, momentum, volatility, oscillators, candlestick recognition, and price transforms with TA-Lib compatible semantics backed by 140+ GPU kernels.
- Python (ctypes) and .NET 7 bindings generated from the shared C ABI via
bindings/generate_bindings.py
, with regeneration guards wired into CTest. - GoogleTest regression suite cross-checking against TA-Lib reference data, including device buffer pool coverage and representative indicator families.
Data ergonomics & throughput
- Columnar OHLCV host container plus binding updates so callers no longer hand-pack arrays.
- Batched/multi-symbol execution entry points with stream-aware scheduling for portfolio-scale workloads.
Release readiness
- Packaging for common ecosystems (PyPI wheel, NuGet, Conda, binary releases) with automated CI publication.
- Documented production deployment guide, ABI stability policy, and fill-in for the referenced
docs/
tree.
Benchmarking & validation
- Curated benchmark datasets and published comparative numbers covering CPU vs GPU baselines.
- Design and implement an OHLCV columnar container usable from C++, Python, and .NET host APIs.
- Update bindings to accept structured inputs and extend tests/benchmarks to exercise the new path.
- Document migration guidance for existing users still relying on manual array packing.
- Add multi-symbol/batched dispatchers with CUDA stream management hooks.
- Extend the registry and bindings to accept portfolio requests, including stress tests and profiling scripts.
- Prototype heuristics for overlapping host/device transfers and kernel launches.
- Author reproducible benchmark datasets and integrate them into the benchmarking harness.
- Stand up CI jobs producing PyPI wheels, Conda packages, NuGet packages, and binary tarballs.
- Capture release criteria covering artifact validation, signing, and smoke tests.
- Build out the referenced documentation tree (API, user guide, performance, operations) and publish deployment runbooks.
- Formalize ABI stability guarantees, including header versioning and changelog automation.
- Prepare an adoption checklist covering monitoring, upgrade sequencing, and support escalation paths.
- API Reference - Complete function documentation
- User Guide - Getting started tutorials
- Performance Guide - Optimization best practices
- Contributing - Development workflow
We welcome contributions! Here's how to get started:
# Fork and clone
git clone https://github.com/pavadik/tacuda.git
cd tacuda
# Install pre-commit hooks
pip install pre-commit
pre-commit install
# Build in debug mode
cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug -DBUILD_TESTS=ON
cmake --build build -j$(nproc)
- 🎯 Focus on performance: Include benchmarks for performance-affecting changes
- 🧪 Test coverage: New features must include comprehensive tests
- 📝 Documentation: Update docs for API changes
- 🎨 Code style: Use
clang-format
and maintain consistency - 💬 Discussion: Open an issue before major architectural changes
- ✅ All tests must pass
- ✅ No compiler warnings
- ✅ Memory leak-free
- ✅ Thread-safe implementations
- ✅ Proper error handling
Licensed under the Apache License, Version 2.0. See LICENSE for details.
- NVIDIA for CUDA toolkit and documentation
- TA-Lib for algorithmic reference implementations
- NumPy community for API design inspiration
- GoogleTest for testing framework
This project is independent and not affiliated with TA-Lib or any other trademark. "TA-style API" refers to interface similarity for user convenience only.
⭐ Star this repo if you find TACUDA useful!
Made with ❤️ by Pavel Dikalov