Skip to content

pavadik/tacuda

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 TACUDA

License CUDA C++ Python .NET

High-Performance CUDA-Accelerated Technical Analysis Library

TACUDA delivers lightning-fast technical analysis indicators powered by NVIDIA CUDA GPUs. Built for quantitative traders, researchers, and financial analysts who demand maximum performance from their computational workflows.


✨ Key Features

🔥 Performance First

  • GPU-accelerated kernels for massive parallel computation
  • Pipelined workloads with optional cudaStream_t support
  • Zero-copy operations where possible
  • Optimized memory patterns for coalesced access

📊 Comprehensive Indicator Suite

  • Moving Averages: SMA, EMA, WMA
  • Price Transforms: AVGPRICE, MEDPRICE, TYPPRICE, WCLPRICE, MIDPRICE
  • Oscillators: MIDPOINT, MAXINDEX, MININDEX, WILLR
  • Momentum: ROC, ROCP, ROCR, ROCR100
  • Statistical: STDDEV, MIN, MAX, MINMAX, MINMAXINDEX
  • Advanced: RSI, MACD, BBANDS, Others

🕯️ Candlestick Pattern Recognition

Comprehensive pattern detection including Doji, Hammer, Engulfing patterns, Three White Soldiers, and many more.

🔌 Multi-Language Support

  • C/C++: Stable ABI with extern "C" interface
  • Python: Pythonic API with NumPy integration
  • C#: Native .NET bindings with generic support

Production Ready

  • Warm-up markers via trailing NaN values for incomplete windows
  • Comprehensive test suite with GoogleTest
  • Memory-safe RAII patterns
  • Thread-safe indicator registry

🛠️ Installation

Prerequisites

Component Version Notes
NVIDIA GPU Compute ≥ 6.0 Maxwell architecture or newer
CUDA Toolkit 11.x / 12.x 12.x recommended
CMake ≥ 3.21
Compiler C++17 GCC, Clang, or MSVC
Python ≥ 3.8 Optional for bindings
NET ≥ 7.0 Optional for C# binding

Quick Build

# Clone repository
git clone https://github.com/pavadik/tacuda.git
cd tacuda

# Build core library
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

# Run example
./build/examples/tacuda_example

Python Installation

# Build with Python support
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release -DBUILD_PYTHON=ON
cmake --build build -j$(nproc)

# Test installation
PYTHONPATH=build python -c "
import numpy as np
import tacuda
data = np.arange(1, 11, dtype=np.float32)
result = tacuda.sma(data, window=5)
print('SMA Result:', result)
"

C# Installation

# Build C# bindings
dotnet build bindings/csharp/ConsoleExample -c Release

# Ensure library is discoverable (Linux example)
export LD_LIBRARY_PATH=$PWD/build:$LD_LIBRARY_PATH
dotnet run --project bindings/csharp/ConsoleExample

Binding generation workflow

Python and C# bindings are generated from the public header. After editing include/tacuda.h, regenerate the artefacts and commit the results:

python bindings/generate_bindings.py

CTest contains a guard that fails if the checked-in bindings are stale.


🚀 Quick Start

Python API

import numpy as np
import tacuda

# Generate sample price data
prices = np.random.randn(1000).cumsum().astype(np.float32)

# Calculate 20-period Simple Moving Average
sma_20 = tacuda.sma(prices, window=20)

# Generic indicator interface
roc_5 = tacuda.run("ROC", prices, timeperiod=5)

# Williams %R oscillator
willr = tacuda.run("WILLR", prices, timeperiod=14)

C++ API

#include "tacuda/api.h"
#include "tacuda/indicators/sma.h"
#include <vector>
#include <iostream>

int main() {
    // Input data
    std::vector<float> prices = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10};
    std::vector<float> output;
    
    // Configure SMA parameters
    tacuda::SMAParams params{5};  // 5-period window
    
    // Execute on GPU
    auto status = tacuda::run_indicator_host("SMA", prices, output, params);

    std::cout << "SMA computed successfully!" << std::endl;
    for (float value : output) {
        std::cout << value << " ";
    }
    std::cout << std::endl;

    return status == tacuda::Status::OK ? 0 : 1;
}

Expected output:

SMA computed successfully!
3 4 5 6 7 8 nan nan nan nan

The library leaves trailing NaN values in place of samples that do not yet cover a full window so consumers can easily spot warm-up regions.

📦 OHLCV Container

C++

#include <tacuda/OHLCVSeries.h>
#include <tacuda.h>

std::vector<float> open{1.0f, 2.0f, 1.5f};
std::vector<float> high{1.2f, 2.3f, 1.6f};
std::vector<float> low{0.9f, 1.9f, 1.3f};
std::vector<float> close{1.1f, 2.1f, 1.4f};

// Volume defaults to zero when omitted
tacuda::OHLCVSeries candles(open, high, low, close);
std::vector<float> imi(candles.size());
ct_imi(candles.open_data(), candles.close_data(), imi.data(), static_cast<int>(candles.size()), 3);

Python

from tacuda import OHLCV, imi

ohlcv = OHLCV.from_columns(open, high, low, close, volume)
result = imi(ohlcv.open, ohlcv.close, period=3)
packed = ohlcv.column_major()  # [O1..On, H1..Hn, ...]

C#

using Tacuda.Bindings;

var candles = new OhlcvSeries(open, high, low, close);
var output = new float[candles.Length];
NativeMethods.ct_imi(candles.Open, candles.Close, output, candles.Length, period: 3);
var columnMajor = candles.ToColumnMajor();

C# API

using Tacuda;

class Program 
{
    static void Main() 
    {
        float[] prices = {1f, 2f, 3f, 4f, 5f, 6f, 7f, 8f, 9f, 10f};
        
        // Simple Moving Average
        var sma = Tacuda.SMA(prices, window: 5);
        
        // Generic interface
        var roc = Tacuda.Run<ROCParams>("ROC", prices, new ROCParams { timeperiod = 5 });
        
        Console.WriteLine($"SMA: [{string.Join(", ", sma)}]");
    }
}

📈 Performance Benchmarks

TACUDA delivers significant performance improvements over CPU implementations:

# Benchmark example (1M data points)
import time
import numpy as np
import tacuda

def benchmark_sma():
    n = 1_000_000
    data = np.random.rand(n).astype(np.float32)
    
    for window in [5, 14, 50, 200]:
        # CPU implementation
        cpu_start = time.perf_counter()
        cpu_result = np.convolve(data, np.ones(window)/window, mode='same')
        cpu_time = time.perf_counter() - cpu_start
        
        # GPU implementation  
        gpu_start = time.perf_counter()
        gpu_result = tacuda.sma(data, window=window)
        gpu_time = time.perf_counter() - gpu_start
        
        speedup = cpu_time / gpu_time
        print(f"Window {window:3d}: CPU {cpu_time:.4f}s | GPU {gpu_time:.4f}s | {speedup:.1f}x faster")

Typical Results (NVIDIA RTX 4090):

Window   5: CPU 0.0234s | GPU 0.0031s | 7.5x faster
Window  14: CPU 0.0267s | GPU 0.0033s | 8.1x faster
Window  50: CPU 0.0312s | GPU 0.0035s | 8.9x faster
Window 200: CPU 0.0445s | GPU 0.0041s | 10.9x faster

⚙️ Exponential Moving Average Acceleration

The EMA family of indicators now share a single CUDA implementation based on a prefix-scan formulation of the linear recurrence:

EMA[i] = α · x[i] + (1 − α) · EMA[i − 1]

Instead of iterating sequentially, we interpret each update as an affine transformation and compute the cumulative product of these transforms with thrust::inclusive_scan. This yields all intermediate EMA values in parallel, which are then re-used across EMA, DEMA, TEMA, T3, TRIX and MACD calculations. The shared helper keeps warm-up regions initialised to NaN while guaranteeing identical numerical outputs to the previous per-kernel loops.

To reproduce the performance improvement for large smoothing periods, run the dedicated benchmark:

python benchmarks/bench_ema.py

🏗️ Architecture

TACUDA employs a clean, extensible architecture:

┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Language      │    │    Unified       │    │   CUDA Kernel   │
│   Bindings      │───▶│   Dispatcher     │───▶│   Execution     │
│ (Python/C#/C++) │    │   Registry       │    │   Engine        │
└─────────────────┘    └──────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌──────────────────┐    ┌─────────────────┐
│   Type Safety   │    │   Memory         │    │   Optimized     │
│   Parameter     │    │   Management     │    │   Algorithms    │
│   Validation    │    │   (RAII)         │    │   (Parallel)    │
└─────────────────┘    └──────────────────┘    └─────────────────┘

Core Components

  • 🔧 Registry: Thread-safe, lazy-loaded indicator registry
  • ⚡ IndicatorFn: Unified kernel signature for all indicators
  • 🎯 DeviceBuffer: RAII memory management with async operations
  • 🔌 Bindings: Language-specific wrappers with native feel

🧪 Testing & Quality

# Run comprehensive test suite
ctest --test-dir build --output-on-failure

# Performance regression testing
cd benchmarks && python benchmark_suite.py

# Memory leak detection (Linux)
valgrind --tool=memcheck ./build/examples/tacuda_example

Test Coverage:

  • ✅ Numerical accuracy vs reference implementations
  • ✅ Edge cases (empty inputs, extreme values)
  • ✅ Memory safety and leak detection
  • ✅ Multi-threaded safety
  • ✅ Cross-platform compatibility

🔧 Extending TACUDA

Adding new indicators is straightforward:

1. Define Parameters Structure

// include/tacuda/indicators/my_indicator.h
struct MyIndicatorParams {
    int period;
    float factor;
};

2. Implement CUDA Kernel

// src/indicators/my_indicator.cu
__global__ void my_indicator_kernel(const float* d_in, float* d_out, 
                                   int n, MyIndicatorParams params) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx >= n) return;
    
    // Your algorithm here
    d_out[idx] = d_in[idx] * params.factor;
}

3. Register Indicator

// Register in the global registry
REGISTER_INDICATOR("MY_INDICATOR", my_indicator_dispatch, MyIndicatorParams);

4. Add Language Bindings

# Python binding
def my_indicator(data, period=10, factor=1.0):
    return run("MY_INDICATOR", data, period=period, factor=factor)

✅ Status & TODO

✅ Completed pillars

  • CUDA implementations for moving averages, momentum, volatility, oscillators, candlestick recognition, and price transforms with TA-Lib compatible semantics backed by 140+ GPU kernels.
  • Python (ctypes) and .NET 7 bindings generated from the shared C ABI via bindings/generate_bindings.py, with regeneration guards wired into CTest.
  • GoogleTest regression suite cross-checking against TA-Lib reference data, including device buffer pool coverage and representative indicator families.

🚧 Outstanding priorities

Data ergonomics & throughput

  • Columnar OHLCV host container plus binding updates so callers no longer hand-pack arrays.
  • Batched/multi-symbol execution entry points with stream-aware scheduling for portfolio-scale workloads.

Release readiness

  • Packaging for common ecosystems (PyPI wheel, NuGet, Conda, binary releases) with automated CI publication.
  • Documented production deployment guide, ABI stability policy, and fill-in for the referenced docs/ tree.

Benchmarking & validation

  • Curated benchmark datasets and published comparative numbers covering CPU vs GPU baselines.

🗓️ Proposed Sprint Plan

Sprint 1 – Data ergonomics foundation (2 weeks)

  • Design and implement an OHLCV columnar container usable from C++, Python, and .NET host APIs.
  • Update bindings to accept structured inputs and extend tests/benchmarks to exercise the new path.
  • Document migration guidance for existing users still relying on manual array packing.

Sprint 2 – Batched execution & scheduling (2 weeks)

  • Add multi-symbol/batched dispatchers with CUDA stream management hooks.
  • Extend the registry and bindings to accept portfolio requests, including stress tests and profiling scripts.
  • Prototype heuristics for overlapping host/device transfers and kernel launches.

Sprint 3 – Distribution pipeline (2 weeks)

  • Author reproducible benchmark datasets and integrate them into the benchmarking harness.
  • Stand up CI jobs producing PyPI wheels, Conda packages, NuGet packages, and binary tarballs.
  • Capture release criteria covering artifact validation, signing, and smoke tests.

Sprint 4 – Production readiness (2 weeks)

  • Build out the referenced documentation tree (API, user guide, performance, operations) and publish deployment runbooks.
  • Formalize ABI stability guarantees, including header versioning and changelog automation.
  • Prepare an adoption checklist covering monitoring, upgrade sequencing, and support escalation paths.

📚 Documentation


🤝 Contributing

We welcome contributions! Here's how to get started:

Development Setup

# Fork and clone
git clone https://github.com/pavadik/tacuda.git
cd tacuda

# Install pre-commit hooks
pip install pre-commit
pre-commit install

# Build in debug mode
cmake -S . -B build -DCMAKE_BUILD_TYPE=Debug -DBUILD_TESTS=ON
cmake --build build -j$(nproc)

Contribution Guidelines

  • 🎯 Focus on performance: Include benchmarks for performance-affecting changes
  • 🧪 Test coverage: New features must include comprehensive tests
  • 📝 Documentation: Update docs for API changes
  • 🎨 Code style: Use clang-format and maintain consistency
  • 💬 Discussion: Open an issue before major architectural changes

Code Quality Standards

  • ✅ All tests must pass
  • ✅ No compiler warnings
  • ✅ Memory leak-free
  • ✅ Thread-safe implementations
  • ✅ Proper error handling

📄 License

Licensed under the Apache License, Version 2.0. See LICENSE for details.


🙏 Acknowledgments

  • NVIDIA for CUDA toolkit and documentation
  • TA-Lib for algorithmic reference implementations
  • NumPy community for API design inspiration
  • GoogleTest for testing framework

⚠️ Disclaimer

This project is independent and not affiliated with TA-Lib or any other trademark. "TA-style API" refers to interface similarity for user convenience only.


⭐ Star this repo if you find TACUDA useful!

Made with ❤️ by Pavel Dikalov

About

A high-performance library of technical indicators “in the spirit of TA-Lib”

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •