-
waqasm86.github.io Public
Personal engineering portfolio showcasing CUDA + C++ + LLM inference projects. Features production-grade distributed systems, empirical performance research, and on-device AI optimization. Built wi…
UpdatedDec 30, 2025 -
llcuda Public
CUDA-accelerated LLM inference for Python with automatic server management. Zero-configuration setup, JupyterLab-ready, production-grade performance. Just install and start inferencing!
Python MIT License UpdatedDec 30, 2025 -
Pre-built llama.cpp CUDA binary for Ubuntu 22.04. No compilation required - download, extract, and run! Works with llcuda Python package for JupyterLab integration. Tested on GeForce 940M to RTX 4090.
-
cuda-nvidia-systems-engg Public
Production-grade C++20/CUDA distributed LLM inference system with TCP networking, MPI scheduling, and content-addressed storage. Features comprehensive benchmarking (p50/p95/p99 latencies), epoll a…
C++ MIT License UpdatedDec 27, 2025 -
local-llama-cuda Public
Custom CUDA implementation for LLM inference with MPI-based distributed computing. Memory-efficient layer offloading, multi-rank coordination, and GPU optimization for constrained hardware (1GB VRAM).
C++ MIT License UpdatedDec 25, 2025 -
cuda-tcp-llama.cpp Public
High-performance TCP inference gateway with epoll async I/O for CUDA-accelerated LLM serving. Binary protocol, connection pooling, streaming responses. Zero dependencies beyond POSIX and CUDA.
C++ UpdatedDec 23, 2025 -
cuda-openmpi Public
CUDA-aware OpenMPI integration for GPU-accelerated distributed computing. Multi-GPU LLM inference with MPI communication, performance benchmarking, and collective operations testing.
Cuda MIT License UpdatedDec 23, 2025 -
cuda-llm-storage-pipeline Public
Content-addressed LLM model distribution with SHA256 verification and SeaweedFS integration. Distributed storage, manifest management, LRU caching, and integrity checking for GGUF models.
C++ UpdatedDec 23, 2025 -
cuda-mpi-llama-scheduler Public
Distributed MPI scheduler with work-stealing algorithm for LLM inference. Percentile latency analysis (p50/p95/p99), throughput benchmarking, multi-rank load balancing, and empirical performance me…
Cuda UpdatedDec 23, 2025 -
-
-
cmake-superbuild-toolkit Public
Qt-style CMake superbuild demo: FetchContent deps, feature flags, install/export targets, CI matrix, tests, and CPack packaging.
CMake Other UpdatedDec 16, 2025 -
MCP stdio server for Windsurf that routes tool calls to a local llama.cpp llama-server (GGUF), optimized for low-VRAM GPUs.
Python UpdatedDec 13, 2025 -
-
Wolfram-llama.cpp Public
This is a sample project to use wolfram with llama.cpp
UpdatedNov 18, 2025 -
-
-
-
-
-
llama.cpp Public
Forked from ggml-org/llama.cppLLM inference in C/C++
C++ MIT License UpdatedSep 25, 2025 -
-
-
-
-
-
-
FreeDomain Public
Forked from DigitalPlatDev/FreeDomainDigitalPlat FreeDomain: Free Domain For Everyone
HTML GNU Affero General Public License v3.0 UpdatedMay 14, 2025 -
-