Stars
Fast CUDA matrix multiplication from scratch
My learning notes for ML SYS.
Tile primitives for speedy kernels
Ongoing research training transformer models at scale
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. Tensor…
Modeling, training, eval, and inference code for OLMo
Repository for the QUIK project, enabling the use of 4bit kernels for generative inference - EMNLP 2024
https://wavespeed.ai/ Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
OneFlow is a deep learning framework designed to be user-friendly, scalable and efficient.
ppl.cv is a high-performance image processing library of openPPL supporting various platforms.
Productive, portable, and performant GPU programming in Python.
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Benchmarks for partial evaluation of different GPU application scenarios
快速入门CMake,通过例程学习语法。在线阅读地址:https://sfumecjf.github.io/cmake-examples-Chinese/
A Graph-Based Framework for Information Extraction
Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA
COLMAP - Structure-from-Motion and Multi-View Stereo
C++ Insights - See your source code with the eyes of a compiler
A curated list of awesome C++ (or C) frameworks, libraries, resources, and shiny things. Inspired by awesome-... stuff.