Skip to content
View leslie-fang-intel's full-sized avatar
  • INTC
  • Shanghai

Block or report leslie-fang-intel

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A Datacenter Scale Distributed Inference Serving Framework

Rust 4,969 589 Updated Sep 13, 2025

A simple C++11 Thread Pool implementation

C++ 8,495 2,334 Updated Jul 20, 2024

User-friendly Desktop Client App for AI Models/LLMs (GPT, Claude, Gemini, Ollama...)

TypeScript 36,589 3,549 Updated Aug 20, 2025

A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations

Python 15,040 1,080 Updated Sep 12, 2025

A Python-embedded DSL that makes it easy to write fast, scalable ML kernels with minimal boilerplate.

Python 299 29 Updated Sep 13, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 3,732 489 Updated Sep 11, 2025

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 5,704 694 Updated Sep 12, 2025

DeepEP: an efficient expert-parallel communication library

Cuda 8,506 921 Updated Sep 12, 2025

FlashMLA: Efficient MLA kernels

C++ 11,721 899 Updated Aug 27, 2025

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,910 282 Updated May 15, 2025

Accessible large language models via k-bit quantization for PyTorch.

Python 7,573 782 Updated Sep 9, 2025

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 2,247 288 Updated May 11, 2025

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,942 526 Updated Apr 11, 2025

Official inference framework for 1-bit LLMs

Python 21,953 1,689 Updated Jun 3, 2025
C++ 55 56 Updated Sep 13, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 17,873 2,910 Updated Sep 13, 2025

CUDA Templates for Linear Algebra Subroutines

C++ 8,421 1,434 Updated Sep 9, 2025

List of papers related to neural network quantization in recent AI conferences and journals.

715 59 Updated Mar 27, 2025

An experimental CPU backend for Triton (https//github.com/openai/triton)

C++ 45 3 Updated Aug 18, 2025

Development repository for the Triton language and compiler

MLIR 16,852 2,239 Updated Sep 13, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 57,892 10,088 Updated Sep 13, 2025

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Python 2,181 181 Updated Mar 27, 2024

PyTorch native quantization and sparsity for training and inference

Python 2,350 336 Updated Sep 13, 2025

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and support state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorR…

C++ 11,573 1,741 Updated Sep 12, 2025

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 1,496 181 Updated Jul 12, 2024

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.

C++ 12,136 2,247 Updated Sep 10, 2025

PyTorch tutorials.

Jupyter Notebook 8,776 4,245 Updated Sep 12, 2025

YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite

Python 55,330 17,187 Updated Sep 8, 2025

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Python 1,957 291 Updated Aug 29, 2025
Next