Skip to content
View JamesTheZ's full-sized avatar

Highlights

  • Pro

Block or report JamesTheZ

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.

Python 27,352 4,833 Updated Aug 18, 2024

诺亚盘古大模型研发背后的真正的心酸与黑暗的故事。

11,376 1,363 Updated Jul 9, 2025

Tutel MoE: Optimized Mixture-of-Experts Library, Support GptOss/DeepSeek/Kimi-K2/Qwen3 using FP8/NVFP4/MXFP4

C 938 106 Updated Nov 10, 2025

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

Python 62,615 7,580 Updated Nov 13, 2025
Python 1 Updated Oct 6, 2025

台湾大学李宏毅老师机器学习

Jupyter Notebook 1,126 384 Updated Jul 15, 2019

Awesome LLM compression research papers and tools.

1,709 110 Updated Nov 10, 2025

Understanding Deep Learning - Simon J.D. Prince

Jupyter Notebook 8,493 1,956 Updated Nov 5, 2025

Fast low-bit matmul kernels in Triton

Python 396 29 Updated Oct 26, 2025

本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)

HTML 21,883 2,568 Updated Oct 19, 2025

Code repo for the paper "SpinQuant LLM quantization with learned rotations"

Python 346 59 Updated Feb 14, 2025

My learning notes/codes for ML SYS.

Python 4,186 254 Updated Nov 17, 2025

Code for the paper "Evaluating Large Language Models Trained on Code"

Python 3,018 422 Updated Jan 17, 2025

A framework for the evaluation of autoregressive code generation language models.

Python 993 250 Updated Jul 22, 2025

SOTA low-bit LLM quantization (INT8/FP8/MXFP8/INT4/MXFP4/NVFP4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime

Python 2,525 281 Updated Nov 17, 2025

A low-latency & high-throughput serving engine for LLMs

Python 445 59 Updated Oct 16, 2025

A throughput-oriented high-performance serving framework for LLMs

Jupyter Notebook 915 44 Updated Oct 29, 2025

Fast and memory-efficient exact attention

Python 200 68 Updated Oct 20, 2025

8-bit CUDA functions for PyTorch

Python 68 12 Updated Sep 24, 2025

[ICLR2025 Spotlight] MagicPIG: LSH Sampling for Efficient LLM Generation

Python 240 16 Updated Dec 16, 2024

Running large language models on a single GPU for throughput-oriented scenarios.

Python 9,377 583 Updated Oct 28, 2024

📰 Must-read papers on KV Cache Compression (constantly updating 🤗).

603 16 Updated Sep 30, 2025

An acceleration library that supports arbitrary bit-width combinatorial quantization operations

C++ 238 21 Updated Sep 30, 2024

Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.

Python 449 51 Updated Nov 26, 2024

Model Compression Toolbox for Large Language Models and Diffusion Models

Python 697 67 Updated Aug 14, 2025

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 779 53 Updated Mar 6, 2025

MIXQ: Taming Dynamic Outliers in Mixed-Precision Quantization by Online Prediction

Python 94 15 Updated Oct 29, 2024

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Python 872 71 Updated May 22, 2025

[ICML 2024] LLMCompiler: An LLM Compiler for Parallel Function Calling

Python 1,782 122 Updated Jul 10, 2024
Next