Skip to content
View amalcao's full-sized avatar
:octocat:
I may be slow to respond.
:octocat:
I may be slow to respond.
  • The Institute of Computing Technology of the Chinese Academy of Sciences
  • Beijing P.R. China

Block or report amalcao

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Distributed Compiler based on Triton for Parallel Systems

Python 1,309 117 Updated Dec 27, 2025

Tensor Compute Primitives: Mid-level Intermediate Representation for Machine Learning Programs

MLIR 13 7 Updated Jun 10, 2025

Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels

Python 4,565 385 Updated Jan 10, 2026

A modern model graph visualizer and debugger

JavaScript 1,364 138 Updated Jan 10, 2026

💯2026年 软件设计师 (软考中级)备考资源库+配套免费刷题软件。https://ruankaodaren.com

1,139 210 Updated Dec 25, 2025

KernelBench: Can LLMs Write GPU Kernels? - Benchmark + Toolkit with Torch -> CUDA (+ more DSLs)

Jupyter Notebook 746 107 Updated Jan 9, 2026
LLVM 278 96 Updated Jan 9, 2026

Shared Middle-Layer for Triton Compilation

MLIR 322 83 Updated Dec 5, 2025

C/C++ frontend for MLIR. Also features polyhedral optimizations, parallel optimizations, and more!

C++ 600 156 Updated Jun 19, 2025

This is the top-level repository for the Accel-Sim framework.

Python 548 181 Updated Nov 26, 2025

A library to benchmark CUDA code, similar to google benchmark.

C++ 30 7 Updated Apr 18, 2021

collection of benchmarks to measure basic GPU capabilities

C++ 477 73 Updated Oct 24, 2025

A GPU benchmark suite for assessing on-chip GPU memory bandwidth

C++ 109 27 Updated Aug 12, 2017

CUDA GPU Benchmark

Cuda 36 11 Updated Jan 31, 2025

Universal LLM Deployment Engine with ML Compilation

Python 21,853 1,897 Updated Dec 31, 2025

BlackHole is a modern macOS audio loopback driver that allows applications to pass audio to other applications with zero additional latency.

C 18,085 750 Updated Dec 19, 2025

PennyLane is a cross-platform Python library for quantum computing, quantum machine learning, and quantum chemistry. Built by researchers, for research.

Python 3,006 733 Updated Jan 10, 2026
Python 48 6 Updated Jul 13, 2024
C++ 6 Updated May 31, 2023

OSDI 2023 Welder, deeplearning compiler

Python 30 8 Updated Nov 24, 2023

Heron: Automatically Constrained High-Performance Library Generation for Deep Learning Accelerators

Python 23 1 Updated Jan 30, 2024

This is a list of awesome edgeAI inference related papers.

98 9 Updated Dec 21, 2023

Must read research papers and links to tools and datasets that are related to using machine learning for compilers and systems optimisation

1,641 173 Updated Sep 12, 2025

Polyhedral Parallel Code Generation (source repository: http://repo.or.cz/ppcg.git)

C 131 37 Updated Jul 22, 2022

Open Neural Network Exchange to C compiler.

C 350 61 Updated Dec 22, 2025

Transformer related optimization, including BERT, GPT

C++ 6,378 928 Updated Mar 27, 2024
Python 1,026 95 Updated Jan 4, 2024
Next