Skip to content

szaher/lectures

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Supplementary Material for Lectures

YouTube Channel

The PMPP Book: Programming Massively Parallel Processors: A Hands-on Approach (Amazon link)

Lecture 1: Profiling and Integrating CUDA kernels in PyTorch

Lecture 2: Recap Ch. 1-3 from the PMPP book

Lecture 3: Getting Started With CUDA

Lecture 4: Intro to Compute and Memory Architecture

Lecture 5: Going Further with CUDA for Python Programmers

Lecture 6: Optimizing PyTorch Optimizers

Lecture 7: Advanced Quantization

Lecture 8: CUDA Performance Checklist

Lecture 9: Reductions

Lecture 10: Build a Prod Ready CUDA Library

Lecture 11: Sparsity

Lecture 12: Flash Attention

Lecture 13: Ring Attention

Lecture 14: Practitioner's Guide to Triton

Lecture 15: CUTLASS

Lecture 16: On Hands profiling

Bonus Lecture: CUDA C++ llm.cpp

Lecture 17: GPU Collective Communication (NCCL)

Lecture 18: Fused Kernels

Lecture 19: Data Processing on GPUs

Lecture 20: Scan Algorithm

Lecture 21: Scan Algorithm Part 2

Lecture 22: Hacker's Guide to Speculative Decoding in VLLM

Lecture 23: Tensor Cores

  • Speaker: Vijay Thakkar & Pradeep Ramani
  • Slides

Lecture 24: Scan at the Speed of Light

  • Speaker: Jake Hemstad & Georgii Evtushenko

Lecture 25: Speaking Composable Kernel

  • Speaker: Haocong Wang
  • Slides

Lecture 26: SYCL MODE (Intel GPU)

Lecture 27: gpu.cpp

Lecture 28: Liger Kernel

Lecture 29: Triton Internals

Lecture 30: Quantized training

Lecture 31: Beginners Guide to Metal Kernels

Lecture 32: Unsloth - LLM Systems Engineering

Lecture 33: BitBLAS

Lecture 34: Low Bit Triton Kernels

Lecture 35: SGLang Performance Optimization

Lecture 36: CUTLASS and Flash ATtention 3

Lecture 37: Introduction to SASS & GPU Microarchitecture

Lecture 38: Lowbit kernels for ARM CPU

Lecture 39: TorchTitan

  • Speaker: Mark Saroufim and Tianyu Liu

Lecture 40: Flash Infer

Lecture 41: CUDA Docs for Humans

Lecture 42: Mosaic GPU

Lecture 43:

  • Speaker: Erik Schultheis
  • Slides

Lecture 57: CuTE

Lecture 67: NCCL & NVSHMEM

Lecture 69: Quartet 4 bit training

Lecture 70: Fault tolerant communication collectives

Lecture 78: Iris: Multi-GPU Programming in Triton

Speakers: Muhammad Awad, Muhammad Osama & Brandon Potter

About

Material for gpu-mode lectures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 85.6%
  • Python 9.4%
  • Cuda 2.2%
  • C 2.1%
  • Objective-C++ 0.5%
  • CMake 0.1%
  • Other 0.1%