-
Notifications
You must be signed in to change notification settings - Fork 698
Pull requests: pytorch/FBGEMM
Author
Label
Projects
Milestones
Reviews
Assignee
Sort
Pull requests list
Adding support for bias addition + rescaling with token weights to grouped_gemm
cla signed
fb-exported
meta-exported
#5280
opened Dec 29, 2025 by
metastableB
Loading…
Refactor cumem_utils CPU library to cut GPU dependencies
cla signed
fb-exported
meta-exported
#5279
opened Dec 29, 2025 by
crypt3lx2k
Loading…
Add repeat_arange cuda kernel
cla signed
fb-exported
meta-exported
#5278
opened Dec 26, 2025 by
yunjiangster
Loading…
Add tidy fixes (#5268)
cla signed
fb-exported
meta-exported
#5277
opened Dec 24, 2025 by
q10
Loading…
Replace .data_ptr with .mutable_data_ptr or .const_data_ptr (#5267)
cla signed
fb-exported
meta-exported
#5276
opened Dec 24, 2025 by
q10
Loading…
Tune max segment length per cta in triton table batched embeddings, and expose the param via cli
cla signed
fb-exported
meta-exported
#5270
opened Dec 22, 2025 by
OmarPavel
Loading…
Replace .data_ptr with .mutable_data_ptr or .const_data_ptr
cla signed
#5267
opened Dec 20, 2025 by
cyyever
Loading…
Optimizations for index_select_scalar_cumsum_kernel on ROCm
cla signed
module: rocm
#5263
opened Dec 18, 2025 by
amd-wsung102
Loading…
Refactor TBE benchmark reporter to use structured data config
cla signed
fb-exported
meta-exported
#5260
opened Dec 18, 2025 by
gchalump
Loading…
Fix blackwell CUTLASS attention meta registration + actually test compile
cla signed
fb-exported
meta-exported
#5259
opened Dec 18, 2025 by
jbschlosser
Loading…
Optimize benchmark index generation with std::sample()
cla signed
fb-exported
meta-exported
#5254
opened Dec 17, 2025 by
terdogan
Loading…
Remove unused dedup_map and associated includes from benchmarks
cla signed
fb-exported
meta-exported
#5253
opened Dec 17, 2025 by
terdogan
Loading…
Move the prefetched info to preallocated buffers
cla signed
fb-exported
meta-exported
#5251
opened Dec 17, 2025 by
chouxi
Loading…
Enable direct MX4→BF16 dequantization to reduce memory (python side) (2/2)
cla signed
fb-exported
meta-exported
#5250
opened Dec 17, 2025 by
armandsauzay
Loading…
Add aarch64 intrinsic-based dequantization to autovec routine
cla signed
fb-exported
meta-exported
#5249
opened Dec 17, 2025 by
Nicoshev
Loading…
Choose _autovec version of GenerateEmbeddingSpMDMRowWiseSparse on AArch64
cla signed
fb-exported
meta-exported
#5247
opened Dec 17, 2025 by
MatzeB
Loading…
Specialize more cases to improve EmbeddingSpMDMNBitBenchmark
cla signed
fb-exported
meta-exported
#5245
opened Dec 17, 2025 by
MatzeB
Loading…
Add EmbeddingSpMDMNBitRowWiseSparse autovectorized variant
cla signed
fb-exported
meta-exported
#5244
opened Dec 17, 2025 by
MatzeB
Loading…
Optimize group_index_select_or_add_2d_kernel on ROCm by adding a separate codepath for small embedding dimensions
cla signed
module: rocm
#5233
opened Dec 16, 2025 by
aryaman-gupta
Loading…
support object cache in ssd l2 cache and add more unit tests
cla signed
fb-exported
meta-exported
#5228
opened Dec 16, 2025 by
zhaojuanmao
Loading…
Optimizing 4-bit dequant to FP32 on AArch64 using vectorized intrinsics in EmbeddingSpMDMAutovec
cla signed
#5224
opened Dec 15, 2025 by
marma01
Loading…
Update heuristic to support variant batch sizes
cla signed
fb-exported
meta-exported
#5211
opened Dec 10, 2025 by
zjing14
Loading…
Use H100 runners for OSS CI
cla signed
fb-exported
meta-exported
#5205
opened Dec 9, 2025 by
q10
Loading…
Modifying clear_all_staged_data to accomadate KV Tensor Deletion
cla signed
fb-exported
meta-exported
#5202
opened Dec 9, 2025 by
Raahul46
Loading…
Previous Next
ProTip!
Find all pull requests that aren't related to any open issues with -linked:issue.