Skip to content

Releases: pytorch/helion

v0.2.8

23 Dec 20:10
768326f

Choose a tag to compare

What's Changed

  • Support passing Triton function object to hl.triton_kernel() by @yf225 in #1263
  • chore: Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in #1270
  • chore: Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #1271
  • [Distributed] one_shot_allreduce_bias_rmsnorm example by @yf225 in #1266
  • [Distributed] matmul_reduce_scatter example by @yf225 in #1269
  • feat(benchmarks): add shapes to json output by @fulvius31 in #1273
  • [Autotuner] Log the 'started' state to CSV, for easier user monitoring of kernel hanging at runtime by @yf225 in #1279
  • default pattern search by @v0i0 in #1259
  • Set LFBOPatternSearch as default by @ethche in #1280
  • fix surrogate search for singleton population by @v0i0 in #1281
  • Ignore bzl files in git. by @Myrthan in #1282
  • chunk fused_linear_jsd by @v0i0 in #1277
  • Fix buggy interation between XYZProgramIDs and L2GroupingProgramIDs by @jansel in #1288
  • Fix bug with torch.rand_like compile error by @jansel in #1289
  • [autotuner] print path to generated Triton code after selection of kernel by @bringlein in #1285
  • Add support for torch.gather by @jansel in #1290
  • [docs] Add more example to docs by @oulgen in #1301
  • [lint] remove dead ignores by @oulgen in #1302
  • Add proper error handling for torch.split and torch.tensor_split in device loops by @oulgen in #1297
  • Fix BlockReductionStrategy to use existing index variables for argmax/argmin operations by @oulgen in #1298
  • Skip more failing tests on cpu backend by @oulgen in #1304
  • [CI] Fix broken notebook by @oulgen in #1305
  • Fix shape inference for tile indexing on size-1 dimensions and use broadcast_to for block_ptr by @oulgen in #1299
  • Fix codegen broadcasting for tile indexing on size-1 tensor dimensions by @oulgen in #1300
  • Enable tests on py314 by @oulgen in #1306

New Contributors

Full Changelog: v0.2.7...v0.2.8

v0.2.7

13 Dec 08:16
d5a2d61

Choose a tag to compare

What's Changed

  • [CI] Skip all failing distributed tests by @yf225 in #1206
  • Include index_dtype in the printed decorator snippet by @choijon5 in #1207
  • Add dict comprehension support by @oulgen in #1191
  • settings: set appropriate dot_precision default by @fulvius31 in #1184
  • [Interpret Mode] Support custom block size by @yf225 in #1194
  • [Autotuner] Add autotune_benchmark_fn setting by @yf225 in #1199
  • jagged_dense_bmm (#1126) by @trieuat in #1213
  • benchmarks: Include AMD GCN arch in get_device_name() by @fulvius31 in #1214
  • Fix linter errors by @yf225 in #1218
  • Fix unit test breakage due to upstream change by @yf225 in #1219
  • Fix static_shapes setting in test_dot.py by @yf225 in #1220
  • Fix memory leak when Triton compile error occurs by @yf225 in #1217
  • [Interpret Mode] Re-enable block-size dependent tests by @yf225 in #1212
  • [Interpret Mode] Raise error if hl.store is used with duplicate indices by @yf225 in #1221
  • [Interpret Mode] Fix hl.store automatic dtype conversion by @yf225 in #1226
  • [Interpret Mode] Fix hl.load with multiple 1D tensor indices by @yf225 in #1227
  • [CI] Fix NVSHMEM env vars and re-enable distributed CI job by @yf225 in #1201
  • Move jagged_dense_bmm expected code to the right place by @yf225 in #1232
  • Reduce log volume by moving output code logging behind HELION_PRINT_OUTPUT_CODE=1 by @yf225 in #1233
  • Add setup for Helion to compile on MTIA with basic test by @Myrthan in #1169
  • Make hl.triton_kernel support global var and recursive kernel call by @yf225 in #1234
  • Make hl.triton_kernel support output_like=None without being DCE'd by @yf225 in #1237
  • Show errors when pre-commit fails by @oulgen in #1238
  • example: gated delta net fwd_h by @v0i0 in #1119
  • Change property name from camel case to snake case. by @Myrthan in #1239
  • Move distributed examples to examples/distributed/ by @yf225 in #1240
  • fix for circular dependency by @mengluy0125 in #1236
  • Fix mask propagation for indexed stores when block_id is 0 by checking is not None instead of truthiness by @oulgen in #1244
  • Clean up distributed examples path refs by @yf225 in #1241
  • Fix RNG codegen for constant (specialized) dimensions by @yf225 in #1253
  • Avoid broadcasting for non-consecutive tensor indexers by @yf225 in #1254
  • Implement torch.sort support by @oulgen in #1247
  • Implement torch.topk support by @oulgen in #1248
  • Allow using hl.specialize to specialize on tensor strides by @yf225 in #1215
  • Use torch._dynamo.mark_static() API to allow tensor shape specialization outside of the kernel code by @yf225 in #1210
  • chore: Bump actions/cache from 4 to 5 by @dependabot[bot] in #1257
  • Fix invalid Triton code for mixed scalar/block indexing in store operations when block dimension has size 1 by @oulgen in #1258

New Contributors

Full Changelog: v0.2.6...v0.2.7

v0.2.6

04 Dec 04:33
9982041

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.5...v0.2.6

v0.2.5

02 Dec 21:42
1f880ea

Choose a tag to compare

What's Changed

  • Add 2d and 3d indirect indexing support by @yf225 in #593

Full Changelog: v0.2.4...v0.2.5

v0.2.4

01 Dec 22:30
9d0b8bd

Choose a tag to compare

What's Changed

  • Add user-customizable autotune_baseline_atol / rtol settings by @yf225 in #1136
  • Fix specialize + reshape use case by @yf225 in #1146
  • Emit tl.constexpr dims for block-size-only view/reshape shapes by @oulgen in #1149
  • Add hl.triton_kernel to call Triton kernels from device code by @oulgen in #1150
  • Add torch.library.custom_op compatibility to @helion.kernel by @gmagogsfm in #1153
  • chore: Bump actions/checkout from 5 to 6 by @dependabot[bot] in #1154
  • Skip Resource temporarily unavailable error by @mengluy0125 in #1156
  • Automatically use zero tolerance for bitwise comparison for fp8 dtypes during autotuning by @gmagogsfm in #1158
  • Fix min hoisting bug by @yf225 in #1157
  • Fix scalar broadcast bug in inductor lowering by @gmagogsfm in #1159
  • Add LFBO Pattern Search by @ethche in #1115
  • benchmarks: allow external kernel mappings for Helion run.py by @fulvius31 in #1160
  • Fix CI dependency error for nvidia-nvshmem-cu12 when using PyTorch nightly and other CI lint errors from pyrefly change. by @choijon5 in #1165
  • Support AMD-specific autotune parameters: waves_per_eu and matrix_instr_nonkdim by @choijon5 in #1162
  • Get remote tensors inside @helion.kernel by @kwen2501 in #1122
  • fix shape bug in lfbo pattern search by @ethche in #1170
  • Fix lint errors in local dev env by @yf225 in #1174
  • [Ref Mode] Fix error message by @yf225 in #1175
  • Add support for x.view() by @oulgen in #1176
  • Add support for hl.randint by @oulgen in #1177
  • Support torch.tensor in helion.kernel by @oulgen in #1178
  • Support data-dependent hl.tile/hl.grid bounds in persistent kernels by @oulgen in #1180
  • [CI] remove all conda and move to uv by @oulgen in #1181
  • Fix unbackend symints in generated code by @oulgen in #1179

New Contributors

Full Changelog: v0.2.3...v0.2.4

v0.2.3

18 Nov 18:21
2644d0a

Choose a tag to compare

What's Changed

  • [CI] Fail the distributed CI job if any unit test fails by @yf225 in #1125
  • Add DE-Surrogate hybrid autotuner algorithm + early stopping option for DE and DE-Surrogate by @FranciscoThiesen in #1096
  • Update AGENTS.md by @jansel in #1128
  • Add Settings.persistent_reserved_sms by @jansel in #1129
  • Add Settings.autotune_force_persistent by @jansel in #1130
  • [CI] Fix fbcode test_breakpoint error by @yf225 in #1132
  • Auto-select index_dtype by @jansel in #1131
  • Support tuple indexing by hl.static_range iterator by @yf225 in #1134
  • Fix CI to surface errors correctly, fix all existing errors by @yf225 in #1138
  • Workaround TRITON_INTERPRET bug breaking tests by @jansel in #1139
  • Fix size 0 tensor handling by @jansel in #1140
  • [Benchmark CI] Print generated Triton code for the best config by @yf225 in #1142
  • Use pyrefly for type checking by @rchen152 in #1143
  • fix pyrefly errors by @oulgen in #1144
  • [CI] Skip TestBreakpoint in ref-eager CI job by @yf225 in #1141
  • Bump pyrefly to 0.42.1 and remove 'sed' workaround. by @rchen152 in #1145

New Contributors

Full Changelog: v0.2.2...v0.2.3

v0.2.2

12 Nov 18:58
51580b4

Choose a tag to compare

What's Changed

  • [Benchmark] Update welford torch.compile function name by @yf225 in #1029
  • chore: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1030
  • chore: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1031
  • [Benchmark CI] Set welford num_inputs to 6 to avoid timeout by @yf225 in #1032
  • Default config: reduce block_size and num_stages to avoid shared mem OOM by @yf225 in #1033
  • Default config: reduce block_size further to avoid shared mem OOM by @yf225 in #1034
  • Disable autotuner progress bar in fbcode unit test by @yf225 in #1035
  • Always print cached config by @oulgen in #1036
  • Fix dtype mismatch error in se_block example by @yf225 in #1040
  • Upgrade clang version by @oulgen in #1043
  • Fix missing static_shapes=False in deployment_autotuning.md by @jansel in #1042
  • Fix matmul output dtype to match PyTorch eager behavior by @yf225 in #1044
  • Fix layernorm bwd unit test by @yf225 in #1047
  • Fix FlattenedTileStrategy to handle unit-sized block dimensions by @yf225 in #1048
  • [CI] Fix debug_str() to be compatible with latest PyTorch nightly by @yf225 in #1050
  • [Fix upcoming CI error] Set current node in inductor lowering by @yf225 in #1052
  • Remove Section Navigation pane from Deployment and Autotuning page. by @choijon5 in #1051
  • Add settings.autotune_baseline_fn to allow passing in custom baseline function to autotuner by @yf225 in #1054
  • Add HELION_PRINT_REPRO=1 to print Helion kernel repro script to console by @yf225 in #1049
  • Fix caching for CPUs by @oulgen in #1055
  • Add get_num_sm for cpu by @oulgen in #1056
  • Support torch.rand / torch.rand_like with dynamic tile sizes by @yf225 in #1057
  • Remove line numbers from expected files by @oulgen in #1061
  • Ignore passed in config when force autotune is turned on by @oulgen in #1060
  • Update Watch Talk link to Triton conf talk. by @choijon5 in #1058
  • Helion Puzzle docs bug fixes by @Athe-kunal in #1062
  • Update test_persistent_kernels.expected by @jansel in #1070
  • Make HELION_PRINT_REPRO=1 take effect in more error cases by @yf225 in #1066
  • add geglu backward by @parsshar-RH in #1069
  • [Unblock internal] Fix log capture issue on internal tests by @yf225 in #1076
  • Add best effort triton-cpu support by @oulgen in #1037
  • Update test_debug_utils.py by @oulgen in #1077
  • Raise user error if device-loop is empty after DCE by @yf225 in #1074
  • Add GRPO loss example by @ighoshsubho in #1063
  • Use HELION_PRINT_REPRO=1 to print repro when device IR lowering or Triton codegen error by @yf225 in #1078
  • add AMD demo link by @vivienfanghuagood in #1068
  • Update test.yml by @oulgen in #1083
  • Fix GRPO loss example unit tests by @yf225 in #1079
  • Remove requirements.txt by @oulgen in #1088
  • Relax requirements for inline_triton output_like=None by @jansel in #1087
  • feat(autotuner): Make autotune cache class configurable via env var by @fulvius31 in #1071
  • Add support for while and pass by @jansel in #1090
  • Update sphinxtheme to pull from pypi package by @sekyondaMeta in #1091
  • [Autotuner] Better error message for default config error by @yf225 in #1092
  • Ignore illegal instruction errors by @jansel in #1093
  • Update talk links to PTC version by @jansel in #1094
  • Add autotuning log by @jansel in #1095
  • Fix builtin min / max handling in device loop by @yf225 in #1085
  • Add skipIfRocm to failing test on main by @jansel in #1101
  • Fix lint in newer triton by @jansel in #1098
  • Add AGENTS.md by @jansel in #1100
  • Refactor _decorators.codegen to allow multiple backends by @jansel in #1099
  • Add extra line before repro log; update repro log tests by @yf225 in #1102
  • Refactor inductor_lowering.py into two files by @jansel in #1103
  • Use CPU machine for triton-cpu by @oulgen in #1105
  • Fix no libdw.so issue on AMD CI by @yf225 in #1107
  • Fixes in helion puzzles by @Athe-kunal in #1104
  • Add distributed CI job (4xH100) and example unit tests by @yf225 in #1106
  • Generalize aten_lowering.py for multiple backends by @jansel in #1108
  • Support tensor.T for transpose by @yf225 in #1110
  • Add warning to discourage use of acc += lhs @ rhs pattern by @yf225 in #1111
  • Remove @helion.jit usage and advise use of @helion.kernel by @yf225 in #1116

New Contributors

Full Changelog: v0.2.1...v0.2.2

v0.2.1

26 Oct 23:16
c5dbbbe

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.0...v0.2.1

v0.2.0

20 Oct 20:54
3a0e975

Choose a tag to compare

What's Changed

  • Verify compiled kernels in subprocess by @jansel in #914
  • Auto-shrink autotune_precompile_jobs based on free memory by @jansel in #940
  • Make HELION_FORCE_AUTOTUNE or kernel.autotune() skip the cache by @jansel in #930
  • Support warp specialization on B200 by @oulgen in #935
  • Update README.md by @oulgen in #943
  • Register tile symbol origin, to support tile + offset use case in blackwell attention by @yf225 in #939
  • [CI] Print failed tests by @oulgen in #942
  • Update examples to use run_example by @jansel in #941
  • blackwell attn with triton attr set by @v0i0 in #918
  • Set static_shapes=True by @oulgen in #937
  • run.py env var to skip exception logging by @v0i0 in #946
  • Fix bug with unit sized dims and block_sizes by @jansel in #932
  • Update static_shapes docs by @jansel in #951
  • Add tile.count by @oulgen in #955
  • Auto detect low vram by @oulgen in #956
  • [CI] Use official PyTorch 2.9 by @oulgen in #962
  • Use interleaved_bench for run_example by @jansel in #945
  • Generalize tile_with_offset pass by @jansel in #949
  • Docstring updates by @jansel in #952
  • Import updates by @jansel in #953
  • Add missing environment variables to docs by @jansel in #957
  • Print out errors vs timeouts in autotuning status by @jansel in #960
  • Add HELION_AUTOTUNE_IGNORE_ERRORS by @jansel in #961
  • Exit autotuning faster on KeyboardInterrupt by @jansel in #963
  • Remove default settings by @jansel in #964
  • Add missing settings environment variables by @jansel in #965
  • Skip test_differential_evolution_search due to slowness by @jansel in #968
  • [Benchmark CI] Give nightly job permissions by @oulgen in #970
  • [Benchmark CI] Allow kicking off workflow dispatch by @oulgen in #971
  • [Benchmark CI] Allow specifying custom env vars via UI by @yf225 in #972
  • [blackwell attn example] qk scale as param by @v0i0 in #969
  • [Benchmark CI] Allow specifying custom args to benchmark runner via UI by @yf225 in #974
  • Add initial backwards compatibility tests by @oulgen in #958
  • Remove unrolling + warp spec by @PaulZhang12 in #967
  • [Benchmark CI] Set atol and rtol to 1e-2 by @yf225 in #976
  • [Benchmark] Fix tritonbench auto-installation by @yf225 in #980
  • [Autotuner] Fix fork-based autotuner to avoid re-initializing CUDA context in subprocess by @yf225 in #981
  • Make fork default precompilation strategy by @oulgen in #979
  • [benchmarks] change tritonbench path by @xuzhao9 in #966
  • Add skipIfA10G decorator by @yf225 in #982
  • Suggest HELION_AUTOTUNE_PRECOMPILE=spawn when IMA happens by @jansel in #984
  • Layer Norm bwd kernel to support large B*M case used by internal by @yf225 in #973
  • Fix timeouts in autotuning by @jansel in #985
  • Log generated triton code at the DEBUG level rather than INFO by @jansel in #986
  • Remove extra debug log for timeouts by @jansel in #987
  • Add squeeze_and_excitation_net kernel by @mengluy0125 in #870
  • Generalize test cases to support XPU by @EikanWang in #983
  • Updated README with News section of upcoming events. Added link to GPU mode talk. by @choijon5 in #991
  • Update README.md by @oulgen in #992
  • Update README.md by @oulgen in #993
  • Mamba2 Chunk Scan & State by @v0i0 in #950
  • Remove unrolling with tma + pipelining by @PaulZhang12 in #994
  • Add provenance annotations to output code by @jansel in #988

Full Changelog: v0.1.8...v0.2.0

v0.1.8

15 Oct 00:37
b77301f

Choose a tag to compare

What's Changed

  • fix rmsnorm fwd tritonbench by @v0i0 in #840
  • Update input shapes for example kernels by @yf225 in #845
  • Extend eviction policy tests to all indexing types by @oulgen in #833
  • [Docs] Remove early development warning by @oulgen in #846
  • [Docs] Add link to gpumode discord by @oulgen in #847
  • [Docs] Add PTC promotional material by @oulgen in #848
  • [Benchmark] Add low mem dropout example by @karthickai in #641
  • Update lint.yml by @oulgen in #854
  • Remove hl.register_reduction_dim API by @yf225 in #834
  • Error message for boolean masking or torch.nonzero by @yf225 in #687
  • Remove hardcoded block_size=1 usage in attention kernel example by @yf225 in #843
  • Revert "Update to use the new attribute setting for tf32." by @choijon5 in #856
  • Decrease num_stages default from 3 to 2, to avoid shared memory OOM by @yf225 in #841
  • Allow user-defined specialization key by @jansel in #853
  • [Benchmark CI] Use fewer num_inputs for flash_attention to avoid timeout by @yf225 in #857
  • Remove legacy register_inductor_lowering code by @yf225 in #864
  • Set setstate/getstate methods to Config by @jansel in #868
  • [doc] Add deployment/autotuning guide by @jansel in #869
  • [Benchmark CI] Use equally-spaced-k mode to sample input shapes by @yf225 in #861
  • Fix sphinx warnings by @jansel in #871
  • Normalize tl.sqrt and libdevice.sqrt for tests by @oulgen in #866
  • [CI] Pin py3.10 and one py3.12 on pytorch2.9 by @oulgen in #858
  • [Docs] Suggest PyTorch 2.9 or above by @oulgen in #859
  • [Benchmark] Pin benchmarks to PyTorch 2.9 by @oulgen in #860
  • Print Triton code when error for easier debugging by @yf225 in #874
  • Terminate autotuning faster if progress is minimal by @oulgen in #855
  • Update README.md by @oulgen in #877
  • [CI] pin b200 to pytorch2.9 by @oulgen in #878
  • [Autotuner] Run CUDA synchronize before / after candidate func call, to surface CUDA errors sooner by @yf225 in #872
  • [Benchmark] bf16 x int16 helion kernel by @karthickai in #794
  • Install git for benchmarks by @oulgen in #882
  • Pin AMD to 6.4.4 by @oulgen in #883
  • Faster int4 gemm by @PaulZhang12 in #751
  • Pin AMD to 6.4.4 by @oulgen in #881
  • Remove PyTorch requirement from deps so that it is easier to install arbitrary version of pytorch by @oulgen in #879
  • [Benchmark CI] Use regular matmul instead of split-k by @yf225 in #884
  • [Benchmark] Use bespoke setup-python action by @oulgen in #885
  • [Benchmark] Drop memory bound kernels and replace them with gemms by @oulgen in #887
  • Add dependabot by @oulgen in #888
  • Update dependabot.yml by @oulgen in #891
  • chore: Bump actions/setup-python from 5 to 6 by @dependabot[bot] in #893
  • chore: Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #895
  • chore: Bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #894
  • chore: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #892
  • Upgrade ruff==0.14.0 by @jansel in #889
  • [Benchmark CI] grouped_gemm: include input preproc in timing measurement; update gemm backend name mapping by @yf225 in #898
  • chore: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #896
  • [Benchmark] use logger.exception for process errors by @oulgen in #902
  • [Benchmark CI] Reduce num_inputs for grouped_gemm and gemm benchmarks by @yf225 in #903
  • Query minimum dot size for XPU by @EikanWang in #900
  • Add matmul/addmm bwd examples and add test coverage by @tianrengao in #748
  • [CI] Pin amd to rocm7.0 by @oulgen in #907
  • [Benchmark] Move benchmark kernel sharding to dispatch by @oulgen in #905
  • [Benchmark] Provide a way to pass custom list of kernels by @oulgen in #906
  • [Benchmark CI] Use triton_tutorial_matmul for triton matmul baseline by @yf225 in #911
  • Remove cache around set_triton_allocator by @oulgen in #912
  • Add int4_gemm by @oulgen in #917
  • chore: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #916
  • Catch missing cudnn error by @jansel in #873
  • Add progress bar for precompiling by @jansel in #919
  • Adding new setting, autotune_effort=[none/quick/full] by @choijon5 in #913
  • Print error message for torch.chunk / torch.unbind to redirect users to hl.split by @yf225 in #921
  • Avoid setting default --input-sample-mode to equally-spaced-k by @yf225 in #922
  • Remove triton_helpers.* usage in lifted device function arguments by @yf225 in #849
  • Set HELION_DEV_LOW_VRAM=1 on a10g CI machines by @yf225 in #923
  • Suggest use of @helion.kernel(index_dtype=torch.int64) if index offset is out of bound for int32 by @yf225 in #850
  • Deprecate use_default_config and replace all its uses with autotune_effort by @choijon5 in #924
  • Support hl.arange() with non-power-of-2 input by @yf225 in #862
  • Setting up RunLLm AI Chatbot by @sekyondaMeta in #925
  • Generalize examples with the DEVICE variable by @adam-smnk in #915
  • Fix lint error by @jansel in #926
  • Add lint to make sure examples and tests use device=DEVICE by @oulgen in #929
  • Support tile+offset and tensor descriptors by @jansel in #928
  • Fix triton/torch.compile compability issue by @jansel in #927
  • Fix CUDA IMA from combination of unrolling + pipelining by @PaulZhang12 in #920
  • Update the Agent ID by @sekyondaMeta in #931
  • [Benchmark CI] Use --non-square flag for gemm by @yf225 in #938

New Contributors

Full Changelog: v0.1.7...v0.1.8