Releases: pytorch/helion
Releases · pytorch/helion
v0.2.8
What's Changed
- Support passing Triton function object to
hl.triton_kernel()by @yf225 in #1263 - chore: Bump actions/download-artifact from 6 to 7 by @dependabot[bot] in #1270
- chore: Bump actions/upload-artifact from 5 to 6 by @dependabot[bot] in #1271
- [Distributed]
one_shot_allreduce_bias_rmsnormexample by @yf225 in #1266 - [Distributed]
matmul_reduce_scatterexample by @yf225 in #1269 - feat(benchmarks): add shapes to json output by @fulvius31 in #1273
- [Autotuner] Log the 'started' state to CSV, for easier user monitoring of kernel hanging at runtime by @yf225 in #1279
- default pattern search by @v0i0 in #1259
- Set LFBOPatternSearch as default by @ethche in #1280
- fix surrogate search for singleton population by @v0i0 in #1281
- Ignore bzl files in git. by @Myrthan in #1282
- chunk fused_linear_jsd by @v0i0 in #1277
- Fix buggy interation between XYZProgramIDs and L2GroupingProgramIDs by @jansel in #1288
- Fix bug with torch.rand_like compile error by @jansel in #1289
- [autotuner] print path to generated Triton code after selection of kernel by @bringlein in #1285
- Add support for torch.gather by @jansel in #1290
- [docs] Add more example to docs by @oulgen in #1301
- [lint] remove dead ignores by @oulgen in #1302
- Add proper error handling for torch.split and torch.tensor_split in device loops by @oulgen in #1297
- Fix BlockReductionStrategy to use existing index variables for argmax/argmin operations by @oulgen in #1298
- Skip more failing tests on cpu backend by @oulgen in #1304
- [CI] Fix broken notebook by @oulgen in #1305
- Fix shape inference for tile indexing on size-1 dimensions and use broadcast_to for block_ptr by @oulgen in #1299
- Fix codegen broadcasting for tile indexing on size-1 tensor dimensions by @oulgen in #1300
- Enable tests on py314 by @oulgen in #1306
New Contributors
- @bringlein made their first contribution in #1285
Full Changelog: v0.2.7...v0.2.8
v0.2.7
What's Changed
- [CI] Skip all failing distributed tests by @yf225 in #1206
- Include index_dtype in the printed decorator snippet by @choijon5 in #1207
- Add dict comprehension support by @oulgen in #1191
- settings: set appropriate dot_precision default by @fulvius31 in #1184
- [Interpret Mode] Support custom block size by @yf225 in #1194
- [Autotuner] Add
autotune_benchmark_fnsetting by @yf225 in #1199 - jagged_dense_bmm (#1126) by @trieuat in #1213
- benchmarks: Include AMD GCN arch in get_device_name() by @fulvius31 in #1214
- Fix linter errors by @yf225 in #1218
- Fix unit test breakage due to upstream change by @yf225 in #1219
- Fix
static_shapessetting in test_dot.py by @yf225 in #1220 - Fix memory leak when Triton compile error occurs by @yf225 in #1217
- [Interpret Mode] Re-enable block-size dependent tests by @yf225 in #1212
- [Interpret Mode] Raise error if
hl.storeis used with duplicate indices by @yf225 in #1221 - [Interpret Mode] Fix
hl.storeautomatic dtype conversion by @yf225 in #1226 - [Interpret Mode] Fix
hl.loadwith multiple 1D tensor indices by @yf225 in #1227 - [CI] Fix NVSHMEM env vars and re-enable distributed CI job by @yf225 in #1201
- Move jagged_dense_bmm expected code to the right place by @yf225 in #1232
- Reduce log volume by moving output code logging behind HELION_PRINT_OUTPUT_CODE=1 by @yf225 in #1233
- Add setup for Helion to compile on MTIA with basic test by @Myrthan in #1169
- Make
hl.triton_kernelsupport global var and recursive kernel call by @yf225 in #1234 - Make
hl.triton_kernelsupport output_like=None without being DCE'd by @yf225 in #1237 - Show errors when pre-commit fails by @oulgen in #1238
- example: gated delta net fwd_h by @v0i0 in #1119
- Change property name from camel case to snake case. by @Myrthan in #1239
- Move distributed examples to
examples/distributed/by @yf225 in #1240 - fix for circular dependency by @mengluy0125 in #1236
- Fix mask propagation for indexed stores when block_id is 0 by checking is not None instead of truthiness by @oulgen in #1244
- Clean up distributed examples path refs by @yf225 in #1241
- Fix RNG codegen for constant (specialized) dimensions by @yf225 in #1253
- Avoid broadcasting for non-consecutive tensor indexers by @yf225 in #1254
- Implement torch.sort support by @oulgen in #1247
- Implement torch.topk support by @oulgen in #1248
- Allow using
hl.specializeto specialize on tensor strides by @yf225 in #1215 - Use
torch._dynamo.mark_static()API to allow tensor shape specialization outside of the kernel code by @yf225 in #1210 - chore: Bump actions/cache from 4 to 5 by @dependabot[bot] in #1257
- Fix invalid Triton code for mixed scalar/block indexing in store operations when block dimension has size 1 by @oulgen in #1258
New Contributors
Full Changelog: v0.2.6...v0.2.7
v0.2.6
What's Changed
- Add tests for tile.index floor division pattern in indexing by @oulgen in #1189
- Update pyrefly pre-commit hook. by @rchen152 in #1186
- [CI] Skip failing distributed test by @yf225 in #1196
- [CI] Pin networkx for py3.14 by @oulgen in #1197
- Add tuple comprehension support by @oulgen in #1190
- Print repro code on autotune success by @yf225 in #1203
- Use int64 indexing for pids as well by @Chillee in #1195
New Contributors
Full Changelog: v0.2.5...v0.2.6
v0.2.5
v0.2.4
What's Changed
- Add user-customizable autotune_baseline_atol / rtol settings by @yf225 in #1136
- Fix specialize + reshape use case by @yf225 in #1146
- Emit tl.constexpr dims for block-size-only view/reshape shapes by @oulgen in #1149
- Add hl.triton_kernel to call Triton kernels from device code by @oulgen in #1150
- Add torch.library.custom_op compatibility to @helion.kernel by @gmagogsfm in #1153
- chore: Bump actions/checkout from 5 to 6 by @dependabot[bot] in #1154
- Skip Resource temporarily unavailable error by @mengluy0125 in #1156
- Automatically use zero tolerance for bitwise comparison for fp8 dtypes during autotuning by @gmagogsfm in #1158
- Fix min hoisting bug by @yf225 in #1157
- Fix scalar broadcast bug in inductor lowering by @gmagogsfm in #1159
- Add LFBO Pattern Search by @ethche in #1115
- benchmarks: allow external kernel mappings for Helion run.py by @fulvius31 in #1160
- Fix CI dependency error for nvidia-nvshmem-cu12 when using PyTorch nightly and other CI lint errors from pyrefly change. by @choijon5 in #1165
- Support AMD-specific autotune parameters: waves_per_eu and matrix_instr_nonkdim by @choijon5 in #1162
- Get remote tensors inside
@helion.kernelby @kwen2501 in #1122 - fix shape bug in lfbo pattern search by @ethche in #1170
- Fix lint errors in local dev env by @yf225 in #1174
- [Ref Mode] Fix error message by @yf225 in #1175
- Add support for x.view() by @oulgen in #1176
- Add support for hl.randint by @oulgen in #1177
- Support torch.tensor in helion.kernel by @oulgen in #1178
- Support data-dependent hl.tile/hl.grid bounds in persistent kernels by @oulgen in #1180
- [CI] remove all conda and move to uv by @oulgen in #1181
- Fix unbackend symints in generated code by @oulgen in #1179
New Contributors
- @gmagogsfm made their first contribution in #1153
- @ethche made their first contribution in #1115
- @kwen2501 made their first contribution in #1122
Full Changelog: v0.2.3...v0.2.4
v0.2.3
What's Changed
- [CI] Fail the distributed CI job if any unit test fails by @yf225 in #1125
- Add DE-Surrogate hybrid autotuner algorithm + early stopping option for DE and DE-Surrogate by @FranciscoThiesen in #1096
- Update AGENTS.md by @jansel in #1128
- Add Settings.persistent_reserved_sms by @jansel in #1129
- Add Settings.autotune_force_persistent by @jansel in #1130
- [CI] Fix fbcode test_breakpoint error by @yf225 in #1132
- Auto-select index_dtype by @jansel in #1131
- Support tuple indexing by hl.static_range iterator by @yf225 in #1134
- Fix CI to surface errors correctly, fix all existing errors by @yf225 in #1138
- Workaround TRITON_INTERPRET bug breaking tests by @jansel in #1139
- Fix size 0 tensor handling by @jansel in #1140
- [Benchmark CI] Print generated Triton code for the best config by @yf225 in #1142
- Use pyrefly for type checking by @rchen152 in #1143
- fix pyrefly errors by @oulgen in #1144
- [CI] Skip TestBreakpoint in ref-eager CI job by @yf225 in #1141
- Bump pyrefly to 0.42.1 and remove 'sed' workaround. by @rchen152 in #1145
New Contributors
- @FranciscoThiesen made their first contribution in #1096
- @rchen152 made their first contribution in #1143
Full Changelog: v0.2.2...v0.2.3
v0.2.2
What's Changed
- [Benchmark] Update welford torch.compile function name by @yf225 in #1029
- chore: Bump actions/upload-artifact from 4 to 5 by @dependabot[bot] in #1030
- chore: Bump actions/download-artifact from 5 to 6 by @dependabot[bot] in #1031
- [Benchmark CI] Set welford num_inputs to 6 to avoid timeout by @yf225 in #1032
- Default config: reduce block_size and num_stages to avoid shared mem OOM by @yf225 in #1033
- Default config: reduce block_size further to avoid shared mem OOM by @yf225 in #1034
- Disable autotuner progress bar in fbcode unit test by @yf225 in #1035
- Always print cached config by @oulgen in #1036
- Fix dtype mismatch error in se_block example by @yf225 in #1040
- Upgrade clang version by @oulgen in #1043
- Fix missing static_shapes=False in deployment_autotuning.md by @jansel in #1042
- Fix matmul output dtype to match PyTorch eager behavior by @yf225 in #1044
- Fix layernorm bwd unit test by @yf225 in #1047
- Fix FlattenedTileStrategy to handle unit-sized block dimensions by @yf225 in #1048
- [CI] Fix debug_str() to be compatible with latest PyTorch nightly by @yf225 in #1050
- [Fix upcoming CI error] Set current node in inductor lowering by @yf225 in #1052
- Remove Section Navigation pane from Deployment and Autotuning page. by @choijon5 in #1051
- Add
settings.autotune_baseline_fnto allow passing in custom baseline function to autotuner by @yf225 in #1054 - Add
HELION_PRINT_REPRO=1to print Helion kernel repro script to console by @yf225 in #1049 - Fix caching for CPUs by @oulgen in #1055
- Add get_num_sm for cpu by @oulgen in #1056
- Support torch.rand / torch.rand_like with dynamic tile sizes by @yf225 in #1057
- Remove line numbers from expected files by @oulgen in #1061
- Ignore passed in config when force autotune is turned on by @oulgen in #1060
- Update Watch Talk link to Triton conf talk. by @choijon5 in #1058
- Helion Puzzle docs bug fixes by @Athe-kunal in #1062
- Update test_persistent_kernels.expected by @jansel in #1070
- Make HELION_PRINT_REPRO=1 take effect in more error cases by @yf225 in #1066
- add geglu backward by @parsshar-RH in #1069
- [Unblock internal] Fix log capture issue on internal tests by @yf225 in #1076
- Add best effort triton-cpu support by @oulgen in #1037
- Update test_debug_utils.py by @oulgen in #1077
- Raise user error if device-loop is empty after DCE by @yf225 in #1074
- Add GRPO loss example by @ighoshsubho in #1063
- Use HELION_PRINT_REPRO=1 to print repro when device IR lowering or Triton codegen error by @yf225 in #1078
- add AMD demo link by @vivienfanghuagood in #1068
- Update test.yml by @oulgen in #1083
- Fix GRPO loss example unit tests by @yf225 in #1079
- Remove requirements.txt by @oulgen in #1088
- Relax requirements for inline_triton output_like=None by @jansel in #1087
- feat(autotuner): Make autotune cache class configurable via env var by @fulvius31 in #1071
- Add support for while and pass by @jansel in #1090
- Update sphinxtheme to pull from pypi package by @sekyondaMeta in #1091
- [Autotuner] Better error message for default config error by @yf225 in #1092
- Ignore illegal instruction errors by @jansel in #1093
- Update talk links to PTC version by @jansel in #1094
- Add autotuning log by @jansel in #1095
- Fix builtin min / max handling in device loop by @yf225 in #1085
- Add skipIfRocm to failing test on main by @jansel in #1101
- Fix lint in newer triton by @jansel in #1098
- Add AGENTS.md by @jansel in #1100
- Refactor _decorators.codegen to allow multiple backends by @jansel in #1099
- Add extra line before repro log; update repro log tests by @yf225 in #1102
- Refactor inductor_lowering.py into two files by @jansel in #1103
- Use CPU machine for triton-cpu by @oulgen in #1105
- Fix no libdw.so issue on AMD CI by @yf225 in #1107
- Fixes in helion puzzles by @Athe-kunal in #1104
- Add distributed CI job (4xH100) and example unit tests by @yf225 in #1106
- Generalize aten_lowering.py for multiple backends by @jansel in #1108
- Support tensor.T for transpose by @yf225 in #1110
- Add warning to discourage use of
acc += lhs @ rhspattern by @yf225 in #1111 - Remove
@helion.jitusage and advise use of@helion.kernelby @yf225 in #1116
New Contributors
- @Athe-kunal made their first contribution in #1062
- @parsshar-RH made their first contribution in #1069
- @ighoshsubho made their first contribution in #1063
- @vivienfanghuagood made their first contribution in #1068
- @fulvius31 made their first contribution in #1071
Full Changelog: v0.2.1...v0.2.2
v0.2.1
What's Changed
- No autotuning on block_ptr if tma is available by @PaulZhang12 in #997
- Add reps for benchmarking stability by @PaulZhang12 in #999
- Prioritize outermost loop for warp spec by @PaulZhang12 in #1000
- Add backward pass for softmax kernel by @karthickai in #744
- Fix linter in softmax by @oulgen in #1003
- Fix test_examples.expected by @oulgen in #1002
- Beef up caching tests by @oulgen in #1001
- Add HELION_ASSERT_CACHE_HIT to debug/explain cache miss by @oulgen in #1006
- Better error message for calling Helion kernel from another kernel by @yf225 in #1008
- Assert that we are cache hitting on the CI by @oulgen in #1007
- Always raise
FailedToUnpackTilewhenfor tile_m, tile_d in hl.tile(m, d)is used by @yf225 in #1009 - Adding demo for running softmax kernel on Google colab by @choijon5 in #944
- int4 gemm accurate baselines by @PaulZhang12 in #1010
- Add sitemap xml by @sekyondaMeta in #1013
- [helion] backward support for swiglu by @shunting314 in #756
- Raise informative error when
hl.dotwith 3D inputs have batch dim mismatch by @yf225 in #1012 - [CI] Fix AMD journal check errors by @yf225 in #1016
- Support
breakpoint()in device code when interpret mode is on by @yf225 in #1020 - Sort requirements file by @oulgen in #1021
- Better type checking for eviction policies by @oulgen in #1024
- Bump linter versions by @jansel in #1018
- Garbage collect expected results by @jansel in #1017
- Make indexing choice a list by @oulgen in #1025
- [Docs] Add list of indexing autotuning docs by @oulgen in #1027
- Make store indexing also individually tunable by @oulgen in #1028
New Contributors
- @shunting314 made their first contribution in #756
Full Changelog: v0.2.0...v0.2.1
v0.2.0
What's Changed
- Verify compiled kernels in subprocess by @jansel in #914
- Auto-shrink autotune_precompile_jobs based on free memory by @jansel in #940
- Make HELION_FORCE_AUTOTUNE or kernel.autotune() skip the cache by @jansel in #930
- Support warp specialization on B200 by @oulgen in #935
- Update README.md by @oulgen in #943
- Register tile symbol origin, to support
tile + offsetuse case in blackwell attention by @yf225 in #939 - [CI] Print failed tests by @oulgen in #942
- Update examples to use run_example by @jansel in #941
- blackwell attn with triton attr set by @v0i0 in #918
- Set static_shapes=True by @oulgen in #937
- run.py env var to skip exception logging by @v0i0 in #946
- Fix bug with unit sized dims and block_sizes by @jansel in #932
- Update static_shapes docs by @jansel in #951
- Add tile.count by @oulgen in #955
- Auto detect low vram by @oulgen in #956
- [CI] Use official PyTorch 2.9 by @oulgen in #962
- Use interleaved_bench for run_example by @jansel in #945
- Generalize tile_with_offset pass by @jansel in #949
- Docstring updates by @jansel in #952
- Import updates by @jansel in #953
- Add missing environment variables to docs by @jansel in #957
- Print out errors vs timeouts in autotuning status by @jansel in #960
- Add HELION_AUTOTUNE_IGNORE_ERRORS by @jansel in #961
- Exit autotuning faster on KeyboardInterrupt by @jansel in #963
- Remove default settings by @jansel in #964
- Add missing settings environment variables by @jansel in #965
- Skip test_differential_evolution_search due to slowness by @jansel in #968
- [Benchmark CI] Give nightly job permissions by @oulgen in #970
- [Benchmark CI] Allow kicking off workflow dispatch by @oulgen in #971
- [Benchmark CI] Allow specifying custom env vars via UI by @yf225 in #972
- [blackwell attn example] qk scale as param by @v0i0 in #969
- [Benchmark CI] Allow specifying custom args to benchmark runner via UI by @yf225 in #974
- Add initial backwards compatibility tests by @oulgen in #958
- Remove unrolling + warp spec by @PaulZhang12 in #967
- [Benchmark CI] Set atol and rtol to 1e-2 by @yf225 in #976
- [Benchmark] Fix tritonbench auto-installation by @yf225 in #980
- [Autotuner] Fix fork-based autotuner to avoid re-initializing CUDA context in subprocess by @yf225 in #981
- Make fork default precompilation strategy by @oulgen in #979
- [benchmarks] change tritonbench path by @xuzhao9 in #966
- Add skipIfA10G decorator by @yf225 in #982
- Suggest HELION_AUTOTUNE_PRECOMPILE=spawn when IMA happens by @jansel in #984
- Layer Norm bwd kernel to support large B*M case used by internal by @yf225 in #973
- Fix timeouts in autotuning by @jansel in #985
- Log generated triton code at the DEBUG level rather than INFO by @jansel in #986
- Remove extra debug log for timeouts by @jansel in #987
- Add squeeze_and_excitation_net kernel by @mengluy0125 in #870
- Generalize test cases to support XPU by @EikanWang in #983
- Updated README with News section of upcoming events. Added link to GPU mode talk. by @choijon5 in #991
- Update README.md by @oulgen in #992
- Update README.md by @oulgen in #993
- Mamba2 Chunk Scan & State by @v0i0 in #950
- Remove unrolling with tma + pipelining by @PaulZhang12 in #994
- Add provenance annotations to output code by @jansel in #988
Full Changelog: v0.1.8...v0.2.0
v0.1.8
What's Changed
- fix rmsnorm fwd tritonbench by @v0i0 in #840
- Update input shapes for example kernels by @yf225 in #845
- Extend eviction policy tests to all indexing types by @oulgen in #833
- [Docs] Remove early development warning by @oulgen in #846
- [Docs] Add link to gpumode discord by @oulgen in #847
- [Docs] Add PTC promotional material by @oulgen in #848
- [Benchmark] Add low mem dropout example by @karthickai in #641
- Update lint.yml by @oulgen in #854
- Remove
hl.register_reduction_dimAPI by @yf225 in #834 - Error message for boolean masking or torch.nonzero by @yf225 in #687
- Remove hardcoded
block_size=1usage in attention kernel example by @yf225 in #843 - Revert "Update to use the new attribute setting for tf32." by @choijon5 in #856
- Decrease
num_stagesdefault from 3 to 2, to avoid shared memory OOM by @yf225 in #841 - Allow user-defined specialization key by @jansel in #853
- [Benchmark CI] Use fewer num_inputs for flash_attention to avoid timeout by @yf225 in #857
- Remove legacy
register_inductor_loweringcode by @yf225 in #864 - Set setstate/getstate methods to Config by @jansel in #868
- [doc] Add deployment/autotuning guide by @jansel in #869
- [Benchmark CI] Use equally-spaced-k mode to sample input shapes by @yf225 in #861
- Fix sphinx warnings by @jansel in #871
- Normalize tl.sqrt and libdevice.sqrt for tests by @oulgen in #866
- [CI] Pin py3.10 and one py3.12 on pytorch2.9 by @oulgen in #858
- [Docs] Suggest PyTorch 2.9 or above by @oulgen in #859
- [Benchmark] Pin benchmarks to PyTorch 2.9 by @oulgen in #860
- Print Triton code when error for easier debugging by @yf225 in #874
- Terminate autotuning faster if progress is minimal by @oulgen in #855
- Update README.md by @oulgen in #877
- [CI] pin b200 to pytorch2.9 by @oulgen in #878
- [Autotuner] Run CUDA synchronize before / after candidate func call, to surface CUDA errors sooner by @yf225 in #872
- [Benchmark] bf16 x int16 helion kernel by @karthickai in #794
- Install git for benchmarks by @oulgen in #882
- Pin AMD to 6.4.4 by @oulgen in #883
- Faster int4 gemm by @PaulZhang12 in #751
- Pin AMD to 6.4.4 by @oulgen in #881
- Remove PyTorch requirement from deps so that it is easier to install arbitrary version of pytorch by @oulgen in #879
- [Benchmark CI] Use regular matmul instead of split-k by @yf225 in #884
- [Benchmark] Use bespoke setup-python action by @oulgen in #885
- [Benchmark] Drop memory bound kernels and replace them with gemms by @oulgen in #887
- Add dependabot by @oulgen in #888
- Update dependabot.yml by @oulgen in #891
- chore: Bump actions/setup-python from 5 to 6 by @dependabot[bot] in #893
- chore: Bump actions/download-artifact from 4 to 5 by @dependabot[bot] in #895
- chore: Bump actions/upload-pages-artifact from 3 to 4 by @dependabot[bot] in #894
- chore: Bump actions/checkout from 4 to 5 by @dependabot[bot] in #892
- Upgrade ruff==0.14.0 by @jansel in #889
- [Benchmark CI] grouped_gemm: include input preproc in timing measurement; update gemm backend name mapping by @yf225 in #898
- chore: Bump astral-sh/setup-uv from 6 to 7 by @dependabot[bot] in #896
- [Benchmark] use logger.exception for process errors by @oulgen in #902
- [Benchmark CI] Reduce num_inputs for grouped_gemm and gemm benchmarks by @yf225 in #903
- Query minimum dot size for XPU by @EikanWang in #900
- Add matmul/addmm bwd examples and add test coverage by @tianrengao in #748
- [CI] Pin amd to rocm7.0 by @oulgen in #907
- [Benchmark] Move benchmark kernel sharding to dispatch by @oulgen in #905
- [Benchmark] Provide a way to pass custom list of kernels by @oulgen in #906
- [Benchmark CI] Use triton_tutorial_matmul for triton matmul baseline by @yf225 in #911
- Remove cache around set_triton_allocator by @oulgen in #912
- Add int4_gemm by @oulgen in #917
- chore: Bump actions/github-script from 7 to 8 by @dependabot[bot] in #916
- Catch missing cudnn error by @jansel in #873
- Add progress bar for precompiling by @jansel in #919
- Adding new setting, autotune_effort=[none/quick/full] by @choijon5 in #913
- Print error message for torch.chunk / torch.unbind to redirect users to hl.split by @yf225 in #921
- Avoid setting default
--input-sample-modetoequally-spaced-kby @yf225 in #922 - Remove
triton_helpers.*usage in lifted device function arguments by @yf225 in #849 - Set HELION_DEV_LOW_VRAM=1 on a10g CI machines by @yf225 in #923
- Suggest use of
@helion.kernel(index_dtype=torch.int64)if index offset is out of bound for int32 by @yf225 in #850 - Deprecate use_default_config and replace all its uses with autotune_effort by @choijon5 in #924
- Support
hl.arange()with non-power-of-2 input by @yf225 in #862 - Setting up RunLLm AI Chatbot by @sekyondaMeta in #925
- Generalize examples with the DEVICE variable by @adam-smnk in #915
- Fix lint error by @jansel in #926
- Add lint to make sure examples and tests use device=DEVICE by @oulgen in #929
- Support tile+offset and tensor descriptors by @jansel in #928
- Fix triton/torch.compile compability issue by @jansel in #927
- Fix CUDA IMA from combination of unrolling + pipelining by @PaulZhang12 in #920
- Update the Agent ID by @sekyondaMeta in #931
- [Benchmark CI] Use
--non-squareflag for gemm by @yf225 in #938
New Contributors
- @dependabot[bot] made their first contribution in #893
- @tianrengao made their first contribution in #748
Full Changelog: v0.1.7...v0.1.8