Fix batched inference/generation, position_ids creation, falcon alibi, gpt_bigcode multi-query,.. #2326

IlyasMoutawwakil · 2025-07-24T16:15:55Z

What does this PR do?

The batched inference issue and numerical mismatch has persisted in the ort class since very long, and even with #1381, the position ids were only created in the generation tests instead of creating them inside the inference class, not to mention that for generation, prepare_input_for_generation takes care of generating position ids. So for simple forward pass, it was still an issue.

This PR is the result of a rabbit hole I went into when I realized that my decoder testing refactorization removed batched the only batched generation check we had 🥲. So I enabled batched inference/generation by default and to my surprise, all models that require position ids were failing batched inference/generation. But simply installing transformers==4.52 and it was passing, so the problem is obviously in something that happened in the transformers 4.53 refactorization.

Starting from transformers 4.53 the modeling code uses boolean 4D masks, which are not "officially" supported by the torch onnx export (it's not really a "masked operation"), the boolean mask is simply converted to 0 and -inf filled tensor https://github.com/pytorch/pytorch/blob/f8fafdc7a6d260cea6c145643f4cf73631c81460/torch/onnx/symbolic_opset14.py#L187
This, in the case of padded batched inputs, results in the softmax returning nans, which pollutes the entire sequence logits (the entire padded sequences return nans as logits). This behavior used to be avoided by not calling _unmask_unattended in the case of onnx export. Instead of going that path again for the new masking methods (whch also results in small numerical mismatches), we fix this by patching the torch onnx exporter directly, and overloading graph it uses to replace aten::scaled_dot_product_attention.

A lot of fixes and changes, but hey, we were able to un-bloat the model patcher, get matching logits even for the mask tokens and get better support and testing of batched inference/generation across transformers versions.

Ps: I tested every version of transformers from 4.36 to 4.53. and also tested these changes on torch 2.1.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

HuggingFaceDocBuilderDev · 2025-07-24T17:26:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…cause of transformers using boolean mask

IlyasMoutawwakil · 2025-07-28T20:20:57Z

optimum/exporters/onnx/config.py

            common_inputs = {"input_ids": {0: "batch_size", 1: "sequence_length"}}
+            common_inputs["attention_mask"] = {0: "batch_size", 1: "past_sequence_length + sequence_length"}
            self.add_past_key_values(common_inputs, direction="inputs")
-            common_inputs["attention_mask"] = {0: "batch_size", 1: "past_sequence_length + 1"}


this was wrong

IlyasMoutawwakil · 2025-07-28T20:23:15Z

optimum/exporters/onnx/model_configs.py

+            if self._normalized_config.multi_query:
+                # No dim for `n_head` when using multi-query attention
+                inputs_or_outputs[f"{name}.{i}.key_value"] = {0: "batch_size", 1: decoder_sequence_name}
+            else:
+                inputs_or_outputs[f"{name}.{i}.key_value"] = {0: "batch_size", 2: decoder_sequence_name}


support for multi_query=True/False for gpt bigcode

IlyasMoutawwakil · 2025-07-28T20:24:29Z

optimum/exporters/onnx/model_patcher.py

+# No-op bfloat16 casting to avoid issues with legacy ONNX export which cast to complex128
+def noop_bfloat16_casting(self):
+    return self


this is for falcon with alibi and any method that calls bfloat16 on a tensor (not supported by onnx exporter)

IlyasMoutawwakil · 2025-07-28T20:26:03Z

optimum/exporters/onnx/model_patcher.py

+@_onnx_symbolic("aten::__ior_")
+@symbolic_helper.parse_args("v", "v")
+def __ior_(g: jit_utils.GraphContext, self: torch._C.Value, other: torch._C.Value) -> torch._C.Value:
+    return g.op("Or", self, other)


this fixes the missing in-place or op.

optimum/onnxruntime/modeling_decoder.py

IlyasMoutawwakil · 2025-07-28T20:32:18Z

tests/onnxruntime/test_diffusion.py

+        global IMAGE
+        if IMAGE is None:
+            # Load a sample image from the Hugging Face Hub
+            IMAGE = load_image(
+                "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/in_paint/overture-creations-5sI6fQgYIuo.png"
+            )
+        image = IMAGE.resize((width, height))


to avoid load_image failing because of multiple calls, the method itself is very error prone when many calls are made to the same url in parallel.

…ngface/optimum into fix-ort-batched-generation

IlyasMoutawwakil · 2025-07-29T10:11:30Z

tests/onnxruntime/test_decoder.py

+    def test_all_models_requiring_postion_ids(self):
+        for model_type in TasksManager.get_supported_model_type_for_task(task=self.TASK, exporter="onnx"):
+            model_type_requires_position_ids = model_type in MODEL_TYPES_REQUIRING_POSITION_IDS
+            onnx_config_class = TasksManager._SUPPORTED_MODEL_TYPE[model_type]["onnx"][self.TASK].func
+            onnx_config_class_with_position_ids = issubclass(onnx_config_class, TextDecoderWithPositionIdsOnnxConfig)
+
+            if model_type_requires_position_ids ^ onnx_config_class_with_position_ids:
+                raise ValueError(
+                    f"Model type {model_type} {'requires' if model_type_requires_position_ids else 'does not require'} position ids, "
+                    f"but the ONNX config class {onnx_config_class} {'is' if onnx_config_class_with_position_ids else 'is not'} "
+                    f"subclassed from TextDecoderWithPositionIdsOnnxConfig.\n"
+                )


so that they're always in sync (found a couple models that weren't added)

echarlaix

Huuuge work, thank you so much @IlyasMoutawwakil 🔥🔥

optimum/exporters/onnx/model_configs.py

…con alibi, gpt_bigcode multi-query,.. (#28) same as huggingface/optimum#2326 --------- Co-authored-by: Copilot <[email protected]>

IlyasMoutawwakil added 5 commits July 24, 2025 17:59

test left-padded batched inference

63a6efe

demonstrate batched tex generation failure

39496d8

fix remote code

2ccc150

fix

ecf65d5

fix position_ids generation inside ORTModelForCausalLM class

9f3eedc

IlyasMoutawwakil added 4 commits July 25, 2025 10:53

it works until transformers 4.52 -_-

b7bec5e

now run with latest transformers

0df42e5

bolean 4D mask is actually not supported by torch onnx exporter

999a145

only test generation with batched inputs, for logits are a bit off be…

638856e

…cause of transformers using boolean mask

IlyasMoutawwakil marked this pull request as ready for review July 25, 2025 12:36

IlyasMoutawwakil requested a review from echarlaix July 25, 2025 13:22

IlyasMoutawwakil added 5 commits July 25, 2025 21:57

boolean mask safe softmax batched inference

3d40502

style

023d2ac

use old typing

accf852

don't do unnecessary patching

0965ea9

try to avoid spamming the hub for an image

d1f9bbd

IlyasMoutawwakil added the onnxruntime-slow label Jul 25, 2025

update min transformers version

01c4084

IlyasMoutawwakil removed the request for review from echarlaix July 28, 2025 08:53

IlyasMoutawwakil added 6 commits July 28, 2025 11:19

better and direct torch patching

aeeecb2

more batched generation special cases

fc62f42

style

ba994fb

initialize the il image instead of downloading it

de6a798

use random pil image

cf164b3

test different versions of transformers in fast tests

5934bf9

IlyasMoutawwakil removed the onnxruntime-slow label Jul 28, 2025

IlyasMoutawwakil added 3 commits July 28, 2025 12:53

fix

4b76f5e

revert diffusers changes for now

e171196

mask padding kv cache as well

5ab88b6

IlyasMoutawwakil added 5 commits July 28, 2025 16:46

cleanup and some comments

a3dc4e8

fix and test falcon alibi

a1ff2f2

style

603f62c

fix, support and test multi_query=False as well

cf5b562

only apply masked testing for transformers version previous to 4.39

3a29549

IlyasMoutawwakil requested a review from echarlaix July 28, 2025 20:20

IlyasMoutawwakil commented Jul 28, 2025

View reviewed changes

optimum/onnxruntime/modeling_decoder.py Outdated Show resolved Hide resolved

IlyasMoutawwakil commented Jul 28, 2025

View reviewed changes

IlyasMoutawwakil added the onnxruntime-slow label Jul 28, 2025

IlyasMoutawwakil and others added 3 commits July 28, 2025 22:32

Update optimum/onnxruntime/modeling_decoder.py

af5fa34

use text decoder position ids onnx config but test its sync with list

59c0c14

Merge branch 'fix-ort-batched-generation' of https://github.com/huggi…

b5d92e5

…ngface/optimum into fix-ort-batched-generation

IlyasMoutawwakil commented Jul 29, 2025

View reviewed changes

IlyasMoutawwakil added 2 commits July 29, 2025 12:15

fix opt

9db07bf

style

98123d4

echarlaix approved these changes Jul 29, 2025

View reviewed changes

optimum/exporters/onnx/model_configs.py Show resolved Hide resolved

IlyasMoutawwakil changed the title ~~Fix ORTModelForCausalLM batched generation~~ Fix batched inference/generation, position_ids creation, falcon alibi, gpt_bigcode multi-query,.. Jul 29, 2025

IlyasMoutawwakil added 5 commits July 29, 2025 22:58

fix sdpa without overriting torch onnx exporter

411df8f

use inplace op ;-;

133f340

Merge branch 'main' into fix-ort-batched-generation

9044948

fix st test

c98ab28

patch directly in onnx because patch needs to happen after softmax

e787b92

IlyasMoutawwakil merged commit 31d4ea9 into main Jul 30, 2025
57 of 61 checks passed

IlyasMoutawwakil deleted the fix-ort-batched-generation branch July 30, 2025 12:45

IlyasMoutawwakil mentioned this pull request Aug 1, 2025

Fix and test batched inference/generation, position_ids creation, falcon alibi, gpt_bigcode multi-query,.. huggingface/optimum-onnx#28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix batched inference/generation, position_ids creation, falcon alibi, gpt_bigcode multi-query,.. #2326

Fix batched inference/generation, position_ids creation, falcon alibi, gpt_bigcode multi-query,.. #2326

Uh oh!

IlyasMoutawwakil commented Jul 24, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 24, 2025

Uh oh!

IlyasMoutawwakil Jul 28, 2025

Uh oh!

IlyasMoutawwakil Jul 28, 2025

Uh oh!

IlyasMoutawwakil Jul 28, 2025 •

edited

Loading

Uh oh!

IlyasMoutawwakil Jul 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

IlyasMoutawwakil Jul 28, 2025

Uh oh!

IlyasMoutawwakil Jul 29, 2025

Uh oh!

echarlaix left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix batched inference/generation, position_ids creation, falcon alibi, gpt_bigcode multi-query,.. #2326

Fix batched inference/generation, position_ids creation, falcon alibi, gpt_bigcode multi-query,.. #2326

Uh oh!

Conversation

IlyasMoutawwakil commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 24, 2025

Uh oh!

IlyasMoutawwakil Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

IlyasMoutawwakil Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

IlyasMoutawwakil Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

echarlaix left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

IlyasMoutawwakil commented Jul 24, 2025 •

edited

Loading

IlyasMoutawwakil Jul 28, 2025 •

edited

Loading

IlyasMoutawwakil Jul 28, 2025 •

edited

Loading