Skip to content

Conversation

michaelbenayoun
Copy link
Member

@michaelbenayoun michaelbenayoun commented Jan 4, 2023

What does this PR do?

This a ONNX graph transformation that wraps the Unet ONNX graph in a for-loop to perform the iteration steps during generation directly in the ONNX graph.

Sample code:

  1. First export the Unet model:
optimum-cli export onnx --model hf-internal-testing/tiny-stable-diffusion-torch --task stable-diffusion stable_diffusion

The tiny, non-representative model, hf-internal-testing/tiny-stable-diffusion-torch was used to iterate fast during development.

  1. Then perform the transformation:
import numpy as np
import onnx
from onnxruntime import InferenceSession
from optimum.onnx.graph_transformations import embedd_loop_in_unet

path_unet = "stable_diffusion/unet/model.onnx"
path_unet_with_loop = "unet_with_loop.onnx"

unet_model = onnx.load(path_unet)
unet_with_loop = embedd_loop_in_unet(unet_model)
onnx.save(unet_with_loop, path_unet_with_loop)

unet_sess = InferenceSession(path_unet, providers=["CPUExecutionProvider"])
unet_with_loop_sess = InferenceSession("unet_with_loop.onnx", providers=["CPUExecutionProvider"])


batch_size = 16
initial_sample = np.random.uniform(size=(batch_size, 4, 32, 32)).astype(np.float32)
hidden_states = np.random.uniform(size=(batch_size, 25, 32)).astype(np.float32)

def outside_loop(iterations):
    sample = initial_sample
    for i in range(iterations):
        timestep = np.array([i] * batch_size)
        out  = unet_sess.run(
            ["out_sample"],
            {
                "sample": sample, 
                "timestep": timestep,
                "encoder_hidden_states": hidden_states
            }
        )
        sample = out[0]
    return sample

def embedded_loop(iterations):
    num_iterations = np.array(iterations)
    outputs = unet_with_loop_sess.run(
        ["sample"], 
        {
            "num_iterations": num_iterations,
            "initial_sample": initial_sample, 
            "encoder_hidden_states": hidden_states
        }
    )
    outputs = outputs[0]
    return outputs

outputs = outside_loop(10)
outputs_with_loop = embedded_loop(10)

print(np.max(np.abs(outputs - outputs_with_loop)))

TODO

  • Outputs are different with CUDAExecutionProvider
  • Lots of removed initializers for the transformed graph by ONNX Runtime, why??
  • Benchmark => On CPU and GPU with the tiny model, no speedup, need to try on real size models.
  • Support for schedulers: currently the transformed model iterates linearly on the steps, but the timesteps are usually computed using a scheduler, this needs to be handled.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@echarlaix echarlaix added the onnx Related to the ONNX export label Jun 4, 2025
Copy link

github-actions bot commented Sep 4, 2025

This PR has been marked as stale because it has been open for 90 days with no activity. This thread will be automatically closed in 30 days if no further activity occurs.

@github-actions github-actions bot added the Stale label Sep 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
onnx Related to the ONNX export Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants