Skip to content

Conversation

@tolgacangoz
Copy link

@tolgacangoz tolgacangoz commented Nov 23, 2025

Hey Modular community!

This PR is in the process of introducing a dedicated pipeline for diffusion-based vision models, specifically implementing the Z-Image architecture. While MAX currently has strong support for LLMs and VLMs, support for multi-part diffusion systems (VAE + Text Encoder(s) + (maybe Image Encoder) + Backbone Transformer(s) + Scheduler) is an area for growth opportunity.

Inspired by the Democratizing AI Compute series, I aim to demonstrate how MAX can optimize multi-part, multi-modal graphs without relying heavily on lossy approximations like distillation or caching intermediate latents, etc.

I encountered benchmarks for possible diffusion models by testing the U-Net on the website, which was the earlier denoiser for diffusion-based generative vision models. I propose the natural next steps.

⚡️- Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

TODOs:

✅ ❓ 1. Make it work
  • ❓ The VAE's decoder and the backbone transformer seem to be compiled, but the text encoder's -Qwen/Qwen3-4B- compilation seems stuck without any warning. I have tried to fix it, but couldn't manage, and since I am not familiar with compiling internals, I opened an issue.
  • The overall pipeline seems to "work" (except the text encoder), but generates NaNs. I am working on it at the second stage.
⏳ 2. Make it right
  1. The comparison of the outputs between diffusers and this PR with the same inputs, including seeds.
diffusers This PR
  1. Ensure that minimal and reasonable changes are made in the modular's codebase as modular natively as possible.
  2. Consider adding tests, or postpone them at the very end.
🔜 3. Make it fast
  • Make it faster than diffusers' flux-fast version of Z-Image.
🔜 4. Make it the fastest

Planned benchmarks for throughput efficiency:

150859076 modular 136984999 vllm-omni 147780389 sglang 147780389 stable-diffusion.cpp 25720743 diffusers

@github-actions
Copy link

github-actions bot commented Nov 23, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@tolgacangoz
Copy link
Author

I have read the CLA Document and I hereby sign the CLA

@tolgacangoz tolgacangoz changed the title [MAX] Propose to add support for image editing pipeline: QwenImageEdit [MAX] Propose to add support for image editing pipeline: QwenImageEditPipeline Nov 23, 2025
modular-cla-bot bot added a commit to modular/cla that referenced this pull request Nov 24, 2025
@lattner
Copy link
Collaborator

lattner commented Nov 24, 2025

This is really amazing Tolga!

@mdanatg mdanatg requested a review from KCaverly November 24, 2025 16:43
@tolgacangoz
Copy link
Author

tolgacangoz commented Nov 26, 2025

Thank you, Chris! I am definitely looking forward to the next chapters of the series!

I am really enjoying digging into the MAX engine—it feels like a great fit for these types of vision models too. I will keep pushing updates to get this pipeline fully operational and ready for review soon!

@tolgacangoz
Copy link
Author

tolgacangoz commented Nov 30, 2025

Hi @KCaverly. I noticed the move toward the new Model API and the recent addition of gpt_oss_module_v3. For this diffusion pipeline, should I stick to the Define-then-Run pattern used in most implementations, or is this a good candidate to pilot the new Eager-style API? However, the new API doesn't yet include features such as Conv2D etc. (the VAE needs it). I assumed the former one; nevertheless, I wanted to ask.

@tolgacangoz tolgacangoz force-pushed the integrations/QwenImageEdit2511Pipeline branch from 4a5de12 to 237312e Compare December 1, 2025 08:52
@tolgacangoz tolgacangoz changed the title [MAX] Propose to add support for image editing pipeline: QwenImageEditPipeline [MAX] Propose to Add Support for Image Generation: ZImagePipeline Dec 1, 2025
@ehsanmok
Copy link
Contributor

ehsanmok commented Dec 2, 2025

@tolgacangoz Very impressive work!

For this diffusion pipeline, should I stick to the Define-then-Run pattern used in most implementations, or is this a good candidate to pilot the new Eager-style API? However, the new API doesn't yet include features such as Conv2D etc. (the VAE needs it). I assumed the former one; nevertheless, I wanted to ask.

About your question on the experimental API, it'd be great to see this using the new API as we're heading to migration. As you know, we're trying to make the UX similar to PyTorch so there's F.conv2d that you can use. We lack proper documentation now but here's a minimal example to get you the idea on how we're thinking about it

import max.nn.module_v3 as nn
from max.dtype import DType
from max.experimental import tensor, defaults, random

class MyModel(nn.Module):
    def __init__(self):
        self.fc = nn.Linear(10, 10)
        self.proj = nn.Linear(10, 1)

    def __call__(self, x: tensor.Tensor) -> tensor.Tensor:
        x = self.fc(x)
        x = F.gelu(x)
        x = self.proj(x)
        return x


dtype, device = defaults()
print(f"default dtype {dtype}, device: {device}")
model = MyModel()
max_output = model(random.normal((10, 10))) # <- lazy eval and no need for numpy
# OR if need to use np / torch tensor can use `Tensor.from_dlpack`
# import numpy as np
# max_output = model(Tensor.from_dlpack(np.random.randn(10, 10)).cast(DType.float32 if device.is_host else DType.bfloat16).to(device))

# for prod compile your model
compiled_model = model.compile(TensorType(dtype=DType.float32 if device.is_host else DType.bfloat16, shape=(10, 10), device=device))

Please let us know if there's any issues that come across and feel free to create separate issues and please tag me :)

@tolgacangoz
Copy link
Author

Alright, thanks for the answer! I am entirely focusing on the new Model API.

@KCaverly
Copy link
Contributor

KCaverly commented Dec 2, 2025

This is great! So I can pick up on my side, and play around with it. Do you have an example of how you are using this model as is. I see we havent wired up the Image Generation Pipeline yet, but this should be fairly close.

@tolgacangoz
Copy link
Author

Thanks for jumping in, Kyle! I haven't run a full forward pass script yet. I will write up an example script and post it here shortly!

@tolgacangoz tolgacangoz force-pushed the integrations/QwenImageEdit2511Pipeline branch from 237312e to a149ee6 Compare December 8, 2025 09:22
Copy link
Contributor

@KCaverly KCaverly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome progress! Just added a few drive by comments.

@modular modular deleted a comment from CodeAlexx Dec 9, 2025
@tolgacangoz tolgacangoz force-pushed the integrations/QwenImageEdit2511Pipeline branch 2 times, most recently from 5cbedcd to 78db594 Compare December 22, 2025 09:52
Adds the Z-Image-Base model to the list of supported example repositories, expanding the available model options for the Z-Image pipeline.

Removes unnecessary dependencies and decorators from the scheduler implementation, including unused torch import and register_to_config decorator that are no longer needed.

Adds future annotations import to improve type hint compatibility.
Replaces PyTorch-based implementation with native MAX operations for improved performance and native integration with the MAX execution engine.

Converts neural network modules to use MAX primitives including ops for mathematical operations, Linear layers, RMSNorm, and custom attention mechanisms. Implements ragged tensor support using flash_attention_ragged_gpu with input row offsets for variable-length sequences.

Refactors attention processor to stateless implementation, removes PyTorch autocast contexts, and replaces complex number operations with explicit cosine/sine calculations for rotary embeddings.

Updates tensor manipulation to use MAX ops including reshape, concat, gather, and slice operations. Converts gradient checkpointing and sequence padding logic to ragged tensor format with explicit batch indexing.

Adds dtype and device parameters throughout the architecture for explicit control over precision and execution location.
Replaces PyTorch-based implementation with MAX framework equivalents to enable graph-based execution and optimization.

Inlines encoder, decoder, and distribution classes from external modules directly into the file to consolidate dependencies.

Removes attention processor methods and gradient checkpointing support that are not compatible with the MAX execution model.

Converts Conv2d layers to use MAX ops and replaces torch.cat operations with ops.concat for tensor concatenation.

Simplifies method signatures by removing return_dict parameters and returning values directly instead of wrapped outputs.
Switches from Qwen3-specific tokenizer to a generic TextTokenizer interface in the z_image architecture. This change promotes code reusability and reduces tight coupling to a specific tokenizer implementation.

Also updates import statements to use absolute imports for better clarity.
Aligns VAE configuration with standard diffusers API by replacing custom parameters with standard fields like `block_out_channels`, `down_block_types`, and `force_upcast`. Adds comprehensive documentation for scaling and latent space handling.

Renames `DenoiserConfig` to `TransformerConfig` to better reflect its purpose and improves consistency across the codebase.

Updates text encoder to use `Qwen3Config` instead of `Qwen2_5VLConfig`, reflecting model architecture changes.

Adds proper type hints by importing `Tuple` from `typing` module.
Adds future annotations import for better type hint compatibility.

Consolidates imports by removing unnecessary parentheses and line breaks.

Corrects module name from 'autoencoderkl' to 'autoencoder_kl'.
…ion pipeline into a native MAX Engine model implementation.

Replaces relative imports with local module references for VAE, transformer, and scheduler components. Introduces `ZImageModel` class that inherits from `PipelineModel` and implements proper model loading with separate VAE, text encoder, and transformer compilation.

Adds comprehensive `ZImageInputs` `dataclass` to encapsulate all model inputs including text prompts, image generation parameters, vision inputs, and distributed execution signals.

Started to implement three separate graph builders (`_build_vae_graph`, `_build_text_encoder_graph`, `_build_transformer_graph`) to support multi-GPU processing and proper device placement.

Removes HuggingFace-specific dependencies (`AutoTokenizer`, `PreTrainedModel`) and adapts the execution flow to use MAX Engine's inference session and weight management system.

Updates method signatures to align with MAX Engine conventions, replacing `__call__` with `execute`.
Adds a `VaeImageProcessor` class for handling image post-processing.

This class includes methods for denormalizing images, converting `Tensor`s to `PIL` images, and providing a unified post-processing interface that allows users to convert the processed images to either `PIL` format or retain the latent representation.
…kenizer files

- Updated import path for AutoencoderKL in z_image/arch.py.
- Deleted nn/data_processing.py as it was no longer needed.
- Removed tokenizer.py file to streamline the architecture.
- Changed import for Qwen3 model in z_image.py to align with new structure.
- Adjusted build_text_encoder method to instantiate Qwen3 directly.
Improves configuration handling for the Z_Image pipeline by:

- Adds comprehensive docstrings to previously undocumented VAE config fields (latents mean/std, layers per block, normalization groups, etc.)

- Updates Transformer config to use more accurate field names (n_layers instead of num_layers, n_heads instead of num_attention_heads, removes obsolete fields)

- Expands configuration parameter passing to include all relevant fields for both VAE and Transformer components

- Corrects delegation from Llama3Config to Qwen3Config throughout, aligning with the actual text encoder being used

- Removes trailing whitespace and fixes formatting inconsistencies

These changes ensure the configuration accurately reflects the model architecture and improves code maintainability by adding missing documentation.
Migrates the flow match Euler discrete scheduler from diffusers dependencies to a native Modular implementation using experimental tensor operations with Eager style and functional API.

Removes dependencies on numpy, torch, and HuggingFace utilities by implementing equivalent functionality with MAX primitives. Replaces inheritance from ConfigMixin/SchedulerMixin with direct attribute storage.

Updates copyright header to include Modular Inc. and switches to Apache v2.0 with LLVM Exceptions license.

Implements custom linspace function to replace numpy dependency and uses native tensor operations throughout. Makes scipy optional for beta sigmas feature with availability check.

Improves type hints to use modern Python union syntax and adds proper return type annotations for all methods.
Replaces PyTorch-based implementation with MAX native module infrastructure to improve performance and integration.

Removes local implementations of Encoder, Decoder, and DiagonalGaussianDistribution classes in favor of importing from shared VAE module.

Updates all operations to use MAX experimental functional API (F.concat) instead of PyTorch ops for better compatibility with MAX runtime.

Refactors blend_v and blend_h methods to avoid in-place tensor item assignment, which is incompatible with MAX's functional paradigm, by using slice/concat operations instead.

Improves tiled encode/decode methods with proper bounds checking and dimension clamping to prevent out-of-bounds errors during tile processing.

Adds proper return_dict parameter handling to encode, decode, and forward methods for consistent API with diffusers library conventions.
Implements core neural network modules for variational autoencoder functionality including encoder/decoder blocks, upsampling/downsampling layers, ResNet blocks, and UNet components.

Introduces DiagonalGaussianDistribution for latent space sampling and AutoencoderMixin for tiling/slicing optimizations. These components enable efficient image encoding/decoding for diffusion models with support for spatial normalization, attention mechanisms, and configurable block architectures.

Provides foundation for image generation pipelines requiring latent space representations.
Implements core neural network building blocks for image processing pipelines including Conv2d, attention mechanisms, normalization layers (RMSNorm, GroupNorm, LayerNorm, SpatialNorm), and activation functions.

Fixes symbolic tensor issues in image processor by using DLPack for numpy conversion instead of direct permute operations, which improves compatibility with the tensor framework.

Adds utility functions for sequence padding, attention mask creation, and masked scatter operations to support variable-length sequences in the pipeline architecture.
Refactors the Z-Image transformer implementation to use the v3 module API and experimental functional operations, replacing graph-based operations with tensor-based functional approach.

Simplifies the architecture by removing device and dtype parameters from layer constructors, consolidating sequential operations, and replacing custom attention with manual implementation for CPU compatibility.

Updates patchify/unpatchify logic to handle batched lists of tensors instead of ragged tensors, implementing proper sequence padding and attention masking for variable-length sequences.

Replaces Weight objects with Tensor initialization and ModuleDict/ModuleList for dynamic layer management, improving compatibility with the modern API surface.
Implements a new pipeline for converting text to images by introducing an image generator pipeline class that delegates work to the underlying pipeline model.

Registers the new pipeline type in the pipeline registry and routes image generation tasks to the appropriate pipeline implementation.

Tracks input and output tokens through telemetry metrics to monitor generation performance.
Replaces PyTorch-based tensor operations with MAX experimental tensor API throughout the Z-Image text-to-image generation pipeline.

Removes dependency on DiffusionPipeline base class and switches to eager execution for VAE and transformer models while keeping text encoder as compiled graph.

Updates type hints to use modern Python syntax (PEP 604 union operators) and replaces Optional/Union with native type union notation.

Implements serving interface compatibility by adding next_chunk method and ImageGenerationOutput integration for production deployment.

Removes complex multi-device scattered/gathered vision input handling, simplifying the model execution path and reducing graph complexity.

Consolidates import statements and removes unused vision processing utilities like scatter_gather_indices computation.
Introduces a specialized encoder that extends the base Qwen3 model to expose hidden states instead of logits. This enables the Z-Image pipeline to use Qwen3 as a text encoder by accessing the normalized hidden representations before the language model head.

The encoder applies standard transformer processing (token embedding, positional encoding, layer-by-layer attention) but returns the final normalized hidden states rather than vocabulary logits, making it suitable for multimodal embedding tasks.
Consolidates image generation pipeline code by moving `ImageGeneratorPipeline` from a standalone module into the pipeline variants structure alongside text and embeddings generation.

Removes the minimal `image_generator_pipeline.py` wrapper that delegated to `PipelineModel` methods and replaces it with a full-featured `ImageGenerationPipeline` class that implements the complete generation workflow template.

Exports image generation types (`ImageGenerationContextType`, `ImageGenerationInputs`, `ImageGenerationOutput`, `ImageGenerationRequest`, `ImageGenerationMetadata`) through the public interfaces to enable external usage.

Updates architecture registration to include `ARCHITECTURES` list for image models and adds `ImageGeneratorPipeline` to serve pipelines for streaming image generation requests.
Simplifies architecture definition by removing `multi_gpu_supported` flag and `enable_chunked_prefill` requirement, which were redundant given the current implementation constraints.

Replaces direct `qwen3_arch` reference with dedicated `Qwen3Encoder` class to properly encapsulate text encoding behavior. Reuses `llama3` weight adapters instead of maintaining separate conversion logic.

Restricts supported encodings to `bfloat16` with paged KV cache strategy, removing unused `float32` option to clarify deployment requirements.
Removes runtime-derived fields from `SchedulerConfig`, `VAEConfig`, and `TransformerConfig` dataclasses, replacing required fields with optional fields that have sensible defaults. Updates `generate()` methods to use `getattr()` with fallback values instead of direct attribute access.

Eliminates Float8 config parsing and device specification logic from config generation, as these are now handled externally. Changes tuple types to list types for collection fields.

Reduces coupling between configuration objects and runtime state by removing fields like `_class_name`, `_diffusers_version`, `dtype`, `devices`, and `float8_config` that belong to execution context rather than model architecture definition.
Extends pipeline registry to support Diffusers-based image generation models by detecting `model_index.json` files and extracting architecture information from `_class_name` field.

Renames `ImageGeneratorPipeline` to `ImageGenerationPipeline` for consistency with naming conventions.

Adds optional component fields (`scheduler`, `vae`, `text_encoder`, `transformer`) to `SupportedArchitecture` to accommodate multi-component Diffusers models (maybe temporarily).

Changes `get_active_huggingface_config` return type from `AutoConfig` to `PretrainedConfig` to handle both traditional transformer configs and synthetic configs for Diffusers models.

Configures tokenizer loading with `subfolder="tokenizer"` parameter for Diffusers repository structure.

Updates API server to instantiate and serve image generation pipelines.

Enables recursive weight file discovery for repositories with nested directory structures, falling back to `fs.find()` when globstar patterns are unsupported by the filesystem backend.

Adds direct safetensors header parsing to detect data types across multi-component Diffusers pipelines where different shards may use different dtypes.
Aligns layer initialization with checkpoint weight `dtype`s by switching default `dtype` from `float32` to `bfloat16` for `Conv2d`, `GroupNorm`, and `LayerNorm` parameters.

Fixes `Conv2d` to properly handle NCHW input by converting to NHWC before convolution operation, addressing a layout mismatch that would cause incorrect results.

Updates `Attention.to_out` from single `Linear` layer to `ModuleList` containing `Linear` and `Dropout` to match checkpoint parameter naming structure from` diffusers`.

Adds Apache 2.0 license header and applies formatting improvements including import reordering and line length fixes.
Reorganizes component initialization to separate weight partitioning from model compilation. Previously, weights were partitioned by string prefix heuristics; now partitions by source file path from the `diffusers`-style repository structure (`vae/`, `text_encoder/`, `transformer/` subfolders).

Loads component configs (scheduler, VAE, text encoder, transformer) explicitly from their respective JSON files rather than relying on a single unified config, enabling proper initialization of each `diffusers` component.

Compiles the VAE decoder and text encoder graph separately, returning compiled models rather than mixing eager and graph execution modes. The transformer compilation is stubbed pending BMM kernel rank fixes.

Removes unused parallel ops, KV cache estimation, and signal buffer infrastructure. Cleans up type hints to use standard Python syntax (list/dict instead of List/Dict).

The new structure makes the `diffusers` pipeline components explicit and sets up proper weight routing for multi-stage compilation.
Update the architecture module name from `z_image` to `z_image_module_v3` to reflect a new version of the image module implementation.

Changes the import statement in the architecture registry to reference the renamed module while maintaining the same `z_image_arch` export.
Removes the `Qwen3Encoder` wrapper class and weight adapter utilities that were part of the Z-Image v3 architecture implementation, as they are no longer needed.

Also updates the `SupportedArchitecture` type annotation to remove the optional `None` case for KV cache strategies, making the supported encodings mapping more strictly typed.
Refactors `ZImageTransformer2DModel` and related components to be graph-compilable by eliminating operations that cause graph tracing issues.

Reimplements `RopeEmbedder` as graph-compilable `nn.Module` that computes embeddings directly from position IDs without precomputation or `F.gather` operations. Moves RoPE embedding to GPU by passing device to constructor.

Replaces manual attention implementation with `flash_attention_gpu` kernel for improved performance and graph compatibility.

Removes dynamic shape operations (`int(tensor.shape)`, list comprehensions on tensor shapes) throughout transformer forward pass by using fixed compilation parameters for batch size 1 at 1024x1024 resolution.

Simplifies `create_coordinate_grid` to explicitly handle 3D grids without list operations, making it graph-traceable.
Introduces `SECOND_TO_LAST` option to `ReturnHiddenStates` enum to extract hidden states from the second-to-last transformer layer, matching the `diffusers`' `transformers` usage pattern of `hidden_states[-2]` for text encoders.

Updates Z-Image pipeline to use native `Qwen3` text encoder instead of custom `Qwen3Encoder`, configured to return second-to-last layer hidden states for improved conditioning behavior aligned with `diffusers` implementation.

Tracks previous layer output (`prev_h`) during transformer forward pass to enable extraction of second-to-last layer states.

Passes device context to transformer construction for GPU-accelerated RoPE precomputation.
@tolgacangoz tolgacangoz force-pushed the integrations/QwenImageEdit2511Pipeline branch from 3384638 to f6ff121 Compare December 25, 2025 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants