Logo LaVin-DiT

Large Vision Diffusion Transformer

Zhaoqing Wang1,4, Xiaobo Xia2, Runnan Chen1, Dongdong Yu4
Changhu Wang4*, Mingming Gong3*, Tongliang Liu1*,
1The University of Sydney, 2National University of Singapore, 3The University of Melbourne, 4AIsphere

Abstract

This paper presents the Large Vision Diffusion Transformer (LaVin-DiT), a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework. Unlike existing large vision models directly adapted from natural language processing architectures, which rely on less efficient autoregressive techniques and disrupt spatial relationships essential for vision data, LaVin-DiT introduces key innovations to optimize generative performance for vision tasks. First, to address the high dimensionality of visual data, we incorporate a spatial-temporal variational autoencoder that encodes data into a continuous latent space. Second, for generative modeling, we develop a joint diffusion transformer that progressively produces vision outputs. Third, for unified multi-task training, in-context learning is implemented. Input-target pairs serve as task context, which guides the diffusion transformer to align outputs with specific tasks within the latent space. During inference, a task-specific context set and test data as queries allow LaVin-DiT to generalize across tasks without fine-tuning. Trained on extensive vision datasets, the model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks. This work introduces a novel pathway for large vision foundation models, underscoring the promising potential of diffusion transformers. The code and models will be open-sourced.

Motivation

Large language models (LLMs) like GPT and LLaMA have rapidly gained widespread attention and transformed the field, demonstrating the strong capability to handle a wide range of language tasks within a unified framework. This breakthrough of integrating diverse language tasks into a single large model has sparked momentum to develop similar large models for computer vision. The potential to create large vision models (LVMs) capable of generalizing across multiple vision tasks represents a promising step toward a more versatile, scalable, and efficient approach to vision-based AI.

However, constructing LVMs presents greater complexity than LLMs due to the inherently diverse and high-dimensional nature of vision data, as well as the need to handle variations in scale, perspective, and lighting across tasks. To handle the problem, recent work has developed a sequential modeling method that learns from purely vision data by representing images, videos, and annotations in a unified "visual sentence" format. This method enables the model to predict sequential vision tokens from a vast dataset, entirely independent of language-based inputs. Although this method has shown promising results in diverse vision tasks, it faces two primary challenges. Specifically, the first issue concerns the efficiency limitations inherent in autoregressive sequence modeling, as it demands token-by-token prediction, which is computationally intensive for high-dimensional vision data. The second issue is the disruption of spatial coherence when converting vision data into a sequential format, which compromises the preservation of spatial dependencies crucial for performance in vision tasks.

motivation
Comparison of autoregressive and diffusion modeling. (a) In autoregressive modeling, visual data is divided into a sequence of patches and transformed into a one-dimensional sequence. The model then predicts each token sequentially from left to right and top to bottom, which is computationally intensive for high-dimensional visual data. Besides, tokens marked in red and blue illustrate disrupted spatial dependencies, highlighting the limitations of preserving spatial coherence. (b) In contrast, diffusion modeling denoises all tokens in parallel across N timesteps, significantly improving computational efficiency and preserving essential spatial structures crucial for high-performance vision tasks.

Method

Problem Setting

Computer vision includes a series of tasks like object detection and panoptic segmentation, which are typically handled by specialized models designed for specific input-target mappings. While effective for single tasks, this specialization restricts model adaptability and scalability across multiple tasks or diverse visual data. To overcome this limitation, we aim to design a conditional generative framework that unifies multiple vision tasks within a single cohesive model. Specifically, given a query x (e.g., an image or a video), the framework produces the corresponding prediction ŷ to approximate the target y conditioned on a set of input-target pairs s. These conditioning pairs provide task definitions and guidance, enabling the model to flexibly adapt to different tasks according to the supplied examples. Formally, the objective is to model the conditional distribution p(y|x,s).

Architecture Overview

The proposed Large Vision Diffusion Transformer (LaVin-DiT) framework integrates a spatial-temporal variational autoencoder (ST-VAE) with a joint diffusion transformer to unify multiple vision tasks. Given a vision task, e.g., panoptic segmentation, we first sample a set of input-target pairs as the task definition. Afterward, the set and other visual examples are fed into ST-VAE, which are encoded into latent representations. Subsequently, the encoded representations are patchified and unfolded into a sequential format. The set and input visual data form the conditional latent presentation zc, while the target is perturbed with random Gaussian noise, yielding a noisy latent representation zt. Both zc and zt are then put into the joint diffusion transformer (J-DiT), which denoises zt to recover a clean latent representation within the shared latent space. Lastly, the recovered latent representation is passed through the ST-VAE decoder to reconstruct the target in raw pixel space.

Architecture Overview
Overview of Large Vision Diffusion Model (LaVin-DiT). As shown in panel (a), the model initially compresses input visual data from the pixel space into a latent space, where multiple input-target pairs serve as the task context. A target is perturbed with Gaussian noise through a diffusion process. Guided by the task context and query, the Joint Diffusion Transformer (J-DiT) iteratively denoises this noisy target over N timesteps to recover a clean latent representation. The prediction is then generated via the ST-VAE decoder. Panels (b) and (c) provide architectural details of the ST-VAE and J-DiT, respectively. "Down." and "Up." indicate the downsampling and upsampling, respectively. Concatenation is represented by ⊙.

Training & Inference Procedures

We present algorithm flows of the proposed LaVin-DiT. It is built upon the flow matching framework. The training and inference procedures are provided in Algorithm 1 and Algorithm 2, respectively.

Algorithm 1
Algorithm 2

Experiments

Main Results

To assess the effectiveness of our proposed method, we conduct extensive experiments across a broad range of computer vision tasks.

Comparison Results 1
Comparison on foreground segmentation, single object detection, and colorization. For foreground segmentation and single object detection, we report "mIoU" (higher is better). For colorization, we report "LPIPS" and "MSE" (lower is better). Note that foreground segmentation and single object detection are unseen tasks during our training.
Comparison Results 2
Comparison on NYU-v2 depth estimation, surface normal estimation and ImageNet inpainting. For depth estimation, we report absolute relative difference (AbsRel) and threshold accuracy (δ₁). For surface normal estimation, we report mean angular error (MAE) and angle accuracy within a threshold (<11.25°). We report FID for inpainting. † denotes evaluations on the official released 7B model.

Scalability & Inference Latency Analysis

To investigate the scalability of the proposed LaVin-DiT, we conduct experiments with three model sizes: 0.1B, 1.0B, and 3.4B parameters. Besides, we compare the inference latency of LaVin-DiT and LVM (both 7B models) across increasing resolutions, demonstrating that our method is consistently more efficient.

Training Loss Curves
Performance Comparison
Performance Comparison

Left & Center: Training loss curves and performance comparison for LaVin-DiT of varying model sizes. The 3.4B model demonstrates faster convergence, achieving lower training losses than smaller models. Comparison of LaVin-DiT with different parameters on colorization (MSE) and depth estimation (AbsRel).
Right: Inference latency comparison. LaVin-DiT consistently achieves lower latency than LVM across different resolutions, as tested on an A100-80G GPU with 8 input-target pairs.

Effect of Task Context Length

In-context learning enables the model to adapt to new tasks using a few examples, with performance generally improving as more examples are provided. We investigate this by assessing the effect of task context length across ten downstream tasks.

Comparison Results 1
Effect of task context length. Longer task context can consistently improve the performance of downstream tasks.

Visualization Results

We showcase visualization results across various tasks. Additional results can be found in our paper.

Citation


      @article{wang2024lavin,
              title={LaVin-DiT: Large Vision Diffusion Transformer},
              author={Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu},
              journal={arXiv preprint arXiv:2411.11505},
              year={2024}
      }