To assess the effectiveness of our proposed method, we conduct extensive experiments across a broad range of computer vision tasks.
Large language models (LLMs) like GPT and LLaMA have rapidly gained widespread attention and transformed the field, demonstrating the strong capability to handle a wide range of language tasks within a unified framework. This breakthrough of integrating diverse language tasks into a single large model has sparked momentum to develop similar large models for computer vision. The potential to create large vision models (LVMs) capable of generalizing across multiple vision tasks represents a promising step toward a more versatile, scalable, and efficient approach to vision-based AI.
However, constructing LVMs presents greater complexity than LLMs due to the inherently diverse and high-dimensional nature of vision data, as well as the need to handle variations in scale, perspective, and lighting across tasks. To handle the problem, recent work has developed a sequential modeling method that learns from purely vision data by representing images, videos, and annotations in a unified "visual sentence" format. This method enables the model to predict sequential vision tokens from a vast dataset, entirely independent of language-based inputs. Although this method has shown promising results in diverse vision tasks, it faces two primary challenges. Specifically, the first issue concerns the efficiency limitations inherent in autoregressive sequence modeling, as it demands token-by-token prediction, which is computationally intensive for high-dimensional vision data. The second issue is the disruption of spatial coherence when converting vision data into a sequential format, which compromises the preservation of spatial dependencies crucial for performance in vision tasks.
Computer vision includes a series of tasks like object detection and panoptic segmentation, which are typically handled by specialized models designed for specific input-target mappings. While effective for single tasks, this specialization restricts model adaptability and scalability across multiple tasks or diverse visual data. To overcome this limitation, we aim to design a conditional generative framework that unifies multiple vision tasks within a single cohesive model. Specifically, given a query x (e.g., an image or a video), the framework produces the corresponding prediction ŷ to approximate the target y conditioned on a set of input-target pairs s. These conditioning pairs provide task definitions and guidance, enabling the model to flexibly adapt to different tasks according to the supplied examples. Formally, the objective is to model the conditional distribution p(y|x,s).
The proposed Large Vision Diffusion Transformer (LaVin-DiT) framework integrates a spatial-temporal variational autoencoder (ST-VAE) with a joint diffusion transformer to unify multiple vision tasks. Given a vision task, e.g., panoptic segmentation, we first sample a set of input-target pairs as the task definition. Afterward, the set and other visual examples are fed into ST-VAE, which are encoded into latent representations. Subsequently, the encoded representations are patchified and unfolded into a sequential format. The set and input visual data form the conditional latent presentation zc, while the target is perturbed with random Gaussian noise, yielding a noisy latent representation zt. Both zc and zt are then put into the joint diffusion transformer (J-DiT), which denoises zt to recover a clean latent representation within the shared latent space. Lastly, the recovered latent representation is passed through the ST-VAE decoder to reconstruct the target in raw pixel space.
We present algorithm flows of the proposed LaVin-DiT. It is built upon the flow matching framework. The training and inference procedures are provided in Algorithm 1 and Algorithm 2, respectively.
To assess the effectiveness of our proposed method, we conduct extensive experiments across a broad range of computer vision tasks.
To investigate the scalability of the proposed LaVin-DiT, we conduct experiments with three model sizes: 0.1B, 1.0B, and 3.4B parameters. Besides, we compare the inference latency of LaVin-DiT and LVM (both 7B models) across increasing resolutions, demonstrating that our method is consistently more efficient.
Left & Center: Training loss curves and performance comparison for LaVin-DiT of varying model sizes. The 3.4B model demonstrates faster convergence, achieving lower training losses than smaller models. Comparison of LaVin-DiT with different parameters on colorization (MSE) and depth estimation (AbsRel).
Right: Inference latency comparison. LaVin-DiT consistently achieves lower latency than LVM across different resolutions, as tested on an A100-80G GPU with 8 input-target pairs.
In-context learning enables the model to adapt to new tasks using a few examples, with performance generally improving as more examples are provided. We investigate this by assessing the effect of task context length across ten downstream tasks.
We showcase visualization results across various tasks. Additional results can be found in our paper.
Inpainting
Colorization
De-raining
Panoptic Segmentation
Pose Estimation
Pose-to-Image Generation
Depth Estimation
Depth-to-Image Generation
Surface Normal Estimation
Normal-to-Image Generation
@article{wang2024lavin,
title={LaVin-DiT: Large Vision Diffusion Transformer},
author={Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu},
journal={arXiv preprint arXiv:2411.11505},
year={2024}
}