Unpair-Seg: Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

Zhaoqing Wang1,4    Xiaobo Xia1    Ziye Chen2   Xiao He4   Yandong Guo4  
Mingming Gong2,3,😎   Tongliang Liu1,😎
1The University of Sydney   2The University of Melbourne    3Mohamed bin Zayed University of Artificial Intelligence   4AI2Robotics
😎 Corresponding authors

Motivation

Figure 1. The proposed Uni-OVSeg framework learns open-vocabulary segmentation with unpaired mask-text supervision. Compared to the labour-intensive image-mask-text annotations, independent image-mask and image-text pairs are easier to collect. With one suite of weights, given different visual prompts (e.g., points and boxes), Uni-OVSeg can segment and categorise various objects and stuff from an open set of vocabulary in the real world.

Cutting-edge approaches in open-vocabulary segmentation typically leverage supervision with triplet annotations that are composed of images, masks, and corresponding texts [1, 2]. It is worth noting that strict alignment between each mask and text results in an expensive annotation cost. To mitigate this, some weakly-supervised methods propose using only text supervision [3, 4].

However, learning with this supervision, the model falls short in capturing complex spatial details, which is suboptimal for dense prediction. Furthermore, this type of supervision lacks positional information, making the model difficult to distinguish different instances with the same semantic class. These issues severely limit the versatility and segmentation performance of existing weakly-supervised methods.

As shown in Figure 1., we propose an advanced weakly-supervised open-vocabulary segmentation framework, named Uni-OVSeg, to reduce the annotation expense while significantly enhancing performance.

Method

Figure 2. Overview of the Uni-OVSeg framework. This framework consists of feature extraction, mask generation, and mask-text alignment. A frozen CLIP model and prompt encoder are used for image and text feature extraction and prompt encoding, respectively. We employs a mask and pixel decoder for binary mask prediction. A mask-text bipartite matching is designed to exploit confident mask-entity pairs. Visual prompts using boxes are omitted for simplicity.

On a macro level, as illustrated in Figure 2., the proposed Uni-OVSeg contains a CLIP model to extract features of both images and text descriptions. With the image-mask pairs, a branch of mask generation, including a visual prompt encoder, a pixel decoder and a mask decoder, is employed to predict a set of binary masks of an input image. With the image-text pairs, mask-text bipartite matching is used to exploit confident pairs between predicted masks and entities in text descriptions. Afterward, we adopt a multi-scale feature adapter to enhance the mask-wise visual embeddings, which are further aligned with associated entity embeddings based on the confident pairs. Finally, we perform open-vocabulary segmentation with the above-mentioned parts. More details can be found in our paper.

Results

1. Open-vocabulary semantic segmentation performance

Table 1. Open-vocabulary semantic segmentation performance. We mainly compare with the fully-supervised and weakly-supervised methods. “COCO S.”, “COCO P.” and “COCO C.” denote the COCO stuff, panoptic and caption datasets. “O365” denotes the Object 365 dataset. “M. 41M” denotes the merged 41M image dataset. We report mIoU for all datasets.

As demonstrated in Table 1., we compare our method to previous works across a range of benchmarks, including ADE20K (encompassing both 150 and 847 class variants), PASCAL Context (459 and 59 class variants), PASCAL VOC (with 20 and 21 class categories), and Cityscapes. Compared to weakly-supervised methods, Uni-OVSeg exhibits remarkable performance improvements across all evaluated datasets. Specifically, in the more challenging datasets of PASCAL Context-459, Uni-OVSeg not only surpasses its weakly-supervised counterparts but also outperforms the cutting-edge fully-supervised methods, e.g., FC-CLIP. This is indicative of Uni-OVSeg's superior capability in categorizing a diverse array of semantic classes. Furthermore, in the PASCAL VOC benchmarks (20 and 21 classes), Uni-OVSeg demonstrates a substantial enhancement over state-of-the-art weakly-supervised methods, achieving improvements of 18.3% and 12.2% mIoU, respectively, which demonstrates our Uni-OVSeg captures fine-grained spatial structures. These results elevate the practical applicability of weakly-supervised open-vocabulary segmentation to new heights.

2. Open-vocabulary panoptic segmentation performance

Table 2. Open-vocabulary panoptic segmentation performance. We mainly compare with the fully-supervised and unsupervised methods. “COCO P.” denotes the COCO panoptic datasets. “COCO” denotes the COCO image dataset. “IN 1K” denotes the ImageNet-1K image dataset. We report PQ, SQ and RQ for all datasets.

As demonstrated in Table 2., for open-vocabulary panoptic segmentation, we zero-shot evaluate our model on the COCO, ADE20K, and Cityscapes datasets. Existing weakly-supervised methods only use text supervision, which causes a challenge to discriminate different instances with the same semantic class. To the best of our knowledge, we are the first to learn open-vocabulary panoptic segmentation with weak supervision. Compared to unsupervised methods, we obviously outperform U2Seg by 1.9% PQ, 1.5%SQ, and 4.4% RQ on COCO datasets. Unfortunately, our used image-mask pairs contain multiple granularity masks, such as object-wise and part-wise masks, which is different with the panoptic segmentation datasets. This discrepancy leads to a number of false positive results, limiting our performance.

3. Promptable segmentation performance

Figure 3. Point-promptable segmentation performance. We compare our method with SAM-ViT/L on a wide range of datasets. Given a 20 × 20 point grid as visual prompt, we select the output masks with max IoU by calculating the IoU with the ground-truth masks. We report 1-pt IoU for all datasets.
Figure 4. Box-promptable segmentation performance. We compare our method with SAM-ViT/L on a wide range of datasets. Given a ground-truth box as the visual prompt, we select the output masks with max IoU by calculating the IoU with the ground-truth masks. We report 1-pt IoU for all datasets.
Figure 5. Point-promptable segmentation performance. We compare our method with SAM-ViT/L on the SegInW datasets. Given a 20 × 20 point grid as a visual prompt, we select the output masks with max IoU by calculating the IoU with the ground-truth masks. We report 1-pt IoU for all datasets.
Figure 6. Box-promptable segmentation performance. We compare our method with SAM-ViT/L on the SegInW datasets. Given a ground-truth box as the visual prompt, we select the output masks with max IoU by calculating the IoU with the ground-truth masks. We report 1-pt IoU for all datasets.

To evaluate the segmentation quality of Uni-OVSeg in scenarios involving interactive point and box prompts, we conduct comparative analyses with the SAM-ViT/L model across a range of datasets from diverse domains. For visual prompts, we implement a uniform 20 × 20 point grid as the interactive point prompt and utilise the actual bounding boxes as box prompts. The segmentation performance is measured using the 1-pt IoU (Oracle) metric across all datasets. We report results in Figures 3-6.


BibTeX


@article{wang2024open,
  title={Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision},
  author={Wang, Zhaoqing and Xia, Xiaobo and Chen, Ziye and He, Xiao and Guo, Yandong and Gong, Mingming and Liu, Tongliang},
  journal={arXiv preprint arXiv:2402.08960},
  year={2024}
}