Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision

Motivation

Cutting-edge approaches in open-vocabulary segmentation typically leverage supervision with triplet annotations that are composed of images, masks, and corresponding texts [1, 2]. It is worth noting that strict alignment between each mask and text results in an expensive annotation cost. To mitigate this, some weakly-supervised methods propose using only text supervision [3, 4].

However, learning with this supervision, the model falls short in capturing complex spatial details, which is suboptimal for dense prediction. Furthermore, this type of supervision lacks positional information, making the model difficult to distinguish different instances with the same semantic class. These issues severely limit the versatility and segmentation performance of existing weakly-supervised methods.

As shown in Figure 1., we propose an advanced weakly-supervised open-vocabulary segmentation framework, named Uni-OVSeg, to reduce the annotation expense while significantly enhancing performance.

Method

Figure 2. Overview of the Uni-OVSeg framework. This framework consists of feature extraction, mask generation, and mask-text alignment. A frozen CLIP model and prompt encoder are used for image and text feature extraction and prompt encoding, respectively. We employs a mask and pixel decoder for binary mask prediction. A mask-text bipartite matching is designed to exploit confident mask-entity pairs. Visual prompts using boxes are omitted for simplicity.

On a macro level, as illustrated in Figure 2., the proposed Uni-OVSeg contains a CLIP model to extract features of both images and text descriptions. With the image-mask pairs, a branch of mask generation, including a visual prompt encoder, a pixel decoder and a mask decoder, is employed to predict a set of binary masks of an input image. With the image-text pairs, mask-text bipartite matching is used to exploit confident pairs between predicted masks and entities in text descriptions. Afterward, we adopt a multi-scale feature adapter to enhance the mask-wise visual embeddings, which are further aligned with associated entity embeddings based on the confident pairs. Finally, we perform open-vocabulary segmentation with the above-mentioned parts. More details can be found in our paper.

Results

1. Open-vocabulary semantic segmentation performance

As demonstrated in Table 1., we compare our method to previous works across a range of benchmarks, including ADE20K (encompassing both 150 and 847 class variants), PASCAL Context (459 and 59 class variants), PASCAL VOC (with 20 and 21 class categories), and Cityscapes. Compared to weakly-supervised methods, Uni-OVSeg exhibits remarkable performance improvements across all evaluated datasets. Specifically, in the more challenging datasets of PASCAL Context-459, Uni-OVSeg not only surpasses its weakly-supervised counterparts but also outperforms the cutting-edge fully-supervised methods, e.g., FC-CLIP. This is indicative of Uni-OVSeg's superior capability in categorizing a diverse array of semantic classes. Furthermore, in the PASCAL VOC benchmarks (20 and 21 classes), Uni-OVSeg demonstrates a substantial enhancement over state-of-the-art weakly-supervised methods, achieving improvements of 18.3% and 12.2% mIoU, respectively, which demonstrates our Uni-OVSeg captures fine-grained spatial structures. These results elevate the practical applicability of weakly-supervised open-vocabulary segmentation to new heights.

2. Open-vocabulary panoptic segmentation performance

As demonstrated in Table 2., for open-vocabulary panoptic segmentation, we zero-shot evaluate our model on the COCO, ADE20K, and Cityscapes datasets. Existing weakly-supervised methods only use text supervision, which causes a challenge to discriminate different instances with the same semantic class. To the best of our knowledge, we are the first to learn open-vocabulary panoptic segmentation with weak supervision. Compared to unsupervised methods, we obviously outperform U2Seg by 1.9% PQ, 1.5%SQ, and 4.4% RQ on COCO datasets. Unfortunately, our used image-mask pairs contain multiple granularity masks, such as object-wise and part-wise masks, which is different with the panoptic segmentation datasets. This discrepancy leads to a number of false positive results, limiting our performance.

3. Promptable segmentation performance

Figure 3. Point-promptable segmentation performance. We compare our method with SAM-ViT/L on a wide range of datasets. Given a 20 × 20 point grid as visual prompt, we select the output masks with max IoU by calculating the IoU with the ground-truth masks. We report 1-pt IoU for all datasets.

Figure 4. Box-promptable segmentation performance. We compare our method with SAM-ViT/L on a wide range of datasets. Given a ground-truth box as the visual prompt, we select the output masks with max IoU by calculating the IoU with the ground-truth masks. We report 1-pt IoU for all datasets.

Figure 5. Point-promptable segmentation performance. We compare our method with SAM-ViT/L on the SegInW datasets. Given a 20 × 20 point grid as a visual prompt, we select the output masks with max IoU by calculating the IoU with the ground-truth masks. We report 1-pt IoU for all datasets.

Figure 6. Box-promptable segmentation performance. We compare our method with SAM-ViT/L on the SegInW datasets. Given a ground-truth box as the visual prompt, we select the output masks with max IoU by calculating the IoU with the ground-truth masks. We report 1-pt IoU for all datasets.

To evaluate the segmentation quality of Uni-OVSeg in scenarios involving interactive point and box prompts, we conduct comparative analyses with the SAM-ViT/L model across a range of datasets from diverse domains. For visual prompts, we implement a uniform 20 × 20 point grid as the interactive point prompt and utilise the actual bounding boxes as box prompts. The segmentation performance is measured using the 1-pt IoU (Oracle) metric across all datasets. We report results in Figures 3-6.

2. Convolutions die hard: Open-vocabulary segmentation with single frozen convolutional clip.

3. Groupvit: Semantic segmentation emerges from text supervision.

4. Learning open-vocabulary semantic segmentation models from natural language supervision.

BibTeX


@article{wang2024open,
  title={Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision},
  author={Wang, Zhaoqing and Xia, Xiaobo and Chen, Ziye and He, Xiao and Guo, Yandong and Gong, Mingming and Liu, Tongliang},
  journal={arXiv preprint arXiv:2402.08960},
  year={2024}
}

Unpair-Seg: Open-Vocabulary Segmentation with Unpaired Mask-Text Supervision