Labeling Indoor Scenes with Fusion of
Out-of-the-Box Perception Models


Yimeng Li*
Navid Rajabi*
Sulabh Shrestha
Md Alimoor Reza
Jana Kosecka
George Mason University, Drake University

WACV 2024 2nd Workshop on Pretraining

[Paper]
[Code]


Starting from an RGB-D dataset, we propose a labeling approach for semantic segmentation annotations. On top of the semantic segmentation results, we additionally proposed two downstream tasks for robot navigation. We build top-down-view semantic maps and use them for zero-shot semantic-goal navigation. We proposed an object part segmentation task for the ’cabinet handle’ related to the robot mobile manipulation task.


The image annotation stage is a critical and often the most time-consuming part required for training and evaluating object detection and semantic segmentation models. Deployment of the existing models in novel environments often requires detecting novel semantic classes not present in the training data. Furthermore, indoor scenes contain significant viewpoint variations, which need to be handled properly by trained perception models. We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer), all trained on large-scale datasets. We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments, with the ultimate goal of facilitating the training of lightweight models for various downstream tasks. We also propose a multi-view labeling fusion stage, which considers the setting where multiple views of the scenes are available and can be used to identify and rectify single-view inconsistencies. We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset. We evaluate the quality of our labeling process by comparing it with human annotations. Also, we demonstrate the effectiveness of the obtained labels in downstream tasks such as object goal navigation and part discovery. In the context of object goal navigation, we depict enhanced performance using this fusion approach compared to a zero-shot baseline that utilizes large monolithic vision-language pre-trained models.


Single-View Labeling

This diagram gives an overview of our labeling approach at the single-view labeling stage. Given an input image, we generate masks for foreground classes utilizing Detic and SAM at steps (A) and (B). If any classes with manual bounding boxes are available, we generate masks utilizing SAM at steps (C) and (D). We generate masks for the entire image utilizing MaskFormer and SAM at steps (E) and (F). We overlay the results of foreground class masks on top of the entire image’s mask and achieve the semantic segmentation and instance segmentation annotations.

MaskFormer and SAM

An example of labeling using Semantic Segmentation (MaskFormer) and SAM. MaskFormer produces good predictions for background classes but not so well for foreground classes: the coffee machine is misclassified as a ‘stove’ (black bounding box), and the cooking pot (cyan bounding box) is missed entirely.

Object Detector (Detic) and SAM

We employ Detic with LVIS vocabulary, which includes 1203 classes and covers many known indoor object classes. We utilize SAM to generate high-quality masks by using the bounding boxes as input prompts. In practice, we prompt SAM with bounding boxes and a point corresponding to the centroid of the masks from Detic, leading to high-quality object instance segmentation shown in the figure.

Multiview Verification

An example of Multiview Verification. The refrigerator (cyan bounding box) in view Ak is originally labeled as wall and missing mask for some of its parts. The error is resolved by fusing labels from views Am and An with the correct annotation for the class.

Object Part Discovery

This diagram gives an overview of the labeling cabinet handle. We choose SAM segments within the detected ’cabinet handle’ bounding box. Then, we extract ResNet50 features for these segments. We cluster these feature points through KMeans and manually label the cluster containing ’cabinet handle’ points. We backproject the label to the original image to get the ’cabinet handle’ annotations

Results: Object Goal Navigation

A qualitative result of an episode for navigating to a kitchen sink in Home-005. Our approach successfully drives the agent to the sink, while the VL-Map approach fails to approach the target object.
A qualitative result of an episode for navigating to a flowerpot in Home-006. Both approaches successfully reach the target object.


Citation

                        @inproceedings{li2024labeling,
                        title={Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models},
                        author={Li, Yimeng and Rajabi, Navid and Shrestha, Sulabh and Alimoor, Reza and Ko{\v{s}}eck{\'a}, Jana},
                        booktitle={Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision},
                        pages={578--587},
                        year={2024}
                        }
                        


Acknowledgements

We thank members of the GMU Vision and Robotics Lab.
This webpage template was borrowed from some colorful folks.