SAM2Object: Consolidating View Consistency via SAM2 for Zero-Shot 3D Instance Segmentation

CVPR 2025

Jihuai Zhao Junbao Zhuo^* Jiansheng Chen Huimin Ma^*

University of Science and Technology Beijing

^* Corresponding authors.

Abstract

In the field of zero-shot 3D instance segmentation, existing 2D-to-3D lifting methods typically obtain 2D segmentation across multiple RGB frames using vision foundation models, which are then projected and merged into 3D space. However, since the inference of vision foundation models on a single frame is not integrated with adjacent frames, the masks of the same object may vary across different frames, leading to a lack of view consistency in the 2D segmentation. Furthermore, current lifting methods average the 2D segmentation from multiple views during the projection into 3D space, causing low-quality masks and high-quality masks to share the same weight. These factors can lead to fragmented 3D segmentation. In this paper, we present SAM2Object, a novel zero-shot 3D instance segmentation method that effectively utilizes the Segment Anything Model 2 to segment and track objects, consolidating view consistency across frames. Our approach combines these consistent 2D masks with 3D geometric priors, improving the robustness of 3D segmentation. Additionally, we introduce a mask consolidation module to filter out low-quality masks across frames, which enables more precise 3D-to-2D matching. Comprehensive evaluations on ScanNetV2, ScanNet++ and ScanNet200 demonstrate the robustness and effectiveness of SAM2Object, showcasing its ability to outperform previous methods.

Method

Our method consists of six steps. We over-segment the 3D mesh / point cloud into superpoints. (a) We extract keyframes from posed images. (b) We leverage SAM2 to segment the keyframes, and its segmentation serve as mask prompts to prompt SAM2 to track objects throughout the entire set of posed images. (c) We consolidate the consistency of the masks obtained by SAM2 to filter out low-quality mask proposals. (d) We combine the masks with 3D geometric priors to construct a graph, where the nodes are superpoints and the edges represent the affinity scores between masks of superpoints that have undergone consistency consolidation. (e) We perform iterative graph clustering on the graph to achieve 3D instance segmentation.