Weijia Dou1, Xu Zhang2, Yi Bin1*, Jian Liu3, Bo Peng2, Guoqing Wang3, Yang Yang3, Heng Tao Shen1 (*Corresponding author)
1Tongji University 2Tianjin University 3University of Electronic Science and Technology of China
This is the official repository for GeoPurify. Our work tackles a key challenge in open-vocabulary 3D segmentation: the noisy and fragmented results produced when lifting features from 2D Vision-Language Models (VLMs) to 3D space.
GeoPurify introduces a framework that learns to purify these semantically-rich but geometrically-inconsistent 3D features. By distilling robust, class-agnostic geometric priors from a 3D self-supervised model, it effectively reconciles 2D semantics with 3D structure—all without needing any 3D semantic labels for its training.
Our key novelty in a sentence: GeoPurify achieves state-of-the-art open-vocabulary 3D segmentation with only ~1.5% of training data by learning to purify noisy 2D VLM features using distilled 3D geometric priors.
Our method explicitly decouples semantics and geometry into a two-stage pipeline:
- Stage 1: Training (Geometric Distillation) A sparse 3D Student Affinity Network (φS) is trained to comprehend 3D structure. It learns geometric relationships directly from the point cloud by using contrastive distillation to mimic the embeddings of a powerful, frozen 3D SSL teacher (φT, e.g., Sonata). Crucially, this training phase requires no 3D semantic labels.
- Stage 2: Inference (Geometry-Guided Pooling) A frozen generalist 2D VLM (Ψ2D, e.g., X-Decoder) generates initial 3D features by projecting rich semantic content from multi-view images. Because these features are geometrically inconsistent, our pre-trained student network applies a geometry-aware pooling operation, using its learned affinities to iteratively refine and denoise the initial features. This process yields a final representation that is both semantically rich and geometrically coherent.
- ⚡ Unrivaled Data Efficiency: Achieves or surpasses SOTA performance on major benchmarks (ScanNetV2, Matterport3D) while training on only ~1.5% of the data, eliminating the need for large-scale 3D annotations.
- 🎓 Novel Geometric Distillation: Introduces a teacher-student framework that distills purely geometric affinities from a 3D self-supervised model. This learns a class-agnostic prior to correct structural inconsistencies in 2D-lifted features.
- 🌍 Strong Generalization: The decoupled architecture provides robust zero-shot performance on long-tail benchmarks and excels in cross-dataset generalization, unlike methods that learn entangled geo-semantic representations.
- 🎯 Simple & Effective Purification: At inference, a lightweight Geometry-Guided Pooling module uses the learned affinities to denoise features, producing coherent and accurate segmentation maps.
For detailed setup instructions, please see the Installation Guide.
- Input: Multi-view RGB-D images + 3D point clouds.
- Datasets supported: ScanNetV2, Matterport3D, and ScanNet200.
- Follow preprocessing scripts in
scripts/preprocess.
Run training with the curated subset (~1.5% of data):
sh run/train.sh --exp_dir=out/scannet --config=config/geopurify_scannet.yamlApply trained model for open-vocabulary 3D segmentation. Pretrained checkpoints are provided under:
- Matterport3D:
result/matterport/model - ScanNetV2:
result/scannet/model
sh run/val.sh --exp_dir=out/scannet --config=config/geopurify_scannet.yaml --ckpt_name=geopurify.pth- ScanNetV2: 1,500 RGB-D scans.
- Matterport3D: 90 large-scale indoor scenes.
- ScanNet200: Long-tail benchmark emphasizing rare categories.
- mIoU (mean Intersection-over-Union)
- mAcc (mean Accuracy)
- Foreground-mIoU / Foreground-mAcc (excluding wall/floor/ceiling).
- ScanNetV2 (∼1.5% data): 55.1 mIoU / 72.5 mAcc
- Matterport3D: 40.2 mIoU / 62.4 mAcc.
- ScanNet200 (long-tail): 11.9 f-mIoU / 22.8 f-mAcc
Pretrained checkpoints are available on Google Drive: 🔗 Download Here
-
Matterport3D checkpoint:
checkpoint/result/matterport/model/geopurify.pth -
ScanNetV2 checkpoint:
checkpoint/result/scannet/model/geopurify.pth
If you find this work useful, please cite:
@misc{dou2025geopurifydataefficientgeometricdistillation,
title={GeoPurify: A Data-Efficient Geometric Distillation Framework for Open-Vocabulary 3D Segmentation},
author={Weijia Dou and Xu Zhang and Yi Bin and Jian Liu and Bo Peng and Guoqing Wang and Yang Yang and Heng Tao Shen},
year={2025},
eprint={2510.02186},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.02186},
}We thank the authors of Sonata, X-Decoder, and XMask3D for their excellent open-source contributions.
This project is licensed under the MIT License.
