Sonata: Self-Supervised Learning of Reliable Point Representations

CVPR 2025 Highlight

¹The University of Hong Kong
²Meta Reality Labs

^* indicates Equal Contribution in Alphabetic Order

Weight BibTeX Inference Demo Training Code

We recommend beginning with the inference demo. The training code is for those interested in reproducing the pre-training.
Code released under Apache-2.0 and weight release under cc-by-nc-4.0 due to restriction of training data.

Abstract

In this paper, we question whether we have a reliable self-supervised point cloud model that can be used for diverse 3D tasks via simple linear probing, even with limited data and minimal computation. We find that existing 3D self-supervised learning approaches fall short when evaluated on representation quality through linear probing. We hypothesize that this is due to what we term the "geometric shortcut", which causes representations to collapse to low-level spatial features. This challenge is unique to 3D and arises from the sparse nature of point cloud data. We address it through two key strategies: obscuring spatial information and enhancing the reliance on input features, ultimately composing a Sonata of 140k point clouds through self-distillation. Sonata is simple and intuitive, yet its learned representations are strong and reliable: zero-shot visualizations demonstrate semantic grouping, alongside strong spatial reasoning through nearest-neighbor relationships. Sonata demonstrates exceptional parameter and data efficiency, tripling linear probing accuracy (from 21.8% to 72.5%) on ScanNet and nearly doubling performance with only 1% of the data compared to previous approaches. Full fine-tuning further advances SOTA across both 3D indoor and outdoor perception tasks.

What is the Geometric Shortcut?

This shortcut refers to the model's tendency to collapse to easily accessible, low-level geometric cues, which we identify as the primary issue. Specifically, we select a point on the sofa arm and compute pairwise similarity with other points. The similarity heatmap reveals that CSC collapses to surface normals, and MSC overfits to point height. In contrast, our Sonata can extract higher-level concepts, as can be seen by the high similarity between all sofa arms highlighted in red.

It is a Unique Challenge to 3D

When comparing the information contained in 2D image and 3D point cloud data after removing the input feature (indicated by color), it is evident that in images all information is within the input feature. Whereas point clouds retain geometric information in point positions, which is directly utilized by operators. This characteristic leads to what we term geometric shortcuts in 3D SSL.

Model Design

Building upon a well-designed Point Self-distillation framework (check our paper for a more detailed explanation), we mitigate the collapse caused by geometric shortcuts through two key strategies: obscuring spatial information and enhancing the reliance on input features. Specifically, we address this issue by applying SSL losses at coarser spatial scales, disturbing the spatial information of masked points without features, and progressively increasing task difficulty to reduce reliance on accessible geometric cues.

Comparison with DINOv2

We compare the PCA visualizations of DINOv2, Sonata, and their combined feature representation. DINOv2 excels at capturing photometric details, while Sonata better distinguishes spatial information. The combined model demonstrates improved coherence and detail, showcasing the complementary strengths of both models. The linear probing results of lifted DINOv2, Sonata, and combined representation are 63.1%, 72.5%, and 75.9%, respectively, further confirming the observation above.

Zero-shot Representation Across Scenes

We provide PCA-mapped colors and dense matching (with five representative points marked with ×) on a house-scale point cloud from HM3D, comprising 2 floors and 12 rooms (left: floor 1, right: floor 2). The visualization demonstrates that Sonata consistently delivers semantically rich and informative representations across diverse indoor environments.

Exceptional Parameter Efficiency & Data Efficiency

Sonata shows exceptional parameter and data efficiency. The linear probing accuracy of Sonata is 72.5%, which is 3.3x higher than the baseline (21.8%) on ScanNet. With only 1% of the data, Sonata nearly doubles the performance compared to previous approaches.

parameter_efficiency — Table 2. Parameter efficiency with indoor semantic segmentation.

Strong Perception Performance

Further enhanced with full fine-tuning, Sonata demonstrates exceptional performance across a wide range of 3D perception tasks. It achieves state-of-the-art performance on both indoor and outdoor semantic segmentation and instance segmentation tasks. The results are summarized in Tables 3, 4, and 5.