3D & The Bitter Lesson – CVPR25 Rambling Thoughts

One of the implications of the bitter lesson is that scaling up data and computation is far more effective than hand-designing structures. In this sense, 3D vision is a bit weird - for any end task (robotics, AR/VR, design), we do not have easily-accessible large-scale data. When we talk about 3D vision, or 3D perception, by definition we are defining a specific modular breakdown of an end application. In Ross Girschick’s words, is 3D perception a “parser”? Or in Jon Barron’s words, has generative video “bitter-lessoned” 3D?

How to reconcile the bitter lesson with the data scarcity of 3D? With this big question in mind, I arrived at CVPR25 in Nashville, hoping to get more perspectives from the community. Numerous chats, posters, and workshops later, I found myself extremely excited about two answers to the above question. Since we do not have large-scale data for any 3D application for free, we have two options:

A. Embrace some modularity - e.g. separate out perception priors (about object, motion, or physics) into a vision module that is pretrained separately from E2E tasks
B. Build data engines that can leverage new signals that complement our existing datasets

Why 3D understanding tasks are not “parsers”

While the learning from LLMs is that large-scale E2E training >> human-designed structure, I do think applications like robotics could benefit from some modularity with a good perception module that needn’t be bottlenecked by the scarcity of action data.

I see all of 3D understanding tasks as ways to find the right representation of the observed world. The question is whether such representations are best learned from scaled-up robot action data or should be improved via strong visual pretraining. Both sides gave interesting opinions at the Generalization in Robotics Manipulation workshop. On the one hand, Chelsea Finn mentioned that long video sequences in robotic data already implicitly encodes information, such as 3D. On the other hand, for people who currently favor more modular approaches (HoMeR, Track2Act), they wish to pursue stronger, more informative visual representations and much more data-efficient action generalization. I assume that both eventually would help us build stronger VLA models. As shown by Gemini, a good VLM backbone pre-trained on 3D tasks such as detection and multiview correspondence (Gemini Robotics ER) helps the E2E action model improve both performance and generalization.

I do believe that we should take advantage of all the vision data and 3D understanding tasks we have to obtain the best representation – an understanding of shapes, motion, and common sense. The end goal is probably to integrate all these tasks/representations into either pretraining tasks or the vision encoder – as earlier results in language (e.g. FLAN) have shown, such mixing of tasks could improve both performance on existing tasks and generalization to new tasks. Along the way, we could also learn new insights about what training recipes are scalable and efficient, and gain new insights about architecture. In this sense, I do think it is valuable to work on modular visual understanding tasks and to have standalone metrics. The true lesson from language, in my opinion, is the discovery that the autoregressive generation pretraining task benefits all downstream E2E tasks, rather than solely focusing on specific E2E tasks such as translation or summarization. Before the GPT results, it was unclear that this specific unified formulation improves E2E performance of all downstream tasks, yet people spent the time and effort investigating it because they believed in the generality of the autoregressive formulation. Thus, in addition to “don’t build a parser”, I also want to say we should not “focus only on the translation metrics” but focus on finding general representations and tasks. I think the current investigation in all 3D understanding (grounding/text-3D/motion), albeit each having its own metric, helps us in the quest of a general visual representation of the world. Eventually the validity of such representations might only be measurable on large-scale projects (think GPT), but the technology buildup towards that final evaluation needs to happen gradually and even perhaps modularly.

Data engines: look out for new signals that enrich existing data

As we are closer than ever to exhausting our available data sources, we need to start seeking other types of signals, either from humans (in less labor-intensive ways) or from other models. I define data engines broadly as systems composed of one or more models, with the goal of enriching current data with new signals. The new signals can fall broadly in two buckets: feedback, or more diverse data sources.

The general idea of incorporating feedback into training is nothing new: that is how we initially achieved text-to-3D (DreamFusion), how we accelerated supervision for foundation models like SAM, and how we now approach LLM alignment (RLHF) and reasoning (RLVR). These are usually one-step approaches that leverage zero or one models, but I believe there is so much more room – both in coming up with feedback more creatively, and in applying it to new domains and tasks. I came across one such formulation for robotics from a workshop talk given by Ranjay Krishna. The data engine is composed of a base model and a verifier. The verifier (AHA in this case) can detect robotic failure in each stage of a potentially long-horizon task. The base model rolls out a policy, the verifier verifies, and if all stages pass, the policy is added to the dataset. Without resetting the robot, a VLM proposes the next goal, and the model repeats this rollout + verification loop, theoretically infinitely. With such a verifier, we can keep “distilling” suboptimal generator models to better ones. One could also imagine that once the base model is good enough, the verifier can be directly used for RLVR.

At this CVPR25, there was also a lot of excitement about enriching the 3D data source via reconstruction methods (e.g. VGGT). Such data can be used directly to train self-supervised representations such as Sonata, or combined with other foundation models to provide data for understanding tasks (such as what I tried in my work, Find3D).

The data limitation in 3D forces us to think harder – since it is not easy to train everything E2E on large data, we need to get more creative. I am extremely excited about both paths forward: making 3D perception a bit modular and “pretrainable” to save on E2E task data, as well as exploring systems that can leverage new signals to enrich our existing data.

Why 3D understanding tasks are not “parsers”

Data engines: look out for new signals that enrich existing data

Enjoy Reading This Article?