My research is about building generalist intelligence for the physical world. An intelligence that cannot perceive, predict, and act in the physical world is not truly general.
Language models scaled on internet text, and internet video is the equivalent corpus for physical-world AI. In my view, the missing piece is post-training—learning from interleaved synthetic data, human preferences, and sparse rewards to close the gap between passive video pretraining and active agents.
I've worked on this problem end-to-end: transfer learning and pretrained representations for RL in my PhD, then synthetic data engines and preference-based post-training for 3D at Meta. 3D is a domain where data scarcity demands post-training techniques that will generalize beyond it. Ultimately, I aim to build world models that support capable agents.
A 1.8B flow matching model (6T training tokens) for single-image 3D generation, that predicts complete shape, pose, and texture for masked objects. Drove adoption of LLM training paradigm (pretraining→midtraining→SFT→DPO) and preference-based data engine (online DPO → reward models → rejection sampling). Designed and built layout model end-to-end (flow matching for pose, multimodal conditioning). 5:1 win rate over SOTA.
3D vision-language grounding via distillation from 2D VLMs. Scaling behavior matches transfer learning predictions, achieving SOTA. Proposed render-supervised distillation approach; co-developed on design + implementation.
Language-based 3D localization in real-world scenes. Introduced problem framing; championed pre/post-training approach and proposed 2D annotation for data engine. Designed and built evaluations; led mask decoder development; drove dataset curation at 20-40x typical scale.
An open-vocabulary benchmark for Embodied Question Answering across 180+ real-world scenes.
Humans far outperform VLMs on tasks that require complex spatial understanding.
A multimodal 2D/3D dataset of millions of frames from thousands of scanned and synthetic scenes.
For surface normal estimation, ViTs trained on Omnidata achieve human-level performance on OASIS.
SOTA depth and the annotation engine is released as OSS.
Large-scale analysis of pretrained image-to-image networks.
Compared to purely semantic perceptual losses like LPIPS, geometric perceptual consistency losses (depth/surface normal) yield sharper details and improve generalization to new domains.
Compared to full fine-tuning, and other parameter-efficient fine-tuning methods, simply adding a lightweight side network to control the activations is surprisingly competitive.
We show results across vision, language, and robotics using both supervised learning and behavior cloning.
Pretrained visual representations enable faster learning and qualitatively better generalization compared to domain randomization, saturating the benchmark in settings where domain randomization fails entirely. Includes zero-shot sim-to-real on physical robots. We train policies with SAC and TD3 using hindsight experience replay (HER) for sample efficiency.
RL agents with pretrained image representations enable 10x sample efficiency gains vs. random init. We train navigation policies with PPO, adding importance sampling to enable off-policy learning with a replay buffer.
A fully computational approach for modeling relationships across 26 fundamental vision tasks.
Exploiting this structure with a pretraining/finetuning learning curriculum reduces overall supervision required by ~2/3.
A rendering and physics simulator to bridge the gap between large-scale simulation and real-world environments.
The simulator features a database of thousands of real spaces and uses a generative model (GAN) for super-sampling.
A large-scale indoor dataset providing mutually registered 2D, 2.5D and 3D modalities with instance-level semantic annotations.
Covers over 6,000m² across 6 large-scale indoor areas with 70,000+ RGB images and corresponding poses, depths, surface normals, semantic annotations, and pointmaps.