I am a senior research scientist at Meta Superintelligence Labs working on the intersection of multimodal models and embodied AI.
My research currently focuses on equipping models with spatial reasoning and memory,
and scaling up to train agents that can not only perceive their physical environment but actively reason and act within it.
My recent work focuses on creating new model capabilities for understanding, grounding, and reconstructing the physical world.
This work advances the full training and data generation loop, creating scalable techniques for pretraining and post-training multimodal models.
A foundation model for reconstructing objects' shape, pose, and texture from a single image.
We cracked the 3D data bottleneck by having humans rank model generations.
After pretraining on synthetic data, we use human feedback for sim-to-real transfer and preference alignment.
Replace the entire structure-from-motion pipeline with a single transformer.
A purely learning-based approach for 3D reconstruction that scales with more data and compute.
Use differentiable rendering to supervise 3D masks.
We distill from 2D VLMs (think: SAM 3 or Gemini 3), then finetune on ~50k samples for 3D vision-language grounding. Achieves SOTA performance with strong empirical data scaling.
An open-vocabulary benchmark for Embodied Question Answering across 180+ real-world scenes.
Humans far outperform VLMs on tasks that require complex spatial understanding.
A multimodal 2D/3D dataset of millions of frames from thousands of scanned and synthetic scenes.
For surface normal estimation, ViTs trained on Omnidata achieve human-level performance on OASIS.
SOTA depth and the annotation engine is released as OSS.
Large-scale analysis of pretrained image-to-image networks.
Compared to purely semantic perceptual losses like LPIPS, geometric perceptual consistency losses (depth/surface normal) yield sharper details and improve generalization to new domains.
Compared to full fine-tuning, and other parameter-efficient fine-tuning methods, simply adding a lightweight side network to control the activations is surprisingly competitive.
We show results across vision, language, and robotics using both supervised learning and behavior cloning.
Compared to domain randomization, pretrained image representations yield qualitatively better generalization to new environments, including sim-to-real transfer.
In data regimes where visual domain randomization gets near-zero training performance, policies using pretrained image representations can saturate the benchmark.
A fully computational approach for modeling relationships across 26 fundamental vision tasks.
Exploiting this structure with a pretraining/finetuning learning curriculum reduces overall supervision required by ~2/3.
A rendering and physics simulator to bridge the gap between large-scale simulation and real-world environments.
The simulator features a database of thousands of real spaces and uses a generative model (GAN) for super-sampling.
A large-scale indoor dataset providing mutually registered 2D, 2.5D and 3D modalities with instance-level semantic annotations.
Covers over 6,000m² across 6 large-scale indoor areas with 70,000+ RGB images and corresponding poses, depths, surface normals, semantic annotations, and pointmaps.