I'm a senior research scientist at Meta FAIR in Menlo Park.
My research focuses on building multimodal foundation models to enable Embodied Agents that can
perceive, act in, and communicate about the physical world.
A lot of my work is around large-scale training for predictive and generative models using supervised learning and RL.
It often involves developing multimodal 3D datasets, simulators, and techniques for efficient pretraining and finetuning.
Here are some nice honors:
An open-vocabulary benchmark for Embodied Question Answering (EQA) with a new dataset across 180+ real environments.
GPT-4V at 48.5% lags behind human performance (85.9%).
For questions that require complex spatial understanding, VLMs in December 2024 are nearly "blind".
Annotating a large-scale dataset of 3D scanned and artist-generated scenes, using Blender.
Transformer models trained for monocular surface normal estimation achieved performance comparable to human annotators on the OASIS benchmark, and
depth models outperformed MidasV2.
A method for creating a robust and diverse ensemble of predictive networks where the average pairwise disagreement provides a measure of uncertainty.
The calibration can be supervised with a lightweight post-training step.
Introducing perceptual consistency losses for large-scale distillation; using learned priors from pretrained networks.
We find that monocular priors from depth and surface normal estimation yield sharper details and better generalization than LPIPS.
Adding a lightweight side network to modulate a large pretrained network
is surprisingly competitive for network adaptation and lifelong learning.
We show results in vision, language, and robotics using supervised learning and behavior cloning.
In data regimes where visual domain randomization gets near-zero training performance, policies using pretrained image representations can saturate the benchmark.
Pretrained representations yield qualitatively better generalization to new environments, including sim-to-real transfer.
A fully computational approach for modeling the structure across 26 fundamental vision tasks.
Exploiting this structure with a transfer learning curriculum reduces overall supervision by ~2/3.
A rendering and physics simulator to bridge the gap between large-scale simulation and real-world environments.
The simulator features a database of thousands of real spaces and uses a neural network for infilling, like an early form of DLSS.
A large-scale indoor dataset providing mutually registered 2D, 2.5D and 3D modalities with instance-level semantic annotations.
Covers over 6,000m² across 6 large-scale indoor areas with 70,000+ RGB images and corresponding depths, surface normals, semantic annotations, and global XYZ coordinates.