Alexander (Sasha) Sax

I'm a senior research scientist at Meta FAIR in Menlo Park. My research focuses on building multimodal foundation models to enable Embodied Agents that can perceive, act in, and communicate about the physical world.

I did my PhD at UC Berkeley, where I was advised by Jitendra Malik and Amir Zamir (EPFL). In the summer of 2022, I interned at FAIR with Georgia Gkioxari. I got my MS and BS from Stanford University, where I was advised by Silvio Savarese and graduated with an Erdős number of 3.

Email / CV / Scholar / Github

Research

A lot of my work is around large-scale training for predictive and generative models using supervised learning and RL. It often involves developing multimodal 3D datasets, simulators, and techniques for efficient pretraining and finetuning. Here are some nice honors:

CVPR Best Paper Award Nomination (X-Task Consistency, 2020)
CVPR Best Paper Award (Taskonomy, 2018)
NVIDIA Pioneering Research Award (Gibson Environment, 2018)
Stanford University Distinction in Research (Computational Evidence for Structure in the Space of Tasks, 2018)
Winner CVPR Habitat Embodied Agents Challenge (Mid-Level Representations, RGB track, 2019)
Outstanding Reviewer (ICLR '24, CVPR '22)

A few papers are highlighted. Equal contribution is indicated by * and listed in alphabetical order.

	Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D The Cortex Team @ FAIR ICML, 2025 (Spotlight) website / demo / paper / github / model SotA model for localizing objects in 3D scenes from referring expressions. The model works directly on sensor observation streams for real-world deployment on robots and AR devices. Includes 3D-JEPA (SSL for sensor point clouds), and introduces the Locate 3D Dataset with over 130K annotations.
	Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, Matt Feiszli CVPR, 2025 project page / arXiv / github / demo Replacing the entire structure-from-motion pipeline with a transformer. A purely learning-based approach for 3D reconstruction that scales well with more data and model parameters.
	LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding Ang Cao, Sergio Arnaud, Oleksandr Maksymets, Jianing Yang, Ayush Jain, Sriram Yenamandra, Ada Martin, Vincent-Pierre Berges, Paul McVay, Ruslan Partsey, Aravind Rajeswaran, Franziska Meier, Justin Johnson, Jeong Joon Park, Alexander Sax ICML, 2025 project page / arXiv Image-based supervision and differentiable rendering (alone) can train 3D vision-language grounding models. This allows us to distill 3D grounding models from a 2D model.
	UniVLG: Unifying 2D and 3D Vision-Language Understanding Ayush Jain, Alexander Swerdlow, Yuzhou Wang, Sergio Arnaud, Ada Martin, Alexander Sax, Franziska Meier, Katerina Fragkiadaki ICML, 2025 project page / arXiv / github A unified architecture for 2D and 3D vision-language understanding that achieves SotA performance in 3D vision-language tasks. It uses 2D pre-training for the encoder and introduces a SAM-like mask decoder that is language-conditioned and shared across modalities.
	OpenEQA: Embodied Question Answering in the Era of Foundation Models The Cortex Team @ FAIR CVPR, 2024 project page / blog post / paper / github An open-vocabulary benchmark for Embodied Question Answering (EQA) with a new dataset across 180+ real environments. GPT-4V at 48.5% lags behind human performance (85.9%). For questions that require complex spatial understanding, VLMs in December 2024 are nearly "blind".
	Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans Ainaz Eftekhar, Alexander Sax, Roman Bachmann, Jitendra Malik, Amir Zamir ICCV, 2021 project page / arXiv / github Annotating a large-scale dataset of 3D scanned and artist-generated scenes, using Blender. Transformer models trained for monocular surface normal estimation achieved performance comparable to human annotators on the OASIS benchmark, and depth models outperformed MidasV2.
	Robustness via Cross-Domain Ensembles Oğuzhan Fatih Kar, Teresa Yeo, Alexander Sax, Amir Zamir ICCV, 2021 (Oral Presentation) project page / arXiv / github A method for creating a robust and diverse ensemble of predictive networks where the average pairwise disagreement provides a measure of uncertainty. The calibration can be supervised with a lightweight post-training step.
	Robust Learning Through Cross-Task Consistency Alexander Sax, Amir Zamir, Teresa Yeo, Oğuzhan Fatih Kar, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, Leonidas Guibas CVPR, 2020 (Best Paper Award Nominee) project page / arXiv / github Introducing perceptual consistency losses for large-scale distillation; using learned priors from pretrained networks. We find that monocular priors from depth and surface normal estimation yield sharper details and better generalization than LPIPS.
	Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, Jitendra Malik ECCV, 2020 project page / arXiv / github Adding a lightweight side network to modulate a large pretrained network is surprisingly competitive for network adaptation and lifelong learning. We show results in vision, language, and robotics using supervised learning and behavior cloning.
	Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation Bryan Chen, Alexander Sax, Francis E. Lewis, Silvio Savarese, Jitendra Malik, Amir Zamir, Lerrel Pinto CoRL, 2020 project page / arXiv / github In data regimes where visual domain randomization gets near-zero training performance, policies using pretrained image representations can saturate the benchmark. Pretrained representations yield qualitatively better generalization to new environments, including sim-to-real transfer.
	Learning to Navigate Using Mid-Level Visual Priors Alexander Sax, Jeffrey O. Zhang, Bradley Emi, Amir Zamir, Leonidas Guibas, Silvio Savarese, Jitendra Malik CoRL, 2019 (Winner of CVPR19 Habitat Challenge RGB Track) project page / arXiv / github Training end-to-end policies with pretrained image representations requires 10x less data to achieve a given performance vs. RL from raw pixels.
	Taskonomy: Disentangling Task Transfer Learning Amir Zamir, Alexander Sax, William B. Shen, Leonidas Guibas, Jitendra Malik, Silvio Savarese CVPR, 2018 (Best Paper Award) project page / arXiv / github A fully computational approach for modeling the structure across 26 fundamental vision tasks. Exploiting this structure with a transfer learning curriculum reduces overall supervision by ~2/3.
	Gibson Env: Real-World Perception for Embodied Agents Zhi-Yang He, Fei Xia, Amir Zamir, Alexander Sax, Jitendra Malik, Silvio Savarese CVPR, 2018 (Spotlight, NVIDIA Pioneering Research Award)* project page / arXiv / github A rendering and physics simulator to bridge the gap between large-scale simulation and real-world environments. The simulator features a database of thousands of real spaces and uses a neural network for infilling, like an early form of DLSS.
	Joint 2D-3D-Semantic Data for Indoor Scene Understanding Iro Armeni, Alexander Sax, Amir R. Zamir, Silvio Savarese arXiv, 2017 project page / arXiv / github A large-scale indoor dataset providing mutually registered 2D, 2.5D and 3D modalities with instance-level semantic annotations. Covers over 6,000m² across 6 large-scale indoor areas with 70,000+ RGB images and corresponding depths, surface normals, semantic annotations, and global XYZ coordinates.

Misc.

Academic Service	Graduate Mentor: BAIR Undergraduate Mentoring, 2019-2023 Graduate Admissions (Berkeley 2x, EPFL 1x) Student Organizer, 3DV 2016 Stanford Junior Class President, 2014-2015
Teaching	Graduate Student Instructor, Berkeley CS 189/289A 2020 (Machine Learning) Head TA, Stanford CS331B (Representation Learning) Fall 2018 Teaching Assistant, Stanford CS103 (Mathematical Foundations of Computing) 2015

Thanks Jon Barron for the clean website.
Last updated December 2024.

Research

Misc.

Academic Service

Teaching