Alexander (Sasha) Sax

I'm a senior research scientist at Meta FAIR in Menlo Park. My research focuses on building multimodal foundation models to enable Embodied Agents that can perceive, act in, and communicate about the physical world.

I did my PhD at UC Berkeley, where I was advised by Jitendra Malik and Amir Zamir (EPFL). In the summer of 2022, I interned at FAIR with Georgia Gkioxari. I got my MS and BS from Stanford University, where I was advised by Silvio Savarese and graduated with an Erdős number of 3.

Email  /  CV  /  Scholar  /  Github

profile photo

Research

A lot of my work is around large-scale training for predictive and generative models using supervised learning and RL. It often involves developing multimodal 3D datasets, simulators, and techniques for efficient pretraining and finetuning. Here are some nice honors:

A few papers are highlighted. Equal contribution is indicated by * and listed in alphabetical order.

OpenEQA: Embodied Question Answering in the Era of Foundation Models
The Cortex Team @ FAIR
CVPR, 2024
project page / blog post / paper / github

An open-vocabulary benchmark for Embodied Question Answering (EQA) with a new dataset across 180+ real environments. GPT-4V at 48.5% lags behind human performance (85.9%). For questions that require complex spatial understanding, VLMs in December 2024 are nearly "blind".

Omnidata: A Scalable Pipeline for Making Multi-Task Mid-Level Vision Datasets from 3D Scans
Ainaz Eftekhar*, Alexander Sax*, Roman Bachmann, Jitendra Malik, Amir Zamir
ICCV, 2021
project page / arXiv / github

Annotating a large-scale dataset of 3D scanned and artist-generated scenes, using Blender. Transformer models trained for monocular surface normal estimation achieved performance comparable to human annotators on the OASIS benchmark, and depth models outperformed MidasV2.

Robustness via Cross-Domain Ensembles
Oğuzhan Fatih Kar*, Teresa Yeo*, Alexander Sax, Amir Zamir
ICCV, 2021   (Oral Presentation)
project page / arXiv / github

A method for creating a robust and diverse ensemble of predictive networks where the average pairwise disagreement provides a measure of uncertainty. The calibration can be supervised with a lightweight post-training step.

Robust Learning Through Cross-Task Consistency
Alexander Sax*, Amir Zamir*, Teresa Yeo, Oğuzhan Fatih Kar, Nikhil Cheerla, Rohan Suri, Zhangjie Cao, Jitendra Malik, Leonidas Guibas
CVPR, 2020   (Best Paper Award Nominee)
project page / arXiv / github

Introducing perceptual consistency losses for large-scale distillation; using learned priors from pretrained networks. We find that monocular priors from depth and surface normal estimation yield sharper details and better generalization than LPIPS.

Side-Tuning: A Baseline for Network Adaptation via Additive Side Networks
Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas Guibas, Jitendra Malik
ECCV, 2020
project page / arXiv / github

Adding a lightweight side network to modulate a large pretrained network is surprisingly competitive for network adaptation and lifelong learning. We show results in vision, language, and robotics using supervised learning and behavior cloning.

Robust Policies via Mid-Level Visual Representations: An Experimental Study in Manipulation and Navigation
Bryan Chen*, Alexander Sax*, Francis E. Lewis, Silvio Savarese, Jitendra Malik, Amir Zamir, Lerrel Pinto
CoRL, 2020
project page / arXiv / github

In data regimes where visual domain randomization gets near-zero training performance, policies using pretrained image representations can saturate the benchmark. Pretrained representations yield qualitatively better generalization to new environments, including sim-to-real transfer.

Learning to Navigate Using Mid-Level Visual Priors
Alexander Sax*, Jeffrey O. Zhang*, Bradley Emi, Amir Zamir, Leonidas Guibas, Silvio Savarese, Jitendra Malik
CoRL, 2019   (Winner of CVPR19 Habitat Challenge RGB Track)
project page / arXiv / github

Training end-to-end policies with pretrained image representations requires 10x less data to achieve a given performance vs. RL from raw pixels.

Taskonomy: Disentangling Task Transfer Learning
Amir Zamir, Alexander Sax*, William B. Shen*, Leonidas Guibas, Jitendra Malik, Silvio Savarese
CVPR, 2018   (Best Paper Award)
project page / arXiv / github

A fully computational approach for modeling the structure across 26 fundamental vision tasks. Exploiting this structure with a transfer learning curriculum reduces overall supervision by ~2/3.

Gibson Env: Real-World Perception for Embodied Agents
Zhi-Yang He*, Fei Xia*, Amir Zamir*, Alexander Sax, Jitendra Malik, Silvio Savarese
CVPR, 2018   (Spotlight, NVIDIA Pioneering Research Award)
project page / arXiv / github

A rendering and physics simulator to bridge the gap between large-scale simulation and real-world environments. The simulator features a database of thousands of real spaces and uses a neural network for infilling, like an early form of DLSS.

Joint 2D-3D-Semantic Data for Indoor Scene Understanding
Iro Armeni*, Alexander Sax*, Amir R. Zamir, Silvio Savarese
arXiv, 2017
project page / arXiv / github

A large-scale indoor dataset providing mutually registered 2D, 2.5D and 3D modalities with instance-level semantic annotations. Covers over 6,000m² across 6 large-scale indoor areas with 70,000+ RGB images and corresponding depths, surface normals, semantic annotations, and global XYZ coordinates.

Misc.

Academic Service

Graduate Mentor: BAIR Undergraduate Mentoring, 2019-2023
Graduate Admissions (Berkeley 2x, EPFL 1x)
Student Organizer, 3DV 2016
Stanford Junior Class President, 2014-2015

Teaching

Graduate Student Instructor, Berkeley CS 189/289A 2020 (Machine Learning)
Head TA, Stanford CS331B (Representation Learning) Fall 2018
Teaching Assistant, Stanford CS103 (Mathematical Foundations of Computing) 2015

Thanks Jon Barron for the clean website.
Last updated December 2024.