Ankit Goyal


I am a Research Scientist in Robotics at NVIDIA working with Dieter Fox. I did my Ph.D. in Computer Science at Princeton University, where I was advised by Prof. Jia Deng. I completed Masters from University of Michigan and Bachelors from IIT Kanpur.

I have been fortunate to intern at some wonderful places and work with amazing mentors.

Recent News:

Email  /  CV  /  Google Scholar  /  Github  /  LinkedIn  / 


Research Scientist
Princeton University
2018 - 2022
Research Intern
Winter 2021
University of Michigan
2016 - 2018
Research Intern
Summer 2016
Research Intern
Summer 2015
BTech, EE
IIT Kanpur
2012 - 2016

I am interested in understanding various aspects of intelligence, especially reasoning and common sense. In particular, I want to develop computation models for various reasoning skills that humans possess.

RVT: Robotic View Transformer for 3D Object Manipulation
Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, Dieter Fox
CoRL 2023 (Oral)
[code] [project page] [video] [slides]

RVT is a multi-view transformer for 3D manipulation that is both scalable and accurate. In simulations, a single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than existing SOTA (PerAct). It also trains 36X faster than PerAct for achieving the same performance and achieves 2.3X the inference speed of PerAct. Further, RVT can perform a variety of manipulation tasks in the real world with just a few (~10) demonstrations per task.

Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement
Anthony Simeonov, Ankit Goyal, Lucas Manuelli, Lin Yen-Chen, Alina Sarmiento, Alberto Rodriguez, Pulkit Agrawal, Dieter Fox
CoRL 2023
[code] [project page] [video]

RPDiff rearranges objects into "multimodal" configurations, such as a book inserted in an open slot of a bookshelf. It generalizes to novel geometries, poses, and layouts, and is trained from demonstrations to operate on point clouds.

Infinite Photorealistic Worlds using Procedural Generation
Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, Alejandro Newell, Hei Law, Ankit Goyal, Kaiyu Yang, and Jia Deng
CVPR 2023
[project page]

Data drives progress in computer vision. Infinigen is a generator of unlimited high-quality 3D data. 100% procedural, no external assets, no AI. Free and open source.

ProgPrompt: Generating Situated Robot Task Plans using Large Language Models
Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg
ICRA 2023
Also in Autonomous Robots, LaRel @ NeurIPS 2022 and LangRob @ CoRL 2022
[project page]

We use large language models (LLMs) for task planning in robotics. We construct pythonic prompts, which specify the task, robot capabilities and the environment to seed LLMs.

IFOR: Iterative Flow Minimization for Robotic Object Rearrangement
Ankit Goyal, Arsalan Mousavian, Chris Paxton, Yu-Wei Chao, Brian Okorn, Jia Deng, Dieter Fox
CVPR 2022
Also in EAI @ CVPR 2022
[project page] [slides] [poster]

IFOR is an end-to-end method for the challenging problem of object rearrangement for unknown objects given an RGBD image of the original and final scenes. It works on cluttered scenes in the real world, while training only on synthetic data.

Coupled Iterative Refinement for 6D Multi-Object Pose Estimation
Lahav Lipson, Zachary Teed, Ankit Goyal, Jia Deng
CVPR 2022
[paper] [code]

We propose state-of-the-art 6DOF multi-object pose estimation system. Our system iteratively refines object pose and correspondece.

Non-deep Networks
Ankit Goyal, Alexey Bochkovskiy, Jia Deng, Vladlen Koltun
NeurIPS 2022
[code] [poster] [slides] [video]


Depth is the hallmark of DNNs. But more depth means more sequential computation and higher latency. This begs the question -- is it possible to build high-performing ``non-deep" neural networks? We show it is.

Revisiting Point Cloud Shape Classification with a Simple and Effective Baseline
Ankit Goyal, Hei Law, Bowei Liu, Alejnadro Newell, Jia Deng
ICML 2021
[code] [slides] [poster] [video]

Many point-based approaches have been proposed reporting steady benchmark improvements over time. We study the key ingredients of this progress and uncover two critical results. First, auxiliary factors, independent of the model architecture, make a large difference in performance. Second, a very simple projection based method performs surprisingly well.

Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D
Ankit Goyal, Kaiyu Yang, Dawei Yang, Jia Deng
NeuRIPS 2020, Spotlight (Top 4% of submitted papers)
[code] [slides] [poster] [video]

Understanding spatial relations is important for both humans and robots. We create Rel3D, the first large-scale, human-annotated dataset for grounding spatial relations in 3D. The 3D scenes in Rel3D come in minimally contrastive pairs: two scenes in a pair are almost identical, but a spatial relation holds in one and fails in the other.

PackIt: A Virtual Environment for Geometric Planning
Ankit Goyal, Jia Deng
ICML 2020
[code] [slides] [video]

Simultaneously reasoning about geometry and planning action is crucial for intelligent agents. This ability of geometric planning comes in handy while grocery shopping, rearranging room, warehouse management etc. We create PackIt, a virtual environment that caters to geometric planning.

Think Visually: Question Answering through Virtual Imagery
Ankit Goyal, Jian Wang, Jia Deng
ACL 2018
[code] [poster]

We study geometric reasoning in the context of question-answering. We introduce Dynamic Spatial Memory Network (DSMN), a deep network architecture designed for answering questions that admit latent visual representations.

ProtoNN: Compressed and Accurate kNN for Resource-scarce Devices
C Gupta, AS Suggala, A Goyal, HV Simhadri, BP, AK, SG, RU, MV, P Jain
ICML 2017

Resource-Efficient Machine Learning
Prateek Jain, Chirag Gupta, AS Suggala, Ankit Goyal, HV Simhadri
US Patent Applicaiton

We propose ProtoNN, a novel algorithm that addresses the problem of real-time and accurate prediction on resource-scarce devices.

A Multimodal Mixture-Of-Experts Model for Dynamic Emotion Prediction in Movies
Ankit Goyal, Naveen Kumar, Tanaya Guha, Shrikanth S. Narayanan

We address the problem of continuous emotion prediction in movies. We propose a Mixture of Experts (MoE)-based fusion model that dynamically combines information from the audio and video modalities for predicting the emotion evoked in movies.

Object Matching Using Speeded Up Robust Features
NK Verma, Ankit Goyal, A Harsha Vardhan, Rahul Kumar Sevakula, Al Salour
IES 2016

We propose a robust algorithm which is capable of detecting all the instances of a particular object in a scene image using Speeded Up Robust Features.

Template Matching for Inventory Management using Fuzzy Color Histogram and Spatial Filters
NK Verma, Ankit Goyal, Anadi Chaman, Rahul K Sevakula, Al Salour
ICIEA 2015

We propose a methodology for object counting using color histogram based segmentation and spatial filters.


I have been a Teaching Assistant for the following courses:

  • COS529: Advanced Computer Vision at Princeton University [Winter 2020]
  • COS429: Computer Vision at Princeton University [Fall 2018]
  • EECS442: Computer Vision at University of Michigan [Fall 2017, Winter 2018]

[Web Cite]