Ankit Goyal

I am a Research Scientist in Robotics at NVIDIA working with Dieter Fox. I did my Ph.D. in Computer Science at Princeton University, where I was advised by Prof. Jia Deng. I completed Masters from University of Michigan and Bachelors from IIT Kanpur.

I have been fortunate to intern at some wonderful places and work with amazing mentors.

Summer 2021: Nvidia with Dieter Fox
Winter 2021: Intel with Vladlen Koltun
Summer 2016: Microsoft Research India with Prateek Jain
Summer 2015: USC with Shrikanth Narayanan, Tanaya Guha and Naveen Kumar

Recent News:

June 2025: cuTAMP accepted to RSS 2025 and covered by MIT News.
Feb 2025: 3D-MVP, 3D pretraining for manipulation was accepted to CVPR 2025.
Jan 2025: HAMSTER, a hierarchical VLA for open-world manipulation was accepted to ICLR 2025.
Sep 2024: Gave a talk at MILA Robot Learning Seminar
Aug 2024: ActAIM2, which discovers self-super action modes accepted to CoRL 2024.
May 2024: RVT-2 accepted to RSS 2024.
May 2024: Gave a keynote talk at Manipulation Skills Workshop in ICRA.
Oct 2023: Gave a talk at UW Robotics Colloqium along with Caelan Garrett and Iretiayo Akinola.
Aug 2023: Two papers (including one Oral) accepted to CoRL 2023. RVT and "Shelving, Stacking and Hanging".
June 2023: Released Robotic View Transformer for fast and performant 3D manipulation.
May 2023: Selected to be a part of the RSS Pioneers Cohort, 2023.
Apr 2023: Gave a talk at MILA Vision Reading Group. Thanks for the invite!
Older news
- Feb 2023: ProgPrompt accepted to ICRA 2023 - LLM+Robotics led by Ishika and Valts.
- Oct 2022: 6D pose estimation work led by Lahav Lipson won an award in ECCV BOP challenge 2022.
- Sep 2022: Received the NeurIPS Scholar Award.
- Sep 2022: Paper on non-deep networks accepted to NeurIPS 2022.
- Aug 2022: Defended my Ph.D. thesis and started as a Research Scientist at NVIDIA working with Dieter Fox.
- Mar 2022: Two papers accepted to CVPR 2022.
- Aug 2021: Recognized as an outstanding reviewer at ICCV 2021.
- July 2021: Selected for Qualcomm Innovation Fellowship with Zachary Teed!

Email / CV / Google Scholar / Github / LinkedIn / Follow @imankitgoyal


Research Scientist NVIDIA Current	PhD, CS Princeton University 2018 - 2022	Research Intern Intel Winter 2021	MS, CSE University of Michigan 2016 - 2018	Research Intern MSR Summer 2016	Research Intern USC Summer 2015	BTech, EE IIT Kanpur 2012 - 2016

Selected Publications

I am interested in understanding various aspects of intelligence, especially reasoning and common sense. In particular, I want to develop computation models for various reasoning skills that humans possess.

	HAMSTER: Hierarchical Action Models for Open-World Manipulation Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li^, Abhishek Gupta^, Ankit Goyal^ ICLR 2025* [code] [project page] We introduce HAMSTER, a Hierarchical Vision-Language-Action (VLA) architecture designed for robotic manipulation. This approach effectively combines the advantages of imitation learning models, which require little in-domain robot data, with those of large VLA models that can generalize well.
	3D-MVP: 3D Multiview Pretraining for Robotic Manipulation Shengyi Qian, Kaichun Mo, Valts Blukis, David F. Fouhey, Dieter Fox, Ankit Goyal CVPR 2025 [project page] [paper] We propose 3D multi-view pretraining using MAEs for robot manipulation.
	RVT-2: Learning Precise Manipulation from Few Examples Ankit Goyal, Valts Blukis, Jie Xu, Yijie Guo, Yu-Wei Chao, Dieter Fox RSS 2024 [code] [project page] We study how to build a robotic system that can solve high-precision manipulation tasks from a few demonstrations. Prior works, like PerAct and RVT, have studied few-shot manipulation; however, they often struggle with tasks requiring high precision. We study how to make them more effective, precise, and fast. Using a combination of architectural and system-level improvements, we propose RVT-2, a multitask 3D manipulation model that is 6X faster in training and 2X faster in inference than its predecessor RVT. RVT-2 achieves a new state-of-the-art on RLBench.
	RVT: Robotic View Transformer for 3D Object Manipulation Ankit Goyal, Jie Xu, Yijie Guo, Valts Blukis, Yu-Wei Chao, Dieter Fox CoRL 2023 (Oral) [code] [project page] [video] [slides] RVT is a multi-view transformer for 3D manipulation that is both scalable and accurate. In simulations, a single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than existing SOTA (PerAct). It also trains 36X faster than PerAct for achieving the same performance and achieves 2.3X the inference speed of PerAct. Further, RVT can perform a variety of manipulation tasks in the real world with just a few (~10) demonstrations per task.
	Shelving, Stacking, Hanging: Relational Pose Diffusion for Multi-modal Rearrangement Anthony Simeonov, Ankit Goyal, Lucas Manuelli, Lin Yen-Chen, Alina Sarmiento, Alberto Rodriguez, Pulkit Agrawal, Dieter Fox CoRL 2023 [code] [project page] [video] RPDiff rearranges objects into "multimodal" configurations, such as a book inserted in an open slot of a bookshelf. It generalizes to novel geometries, poses, and layouts, and is trained from demonstrations to operate on point clouds.
	Infinite Photorealistic Worlds using Procedural Generation Alexander Raistrick, Lahav Lipson, Zeyu Ma, Lingjie Mei, Mingzhe Wang, Yiming Zuo, Karhan Kayan, Hongyu Wen, Beining Han, Yihan Wang, Alejandro Newell, Hei Law, Ankit Goyal, Kaiyu Yang, and Jia Deng CVPR 2023 [project page] Infinigen is a generator of unlimited high-quality 3D data. Procedural and open-source.
	ProgPrompt: Generating Situated Robot Task Plans using Large Language Models Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, Animesh Garg ICRA 2023 Also in Autonomous Robots, LaRel @ NeurIPS 2022 and LangRob @ CoRL 2022 [project page] We use large language models (LLMs) for task planning in robotics. We construct pythonic prompts, which specify the task, robot capabilities and the environment to seed LLMs.
	IFOR: Iterative Flow Minimization for Robotic Object Rearrangement Ankit Goyal, Arsalan Mousavian, Chris Paxton, Yu-Wei Chao, Brian Okorn, Jia Deng, Dieter Fox CVPR 2022 Also in EAI @ CVPR 2022 [project page] [slides] [poster] IFOR is an end-to-end method for the challenging problem of object rearrangement for unknown objects given an RGBD image of the original and final scenes. It works on cluttered scenes in the real world, while training only on synthetic data.
	Coupled Iterative Refinement for 6D Multi-Object Pose Estimation Lahav Lipson, Zachary Teed, Ankit Goyal, Jia Deng CVPR 2022 [paper] [code] We propose state-of-the-art 6DOF multi-object pose estimation system. Our system iteratively refines object pose and correspondece.
	Non-deep Networks Ankit Goyal, Alexey Bochkovskiy, Jia Deng, Vladlen Koltun NeurIPS 2022 [code] [poster] [slides] [video] Depth is the hallmark of DNNs. But more depth means more sequential computation and higher latency. This begs the question -- is it possible to build high-performing ``non-deep" neural networks? We show it is.
	Revisiting Point Cloud Shape Classification with a Simple and Effective Baseline Ankit Goyal, Hei Law, Bowei Liu, Alejnadro Newell, Jia Deng ICML 2021 [code] [slides] [poster] [video] Many point-based approaches have been proposed reporting steady benchmark improvements over time. We study the key ingredients of this progress and uncover two critical results. First, auxiliary factors, independent of the model architecture, make a large difference in performance. Second, a very simple projection based method performs surprisingly well.
	Rel3D: A Minimally Contrastive Benchmark for Grounding Spatial Relations in 3D Ankit Goyal, Kaiyu Yang, Dawei Yang, Jia Deng NeuRIPS 2020, Spotlight (Top 4% of submitted papers) [code] [slides] [poster] [video] Understanding spatial relations is important for both humans and robots. We create Rel3D, the first large-scale, human-annotated dataset for grounding spatial relations in 3D. The 3D scenes in Rel3D come in minimally contrastive pairs: two scenes in a pair are almost identical, but a spatial relation holds in one and fails in the other.
	PackIt: A Virtual Environment for Geometric Planning Ankit Goyal, Jia Deng ICML 2020 [code] [slides] [video] Simultaneously reasoning about geometry and planning action is crucial for intelligent agents. This ability of geometric planning comes in handy while grocery shopping, rearranging room, warehouse management etc. We create PackIt, a virtual environment that caters to geometric planning.
	Think Visually: Question Answering through Virtual Imagery Ankit Goyal, Jian Wang, Jia Deng ACL 2018 [code] [poster] We study geometric reasoning in the context of question-answering. We introduce Dynamic Spatial Memory Network (DSMN), a deep network architecture designed for answering questions that admit latent visual representations.
	ProtoNN: Compressed and Accurate kNN for Resource-scarce Devices C Gupta, AS Suggala, A Goyal, HV Simhadri, BP, AK, SG, RU, MV, P Jain ICML 2017 [code] Resource-Efficient Machine Learning Prateek Jain, Chirag Gupta, AS Suggala, Ankit Goyal, HV Simhadri US Patent Applicaiton We propose ProtoNN, a novel algorithm that addresses the problem of real-time and accurate prediction on resource-scarce devices.
	A Multimodal Mixture-Of-Experts Model for Dynamic Emotion Prediction in Movies Ankit Goyal, Naveen Kumar, Tanaya Guha, Shrikanth S. Narayanan ICASSP 2016 We address the problem of continuous emotion prediction in movies. We propose a Mixture of Experts (MoE)-based fusion model that dynamically combines information from the audio and video modalities for predicting the emotion evoked in movies.