I am a Research Scientist in Robotics at NVIDIA working with Dieter Fox. I did my Ph.D. in Computer Science at Princeton University, where I was advised by Prof. Jia Deng. I completed Masters from University of Michigan and Bachelors from IIT Kanpur.
I have been fortunate to intern at some wonderful places and work with amazing mentors.
I am interested in understanding various aspects of intelligence, especially reasoning and common sense. In particular, I want to develop computation models for various reasoning skills that humans possess.
RVT is a multi-view transformer for 3D manipulation that is both scalable and accurate. In simulations, a single RVT model works well across 18 RLBench tasks with 249 task variations, achieving 26% higher relative success than existing SOTA (PerAct). It also trains 36X faster than PerAct for achieving the same performance and achieves 2.3X the inference speed of PerAct. Further, RVT can perform a variety of manipulation tasks in the real world with just a few (~10) demonstrations per task.
RPDiff rearranges objects into "multimodal" configurations, such as a book inserted in an open slot of a bookshelf. It generalizes to novel geometries, poses, and layouts, and is trained from demonstrations to operate on point clouds.
IFOR is an end-to-end method for the challenging problem of object rearrangement for unknown objects given an RGBD image of the original and final scenes. It works on cluttered scenes in the real world, while training only on synthetic data.
Depth is the hallmark of DNNs. But more depth means more sequential computation and higher latency. This begs the question -- is it possible to build high-performing ``non-deep" neural networks? We show it is.
Many point-based approaches have been proposed reporting steady benchmark improvements over time. We study the key ingredients of this progress and uncover two critical results. First, auxiliary factors, independent of the model architecture, make a large difference in performance. Second, a very simple projection based method performs surprisingly well.
Understanding spatial relations is important for both humans and robots. We create Rel3D, the first large-scale, human-annotated dataset for grounding spatial relations in 3D. The 3D scenes in Rel3D come in minimally contrastive pairs: two scenes in a pair are almost identical, but a spatial relation holds in one and fails in the other.
Simultaneously reasoning about geometry and planning action is crucial for intelligent agents. This ability of geometric planning comes in handy while grocery shopping, rearranging room, warehouse management etc. We create PackIt, a virtual environment that caters to geometric planning.
We study geometric reasoning in the context of question-answering. We introduce Dynamic Spatial Memory Network (DSMN), a deep network architecture designed for answering questions that admit latent visual representations.
We address the problem of continuous emotion prediction in movies. We propose a Mixture of Experts (MoE)-based fusion model that dynamically combines information from the audio and video modalities for predicting the emotion evoked in movies.