Publications

Preprints

DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion

DreamControl: Human-Inspired Whole-Body Humanoid Control for Scene Interaction via Guided Diffusion

Dvij Kalaria, Sudarshan Harithas, Pushkal Katara, Sangkyung Kwak, Sarthak Bhagat, S. Shankar Sastry, Srinath Sridhar, Sai Vemprala, Ashish Kapoor, Jonathan Huang

ArXiv Preprint ArXiv Preprint , 2025

TL;DR: DreamControl leverages a generative model trained on offline human motion data to synthesize diverse human motion trajectories for a wide range of tasks. These trajectories are re-targeted to the humanoid, and used to train task-specific RL policies that can be deployed both in simulation and on some real-world tasks.

HITTER: A HumanoId Table TEnnis Robot via Hierarchical Planning and Learning

HITTER: A HumanoId Table TEnnis Robot via Hierarchical Planning and Learning

Zhi Su, Bike Zhang, Nima Rahmanian, Yuman Gao, Qiayuan Liao, Caitlin Regan, Koushil Sreenath, S. Shankar Sastry

ArXiv Preprint ArXiv Preprint , 2025

TL;DR: Model based planner + reference motion tracking enables robust real-world Ping Pong play (continuiously up to 106 hits!)

LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction

LeVERB: Humanoid Whole-Body Control with Latent Vision-Language Instruction

Haoru Xue*, Xiaoyu Huang*, Dantong Niu*, Qiayuan Liao*, Thomas Kragerud, Jan Tommy Gravdahl, Xue Bin Peng, Guanya Shi, Trevor Darrell, Koushil Sreenath, S. Shankar Sastry (* equal contribution)

ArXiv Preprint ArXiv Preprint , 2025

TL;DR: LeVERB is the first latent whole-body humanoid VLA. We introduce a latent vocabulary as an interface between vision-language and whole-body action to enable expressive task specification and interpolatable execution.

2025

LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos

LATTE-MV: Learning to Anticipate Table Tennis Hits from Monocular Videos

Daniel Etaat, Dvij Kalaria, Nima Rahmanian, S. Shankar Sastry

CVPR Conference on Computer Vision and Pattern Recognition , 2025

TL;DR: LATTE-MV is a scalable system that processes over 800 hours of YouTube table tennis footage to reconstruct 27 hours of clean gameplay. This dataset is then used to train an anticipatory controller that predicts the player's next shot, enabling the robot to pre-position itself and perform better against high-speed rallies.

Full-Order Sampling-Based MPC for Torque-Level Locomotion Control via Diffusion-Style Annealing

Full-Order Sampling-Based MPC for Torque-Level Locomotion Control via Diffusion-Style Annealing

Haoru Xue*, Chaoyi Pan*, Zeji Yi, Guannan Qu, Guanya Shi (* equal contribution)

ICRA (Best Paper Finalist) International Conference on Robotics and Automation , 2025

TL;DR: DIAL-MPC is the first training-free method achieving real-time whole-body torque control using full-order dynamics.