Papers Blog

VENTURA: Adapting Image Diffusion Models for Unified Task Conditioned Navigation

Arthur Zhang, Xiangyun Meng, Luca Calliari, Dong-Ki Kim, Shayegan Omidshafiei, Joydeep Biswas, Ali Agha, Amirreza Shaban

ICRA 2026 2026 Field Ai, UT

Diffusion ModelRepresentation LearningNavigationVLA

Show Abstract

Robots must adapt to diverse human instructions and operate safely in unstructured, open-world environments. Recent Vision-Language models (VLMs) offer strong priors for grounding language and perception, but remain difficult to steer for navigation due to differences in action spaces and pretraining objectives that hamper transferability to robotics tasks. Towards addressing this, we introduce VENTURA, a vision-language navigation system that finetunes internet-pretrained image diffusion models for path planning. Instead of directly predicting low-level actions, VENTURA generates a path mask (i.e. a visual plan) in image space that captures fine-grained, context-aware navigation behaviors. A lightweight behavior-cloning policy grounds these visual plans into executable trajectories, yielding an interface that follows natural language instructions to generate diverse robot behaviors. To scale training, we supervise on path masks derived from self-supervised tracking models paired with VLM-augmented captions, avoiding manual pixel-level annotation or highly engineered data collection setups. In extensive real-world evaluations, VENTURA outperforms state-of-the-art foundation model baselines on object reaching, obstacle avoidance, and terrain preference tasks, improving success rates by 33% and reducing collisions by 54% across both seen and unseen scenarios. Notably, we find that VENTURA generalizes to unseen combinations of distinct tasks, revealing emergent compositional capabilities

Click to read summary →

Unified Latents (UL): How to train your latents

Jonathan Heek, Emiel Hoogeboom, Thomas Mensink, Tim Salimans

ArXiv 2026 Google

Diffusion ModelRepresentation LearningSelf-Supervised LearningFoundation Model

Show Abstract

We present Unified Latents (UL), a framework for learning latent representations that are jointly regularized by a diffusion prior and decoded by a diffusion model. By linking the encoder’s output noise to the prior’s minimum noise level, we obtain a simple training objective that provides a tight upper bound on the latent bitrate. On ImageNet-512, our approach achieves competitive FID of 1.4, with high reconstruction quality (PSNR) while requiring fewer training FLOPs than models trained on Stable Diffusion latents. On Kinetics-600, we set a new state-of-the-art FVD of 1.3.

Click to read summary →

Latent Action Pretraining from Videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo

ICLR 2025 2025 Meta

FKDRLRepresentation LearningVideo-Based Learning

Show Abstract

We introduce Latent Action Pretraining for general Action models (LAPA), an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators during pretraining, which significantly limits possible data sources and scale. In this work, we propose a method to learn from internet-scale videos that do not have robot action labels. We first train an action quantization model leveraging VQ-VAE-based objective to learn discrete latent actions between image frames, then pretrain a latent VLA model to predict these latent actions from observations and task descriptions, and finally finetune the VLA on small-scale robot manipulation data to map from latent to robot actions. Experimental results demonstrate that our method significantly outperforms existing techniques that train robot manipulation policies from large-scale videos. Furthermore, it outperforms the state-of-the-art VLA model trained with robotic action labels on real-world manipulation tasks that require language conditioning, generalization to unseen objects, and semantic generalization to unseen instructions. Training only on human manipulation videos also shows positive transfer, opening up the potential for leveraging web-scale data for robotics foundation model.

Click to read summary →

MemDistill: Distilling LiDAR Knowledge into Memory for Camera-Only 3D Object Detection

Donghyeon Kwon, Youngseok Yoon, Hyeongseok Son, Suha Kwak

ICCV 2025 2025 Samsung

Representation LearningCross-Modal Representation LearningLidarComputer Vision

Show Abstract

Camera-based 3D object detection has gained attention for its cost-effectiveness, but it in general lags behind LiDAR-based approaches due to its lack of explicit 3D spatial cues. To take the best of both camera- and LiDAR-based detectors, we propose MemDistill, a novel cross-modal knowledge distillation framework for 3D object detection.MemDistill transfers rich 3D knowledge from a LiDAR-based teacher model to a camera-based student model through a dedicated memory unit and a scene-dependent memory retrieval module.To be specific, our framework distills the teacher's 3D knowledge, optimizes the memory to store that knowledge compactly, and learns the retriever that searches the memory to produce 3D features relevant to the input scene, compensating for the missing LiDAR modality.Experiments on the nuScenes dataset demonstrate that MemDistill significantly improves performance of its camera-only baseline, achieving the state of the art in camera-based 3D object detection.

Click to read summary →

Dream to Control: Learning Behaviors by Latent Imagination [Dreamer V1]

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi

ICLR 2020 2020 Google

RLModel-Based RLRepresentation LearningSelf-Supervised LearningDynamics LearningFKD

Show Abstract

Learned world models summarize an agent's experience to facilitate learning complex behaviors. While learning world models from high-dimensional sensory inputs is becoming feasible through deep learning, there are many potential ways for deriving behaviors from them. We present Dreamer, a reinforcement learning agent that solves long-horizon tasks from images purely by latent imagination. We efficiently learn behaviors by propagating analytic gradients of learned state values back through trajectories imagined in the compact state space of a learned world model. On 20 challenging visual control tasks, Dreamer exceeds existing approaches in data-efficiency, computation time, and final performance.

Click to read summary →

Year Range

Institutions (Any)

Tags (All)

VENTURA: Adapting Image Diffusion Models for Unified Task Conditioned Navigation

Unified Latents (UL): How to train your latents

Latent Action Pretraining from Videos

MemDistill: Distilling LiDAR Knowledge into Memory for Camera-Only 3D Object Detection

Dream to Control: Learning Behaviors by Latent Imagination [Dreamer V1]