More Works

RDD: Retrieval-Based Demonstration Decomposer
for Planner Alignment in Long-Horizon Tasks

NeurIPS 2025, San Diego
1University of California, Riverside    2University of Michigan
   3Meta AI
*Corresponding author

TASL Lab Logo Trustworthy Autonomous Systems Laboratory (TASL)

Can We Identify Sub-tasks Similar to the Expert-Labeled Ones?

Qualitative results showing RDD performance

Qualitative results of RDD and UVD when decomposing real-world (AgiBotWorld) and simulation (RLBench and LIBERO) benchmarks. RDD robustly identifyies sub-tasks that are close to expert sub-task decompositions, while UVD fails to locate keyframes precisely.

Abstract

To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to learn to decompose a target task. This finetuning requires target task demonstrations segmented into sub-tasks by either human annotation or heuristic rules. However, the heuristic subtasks can deviate significantly from the training data of the visuomotor policy, which degrades task performance. To address these issues, we propose a Retrieval-based Demonstration Decomposer RDD that automatically decomposes demonstrations into sub-tasks by aligning the visual features of the decomposed sub-task intervals with those from the training data of the low-level visuomotor policies. Our method demonstrates superior performance compared to the state-of-the-art sub-task decomposer on both simulation and real-world demonstrations, showcasing robustness across various settings.

The Planner-Visuomotor Dataset Misalignment Problem

Teaser overview of RDD method

The planner-visuomotor dataset misalignment problem: In hierarchical VLAs, the planner, often a powerful VLM, performs task planning and reasoning to break down complex tasks into simpler sub-tasks with step-by-step language instructions. Conditioned on the generated sub-task instructions, a learning-based visuomotor policy, trained on datasets with short-horizon sub-tasks, performs precise manipulation to complete the sub-tasks one by one, thereby completing long-horizon tasks.
A VLM planner typically needs to be finetuned with demonstrations of a given task, where demonstrations are temporally decomposed to sub-tasks by human annotation or heuristics.
The planner-visuomotor dataset misaligning problem is illustrated in the firgure: (a) Two sub-tasks appear in the visuomotor policy's training set, on which the policy has been optimized. (b) Existing sub-task decomposers, such as UVD, use heuristic decomposition rules and may generate "unfamiliar" sub-tasks that are difficult to handle for the low-level visuomotor policy.
Core idea of RDD: As shown in (c), RDD decomposes the demonstration into sub-tasks that are visually similar to the ones in the training set of the visuomotor policy. The sub-tasks are then used to finetune the high-level planner, which will generate sub-task instructions that are "familar" to the low-level visuomotor policy.

Retrival-Based Demonstration Decomposer

RDD method overview and architecture

RDD formulates demonstration decomposition as an optimal partitioning problem, using retrieval with approximate nearest neighbor search (ANNS) and dynamic programming to efficiently find the optimal decomposition strategy.

Improves End-to-End Performance of Hierarchical VLA

RDD method overview and architecture

Results are averaged over 10 random seeds. RDD improves the end-to-end performance of the hierarchical VLA RACER and achieves a near-oracle performance and only compromises the success rate of merely 0.2% compared with the expert decomposer

Real-World & Out-of-Distribution Demonstrations

rdd-realworld-ood

Performance (IoU) on AgiBotWorld-Alpha (real-world) and RoboCerebra (out-of-distribution sub-tasks)

Scalability

speed

Linear time complexity of RDD with bounded maximum sub-task duration. Experiment uses a single CPU core (AMD EPYC 9254). We also provide a conceptual speed evaluation of RDD when working with the GPU-accelerated ANNS method FAISS. For details please refer to our paper.