🧭 DIRECT: When and Where Should You
Allocate Test-Time Compute in Embodied Planners?

1Stanford University 2University of Waterloo 3NVIDIA
  Equal contribution
DIRECT Overview
Video transcript

We will be talking about our work titled DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

Vision-Language Models (VLMs) are increasingly deployed as high-level planners that decompose abstract instructions into executable sub-skills for embodied agents. In this hierarchy, an orchestrating VLM planner generates skill plans that are enacted by a low-level policy.

An emerging strategy to improve performance is scaling the compute cost expended at inference—this cost is known as test-time compute. We investigate three axes of scaling test-time compute for embodied planners: chain-of-thought depth, model size, and memory history.

In our preliminary investigations on these three axes, we find that scaling test-time compute is not free in robotics. Chain-of-thought helps under hidden constraints, yet on 44% of tasks Instruct ties Thinking far more cheaply. Model size broadens skills but scales erratically. Memory helps history-dependent tasks but hurts elsewhere.

These investigations yield three trends: (1) large capability gaps separate cheap and expensive configurations; (2) the gaps are non-uniform, hinging on task-specific demands like skill command or recall horizon; and (3) lightweight models suffice on many tasks.

Since any static model fixes a single latency–capability compromise, these observations motivate a framework that infers each task's demands from the scene context and allocates compute per task accordingly.

We introduce our routing framework DIRECT, which aims to learn a mapping from a task's scene and instruction to the planner best suited to solve it. Given a user instruction and scene observation, our lightweight router DIRECT selects the planner with the best quality–cost tradeoff from a fixed pool of VLM planners.

The chosen VLM decomposes the instruction into primitive subskills executed by a low-level VLA policy, while a transition detector triggers replanning for multi-stage tasks. By conditioning on multimodal context, DIRECT allocates test-time compute per task rather than using a single fixed-capability planner.

In simulation across VLABench and RoboMME, DIRECT achieves the best routing efficiency in every configuration. It recovers or even outperforms the strongest model at lower latency, and turns the erratic size-scaling curve into monotonic improvement.

On Franka DROID, DIRECT matches or exceeds a stronger model at up to 65% lower latency. On multi-step grocery bagging it reaches 95% success at 7 seconds—versus the Thinking planner's 90% at 20 seconds. We find improvements for chain-of-thought depth and memory routing on hardware.

To conclude, naively scaling test-time compute is wasteful, and our investigations document how different axes of test-time compute show qualitatively different capability profiles. Treating the planner as a dynamic variable enables frontier-level embodied planning at a fraction of the cost, with difficulty inferred from fast-to-compute embeddings.

Abstract

VLMs are increasingly used as high-level planners for embodied agents, and a common strategy is to scale test-time compute for better capability. But that compute buys uneven, often diminishing gains while adding latency, tokens, and FLOPs. We argue that choosing when and where to spend it is what brings frontier performance to the real world. DIRECT is a routing framework that uses multimodal scene context to allocate compute per task, improving the success–cost frontier over any fixed model. Across three axes — chain-of-thought depth, model size, and memory history — we show test-time compute is not a uniform lever as each yields distinct capability gains. On VLABench, RoboMME, and a physical Franka arm, DIRECT matches or beats a stronger model's success at up to 65% lower latency. Our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level planning in robotic systems at a fraction of the cost.

How Test-Time Compute Impacts Embodied Planning

  Chain-of-Thought

On 44% of VLABench tasks, the non-thinking model matches or beats Thinking at <2% of the latency — about 63× faster (1.9 s vs 118 s). Reasoning pays off only sometimes.

  Model Size

Scaling 2B→235B is non-monotonic — a 32B planner can even run slower than 235B. Size mainly governs the breadth of skills a planner can command, useful only when a task needs them.

  Memory

No single architecture dominates: on easy tasks a lightweight scheme beats MemER at ~10× fewer FLOPs, while MemER and GroundSG lead on hard, long-recall tasks.

DIRECT: An Embodied Routing Framework

DIRECT framework: a lightweight router maps scene and instruction to the best planner from a fixed pool

From a scene image and instruction, a lightweight router picks the best planner from a fixed pool. It encodes this multimodal information and predicts the quality–cost tradeoff per planner. The router runs once per task (or can be called many times for multi-stage tasks), and deploys zero-shot on hardware.

Experimental Results

Across 270,000+ simulated routing decisions and 245 hardware trajectories, spanning all three test-time-compute axes.

Multi-Step Grocery Bagging (Heaviest→Lightest), Franka DROID

“Place the fruits from heaviest to lightest in the white bin.”

planning bars run at 6× speed, to the same scale across all three

planning (fast) thinking
No thinkingfast — but grabs the wrong fruit
planning…
2.0s planning · ✗ wrong order
DIRECTroutes per subgoal — thinks only on the hard pick
planning…
20.5s planning · ✓ correct order
Thinkingreasons on every pick — even the easy ones
planning…
104.6s planning · ✓ correct, but slow
Planner Success (%) Latency (s)
Qwen3.5-VL 9B (No Thinking) 47.62 2.19
Qwen3.5-VL 9B (Thinking) 90.48 19.58
DIRECT (Ours) 95.24 6.85

Routing step-wise, DIRECT beats both planners on success while matching Thinking at a fraction of the latency.

Chain-of-Thought

On VLABench, DIRECT has the best routing efficiency in every thinking/non-thinking pair — recovering near-expensive quality at ~30% lower latency.

Model Size

Cumulative routing over 2B–235B turns the erratic size curve monotonic — at the largest model, adding +5.1 points of success while cutting 32.4 s of average latency.

Memory

Beats the best single memory architecture at a fraction of MemER's cost — paying recall overhead only when a task genuinely needs it.

Routing Video Demos

Watch DIRECT route on a physical Franka arm — hit Run to see the planner reason before it acts.

Open Interactive Demos