🧭 DIRECT: When and Where Should You
Allocate Test-Time Compute in Embodied Planners?
Abstract
VLMs are increasingly used as high-level planners for embodied agents, and a common strategy is to scale test-time compute for better capability. But that compute buys uneven, often diminishing gains while adding latency, tokens, and FLOPs. We argue that choosing when and where to spend it is what brings frontier performance to the real world. DIRECT is a routing framework that uses multimodal scene context to allocate compute per task, improving the success–cost frontier over any fixed model. Across three axes — chain-of-thought depth, model size, and memory history — we show test-time compute is not a uniform lever as each yields distinct capability gains. On VLABench, RoboMME, and a physical Franka arm, DIRECT matches or beats a stronger model's success at up to 65% lower latency. Our results show that naively scaling test-time compute is wasteful, and that DIRECT can provide frontier-level planning in robotic systems at a fraction of the cost.
How Test-Time Compute Impacts Embodied Planning
Chain-of-Thought
On 44% of VLABench tasks, the non-thinking model matches or beats Thinking at <2% of the latency — about 63× faster (1.9 s vs 118 s). Reasoning pays off only sometimes.
Model Size
Scaling 2B→235B is non-monotonic — a 32B planner can even run slower than 235B. Size mainly governs the breadth of skills a planner can command, useful only when a task needs them.
Memory
No single architecture dominates: on easy tasks a lightweight scheme beats MemER at ~10× fewer FLOPs, while MemER and GroundSG lead on hard, long-recall tasks.
DIRECT: An Embodied Routing Framework
From a scene image and instruction, a lightweight router picks the best planner from a fixed pool. It encodes this multimodal information and predicts the quality–cost tradeoff per planner. The router runs once per task (or can be called many times for multi-stage tasks), and deploys zero-shot on hardware.
Experimental Results
Across 270,000+ simulated routing decisions and 245 hardware trajectories, spanning all three test-time-compute axes.
Multi-Step Grocery Bagging (Heaviest→Lightest), Franka DROID
“Place the fruits from heaviest to lightest in the white bin.”
planning bars run at 6× speed, to the same scale across all three
| Planner | Success (%) | Latency (s) |
|---|---|---|
| Qwen3.5-VL 9B (No Thinking) | 47.62 | 2.19 |
| Qwen3.5-VL 9B (Thinking) | 90.48 | 19.58 |
| DIRECT (Ours) | 95.24 | 6.85 |
Routing step-wise, DIRECT beats both planners on success while matching Thinking at a fraction of the latency.
Chain-of-Thought
On VLABench, DIRECT has the best routing efficiency in every thinking/non-thinking pair — recovering near-expensive quality at ~30% lower latency.
Model Size
Cumulative routing over 2B–235B turns the erratic size curve monotonic — at the largest model, adding +5.1 points of success while cutting 32.4 s of average latency.
Memory
Beats the best single memory architecture at a fraction of MemER's cost — paying recall overhead only when a task genuinely needs it.
Routing Video Demos
Watch DIRECT route on a physical Franka arm — hit Run to see the planner reason before it acts.