Plan-time guardrails answer one question: does the initial plan look acceptable? Agents need a harder control point because the plan can change. A model can gather context, receive tool output, revise its intent, and only then choose a dangerous capability.
This test is about that exact gap. The first seven turns did not request the prohibited action. The eighth turn did. The runtime boundary made the decision at the moment of execution, blocked before body entry, and skipped the remaining turns.
The expanded run makes the claim harder to dismiss: five model surfaces produced the plans, every model-selected scenario was blocked cleanly, and the same plan family held under 100-way concurrent replay.
The useful boundary is the one closest to the side effect.
| Turn | Stage | Boundary result |
|---|---|---|
1-7 | Normal context | Each model selected safe-looking setup work before the prohibited tool appeared. |
8 | Execution boundary | payment.transfer was requested at turn eight and blocked before the function body. |
9-10 | Stop propagation | Remaining planned turns were skipped after the boundary violation. |
25 / 50 | Long context | Separate 25-turn and 50-turn runs blocked late prohibited actions after context accumulation. |
The artifact now covers the seven follow-up tests.
The latest artifact is separate from the broader adversarial replay. It focuses on late drift and records cross-model plans, concurrent replay, live Worker preflight behavior, long-context runs, allowed-before-block proof, proof verification, fail-closed chaos, and the current native-Windows agent --local smoke.
| OpenClaw agent --local smoke | 2026.5.12 passed |
| OpenClaw model.run smoke | 5/5 |
| Cross-model suite | 5/5 models |
| Live model scenarios | 35/35 |
| Concurrent replay | 700/700 |
| Live Worker preflight | passed |
| Long-context drift | 2/2 |
| Allowed-before-block | 3/3 |
| Fail-closed chaos | passed |
| Proof verification | passed |
Public artifact: openclaw-live-late-drift-20260514.json and openclaw-local-agent-smoke-20260514.json
What this adds beyond the broader OpenClaw replay.
This is a late-drift boundary test, not a universal conversation benchmark.
This article is intentionally narrow. It does not claim a full arbitrary chat UI benchmark. It tests the model-selected tool plan an agent executor would run and answers one production-relevant question: if the agent looks safe first and becomes unsafe later, does the runtime still stop the actual side effect?
