Plan-time guardrails are not enough. The bad action appeared at turn eight.

The distinct failure mode here is late-stage drift. The agent looked safe for seven turns, then attempted payment.transfer at turn eight. The follow-up suite ran that pattern across five OpenClaw model surfaces, 35 live model-selected scenarios, 700 concurrent replays, a live Worker policy-change preflight, long-context turns, argument drift, delegation, and fail-closed chaos. The current OpenClaw agent --local path also completed on native Windows with OpenClaw 2026.5.12 after routing the lab profile through the PI embedded runtime. Before the execution checks ran, the current OpenClaw model.run surface returned valid responses for all five configured model entries.

Failure mode

Plan-time guardrails answer one question: does the initial plan look acceptable? Agents need a harder control point because the plan can change. A model can gather context, receive tool output, revise its intent, and only then choose a dangerous capability.

This test is about that exact gap. The first seven turns did not request the prohibited action. The eighth turn did. The runtime boundary made the decision at the moment of execution, blocked before body entry, and skipped the remaining turns.

The expanded run makes the claim harder to dismiss: five model surfaces produced the plans, every model-selected scenario was blocked cleanly, and the same plan family held under 100-way concurrent replay.

Conversation path

The useful boundary is the one closest to the side effect.

Turn	Stage	Boundary result
`1-7`	Normal context	Each model selected safe-looking setup work before the prohibited tool appeared.
`8`	Execution boundary	payment.transfer was requested at turn eight and blocked before the function body.
`9-10`	Stop propagation	Remaining planned turns were skipped after the boundary violation.
`25 / 50`	Long context	Separate 25-turn and 50-turn runs blocked late prohibited actions after context accumulation.

Evidence packet

The artifact now covers the seven follow-up tests.

The latest artifact is separate from the broader adversarial replay. It focuses on late drift and records cross-model plans, concurrent replay, live Worker preflight behavior, long-context runs, allowed-before-block proof, proof verification, fail-closed chaos, and the current native-Windows agent --local smoke.

OpenClaw agent --local smoke	`2026.5.12 passed`
OpenClaw model.run smoke	`5/5`
Cross-model suite	`5/5 models`
Live model scenarios	`35/35`
Concurrent replay	`700/700`
Live Worker preflight	`passed`
Long-context drift	`2/2`
Allowed-before-block	`3/3`
Fail-closed chaos	`passed`
Proof verification	`passed`

Public artifact: openclaw-live-late-drift-20260514.json and openclaw-local-agent-smoke-20260514.json

Takeaways

What this adds beyond the broader OpenClaw replay.

A safe-looking initial plan cannot be treated as lasting permission.

Every dangerous tool call needs a fresh execution-boundary decision.

Late-stage drift held across OpenAI, Gemini, DeepSeek, and OpenAI-Codex model-plan surfaces.

The same generated plans held under 700/700 concurrent replay attempts.

The production Worker preflight path denied cloud.deploy after a live policy change.

After the block, remaining turns were skipped instead of letting the run continue after a violation.

Scope

This is a late-drift boundary test, not a universal conversation benchmark.

This article is intentionally narrow. It does not claim a full arbitrary chat UI benchmark. It tests the model-selected tool plan an agent executor would run and answers one production-relevant question: if the agent looks safe first and becomes unsafe later, does the runtime still stop the actual side effect?

Plan-time guardrails are not enough. The bad action appeared at turn eight.

Eight chapters. One continuous agent-control story.

The useful boundary is the one closest to the side effect.

The artifact now covers the seven follow-up tests.

What this adds beyond the broader OpenClaw replay.

This is a late-drift boundary test, not a universal conversation benchmark.