The Bugs Hidden Inside a Data Agent Harness

What breaks when tool results, provider payloads, reminders, and subagents cross boundaries invisibly.

May 03, 2026

A ReAct loop is easy to sketch.

Call the model. If it asks for a tool, run the tool. Return the result. Repeat.

That loop is not the hard part.

The hard part is stopping state from crossing boundaries invisibly: tool calls that lose their matching results, cached data that leaks into the transcript, provider metadata that becomes part of canonical history, and subagents that mutate parent state through the side door.

None of those failures looks dramatic in code. They usually appear later, as a wrong number, a conversation that cannot be replayed, a prompt-cache miss nobody expected, or a subagent result that changed data the parent thought was stable.

This post is about the concrete implementation bugs I try to prevent in a data-agent harness.

It follows from two earlier posts: one on the overall harness design (a data-native ReAct harness without bash, built around a Python interpreter, progressive connector disclosure, and the handle/snapshot pattern), and one on the session cache (handle naming, snapshot dispatch, immutability, eviction, disk-backed spillover).

Tool calls need matching results

A common simplification in toy harnesses is to treat tool calls as just another bit of assistant text.

That breaks down as soon as tools enter the loop.

A tool call creates a concrete obligation: before the next assistant call, the harness must send back a matching tool result for that specific tool-use id. If the assistant asks for three tools, the next user-side tool-result message has to satisfy all three ids. The ordering and ids matter because the provider and the model are both tracking which result belongs to which request.

That is why the harness keeps typed blocks internally:

TextBlock
ToolUseBlock
ToolResultBlock

A message is a role plus a list of blocks. That structure makes the pairing explicit: this tool request produced that tool result, and only after that can the model continue.

Bug prevented: the loop accidentally calls the model again while a tool-use id is unresolved, or attaches a query result to the wrong tool call. In a data workflow, that is how the model ends up explaining one table while computing over another.

Tool results need one formatting path

The handle/snapshot pattern says large results go into cache, small results go into context. The implementation question is who decides.

If each tool formats its own output, behaviour drifts quickly. A connector returns a full JSON payload. The interpreter returns stdout inline. list_variables uses a different truncation rule. One tool caches a DataFrame, another pastes the first thousand rows into the conversation. The model then sees different interfaces depending on which tool happened to produce the value.

The harness needs one shared formatter called from every tool-dispatch path. Its policy can be simple:

DataFrames and arrays become handles plus snapshots
short strings and scalars are inlined
small dicts and lists are serialised inline
long strings are cached or truncated consistently
exceptions become readable error strings

The model should learn one rule: small things are answers; large or structured things become handles.

Bug prevented: a connector silently dumps a large result into message history, pushes the original user request out of context, and encourages the model to “read” stale sampled text instead of computing over the cached object.

Provider payloads are renderings, not harness state

The generic lesson “adapters should not mutate input” is not the interesting part.

The useful boundary is sharper: canonical harness state and provider-bound payloads are different artefacts.

The harness owns messages, content blocks, tool specs, stop reasons, token counts, cache handles, and the byte-stable system prompt. A provider adapter, or a layer such as LiteLLM above the provider, owns the API-specific rendering of that state.

That matters because provider payloads often need metadata that does not belong in the transcript. Prompt-caching annotations are the obvious example. A provider may need cache-control markers on the system prompt or the latest user message. Those markers are useful in the outbound payload, but they are not part of the canonical conversation.

The adapter can copy harness state, render it into the provider schema, attach provider-specific metadata, call the API, and normalise the response back into harness types. The copy is disposable. The harness state remains the source of truth.

Bug prevented: provider-specific cache-control metadata leaks into stored messages, mutates the prompt prefix, or makes a later replay depend on an API annotation that was never part of the semantic conversation.

Dynamic reminders belong in the suffix

Planner nags and final-turn warnings are useful, but they are current-turn context, not permanent changes to the agent’s instruction hierarchy.

If the harness inserts dynamic reminders into the system prompt, the prompt prefix changes every turn. That breaks the mental model of a stable agent role, and it can also defeat provider prompt caching.

The rule is:

Static instructions live in the system prompt. Dynamic reminders live near the current turn.

In practice, if the latest user message contains tool results, append the reminder as another text block in that same message. If there is no current user message, create one. The model sees the reminder where it is relevant, and the system prompt stays byte-stable.

Bug prevented: a changing “current variables” or “planner state” section gets spliced into the system prompt, causing avoidable prompt-cache misses and making it harder to tell which instructions were actually stable across the run.

Tool handlers should arrive already bound

The loop should not know how every tool gets its dependencies.

Some tools need state:

the interpreter needs the session cache
list_variables needs the session cache
load_connectors needs the connector registry
the planner needs planner state
the subagent tool needs an adapter factory and a tool factory

If the loop has special cases for all of that, it becomes a service container. Adding one tool means editing dispatch logic. Testing one tool means constructing half the harness. The loop stops being the protocol runner and starts owning every dependency boundary.

The cleaner contract is:

A tool spec carries an already-bound handler. The loop calls it with only model-supplied arguments.

Stateful tools can be callable classes, closures, or factory outputs. The loop does not care. Its job is only to find the visible tool by name, call the handler with the tool input, format the result, and append a matching tool-result block.

Bug prevented: adding a new stateful tool accidentally changes the central loop, couples dispatch to specific tool internals, or lets one tool’s dependency wiring leak into another tool’s execution path.

Some mutation is the point

Not every mutation is bad. The harness is allowed to mutate state when the mutation is the explicit behaviour the model asked for.

Progressive connector disclosure is the example.

At startup, connector tools may exist in the registry but be invisible. The model sees only the catalogue and the load_connectors tool. When it loads a connector, the registry flips the relevant tool specs to visible. On the next model call, those tools are included.

That mutation is not a side effect of formatting or provider rendering. It is the operation the tool performs.

The distinction is ownership:

provider payload mutation is a hidden side effect
registry visibility mutation is explicit harness behaviour

Bug prevented: treating all mutation as equally suspicious leads either to over-engineered immutable plumbing or, worse, to hiding real state changes in places the transcript cannot explain. The important question is not “did something mutate?” but “who owns the mutation, and did the model ask for it?”

Subagents need explicit input and output state

A subagent is useful only if it is actually isolated.

If the subagent silently shares the parent cache, parent messages, or parent planner state, it is not a clean-context worker. It is just another loop with shared implicit state.

The default should be strict:

fresh adapter
fresh message history
fresh cache
no recursive subagent tool
no parent state unless explicitly passed

When the parent wants the subagent to inspect data, it should pass input handles by name. The values cross as cache state, not prompt text. The cache post covers the important caveat: the current implementation copies those values by reference, while the intended boundary is type-aware copy so subagent mutations cannot leak back into the parent. Parent-side collision suffixing for published handles is already part of the contract; transparent hydration for disk-backed handles is a planned extension of the same boundary.

The parent should receive two separate channels:

narrative result: final text
data result: published handles plus snapshots

A summary belongs in conversation. A DataFrame belongs in cache.

Bug prevented: a subagent runs an in-place DataFrame operation on what looks like its own working data, but the object is shared with the parent cache. The parent later computes from a handle whose snapshot no longer describes the underlying value.

The pattern underneath all of this

These details look varied:

protocol-shaped transcripts
one shared formatter for tool results
provider payloads as disposable renderings
suffix-only reminders
bound tool handlers
explicit harness-owned mutation
explicit subagent state transfer

But they are all enforcing the same idea:

State should cross boundaries deliberately.

Raw data crosses into cache, not context. Provider-specific formatting crosses inside adapters, not into the loop. Dynamic reminders cross into the suffix, not the system prompt. Subagent inputs cross as handles, not hidden shared state. Tool results cross as typed blocks, not ambiguous strings.

That is what separates a useful harness from a demo loop. It makes later numerical answers inspectable rather than merely plausible.

The loop is simple. The bugs live in the boundaries.

Max Khor

Discussion about this post

Ready for more?