Splitting the Context Window in AI SDK

AI SDK lets a tool's execute be an async generator. Each yield is a preliminary result streamed to the UI, and only the last one is written back to the agent's memory. I explore how these preliminary tools - in an orchestrator-worker style - split the context window, save tokens, and make cache hits almost certain.

Abstract

When an agent does a long reasoning or search task by itself, every intermediate step - queries, partial notes, dead ends - is appended to its message history and re-sent on the next turn. The process is useful to show to the user, but you do not want it to live in the agent's memory forever.

AI SDK has a nice mechanism for this. A tool's execute can be an async generator. Every yield becomes a preliminary tool result, and only the last value is the real result that goes back into the agent.

In this post, I will explore this preliminary tool with a small PoC, and why it is good for splitting the context window, saving tokens, and improving cache hits.

Project

I prepared a sample project.

The important package is packages/preliminary-tool-chain-of-thought-server. The structure is simple.

agent-orchestrator.ts - the main agent. It has memory.
agent-worker.ts - a sub-agent that does the actual research.
index.ts - a Hono server that streams the orchestrator's UI messages.

The generator in `execute`

The core is here. The orchestrator exposes a research tool, and the tool's execute is an async generator.

1const research = tool({
2  description: "Delegate a research task to a worker agent.",
3  inputSchema: researchInput,
4  async *execute({ question }) {
5    const stream = await createAgentUIStream({
6      agent: worker,
7      uiMessages: [
8        { id: "q", role: "user", parts: [{ type: "text", text: question }] },
9      ],
10    });
11    for await (const message of readUIMessageStream<UIMessage>({ stream })) {
12      yield message;
13    }
14  },
15  toModelOutput: ({ output }) => ({
16    type: "text",
17    value:
18      output?.parts
19        .flatMap((p) => (p.type === "text" ? [p.text] : []))
20        .join("") ?? "",
21  }),
22});
23

The worker is just another ToolLoopAgent. It runs its own tool loop - in the sample it calls get-weather twice and a calculator, then summarizes.

1export const worker = new ToolLoopAgent({
2  model: anthropic("claude-sonnet-4-6"),
3  temperature: 0,
4  instructions:
5    "Answer the user's question. ... First get the current weather then tell the user that you will fetch it again. Compare the two results. ...",
6  tools: { "get-weather": weather, calculator },
7  stopWhen: stepCountIs(5),
8});
9

The last yield is the result

This is the most important part.

The worker emits a growing UI message as it streams. We yield that message on every emission. In AI SDK, every yield is marked as preliminary. If you look at the output-available tool part type, there is a flag for it.

1{
2  state: 'output-available';
3  input: ...;
4  output: ...;
5  preliminary?: boolean; // every yield except the last one is preliminary
6}
7

So the generator has two kinds of output mixed in one place.

The intermediate yields - the whole thinking process, streamed to the UI as a chain of thought.
The final value - the last yield - which becomes the real tool result and is the one passed to toModelOutput.

The client only renders the preliminary stream as the live reasoning, and the orchestrator only receives the flattened final text. The verbose process is shown, but it is never stored.

Splitting the context window

Here is why I like this.

The main agent has memory. Every step it takes is appended to its message history and re-sent on the next turn. If the main agent did the search itself, all of that intermediate process would land in that memory too. When the process is large, it inflates the token bill - you pay to re-send the whole scratchpad on every subsequent turn.

The preliminary tool moves that work into a sub-agent. The sub-agent's process runs in its own throwaway context. Only the flattened result is written back into the main agent's memory.

1User
2  -> Main agent (has memory)
3        memory grows by +1 short answer per call
4  -> calls research
5        -> worker (throwaway context)
6              search / reasoning / multiple tool calls
7        ... preliminary stream -> UI: chain of thought (shown, NOT stored)
8        -> toModelOutput: short answer only -> back to memory
9

So the context window is split into two.

The worker's window holds the messy, large process. It is discarded when the call ends.
The orchestrator's window holds only a short answer per call. It stays small.

You do not need a separate memory layer for the intermediate steps. The split itself is the memory strategy - the part you want to keep is what toModelOutput returns, and the rest just disappears.

Cache hit

Prompt caching works on a stable prefix. As long as the start of the messages does not change, the cached part hits.

The worker runs a tool loop, and that loop only ever appends to its context - a tool call, then its result, then the next call. Nothing rewrites what came before. So at each step the prefix is exactly the previous step's full context, and the model call almost always hits the cache from the step before.

This is why multiple tool calls are actually good for the cache here. The more steps the worker takes, the more times that growing-but-stable prefix is re-used, so across the whole loop the cache hit becomes nearly certain.

The cache TTL makes this even clearer. Prompt caching keeps an entry alive for about 5 minutes, and sometimes a 1-hour window is supported. For the main agent, that leaves a few things you have to reason about - do you re-send the whole message history on every call? will the user send the next message within the TTL, or come back an hour later when the cache is already gone? The worker barely has to care about any of this: its tool loop fires step after step in one uninterrupted burst, all within seconds, so the cache from the previous step is still warm when the next one runs.

And because it is nearly certain, you can plan around it - how many tool calls to make, where to put the cache breakpoint, how big each step is - instead of just hoping the cache survives. The predictability is the real benefit.

Conclusion

The preliminary tool is just an async generator in execute, but it splits the context window cleanly. The last yield is the result, everything before it is only shown, and the heavy work stays in a throwaway worker context.

So the main agent stays small, and the worker's loop hits the cache almost every step.