Context and Conversation Management

🧠 Your bot from Chapter 3 already remembers a conversation - but only because you happened to keep one messages array alive in a single process. Pull the plug and the memory is gone; open a second chat and the two talk over each other. The reason is worth saying plainly: the Messages API is stateless. Every client.messages.create call must carry the entire conversation with it, because the server keeps nothing between requests. So the memory has to live in your code, and this chapter is about holding it well - across turns, across the model's context-window ceiling, across users, and across restarts.

You already have new Anthropic(), the env vars, and content-block narrowing from Chapter 1, plus the REPL loop and Telegram token from Chapter 3, so here you only add the stateful layer on top. As always the key comes from the environment - never hardcode it - and Bun auto-loads .env, so there is nothing to import.

History and the system prompt

The shape of memory is a list of turns: a messages array of Anthropic.MessageParam, strictly alternating role: 'user' and role: 'assistant', where each content is a string or an array of content blocks (text, tool_use, tool_result). The one rule that keeps it valid is alternation, and the one move that maintains it is this: after each call, push response.content straight back as an assistant turn before you add the next user message.

// bun run examples/04-context/multi-turn.ts

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const model = process.env.ANTHROPIC_DEFAULT_SONNET_MODEL ?? 'claude-sonnet-4-6';

// The API is stateless: this array is the entire memory you resend every call.
const messages: Anthropic.MessageParam[] = [];

const turns = [
  'My favorite color is teal. Remember it.',
  'What is 12 times 11?',
  'What was my favorite color again?',
];

for (const text of turns) {
  messages.push({ role: 'user', content: text });
  const message = await client.messages.create({ model, max_tokens: 256, messages });

  // Push the response.content block array straight back as the assistant turn.
  messages.push({ role: 'assistant', content: message.content });

  const first = message.content[0];
  const reply = first?.type === 'text' ? first.text : '';
  console.log(`turn ${messages.length / 2} | you: ${text}`);
  console.log(`claude: ${reply}\n`);
}

console.log(`history holds ${messages.length} messages across ${messages.length / 2} turns`);

Notice that the assistant turn is response.content unchanged - the same block array the model returned - so the next call sees the full, faithful history. The printed turn count climbs by two each round, one user and one assistant, which is the alternation made visible. Run the first sample to watch it grow:

bun run examples/04-context/multi-turn.ts

Persona and standing instructions do not belong in a turn - they belong in system, which sits outside the alternation and is sent with every request. You can pass system as a plain string or as an array of text blocks; the array form is what you will want the moment caching enters the picture.

// bun run examples/04-context/system-prompt.ts

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const model = process.env.ANTHROPIC_DEFAULT_SONNET_MODEL ?? 'claude-sonnet-4-6';

// A bare string and a block array set the same persona; pick whichever reads cleaner.
const asString = 'You are Captain Reef, a pirate. Answer in one sentence and end with "Arr!".';
const asBlocks: Anthropic.TextBlockParam[] = [
  { type: 'text', text: 'You are Captain Reef, a pirate.' },
  { type: 'text', text: 'Answer in one sentence and end with "Arr!".' },
];

const messages: Anthropic.MessageParam[] = [];

async function turn(system: string | Anthropic.TextBlockParam[], question: string) {
  messages.push({ role: 'user', content: question });
  const message = await client.messages.create({ model, max_tokens: 256, system, messages });
  const first = message.content[0];
  const reply = first?.type === 'text' ? first.text : '';
  messages.push({ role: 'assistant', content: reply });
  console.log(`> ${question}\n${reply}\n`);
}

// Same system on both turns: the persona and the "Arr!" constraint should survive the follow-up.
await turn(asString, 'What is a variable?');
await turn(asBlocks, 'And a function?');

The same system rides along on both turns, so the persona holds without you ever restating it inside a user message. String or block array, the model reads them identically - the array just gives you a place to attach cache_control later.

Counting tokens and trimming

Here is the wall you will eventually hit: every model has a finite context window, and a conversation that runs long enough will overflow it. You get ahead of that by measuring before you send. client.messages.countTokens takes the same model, system, messages, and tools you are about to pass to create, and returns the input-token count - so you can branch before spending a request.

// bun run examples/04-context/token-counter.ts

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const model = process.env.ANTHROPIC_DEFAULT_SONNET_MODEL ?? 'claude-sonnet-4-6';

// retrieve confirms the model and gives you a label to log; it is not a budget source.
const info = await client.models.retrieve(model);

// A deliberately small threshold so these short demo turns actually trip the trim.
const THRESHOLD = 40;
const system = 'You are a terse assistant. Reply in one short sentence.';
console.log(`counting against ${info.display_name}, trimming above ${THRESHOLD} tokens`);

const messages: Anthropic.MessageParam[] = [];
for (const turn of ['Name a planet.', 'Another?', 'And one more?', 'Last one?']) {
  messages.push({ role: 'user', content: turn });

  // Count the exact payload create will send, then roll the window under threshold.
  let { input_tokens } = await client.messages.countTokens({ model, system, messages });
  while (input_tokens > THRESHOLD && messages.length > 2) {
    messages.splice(0, 2);
    ({ input_tokens } = await client.messages.countTokens({ model, system, messages }));
  }
  console.log(`tokens=${input_tokens} window=${messages.length} msgs`);

  const reply = await client.messages.create({ model, max_tokens: 64, system, messages });
  const first = reply.content[0];
  messages.push({ role: 'assistant', content: first?.type === 'text' ? first.text : '' });
}

The threshold is paired with the model from client.models.retrieve rather than a bare number hard-coded in, because a window that is generous on one tier is tight on another. When the count crosses it, a rolling window keeps only the last N turns and drops the rest.

Trim in pairs, never one side alone

Every trim must remove a user and its assistant reply together. Drop one side and you break alternation - two user turns in a row, or an assistant with nothing before it - and the next create call rejects the whole array. The window slides by two, always.

A rolling window is cheap but forgetful: it throws away the early turns wholesale. When those early turns still matter, summarize instead. You call create once with a summarize instruction over the old turns, then replace that whole stretch with a single injected user/assistant pair carrying the summary - history compressed, alternation intact.

// bun run examples/04-context/summarize-history.ts

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const model = process.env.ANTHROPIC_DEFAULT_SONNET_MODEL ?? 'claude-sonnet-4-6';
// A pretend-long history that in a real bot grows turn by turn.
const messages: Anthropic.MessageParam[] = [
  { role: 'user', content: 'I am Ada, planning a five-night Lisbon trip in early May on a tight budget.' },
  { role: 'assistant', content: 'Got it: Ada, Lisbon, early May, five nights, budget-conscious.' },
  { role: 'user', content: 'Vegetarian, and I want to be near the tram lines.' },
  { role: 'assistant', content: 'Noted: vegetarian, lodging close to the historic tram routes.' },
];
// Replace the old turns with one user/assistant pair so roles still alternate.
async function summarize(history: Anthropic.MessageParam[]): Promise<Anthropic.MessageParam[]> {
  const transcript = history.map((m) => `${m.role}: ${m.content}`).join('\n');
  const summary = await client.messages.create({
    model,
    max_tokens: 512,
    messages: [{ role: 'user', content: `Summarize this conversation as durable memory:\n\n${transcript}` }],
  });
  const first = summary.content[0];
  return [
    { role: 'user', content: 'Here is a summary of our earlier conversation.' },
    { role: 'assistant', content: first?.type === 'text' ? first.text : '' },
  ];
}
const threshold = 30;
const { input_tokens } = await client.messages.countTokens({ model, messages });
console.log(`history is ${input_tokens} tokens; threshold ${threshold}`);
if (input_tokens > threshold) {
  const compacted = await summarize(messages);
  console.log(`compacted ${messages.length} turns into ${compacted.length}:`);
  for (const m of compacted) console.log(`  ${m.role}: ${m.content}`);
}

The injected pair is the new beginning of messages: one short user turn that asks for the state of things, one assistant turn that holds the summary. Everything before it is gone, but its meaning rides forward in far fewer tokens.

Prompt caching

When a large, stable chunk of context rides along on every request - a long system prompt, a file you keep referencing - you are paying to re-process the same tokens each time. Prompt caching fixes that: add cache_control: { type: 'ephemeral' } to the final text block of your system array (or to a large stable user turn), and the model caches everything up to that point. Later requests that share the prefix read it from cache instead of reprocessing it, which cuts both latency and cost.

// bun run examples/04-context/prompt-cache.ts

import Anthropic from '@anthropic-ai/sdk';

const client = new Anthropic();
const model = process.env.ANTHROPIC_DEFAULT_SONNET_MODEL ?? 'claude-sonnet-4-6';

// A byte-stable prefix past Sonnet's ~2048-token minimum: no timestamp or random value, or the cache silently misses.
const stable = 'You are a meticulous code reviewer. '.repeat(900);
const system: Anthropic.TextBlockParam[] = [{ type: 'text', text: stable, cache_control: { type: 'ephemeral' } }];
const messages: Anthropic.MessageParam[] = [{ role: 'user', content: 'Reply with the single word: ok.' }];

async function ask(label: string) {
  const message = await client.messages.create({ model, max_tokens: 16, system, messages });
  const { cache_creation_input_tokens, cache_read_input_tokens } = message.usage;
  console.log(`${label} created=${cache_creation_input_tokens ?? 0} read=${cache_read_input_tokens ?? 0}`);
}

// First request writes the cache; the second, identical request reads it back.
await ask('request 1');
await ask('request 2');

The first request reports cache_creation_input_tokens as it writes the cache; the second, identical request reports a non-zero cache_read_input_tokens - the prefix served from cache, paid for once. Two conditions make or break this:

Requirement	Detail
Minimum size	The cached prefix must exceed the model's floor: ~4096 tokens on Opus 4.x and Haiku 4.5, ~2048 on Sonnet 4.6. Smaller prefixes are never cached.
Byte-stability	The prefix must be byte-for-byte identical across requests. Slip a changing value - a timestamp, a counter - into the cached block and it silently no-ops, charging full price with no warning.

Verify, don't trust

Caching fails quietly, so always confirm it by reading usage.cache_read_input_tokens on the second request. Zero where you expected a hit means your prefix changed or fell under the minimum.

Per-user sessions that survive restarts

Now back to the bot, with everything above in hand. One process serves many chats, so one shared messages array will not do - each Telegram chat.id needs its own history. You key an in-memory Map by chat.id, and to outlast a restart you serialize that Map to a JSON file and load it back on startup.

// bun run examples/04-context/telegram-sessions.ts

import Anthropic from '@anthropic-ai/sdk';

const token = process.env.TELEGRAM_BOT_TOKEN;
if (!token) {
  throw new Error('Set TELEGRAM_BOT_TOKEN in your .env');
}

const base = `https://api.telegram.org/bot${token}`;
const client = new Anthropic();
const model = process.env.ANTHROPIC_DEFAULT_SONNET_MODEL ?? 'claude-sonnet-4-6';

type Update = {
  update_id: number;
  message?: {
    chat: { id: number };
    text?: string;
  };
};

type Entry = [number, Anthropic.MessageParam[]];

// POST a JSON body to one Bot API method and return the parsed response.
async function tg<T>(method: string, body: object): Promise<{ result?: T }> {
  const response = await fetch(`${base}/${method}`, {
    method: 'POST',
    headers: { 'content-type': 'application/json' },
    body: JSON.stringify(body),
  });
  return response.json() as Promise<{ result?: T }>;
}

// Long-poll getUpdates forever, yielding one update at a time.
async function* pollUpdates(): AsyncGenerator<Update> {
  let offset = 0;
  while (true) {
    const { result = [] } = await tg<Update[]>('getUpdates', { offset, timeout: 30 });
    for (const update of result) {
      offset = update.update_id + 1;
      yield update;
    }
  }
}

const file = `${import.meta.dir}/sessions.json`;
const sessions = new Map<number, Anthropic.MessageParam[]>();
if (await Bun.file(file).exists()) {
  const saved = (await Bun.file(file).json()) as Entry[];
  for (const [chatId, history] of saved) {
    sessions.set(chatId, history);
  }
}

async function persist(): Promise<void> {
  const entries: Entry[] = [...sessions];
  await Bun.write(file, JSON.stringify(entries, null, 2));
}

for await (const update of pollUpdates()) {
  const chatId = update.message?.chat.id;
  const prompt = update.message?.text?.trim();
  if (chatId === undefined || !prompt) {
    continue;
  }

  const history = sessions.get(chatId) ?? [];
  history.push({ role: 'user', content: prompt });

  const reply = await client.messages.create({
    model,
    max_tokens: 1024,
    messages: history,
  });
  history.push({ role: 'assistant', content: reply.content });

  sessions.set(chatId, history);
  await persist();

  const first = reply.content[0];
  const answer = first?.type === 'text' ? first.text : '...';
  await tg('sendMessage', { chat_id: chatId, text: answer });
}

Each chat's Anthropic.MessageParam[] lives under its own key, so two people never bleed into each other's context. Writing sessions.json after each turn means a crash or a redeploy costs you nothing - the histories are read back the next time the bot wakes, exactly where they left off.

What's next: Chapter 5 - Implementing Tools and Function Calling.

Context and Conversation Management ​

History and the system prompt ​

Counting tokens and trimming ​

Prompt caching ​

Per-user sessions that survive restarts ​

Context and Conversation Management

History and the system prompt

Counting tokens and trimming

Prompt caching

Per-user sessions that survive restarts