33 Agents, tools and memory

Where we are. In Ch. 31, the model retrieved text to answer better. Here we take the leap: turning it into an agent that acts —uses tools, runs code, browses— in a loop to accomplish a multi-step goal. We will see the reasoning-action loop, tool use, planning, memory… and, above all, the limits and failure modes —because here the hype far outran the real reliability—.

33.1 The idea in one sentence

An agent is an LLM put in a loop that can observe, reason and act on the world (tools, APIs, code), repeating the cycle until it reaches the goal —going from “a mouth that talks” to “hands and eyes with a to-do list”—.

🧩 Analogy — the mouth, the hands and the eyes. The “flat” LLM only has a mouth: it produces text and that is it. Tools give it eyes (observing results) and hands (acting), and the loop gives it a to-do list it ticks off. That is an agent.

33.2 Key concepts and their role in the transformer

Before diving in, we define this chapter’s terms and what each one is for inside a transformer:

Agent. Definition: an LLM put in a loop that observes, reasons and acts on the world. In the transformer: it goes from “generating text” to acting with effects (tools, code, APIs).
Perceive→reason→act→observe loop. Definition: the cycle that repeats until the goal is reached or the budget runs out. In the transformer: it closes the loop with the world that one-shot generation does not have.
ReAct. Definition: interleaving Thought, Action and Observation. In the transformer: it grounds the reasoning in real external information → less hallucination.
Tools / function calling. Definition: the LLM emits a structured call (name + JSON arguments) that a runtime executes. In the transformer: it connects with constrained decoding (Ch. 29) —the JSON must be valid—.
Code as action. Definition: generating and running code as the action space. In the transformer: precise, composable, with loops and libraries —one format covers many tools—.
Planning. Definition: decomposing a goal into sub-steps before executing. In the transformer: it is the weak spot of the autonomous LLM → sometimes delegated to a classical planner.
Memory (episodic / semantic). Definition: an external store of past interactions, beyond the context. In the transformer: the context is finite and volatile; memory is RAG over the agent’s own history.
Error compounding. Definition: the reliability of an \(n\)-step chain falls \(\approx p^{n}\). In the transformer: it explains why long tasks collapse even when each step is good.
Prompt injection. Definition: malicious instructions hidden in retrieved data. In the transformer: it is made worse in an agent because it has hands (actions with effects).

With that in mind, let’s open the loop.

33.3 What an agent is (and what it is not)

Operationally: an LLM placed in a perceive → reason → act → observe loop, which repeats until it meets the goal or exhausts the budget. Two useful contrasts:

Vs. one-shot generation: the normal LLM produces text in one pass and stops; it does not execute anything nor close the loop with the world.
Vs. RAG (Ch. 31): RAG retrieves context (reading) for an answer; the agent acts (reading and effects). In fact, RAG is a “search tool” inside an agent; the agent generalizes to any action.

(Honest: there is no single canonical definition; “agent” is used loosely. The de facto convention is “LLM + tools + loop + memory + planning”.)

33.4 The reasoning-action loop

The backbone is ReAct (Yao et al. 2023) (we glimpsed it in Ch. 30): it interleaves Thought (reasoning), Action (calling a tool) and Observation (reading the result), in a cycle. The reasoning decides the action; the action grounds the reasoning in real external information → less hallucination than reasoning alone.

🧩 Analogy — think aloud, try, look, repeat. ReAct is like someone debugging code: they form a hypothesis (thought), try a command (action), read the error (observation) and adjust. One step at a time, watching what happened.

On top of that, Reflexion (Shinn et al. 2023) adds verbal self-reflection: after a failure, the agent writes in its memory what went wrong and uses it on the retry —a “reinforcement” through language, without touching the weights—.

33.5 Using tools

How does an LLM “act”? By emitting tool calls:

Toolformer (Schick et al. 2023): the model learns in a self-supervised way when and how to call APIs (calculator, search, calendar…), inserting the calls into the text —it keeps the ones that improve its prediction—.
Function calling: the model emits a structured call (name + JSON arguments), the runtime executes it and returns the result. It connects with constrained decoding (Ch. 29): the call’s JSON must be valid. (Modern “tool” interfaces and standards like MCP expose tools to models in a uniform way.)
Code as action: generating and running code is an extremely powerful tool. PAL (Gao et al. 2023) delegates the computation to an interpreter (avoiding the arithmetic errors of reasoning in language); CodeAct (Wang et al. 2024) unifies all actions into executable Python. Why code? It is precise, composable, has loops and conditionals, reuses libraries, and a single format covers many tools.

(It is measured with benchmarks* like ToolLLM (Qin et al. 2024) —16,000+ real APIs— and API-Bank (M. Li et al. 2023).)*

33.6 Planning

For complex goals, decompose into sub-steps:

Plan-and-Solve (L. Wang et al. 2023): first devise a plan (subtasks), then execute it —it tackles the steps that zero-shot CoT skips—.
ReWOO (Xu et al. 2023): it generates the whole plan up front, without interleaving observations → fewer tokens than the ReAct style.
LLM+P (Liu et al. 2023): the LLM translates the problem into a formal format (PDDL), a classical planner finds the optimal plan, and it is translated back. It externalizes planning precisely because the LLM alone is weak at it.

⚠ Honest — planning is the weak spot

Autonomous long-horizon planning is brittle. In a critical study (Valmeekam et al. 2023), GPT-4 generated autonomously executable plans with only ~12 % success on standard planning domains. This is why approaches like LLM+P delegate to a classical planner instead of trusting the model.

33.7 Memory

The context window is not memory: it is finite, volatile and disappears when the session ends. That is why agents add explicit memory:

Short-term: the context / a scratchpad of the current reasoning.
Long-term: an external vector store of past interactions (it is RAG, Ch. 31, applied to the agent’s own history!). A distinction is drawn between episodic memory (what happened) and semantic memory (distilled facts).

Two canonical architectures: Generative Agents (Park et al. 2023) —a memory stream with retrieval by relevance + recency + importance and periodic reflection (the “Smallville” simulations of 25 agents)— and MemGPT (Packer et al. 2023), which manages memory like an operating system (paging information in and out of the context).

🧩 Analogy — the notebook. External memory is a notebook where the agent writes and rereads, because it cannot fit everything “in its head” (the context). MemGPT is the notebook with a filing system; Generative Agents, the notebook with a relevance index and summaries.

33.8 Multi-agent (brief)

Several agents collaborating or debating: CAMEL (G. Li et al. 2023) (role-play), AutoGen (Wu et al. 2023) (conversation among composable agents), multi-agent debate (Du et al. 2023) (several instances debate → better factuality) and Voyager (G. Wang et al. 2023) (a Minecraft agent that accumulates a library of skills). The AutoGPT/BabyAGI projects (2023) popularized the autonomous loop —more phenomenon/demo than scientific result—.

33.9 Limits and failure modes (the honest heart)

Here is what the demos do not show.

Error compounding. If each step has success probability \(p\), an \(n\)-step chain (idealizing that the steps are independent) succeeds \(\approx p^{n}\):

\[ P(\text{éxito}) \approx p^{\,n} \]

With \(p=0{,}95\) and \(n=20\) steps → ~36 %; with \(n=50\) → ~8 %. Reliability collapses as the chain grows. (It is an illustrative model —the errors are not entirely independent—, but it captures the phenomenon.)

🧩 Analogy — the broken telephone. Each link in the chain fails with low probability, but a long chain almost always breaks somewhere. A many-step agent is that chain.

And the empirical evidence is sobering:

Table 33.1: The gap between demo and production

Benchmark	Best agent	Human
GAIA (Mialon et al. 2024)	GPT-4 + plugins 15 %	92 %
WebArena (Zhou et al. 2024)	GPT-4 14.4 %	78 %
τ-bench (pass^8) (Yao et al. 2024)	GPT-4o ~25 %	—

τ-bench (Yao et al. 2024) is revealing: it measures pass^k (does it succeed consistently k times?). GPT-4o is around 61 % on the first try but ~25 % if you demand 8 successes in a row → brutal inconsistency. (The gaps between models show up in AgentBench (Liu et al. 2024).)

Other failures: hallucinated tool calls or ones with invalid JSON; infinite loops; cost/latency from many calls per task; and a key point from the sober critique (Kapoor et al. 2024): the benchmarks reward accuracy while ignoring cost, and often a simple baseline matches a complex agent at ~50× less cost. Honesty: the agentic hype of 2023-2024 outran the real reliability.

⚠ Security — an agent that acts is far more dangerous

Prompt injection (Ch. 30) is enormously aggravated here: malicious instructions hidden in data the agent retrieves (a web page, an email, a document) can hijack its behavior —exfiltrate data, perform unwanted actions— (Greshake et al. 2023). The difference from a chatbot is that the agent has hands: the attack surface grows with each tool that has side effects.

🧪 Try it — tafagent

tafagent does not orchestrate agents, but an agent accumulates context that grows at every step (history, observations, retrieved memory). Its γ and effective horizon (Ch. 15-20) tell you how far the model really attends to that history: a short horizon explains why an agent “forgets” instructions from the beginning in long chains —and how much KV it will cost you to keep that history—.

33.10 Summary

Agent = LLM in a loop perceive→reason→act→observe; it goes beyond generating (RAG retrieves, the agent acts).
Loop: ReAct (Yao et al. 2023) (Thought/Action/Observation, less hallucination); Reflexion (Shinn et al. 2023) (learns from failures through language).
Tools: Toolformer (learns to call APIs), function calling (valid JSON, Ch. 29), code as action (PAL, CodeAct) — precise and composable.
Planning: Plan-and-Solve, ReWOO, LLM+P (delegates to a classical planner). Honest: autonomous planning is brittle (~12 % GPT-4 (Valmeekam et al. 2023)).
Memory: context ≠ memory; external store (RAG over the agent’s own history); Generative Agents, MemGPT.
Limits (honest): error compounding (\(p^n\) collapses); the demo↔︎production gap (GAIA 15 % vs 92 %; τ-bench pass^8 ~25 %); cost; prompt injection amplified by the power to act. The 2023-24 hype outran the reliability.

Next (Chapter 33): we close Part V by stepping out of text: multimodal transformers —seeing images (ViT), uniting vision and language (CLIP), audio—.

33.11 Exercises

Agent vs RAG. How does “acting” differ from “retrieving”? Why is RAG a special case within an agent?
ReAct. Describe the Thought/Action/Observation cycle. Why does it reduce hallucination compared to reasoning alone?
Code as action. Give two reasons code is a good “action space” for an agent.
Error compounding. With steps at 90 % success, what reliability do you expect on a 10-step task? What does that say about long-horizon agents?
Demo vs production. Cite a figure for the agent-vs-human gap and explain what pass^k measures.
Security. Why is prompt injection more dangerous in an agent than in a chatbot?

References

Du, Yilun, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. 2023. Improving Factuality and Reasoning in Language Models Through Multiagent Debate. https://arxiv.org/abs/2305.14325.

Gao, Luyu, Aman Madaan, Shuyan Zhou, et al. 2023. “PAL: Program-Aided Language Models.” ICML. https://arxiv.org/abs/2211.10435.

Greshake, Kai, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection. https://arxiv.org/abs/2302.12173.

Kapoor, Sayash, Benedikt Stroebl, Zachary S. Siegel, et al. 2024. AI Agents That Matter. https://arxiv.org/abs/2407.01502.

Li, Guohao, Hasan Abed Al Kader Hammoud, Hani Itani, et al. 2023. CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society. https://arxiv.org/abs/2303.17760.

Li, Minghao et al. 2023. “API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs.” EMNLP. https://arxiv.org/abs/2304.08244.

Liu, Bo et al. 2023. “LLM+p: Empowering Large Language Models with Optimal Planning Proficiency.” https://arxiv.org/abs/2304.11477.

Liu, Xiao et al. 2024. “AgentBench: Evaluating LLMs as Agents.” ICLR. https://arxiv.org/abs/2308.03688.

Mialon, Grégoire et al. 2024. “GAIA: A Benchmark for General AI Assistants.” ICLR. https://arxiv.org/abs/2311.12983.

Packer, Charles et al. 2023. MemGPT: Towards LLMs as Operating Systems. https://arxiv.org/abs/2310.08560.

Park, Joon Sung, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. “Generative Agents: Interactive Simulacra of Human Behavior.” UIST. https://arxiv.org/abs/2304.03442.

Qin, Yujia et al. 2024. “ToolLLM: Facilitating Large Language Models to Master 16000+ Real-World APIs.” ICLR. https://arxiv.org/abs/2307.16789.

Schick, Timo, Jane Dwivedi-Yu, Roberto Dessì, et al. 2023. Toolformer: Language Models Can Teach Themselves to Use Tools. https://arxiv.org/abs/2302.04761.

Shinn, Noah, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. “Reflexion: Language Agents with Verbal Reinforcement Learning.” NeurIPS. https://arxiv.org/abs/2303.11366.

Valmeekam, Karthik, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. 2023. “On the Planning Abilities of Large Language Models: A Critical Investigation.” NeurIPS. https://arxiv.org/abs/2305.15771.

Wang, Guanzhi et al. 2023. Voyager: An Open-Ended Embodied Agent with Large Language Models. https://arxiv.org/abs/2305.16291.

Wang, Lei et al. 2023. “Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models.” ACL. https://arxiv.org/abs/2305.04091.

Wang, Xingyao, Yangyi Chen, Lifan Yuan, et al. 2024. “Executable Code Actions Elicit Better LLM Agents.” ICML. https://arxiv.org/abs/2402.01030.

Wu, Qingyun et al. 2023. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. https://arxiv.org/abs/2308.08155.

Xu, Binfeng et al. 2023. ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models. https://arxiv.org/abs/2305.18323.

Yao, Shunyu, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. Tau-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains. https://arxiv.org/abs/2406.12045.

Yao, Shunyu, Jeffrey Zhao, Dian Yu, et al. 2023. “ReAct: Synergizing Reasoning and Acting in Language Models.” ICLR. https://arxiv.org/abs/2210.03629.

Zhou, Shuyan et al. 2024. “WebArena: A Realistic Web Environment for Building Autonomous Agents.” ICLR. https://arxiv.org/abs/2307.13854.