An agent receives a goal and runs a continuous loop — perceiving, planning, acting, observing — until the task is done. This is the complete guide to what that loop actually does at each step.
Imagine you ask a very capable assistant to book you the cheapest available flight to London next month. A standard AI would tell you how to find cheap flights — useful advice, but you still do all the work. An AI agent would open flight search engines, check dates, compare prices, find the cheapest option, and come back with: "Here it is. Should I book it?"
The agent is not answering a question. It is completing a task. The difference seems small at first. It is not.
An agent works by running the same four steps repeatedly until the task is finished:
The agent reads the current situation — the original goal, any results from previous steps, any new information it has gathered. This is its view of the world at this moment.
Based on what it sees, it decides its next action. Should it search the web? Run some code? Read a file? Ask for clarification? It picks the most logical next step toward the goal.
It takes the action — runs the search, executes the code, reads the file. Something in the world changes as a result. There is now a result to look at.
The result is now part of the situation. The agent evaluates: is the goal achieved? If yes, it stops and delivers the output. If not, it goes back to step 1 with new information and decides the next action.
This loop — look, decide, act, observe — is the engine of every AI agent. The intelligence is in the deciding step. The capability is in the acting step. The quality of the output depends on how well the model reasons and how many tools it has available.
Tools are the hands of an AI agent. Without tools, an agent can only produce text — the same as a standard AI. Tools are what let it actually do things:
The agent decides which tool to use at each step. It does not use them blindly — it reasons about which tool is appropriate for the current situation.
The loop is powerful, but it has three common failure points that are worth knowing:
The key insight: An agent's quality comes from two things — how well the underlying model reasons, and how good its tools are. A brilliant reasoner with bad tools produces bad results. A great tool set given to a poor reasoner also produces bad results. Both matter equally.
The loop described above — observe, plan, act, observe — is formally called the ReAct pattern (Reasoning + Acting). In practice, most production agent frameworks implement a variant of this with additional structure around memory, error handling, and stopping conditions.
The agent receives a goal. For simple agents, this is the user's message. For more sophisticated systems, goals arrive as structured task objects from an orchestrator agent. The first internal action is decomposition: the model breaks the goal into a sequence of sub-tasks it can address one at a time.
Quality of decomposition is highly dependent on the underlying model's capability. Stronger models (GPT-4o, Claude 3.5 Sonnet and above, Gemini 1.5 Pro and above) produce better decompositions — they anticipate dependencies between sub-tasks and order them correctly. Weaker models produce flat lists that miss dependencies and lead to wasted steps.
The agent is given a list of available tools, each described in natural language with a schema. The model reads the descriptions and the current state of the task, then decides which tool to call. This decision is probabilistic — the model does not "know" which tool is right in a deterministic sense. It generates a tool call based on the most likely correct action given the context.
This is why tool descriptions matter enormously. A tool named "get_data" with a vague description will be called inappropriately. A tool named "search_web_for_current_information" with a clear description of when to use it will be called at the right moments.
The model does not execute the tool itself. It outputs a structured tool call — a JSON object specifying the tool name and parameters. The calling application — the framework code running around the model — intercepts this, executes the actual function, and returns the result.
This separation is critical to understand. The model is a reasoning engine. The framework is the execution engine. The model never directly accesses the internet, runs code, or modifies files. The framework does, on its behalf, based on the model's instructions.
The tool result is injected back into the conversation context as a structured message. The model now has the result in its working memory. It generates its next action — another tool call, a text response to the user, or a stopping signal — based on all the context accumulated so far.
This is where context management becomes critical. Each tool result adds to the context. Long tasks accumulate large contexts. If the context exceeds the model's window limit, earlier information is lost or must be summarised. Most production frameworks handle this with automatic summarisation, external memory (vector databases), or both.
Agents can have access to multiple types of memory, each with different properties:
Self-correction is one of the most distinctive capabilities of a well-designed agent. It operates at two levels:
When a tool call fails or returns an unexpected result, the agent observes the error, reasons about what went wrong, and tries a different approach. If a web search returns no results, the agent might reformulate the query. If code fails to execute, the agent reads the error message and tries to fix it. This is not magic — it is the same reasoning loop applied to the problem of its own failure.
More capable agents can evaluate whether their intermediate outputs are actually moving toward the goal, not just completing steps. If the results of the first three steps suggest the initial approach was wrong, the agent can revise the plan. This requires the model to maintain an accurate model of the goal throughout the process — something weaker models struggle with over long tasks.
An agent without proper stopping conditions will continue indefinitely — and incur API costs with every step. Production agent systems use several types of stopping conditions:
Important: Maximum steps and cost limits are not optional in production. An agent that enters an unexpected loop can exhaust an API budget in minutes. Always set explicit limits before deploying any agent system.
Agent invents a tool that does not exist in its available set, or calls a real tool with invented parameters. Result: error or silent wrong output.
Malicious content in the environment (a web page, a document the agent reads) contains instructions that override the agent's goal. The agent follows the injected instructions instead.
Over many steps, the agent loses track of the original goal and optimises for an easier or more obvious sub-goal. Common in very long tasks.
A step fails repeatedly. Without a maximum retry limit, the agent loops on the same failing action indefinitely.
The accumulated context from many tool results exceeds the model's window. Earlier instructions are lost, causing the agent to "forget" constraints or goals set at the start.
The agent decides the goal is complete when it is not — a partial result is presented as a final output without flagging the gaps.
The ReAct framework (Yao et al., 2022, arXiv:2210.03629) defines agent behaviour as an interleaved sequence of Thought, Action, and Observation triplets. Formally:
search("product pricing 2026")The sequence T→A→O repeats until a terminal action (producing the final answer) is generated. The key finding of the original paper was that making reasoning explicit — forcing the model to state its thinking before acting — substantially improved accuracy on knowledge-intensive tasks compared to either chain-of-thought reasoning alone or action-only approaches.
All major LLM APIs implement tool calling with broadly consistent mechanics, following a pattern established by OpenAI's function calling feature (June 2023) and adopted by Anthropic (tool use, November 2023) and Google (function calling, Gemini API).
A tool is defined as a JSON object with three required fields:
name — the function identifier, used in the model's output when calling the tooldescription — natural language explanation of what the tool does and when to use it. This is what the model reads to decide whether to call the tool.input_schema — JSON Schema object specifying the parameters the tool accepts, their types, and which are requiredWhen the model decides to call a tool, the API response contains a content block of type tool_use (Anthropic terminology) or function_call (OpenAI terminology) rather than or alongside a text block. The calling application is responsible for detecting this, executing the referenced function, and returning the result as a tool_result content block in the next API request.
The model never executes code, accesses network resources, or modifies files directly. All actions are mediated by the calling application. This is both a security property and a capability boundary — the agent can only do what the calling application has implemented as a tool.
MCP (Model Context Protocol), published by Anthropic in November 2024, defines a standardised interface for connecting tools to AI agents. Prior to MCP, every tool integration was bespoke — a tool built for LangChain did not work with AutoGen, and vice versa. MCP addresses this by defining a universal protocol.
An MCP server exposes tools via a JSON-RPC interface over stdio or HTTP with SSE transport. An MCP client (an agent framework or model interface) connects to the server, lists the available tools, and calls them using a standardised schema. The separation means any tool built to the MCP specification works with any agent built to the MCP specification.
As of April 2026, MCP has been adopted by LangGraph, AutoGen, Claude (natively), and a growing ecosystem of third-party tool providers. Official specification: modelcontextprotocol.io/specification.
How an agent plans affects both its capability and its cost. Three main planning strategies appear in current systems:
No upfront plan. The agent decides the next action at each step based on the current context. Efficient for short tasks. Prone to losing direction on long tasks. Used by default in most tool-calling implementations.
The agent generates a complete plan at the start, then executes it step by step. If execution deviates from the plan (a step fails, a result is unexpected), the agent revises the plan. More reliable on structured tasks. Requires a capable model to generate a good initial plan. Implemented in LangChain's Plan-and-Execute agent and similar patterns.
The agent generates multiple candidate next steps, evaluates each, and pursues the most promising branch. More expensive (multiple model calls per step) but substantially more reliable on tasks with many possible approaches. Introduced in "Tree of Thoughts: Deliberate Problem Solving with Large Language Models" (Yao et al., 2023, arXiv:2305.10601).
External memory in agent systems is most commonly implemented using vector databases. The process: text is chunked into segments, each chunk is converted to an embedding vector using an embedding model (OpenAI text-embedding-3-small, Cohere embed, or similar), and the vector is stored alongside the original text in a vector database (Pinecone, Chroma, Weaviate, Qdrant).
When the agent needs to retrieve relevant past information, it converts the current query to a vector and performs a nearest-neighbour search in the database, returning the most semantically similar chunks. This allows the agent to efficiently query large knowledge bases (millions of documents) without holding everything in the context window.
The quality of retrieval depends on the quality of the embedding model, the chunking strategy, and the retrieval parameters. LlamaIndex provides the most comprehensive tooling for tuning RAG pipelines in agentic contexts. Full LlamaIndex guide →
Source note: Technical specifications in this guide are drawn from the cited research papers and official API documentation. All claims are traceable to primary sources listed above.