LlamaIndex — Complete Guide to Data-Augmented AI Agents

The problem LlamaIndex solves

Large language models are trained on public data up to a cutoff date. They know nothing about your company's internal documents, your product catalogue, your customer data, or anything written after their training ended. If you want an AI that can answer questions about your specific data, you have two options: fine-tune a model on your data (expensive, slow, requires ML expertise) or give the model access to your data at query time so it can read what it needs before answering.

The second approach is called RAG — Retrieval-Augmented Generation. Instead of the model knowing everything, it retrieves the relevant parts of your data when needed and uses those as context. LlamaIndex is the most capable open-source framework for building RAG systems — and for building AI agents that use RAG as one of their core capabilities.

RAG in plain terms

The process has five steps:

Step 1

Load

→

Step 2

Chunk

→

Step 3

Embed

→

Step 4

Store

→

Step 5

Retrieve & Answer

Load — your documents are loaded from their source. PDFs, Word files, web pages, Notion pages, databases, APIs. LlamaIndex has over 160 data loaders.

Chunk — documents are split into smaller pieces (chunks). This is necessary because most documents are too large to fit in a model's context window, and smaller, focused chunks produce better search results.

Embed — each chunk is converted to a numerical representation (an embedding vector) that captures its meaning. Similar chunks end up with similar vectors.

Store — chunks and their vectors are stored in a vector database. LlamaIndex works with Pinecone, Chroma, Weaviate, Qdrant, and others.

Retrieve and answer — when a question arrives, LlamaIndex converts it to a vector, finds the most similar chunks in the database, and passes those chunks to the LLM as context. The model answers using the retrieved information.

What makes LlamaIndex different from just using LangChain for RAG

LangChain can do RAG. The difference is depth. LlamaIndex provides far more tuning options at every step of the pipeline — different chunking strategies, different retrieval algorithms, hybrid search (combining semantic and keyword search), query rewriting, reranking, and evaluation. For applications where RAG quality is critical — a customer service bot that must answer accurately from a product manual, a legal research tool, a medical knowledge base — LlamaIndex's additional control is significant.

Choose LlamaIndex when: your primary challenge is getting an AI to reason accurately over your own documents or data. If your workflow does not involve large custom knowledge bases, start with LangChain or CrewAI.

LlamaIndex's core abstractions

Documents and Nodes — a Document is a raw piece of content (a PDF, a web page, a database row). A Node is a chunk of a Document, with metadata linking it back to the original source. Everything in LlamaIndex flows through these two types.

VectorStoreIndex — the most commonly used index type. Stores Nodes as embedding vectors. Supports semantic similarity search. The starting point for most RAG applications.

QueryEngine — takes a natural language question, retrieves relevant Nodes, and generates an answer. The simplest complete RAG pipeline: one line to create it from an index, one line to query it.

Agent — a ReAct agent that can use a QueryEngine (and any other tool) to answer multi-step questions. Decides when to query the knowledge base, when to use other tools, and when it has enough information to answer.

Prompts for building with LlamaIndex

Getting started

Build a basic RAG pipeline

Write complete Python code using LlamaIndex to: (1) load all PDF files from a local directory, (2) split them into chunks with chunk_size=1024 and chunk_overlap=200, (3) create embeddings using OpenAI's text-embedding-3-small, (4) store the index locally, (5) create a query engine, (6) answer the question "What are the main topics covered in these documents?" Show all imports and include error handling if the directory is empty.

LlamaIndex vs LangChain for RAG — help me decide

I need to build a RAG system over [describe your documents — e.g. "500 internal policy PDFs"]. My primary concern is accuracy — the system must cite sources and not hallucinate. Should I use LlamaIndex or LangChain for this? Compare the specific features that affect RAG quality: chunking options, retrieval algorithms, reranking, and evaluation tooling. Give me a direct recommendation.

Persist and reload a LlamaIndex index

I've built a LlamaIndex VectorStoreIndex over 200 PDFs. Building it takes 10 minutes due to the embedding API calls. Show me how to persist the index to disk after building it, and how to reload it in subsequent runs without re-embedding the documents. What format does LlamaIndex use for storage? What happens if I add new documents — do I need to rebuild the whole index?

Advanced retrieval

Improve retrieval accuracy with hybrid search

My LlamaIndex RAG system gives good results for conceptual questions but poor results for specific names, dates, and product codes — things where keyword matching would work better than semantic similarity. Show me how to implement hybrid search in LlamaIndex that combines vector similarity (semantic) with BM25 keyword matching. How do I weight the two approaches? Which vector stores support this natively?

Add a reranker to improve results

My LlamaIndex retriever returns the top 10 chunks, but only 2 or 3 are actually relevant to the query. The irrelevant chunks confuse the LLM. Show me how to add a reranker that takes the top 10 retrieved chunks and reorders them by relevance, then passes only the top 3 to the LLM. What reranker options does LlamaIndex support? Show the code for adding a Cohere reranker.

Query multiple document collections

I have three separate document collections: technical documentation, customer support tickets, and product specifications. I want a LlamaIndex agent that automatically decides which collection to query based on the question, can query multiple collections if needed, and synthesises answers from multiple sources. Show me how to set up a RouterQueryEngine or an agent with multiple query engine tools to achieve this.

Agents and multi-step reasoning

Build a document Q&A agent

Build a LlamaIndex ReAct agent that can answer complex multi-step questions about a document collection. The agent should: query the knowledge base when it needs specific information, perform calculations on retrieved data when needed, and cite the source document and page number for every factual claim in its final answer. Use Claude claude-sonnet-4-20250514 as the LLM. Show the full code.

Combine LlamaIndex with web search

I want a LlamaIndex agent that can answer questions using both my internal documents and current web information. If the answer exists in my knowledge base, use that (more reliable). If it requires current information not in my documents, search the web. Show me how to build an agent with both a VectorStoreIndex query engine and a web search tool, and how to configure it to prefer internal documents.

Evaluation and production

Evaluate RAG pipeline quality

I want to measure how accurate my LlamaIndex RAG pipeline is before deploying it. Show me how to use LlamaIndex's built-in evaluation framework to measure: (1) whether retrieved chunks are actually relevant to the query (retrieval quality), (2) whether the generated answer is faithful to the retrieved context and doesn't hallucinate, (3) whether the answer actually addresses the original question. Provide the code for running these evaluations on a dataset of 20 test question/answer pairs.

Connect LlamaIndex to a production vector database

I've been using LlamaIndex's default in-memory storage for development. I now need to connect to Pinecone for production — it needs to persist data, support multiple users, and handle 50,000+ documents. Show me: (1) how to set up the Pinecone integration in LlamaIndex, (2) how to build the index against Pinecone instead of in-memory storage, (3) how to load the existing index on subsequent runs without rebuilding it, (4) any considerations for managing costs with a large document set.

Add metadata filtering to retrieval

My LlamaIndex knowledge base contains documents from multiple departments (HR, Legal, Engineering, Finance). I want to be able to restrict retrieval to documents from a specific department when relevant — e.g. an HR question should only retrieve HR documents. Show me how to add department metadata to documents during ingestion, and how to apply metadata filters at query time in LlamaIndex.

The indexing pipeline — technical detail

LlamaIndex's indexing pipeline consists of four stages, each configurable:

1. Data loading

LlamaIndex provides over 160 data loaders via the LlamaHub ecosystem, covering PDFs (using PyMuPDF or PyPDF2), Word documents, web pages, databases (SQL, MongoDB, Elasticsearch), APIs (Notion, Confluence, Slack, Google Drive), and more. Loaders return Document objects with a text field and a metadata dictionary. Metadata is preserved through the pipeline and attached to every Node derived from the Document.

2. Node parsing and chunking

Documents are split into Nodes using a NodeParser. The default is SentenceSplitter, which respects sentence boundaries. Alternative parsers include: TokenTextSplitter (splits on token count), SemanticSplitterNodeParser (uses embeddings to find semantically coherent split points — higher quality but more expensive), and HierarchicalNodeParser (creates parent and child nodes, enabling retrieval at multiple granularities).

Chunk size and overlap are the primary tuning parameters. Smaller chunks (256–512 tokens) produce more precise retrieval but may lack context. Larger chunks (1024–2048 tokens) preserve more context but may retrieve less precisely. The optimal value depends on document structure and query type.

3. Embedding

Each Node's text is converted to an embedding vector using a configured embedding model. LlamaIndex supports OpenAI (text-embedding-3-small and text-embedding-3-large), Cohere, HuggingFace models (for local/private deployment), and others via the embeddings abstraction. The embedding model used at ingestion must be the same model used at query time — embedding spaces are model-specific and not interoperable.

4. Vector storage

Nodes and their vectors are stored in a VectorStore. LlamaIndex's SimpleVectorStore stores everything in memory (suitable for development and small collections). Production deployments use external vector databases: Pinecone, Weaviate, Chroma, Qdrant, pgvector (PostgreSQL extension), or others. LlamaIndex provides a consistent interface across all stores, so switching storage backends requires only a configuration change.

Advanced retrieval — beyond simple similarity search

HyDE (Hypothetical Document Embeddings)

Instead of embedding the user's query directly, HyDE first asks the LLM to generate a hypothetical document that would answer the query, then embeds that hypothetical document. The idea: a generated answer tends to live in the same embedding space as actual answers, producing better retrieval than embedding the question. LlamaIndex implements this via HyDEQueryTransform.

Sentence window retrieval

Nodes are created at sentence granularity for precise matching, but when a sentence is retrieved, its surrounding window (several sentences before and after) is passed to the LLM as context. This combines precise retrieval with sufficient context for coherent answering — a significant quality improvement over standard chunking.

Auto-merging retrieval

Hierarchical chunking creates small child nodes and larger parent nodes. When enough child nodes from the same parent are retrieved, LlamaIndex automatically replaces them with the parent node — providing the LLM with a coherent larger chunk rather than several disconnected small ones. Particularly effective for structured documents.

LlamaIndex agents — the ReAct pattern over data

LlamaIndex agents use the ReAct pattern (Reasoning + Acting) with QueryEngines and other tools as the available actions. A QueryEngine over a document collection is wrapped as a Tool with a name and description — the agent reads the description and decides when to query which collection. This enables agents that reason over multiple knowledge bases, deciding which to consult based on the nature of each sub-question.

LlamaIndex's agent framework is less feature-rich than LangGraph for complex workflow orchestration, but substantially easier to use for data-heavy tasks. The two are often combined: a LangGraph orchestrator coordinates the overall workflow; LlamaIndex components handle document retrieval.

LlamaCloud — the managed service

LlamaCloud is LlamaIndex's hosted data ingestion and retrieval service. It manages the ingestion pipeline (parsing, chunking, embedding, storage) as a managed API, with connectors to common data sources (Google Drive, Confluence, Notion, SharePoint). Pricing starts with a free tier; enterprise pricing is available. For organisations that want LlamaIndex capabilities without managing vector database infrastructure, LlamaCloud is the production path. Documentation at docs.cloud.llamaindex.ai.

Evaluation — the RAGAS framework

LlamaIndex integrates with RAGAS (Retrieval-Augmented Generation Assessment), the standard evaluation framework for RAG pipelines. RAGAS measures: Context Precision (are retrieved chunks relevant?), Context Recall (were all relevant chunks retrieved?), Faithfulness (does the answer stay within what the retrieved context says?), and Answer Relevancy (does the answer address the question?). These four metrics together characterise a RAG pipeline's quality comprehensively. Running RAGAS evaluations before deployment is strongly recommended for any production RAG application.

Official documentation

Documentation: docs.llamaindex.ai
GitHub: github.com/run-llama/llama_index
LlamaCloud: docs.cloud.llamaindex.ai

Source note: All technical specifications are drawn from the official LlamaIndex documentation at docs.llamaindex.ai and the LlamaIndex GitHub repository. LlamaCloud pricing information is from the official LlamaCloud documentation. Verified April 2026.