LlamaIndex specialises in AI agents that reason over data — your company's documents, internal knowledge bases, databases, APIs, and more. Where other frameworks treat data retrieval as one tool among many, LlamaIndex is built from the ground up around it. The result: the most capable and most configurable RAG and data-augmented agent stack available in open source.
Large language models are trained on public data up to a cutoff date. They know nothing about your company's internal documents, your product catalogue, your customer data, or anything written after their training ended. If you want an AI that can answer questions about your specific data, you have two options: fine-tune a model on your data (expensive, slow, requires ML expertise) or give the model access to your data at query time so it can read what it needs before answering.
The second approach is called RAG — Retrieval-Augmented Generation. Instead of the model knowing everything, it retrieves the relevant parts of your data when needed and uses those as context. LlamaIndex is the most capable open-source framework for building RAG systems — and for building AI agents that use RAG as one of their core capabilities.
The process has five steps:
Load — your documents are loaded from their source. PDFs, Word files, web pages, Notion pages, databases, APIs. LlamaIndex has over 160 data loaders.
Chunk — documents are split into smaller pieces (chunks). This is necessary because most documents are too large to fit in a model's context window, and smaller, focused chunks produce better search results.
Embed — each chunk is converted to a numerical representation (an embedding vector) that captures its meaning. Similar chunks end up with similar vectors.
Store — chunks and their vectors are stored in a vector database. LlamaIndex works with Pinecone, Chroma, Weaviate, Qdrant, and others.
Retrieve and answer — when a question arrives, LlamaIndex converts it to a vector, finds the most similar chunks in the database, and passes those chunks to the LLM as context. The model answers using the retrieved information.
LangChain can do RAG. The difference is depth. LlamaIndex provides far more tuning options at every step of the pipeline — different chunking strategies, different retrieval algorithms, hybrid search (combining semantic and keyword search), query rewriting, reranking, and evaluation. For applications where RAG quality is critical — a customer service bot that must answer accurately from a product manual, a legal research tool, a medical knowledge base — LlamaIndex's additional control is significant.
Choose LlamaIndex when: your primary challenge is getting an AI to reason accurately over your own documents or data. If your workflow does not involve large custom knowledge bases, start with LangChain or CrewAI.
Documents and Nodes — a Document is a raw piece of content (a PDF, a web page, a database row). A Node is a chunk of a Document, with metadata linking it back to the original source. Everything in LlamaIndex flows through these two types.
VectorStoreIndex — the most commonly used index type. Stores Nodes as embedding vectors. Supports semantic similarity search. The starting point for most RAG applications.
QueryEngine — takes a natural language question, retrieves relevant Nodes, and generates an answer. The simplest complete RAG pipeline: one line to create it from an index, one line to query it.
Agent — a ReAct agent that can use a QueryEngine (and any other tool) to answer multi-step questions. Decides when to query the knowledge base, when to use other tools, and when it has enough information to answer.
LlamaIndex's indexing pipeline consists of four stages, each configurable:
LlamaIndex provides over 160 data loaders via the LlamaHub ecosystem, covering PDFs (using PyMuPDF or PyPDF2), Word documents, web pages, databases (SQL, MongoDB, Elasticsearch), APIs (Notion, Confluence, Slack, Google Drive), and more. Loaders return Document objects with a text field and a metadata dictionary. Metadata is preserved through the pipeline and attached to every Node derived from the Document.
Documents are split into Nodes using a NodeParser. The default is SentenceSplitter, which respects sentence boundaries. Alternative parsers include: TokenTextSplitter (splits on token count), SemanticSplitterNodeParser (uses embeddings to find semantically coherent split points — higher quality but more expensive), and HierarchicalNodeParser (creates parent and child nodes, enabling retrieval at multiple granularities).
Chunk size and overlap are the primary tuning parameters. Smaller chunks (256–512 tokens) produce more precise retrieval but may lack context. Larger chunks (1024–2048 tokens) preserve more context but may retrieve less precisely. The optimal value depends on document structure and query type.
Each Node's text is converted to an embedding vector using a configured embedding model. LlamaIndex supports OpenAI (text-embedding-3-small and text-embedding-3-large), Cohere, HuggingFace models (for local/private deployment), and others via the embeddings abstraction. The embedding model used at ingestion must be the same model used at query time — embedding spaces are model-specific and not interoperable.
Nodes and their vectors are stored in a VectorStore. LlamaIndex's SimpleVectorStore stores everything in memory (suitable for development and small collections). Production deployments use external vector databases: Pinecone, Weaviate, Chroma, Qdrant, pgvector (PostgreSQL extension), or others. LlamaIndex provides a consistent interface across all stores, so switching storage backends requires only a configuration change.
Instead of embedding the user's query directly, HyDE first asks the LLM to generate a hypothetical document that would answer the query, then embeds that hypothetical document. The idea: a generated answer tends to live in the same embedding space as actual answers, producing better retrieval than embedding the question. LlamaIndex implements this via HyDEQueryTransform.
Nodes are created at sentence granularity for precise matching, but when a sentence is retrieved, its surrounding window (several sentences before and after) is passed to the LLM as context. This combines precise retrieval with sufficient context for coherent answering — a significant quality improvement over standard chunking.
Hierarchical chunking creates small child nodes and larger parent nodes. When enough child nodes from the same parent are retrieved, LlamaIndex automatically replaces them with the parent node — providing the LLM with a coherent larger chunk rather than several disconnected small ones. Particularly effective for structured documents.
LlamaIndex agents use the ReAct pattern (Reasoning + Acting) with QueryEngines and other tools as the available actions. A QueryEngine over a document collection is wrapped as a Tool with a name and description — the agent reads the description and decides when to query which collection. This enables agents that reason over multiple knowledge bases, deciding which to consult based on the nature of each sub-question.
LlamaIndex's agent framework is less feature-rich than LangGraph for complex workflow orchestration, but substantially easier to use for data-heavy tasks. The two are often combined: a LangGraph orchestrator coordinates the overall workflow; LlamaIndex components handle document retrieval.
LlamaCloud is LlamaIndex's hosted data ingestion and retrieval service. It manages the ingestion pipeline (parsing, chunking, embedding, storage) as a managed API, with connectors to common data sources (Google Drive, Confluence, Notion, SharePoint). Pricing starts with a free tier; enterprise pricing is available. For organisations that want LlamaIndex capabilities without managing vector database infrastructure, LlamaCloud is the production path. Documentation at docs.cloud.llamaindex.ai.
LlamaIndex integrates with RAGAS (Retrieval-Augmented Generation Assessment), the standard evaluation framework for RAG pipelines. RAGAS measures: Context Precision (are retrieved chunks relevant?), Context Recall (were all relevant chunks retrieved?), Faithfulness (does the answer stay within what the retrieved context says?), and Answer Relevancy (does the answer address the question?). These four metrics together characterise a RAG pipeline's quality comprehensively. Running RAGAS evaluations before deployment is strongly recommended for any production RAG application.
Source note: All technical specifications are drawn from the official LlamaIndex documentation at docs.llamaindex.ai and the LlamaIndex GitHub repository. LlamaCloud pricing information is from the official LlamaCloud documentation. Verified April 2026.