AI Tool Guide

Llama — The Complete Guide

Meta’s open-source AI — the model anyone can download, run, modify, and build on. The foundation of hundreds of AI products. The most important open-source AI story in the world. What open-source means, the full history, how to use it, why it matters, and complete technical depth. Three reading levels. Official sources only.

Llama Meta AI ~9,700 words Updated April 2026

What is Llama — and why is it different?

Almost every AI you have heard of — ChatGPT, Claude, Gemini, Copilot — is a service. You visit a website, you use the AI, the company controls it, you pay them or use their free tier, and you trust them with whatever you type.

Llama is different. It is an AI model made by Meta — the company that owns Facebook and Instagram — and Meta gives it away for free. Not just free to use — free to download. The actual model itself. You can put it on your own computer and run it. No internet connection needed. No company watching what you type. No usage limits. Just the AI, running on your hardware, entirely under your control.

This is what “open-source” means in AI: the model’s weights — the billions of numerical values that make the AI work — are published openly for anyone to use.

The best analogy I can give you

Imagine recipes. Most restaurants keep their recipes secret — you can eat there, but you can’t take the recipe home. ChatGPT and Claude are like restaurants. Llama is like a restaurant that posts its recipes online for free. You can make the food yourself, in your own kitchen, for your own family, change the recipe, improve it, make it spicier, give your version to friends.

The vast majority of people will still go to restaurants (use ChatGPT or Claude). But for cooks — developers, researchers, companies with specific needs — having the recipe changes everything.

Who made Llama?

Llama was made by Meta AI — the artificial intelligence research division of Meta, the company founded by Mark Zuckerberg. Meta AI is headquartered in Menlo Park, California, with major research offices in Paris and New York.

The person most associated with Meta’s AI philosophy is Yann LeCun, Meta’s Chief AI Scientist. LeCun is one of the most respected figures in AI research — he co-won the 2018 Turing Award (the Nobel Prize of computing) alongside Geoffrey Hinton and Yoshua Bengio for their foundational work on deep learning. Unlike many AI leaders, LeCun has been consistently outspoken in support of open-source AI development and sceptical of what he sees as exaggerated concerns about near-term AI risk.

LeCun’s view — which underpins Meta’s Llama strategy — is that open AI is safer AI: more researchers can study it, identify problems, and improve it. Closed AI concentrates power in the hands of a few companies.

Why does a company like Meta give this away?

It seems counterintuitive. Why would Meta spend hundreds of millions of dollars building an AI and then give it away for free?

Two reasons. First, Meta’s business is social media, not selling AI services — they do not have a competing product to protect. Second, by releasing Llama openly, Meta accelerates the entire field, builds enormous goodwill with developers, and ensures that AI development is not monopolised by OpenAI and Google. A world with Llama is a world where Meta has influence even without a major AI product. Strategic generosity.

The history of Llama — from a research paper to a revolution

The context: closed models dominating (2022)

By late 2022, the frontier of AI was firmly behind closed doors. OpenAI had GPT-3 and GPT-4 locked behind a paid API. Google had LaMDA but was not releasing it. The open-source AI community had models, but nothing close to the capability of the frontier. If you wanted cutting-edge AI, you paid OpenAI.

The researchers at Meta AI — many of them veterans of Google Brain, DeepMind, and top universities — believed this was wrong. Not just strategically for Meta, but philosophically: AI this powerful and this broadly applicable should not be controlled exclusively by one or two companies.

February 2023: Llama 1 — the first crack in the wall

On 24 February 2023, Meta released Llama 1 — a family of models ranging from 7 billion to 65 billion parameters. The release was initially restricted to researchers who applied for access through a form. Meta wanted to control the rollout carefully.

The models were immediately impressive. Not quite frontier (GPT-4 was released the same month and was significantly more capable), but far better than anything previously available openly. For a 65B parameter model running on research hardware, the performance was remarkable.

Then something happened that Meta had not planned for.

The leak — and what it meant

Within days of Llama 1’s restricted release, someone posted the model weights to a public forum. The leak spread instantly. Within a week, Llama 1’s weights were permanently, irrevocably available to anyone on the internet who wanted them.

The reaction divided observers. Some saw this as a disaster — dangerous AI capabilities unleashed without proper safety review. Others saw it as the inevitable consequence of any restricted-release strategy for AI. If you tell researchers they can have access but not share it, someone will share it.

What happened next was extraordinary. Developers around the world began building on Llama 1 at a pace that no closed model could match. Within weeks:

  • Alpaca (Stanford) showed that fine-tuning Llama 1 on 52,000 instruction examples could produce a surprisingly capable assistant model
  • Vicuna demonstrated that fine-tuning on ChatGPT conversation data could get surprisingly close to GPT-4 performance
  • Dozens of specialised variants emerged for coding, medicine, law, and other domains
  • Ollama made running Llama models on consumer hardware trivially easy

The open-source AI ecosystem, which had been starved of capable foundation models, exploded.

July 2023: Llama 2 — Meta embraces openness

Meta drew the obvious conclusion. On 18 July 2023, Meta released Llama 2 — not restricted to researchers, but openly available for research and commercial use. Anyone could download and use it. Companies could build products on it. The only restriction: organisations with more than 700 million monthly active users needed a special licence (effectively targeting only other tech giants like Google).

Llama 2 was a significant improvement over Llama 1. Four sizes: 7B, 13B, 34B, and 70B parameters. The 70B model was competitive with GPT-3.5 on many benchmarks. Separately, Meta released Llama 2 Chat — versions fine-tuned for conversational use.

The reaction was enormous. Llama 2 was downloaded millions of times in the first days. Hugging Face — the platform that hosts AI models — saw its servers strained. Every major cloud provider added Llama 2 to their model catalogues. Hundreds of companies began building Llama 2-based products.

April 2024: Llama 3 — closing the gap with frontier models

Llama 3 launched on 18 April 2024. Two sizes initially: 8B and 70B. The 70B model’s benchmark performance was striking — it matched or exceeded GPT-3.5 and was competitive with Claude 3 Sonnet on many tasks. For a freely available, open-weight model, this was unprecedented.

Key improvements in Llama 3: a new tokeniser with a vocabulary of 128,000 tokens (vs 32,000 in Llama 2), better instruction following, significantly improved coding ability, and stronger performance on multilingual tasks. The model was trained on over 15 trillion tokens — more than seven times the training data of Llama 2.

July 2024: Llama 3.1 405B — a historic moment

On 23 July 2024, Meta released Llama 3.1 — including a 405-billion parameter model. This was the largest open-weight model ever released. Its performance matched GPT-4o on several benchmarks. The implications were significant: for the first time, frontier-level AI capability was available as open weights, to anyone, for free.

Mark Zuckerberg published a letter arguing that open-source AI was the future — safer, more accessible, and ultimately better for society than closed models. This was a direct challenge to OpenAI and Anthropic’s closed model approach.

The 8B and 70B variants also received significant improvements, including a context window extended to 128,000 tokens — enough to hold over 100 pages of text in a single conversation.

2025: Llama 4 — multimodal and mixture-of-experts

Llama 4 launched in early 2025 as a significant architectural shift. Three variants:

  • Llama 4 Scout — 17B active parameters (109B total), 10M token context window, efficient for deployment
  • Llama 4 Maverick — 17B active parameters (400B total via MoE), multimodal (text + image), competitive with GPT-4o
  • Llama 4 Behemoth — 288B active parameters, used primarily as a teacher model for distillation, not a deployment model

Llama 4 was natively multimodal — trained on text, images, and video from the ground up. This brought it level with commercial multimodal models on visual understanding tasks.

Who actually uses Llama and how?

Most people who “use Llama” do not know they are using Llama. It underpins a large and growing number of consumer and enterprise applications. But there are several ways to interact with it directly:

1. Meta AI — the consumer interface

Meta’s own AI assistant, available at meta.ai and built into WhatsApp, Instagram, and Facebook Messenger. This is the easiest way for most people to experience Llama — it is the familiar chat interface that runs on Meta’s servers. Free to use.

Try Meta AI first
Hello! I want to understand what you are and how you work. Can you: tell me your name and who made you, explain in simple language what kind of AI you are, and give me three practical examples of how you could help me today?

2. Ollama — run Llama on your own computer

Ollama is a free tool that makes running Llama on your Mac, Windows, or Linux computer as simple as a single command. Your data never leaves your machine. No API key needed. No usage limits. Completely private.

Minimum hardware for the 8B model: 8GB RAM (runs on most modern laptops). For 70B: 48GB RAM (high-end workstation or Mac with M-series chip and 64GB RAM).

When running locally makes sense

You are a doctor and you want AI help with patient notes — but you cannot send patient data to OpenAI’s servers. Run Llama locally. All patient information stays on your computer.

You are a lawyer reviewing confidential contracts. Run Llama locally. The contracts never leave your office.

You are a developer building an app and you need AI that works offline. Llama locally means your app works without internet.

3. Products built on Llama

Hundreds of products use Llama as their foundation. When you use a customer service chatbot, a coding assistant, a writing tool, or a business AI — there is a reasonable chance it is powered by a fine-tuned Llama model. You may never know, because the company has taken Llama and customised it for their specific use case.

4. Cloud APIs — try without hardware requirements

If you want to try Llama models without downloading them, several cloud providers offer API access at low cost:

  • Groq (groq.com) — Extremely fast Llama inference, free tier available
  • Together AI (together.ai) — Multiple Llama variants, pay per token
  • Replicate (replicate.com) — Easy API access, pay per use
  • AWS Bedrock, Azure, Google Cloud — Enterprise-grade hosting

What makes Llama special — the open-source advantage

Privacy

When you send text to ChatGPT or Claude, that text goes to OpenAI’s or Anthropic’s servers. Their privacy policies say they do not use it to train models (unless you opt in), but your data does leave your device. For sensitive information — medical records, legal documents, financial data, personal matters — this may be unacceptable.

Running Llama locally means your data never leaves your computer. Full stop.

Customisation — fine-tuning

One of the most powerful things you can do with an open-weight model is fine-tune it — train it further on your own data to become a specialist. A hospital can fine-tune Llama on medical literature and clinical notes to create an AI that understands medical language far better than a general model. A law firm can fine-tune on legal documents. An e-commerce company on product catalogue and customer service interactions.

Fine-tuning a closed model (like GPT-4) is limited to what the API provider allows. Fine-tuning Llama has no restrictions — you are working with the weights directly.

Cost at scale

Paying OpenAI or Anthropic per API call adds up at scale. A product making millions of AI requests per day pays substantial API fees. Running your own Llama instance — on your own servers or cloud infrastructure — can be significantly cheaper once you reach sufficient volume. Many companies reach the crossover point where self-hosting is cheaper than API usage.

Independence

If OpenAI changes its pricing, its terms of service, its model behaviour, or if it goes out of business — every product built on the OpenAI API is affected. Products built on Llama are not. The weights they downloaded do not change. Their model continues working regardless of what happens to Meta.

Using Llama in practice

For most working users, the choice is between Meta AI (simple, no setup), a cloud API (flexible, scalable), or a local installation (private, unlimited). Here is a practical guide for each.

Option 1: Meta AI — zero setup

Go to meta.ai. Create or sign in with a Meta account. Start chatting. This is Llama 4 running on Meta’s infrastructure — the same underlying model, with Meta’s fine-tuning and safety layers on top. Free. No technical setup.

Option 2: Ollama — local in 10 minutes

Set up Ollama (Mac/Linux/Windows)
# Step 1: Install Ollama from ollama.com
# (One-click installer for Mac/Windows, curl for Linux)

# Step 2: Pull a model and start chatting
ollama run llama3.2          # 3B — fast, works on most computers
ollama run llama3.1:8b       # 8B — better quality, needs 8GB RAM
ollama run llama3.1:70b      # 70B — best quality, needs 48GB RAM

# Step 3: Ollama also runs as a local API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Explain quantum computing simply",
  "stream": false
}'

Option 3: Groq API — fastest inference available

Groq API — Python (free tier available)
from groq import Groq

client = Groq(api_key="your-groq-api-key")

response = client.chat.completions.create(
    model="llama-3.1-70b-versatile",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Summarise this document: [paste content]"}
    ],
    temperature=0.7,
    max_tokens=1024
)

print(response.choices[0].message.content)

20 high-value prompts for Llama / Meta AI

1. Private document analysis
I am sharing this document with you on my local/private AI. It contains sensitive information. Please: summarise the key points, identify any risks or concerns, extract all action items and deadlines, and flag anything that requires urgent attention. Here is the document: [paste content]
2. Local coding assistant
Act as a senior software engineer. Review this code: [paste code]. Identify: bugs, security vulnerabilities, performance issues, code smells, and anything that does not follow best practices for [language]. For each issue: explain the problem, explain why it matters, and show the corrected code.
3. Fine-tuning data generation
I am building a fine-tuning dataset for a [domain] AI assistant. Generate [number] high-quality instruction-response pairs in the following format: each pair should cover [topic area], responses should be [style/tone], and each should demonstrate [specific capability]. Format as JSON with "instruction" and "response" fields.
4. Medical note summarisation (local/private)
Summarise the following clinical/medical notes for [purpose — patient handover / referral letter / discharge summary]. Keep all medically relevant information. Use appropriate clinical terminology. Format: chief complaint, history, examination findings, assessment, plan. Notes: [paste notes]
5. Legal document review (local/private)
Review this legal document: [paste content]. Identify: the key obligations on each party, any unusual or potentially unfavourable clauses, defined terms that could be interpreted broadly, conditions that could trigger liability, and anything I should seek legal advice about before signing.
6. Domain-specific Q&A system
You are an expert assistant for [specific domain]. Your knowledge base includes the following information: [paste domain documents or facts]. Answer questions about this domain using only the provided information. If the answer is not in the provided information, say so clearly rather than guessing. Question: [question]
7. Build a RAG system prompt
You are an AI assistant for [company/organisation]. Answer questions using only the following retrieved context. If the context does not contain enough information to answer the question, say “I don’t have that information in the provided documents” — do not guess or use your training knowledge. Retrieved context: [context]. Question: [question]
8. Offline research assistant
Based on your training knowledge (no internet access needed), give me a comprehensive overview of [topic]. Cover: background and history, current state of knowledge, key debates or open questions, practical implications, and the most important things someone new to this topic should know. Flag anything where your knowledge may be dated.
9. Multi-language processing
Process the following text which is in [language]: [paste text]. Please: translate it to English, summarise the key points, identify the tone and intended audience, and note any culturally-specific references that a non-[nationality] reader should understand to fully appreciate the content.
10. Structured data extraction
Extract structured data from the following unstructured text: [paste text]. Output as JSON with the following fields: [list fields]. If a field is not present in the text, use null. Extract every instance if there are multiple records in the text. Return only valid JSON with no explanation.
11. System prompt for a custom assistant
Write a system prompt for an AI assistant with the following specifications: Role: [describe the role]. Knowledge domain: [describe the domain]. Personality: [describe tone and style]. What it should always do: [list]. What it should never do: [list]. How it should handle questions outside its domain: [describe]. Format the system prompt ready to use.
12. Batch content processing
Process each of the following items and for each one [describe the task — e.g. classify by category / extract key information / translate / summarise / score by quality]. Format the output as a table with columns: [list columns]. Items: [list items]
13. Evaluation and scoring
Evaluate the following [content type — essay / code / business plan / design brief / argument] against these criteria: [list criteria]. For each criterion: give a score out of 10, explain the score with specific examples from the content, and suggest one specific improvement. Finally, give an overall assessment.
14. Generate test cases for code
Generate comprehensive test cases for the following function/code: [paste code]. Include: happy path tests, edge cases, boundary conditions, error cases, and any inputs that might cause unexpected behaviour. Format as [testing framework — pytest/jest/etc] with clear test names that describe what each test is checking.
15. Content moderation classifier
Classify the following piece of user-generated content against these moderation policies: [list policies — e.g. no hate speech, no personal attacks, no misinformation]. For each policy: state whether it is violated (yes/no/borderline), explain your reasoning with a quote from the content, and suggest the appropriate action (approve/remove/flag for review). Content: [paste content]
16. Customer feedback analysis
Analyse these customer reviews/feedback comments: [paste reviews]. Identify: the top 5 positive themes with supporting examples, the top 5 negative themes with supporting examples, any urgent issues mentioned that need immediate attention, and the overall sentiment score. Present findings in a format suitable for sharing with a product team.
17. Local writing assistant without cloud
Help me improve this piece of writing: [paste text]. I need this to stay entirely local — I am using a private AI. Improve it for: clarity, flow, strength of argument, and appropriate tone for [audience]. Show me the improved version alongside specific notes on what you changed and why.
18. Generate synthetic training data
Generate [number] synthetic [data type — customer support conversations / product reviews / medical case summaries / legal briefs] for training a specialist AI. Each example should: be realistic and varied, represent different scenarios within [domain], avoid any real personally identifiable information, and match this style/format: [describe]. Output as JSONL.
19. Agentic task decomposition
I need to accomplish the following goal: [describe goal]. Break this down into a sequence of specific, executable steps. For each step: describe exactly what needs to be done, what information or tools are required, what the expected output is, and what could go wrong. Format as a numbered task list I can use to guide an AI agent or a team.
20. Privacy-conscious personal assistant
I want you to act as my personal assistant for [type of task — health tracking / financial planning / personal journaling / relationship management]. I am using a local AI specifically because I want these conversations to stay private. Help me [specific task] while keeping everything confidential. Here is the relevant information: [share details]

Llama architecture: technical evolution across versions

The Llama series uses a decoder-only transformer architecture — the same fundamental design as GPT-3 and subsequent OpenAI models — with a set of architectural improvements that have been influential across the open-source community.

Key architectural choices in Llama 1 and 2

Pre-normalisation with RMSNorm: Llama applies layer normalisation before the attention and feed-forward sub-layers (pre-norm) rather than after (post-norm as in the original transformer). This improves training stability. Specifically, Llama uses RMSNorm (Root Mean Square Layer Normalisation) rather than the standard LayerNorm — computationally cheaper and empirically comparable in performance. Source: Zhang and Sennrich (2019), “Root Mean Square Layer Normalization.”

SwiGLU activation function: The feed-forward sub-layers use SwiGLU activations (a Swish-gated linear unit variant), following Shazeer (2020) and the PaLM architecture, rather than the ReLU or GELU activations used in GPT-style models. SwiGLU empirically improves model performance at equivalent compute.

Rotary Positional Embeddings (RoPE): Llama uses RoPE (Su et al., 2022) rather than absolute or learnt positional embeddings. RoPE encodes position information directly into the attention computation, and crucially enables better generalisation to sequence lengths beyond those seen during training — a significant advantage for long-context applications.

Grouped Query Attention (GQA) — introduced in Llama 2: Standard multi-head attention (MHA) uses separate key and value projections for each attention head. GQA groups multiple heads to share a single key-value projection, significantly reducing the KV cache memory requirements during inference — enabling longer contexts and faster generation — with minimal quality degradation. This became standard practice across many subsequent models.

Primary sources — Llama architecture

Touvron, H., et al. (2023). “LLaMA: Open and Efficient Foundation Language Models.” Meta AI. arxiv.org/abs/2302.13971

Touvron, H., et al. (2023). “Llama 2: Open Foundation and Fine-Tuned Chat Models.” Meta AI. arxiv.org/abs/2307.09288

Llama 3: Scaling and improved data curation

The Llama 3 technical report documents the most significant training data improvement: a new dataset of over 15 trillion tokens, curated with more aggressive quality filtering, deduplication, and domain balancing than Llama 2’s 2 trillion token dataset. The improved tokeniser uses a vocabulary of 128,256 tokens (vs 32,000 for Llama 2), enabling more efficient representation of code, mathematics, and non-English text.

Llama 3’s instruction-following improvements were achieved through a multi-stage post-training pipeline: supervised fine-tuning, rejection sampling fine-tuning, direct preference optimisation (DPO), and proximal policy optimisation (PPO) — a combination more sophisticated than the RLHF approach used in Llama 2.

Primary source — Llama 3

Meta AI (2024). “The Llama 3 Herd of Models.” arxiv.org/abs/2407.21783

Llama 4: Mixture-of-Experts and native multimodality

Llama 4 introduces two architectural paradigm shifts relative to previous Llama versions:

Mixture-of-Experts (MoE): Rather than activating the full parameter set for every token, Llama 4 Maverick and Behemoth use a learned routing mechanism that dispatches each token to a subset of “expert” feed-forward networks. Maverick has 128 experts with 1 of 128 active per token, totalling 400B parameters but only 17B active — enabling high model capacity at a fraction of the inference cost of an equivalent dense model.

Native multimodality: Llama 4 is trained on text, images, and video from the beginning of pre-training, rather than adding vision as a post-hoc module. The vision encoder uses a modified ViT (Vision Transformer) architecture. Cross-modal attention enables the language decoder to attend to visual representations at any layer, not just at the input.

Mixture of Depths: Llama 4 also incorporates Mixture of Depths — an adaptive computation technique where different tokens are routed to different numbers of transformer layers based on a learned difficulty estimate. Simple tokens use fewer layers; complex tokens use more. This further improves efficiency.

Primary source — Llama 4

Meta AI (2025). “Llama 4 Model Card and Technical Summary.” github.com/meta-llama/llama-models

Meta AI (2025). “Introducing Llama 4.” ai.meta.com/blog/llama-4

Fine-tuning Llama: LoRA and QLoRA

Full fine-tuning of a 70B parameter model requires approximately 140GB of GPU VRAM — out of reach for most organisations. Two parameter-efficient fine-tuning (PEFT) techniques have made fine-tuning accessible:

LoRA (Low-Rank Adaptation, Hu et al. 2021): Rather than updating all model weights, LoRA adds small low-rank matrices to the weight matrices of attention layers. Only these small matrices are trained — reducing trainable parameters by orders of magnitude. The base model weights are frozen. At inference, the LoRA adapters are merged back into the base weights with no inference overhead.

QLoRA (Dettmers et al. 2023): Extends LoRA by quantising the base model to 4-bit precision before fine-tuning (reducing memory requirements by ~4x), while maintaining LoRA adapters in full 16-bit precision. QLoRA enabled fine-tuning Llama 65B on a single 48GB GPU — making specialist AI accessible to individuals and small organisations.

Fine-tuning Llama with QLoRA (Python — Hugging Face)
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# Load model in 4-bit (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B",
    quantization_config=bnb_config,
    device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,           # Rank — higher = more capacity, more memory
    lora_alpha=32,  # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 2,097,152 || all params: 8,030,261,248
# Only 0.026% of parameters trained!
Primary sources — fine-tuning

Hu, E.J., et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.” arxiv.org/abs/2106.09685

Dettmers, T., et al. (2023). “QLoRA: Efficient Finetuning of Quantized LLMs.” arxiv.org/abs/2305.14314

Llama licence — what you can and cannot do

Understanding the licence is essential before commercial use:

  • Llama 3 Community Licence: Broadly permissive. You can use, copy, modify, and distribute for research and commercial purposes. You must include attribution. You must comply with Llama’s usage policy (which prohibits certain harmful applications). The licence restriction for very large organisations (>700M MAU) applies — they need a separate commercial agreement with Meta.
  • Llama 4 Licence: Similar structure, updated terms. Always check the specific licence file in the model repository before commercial deployment.
  • What is NOT allowed under any Llama licence: Using Llama outputs to train competing foundation models, facilitating illegal activities, and certain other prohibited uses detailed in the acceptable use policy.

Full licence text: github.com/meta-llama/llama-models — LICENSE