Meta’s open-source AI — the model anyone can download, run, modify, and build on. The foundation of hundreds of AI products. The most important open-source AI story in the world. What open-source means, the full history, how to use it, why it matters, and complete technical depth. Three reading levels. Official sources only.
Almost every AI you have heard of — ChatGPT, Claude, Gemini, Copilot — is a service. You visit a website, you use the AI, the company controls it, you pay them or use their free tier, and you trust them with whatever you type.
Llama is different. It is an AI model made by Meta — the company that owns Facebook and Instagram — and Meta gives it away for free. Not just free to use — free to download. The actual model itself. You can put it on your own computer and run it. No internet connection needed. No company watching what you type. No usage limits. Just the AI, running on your hardware, entirely under your control.
This is what “open-source” means in AI: the model’s weights — the billions of numerical values that make the AI work — are published openly for anyone to use.
Imagine recipes. Most restaurants keep their recipes secret — you can eat there, but you can’t take the recipe home. ChatGPT and Claude are like restaurants. Llama is like a restaurant that posts its recipes online for free. You can make the food yourself, in your own kitchen, for your own family, change the recipe, improve it, make it spicier, give your version to friends.
The vast majority of people will still go to restaurants (use ChatGPT or Claude). But for cooks — developers, researchers, companies with specific needs — having the recipe changes everything.
Llama was made by Meta AI — the artificial intelligence research division of Meta, the company founded by Mark Zuckerberg. Meta AI is headquartered in Menlo Park, California, with major research offices in Paris and New York.
The person most associated with Meta’s AI philosophy is Yann LeCun, Meta’s Chief AI Scientist. LeCun is one of the most respected figures in AI research — he co-won the 2018 Turing Award (the Nobel Prize of computing) alongside Geoffrey Hinton and Yoshua Bengio for their foundational work on deep learning. Unlike many AI leaders, LeCun has been consistently outspoken in support of open-source AI development and sceptical of what he sees as exaggerated concerns about near-term AI risk.
LeCun’s view — which underpins Meta’s Llama strategy — is that open AI is safer AI: more researchers can study it, identify problems, and improve it. Closed AI concentrates power in the hands of a few companies.
It seems counterintuitive. Why would Meta spend hundreds of millions of dollars building an AI and then give it away for free?
Two reasons. First, Meta’s business is social media, not selling AI services — they do not have a competing product to protect. Second, by releasing Llama openly, Meta accelerates the entire field, builds enormous goodwill with developers, and ensures that AI development is not monopolised by OpenAI and Google. A world with Llama is a world where Meta has influence even without a major AI product. Strategic generosity.
By late 2022, the frontier of AI was firmly behind closed doors. OpenAI had GPT-3 and GPT-4 locked behind a paid API. Google had LaMDA but was not releasing it. The open-source AI community had models, but nothing close to the capability of the frontier. If you wanted cutting-edge AI, you paid OpenAI.
The researchers at Meta AI — many of them veterans of Google Brain, DeepMind, and top universities — believed this was wrong. Not just strategically for Meta, but philosophically: AI this powerful and this broadly applicable should not be controlled exclusively by one or two companies.
On 24 February 2023, Meta released Llama 1 — a family of models ranging from 7 billion to 65 billion parameters. The release was initially restricted to researchers who applied for access through a form. Meta wanted to control the rollout carefully.
The models were immediately impressive. Not quite frontier (GPT-4 was released the same month and was significantly more capable), but far better than anything previously available openly. For a 65B parameter model running on research hardware, the performance was remarkable.
Then something happened that Meta had not planned for.
Within days of Llama 1’s restricted release, someone posted the model weights to a public forum. The leak spread instantly. Within a week, Llama 1’s weights were permanently, irrevocably available to anyone on the internet who wanted them.
The reaction divided observers. Some saw this as a disaster — dangerous AI capabilities unleashed without proper safety review. Others saw it as the inevitable consequence of any restricted-release strategy for AI. If you tell researchers they can have access but not share it, someone will share it.
What happened next was extraordinary. Developers around the world began building on Llama 1 at a pace that no closed model could match. Within weeks:
The open-source AI ecosystem, which had been starved of capable foundation models, exploded.
Meta drew the obvious conclusion. On 18 July 2023, Meta released Llama 2 — not restricted to researchers, but openly available for research and commercial use. Anyone could download and use it. Companies could build products on it. The only restriction: organisations with more than 700 million monthly active users needed a special licence (effectively targeting only other tech giants like Google).
Llama 2 was a significant improvement over Llama 1. Four sizes: 7B, 13B, 34B, and 70B parameters. The 70B model was competitive with GPT-3.5 on many benchmarks. Separately, Meta released Llama 2 Chat — versions fine-tuned for conversational use.
The reaction was enormous. Llama 2 was downloaded millions of times in the first days. Hugging Face — the platform that hosts AI models — saw its servers strained. Every major cloud provider added Llama 2 to their model catalogues. Hundreds of companies began building Llama 2-based products.
Llama 3 launched on 18 April 2024. Two sizes initially: 8B and 70B. The 70B model’s benchmark performance was striking — it matched or exceeded GPT-3.5 and was competitive with Claude 3 Sonnet on many tasks. For a freely available, open-weight model, this was unprecedented.
Key improvements in Llama 3: a new tokeniser with a vocabulary of 128,000 tokens (vs 32,000 in Llama 2), better instruction following, significantly improved coding ability, and stronger performance on multilingual tasks. The model was trained on over 15 trillion tokens — more than seven times the training data of Llama 2.
On 23 July 2024, Meta released Llama 3.1 — including a 405-billion parameter model. This was the largest open-weight model ever released. Its performance matched GPT-4o on several benchmarks. The implications were significant: for the first time, frontier-level AI capability was available as open weights, to anyone, for free.
Mark Zuckerberg published a letter arguing that open-source AI was the future — safer, more accessible, and ultimately better for society than closed models. This was a direct challenge to OpenAI and Anthropic’s closed model approach.
The 8B and 70B variants also received significant improvements, including a context window extended to 128,000 tokens — enough to hold over 100 pages of text in a single conversation.
Llama 4 launched in early 2025 as a significant architectural shift. Three variants:
Llama 4 was natively multimodal — trained on text, images, and video from the ground up. This brought it level with commercial multimodal models on visual understanding tasks.
Most people who “use Llama” do not know they are using Llama. It underpins a large and growing number of consumer and enterprise applications. But there are several ways to interact with it directly:
Meta’s own AI assistant, available at meta.ai and built into WhatsApp, Instagram, and Facebook Messenger. This is the easiest way for most people to experience Llama — it is the familiar chat interface that runs on Meta’s servers. Free to use.
Ollama is a free tool that makes running Llama on your Mac, Windows, or Linux computer as simple as a single command. Your data never leaves your machine. No API key needed. No usage limits. Completely private.
Minimum hardware for the 8B model: 8GB RAM (runs on most modern laptops). For 70B: 48GB RAM (high-end workstation or Mac with M-series chip and 64GB RAM).
You are a doctor and you want AI help with patient notes — but you cannot send patient data to OpenAI’s servers. Run Llama locally. All patient information stays on your computer.
You are a lawyer reviewing confidential contracts. Run Llama locally. The contracts never leave your office.
You are a developer building an app and you need AI that works offline. Llama locally means your app works without internet.
Hundreds of products use Llama as their foundation. When you use a customer service chatbot, a coding assistant, a writing tool, or a business AI — there is a reasonable chance it is powered by a fine-tuned Llama model. You may never know, because the company has taken Llama and customised it for their specific use case.
If you want to try Llama models without downloading them, several cloud providers offer API access at low cost:
When you send text to ChatGPT or Claude, that text goes to OpenAI’s or Anthropic’s servers. Their privacy policies say they do not use it to train models (unless you opt in), but your data does leave your device. For sensitive information — medical records, legal documents, financial data, personal matters — this may be unacceptable.
Running Llama locally means your data never leaves your computer. Full stop.
One of the most powerful things you can do with an open-weight model is fine-tune it — train it further on your own data to become a specialist. A hospital can fine-tune Llama on medical literature and clinical notes to create an AI that understands medical language far better than a general model. A law firm can fine-tune on legal documents. An e-commerce company on product catalogue and customer service interactions.
Fine-tuning a closed model (like GPT-4) is limited to what the API provider allows. Fine-tuning Llama has no restrictions — you are working with the weights directly.
Paying OpenAI or Anthropic per API call adds up at scale. A product making millions of AI requests per day pays substantial API fees. Running your own Llama instance — on your own servers or cloud infrastructure — can be significantly cheaper once you reach sufficient volume. Many companies reach the crossover point where self-hosting is cheaper than API usage.
If OpenAI changes its pricing, its terms of service, its model behaviour, or if it goes out of business — every product built on the OpenAI API is affected. Products built on Llama are not. The weights they downloaded do not change. Their model continues working regardless of what happens to Meta.
For most working users, the choice is between Meta AI (simple, no setup), a cloud API (flexible, scalable), or a local installation (private, unlimited). Here is a practical guide for each.
Go to meta.ai. Create or sign in with a Meta account. Start chatting. This is Llama 4 running on Meta’s infrastructure — the same underlying model, with Meta’s fine-tuning and safety layers on top. Free. No technical setup.
# Step 1: Install Ollama from ollama.com
# (One-click installer for Mac/Windows, curl for Linux)
# Step 2: Pull a model and start chatting
ollama run llama3.2 # 3B — fast, works on most computers
ollama run llama3.1:8b # 8B — better quality, needs 8GB RAM
ollama run llama3.1:70b # 70B — best quality, needs 48GB RAM
# Step 3: Ollama also runs as a local API
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Explain quantum computing simply",
"stream": false
}'
from groq import Groq
client = Groq(api_key="your-groq-api-key")
response = client.chat.completions.create(
model="llama-3.1-70b-versatile",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Summarise this document: [paste content]"}
],
temperature=0.7,
max_tokens=1024
)
print(response.choices[0].message.content)
The Llama series uses a decoder-only transformer architecture — the same fundamental design as GPT-3 and subsequent OpenAI models — with a set of architectural improvements that have been influential across the open-source community.
Pre-normalisation with RMSNorm: Llama applies layer normalisation before the attention and feed-forward sub-layers (pre-norm) rather than after (post-norm as in the original transformer). This improves training stability. Specifically, Llama uses RMSNorm (Root Mean Square Layer Normalisation) rather than the standard LayerNorm — computationally cheaper and empirically comparable in performance. Source: Zhang and Sennrich (2019), “Root Mean Square Layer Normalization.”
SwiGLU activation function: The feed-forward sub-layers use SwiGLU activations (a Swish-gated linear unit variant), following Shazeer (2020) and the PaLM architecture, rather than the ReLU or GELU activations used in GPT-style models. SwiGLU empirically improves model performance at equivalent compute.
Rotary Positional Embeddings (RoPE): Llama uses RoPE (Su et al., 2022) rather than absolute or learnt positional embeddings. RoPE encodes position information directly into the attention computation, and crucially enables better generalisation to sequence lengths beyond those seen during training — a significant advantage for long-context applications.
Grouped Query Attention (GQA) — introduced in Llama 2: Standard multi-head attention (MHA) uses separate key and value projections for each attention head. GQA groups multiple heads to share a single key-value projection, significantly reducing the KV cache memory requirements during inference — enabling longer contexts and faster generation — with minimal quality degradation. This became standard practice across many subsequent models.
Touvron, H., et al. (2023). “LLaMA: Open and Efficient Foundation Language Models.” Meta AI. arxiv.org/abs/2302.13971
Touvron, H., et al. (2023). “Llama 2: Open Foundation and Fine-Tuned Chat Models.” Meta AI. arxiv.org/abs/2307.09288
The Llama 3 technical report documents the most significant training data improvement: a new dataset of over 15 trillion tokens, curated with more aggressive quality filtering, deduplication, and domain balancing than Llama 2’s 2 trillion token dataset. The improved tokeniser uses a vocabulary of 128,256 tokens (vs 32,000 for Llama 2), enabling more efficient representation of code, mathematics, and non-English text.
Llama 3’s instruction-following improvements were achieved through a multi-stage post-training pipeline: supervised fine-tuning, rejection sampling fine-tuning, direct preference optimisation (DPO), and proximal policy optimisation (PPO) — a combination more sophisticated than the RLHF approach used in Llama 2.
Meta AI (2024). “The Llama 3 Herd of Models.” arxiv.org/abs/2407.21783
Llama 4 introduces two architectural paradigm shifts relative to previous Llama versions:
Mixture-of-Experts (MoE): Rather than activating the full parameter set for every token, Llama 4 Maverick and Behemoth use a learned routing mechanism that dispatches each token to a subset of “expert” feed-forward networks. Maverick has 128 experts with 1 of 128 active per token, totalling 400B parameters but only 17B active — enabling high model capacity at a fraction of the inference cost of an equivalent dense model.
Native multimodality: Llama 4 is trained on text, images, and video from the beginning of pre-training, rather than adding vision as a post-hoc module. The vision encoder uses a modified ViT (Vision Transformer) architecture. Cross-modal attention enables the language decoder to attend to visual representations at any layer, not just at the input.
Mixture of Depths: Llama 4 also incorporates Mixture of Depths — an adaptive computation technique where different tokens are routed to different numbers of transformer layers based on a learned difficulty estimate. Simple tokens use fewer layers; complex tokens use more. This further improves efficiency.
Meta AI (2025). “Llama 4 Model Card and Technical Summary.” github.com/meta-llama/llama-models
Meta AI (2025). “Introducing Llama 4.” ai.meta.com/blog/llama-4
Full fine-tuning of a 70B parameter model requires approximately 140GB of GPU VRAM — out of reach for most organisations. Two parameter-efficient fine-tuning (PEFT) techniques have made fine-tuning accessible:
LoRA (Low-Rank Adaptation, Hu et al. 2021): Rather than updating all model weights, LoRA adds small low-rank matrices to the weight matrices of attention layers. Only these small matrices are trained — reducing trainable parameters by orders of magnitude. The base model weights are frozen. At inference, the LoRA adapters are merged back into the base weights with no inference overhead.
QLoRA (Dettmers et al. 2023): Extends LoRA by quantising the base model to 4-bit precision before fine-tuning (reducing memory requirements by ~4x), while maintaining LoRA adapters in full 16-bit precision. QLoRA enabled fine-tuning Llama 65B on a single 48GB GPU — making specialist AI accessible to individuals and small organisations.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch
# Load model in 4-bit (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B",
quantization_config=bnb_config,
device_map="auto"
)
# Configure LoRA
lora_config = LoraConfig(
r=16, # Rank — higher = more capacity, more memory
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 2,097,152 || all params: 8,030,261,248
# Only 0.026% of parameters trained!
Hu, E.J., et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.” arxiv.org/abs/2106.09685
Dettmers, T., et al. (2023). “QLoRA: Efficient Finetuning of Quantized LLMs.” arxiv.org/abs/2305.14314
Understanding the licence is essential before commercial use:
Full licence text: github.com/meta-llama/llama-models — LICENSE