Sarvam AI: India's Sovereign LLMs

Overview

Why Sarvam Matters

India has 1.4 billion people speaking 22 scheduled languages across 12 scripts. The dominant English-centric LLMs handle most of these poorly — poor token efficiency for Indic scripts (4–8 tokens per word vs 1.4 for English), limited training data in regional languages, and cultural context gaps that matter for real-world deployment in healthcare, banking, government services, and education.

Sarvam AI, founded in August 2023 by Vivek Raghavan and Pratyush Kumar — both formerly of AI4Bharat at IIT Madras — set out to build the full stack from scratch: tokenizer, architecture, training data, training infrastructure, post-training pipelines, and inference systems. Not a fine-tune of someone else's checkpoint. Built in India, on Indian compute, under the IndiaAI Mission.

The result is a model family that spans a 2B base model (Sarvam-1, Oct 2024), a 24B fine-tune (Sarvam-M, May 2025), and now two fully sovereign foundation models: Sarvam 30B and Sarvam 105B, both MoE architectures trained from scratch and open-sourced under Apache 2.0 in February–March 2026.

🗣️

22 Indian Languages

All officially scheduled Indian languages, including code-mixed formats like Hinglish. Custom tokenizer with fertility rates of 1.4–2.1 vs 4–8 for standard models.

⚡

Sparse MoE Efficiency

Both 30B and 105B activate only a fraction of total parameters per token — 2.4B and 10.3B respectively — keeping inference costs practical at scale.

🔓

Apache 2.0 Open Source

Weights on HuggingFace and AIKosh. Enterprise and developer use permitted. Aimed at reducing India's dependence on closed foreign AI systems.

🇮🇳

Sovereign Infrastructure

Trained entirely in India on Yotta's Shakti GPU cluster using government-provided compute under the ₹10,372-crore IndiaAI Mission.

🧠

Reasoning-First Training

Full pipeline: pre-training → SFT → RL. Models include extended thinking mode and agentic traces for real-world tool use and multi-step workflows.

Sarvam-1

The Foundation — India's first purpose-built Indic LLM

2B Parameters Dense 4K Context Oct 2024

Released in October 2024, Sarvam-1 was the first model to demonstrate that a carefully curated 2B parameter model can outperform much larger general-purpose models on Indian languages. Trained on 2 trillion tokens across 10 Indic languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu), it beat Gemma-2-2B and Llama-3.2-3B on Indic benchmarks, and stayed competitive with Llama-3.1-8B — at 4× smaller.

The key insight: token efficiency first. Existing multilingual models need 4–8 tokens per Indic word due to poor tokenizer design. Sarvam-1's custom tokenizer achieves 1.4–2.1 fertility across all supported languages — matching English efficiency — which directly improves model capacity utilization and training signal quality.

Sarvam-1 Architecture (2B Dense)

Token Embedding → Custom Indic Tokenizer (32K vocab)

RMSNorm → GQA (Grouped Query Attention) → RMSNorm → SwiGLU FFN × N layers

RoPE (θ=10,000) + bfloat16 mixed precision + Deeper & thinner config

Key Design Choice: Deeper & Thinner

Sarvam-1 uses more layers with a smaller hidden dimension than similarly-sized models. Research at the time (later reinforced by Qwen3) showed this improves performance for a fixed parameter budget, particularly on multilingual tasks where diverse representations benefit from greater depth.

Training Setup

Trained on Yotta's Shakti HGX H100 cluster with 1,024 GPUs over 5 days using NVIDIA NeMo framework. Kernel fusion and mixed-precision optimizations throughout. Note: this is a base completion model — not instruction-tuned, designed to be fine-tuned for downstream tasks.

Sarvam-M

The Bridge — Fine-tuned Mistral for Indian reasoning

24B Parameters Dense 32K Context May 2025

Released May 2025, Sarvam-M was a significant step up in capability — fine-tuned on Mistral-Small-3.1-24B-Base to enhance Indian language performance, reasoning, and coding. While technically built on a foreign base model (which later drew criticism for not qualifying as truly sovereign), it served as a crucial bridge that demonstrated Sarvam's SFT and RL pipeline capabilities at scale.

Sarvam-M supports 11 Indian languages, has a 32K token context window with sliding window attention of 4,096 tokens, and includes a thinking mode for extended reasoning. It was Sarvam's first production-deployed model for conversational use cases.

The Sovereignty Debate

Sarvam-M drew industry criticism because it was fine-tuned from Mistral's architecture — a French AI company's design. While performance on Indic languages improved significantly, critics argued it did not reduce structural dependence on foreign AI infrastructure. This directly motivated Sarvam's decision to train 30B and 105B from scratch.

Architecture Details (Inherited from Mistral-Small-3.1)

GQA attention with sliding window of 4,096 tokens for local layers and full attention for global layers. 32K context window. SwiGLU FFN. Standard Pre-Norm with RMSNorm. RoPE positional embeddings. The key Sarvam contribution was the post-training pipeline: instruction tuning, safety fine-tuning, and RL on Indian-language-heavy prompts.

Sarvam 30B

First fully sovereign foundation model — built for real-time production

30B Total 2.4B Active / token 32K Context MoE · 128 Experts · Top-6

Sarvam 30B is the first model in the family built entirely from scratch in India — architecture, training data, tokenizer, and training infrastructure. Pre-trained on 16 trillion tokens spanning code, web data, mathematics, and multilingual content, with a custom data mixture tuned after extensive ablations. Then put through SFT and RL pipelines developed entirely in-house.

The architecture uses a Heterogeneous MoE design: 19 layers total — 1 dense layer followed by 18 MoE layers. Each MoE layer has 128 experts with top-6 routing (6 experts activate per token). A dedicated shared expert handles common linguistic patterns, keeping consistent representations across all inputs. Grouped Query Attention (GQA) with 4 KV heads per layer balances memory bandwidth and generation quality.

Sarvam 30B Architecture (Heterogeneous MoE)

Token Input → Custom Indic Tokenizer (22 langs, 12 scripts)

Layer 1 (Dense) → GQA (4 KV heads) + SwiGLU FFN

Layers 2–19 (MoE) → GQA (4 KV heads) + MoE FFN · 128 experts · top-6 routed + 1 shared × 18

RMSNorm + RoPE (θ=8,000,000) + Sigmoid routing (not softmax)

Sigmoid Routing — A Key Innovation

Instead of traditional softmax gating over expert logits, Sarvam uses sigmoid-based routing scores. Softmax normalizes scores across all experts, creating competition and routing collapse where a few experts dominate over time. Sigmoid scores are independent per expert, improving load balancing and encouraging more uniform expert utilization — critical for training stability over 16 trillion tokens.

Ultra-High RoPE Theta

Sarvam 30B uses a RoPE theta of 8,000,000 — orders of magnitude higher than Llama's 500,000 or the original 10,000. This allows stable positional encoding at long contexts without needing a separate RoPE scaling mechanism (no YaRN or ABF needed). The model handles 32K context natively without degradation.

Production Target: Samvaad

Sarvam 30B is the engine powering Samvaad, Sarvam's conversational AI platform. The 2.4B active parameters per token make it fast enough for real-time voice interactions across Indian languages. NVIDIA co-designed inference optimizations delivered 4× speedup over baseline H100 performance via kernel fusion, RadixAttention for KV prefix reuse, and Blackwell NVFP4 quantization.

Sarvam 105B

The flagship — deep reasoning, long context, agentic workflows

105B Total 10.3B Active / token 128K Context MoE · 128 Experts · Top-8 · MLA

Sarvam 105B extends the 30B architecture to 32 layers (1 dense + 31 MoE) with larger expert FFN hidden size and top-8 routing. The critical architectural addition at this scale is Multi-Head Latent Attention (MLA) — the same KV cache compression technique pioneered by DeepSeek V3 — which enables the 128K context window without prohibitive memory requirements.

Pre-trained on 12 trillion tokens (fewer than the 30B, but with a heavier emphasis on Indian languages, STEM, and agentic data). Full post-training: SFT on diverse, synthetically augmented prompts including agentic traces and tool-use trajectories, followed by RL using an asynchronous GRPO setup with adaptive rollout allocation.

Sarvam 105B Architecture (Flagship MoE)

Token Input → Custom Tokenizer (22 langs, 12 scripts)

Layer 1 (Dense) → MLA (Multi-Head Latent Attention) + SwiGLU FFN

Layers 2–32 (MoE) → MLA (compressed KV latent) + MoE FFN · 128 experts · top-8 routed + 1 shared × 31

RMSNorm + RoPE + Sigmoid routing + Expert bias term

MLA: DeepSeek's Technique, Sarvam's Scale

Multi-Head Latent Attention compresses Key and Value tensors into a low-dimensional latent vector before projecting back out for attention computation. The KV cache stores this compressed latent rather than full K/V tensors — dramatically reducing memory at 128K context lengths. Sarvam adopted MLA (similar to DeepSeek V3) specifically because GQA wasn't sufficient for long-context efficiency at the 105B scale.

Adaptive RL Curriculum

The RL stage uses an adaptive knapsack-style rollout allocation: prompts are pre-filtered to remove trivially solvable or unsolvable tasks, then rollouts are dynamically weighted toward tasks near the model's capability frontier — where learning signal is strongest. An asynchronous GRPO setup decouples generation, reward computation, and policy updates to maximize GPU utilization during RL.

Production Target: Indus

Sarvam 105B powers Indus, Sarvam's AI assistant for complex reasoning and agentic workflows. Benchmarked against JEE Main 2026 papers (Math: 25/25 under Pass@1), Codeforces Div3 problems, and the Tau 2 Bench for agentic reasoning — where it outperforms DeepSeek R1, Gemini 2.5 Flash, and o4-mini.

📊

Benchmark positioning: Sarvam 105B competes with GPT-OSS-120B and Qwen3-Next-80B. The 30B targets Gemma 27B and GPT-OSS-20B. Both models achieve state-of-the-art on Indian language benchmarks at their parameter class, outperforming significantly larger general-purpose models on Indic tasks.

Model	Params	Active	Attention	Architecture	Context	Key Feature
Sarvam-1	2B	2B (dense)	GQA	Dense Transformer	4K	Custom Indic tokenizer SwiGLU + RoPE
Sarvam-M	24B	24B (dense)	GQA + SWA	Dense (Mistral base)	32K	Fine-tune of Mistral 3.1 Thinking mode
Sarvam 30B	30B	2.4B / token	GQA (4 KV heads)	MoE · 128 experts · top-6	32K	Sigmoid routing RoPE θ=8M Shared expert
Sarvam 105B	105B	10.3B / token	MLA (DeepSeek-style)	MoE · 128 experts · top-8	128K	MLA + KV compression Async GRPO RL Expert bias

What It Tells Us

Key Architectural Insights from Sarvam

Tokenizer is architecture

Sarvam's biggest competitive advantage isn't model size — it's the custom tokenizer. With 1.4–2.1 fertility vs 4–8 for standard models, every layer of the model gets more signal per Indic token. Better tokenization compounds through the entire training process.

Sigmoid routing beats softmax at scale

Using sigmoid instead of softmax for MoE gating scores is a meaningful deviation from standard practice. Independent per-expert scoring prevents routing collapse and distributes load more uniformly — important for maintaining expert diversity over 16T training tokens.

MLA adoption validates DeepSeek's design

Sarvam 105B adopting MLA (alongside Kimi K2, GLM-5, Mistral 3 Large) cements it as the emerging standard for long-context efficiency. GQA wasn't sufficient at 128K — MLA's latent KV compression solved what GQA couldn't.

Fine-grained MoE: depth over breadth

128 small experts with top-6 or top-8 routing follows the DeepSeek V3 fine-grained template rather than Grok 2.5's coarse-grained 8-large-expert approach. More experts, more specialization — especially useful for multilingual models where language-specific routing matters.

Sovereign = full-stack, not just weights

The Sarvam-M controversy clarified what "sovereign AI" actually requires: not just open weights, but control over architecture design, training data, infrastructure, and post-training pipeline. Sarvam 30B and 105B meet that bar. Sarvam-M did not.

Adaptive RL curriculum is the new frontier

Sarvam's knapsack-based rollout allocation — concentrating RL compute on tasks near the model's capability frontier — is more sophisticated than standard uniform sampling. Combined with async GRPO and trajectory staleness controls, it squeezes maximum learning signal per GPU hour.

Chronology

Sarvam AI Release Timeline

From founding to sovereign flagship — a 2.5 year journey

Aug 2023

Sarvam AI Founded

Inception

Vivek Raghavan and Pratyush Kumar — both formerly of AI4Bharat at IIT Madras — found Sarvam AI in Bengaluru. The mission: build large language models and multimodal AI systems with a focus on Indian languages from the ground up. In December 2023, the company closes a combined seed + Series A of ~$41M led by Lightspeed, with Peak XV Partners and Khosla Ventures participating.

Sarvam AI · Bengaluru

Founded by ex-AI4Bharat researchers $41M seed + Series A in Dec 2023 Mission: India-first, sovereign AI stack

Oct 2024

Sarvam-1 (2B)

First Model

The first public model: a 2B dense transformer trained from scratch on 2 trillion tokens across 10 Indic languages. The headline achievement is the custom tokenizer — achieving 1.4–2.1 fertility vs 4–8 for existing multilingual models. Sarvam-1 outperforms Gemma-2-2B and Llama-3.2-3B on Indic benchmarks despite being the same or smaller size. Competitive with Llama-3.1-8B at 4× fewer parameters. Uses GQA, SwiGLU, RoPE (θ=10,000), bfloat16 training.

Sarvam-1 · 2B · Dense

Custom tokenizer: 1.4–2.1 fertility 2T tokens, 10 Indic languages Deeper & thinner architecture GQA + SwiGLU + RoPE Trained on Yotta Shakti: 1,024 H100s, 5 days

Apr–May 2025

IndiaAI Mission Selection · Sarvam-M (24B)

Scale + Controversy

In April 2025, India's Ministry of Electronics (MeitY) selects Sarvam AI under the IndiaAI Mission to develop an indigenous foundation model — providing access to government-backed GPU compute. In May, Sarvam releases Sarvam-M, a 24B model fine-tuned from Mistral-Small-3.1-24B. It supports 11 Indian languages, includes a thinking mode, and is the first Sarvam model deployed in production (for conversational use). However, it quickly draws criticism for being a foreign-architecture fine-tune rather than a truly sovereign model.

Sarvam-M · 24B · Dense IndiaAI Mission GPU Access

Fine-tune of Mistral Small 3.1 GQA + sliding window 4K 32K context, thinking mode Criticized as non-sovereign base Government GPU compute unlocked

Feb 2026

Sarvam 30B + 105B Unveiled at India AI Impact Summit

Sovereign Milestone

At the India AI Impact Summit in New Delhi's Bharat Mandapam, Sarvam unveils two fully sovereign foundation models trained from scratch on Indian government compute under the IndiaAI Mission. Both are MoE architectures with a custom Indic tokenizer supporting all 22 official languages across 12 scripts. The 30B uses GQA and is optimized for real-time deployment. The 105B adds MLA for 128K context and is designed for complex reasoning and agentic tasks. At launch, a demonstration called "Vikram" (named after Vikram Sarabhai) showcases multilingual conversations including Punjabi and Hindi.

Sarvam 30B · GQA · 32K Sarvam 105B · MLA · 128K

Both trained from scratch, in India 128 experts, sigmoid routing 22 official Indian languages 105B: MLA for KV compression 105B: 128K context window 4× inference speedup on Blackwell (NVIDIA collab)

Mar 2026

Open-Sourced Under Apache 2.0

Open Source

Weights for both Sarvam 30B and 105B are officially released on HuggingFace (sarvamai/sarvam-30b, sarvamai/sarvam-105b) and AIKosh under Apache License 2.0 — the most permissive open-source license, allowing commercial use. Sarvam 30B powers Samvaad (conversational agents); 105B powers Indus (reasoning and agentic workflows). Both available via the Sarvam API. Future plans: coding-specialized models, multimodal systems, and scaling to significantly larger checkpoints.

HuggingFace Release AIKosh Release Sarvam API

Apache 2.0 — commercial use allowed 30B → Samvaad (conversational AI) 105B → Indus (reasoning + agentic) SGLang, vLLM, Transformers support Roadmap: coding, multimodal, larger models

Sarvam AI:
India's Own LLMs

Why Sarvam Matters

22 Indian Languages

Sparse MoE Efficiency

Apache 2.0 Open Source

Sovereign Infrastructure

Reasoning-First Training

Key Design Choice: Deeper & Thinner

Training Setup

The Sovereignty Debate

Architecture Details (Inherited from Mistral-Small-3.1)

Sigmoid Routing — A Key Innovation

Ultra-High RoPE Theta

Production Target: Samvaad

MLA: DeepSeek's Technique, Sarvam's Scale

Adaptive RL Curriculum

Production Target: Indus

Sarvam Model Family Comparison

Key Architectural Insights from Sarvam

Tokenizer is architecture

Sigmoid routing beats softmax at scale

MLA adoption validates DeepSeek's design

Fine-grained MoE: depth over breadth

Sovereign = full-stack, not just weights

Adaptive RL curriculum is the new frontier

Sarvam AI Release Timeline

Sarvam AI:India's Own LLMs

Why Sarvam Matters

22 Indian Languages

Sparse MoE Efficiency

Apache 2.0 Open Source

Sovereign Infrastructure

Reasoning-First Training

Key Design Choice: Deeper & Thinner

Training Setup

The Sovereignty Debate

Architecture Details (Inherited from Mistral-Small-3.1)

Sigmoid Routing — A Key Innovation

Ultra-High RoPE Theta

Production Target: Samvaad

MLA: DeepSeek's Technique, Sarvam's Scale

Adaptive RL Curriculum

Production Target: Indus

Sarvam Model Family Comparison

Key Architectural Insights from Sarvam

Tokenizer is architecture

Sigmoid routing beats softmax at scale

MLA adoption validates DeepSeek's design

Fine-grained MoE: depth over breadth

Sovereign = full-stack, not just weights

Adaptive RL curriculum is the new frontier

Sarvam AI Release Timeline

Sarvam AI:
India's Own LLMs