Sovereign AI · India · Deep Dive

Sarvam AI:
India's Own LLMs

From a 2B parameter Indic language model to a sovereign 105B MoE flagship — the story of how Sarvam AI built India's first full-stack foundation models from scratch, in India, for India.

Published March 2026 · Prateek Singh, PhD
Sovereign AI MoE Architecture 22 Indian Languages Open Source
Scroll

Why Sarvam Matters

India has 1.4 billion people speaking 22 scheduled languages across 12 scripts. The dominant English-centric LLMs handle most of these poorly — poor token efficiency for Indic scripts (4–8 tokens per word vs 1.4 for English), limited training data in regional languages, and cultural context gaps that matter for real-world deployment in healthcare, banking, government services, and education.

Sarvam AI, founded in August 2023 by Vivek Raghavan and Pratyush Kumar — both formerly of AI4Bharat at IIT Madras — set out to build the full stack from scratch: tokenizer, architecture, training data, training infrastructure, post-training pipelines, and inference systems. Not a fine-tune of someone else's checkpoint. Built in India, on Indian compute, under the IndiaAI Mission.

The result is a model family that spans a 2B base model (Sarvam-1, Oct 2024), a 24B fine-tune (Sarvam-M, May 2025), and now two fully sovereign foundation models: Sarvam 30B and Sarvam 105B, both MoE architectures trained from scratch and open-sourced under Apache 2.0 in February–March 2026.

🗣️

22 Indian Languages

All officially scheduled Indian languages, including code-mixed formats like Hinglish. Custom tokenizer with fertility rates of 1.4–2.1 vs 4–8 for standard models.

Sparse MoE Efficiency

Both 30B and 105B activate only a fraction of total parameters per token — 2.4B and 10.3B respectively — keeping inference costs practical at scale.

🔓

Apache 2.0 Open Source

Weights on HuggingFace and AIKosh. Enterprise and developer use permitted. Aimed at reducing India's dependence on closed foreign AI systems.

🇮🇳

Sovereign Infrastructure

Trained entirely in India on Yotta's Shakti GPU cluster using government-provided compute under the ₹10,372-crore IndiaAI Mission.

🧠

Reasoning-First Training

Full pipeline: pre-training → SFT → RL. Models include extended thinking mode and agentic traces for real-world tool use and multi-step workflows.

01
Sarvam-1
The Foundation — India's first purpose-built Indic LLM
2B Parameters Dense 4K Context Oct 2024

Released in October 2024, Sarvam-1 was the first model to demonstrate that a carefully curated 2B parameter model can outperform much larger general-purpose models on Indian languages. Trained on 2 trillion tokens across 10 Indic languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu), it beat Gemma-2-2B and Llama-3.2-3B on Indic benchmarks, and stayed competitive with Llama-3.1-8B — at 4× smaller.

The key insight: token efficiency first. Existing multilingual models need 4–8 tokens per Indic word due to poor tokenizer design. Sarvam-1's custom tokenizer achieves 1.4–2.1 fertility across all supported languages — matching English efficiency — which directly improves model capacity utilization and training signal quality.

Sarvam-1 Architecture (2B Dense)
Token Embedding Custom Indic Tokenizer (32K vocab)
RMSNorm GQA (Grouped Query Attention) RMSNorm SwiGLU FFN × N layers
RoPE (θ=10,000) + bfloat16 mixed precision + Deeper & thinner config

Key Design Choice: Deeper & Thinner

Sarvam-1 uses more layers with a smaller hidden dimension than similarly-sized models. Research at the time (later reinforced by Qwen3) showed this improves performance for a fixed parameter budget, particularly on multilingual tasks where diverse representations benefit from greater depth.

Training Setup

Trained on Yotta's Shakti HGX H100 cluster with 1,024 GPUs over 5 days using NVIDIA NeMo framework. Kernel fusion and mixed-precision optimizations throughout. Note: this is a base completion model — not instruction-tuned, designed to be fine-tuned for downstream tasks.

02
Sarvam-M
The Bridge — Fine-tuned Mistral for Indian reasoning
24B Parameters Dense 32K Context May 2025

Released May 2025, Sarvam-M was a significant step up in capability — fine-tuned on Mistral-Small-3.1-24B-Base to enhance Indian language performance, reasoning, and coding. While technically built on a foreign base model (which later drew criticism for not qualifying as truly sovereign), it served as a crucial bridge that demonstrated Sarvam's SFT and RL pipeline capabilities at scale.

Sarvam-M supports 11 Indian languages, has a 32K token context window with sliding window attention of 4,096 tokens, and includes a thinking mode for extended reasoning. It was Sarvam's first production-deployed model for conversational use cases.

The Sovereignty Debate

Sarvam-M drew industry criticism because it was fine-tuned from Mistral's architecture — a French AI company's design. While performance on Indic languages improved significantly, critics argued it did not reduce structural dependence on foreign AI infrastructure. This directly motivated Sarvam's decision to train 30B and 105B from scratch.

Architecture Details (Inherited from Mistral-Small-3.1)

GQA attention with sliding window of 4,096 tokens for local layers and full attention for global layers. 32K context window. SwiGLU FFN. Standard Pre-Norm with RMSNorm. RoPE positional embeddings. The key Sarvam contribution was the post-training pipeline: instruction tuning, safety fine-tuning, and RL on Indian-language-heavy prompts.

03
Sarvam 30B
First fully sovereign foundation model — built for real-time production
30B Total 2.4B Active / token 32K Context MoE · 128 Experts · Top-6

Sarvam 30B is the first model in the family built entirely from scratch in India — architecture, training data, tokenizer, and training infrastructure. Pre-trained on 16 trillion tokens spanning code, web data, mathematics, and multilingual content, with a custom data mixture tuned after extensive ablations. Then put through SFT and RL pipelines developed entirely in-house.

The architecture uses a Heterogeneous MoE design: 19 layers total — 1 dense layer followed by 18 MoE layers. Each MoE layer has 128 experts with top-6 routing (6 experts activate per token). A dedicated shared expert handles common linguistic patterns, keeping consistent representations across all inputs. Grouped Query Attention (GQA) with 4 KV heads per layer balances memory bandwidth and generation quality.

Sarvam 30B Architecture (Heterogeneous MoE)
Token Input Custom Indic Tokenizer (22 langs, 12 scripts)
Layer 1 (Dense) GQA (4 KV heads) + SwiGLU FFN
Layers 2–19 (MoE) GQA (4 KV heads) + MoE FFN · 128 experts · top-6 routed + 1 shared × 18
RMSNorm + RoPE (θ=8,000,000) + Sigmoid routing (not softmax)

Sigmoid Routing — A Key Innovation

Instead of traditional softmax gating over expert logits, Sarvam uses sigmoid-based routing scores. Softmax normalizes scores across all experts, creating competition and routing collapse where a few experts dominate over time. Sigmoid scores are independent per expert, improving load balancing and encouraging more uniform expert utilization — critical for training stability over 16 trillion tokens.

Ultra-High RoPE Theta

Sarvam 30B uses a RoPE theta of 8,000,000 — orders of magnitude higher than Llama's 500,000 or the original 10,000. This allows stable positional encoding at long contexts without needing a separate RoPE scaling mechanism (no YaRN or ABF needed). The model handles 32K context natively without degradation.

Production Target: Samvaad

Sarvam 30B is the engine powering Samvaad, Sarvam's conversational AI platform. The 2.4B active parameters per token make it fast enough for real-time voice interactions across Indian languages. NVIDIA co-designed inference optimizations delivered 4× speedup over baseline H100 performance via kernel fusion, RadixAttention for KV prefix reuse, and Blackwell NVFP4 quantization.

04
Sarvam 105B
The flagship — deep reasoning, long context, agentic workflows
105B Total 10.3B Active / token 128K Context MoE · 128 Experts · Top-8 · MLA

Sarvam 105B extends the 30B architecture to 32 layers (1 dense + 31 MoE) with larger expert FFN hidden size and top-8 routing. The critical architectural addition at this scale is Multi-Head Latent Attention (MLA) — the same KV cache compression technique pioneered by DeepSeek V3 — which enables the 128K context window without prohibitive memory requirements.

Pre-trained on 12 trillion tokens (fewer than the 30B, but with a heavier emphasis on Indian languages, STEM, and agentic data). Full post-training: SFT on diverse, synthetically augmented prompts including agentic traces and tool-use trajectories, followed by RL using an asynchronous GRPO setup with adaptive rollout allocation.

Sarvam 105B Architecture (Flagship MoE)
Token Input Custom Tokenizer (22 langs, 12 scripts)
Layer 1 (Dense) MLA (Multi-Head Latent Attention) + SwiGLU FFN
Layers 2–32 (MoE) MLA (compressed KV latent) + MoE FFN · 128 experts · top-8 routed + 1 shared × 31
RMSNorm + RoPE + Sigmoid routing + Expert bias term

MLA: DeepSeek's Technique, Sarvam's Scale

Multi-Head Latent Attention compresses Key and Value tensors into a low-dimensional latent vector before projecting back out for attention computation. The KV cache stores this compressed latent rather than full K/V tensors — dramatically reducing memory at 128K context lengths. Sarvam adopted MLA (similar to DeepSeek V3) specifically because GQA wasn't sufficient for long-context efficiency at the 105B scale.

Adaptive RL Curriculum

The RL stage uses an adaptive knapsack-style rollout allocation: prompts are pre-filtered to remove trivially solvable or unsolvable tasks, then rollouts are dynamically weighted toward tasks near the model's capability frontier — where learning signal is strongest. An asynchronous GRPO setup decouples generation, reward computation, and policy updates to maximize GPU utilization during RL.

Production Target: Indus

Sarvam 105B powers Indus, Sarvam's AI assistant for complex reasoning and agentic workflows. Benchmarked against JEE Main 2026 papers (Math: 25/25 under Pass@1), Codeforces Div3 problems, and the Tau 2 Bench for agentic reasoning — where it outperforms DeepSeek R1, Gemini 2.5 Flash, and o4-mini.

📊

Benchmark positioning: Sarvam 105B competes with GPT-OSS-120B and Qwen3-Next-80B. The 30B targets Gemma 27B and GPT-OSS-20B. Both models achieve state-of-the-art on Indian language benchmarks at their parameter class, outperforming significantly larger general-purpose models on Indic tasks.

Sarvam Model Family Comparison

Model Params Active Attention Architecture Context Key Feature
Sarvam-1 2B 2B (dense) GQA Dense Transformer 4K Custom Indic tokenizer SwiGLU + RoPE
Sarvam-M 24B 24B (dense) GQA + SWA Dense (Mistral base) 32K Fine-tune of Mistral 3.1 Thinking mode
Sarvam 30B 30B 2.4B / token GQA (4 KV heads) MoE · 128 experts · top-6 32K Sigmoid routing RoPE θ=8M Shared expert
Sarvam 105B 105B 10.3B / token MLA (DeepSeek-style) MoE · 128 experts · top-8 128K MLA + KV compression Async GRPO RL Expert bias

Sarvam AI Release Timeline

From founding to sovereign flagship — a 2.5 year journey

Aug 2023
Sarvam AI Founded
Inception

Vivek Raghavan and Pratyush Kumar — both formerly of AI4Bharat at IIT Madras — found Sarvam AI in Bengaluru. The mission: build large language models and multimodal AI systems with a focus on Indian languages from the ground up. In December 2023, the company closes a combined seed + Series A of ~$41M led by Lightspeed, with Peak XV Partners and Khosla Ventures participating.

Sarvam AI · Bengaluru
Founded by ex-AI4Bharat researchers $41M seed + Series A in Dec 2023 Mission: India-first, sovereign AI stack
Oct 2024
Sarvam-1 (2B)
First Model

The first public model: a 2B dense transformer trained from scratch on 2 trillion tokens across 10 Indic languages. The headline achievement is the custom tokenizer — achieving 1.4–2.1 fertility vs 4–8 for existing multilingual models. Sarvam-1 outperforms Gemma-2-2B and Llama-3.2-3B on Indic benchmarks despite being the same or smaller size. Competitive with Llama-3.1-8B at 4× fewer parameters. Uses GQA, SwiGLU, RoPE (θ=10,000), bfloat16 training.

Sarvam-1 · 2B · Dense
Custom tokenizer: 1.4–2.1 fertility 2T tokens, 10 Indic languages Deeper & thinner architecture GQA + SwiGLU + RoPE Trained on Yotta Shakti: 1,024 H100s, 5 days
Apr–May 2025
IndiaAI Mission Selection · Sarvam-M (24B)
Scale + Controversy

In April 2025, India's Ministry of Electronics (MeitY) selects Sarvam AI under the IndiaAI Mission to develop an indigenous foundation model — providing access to government-backed GPU compute. In May, Sarvam releases Sarvam-M, a 24B model fine-tuned from Mistral-Small-3.1-24B. It supports 11 Indian languages, includes a thinking mode, and is the first Sarvam model deployed in production (for conversational use). However, it quickly draws criticism for being a foreign-architecture fine-tune rather than a truly sovereign model.

Sarvam-M · 24B · Dense IndiaAI Mission GPU Access
Fine-tune of Mistral Small 3.1 GQA + sliding window 4K 32K context, thinking mode Criticized as non-sovereign base Government GPU compute unlocked
Feb 2026
Sarvam 30B + 105B Unveiled at India AI Impact Summit
Sovereign Milestone

At the India AI Impact Summit in New Delhi's Bharat Mandapam, Sarvam unveils two fully sovereign foundation models trained from scratch on Indian government compute under the IndiaAI Mission. Both are MoE architectures with a custom Indic tokenizer supporting all 22 official languages across 12 scripts. The 30B uses GQA and is optimized for real-time deployment. The 105B adds MLA for 128K context and is designed for complex reasoning and agentic tasks. At launch, a demonstration called "Vikram" (named after Vikram Sarabhai) showcases multilingual conversations including Punjabi and Hindi.

Sarvam 30B · GQA · 32K Sarvam 105B · MLA · 128K
Both trained from scratch, in India 128 experts, sigmoid routing 22 official Indian languages 105B: MLA for KV compression 105B: 128K context window 4× inference speedup on Blackwell (NVIDIA collab)
Mar 2026
Open-Sourced Under Apache 2.0
Open Source

Weights for both Sarvam 30B and 105B are officially released on HuggingFace (sarvamai/sarvam-30b, sarvamai/sarvam-105b) and AIKosh under Apache License 2.0 — the most permissive open-source license, allowing commercial use. Sarvam 30B powers Samvaad (conversational agents); 105B powers Indus (reasoning and agentic workflows). Both available via the Sarvam API. Future plans: coding-specialized models, multimodal systems, and scaling to significantly larger checkpoints.

HuggingFace Release AIKosh Release Sarvam API
Apache 2.0 — commercial use allowed 30B → Samvaad (conversational AI) 105B → Indus (reasoning + agentic) SGLang, vLLM, Transformers support Roadmap: coding, multimodal, larger models