From a 2B parameter Indic language model to a sovereign 105B MoE flagship — the story of how Sarvam AI built India's first full-stack foundation models from scratch, in India, for India.
India has 1.4 billion people speaking 22 scheduled languages across 12 scripts. The dominant English-centric LLMs handle most of these poorly — poor token efficiency for Indic scripts (4–8 tokens per word vs 1.4 for English), limited training data in regional languages, and cultural context gaps that matter for real-world deployment in healthcare, banking, government services, and education.
Sarvam AI, founded in August 2023 by Vivek Raghavan and Pratyush Kumar — both formerly of AI4Bharat at IIT Madras — set out to build the full stack from scratch: tokenizer, architecture, training data, training infrastructure, post-training pipelines, and inference systems. Not a fine-tune of someone else's checkpoint. Built in India, on Indian compute, under the IndiaAI Mission.
The result is a model family that spans a 2B base model (Sarvam-1, Oct 2024), a 24B fine-tune (Sarvam-M, May 2025), and now two fully sovereign foundation models: Sarvam 30B and Sarvam 105B, both MoE architectures trained from scratch and open-sourced under Apache 2.0 in February–March 2026.
All officially scheduled Indian languages, including code-mixed formats like Hinglish. Custom tokenizer with fertility rates of 1.4–2.1 vs 4–8 for standard models.
Both 30B and 105B activate only a fraction of total parameters per token — 2.4B and 10.3B respectively — keeping inference costs practical at scale.
Weights on HuggingFace and AIKosh. Enterprise and developer use permitted. Aimed at reducing India's dependence on closed foreign AI systems.
Trained entirely in India on Yotta's Shakti GPU cluster using government-provided compute under the ₹10,372-crore IndiaAI Mission.
Full pipeline: pre-training → SFT → RL. Models include extended thinking mode and agentic traces for real-world tool use and multi-step workflows.
Released in October 2024, Sarvam-1 was the first model to demonstrate that a carefully curated 2B parameter model can outperform much larger general-purpose models on Indian languages. Trained on 2 trillion tokens across 10 Indic languages (Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu), it beat Gemma-2-2B and Llama-3.2-3B on Indic benchmarks, and stayed competitive with Llama-3.1-8B — at 4× smaller.
The key insight: token efficiency first. Existing multilingual models need 4–8 tokens per Indic word due to poor tokenizer design. Sarvam-1's custom tokenizer achieves 1.4–2.1 fertility across all supported languages — matching English efficiency — which directly improves model capacity utilization and training signal quality.
Sarvam-1 uses more layers with a smaller hidden dimension than similarly-sized models. Research at the time (later reinforced by Qwen3) showed this improves performance for a fixed parameter budget, particularly on multilingual tasks where diverse representations benefit from greater depth.
Trained on Yotta's Shakti HGX H100 cluster with 1,024 GPUs over 5 days using NVIDIA NeMo framework. Kernel fusion and mixed-precision optimizations throughout. Note: this is a base completion model — not instruction-tuned, designed to be fine-tuned for downstream tasks.
Released May 2025, Sarvam-M was a significant step up in capability — fine-tuned on Mistral-Small-3.1-24B-Base to enhance Indian language performance, reasoning, and coding. While technically built on a foreign base model (which later drew criticism for not qualifying as truly sovereign), it served as a crucial bridge that demonstrated Sarvam's SFT and RL pipeline capabilities at scale.
Sarvam-M supports 11 Indian languages, has a 32K token context window with sliding window attention of 4,096 tokens, and includes a thinking mode for extended reasoning. It was Sarvam's first production-deployed model for conversational use cases.
Sarvam-M drew industry criticism because it was fine-tuned from Mistral's architecture — a French AI company's design. While performance on Indic languages improved significantly, critics argued it did not reduce structural dependence on foreign AI infrastructure. This directly motivated Sarvam's decision to train 30B and 105B from scratch.
GQA attention with sliding window of 4,096 tokens for local layers and full attention for global layers. 32K context window. SwiGLU FFN. Standard Pre-Norm with RMSNorm. RoPE positional embeddings. The key Sarvam contribution was the post-training pipeline: instruction tuning, safety fine-tuning, and RL on Indian-language-heavy prompts.
Sarvam 30B is the first model in the family built entirely from scratch in India — architecture, training data, tokenizer, and training infrastructure. Pre-trained on 16 trillion tokens spanning code, web data, mathematics, and multilingual content, with a custom data mixture tuned after extensive ablations. Then put through SFT and RL pipelines developed entirely in-house.
The architecture uses a Heterogeneous MoE design: 19 layers total — 1 dense layer followed by 18 MoE layers. Each MoE layer has 128 experts with top-6 routing (6 experts activate per token). A dedicated shared expert handles common linguistic patterns, keeping consistent representations across all inputs. Grouped Query Attention (GQA) with 4 KV heads per layer balances memory bandwidth and generation quality.
Instead of traditional softmax gating over expert logits, Sarvam uses sigmoid-based routing scores. Softmax normalizes scores across all experts, creating competition and routing collapse where a few experts dominate over time. Sigmoid scores are independent per expert, improving load balancing and encouraging more uniform expert utilization — critical for training stability over 16 trillion tokens.
Sarvam 30B uses a RoPE theta of 8,000,000 — orders of magnitude higher than Llama's 500,000 or the original 10,000. This allows stable positional encoding at long contexts without needing a separate RoPE scaling mechanism (no YaRN or ABF needed). The model handles 32K context natively without degradation.
Sarvam 30B is the engine powering Samvaad, Sarvam's conversational AI platform. The 2.4B active parameters per token make it fast enough for real-time voice interactions across Indian languages. NVIDIA co-designed inference optimizations delivered 4× speedup over baseline H100 performance via kernel fusion, RadixAttention for KV prefix reuse, and Blackwell NVFP4 quantization.
Sarvam 105B extends the 30B architecture to 32 layers (1 dense + 31 MoE) with larger expert FFN hidden size and top-8 routing. The critical architectural addition at this scale is Multi-Head Latent Attention (MLA) — the same KV cache compression technique pioneered by DeepSeek V3 — which enables the 128K context window without prohibitive memory requirements.
Pre-trained on 12 trillion tokens (fewer than the 30B, but with a heavier emphasis on Indian languages, STEM, and agentic data). Full post-training: SFT on diverse, synthetically augmented prompts including agentic traces and tool-use trajectories, followed by RL using an asynchronous GRPO setup with adaptive rollout allocation.
Multi-Head Latent Attention compresses Key and Value tensors into a low-dimensional latent vector before projecting back out for attention computation. The KV cache stores this compressed latent rather than full K/V tensors — dramatically reducing memory at 128K context lengths. Sarvam adopted MLA (similar to DeepSeek V3) specifically because GQA wasn't sufficient for long-context efficiency at the 105B scale.
The RL stage uses an adaptive knapsack-style rollout allocation: prompts are pre-filtered to remove trivially solvable or unsolvable tasks, then rollouts are dynamically weighted toward tasks near the model's capability frontier — where learning signal is strongest. An asynchronous GRPO setup decouples generation, reward computation, and policy updates to maximize GPU utilization during RL.
Sarvam 105B powers Indus, Sarvam's AI assistant for complex reasoning and agentic workflows. Benchmarked against JEE Main 2026 papers (Math: 25/25 under Pass@1), Codeforces Div3 problems, and the Tau 2 Bench for agentic reasoning — where it outperforms DeepSeek R1, Gemini 2.5 Flash, and o4-mini.
Benchmark positioning: Sarvam 105B competes with GPT-OSS-120B and Qwen3-Next-80B. The 30B targets Gemma 27B and GPT-OSS-20B. Both models achieve state-of-the-art on Indian language benchmarks at their parameter class, outperforming significantly larger general-purpose models on Indic tasks.
| Model | Params | Active | Attention | Architecture | Context | Key Feature |
|---|---|---|---|---|---|---|
| Sarvam-1 | 2B | 2B (dense) | GQA | Dense Transformer | 4K | Custom Indic tokenizer SwiGLU + RoPE |
| Sarvam-M | 24B | 24B (dense) | GQA + SWA | Dense (Mistral base) | 32K | Fine-tune of Mistral 3.1 Thinking mode |
| Sarvam 30B | 30B | 2.4B / token | GQA (4 KV heads) | MoE · 128 experts · top-6 | 32K | Sigmoid routing RoPE θ=8M Shared expert |
| Sarvam 105B | 105B | 10.3B / token | MLA (DeepSeek-style) | MoE · 128 experts · top-8 | 128K | MLA + KV compression Async GRPO RL Expert bias |
Sarvam's biggest competitive advantage isn't model size — it's the custom tokenizer. With 1.4–2.1 fertility vs 4–8 for standard models, every layer of the model gets more signal per Indic token. Better tokenization compounds through the entire training process.
Using sigmoid instead of softmax for MoE gating scores is a meaningful deviation from standard practice. Independent per-expert scoring prevents routing collapse and distributes load more uniformly — important for maintaining expert diversity over 16T training tokens.
Sarvam 105B adopting MLA (alongside Kimi K2, GLM-5, Mistral 3 Large) cements it as the emerging standard for long-context efficiency. GQA wasn't sufficient at 128K — MLA's latent KV compression solved what GQA couldn't.
128 small experts with top-6 or top-8 routing follows the DeepSeek V3 fine-grained template rather than Grok 2.5's coarse-grained 8-large-expert approach. More experts, more specialization — especially useful for multilingual models where language-specific routing matters.
The Sarvam-M controversy clarified what "sovereign AI" actually requires: not just open weights, but control over architecture design, training data, infrastructure, and post-training pipeline. Sarvam 30B and 105B meet that bar. Sarvam-M did not.
Sarvam's knapsack-based rollout allocation — concentrating RL compute on tasks near the model's capability frontier — is more sophisticated than standard uniform sampling. Combined with async GRPO and trajectory staleness controls, it squeezes maximum learning signal per GPU hour.
From founding to sovereign flagship — a 2.5 year journey
Vivek Raghavan and Pratyush Kumar — both formerly of AI4Bharat at IIT Madras — found Sarvam AI in Bengaluru. The mission: build large language models and multimodal AI systems with a focus on Indian languages from the ground up. In December 2023, the company closes a combined seed + Series A of ~$41M led by Lightspeed, with Peak XV Partners and Khosla Ventures participating.
The first public model: a 2B dense transformer trained from scratch on 2 trillion tokens across 10 Indic languages. The headline achievement is the custom tokenizer — achieving 1.4–2.1 fertility vs 4–8 for existing multilingual models. Sarvam-1 outperforms Gemma-2-2B and Llama-3.2-3B on Indic benchmarks despite being the same or smaller size. Competitive with Llama-3.1-8B at 4× fewer parameters. Uses GQA, SwiGLU, RoPE (θ=10,000), bfloat16 training.
In April 2025, India's Ministry of Electronics (MeitY) selects Sarvam AI under the IndiaAI Mission to develop an indigenous foundation model — providing access to government-backed GPU compute. In May, Sarvam releases Sarvam-M, a 24B model fine-tuned from Mistral-Small-3.1-24B. It supports 11 Indian languages, includes a thinking mode, and is the first Sarvam model deployed in production (for conversational use). However, it quickly draws criticism for being a foreign-architecture fine-tune rather than a truly sovereign model.
At the India AI Impact Summit in New Delhi's Bharat Mandapam, Sarvam unveils two fully sovereign foundation models trained from scratch on Indian government compute under the IndiaAI Mission. Both are MoE architectures with a custom Indic tokenizer supporting all 22 official languages across 12 scripts. The 30B uses GQA and is optimized for real-time deployment. The 105B adds MLA for 128K context and is designed for complex reasoning and agentic tasks. At launch, a demonstration called "Vikram" (named after Vikram Sarabhai) showcases multilingual conversations including Punjabi and Hindi.
Weights for both Sarvam 30B and 105B are officially released on HuggingFace (sarvamai/sarvam-30b, sarvamai/sarvam-105b) and AIKosh under Apache License 2.0 — the most permissive open-source license, allowing commercial use. Sarvam 30B powers Samvaad (conversational agents); 105B powers Indus (reasoning and agentic workflows). Both available via the Sarvam API. Future plans: coding-specialized models, multimodal systems, and scaling to significantly larger checkpoints.