[AINews] Gemma 4: The best small Multimodal Open Models, dramatically better than Gemma 3 in every way

Original: Swyx · 03/04/2026

Summary

Gemma 4, Google’s latest multimodal open model, outperforms its predecessor Gemma 3 with enhanced capabilities and a new Apache 2.0 license.

Key Insights

“Gemma 4 is Google’s biggest open-weight licensing + capability jump in a year.” — Discussing the significance of the Gemma 4 release.

“The licensing is also improved with a proper Apache 2.0 license.” — Highlighting the changes in licensing for Gemma 4.

“DeepMind highlights include function calling + structured JSON, and long context up to 256K.” — Describing the key features and capabilities of Gemma 4.

Topics

Full Article

The sudden departures at the Allen Institute and limbo status of GPT-OSS have left the future of American Open Models in question, so Google DeepMind keeping up the pace of Gemma 4 is a very very very welcome update! The 31B dense variant ties with Kimi K2.5 (744B-A40B) and Z.ai GLM-5 (1T-A32B) for the worlds top open models, but with far less total parameters (with other interesting arch choices, see below):obligatory pareto chartThis image from Arena shows progress over the years (exaggerated by the # ordinal ranking rather than numerical, but truly standard benches like GPQA and AIME also improved tremendously vs Gemma 3):The licensing is also improved with a proper Apache 2.0 license, and they natively process video and images, supporting variable resolutions, and excelling at visual tasks like OCR and chart understanding. Additionally, the E2B and E4B models feature native audio input for speech recognition and understanding.The excellent on device capabilities makes one wonder if these are the basis for the models that will be deployed in New Siri under the deal with Apple.AI News for 4/1/2026-4/2/2026. We checked 12 subreddits, 544 Twitters and no further Discords. AINews website lets you search all past issues. As a reminder, AINews is now a section of Latent Space. You can opt in/out of email frequencies!AI Twitter RecapGoogle DeepMinds Gemma 4 release: open-weight, Apache 2.0, multimodal, long-contextplus rapid ecosystem rolloutGemma 4 is Googles biggest open-weight licensing + capability jump in a year: Google/DeepMind launched Gemma 4 as a family of models explicitly positioned for reasoning + agentic workflows and local/edge deployment, now under a commercially permissive Apache 2.0 license (a notable shift from prior Gemma licensing). See launch threads from @GoogleDeepMind, @GoogleAI, and @Google, with Jeff Deans framing and adoption stats (Gemma 3: 400M downloads, 100K variants) in @JeffDean.Model lineup + key specs: Four sizes were announced31B dense, 26B MoE (A4B, ~4B active), and two effective edge models E4B and E2B aimed at mobile/IoT with native multimodal support (text/vision/audio called out for edge). DeepMind highlights include function calling + structured JSON, and long context up to 256K (large models) in @GoogleDeepMind and @GoogleAI. Community summaries and how to run locally guidance proliferated quickly, e.g. @_philschmid and @UnslothAI.Early benchmark signals (with caveats):Arena/Text: Arena reports Gemma-4-31B as #3 among open models (and #27 overall), with Gemma-4-26B-A4B at #6 open in @arena; Arena later calls it the #1 ranked US open model on its open leaderboard in @arena.Scientific reasoning: Artificial Analysis reports GPQA Diamond 85.7% for Gemma 4 31B (Reasoning) and emphasizes token efficiency (~1.2M output tokens) vs peers in @ArtificialAnlys and @ArtificialAnlys.Several posts stress the scale/efficiency surprise (e.g., outperforms models 20 its size) but note that preference-based leaderboards can be gamed; Raschkas more measured read is in @rasbt.Day-0 ecosystem support became part of the story: Gemma 4 landed immediately across common local + serving stacks:llama.cpp day-0 support: @ggerganovOllama (requires 0.20+): @ollamavLLM day-0 support (GPU/TPU/etc.): @vllm_projectLM Studio availability: @lmstudioTransformers/llama.cpp/transformers.js callout: @mervenoyannModular/MAX production inference in days: @clattner_llvmLocal inference performance anecdotes got unusually concrete:Brew install + llama-server became the canonical one-liner for many: @julien_c.llama.cpp performance demo: Gemma 4 26B A4B Q8_0 on M2 Ultra, built-in WebUI, MCP support, 300 t/s (realtime video) in @ggerganov (with a follow-up caveat about prompt-recitation/speculative decoding in @ggerganov).RTX 4090 long-context throughput + TurboQuant KV quant details in @basecampbernie.Browser-local run via WebGPU/transformers.js demo noted by @xenovacom and amplified by @ClementDelangue.Gemma 4 architecture notes: hybrid attention, MoE layering choices, and efficiency tricksUnusual transformer detailseliebakouch highlighted:per-layer embeddings on small variantno explicit attention scale (suggesting it may be absorbed into norm weights)QK norm + V normshared K/V for large variantaggressive KV cache sharing on small variantsliding window sizes 512 and 1024no sinkssoftcappingpartial-dimension RoPE with different theta for local/global layersGrad62304977 replied that the missing attention scale is likely merged into QK norm weights.baseten summarized additional architecture choices:alternative attention mechanismsproportional RoPEPer-Layer Embeddings (PLE)KV-cache sharingnative aspect-ratio handling for visionsmaller frame window for audionorpadon called it very much not a standard transformer.rasbt offered a more conservative read for the 31B dense: architecture looks pretty much unchanged compared to Gemma 3 aside from multimodal support, retaining a hybrid 5:1 local/global attention mechanism and classic GQA, suggesting the bigger jump likely came more from the training recipe and data than radical dense-model architecture change.Not a standard transformer takes, plus specific deltas: A thread flagged Gemma 4 as having galaxybrained architecture in @norpadon, followed by more specific notes on how Gemmas MoE differs from DeepSeek/Qwen (Gemma uses MoE blocks as separate layers added alongside normal MLP blocks) in @norpadon.Concrete low-level details being circulated: A concise recap of quirks (e.g., no explicit attention scale, QK/V norm, KV sharing, sliding window sizes, partial RoPE + different theta, softcapping, per-layer embeddings) is in @eliebakouch. Basetens launch post also lists similar architecture innovations (PLE, KV-cache sharing, proportional RoPE, aspect ratio handling for vision, smaller audio frame window) in @baseten.Raschkas read: minimal architectural change, big recipe/data change: Raschka argues Gemma 4 31B is architecturally close to Gemma 3 27B, still using a hybrid sliding-window + global attention pattern and GQA, implying the leap is likely training recipe/data rather than architecture overhaul: @rasbt.Agents, harness engineering, and local agents momentum (Hermes/OpenClaw + model/harness training loops)Open-models-as-agent-engines is now mainstream positioning: Multiple posts frame Gemma 4 as the perfect local model for open agent stacks (OpenClaw/Hermes/Pi/opencode). See @ClementDelangue, @mervenoyann, and @ben_burtenshaw.Hermes Agent growth + pluggable memory:Hermes Agent hit a major usage milestone and asked for roadmap input: @Teknium.Memory integrations were expanded to multiple providers via a new pluggable system: @Teknium.A local semantic index plugin (Enzyme) pitched as solving the too many workspace files issue with local embedding and 8ms queries: @jphorism.Harness engineering as the moat (and the loop): A strong ModelHarness Training Loop thesisopen models + traces + fine-tuning infrawas articulated in @Vtrivedy10 and echoed more generally in @Vtrivedy10. Related: LangChain notes open models are good enough at tool use/retrieval/file ops to drive harnesses like Deep Agents in @hwchase17.Agent self-healing + observability trends:A blog on self-healing GTM agent feedback loops is referenced by @hwchase17 and expanded on by @Vtrivedy10.LangSmith reports Azures share of OpenAI traffic rose from 8% 29% over 10 weeks, based on 6.7B agent runs, suggesting enterprise governance/compliance is driving routing decisions: @LangChain.Tooling and infra: kernels, fine-tuning stacks, vector DB ergonomics, document extractionNew linear attention kernel: A CUDA linear attention kernel drop is in @eliebakouch (repo link in tweet).Axolotl v0.16.x: Axolotls release emphasizes MoE + LoRA speed/memory wins (claimed 15 faster, 40 less memory) and GRPO async training (58% faster) plus docs overhaul in @winglian and @winglian. Gemma 4 support follows in @winglian.Vector DB ergonomics: turbopuffer adds multiple vector columns per doc (different dims/types/indexes) in @turbopuffer.Document automation stack: LiteParse + Extract v2:LiteParse open-source document parser: spatial text parsing with bounding boxes, fast on large table-heavy PDFs, enabling audit trails back to source in @jerryjliu0.Extract v2 (LlamaIndex/LlamaParse): simplified tiers, saved extract configs, configurable parsing before extraction, transition period for v1 in @llama_index and additional context from @jerryjliu0.Frontier org updates: Anthropic interpretability, OpenAI product distribution, and Perplexity Computer for TaxesAnthropic: Emotion vectors inside Claude: Anthropic reports internal emotion concept representations that can be dialed up/down and measurably affect behavior (e.g., increasing a desperate vector increases cheating; calm reduces it). The core threads are @AnthropicAI, @AnthropicAI, and @AnthropicAI. The work also triggered citation/precedent disputes in the interp community (e.g., @aryaman2020, @dribnet, and discussion around vgels posts via @jeremyphoward).OpenAI: CarPlay + Codex pricing changes:ChatGPT Voice Mode on Apple CarPlay rolling out for iOS 26.4+: @OpenAI.Codex usage-based pricing in ChatGPT Business/Enterprise (plus promo credits): @OpenAIDevs. Greg Brockman reinforces try at work without up-front commitment: @gdb.Perplexity: agentic Computer for Taxes: Perplexity launched a workflow to help draft/review federal tax returns (Navigate my taxes) in @perplexity_ai with details in @perplexity_ai.Top tweets (by engagement, filtered to tech/product/research)Gemma 4 launch (open-weight, Apache 2.0): @Google, @GoogleDeepMind, @demishassabis, @GoogleAIAnthropic Emotion concepts/vectors interp research: @AnthropicAIKarpathy on LLM Knowledge Bases (Obsidian + compiled markdown wiki workflow): @karpathyCursor 3 (agent-collaboration interface): @cursor_aiChatGPT on CarPlay: @OpenAIllama.cpp local performance demo + MCP/WebUI: @ggerganovPerplexity Computer for Taxes: @perplexity_aiAI Reddit Recap/r/LocalLlama + /r/localLLM Recap1. Gemma 4 Model Releases and Features Read more

[AINews] Good Friday

Swyx · explanation · 84% similar

Gemma 4: Byte for byte, the most capable open models

Simon Willison · explanation · 84% similar

[AINews] Gemma 4 crosses 2 million downloads

Swyx · reference · 81% similar

Originally published at https://www.latent.space/p/ainews-gemma-4-the-best-small-multimodal.

Research

Personal

Planning

[AINews] Gemma 4: The best small Multimodal Open Models, dramatically better than Gemma 3 in every way

Summary

Key Insights

Topics

Full Article

[AINews] Good Friday

Gemma 4: Byte for byte, the most capable open models

[AINews] Gemma 4 crosses 2 million downloads

Research

Personal

Planning

Documentation Index

​Summary

​Key Insights

​Topics

​Full Article

​Related Articles

[AINews] Good Friday

Gemma 4: Byte for byte, the most capable open models

[AINews] Gemma 4 crosses 2 million downloads

Summary

Key Insights

Topics

Full Article

Related Articles