Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

Gemma 4 is the latest family of open-source AI models from Google DeepMind, offering a range of sizes optimized for diverse applications from on-device processing to complex reasoning tasks.
It introduces significant architectural innovations, including Mixture-of-Experts (MoE), Per-Layer Embeddings (PLE), and advanced attention mechanisms, contributing to unprecedented performance and efficiency across all model sizes.
Gemma 4 features native multimodal capabilities (vision and audio), flexible image processing, and is released under an Apache 2.0 license for widespread developer accessibility and integration.

Gemma 4 is available under an Apache 2.0 license, promoting broad accessibility for developers to integrate models into their development lifecycle.
The family includes four models: 2B and 4B (effective models, optimized for on-device applications with audio/vision support), 26B (the first Gemma Mixture-of-Experts model, efficient with 3.8B active parameters), and 31B (dense, state-of-the-art for advanced reasoning with a 256K context length).
The 31B dense model is purpose-built for advanced reasoning, autonomous workflows, and supports native function calling and structured JSON outputs.
Architectural improvements in dense models include interleaving local (sliding window) and global (all preceding tokens) attention layers, plus Grouped Query Attention (GQA) which shares key/value heads to improve efficiency without significant memory cost.
Effective models (2B, 4B) leverage Per-Layer Embeddings (PLE) stored in flash memory, rather than VRAM, significantly reducing memory constraints and enabling high-performance inference on mobile devices and laptops.
Multimodality is a core, natively integrated feature, with vision encoders supporting variable aspect ratios and resolutions, allowing developers to allocate image token budgets based on task requirements (e.g., higher for OCR).
Audio support (translation, speech recognition) is integrated into the E2B and E4B models, utilizing an audio tokenizer and a 305M parameter conformer to process audio embeddings.
Gemma 4 models can be self-hosted via platforms like Hugging Face, Kaggle, and Ollama, or accessed through Claude-hosted options like AI Studio and Vertex AI for larger models.

Mixture of Experts (MoE) — An architectural design where different "expert" neural networks are specialized for different input subsets, with a router selecting which experts to activate for a given input.
Dense model — A traditional neural network architecture where all parameters are active during every forward pass, as opposed to sparse models like MoE.
Active parameters — The number of model parameters actively used during a single forward pass, typically much lower in MoE models compared to their total parameters.
Context length — The maximum number of tokens a model can process and consider in a single input sequence.
Function calling — The ability of a language model to identify when a user intends to invoke an external tool or API and generate the appropriate function call.
Grouped Query Attention (GQA) — An attention mechanism that groups multiple queries to share the same key and value heads, improving efficiency and reducing memory usage in large language models.
Per-Layer Embeddings (PLE) — A technique where dedicated embedding tables are stored per layer, allowing for storage in flash memory rather than VRAM, crucial for on-device performance.
Flash memory — A type of non-volatile computer memory, often slower but cheaper and less power-intensive than VRAM, used in mobile devices and SSDs.
Vision encoder — A component in a multimodal model responsible for processing image data and transforming it into numerical representations (embeddings) that the language model can understand.
Conformer — A neural network architecture combining convolution and transformer layers, often used in speech processing tasks due to its ability to capture both local and global dependencies in audio data.
Soft tokens — Intermediate, abstract representations of data (e.g., from images or audio) that are fed into a language model, distinct from discrete text tokens.

Hi everyone, my name is Cassidy and I'm a researcher at Google Debt Mind. Today I'm really excited to share with you some of the technical improvements and architecture that we have with Gemma 4. Last week we launched Gemma 4 which is the latest addition to our family of open source models. Gemma 4 brought incredible improvements at a scale that has not been seen before. We have a family of very small models with incredible performance setting a new precedent for what's possible with small open source models. Gemma 4 comes in four sizes. We have two smaller effective models which are geared towards on device applications. These models have been adapted and improved in order to provide incredible performance at a small scale which are able to run locally on phones, iPads and laptops. We have two larger models starting with a 26B mixture of experts model which is the first ever Gemma MOE. This model has been adapted to have incredible performance while only requiring 3.9 billion active parameters. And our largest model is our 31B dense. This has insane performance, a huge improvement upon what existed within Gemma 3 at a new precedent that hasn't been seen before. Taking a look at our larger models, our 31B and our 26B, these models have both ranked in the top six of all open source models on the LM arena. One of the most exciting improvements and things that we've launched alongside Gemma 4 is the move to an Apache 2.0 license. This was deliberately done in order to make our models more accessible for the everyday developer. You should easily be able to integrate Gemma into your life cycle of development through initial testing all the way to deployment and building within the Gemma universe. Now, let's take a little look at what each of these models are and some of the use cases that we've adapted this for. Starting with our 31B dense model. This is a state of the art multi-modal model which has been purposely built for advanced reasoning. This model ranked number three on the global arena for the AI leader board. This is outperforming models over 20 times its size. This is a huge improvement. The 31B has a 256K context length which has been purpose built for autonomous workflows with native support for thinking, function calling, and structured JSON outputs. We also have a slightly smaller 26B. This 26B is the first edition of a mixture of experts model into the Gemma family. Only requiring 3.8 billion parameters during any forward pass, this model is small and efficient. Utilizing a total of 128 experts while only requiring 8 experts during any inference, this is efficient for running while still maintaining some of the incredible performance that we saw with our 31B. On the smaller side, we introduced two effective models. These models are geared towards on-device applications with the additional support of audio. These are vision, text, and image, vision, text, and audio input models while remaining being text only output models. Similarly, we have our effective 2B model. Across a variety of benchmarks, these models are incredible. Looking at our performance across agentic capabilities, coding, multi-modal, multilingual, we've truly set a new frontier for what's capable with the Gemma models. This is significantly outperforming everything we had with the Gemma 3 family of models. Now, let's take a look at what's actually new in Gemma 4 and what have we done and how have we actually been able to achieve this incredible performance, starting on the architecture side. We have our standard dense model. This is our 31B as well as our smaller effective 2B and 4B models. We have our standard decoder block. What we've done with Gemma 4 is we've made several improvements within attention. We've introduced a 5-to-1 ratio of interleaving local to global layers with our smaller effective 2B having a 4-to-1 ratio. This means that within our local layers we have a sliding window of how many tokens we're attending to. And lastly, with our global layers, we've now ensured that the last layer is always a global layer, meaning that our last layer is attending to all preceding tokens. In practice, what this looks like is our global layers are attending to every token that is preceded within this, whereas our local models are only attending to a specific number of preceding tokens. In our smaller models, we have a sliding window of 512 tokens, well in our larger models, we have a sliding window of 1,024 tokens. This sliding window has provided significant improvements in the efficiency and optimizations of our local layers while still maintaining passing through information to the preceding layers. However, our global layers remain to be quite expensive. Despite this interleaving of local and global layers, all of our global layers are still required to attend to all preceding tokens, which makes it quite memory intensive and expensive to run. And this is where we've looked into introducing grouped query attention. Within our local layers, we grouped together two queries to share the same key and value heads. However, in our global layers, we're grouping together eight queries sharing the same key and value heads. Since reducing the number of key and value heads can have a big impact on performance, we've doubled the length of the key value heads within our global layers to have a length of 512 as opposed to 252 tokens. This group's query attention has provided significant performance improvements without massive memory costs and inference increases in the cost of being able to serve these models. This ratio of eight queries to one key value across all of our models has provided significant improvements in the efficiency of our global layers. These attention changes were present across all of our models, but we also introduced a new architecture with Gemma 4, and this is our MOE. Our MOE has one shared router expert with a total of 128 total experts with eight activated experts on each forward pass. All of these experts are small Fed-forward neural networks. Similar to the architecture that we had with our dense model, we've now replaced that standard Fed-forward neural network with an MOE. In practice, what this looks like is having our constant shared expert, which is activated on every pass of the model. This shared expert is three times the size of our regular experts. We then have 128 experts, each of which are selected by the router during any pass of the model. And lastly on an architecture side, we get into our dense effective models. So what does it mean when we're seeing our model as effectively 2B, effectively 4B? This is where we're looking at the difference in the number of parameters which are required to operate the model as opposed to the total number of representational parameters present. Our 2B is effectively 2.3 billion parameters while having a representational depth of 5.1 billion parameters. These models have been specifically designated and optimized in order to have the best on-device performance. These are designed to run on phones, run on laptops without requiring expensive API calls to model-served somewhere else in the world. These advancements were made possible through PLE. This is our per layer embeddings, where within each of our layers, we now have a dedicated embedding table. Before digging into exactly how our per layer embedding table looks, let's take a look at how the entire token embedding layer works. This hasn't been replaced within our effective models. We still have your standard embedding table where you're looking at the mapping of an ID for a token towards its embedding vector. In our E2B, we have an embedding vector size of 1,536. And in our larger E4B, we have an embedding vector size of 2,560. This is where we're storing the vector embeddings of each of these tokens. Now, we also have a per embedding table, a per layer embedding table. This is where, similarly, we have our entire vocabulary size and have an embedding representation for each of these tokens. But now, we also have one of these for each of the layers. The big advancement that comes here is the fact that we store our PLE, our per layer embedding table, in flash memory as opposed to VRAM. VRAM is one of the largest constraints on on device. It's where you quickly run out of memory in phones and laptops. So by requiring that we no longer need to store this additional embedding table in VRAM, and we can store it in flash memory, we're able to get incredible improvements on the inference side without having an expensive cost of this additional storage and memory. The big difference in our PLE embedding table, as opposed to the standard embedding table, is our embedding dimension now is only 256. So this is significantly reduced from the size of the full model. For each token, for example, high, we now have an embedding for this token at each separate layer within the model. And as you progress through the layers of the models, 35 for the 2B and 42 for our larger model, you'll see the progression and improvements in the embedding representation for each of these tokens at the next subsequent layer. And how does this work in practice? Now, at the end of our decoder block, we're able to look up the per layer embedding for each of our tokens. And this is where we're able to look up the 256 dimension and project this up to the full embedding size that's expected for each of our models. Ultimately, these improvements with PLE allow for our E2B and our E4B to be significantly outperforming prior generations of Jemma small models. In addition to everything that we've done on an architecture side, we've also really set a new frontier for what's possible with multimodality. In the GEMMA 3, we introduced vision for the first time. We added vision in Jemma 3, adding support across all of our sizes. And in the GEMMA 3N model, we launched open-source audio vision and text model for the first time. This really paved the way for Jemma 4 being natively multimodal models. Multimodal has been integrated from the very beginning across all of these models with quite incredible performance. Starting on the vision side, our 31B model and 26B model both use a 550 million parameter vision encoder. Our effective 2B and effective 4B have a smaller, compact, more designed encoder of 150 million parameters. We've made big strides and improvements from Jemma 3 by the introduction of variable aspect ratios and variable resolutions. What this means in practice for U.S. developers is you now have a choice when you're running a Jemma model to select the resolution and the soft token budget that you want to allocate for images. This is available in five different resolutions across all of our models. Let's take a second to revisit how our vision encoder works. What we want to understand here is we get an initial image and this image is split into patches, patches of 16 by 16 pixels. These patches are then flattened and linear projected up into patch embeddings. These embeddings are then transformed to account for their positional encodings and this is what's ultimately showed to our model. With this in mind, let's take a look back at variable aspect. We can have images of different variable aspects where we have an image of 4 by 2 and an image of 3 by 3. What's important to understand here now is patch 4 is now in a completely different position. If we're encoding both of these images in the same way, we need our model to understand that patch 4 in the one on the left is in the second row. It's right below our first versus patch 4 in the image on the right is now at the very end. So what's been a critical development here is that we need to ensure that this spatial positional encoding is also passed through to our model. In addition to variable aspect ratios, we've also had to introduce variable resolution. This is where, despite the fact that both of our images have a ratio of 3 to 2, they're both very different resolutions. We have a higher resolution on the right with a lower resolution on the left. And this is where we introduce the variable resolutions where you can select how many images and tokens you want to allocate to each image. This is critical because it allows you to determine how much of your token budget you want to be spent on images. For tasks such as OCR, spatial object recognition, you want to allocate a much higher budget to ensure that you're processing high quality, high resolution image. If your applications are purely text based and you're not using some of the multimodal capabilities, you can allocate a much smaller and lower token budget. What's important to understand here is this is a huge improvement from Jumma 3. In Jumma 3, we had to introduce pan and scan. And this is where you would give an image of variable resolutions, variable aspect ratios, and we split it into a sequence of squares and padded whatever else was necessary. Then your one image was passed to the model as two, three, four different images, which were each processed sequentially. Now with these advancements across variable resolution and variable aspect ratios, we're now able to process images with varying number of patches, depending upon the actual image, which was provided by the user. The next thing to understand here is how do we actually take these patches of images and pass this through to the model. This is where we take three by three grids of patches and they become a single embedding and this single embedding is then passed forward to a model. So if you're running a Jumma model with a token budget of 280, this actually equates to 2,520 different patches, which will be constructed from each of your images. For anything that doesn't fit within a sequence of 16 by 16 patches or 16 by 16 pick cells for each of the patches will provide additional padding. For an example of square images across the five different soft token budgets and resolutions which are supported with Jumma, here's an instance of the resolutions, patches, and the pooled elements in embeddings, which would exist at each of these different sizes. And this is where you really see that if you're doing something such as object detection, OCR, you really want to be running with a higher quality resolution and token budget, and this is where we have 560 and 1120 available for these use cases. And pulling this all together, we start with an image of end patches. These patches are each turned into end patch embeddings. We then pool this together to produce N over 9 soft tokens and these soft tokens are linear projected up to what is actually shown to our models. Now for the audio side. Audio has been added into the E2B and the E4B with the goal of being able to support translation and speech recognition. This is made possible through the combination of an audio tokenizer and a conformer. This is a 305 million parameter conformer which processes as an audio encoder processing embeddings rather than tokens. On the audio tokenizer side, we start with raw audio. This raw audio is run through a male spectrogram which is able to process features out of the raw audio files. This male spectrogram is then split into N-mell chunks which are down sampled through two convolutional layers. This ultimately outputs N over 4 soft tokens. These audio embeddings are what's passed forward into the conformer. Our conformer follows a similar architecture to what we've already seen across the dense models and the MOE architecture, although this time we're adding in a convolutional layer. Returning back to where we started today, the JAMA family of models has four models which we've released to you externally last week. We have two smaller models which have been geared towards on-device applications with support for text, vision and audio. Our two larger models are designed for more complex and reasoning tasks with support for agent work flows and coding. That's what brings us to what can you do with JAMA today and how can you get started. There's two main options for getting started with JAMA. You have the option to download and self-host all of our models. These are available across hugging phase, cagle and olamma. We also have Claude hosted options for our larger models, our 31b and our 26b, and these can be accessible across AI studio and vertex. This is what's going to allow you to immediately get started with prototyping, building agent work flows and testing out function calling in these larger models. Thank you and I'm happy to answer any questions about JAMA.

Open Models at Google DeepMind — Cassidy Hardin, Google DeepMind

TL;DR

Takeaways

Vocabulary

Transcript