Gemma, DeepMind's Family of Open Models — Omar Sanseviero, Google DeepMind

Gemma 4 is Google's newest family of highly capable, open-source models, released under an Apache 2 license, designed for efficient on-device execution. They range from 2 billion to 32 billion parameters, supporting multi-modal and agentic applications across various devices, including mobile phones.
A key innovation for smaller Gemma 'e' models is the "Per-Layer Embeddings" architecture, which reduces GPU memory requirements by offloading computations to CPU or disk, making them highly efficient for mobile deployment.
The project strongly emphasizes fostering an open ecosystem for fine-tuning and integration, enabling developers to build customized, powerful AI solutions that run entirely offline and on personal devices.

Gemma 4 is Google's family of open models, ranging from 2B to 32B parameters, available for download and deployment on personal infrastructure and devices.
All Gemma 4 models are designed for on-device execution, with even the 31B model fitting on a single consumer GPU, enabling powerful offline AI capabilities.
The models support multi-modal understanding (images, video, audio) and agentic capabilities, useful for tasks like coding, Android app development, object detection, and speech translation.
Gemma 4 adopts an Apache 2.0 license, providing users with broad flexibility for commercial and personal use, addressing previous community feedback.
Smaller Gemma 'e' models (e.g., 2B 'effectively' 5B) leverage a novel "Per-Layer Embeddings" architecture, significantly reducing GPU memory requirements by moving computations to CPU or disk.
Gemma 4 models are heavily multilingual, trained on over 140 languages using a Gemini-based tokenizer, making them effective for fine-tuning in low-resource languages.
Google actively collaborates with the open-source ecosystem (MLX, Ollama, Hugging Face) to ensure seamless integration and fine-tuning capabilities for Gemma 4 within existing developer workflows.
Official variants like 'Shield Gemma' (for content moderation) and 'Med Gemma' (for medical tasks) are available, along with numerous community-created fine-tunes and quantizations.

open models — AI models whose weights and architecture are publicly released, allowing users to download, run, and customize them on their own infrastructure. parameters — The internal variables or weights within an AI model that are learned during training, determining its complexity and capabilities. on-device — Refers to AI models running directly on a user's local device (e.g., phone, laptop) without requiring cloud server access, enabling offline functionality. multimodal — The ability of an AI model to process and understand information from multiple types of data, such as text, images, audio, and video. agentic — Describes an AI model's capability to act, reason, and make decisions to achieve a goal, often by selecting and executing various tools or skills. Apache 2 license — A permissive open-source software license that grants users rights to use, modify, and distribute the software, including for commercial purposes, with minimal restrictions. Per-Layer Embeddings — A novel architectural design for AI models that optimizes memory usage by treating layer-specific embeddings as lookup tables rather than computations, making them efficient for on-device use. tokenizer — A component of an AI model that converts raw text into numerical tokens (subword units) that the model can process, crucial for handling different languages. fine-tune — The process of further training a pre-trained AI model on a specific, smaller dataset to adapt it for a particular task, domain, or language. quantization — A technique to reduce the size and computational requirements of an AI model by representing its weights and activations with lower-precision numbers (e.g., 8-bit integers instead of 32-bit floats).

Hi everyone, it's full here. So I'm super excited to give this talk because just seven days ago we released Jam4. So before this conference, who here has heard about Jam4 already? Okay, so most of you are great. So Jam4 is Google that reminds a family of open models. Open models means that these are models that you can take, you can download, you can run in your own infrastructure, your own devices, you can find them for your own use cases. So about a year ago we released Jam3. Back then Jam3 were the most capable open models that could fit in a single consumer GPU. So we designed models from 1 billion parameters all the way to 27 billion parameters. And back then in LMR in it was a very strong model. So you see here like different open models under LMR in a scores. And those small dots at the bottom represent how many H100s or A100s you would need. Just to be able to load the models. So this is again Jam3, that's from 1 year ago. But you can see that even if it's a model from a year ago, it's a tiny model or a relatively small model that is extremely capable. But yeah, so last week we released Jam4 and this is my first conference talking about Jam4. So very excited about that. So Jam4 is the family of most capable open models that Google has released. However, these are models that go from 2 billion parameters all the way to 32 billion parameters. These models have very different capabilities so I'm going to talk a bit about these different things. And if you're wondering what's the E, there I also explained that in a second. So the smallest models can run in Android phone, in iOS, in iPhone phone as well, even in a Raspberry Pi. These are really small, small models that are multi-model, have reasoning, can do like very cool, undivisement, agentic things. And there's a MOE, a mixture of experts model that's super fast, high, very low latency, a model that you can do, that can do very cool things. And then you have the 31B, that's the most intelligent model, the most capable. So when you want the most raw intelligence, you would use this large model. But even the 31B is a model that can run in a consumer GPU. So all of these models have been developed differently sizes, which is quite important to us. So let me show you a couple of the most, assuming the video slowed. So there's a lot happening here. So let me begin with one at the right. That's an application where you have GEMMA running directly in an Android phone, where you can pick different skills. So pretty much here you have a full-agentic setup where the model is speaking maybe like a skill to play the piano. And then you have GEMMA playing the piano. The one at the left is Yamaha Vive coding, also on device. This is again airplane mode, no API calls, fully running in a phone. And the example in the middle is in a laptop computer, we have 20 instances or 10 instances of GEMMA running in parallel. Each of them is doing a different SPG. And in a couple of seconds you are going to see like 10 SPGs generated by different agents all of this running on device with LAMAS-PPP. And even then it's like a hundred tokens per second and there you can see the SPGs that were generated by the 10 different GEMMA models. GEMMAs are good coding model, it can do agentic stuff, it can do coding, it can do even Android app development, and again all of this offline. So the LAMarines scores are quite nice. Here you can see like bunch of different models, X-axis is how many billion parameters the model has, Y-axis is the LAMarines score. And I know like LAMarines is not the perfect benchmark, but it does give you like some proxy of how much the community likes the model for general use cases, like conversations and so on. And GEMMA has like a nice kind of a mix between being friendly and like a helpful and at the same time being very capable. And you can see like this corner at the top left, that means that these are very small models that are very capable, which is quite exciting. It's been exciting to see how the models have progressed over the last two years. So last year it was Y-M-3, two years ago it was Y-M-1, sorry, Y-M-2. And you can see like for a bunch of different things, the models have getting better and better without going bigger. Which for me is quite exciting because if I think where we'll stand in a year from now or in two years from now, I do think we'll have extremely capable models running directly in our own devices, in our own pockets. I'll skip the benchmarks. But what is exciting is that GEMMA can fit in a desktop computer, it can fit in a laptop, it can fit in a phone. I suggested today, or today's ago, that someone put LAMAS-CPP in an Nintendo switch and they are using LAMAS-CPP to try GEMMA directly there. So I don't know how things will be in a couple of years but I'm excited for it. Something that we heard a lot with the previous GEMMA version was that the license that we had was not great. We wanted a proper open source license. So with GEMMA 4, we changed our license to an actual Apache 2 license that gives you control to pretty much you have the flexibility of the Apache 2 license. So that's quite nice as well. Now you have probably heard about Mixture of Experts. That's the 27-B model, 26-B model. You have heard about Transformers and Tens models but you have probably never heard about the E here. So it stands for effectively 2 billion parameters. So actually, GEMMA it would be a customer parameter. It has 4 billion parameters or so. It has a new novel architecture called Per-Layer and Beddings. That was something that really is summer of last year. So there's this small block at the bottom. The TLDR here is that pretty much there is like an embedding kind of a per-each layer as the name indicates. And it works more of a pretty much as a lookup table rather than a computation that you need to do. So pretty much this is an extremely fast thing. You don't need to have this in the GPU. You can have this in the CPU. You can have this in the disk. And this is an architectural decision that is really optimized for on-device like mobile use cases. So that's why the smallest model is starting running on Android or in an iPhone are you seeing this e to be a or e4b architecture. So even if the model is 5 billion parameters, you actually just load 2 billion parameters into the GPU. And then the rest can be much slower memory because you are not doing any of the matrix multiplications that you would usually do with a transformer architecture. And this can be done a labor-action LAMAS-CPP with a simple fact over a tensor. And then you move the per-layer embeddings to CPU or even to this and it should work quite well out of the box. A couple of other exciting things, the smallest models can do multimodal understanding for images, for videos and even for audio. So you can do speech recognition, you can do speech to translate the text. So I can speak in Spanish and the text can be transcribed to French. And then there are your model can do like extremely capable multimodal understanding. So videos, fine grain details, actually have a couple of examples in here. So for example, it can do things such as pointing where the LAMAS is in the picture. It can do object detection, so it can detect different objects in a picture. And what is cool is that this model is heavily multilingual. So Gemma 4 has, well, it was trained with over 140 languages and it uses the tokenizer that is based on Gemini as well. So pretty much all of the multilingual research that powers Gemini is also enabling Gemma. The tokenizer piece is quite interesting because independently of the raw capabilities of Gemma, this tokenizer was designed for multilingual use cases and we took lots of care with it. Which is interesting because if you want to find the one Gemma for a different language for which there are low digital resource languages, so let's say like indigenous language and Peru, Ketua or I don't know one of the official languages in India, you can pick the model, you can use your data, you can train the model independently of the raw capabilities of Gemma, just because of the tokenizer decisions, things tend to work quite well out of the box. So then you can mix the multilingual with multimodal capabilities. So for example here, to get the text or an explanation of an image with Japanese text and that's quite cool. So we released a model a week ago, just last yesterday we got to 10 million downloads just for Gemma for base models. There are over 1000 models based on Gemma for already, so quantizations or fine tunes by the community, over 500 million downloads of the whole Gemma family. So what is very cool for me is that Gemma is not just about always some model that you can use, but it's more about enabling the ecosystem to build on top of it. And that's what the community has done over the last few days. It was top of a login phase, people have been building cool examples, they are on-site people have been doing full repository audits using Gemma, people are putting Gemma in all kinds of devices and exploring all of the capabilities, which is quite nice. And all of this is not done just by us. We collaborate with an open source ecosystem, we work with on-site MLX, Olam, login phase, BLM, Silem, and pretty much. We want to ensure that when we launch a new tool, both for Gemma and ProGema, people can never ask the capabilities out of the box, right? Like they should not need to switch to Keras if they want to find Gemma. Like if they are fine tuning with login phase transformers, they should be able to do that. So for us, it's very important and critical to be where the community is. And that's why I really shout out to all of those of you that are working in the open source ecosystem that are contributing to different tools, maintainers of all of these repositories, because it's really a way to enable the ecosystem to do amazing things. Another part that I like about Gemma is all of the product integrations that we can do. So Android Studio, I don't know if anyone here is an Android developer, but Android Studio has like an agent mode where you have an agent that helps you buy code and develop. And there's an offline mode now where you can have a Lama CPP or a Olama or BLLM power system in which you have Gemma helping you buy code for Android development. And we did include some Android related data sets and benchmarks while offering in Gemma. So it's actually a very capable model for Android development. So I talked a bit about how many people are fine tuning about how many people are sharing. So let me share a bit about the Gemmaverse. So this number is updated. This is from last week. Now we have 500 million downloads, as I mentioned. And in total, Gemma has over 100,000 models. So again, maybe you just want to use the amount of the products. Like open models may work right for you, but maybe you want to improve the capabilities. Maybe you want to change the styling which the model is talking with the users. Maybe you don't want a conversational model, right? Maybe you just want a model that can predict certain things in your own context. Or maybe you just have too many GPUs at home and you just want to burn them. I don't know what's the reason, but you can find two models for many cool things. So Google has done a couple of what we call official Gemma variants. We did a shield Gemma, which is a family of world-rate models. Those are great for production and use cases where maybe you don't want users to put, let's say toxic images or toxic text that does not match the policies that you have set up. So shield Gemma is the family of models that allows you to do that. But then there are also other kind of use cases. So for example, for medical use cases, we have released Med Gemma, which is a multi-modal Gemma, a three-page model for different medical tasks. So radiology, chest x ray understanding, and a bunch of other things. And again, these are open models. You can use them and you can also fine tune them even more if you have like a even more niche kind of use case. So that's what Google has done. But the community is also doing cool things. So for example, there is AI Singapore, it's a group that is training models for Southeast Asian languages. There are a bunch of them and they have been building quite a bit of research with open models to push even further the state of the art capabilities in terms of multilinguality. For another example is Sarvam. So in India, there are many official languages. And there is this effort by the government. They are investing in a couple of big startups to train national models. So this is more on the sub-rain AI and official languages point of view. But people are doing very interesting stuff on the multilingual side of things. A part of that there is quite a bit of other pool research happening. So there was this paper we were released in December of last year about how some researchers from DeepMind were able to use JMA-3 to propose some cancer therapy pathways, which was actually taken to an actual lab. And they were able to validate that the pathways that were proposed by this JMA-based model were able to actually lead to actual results that could be validated. So that was quite exciting because it's not just about having the resistance or chatting with, yeah, I don't know, like doing role-playing and whatnot. It's also about building models that can be used for actual things that help the community for many different things. So be that finance or be that, I don't know, legal, abuse, offline use cases where you don't want your data to leave your servers. If that's like for offline modes, if you're in the subway, if you're in an airplane, then you need to use AI for something. If you want to have a program extension that has a JMA in there and help you understand what is in your screen, if you want to do on-device control, the open model is getting there. And for me, that's quite exciting because if you compare where we are now versus how we were like one year ago, two years ago, open models now can do very cool, very interesting, highly-agentic, complex tasks, entirely on-device, entirely in your phone. It's really like recommend all of you to just spend like one hour in the next two weeks, play with open models, the latest open models, and try to understand which are the capabilities. Of course, there are many things for which you will want to use a API-based model. If you want the most raw intelligence, you will go and use like Gemini or your model of choice. But if you want to have things on-device, there are many exciting things that you can already do. And for me, what is more exciting is I don't know how things will be in six or 12 months from now. But I think we're heading towards a very exciting direction where people will be able to have extremely capable open models in their own devices that are customized for their own use cases with their own data. So yeah, please try the models, build something, and share that. All right, thank you.

Gemma, DeepMind's Family of Open Models — Omar Sanseviero, Google DeepMind

TL;DR

Takeaways

Vocabulary

Transcript