Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

Small models, particularly for edge devices, are not simply scaled-down versions of larger models; they possess unique challenges and opportunities due to being memory-bound, task-specific, and latency-sensitive.
Optimizing small models requires specialized architectural design, such as smaller embedding layers and efficient operators like short convolutions, combined with extensive pre-training beyond traditional scaling laws.
Key post-training techniques like focused supervised fine-tuning, preference alignment, and reinforcement learning are crucial for improving performance and addressing issues like "doom looping," making small models effective for agent-tec tasks.

Small models are defined by being memory-bound, task-specific, and latency-sensitive, which dictates their design and optimization strategies.
Architectural design should prioritize effective parameters by reducing the relative size of embedding layers and implementing fast operators like short convolutions (e.g., LFM2 architecture uses short convs and GQA).
Small models can benefit from significantly more pre-training data than suggested by traditional compute optimal scaling laws (e.g., 350M parameter model on 28 trillion tokens showed continued performance growth).
Post-training stages, including supervised fine-tuning (SFT), preference alignment (e.g., DPO), and reinforcement learning (RL), are critical and should be highly targeted to specific tasks for small models.
Doom looping (repetitive output) is a particular challenge for small reasoning models on complex tasks, which can be mitigated during preference alignment data generation and reinforcement learning with timed-fireball rewards and repetition penalties.
Agent-tec approaches, which equip small models with external tools like web search, are highly effective in compensating for low knowledge capacity and addressing long-context limitations.
Small models are ideal for deployment scenarios requiring on-device processing, low latency, or strict privacy (e.g., in-car, finance, healthcare).

Edge models — Machine learning models designed for on-device deployment, typically with limited computational resources. Memory-bound — A characteristic where a model's performance is primarily limited by the available memory rather than computational speed. Latency-sensitive — A characteristic where the speed of response or throughput of a model is critical for its intended application. Embedding layer — A component of a neural network that maps discrete input (like words) into continuous vector representations. Effective parameters — The subset of a model's parameters primarily responsible for reasoning and knowledge capacity, excluding components like large embedding layers that primarily handle vocabulary. Distillation — A technique where a smaller "student" model is trained to mimic the behavior of a larger "teacher" model. GQA (Grouped Query Attention) — An attention mechanism variation designed to improve efficiency and speed, often used in transformer models. Short convolutions — A type of convolutional neural network layer that processes input over a small receptive field, often favored for speed in edge models. Compute optimal scaling laws — Empirical relationships that suggest the ideal balance between model size, training data, and compute for maximizing performance. Supervised fine-tuning (SFT) — The process of training a pre-trained language model on a smaller, task-specific dataset with labeled examples. Preference alignment — A post-training technique that fine-tunes a model to better align with human preferences, often using comparative human feedback. Reinforcement Learning (RL) — A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a reward signal. Doom looping — A common failure mode in generative models where they produce repetitive or nonsensical sequences of words indefinitely. Direct Preference Optimization (DPO) — A type of preference alignment algorithm that directly optimizes a policy to satisfy human preferences without needing a separate reward model. Timed-fireball rewards — A reward mechanism in reinforcement learning that encourages the model to reach a specific, desired output (like a final answer) within a certain timeframe or sequence length. Agent-tec — A paradigm where models are integrated with external tools and environments (e.g., web search, calculators) to perform complex tasks by breaking them down into sub-problems. Hallucinate — When a generative AI model produces output that is factually incorrect or nonsensical, despite sounding plausible.

Hi everyone, my name is Maxim Lebon. In this presentation I want to talk about the lessons I've learned post training small models. So, for context I work at Liquid AI as head of post training. At Liquid we mostly focus on edge models for on device deployment. And as you can see here we have models from 350 million parameters to 24 billion parameters. So this is very, very small. And yesterday we released our new VLM for 50M. And the week before we released the new version of the 350M model for text. So this is what we do. We work across text, vision and audio. And yeah, the models are available on Hagenface if you want to try them out. And this presentation I want to talk about what separates small models and big models. And there are three main characteristics I want to talk about. So first of all, the small models they are memory bound because the hardware is what it is right on the phone, in the car, etc. We can't really use super big models which is why we try to keep the size quite small. And because of that we have low knowledge capacity compared to bigger models. Then the models are task specific which is great because if you have small knowledge capacity you can at least focus on one thing very well. And so that means that they are usually not general purpose chat boards like chat GPT. They are a lot more narrow in terms of focus. And they can do something like summarization to use very, very well. So that's the second aspect. And the final one is that it's very latency sensitive. And that means that you need to have very, very fast throughput. So all these characteristics are very important. And we'll see in this presentation how they play with each other and how we can do better. But the main lesson I want you to retain from this presentation is that small models are not just scaled on versions of bigger models. They also have their unique challenges and we will see about how we do it in this presentation. The first thing I want to talk about is the model architecture because there's a lot of interesting things that we can do here for edge models. I want to first talk about Gemma 3 270m and QAN 2.50.8B. So these models are the smallest version of their respective family. And you can see that both of them they adopt a hybrid architecture. Gemma 3 has sliding window attention and GQA hybrid QAN 3.5. There's a new architecture with Gated Delta Net and Gated Attention. This is great because this is a lot faster. But what I'm interested in here is actually the embedding layer. Because if you look at the size of the embedding layer compared to all the parameters of the model, you see that actually Gemma 3 270m is mostly an embedding layer. It's 63% of the total parameters. And even QAN 3.5 0.8B, it's still like 29% of the parameters. So that's not super efficient because the effective parameters, the parameters that I really use for reasoning, for knowledge capacity and all that stuff, are not the embedding parameters. It's the rest. So the effective size is actually a lot smaller. And it means that you could squeeze more reasoning and more performance from the same memory footprint. And the reason why they do that is because they use distillation to train the models. So they distill these models, like those are the student models and they have teacher models with a huge vocabulary sizes. And this is why we have these super big embedding layers. All right, let's talk about the LFM2 architecture. Now, as you can see, the LFM2 architecture is actually not that different in terms of just layers. We also have a hybrid architecture. And this time, we have short convolutions and GQA. And I want to talk a bit about, well, first, you can see that the embedding layer is actually a lot smaller compared to the others. It's like 90% of the parameters. So we have more effective parameters, which is great. And I want to talk about how we created this architecture. And we did on device profiling. So instead of doing more theoretical work, we decided to say, let's try to really implement it on the target hardware. So we had two target hardware here. And we wanted to see how it performs in real life to be able to optimize the architecture, find the right operators here. And the thing that we found is this GQA.t.t short convolution block that you can see here. And why this is nice? It's because it's very, very fast. The short conv are a lot faster than all the alternatives. You can see here, compared to setting window attention from Gemma 3, the Github.net from QN 3.5, Gated Linear Attention and a group query attention. You can see that the cost ratio is really in favor of short conv, which is great because we said that this is very latency sensitive. So this is exactly what we want. So this is quite theoretical. But if we look in practice and we profile the inference of these models, you can see here on the 2CPU, the IndyRyzen Max Plus 395 and the Samsung Galaxy S25 Ultra. All these models don't have necessarily the same size, but it gives you a rough picture. And you can see that the short conv really allow LFM track to be a lot faster and also use less memory. So that's great. And you can see also GPU. It's not really just for CPU, but also on GPU, you can see that. It has a lot of throughput, even at very high concurrency levels. All right, let's talk a bit about training now. So the LFM 2.5 training recipe is quite similar to what you can find elsewhere in terms of stages. I have pre-enme training 28 trillion tokens. We have supervised fine training, preference alignment and reinforcement learning. And here you can see I'm talking about 28 trillion tokens. And I said that we released a model of 350M parameters last week. So yeah, we pre-trained a 350 million parameter model on 28 trillion tokens. If you're familiar with the gentile scaling laws, that might sound a bit weird, because we're supposed to be compute optimal. I don't know, maybe 1 trillion, not even 1 trillion parameter. But it's actually not the case. And we see that the performance still grows when you scale the number of pre-training tokens. And there was a super interesting paper by Robert Zit our published last week about the test time scaling laws. And you can see here how LFM 2.5 to 550M compares to their new scaling laws. You can see gentile scaling laws here and the new ones that they proposed here. And actually, we did not pre-trained the models on enough tokens. We should pre-trained even more to be optimal according to their laws. But this is cool because more pre-training works. And it works even at the smallest scale, which is great because these models are a lot cheaper to train than much bigger models. And here you can see comparison. It's not just for pre-training. It's like post-training models. And you can see that the LFM 2.5 model is significantly better than the previous version, LFM 2.3.5.5. On a lot of different benchmarks. So you have knowledge with UPQ diamond. You have instruction following with IFBENG, you have case report bench, which is data extraction. And also a lot of tool use with BFCL and TOU2 bench. With this model, it's only 350Mb. So what we wanted to do is we wanted the model to be very, very good at data extraction and at tool use. And the rest, if it's not the best model in code, it doesn't matter. People don't use it that way anyway. Same for math. I think it's really nice to try to target some capabilities and not try to be average on everything. All right, let's talk exactly about that. Post-training, small and big models, what the difference. We have pretty much the same stage. So this is not really in terms of stages that we see a difference. It's more about how you do it. So for super wide fine training, it's better if you actually quite now and you focus on some task. It's true for general purpose post-training, but it's also true if you do fine training. So you can take one of these models on Hiring Face and just fine tune it for your use case. And for example, you have a use case where you have a particular function that you want to call. This is great. This is an excellent use case. Like the more now you can find it or design it, the better it is. Then we have preference alignment. So during post-training, we have our own on-policy, length normalize direct preference optimization algorithm that we quite like. And preference alignment is very nice because it brings you general improvements. It's not just about benchmarks. It's really like overall after preference alignment, the mobile is better, it sounds better, and this is really nice to be able to just improve it overall. And finally, we have reinforcement learning. And reinforcement learning is extremely efficient even at very small scale. It's a really, really important technique that we use everywhere. And the main thing is that it's very narrow in terms of focus. So you want to have as many environments as many tasks as possible and make sure that you generalize well things to this. And then for small models in particular, the quite sensitive to cold start SFT data. So if you have a particular task in reinforcement learning, it's always good to have similar samples and a similar task in your supervised fine-tuning mixture. And this is good feedback, like you can see. During reinforcement learning, something doesn't train very well. It's probably because you are missing some cold start SFT data. Maybe the task is too complex, there are different reasons. But you can try to start again from the supervised fine-tuning stage, add your data, and then see if it improves anything. All right, but there's a new problem with small language models that you might have encountered even with bigger ones. And this is doom looping. So the point with doom looping is, as you can see here, it's going to start upping a sequence of words over and over and over and over again, and it just never stops. So this is a problem all the time, but it's particularly a problem if you have small models, if you have reasoning models, and if you have complex task. If the task is basically too complex for the model. Hopefully this recipe is not too complex for this model, but this can happen anywhere. And you have the three of them at the same time. So if you have a tiny reasoning models on super difficult math task, this is the perfect recipe to have a lot of doom loops. So this is a unique challenge that you find with small models to give you a concrete example. And here I can talk about how we solve it. The first thing is that we solve it during the preference alignment stage and in particular for the data generation parts that we do. So here you can see the pipeline that we use to do the data generation, the on-policy data generation for preference alignment. So we start with prompt, like, 1 million samples to give you a rough idea. And then we use the policy model, the moment we want to train with temperature sampling, and we just generate five roll outs. Because we use temperature sampling, these roll outs tend to be a lot more diverse, and we expect that not all of them will have doom loops. At least one should not doom loop, right? And on the other hand, we generate just one extra roll out with a policy model with temperature zero. And this one we think that it's going to doom loop. And then we give everything to an LLN jury to score all the roll outs. We pick the best one, the one with the highest scores, the chosen answer, the one with the worst score as the rejected answer. And the idea is that if we have some doom loop here, the response for the doom loop will be rejected. So we will train the model during preference alignment to not doom loop. And this is quite effective. So this is solution number one. And then we have solution number two. And this one is about using reinforcement learning with thyme fireball rewards. And we add a bit of end-gram repetition penalty. But you can see that with reinforcement learning with thyme fireball rewards, it's a very nice way to actually solve this issue. Because if you have a question, like a math question like this one, you are going to try to extract the final answer. If you do not have a final answer, you won't get a positive reward. So this is already being taken care of during reinforcement learning with thyme fireball rewards. But on top of that, you can add a bit of repetition penalty to make sure that you are going to generate more less doom loops in general. And the same thing, we also use temperature sampling here. So the rollouts are also quite diverse. And it just is less likely that you are going to get a lot of doom loops all the time. So this is the second solution. And that allows us to really reduce the doom loop ratio. So this is a real example with LFM 2.5, 1.2B thinking, which is a small model. It's a reasoning model. And on top of that, we threw really hard task at it. So you can see that after mid training, the doom loop ratio that we calculated across a lot of benchmarks was about 15%, or even 16%. And then after SFT, it barely moves. Like SFT is not the right stage to fix this. We didn't have doom loop examples during the SFT stage, but it's not enough to get rid of this issue. After DPO, so that was the first solution, it really reduces quite a lot. And you can see that after reinforcement learning, the problem is almost nonexistent. If today you try to do the same thing with QNP.5, 0.8B in reasoning mode, you will see a lot, a lot of doom loops, like over 50% of doom loops, just something that people complain about online. And that also shows that QNP.5, like this tiny model, is just a scaled-down version of bigger models. And this is not the approach that we're taking here, liquid. We want to say, okay, like the edge models, they are their own thing. And this is also a way to just optimize the entire architecture, the entire question and stack to make sure that we treat them as as best as possible. And finally, I want to talk about next stage, next steps for all these small models with agent-tec reinforcement learning. The final characteristic I didn't mention here is about being memory-bound. If you're memory-bound, it means that you have low knowledge capacity. If you have low knowledge capacity, it means that you're going to hallucinate a lot. But a nice way to solve this issue is just providing like web search tools to the model. If you have a tiny model, but it's able to Google everything that you throw at it in terms of knowledge questions, you're going to have like much, much better performance than if you just rely on the base models. And same thing with a lot of problems that you can throw at the model. I think that from experience, these tiny models are actually very good at agent-tec task. And this is how we should use them. It doesn't matter if they don't have the knowledge capacity of big models. What they truly need is really good reasoning abilities to make sure that they are able to use these tools in a reliable manner. And another point that I haven't mentioned here is that small models are also not very good at long-context capabilities. But it's okay because if you have like a recursive language model environment, then you can use Python and basically take a shortcut to solve this issue. So most of the issues that you find with small language models can actually be fixed in different ways. It just requires more creativity. It just requires thinking about this problem, not like you would think about it from a bigger, more perspective, but everything about this is fixable. All right, so in conclusion, some takeaways. I hope I convinced you that edge models have unique challenges and they are actually interesting from scientific point of view and also production point of view. If you combine them with agent-tec tools, they tend to perform really, really well. And this is something that is currently under-explored. We talk about agent-tec workloads with really big models, but it's not necessarily the best use case. It's not necessarily the best fit all the time. And yeah, finally we're working on LFM3 and we have like a ton of crazy experiments and ideas to try. So come work with us if you're interested in this space. Thank you everyone. Yes. Can you share a bit about how you use these models in your work? How you make the decision? Yeah, this is a good question. So the question is how we use this model in the workflows and how we decide between small models and big models. The main idea here is that you will try to use the small models when you don't have an internet connection, for example. So in-card deployment is a good example of that because you can't have a reliable internet connection, so it makes sense. Latency is also a big one. If you have a workload that is very latency-sensitive, small models running locally are always going to be better. And another one is privacy. If you use a regulated environment, if you work in finance or healthcare, it's also a good one. I make the models. So I make the models for other people, but in my workflows, not necessarily. It's a good question. I think you need to do some experiments to see if just distilling from a bigger model chance states well in terms of the looping. I would say no. I don't think so because I think it would be too close to SFT. It depends how you do this distillation. If you stop K and you have N of K, maybe this is good enough. But I think that it would not be completely solved and you would still need several batches to make sure that it doesn't happen again. All right. I think I should quit, but I'll be around if you have other questions. Thank you very much.

Everything I Learned Training Frontier Small Models — Maxime Labonne, Liquid AI

TL;DR

Takeaways

Vocabulary

Transcript