- Reinforcement Learning (RL) environments offer a dynamic training paradigm for Language Models (LLMs), allowing them to learn by interacting, exploring, and improving from direct feedback. This approach helps overcome the limitations of traditional static dataset-based training.
- The
Verifiersopen-source library provides a modular framework for building these RL environments, abstracting away complex infrastructure to let developers focus on task logic and reward design. - By leveraging RL with verifiable rewards, even a small, initially underperforming LLM can be transformed into a highly capable agent for specific tasks, demonstrating mastery and outperforming larger models.
Let LLMs Wander: Engineering RL Environments — Stefano Fiorucci
- Reinforcement Learning (RL) with verifiable rewards allows LLMs to explore diverse trajectories and discover efficient reasoning strategies, unlike Supervised Fine-Tuning (SFT) which is limited by the distribution of human-curated examples.
- The
Verifierslibrary simplifies environment creation by offering base classes for single-turn, multi-turn, and tool-enabled LLM interactions, supporting both evaluation and training through an OpenAI-compatible API. - To make RL training effective, environments should incorporate features like controllable opponent skill, deterministic seeding for consistent rollouts, and multi-faceted reward functions that account for format, validity, and reasoning.
- A strategic approach to LLM training often involves an SFT "warm-up" phase to teach basic task syntax and valid moves, followed by RL to build deeper capabilities and reasoning.
Batch sizeis a critical hyperparameter in RL training; larger batch sizes generally lead to more stable learning, while smaller ones can introduce instability or model collapse if not carefully managed.- When choosing a base model for RL, consider using an instruct model over a large reasoning model, especially with limited GPU resources, as instruct models can be more efficiently transformed for specific reasoning tasks without excessive truncation.
- Continuously monitor for subtle biases within the environment (e.g., in opponent AI algorithms) which can lead to model memorization rather than true generalizable learning.
Reinforcement Learning (RL) — A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward.
Supervised Fine-Tuning (SFT) — A training phase where a pre-trained language model is further trained on a dataset of prompt-response pairs to learn specific instructions or tasks.
Agent — In RL, the entity that perceives the environment and takes actions. In the context of LLMs, the language model itself.
Environment — In RL, the system with which the agent interacts, providing states, rewards, and responding to the agent's actions.
Reward Function — A numerical value provided by the environment to the agent, indicating the desirability of an action or state; the agent's goal is to maximize cumulative reward.
Trajectory — A sequence of states, actions, and rewards experienced by an agent during an interaction episode with an environment. Also called a rollout.
Verifiable Rewards — A paradigm in RL where the outcome of a model's action (e.g., an answer, a game win, a tool call) can be automatically checked against a ground truth or rule to generate a reward signal.
Batch Size — The number of interaction trajectories (games, episodes) used to compute the gradient and update the model's weights in a single training step.
Temperature — A parameter in LLM generation that controls the randomness or creativity of the output; higher temperature encourages more diverse and exploratory responses.
Hello everybody and welcome to LatinLampsWonder, Engineering and Enforcement Learning environments. If you want me to talk about me, I am Stefan of your Ruchi, a science software engineer. By day, I work on AI orchestration at Deepset where I develop Haystack and open source at Land Framework. By night, I love tinkering with small language models, fine-tuning and reinforcement learning. Today, I am going to talk about reinforcement learning environments for language models, evaluation and training. This has been a hot topic over the past year and I find it fascinating for several reasons. These environments let models learn by interacting, exploring and improving from feedback. They are natural gyms for LLM agents that can use tools, run code and solve multi-step tasks. In addition, startups building, parallel environments are getting measure funding and working directly with big AI labs. Recent technical reports by DeepSeek and Minimax show that they are effectively using thousands of reinforcement learning environments to improve model performance on challenging tasks and scale intelligence. Don't worry if you know nothing about parallel environments. I'll cover that soon. Here is the agenda for the talk. We'll first review classic reinforcement learning concept and see how they map to the language models domain. We'll then introduce VerifyFs, an Obersource library to build environments as software artifacts and explore some common patterns to implement them. Finally, I'll work you through an experiment where we take a small model that can barely play Tiktok to against a random player and transform it into a master using their reinforcement learning environment. Let's start. First, a quick refresher on reinforcement learning. In reinforcement learning, there are two main characters, the agent and the environment. The environment is the word the agent interrupts with. A teach step, the agent sees the current state of the word and takes an action. The state of the environment then changes in response to detection. The agent also receives a reward from the environment, number indicating how good or bad the state is. The agent's goal is to maximize its cumulative reward over time. And to do this, it has to balance exploration, so trying new actions to discover better strategies and exploitation using actions known to work. By interacting with the environment, the agent learns from experience and improves its behavior. A trajectory or rollout is the sequence of states, actions and rewards that the agent goes through while interacting with the environment. It's a record of the experience. In this presentation, I'll use trajectory to meet a complete episode like one entire game. Let's take a look at Elelem's training. Language model is a statistical model that, given some text, the prompt, returns a text completion. The start-up training recipe is divided into three phases. Pre-training on a massive amount of internet text. Here, the model learns to create text completions. The base node is knowledgeable but can follow instructions and is hardly usable in applications. During supervised point tuning on conversational examples, the model is trained to follow instructions and can learn new tasks. In the third step, reinforcement learning is often used with techniques like proximal policy optimization to align the model with human preferences. It's worth showing an example of supervised point tuning data, as later with frequently compare supervised point tuning learning with new reinforcement learning approaches. As you can see, we have pairs of prompt and responses. During this phase, the model learns by statistical limitation. It's essentially trying to make the examples provided. You might remember Ilya Satsivir talked at Neolips 2024. He pointed out that the Elelem training paradigm, we just saw, is starting to show it's limits. In particular, pre-training, the longer seems to be enough to keep improving model quality at the same rate. We needed a new way to scale. Then, open I publish to its O1 model series. In the release blog post, the mention of reinforcement learning training to make models use chain of dot effectively. The yields on their line that the performance of O1 consistently improved with more reinforcement learning, train time compute, and with more time spent thinking, test time compute. Unfortunately, they did not share many details on how this model was actually trained. The list of tipsy car 1 shed sunlight on how you can possibly achieve those results. First, they recognized that reasoning and chain of dot can improve the performance of models, but teaching this behavior to models using supervised point tuning requires creative data that is too expensive to produce at scale. They used reinforcement learning with very fiber rewards, which we've seen a moment. And deep seek also used GIPO, a new reinforcement learning algorithm that offers a simple, lighter setup compared to techniques like PPO. So what is reinforcement learning with verifiable rewards? With this paradigm, the model is asked the question and generates both a reasoning trace and an answer. The answer is then checked against the non-correct answer. The reward is used for reinforcement learning training. The underlying idea is more general and it has where the outcome can be verified automatically, like a corret answer, a one-game, a successful tool called cancered as a training similar. And this is fundamentally different from supervised point tuning. In SFT, the model learns from curated examples and its completions tend to stay close to the distribution of those examples. They are reinforcement learning with verifiable rewards. The model explores different trajectories from its pre-training and learns to favor the ones that maximize rewards. And this is exciting because the model is no longer limited by the quality of human examples through trial and error, it can discover more efficient reasoning strategies. We can finally map the classic reinforcement learning concepts to LMS. The language model acts as the agent. The environment for any task, including data, analysis and scoring rules, anything needed to check and possibly train the model on the task. From a software perspective, this marks a shift from supervised and tuning to reinforcement learning with verifiable rewards. While SFT mainly relies on conversational datasets, this new paradigm usually requires an environment, a dynamic system that the model can interact with. The definition of the agent is also expanding. Language models can now be given tools from a weather API to a terminal, and this makes environments for training the evaluation more complex and critical. To make this more concrete, consider teaching a model to play tic-tac-toe. The agent is the language model, its action is generating a text response with a specific move. The environment acts as the game engine, it handles pronging the model, tracking the worst status, generating the opponent's move and deciding when the game is over. The reward is the signal from the environment, for example, plus one for a win and zero for a loss. This reward guides the model to find winning strategies through trial and error. This setup allows the agent to discover strategies that maximize its score without needing pre-existing human examples. You have probably understood that I am enthusiastic about this topic, but let's also use André Carpace's words to describe the environment. They give the LLM an opportunity to actually interact, take actions, see outcomes. This means you can hope to do a lot better than statistical expert imitation. Now, let's see how to build these environments. To build environments as software-active X, we can use bird fires and open source library by priming the left. The bird fires provides modular components to create reinforcement learning environments for LLM agents. This can be used for both evaluation and training. Environments are Python packages that can be easily installed and distributed. The library provides space classes for several set ups, single-term environments with just one interaction between the model and the end, multi-tourne environments, two environments where the model is equipped with tools and several others. It also includes abstractions for parsing model responses and defining reward functions. Verifier abstracts model serving. It expects an open AI compatible API endpoint so you can plug in open AI, open router or local models via the LLM. It enables async interaction and parallel trajectories so you focus on the environment logic. For training, verifiers come with its own trainer and integrates with other frameworks such as Pribarral, Tinker and Skyral. In short, verifiers let's us focus on the task and the rewards greater than the infrastructure. Let's start with a single-term environment. Vigorous text is a simple environment to evaluate or train language models on their ability to reverse a string of text. What's going on here? LLM environment is the entry point for every verifier's environment. It contains all the set up logic. First, the dataset is loaded and marked. The default dataset contains a thousand text paragraphs stored in the prompt column. During the mapline step, we just store the original dataset. Question is the text paragraph, while answer is the reversing text. Next, examine parser is initialized. It extracts the text inside the reversing text as specified in the system prompt. We then define the reward function. This compares the model's output to the ground truth and returns a longest common subsequent ratio. Finally, we mando this into a rubric, a collection of weighted rewards and initialize the single-term end. But how does this come to light? Let me show you an evaluation run. Here is what happens under the root. The environment is invoked. Five examples are taken from the dataset and each example is used three times for three roll outs. Each roll outs gets the same question, but may produce different completions due to modern randomness. We have 15 roll outs in total. For each roll out, a conversation is prepared with the system prompt and the question. The conversation is to the model. The model generates a response. The response is parsed, the script in the answer. The reward is computed and results are saved. At the end of the evaluation, you get summary statistics and additionally, for about reward distribution. Training follows the same core mechanism with the additional step of updating model parameters. We look at that in more detail later. Let's look at the different examples from various files, the double check environment. Here, the model answers a marked question and the environment then asks, are you sure? This is a multi-turn environment, similar in speed to Tiktato, in which each trajectory involves multiple interaction between the model and environment. Let's look at what multi-turn and introduces. State makes its first appearance here. It's a dictionary that tracks information during a roll out. We can set the initial state through the setup state method, not used in this example. At the response, instead of the interaction ending after one turn, the environment can reply with list of messages. Here, it just says, are you sure? In more complex environments, the response can be dynamically generated based on this state. The fifth stop, the curator marks a method as a stopping condition. This method runs at every turn of the agent environment interaction. Once it returns through, the roll out terminates. Here, the hood verifiers runs a loop in which the model and environment take turns, exchanging messages, updating shared state until a stopping condition is met and the full trajectory can be evaluated. Another interesting type of environment is the tool environment. All environment types in their fires are built on multi-turn and which implements the core single agent roll out loop. To land at stool-colling to this foundation, as you can see, tools are defined as python function. During roll-outs, the model can call tools, receive results and continue reasoning until it produces a response without tool goals. Each turn consists of a model response followed by the environment's tool execution. For a more realistic example, I recommend checking out the weekly search environment. Beyond the fundamental environments I just showed, verifiers provides more abstraction to build on environments. The NCPM environment automatically connects to model context protocols, servers to expose their tools. An important tool to land is for tools that need the paralleled, persisted state like database connection or session ID. There is also a class implementing recursive language models, and all the idea you might have heard of. It's an inference strategy where language models can decompose and recursively interact with input context or unbounded land through RAPL environments. Verifiers also plays well with others, integrating with several third-party environment libraries. Verifiers is tightly integrated with the environments hub, a community space for sharing these RL environments. Verifiers and environments are different phases of the same event. They aim to fight environment fragmentation. To often, environments are looking into specific trainings dark, making them difficult to reuse. And as a market for closed source environments emerges, these open initiatives ensure we have the robust alternative. We don't want open source models to lag behind just because they lack the right playground training. Plus, beyond the serious side, it just found to explore the hub and see what people are building. Now, let's move on to TicTacTo. We'll use Verifiers to create a TicTacTo environment for training and evaluating language models on this game. Now, why TicTacTo? It's a simple game, but requires multi-term interaction and capturing its dynamics with the static dataset is challenging. Despite its one-state space and the deterministic solutions, more language models often struggle with it. Let's see if reinforcement learning cannot preach that gap. It's best to start with a simple version, run evaluation to verify it works and then iterate. To start, we make a few assumptions. The model always plays SX and goes first. It must output an envelope between 0 and 8 inside Move tags and the opponent just plays randomly. In a load environment, we create a dataset containing the initial user message that starts each game. For each rollout, setup state populates the state digitally with information used and updating during the game, such as the board and the winner. And for response, contains the core game logic. It parses the model last move. Check if it's valid, invite moves are an immediate loss for now, and applies it to the board. Then it applies a random point move and checks for a win or throw. If the game isn't finished, it turns a user message with the current board state and lasts for the next move. We use truly word function, win reward function, win weight 1, and form it, reward function, win weight 0.2, which rewards the model for respecting the XML format. We can now make our environment more flexible, realistic and suitable for both evaluation and training. I made some of these improvements readily while it didn't worth during training. First, we want the model to sometimes play first and sometimes play second. Let's now address opponent's skill. Always playing against a random opponent isn't realistic, so we introduce an optimal opponent using the minimap algorithm. Against this opponent, a draw is the best achievable word outcome. However, for training, we want the opponent's skill to be controllable. If the opponent is too perfect, too early, the model might never say win and fail to learn. We can do so by introducing a probability for the opponent to choose a random move instead of the optimal one. In a lot of environments, we introduce mean random move prob and mass random move prob varying from 0 to 1. If we set both to 0, all games will be against an optimal opponent. If we set both to 1, all games will be against a random opponent. Using these parameters allows us to control the opponent's skill across all games. For different roll outs originating from the same dataset example, the opponent will always have the same probability of choosing random moves, ensuring fair comparison. Now about reasoning. It's common to ask models to produce a think in trace before the final answer. It can improve performance at in-frame time, but it's also instrumental to make models better during training. We define a new form of reward function using a regular expression to also check the presence of think tags. It's now covered in very moves, but experimenting with more open models I observed many of them. Sometimes the outperformed was incorrect, sometimes the chosen cell was occupied and in the game immediately is harsh. It might stop smaller models from getting a useful learning signal. Instead, we now let the game continue and apply a flat minus 0.1 penalty, capping the third and eighth. Let's discuss reducing noise in group-based reinforcement learning. In GARPO learning, we compare cellular roll outs from the same starting point to see which ones to reinforce based on rewards. For this work, differences in rewards should come from all the model placed, not from environment randomness. And how can we reduce noise in this setup? We set an example seat for each example in the data set to select the starting player. Then, for each turn, we write a specific turn seat based on the example seat and worst state. This guarantees that if the two roll outs reach the same board position, the point will always respond the same way. Last point, reducing noise across batches. For our training, the batch size is the number of games taken into consideration before the models where are updated. In our setup, the opponent's case varies across the data set according to mean random move prop and max random move prop. If we train with the small batch size and the random move probability is not fixed, we might sample a batch in which many opponents are art or many of these. This causes the average reward to rotate a lot, making training unstable. To fight this, I added the stratified sampling. This forces every batch to contain a perfectly balanced mix of four-pointed difficulty spanning the chosen range. I know this slide is dense, but you can find all code and more details in the GitHub repository. Trying to evaluate existing models, we choose GPT-5 Mini and LFM 2 by LiquidAI is more fast open model. Using various evaluating models just requires a few comments. Along with some statistical variability, GPT-5 Mini is excellent at following format and is a good tiktok player, but not perfect. The small open model by LiquidAI striples to follow format and to make valid moves. It's a weak tiktok player, sometimes winning against a random opponent, but rarely surviving against an optimal one. There is a significant gap. We decide to train LFM 2 for some reasons. It's a good model for its size and it's an extra model, ideal for transforming it into reasoning model. How can we improve it? We saw that this model striples to follow format and often provides invalid moves. We can use supervised by tooling for a warm up phase where we teach the model the format and valid move syntax. We can then use reinforcement learning to build deeper capabilities. The first step is generating synthetic data for supervised by tooling. Once you have a good environment, the rating data requires a single comment. Here we use GPT-5 Mini since it follows format perfectly. And we don't need many examples. We generate 200 and filter out closing games to avoid making ins about mass strategies. With this synthetic data attend, we can clearly spin up a supervised tooling run using prime array. In this example, I am using the N96 Gigabytes GPU but you can use a smaller one. Training requires only a few minutes. Time to evaluate our fine tuned model. Compared to the original model, it learned format almost perfectly and reduced the number of invades moves. It also improved the game performance but there is still significant work to do. For example, into our training, let's do a quick recap of group relative policy optimization applied to Tiktok. Robots starting from the same initial board, the model plays several games via LLN sampling. Each route is evaluated using the termistic reward function. In our case, win format and invalid move. Another range score is calculated across the group of roll outs. And each roll out is then compared against this average advantage computation. The model is updated to favor trajectories that did better than the group baseline. We'll use CISPO which is an improvement over GIPO. For reinforcement learning training, I used Verifier's REL simple trainer. Yeah, we use a GPU for inference and a GPU for training. Let's comment some parameters in the training configuration. In this training ground, random move probability is ranging from 20 to 70%, no purely random players and no optimal players. It's a good playground to get signal and learn both attack and events. And no groups parameter is used to set up stratified samples. When comes to trainer arguments, we want our model to learn stably while fuel utilizing our GPUs without crashing. For tips on how to use the GPU without going out of memory, I recommend checking out the GIPO. Yeah, I want to stress that reinforcement learning training is sensitive to either parameters and can be unstable. I heard the earthquake that batch size is a key parameter. In this environment, I was certainly unstable training and model collapse and experimented with various lower than 256. Explanation is intuitive. Batch size is the number of games using to update the model's weights. If this number is low, this means learning to play from a very small number of matches and opponent types at once. And this likely leads to subordinate strategies. Let's take a look at training plots. We are in a war function and the total reward constantly improved. Format reward function was already near perfect and did not change significantly. Invite move the penalty function, start it well and converge it to zero towards the end of the technique. It's a good training run, but let's run proper evaluation. Impressive. Thanks to reinforcement learning, our model has become a very competitive tactical player. It dominates random players and draws 85% of the time against an optimal opponent. Invite moves at dropped to near zero. These results are already satisfying. And one can say that not much more we can learn from this example. I am a perfectionist and I'd like to perfect the eye player. In spite of the rollouts, I found some recurrent fatal mods, in particular our model sometimes falls into four traps. This example shows the end of the game. Our model is not playing badly here, but it had already lost the game by allowing the opponent to have two in it. That's it. It's impossible to use our RL environment to push our model farther toward perfection. That's right. We have to make some changes. In this run, I used the bigger GPUs just to experiment quickly. It's probably not required. First, let's discuss opponent's skill. We increase opponent's skill by setting the probability of a random move to range from zero percent to 25 percent. I also tried making the model play against the perfect opponent solely, but it didn't work. The model became overly defensive and failed to exploit errors when testing the game around the players. But how to make our model explore beyond learning strategies? I made several experiments where this model failed to improve and forget the sub-automatic strategies. We want the model to experiment with new approaches, and the temperature is the right parameter to tweak. But this is a bit risky. If the temperature is too high, the model can start generating heat pressure. Let's try. Things get really interesting. There was a significant initial drop in the win-reward function and total reward. I doubt that this has an exploratory phase where the model tried new and random strategies, which after four months at first. But over time, it's recovered and improved to new eyes. Also, four-match reward function and invalid move penalty function at an initial drop, but overall, always stayed around their maximum values. Let's move to proper evaluation. Oh, we finally got a TITTA to master, but why not playing again? I guess it. OK, let's make not the perfect move. We now need to block the model. Let's block it again. Oh no, we lost. It could be interesting now to compare our model performance with GIGGY 5 Mini. The TITTA model we used to generate our synthetic data. Against the random opponent, performance is very similar. Let's see against an optimal opponent. Oh, in this case, our model is superior. This is a very nice achievement. To get these results, I went through several failed experiments and I'd like to share the findings with you. First of all, batch size. If this value is large, yes, your model apparently learns slowly, but in exchange, you get stable training. If the batch size is small, your environment produces diverse matches and opponent skills, the model we learned from is more number of games at once. This can reinforce sub-autma strategies and you may observe unstable training or model collapse. Second lesson. Watched for item biases in environments. Let me split. In a previous experiment, I used a different Mini Max algorithm for the optimal opponent. I thought this was an implementation data and let the load handle it. I got great benchmark results, but then playing against the model, I realized it was too loose. Looking better Mini Max, these are the bias. With multiple moves at the same optimal scores, the first preposition was always selected. I basically was training my model against a specific type of optimal player. Over many games, the model simply memorized it. Now about model choice. You can't start from a model, it is already trained for reasoning, but they tend to output long thinking traces. If you have limited GPU resources and time, you may end up truncating most of the longer completion at the beginning to fit the short limits. This means wasted budget and also the risk of damaging the model's intelligence. It might make more sense to start from an instruct model and transform it into a reasoning model for your task. Another point about models, it could be hard to push very small models to competency. This of course depends on the task. And I recommend, evaluate the base model in your environment, look at a few completions, choose a model that shows promising the areas, even if the numbers are not satisfying yet. In general, it is always a good idea to expect some role as to see the model evolve. Also after training, do not stop at programmatic evaluation, try the model in the real task. Final recommendation is not right to watch your logs and plots when you start training to identify early out of memory errors or instability. It's difficult for me to, but one training begins well as it's just stopping staying in plots for a while. Reenforcement learning is low and takes time to see progress. If you continually monitor it, you list the tendation to stop it and tweak something prematurely. While its glory progress in Ra can be a surprising good well given enough time. So start training and go for a work. During this presentation, we mappled the reinforcement learning concept to the language models domain. Then I introduced various fires and open source library to build environments as software activists. Finally, I walked you through my experiments where I took a small model and turned it into a tick-tock to master using supervised meant tuning and reinforcement learning with verified or rewards. We did not just show the model out to play, we gave it a space to play and guided it through rewards. Nowadays, reinforcement learning complements supervised meant tuning in language models for training. You can do this a tone 2. If you can define a clear reward signal, you can build an environment and train a small specialized model to build a large closed model on a specific task at a fraction of the cost. I want to leave you with a few ideas and resources on this topic. All what I share out today can be found in my free LLM RL Environment link course where I go deep at explaining the day. Take a look and give a start. To figure out what others have been building, explore the environment's hub. And what to be in this? Something I am very excited about is this. Train a small language models on two three tools you often use and try to perform a large model in that specific task. I recommend the Wiki search example in the priming RL repository as a static point. Thank you.
TL;DR
- Reinforcement Learning (RL) environments offer a dynamic training paradigm for Language Models (LLMs), allowing them to learn by interacting, exploring, and improving from direct feedback. This approach helps overcome the limitations of traditional static dataset-based training.
- The
Verifiersopen-source library provides a modular framework for building these RL environments, abstracting away complex infrastructure to let developers focus on task logic and reward design. - By leveraging RL with verifiable rewards, even a small, initially underperforming LLM can be transformed into a highly capable agent for specific tasks, demonstrating mastery and outperforming larger models.
Takeaways
- Reinforcement Learning (RL) with verifiable rewards allows LLMs to explore diverse trajectories and discover efficient reasoning strategies, unlike Supervised Fine-Tuning (SFT) which is limited by the distribution of human-curated examples.
- The
Verifierslibrary simplifies environment creation by offering base classes for single-turn, multi-turn, and tool-enabled LLM interactions, supporting both evaluation and training through an OpenAI-compatible API. - To make RL training effective, environments should incorporate features like controllable opponent skill, deterministic seeding for consistent rollouts, and multi-faceted reward functions that account for format, validity, and reasoning.
- A strategic approach to LLM training often involves an SFT "warm-up" phase to teach basic task syntax and valid moves, followed by RL to build deeper capabilities and reasoning.
Batch sizeis a critical hyperparameter in RL training; larger batch sizes generally lead to more stable learning, while smaller ones can introduce instability or model collapse if not carefully managed.- When choosing a base model for RL, consider using an instruct model over a large reasoning model, especially with limited GPU resources, as instruct models can be more efficiently transformed for specific reasoning tasks without excessive truncation.
- Continuously monitor for subtle biases within the environment (e.g., in opponent AI algorithms) which can lead to model memorization rather than true generalizable learning.
Vocabulary
Reinforcement Learning (RL) — A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward.
Supervised Fine-Tuning (SFT) — A training phase where a pre-trained language model is further trained on a dataset of prompt-response pairs to learn specific instructions or tasks.
Agent — In RL, the entity that perceives the environment and takes actions. In the context of LLMs, the language model itself.
Environment — In RL, the system with which the agent interacts, providing states, rewards, and responding to the agent's actions.
Reward Function — A numerical value provided by the environment to the agent, indicating the desirability of an action or state; the agent's goal is to maximize cumulative reward.
Trajectory — A sequence of states, actions, and rewards experienced by an agent during an interaction episode with an environment. Also called a rollout.
Verifiable Rewards — A paradigm in RL where the outcome of a model's action (e.g., an answer, a game win, a tool call) can be automatically checked against a ground truth or rule to generate a reward signal.
Batch Size — The number of interaction trajectories (games, episodes) used to compute the gradient and update the model's weights in a single training step.
Temperature — A parameter in LLM generation that controls the randomness or creativity of the output; higher temperature encourages more diverse and exploratory responses.
Transcript
Hello everybody and welcome to LatinLampsWonder, Engineering and Enforcement Learning environments. If you want me to talk about me, I am Stefan of your Ruchi, a science software engineer. By day, I work on AI orchestration at Deepset where I develop Haystack and open source at Land Framework. By night, I love tinkering with small language models, fine-tuning and reinforcement learning. Today, I am going to talk about reinforcement learning environments for language models, evaluation and training. This has been a hot topic over the past year and I find it fascinating for several reasons. These environments let models learn by interacting, exploring and improving from feedback. They are natural gyms for LLM agents that can use tools, run code and solve multi-step tasks. In addition, startups building, parallel environments are getting measure funding and working directly with big AI labs. Recent technical reports by DeepSeek and Minimax show that they are effectively using thousands of reinforcement learning environments to improve model performance on challenging tasks and scale intelligence. Don't worry if you know nothing about parallel environments. I'll cover that soon. Here is the agenda for the talk. We'll first review classic reinforcement learning concept and see how they map to the language models domain. We'll then introduce VerifyFs, an Obersource library to build environments as software artifacts and explore some common patterns to implement them. Finally, I'll work you through an experiment where we take a small model that can barely play Tiktok to against a random player and transform it into a master using their reinforcement learning environment. Let's start. First, a quick refresher on reinforcement learning. In reinforcement learning, there are two main characters, the agent and the environment. The environment is the word the agent interrupts with. A teach step, the agent sees the current state of the word and takes an action. The state of the environment then changes in response to detection. The agent also receives a reward from the environment, number indicating how good or bad the state is. The agent's goal is to maximize its cumulative reward over time. And to do this, it has to balance exploration, so trying new actions to discover better strategies and exploitation using actions known to work. By interacting with the environment, the agent learns from experience and improves its behavior. A trajectory or rollout is the sequence of states, actions and rewards that the agent goes through while interacting with the environment. It's a record of the experience. In this presentation, I'll use trajectory to meet a complete episode like one entire game. Let's take a look at Elelem's training. Language model is a statistical model that, given some text, the prompt, returns a text completion. The start-up training recipe is divided into three phases. Pre-training on a massive amount of internet text. Here, the model learns to create text completions. The base node is knowledgeable but can follow instructions and is hardly usable in applications. During supervised point tuning on conversational examples, the model is trained to follow instructions and can learn new tasks. In the third step, reinforcement learning is often used with techniques like proximal policy optimization to align the model with human preferences. It's worth showing an example of supervised point tuning data, as later with frequently compare supervised point tuning learning with new reinforcement learning approaches. As you can see, we have pairs of prompt and responses. During this phase, the model learns by statistical limitation. It's essentially trying to make the examples provided. You might remember Ilya Satsivir talked at Neolips 2024. He pointed out that the Elelem training paradigm, we just saw, is starting to show it's limits. In particular, pre-training, the longer seems to be enough to keep improving model quality at the same rate. We needed a new way to scale. Then, open I publish to its O1 model series. In the release blog post, the mention of reinforcement learning training to make models use chain of dot effectively. The yields on their line that the performance of O1 consistently improved with more reinforcement learning, train time compute, and with more time spent thinking, test time compute. Unfortunately, they did not share many details on how this model was actually trained. The list of tipsy car 1 shed sunlight on how you can possibly achieve those results. First, they recognized that reasoning and chain of dot can improve the performance of models, but teaching this behavior to models using supervised point tuning requires creative data that is too expensive to produce at scale. They used reinforcement learning with very fiber rewards, which we've seen a moment. And deep seek also used GIPO, a new reinforcement learning algorithm that offers a simple, lighter setup compared to techniques like PPO. So what is reinforcement learning with verifiable rewards? With this paradigm, the model is asked the question and generates both a reasoning trace and an answer. The answer is then checked against the non-correct answer. The reward is used for reinforcement learning training. The underlying idea is more general and it has where the outcome can be verified automatically, like a corret answer, a one-game, a successful tool called cancered as a training similar. And this is fundamentally different from supervised point tuning. In SFT, the model learns from curated examples and its completions tend to stay close to the distribution of those examples. They are reinforcement learning with verifiable rewards. The model explores different trajectories from its pre-training and learns to favor the ones that maximize rewards. And this is exciting because the model is no longer limited by the quality of human examples through trial and error, it can discover more efficient reasoning strategies. We can finally map the classic reinforcement learning concepts to LMS. The language model acts as the agent. The environment for any task, including data, analysis and scoring rules, anything needed to check and possibly train the model on the task. From a software perspective, this marks a shift from supervised and tuning to reinforcement learning with verifiable rewards. While SFT mainly relies on conversational datasets, this new paradigm usually requires an environment, a dynamic system that the model can interact with. The definition of the agent is also expanding. Language models can now be given tools from a weather API to a terminal, and this makes environments for training the evaluation more complex and critical. To make this more concrete, consider teaching a model to play tic-tac-toe. The agent is the language model, its action is generating a text response with a specific move. The environment acts as the game engine, it handles pronging the model, tracking the worst status, generating the opponent's move and deciding when the game is over. The reward is the signal from the environment, for example, plus one for a win and zero for a loss. This reward guides the model to find winning strategies through trial and error. This setup allows the agent to discover strategies that maximize its score without needing pre-existing human examples. You have probably understood that I am enthusiastic about this topic, but let's also use André Carpace's words to describe the environment. They give the LLM an opportunity to actually interact, take actions, see outcomes. This means you can hope to do a lot better than statistical expert imitation. Now, let's see how to build these environments. To build environments as software-active X, we can use bird fires and open source library by priming the left. The bird fires provides modular components to create reinforcement learning environments for LLM agents. This can be used for both evaluation and training. Environments are Python packages that can be easily installed and distributed. The library provides space classes for several set ups, single-term environments with just one interaction between the model and the end, multi-tourne environments, two environments where the model is equipped with tools and several others. It also includes abstractions for parsing model responses and defining reward functions. Verifier abstracts model serving. It expects an open AI compatible API endpoint so you can plug in open AI, open router or local models via the LLM. It enables async interaction and parallel trajectories so you focus on the environment logic. For training, verifiers come with its own trainer and integrates with other frameworks such as Pribarral, Tinker and Skyral. In short, verifiers let's us focus on the task and the rewards greater than the infrastructure. Let's start with a single-term environment. Vigorous text is a simple environment to evaluate or train language models on their ability to reverse a string of text. What's going on here? LLM environment is the entry point for every verifier's environment. It contains all the set up logic. First, the dataset is loaded and marked. The default dataset contains a thousand text paragraphs stored in the prompt column. During the mapline step, we just store the original dataset. Question is the text paragraph, while answer is the reversing text. Next, examine parser is initialized. It extracts the text inside the reversing text as specified in the system prompt. We then define the reward function. This compares the model's output to the ground truth and returns a longest common subsequent ratio. Finally, we mando this into a rubric, a collection of weighted rewards and initialize the single-term end. But how does this come to light? Let me show you an evaluation run. Here is what happens under the root. The environment is invoked. Five examples are taken from the dataset and each example is used three times for three roll outs. Each roll outs gets the same question, but may produce different completions due to modern randomness. We have 15 roll outs in total. For each roll out, a conversation is prepared with the system prompt and the question. The conversation is to the model. The model generates a response. The response is parsed, the script in the answer. The reward is computed and results are saved. At the end of the evaluation, you get summary statistics and additionally, for about reward distribution. Training follows the same core mechanism with the additional step of updating model parameters. We look at that in more detail later. Let's look at the different examples from various files, the double check environment. Here, the model answers a marked question and the environment then asks, are you sure? This is a multi-turn environment, similar in speed to Tiktato, in which each trajectory involves multiple interaction between the model and environment. Let's look at what multi-turn and introduces. State makes its first appearance here. It's a dictionary that tracks information during a roll out. We can set the initial state through the setup state method, not used in this example. At the response, instead of the interaction ending after one turn, the environment can reply with list of messages. Here, it just says, are you sure? In more complex environments, the response can be dynamically generated based on this state. The fifth stop, the curator marks a method as a stopping condition. This method runs at every turn of the agent environment interaction. Once it returns through, the roll out terminates. Here, the hood verifiers runs a loop in which the model and environment take turns, exchanging messages, updating shared state until a stopping condition is met and the full trajectory can be evaluated. Another interesting type of environment is the tool environment. All environment types in their fires are built on multi-turn and which implements the core single agent roll out loop. To land at stool-colling to this foundation, as you can see, tools are defined as python function. During roll-outs, the model can call tools, receive results and continue reasoning until it produces a response without tool goals. Each turn consists of a model response followed by the environment's tool execution. For a more realistic example, I recommend checking out the weekly search environment. Beyond the fundamental environments I just showed, verifiers provides more abstraction to build on environments. The NCPM environment automatically connects to model context protocols, servers to expose their tools. An important tool to land is for tools that need the paralleled, persisted state like database connection or session ID. There is also a class implementing recursive language models, and all the idea you might have heard of. It's an inference strategy where language models can decompose and recursively interact with input context or unbounded land through RAPL environments. Verifiers also plays well with others, integrating with several third-party environment libraries. Verifiers is tightly integrated with the environments hub, a community space for sharing these RL environments. Verifiers and environments are different phases of the same event. They aim to fight environment fragmentation. To often, environments are looking into specific trainings dark, making them difficult to reuse. And as a market for closed source environments emerges, these open initiatives ensure we have the robust alternative. We don't want open source models to lag behind just because they lack the right playground training. Plus, beyond the serious side, it just found to explore the hub and see what people are building. Now, let's move on to TicTacTo. We'll use Verifiers to create a TicTacTo environment for training and evaluating language models on this game. Now, why TicTacTo? It's a simple game, but requires multi-term interaction and capturing its dynamics with the static dataset is challenging. Despite its one-state space and the deterministic solutions, more language models often struggle with it. Let's see if reinforcement learning cannot preach that gap. It's best to start with a simple version, run evaluation to verify it works and then iterate. To start, we make a few assumptions. The model always plays SX and goes first. It must output an envelope between 0 and 8 inside Move tags and the opponent just plays randomly. In a load environment, we create a dataset containing the initial user message that starts each game. For each rollout, setup state populates the state digitally with information used and updating during the game, such as the board and the winner. And for response, contains the core game logic. It parses the model last move. Check if it's valid, invite moves are an immediate loss for now, and applies it to the board. Then it applies a random point move and checks for a win or throw. If the game isn't finished, it turns a user message with the current board state and lasts for the next move. We use truly word function, win reward function, win weight 1, and form it, reward function, win weight 0.2, which rewards the model for respecting the XML format. We can now make our environment more flexible, realistic and suitable for both evaluation and training. I made some of these improvements readily while it didn't worth during training. First, we want the model to sometimes play first and sometimes play second. Let's now address opponent's skill. Always playing against a random opponent isn't realistic, so we introduce an optimal opponent using the minimap algorithm. Against this opponent, a draw is the best achievable word outcome. However, for training, we want the opponent's skill to be controllable. If the opponent is too perfect, too early, the model might never say win and fail to learn. We can do so by introducing a probability for the opponent to choose a random move instead of the optimal one. In a lot of environments, we introduce mean random move prob and mass random move prob varying from 0 to 1. If we set both to 0, all games will be against an optimal opponent. If we set both to 1, all games will be against a random opponent. Using these parameters allows us to control the opponent's skill across all games. For different roll outs originating from the same dataset example, the opponent will always have the same probability of choosing random moves, ensuring fair comparison. Now about reasoning. It's common to ask models to produce a think in trace before the final answer. It can improve performance at in-frame time, but it's also instrumental to make models better during training. We define a new form of reward function using a regular expression to also check the presence of think tags. It's now covered in very moves, but experimenting with more open models I observed many of them. Sometimes the outperformed was incorrect, sometimes the chosen cell was occupied and in the game immediately is harsh. It might stop smaller models from getting a useful learning signal. Instead, we now let the game continue and apply a flat minus 0.1 penalty, capping the third and eighth. Let's discuss reducing noise in group-based reinforcement learning. In GARPO learning, we compare cellular roll outs from the same starting point to see which ones to reinforce based on rewards. For this work, differences in rewards should come from all the model placed, not from environment randomness. And how can we reduce noise in this setup? We set an example seat for each example in the data set to select the starting player. Then, for each turn, we write a specific turn seat based on the example seat and worst state. This guarantees that if the two roll outs reach the same board position, the point will always respond the same way. Last point, reducing noise across batches. For our training, the batch size is the number of games taken into consideration before the models where are updated. In our setup, the opponent's case varies across the data set according to mean random move prop and max random move prop. If we train with the small batch size and the random move probability is not fixed, we might sample a batch in which many opponents are art or many of these. This causes the average reward to rotate a lot, making training unstable. To fight this, I added the stratified sampling. This forces every batch to contain a perfectly balanced mix of four-pointed difficulty spanning the chosen range. I know this slide is dense, but you can find all code and more details in the GitHub repository. Trying to evaluate existing models, we choose GPT-5 Mini and LFM 2 by LiquidAI is more fast open model. Using various evaluating models just requires a few comments. Along with some statistical variability, GPT-5 Mini is excellent at following format and is a good tiktok player, but not perfect. The small open model by LiquidAI striples to follow format and to make valid moves. It's a weak tiktok player, sometimes winning against a random opponent, but rarely surviving against an optimal one. There is a significant gap. We decide to train LFM 2 for some reasons. It's a good model for its size and it's an extra model, ideal for transforming it into reasoning model. How can we improve it? We saw that this model striples to follow format and often provides invalid moves. We can use supervised by tooling for a warm up phase where we teach the model the format and valid move syntax. We can then use reinforcement learning to build deeper capabilities. The first step is generating synthetic data for supervised by tooling. Once you have a good environment, the rating data requires a single comment. Here we use GPT-5 Mini since it follows format perfectly. And we don't need many examples. We generate 200 and filter out closing games to avoid making ins about mass strategies. With this synthetic data attend, we can clearly spin up a supervised tooling run using prime array. In this example, I am using the N96 Gigabytes GPU but you can use a smaller one. Training requires only a few minutes. Time to evaluate our fine tuned model. Compared to the original model, it learned format almost perfectly and reduced the number of invades moves. It also improved the game performance but there is still significant work to do. For example, into our training, let's do a quick recap of group relative policy optimization applied to Tiktok. Robots starting from the same initial board, the model plays several games via LLN sampling. Each route is evaluated using the termistic reward function. In our case, win format and invalid move. Another range score is calculated across the group of roll outs. And each roll out is then compared against this average advantage computation. The model is updated to favor trajectories that did better than the group baseline. We'll use CISPO which is an improvement over GIPO. For reinforcement learning training, I used Verifier's REL simple trainer. Yeah, we use a GPU for inference and a GPU for training. Let's comment some parameters in the training configuration. In this training ground, random move probability is ranging from 20 to 70%, no purely random players and no optimal players. It's a good playground to get signal and learn both attack and events. And no groups parameter is used to set up stratified samples. When comes to trainer arguments, we want our model to learn stably while fuel utilizing our GPUs without crashing. For tips on how to use the GPU without going out of memory, I recommend checking out the GIPO. Yeah, I want to stress that reinforcement learning training is sensitive to either parameters and can be unstable. I heard the earthquake that batch size is a key parameter. In this environment, I was certainly unstable training and model collapse and experimented with various lower than 256. Explanation is intuitive. Batch size is the number of games using to update the model's weights. If this number is low, this means learning to play from a very small number of matches and opponent types at once. And this likely leads to subordinate strategies. Let's take a look at training plots. We are in a war function and the total reward constantly improved. Format reward function was already near perfect and did not change significantly. Invite move the penalty function, start it well and converge it to zero towards the end of the technique. It's a good training run, but let's run proper evaluation. Impressive. Thanks to reinforcement learning, our model has become a very competitive tactical player. It dominates random players and draws 85% of the time against an optimal opponent. Invite moves at dropped to near zero. These results are already satisfying. And one can say that not much more we can learn from this example. I am a perfectionist and I'd like to perfect the eye player. In spite of the rollouts, I found some recurrent fatal mods, in particular our model sometimes falls into four traps. This example shows the end of the game. Our model is not playing badly here, but it had already lost the game by allowing the opponent to have two in it. That's it. It's impossible to use our RL environment to push our model farther toward perfection. That's right. We have to make some changes. In this run, I used the bigger GPUs just to experiment quickly. It's probably not required. First, let's discuss opponent's skill. We increase opponent's skill by setting the probability of a random move to range from zero percent to 25 percent. I also tried making the model play against the perfect opponent solely, but it didn't work. The model became overly defensive and failed to exploit errors when testing the game around the players. But how to make our model explore beyond learning strategies? I made several experiments where this model failed to improve and forget the sub-automatic strategies. We want the model to experiment with new approaches, and the temperature is the right parameter to tweak. But this is a bit risky. If the temperature is too high, the model can start generating heat pressure. Let's try. Things get really interesting. There was a significant initial drop in the win-reward function and total reward. I doubt that this has an exploratory phase where the model tried new and random strategies, which after four months at first. But over time, it's recovered and improved to new eyes. Also, four-match reward function and invalid move penalty function at an initial drop, but overall, always stayed around their maximum values. Let's move to proper evaluation. Oh, we finally got a TITTA to master, but why not playing again? I guess it. OK, let's make not the perfect move. We now need to block the model. Let's block it again. Oh no, we lost. It could be interesting now to compare our model performance with GIGGY 5 Mini. The TITTA model we used to generate our synthetic data. Against the random opponent, performance is very similar. Let's see against an optimal opponent. Oh, in this case, our model is superior. This is a very nice achievement. To get these results, I went through several failed experiments and I'd like to share the findings with you. First of all, batch size. If this value is large, yes, your model apparently learns slowly, but in exchange, you get stable training. If the batch size is small, your environment produces diverse matches and opponent skills, the model we learned from is more number of games at once. This can reinforce sub-autma strategies and you may observe unstable training or model collapse. Second lesson. Watched for item biases in environments. Let me split. In a previous experiment, I used a different Mini Max algorithm for the optimal opponent. I thought this was an implementation data and let the load handle it. I got great benchmark results, but then playing against the model, I realized it was too loose. Looking better Mini Max, these are the bias. With multiple moves at the same optimal scores, the first preposition was always selected. I basically was training my model against a specific type of optimal player. Over many games, the model simply memorized it. Now about model choice. You can't start from a model, it is already trained for reasoning, but they tend to output long thinking traces. If you have limited GPU resources and time, you may end up truncating most of the longer completion at the beginning to fit the short limits. This means wasted budget and also the risk of damaging the model's intelligence. It might make more sense to start from an instruct model and transform it into a reasoning model for your task. Another point about models, it could be hard to push very small models to competency. This of course depends on the task. And I recommend, evaluate the base model in your environment, look at a few completions, choose a model that shows promising the areas, even if the numbers are not satisfying yet. In general, it is always a good idea to expect some role as to see the model evolve. Also after training, do not stop at programmatic evaluation, try the model in the real task. Final recommendation is not right to watch your logs and plots when you start training to identify early out of memory errors or instability. It's difficult for me to, but one training begins well as it's just stopping staying in plots for a while. Reenforcement learning is low and takes time to see progress. If you continually monitor it, you list the tendation to stop it and tweak something prematurely. While its glory progress in Ra can be a surprising good well given enough time. So start training and go for a work. During this presentation, we mappled the reinforcement learning concept to the language models domain. Then I introduced various fires and open source library to build environments as software activists. Finally, I walked you through my experiments where I took a small model and turned it into a tick-tock to master using supervised meant tuning and reinforcement learning with verified or rewards. We did not just show the model out to play, we gave it a space to play and guided it through rewards. Nowadays, reinforcement learning complements supervised meant tuning in language models for training. You can do this a tone 2. If you can define a clear reward signal, you can build an environment and train a small specialized model to build a large closed model on a specific task at a fraction of the cost. I want to leave you with a few ideas and resources on this topic. All what I share out today can be found in my free LLM RL Environment link course where I go deep at explaining the day. Take a look and give a start. To figure out what others have been building, explore the environment's hub. And what to be in this? Something I am very excited about is this. Train a small language models on two three tools you often use and try to perform a large model in that specific task. I recommend the Wiki search example in the priming RL repository as a static point. Thank you.