Skip to main content

What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

TL;DR

  • While public benchmarks show continuous improvement in LLM performance, these often represent narrow tasks that don't reflect real-world utility or user satisfaction.
  • Many leading models still struggle significantly with fundamental challenges like identifying and pushing back against nonsensical queries, often attempting to solve them regardless.
  • User-generated data reveals a persistent "dissatisfaction rate" of around 9% even among top models, and limited progress in complex "expert" domains like medicine, finance, law, or creative fields like game design.

Takeaways

  • Question "line goes up" benchmarks: Traditional benchmarks often measure performance on specific, narrow tasks, which may not accurately represent an LLM's broader capabilities or reliability in complex, real-world scenarios.
  • Implement "nonsense pushback" tests: Evaluate models on their ability to identify and refuse to answer nonsensical or ill-posed questions, rather than attempting a flawed solution. Many current top models frequently fail this test.
  • Reasoning can be detrimental: Activating "high reasoning" capabilities in LLMs does not always improve performance; it can sometimes make responses worse, especially for ambiguous or nonsensical prompts.
  • Monitor user dissatisfaction rates: Beyond quantitative benchmarks, track real-world user feedback, such as "dissatisfaction rates" from head-to-head model comparisons, to gauge practical utility and identify persistent performance gaps.
  • Domain-specific performance varies: While LLMs have shown dramatic improvement in quantitative tasks like math and physics, progress in nuanced expert domains (e.g., medical, finance, law) and creative tasks (e.g., game design) has been notably slower.
  • Model size isn't a panacea: There is no clear pattern suggesting that simply increasing the number of model parameters consistently leads to better performance across all types of tasks, particularly for complex or nuanced prompts.
  • Focus on broad distribution improvement: Efforts should extend beyond pushing the frontier of model capabilities to improving the overall reliability and performance across the entire spectrum of use cases, addressing the "bottom of the distribution."

Vocabulary

LLM — Large Language Model; an AI model trained on vast amounts of text data to understand and generate human-like text. Benchmark — A standard test or set of tests used to measure and compare the performance of different models or systems. Pushback — When an LLM recognizes that a prompt is nonsensical, ill-posed, or impossible to answer and instead of attempting a solution, it challenges the premise or asks for clarification. Reasoning — The internal process or explicit instructions given to an LLM to think step-by-step through a problem, often used to improve logical coherence or problem-solving. Parameters — The internal variables or weights within a machine learning model that are adjusted during training to allow it to learn from data and make predictions. Dissatisfaction rate — A metric, typically derived from user feedback in competitive settings, indicating the percentage of times users find the responses from two or more models to be equally poor or unhelpful. Prompt — The input text or query provided by a user to an LLM to elicit a specific response. Hugging Face — A widely used platform and community for machine learning, offering tools, datasets, and pre-trained models. Evals — Short for "evaluations," referring to the process of testing and assessing the performance and capabilities of AI models. Battle mode — A user interface feature on platforms like Arena where a user's prompt is sent to two different, often anonymous, models, and the user votes on which response is better.

Transcript

I want to talk to you something maybe a little bit controversial today. You can argue with me later. But the topic is what do models still suck at? And the reason why I wanted to talk about it is that I think we all look at these kinds of charts where any benchmark you seem to look at, line goes up, and we look at meter charts and the surprises every time no matter how prepared we are. And this could create this kind of psychosis that we'll see where everyone is freaking out about the next model. You know, we heard some new ones coming up. And the feeling I think that we all get is that this is kind of a GI-like creatures that are just almost there, just one more turn and they're almost there. I think we could be deceiving ourselves a little bit because I think there's still quite a few things missing. I want to explore that in a couple of different ways. And we certainly, by the way, see that as well in our data at Arena as well. So we track models and if you notice the data, this is Q2, 2023. So we've got data going back to GPT-4. And what we do is we can we've tracked, I think, is a 700 model so far. In text and what this chart is showing is what the top model is for a tiny given time for its organization. So you can see, line goes up, new model builds on top of each other and it's all very impressive. But I think it's not the whole story. So I've got a couple of ways how I want to explore that. It's not the end of the conversation. There are definitely many other ways of looking at it. One is my own benchmark that I've built recently, which I'd rather like. This is the BUSHA benchmark. And then also I'll share some of the Arena's data as well that we haven't shared so far, which I think would be interesting for you guys to see. So the idea behind the BUSHA benchmark is quite simple. Is that what happens if we ask nonsense questions from the models? What are they going to do? Are they going to just tell you that, or this doesn't make sense? And maybe reframe it, or they're just going to go with it. And honestly, I wasn't sure how that was going to go, but when I just posted it, one run the evening, I think a lot of people liked it. It resonated with a lot of people. And I think the reason is that it's probably spoke to a lot of maybe kind of slight anise people had with different models. And I'll give you one example here. This is just one question, and the way it works is I think I've got 155 questions, something like that. And we then give this to the models. We get a response back, and all we do is then grade it with Al Al Amazighi. And I've been through it myself as well. I read a lot of nonsense to kind of see that I think Al Al Amazighi works here. So this one is a kind of silly question controlling for repository age and average file size, how to attribute variance in deployment frequency to the indentation style of the code base, versus the average variable name length. So hopefully you understand that it's nonsense. So it's very a breached response that much longer just for the purpose of this. So Sonnet gives a good response. I think it just says you can't manipulate, measure this, you can push us back. Gemniz is a little bit more complicated because it starts off well. It says that strictly speaking, it doesn't really make sense. But then the second part is however both act as strong proxy variables for engineering and culture, language ecosystems and code quality, which I hope you don't agree with. So I'm not going to go through a bunch of examples. It's all open source, by the way, you can jick it out yourself. But it's really, really surprised me how easy it was for the models to just go along with the complete nonsense questions. So the results that I got is that the way to read this chart is the green is the clear pushback. So when the models, like in the first example where it said, or maybe this doesn't really make sense, then the, the, and red there is kind of for exerting the nonsense. And the basic results are that the latest Sonnet models or other Claude models are doing really well. There's like a couple of other models like Quenn models not too bad. There's even GROC is like, okay, as well, the very latest one. But if you go beyond that, there's a lot of models that will use all the time. So GPT models, Gem9 models, they're basically kind of about 50-50 whether they're going to go along with it or not. And even looking at some of the traces and responses in more detail, even the ones that are green is still like a little bit shaky. They still kind of try to accommodate. So it's like for me, this is really not, not when you're good enough for the level of responses. And just for completeness, if you go all the way, so this is the very bottom of the table, there are bunch of smaller models there kind of for all the models. Yeah, some, some results are completely terrible. It feels like you can ask anything, they just respond. Another way you're looking at this data is I just took the Anthropic, OpenAI, and Google there. And I measured the model performance of the time. And you don't see all the labels there, but they're basically all of the models that you remember them releasing. So the way I interpret this is that the Anthropic models were like okay at the beginning, but since Claude 4.5, soared at 4.5, they really went up. And even Hike is quite high. But with OpenAI, Google models that kind of up and down, but they nowhere close at the top there. Which I think is kind of interesting. And I'll go into some of the other interesting dynamics there. So for example, this thinking help. So this is, I always hear this when there is like a silly puzzle that the model can do. What do you do? It's just all crank up, the reason it solves it. If you see a look at the chart on the right, it basically is completely not true here. So reasoning often actually goes in reverse and doesn't help, it actually makes it worse. The model, the more recent models perform better, it's kind of hard to tell for sure, but there's at least not the clear line going up. And I think if you exclude maybe the latest anthropic models, it's not to ensure clear that the line goes up at all. Then some specific comparisons for reasoning. So for example, what you see this kind of the, is the same model with the low reasoning and high reasoning. And these are some examples where no reasoning performed better than high reasoning. And I spent a lot of time reading the traces of GPT-5.4. It's probably the most confusing experience of reading these traces. And what I found was that quite often it would maybe have one line where it would question the premise of this question. And then spent 20 paragraphs trying to solve it. And even if then comes back and says, OK, maybe this damn my exence, it still tries to solve it in some way. And this feels completely crazy to me. But the way I imagine, and I don't know for sure, but I imagine the reason why that happens is that they will train so much to solve the task at a new cost. And I think there was probably not a lot of training to say, actually maybe don't solve the problem sometimes. And notice this first, sometimes when you have a lot of agents running in parallel, and I would sometimes forget which one is doing what. And I would ask one agent to do something that's completely at the wrong project and it's still going to do something. And I then I was right, man. So yeah, that's kind of an interesting dynamic I thought about thinking. Then also, so this is a subset for open source models. Only try to see if bigger models do better. There's also no real clear pattern. So we've got the total parameters on the left, then active parameters on the right. And I don't know, maybe you can see some patterns. I don't really see it's like kind of up and down. But yeah, not huge sample. So do not inconclusive. It is not obviously true. So that was kind of one lines looking at kind of this specific idea. But I want to take advantage of the data that we have at Arena and ensure you maybe more broader trends that we could look at. So just in case you don't know much about Arena, what we do is we publish benchmarks. And the way we derive them is that users go into our platform. They can go in the battle mode, they put in a query. And then they get two responses back. We share from two anonymous models. And then they can say which one they like better. And then you get, then the more the names only reveal done. And then in text arena, we've got nearly over 5.5 million votes there. And we've been going since 2023 as well with this data. So it gives us really nice broad view. The reason why I think this is really useful is, first of all, we do have this long trend. And there is not any other benchmark that lasts so long because this one you cannot exhaust it. There will always be one model better than the other. So that gives us a long perspective. Another one is that inevitably any benchmark that you pick, it's inevitably has to be condensed to like very specific question that you ask him. Because otherwise it's very hard to measure. So I'm sure it's all in your experience as well. When you are doing coding or whatever is your task, the benchmarks would measure very tiny slice of what you actually care about. And in here we don't have that problem because a user can put any prompt and then they could just use the adjustment to see is that the good thing or not. So what I want to specifically focus on is slightly like a odd mechanic that we have that I'm really glad that we had since the beginning is that you can vote which model is better here or AOB. But you can also say when both models give a bad response. And if you ask to write a model a joke, response is always bad. So that's an easy example. Didn't take me long. So that's the thing to remember. If you just to remember one thing that will really help you for the next seven, eight minutes is that this is the mechanic. Think of it as like just such as factory rate. And what we can do is if you were to take battles between top 25 models, so we're kind of sampling from the top. So to avoid kind of at no longer 8B fighting crime 3B, we just take the top set of models and then we map this kind of dissatisfaction rate over time. And I think this is quite interesting that we do see progress with this matrix. So there's kind of pre-reasoning models you can see there is like 20, 17% dissatisfaction rate. Then when we after 01, we see that drop quite a bit to sort of about 12%. And then after that it carries on improving to sort of about I think it's about 9%. But it's so improvement is definitely there, but it's not 0%. Which I find interesting. I must say when I when I first got to that result, I thought like that's quite high. So 9% of the time people would get to responses from two good models and they don't like them. Which I think it doesn't tell the same stories all of these like crazy lines going up. So then what we can do is we can also take. So the previous one you saw it's like average across all like six million prompts. And this is the categorization of those. These are just some picked out in there. And you can see some interesting trends as well. So math was like 25, 27%, and then it got so much better. So that's quite a nice result that matches my experience of models as well. But then when you look at like creative writing, okay, it did get better. But the improvement was not dramatic. Which I think is true as well. The category I want to focus on to really, really try to zero in on the most signal is the expert category. And the way it works is that we take those nearly six million prompts. Then we have a way to classify what are the most interesting kind of the harder, the more kind of real tasks that expert people do. And there could be experts in different fields. But they're kind of the most, I would say high signal prompts in terms of what we could zero in on. And then we also narrow down to the battles just between the stock 25 models. So they get us to about 40,000 prompts. And then we can look at these expert categories. And then expert category and then we can subdivide it even further. So in here, I've got five categories here. So again, quantitative, for example, so it's like mass, physics, things like that. You can see this kind of really, really high dissatisfaction rate in the kind of one is it about early 2025, late 2024. And so, but and that drops dramatically. And I think that feels true to me that a lot of the models got so much better at this kind of quantitative stuff. And I would also say the reason why I think the line goes up is not that the models got worse, but I think people's expectations shift as well. The data that we see in terms of what prompts people use at the beginning, like three years ago versus now, it shifts a lot. So this is also not like a static benchmark. So we can really see the kind of the the battle of the expectation versus the model performance. Interesting as well on the bottom, we've got magical finance and law. And the lines, like it is the scale is equal across the five charts. It's a little harder to see, but it's not steep, right? It's not really improved all that much. I don't want to go into the magical and law and finance fields, because I don't know enough about it. But it does feel like it's probably true that there's not really been the focus of the models necessarily. So I think maybe the performance improvements not been that high. So then what I did was to take all of these prompts and classify them further into these more deeper subcatchages. I'm going to focus on software now and give you that kind of view of these subcatchages, which I think also gives us even more digital view. Just to give you a feel of sense, what kind of prompts we are talking about here, obviously tiny sample of three, but to give you sense for gaming, someone's asking to get them my digital game design document. Then for security, someone's got autonomous systems, they want to configure. But two, which I don't really know what this is, but then for agent systems, which I thought was interesting, like actually you'll see the right disk quite good. But the person that is asking for refined this agent so you can run daily with no supervision. So these are the kind of just to give you a feel, these are kind of real things that the people want to do. And we've got two charts here on the left is from Q2 2024. These are kind of just satisfaction rate. And then on the right, we've got the Q1 2026. So there's the most recent data. And you can definitely see improvements. So if you look at the top line, this is the overall average rate. And we've gone from 23.5% to 13%. So it's a really nice improvement. But I think the improvement is not really seen everywhere. So we can see this as well, same data with the closer timeline, which I think is quite interesting. And you'll probably have better theories on all of the different categories. Why that's the case? And I think by a mind the case that I think people do ask a lot harder questions. So I think GPU compute for example, I imagine probably it's up and down because probably people ask harder things as well. But I think gaming is an interesting category because I've tried to use LLAMs to build games. Not that I, I mean, I use games, but I don't build them. But whenever you try to build games with LLAMs, it just feels like they have no idea how to build actual games. The mechanics all over the place, they're not interesting, they're not challenging. So I do get this feeling that the performance not really improved in some dimensions. Like I don't think LLAMs really get games. Even though I'm sure maybe go back two years, people are asking to build much simpler games, this is now. But I wouldn't say that I'm aware of any really good game in benchmarks that would kind of capture this. So again, if you compare this to kind of one going up, I think this is not kind of marching that story, which I think is quite interesting. And there are a bunch of other examples that you see in there. So what's really the gap between those, between these kind of crazy charts, which by the way I also agree with, I think they are true and what we see on the right. And I think there's something that this kind of fuzziness that we all have in our heads and our experience about the judgment that we have that we use, that doesn't necessarily match all of these super narrow, very well-defined, very well-specified tasks. And I think there's much more to what work is and what white color work is and all work is that is not really captured by these benchmarks. So I think we should be just careful and maybe put a bit more effort to maybe bring up also the bottom of the distribution. So it's not just the very frontier gets better, but also kind of the broader distribution gets better as well. So I'll close here. One thing to mention, if you, I think you like this kind of data, go to our Hagen phase. There's a lot that that we publish and share. We're going to do more of that. And we share some expert prompts, for example, and some of the leaderboard stuff. Join us if you want to build the arena or if you train models, we also do a lot of private evals. So thanks so much.

Feedback / ReportSpotted an issue or have an improvement idea?