How METR measures Long Tasks and Experienced Open Source Dev Productivity

Here's the very simple argument. If you look at the sub-nation of compute over time, this could be like R&D. Spending on compute, this could be experiments. Compute could be training compute, whatever. There's some particular lab is using. It's like this, no surprise. If you have another chart of log time horizon, let's say this meets the measure from the big end that many of you will have seen on Twitter over time. It looks like that. Let's say that this was not merely a coincidence, but these things were causally proportional. In the sense that if compute growth were to half, then time horizon growth were to half. So for the sake of argument, let's say that starting from 28 or so, the compute curve begins to bends like that, but this would be no growth and this would be the original growth, something something like half. Then if they were causally related, and in particular, they were causally proportional to one another, then you'd expect this to go like that. And then for some milestone that you care about, let's say here we've got one worked horizon up there, one month. Then the delay implied in AI capabilities is potentially enormous. Now, why lots of people have stipulated that there might be some slow down and compute growth. I'm not an expert in those forecasts, but I think the priorities do seem somewhat strong to me. One is physical constraints that we might hit, our constraints as mentioned, or there are various other ones that E-bock have report on the day. So they consider all of which seem to not fight through 2030, but potentially could buy sometime off to 2030. I think the more likely one is just like dollars is a constraint. Like you can't, you know, large tech companies can only spend so much at a certain point, like large nation states can only spend so much, like you can't. I guess there are some scenarios in which you can continue going, but that seems to kind of naturally imply this slowing down. And then the additional point that this paper's trying to make is that under a very contestable, but standard assumption from economics, you should in fact expect these to be causally proportional. I think in particular, you should expect them to be causally proportional to the extent that or for the period that software and the singularity is not possible. And that's another, I think, to look at that. But at least in this kind of somewhat business as usual, scenario or sort of until that scenario no longer applies, I think this is maybe a reasonable model and doesn't apply some sort of capabilities in the near future. I have no plan for this session whatsoever. That also suits me. We don't have a technological advance that dramatically improves capabilities relative to the country. Like a more predictable technological advance, right? Yeah, I mean, all predictions, you know, assume no unpredictable. Yeah, I'm like, you know, time horizon or like in general, in AI kind of straight lines on blog linear plots have been a, I think, you know, very highly underrated forecasting tool. They've done extremely well over now many orders of magnitude. You know, I think it's reasonable to have the default expectation that the blog linear lines continue through, like approximately the same number of orders of magnitude, except maybe if there's, you know, some significant break in the inputs. Yeah, of course, on the upside, there could be, there could be something quite dramatic. Software and easy humanities, the first thing that comes to my mind, but, you know, another transformative style moment seems like another candidate. Absolutely. Of course, also, one of the problems with testing this will be dead, like I think most of the tests that you have, it will, a bit of tests will, you know, eclipse the maximum possible amount of time that those tests could take at some point. The evaluation said. Yeah, so I think, you know, there are some ways around this that we're working on. I'd be excited to talk about that. They will feel pretty early. But yeah, you know, I think it's right. So that's if time horizons are doubling, you know, eventually you, you know, the doubling time is actually, you can't possibly make long enough tests in the world of development. It's possible also that like we actually get a place where time horizon is no longer a useful measure because actually you know, one time, you want time to decrease, like you want as you want the same results. I don't know. Oh, one, but you want higher reliability at a lower time horizon. One thing to say about time horizon is there's like two notions of time here, like a human time access thing. Like, I can't understand the time that the model working for, I think you should like kind of approximate a zero. It's not actually zero. They are, they are taking actions. They largely do their successful work pretty early on to the extent they're going to be successful on tasks. So my guess would be that it will continue to be the case that there's not sort of so much extraduce on that margin of making the models complete task more quickly, although reliability very much so obviously. So most of it's like the human, like the iteration loop, not the time is spent in like the human machine iteration loop. The humans are working without AI's and AI's working without humans. So the humans, I guess, are all humans. So I thought when you think like a new person like was the site. Yeah, yeah, yeah, yeah. Yeah. Cool. Any questions on me to work? I can go through some like upcoming things that were that were excited about if people were excited about those things. Yeah, I did have one person that being perceived one like the implementation. Yeah, one of the mentors. Yeah, yeah. Yeah. One thing I thought you brought a little bit of the paper, which is whether or not familiarity is a confounding factor. Although one of things with tools. You have tools and later is confounding factor. And of course, also like you also brought up that like tool durability has dramatically changed. But a lot of those interesting presentation from Meta at the development committee engineering summit is here. And they had done this. They have probably the best infrastructure for quantitative measurement of like 12 per experience in the world of any company. And they're able to tell you basically how long it actually takes to make a PR. Basically they call them DASA data. But how much actual effort, a human time effort it took to make a PR? And what they saw was they saw a Jacob when they gave people agents. And that Jacob was, I don't know how long it was like three months or six months. And so one of the things that I also wonder is like, if it would be interesting, if there's a cut off of how much familiarity the person has, like how they've been using this as their full-time daily driver for a period of months. And if there's a kind of thing cut off that occurs once they're like a level of familiarity in groups. Yeah. I'm totally on board with not just in this case, but in many economically relevant outside of software engineering cases, Jacob likes explanations being a real thing. I'm like, yeah. Well, of course, not just developers. Experiment with tools. You tend to be slower the first time that you're experimenting with tools. But if you're doing this, you have some investor that it takes. Later on, you might be more proficient of the tools, or in the case of AI, maybe you just sort of expect the models will get better. And so even if you don't become more proficient, it will be like the kind of thing that you want to do. Those explanations broadly make sense to me. I can give you some reasons why I've got schools. I think one thing to say is what does something to say? As backgrounds, we're continuing with this work, and we'll see. Another thing to say is just, quantitatively, difference between this and this, very large. I'm like, how much is Jacob explaining? I think it's not explaining that much. Let me explain that. Becoming a police, you just forward over actually in some Virginia studies that the one question you can't ask people in a survey is how long did a task take? You can ask people how much more productive did you feel? And they will give you an accurate response to the core of these quantitative feedback. Anybody in the amount of time that something takes they are almost towards the wrong. So that, I was like, when I shared this with my colleagues, I was like, OK, I'm not surprised about that at all. But what is interesting is how much is the slow down aspect? That was what was in there. Yeah, yeah, yeah. Point will take them. That makes a lot of sense. I do. I do. So I think we, despite this, were interesting time estimates because we're interesting providing. I mean, I think the perception of the perceptual, I didn't think that's relevant to you also, because the perceptual aspect is also the high aspect. So developers will tell you that they were faster when they weren't. And I think that is worth knowing. And to the extent that we're interested in measuring the possibility timing nature of capabilities explosions or sort of hair and D being automated, one commonly proposed measure to do this is just like ask developers or researchers how much they've been stood up. And for exactly the reasons of pointing out, I don't put a lot of faith in those estimates. So nice to see it like this. Yeah, some more Jacob things. So I, so the poll clusters who are not predicting time complete, they are just predicting this effect size that non-developers, the expert poll clusters. They are told the degree of experience these developers have. And some of the poll clusters are in thinking about how this population might be different to other populations, pointing out various facts about the study, like the more experienced, I expect experienced people to get less speed up. Or the repository is a larger. I think AI is less capable of working on larger repositories I expect less speed up. They never, never mention familiarity with tools. My sense is that they share the sense of the head of time, which was like most of the action is in understanding what the kind of things that AI is a good at about that in the first place. And all of these developers have experienced with their LMS and their core development workflow. It's just kind of said that they're quite that's a three-course of them, and I'm totally unfamiliar with the start of the study. So I just wasn't seeing much margin. Yeah, I think it is an open question. I also watched so many hours of screen recordings of these developers working. And I just do not see, I think they're like adopting very reasonably, in some cases worse than me and my colleagues in some cases better. I'm not seeing these advanced workflows that they're not actually seeing. Yeah, and my experience is not that far off. From this, is that there are times when I am dramatically slow down. And there are times when I am accelerated. And although as my familiar with the tool increases, I definitely don't have to spend that proof a lot. Because I learn over time what I can tell it to do and what I can't tell it to do. In addition to like, that's just getting better with it, like understanding like, OK, now I need to plan, now I'll follow up. But that's why the thing is like, before you make a high-level architectural decision, that 10 conversations, 10 conversations, turns down is going to blow up in your face. You're like, really trying to think about it. Yeah, yeah, yeah, exactly. And also like scope it down to like a smaller problem. Like, at first, I would try problems that were too large and I can't handle that. But just for the future, if you ever do, I mean, I think it's obviously really hard with the simple, with the 16-person sample size. But you know, that's great. Great. Because in the future, what I think I've cut off, like trying to figure out if there is a cut off of familiarity where the number of changes would be interesting to see if that matter of result generalizes. And that's what I should say. We are on it. I think the AI has been getting better during this period, which is in a compound a lot of what's going on obviously. But yeah, yeah. Interesting. The thing is, the projects themselves are very optimized for people coming up onto new projects and figuring out how to, you know, they're already the ones that struggle to be organized well for humans to come on board and build a navigate and quickly don't survive very long in the open source ecosystem. Yeah, these are fairly mature open source projects. And they're a little bit different from like any enterprise settings where things survive because they make money even if they're a pain to develop on. So the context is a bit different. These are the different. Yeah. Yeah, that is really interesting, Brian, because actually some of the repos that I was helped the most with were ones that I was completely unfamiliar with, and which I'd know these in documentation of any kind. And we're like, I had to come in on this legacy cut base that is just for years and like make a change. And like the developer who loved it was like only partially available to answer questions to me. And so in that case, like CloudCoast is huge. Yeah, legacy co-vis is don't exist because they work well. It's because they make money. Yeah. Yeah. Sure fine. Yeah. So what I had was, like, development developers had the same level of AI. From what I heard, the first year was, was there some variance? Is there a plot of each of the, each of the familiar? There's always a plot. There's always a plot. There's always a plot. You just, like, you counted the interlite, the question of those, or is there a jaker? Yeah. So here's some evidence. So, OK, I've shown you some plots. I think the sample size is just small enough that, like, you shouldn't really believe any of the, I mean, I think the plots aren't going to so much, but then I don't want to say that's, like, strong evidence. This is not something that's going on. I just think the evidence is kind of weak. The thing that really convinced me is, like, watch the video. I watch the videos I'm working. And, you know, often they're better at using cars than me. And I'm like, well, you know, I'm working on this project with you using cars. But here are some graphs. So, so this is by whether they have various types of AI experience coming into the study. And, you know, basically, you see no movement in point estimates. People who are the first to was primary ID before, yeah, not a huge amount of difference versus people who, it's not. Then the next one is, you know, you might think, maybe, you have a view that, you know, some Jacob comes off at this point, but still, you know, within the study, there's some variation in how experienced people are but the AI because they have multiple issues, you know, after the first AI issue, there's slightly more exposed than off second AI issue. So you might try sort of excluding those data points over time and saying what pops up. And, you know, they don't seem to get better at using it over time. Well, I think there's probably a static issue. You think there's probably what, so. There's probably a static issue with that plot right there. Like, those bars are very, very wise. Oh, I mean, I think, yeah, none of the, I think like, all of the plots outside of the main plots, all of these subset things, you should like not put a lot of stock in. Yeah. Yeah, I totally agree. Okay, and then lots of the main, so this graph is the reason we put it in unclear evidence, because we like to have things pointing different directions. Lots of the made of this plot suggesting, you know, something, something J shaped in particular, that, you know, at the end, once people have more experience, they do experience some speed up. Here are some issues, you know, first, like the other plots don't. I think that's important to include. A second, these hours are coded very conservatively. So, for instance, someone in the 30 to 50 hours bucket is had cursor as their primary IDE in 2024. They had recorded themselves on their time tracking software as having spent 140 hours using cursor. They conservatively estimated that they'd spent 50 hours using cursor. And so they had not 30 to 50 hours been. This is someone whose primary IDE was, was, was cursed last year. And, you know, people have been commenting about this. They've been using cursor for less than a week. I think that's not a very fair assessment. If you were to move that developer over from the penultimate bar into the, again, you shouldn't believe this because of statistics, but if you were to move the that developer from the penultimate effect size estimate to the last one, then you see some balancing out where you get back to essentially zero in the last two. Yeah, again, so, so like penultimate if I think Jacob explanations, you know, still like very unstable. Is it not likely that the 50 hour group also is similarly underestimating their time they've spent using cursor and that actually if you just have a longer scale that you would still see it to me? Oh, that, that is an interesting point. That seems plausible to me. And then, and then I guess I want to, I'm not sure it's underestimating because we're using this like very conservative. Yeah, yeah. Certainly. Yeah, yeah, I think that seems plausible to me. And then for this not to be strong evidence, I retreat back to I think you shouldn't really believe that any of these. Yeah, I know you thought. But I think the big thing is it's small sample size and there's also a lot of bias in the data set effectively. Right? Like it's a certain kind of data set. It depends on the kinds of developers. Yeah, open source developers and also working on open source projects that are pretty mature. Yeah. You know, those those two things are, you know, working with open source developers on these two pretty mature. This is probably reasonably indicative. Maybe the sample size is pretty small, but outside of that it gets a little harder. Yeah, and talk about this. I'm like, I think, yeah, this group is really weird. It's really interesting. It's like interesting for the same reason. It's really weird. Right? Yeah, we were interested in, you know, again, studying possible effects on of AI for NARADE speed up or automation. There if any types of developers are not being greatly sped up and implies the whole thing isn't being sped up. So it is kind of curious to see even, even like particular weird populations. You might imagine only like large, you know, sort of production inference code bases. Maybe have a bit more of the shape than scrappy experiment scripts. Yeah. But yeah, totally. No, I think it's very interesting. It's just it's hard to get on as we judge them. Yeah. Yeah, we are doing this large study. And I think, you know, I think, unfortunately, lots of the large study, which includes more green fill projects. And I think it's still going to be hot. I see you saying, yeah. For not so many reasons, yeah. Although I don't feel like your results are particularly contradictory with any actual independent research that's been conducted, you know, the research that I've seen that I would say is contradictory to yours is research that has been funded by model shops or agent shops. What can I say about that? I do think that most of the research that's put out is associated with large tech companies. And I think there are other metallogical concerns that I have visited as well. I have metallogical criteria with them as well. I know people who work at some of those places sort of metallogical concerns with the ones that was out there. So I mean, I think there are concerns about us as well. Sure. Sure, but I actually feel like, I remember somebody sent me your paper. And when I saw the headline, I was like, no way. Let me do it. I was like, that sounds like BS. Yeah, yeah. I read the paper and I was like, oh, this doesn't suck at all. Like that. Yeah, a little bit. No, no, no. I don't know. At least your high level conclusion, both is intuitive, like from a person who's read a lot of software engineering research and also is well justified. I think people, I have had to argue with me about the 16 developer thing, but I don't think that actually matters in that particular case because I think they're actually a fairly good control set, or less, right, for an experiment because they remove a lot of validity concerns by being experts. So yeah, they're sure that they don't represent certain, like, like the broad aspects of developers, but they also remove a lot of variance and what you would expect from the population. And they allow you to have a sort of an epistemological function of like, hey, let's isolate that factor away and then that's what happens with that. And that's what I like that. And as we thought, the way that the study was conducted was completely sufficient to draw like the high level conclusion that it dropped. Thank you very much. Here's a curiosity. So we did publish this because of the organizational reasons that we're going to, but we did conduct this, people with rows, sort of their various explanations that for what's going on here, many of which have lots of merits, some of which more skeptical of natural one is brown field versus green field projects. So we ran this kind of enormous hackathon where we ran the most half of teams to use ARA versus not, you know, maximally green field or something. And then we'd have bunch of judges score them, many judge scores per project or something to try and even out that noise more. See, is it the case that like the bottom 50% or all the AI just allowed group in the top, the top rule, the AI allowed groups or something like that? Now, unfortunately, it was sort of even smaller. That's like part of the reason we're not publishing it. So I think the evidence is really quite weak. The degree of overlap is enormous. Like the point estimate that we a bit nervous about saying this because you know, as a one through the kind of review processes that something like this goes through. So maybe I've messed something up. But I think the point estimate is something like four percentage points higher on a, sorry, four percentile points higher if AI is allowed versus if it's not off the controlling group of the gals. That is like, you know, extremely noisy and chindering conclusions, but seemingly maybe kind of small effects, extons, lying AI. Yeah. Yeah. So the proof of my head, I guess this is a good study and also good to other research that you guys have done. So have you found a similar pattern? I guess first have you explored the effective AI in other domains, and specifically software engineering? And if so, have you also found this kind of surprising result that maybe as much would be up? No, no, no, no. I mean, no, new directions, stuff that we have not done. Yeah, I, yeah, we're interested in understanding the possibility of fixal routing. R&D, coding is not the only kind of thing that happens at major AI companies, much more conceptual work happens. I'd be very excited about working with math PhD students or very different types of software developers or running these kind of studies inside of major AI companies or large tech companies or something like that. I think we are very interested in, not necessarily directly, but somewhat close analogy to the large AI company case. So to the extent that something really deviates from that, probably less interested. If interesting. So yeah, so it sounds like you're interested in measuring capabilities for like math research and some multiple other research. Yeah, I'd say I'm interested in what the hell is going on in AI. And how am I going to learn the most about what the hell is going on in AI? I think something a bit more conceptual, something where fewer humans are currently working on it. So it's less appearing in training data will help me better sort of triangulate the truth about what's going on in AI, even if I don't care about math research in particular, it'll still sort of draw helpful qualitative lessons as it's kind of the sense I have. Yeah, I mean, I was going to pick the areas that I think it's most successful. And all the areas where it expects to be more successful, but where I think it is being less successful, I would pick probably data science. As an interesting one, like how to do the science, how do our want to do the scientists help by AI today? Say more about what you expected to be less successful. So in a real, so let me give you an example. Yeah, I'm so glad to. And at LinkedIn, there are 5,000 tables with the name impressions in the table, right? So if an analyst wants to understand how many impressions have been on a page where the hell did they go? Yeah, I mean, you can't figure that out. Today, there is no existing AI system that we have that could be hooked into corporate environment like that and process through, I mean, there's trillions of rows in those tables. So what a data scientist needs to do is they need to be like, I need to analyze a bunch of data and come to a conclusion, right? And I hear lots of thoughts about building systems. People talk about the middle of the SQL. The models are much better writing SQL than they used to be. But I believe that the state of underlying data is so bad that the actual data scientists are going to get way less valued out of the AI than the software engineers are mixing. That is very curious. So one view that some more bearish people have looking at the future of AI is so much tacit knowledge around the so much knowledge that embedded inside of companies that you're not going to pick up from these RRL training environments data or something. Something maybe it's not the state of nature that there needs to be many specialized AIs, like much less than the last years that one big general AI seems to be more performance. But at some point in the future, when data is locked up inside of companies, we will have more of this proliferation of many more specialist models as I have GTN, fine tuned on LinkedIn data and particular something, something. I have one reaction that's kind of like that. I do have a disbelief-like reaction. I'm like, ah, science. But also like, so contradictory facts. So the most common is all these data sets contain conflict of facts. Like the name of the field will be like, you know, dates started, like, I would say that there'll be time to start it, right? And then it will contain only a date, except we'll only contain the date up until like November of last year. And then after that, it will contain only the month. But then after that, it will contain maybe the seconds that the thing finished. And in order to actually successfully query the data set, you, the data, you, the data analyst, the data scientists have to know what those cutoff dates were, not written anywhere. Although what you couldn't do theoretically is import a bunch of the SQL that other analysts have written to try to figure out like how they triangulated these things and worked out for those reports. But today, so I think today, for example, and the people, sorry, I'm just like, I haven't worked a large company. People don't fix this source. Oh, no. So I feel like the lesson I learned over and over again, this data spec really matter. Really matter. No, I've also been working in data analysis at research, to all my researches. Yeah. And so, yeah, so the problem is like their job is like produce this report for this executive, right? And not go making for structure to produce this report. Yeah. OK, fight. OK. I live that dream every day. Yeah. Well, you just have to, even the having to, right, is you have to build out an infrastructure for it that has to be part of the job description. And the other part is you have to fix the problem with the source. Like, you're really, I still remember having a conversation where someone said it's too difficult to fix it at the source because there's too much complexity of all the systems that I know the source. And I said, OK, wait a minute. This means it's too complicated to solve a source downstream somehow a problem that is too big for the entire organization to solve. It's easier to solve there. Come on. Yeah, that doesn't make any sense. I just think there's so much potential here and I have not seen a lot of studies done on how people who are working in that data space are experiencing AI. And what's fascinating about that is real ML is mostly data work. Like ML outside of LLMS, ML outside of LLMS, the majority of ML engineers spend most of their time doing future curation rather than they spend actual direct model training and trying to clean up bad data for future curation. So the potential, even for the improvement of ML by enabling ML to be a better data scientist is huge. And I suspect that if you, my hypothesis is, if you went into this space, you would discover it is great at telling me how to write SQL or how to write pandas and or pollers or whatever you're using. It is OK at doing very trivial things and it fails at all complex. I failed completely in all complex. I don't even know. I don't even see the benchmark on it. Can you give me an example of a complex task? Sure. Let's say a complex task is determine the time between, give me the P90 of time between deployments for all deployments to happen to capital one. Struggle to that? Yeah, that does seem surprising to me. That seems surprising, right? Yeah. So I'm like, you know, if it has sort of reasonable context about where it would find this. So I find that data, right? Sure, sure makes sense. And then so OK, so give me that number. And then also, I make sure that you can break that down by team hierarchy. So give me that in a table so I can break it down by team hierarchy. Where is the team hierarchy data? Like how? Oh, here's a funny thing. What PRs were in those? So how do I know? How do I actually determine what the time deployments started and ended was? It turns out that's not clear in the base telemetry. And you have to like, no magic to figure out when the deployment started and ended. And also tell me, you know, for my ability to analyze it, tell me how many PRs were in each of those deployments and which PRs were in each of those deployments. Well, guess what? The deployment system only, this isn't big of a problem, right? I think it is being recorded. Yes, before you. Thank you. So then imagine the deployment system just contains some information about that data, right? Then like, where do I get that data? Well, that data doesn't exist in any other system. So maybe I have to go like, I have to go to GitHub and I have to call to GitHub API. And the chance of the all-in agent training that I have today is pretty minimal. Yeah, I do still. You know, relative to Michael Lee, because I'm pretty bearish on AI progress. I do still have some reaction that's like, how can't you spend the day getting this into a cursor rules file? You know, where the hierarchy exists? I would, I would just, that's why I think it's interesting. I think when you were studying, I have not seen any real comprehensive study on the experience data scientists have. If you have any ins to ask running studies at large, they're companies than I, I know all this. There is a fellow at opening either, talking to one of the speakers who does eVALS, internally VALS. And he has mentioned that he's done some work with data scientists. So he might know some people who have that data. But it's all been internal between him and like, cursor, between him and like, you know, anthropic or whatever. And yeah, and I also think one of the ones I'm curious about to his lawyers. There's about like more traditional older, lawyers, doctors, and I think mathematicians are all that they're interested in. It's because, you know, both lawyers and doctors are so constrained by a legacy history of like the constraints around them and how they work. Yeah, legal, legal issues. I think they need to be asleep in the film bar. And they're stoge, like I was interested in like, what's the, how are the stoge? The stoge, like, is a, I think I'm less bought into as long as I'm explanation to, I'm like, the legal restrictions they sort of continue to be the case through time. The stoge, like, set up a new law firm that's less stoge, and then you think the previous law firm, seems to be. I agree. I don't think it's persistent. I just think it's interesting to see, one thing that would be interesting to see is like, if that affects the mental model that they have today, like, if they're, like, how they've been talked to about it or how their trust in it affects how they use it, to be interesting to know, to me, I don't know, it's a worthwhile study, it's more like one of those things that I wonder about Italy. You've been a lawyer, you've just got out of college, it sort of, you know, has spent a lot more time using chat TPPT and you take a lawyer who's been in the business for 50 years and, you know, has a giant foffle of work docs that contain like all the briefs that all there, you know, junior associates have written for decades and decades and he just opens up those briefs and like changes a few words, and then sends them out to the judge. And he like, you know, has known those judges for like 30 years, 40 years, you know, it's exactly what they want it like, you know, is he getting any, is he gonna get any value? But is there a value he should get? Is there something that, is there some way that like he would be helped? Like, I certainly know discovery. Discovery in AI is like in law, it's like a huge, huge problem. And I know that like there's Harvey, I don't know anything about what success they've had. I've known a lot of people working in that space specifically, like, that's an ongoing thing, right? There's always tells you for it, but it's kind of the adoption of it is a very different thing from... That's a thing, right? Because I had one of the first things that I thought of, because I had a little bit of a legal background and one of the first things that I thought of the first time I went to a chat TPPT came out, I was like, oh, this could totally change discovery. Like this could be because discovery is like the most painful and most difficult, and most expensive, like you have serious social consequences by making discovery less expensive. Like that is the expensive part, having a loss. And so like you could actually have significant impact on a society if you could make discovery cheap and instantaneous and reliable. Yeah. The question of graph. Yeah. Person. I'm sure you've missed it. You missed it. Keep on coming to the community. Here. Well, sorry. It's a scary scam plot, right? It was worth a certain 50 hours. And according... Shall I always say? Yeah. Yep. Dump, dump, dump, dump. I say it's this one. Yes. So you're saying that people that develop there was no difference. Kirsir is we're talking about the ID that Viacoding and the user for 50 hour problem. I was very intrigued by that because everyone talks about Viacoding and how Kirsir is. She met all. And why did she get to, how did you get to 50 hours? She's curious. So this is including time. Private 50 hours is... This is including time in the experiments that develop as a spent in the experiment plus their past experience. So for some developers working on some issues as past the experiment, some of them have gotten to more than 50 hours of Kirsir experience. And that's just kind of up and back to the end. It was a the same task for each... No, these are kind of there. They're natural tasks that pop up on the GitHub repositories, which I was mentioned, kind of. I don't want to... I'm a little bit nervous about saying they're weird because it implies that... I want to say it's very interesting and it's very weird. And then it's interesting for the same reasons it's weird. These are repositories in which they have... These are projects in which they have an enormous amount of mental context to build up that the AI's might not have, that they've worked on for many, many years, that they can... I'm not sure this is always the case, but imagine it in my head that they basically know how to excuse on the particular task they have before they even, you know, go about something else. Because there's no experts in the project. Can you even pass a speed up? Is it like, like 5%, like when you move up, what's the kind of advanced speed? So you might think about... Let's go to this one instead. So on the left hand side, we have the averages for what the developers say happen in terms of their time to complete if their issue or their task gets assigned to the AI just loud or the AI-allowed group. They think that if AI is just loud, it'll take them a bit more time closer to two hours and I guess more like an hour and a half or a little bit less if AI is allowed. But then we randomize this secular task to allow AI or not to allow AI. And it turns out if we randomize to AI-allowed, then the time is a little bit above two hours rather than a bit below two hours. And then you can think of the change in time estimate as sort of being one divided by the other here. It's not quite that for reasons I can go into, but it's effectively what exactly is the transformation. You know, it's something like AI just loud over the AI-allowed minus one. So to draw them out, I'm like, you know, I might be like, what's the speed up? You know, is it like 1.1x? That these developers are going 1.1 times faster when we're actually on a time to complete scale or not a speed scale, but ignoring that detail, is it 1.5x? Is it 0.5x? They're actually going sort of twice as slow. How would we get that information? Well, we'd do something like take the AI-allowed times divided by the allowed AI times. You know, if this was 1.1, let's say times as long as the allowed times, then we'd get to 1.1x, you know. It's something like that, let's go. And in fact, you know, we find that slow down. Thanks. I just read a fascinating article, a past company, I remember, but basically, journalist was allowed to use a five coding, right? Do a full request. It was featured. AI was used to assist with building up requirements and, cheap practically, pointed the article just kind of did a little couple of weeks and then the sign up on it. It was this really fascinating. I was the whole five coding thing. Yeah, I just knew. Like, that was the old thing. It was like, you didn't have any software to go on background. That was all I had to secure this, try to do a study on that. So I definitely do the shared sense. But, you know, if you've got like no idea what's going on then probably these are going to be some significant speed up. You know, I will say, I guess, number one, it's not, you know, it's not April or August. You know, in fact, we went out and did this hackathon with very experienced people and much less experienced people and tried to see what happened and what we found is, you know, the schools, the judge scores extremely noisy and I think you shouldn't believe it. But, you know, the judge scores were not that much higher when AI was allowed versus when it was not. The people aren't actually making that much more progress. And then another thing to say is, I think there's going to be more expertise in this room than I have my understanding from, you know, sitting with these open source developers for a while and not being a very capable developer myself is the quality bar on the repositories in this study is just very high. Yes, typically. And so I would be very surprised if journalists, even frankly, if like a good software engineer without lots of experience on the repository, but certainly, you know, someone who wasn't a software engineer was able to get up a clean PR on these repositories at first time. In fact, I think that's a lot of the story for what's going on here is that the AI is, you know, they actually kind of do make progress in the right direction. So I'm good fraction of the time. But for, you know, for various reasons, sometimes for reasons of correctness, but sometimes for reasons of like, you know, how they've tried to solve the problem and you know, whether that's the typical way of solving the problem or like how various parts of the project speaks to one another, these kind of considerations, you know, they haven't properly accounted for that. And so, you know, the humans not only need to spend expensive time verifying, but also like clean up all the stuff. And my sense is that someone who didn't have all that experience like basically wouldn't know how to do that step. And so we wouldn't be able to submit a clean PR to these repositories, you know, that's it. Like I, relative to these people at least, I suck at some point in the moment. And I'm getting up, you know, PRs internally all the time. I think they're, I think they're worth quality. And, you know, and they're getting over time, they're getting better at time. You know, I do believe that people are coding when they wouldn't be able to code. They are submitting, you know, PRs that are lower quality standards when they wouldn't be able to do that at all. But getting up to these expert level PRs, I do feel kind of skeptical. And that's actually part of what I was getting at is, they often get, PRs often get rejected by more novice folks on these big, on these bigger, quality projects. For no other reason other than the developer ergonomics impact of the ping art, right? So the fact that it makes it harder for me to future maintain because for open search project, almost all the incentive is biased towards making it easier for me to maintain the project, right? So every time a PR comes in, if it doesn't make it easier for me to maintain the project, I have a tendency to reject it. Yeah. If it does make it easier to maintain the project than yay, I'm into it. That is an unlike what you have in a typical business context, right? Where those important things actually get something done, right? Because you're, you know, the fact that someone who has to spend a lot of time maintaining is almost job security, right? But for open source, it's the opposite. It's actually what causes people to need projects is what is difficult to maintain, right? So it is a different bias on what you accept for ColerQuest. Can you remind me the name of the English judgment who maintains the Haskell compiler? Simon, some? Yeah. No, I can't remember. No, OK. I'm with me. So here's one story that my relevant, you know, bunch of repositories in the study that they will have, you know, broadly these characteristics, one of them is the Haskell compiler. Famously, on the Haskell compiler, there's like some chance, I don't know if it's 50% or 30% or whatever, but there's some chance that if you submit a PR, the, I'm being recorded. Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, Simon, The Haskell compiler will come into the comments that argue with you for many, many hours, much longer than you spent working on the full request until until the PR hits exactly your specifications. Combine that fact with the remarkable fact, I think that the median PR in the study, the time they spend working on the code post-review, is zero minutes. That is the median PR is like perfect first-time round because the professional incentives these developers are like that. Now, there's a very long tail, on one of them, on one of them, I think literally Simon, this gentleman, pops up and oh, she's been calling us for many hours, and that one's a lot longer. But yeah, they are maintaining this extremely high bar. I'm interested in your other upcoming stuff that you had at your dog. Yeah, so, so, so, you know, so one thing I, I think, is a lot to say. I guess let's go and order. As I think you mentioned, if capabilities are to merge by time horizon, keep doubling, it does seem very, very challenging to keep up with that. In the short term, we have a number of directions for getting on top of that. But, and I think that will all, like through the year, but through two years, you know, that seems challenging. I think still possible. Through three years, I think, still seems awful. You know, it starts to get harder and harder. Anyway, in the short term, building these, building these much longer tasks and ways in which we might get around the problem entirely. For instance, here's one thing that might be somewhat. But also raise the accuracy bar. You could raise the accuracy bar, although, you know, where the reason we're interested in this in the first place is we like, you know, is GP5 extension dangerous. Okay, and the answer is no, I think. But like, why do we think the answer is no? Okay, at least, I think there are multiple reasons. But at least we can say, you know, GP5 is just like not that good of stuff. Like you're trying to get it to do, you know, data science on these, on these, columns that have very similar names. And it's like not clear exactly the logic that's led to the columns. It's not, it's not doing that sort of thing. I'm like, what do you need to do? That sort of thing to be, it's not attacking that. But, you know, the capable of stuff to be extension dangerous and it's not capable of stuff. And so, you know, I see that the, the AI is a failing at these hard tasks. So I'm like, you know, great, whatever. But if it's succeeding 90% of the time and not 99% of the time, one like very challenging for long tasks, then you know, I think there are, there are, you know, other things that we can retreat to and I'm prepared to retreat to those things. But at least the naive, the reason that's not essentially dangerous is that it's not capable of stuff. I think it's least starts to get pushed on as an explanation. So here's another possibility for something we could do. So, what if instead of time horizon here, we had time horizon whilst being watched very closely. And then, you know, maybe the time horizon was not being watched very closely, it's like this, but the time horizon was being watched very closely. It's like this. That is like if, you know, I hope that major AI companies have monitoring setups going where, you know, for example, AI is looking over transcripts of AI performing work to see if anything suspicious is going on. And, you know, it's not, it won't be impossible, perhaps, I conjecture, to fool those monitors, but you know, be challenging to fool those monitors. So, you know, maybe you can do, you know, 40 hour tasks without 40 hour, you know, dangerous capability relevant tasks, without any monitoring, but only like two hour long tasks, if you're being watched very, very closely. Being watched like if there's another model evaluating the app, that's the kind of thing that people can have. Maybe you ask later humans and... With them without safety controls, you mean? Yeah, yeah, yeah, roughly, yeah. And then, you know, now we've brought the load more double-ings, and we can sort of keep doing the capability extrapolation thing. That's one of the kinds of things that in addition to creating everyone that's asking. Yeah, I mean, I imagine some of the little subs do have like evaluations of capability with and without safety, because I'm sure that there's an argument between their researchers and their safety teams. Yeah, yeah, yeah. Do you know what I've seen something about this, but not a lot? Yeah, yeah, yeah. Yeah, I guess I think that's... This might be sort of like an especially quantitatively important consideration. Or I expect that it will reduce the effective time horizon by maybe an order of magnitude or two. Yeah, I agree that there are some important senses in which there's not really a difference in time. Yeah, of course, then I would also worry that publishing that encourages people to focus less on safety or to try to argue against safety, because of how it impacts capability. I think there are lots of landlines in all sorts of safety work, not just the name. Oh, first. Okay, next thing. You know, we have this, we have the strength. I spoke about this at the beginning, but you know, we have this strength, is it going to continue forever? Is this a fact of the universe, or does it, you know, somehow depend on inputs or you think about intelligence suspusions or something like that? Trying to think about that. Where's this line actually going? Is a pretty active area of work. Also, you know, the ways in which this line or the particular points don't quite correspond to the thing I care about. So one obvious way is that these models are being judged according to, you know, I think the algorithmic scoring that we use on me is importantly sort of more robust or more covering the relevant concerns that might be the case in just sort of sweet benches and unit tests, but it still sort of, it still has a lot of the same character. There are, you know, considerations like being able to build on this working future, outside of the immediate problem facing you that aren't being captured by me scoring. And maybe if you did capture that, you know, you'd get something a little bit like going from 50% success to 80% success, you know, you can do our long tasks if it doesn't matter whether you can build on the work, but you know, only 30 minutes ask if it doesn't matter whether you can build on the work, but bringing these numbers again to something I care about a little bit more. And then yeah, projecting out both if there are computer slowdowns, if we are going to end some regime where AI is building AI's and that leads to some sort of steepening of the curve, these kind of considerations. That's another thing I'm thinking about. The, the, the, the, the, and then capabilities measurement from new angles. So his, you know, he's his one history of meter that I think is not the accepted history and also probably not a very accurate history because that's mainly not the most accurate history but his one possible telling. You know near the beginning, meter has early access to when I wasn't there and I have sort of no international knowledge of this when meter has early access to GT4, And there are just sort of Q and A they set going on everywhere like else at the sets. You're like, you know, can, GP4, it seems so smart, relative to stuff that went before. Can it do stuff? You know, so you try it out sometimes to ask, can it do stuff? And the answer is, you know, can do some stuff and it can't do other stuff. And people are like, oh, that's cool. You know, you try this, you try this neat new kind of thing, getting models to do stuff instead of instead of answering questions. And then later you're like, well, different models, you know, they come out over time. You know, this model comes out in January, this model comes out in February. Can they do different kinds of stuff if we test them on the same, if we test them on the same stuff? Then we'll try and think of kind of the most obvious in some ways, some are ecistic of whether they can do stuff. There's like single, single data points or number that reflects whether they can do stuff, the time horizon, plus over time and see what happens. You're like, oh, that's kind of interesting. And then you're like, well, what's the next sort of, in some sense kind of dumbest or like most obvious thing you can do, well, well, right? And kind of the most obvious RCT design. Or like, ala-ray-ray, we're not ala-ray-ray-ray. And then we'll see what happens and we'll try and, you know, it'll be messy. There's lots of, there are lots of methodological problems that people find out as there are with this work, but there are different kinds of problems. You know, there are different pros and different cons and maybe with these two different things to give two different answers and have two different set of pros and cons we can kind of triangulate the truth from that. And then now I'm like, well, can we pull that rabbit out of that one more time? Are there all multiple more times? Are there other sources of evidence that have, you know, different pros and cons that I, that I won't believe in fully, but they're different pros and cons and they might give different answers and so on and so forth? Here are two suggestions for things I'm curious about the moment. The first is in the wild transcripts. So, you know, agents in class, or in class code and whatever other products or services, they leave behind traces, traces of the deaths that they've contributed to codes or deaths of their actions, their recent gains and so on and so forth. The traces that they leave in the wild are, you know, importantly different from this where it's more kind of contained and, you know, the toss sort of neatly packaged and stuff. This is going to be, you know, like the example with the many different columns that are very confusing. This is going to be like, whatever real crap shows up in the wild, how the model sounds and that. There are important reasons why you shouldn't believe that kind of information. It's like not very experimental. It's hard to know exactly what to make of it, but it does have these important pros that it's like, it's more real. It's, you know, the data is enormous. Perhaps the data on transcripts is enormous. You know, perhaps there's a lot you can learn there. That's one thing. And then here's another one. There's this group which you guys should check out called Agent Village, AI Village, sorry. Where they have a lot of different models or agents kind of living in this village, occasionally talking to humans, trying to accomplish fuzzy goals that are set to them, basically using computers. They try and do stuff like, you know, organize this event at the park, or run a few human subjects experiments, or run this merch store, you know, stuff like that that's not so clearly specified. And basically all the time they find that the models fall on their faces and suck. And there are lots of reasons not to believe this evidence. You know, here are some of the reasons. Number one, it is using PCUs, and I think computers is just way worse than CLI based, computer scalability is considerably worse than CLI based stuff at the moment, text-based things in general at the moment. And maybe you care more about text-based things because that's more relevant to very subtle things you care about and also lots of GUI based things can be converted into text-based things. It's, you know, there's all these different models hanging around in the village. I'm like, why are there so many models? Like, why is there a village instead of just like some big agent orchestration set up? I don't really understand what's going on there. Anyway, lots of reasons not to believe it, but on the other hand, it is models doing stuff in the world. It's not, let's mark start task. It's like trying to accomplish some goal, and they can't accomplish even sort of, you know, very basic subsets of the goal. And I feel like that's extremely interesting. And I wonder if you could get rid of some of the most obvious cons, you know, make this only text-based, give them some relevant text-based tools, work a bunch on the illustration to make these models sort of more performance, get rid of the less performance models in the village, so on and so forth, but then try and get them to do these fuzzy goals. And, you know, just observe like where do they mess up? Like, you know, they went about step one, it went great, but then they sort of, they became incoherent, or they, you know, went into a strange psychological basin, with one of the other models, or, you know, they weren't able to interact with external services in the appropriate way, or figure out their resource use. You know, I'd be very interested just kind of qualitatively in what goes on when you do that. Again, keeping in mind that we're interested in, the ability of, at least at the moment, I'm most interested in the ability of AI's to automate R&D, and speaking to why that's not the case, at the moment, and why that might not be the case in their future, some things shaped like this, seems like it might be, might be kind of, my curiously pointed, so by that's not the case. I'm not sure exactly what's that, but yeah. My observation is that they are effectively neurodivergent individuals, right? And none of our world was not built for that. Yeah, everything that we have, they're defined for a human to do, they're shaped and sized to human. It's just like the military, how big are packs? Well, it's based on how much they think a person can reasonably carry, right? And how much we expect someone to handle for their taxes? That's based on what we think a human can do. And if you think about neurodivergent individuals, they struggle with challenges with the way the world's expectations don't align with that. And compared to a neurodivergent individual, these intelligence are really, really different, right? And so all of the rough edges where they don't align with our world, that's why they needed a system to do an assistant in order to accomplish anything real in our world is just too hard for them. Currently. What? Currently. Yeah, I think, hey, it's amazing. Okay, yeah. There's just hopeless. Yeah. I have to get really good. I like world, we'll have to change one of those two things. I agree. I like so strongly share the sense, but if you ask me to really pin down, why is that case that the case again? When they're like, you know, beating GPUX, GPUX, but some of these like streaming hot science questions and the blah, blah, blah. Like, I'm actually what? Why are they not able to accomplish things in the world? You've had a neurodivergent individual who wasn't terribly good at something. You still think getting through life. Yeah, yeah. They all very good at reading books. Yeah. Yeah. There's a lot of those people in the world, right? It's not that surprising. And although my only feeling about it, I believe it's like, well, today is the 200th day my card didn't rock it off the earth and escape velocity and fly to the moon. Like, that's because you didn't build a rocket. Yeah. There was a lot of talk a year ago about, you know. Maybe I mischaracterized it. But I thought there was a lot of talk a year ago about computer escapeability as being impressive today. There was. There was a lot of talk about it. And yet, I have talked to almost nobody who has used them for any practical. Yeah. Totally, totally. Yeah. But if we move this text only, and it seems reasonable to complete text only, would you still have the rocket concern? No. I wouldn't have it. I would really want to. I do want to do it. It depends on what the task wants to do. Yeah. Yeah. The kind of thing that you could, that a human could do over at CLI only. So I think this is for the, to be a topic talked that really was me when they talked about how, you know, one way to use effectively is to give them, if you have a task, like figure out a way to present the task or transform the task to something that is indiscriminate, you know, for the model. And I feel like this conversation kind of, you know, ties in on that, like, you know, interacting with, with Chrome is less in distribution than a CLI. So I think that could be an interesting area of research is like, you know, okay, so if you're interested in exploring, like how well can it perform is really open at a task. Like first, I guess creating harnesses and creating an interface that is much more indiscrimination for them. So that way that's, you know, less of a concern. Yeah. I mean, I think also speaks to point about quite a lot of neurodiprofaction models, you know, this, it's also different from management scale or something, giving, you know, giving appropriate Eastgoat tasks to your, to your very talented interns, or very talented neurodivergent interns, something, something like that. I do think that's right. From the, sorry, to be a, you know, sorry, sorry, sorry to be so repetitive. From the perspective of capabilities, solutions and automating R&D, you know, I think maybe the models will get extremely good at scoping tasks for themselves, such that it's benchmarking or, or something like that, but, you know, if they can't do that, I'm like, well, there's a lot of things that aren't, that don't look like benchmarks that grow up in a real world, and you do need to be able to kind of flexibly work with that if you're to do something as complicated as automate a major AI company. And, you know, so I do, I do think it's, yeah, I think it can both be the case that the AI's are incredibly performance on some particular type of problem, or if you make other types of problems more similar in scope or shape to the type of problem that they're best at. And also that they, you know, can't flexibly substitute for human workers because that requires, you know, yourself setting up the problem in a way that's appropriate or not having those constraints yourself. Yeah, it is interesting though, just your point about new capabilities is thinking of all other access on the graph that you have, because I think there's not just, I wonder if there's not just a time horizon issue, but there's a task category or type of work category. Like, like, as your example of computer users, is one of those examples, right? Like, if we think about the capability of computer users versus, okay, the ability that we're required to be to use, versus the capability that could become, can be a composite highly in text. Yeah, so, yeah, sure. But like, a lot of these are like, almost all these benchmarks are basically text. Yes, yes, yes, and indeed, you know, the ones, the ones that aren't, the ones that require sort of vision capabilities are, are notably like in-point. Yeah, I, I, I'm not sure exactly what to, what to make of this graph. I think one thing I make is that, one thing I make of it is that, you know, there probably is maybe not so much variation in sort of slow, poor doubling time across task distributions. I think it's only weak evidence for that. But, you know, in, in insects or, you know, the base of where we are now, yeah, that there's, there's possibly a great deal of variety, especially on this sort of image-like capabilities, versus not dimension, but, but, physical abilities even more, you know. Yeah, right. So there's exactly like, so, I mean, you could even go through sets, where like, you could go through like a tactile, like, like, today, like, they would all spur zero. Nothing has tactile. So, like, it can't tell you anything about the tactile. Well, you know, in producing this graph, we, you know, we're trying and make the models as performance as possible on somehowed outset. So, we're, so, you know, we try and give them some tactile stuff. I know so they perform there. Sure, sure, sure. But, on space, we do have some example. Yeah. Other, yeah. Yeah. Space judgments, special judgments, like that. Yeah. You know, we've always seen for figure, finding control, and stuff like that. It's just, I, I haven't even, I don't even know if anybody, maybe somebody has listed out, but all applicability is that we would expect in the future. Like, if we actually wanted AGI, what is the entire list of people? That's the way to start a debate that doesn't end. I think it's... Basil Halperin and Arjun Bramani hopefully have a paper on this. That's one of Bramani's. Yeah. And then we have to think about where are we at and do all applicability is follow the same. All applicability is a bit currently measured, they follow the same log. Yeah. It does seem like a reasonable null hypothesis to you as well as me, I think. Not, not, not a certain team, I mean, you know, who knows? Yeah. Oh, that's something, there was something I wanted to add there. Oh, oh, yeah. Here's another thing I think about, not super in a research class, although kind of. So, you know, some people like me sort of sketch school of, of software and singularity, that is the idea that you could automate AI research without also automating chip design, and maybe also chip production as well. You could quickly get bottlenecks by buy computer. Is there a fix hardware? There are only so many experiments that you come on that will be, there will be a positioning productive to, to, to, to, to, to, to, to, to, to progress upwards. But, you know, even for people like me, who are skeptical of that, you know, you might think that in fact, like chip production is going to get automated, you know, the robots, like, they're, they're coming, they can, they can do, they can do the stuff that humans do and then, and then maybe you really do have a fully self-sustaining, robots plus AI economy. And so, you know, and so you, you, you have some slow trend from, from computer slowing down, but then you have sort of a booming backup that's once, once the whole thing is, is, is, is in a tight loop. What, one interesting debate that I heard about recently and would like to think more is, you know, I think there's, in the public discussion, there's some sense that, you know, why, why robotics capability is lagging, lagging LLM-like capability so much? Well, it's due with training data, or something, something like that, or, or maybe it's due with hardware constraints. Yeah. I'm curious if it's not due with hardware constraints. I like, what, what, what, exactly these hardware constraints? If we put super intelligence inside, like, classical superstitious, inside of, you know, hardware parts that existed today, could it build, chip production facilities? And I, I've known it because I'm, you know, I'm, I'm beyond, beyond, beyond, but it's not obvious to me what the, what the answer is. I think it's, I think it's kind of plausible. I'm not sure you need this like, yeah, I'm not sure you need this like, very flexible find motor control and also do it. Also, I think maybe the find motor control is their subjects to having super intelligence control. To me, to be fair, like, the key aspects of chip production are, I'm done, I'm sorry. But, but, but I'm also thinking like building the robots. And, yeah, the whole, you know. And, and that's where I'll say, I have a friend who's, and most of his crew during software development, but during COVID started working on, manufacturing things like peppers and things that help people. And he found out how hard the manufacturing world is and how slow the iteration processes. And it is really like, he put it like, he knew it was gonna be worse. He didn't understand that it was like, next level, like, an order of magnitude worse. And I think that probably, like, you know, we, from our perspective, people who don't do it and see like, how bad can it be, right? And it's the feedback I've had for everybody who actually works in that space is way, way different. That's what I parted as well. I've only talked a little bit with like, people who work in traps and stuff, but I was surprised when I did talk to them of the level of human expertise required. Yeah. In order to work at the fab, like a lot of those jobs are like fairly high paying, actually, and their jobs, or to like success. Also the real improvement is actually glacial, right? Compare to software, right? I think also because it's cost a billion dollars to build a factory. So each iteration is a huge cost of money. It's brutal. So I think that's why it's been hard for me to develop. I mean, there's just like, give them a couple more centuries. Maybe they didn't get it done. I thought, where do you have you centuries? I do. I do think I just kept it like you about how easy some of these tasks are. Yeah. We think they're easy, but in my experience, like, I don't know when this self-driving thing came out. When people were like pushing ahead, and it was, I actually worked in that space for a while, and it was like, I get that we can get really close to it, but getting all the way to something that is acceptable is extremely difficult, right? And we underestimate how much work is involved in getting that last little bit done. The first 90%. I knew we could do it with computers like 10 years ago, pretty much, but getting to the last bit that everyone's happy with it. Yeah, I feel this myself. I didn't get a driver's license. Yeah, it's okay. Because I expect the self-driving cost to come. Yeah, I think it's obviously, but it hasn't been that long. And they're expanding to the entire bay. I'm sure it's serious. They're going to get there. I think it's going to take a lot of time. Like, is the robots economy building the chip production? It's going to take centuries. I don't know about, I can see that it might take, it's so part of the trick with self-driving is the economic incentive is moving it along fast, right? And probably the robot building robots kind of thing would also put like, yeah, where we're at right now is like Rip-Rap is kind of, as far along, as we've got it, robots building robots, right? Which is. Oh, but I feel like, you know, is that paying sufficient attention to the charts? GPT2 2019. It's so recent. I have done this is so non-thense school, but I'm like, maybe we're in a sort of GPT moment. Yeah, it's a fair point. Yeah, I could be wrong. It's just my guess is it's going to take a lot longer than we think. So it's going to be able to do like real mass production. Yeah. At a scale that causes the kind of global impact that you're talking about. Yeah. And that does, I think they can already do a great job building one-offs, right? Robots are very good at doing one-off notes as small scale, but it's totally impractical for doing it in a large scale. There is one fact, I think it's kind of remarkable. Is this, maybe it's this, is that the rate of, is it this? Yeah, yeah. The rate of compute put to robotics models lags behind, sorry, it's about the same, but the levels to orders of magnitude difference. I am curious if that's, if I got closed, what we'd say. It does seem like at least more capable robots are, in some sense, very on the table was something that could be the case very soon. No, I'm not saying all the way. I'm certainly not saying true production. It just does seem like there's some sort of data hang. Yeah, yeah. True to me. It's interesting. Also thinking some sort of, some like you don't just need to be skating data, you can also scale parameters, use the same amount of data, that's the way it's used to compute to close on gap. You see. Yeah. So one of the three just gave me a very interesting overview of a array I was going into fabrication. And what's the same? So it says, so there's a lot of worries where right now it's going to help probably pretty dramatically in the future. And a lot of it's been in computational aspects. There's a lot of problems with the last six or three of the expensive designing, like the mass. It's the whole that you're using for the laser to get the transistors. And calculating that, how to build it, again, ensuring that it conforms to the spec that you've written basically is extremely computationally expensive. And there's a lot of opportunity for AI to help there. And there's also theoretically the possibility for, so, like, obviously, to manufacture is extremely precise, but also fragile. And the opportunity for an AI to detect parameters that are basically out of whack and leading to failure, potential failure, and like imaging away from it, is could theoretically be around being prevailed and yield as a big problem, in fact, it's a manufacturer. Like the reason that you get different speeds out of your CPUs is because they actually have the one line that produces all those CPUs and some of the components that are better and some of them come out worse. And that's why the higher the higher the GHz models are more expensive and the lower air. Like if you're in video, like your home GPUs, your 50, or 50, or 50, or 50, or 50, or 50, or 50, or 50, or 70, or 80, 90 are all the same chip. Right. That just had different quality. Different or different levels of fault are essentially. No. But the problem is that. That's our recording. They're going to actually cut out soon, but feel free to help in your discussion. Yeah. I cool.

How METR measures Long Tasks and Experienced Open Source Dev Productivity - Joel Becker, METR

TL;DR

Takeaways

Vocabulary

Transcript