Skip to main content

Lessons from Scaling GitHub's Remote MCP Server — Sam Morrow, GitHub

TL;DR

  • GitHub's MCP server initially faced challenges with agent performance due to an overwhelming number of tools and context window explosions, compounded by users' reluctance to configure custom settings.
  • The team addressed these issues through aggressive context reduction techniques, including optimizing tool focus, grouping functionalities, and significantly reducing tool output tokens.
  • Security remains a critical focus, with efforts to promote secure authentication via OAuth/PKCE, filter tools by token scopes, and develop strategies against prompt injection and data exfiltration.

Takeaways

  • Avoid context explosion by limiting tools: More tools do not equate to better agents; they lead to confusion and increased context window usage. Prioritize a focused set of tools tailored to common use cases.
  • Optimize for context reduction through tool design: Drastically cut output tokens of tools, group related functionalities (e.g., CRUD), and tailor tool selection to usage patterns to manage LLM context efficiently.
  • Batch server-side API calls: Encode agent intent into tool surfaces to consolidate multiple API calls into a single server-side operation, reducing round trips, saving LLM context, and improving agent success rates.
  • Prioritize secure authentication and scoped access: Utilize OAuth 2.1 with PKCE for secure connections, automatically filter available tools based on token scopes, and enable step-up authorization for incremental permissions.
  • Mitigate prompt injection risks with cautious tool exposure: Recognize the inherent conflict between agent utility and security, and carefully manage tool availability based on user permissions and risk profiles.
  • Implement evaluation suites for tool interaction: Run "a vows" (evaluation suites) to test tools collectively, ensuring they are called appropriately and don't conflict, optimizing overall agent performance beyond individual tool descriptions.
  • Design for stateless, dynamic tool loading: Implement an architecture where a new server instance is created per request, dynamically provisioning tools based on user configuration and access policies for scalability and flexibility.
  • Integrate human-in-the-loop workflows: Allow users to review and edit AI-generated outputs (e.g., GitHub issues) to maintain quality, ensure human oversight, and prevent content from being perceived as purely bot-generated.

Vocabulary

MCP server — GitHub's implementation of the Microservice Communication Protocol specification for AI agents. Context window — The limited amount of input an LLM can process at one time, including prompts, tools, and previous turns. Tool sets — A grouping concept for related product tools that users or agents can select to manage the available functionalities. Output tokens — The units of text an LLM generates, directly impacting the cost and efficiency of LLM interactions. Hallucinate — When an AI agent generates incorrect, nonsensical, or fabricated information with high confidence. A vows — Evaluation suites used to test the performance and interaction of multiple tools within an agentic system. PKCE — Proof Key for Code Exchange; an extension to OAuth 2.0 designed to prevent authorization code interception attacks. Prompt injection — A type of attack where malicious or cleverly crafted input manipulates an LLM to perform unintended actions or reveal sensitive information. Stateless server — A server architecture where each request is processed independently, and no session-specific data is stored on the server between requests. Human-in-the-loop — A system design where human intervention is intentionally incorporated into an AI workflow, often for review, correction, or decision-making.

Transcript

All right, hello London. And I hope everyone's been enjoying the AI engineer Europe so far. There's so many amazing speakers. I've been watching talks and talking to people for days now and it's been immense. I'm Sam, a lead development of GitHub's MCP server. And yeah, I'm here to talk about mostly challenges we've faced, building and scaling our remote server. I'll overcome them. And before I start, I just like messing with people. So you know, here, quick show of hands. Who's used the MCP server? Good, good. Who's used GitHopes? Who has a hot take? No, no. And yeah, does anyone build a server or a client? Oh, nice, quite a few. And yeah, is anyone contributed to the specification? Oh, yeah. I got one. That's actually the first one, I think, other than the MCP dev summit. There was quite a lot of them. But yeah, anyway, it's really awesome to see so many hands. So I'm glad that I've actually come to the right place. But yeah, forget how our MCP journey started, at least in public, in April last year. And we actually opened source to our local MCP in April last year. And we've just turned one years old. So I'm super stoked by that. But yeah, back then, there was a tremendous buzz. We were the most start repo on GitHub of the particular week. And like the exposure meant we got a high volume of public contributions, rapidly filling gaps in platform coverage that people wanted to add tools and things. And not everything was perfect. Right? After a month or so of new features, agents, in some ways, we're getting worse at using GitHub and context windows. We're getting blown out quicker. And we picked, I think, over 100 tools. And certainly at the time, that was just too many. Langchain had already produced research. They published in February of the exact kind of problems we were seeing. More tools don't make better agents. They get confused and forgetful. Well, I say more tools like more context and more tools shove directly into the context to be precise. But yeah, GitHub's a really expansive platform. And we provided tools for repos, issues, PRs, actions, projects, even more things. But the hard part of solving this was, we didn't prevent users from having the tools individually that they needed and they used. And suffice to say, our user base is pretty diverse. And probably even on GitHub platform at the moment, there might be one or two clause as well. And for the record, there's a team of us who work on it. It's not just me. And my team is awesome. But yeah, so to try and fix some of this, I quickly added this thing, tool sets, which was a kind of grouping concept of related product tools. And users could just pick which ones they wanted and configure it. I also added a dynamic tool selection thing where agents could discover sets of tools and then turn on in chunks. And we never released it, but I made a kind of rag version of the same, for kind of semantic tool search and discovery. But what do you think happened even in spite of all this stuff? Everyone used the default settings. It was really annoying because in a way, we had all these elegant solutions. All they did was require users to actually configure the JSON a little bit. And most users just don't. Maybe it's even partially a spec problem, because every proposal so far for grouping to the MCP specification for various reasons has been rejected. And there have been several attempts. And in a sense, every mode or configuration we add, one could argue is papering over potential gaps, like or gaps in client implementations. So as an example, we have a read-only mode. And roughly 17% of our users use it, but it maps one-to-one to the read-only hint annotation. But no client exposes that as a method of filtering servers. I think some gateways now do, but anyway, it's an interesting, easy win for more enterprise use cases where people often only want that. But yeah, we needed to find better solutions to context reduction. And you don't need to worry too much about the specifics. This is dated now. But we started trying to optimize, and we looked at the usage patterns on our remote server. And initially, we cut the amount of context used by focusing the tools more specifically to the general case and based on usage to about 49% reduction of the initial load. And then we subsequently also grouped crud tools and brought that down even more. And I think you get about 40 tools if you use the default configuration. And then you can kind of expand or contract that based on your own preference. But yeah, it's easy to customize. And we've also recently had a massive push to reduce output tokens of a lot of tools as well. And in this example, just by tailoring exactly what comes with the list pull requests, it's actually lost more than 75% of the tokens used in the output. So in terms of how token hungry GitHub server is, it's a moving target. We're constantly changing things that improve it. And if you haven't used it in a while, it's likely very different from a few months ago even. And anyway, and we haven't ruled out more advanced approaches like code mode and we're always experimenting internally. But on the heels of this, we also dug into our data. And we found some more opportunities. So yeah, we made a big push to reduce tool failures as well. And the success rate is roughly I think over 95% at this point. But not all failure is preventable, because agents don't necessarily know which repos. They have right permission on. They still hallucinate. But we've been able to identify significant numbers of areas that could be overcome mostly by encoding the sort of agent intent into our tool surface. And you might have to make five API calls to make it more robust. But in that case, we do that in the server side to reduce round trips, because that saves context, saves time, and usually makes the massively better experience makes the agents more successful. And yeah, we also started to run a vows last year. I'm not going to go into detail. The that link takes you to a blog article that my colleagues in your wrote about doing it. But one of the gist is instead of micro-optimizing individual tool descriptions, you try to test them against each other to try and make sure that they're called at the right times and not called at the wrong times. So that in the pool of each other, they don't fight for the perfect tool description that makes the agent call it all the time as terrible, as is the reverse of that. So you need to try and get that as tight as possible. It is could be a whole over talk. Security, on the other hand, is something that's like a kind of constant menace in all of this. I've seen lots of people talking about this. And it's a real problem in some ways for us, because we have a lot of people using plain text access tokens for MCP in the wild. And usually the stored somewhere agent can access. They're frequently long lived. They're often overprivileged. And they're kind of sat there just waiting to be abused. End users, I don't think they're choosing this. It's actually hard to make configuration easy and secure at the same time. And clients have to make use of system key rings or encrypted storage. And like VS Code does. But the MCP spec also provided a better way with remote HTTP, which is all the way back to April last year as well. And we embraced this, of course. And we wanted to make secure connection path of least resistance. We didn't want users to have to download a local runtime. And our remote server supports OAuth 2.1. And my team even helped add the proof key for Code Exchange support, which is commonly known as Pixie, to get hub's authorization server to improve the security posture for client apps. But as I said, we hoped OAuth would be the path of least resistance. And again, perhaps some of you might know what happened. Everyone expected us to support the dynamic client registration. And for us, it creates more problems than it solves. Because if you implement it kind of properly, it's hard not to have unbounded growth of app databases and challenges of how you would bucket them for rate limits. And there isn't a reliable app identity. So we just considered it and rejected it. And we feel like it was a well-intentioned mistake. And we're not the only authorization server to not support this. And even MCP itself, it decided that client ID metadata is probably the way to go. And I can't promise that we're going to support it. But I promise that I am trying to get us to support it. And that should make logging in massively easier. But more on that in the future. And also, speaking of security, some of you may have seen this. This was a fun day. But invariant labs published this. And it's a correctly done prompt injection X-Fill attack for getting private data out of GitHub. And the thing is, they called specifically GitHub's MCP server out. And I think that we do provide the tools that can enable that if you just kind of enable them all. But it applies to almost every agent set up. Whether they use MCP or not. Or whether they use GitHub MCP. Like the lethal trifecta stuff, which I'm not going to rehash now. Because I think many of you have probably seen it. Or you can look it up. Simon Wilson's blog post on that's excellent. But the utility of agents is in direct conflict. We've kind of protecting this stuff. And it's an active space trying to work out how to prevent these problems. But it's not solved. And it's very much not unique to GitHub. And we have users with wildly different risk profiles. We even have people that have like air-gapped GitHub enterprise server instances in much more secure. And then obviously the clubbers, et cetera, are also just running straight to GitHub with probably full token access to the agent and everything. And that's also interesting. And I'm not nice saying any of this. It's just it's cool to kind of see what people do and see if we can actually support the different use cases and security postures while everyone experiments with this stuff. And we also kind of use, like, lean on auth to manage tools as well. And this is something I'm pretty happy with. If you log in to GitHub MCP with a part token that we just immediately filtered the tools down by the scopes that the token has, you don't have to do anything other than give it the token. On OAuth, we support step-up of so you can get us, we can return a scope challenge and then it will interactively ask the user if they want to allow the scope. And if you do, then you can continue the tool call. It doesn't fail, which I think is also nice. And VS Code, for example, supports that and I initially worked on this with them just because they already have a token to use GitHub. And what they wanted was that if they're baked in token, doesn't have permissions to use everything that instead of just failing, there was a mechanism for users having a clean install and then an upscoping later if they need it. And lastly, server tokens as well. They didn't have on actions and things. They didn't have a user-specific tools or out there. And then by removing those, we're just removing constant sources of failure and wasted context at the same time. We ran a completely sort of stateless server set up. And we have been using Redis for session storage. It's standard observability in deep, I kind of stack. This is not a weird picture. But I guess one of the weird things for some people is a lot of people are running a state-for-MCPS over process in the singular and have kind of struggled with how you get it into this shape. But for us, we did a few things because it's very dynamic. But one of the fun things we did is we actually make a brand new, in the SDK sense, a brand new server instance on every single request. And we add the tools to it at the start. So whatever your configuration is, it just builds this. And then you get what you've asked for or what you're allowed to use because some things have policies that impact whether you've got tools or not. And yeah, we've been able to scale to this point. We save around 7 million tool calls a week. And we don't have session affinity. Even the sessions, we generally only use them to identify. It's the only way to identify the self-reported client identity that comes through MCPS. So it's useful for us to understand what clients people are using the server with. So yeah, we use sessions for that. But we also have wanted to bring experiments to all of you and everyone. And we have this thing that's in Insiders mode. And all it does is it turns on certain feature flags and things for experiments that we're happy to just ship to anyone who wants to use them. And this just takes you to the documentation. But an example of something that we haven't released generally yet but is on Insiders is our MCP apps. And I set up the example before I came in. It's quite nice when you're talking to the agent to have the opportunity to kind of edit the AI generated issue, especially if you're working heavily in professional open source stuff. And you want to make sure that it's you posting. And it's not going to get closed as a sort of bot generated thing. This is a nice human in the loop thing that MCP enables. And I wasn't sure how much I would like it at first, but then I've come to love it because I kind of care about how my issues and things are received by people. And this is just a really great way to make sure that I can check that. So yeah, in terms of where I think it's going, something along these lines, I think a near future server discovery will hopefully be automatic. And tool use will probably become more compositional like bash or piping tools and other tools streaming data for them or CloudFavours code mode approach or anthropics tool search tool API, which just landed in Claude code a couple of weeks ago. And OpenAI recently added a similar API. So OpenAI added a similar API too. And I fully expect that thousands of tools will be normal very soon. We're trying to iron out all the problems that prevented it in the first place. And I'll probably reverse many of the fewer tools decisions. And users hopefully won't even have to know what MCP is. They'll just convey what it is they want to do. And the offset up and the tool selection, things will become truly autonomous. And I don't think we're that far away from this, but we're kind of in this experimental phase where we're not really there yet. But I think that harnesses like pie are also interesting because you can build a weird client that maybe optimizes this in a really good way yourself. So I would encourage people to experiment with crazy clients. I feel like you never know you could be the next. If you're super lucky, you could be the next Claure. You could publish something that goes so viral that totally changes the agentic game. I wanted to end on a high and look at some numbers. So I get hub itself. It's actually got over 11 million Docker downloads of our standard IO server, which is by far not the most used version of it either. We've got 126 contributors now, and over 2,300 issues and PRs, which it's been over seven a day, like every single day, for over a year now, which I do look at almost every single thing eventually. So it's been like quite a year. I mean, other summary posts haven't even worse, but I also love it. So I please keep doing it. And yeah, we've got almost 4,000 forks, which blows my mind. I kind of want to know, like, the weirder things that people have done that they haven't contributed back. Yeah, nearly 30,000 stars. And we're a fast approaching 8 million tool cause a week. And GitHub itself is also facing a new challenge. This is really intense, right? And it shows no sign of slowing down. I still want you to keep opening issues and PRs for us. Like we will cope, but this is new territory. And everything's like mildly on fire for everyone I think these days. And it's just exciting and fun. But yeah, thank you so much for having me. I think I've got like 30 seconds. I don't know if anyone has anything they want to ask, but I think like, things like trying out MCP CLIs and things like that is a fun avenue. I don't think it's entirely ironed out, but like one thing you could do, take the read only tools from some MCP, wrap it in a CLI and just give it a proper help and just see how the agent does it. Stuff like that is surprisingly effective. And like I say, I want people to mess with this stuff. So I would encourage you to just try it if you're interested. All right, I'm zero seconds. I will answer you, but in person, if that's OK.

Feedback / ReportSpotted an issue or have an improvement idea?