Skip to main content

Running LLMs locally: Practical LLM Performance on DGX Spark — Mozhgan Kabiri chimeh, NVIDIA

TL;DR

  • Modern AI development faces significant challenges with local memory and software limitations, often forcing developers to offload workflows to the Claude, which can slow down iteration and increase costs.
  • NVIDIA's JxSpark, powered by the Grace Blackwell Super Chip, offers a powerful local solution, enabling developers to run and prototype large language models (up to 200 billion parameters) efficiently from their desktop.
  • Leveraging 4-bit floating point (NVFB4) quantization is critical for practical local LLM performance, significantly improving both throughput and perceived responsiveness (time to first token) for larger models on local hardware.

Takeaways

  • Address Local Dev Challenges: The increasing demands of AI models on developer systems necessitate local solutions to overcome memory constraints and ensure access to the correct software stack, thereby improving iteration speed and reducing Claude dependencies.
  • Leverage JxSpark for Local AI: The NVIDIA JxSpark, featuring the GB10 Grace Blackwell Super Chip and 128GB of unified memory, provides a standalone system capable of running and building AI models up to 200 billion parameters locally, using the same NVIDIA AI software stack as production environments.
  • Prioritize Time to First Token: While end-to-end latency is important, time to first token is the critical metric for defining user-perceived performance and responsiveness in AI applications, reflecting how quickly the model begins to generate output.
  • Utilize Quantization for Efficiency: Employing techniques like NVFB4 (NVIDIA's 4-bit floating point quantization) is crucial for optimizing LLM performance on local hardware, dramatically increasing completion tokens per second (throughput) and reducing time to first token for larger models.
  • Understand Memory Capacity vs. Bandwidth: While the JxSpark offers ample memory capacity (128GB) to fit massive models, actual performance metrics like throughput and responsiveness are primarily limited by memory bandwidth, emphasizing the importance of efficient data formats like NVFB4 to maximize "intelligence per byte."
  • Automate Benchmarking: Implement an automated benchmarking harness that includes strict protocols like environment isolation, mandatory warm-up runs, and detailed GPU metrics logging to ensure reproducible and verifiable performance data.
  • Enable Seamless Dev-to-Prod Workflow: By running the exact same NVIDIA AI software stack locally on JxSpark as used in data centers and the Claude, developers can build, prototype, and fine-tune models rapidly, with minimal changes required for eventual deployment to production.

Vocabulary

LLM — Large Language Model: A type of artificial intelligence model trained on vast amounts of text data to understand and generate human-like text. Unified Memory Architecture — A system design where the CPU and GPU share access to the same pool of physical memory, reducing data transfer overhead and improving performance. VLLM — A high-throughput serving engine for large language models, optimized for efficient inference through techniques like continuous batching and PagedAttention. Quantization — The process of reducing the precision of a model's parameters (e.g., from 32-bit floating point to 4-bit integers) to decrease memory footprint and accelerate inference speed. NVFB4 — NVIDIA's 4-bit floating point quantization format, specifically designed to accelerate AI inference and improve memory efficiency on NVIDIA hardware. Throughput — In AI inference, a measure of how many output tokens or requests a model can process per unit of time, typically expressed as tokens per second. Time to First Token — The latency from when an LLM inference request is sent until the very first token of its response is received, crucial for perceived user responsiveness. Memory Bandwidth — The rate at which data can be read from or written to memory, a critical factor determining the performance of data-intensive workloads like large language models. Grace Blackwell Super Chip — A high-performance NVIDIA chip architecture integrating CPU and GPU components with unified memory, optimized for AI and high-performance computing workloads. Developer Relations — A role or department focused on building and maintaining relationships with the developer community, providing support, tools, and resources.

Transcript

Hello everyone, I'm Moshka and Kevri Chime, developer relations manager at MVIDIA, where I work closely with developers building and deploying AI systems. Today, we're looking at running Elements locally, practical Elements performance and the JxSpark. This isn't a theoretical talk, it's a data factor and it's true to trade-offs of modern AI infrastructure. Fundings are based on hands-on experiments with the goal of understanding what's actually practical on a single system. The evolution in AI puts greater demands on developer systems, creating two main challenges. You either run out of memory or you do not have access to the right software stack, then you end up pushing everything to the Claude or data center. As models move from experiments to production, concerns like cost productivity, data residency and a Theministic latency take center stage. And iteration speed often depends on access to shared infrastructure. And since your work will be scheduled against other competing workloads, it causes delays in development work as well. The question becomes, can we bring some of that workflow closer to where development actually happens? To maximize developer productivity, local solutions to these challenges are required. The JxSpark is designed from the ground up to build and run AI and can be used as a stand of a system. It's powered by the GB10 Grace Blackwell Super Chip, combining CPU and GPU with a unified memory architecture, with 128GB of unified memory, and if before support it enables developers to work with models of up to around 200 billion parameters locally, an assistant that fits under the desk or on top of your desk. It runs the same MVDA AI software stack used in production environments, meaning workflows can move from desktop to data center or Claude with minimal changes. The key idea here is not replacing the Claude, but bringing powerful AI development closer to the developer. This is my setup. Everything runs locally and is reproducible. To serve the models I used VLLN. With a set of Claude models, I cross different sizes and precision formats. I run this inside an MVDA optimized container. This would ensure that our environment is identical to what you would deploy in a data center. In a setup to show in-draw results, I want to show you the help. I build an automated benchmarking harness. Every model run from 1.5 billion to 14 billion follows the same strict protocol. Environment isolation will occur three mandatory warm-up runs and background GPU metrics logging at one second interval. On the left, we are looking at the orchestrator script. For every execution, the script automatically generates a unique directory using a precise time-stop and a sanitized model ID. For every model run, it captures the full model's endpoint response and necessary metrics. On the right, you see the result, a clean versioned artifact of the run. It contains everything to verify the findings, the metadata and the text results from the benchmarks. On the lower right, you can see an example command for getting things started. Now, let's look at the actual measurement logic. In an AI application, N2N latencies are important, but time to first token is the metric that defines the user's perceived performance. If the first token arrives instantly, the application feels responsive, and the script here shows how the time stamp the very first trunk of the streaming response. In this script, we are not just calling an API and waiting for a result. We are explicitly handling the streaming response from the VLLN server. If you look at the highlighted block in the string under Score 1 function, you'll see the time stamp in logic. To explore this in practice, I run a series of experiments using VLNM on the Jokes S Park. I tested different models from the smaller instruction models to larger optimized variants, all under the same setup. Everything was served locally. The goal here wasn't to push theoretical limits, but to understand realistic behavior in the developer workflow. Let's dive into the raw performance data I captured. This bar chart represents the completion token per second across the test suit. At the far left, you see the 1.5 billion instruct model delivering the massive 61.73 tokens per second. But the most interesting data point for us is the 14 billion NVFB4 model. Despite being nearly 10 times larger than the 1.5 billion model, the still achieves 20.90 tokens per second. I would say this is a critical engineering street spot. By leveraging NVFB4, 4B floating point quantization, we were able to maintain a sophisticated high intelligent model at a throughput that is still faster than the average human reading speed. Note because we're clear here is how aggressively throughput drops as model scale and how much quantization helps. Have a look at the 14 billion base model. It drops to just 8.4 T tokens per second. This proves that on Blackwell Market, on Blackwell Hardware, the choice of quantization format is just as important as the hardware itself. It's what allows the DJs to spark to bridge the gap between the system and production, prototyping engine. This spark allows me to experiment with different precision formats locally and understand the trade-offs in the real time. While throughput is a measure of raw power, time to first token is the metric that defines user experience. It determines whether an application feels instant or broken. In other words, it reflects how quickly the model starts responding. As we see in the results, the DJs spark delivers exceptional responsiveness. The increase for larger models is expected. More parameters mean more computation before the first token is generated. One of all from this chart, the interesting ones is a comparison between the 14 billion parameter models, the base model and the NVFB4 model. As we can see, the 14 billion NVFB4 is 3.4 times faster to first token than the unoptimized 14-based model. A key takeaway here from this data is that memory capacity is not the same as the memory bandwidth. While the DJs spark 128 gigabytes of unified memory allows us to fit massive models up to 200 billion parameters, our throughput is still governed by how efficiently we can move data. This is why NVFB4 is the hero here. It effectively increases our intelligence per byte, allowing a 14 billion model to feel as responsive as much smaller one. After running all of these, treating the study anatomy, I went to use the DJs spark. For steady state workloads, privacy sensitive data and rapid prototyping, it allows you to build and fine tune locally with the exact same software stack used in DJs Claude. Visit build.mvd.com and slash spark to access the playbooks and software stack I use for these benchmarks. And in one line, run locally, iterate quickly and then ready to scale to data center or Claude.

Feedback / ReportSpotted an issue or have an improvement idea?