NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

Original: Swyx · 10/03/2026

Summary

NVIDIA’s engineers discuss advancements in AI inference at GTC, focusing on Dynamo’s capabilities and the importance of security in agent operations.

Key Insights

“Agents can do three things. They can access your files, they can access the internet, and then now they can write custom code and execute it.” — Nader discusses the capabilities and security concerns of AI agents.

“We are blessed to have a unique relationship with our first ever NVIDIA guests.” — Swyx introduces the guests at the podcast.

“It’s like everyone who sponsors a conference comes, does their booth. They’re like, we are changing the future of ai or something, some generic bullshit.” — Swyx reflects on the typical conference experience and the need for creativity.

Topics

Full Article

Join Kyle, Nader, Vibhu, and swyx live at NVIDIA GTC next week!Now that AIE Europe tix are ~sold out, our attention turns to Miami and Worlds Fair!The definitive AI Accelerator chip company has more than 10xed this AI Summer:And is now a

4.4 trillion megacorp that is somehow still moving like a startup. We are blessed to have a unique relationship with our first ever NVIDIA guests: Kyle Kranen who gave a great inference keynote at the first Worlds Fair and is one of the leading architects of NVIDIA Dynamo (a Datacenter scale inference framework supporting SGLang, TRT-LLM, vLLM), and Nader Khalil, a friend of swyx from our days in Celo in The Arena, who has been drawing developers at GTC since before they were even a glimmer in the eye of NVIDIA:Nader discusses how NVIDIA Brev has drastically reduced the barriers to entry for developers to get a top of the line GPU up and running, and Kyle explains NVIDIA Dynamo as a data center scale inference engine that optimizes serving by scaling out, leveraging techniques like prefill/decode disaggregation, scheduling, and Kubernetes-based orchestration, framed around cost, latency, and quality tradeoffs. We also dive into Jensens SOL (Speed of Light) first-principles urgency concept, long-context limits and model/hardware co-design, internal model APIs (https://build.nvidia.com), and upcoming Dynamo and agent sessions at GTC.Full Video pod on YouTubeTimestamps00:00 Agent Security Basics00:39 Podcast Welcome and Guests07:19 Acquisition and DevEx Shift13:48 SOL Culture and Dynamo Setup27:38 Why Scale Out Wins29:02 Scale Up Limits Explained30:24 From Laptop to Multi Node33:07 Cost Quality Latency Tradeoffs38:42 Disaggregation Prefill vs Decode41:05 Kubernetes Scaling with Grove43:20 Context Length and Co Design57:34 Security Meets Agents58:01 Agent Permissions Model59:10 Build Nvidia Inference Gateway01:01:52 Hackathons And Autonomy Dreams01:10:26 Local GPUs And Scaling Inference01:15:31 Long Running Agents And SF ReflectionsTranscriptAgent Security BasicsNader: Agents can do three things. They can access your files, they can access the internet, and then now they can write custom code and execute it. You literally only let an agent do two of those three things. If you can access your files and you can write custom code, you dont want internet access because thats one to see full vulnerability, right?If you have access to internet and your file system, you should know the full scope of what that agents capable of doing. Otherwise, now we can get injected or something that can happen. And so thats a lot of what weve been thinking about is like, you know, how do we both enable this because its clearly the future.But then also, you know, what, what are these enforcement points that we can start to like protect?swyx: All right.Podcast Welcome and Guestsswyx: Welcome to the Lean Space podcast in the Chromo studio. Welcome to all the guests here. Uh, we are back with our guest host Viu. Welcome. Good to have you back. And our friends, uh, Netter and Kyle from Nvidia. Welcome.Kyle: Yeah, thanks for having us.swyx: Yeah, thank you. Actually, I dont even know your titles.Uh, I know youre like architect something of Dynamo.Kyle: Yeah. I, Im one of the engineering leaders [00:01:00] and a architects of Dynamo.swyx: And youre director of something and developers, developer tech.Nader: Yeah.swyx: Youre the developers, developers, developers guy at nvidia,Nader: open source agent marketing, brev,swyx: and likeNader: Devrel tools and stuff.swyx: Yeah. BeenNader: the focus.swyx: And were, were kind of recording this ahead of Nvidia, GTC, which is coming to town, uh, again, uh, or taking over town, uh, which, uh, which well all be at. Um, and well talk a little bit about your sessions and stuff. Yeah.Nader: Were super excited for it.GTC Booth Stunt Storiesswyx: One of my favorite memories for Nader, like you always do like marketing stunts and like while you were at Rev, you like had this surfboard that you like, went down to GTC with and like, NA Nvidia apparently, like did so much that they bought you.Like what, what was that like? What was that?Nader: Yeah. Yeah, we, we, um. Our logo was a chaka. We, we, uh, we were always just kind of like trying to keep true to who we were. I think, you know, some stuff, startups, youre like trying to pretend that youre a bigger, more mature company than you are. And it was actually Evan Conrad from SF Compute who was just like, you guys are like previousswyx: guest.Yeah.Nader: Amazing. Oh, really? Amazing. Yeah. He was just like, guys, youre two dudes in the room. Why are you [00:02:00] pretending that youre not? Uh, and so then we were like, okay, lets make the logo a shaka. We brought surfboards to our booth to GTC and the energy was great. Yeah. Some palm trees too. They,Kyle: they actually poked out over like the, the walls so you could, you could see the bread booth.Oh, thats so funny. AndNader: no one else,Kyle: just from very far away.Nader: Oh, so you remember it backKyle: then? Yeah I remember it pre-acquisition. I was like, oh, those guys look cool,Nader: dude. That makes sense. cause uh, we, so we signed up really last minute, and so we had the last booth. It was all the way in the corner. And so I was, I was worried that no one was gonna come.So thats why we had like the palm trees. We really came in with the surfboards. We even had one of our investors bring her dog and then she was just like walking the dog around to try to like, bring energy towards our booth. Yeah.swyx: Steph.Kyle: Yeah. Yeah, shes the best,swyx: you know, as a conference organizer, I love that.Right? Like, its like everyone who sponsors a conference comes, does their booth. Theyre like, we are changing the future of ai or something, some generic bullshit and like, no, like actually try to stand out, make it fun, right? And people still remember it after three years.Nader: Yeah. Yeah. You know whats so funny?Ill, Ill send, Ill give you this clip if you wanna, if you wanna add it [00:03:00] in, but, uh, my wife was at the time fiance, she was in medical school and she came to help us. cause it was like a big moment for us. And so we, we bought this cricket, its like a vinyl, like a vinyl, uh, printer. cause like, how else are we gonna label the surfboard?So, we got a surfboard, luckily was able to purchase that on the company card. We got a cricket and it was just like fine tuning for enterprises or something like that, that we put on the. On the surfboard and its 1:00 AM the day before we go to GTC. Shes helping me put these like vinyl stickers on.And she goes, you son of, shes like, if you pull this off, you son of a bitch. And so, uh, right. Pretty much after the acquisition, I stitched that with the mag music acquisition. I sent it to our family group chat. Ohswyx: Yeah. No, well, she, she made a good choice there. Was that like basically the origin story for Launchable is that we, it was, and maybe we should explain what Brev is andNader: Yeah.Yeah. Uh, I mean, brev is just, its a developer tool that makes it really easy to get a GPU. So we connect a bunch of different GPU sources. So the basics of it is like, how quickly can we SSH you into a G, into a GPU and whenever we would talk to users, they wanted A GPU. They wanted an A 100. And if you go to like any cloud [00:04:00] provisioning page, usually its like three pages of forms or in the forms somewhere theres a dropdown.And in the dropdown theres some weird code that you know to translate to an A 100. And I remember just thinking like. Every time someone says they want an A 100, like the piece of text that theyre telling me that they want is like, stuffed away in the corner. Yeah. And so we were like, what if the biggest piece of text was what the users asking for?And so when you go to Brev, its just big GPU chips with the type that you want withswyx: beautiful animations that you worked on pre, like pre you can, like, now you can just prompt it. But back in the day. Yeah. Yeah. Those were handcraft, handcrafted artisanal code.Nader: Yeah. I was actually really proud of that because, uh, it was an, i I made it in Figma.Yeah. And then I found, I was like really struggling to figure out how to turn it from like Figma to react. So what it actually is, is just an SVG and I, I have all the styles and so when you change the chip, whether its like active or not it changes the SVG code and that somehow like renders like, looks like its animating, but it, we just had the transition slow, but its just like the, a JavaScript function to change the like underlying SVG.Yeah. And that was how I ended up like figuring out how to move it from from Figma. But yeah, thats Art Artisan. [00:05:00]Kyle: Speaking of marketing stunts though, he actually used those SVGs. Or kind of use those SVGs to make these cards.Nader: Oh yeah. LikeKyle: a GPU gift card Yes. That he handed out everywhere. That was actually my first impression of thatNader: one.Yeah,swyx: yeah, yeah.Nader: Yeah.swyx: I think I still have one of them.Nader: They look great.Kyle: Yeah.Nader: I have a ton of them still actually in our garage, which just, they dont have labels. We should honestly like bring, bring them back. But, um, I found this old printing press here, actually just around the corner on Ven ness. And its a third generation San Francisco shop.And so I come in an excited startup founder trying to like, and they just have this crazy old machinery and Im in awe. cause the the whole building is so physical. Like youre seeing these machines, they have like pedals to like move these saws and whatever. I dont know what this machinery is, but I saw all three generations.Like theres like the grandpa, the father and the son, and the son was like, around my age. Well,swyx: its like a holy, holy trinity.Nader: Its funny because we, so I just took the same SVG and we just like printed it and its foil printing, so they make a a, a mold. Thats like an inverse of like the A 100 and then they put the foil on it [00:06:00] and then they press it into the paper.And I remember once we got them, he was like, Hey, dont forget about us. You know, I guess like early Apple and Ciscos first business cards were all made there. And so he was like, yeah, we, we get like the startup businesses but then as they mature, they kind of go somewhere else. And so I actually, I think we were talking with marketing about like using them for some, we should go back and make some cards.swyx: Yeah, yeah, yeah. You know, I remember, you know, as a very, very small breadth investor, I was like, why are we spending time like, doing these like stunts for GPUs? Like, you know, I think like as a, you know, typical like cloud hard hardware person, you go into an AWS you pick like T five X xl, whatever, and its just like from a list and you look at the specs like, why animate this GP?And, and I, I do think like it just shows the level of care that goes throughout birth and Yeah. And now, and also the, and,Nader: and Nvidia. I think thats what the, the thing that struck me most when we first came in was like the amount of passion that everyone has. Like, I think, um, you know, you talk to, you talk to Kyle, you talk to, like, every VP that Ive met at Nvidia goes so close to the metal.Like, I remember it was almost a year ago, and like my VP asked me, hes like, Hey, [00:07:00] whats cursor? And like, are you using it? And if so, why? Surprised at this, and he downloaded Cursor and he was asking me to help him like, use it. And I thought that was, uh, or like, just show him what he, you know, why we were using it.And so, the amount of care that I think everyone has and the passion, appreciate, passion and appreciation for the moment. Right. This is a very unique time. So its really cool to see everyone really like, uh, appreciate that.swyx: Yeah.Acquisition and DevEx Shiftswyx: One thing I wanted to do before we move over to sort of like research topics and, uh, the, the stuff that Kyles working on is just tell the story of the acquisition, right?Like, not many people have been, been through an acquisition with Nvidia. Whats it like? Uh, what, yeah, just anything youd like to say.Nader: Its a crazy experience. I think, uh, you know, we were the thing that was the most exciting for us was. Our goal was just to make it easier for developers.We wanted to find access to GPUs, make it easier to do that. And then all, oh, actually your question about launchable. So launchable was just make one click exper, like one click deploys for any software on top of the GPU. Mm-hmm. And so what we really liked about Nvidia was that it felt like we just got a lot more resources to do all of that.I think, uh, you [00:08:00] know, NVIDIAs goal is to make things as easy for developers as possible. So there was a really nice like synergy there. I think that, you know, when it comes to like an acquisition, I think the amount that the soul of the products align, I think is gonna be. Is going speak to the success of the acquisition.Yeah. And so it in many ways feels like were home. This is a really great outcome for us. Like we you know, I love brev.nvidia.com. Like you should, you should use its, its theKyle: front page for GPUs.Nader: Yeah. Yeah. If you want GP views,Kyle: you go there, getswyx: it there, and its like internally is growing very quickly.I, I dont remember You said some stats there.Nader: Yeah, yeah, yeah. Its, uh, I, I wish I had the exact numbers, but like internally, externally, its been growing really quickly. Weve been working with a bunch of partners with a bunch of different customers and ISVs, if you have a solution that you want someone that runs on the GPU and you want people to use it quickly, we can bundle it up, uh, in a launchable and make it a one click run.If youre doing things and you want just like a sandbox or something to run on, right. Like open claw. Huge moment. Super exciting. Our, uh, and well talk into it more, but. You know, internally, people wanna run this, and you, we know we have to be really careful from the security implications. Do we let this run on the corporate network?Securitys guidance was, Hey, [00:09:00] run this on breath, its in, you know, its, its, its a vm, its sitting in the cloud, its off the corporate network. Its isolated. And so thats been our stance internally and externally about how to even run something like open call while we figure out how to run these things securely.But yeah,swyx: I think theres also like, you almost like were the right team at the right time when Nvidia is starting to invest a lot more in developer experience or whatever you call it. Yeah. Uh, UX or I dont know what you call it, like software. Like obviously NVIDIA is always invested in software, but like, theres like, this is like a different audience.Yeah. Its aNader: widerKyle: developer base.swyx: Yeah. Right.Nader: Yeah. Yeah. You know, its funny, its like, its not, uh,swyx: so like, what, what is it called internally? What, what is this that people should be aware that is going on there?Nader: Uh, what, like developer experienceswyx: or, yeah, yeah. Is its called just developer experience or is there like a broader strategy hereNader: in Nvidia?Um, Nvidia always wants to make a good developer experience. The thing is and a lot of the technology is just really complicated. Like, its not, its uh, you know, I think, um. The thing thats been really growing or the AIs growing is having a huge moment, not [00:10:00] because like, lets say data scientists in 2018, were quiet then and are much louder now.The pie is com, right? Theres a whole bunch of new audiences. My moms wondering what shes doing. My sisters learned, like taught herself how to code. Like the, um, you know, I, I actually think just generally AIs a big equalizer and youre seeing a more like technologically literate society, I guess.Like everyones, everyones learning how to code. Uh, there isnt really an excuse for that. And so building a good UX means that you really understand who your end user is. And when your end user becomes such a wide, uh, variety of people, then you have to almost like reinvent the practice, right? Yeah. You haveKyle: to, and actually build more developer ux, right?Because the, there are tiers of developer base that were added. You know, the, the hackers that are building on top of open claw, right? For example, have never used gpu. They dont know what kuda is. They, they, they just want to run something.Nader: Yeah.Kyle: You need new UX that is not just. Hey, you know, how do you program something in Cuda and run it?And then, and then we built, you know, like when Deep Learning was getting big, we built, we built Torch and, and, but so recently the amount of like [00:11:00] layers that are added to that developer stack has just exploded because AI has become ubiquitous. Everyones using it in different ways. Yeah. ItsNader: moving fast in every direction.Vertical, horizontal.Vibhu: Yeah. You guys, you even take it down to hardware, like the DGX Spark, you know, its, its basically the same system as just throwing it up on big GPU cluster.Nader: Yeah, yeah, yeah. Its amazing. Blackwell.swyx: Yeah. Uh, we saw the preview at the last years GTC and that was one of the better performing, uh, videos so far, and video coverage so far.Awesome. This will beat it. Um,Nader: that wasswyx: actually, we have fingersNader: crossed. Yeah.DGX Spark and Remote AccessNader: Even when Grace Blackwell or when, um, uh, DGX Spark was first coming out getting to be involved in that from the beginning of the developer experience. And it just comes back to what youswyx: were involved.Nader: Yeah. St. St.swyx: Mars.Nader: Yeah. Yeah. I mean from, it was just like, I, I got an email, we just got thrown into the loop and suddenly yeah, I, it was actually really funny cause Im still pretty fresh from the acquisition and Im, Im getting an email from a bunch of the engineering VPs about like, the new hardware, GPU chip, like were, or not chip, but just GPU system that were putting out.And Im like, okay, cool. Matters. Now involved with this for the ux, Im like. What am I gonna do [00:12:00] here? So, I remember the first meeting, I was just like kind of quiet as I was hearing engineering VPs talk about what this box could be, what it could do, how we should use it. And I remember, uh, one of the first ideas that people were idea was like, oh, the first thing that it was like, I think a quote was like, the first thing someones gonna wanna do with this is get two of them and run a Kubernetes cluster on top of them.And I was like, oh, I think I know why Im here. I was like, the first thing were doing is easy. SSH into the machine. And then, and you know, just kind of like scoping it down of like, once you can do that every, you, like the person who wants to run a Kubernetes cluster onto Sparks has a higher propensity for pain, then, then you know someone who buys it and wants to run open Claw right now, right?If you can make sure that thats as effortless as possible, then the rest becomes easy. So theres a tool called Nvidia Sync. It just makes the SSH connection really simple. So, you know, if you think about it like. If you have a Mac, uh, or a PC or whatever, if you have a laptop and you buy this GPU and you want to use it, you should be able to use it like its A-A-G-P-U in the cloud, right?Um, but theres all this friction of like, how do you actually get into that? Thats part of [00:13:00] Revs value proposition is just, you know, theres a CLI that wraps SSH and makes it simple. And so our goal is just get you into that machine really easily. And one thing we just launched at CES, its in, its still in like early access.Were ironing out some kinks, but it should be ready by GTC. You can register your spark on Brev. And so now if youswyx: like remote managed yeah, local hardware. Single pane of glass. Yeah. Yeah. Because Brev can already manage other clouds anyway, right?Vibhu: Yeah, yeah. And you use the spark on Brev as well, right?Nader: Yeah. But yeah, exactly. So, so you, you, so you, you set it up at home you can run the command on it, and then it gets its essentially itll appear in your Brev account, and then you can take your laptop to a Starbucks or to a cafe, and youll continue to use your, you can continue use your spark just like any other cloud node on Brev.Yeah. Yeah. And its just like a pre-provisioned centerswyx: in yourNader: home. Yeah, exactly.swyx: Yeah. Yeah.Vibhu: Tiny little data center.Nader: Tiny little, the size ofVibhu: your phone.SOL Culture and Dynamo Setupswyx: One more thing before we move on to Kyle. Just have so many Jensen stories and I just love, love mining Jensen stories. Uh, my favorite so far is SOL. Uh, what is, yeah, what is S-O-L-S-O-LNader: is actually, i, I think [00:14:00] of all the lessons Ive learned, that ones definitely my favorite.Kyle: Itll always stick with you.Nader: Yeah. Yeah. I, you know, in your startup, everythings existential, right? Like weve, weve run out of money. We were like, on the risk of, of losing payroll, weve had to contract our team because we l ran outta money. And so like, um, because of that youre really always forcing yourself to I to like understand the root cause of everything.If you get a date, if you get a timeline, you know exactly why that date or timeline is there. Youre, youre pushing every boundary and like, youre not just say, youre not just accepting like a, a no. Just because. And so as you start to introduce more layers, as you start to become a much larger organization, SOL is is essentially like what is the physics, right?The speed of light moves at a certain speed. So if flights moving some slower, then you know somethings in the way. So before trying to like layer reality back in of like, why cant this be delivered at some date? Lets just understand the physics. What is the theoretical limit to like, uh, how fast this can go?And then start to tell me why. cause otherwise people will start telling you why something cant be done. But actually I think any great leaders goal is just to create urgency. Yeah. [00:15:00] Theres an infiniteKyle: create compelling events, right?Nader: Yeah.Kyle: Yeah. So l is a term video is used to instigate a compelling event.You say this is done. How do we get there? What is the minimum? As much as necessary, as little as possible thing that it takes for us to get exactly here and. It helps you just break through a bunch of noise.swyx: Yeah.Kyle: Instantly.swyx: One thing Im unclear about is, can only Jensen use the SOL card? Like, oh, no, no, no.Not everyone get the bullshit out because obviously its Jensen, but like, can someone else be like, no, likeKyle: frontline engineers use it.Nader: Yeah. Every, I think its not so much about like, get the bullshit out. Its like, its like, give me the root understanding, right? Like, if you tell me something takes three weeks, it like, well, whats the first principles?Yeah, the first principles. Its like, whats the, what? Like why is it three weeks? What is the actual yeah. Whats the actual limit of why this is gonna take three weeks? If youre gonna, if you, if lets say you wanted to buy a new computer and someone told you its gonna be here in five days, whats the SOL?Well, like the SOL is like, I could walk into a Best Buy and pick it up for you. Right? So then anything thats like beyond that is, and is that practical? Is that how were gonna, you know, lets say give everyone in the [00:16:00] company a laptop, like obviously not. So then like thats the SOL and then its like, okay, well if we have to get more than 10, suddenly there might be some, right?And so now we can kind of piece the reality back.swyx: So, so this is the. Paul Graham do things that dont scale. Yeah. And this is also the, what people would now call behi agency. Yeah.Kyle: Its actually really interesting because theres a, theres a second hardware angle to SOL that like doesnt come up for all the org sol is used like culturally at aswyx: media for everything.Im also mining for like, I think that can be annoying sometimes. And like someone keeps going IOO you and youre like, guys, like we have to be stable. We have to, we to fucking plan. Yeah.Kyle: Its an interesting balance.Nader: Yeah. I encounter that with like, actually just with, with Alec, right? cause we, we have a new conference so we need to launch, we have, we have goals of what we wanna launch by, uh, by the conference and like, yeah.At the end of the day, where isswyx: this GTC?Nader: Um, well this is like, so we, I mean we did it for CES, we did for GT CDC before that were doing it for GTC San Jose. So I mean, like every, you know, we have a new moment. Um, and we want to launch something. Yeah. And we want to do so at SOL and that does mean that some, theres some level of prioritization that needs [00:17:00] to happen.And so it, it is difficult, right? I think, um, you have to be careful with what youre pushing. You know, stability is important and that should be factored into S-O-L-S-O-L isnt just like, build everything and let it break, you know, that, thats part of the conversation. So as youre laying, layering in all the details, one of them might be, Hey, we could build this, but then its not gonna be stable for X, y, z reasons.And so that was like, one of our conversations for CES was, you know, hey, like we, we can get this into early access registering your spark with brev. But there are a lot of things that we need to do in order to feel really comfortable from a security perspective, right? Theres a lot of networking involved before we deliver that to users.So its like, okay. Lets get this to a point where we can at least let people experiment with it. We had it in a booth, we had it in Jensens keynote, and then lets go iron out all the networking kinks. And thats not easy. And so, uh, that can come later. And so that was the way that we layered that back in.Yeah. ButKyle: Its not really about saying like, you dont have to do the, the maintenance or operational work. Its more about saying, you know, its kind of like [00:18:00] highlights how progress is incremental, right? Like, what is the minimum thing that we can get to. And then theres SOL for like every component after that.But theres the SOL to get you, get you to the, the starting line. And that, thats usually how its asked. Yeah. On the other side, you know, like SOL came out of like hardware at Nvidia. Right. So SOL is like literally if we ran the accelerator or the GPU with like at basically full speed with like no other constraints, like how FAST would be able to make a program go.swyx: Yeah. Yeah. Right.Kyle: Soswyx: in, in training that like, you know, then you work back to like some percentage of like MFU for example.Kyle: Yeah, thats a, thats a great example. So like, theres an, theres an S-O-L-M-F-U, and then theres like, you know, whats practically achievable.swyx: Cool. Should we move on to sort of, uh, Kyles side?Uh, Kyle, youre coming more from the data science world. And, uh, I, I mean I always, whenever, whenever I meet someone whos done working in tabular stuff, graph neural networks, time series, these are basically when I go to new reps, I go to ICML, I walk the back halls. Theres always like a small group of graph people.Yes. Absolute small group of tabular people. [00:19:00] And like, theres no one there. And like, its very like, you know what I mean? Like, yeah, no, like its, its important interesting work if you care about solving the problems that they solve.Kyle: Yeah.swyx: But everyone else is just LMS all the time.Kyle: Yeah. I mean its like, its like the black hole, right?Has the event horizon reached this yet in nerves? Um,swyx: but like, you know, those are, those are transformers too. Yeah. And, and those are also like interesting things. Anyway, uh, I just wanted to spend a little bit of time on, on those, that background before we go into Dynamo, uh, proper.Kyle: Yeah, sure. I took a different path to Nvidia than that, or I joined six years ago, seven, if you count, when I was an intern.So I joined Nvidia, like right outta college. And the first thing I jumped into was not what Id done in, during internship, which was like, you know, like some stuff for autonomous vehicles, like heavyweight object detection. I jumped into like, you know, something, Im like, recommenders, this is popular. Andswyx: yeah, he did RexiKyle: as well.Yeah, Rexi. Yeah. I mean that, that was the taboo data at the time, right? You have tables of like, audience qualities and item qualities, and youre trying to figure out like which member of [00:20:00] the audience matches which item or, or more practically which item matches which member of the audience. And at the time, really it was like we were trying to enable.Uh, recommender, which had historically been like a little bit of a CP based workflow into something that like, ran really well in GPUs. And its since been done. Like there are a bunch of libraries for Axis that run on GPUs. Uh, the common models like Deeplearning recommendation model, which came outta meta and the wide and deep model, which was used or was released by Google were very accelerated by GPUs using, you know, the fast HBM on the chips, especially to do, you know, vector lookups.But it was very interesting at the time and super, super relevant because like we were starting to get like. This explosion of feeds and things that required rec recommenders to just actively be on all the time. And sort of transitioned that a little bit towards graph neural networks when I discovered them because I was like, okay, you can actually use graphical neural networks to represent like, relationships between people, items, concepts, and that, that interested me.So I jumped into that at [00:21:00] Nvidia and, and got really involved for like two-ish years.swyx: Yeah. Uh, and something I learned from Brian Zaro Yeah. Is that you can just kind of choose your own path in Nvidia.Kyle: Oh my God. Yeah.swyx: Which is not a normal big Corp thing. Yeah. Like you, you have a lane, you stay in your lane.Nader: I think probably the reason why I enjoy being in a, a big company, the mission is the boss probably from a startup guy. Yeah. The missionswyx: is the boss.Nader: Yeah. Uh, it feels like a big game of pickup basketball. Like, you know, if you play one, if you wanna play basketball, you just go up to the court and youre like, Hey look, were gonna play this game and we need three.Yeah. And you just like find your three. Thats honestly for every new initiative thats what it feels like. Yeah.Vibhu: It also like shows, right? Like Nvidia. Just releasing state-of-the-art stuff in every domain. Yeah. Like, okay, you expect foundation models with Nemo tron voice just randomly parakeet.Call parakeet just comes out another one, uh, voice. TheKyle: video voice team has always been producing.Vibhu: Yeah. Theres always just every other domain of paper that comes out, dataset that comes out. Its like, I mean, it also stems back to what Nvidia has to do, right? You have to make chips years before theyre actually produced.Right? So you need to know, you need to really [00:22:00] focus. TheKyle: design process starts likeVibhu: exactlyKyle: three to five years before the chip gets to the market.Vibhu: Yeah. I, Im curious more about what thats like, right? So like, you have specialist teams. Is it just like, you know, people find an interest, you go in, you go deep on whatever, and that kind of feeds back into, you know, okay, we, we expect predictions.Like the internals at Nvidia must be crazy. Right? You know? Yeah. Yeah. You know, you, you must. Not even without selling to people, you have your own predictions of where things are going. Yeah. And theyre very based, very grounded. Right?Kyle: Yeah. It, it, its really interesting. So theres like two things that I think that Amed does, which are quite interesting.Uh, one is like, we really index into passion. Theres a big. Sort of organizational top sound push to like ensure that people are working on the things that theyre passionate about. So if someone proposes something thats interesting, many times they can just email someone like way up the chain that they would find this relevant and say like, Hey, can I go work on this?Nader: Its actually like I worked at a, a big company for a couple years before, uh, starting on my startup journey and like, it felt very weird if you were to like email out of chain, if that makes [00:23:00] sense. Yeah. The emails at Nvidia are like mosh pitsswyx: shoot,Nader: and its just like 60 people, just whatever. And like theyre, theres this,swyx: they got messy like, reply all you,Nader: oh, its in, its insane.Its insane. They justKyle: help. You know, Maxim,Nader: the context. But, but thats actually like, Ive actually, so this is a weird thing where I used to be like, why would we send emails? We have Slack. I am the entire, Im the exact opposite. I feel so bad for anyone whos like messaging me on Slack cause Im so unresponsive.swyx: Your emailNader: Maxi, email Maxim. Im email maxing Now email is a different, email is perfect because man, we cant work together. Im email is great, right? Because important threads get bumped back up, right? Yeah, yeah. Um, and so Slack doesnt do that. So I just have like this casino going off on the right or on the left and like, I dont know which thread was from where or what, but like the threads get And then also just like the subject, so you can have like working threads.I think whats difficult is like when youre small, if youre just not 40,000 people I think Slack will work fine, but theres, I dont know what the inflection point is. There is gonna be a point where that becomes really messy and youll actually prefer having email. cause you can have working threads.You can cc more than nine people in a thread.Kyle: You can fork stuff.Nader: You can [00:24:00] fork stuff, which is super nice and just like y Yeah. And so, but that is part of where you can propose a plan. You can also just. Start, honestly, momentums the only authority, right? So like, if you can just start, start to make a little bit of progress and show someone something, and then they can try it.Thats, I think whats been, you know, I think the most effective way to push anything for forward. And thats both at Nvidia and I think just generally.Kyle: Yeah, theres, theres the other concept that like is explored a lot at Nvidia, which is this idea of a zero billion dollar business. Like market creation is a big thing at Nvidia.Like,swyx: oh, you want to go and start a zero billion dollar business?Kyle: Jensen says, we are completely happy investing in zero billion dollar markets. We dont care if this creates revenue. Its important for us to know about this market. We think it will be important in the future. It can be zero billion dollars for a while.Im probably minging as words here for, but like, you know, like, Ill give an example. NVIDIAs been working on autonomous driving for a a long time,swyx: like an Nvidia car.Kyle: No, they, theyveVibhu: used the Mercedes, right? Theyre around the HQ and I think it finally just got licensed out. Now theyre starting to be used quite a [00:25:00] bit.For 10 years youve been seeing Mercedes with Nvidia logos driving.Kyle: If youre in like the South San Santa Clara, its, its actually from South. Yeah. So, um. Zero billion dollar markets are, are a thing like, you know, Jensen,swyx: I mean, okay, look, cars are not a zero billion dollar market. But yeah, thats a bad example.Nader: I think, I think hes, hes messaging, uh, zero today, but, or even like internally, right? Like, like its like, uh, an org doesnt have to ruthlessly find revenue very quickly to justify their existence. Right. Like a lot of the important research, a lot of the important technology being developed that, thats kind ofKyle: where research, research is very ide ideologically free at Nvidia.Yeah. Like they can pursue things that they wereswyx: Were you research officially?Kyle: I was never in research. Officially. I was always in engineering. Yeah. We in, Im in an org called Deep Warning Algorithms, which is basically just how do we make things that are relevant to deep warning go fast.swyx: That sounds freaking cool.Vibhu: And I think a lot of that is underappreciated, right? Like time series. This week Google put out time. FF paper. Yeah. A new time series, paper res. Uh, Symantec, ID [00:26:00] started applying Transformers LMS to Yes. Rec system. Yes. And when you think the scale of companies deploying these right. Amazon recommendations, Google web search, its like, its huge scale andKyle: Yeah.Vibhu: You want fast?Kyle: Yeah. Yeah. Yeah. Actually its, it, I, theres a fun moment that brought me like full circle. Like, uh, Amazon Ads recently gave a talk where they talked about using Dynamo for generative recommendation, which was like super, like weirdly cathartic for me. Im like, oh my God. Ive, Ive supplanted what I was working on.Like, I, youre using LMS now to do what I was doing five years ago.swyx: Yeah. Amazing. And lets go right into Dynamo. Uh, maybe introduce Yeah, sure. To the top down and Yeah.Kyle: I think at this point a lot of people are familiar with the term of inference. Like funnily enough, like I went from, you know, inference being like a really niche topic to being something thats like discussed on like normal peoples Twitter feeds.Its,Nader: its on billboardsKyle: here now. Yeah. Very, very strange. Driving, driving, seeing just an inference ad on 1 0 1 inference at scale is becoming a lot more important. Uh, we have these moments like, you know, open claw where you have these [00:27:00] agents that take lots and lots of tokens, but produce, incredible results.There are many different aspects of test time scaling so that, you know, you can use more inference to generate a better result than if you were to use like a short amount of inference. Theres reasoning, theres quiring, theres, adding agency to the model, allowing it to call tools and use skills.Dyno sort came about at Nvidia. Because myself and a couple others were, were sort of talking about the, these concepts that like, you know, you have inference engines like VLMS, shelan, tenor, TLM and they have like one single copy. They, they, they sort of think about like things as like one single copy, like one replica, right?Why Scale Out WinsKyle: Like one version of the model. But when youre actually serving things at scale, you cant just scale up that replica because you end up with like performance problems. Theres a scaling limit to scaling up replicas. So you actually have to scale out to use a, maybe some Kubernetes type terminology.We kind of realized that there was like. A lot of potential optimization that we could do in scaling out and building systems for data [00:28:00] center scale inference. So Dynamo is this data center scale inference engine that sits on top of the frameworks like VLM Shilling and 10 T lm and just makes things go faster because you can leverage the economy of scale.The fact that you have KV cash, which we can define a little bit later, uh, in all these machines that is like unique and you wanna figure out like the ways to maximize your cash hits or you want to employ new techniques in inference like disaggregation, which Dynamo had introduced to the world in, in, in March, not introduced, it was a academic talk, but beforehand.But we are, you know, one of the first frameworks to start, supporting it. And we wanna like, sort of combine all these techniques into sort of a modular framework that allows you to. Accelerate your inference at scale.Nader: By the way, Kyle and I became friends on my first date, Nvidia, and I always loved, cause like he always teaches meswyx: new things.Yeah. By the way, this is why I wanted to put two of you together. I was like, yeah, this is, this is gonna beKyle: good. Its very, its very different, you know, like weve, we, weve, weve talked to each other a bunch [00:29:00] actually, you asked like, why, why cant we scale up?Nader: Yeah.Scale Up Limits ExplainedNader: model, you said model replicas.Kyle: Yeah. So you, so scale up means assigning moreswyx: heavier?Kyle: Yeah, heavier. Like making things heavier. Yeah, adding more GPUs. Adding more CPUs. Scale out is just like having a barrier saying, Im gonna duplicate my representation of the model or a representation of this microservice or something, and Im gonna like, replicate it Many times.Handle, load. And the reason that you cant scale, scale up, uh, past some points is like, you know, there, there, there are sort of hardware bounds and algorithmic bounds on, on that type of scaling. So Ill give you a good example thats like very trivial. Lets say youre on an H 100. The Maxim ENV link domain for H 100, for most Ds H one hundreds is heus, right?So if you scaled up past that, youre gonna have to figure out ways to handle the fact that now for the GPUs to communicate, you have to do it over Infin band, which is still very fast, but is not as fast as ENV link.swyx: Is it like one order of magnitude, like hundreds or,Kyle: its about an order of magnitude?Yeah. Okay. Um, soswyx: not terrible.Kyle: [00:30:00] Yeah. I, I need to, I need to remember the, the data sheet here, like, I think its like about 500 gigabytes. Uh, a second unidirectional for ENV link, and about 50 gigabytes a second unidirectional for Infin Band. I, it, it depends on the, the generation.swyx: I just wanna set this up for people who are not familiar with these kinds of like layers and the trash speedVibhu: and all that.Of course.From Laptop to Multi NodeVibhu: Also, maybe even just going like a few steps back before that, like most people are very familiar with. You see a, you know, you can use on your laptop, whatever these steel viol, lm you can just run inference there. All, theres all, you can, youcan run it on thatVibhu: laptop. You can run on laptop.Then you get to, okay, uh, models got pretty big, right? JLM five, they doubled the size, so mm-hmm. Uh, what do you do when you have to go from, okay, I can get 128 gigs of memory. I can run it on a spark. Then you have to go multi GPU. Yeah. Okay. Multi GPU, theres some support there. Now, if Im a company and I dont have like.Im not hiring the best researchers for this. Right. But I need to go [00:31:00] multi-node, right? I have a lot of servers. Okay, now theres efficiency problems, right? You can have multiple eight H 100 nodes, but, you know, is that as a, like, how do you do that efficiently?Kyle: Yeah. How do you like represent them? How do you choose how to represent the model?Yeah, exactly right. Thats a, thats like a hard question. Everyone asks, how do you size oh, I wanna run GLM five, which just came out new model. There have been like four of them in the past week, by the way, like a bunch of new models.swyx: You know why? Right? Deep seek.Kyle: No comment. Oh. Yeah, but Ggl, LM five, right?We, we have this, new model. Its, its like a large size, and you have to figure out how to both scale up and scale out, right? Because you have to find the right representation that you care about. Everyone does this differently. Lets be very clear. Everyone figures this out in their own path.Nader: I feel like a lot of AI or ML even is like, is like this. I think people think, you know, I, I was, there was some tweet a few months ago that was like, why hasnt fine tuning as a service taken off? You know, that might be me. It might have been you. Yeah. But people want it to be such an easy recipe to follow.But even like if you look at an ML model and specificKyle: to you Yeah,Nader: yeah.Kyle: And the [00:32:00] model,Nader: the situation, and theres just so much tinkering, right? Like when you see a model that has however many experts in the ME model, its like, why that many experts? I dont, they, you know, they tried a bunch of things and that one seemed to do better.I think when it comes to how youre serving inference, you know, you have a bunch of decisions to make and there you can always argue that you can take something and make it more optimal. But I think its this internal calibration and appetite for continued calibration.Vibhu: Yeah. And that doesnt mean like, you know, people arent taking a shot at this, like tinker from thinking machines, you know?Yeah. RL as a service. Yeah, totally. Its, it also gets even harder when you try to do big model training, right? Were not the best at training Moes, uh, when theyre pre-trained. Like we saw this with LAMA three, right? Theyre trained in such a sparse way that meta knows theres gonna be a bunch of inference done on these, right?Theyll open source it, but its very trained for what meta infrastructure wants, right? They wanna, they wanna inference it a lot. Now the question to basically think about is, okay, say you wanna serve a chat application, a coding copilot, right? Youre doing a layer of rl, youre serving a model for X amount of people.Is it a chat model, a coding model? Dynamo, you know, back to that,Kyle: its [00:33:00] like, yeah, sorry. So you we, we sort of like jumped off of, you know, jumped, uh, on that topic. Everyone has like, their own, own journey.Cost Quality Latency TradeoffsKyle: And I, I like to think of it as defined by like, what is the model you need? What is the accuracy you need?Actually I talked to NA about this earlier. Theres three axes you care about. What is the quality that youre able to produce? So like, are you accurate enough or can you complete the task with enough, performance, high enough performance. Yeah, yeah. Uh, theres cost. Can you serve the model or serve your workflow?Because its not just the model anymore, its the workflow. Its the multi turn with an agent cheaply enough. And then can you serve it fast enough? And were seeing all three of these, like, play out, like we saw, we saw new models from OpenAI that you know, are faster. You have like these new fast versions of models.You can change the amount of thinking to change the amount of quality, right? Produce more tokens, but at a higher cost in a, in a higher latency. And really like when you start this journey of like trying to figure out how you wanna host a model, you, you, you think about three things. What is the model I need to serve?How many times do I need to call it? What is the input sequence link was [00:34:00] the, what does the workflow look like on top of it? What is the SLA, what is the latency SLA that I need to achieve? Because theres usually some, this is usually like a constant, you, you know, the SLA that you need to hit and then like you try and find the lowest cost version that hits all of these constraints.Usually, you know, you, you start with those things and you say you, you kind of do like a bit of experimentation across some common configurations. You change the tensor parallel size, which is a form of parallelismVibhu: I take, it goes even deeper first. Gotta think what model.Kyle: Yes, course,ofKyle: course. Its like, its like a multi-step design process because as you said, you can, you can choose a smaller model and then do more test time scaling and itll equate the quality of a larger model because youre doing the test time scaling or youre adding a harness or something.So yes, it, it goes way deeper than that. But from the performance perspective, like once you get to the model you need, you need to host, you look at that and you say, Hey. I have this model, I need to serve it at the speed. What is the right configuration for that?Nader: You guys see the recent, uh, there was a paper I just saw like a few days ago that, uh, if you run [00:35:00] the same prompt twice, youre getting like double Just try itagain.Nader: Yeah, exactly.Vibhu: And you get a lot. Yeah. But the, the key thing there is you give the context of the failed try, right? Yeah. So it takes a shot. And this has been like, you know, basic guidance for quite a while. Just try again. cause you know, trying, just try again. Did you try again? All adviceNader: in life.Vibhu: Just, its a paper from Google, if Im not mistaken, right?Yeah,Vibhu: yeah. I think it, its like a seven bas little short paper. Yeah. Yeah. The titles very cute. And its just like, yeah, just try again. Give it ask context,Kyle: multi-shot. You just like, say like, hey, like, you know, like take, take a little bit more, take a little bit more information, try and fail. Fail.Vibhu: And that basic concept has gone pretty deep.Theres like, um, self distillation, rl where you, you do self distillation, you do rl and you have past failure and you know, that gives some signal so people take, try it again. Not strong enough.swyx: Uh, for, for listeners, uh, who listen to here, uh, vivo actually, and I, and we run a second YouTube channel for our paper club where, oh, thats awesome.Vivo just covered this. Yeah. Awesome. Self desolation and all thats, thats why he, to speed [00:36:00] on it.Nader: Ill to check it out.swyx: Yeah. It, its just a good practice, like everyone needs, like a paper club where like you just read papers together and the social pressure just kind of forces you to just,Nader: we, we,theresNader: like a big inference.Kyle: ReadingNader: group at a video. I feel so bad every time. I I, he put it on like, on our, he shared it.swyx: One, one ofNader: your guys,swyx: uh, is, is big in that, I forget es han Yeah, yeah,Kyle: es Hans on my team. Actually. Funny. Theres a, theres a, theres a employee transfer between us. Han worked for Nater at Brev, and now he, hes on my team.He wasNader: our head of ai. And then, yeah, once we got in, andswyx: because Im always looking for like, okay, can, can I start at another podcast that only does that thing? Yeah. And, uh, Esan was like, I was trying to like nudge Esan into like, is there something here? I mean, I dont think theres, theres new infant techniques every day.So its like, its likeKyle: you would, you would actually be surprised, um, the amount of blog posts you see. And ifswyx: theres a period where it was like, Medusa hydra, what Eagle, like, youKyle: know, now we have new forms of decode, uh, we have new forms of specula, of decoding or new,swyx: what,Kyle: what are youVibhu: excited? And its exciting when you guys put out something like Tron.cause I remember the paper on this Tron three, [00:37:00] uh, the amount of like post train, the on tokens that the GPU rich can just train on. And it, it was a hybrid state space model, right? Yeah.Kyle: Its co-designed for the hardware.Vibhu: Yeah, go design for the hardware. And one of the things was always, you know, the state space models dont scale as well when you do a conversion or whatever the performance.And you guys are like, no, just keep draining. And Nitron shows a lot of that. Yeah.Nader: Also, something cool about Nitron it was released in layers, if you will, very similar to Dynamo. Its, its, its essentially it was released as you can, the pre-training, post-training data sets are released. Yeah. The recipes on how to do it are released.The model itself is released. Its full model. You just benefit from us turning on the GPUs. But there are companies like, uh, ServiceNow took the dataset and they trained their own model and we were super excited and like, you know, celebrated that work.ZoomVibhu: different. Zoom is, zoom is CGI, I think, uh, you know, also just to add like a lot of models dont put out based models and if theres that, why is fine tuning not taken off?You know, you can do your own training. Yeah,Kyle: sure.Vibhu: You guys put out based model, I think you put out everything.Nader: I believe I know [00:38:00]swyx: about base. BasicallyVibhu: without baseswyx: basic can be cancelable.Vibhu: Yeah. Base can be cancelable.swyx: Yeah.Vibhu: Safety training.swyx: Did we get a full picture of dymo? I, I dont know if we, what,Nader: what Id love is you, you mentioned the three axes like break it down of like, you know, whats prefilled decode and like what are the optimizations that we can get with Dynamo?Kyle: Yeah. That, thats, thats, thats a great point. So to summarize on that three axis problem, right, there are three things that determine whether or not something can be done with inference, cost, quality, latency, right? Dynamo is supposed to be there to provide you like the runtime that allows you to pull levers to, you know, mix it up and move around the parade of frontier or the preto surface that determines is this actually possible with inference And AI todayNader: gives you the knobs.Kyle: Yeah, exactly. It gives you the knobs.Disaggregation Prefill vs DecodeKyle: Uh, and one thing that like we, we use a lot in contemporary inference and is, you know, starting to like pick up from, you know, in, in general knowledge is this co concept of disaggregation. So historically. Models would be hosted with a single inference engine. And that inference engine [00:39:00] would ping pong between two phases.Theres prefill where youre reading the sequence generating KV cache, which is basically just a set of vectors that represent the sequence. And then using that KV cache to generate new tokens, which is called Decode. And some brilliant researchers across multiple different papers essentially made the realization that if you separate these two phases, you actually gain some benefits.Those benefits are basically a you dont have to worry about step synchronous scheduling. So the way that an inference engine works is you do one step and then you finish it, and then you schedule, you start scheduling the next step there. Its not like fully asynchronous. And the problem with that is you would have, uh, essentially pre-fill and decode are, are actually very different in terms of both their resource requirements and their sometimes their runtime.So you would have like prefill that would like block decode steps because you, youd still be pre-filing and you couldnt schedule because you know the step has to end. So you remove that scheduling issue and then you also allow you, or you yourself, to like [00:40:00] split the work into two different ki types of pools.So pre-fill typically, and, and this changes as, as model architecture changes. Pre-fill is, right now, compute bound most of the time with the sequence is sufficiently long. Its compute bound. On the decode side because youre doing a full Passover, all the weights and the entire sequence, every time you do a decode step and youre, you dont have the quadratic computation of KV cache, its usually memory bound because youre retrieving a linear amount of memory and youre doing a linear amount of compute as opposed to prefill where you retrieve a linear amount of memory and then use a quadratic.You know,Nader: its funny, someone exo Labs did a really cool demo where for the DGX Spark, which has a lot more compute, you can do the pre the compute hungry prefill on a DG X spark and then do the decode on a, on a Mac. Yeah. And soVibhu: thats faster.Nader: Yeah. Yeah.Kyle: So you could, you can do that. You can do machine strat stratification.Nader: Yeah.Kyle: And like with our future generation generations of hardware, we actually announced, like with Reuben, this [00:41:00] new accelerator that is prefilled specific. Its called Reuben, CPX. SoKubernetes Scaling with GroveNader: I have a question when you do the scale out. Yeah. Is scaling out easier with Dynamo? Because when you need a new node, you can dedicate it to either the Prefill or, uh, decode.Kyle: Yeah. So Dynamo actually has like a, a Kubernetes component in it called Grove that allows you to, to do this like crazy scaling specialization. It has like this hot, its a representation that, I dont wanna go too deep into Kubernetes here, but there was a previous way that you would like launch multi-node work.Uh, its called Leader Worker Set. Its in the Kubernetes standard, and Leader worker set is great. It served a lot of people super well for a long period of time. But one of the things that its struggles with is representing a set of cases where you have a multi-node replica that has a pair, right?You know, prefill and decode, or its not paired, but it has like a second stage that has a ratio that changes over time. And prefill and decode are like two different things as your workload changes, right? The amount of prefill youll need to do may change. [00:42:00] The amount of decode that you, youll need to do might change, right?Like, lets say you start getting like insanely long queries, right? That probably means that your prefill scales like harder because youre hitting these, this quadratic scaling growth.swyx: Yeah.And then for listeners, like prefill will be long input. Decode would be long output, for example, right?Kyle: Yeah. So like decode, decode scale. I mean, decode is funny because the amount of tokens that you produce scales with the output length, but the amount of work that you do per step scales with the amount of tokens in the context.swyx: Yes.Kyle: So both scales with the input and the output.swyx: Thats true.Kyle: But on the pre-fold view code side, like if.Suddenly, like the amount of work youre doing on the decode side stays about the same or like scales a little bit, and then the prefilled side like jumps up a lot. You actually dont want that ratio to be the same. You want it to change over time. So Dynamo has a set of components that A, tell you how to scale.It tells you how many prefilled workers and decoded workers you, it thinks you should have, and also provides a scheduling API for Kubernetes that allows you to actually represent and affect this scheduling on, on, on your actual [00:43:00] hardware, on your compute infrastructure.Nader: Not gonna lie. I feel a little embarrassed for being proud of my SVG function earlier.swyx: No, itNader: wasreallyKyle: cute. I, Iswyx: likeNader: its all,swyx: its all engineering. Its all engineering. Um, thats where ImKyle: technical.swyx: One thing Im, Im kind of just curious about with all with you see at a systems level, everything going on here. Mm-hmm. And we, you know, were scaling it up in, in multi, in distributed systems.Context Length and Co Designswyx: Um, I think one thing thats like kind of, of the moment right now is people are asking, is there any SOL sort of upper bounds. In terms of like, lets call, just call it context length for one for of a better word, but you can break it down however you like.Nader: Yeah.swyx: I just think like, well, yeah, I mean, like clearly you can engage in hybrid architectures and throw in some state space models in there.All, all you want, but it looks, still looks very attention heavy.Kyle: Yes. Uh, yeah. Long context is attention heavy. I mean, we have these hybrid models, um,swyx: to take and most, most models like cap out at a million contexts and thats it. Yeah. Like for the last two years has been it.Kyle: Yeah. The model hardware context co-design thing that were seeing these days is actually super [00:44:00] interesting.Its like my, my passion, like my secret side passion. We see models like Kimmy or G-P-T-O-S-S. Im use these because I, I know specific things about these models. So Kimmy two comes out, right? And its an interesting model. Its like, like a deep seek style architecture is MLA. Its basically deep seek, scaled like a little bit differently, um, and obviously trained differently as well.But they, they talked about, why they made the design choices for context. Kimmy has more experts, but fewer attention heads, and I believe a slightly smaller attention, uh, like dimension. But I need to remember, I need to check that. Uh, it doesnt matter. But they discussed this actually at length in a blog post on ji, which is like our pu which is like credit puswyx: Yeah.Kyle: Um, in, in China. Chinese red.swyx: Yeah.Kyle: Its, yeah. So it, its, its actually an incredible blog post. Uh, like all the mls people in, in, in that, Ive seen that on GPU are like very brilliant, but they, they talk about like the creators of Kimi K two [00:45:00] actually like, talked about it on, on, on there in the blog post.And they say, we, we actually did an experiment, right? Attention scales with the number of heads, obviously. Like if you have 64 heads versus 32 heads, you do half the work of attention. You still scale quadratic, but you do half the work. And they made a, a very specific like. Sort of barter in their system, in their architecture, they basically said, Hey, what if we gave it more experts, so were gonna use more memory capacity.But we keep the amount of activated experts the same. We increase the expert sparsity, so we have fewer experts act. The ratio to of experts activated to number of experts is smaller, and we decrease the number of attention heads.Vibhu: And kind of for context, what the, what we had been seeing was you make models sparser instead.So no one was really touching heads. Youre just having, uh,Kyle: well, they, they did, they implicitly made it sparser.Vibhu: Yeah, yeah. For, for Kimmy. They did,Kyle: yes.Vibhu: They also made it sparser. But basically what we were seeing was people were at the level of, okay, theres a sparsity ratio. You want more total parameters, less active, and thats sparsity.[00:46:00]But what you see from papers, like, the labs like moonshot deep seek, they go to the level of, okay, outside of just number of experts, you can also change how many attention heads and less attention layers. More attention. Layers. Layers, yeah. Yes, yes. So, and thats all basically coming back to, just tied together is like hardware model, co-design, which isKyle: hardware model, co model, context, co-design.Vibhu: Yeah.Kyle: Right. Like if you were training a, a model that was like. Really, really short context, uh, or like really is good at super short context tasks. You may like design it in a way such that like you dont care about attention scaling because it hasnt hit that, like the turning point where like the quadratic curve takes over.Nader: How do you consider attention or context as a separate part of the co-design? Like I would imagine hardware or just how I wouldve thought of it is like hardware model. Co-design would be hardware model context co-designKyle: because the harness and the context that is produced by the harness is a part of the model.Once its trained in,Vibhu: like even though towards the end youll do long context, youre not changing architecture through I see. Training. Yeah.Kyle: I mean you can try.swyx: Youre saying [00:47:00] everyones training the harness into the model.Kyle: I would say to some degree, orswyx: theres co-design for harness. I know theres a small amount, but I feel like not everyone has like gone full send on this.Kyle: I think, I think I think its important to internalize the harness that you think the model will be running. Running into the model.swyx: Yeah. Interesting. Okay. Bash is like the universal harness,Kyle: right? Like Ill, Ill give. An example here, right? I mean, or just like a, like a, its easy proof, right? If you can train against a harness and youre using that harness for everything, wouldnt you just train with the harness to ensure that you get the best possible quality out of,swyx: Well, the, uh, I, I can provide a counter argument.Yeah, sure. Which is what you wanna provide a generally useful model for other people to plug into their harnesses, right? So if youKyle: Yeah. Harnesses can be open, open source, right?swyx: Yeah. So I mean, thats, thats effectively whats happening with Codex.Kyle: Yeah.swyx: And, but like you may want like a different search tool and then you may have to name it differently or,Nader: I dont know how much people have pushed on this, but can you.Train a model, would it be, have you have people compared training a model for the for the harness versus [00:48:00] like post training forswyx: I think its the same thing. Its the same thing. Its okay. Just extra post training. INader: see.swyx: And so, I mean, cognition does this course, it does this where you, you just have to like, if your tool is slightly different, um, either force your tool to be like the tool that they train for.Hmm. Or undo their training for their tool and then Oh, thats re retrain. Yeah. Its, its really annoying and like,Kyle: I would hope that eventually we hit like a certain level of generality with respect to training newswyx: tools. This is not a GI like, its, this is a really stupid like. Learn my tool bitch.Like, I dont know if, I dont know if I can say that, but like, you know, um, I think what my point kind of is, is that theres, like, I look at slopes of the scaling laws and like, this slope is not working, man. We, we are at a million token context, okay, maybe next year, 2 million, were not going to a hundred trillion, you know, like this, this, oh, theres so many interesting ways to get this Doesnt work.Just doesnt work.Nader: Whats kind of funny is whenever there, I, I feel like we always want to see a trend that we can predict, but every time somethings come, its been like a leapfrog. So I, I imagine I, I dont know how we go from one to two, but I imagine what, whats likely to happen is [00:49:00] we break through that from some newKyle: Yeah.Theres actually, theres an interesting formalization of this. There, theres an essay. Its a pretty interesting essay by Leopold Ashton Brener called Situational Awareness.swyx: Okay? Yes.Kyle: He introduces a concept awareness called an un hobbler, right? So he, you know, Leopold in this essay details, Hey, I want to get.You know, like, I wanna get to this point in intelligence and I think that it is four orders of magnitude worth of like compute and data and training away. And you know, he says, oh yeah, I think data centers can scale up by about this much. I think that you can do, scale up the data and some other things by this much.But one of the things that like makes the rest of that order of magnitude growth, PO possibilities is un hobbler, like these scientific discoveries that are discovered during. You know, model architecture, search or training that really, really, really impact how, how you are able to scale. Like a, a good example of this might be that like we see like a mo a lot of models that are, [00:50:00] and this is probably a very tiny on hobbler.But is important for the performance perspective. We see a lot of models that are like trained with multi token prediction natively in during pre-training.And per deep seek in their paper they say, Hey, decided this actually helped us in ensure sta more stable convergence. But theyre like, un Hobbs that are like that.And then theyre like, rather large on hobbler. Right. Like architecturally, a lot of our models, like we had different types of attention. And one of the problems with attention is like, you have a lot of kv, but people found like different forms of attention, like group query attention and, uh, like MLA in deep seek multi-head latent attention that like decrease the burden that KV has on the model, which allows you to grow like longer in context.swyx: Yeah. And that, that was very drastic for deeps seek.Kyle: Yeah. This was like, yeah, it for context like the, the total, I think the total context length of deeps seek is 128,000 tokens or might be 256,000 with rope extension. That entire context, I think its 128,000 fits into eight gigabytes. Previously context, like I think the, the llama four or five B context [00:51:00] of a similar size was like 40 or 80 gigabytes in the same precision.swyx: Yeah.Kyle: Um, so like those in Hobbler like really decrease the stuff of that size. And I wouldnt be surprised if we do see the ability to like, break through to like 10 million, 20 million, a hundred million context through the an un hobbler showing up. Iswyx: see.Kyle: And its just science.swyx: So more deep learning algorithms is whatKyle: Im hearing.Yeah. More deep learning algorithms. Um,swyx: yeah,Kyle: I, I could, I could actually playing pickupswyx: and he hasKyle: room to, I I could actually give you an, an example like of like a, a theory, not a theory theory, but something theoretical and a hobarNader: that youre excited about or,Kyle: well, and, and a hobar that, I mean, I havent seen, so it could be a tar pit and it could not, just, not work.But, uh, I, I would be really excited to see a model that does prefill and decode differently. So a model that does, uh, prefill like locally, like document wise, prefill, like it doesnt in chunks, and then you do decode globally across like the entire sequence because it, logically to me it doesnt seem like you would necessarily need to [00:52:00] have KV b associative between documents that have like, no, no mutual association.But that like places a lot of burden on prefilled to like, or sorry, on, on decode and pure attention within the decode phase to like make those connections since the KV is like static at that point. And you see other techniques that are interesting like this too. But if, if youre able to do that, like.If Prefill becomes local and decode is, is still global, you solve that prefilled quadratic scaling problem because you have a bunch of like small chunks that you prefill independently.swyx: Okay. All right. Well, lets, uh, wait and see, but I, I think itll be pretty exciting.Kyle: Fingers crossed.swyx: Yeah, fingers crossed.Yeah. Yeah.Vibhu: Im excited for prefilled decode on separate hardware. So like yeah. CR acquisition, right. Can we decode on the gr Can we get super fast?Kyle: I dont think Im allowed to comment on this.swyx: Mark is gonna shoot arrows at us.Nader: Uh, hes got a blow dark, hes in the room, justKyle: like,Nader: like go to sleep.Yeah. Yeah.swyx: ButNader: Im, Im super excited to see the team come in and like, you know, Ive gotten the, the pleasure of working with some of the, the GR people coming in. So, you know, yeah, I,swyx: I know Sonny, [00:53:00] weve had him, uh, at the sameKyle: conference thatswyx: you are at.Nader: Yeah.swyx: Um, and, uh, I, I think youre, you guys are gonna be doing some sessions at G tc.I dont know if you wanna, this is a good place to plug them.Kyle: Yeah, yeah, yeah. So, I cant speak to any LPU related sessions at G tc. I have no idea about that. Oh, no, that was,swyx: no. YoursKyle: on the, on the GR side. Yeah. I use the associative NVIDIA U Yeah. Um, on the, on the Nvidia Dynamo side, were, were giving, there are a large number of sessions.For those that arent aware, you can actually search. All of these sessions for GTC online, just go to the GTC website. I dont know what the URL is, but go there. Google it. Yeah. Uh, and you can just look up Dynamo and youll get all the sessions. Therere about 20. There are a couple that are hosted by the Dynamo team.There are a couple that are hosted by people that use Dynamo that wanna show off the results theyve been able to get. But there are two that Im really excited about. Uh, one is just the General Dynamo tutorial, and this is the, Im going out with Harry, whos our lead product manager for Dynamo.And were sort of talking about like how to use Dynamo to get better performance and also like where we see Dynamo going in the future. And [00:54:00] then theres another session that Im doing with one of our agents teams at Nvidia to talk about sort of the future of agents in production inference. Yeah. So were talking about, theres like this new horizon with respect to agents because we have these harnesses that actually impart structure on upon calls.Like if you, if you compare like, the past and the, and the present with respect to like how LM calls work. Like in the early days when they were chatbots, like every call was like very different. There was basically no structure. You could assume that like people, you, if it was conversational, there might be like some implicit structure because you have, you know, a multi-term conversation.But agency have this, this harness that, like abides by rules, right? So it imparts direct structure onto the context. And you see this, there was an interesting Twitter post about how Claude code like structures, its context so that you get as many cts as possible.And I think it was by one of the, the PMs for Claude code.And he, he wrote about it. And that type of structure that the harness can impart actually like goes hand in [00:55:00] hand with the. Inference co-design. So Im doing a talk, I, I dont know the session name or the session number, but Im, Im doing a talk, uh, you can look at me up by name on, on the GTC website, on how we accelerate agents and where we see specific optimizations for agents going in Dynamo and in inference in general.swyx: Yeah. I think theres only 1:00 PM for cloud code and its wo the rest. Theres, theres Devrel, theres Boris. Maybe it was maybe Devrel. Yeah, exactly. I mean, lets go into agents. I think this was like the last part of the, the, the discussion we planned. Yeah. How have we not talked about agents also with you guys?Well, we scheduled, it was like, I was like, okay, you know, like, lets have like cohesive sections or,Vibhu: I mean, theres the big news, right? The NVIDIAs a huge. Like deployment of Codex. Yeah, videoswyx: uses everything. I mean, we use this cursor and we uses code,Vibhu: but thats, thats a pretty big deployment, right?Like, thats tens of thousands of people.Nader: Totally. Yeah.Vibhu: Were super What? Thats,Nader: yeah. I, it goes back to the mosh pit of emails we kind of mentioned earlier, or just the like, um, how fluid the org feels. So when theres new technology, people will just email it out and everyone will try it.[00:56:00] And if it, if its making peoples lives easier, itll spread like wildfire.Kyle: A lot of times Jensen will get it and itll be like, lets make this work. Yeah. Across the company. Lets make this work right now,Nader: honestly, uh, if I was a startup, I feel like a cool hack. If you have something thats going to save an Nvidia time theyll spread it to a couple and the same thing.Right? Itll just spread like wildfire. Okay.Vibhu: Careful before your email blows up from startups. Well,Nader: You gotta know the person. Right? But no, I, um, I, yeah, so I mean, we, I love using Codex. Its been a ton of fun. Yeah. Uh, Ive been using it personally. Ive been using it at work. Its been, um.Yeah, I dunno. Its been great to see the rollout, something really funny. Uh, on the data we got, uh, codex and cloud code access. I found this person, uh, his names Carlos at the company. He wrote an Outlook, CLI.Kyle: Oh yeah.Nader: And, uh, just the CLI for email. And this was, IveKyle: been using that,Nader: yeah, maybe like four or five weeks ago.And, uh, the site, so once I got like Codex access I. Installed the CLI, it had a skill and I just asked it to go through all of my emails, which its very messy. So if I dont respond to your email, Im really sorry. But I asked it to gimme a summary, highlight any [00:57:00] escalations that I should look at, put any thread that it thinks I should respond to in a folder, and then archive everything.And it did. So if I missed your email, its because it didnt get,swyx: so I should put a prompt injection in my V to Yeah, yeah. What you should do is just FaceTime. Yeah. Um, my, yeah, my SLA is highest on FaceTime,Nader: but that was, it was magic. And so I, I sent it in a big email thread to like 500 people. A bunch of folks tried it out.I started like FaceTiming whoever I could at the company to get them set up with this.swyx: Yeah. Um, that specific example mm-hmm. You guys deal with like some pretty. Sensitive emails.Nader: Yeah.swyx: Is there a security review with this?Security Meets Agentsswyx: cause like one guy made, made it for himself, but like its not meant for all theNader: security team and Nvidia is incredible.Like, shout out to them. Theyre, theyre, theyre trying to, we have a, we have an amazing security team cause theyre progressive and they know that this isKyle: really important technology and you have to bring it in. If you think about like, if you work at a big company, your laptops usually very locked down ifNader: you can only access certain things.Nvidia engineers have those restrictions arent there. So youre expected to understand the risks when you try things out. And so. Very quickly, you know, made sure to [00:58:00] chime in security on what we were doing.Agent Permissions ModelNader: Theres actually a lot that weve been thinking about, especially with open claw, right? Like theres, you know, agents can do three things.Yeah. A agents can do three things. They can access your files, they can access the internet, and then now they can write custom code, uh, and execute it. And you literally only let an agent do two of those three things. If you can access your files and you can write custom code, you dont want internet access because thats one to see full vulnerability, right?If you have access to internet and your file system, you should know the full scope of what that agents capable of doing. Otherwise, malware can get injected or something that can happen. And so thats a lot of what weve been thinking about is like, you know, how do we both enable this because its clearly the future.But then also, you know, what, what are these enforcement points that we can start to like protect?swyx: And is there any directive of like, Hey, we have a company account or a company agreement with open ai, we use open AI models here, or like choose whatever.Nader: No, no. So, so I would never put any company data in a model thats not either, that we dont even, it has to most security.Yeah. Yeah. I like how,swyx: how that goes. Uh, you know, obviously you could run your own [00:59:00] models. You Nemo and, and we, right, we, we as an, we have an internal cluster, so, you know, of course in random,Kyle: uh, yeah.swyx: Yeah.Nader: I think were dynamos first customers. Lets goBuild Nvidia Inference GatewayKyle: actually, uh, theres a funny story about like how I got the experience that informed what we needed for Dynamo at one point.Theres a website called build done n video.com and also for us infra dun n video com. That is allows people to try models. It gives an a p service. You can call the model with like a rest, API, and you know, you get a response. I ran the model side for that and it was at one point the largest inference deployment and still may actually be the largest inference deployment in video.Ive, Ive since like, handed it off to some people and theyre doing a wonderful by way. This is a extremelyNader: underknown or less known resource. Vil diamond v.com. You can get any of these open source models. And its rate limited, but its free. So its perfect for hackers to,Kyle: and, and the SLA on getting models day zero models up is like a day.Yeah.Kyle: Like theyre, theyre incredibly good at like figuring out the right way to host the model to [01:00:00] get it up there as soon as it comes out.swyx: You ran this?Kyle: Yeah, I ran, I ran it a long time ago. It was originally called Nvidia AI Playground, then it was called AI Foundational insert. Yeah. And then it was called Build Nvidia call.And I, I ran the model side of it. So there were, there was a large multi-organizational team. I ran how, which models should we host? How should we host them and like whats the proportion of them? And then of course there was like an SRE team that like made sure that things ran well and scaled the models as well.But I ran like, you know, model, how do we get the model to silicon? And then, which model also worked with our product team Determine like which models were important a very long time ago.Yeah. Yeah. Theres also like a middle ground in between there, right? This is like for the hacker. Try anything.Theres the Brev console, then theres Dynamo, there was also nims, right?Kyle: Yes.I remember it had its little moment, like a year or two ago. Is it still?Nader: Yeah. NIM is, uh, you know, inference, uh, oil. I, I think it like for something is it is a log or acronym. Yeah. It [01:01:00] just, just a name. But, um, yeah, NIM is, uh, how enterprises can take our uh, any of the, any of this technology and run it with support and all of that.And so that includes Daniel Mo. That includes, I dont know all of our other optimizations that are packers up for Enterprise. Yep.swyx: Anyway, so, so you, you got a bunch of experience start running the sort of internal inference gateway playgrounds.Kyle: Yeah, I got And Bill also built how build NVIDIAs first internal like vs.Code thing. We call it MB code.swyx: Thats what I, uh, extension.Kyle: Yeah, it was, it was a V first,like the fork vs code.swyx: We jokes absolutely not. It just a while back they like, we should have a fourth vs. Code hackathon where you, thats four. Its the best four V vs code. We,we were, we were doing a hack how make a billion dollars, someone from VS code was there and he was like somewhat down to get involved and I was like,swyx: oh, you should do that.Thats all. Then the cool thing became four chrome hackathonChrome,swyx: And no, no, no IDs or not cooling.Nader: I saw, whats it called?Hackathons And Autonomy DreamsNader: I was talking to Joseph, uh, from Robo Flow and uh, theyre partnering crime. We were talking about how with the new Alpha Mayo model, so Nvidia just [01:02:00] released an open source. Uh, the, the Mercedes cars that you saw drag, she on Frazey?swyx: Yeah.Nader: Released. Will you open source, a autonomous driving model? Uh, I already, yeah, so we were thinking like, could we hackathon a driverless car? Like I have my old car. Lets just try it.swyx: Well take it,Nader: take it to like, click train with a treasure eye, like in the middle of the day. Just like, just see, let everyone, like how many, how many cameras do we need?Right? Like, 1, 2,swyx: 3, 4. They dont. Five, six.Nader: I dont know. I, yeah. But, um, I think were gonna try, you just do it with us.swyx: We can see, we could evenNader: have a race. Its like the first person to automate theirswyx: driving. Let me over a weekend. We do have an autonomy track at Wills fair. Uh, WiMo was there like Yeah.Nvidia did send people that for Goot. Not because he didnt have the driving thing yet.Nader: Yeah.swyx: Yeah. Its, thats cool.Yeah. I think comma, comma also has a version of this comma have open source driving. Theyve, theyve done a fun hackathon onswyx: music and he and I also, cause I, I really, what I really want is a Tesla with Tesla level self-driving.Yeah.swyx: But as a smart car, like a two seater. Thats the basic CPA wheelchair with a [01:03:00] roofand only thing they make them, but the demand has d they, no, they realize this probably five years. Yeah. Really?swyx: Yeah.They were d manufacturer.Kyle: I thought it is one of those things, well, where well see someone buy the brand and itll be revived.swyx: I, I would buy it like IKyle: probably. Someone hears this go byswyx: your car. Yeah. Yeah. Thats crazy. Nobody Mercedes, because they, theyre like, I think 10 Mercedes, Mercedes, uh, I in Mercedes usedto make them, I dont know. I feel like they own the brand and you outswyx: thats your dream might come true enough. Okay.We were time notify and, and I was like, every time I, I try to park in San Francisco, I I have to buy a smart car because like 20% of the parking lots in San Francisco only fit smart cars.Nader: Yeah. So, Hey, really?swyx: Thats where, I mean, its mallNader: even it was late here trying to, this comes from someone that like, basically doesKyle: not drive.Nader: Thats where the, the Vepa was a life hack. Yeah, exactly. Yeah. You know what happened to the Vespa? Um, I used to have [01:04:00] this yellow Vespa, uh, I left it outside the hacker house when we moved out. It trend. Um, its just, it was always there. And then like a month ago. Its not there anymore. Ive been meeting today.I dont dunno. You could, its actually tv. You forgot about it.swyx: Yeah.Nader: And left.swyx: Yeah. Yeah. No, this, its probably hazard. And speaking of hackathons, I also wanted say, give a big shout out to the world. Shortest hackathon. Lets go. Uh, you did twice. You gonna watch aNader: handful of times? Yeah. Theres gonna be one at G tc.Oh, were doing pretty much we have a bunch of challenges that No, we havent released. And you get to bring your agent to come and attempt to, uh, go through thoseKyle: challeng again. Its like a zero, the zero minute hackathon idea, which you just, you just bring your, I I approached eight, nine along a long time ago.You just bring your agent and then you press the go button. Youre not allowed to code. Its just the Asian doing bond.Its a good hidden email, right?Kyle: Yeah.Do you make a jar? You makeKyle: I there something I would love to see from cognition or someone else be like, come bring your agent. Drop it inbecause you dont, you dont know you like supervisor.Well let be [01:05:00] a, you know, operate a browser, order a pizza. Well just see like that snake it, you know,swyx: andKyle: you dont know what theswyx: taskKyle: is. Yeah. You dunno what the task is like, or just like, you dont even know what the judging categories are and then you give it the judging categories. Like, try as much as possible.Its great though. It turns into like, yeah, so lets build something on dining party. Its a great business. See,Kyle: anyway, funny story.Agent UX And CLI EverywhereKyle: Actually, we have a couple of people at Nvidia, weve been working with security to like bring agents really close to compute. So we now have like stuff where we can like tell Dynamo, like go run some experience with Dynamo, like on, X cluster and just like try it right now, like queue up once you get queued, like, send this request load and weve actually been able to like, just like, you know, like one shot problems like.We used to have this problem where you know, with Dynamo you have to like find the right configurations and we, sort of do it automatically for some parts of it, but you have to like a good initial configuration that you want to use. And weve just had like an agent just completely one shot that it goes, it gets the compute, it like runs a couple experiments.Its like [01:06:00] this is the best, this is this, these are part of the ER frontier. Go run this. And then we just like give that to people and its like faster than anything that they have.Nader: Agent UX and agent marketing are super important. Theres stuff that weve been thinking a lot about. Um, Alec is like redoing the entire Brev CLI, um, so that you can fetch all the different compute types that are available.I dont know, its gonna be really soon, but then you can, you can just browse what GPUs are available and then provision one say to it right there. And you can pipe all the commands. But I think it goes back to like the Alex CLI, like if you, coding agents. Its kind of funny. I feel like coding agents have been so much more effective than general purpose agents.And I think a large part of that is it just has access to the terminal, like you said, and that means it has access to everything that youve installed into your terminal. It can run. So, you know, it would write code and, and it can compile the code and if there are errors, it can fix it, it can run your suite of tests because thats all just in your terminal.And so that, you know, then for the idea, what come me really excited about the CLI, were now just turning through building CLI for the entire, like for the entire business. We Slack, building Slack, also. Workday, C-L-I-S-A Go. I, Ive also done that for myself first. Really? Yeah. Yeah. Um, were gonna, were gonna [01:07:00] open source all of this.And like yeah, all the, the I theyre just theyre the C yeah. CLI for the business applications. We would love for someone to run with this and like build like, I dont know, like open CLI foundation in or something. Yeah. We, I Nvidia would love to support, uh, anyone thats doing this.Like e every Devrel tool should really have good CLI support at this point.Yeah. Like at one point it was, you want your docs to be. Like accessible by an LM, right? You want LM Good dog. No, every, everything needs some CLI.Nader: Yeah. Its kind of funny, right? Like we, like computing began with a terminal with a shell, but we said that its not empathetic to, uh, humans. So we built these nice user interfaces and then now we have LMS navigating our user interfaces.And ironically, were not empathetic to the machine anymore.swyx: Yeah.Nader: Yeah. Just give the, the LLM access to the show.swyx: One thing that slightly makes me uncomfortable is like, why do we have to build cli? Why cant we just expose APIs? Like,Kyle: I, I have, I have an interesting answer to this. So there are a couple reasons.Like theres, theres like, you know, portability is like one issue. Like, you know, like sometimes APIs are not like discoverable or like reachable by, by some, you know, types of [01:08:00] things. Theres some element of locality, right? Like, uh, like the CLI is like literally you interfacing with your like local system, which is a little bit different.You could still do it by API, but like theres this highlighting of like, what is the difference between like a CLI and an MCP, right? Like they kind of occupy the same purposes and you call them, it does something on the system and, and thats done. I think that in pre-training theres just an enormous amount.Oh, okay. Command line data. Yeah.Yeah. Like e even lets ignore our, lets lets ignore our l Like youre doing no harness, youre doing no harness push training. Just the amount of like CLI versus API documentation for just like navigating this world of the CLI in your file system through that is just enormous.Nader: Yeah. Yeah.Kyle: Right. INader: think theres a, theres a couple of things too. Like if, lets say we wanna, so one I think your intuitions, right? The CLI is just wrapping the API,swyx: right? So functionalNader: functionally, right? Yeah. And I think its nice because one, youre, youre being very, uh, specific and pedantic even, um, of what and thats really good cause youre describing the problem space.So you know what the, I dont [01:09:00] know. I dont wanna call it like what the, the space for vulnerability. You know what network calls youre making, its not arbitrary and thats not decided on the fly. Thats like pre-decided, which is important from a security perspective. But then if you were to write a bunch of API requests, you would probably do that.I dont know. Would the model like use Python to do so? I kind of like that. Everything like a CLI is just dash because its ubiquitous. Like its just there. And you dont have to make sure that theres certain environment variables that are set up. Like if your Python versions, if the My Python version were using the same model to go do the same thing, is it gonna write like different code?It probably would. And so its kind of like an nice deal work, right? Yeah. Human. Yeah. No, I think just like making those decisions happen ahead of time versus yeah.swyx: One last thing on this sort of agent, I guess maybe co-location or whatever you call it, uh, one pattern on tracking for this year, I always try to think about whats the theme of this year gonna be last year?Definitely coding agents this year is definitely coding agents, breaking out of containment into broadening third world. I go Definitely has. SoVibhu: you rent a human?swyx: Yeah. Yeah.Im on here.swyx: Are you really? [01:10:00]Im like

5,000. Ill do anything. Really? I think so. I need, uh,swyx: my, uh, my borrow from Costco.Uh, but I think the best part is only the agent can book me, you know?Yeah.swyx: Its veryKyle: usually like,swyx: its just like another labor marketplace at Mechanical Turk was this.So definitely I have a weird story with why I did it. So back to your example of just giving agent access to compute, right? Yeah. You guys are GPU Rich at Nvidia. Yeah, I hooked up.Nader: Hes not shy about it.Local GPUs And Scaling InferenceI have, I have a 24 7 agent running, I hooked up to run pot.It doesnt shut down instances. And Im like, Ive tried prompting you, Ive given the instruction. Shut down when youre done. Its like I to keep it warm, Ill need it soon. And its horrible on time estimates too, cause like they realize its like. Yeah, Ill need it in 45 minutes. 45 minutes, Ill shut it down.45 minutes of human time is actually three minute of agent time, so its like Im booting it up, Im waiting, Ill just leave it on all night. And mo moos good at shutting down after something activity. I had it on my local server, like a little dual GPU thing. It just stays on. I have a little space heater at home now, but careful.[01:11:00] So basically, you know, they dont care about the concept of money just burn it. I need it. Its useful.Nader: And another DGX spark will be really nice. Like, I, I think Im looking at it as super useful for agents because Yeah, you buy it once you plug it in and they it can rip. Im gonna make a, Im gonna make an Nvidia ad here.Kyle: Okay. The Blackwell, like RTX 6,000 cards. Pro Pro only, like, I think its $8,000. Slightly cheaper. Yeah. Well, its much, its much cheaper than the data center cards.Vibhu: Yeah.Kyle: And its got 96 gigabytes of u gram. So if you and your, your crew want to go, like, run a local agent for you, you know, you, you in the home.I feel like, hmm. Its got a significant amount of vra m Ive thought about purchasing this and running in my basement, except my neighbors would hate me.Its just a single, like two, three slot. GPU. Its mostly,Kyle: yeah, its A-V-C-I-E.Yeah, itsKyle: UCI u. So GPU, you can go by that. I mean, the big difference against like the RTX, like gaming, GPUs, it, I mean, obviously its like blackball Pro, like its a pro GPU and it has a [01:12:00] lot of E round, which means you can run pretty large models on it.You can stack four of them for the Maxim Q in a system thats a beast.Kyle: Its beefy. You can run, uh, what is that, 96 ger or anything? 96, uh, youre on a loge.Uh, but also they, they are slow. Theyre not, I mean, performance of speed will be somewhat slower compared to API like,Kyle: oh yeah, that, thats true. So again, the big learning economy of scale allows you to do things that allow you to get both speed and throughput.Like you can run. Ill give you an example. Theres an optimization called Wide ep. Im not gonna go into it fully, but like it featured heavily in, in inference Maxim for Deep seek. And theres a, theres a great set of stories from Nvidia and from semi analysis about like why y EP is important, but for like MOE models, its like basically essential and you run it like the A Level app parallelism, the level scale up parallelism used for it is like 32.So it goes beyond that eight barrier. And it like really, really, really is important to have that M mbl, L [01:13:00] 72, GB 200 MD link to serve at scale. And like, its like, I dont remember the, the, you know, cost improvement I think against Hopper, right? Against Hopper. With this MBL L 72 system, youre getting like 35 times cheaper per token for like a lot of the curve.Yeah. Which is crazy.swyx: Yeah.Kyle: And Normalize per GPU obviously because the part of the GP is cost or the code, the GST part of the cost.swyx: One thing Im exploring is the sort of, this year is also the year at the subagent, um, where you have the main agent, but then that also kicks off tools, which are in themselves, agents that have limiteds.Yeah. And sort of context locally, whatever, right? Yeah. Different prompts. So for example, one thing that Ian does is before you kick off a search, they do like a fast context model where you kick off April or you just to search, uh, across the code base plus all that. That is better than indexing. A a lot of the times, not, not all the times, and, uh, you should sell index for some picks, but like the idea that agents should be able to command subagent and probably run [01:14:00] them like maybe close to inference as well.I dont know if thats like architecturally possible or evenKyle: Yeah, were, were thinking about that for dmo. Thats like our big theme for the year,swyx: because like you, like if you can design that into your stuff, then a lot of people, a lot more people will use it. Right now its like just kind of theoretical because.You do pay a lot of like back and forth, uh, coordination costs. Yes.Vibhu: I think itll net speed up though, right? Like even at a basic level, speculative decoding, youre running a small model, youre running two instances, but its not,swyx: that is one example. Yes.Kyle: Yeah. But this is like a little bit like different with like agents.Agents, yeah. This is not spec. I think, I think theres like a summarization of that trend that I like to do or I like to say to my team, its like, this is the year. So there are two things. This is the year system as model, right? Where like instead of having like a single model be a thing, you have a system of models and components that are working together to like emulate the black box model.So when you, when you make an API call to something thats like, like a multi-agent in the background, it still looks like an API called a model. Youre still getting back toswyx: grants, but under the hood.Kyle: Yeah, under the hood. Its like a [01:15:00] billion different models. And thats a lot of complexity, with Dynamo and with other libraries and media were, were looking to help manageNader: that complaint.Yeah. Its funny because we actually, for CES, we just released the model router. Uh, for DGX Spark where you can have a local model thats running on the spark and then also a foundational model and then the model router decides when to send queries to which one. So its no longer this like either or.Its used the best stuff for everything thats available to you. You have a good post-training bottle thats running onswyx: these. There are leads that are also the bread functionality of being able to manage the spark.Kyle: Oh, thatd be cool. Oh yeah,swyx: I did be able feature request. There we go.Long Running Agents And SF ReflectionsKyle: I actually like a question, like I, I like to like extend and flip over.How much longer do you guys think like agents are gonna be running? Because thats one thing Ive been throwing around, like, what happens when, Imean always areKyle: iteven affects the, like back to the prefilled d the decode, right? Like, yeah. Codex is, Id say, compared to cloud code, its much longer at tasks like, yeah, that thing, well, like to run 6, 7, 8 hours.Ill run it overnight.Kyle: Yeah.And Ill, Ill go back and I have like a little crappy logging software I use and theres just times where it wants to, like, Im gonna go deep on [01:16:00] research and itll, I eat up 80,000 tokens go on another go on another, yeah. Just eat through tokens and you know, thats part of it.Like, at the end it does, it does hit a long task. And I think you only see that, that expense. Yeah.Nader: I, yeah, theres insatiable demand for tokens and every improvement that comes kind of just makes our demand even higher. Its kind of funny, right? Like if you have like a teammate and you ask me to do a task and theyre like, should I save some effort and not think too hard about this task?Im like, fuck no.I mean, my favorite was like, you can, you can have four shots, right? Yeah. Like the original codex before the app. You, why do one call, like, give it four attempts? Just, just use all the token to out, right? Try Moreal try, try again. Try more. ItsKyle: like, its like the, the meta index right?Is the thing that tracks like how long models are able to run. I expect that well just see like log linear, if not log super linear growth. We will see before the end of the year an agent that is capable of running for longer than 24 hours with like self consistency the entire time.I, I would also poke at different domains, having different [01:17:00] desires, right?Like at a consumer level. Im getting slightly frustrated at 20 minutes per basic query. Sure. You can optimize, you know, six, eight hour. I dont see myself shooting off many one week agents. Right. Someone doing like, okay, GPU kernel research or medical or biological, like, you know, in, in those domains Sure.Shoot off a lot. That take a, so like I think it will be somewhat domain specific cause you also really need to turn that in. Right.Kyle: Its funny one, those was doing your taxes. Right. Like, thats tax. Yeah, thats, yeah. Okay. Yeah.Nader: Get it right. I wonder if like this major school say sort of like, uh, speculative decoding is like your agent figuring out what you might be prompting it the next day at night and like pre fetching.swyx: Yeah, you can dothat.Nader: Yeah. Really? Branch, branch prediction.swyx: Oh, well no, that, well, thats, thats too, thats too low level, but yes. Sorry. Yeah, yeah, yeah. One question I gotta get, so like, uh, we actually did record a part with the, the beat folks. Uh, with Sarah right here, their chart is the human equivalent work, uh, hours of work rather than how long it has themselves are, are being [01:18:00] autonomous.And that, thats a huge difference, right? Like human work, five hours agent work, 30 minutes, like its actually 30 minutes not, uh, yeah. Firearms, right? Like, so like that, that, that chart that you see is them estimating what the human equivalent replacement is. Um, I think the, I think actually Enro release a more recent chart.That showed cloud code autonomy from their production traffic numbers, and that was 20 to 45 minutes. Thats roughly where we are. So yeah. Yeah, thats the sort of realistic thing. I mean, I, I do think like theres experimental setups we can just like, Ralph with and like just prompt it to keep going, uh, when it stops.And obviously you can, that can go arbitrarily long,Nader: I feel likefrom myNader: experience. Yeah. I guess 20 to 40 minutes seems right for when Im using like Codex or cloud code. But then like what, I always try to just, like, if I wanna spin up like a new, theres a net new project, Ill, Ill often start to rep it and like itll end for I believe, yeah, yeah.Like spin up like the, their new, like from the V three agent. Like itll spin up a web browser and like click around and discover new bugs and just keep churning. Um, so I, I think like my longest was like over an hour that, hey, Ive been churningI think before [01:19:00] we see super long running. I think theres gonna be a bit of an efficiency hit.So. Sure you can take an hour and go down paths, but you also want you wanna be more efficient, you wanna be smarter in your reasoning, right? So I think thatll actually go down before we go back up. Like, you dont wanna scale non-optimized systems just for the heck of it. As much as I love saying, use all the tokens, um, you know, they are expensive.Like going from dance to reasoning models, thats an added cost, right? Youre paying for a lot of tokens and it doesnt make sense to just scale stuff thats not optimized. So theres, theres always that little balance.Nader: Yeah.But you know. I think youll see both sides of it.Nader: Yeah. So 2023 was super exciting.I think if you were in SF you were like, okay, uh, I know this is gonna be a huge world changing moment, but it seemed like, you know, no one had known yet. And maybe even before, was it 2022 maybe?swyx: Yeah, yeah. I would say, yeah, like RU had this tweet where like everyone was in SF from like 2021 to 2023. Yeah.Understood what it was like to be late, early.Nader: Totally. Um, yeah, 2021, thats when I made my first open AI account. Yeah, it went, um, it was crazy. [01:20:00] And I remember it was so funny cause at the time SF had not been doing well. So pretty much what it felt like was the concentration of founders in the city had ro had risen because, um, where my neighbors were used to doing a bunch of stuff, those people had all left.So the only people that were still in the city were people that really wanted to build It was cheap tech. It was, yeah. It was also way cheaper. I feel really bad anyone, uh, who is trying to get rent now, but there was, uh, cell was they had a huge office.swyx: So blockchain in Yeah, like took over the, the old Casper building.Nader: Yeah. They had the showroom and they had the, like the, what would, I think it was like the back warehouse. It was, and it was a huge office. Andswyx: its right across an opening Eyes in New Link.Nader: Yeah. It was inthe original arena.swyx: I named the Arena because of it.Nader: Yeah. Yeah. And so it was really exciting because like vo flow I think uh, I forgot the Minify.Yeah. Minify, uh, brev was there. You guys were there. I remember. That was actually, it was there that you bought the AI engineer domain.swyx: Yeah. I didnt know what I was gonna do in ai. I, I wanna do something,Nader: but it was kind of this, it was a really fun moment where we were kind of all in this solo space and it, um, I dont know.It was, [01:21:00] it was a really cool community, especially being soswyx: early. Yeah. And so it, then you got me early cruise access. Oh yeah. So there was a going period of time. They both cruises and Waymos were just free. Yeah, always.If you had, I mean, theyre, theyre so Back Cell is opened again.swyx: Yeah. So Nature Zoo.Zoo is Nature Zoo. Zoo Robot Taxi. Yeah. So Totally. Yeah.Nader: Oh. But yeah. And so its actually really cool that you guys have this studio so close to, uh, cell. Yeah. This rock climbing gin right around the corner. It was like, um, 2000. Oh yeah. Yeah. Its, its an awesome block.swyx: Cool. Yeah. Just, and you bit services partnership.Uh, I do think one, one thing I try to do with the podcast is like bring, like what is, I get to be a San Francisco to the rest of the world and also just like. Maybe give, uh, yeah.Nader: Yeah. My favorite talk was in the city, uh, andswyx: yeah, stick and stream. I know. Its very good.Nader: Yeah. And I guess what its like to be in San Francisco I think is just everyone seems to be super supportive.Uh, sometimes I feel like the city believes in you more than you do. And even, uh, I dont know if you remember, but I remember [01:22:00] posting my first blog post and I had met you on Twitter and you gave me like an hour of your time super randomly, and you kind of coached me through, uh, writing content for developers.And I was trying really hard not to come off salesy or plug myself. And so I kind of stripped all personality out of the blog post. Yeah. And you, you brought that out. Youre like, people dont, its, its okay to talk about what youre doing. Like you dont have to be weird about it. And I remember just that, I think that really helped me kind of figure out what our voice is and not shy away from it.And so always really grateful for you. Hey, you inject your voice into like, everything. Now its actually a huge advantage to be like veryKyle: genuine about what you care about.swyx: Yeah. Yeah. You imagine like summer, some infra in DMU and like, its like, can you gimme feedback on this blog post? And its pretty boring and youre like.Find like, you know, he looks interesting. Ill just do a zoom call and then you meet this guy. Yeah, right. Hes so energetic, so just be right. Theres, but like, I think people are trained to write a certain way in school and Yeah. They never totally see theres like a broader well,andNader: lots un unlearnKyle: writing.Writing is thinking and like everyone thinks differently. So [01:23:00] like, might as well as just like,swyx: yeah. Yeah.Kyle: Write your way.swyx: Cool. Well, thank you for, uh, in indulging with us, uh, really broad breaking discussion, but I love, like, you guys are like, sort of like the sort of young faces on video with so much energy and, but like also lot of technic death and I think, uh, people learn about for this session.So thank you.Nader: This was awesome. Thank you guys. So thank you for everything that youve done in the talk. Yeah, NG the podcast, all the above. And uh, C-O-T-C-I really forward to it. Yeah. Cool. Thanks. Thats awesome. Thank you. Thank you.

Every Agent Needs a Box — Aaron Levie, Box

Swyx · explanation · 68% similar

[AINews] NVIDIA GTC: Jensen goes hard on OpenClaw, Vera CPU, and announces $1T sales backlog in 2027

Swyx · explanation · 64% similar

Cursor's Third Era: Cloud Agents

Swyx · explanation · 62% similar

Originally published at https://www.latent.space/p/nvidia-brev-dynamo.

Research

Personal

Planning

NVIDIA's AI Engineers: Agent Inference at Planetary Scale and "Speed of Light" — Nader Khalil (Brev), Kyle Kranen (Dynamo)

Summary

Key Insights

Topics

Full Article

Every Agent Needs a Box — Aaron Levie, Box

[AINews] NVIDIA GTC: Jensen goes hard on OpenClaw, Vera CPU, and announces $1T sales backlog in 2027

Cursor's Third Era: Cloud Agents

Research

Personal

Planning

​Summary

​Key Insights

​Topics

​Full Article

​Related Articles

Every Agent Needs a Box — Aaron Levie, Box

[AINews] NVIDIA GTC: Jensen goes hard on OpenClaw, Vera CPU, and announces $1T sales backlog in 2027

Cursor's Third Era: Cloud Agents

Summary

Key Insights

Topics

Full Article

Related Articles