Dlaczego używamy FreeBSD CURRENT w Netflixie? (film, 40m)
OpenFest Bulgaria miał zaszczyt powitać Drew Gallatina, który podzielił się z uczestnikami swoimi doświadczeniami z pracy w Netflixie, gdzie od ośmiu lat zajmuje się optymalizacją jądra FreeBSD. Drew, który jest także znaczącym współpracownikiem FreeBSD od ponad 25 lat, omówił, jak Netflix wykorzystuje FreeBSD w swojej sieci dystrybucji treści Open Connect. Udostępniając praktyczne studium przypadku, zatytułowane 'magical mystery merge', Drew ukazał korzyści płynące z uruchamiania wersji 'current' tego systemu operacyjnego.
W kontekście działania Open Connect, Drew wyjaśnił, jak są skonstruowane urządzenia CDN, które obsługują ogromne ilości danych. Mają one pary dysków, które są przeznaczone do szybkiej aktualizacji - jeżeli aktualizacja się nie powiedzie, serwery mogą szybko przełączyć się na zapasowy dysk. Drew omówił również ryzyko związane z używaniem ZFS dla partycji z treścią, wyjaśniając, że zamiast tego preferują one użycie UFS, wygodniej w zarządzaniu i z wyższą wydajnością.
Drew zwrócił uwagę na znaczenie prostoty w obsługiwaniu statycznych plików multimedialnych, które są wstępnie zakodowane przed trafieniem na CDN. W tej sekcji prezentacji omówił także, jakie korzyści płyną z wykorzystania FreeBSD, które oferuje stabilność i wydajność, co jest kluczowe dla Netflix jako globalnego lidera w strumieniowaniu.
Kolejna część wystąpienia dotycząca 'magical mystery merge' kreśliła, w jaki sposób Netflix zmienia swoje podejście do aktualizacji jądra FreeBSD. Przytaczał konkretne przypadki, gdzie przeszedł z wersji stabilnej na wersję 'current', co niwelowało trudności związane z administracją techniczną, jednocześnie pozwalając na szybszą identyfikację i naprawę problemów, w tym regresji wydajności.
Pod koniec wystąpienia Drew podkreślił nie tylko osiągnięcia zespołu Netflix w dziedzinie FreeBSD, ale również zamierzenia dotyczące przyszłości. Chociaż wykazane problemy były skomplikowane, zespół był w stanie skutecznie zidentyfikować i naprawić błędy, co w znaczny sposób poprawiło ich wydajność. W chwili pisania tego artykułu, film z jego prezentacją zdobył ponad 16,801 wyświetleń i 473 polubień, co świadczy o rosnącym zainteresowaniu rozwiązaniami technologicznymi w świecie Open Source.
Toggle timeline summary
-
Wprowadzenie do następnego wykładu i prelegenta Drew Gallatina.
-
Doświadczenie Drew z FreeBSD i jego rola w Netflix.
-
Przegląd użycia FreeBSD w Netflix i wprowadzenie 'magicznego tajemniczego łączenia'.
-
Opis sieci dystrybucji treści Netflix (CDN) o nazwie Open Connect.
-
Funkcjonalność urządzeń Open Connect i ich mechanizmy aktualizacji.
-
Różne rodzaje urządzeń Open Connect i ich możliwości.
-
Różnice w podejściu do redundancji serwerów przyjętym przez Netflix.
-
Szczegóły dotyczące kodowania wideo i znaczenie jakości wideo.
-
Wprowadzenie do FreeBSD i jego kontekstu historycznego.
-
Rola zespołu Drew w utrzymywaniu oprogramowania na urządzeniach Open Connect.
-
Dyskusja na temat praktyk rozwoju FreeBSD w Netflix i strategii scalania.
-
Korzyści z uruchamiania FreeBSD Current i współpraca z upstream.
-
Wkład w FreeBSD, w tym wprowadzenie asynchronicznego wysyłania plików.
-
Badanie ostatnich wkładów do FreeBSD skoncentrowanych na wydajności.
-
Przegląd adaptacji w celu szybkiego zwiększenia wydajności serwerów CDN.
-
Studium przypadku dotyczące problemów z wydajnością napotkanych podczas scalania z upstream.
-
Identyfikacja i rozwiązanie regresji wydajności związanej z jednym z commitów.
-
Otwarcie dyskusji na pytania i omówienie praktyk aktualizacji serwerów.
-
Zakończenie i podziękowanie za prezentację Drew.
Transcription
Hello again. We are ready to start to continue with our next talk. I'm very happy and excited to introduce you Drew Gallatin, who is here on the stage with me at the moment. Drew has been working for Netflix for eight years, maybe something like that. Yes. And his specialty is free BSD kernel optimizations and he's been committing to the free BSD for more than 25 years. So let's give Drew a round of applause and welcome him on the stage and let's hear more about the magical mystery merge. Hey, so I'm here to talk a little bit about first about how we use free BSD at Netflix and then to talk about why we run free BSD current and to give a nice little case study, which I call the magical mystery merge, which kind of exemplifies why it's good to run current. So I work on our CDN and a CDN is basically a content distribution network. A lot of companies use that to distribute, you know, operating system updates or video games or movies like their customers all over the world. We call our CDN Open Connect and we call our CDN servers Open Connect appliances and we call them appliances mostly because in a lot of ways they have a little bit more in common with your home router than they do with the server at your office. And what I mean by that is we have like an A, B partition and we reimage them every time we update them. So we image the B side and then if that fails, it falls back to the A side. The software is completely identical on every single machine. There's no package in some machine that's not on other machines. This is an example of one of our OCAs that can serve 400 gigabits a second. We currently have three different types of OCAs. We have an all-flash OCA, meaning all NVMe where the older ones are all SATA SSDs. They serve the very most popular content and those are mostly located in our data centers at internet interchange points where the internet comes together. We also have storage OCAs which hold hundreds of terabytes of video assets and that could be anything from a really unpopular title that somebody only watches once a week to even the very most popular things. We also have what we call a global OCA which is given to smaller ISPs to help them offload the traffic that's coming to their network. So for example, if you're watching a movie, it's coming from inside your ISP's network and their uplink isn't affected by that traffic. So a little bit about the OCAs. We run FreeBSD Current and I'll get into why we run Current. One of the most frequent questions I get asked is, do you run ZFS? Yes, we do run ZFS but we only run it for root. We don't run it for our content partitions. That's because ZFS has its own cache called the ARC or the Adaptive Replacement Cache which is separate from the kernel page cache which means in order to serve anything out of ZFS, things need to be copied from the ARC to the page cache which is inefficient so we run UFS for all of our content partitions. We run NGINX as our web server and the other thing I wanted to mention about our file systems and stuff is that unlike a lot of CDNs, we just let drives fail in place. We don't run any RAID. There's no redundancy except for the root partitions but for the content partitions, we just let drives fail in place. It's a lot easier to just let it fail in place than to try to get ISPs to replace a drive or to send technicians to data centers. The machines mostly have so much storage that they lose one or two drives. They're still very useful as they are. Our workload is basically serving static files. Everything is pre-encoded before it even hits our CDN. If you're watching a video, there will be a different codec for maybe what's on your phone versus what's on your smart TV versus what's on your computer and there will be all the different bit rates. One particular TV show or movie could have dozens of different downloadables for it. We do this because video quality is of the utmost importance and we spend a lot of time and effort on making each encoding as good as it can possibly be. The nice thing is that having everything static simplifies our workload since there's no compute involved on the CDN server. A little bit about FreeBSD. I think most people here are familiar with Linux. FreeBSD got its start around the same time that Linux did but in a much different way. Whereas Linux was written from scratch, FreeBSD is directly descended from the BSDs that were the basis of SunOS and things like that in the 80s. FreeBSD came out of 386BSD which was a port of the old BSD to x86, 386 boxes. That 386 fell into a little bit of disrepair and there was a patch kit that was maintained and there were two groups of people maintaining the patch kit that sort of forked. NetBSD came out at roughly the same time. They focused on compatibility and correctness and running on as many architectures as they could. Whereas FreeBSD focused on just performance on, at the time, x86 32-bit. Our first 64-bit port was in the late 90s. That's how I actually got my start as a FreeBSD committer. I worked on that port. FreeBSD likes to basically really support the current high-performance architectures. Once an architecture is sunsetted by the vendor, we sunset it as well. The alpha port was the first port we ever retired because, sadly, alpha died. FreeBSD is a little bit different than a Linux distro because in Linux, everything is packaged. In FreeBSD, we still have this legacy thing where we have the kernel and all of the basic utilities that you would expect on a shell, like the actual shell itself, like ls, move, copy, that kind of thing. The stuff is unpackaged and belongs in the root file system. We also believe in packaging third-party software. Things like the Firefox web browser or the Nginx web server or things like that come in packages. When you install FreeBSD, you get all of that plus all the source and the man pages and documentation. At Netflix, we have our own stripped-down distribution that we run on our OCAs. That obviously doesn't include compilers or documentation or most packages just because we don't need those things. The team I work on at Netflix is called the OCA dev team. Our responsibility is to maintain the software that runs on the OCAs. Most of us are FreeBSD committers or contributors. There's roughly 10 of us. It's a great place to work and the best job I've ever had. I want to talk a little bit about how we do FreeBSD development at Netflix. When we first started off, we made the choice that everybody makes, which was to track a stable branch and to back up a little bit. The way FreeBSD works is, in some ways, if you're familiar with Linux, it's similar to Linux. There's what we call FreeBSD Current, which in many ways is similar to the Linux tree, where everything goes in there first. And then there are stable branches, which in Linux, most people call long-term support or LTS branches. In FreeBSD, there's a new release. Supposedly, every two years, it really works out in practice to be every three or four. Every time there's a new release, a new long-term support or stable branch is created. What we used to do is we used to track the latest stable branch, and then every few weeks we would merge all the security fixes and bug fixes that made it into the stable branch. We would then make an internal release that would go out to our OCAs. That worked really well for the period we were doing that, but what was terrible was when we moved from one stable branch to another, and that would take sometimes, in the best case, weeks. Most times it would take months, up to six months, because you could encounter a problem either in code that we never managed to upstream, being compatible with the new interfaces in the new version, or there could be some kind of regression introduced upstream in the new version that was hard to track down. These merges, every time we updated between stable branches, were just awful. When we ran stable, it was also very hard for us to collaborate with upstream, because the stable branches, the APIs might be different. Some function may take three arguments in stable and four arguments in current or something like that. If we wanted to contribute some change we made, we would first have to port it to current, then somehow test it, because we weren't running current on our machines, and then submit it for review, get it accepted, and then it would finally be there when we went and moved to the next branch. There's very little motivation to do it, which meant we were keeping a lot of things back, which made each successive merge harder, because we were running into more and more conflicts the more stuff we had. We could see that we were building up a technical debt by doing this. About five, six years ago, we decided what we were doing was silly, and what we should do really is to track FreeBSD current. That sounds crazy in the face of it, because that's where everybody pushes all their stuff. Unlike Linux, there are no subsystem maintainers. In FreeBSD, when you have a commit bit, you're allowed to commit whatever you want, and you're encouraged to get reviews, but it's treated with freedom and responsibility. At any rate, current is sometimes somewhat unstable, so running current sounds crazy, but it's actually the best thing in the world, because when we run current, we do the same every three or four weeks merge cycle, where we pull in stuff from upstream, and when we do that, we catch things really fast. If there's some regression, we catch it right away. There's no two or three-year delay between somebody committing something and us finding it's a problem. It also allows us to upstream things much more quickly, because we no longer have to port it to basically a different version of the operating system. It allows us to collaborate with upstream developers and get our changes in to FreeBSD quickly. Over the years, the amount of things we've held back has decreased and decreased and decreased. Speaking of upstreaming code to FreeBSD, basically, since we run current, our tree is almost identical to the upstream FreeBSD tree. If we want to make a small feature or if we want to make a small bug fix, what we often do is, rather than doing it in our tree, we actually do it in the actual FreeBSD tree itself, get it reviewed and committed, and then that will just naturally come back in our every three-week merge cycle. That greatly reduces the amount of technical debt we accumulate by keeping our own patches. If there's something critical like a security fix or a bug fix for a crashing issue or something, we'll cherry-pick it immediately rather than waiting for the merge cycle. But for larger changes, like when we did kernel TLS, that was a very large, very invasive change. We kept that to ourselves for close to five years, not because we were trying to be secret or proprietary, but just because we wanted to get it in the form that it actually needed to be in to be useful upstream and not just to us. So we do testing. Every time we push to a branch, Jenkin's job gets kicked off. At Netflix, we run on both AMD64, which Microsoft also calls it AMD64, everybody else calls it x86-64, and ARM64. So we run a small regression suite on that. Then every night or close to every night, we run what we call a smoke test, where we run with all debugging enabled on the kernels, meaning memory corruption checking, use-after-free checking, lock reversal, lock ordering checking. All the checking we can check is running close to every night on a small cluster of machines looking for any bugs that we've introduced. Then when we do release testing, we run on a slightly wider set of machines with all the bug checks off, just looking for any performance differences or any other regressions that we can find. So we've contributed a lot to FreeBSD over the years, which I'm very proud of. Probably the most important, most fundamental thing we've contributed is something we call async send file. That was done by my colleague, Gleb Chernov, in collaboration with Nginx. The gist of that is that send file is a system call, which basically says, if you have a file here and a socket there, you don't have to read from the file into user space and then push things back into the kernel. It tells the kernel, take data from this file and put it on this socket. The problem with send file in most operating systems is that it blocks, because you have to wait for the stuff to come in from disk. Asynchronous send file basically allows the web server to just say, do this send file, and then return immediately. Then the operating system in the background brings the data in, and then the interrupt completion handler for the disk drive then pushes things out to the network. That allows us to avoid having thread pools or using async I O or doing any sort of thing like that. That one thing is the fundamental optimization that's enabled everything else. Something else we do is what we call unmapped mbufs, and that was done to prepare for kernel TLS. Basically, mbufs are the fundamental network buffer in FreeBSD. Unmapped mbufs essentially use, it's very much like a Linux SKB frag structure, where rather than pointing to one data object, they can point to many data objects. One single mbuf can carry an entire TLS record with four 4K pages or five 4K pages and header and TLS signature information. That is just by itself, even without TLS, is a big optimization because it saves a lot of cache misses when you're doing pointer chasing and things like socket buffers. There's kernel TLS, which basically moves bulk data encryption from open SSL and user space to the kernel. The reason that that's valuable is that, again, avoids doing lots of boundary crossing because without kernel TLS, you need to read data from the file into your user space application, do the crypto, and push it back into the kernel. This allows you to avoid that detour, do the crypto in the kernel, and have things go right out. Even better, if you have a network card like a Mellanox Connectix 6 DX or newer or a Chelsea T6, then you can do kernel TLS, actually NIC TLS offload, which lets the NIC itself do the encryption. It makes it essentially, from the kernel's perspective, like there's no crypto at all. Stuff just comes in from storage, goes out in the network. Kernel never even touches the memory. That's why the mbufs are unmapped. We've also contributed something we call the CAM IO scheduler. CAM is the previous acronym for its storage system. The idea behind that is that we like to be able to fill from other OCA's content that's needed at a particular site at the same time we're serving traffic. We found that a lot of storage devices, if you're doing a lot of writes, the read performance suffers. CAM, the IO scheduler, mitigates the speed at which writes happen or reads happen in order to keep everything nice and smooth. We've contributed the RAC and BBR TCP stacks. You may have heard of BBR. It was done by Google initially in Linux and ported to FreeBSD by my colleague Randall Stewart. It's also the congestion control algorithm. It's used in QUIC. We actually don't use it much. We use something called RAC, which is similar in spirit. That was completely written at Netflix by Randall Stewart. We've also contributed the TCP pacing system, HPTS. The point of that is that if we wanted to send data as fast as possible, I could send data at 100 gigabits a second to your cable modem in a 64K or 128K chunk, and your cable modem's buffer might only be 32K long, and so you would drop most of it and you would end up having lots of TCP retransmits because we'd just be shoving data at you and it would be dropped all the time. So the pacing system basically allows us to, again, deliver data more smoothly and avoid packet loss. We've also contributed some performance enhancements for NUMA. I've given a lecture about that. You can find it on my page at freebsd.org. We contribute something called PFL memory pointer hooks, which, to be honest, was inspired by Linux's XTP, and the idea behind that is that if we're under a denial of service attack and we need to drop a lot of traffic, we want to do it as efficiently as possible. What this does is this allows the firewall to basically get a pointer from the device driver to the packet that was just received, and the firewall can essentially say, hey, that's good. Go the normal way or drop it immediately. By doing that, the network driver can simply recycle that buffer pretty much for free, whereas if we went to the normal path of the firewall, the driver would have to allocate a new buffer, map it for DMA, un-map the old buffer for DMA, pass the buffer up the chain of the network stack where everybody's looking at it and taking cache uses, and then finally with the firewall said, no, we can't have this packet, then the firewall would have to free the packet, and so with this, we can drop tens of millions of packets per second and still serve without any impact to our serving. We've done too many scalability fixes to mention, probably more than I can remember, and most recently, my colleague, Werner Losch, did this thing called KBoot, which started off on PowerPC 64 to ARM 64 and AMD 64, and it's a way to run FreeBSD directly from Linux, so in Linux, you can KExec another Linux kernel, or some Linuxes use it to do crash dumps. Well, we use this to boot on systems that use Linux BIOS where there's no UEFI, so we can KExec directly from Linux into FreeBSD. We also contribute in other ways than code. We have great relationships with a lot of hardware vendors, like people really like us and want to sell us stuff, and so if somebody comes to us with a product that we're really interested in that only has a Linux driver or only has a Windows driver, we collaborate with them and help them to find resources to convert their driver to FreeBSD and to upstream it, and I can't mention which, but there's a lot of drivers in FreeBSD that wouldn't exist if it wasn't for this, and we also, Netflix as a company and also us personally, most of us contribute to the FreeBSD Foundation. So my job is to improve performance, but it's not, I can't just improve performance because the easiest thing I could do to improve performance would be to get rid of RackTCP and to get rid of the TCP pacing system, but that would make our members' quality of experience terrible. One of the things we really pride ourselves on at Netflix is reducing the number of times you see the little spinning wheel where things are rebuffering, and so like if I got rid of the TCP pacing, well, CPU use would go way down, but our members would see a lot more of the little spinning wheels for rebuffering, and so it's a fine line to walk on how we can improve performance, but my goal is basically to improve bandwidth on the same hardware and to also reduce power consumption by being able to deliver the same amount of traffic with a smaller machine. And here are some things that I'm particularly proud of. In 2017, we had the first, what I believe is the first 100 gigabit per second production CDN server in the world, and that was due to kernel TLS. In 2020, we had the first 200 gigabit per second CDN server. That's an AMD Rome-based system, and it was more like 240 gigabits. In 2021, that same hardware now became capable of doing 400 gigabits because we enabled the NIC TLS offload, so we moved the crypto from the kernel down to the NIC, and we had those machines deployed for a while, and there was some software support that needed to be done before we could actually enable that in the fleet. And 2022, we did the first 800 gigabit per second CDN server, which is basically very similar to the first one, except it's a dual socket system with twice as many NICs and twice as much storage. And the thing that I'm particularly proud of, which isn't quite there yet, but I included it in the presentation because I'm so excited about it, is that we should have the first server that can do 100 gigabits per second for less than 100 watts, meaning that for the power that you used to have for a lightbulb, you can serve 20,000, 30,000 customers video. And that's based on an NVIDIA Bluefield DPU, and this is basically a 100 gig NIC that has 16 ARM cores on it, and I think something like 48 gigs of DDR5 memory. And the cool thing is it doesn't need a host. You plug it into a PCI expansion chassis, plug in some NVMe, and you have a system that uses very little power that can serve a lot of traffic. And it's not quite there yet because the prototype we have in the lab is using a 1,200 watt power supply, and we're measuring at the wall, we're getting about 125 watts, and we ordered a power supply that's smaller because if you have a power supply that's 1,200 watts and you're only pulling 100 watts, the efficiency is terrible. And it also has more NVMe drives than it needs because I'm lazy and I want to ramp machines up quickly, and the more storage I have, the more clients will come and find stuff that they want, so the more traffic I get quickly. And so once I pull a couple of NVMe drives and move to a better power supply, I'm hoping it'll drop from 125 watts to below 100. All right, now here's the title of the talk, and this is basically why we run FreeBSD Current. So we did an upstream merge in August, and as things started out, it was a very easy merge, like only two merge conflicts, everything looked great, and then I went to test it on my original 100-gig machine, which I test every merge there for performance, and I noticed that the CPU use increased by roughly 20%. And I did everything that I normally do when I'm looking for performance problems. I looked to see if there was some function that was super hot, maybe there's lock contention, and I spent a long time looking over profilers, and I just couldn't figure out why the CPU use had gotten so high. So this is an internal graph of ours that shows the bandwidth being served. That's roughly 92 gigs, which is what we serve on our 100-gig machines. And this is what the CPU use should look like in the low 40% range, and this is what it looks like after the merge, kind of jagged and janky and like 50% or above. And so when you don't know what you're doing, and you have no idea what happened, what you do is you bisect. And I'm not sure how many people here are developers. Has anybody here ever done a git bisect? Can I raise your hand? All right, so I'll try to explain it then. So basically, a git bisect, you take the last point you knew everything was good, and then you take where you are where it's bad, and you check out a tree that's halfway in between. And if that tree is good, then you know that everything before it was good, and so you check out a tree halfway between the bad point and the good point, and you just keep narrowing things down like that until you figure out which commit is the problem. And since we run FreeBSD in a subtree of our entire tree, it was kind of painful because I did the bisect upstream, looked at the git hash, and then did the merge on that git hash in the subtree, and redid the merge conflicts over and over again, which was really boring. So the process was basically to do that, build an image, install it, and test it, and that whole process took about four hours per bisection step. And it was quite slow and painful, and as SpongeBob would say, like one eternity later, I found the commit. And of course, it still didn't make any sense. Usually when you do this, you look at the commit and you're like, oh, now I understand what happened. In this case, it still made absolutely no sense. And the funny thing is, this is the most famous FreeBSD commit that I can remember. There was a guy, Colin Percival, who's a FreeBSD committer. He also runs TarSnap, if you've ever heard of that. It's a backup service. And he runs it in AWS, and he wants to be able to boot AWS instances as quick as possible. So he's been working on optimizing the boot speed of FreeBSD, and he discovered that there's this sorting routine that runs with FreeBSD boots that was using what he said was a bubble script, which is a horribly inefficient sort. And FreeBSD spent a lot of its time booting just running this sort, and he fixed that. And the commit that broke everything for us is that commit where he fixed the sorting time. So you can see that's a fairly popular tweet, at least for something related to FreeBSD. He was on the front page of Hacker News for a while. And so to detour into what's assist in it. So basically, every kernel subsystem needs to initialize itself at boot. And that's done using this system in macro. And the linker, when you build the kernel, sorts every system into this alphabetically ordered list. But you don't really want an alphabetical order, you want it in the order that you need to initialize the subsystems in. So that's where the sort comes in. There's 79 subsystems, and within each of those subsystems, there's something like eight different levels. I want to go first, I want to go second, I want to go last, I don't really care where I want to go. And so they're sorted at boot by first subsystems, and the subsystems have an order that they're initialized in. And then within those subsystems, that order, those ordering hints are applied. And so the key is that the system that's with the same, that are tied, that have the same subsystem and have the same ordering hint, they should be able to run in any order. But it turns out the original sort wasn't a bubble sort, it was a selection sort. And that meant that ties were handled differently. And so we went from having everything still in alphabetical order, all the ties were kept in alphabetical order, to having all the ties in reverse alphabetical order. And Colin and I both verified that independently using this thing he made called TS log, which logs like what's running at boot and how long it takes. And so the easy fix for this was to reverse this, reverse the order that his new sort was using. And once we did that, everything worked, but still made no sense, like why did that matter? And so, excuse me, very dry mouth. And so basically, I wanted to figure out why it mattered. So I hacked the kernel to control the order that things were sorted in. And so first I reversed entire subsystems and I did another bisection where I took the first half of the subsystems and this, you know, and then reversed them and saw that worked and then did a bisection that way. And then once I found the subsystem that was at fault, which was the driver subsystem, I then I said that, you know, all the things in that just, you know, essentially a second bisection. And that took a long time to it was faster because I didn't need to recompile and reimage anything because I could just use, you know, boot arguments to do it. But eventually I figured out what the real problem was. And so essentially there were multiple CPU frequency control drivers that thought that they wanted to attach to the CPU frequency control hardware on our Intel Xeons. And this happened because the way drivers are supposed to work in FreeBSD is they're supposed to be a probe score. So every driver kind of gets to look and say, oh, yeah, yeah, I know what that hardware is. And that's exactly what I was made for. And I want that 100 percent or, you know, I think maybe I can handle this hardware, but I'm not sure. And if somebody else, you know, better comes along, you should give it to them. Unfortunately, these drivers weren't doing that. And they were either returning like a like a one or a zero saying, you know, I can I can I can handle it or no, I can't. And both of these drivers were saying, yeah, I can handle it. And so when that happens, the first guy to get there wins. And so that meant that when the order was was in the alphabetical order, the newer, more modern Intel EST driver attached first and older Pentium four thermal control something or other driver attached second and and got there and and and lost because it was second when the order was reversed. The older, much worse driver attached first and the newer appropriate driver got there and found somebody else was there and didn't attach. So things that work worked accidentally for years, close to a decade, I think, just because of that alphabetical ordering. My colleague, Warner Lush, is actually working to add appropriate probe scoring to all these drivers so that, you know, the ties will be broken appropriately. So and I was kicking myself because FreeBSD has a system control node, which tells you the frequencies that are supported and stuff. And when the P4TCC driver attached, you saw basically garbage. There are frequencies that weren't supported by the processor. And that number behind the slash is like how much power it thinks it's supposed to use, which meant that it had no idea, whereas the EST driver, it saw all the actual frequencies and knew all the actual power consumption levels for the different frequencies. And it was just it was correct. And so what was happening was basically the CPU was just running at a slower speed than it should have been running at. And so everything was fine. There was no actual regression because something got worse in terms of things actually getting less efficient. It's just that the kernel was running the processor at a slower speed without even realizing it. And so this is like, you know, an example of community. This is a this is a basically an example of like great community interaction. I reached out to Colin after I figured out that that commit was the problem on an internal IRC chat room. And within minutes, he responded and we both independently verified that the sorting order was different. And the key thing is that he remembered the change because he had just done it a few weeks ago. He remembered the change and he's super happy to help. He posted a fix for review that same day. It landed in previous current the next day. And I picked it down into our merge branch, you know, within a few minutes of it landing. So our benefit, the benefits to us is sorry, the benefits to the community rather, is that we noticed this bug almost immediately because we run a lot of server class hardware. Other people might see some kind of problem like this and might just say, oh, the new version of freeBSD is slower. I don't know why. But we were the first to notice the regression and we were the first to be able to attribute it to this change. And later, we realized that there was one other bug that popped up around the same time. Some kind of kernel crash and an AMD temperature driver that was due to the same issue. And the benefit to us, to Netflix from this, is that this is a crazy bug that required bisection. Having like three weeks of changes was terrible because each bisection type of step takes like four hours. And so, you know, bisection, it's like, you know, kind of a log log log type thing where if you have twice as many changes, it's only one more bisection step. But we're talking if we're talking three years worth of changes, I can't imagine even doing that bisection. It would have driven me insane. And the other benefit to us is that since we found it so quickly, Colin was like very responsive and it got fixed almost immediately. If this had been a change that had been in freeBSD for three years, it would have been perfectly reasonable for him to say, what, you've got to be crazy. This has been in the tree for three years. Nobody's complained about it. Like you guys have got to be doing something wrong and be in denial. I mean, that's nothing to say about him. Maybe he wouldn't. He's a great guy. But if it was one of my changes, I would definitely be like, what are you talking about? I did that change three years ago. I barely even remember why I did it. Like why are you talking to me about it now? Like it's been, nobody's complained about it. Obviously you're doing something wrong. Just go away. And that would be a very easy thing for somebody to say after three years. And so that's basically all I really had to talk about. And it looks like we have about 10 minutes for questions if anybody has any. No. No questions? Okay, you can go to the mic please. How much time it takes to upgrade all your servers? How much time does it take to upgrade all of your servers? So essentially the servers are upgraded in ways. We don't do them all at once because there could be some bug that we didn't catch in testing. So I think that, I don't do the upgrade of all of our servers. Our fleet is on the order of tens of thousands of machines. We have an operations team that does that. I believe it takes roughly, roughly a week I think, but don't quote me on that because I'm not, I'm not certain. In fact, one of the worst bugs we ever had was due to a bug on a network interface driver. We took new firmware for the network, a new firmware bug for the network interface driver. And we do a lot of monitoring of temperature, of optical transceivers, of light levels, of everything, and we slam the IDC interface that's on those transceivers. And there was a bug in their firmware where like in one in a million, one in ten million IDC reads would time out and cause the NIC to reset and just basically go dead. And so, this was like five or six years ago, we started rolling a firmware out and we noticed that maybe one in a thousand machines was not responsive after a day or so. And that was, and that was like the worst, that was the worst bug that we ever had because when a machine's not responsive, it's really hard to deal with, especially if it's at a, if it's at a site where we don't have IPMI access to it. So we have IPMI access to everything that we have in an internet interchange point, but not in, we don't have IPMI access to things that are in ISP data centers. And so thankfully we found that before we hit any ISP data centers, so we could actually reboot the machines. But since then, we've been much more careful about taking firmware bugs. In fact, we actually have our own copy of some of the firmware that we know works, and just taking more of the upstream firmware that's... And after you fixed the driver sorting bug, did the performance improve or it was like before the update? It was, yeah, the performance was as we expected. It was, it was the original performance in the mid-low 40% range that we expected. Why are FreeBSD and not Linux or some other distribution? That's a question that is likely to start a flame war, so it's a question that my management has told me not to get into. But we're not stupid, we wouldn't be running FreeBSD if it didn't work better in this application for us. We're not religious. I'm not religious. In my past, I've worked on Linux. I worked on Linux for a company that I made that I worked on Unix device drivers for Linux and Solaris and FreeBSD and macOS and ESX, VMware, and all kinds of crazy operating systems. And I worked on Linux when I worked at Google, so I'm not religious. It's just, right now, it just seems like it's the best tool for the job. I think the initial decision may have been made based on the license. I wasn't there, so I don't know. Anything else? Drew, thank you very much for sharing bits of your really challenging work with us. It was really interesting. We have prepared something for you. Oh, well, thank you. Okay, let's give him another round of applause.