Jak Netflix używa Javy w 2025 roku? (film, 48m)
Paul, który jest ekspertem od Javy i pracuje w Netflix, przedstawił w swoim wystąpieniu, jak Netflix korzysta z Javy, jednocześnie zaznaczając, że jego prezentacja zmienia się z roku na rok w miarę jak firma zmienia swoją architekturę oraz technologię. Powiedział, że chociaż część ludzi krytykuje Javę, w Netflixie ich zespół programistyczny rozwija platformy korzystając z Javy, w tym framework DGS, czyli narzędzie do GraphQL. Architektura Netflixa jest odzwierciedleniem zastosowania Javy na tylnych końcach, podczas gdy interfejsy użytkownika są projektowane w odpowiednich językach dla różnych urządzeń. To zajęcie wprowadza widzów w złożoności, z jakimi się borykają, przy dużej liczbie użytkowników oraz wysoką liczbą zgłoszeń na sekundę.
Paul zwrócił uwagę na różnice w aplikacjach Netflixa, które obejmują wymagania dla strumieniowania oraz dla tradycyjnych aplikacji przedsiębiorstwowych, z których każda ma różne potrzeby oraz modele awarii. Przy strumieniowaniu, Netflix musi zapewnić niską latencję oraz przeprocesować ogromne ilości danych, podczas gdy aplikacje biznesowe mogą używać relacyjnych baz danych oraz wymagają bardziej rygorystycznych standardów dla wyboru opcji przetrzymywania danych. Paul opisał również, jak wydajnie kierują wieloma mikroserwisami, używając GraphQL do optymalizacji ich operacji. Cała ta struktura pozwala utrzymywać działania Netflixa w sposób bardziej spójny oraz niezawodny.
W dalszej części prezentacji, Paul przedstawił uaktualnienia do JDK, podkreślając znaczenie migracji do nowszych wersji, jak JDK 17 oraz 21, w celu zwiększenia wydajności. W rezultacie ich migracji, Netflix zauważył znaczące usprawnienia, takie jak niższe zużycie CPU w związku z kolekcją śmieci oraz zmniejszenie czasów przestojów dzięki zastosowaniu ZGC. To pozwoliło na wszystko wyznaczać na bardziej wymagającą i bezpieczniejszą architekturę, co bezpośrednio przełożyło się na lepsze doświadczenie użytkowników.
Paul skupił się również na wprowadzeniu wirtualnych wątków w Javie, które mogą uprościć strukturę aplikacji i umożliwić programistom efektywniejsze pisanie kodu. Opisał, jak wprowadzenie tych wątków ma potencjał do zastąpienia bardziej złożonych wzorców programowania reaktywnego. Chociaż pojawiały się problemy z ich zastosowaniem, Paul jest optymistyczny, że obecna wersja JDK 24 pozwoli Netflixowi na korzystanie z wirtualnych wątków z większym sukcesem.
Na zakończenie swojego wystąpienia Paul wskazał, że wiele aplikacji w Netflix używa frameworka Spring Boot, co daje programistom znane i komfortowe środowisko, które może się zintegrować z ich infrastrukturą oraz ekosystemem. Dzięki podejściu opartemu na współpracy ze Spring Team, Netflix osiągnął duże postępy w optymalizacji swoich usług aplikacyjnych. Na koniec prezentacji Paul miał świetne statystyki dla swojego wystąpienia, które w chwili pisania tego artykułu osiągnęło 213196 wyświetleń oraz 5654 polubienia.
Toggle timeline summary
-
Wprowadzenie do użycia Javy w Netflix.
-
Prelegent wygłosił podobne prezentacje na przestrzeni lat, za każdym razem aktualizując treść.
-
Omówienie zmieniającej się technologii i architektury w Netflix.
-
Prelegent dzieli się anegdotą o swoim wystąpieniu na konferencji.
-
Wyjaśnienie, że Netflix nie płaci żadnych opłat licencyjnych dla Oracle z powodu używania OpenJDK.
-
Odpowiedzi na pytania z sali dotyczące przyszłości Javy w porównaniu z innymi językami, takimi jak Rust i Kotlin.
-
Wprowadzenie prelegenta, Paula, i jego roli w Netflix.
-
Przegląd architektury używającej Javy w Netflix.
-
Wyjaśnienie, że nie wszystkie interfejsy użytkownika w Netflix są zbudowane w Javie.
-
Opis znaczenia niskiej latencji poprzez użycie wielu regionów Amazon.
-
Dyskusja na temat architektury strumieniowania Netflix oraz tradycyjnych aplikacji przedsiębiorstw.
-
Zarządzanie błędami w aplikacjach i jak żądania serwisowe mogą być realizowane w sposób umożliwiający degradację.
-
Wyjaśnienie różnic w wzorcach ruchu między aplikacjami strumieniowymi a tradycyjnymi.
-
Szczegóły przejścia na Spring Boot dla wszystkich usług Javy.
-
Szkic kroków podjętych w celu aktualizacji z JDK 8 do 17.
-
Informacje o poprawach w zbieraniu nieużywanych obiektów wraz z aktualizacjami JDK.
-
Dyskusja na temat korzyści z wymiany zbieraczy śmieci.
-
Wprowadzenie do wirtualnych wątków i ich integracji z frameworkami.
-
Wyzwania związane z wdrażaniem wirtualnych wątków w produkcji.
-
Usprawnienia w zarządzaniu wirtualnymi wątkami wprowadzone w JDK 24.
-
Przegląd frameworka aplikacji używanego w Netflix oparty na Spring Boot.
-
Wyzwania związane ze zmianami nazw przestrzeni z JavaX na Jakarta w procesie aktualizacji.
-
Dyskusja na temat frameworka DGS do GraphQL i jego roli w architekturze Netflix.
-
Konkluzja dotycząca używania GraphQL i gRPC jako mechanizmów IPC oraz wady REST.
-
Zakończenie i zaproszenie do pytań.
Transcription
All right. We can get started. So, I'm just going to talk about basically how we use Java at Netflix. You might have seen this talk before in a slightly different variation. Because over the last few years, I've done similar talks, basically just iterating how we use Java at Netflix. However, every time I give this talk, it's different. Because our architecture keeps changing. We keep changing our technology. We keep learning. Some things go away. Some things come back in. So, even if you've seen this talk before, it's probably going to be different than what you might have seen, like, two years ago. Anyway, before we jump into that, yesterday, if you were at a keynote, I had my three minutes of fame on the keynote stage. And Bruno posted this picture. And then the rest of the day, my Twitter was on fire. Or everyone's favorite social network. And this is some of my favorites. So, the first question was actually a really good one. Like, I wonder how much licensing fees we pay to Oracle if we build everything in Java. Well, hopefully everyone here knows this. That is zero. Because we have OpenJDK. We do actually get� we have a contract with Azul. But that's completely besides the point. Yeah, we don't pay Oracle at all. And then why not Rust? And this was not the only person. And then the next person asked, well, why Rust? That is actually a pretty good answer to that. So, it has been back and forth a little bit. A lot of people apparently are not so happy with Java. Some people never want to watch Netflix again, because apparently it's tainted by Java. So, that was interesting. And then there was this one person, well, Java sucks. It's like heavyweight and slow, et cetera. You should do Kotlin. I don't think that's how it works. But okay. And then the why not Rust discussion was just kind of fun. Anyway, so, that is kind of how people look at Java, I guess. So, let's get into a little bit more serious topics. So, my name is Paul. I'm a Java champion. And I'm a Java platform at Netflix. So, a Java platform, we are responsible for the Java framework that we are building our stuff on for the JDK and the build tooling and all these things. So, we basically build a Java platform that everyone else at Netflix is building their services on. I'm also one of the original authors of the DGS framework, which is a GraphQL framework that we use. We'll get into it a little bit. And that's also part of Java platform. Okay. So, we're going to start with, like, more of a high level overview of the architecture. So, that is going to be about where we use Java. And then we kind of bubble up to the stack, talk about the JDK, the Java frameworks we use, some build tooling, and kind of get a whole picture of where and how we use Java. You're probably familiar with this screen. This is the Netflix app where you choose the next show you might want to watch. And so, there was actually some confusion also in this thread that I just showed. People are assuming that everything is Java also means that all the UIs are in Java. The UIs are just in whatever language is most fit for the device you're on. So, if you're on a TV, that is different from a mobile device, but none of it is Java. Maybe except Android, because that's kind of Java, but not really. Anyway, everything that I'm saying is about backends. So, if we think about a Netflix streaming app, that is only one aspect of the type of applications we build at Netflix. But there are some very specific things about Netflix streaming. It is extremely high RPS, so many requests per second, and that's just because of the number of users we have. We have many, many millions of users. That just means there's a lot of traffic that we have to deal with in our backends. We're multi-region, meaning that we're on four different Amazon regions. And that is just so that wherever you are in the world, you have very low latency to our backend services. Because going from, let's say, you're in Europe, if you have to connect to a US data center, that just adds a lot of latency, makes things slower. That's not a good experience. So, we are in four different Amazon regions. But of course, that has all sorts of different implications for our backends, because now we have to deal with these different regions. And connecting from one region to another region is very expensive and relatively slow. And of course, in relative terms, because it just adds milliseconds, but that is a problem by itself. Typically, if a request comes in, and this is one request coming in from a device to our backends, and then we have a huge fan-out to all the different microservices that we have. And we'll get into that a little bit further. But we have to deal with that fan-out. If you think about failure and it comes to Netflix streaming, we can usually just do a retry. If we do a call to one of our backend services, and that request, for whatever reason, times out, we also set very, very aggressive timeouts, so that latency overall stays very low. But if one of the requests to a microservice times out, or fails for another reason, usually we can just retry. And that will usually just fix the problem, because we land on another instance and things might be better there. Because the one instance that we hit the first time might be in a garbage collection pass, or something like that. But we can get away with just retrying on failure, for the most part. And if that doesn't work, we can often just return a response to the device with maybe some data missing, because the Netflix app as a whole will still be fine. You might miss a specific recommendation or something like that, but for the most part, you as a user won't even notice. So we can be okay with some failure. And that is important, because you have to huge fan out to many different backend services. And the other thing is, we typically don't use relational data stores in the streaming pod. Again, that is because of being multi-region. That doesn't work so well for relational databases, for the most part. And also the type of data we store doesn't fit as well in a relational model. More often, we use things like in-memory distributed data stores for caching and things like that. That is kind of more common in this pod. And that's Netflix streaming. On the other hand, we also have many more traditional enterprise apps. Netflix is also one of the largest film studios in the world. And that means we have built a lot of software around just movie production to manage people, and equipment, and stages, and whatnot. Everything that comes to actually filming. A lot of software is just built in-house, and these are typically apps that are kind of more traditional. So these are probably things that you or your company very likely work on. More traditional, you have a UI, you have a backend. UI stores data in a database or something like that. Super critical apps for the business, but very different if you look at the traffic patterns. Compared to Netflix streaming, these apps are usually very low RPS. There's not that many concurrent users compared to Netflix itself. We can typically get away with single region. We don't have to be in every data center in the world for low latency. But the data is often very fit to be stored in a relational database. That doesn't mean that we always use a relational database, but it's often a good option. But on the other hand, failure is not an option. If you are in movie planning, for example, and you have to save some data from your UI, it's not acceptable that the data just disappears. We actually want it to end up in a database. So that whole retrial failure mechanism that we can rely on on the streaming side doesn't quite apply as well for these enterprise apps, just the acceptance for failure is very different. So much lower traffic, much easier to scale, very different failure model. Now, if you look at the architecture that we use for actually both of those, interestingly, that is kind of the same. So this is Netflix streaming. And from your TV or whatever device you run Netflix on, a GraphQL request will go to our API gateway. That's kind of the first level place you get to. And that means there's one HTTP request containing a GraphQL query. Now, the GraphQL query is actually a federated GraphQL query. So although from the perspective of the device, there's like one giant GraphQL schema. In fact, that schema is implemented by many different backend services. So for this one small query that we have, so we have a low-level GraphQL query asking for the title and artwork URLs for shows. That's kind of what we need to build this screen. We might have to hit three different backend services. And the federated GraphQL service basically takes care of that. The federated GraphQL spec defines how a backend service can basically register its schema to the schema registry. And based on that, the gateway knows, oh, if I have to fetch titles, I have to go to the movie catalog service. We call these services DGSs, domain graph services. That's kind of just a term we came up with. But basically, this is just a GraphQL service built with a DGS framework and built with Spring Boot. And that means it's all Java. Now, very often from there, there's even further fan out. And for that fan out, we use gRPC. So most of the time when we go from device or UI to backend, that's all GraphQL, because GraphQL fits really, really well for that kind of model. We don't do any REST anymore. gRPC doesn't really work in that case, because devices usually don't work with gRPC. With a device, you are usually better off with something HTTP based. But then if we go from Java service to Java surface, we often use gRPC, because that is an extremely fast mechanism. It's a binary protocol that's very efficient. And we can model the services more like I'm calling a method on another surface, which in this case, for the service to service communication, is kind of the model you want to think in. And then you have all sorts of different data stores. It can be an in-memory distributed data store like EVcache, or we have things like Kafka and Cassandra, all these different data stores. We have many different data stores that we use, all with their own pros and cons, depending on the use case, you might use one over the other. Now, if you look at the same picture for these kind of more traditional studio and enterprise apps, it's actually the same GraphQL-based architecture. It's the same thing. We removed everything over to GraphQL, because that just works very well to get flexible schemas. So in this case, again, we do a GraphQL query from someone's laptop in this case. It, again, ends up at a GraphQL federated gateway. The gateway knows that, oh, if I have to get the title for movies, again, I go to the movie DGS. In this case, that movie service might just be running a Postgres database, because we are running at a very different scale here. We are not talking about Netflix streaming. We are talking about these enterprise apps. Postgres works very well for that. And so there are many of these apps that are kind of deployed this way. So a somewhat simpler fan-out model, it is a little bit smaller, much less traffic, but the architecture in the end is exactly the same. Now, of course, it is only kind of a small part of what Netflix is doing. That architecture that we just talked about for Netflix streaming is really just discovery. Discovery meaning this is the Netflix app that you as a consumer use to kind of browse titles and figure out what you want to watch. As soon as you click play, other things start to happen. And these other things are all happening in what you call Open Connect. What we actually have is we have appliances, so servers, in server racks at internet providers all over the world, so that the actual movie bits that stream to your TV once you click play are coming from somewhere very close to where you live, basically. So all the popular titles are just on giant boxes, basically, at the internet providers so that they can stream it without actually costing network traffic on their side. That's cheaper for them, cheaper for us, and it's a better experience for you because you're getting faster data, basically. Open Connect is all the management software around there. It's also all Java-based, so there's Java there as well. And then, of course, we have things like encoding pipelines for the actual media encoding. That's also all Java-based. We have all sorts of stream processing. Some of the data stores are written in Java. Of course, there's other languages and other things happening there as well. We have some low-level platform stuff in Go, for example, and there's some machine learning things in Python. So there's definitely other languages as well. But for the most part, it is really just all Java in the back end. Okay, so this is kind of where we've seen where we use Java, and now we're going to talk about how we use Java. And first, we're going to talk about JDK, and then we're going to bubble up the stack. So just like many other companies, in the last few years, or actually a few years ago, very sadly, we were still on JDK 8. And that was a little bit embarrassing because, as a tech company, we're still on Java 8. That's not a great story. That definitely didn't feel great either. We had kind of worked ourselves a little bit in a hole. We had an old, outdated application framework that was developed many years ago. That was all kind of in-house built. We were using a lot of old libraries that we had never updated because we didn't want to break anyone's apps. And these libraries were now incompatible with any Java versions newer than JDK 8. So service owners couldn't easily upgrade. That was just not a great story. And then on the other hand, we had JDK 11 available as soon as it came out. But it just wasn't a lot of incentive for developers to do the upgrade because there's not a lot of new language features coming from 8 to 11. So most of our developers, even though it was available, they were like, yeah, whatever. I'm not going to put any effort in upgrading. So now we had this big gap from 8 to 17, basically. And we really needed to break that cycle. So what we did is a few things. The first thing that we did is we patched all the unmaintained libraries for JDK compatibility. So if an old service needed to upgrade to JDK 17, even though they were on this ancient, outdated application framework with all sorts of weird old libraries, they could just upgrade to 17. Because we just patched the libraries. We didn't force them to upgrade anything. We just patched it so that it can at least upgrade to the new JDK. And that sounds really complicated. Like, OK, now we're forking this weird open source library that no one maintains anymore. And that seems like a really bad idea. But in the end, when we really looked at it, it was like a handful of libraries that we needed to patch. It wasn't that much work. It was really fine. So we just got it done. And that way, we could unplug. So I would also recommend, if you're in this situation, that you're kind of stuck on JDK 8, because this weird library that I use can't upgrade, just fix it. It's really not that hard. It might look hard. It is actually not. The other thing that we did that was kind of unrelated to the JDK upgrade, but we also wanted to get rid of this old application framework. Because it was just a bad experience for all our developers, we needed something more modern. So about two, three years ago, we decided to migrate all our Java services to Spring Boot. And that means going from one application framework to a completely different one. That was a lot of work. We built a lot of tooling to make that easier for our teams, like automated code transformations and things like that. But it was still a lot of work. But surprisingly, maybe, we got it all done. We changed about 3,000 applications all to Spring Boot. But the good news is that now all services are on Spring Boot. We have maybe a handful of services left that are not, that are still on a legacy stack. Those are basically the services that will remain for all device compatibility until those devices go away. All services are running on JDK 17 or newer. And most of our high RPS, most important services are all on JDK 21 or 23, so that you can use the new garbage collectors. All right. So when we actually talked about this a little bit in the keynote yesterday as well. When we moved to JDK 17, what we saw is that the G1 garbage collector just got a lot better. So, on Java 8, we were using G1. That's probably the garbage collector most of you are using. On 17, we were still using G1. It just got a lot better. Because that was a lot of Java releases where work had been done on the performance of the JVM, mostly on the garbage collectors. And what we saw is that we got about 20% less CPU time spent on garbage collection on a lot of these higher RPS services. And that is a lot of performance we get basically for free by just upgrading the new JDK. It is really hard to get 20% more out of your machines by actually performance tuning if you've never tried that. So just getting it for free was kind of a big win. So that is definitely a reason just by itself to upgrade. Then with JDK 21, there was the introduction of a generational ZGZ. So the ZGZ garbage collector was around for a few Java releases already, but it wasn't generational. And when it was not generational, so ZGZ is designed to be a low pass time garbage collector, meaning that it doesn't do any stop the world garbage collection events. While that sounds really great in theory, it wasn't generational. And a lot of our services have pretty old lift data. So they create objects at startup time, and then these objects just stay around until the service shuts down basically. If you have large heap sizes, if a garbage collector is not generational, it has to go over all that heap space every time it does a garbage collect, and that becomes really slow. So ZGZ didn't work for us before it became generational. Now 21, it did become generational. And what we saw there is that this was just a better garbage collector all around. We kind of expected it would be good for certain use cases, for certain traffic patterns, but it turned out it was just a really good general purpose garbage collector for most of our workloads. And if you kind of look at the difference, this is some metrics about the maximum ZGZ passes. So these are metrics from a cluster that we were running JDK 21 with a G1 garbage collector. That is basically until the red box. And these green spikes are the stop the world garbage collection events. And you see that we have stopped the world events from about a second to a second and a half. So that basically means when a garbage collection event happens, the service would just basically reject traffic for more than a second. Now more than a second doesn't sound that significant, but what happens as a result is that because we have very aggressive timeouts on our services, on our IPC calls, that all the IPC calls going into that service during that one second would all just timeout. And now they have to retry, and that means just additional load on your cluster. Now when we switch to ZGZ, that is basically the red box. And you see that just the graph just drops. And it doesn't mean it stopped measuring. It's measuring, but it's running ZGZ. And you see there's just no pass times anymore. So that's really impressive. We went from like more than a second pass time to zero. Is it before it? Is it regular ZGC or is it G1? G1. So before G1, and then it's switched to generational ZGZ, but that's a good question. Yeah. So as a result of these pass times being gone, if you look at the error rates of these services, so it is the same graph, the same times looking at the same clusters, but this is the error rates. You see that the error rates, which are the purple in this graph, they also dropped. And that is exactly for what I just explained. These garbage collection events would previously cause timeouts. And when these garbage collection events don't happen, these timeouts also don't happen. So that just means significantly less errors on our IPC calls. And less errors are obviously better. It reduces all this retry behavior. It makes your services just run a little bit more consistent. It's easier to operate, easier to understand where errors are coming from. And as an effect, we can also just run our services a little bit more hot. So we can run at a higher CPU load and still be OK before things just start to fall over in weird ways. So we can basically just squeeze a lot more performance out of our machines. And that is, of course, a very good thing. So in this case, we did have to switch the garbage collector, but that is basically just only setting, oh, I want to use ZGC instead of G1. That's kind of all we had to do. So again, that's a lot of benefits from mostly just upgrading to JDK. Then the other thing that we got out of JDK 21 and beyond is virtual threads. And I've been super excited about virtual threads for many years now. Maybe a bit too excited because it's taken a few years before we actually got there. But that is something that we started experimenting with. And the first thing we started doing is add virtual thread support in our frameworks. So that is in our Spring Boot-based framework and also our graphical framework, the DGS framework. And the idea was that if we automatically use virtual threads, our developers don't even have to change the way they write their code. They will benefit from virtual threads without even knowing it. And we can just, again, get better performance out of what we're doing. So this is a DGS or graphical example. You don't really have to understand how this works, but it's just to illustrate the difference in behavior. So without virtual threads, if I do a query of shows that asks for artwork URL, the first thing that happens is that we have this DGS query method that executes that gives me a list of shows. Let's say we return five shows. And now I have to call a second method which will resolve the artwork URL field. And that's that second method with a DGS data annotation. That method will be called for each of the shows returned by this first method. So if I return five shows, this method will be called five times. If that method is relatively slow because I have to do a database lookup or I have to call another surface, and let's say this has to happen for every show. This is a simplified example, of course. And let's say it takes 200 milliseconds to run this method. This would actually happen in serial. So 200 milliseconds plus 200 milliseconds, et cetera. So in this case, we would have a response time of a second. Not great. With virtual threads, we switched the out-of-the-box behavior of this same code, basically. And now this method is just running in parallel on virtual threads. And the effect is that now we still have this slow 200 milliseconds method, but it's running five times in parallel. And given that we have enough processors, now we only have 200 milliseconds of total processing time. Now, you might be asking, OK, why do you need virtual threads for this? Couldn't you just run this method without virtual threads on a thread pool, on an executor pool? And the answer is, well, yeah, we could. And we can actually, if we write slightly different codes. But we couldn't make it a default. Because in many cases, this method will not take 200 milliseconds. And these methods take microseconds, maybe. They're almost not measurable. And then the overhead of putting it on a real thread would be bigger than the benefit we get. So we would only want this if we need this parallel behavior. And in many cases, we don't need it. So we always have to make this straight off like, well, yeah, OK, sometimes you want this behavior, but not all the time. So it can't be a default. With virtual threads, that extra overhead of the scheduling is basically gone. Because virtual threads are free. So we can just make it a default. So it's a better developer experience. Because they don't have to think about, OK, this method should be running in parallel. I should schedule it on a thread. I need to use completable futures and all that. Versus with virtual threads, you can just do the writing by default out of the box. Because there's no cost to it. As I said, I've been pretty excited about virtual threads for a long time now. And my hot take still is that virtual threads combined with structured concurrency is going to completely replace reactive programming. That is kind of a big statement to make, maybe from someone coming from Netflix. Because I don't know if you're familiar with the RxJava library, which is one of the earliest reactive programming libraries. That was actually developed at Netflix, for the most part. So we were knees deep in reactive programming at Netflix. We were big believers in it. We pushed the technology quite a bit. Everything used to be Rx at Netflix, literally every API. And then we found that, actually, that is kind of hard. Because, yes, it has a lot of benefits when it comes to concurrency and things like that. But it adds a lot of complexity to your code. And also a lot of complexity to your debugging. And we found that, in most cases, the tradeoff is just not a good one. And we backed out of using reactive programming, for the most part. And basically, no one wants to touch it. That is kind of the extent of it. And we're still using some of it. For example, if you're using a web client, or sorry, a HTTP client that needs to do multiple HTTP calls for a fanout, kind of your only option today, without virtual threads and structured concurrency, is something like web client, which is, again, reactive. But it always comes with problems. Because one, it's complicated. But also, now you have two different threading models. You have a thread per request thread model. And then within that model, you're now starting to get into a reactive model, which is a completely different threading mechanism. And that gets you in all sorts of hairy situations. So when structured concurrency is now also there, it's hopefully now in the latest or the last preview in the current JDK, in JDK 24, that it will basically get rid of the last need for reactive programming. And then we can finally live our lives happily. However, so virtual threads on JDK 23 weren't exactly perfect yet. We started rolling out this functionality in the framework. So everyone was starting to use virtual threads. And then our cluster started to completely deadlock, as in we would have instances that would just be completely dead. And there's a blog post written by a few of my co-workers. That blog post is really good. And it goes into the details of this problem. But more importantly, maybe, they describe really well all the debugging steps, how they got to understanding this problem. And that gives a lot of insights how you can look at threads and deadlocks and understand what's going on. Because that's step one. But anyway, what we found is that some libraries that we're using, some of them are using the synchronized keyword. And prior to JDK 24, if you use synchronized in a virtual thread, a virtual thread would be pinned to a platform thread. So we had some of those in libraries. Now, other libraries were using things like reentrant locks. So these are also locks, but not based on synchronized. And kind of the weird scenario that we got into is that we would have a bunch of, let's say, we would have all our platform threads pinned with a virtual thread. So all our platform threads were in use by a virtual thread that was pinned because of the synchronized keyword. However, all those virtual threads would be waiting for a lock. And guess who owns the lock? It's owned by another virtual thread. But that virtual thread will never ever be able to run anything because there's no platform threads available. So that's the deadlock situation that we ran into where no one is able to do work because there's no real threads to schedule work on. And the lock is owned by someone who is a virtual thread. So we kind of had to back out slowly a little bit of, OK, maybe we shouldn't push our heart on virtual threads because there is a very uncommon but very real scenario. But the good news is that in JDK 24 that was released yesterday, we have JEP 491. And they basically re-implemented the way how synchronized works with virtual threads. And this whole thread pinning issue is just completely gone because they re-implemented the whole mechanism. So that means that with JDK 24 available, we will once again start pushing on virtual threads. And hopefully, well, the expectation is definitely that it will go a lot better. We did some early experiments with like preview builds. And that looked really good. So I'm very confident about this. All right. Let's move up a little bit in the stack and start talking about our application framework. So our applications are based on what we call Spring Boot Netflix. And Spring Boot Netflix is really just open source Spring Boot. And then on top of that, we have a whole bunch of modules that make Spring Boot work with our infrastructure and ecosystem of things around it. From a developer's perspective, it is just Spring Boot. It is the same programming model. We don't add anything to the programming model. We use all the same annotations and things like that. It is and looks like just plain Spring Boot. And that's a good thing because that's what most people understand. And we try to really stay close to open source releases as well. Of course, the upgrade to Spring Boot 3 was a little bit of a bigger story. I will talk about it specifically. But anytime a minor release comes out, so going from 3.3 to 3.4, et cetera, within a few days, we have to hopefully upgrade it. And it all mostly rolls out automatically. And so the things that we add in Spring Boot Netflix is things like security integration. So integration with our authentication authorization systems. These security systems are all Netflix specific. But then the way that is exposed in Spring Boot is just through Spring Security. So it's the add secured and add preauthorized annotations, things like that. Probably the same stuff that all of you are using if you're using Spring. And we just integrate with our systems under the hood. For all our incoming traffic and outgoing traffic in a service, we use Service Mesh based on ProxyD. So we integrate the framework with that. Service Mesh takes care of things like service discovery and TLS and things like that. Then we have a mechanism that is actually a programming model for gRPC clients and servers. So we have an annotation based programming model that maybe kind of looks like a little bit like you're just writing a REST controller, but then you're actually building a gRPC service. Very similar is being worked on at open source now as well with the Spring team. That is not exactly what we have built, but it looks pretty much the same. So it just makes it easy to implement the gRPC server or call a gRPC server as a client. Then we have observability. So that is our distributed logging and tracing and metrics and these things. Again, this is mostly just working through the Spring Boot provided APIs. So this is mostly just using micrometer, but then it uses our in-house built systems to actually store all the data. And that's all in-house built because basically there's nothing that scales to what we need. There's fast properties. That is dynamic configuration. So most of our configuration can be changed without restarting a service. That is super important if there's like incidents going on and we can disable feature flags and things like that. And then we have our IPC clients. So the gRPC client to call gRPC services and things like web client. So web client comes from Spring, of course, but we then extend that with all sorts of resiliency behavior for retries and things like that, that all works out of the box. Now, the way we implement these things is the same way as Spring itself is built and all the frameworks around Spring Boot are built. So if you look at, for example, Spring Data or something like that, we basically use the same mechanisms to extend Spring Boot. So it's basically a lot of auto-configuration to provide and override extra beans. We have default configuration properties. That's all environment post-processors. And then we try to provide test slices for any components we build, basically, so that your tests also work in a very fast way. Now, you might be wondering, okay, why Spring Boot? Because there seems to be a lot of other frameworks that are interesting. Some frameworks seem kind of more modern. And all these frameworks actually do look very interesting. However, I don't think they are necessarily better. They're just also good. But Spring Boot is proven to be a really reliable framework in the long term. Actually, one of my first projects I did just coming out of university, which is a long time ago, I think it was 2006 that I started using Spring. Spring framework is still based on the same concepts. And they have done a really, really good job in iterating and making the framework better and using new language features from Java, et cetera, to make the developer experience better. We're obviously not using XML configuration anymore. But at the same time, it's based on the same concepts. And this is quite impressive how they managed to evolve the framework. They're obviously still innovating. Every new Java release that comes out, they come out with new features, leveraging those new Java features. Virtual Threads is one example. And a lot of developers just have Spring experience, which is kind of a big plus if we onboard new folks. It's extensible. Otherwise, we wouldn't be able to make it work in our environment. And maybe one of the more important things also is the Spring team is just a really good partner for us. We collaborate quite a bit with them. We give them a lot of feedback about ideas that they have, directions they want to take. And we're working a lot of stuff together with them. And it works really well. If you think about a deployment of these Spring Boot applications, we basically have two different deployment options. For a developer, it almost doesn't matter which route you choose. All the tooling kind of makes it transparent. But we either deploy directly on AWS instances or on Titus, which is our container platform, which is basically Kubernetes, but we built it in-house many years ago. And it's more and more actually starting to use Kubernetes components. So either containerized or just directly on AWS instances. And then we deploy as exploded JAR files with embedded Tomcat. So that's kind of the deployment model. We are not using anything native image. We have definitely experimented with that because, of course, we want faster startup time. Everyone wants faster startup time. However, it doesn't quite work well enough for us yet. It is too hard to get it right. It is pretty easily breakable. And although, yes, with a native image, you get faster startup time at your deployment, it makes the whole development experience a lot worse because now you add a lot of time to your build times. How do you actually start your application during development? You do definitely not want to build a native image every time you run your app during development. So it's just not a great story if you look at the whole picture. We are now more looking at AoT in general and what Project Lightning is doing. Again, the Spring team is also jumping on that ship. They're working with the Lightning team on this. And that is kind of what we're betting on to improve startup time in the future. I kind of already mentioned that we are kind of staying away from anything reactive for the most part. We are not doing anything WebFlux. We sometimes get asked from our developers at Netflix, like, hey, can you guys not support WebFlux? And the answer has always been no because we just kind of don't want to get back into that world. And WebFlux really only works well if you can guarantee a reactive API all the way from the front to the very back, to your database connections. And it's a really hard thing to pull off, especially if you have a lot of existing libraries. So the benefits just don't weigh up to the negatives there. So we're completely standardized in WebMVC. And with concurrency, I don't see any reason to go back to the reactive route. So if you're using Spring Boot, you are probably struggling with an upgrade to Spring Boot 3. Spring Boot 2 is not maintained anymore in open source. So it's definitely time to take care of this upgrade. But there's kind of two big topics in this new release. First of all, they baselined on JDK 17. That's fine. We were already on 17 anyway. I think it's a great thing. Finally, the Java community as a whole can kind of move forward a little bit. I'm very thankful to the Spring team for doing this. The second topic is completely not interesting, but very impactful. And that is the use of Jakarta-e namespacing instead of JavaX namespacing. Now, why is this important? This is important for libraries. So if you're just building an application, this upgrade is trivial. You can literally just do a find-replace in your source code, change everything JavaX to Jakarta, and you're happy. Upgrade completed. However, if you have libraries, these libraries might provide like a surplus filter in this example, which is JavaX.surplus.filter. That filter works fine on Spring Boot 2. That filter does not work fine on Spring Boot 3, because they expect Jakarta surplus. So the way we mitigated this issue is we are using Gradle transforms. So we built a Gradle plugin, basically, that adds artifact resolution time. So basically, when a jar file for a dependency is downloaded, we run a transform. That's just a standard Gradle feature. And we are doing a bytecode rewrite from JavaX to Jakarta. Now, the good news is that all the APIs from Jakarta for this version are completely unchanged. So it's literally just a namespace change, package namespace change. There's nothing else changed. So you can safely just replace everything JavaX to Jakarta, and you will be happy. And we do this with this bytecode rewrite. That's all open-sourced. So if you're using Gradle, you can use the whole thing as part of the Nebula ecosystem, which is our open-source Gradle plugins that we provide from Netflix, which is then based on another open-source tool that does the actual bytecode manipulation. So even if you're not using Gradle, this would be a good starting point. And this is kind of how we can, at the moment, have libraries that are built against Spring Boot 2, which will also work with Spring Boot 3. And once we get rid of all the Spring Boot 2 apps, we're almost done with that now. Almost everything is on Spring Boot 3 now. We can start changing libraries and eventually get rid of this transform. I'm going to skip this. And then I talked a lot about GraphQL already, and it's all built on top of the DGS framework. That is the framework that we open-sourced in 2020. This was in the early days when Netflix started developing on top of GraphQL. We needed a framework to make that easy. There wasn't really anything available in the Java world that looked reasonable, except for GraphQL Java, but that's a very low-level library. And this stuff is all built on top of GraphQL Java. So that's kind of how it works. This is providing the Spring Boot integration with that. And you basically just get a annotation-based programming model to write your resolvers for GraphQL queries, and then also a testing framework that makes it really easy to run GraphQL queries against your service without actually having to start a web server and things like that. So that is a pretty important part of it. Then a few years after we did that, the Spring team also started to realize the importance of GraphQL, and they started working on GraphQL support directly in Spring Boot. Now we were kind of going towards two almost competing frameworks in a community. That wouldn't be a good thing. So we worked a lot with the Spring team on shaping what Spring for GraphQL should be. That's the thing that comes out of the Spring team. And we integrated the two frameworks completely. So if you use DGS, in fact, under the hood, it is using a lot of the Spring for GraphQL components. And you can use both programming models and all the features basically just together in whatever way you want to do that. Last slide that I will get into. If you ask about, okay, what kind of IPC mechanism should we use? This is kind of how I look at it. You have two options, GraphQL or gRPC. Those are your two good options. If you think about a UI talking to a backend, you want to have a flexible schema or a flexible API basically that kind of works for all these different clients that you need to deal with. And GraphQL gives you that. It gives you that very flexible way of querying data. And very importantly, you have a schema. So that is the way you collaborate between UI developers and backend developers. And when working with GraphQL, you're kind of thinking data, not in methods. If you're talking about server to server communication, you often want to think a little bit more about, okay, now I'm actually just calling a method. It just happens to run on another server. That's kind of the mental model that you're in. And that is what gRPC is really good at. gRPC is extremely performant because it's a binary protocol. It still has a schema because it's protobuf. It's just a different type of schema than GraphQL is. But that is really good for server to server communication. Now, it doesn't mean that you will never use GraphQL for server to server communication. We have some of that as well. That's fine. But these are kind of the big buckets. And what does it mean for REST? Well, I don't think you should use REST at all, really. If, yes, REST is easier than GraphQL because you can basically just do a data dump and tell your UI developer, I have a lot of data, good luck. And this is not a very good way to build these UIs. It's just not a good experience if you don't have a schema and it's not a flexible API. You basically just always get all the data that is available. That might probably be way more data than the UI actually needs. It's just not a good model. And yes, of course, you can use something like OpenAI to add a schema on top of REST, but that's kind of not really more or less an after the fact thing. So, yeah, my opinion, don't use REST. Of course, if you just want to do something quick and dirty and easy, it's fine. I'm not saying you're a bad person if you ever do REST. You're just not as good as a person. Okay. I'm going to skip over this. I'm out of time. Thank you very much. Do I have time for questions? Okay. Then find me in hallways and ask me questions.