Build an AI Supercomputer with 5 Macs (film, 35m)

One, two, three, four, five Mac Studios. I'm connecting them together and forming a super powerful AI cluster. Why? I wanna run the biggest and baddest AI models I can find. We're throwing everything at it, and my goal is to run the biggest of them all, the Llama 3.1 405B model. This thing is scary. It's normally run by super powerful AI clusters in the cloud with servers that cost more than our houses, but we're gonna try it now with five Mac Studios. Can we do it? I don't know, but we're gonna try and get your coffee ready. Let's go. And thank you to NordVPN for sponsoring this video and making it possible. Yes, they are paying me to play with AI and show you cool stuff. It's kind of awesome. We'll talk more about them later. Now, let me get this out there. I did not just buy five Mac Studios to use it for an AI cluster. I mean, it's not beyond me. I would do that. But here at Network Chuck Studios, we're switching from PC to Mac for our video editing pipeline. Comment below if you think that's a good idea. I'm sure we all agree. But when these beautiful, powerful machines arrived, I'm like, you know what? I can't give it to these guys yet. I wanna play with them first. And I just found a software called ExoLabs. It's new, it's beta, but they're all about AI clustering. Check this out. You can take any type of computer hardware. I'm talking a Raspberry Pi, a spare laptop, a super powerful gaming PC with a 4090, and you can connect them together and just have them run AI models. They share the resources. It's actually kind of easy to do. I'm gonna show you how to do it in this video. But first, I gotta open up all these Mac Studios. Honestly, this is probably my favorite part of the video. I don't know what it is. There's something about opening new tech, unboxing new hardware that just makes you feel joy. And it's anything, a network switch or router. It just makes me happy. Are you the same way? Anyways, I unboxed them. They're beautiful. I did smell them one time. They smelled amazing. But before we get crazy, I first wanna talk about AI clusters. Why do this? Now, I already have a dedicated AI server. His name is Terry. I built him in this video here. And he's awesome. He enables me to run local AI models here in my studio, meaning I don't talk to the cloud and rely on scary giant companies like OpenAI to run things like ChatGPT. Everything's local. They don't get my data. But the reason I had to build Terry, who's rocking two 4090 GPUs, is because running AI models, it's resource intensive. Sometimes, because right now, your computer, the one you're watching me on, can probably run an AI model. In moments, you could download OLAMA, run LLAMA 3.21b, and it works really well. You can talk to it like ChatGPT, but it's not gonna feel like ChatGPT. It's not as smart. And you'll notice that really quickly. The difference is kind of crazy. To get the quality of ChatGPT, you'll have to use a bigger, more sophisticated local model. And this is where your laptop isn't gonna cut it. And when I say larger, I'm mainly talking about a thing called parameters. So I mentioned LLAMA 3.21b. Let's break that down. This is a relatively small model, and that 1b stands for one billion, one billion parameters. When you think about a parameter in the context of AI, each one represents learned knowledge. Each of these parameters is a numerical value or weight in a neural network. And they help the model make predictions. And then that's what a model's doing when you're talking to it. It predicts what the response should be based on what you're saying. You can actually think of a parameter as learned knowledge. And the more parameters a model has, the more patterns, relationships, and nuances it can learn from data. Or essentially, the more parameters it has, the smarter it is. Now, a one billion parameter model like LLAMA 3.2, it's good for simple tasks. You can talk to it. It does basic sentence completion. It can summarize stuff. And you can run it on things like CPU. GPU is gonna be better, but it has weaker reasoning and factual accuracy. I'm kinda using LLAMA 3.2 as our baseline. We could go lower. They have lower parameter models that get dumber. But they have their use cases. And if you wanna run it on a Raspberry Pi, you'll have better performance. I think there's one called Tiny LLAMA. I'm gonna go find it, actually. And that is a Tiny LLAMA. It's pretty cute. Now, check this out. Tiny LLAMA is actually a 1.1 billion parameter model. But because of quantization, we'll talk more about that later, you can run it with less resources. 638 megabytes of VRAM. VRAM, what is that? That's video RAM. So this is not your typical memory or RAM on your computer. This is memory that your GPU has. And yes, when we're talking about running local AI, GPU is the name of the game. If you have one, your life will be better. It doesn't mean you can't run LLAMs like Tiny LLAMA on a CPU. You can. But the inference, or having the conversation, will be slower. Now, of course, we can go up. So if LLAMA 3.2 is our baseline, I'll give you some recommended VRAM, like what kind of GPU you might need for each model. LLAMA 3.2, 1 billion parameters. It's recommended you have four gigabytes of VRAM. Again, you can use CPU, it'll be slow. LLAMA 3.23B, three billion parameters. You'll need six gigabytes of VRAM, so make a 2060 GPU. LLAMA 3.18B, eight billion parameters. 10 gigabytes of VRAM, that's gonna be a 3080. 5.4 for Microsoft, 14 billion parameters. You'll need 16 gigabytes of VRAM, that's gonna be a 3090. And then here's my favorite local AI model right now, the LLAMA 3.370B, 70 billion parameters. For this, there is not a consumer GPU that can do it right now. You'll need 48 gigabytes of VRAM. For me to run that, I have to use two 4090s. And then let's get crazy. If we go one more up, we've got the LLAMA 3.1405B, 405 billion parameters. Now real quick, one thing you might be wondering, you saw me jump from LLAMA 3.2, then to LLAMA 3.1, then to 3.3, and then to 3.1 again, what's happening? Those are the different generations of models trained on newer data and having a few new features. But just because LLAMA 3.21B is newer, it doesn't mean it's more intelligent or has better reasoning than LLAMA 3.18B. Anyways, getting back to the 405B, to run this sucker, they recommend one terabyte of VRAM. That's unreal. That's gonna be an AI cluster, and it's not regular GPUs you're gonna be using. You'll be using NVIDIA's H100s or A100s. And this is what I'm aiming for with my cluster. Think about the best GPU on the market right now, a 4090. It has 24 gigabytes of VRAM. I would need 42 4090s to run that. Now, just so you know, for those of you who might know a thing or two about running LLMs, these numbers probably look a little off, and that's because a lot of these already have quantization built into their metrics. What is that? Quantization could be its own video. We're not gonna do that right now. Just know it makes big models fit on smaller GPUs. Now, it doesn't come without a cost. They do have to reduce some precision to get that to fit on a smaller GPU, but they do it in a way to try and maintain accuracy. You'll know a model is quanticized. Is that how you say it? Quantized? Yeah, I think that's what you do. When you see certain notations, so for example, FP32, that's full precision. No alterations. FP16, half. Now, when I say half precision, it doesn't mean it's like half as bad. We're talking a zero to 2% loss in precision, but then we get into integer-based quantization, and this is where it's fun for us because we can run stuff on our GPUs, our consumer GPUs. The first big one is INT8. This will make the model four times smaller with about a one to 3% loss in precision. Now, I say that with a giant asterisk. It depends. It depends on how they quantize that model. There are different ways you can do that, and those different methods change how they try to reduce the loss. Again, that's a whole other video, but just know as we go down to INT4, which is as low as you really want to go, this is eight times smaller than the full FP32 model, but the loss is pretty big, 10 to 30%, and you'll probably notice the degradation for complex tasks like coding or logical reasoning or creative text. Go any lower and it loses its mind. We're talking Arkham Asylum. So many of these models over here are actually using INT8 to make themselves smaller so they can fit on consumer-level GPUs. INT4 is what I'm gonna try and use with Llama 3.1 405B. Now, I'm not gonna quantize it myself. Someone's already done that for me. I'm just gonna try and run it, but even with that quantization, it's a tall order. So how do I expect five Mac studios to run this model when it would take 42 4090s to do this? Well, the new M-series Macs have a trick up their sleeve. It's a thing called unified memory or unified memory architecture. So in most systems, you have your system memory and you have your VRAM, your GPU memory. The new Macs don't do that. They have one pool of memory for everything, and that unlocks something pretty cool because you can get a Mac. For example, my Mac studios, each one of these has 64 gigabytes of RAM. That's shared RAM that can be used for the GPU. So in my mind, I'm thinking 64 times five. What does that give me? 320 gigabytes of RAM that can be used for the GPU. And it's not just the amount of RAM, but it's the transfer. In a typical system, you've got the system memory that has to transfer data between itself and the GPU memory. With unified memory, there's no transfer. It's just all using that memory. One of these Mac studios is $2,600. And that's for the entire computer. One 4090, just one piece of your gaming PC, will cost you 1,600 bucks. And I get way more RAM to use for my GPU with the Mac. Not to mention it's extremely power efficient. It's ridiculous the power consumption on a 4090 versus a Mac studio. You're about to see. But it's not apples to apples, it's Apple to PC. What I mean is if you put a 4090 gaming PC head to head with a Mac studio, the PC is going to win every time. NVIDIA GPUs like the 4090 have dedicated tensor cores. They're optimized for CUDA. What does all that mean? Well, those are the things that AI models have been optimized for for a long time. M-series Macs have not been thought of as AI machines. Up until now, it's just been NVIDIA. So whenever someone makes a new model, they're making that model to run on NVIDIA GPUs. You're going to have a better time. Now, Apple does have something called MLX or machine learning acceleration. And I'll actually be using that with ExoLabs. But CUDA still wins out because of support. Okay, here we go. We're about to test. We have our five Mac studios, which by the way, here are the specs. They're M2 Ultras, 64 gigabytes each of RAM, unified RAM. Now, the first big thing we have to figure out is how do we connect these Macs together? They're going to be clustered, which means they're going to be talking a lot. And that's a lot of bandwidth. For our scenario, I went with the built-in 10 gigabit ethernet connection. So over here, I have a Unify XG6POE 10 gig switch connecting these five Macs together. This however, will be our biggest bottleneck. Not ideal. Now, 10 gig sounds like a lot, but with AI networking, they normally have extremely high speed connections. I'm talking 400 gigabits per second. In fact, last year I did a video on AI networking, working with Juniper, and they were about to come out with 800 gigabit per second connections, which I'm pretty sure is out. So my 10 versus their 800, and it's not just that. AI networking, and we're talking enterprise AI networking, they eliminate a ton of the networking overhead that you might see with ethernet and TCPIP. In many situations, we're doing GPU to GPU access, skipping a lot of the OS overhead. But for us, we've got our Mac studios, and they have to go through the entire TCPIP stack. Now, the reason this matters so much is that our Macs, when I install the XO software, it will actually take whatever model we're going to use. So let's say, for example, Lama 3.2 8B. It won't download the entire model on each individual Mac. It'll actually split up the download. And when we're running our AI model, each Mac will be running part of the job. But like any good team, that depends on efficient communication. They're gonna be talking a lot back and forth, extremely large amounts of data. In fact, I'm gonna try and see that as we're testing it. We're gonna be tracking the amount of power we're using and bandwidth. Now, there is a way with my Mac studios to get more bandwidth, and that's with Thunderbolt. Alex Ziskind, I think it's how you say his name, another YouTuber I just started watching, he did this with a bunch of M4 Mac minis. Thunderbolt's powerful because you get direct PCIe access and bandwidth up to 40 gigabits per second, ideally. The only problem is when you get to where you wanna cluster together five. You only have so many connections and you can't daisy chain all of them. Now, the way you can solve this is by using a Thunderbolt hub or bridge, and that's what he did. But you still will have some bottlenecks. By the way, you should watch this video to see how Thunderbolt performs versus Ethernet, which we're about to do right now. Hey, Network Shuck from the future here. I actually ended up testing Thunderbolt because I just had to. Then the 10 gig was such a bottleneck, you'll see. And yeah, that's all I got. Back to me. But now we're finally at the point to install XO. I'll have a link to the project below, and I will demo how to install XO on a Mac. For Linux, they do have documentation, but the install's pretty much the same. Really, I think the Mac is the harder version. A couple of things, you wanna make sure you have Python 3.12 installed. I'll go and do that right now. I like to use PyENV to manage my Python versions. And then with PyENV, I can install Python 3.12, and I'll do this on all Macs. And by the way, let's get Home Assistant up. I'm actually using a smart plug to measure the amount of power I'm drawing with all five Macs. So right now, at a kind of a baseline, we're pulling 46 watts. And that's for all five Macs. Isn't that crazy? All right, Python 12 installed. I'll set it as my global, PyENV global 3.12. And I'll just verify real quick with Python dash dash version. Oh, I probably need to refresh my terminal. I'll do a source.zshrc. Try it once more. Perfect. The first thing I'll do is install MLX, Machine Learning Acceleration for M1 Macs. I'll do that with pip install mlx. Keeping in mind, this is very specific to Mac deployments. Notice it is very quick. And by the way, if you find that you don't have pip installed, you can get pip and all the things you need installed with the Xcode dash select dash dash install command. Okay, MLX installed on all my Macs. Now time to install XO. This will be the easiest part. Just gonna grab the git clone command to clone their repo. I'll do that on every one of my Macs here. Jump into that XO directory. And then we'll use the command pip install dash e dot. And take a little coffee break. Now while it's doing that, a couple of cool things you'll wanna know about XO. First, you're about to see this. When we run XO, the Macs will just discover each other through magic. I mean, through networking stuff, right? But they will automatically discover each other and recognize that they're in a cluster. XO will also launch a GUI for us, a web interface, so that we can look at it, play with it, and test some LLMs. Speaking of LLMs, they also have a chat GPT compatible or open AI compatible API. Which means that if you actually wanna use XO, even though it is still in beta, still fairly new, you can integrate this into anything that also uses the open AI API, which is many tools I use. In fact, I just reached out to Daniel Miesler, the guy who runs the Fabric project. I use Fabric every day and I'm like, hey, can you add XO Labs to this? He's like, yes, I'm on it. Actually, by the time you watch this video, it's probably already there. All right, looks like our installation is complete. Now this is very Mac specific. If I LS the directory I'm currently in, the XO directory, you'll see I have a script called configureMLX.sh. Running that will tune up your Macs a bit to run XO. So I'll run that on each Mac. You might wanna put sudo in front of that so you don't have to put your password in either way. Notice it did some things. Honestly, I have no idea what that's doing. And now we're at the point where we can just run XO. So I'll run on the first one here. XO, XO, XO, XO. And then there's one behind it there. We have five. I can't get to the other terminal. Where are you at, buddy? Oh, there he is. XO. Are you seeing this? So immediately XO discovered that there are five nodes in its cluster. Just auto discover. Now actually, I'm gonna stop them real quick so I want you to see how it rates each machine. So I'll just run one instance of XO right here. A cluster of one. All right, so notice here we got 26.98 teraflops. We're closer to the GPU poor side than the GPU rich. I'll show my 4090 performance right here. I'm normally like right around here. But when I start running the rest of the cluster here, it will discover the other nodes. And now when I operate one more, we'll see it discover two nodes. Even shows the connection down here. And it increases or doubles my teraflops. Now I'm gonna only operate one right now. Let's test an LLM. Just to get our base performance for one Mac. All right, we're down to one cluster. Now when you want to access our GUI, it'll be port 52415. And this is for 1072 and 169. So I'll launch my browser here. And there we go. Notice on the left here, we can select our model. It won't download it just by selecting it. It's like if I click on 7DB, it won't do 7DB. But I'll click on 1D and when I start typing, it'll try and download it. So I'll just say, hey, how you doing? Downloading it now. And cool, it's working now. And then notice right here, we have our performance. It's documenting for us. And what you want to focus on is the tokens per second. Let's say, tell me a scary story about computers. So averaging about 117 tokens per second, just by itself. Given that this is a small model and one of these Mac studios can run that, no problem, no sweat. Now, what I want to test now is the network bottleneck. If I introduce the other four Macs into a cluster and we divide up the jobs, what will happen? What will it look like? Let's do that now. So I'm actually going to delete the model here and go add my other Macs. That's when I watch them come up here. All right, we got two, three. Clustering together is so easy. Four and five nodes. Let's test it out now. I'll go ahead and start talking to it to download the model. And as soon as downloading, looks like the entire model on each one. So maybe I was wrong about that or maybe there's a bug. I don't know. Either way, we can see our cluster's working because it's obviously downloading on every one. Let's try and do that same prompt as before. Tell me a scary story about computers. So, wow. The bandwidth limitations are massive. 29 tokens per second versus the 117 we were doing before. So speed is not going to be our friend here. I expected that. What I'm more excited about is the amount of RAM we have and being able to run bigger models. Hey, guess what time it is? Coffee break time. During this break, I want to tell you about my sponsor, Now, hold on. Before you click that little fast forward button, just know, they make videos like this possible. So please show them some love because I want to tell you three ways I use a VPN right now. Number one, I use it to give me a bit of anonymity. When you're accessing a website, many websites will use your public IP address to identify who you are. They'll use different tracking techniques. Essentially, when you're stepping around on the internet, you're leaving a footprint and they're tracking that. So I'll use NordVPN to hide me and also kind of tricking websites to think I'm someone else. It's a great tool for IT people to quickly change who you are, your identity online. Number two, watch a ton of movies. Did you know that Japan Netflix looks different from American Netflix? Same goes for UK and other regions. But if you're using NordVPN, you can quickly put yourself in the UK, in Japan, and suddenly Netflix thinks you're in Japan and they show you Netflix Japan. Now, you may have heard this before, but my producer, Alex, did something insane. He uses this app called Letterboxd. In fact, Alex, throw it up right here. But one thing it will show you is when you're looking at the movie, it'll show you where you can watch that movie on all the streaming services. But he added all the streaming services for every single country. So now when he wants to watch a movie, he can watch that movie. Before, he would just simply rent it when it wasn't available in the US. Now he just turns on VPN, changes location, boom. And yeah, you can run NordVPN on things like an Apple TV. And number three, I'll use a VPN to protect myself and my family when we're away from the house, when we're using our devices. And yes, that does include connecting to weird Wi-Fi networks. Now, many VPN haters will say, you don't need a VPN anymore on the internet because most websites are HTTPS, meaning your connection between yourself and the website server is secure. And that's true for a lot of the internet. That's awesome. But what if you get to a website that does have SSL, but it's not a good website? It doesn't matter if it's encrypted. One way that people do this is a thing called typosquatting, typing of your website. So it might be netflux.com instead of Netflix, which in an ideal world would go nowhere, but people buy these websites, bad people, and put up a Netflix feeling place. But with NordVPN, they do have things like Threat Protection Pro. They'll tell you when websites you're visiting are bad. Also, they'll protect you from ads. Ads suck, except for this one. This one's awesome and you know it. But I put a lot of effort into blocking ads on my home networks and my business networks. But when we're out and about, them ads are looking at us getting hungry, turn on NordVPN, it will block ads. So yes, using a VPN is very much a valid thing to do in 2025. I highly recommend it. I use it all the time. So check out the link below, nordvpn.com forward slash networkchuck, or scan this new fancy QR code. Is the QR code safe? I don't know. Scan it, get NordVPN, and then scan it again to see if you are safe. And of course, if you use my link, you get a special deal. What are you waiting for? Check this out. Actually, it's a new year's big savings plus four extra months. I'm going to get it. Three bucks a month. Anyways, thanks to NordVPN for sponsoring this video and making stuff like this possible. Now back to clustering AI stuff. Now let's see what Thunderbolt does. I'm going to go connect my Thunderbolt stuff right now. All right, Thunderbolt connected. Now, how did I connect these hosts? I'll show you a video of it right now. Here's some B-roll, but essentially we're doing kind of a spoken hub situation. We got one Mac connected to all the other Macs. Obviously a less than ideal situation because this guy does become a bottleneck, but we're talking about 40 gigabit per second networking between these guys. And a nice little Thunderbolt bridge here. Thunderbolt networking isn't quite as advanced as regular TCP IP based networking or ethernet. So this is the best I could do without pulling my hair out with advanced configs, because I don't want to do that. So I assigned static IPs to all of them and XO by default should choose the fastest connection. Let's see if it does. And sure enough, we can see that the bridge zero Thunderbolt is monitored. And before we run all of them, I do want to test one host. Well, not one host. We already know how one host performs. Let's test two first. All right, we got Thunderbolt connection between two hosts. Should be very, very quick. Let's feed it a prompt and have some fun. Okay, it is slightly faster. I think before with 10 gigabit networking, we're talking about 50 tokens per second. So it's significant. Let's add three. Join the party, friend. Yeah, let's do the same prompt. Still, it's better than it was before, but notice even with Thunderbolt, we're hitting that bottleneck. Now, why is Thunderbolt better? More bandwidth, that's obvious, but it also has a more direct access to PCIe. Less overhead, more direct. Let's add the team. Come on in guys. All right, we've got a cluster of five. Let's see how we do. Okay, that's actually not bad. It's like, wait, you just told me and then you're telling me you can't? Make up your mind. Let's try this one. Okay, so all hosts are being used. Let's try this prompt and watch the networking happen. So bandwidth usage is obviously pretty much the same. We expected that, right? Now let's test a bigger model, the Llama 3.37 dB. My favorite model. Actually, DeepSeek R1 just came out. I haven't played with it yet. I'm gonna go disconnect the Thunderbolt connections and we'll run 10 gigabit on the first test. Okay, running XO on just one host, running the Llama 3.37 dB. I expect it to be pretty good. This is a quantized model, four bit. Should be able to run everything. And let's see how this performs. Here's the host right here. Watch the RAM usage just go crazy. And the seconds to first token are taking a bit. And then the GPU takes off. So 15 seconds to the first token. It was just low. And the performance is less than great. Now I will say this. We're gonna test Llama after this. For whatever reason, the models in Llama are better. I'm not sure what they're doing. And when I say better, they seem to perform better. I'm not actually sure if they are, you know, better. Okay, let's stop this nonsense. Let's try two. Seven dB. Well, that's the same question. Take a look at our metrics here and let's go. Gotta download part of it to the other host. All right, memory's coming in hot. You know, it's not doing too bad. Better than I expected. Let's check the networking. All right, networking testing now. Is it just me or is it using less bandwidth than before? That's funny. All right, let's add them all in. All five are having a party. Got our monitoring up. Asking a question now. We'll see how it performs. Oh, gotta download one more bit. Two minutes. Killing me. Now, this honestly is the most painful part of making this video. And probably for you, if you're ever gonna use this, it's waiting for these models to download. This video took me way too long. I anticipated one day for this video. Oh, no, no, no, no, no. Foolish Chuck. I wish, and I think it's coming. I saw a few pull requests on GitHub that you could host the models locally and then pull those down. I do love the fact that they break up the model across your network, but it would be so much faster if the model was local, if we didn't have to pull it from hugging face each time. All right, now we can finally try it, I think. Here we stinking go. Hey, that's actually not bad. We're using all hosts, 15 tokens a second. Memory usage is good. GPU spread across all our hosts. I'm happy with this. Not the fastest thing in the world, but it's stinking working. I love this. Let's test networking. All right, we got our networking monitoring set up. We'll launch XO once more. Five nodes up. And let's test it out and watch the networking go crazy. I'm assuming. Yeah, here we go. Okay, so we're distributing the network traffic across all the hosts. What's funny though, is I don't know if it's just I have top, it's acting kind of crazy, but it's showing like dev two is my computer and it's showing me being the highest bandwidth receiver. It's kind of weird, but looking at the cumulative, I mean, 64 megabits per second, that's bytes. And we're getting about 10 tokens a second. Let's test Thunderbolt. We'll test two hosts just as we did before. Let's see how we do. So watch two hosts just go crazy and performance is, I mean, we are using Thunderbolt, right? Yeah, using Thunderbolt. Performance is meh. We're not using any swap, are we? So no swap RAM, meaning so if we ran out of RAM, it would switch over to swap, which means it would start to borrow RAM space from the hard drives, the SSDs, which are less performant, not as fast. RAM is extremely fast, which is why it costs so much. All right, let's test the team. All right, five hosts, Thunderbolt bridge, 70B. Let's see how we do. Hey, that's not too bad. That actually I'm happy with. 11 tokens per second, stuff's being spread across. Man, if we could figure out this bandwidth issue, that would be killer. I don't know how ExoLabs is gonna solve that though, because we're at the mercy of what hardware we have. Maybe they'll figure out something clever. Let's check the networking and then we'll jump into our final test. Can we run the 405B? Now, at the beginning of this video, I did say the 405B is the biggest, baddest of them all. DeepSeek R1 just came out. A local model, supposed to outperform O1 in reasoning. And their biggest one is the 671B, which is just a behemoth. And no, I cannot run that. That's way too big. I like doing the Thunderbolt run because on the networking side, it's very obvious that the hosts aren't talking to me because they're on their own little private network. Full Thunderbolt bridge connection. Let's ask you the fun question. Let's see if it'll do this. Ready, set, go. Okay, so still like 10 tokens per second watching the networking. It's funny, we're not seeing a lot, are we? That's weird. Is it not using the Thunderbolt bridge? They're definitely connected in that way. Am I losing my mind? I'm monitoring these interfaces, right? Why does the scale all the way up here to like 19 megabits when I'm not seeing the progress go? I'm probably just doing something wrong. It's kind of strange. Okay, enough of that. So the 70B, we know we can run this, but now I want to run the biggest, baddest model of them all. We'll see if we can run it on 10 gig first. Now I will say this, to run this model, it took me a bit because to download that model, it is so stinking large. And yes, it's amazing that I can just click on, say something when I went to the 405B up here. Let's see, it's cause they had it sitting right here and it would start to download to all my hosts, but it took forever. And when it finished, it was still kind of buggy. I didn't trust it. So I wanted to download it locally and run it locally, but that involved me finding a pull request that allowed that, a feature that wasn't involved yet. So I did that. I checked it out. I'm currently on that branch and looking down here, you'll see, I have a local 405 four bit model. Now I'm not even going to try around this on one host. It'll kill itself. Actually, you know what? We should just do that for fun, just to see it happen. So running one host, and by the way, I did have to download the entire model, which I think is roughly 200, almost, is about almost 200 gigs and then put it on each host, which was way more efficient than downloading it from Hugging Face. All right, it's about to tell me a story. Watch this RAM load up. It's like, ah, ah, it's about to scream. And then you'll probably see the swap. I mean, you'll definitely see it, I believe. Watch the swap right here. We're at 50. Here comes the swap. There it goes. We're going to use up all the hard drive space. And this is, it will end up working, I think, eventually, but it's just, I mean, we're at 20 gig swap. I don't want to use it. I think I'm almost out of hard drive space. I'm going to stop that right now. Get out of here. It'll probably end up timing out. Let's see if it goes back down. Okay, cool. That was scary. That's the reason we don't do that. But if we share RAM between all of our hosts, it should be a bit better. Let's make that happen right now. Let's get our cluster running. We're still on 10 gig ethernet. Cool. All the hosts are up. It does have the model running on each. You know, I actually need to run kind of a special version of the command. I need to specify MLX as the inference engine, just like this. It should auto discover that and run that, but I want to make sure I don't screw this up. We're active. Let's think and monitor and see what happens. Again, this was our goal, to run the biggest and baddest model if you ignore the recent news. But here we go. And run. All right, let's watch the RAM fill up across the board. So it's filling up the bottom guy here, bottom left. Hopefully it doesn't get to swap. It'll just start to disperse it across all the nodes. Is it filling it up here? Yep, it's doing the top right guy for now. So swap is inactive, still. Okay, it's just slowly filling up the cups of each node. All right, now we're filling up this guy. Still haven't gotten to a point where we started generating text. All right, filling up the top left. It's just taking a minute to load the model and memory. And I think we're almost there. Just to fully distribute the sucker, it's taking forever. Will it evenly distribute swap? I'm curious about that. Network error. But it did start. It gave me a word, it said here. But so far, I don't see any swap memory used. Let's refresh our page and try it again. It should keep the model loaded, so we don't have to wait again. In one paragraph. Okay, here we go. Generate something for me. Let's go. It's doing it. It only took five seconds to get to that first token, and we're rocking a blazing speed of 0.8 tokens a second. But you know what? We did it. We're running the biggest, baddest model of them all on local hardware. What would normally take an entire data center of stuff, we're doing it right here. Slow, but we did it. Take that, Zuckerberg. Take that, Musk. We don't need you, although we're using your model. So we're rocking 0.5 tokens a second. Will it be faster on Thunderbolt? Let's see. All right, I'm going to stop this nonsense. I got Thunderbolt up and running, or connected. Let's run XO now. Oh wait, I got to do my MLX version. All right, five nodes. We're on Thunderbolt. Here we go. I'm excited to see what this does. So right now it's currently not loaded in RAM, so it might take a bit. Here we stinking go. RAM's going nuts. Here we go. I wish we'd fill them all up at once. Man, it takes forever to load this model. Goodness. Coffee break. Haven't taken one of those in a while. It's going to time out before it loads it all up. Yep, and it timed out. Let's try it again. Okay, GPU is spiking. We're getting some stuff, but we're not using all the GPU. So again, our bottleneck here is the networking. I would love to know what the experience would be with some like serious connectivity between the GPUs of these five Mac studios. Now, performance is not any better. We're talking 0.6 tokens per second. We kind of froze at that. But as far as RAM goes, this is supportive. Very exciting. Very slow, but exciting. Let's check the network activity. All right, let's get timed out on me. So networking, I mean, it's not using a lot of bandwidth. It's just not. Okay, old man, let's put you to sleep. You're too slow. Okay, so what I want to show you real quickly though, is the performance of Olama. So Olama, if you don't know what that is, is probably one of the best ways to run AI models locally on your machine. Any machine. It's so easy to install. So Olama list, let's see what I have right now installed. Olama's not running. Let me jump into the GUI real quick. With Rust desk. I forgot I wasn't using Linux. Which is why I love Macs. Sometimes you forget you're not using Linux. We'll run the 7DB 3.3. It'll take a minute to download. It's 42 gigs. But you'll see how much better this is. Now, while it's downloading, I was talking with Daniel Miesler, the creator of the Fabric project. And he, he's such a great guy. I love him. I texted him and said, hey, please add this support for ExoLabs. I want to test it for Fabric. He did it. So let's update Fabric real quick. All right, so here it is. And it's gonna go pretty quick. Oh, look at that. It's fast. So the model is definitely loaded up. No swap though. GPU usage, it's using the full GPU. So here's the thing with Exo. I think that maybe MLX performance of Mac isn't quite there, but we'll see something different on non-Mac computers. So things actually running NVIDIA GPUs. You know, I'm replacing my, my video editors PCs, the ones that have NVIDIA GPUs with these Macs. Should I do another video where I cluster all these extra computers together? Let me know. But it's doing great. Now, as you can tell, like it would not go well if I tried to run a larger model like the 405B, but this model with my 64 gigs of RAM on my Mac studio runs like a dream, which is pretty amazing. The 70B model is awesome. The new model from DeepSeek, we got to try that. They have a 70B as well. Let's see how much space we have on our machine here. Yeah, we got space. So let's try and run this. 42 more gigs. While it's doing that, I'm going to run Fabric. Okay. Let's test this out real quick. Hey, how are you? And we're not using swap and we're running DeepSeek. Like that's huge on a Mac studio. That's amazing. Okay. I've got Fabric updated to this branch. I'm so excited to try this. And by the way, I'm able to use Fabric or rather Daniel Miesler is able to implement Fabric with XO because XO uses chat GPT compatible APIs. Ah, that's not the command. Let's run the setup. We'll do XO. Make sure I'm at least, let me run it all my stuff here. Every one of these hosts should run its own little API and I'll run off the main one here. So let go. Tell me a story. Pipe that into Fabric. Okay. Okay. Not working. Oh, there we go. No. Oh, it's working. Yeah. I want it to stream to me though. What stream? It may not support streaming. Oh, there it goes. Oh, that's sick. I'm using Fabric with this. This is so cool. Okay. That's so cool. Daniel Miesler, thank you so much. And let's test it out. Okay. Things are happening. Ah, yes. Fabric. It's taking a minute. But here we go. Oh my gosh. That was a quick story. Let me have it summarize something. Let's go to bleeping computer. Let's take all this text. Can you paste? Summarize. Boom. Okay. So we're sending it a lot of text and there it goes. Oh my gosh. This is awesome. Yes. This is happening. Anyways, that's stinking cool. All right, let's bring this video home. It's been way too long. ExoLabs is very cool. I'm excited about it. For the Mac with MLX, I think there's still more work that has to be done. Currently, networking is still a bottleneck. Although I don't know how it's going to perform on an NVIDIA-based cluster. Let me know if you want to see that below. Also, I was kind of thinking about doing a Raspberry Pi AI cluster with ExoLabs. Let me know if you want me to do that. That's all I got. I'll catch you guys next time.

Menu

Build an AI Supercomputer with 5 Macs (film, 35m)

Toggle timeline summary

Transcription