How I Survived a DDoS Attack on a Raspberry Pi Cluster (film, 13 minutes)

My website just got DDoSed, and at first I thought it might have just been a DOS, or Denial of Service attack, where one person sends a bunch of traffic and makes it so the website is so overloaded it can't respond to every request. But after digging through my logs, I found it was definitely a DDoS, or Distributed Denial of Service attack. There were thousands of computers around the world hitting my website with over 3,000 requests per second. In this video, I'm going to run through what happened, talk about how I got my site back online, and what I learned in the process. And I'll also briefly discuss how all this relates to Russia's invasion of Ukraine, so stay tuned to the end for that. On February 9th at 9am, I posted a new video to my YouTube channel. The video showed how I was hosting my website on a farm on a cluster of Raspberry Pis using a 4G LTE internet connection. In the video, I said this, Even if my website gets hit by a lot of traffic from Reddit or Hacker News, it should stay up and running. The only major risk is if a lot of people post comments at the same time because all that traffic has to go through to the cluster so the comments get saved in the site's database. So yeah, maybe I shouldn't have tempted fate there. I was worried about a bunch of legitimate users hitting the site and posting comments at the same time, but since I've been running my website now for more than 10 years without a DDoS, that wasn't even something on my radar. So I posted that video, and after a bit, I posted a link to the blog post on Hacker News. Within a few minutes, I saw the story hit the front page on HN, which was a bit weird with how quick it rose to the front, and as usual, the traffic on the site went up a bit, with over 150 visitors on the site at 10am. Blog posts from my site have hit HN's front page in the past, so I knew what to expect, and the cluster was handling it fine. This graph shows the slow rise in traffic to the back end as some users were commenting and hitting pages on the site that weren't cached by my VPS. I was responding to some comments on YouTube when I got this email from UptimeRobot. Jeff Geerling was down. That was around 1041 Central Time. I saw a huge spike in traffic on the VPS that was proxying the Pi cluster. That's this graph of requests from Nginx. The monitoring stopped working once the server was overloaded, but before that happened, Munin could see over 2000 requests per second. I also saw a pretty sharp rise in traffic hitting the Pi cluster, but each of the Pis were actually handling the requests fast enough. Kubernetes was still responsive, so I knew the problem was up on my VPS. We were just 10 minutes in at this point. I logged back into my VPS and noticed its load was really high, and it was basically stuck with hundreds of PHPFPM threads trying to handle requests to the Drupal back end on my Pi servers. That was definitely the bottleneck, and somehow whoever was sending this traffic was busting through my cache layer. PHP just couldn't keep up, so most requests were timing out. I live-tailed the Nginx logs and saw thousands and thousands of requests whizzing by. Someone was sending thousands of these POST requests through to my site, and I ran a command to sort requests by IP address and found that it was definitely a DDoS, because these requests were coming from all over the world. Tracking individual IPs, I saw most requests at this point were coming from Indonesia, Russia, and Brazil, among hundreds of other countries. At this point, I knew I needed to save some of this data off so I could sort through it later, so I started screenshotting everything, and I opened up a text file to dump everything I found. The first thing I learned many years ago, the first time I dealt with a major traffic spike for an enterprise website, was always document everything that's happening for two reasons. First, you'll need it later when you try to figure out exactly what happened and how to prevent it from being a problem in the future. And second, it usually helps your brain to slow down a little so it can piece together what to do next. And in my case, the first thing I tried was blocking IPs in Nginx's configuration. Initially, most requests were coming from one IP in Germany, but after I blocked that, 10 more rose up in its place. I also tried adding rate limiting to my Nginx config using the limit request feature, but that was tricky to get right, and I ended up causing other problems for other people visiting the site. So instead of spending hours playing whack-a-mole with IPs, I decided to put my site behind Cloudflare. I already used Nginx to cache and proxy my Pi cluster so it would be somewhat shielded from a direct DOS attack, just a smaller one, but my single VPS in New York couldn't handle the full onslaught of a DDoS attack punching through the cache. Cloudflare has thousands of servers and hundreds of data centers around the world, which I don't have, and a huge amount of bandwidth capacity. They handle DDoSes like the one I was getting every day, all day, and have a fully staffed NOC where they monitor things. I just had little old me, and I just wanted my website back up and running so I could focus on other things in life. So once I switched my domain's DNS to Cloudflare, my brilliant IP blocks actually got in the way because I ended up blocking the Cloudflare servers too. Oops. What I ultimately did was drop the IP blocking in Nginx, and instead I set up a DigitalOcean firewall rule that only let Cloudflare servers access my VPS. With that, I could make sure all the traffic hitting my website went through Cloudflare. If I left the server exposed, attackers could still hit the server directly if they knew its IP address, so I had to close it off entirely. Once I got the firewall running and added the page rule on Cloudflare to cache everything on my website, I was still seeing thousands of requests per second. So I went into Cloudflare's firewall rules and added a complete block on all post requests, at least until I could get things in order. Within 30 minutes of that, Cloudflare had blocked almost 6 million post requests, and my site was finally stable. Before that, my poor VPS was getting hit with over 3,000 post requests per second. No wonder it couldn't cope. Both Munin's graphs and Cloudflare showed that the attack was a sustained 40 megabits per second coming from all over the world. Now, my server could handle that much raw traffic if it were all legitimate, like a front page post on Hacker News or Reddit, but because it was all post requests, the server got crushed trying to run PHP for each request. So at this point, it was about noon, and it was lunchtime. I decided to document everything else I did in a GitHub issue and take a break to eat. Nobody could comment on my website, and I couldn't even log in right now, but I needed a break and some food. After lunch, I checked back in, and it looked like the attacker finally gave up, so I switched the post blocker to not block everyone but still use a challenge to slow down bot attacks. So now people could actually comment on the site, but some of them might have to solve a captcha. But it wasn't over yet. Because I switched to Cloudflare and I have a relatively complex Drupal website, I had to clean up some messes. First of all, some users on Twitter noticed some of my subdomains were offline. They're on different servers, but when I migrated my DNS to Cloudflare, it didn't include any subdomain records. Hey, the shirt is right again. It was DNS. Grab it at redshirtjeff.com. So I had to add back records for sites like ansible.jeffgerling.com and my PyPCI Express website, but then I realized I couldn't save new blog posts on my site because Cloudflare's proxying interfered with the complex post requests Drupal uses to prevent unauthorized access. So I had to set up a separate edit domain, something that's actually pretty common with large-scale Drupal sites that have to deal with multiple proxy layers like I do now. After that, people started reporting they were seeing duplicates toys from my site in their newsreaders. Yes, RSS is alive and well, and yes, I still hate that Google killed Reader. So I've been cleaning up all the fallout, and in the end, I decided to power off the Pi servers at my house, for now at least. Since I'm using Cloudflare anyways, I might bring it back up again, but using Cloudflare tunnels next time. I mentioned that feature in my farm hosting video, but I didn't want to use it because it meant I'd have to use another third-party service and rely on that. But I mean, if I'm going to start getting hit by regular DDoS attacks, I kind of can't do that on my own anyways. Speaking of which, after I relaxed my post rule, a day later, I got another DDoS attack. As you can see from this graph, I quickly detected it and amped up the post rule again until the attack stopped. This time, the traffic was coming mostly from Indonesia, China, Russia, and Brazil. That attack was almost a non-issue, but another attack that was a bit longer hit over the weekend on a Sunday during some family time when I wasn't checking my notifications. So three attacks in a week after a decade with zero. Cloudflare served up over 25 million requests, and that was after that first hour of my VPS trying to stay alive on its own that first day. In the end, Cloudflare served over 300 gigabytes of traffic over a period of a few hours during those attacks. By my estimation, if I'd left the Pi cluster running over 4G and didn't have any caching on my VPS, that would have cost me 1400 bucks in data charges. Luckily though, I'm not an idiot, so I didn't leave it running in that configuration since the day I released the video, and I've kept my wireless usage under my plan's limits. The attacker hasn't come back again, but I decided to leave Cloudflare in place because at this point my site has enough visibility that more attacks are bound to happen. So now that I've had this time to go through the data and clean up some of the mess, I have some thoughts. First of all, time and time again, every time I've encountered something like this, good monitoring and alerting is the first line of defense, and without it, I would be completely blind. When I helped build a commerce platform for some of the most popular music artists in the world, we had detailed graphs of everything on every level, and we had days that were way worse with hundreds of thousands of legitimate requests within seconds of a merch drop. For that, I also had the budget to just throw hardware at the problem, but still, without the right monitoring, we would have lost millions of dollars in sales per year if it took too long to find the exact problem. Instead, we only had a few minutes of downtime here and there and helped the development team fix a few bugs that made the store scale much more quickly. On my personal projects, I've been using MUNIN since around 2006, and it's chugged along without any issues, giving me insights from a separate monitoring server so I can still react even if my main servers go offline. For enterprise projects I've worked on, when budgets are more than 5 or 10 bucks a month like I have, I've been able to use monitoring services like Sumo Logic, Datadog, New Relic, or Splunk. You can also sell those tools like Prometheus and Grafana, Zabbix, or even MUNIN like I'm using here. It all depends on what you need and what you can invest in it, but in the end, you need monitoring. You need insights into exactly what's happening on your servers. The second thing I've learned is that I need to adjust my appetite for risk, at least a little. For years, I've run a bunch of services out of my home lab, and I've been pretty cavalier with showing how they work. I'm kind of open source by nature, and while I keep things secure and up to date, it's practically impossible to prevent DDoS attacks against public IPs unless you route everything through CDNs or have unlimited bandwidth, which, because of Spectrum, I definitely don't. I should probably be a little more careful about IP addresses and specific references in my public repos, but I won't stop doing any of this stuff because it's just too fun to share it with everyone. But what does any of this have to do with the war in Ukraine? Well, right now, in addition to soldiers on the battlefield, Ukraine and Russia are also in the midst of probably the largest IT war ever waged. All around the world, there's been organic resistance to the Russian invasion through DDoS attacks on crucial Russian websites. It's gotten to the point where the Russian government is basically throwing up a massive firewall around their country and giving up trying to mitigate the constant barrage of traffic. How does it work? Well, one educational example is the UA CyberShield project on GitHub. In the description, it states that they don't support unlawful attacks on anyone else's website and that the software is provided only for educational purposes, but, well, I'm guessing there's at least a few people in the world who aren't only testing their own website with tools like this. But the point is this CyberShield software is one of thousands of small programs that people are running, both for good and for bad, on computers all around the world. Most of the computers running DDoS tools are part of botnets, though. They're not intended to be this way, but they're computers, routers, and even smart devices around the world that have been hacked and send a barrage of traffic at their targets. And I just so happen to be in the crosshairs of one of these botnets a few weeks ago. Unless you have the bandwidth and resources to black all the traffic like Cloudflare does, you have no option but to basically go offline for a while until the traffic stops. And even these providers like Cloudflare have their limits. Google, Amazon, Azure, and Cloudflare have all stopped historic DDoS attacks in the past few years. Attackers are constantly adapting, and it doesn't help that the internet is now flooded with more hacked IoT, smart devices, and old hacked routers and other things every day. The backbone of the internet still has some ugly flaws that make things like DDoS attacks too easy for the attacker, and hard if not impossible for the defender. I don't really like the centralization of all internet traffic around services like Cloudflare, AWS, and Google, but it's practically the only way to keep a website online nowadays if you have any notoriety at all. Anyways, now you know why I haven't gotten a graphics card working on my Pi yet. Hopefully I can get back to that, and until next time, I'm Jeff Geerling.

Menu

How I Survived a DDoS Attack on a Raspberry Pi Cluster (film, 13 minutes)

Toggle timeline summary

Transcription