Jak przetrwałem atak DDoS na klaster Raspberry Pi (film, 13 minut)
Jeff Geerling doświadczył ataku DDoS na swoją stronę internetową, co skłoniło go do dokładnej analizy sytuacji. Początkowo myślał, że to mógł być tylko atak DoS, ale po zbadaniu logów stwierdził, że to rzeczywiście atak DDoS, z tysiącami komputerów na całym świecie dokonujących ponad 3000 żądań na sekundę. W filmie, Jeff omawia, jak udało mu się przywrócić stronę do działania i co zyskał w procesie. Dzieli się także refleksjami na temat związku pomiędzy atakami a wojną na Ukrainie.
Jak relacjonuje Jeff, dzień przed atakiem, w jego wideo na YouTube przedstawił, jak hostuje swoją stronę na klastrze Raspberry Pi z użyciem połączenia internetowego 4G. Wspomniał również o ryzyku związanym z dużą ilością komentarzy. Zaskoczeniem dla niego było, gdy jego historia trafiła na stronę główną Hacker News i szybko przyciągnęła dużą ilość ruchu. Niestety, wzrost ruchu na stronie wydaje się być oznaką nadchodzącego ataku DDoS.
Kiedy Jeff zauważył, że jego serwer VPS nagle przestał działać, natychmiast przystąpił do reakcji. Monitorując logi, odkrył ogromne ilości żądań wysyłanych do jego witryny, co skłoniło go do podjęcia działań zaradczych. Ostatecznie zdecydował się na korzystanie z Cloudflare, co okazało się kluczowym krokiem w zatrzymaniu ataku. Jeff szybko wprowadził zmiany, aby cały ruch przechodził przez Cloudflare, a nie bezpośrednio do jego serwera VPS.
Jeff zwrócił uwagę na potrzebę monitorowania i analizowania danych, co okazało się niezwykle pomocne w celu ustalenia przyczyn problemów i ich zapobiegania w przyszłości. Podkreślił, jak ważna jest dokumentacja w tego typu sytuacjach oraz jak istotne jest dostosowanie się do zmieniających się zagrożeń. Po zakończeniu ataku, zasoby Cloudflare zablokowały miliony żądań, a Jeff podzielił się swoimi przemyśleniami na temat monitorowania, ryzyka i dbania o bezpieczeństwo swoich projektów.
Na koniec, Jeff odniósł się do wojny na Ukrainie, zauważając, że obecnie toczy się największa wojna IT, w której obywatele różnych krajów wspierają Ukrainę poprzez ataki DDoS na strony rządowe Rosji. Ataki DDoS są coraz powszechniejsze, ale Jeff podkreślił, że wykorzystanie narzędzi analitycznych oraz CDN-ów, jak Cloudflare, stało się niezbędne do przetrwania w obliczu takich zagrożeń. Statystyki wideo Jeffa wykazują około 686486 wyświetleń oraz 24455 polubień, co w momencie pisania artykułu pokazuje, że jego doświadczenie dotarło do dużej liczby osób.
Toggle timeline summary
-
Strona internetowa mówcy została zaatakowana w wyniku ataku DDoS.
-
Początkowo myśleli, że to tylko podstawowy atak Denial of Service.
-
Dzienniki potwierdziły, że to był atak Distributed Denial of Service.
-
Atak obejmował tysiące komputerów generujących ponad 3,000 zapytań na sekundę.
-
Mówca przedstawił wydarzenie i powiązał je z inwazją Rosji na Ukrainę.
-
9 lutego opublikowali nowy film o hostingu swojej strony.
-
Mówca wyraził pewność, że ich strona będzie działać.
-
Po opublikowaniu linku na Hacker News ruch nagle wzrósł.
-
Wykresy pokazały stopniowy wzrost ruchu, gdy użytkownicy interakcjonowali z witryną.
-
Alarm wskazywał, że strona jest niedostępna z powodu przeciążenia ruchem.
-
Zapytania osiągnęły szczyt powyżej 2,000 na sekundę, zanim diagnostyka zawiodła.
-
VPS był przeciążony wątkami PHPFPM.
-
Zapytania POST były wysyłane w ogromnych ilościach z różnych krajów.
-
Mówca dokumentował wszystko, aby zrozumieć sytuację.
-
Zdecydowali się skorzystać z Cloudflare jako dodatkowego środka ochrony.
-
Blokowanie adresów IP również zablokowało serwery Cloudflare.
-
Zastosowano regułę zapory, która umożliwiała dostęp tylko z Cloudflare.
-
Cloudflare skutecznie zablokował miliony zapytań atakujących.
-
Po lunchu atakujący wydawał się na chwilę poddać.
-
Mówca radził sobie z subdomenami, które były niewłaściwie skonfigurowane.
-
Zdecydowano się wyłączyć serwery Pi w trakcie reakcji.
-
Cloudflare obsłużył znaczny wolumen zapytań.
-
Mówca podkreślił znaczenie monitorowania i alarmowania.
-
Omówiono trwający konflikt cybernetyczny związany z wojną na Ukrainie.
-
Ataki DDoS są powszechne z powodu botnetów i zhakowanych urządzeń.
-
Skrytykował centralizację ruchu internetowego wokół dużych usług.
Transcription
My website just got DDoSed, and at first I thought it might have just been a DOS, or Denial of Service attack, where one person sends a bunch of traffic and makes it so the website is so overloaded it can't respond to every request. But after digging through my logs, I found it was definitely a DDoS, or Distributed Denial of Service attack. There were thousands of computers around the world hitting my website with over 3,000 requests per second. In this video, I'm going to run through what happened, talk about how I got my site back online, and what I learned in the process. And I'll also briefly discuss how all this relates to Russia's invasion of Ukraine, so stay tuned to the end for that. On February 9th at 9am, I posted a new video to my YouTube channel. The video showed how I was hosting my website on a farm on a cluster of Raspberry Pis using a 4G LTE internet connection. In the video, I said this, Even if my website gets hit by a lot of traffic from Reddit or Hacker News, it should stay up and running. The only major risk is if a lot of people post comments at the same time because all that traffic has to go through to the cluster so the comments get saved in the site's database. So yeah, maybe I shouldn't have tempted fate there. I was worried about a bunch of legitimate users hitting the site and posting comments at the same time, but since I've been running my website now for more than 10 years without a DDoS, that wasn't even something on my radar. So I posted that video, and after a bit, I posted a link to the blog post on Hacker News. Within a few minutes, I saw the story hit the front page on HN, which was a bit weird with how quick it rose to the front, and as usual, the traffic on the site went up a bit, with over 150 visitors on the site at 10am. Blog posts from my site have hit HN's front page in the past, so I knew what to expect, and the cluster was handling it fine. This graph shows the slow rise in traffic to the back end as some users were commenting and hitting pages on the site that weren't cached by my VPS. I was responding to some comments on YouTube when I got this email from UptimeRobot. Jeff Geerling was down. That was around 1041 Central Time. I saw a huge spike in traffic on the VPS that was proxying the Pi cluster. That's this graph of requests from Nginx. The monitoring stopped working once the server was overloaded, but before that happened, Munin could see over 2000 requests per second. I also saw a pretty sharp rise in traffic hitting the Pi cluster, but each of the Pis were actually handling the requests fast enough. Kubernetes was still responsive, so I knew the problem was up on my VPS. We were just 10 minutes in at this point. I logged back into my VPS and noticed its load was really high, and it was basically stuck with hundreds of PHPFPM threads trying to handle requests to the Drupal back end on my Pi servers. That was definitely the bottleneck, and somehow whoever was sending this traffic was busting through my cache layer. PHP just couldn't keep up, so most requests were timing out. I live-tailed the Nginx logs and saw thousands and thousands of requests whizzing by. Someone was sending thousands of these POST requests through to my site, and I ran a command to sort requests by IP address and found that it was definitely a DDoS, because these requests were coming from all over the world. Tracking individual IPs, I saw most requests at this point were coming from Indonesia, Russia, and Brazil, among hundreds of other countries. At this point, I knew I needed to save some of this data off so I could sort through it later, so I started screenshotting everything, and I opened up a text file to dump everything I found. The first thing I learned many years ago, the first time I dealt with a major traffic spike for an enterprise website, was always document everything that's happening for two reasons. First, you'll need it later when you try to figure out exactly what happened and how to prevent it from being a problem in the future. And second, it usually helps your brain to slow down a little so it can piece together what to do next. And in my case, the first thing I tried was blocking IPs in Nginx's configuration. Initially, most requests were coming from one IP in Germany, but after I blocked that, 10 more rose up in its place. I also tried adding rate limiting to my Nginx config using the limit request feature, but that was tricky to get right, and I ended up causing other problems for other people visiting the site. So instead of spending hours playing whack-a-mole with IPs, I decided to put my site behind Cloudflare. I already used Nginx to cache and proxy my Pi cluster so it would be somewhat shielded from a direct DOS attack, just a smaller one, but my single VPS in New York couldn't handle the full onslaught of a DDoS attack punching through the cache. Cloudflare has thousands of servers and hundreds of data centers around the world, which I don't have, and a huge amount of bandwidth capacity. They handle DDoSes like the one I was getting every day, all day, and have a fully staffed NOC where they monitor things. I just had little old me, and I just wanted my website back up and running so I could focus on other things in life. So once I switched my domain's DNS to Cloudflare, my brilliant IP blocks actually got in the way because I ended up blocking the Cloudflare servers too. Oops. What I ultimately did was drop the IP blocking in Nginx, and instead I set up a DigitalOcean firewall rule that only let Cloudflare servers access my VPS. With that, I could make sure all the traffic hitting my website went through Cloudflare. If I left the server exposed, attackers could still hit the server directly if they knew its IP address, so I had to close it off entirely. Once I got the firewall running and added the page rule on Cloudflare to cache everything on my website, I was still seeing thousands of requests per second. So I went into Cloudflare's firewall rules and added a complete block on all post requests, at least until I could get things in order. Within 30 minutes of that, Cloudflare had blocked almost 6 million post requests, and my site was finally stable. Before that, my poor VPS was getting hit with over 3,000 post requests per second. No wonder it couldn't cope. Both Munin's graphs and Cloudflare showed that the attack was a sustained 40 megabits per second coming from all over the world. Now, my server could handle that much raw traffic if it were all legitimate, like a front page post on Hacker News or Reddit, but because it was all post requests, the server got crushed trying to run PHP for each request. So at this point, it was about noon, and it was lunchtime. I decided to document everything else I did in a GitHub issue and take a break to eat. Nobody could comment on my website, and I couldn't even log in right now, but I needed a break and some food. After lunch, I checked back in, and it looked like the attacker finally gave up, so I switched the post blocker to not block everyone but still use a challenge to slow down bot attacks. So now people could actually comment on the site, but some of them might have to solve a captcha. But it wasn't over yet. Because I switched to Cloudflare and I have a relatively complex Drupal website, I had to clean up some messes. First of all, some users on Twitter noticed some of my subdomains were offline. They're on different servers, but when I migrated my DNS to Cloudflare, it didn't include any subdomain records. Hey, the shirt is right again. It was DNS. Grab it at redshirtjeff.com. So I had to add back records for sites like ansible.jeffgerling.com and my PyPCI Express website, but then I realized I couldn't save new blog posts on my site because Cloudflare's proxying interfered with the complex post requests Drupal uses to prevent unauthorized access. So I had to set up a separate edit domain, something that's actually pretty common with large-scale Drupal sites that have to deal with multiple proxy layers like I do now. After that, people started reporting they were seeing duplicates toys from my site in their newsreaders. Yes, RSS is alive and well, and yes, I still hate that Google killed Reader. So I've been cleaning up all the fallout, and in the end, I decided to power off the Pi servers at my house, for now at least. Since I'm using Cloudflare anyways, I might bring it back up again, but using Cloudflare tunnels next time. I mentioned that feature in my farm hosting video, but I didn't want to use it because it meant I'd have to use another third-party service and rely on that. But I mean, if I'm going to start getting hit by regular DDoS attacks, I kind of can't do that on my own anyways. Speaking of which, after I relaxed my post rule, a day later, I got another DDoS attack. As you can see from this graph, I quickly detected it and amped up the post rule again until the attack stopped. This time, the traffic was coming mostly from Indonesia, China, Russia, and Brazil. That attack was almost a non-issue, but another attack that was a bit longer hit over the weekend on a Sunday during some family time when I wasn't checking my notifications. So three attacks in a week after a decade with zero. Cloudflare served up over 25 million requests, and that was after that first hour of my VPS trying to stay alive on its own that first day. In the end, Cloudflare served over 300 gigabytes of traffic over a period of a few hours during those attacks. By my estimation, if I'd left the Pi cluster running over 4G and didn't have any caching on my VPS, that would have cost me 1400 bucks in data charges. Luckily though, I'm not an idiot, so I didn't leave it running in that configuration since the day I released the video, and I've kept my wireless usage under my plan's limits. The attacker hasn't come back again, but I decided to leave Cloudflare in place because at this point my site has enough visibility that more attacks are bound to happen. So now that I've had this time to go through the data and clean up some of the mess, I have some thoughts. First of all, time and time again, every time I've encountered something like this, good monitoring and alerting is the first line of defense, and without it, I would be completely blind. When I helped build a commerce platform for some of the most popular music artists in the world, we had detailed graphs of everything on every level, and we had days that were way worse with hundreds of thousands of legitimate requests within seconds of a merch drop. For that, I also had the budget to just throw hardware at the problem, but still, without the right monitoring, we would have lost millions of dollars in sales per year if it took too long to find the exact problem. Instead, we only had a few minutes of downtime here and there and helped the development team fix a few bugs that made the store scale much more quickly. On my personal projects, I've been using MUNIN since around 2006, and it's chugged along without any issues, giving me insights from a separate monitoring server so I can still react even if my main servers go offline. For enterprise projects I've worked on, when budgets are more than 5 or 10 bucks a month like I have, I've been able to use monitoring services like Sumo Logic, Datadog, New Relic, or Splunk. You can also sell those tools like Prometheus and Grafana, Zabbix, or even MUNIN like I'm using here. It all depends on what you need and what you can invest in it, but in the end, you need monitoring. You need insights into exactly what's happening on your servers. The second thing I've learned is that I need to adjust my appetite for risk, at least a little. For years, I've run a bunch of services out of my home lab, and I've been pretty cavalier with showing how they work. I'm kind of open source by nature, and while I keep things secure and up to date, it's practically impossible to prevent DDoS attacks against public IPs unless you route everything through CDNs or have unlimited bandwidth, which, because of Spectrum, I definitely don't. I should probably be a little more careful about IP addresses and specific references in my public repos, but I won't stop doing any of this stuff because it's just too fun to share it with everyone. But what does any of this have to do with the war in Ukraine? Well, right now, in addition to soldiers on the battlefield, Ukraine and Russia are also in the midst of probably the largest IT war ever waged. All around the world, there's been organic resistance to the Russian invasion through DDoS attacks on crucial Russian websites. It's gotten to the point where the Russian government is basically throwing up a massive firewall around their country and giving up trying to mitigate the constant barrage of traffic. How does it work? Well, one educational example is the UA CyberShield project on GitHub. In the description, it states that they don't support unlawful attacks on anyone else's website and that the software is provided only for educational purposes, but, well, I'm guessing there's at least a few people in the world who aren't only testing their own website with tools like this. But the point is this CyberShield software is one of thousands of small programs that people are running, both for good and for bad, on computers all around the world. Most of the computers running DDoS tools are part of botnets, though. They're not intended to be this way, but they're computers, routers, and even smart devices around the world that have been hacked and send a barrage of traffic at their targets. And I just so happen to be in the crosshairs of one of these botnets a few weeks ago. Unless you have the bandwidth and resources to black all the traffic like Cloudflare does, you have no option but to basically go offline for a while until the traffic stops. And even these providers like Cloudflare have their limits. Google, Amazon, Azure, and Cloudflare have all stopped historic DDoS attacks in the past few years. Attackers are constantly adapting, and it doesn't help that the internet is now flooded with more hacked IoT, smart devices, and old hacked routers and other things every day. The backbone of the internet still has some ugly flaws that make things like DDoS attacks too easy for the attacker, and hard if not impossible for the defender. I don't really like the centralization of all internet traffic around services like Cloudflare, AWS, and Google, but it's practically the only way to keep a website online nowadays if you have any notoriety at all. Anyways, now you know why I haven't gotten a graphics card working on my Pi yet. Hopefully I can get back to that, and until next time, I'm Jeff Geerling.