Slashdotted, Reddited, Fireballed, Penny-Arcaded, whatever the term du jour is; someone popular links to your site, and it’s overwhelmed by the bandwidth. Like everyone else, we made the poor assumption that we were somehow immune or it somehow didn’t matter, and then Notch* (one of the minecraft guys) linked to us and we went offline before we knew it.
There are ways to lower the “cost” of each web access (thus multiplying the capacity of your server), but they’re a bit complicated. Lucky for us, some of the most effective ways have been rolled into a ready-made plugin called W3 Total Cache. It takes maybe 10, 20 minutes to install, and gives you something like 10 times the capacity. Just do it.
It’s worth noting (for your edification) that in our particular case, what brought us down was something that the above still might have helped with (or even prevented a total meltdown), but the root cause was something it doesn’t address; a bad setting on our part. We failed to set a reasonable limit on the number of httpd instances (this is the program that dishes out the webpage to someone asking for it). Instead of just queueing up the people asking for the page, our server launched too many copies of that program, filled its ram, and started swapping to disk like crazy (which slowed the machine down by a factor of over a 1000, for all intents and purposes as bad as crashing the machine). “What is a reasonable limit?”, you say? That depends entirely on your machine. You would need to actually measure what apache is using under a typical load, and budget the number of allowed instances accordingly. I would recommend tools for this, but not being a real webmaster myself, they’d probably be laughable. Suggestions in the comments are very welcome.
For what it’s worth, Ben (crimson_penguin) tells me our previous setting was a limit of 253 httpd processes, each using 0.3-0.5% of the memory, each. Now, we have something less than 5. Dave, our lead coder, also notes that one ‘vicious cycle’ problem with such runaway accumulation of processes is that modern OSes generally fill any unused ram with a filesystem cache, and if you spawn enough processes, they exhaust this. This causes the system to pagefault at the worst possible time.
* in the honest and non-snide sense, thanks dude. Our poor preparedness was our own problem; we really appreciated the publicity!