While I sit here and wait for myisamchk to finish and tell me that the table that various folks have spent the better part of the day trying to restore is either healthy or once again dead (how's that for a run-on sentence), I wanted to dump out some of the things we've done to try to make our MySQL backend scale. It's not been pretty, but given that it's strung together with some Perl, some MySQL, and a bunch of paper clips, I think the folks around me have proven themselves brilliant (I just sit around and pretend to know what's going on).
Oh, and this is all without a cluster. That's probably the next step. And "Oh: part 2," myisamchk finished checking 28 million rows. All is good. I've copied the data over to the main server and brought it up and everything is happy. Back to MySQL scalability ...
First issue: We've got too much data
This one was easy to solve. We got rid of it. Sort of. We started archiving off data that we no longer needed chronologically. Old support incidents, logging, anything that had a timestamp that we aren't looking at gets sliced off into an archive so that the main tables can be as tidy and fast as we can get them. Which for us is like 8GB of data and not fast at all. But it's better than 12GB of data.
Second Issue: We've got too many connections
When you're small(ish), it makes sense to throw a bunch of dbs on the same server. As you grow, those connections start to swamp MySQL. MySQL starts to get all panicked, and it doesn't know how to handle all of the people asking for data, so it starts to get sloppy about closing old handles. Then it's basically like thermal runaway in a transistor. The server can't close old connections, new ones open up, adding more overhead, and all of a sudden your nice server has 5000 open connections and is hosed. Again, this was a pretty easy one to fix. Bring up a new box, move some databases it to it, and hope that you've built your code layer to make that swap pretty easy (ours was). Presto. Now both of your servers are happy.
Third Issue: We lock up the damn tables all the time
We've got a lot of customers who are constantly accessing their sites. We've got nearly 1000 support agents across the globe who are using our tools to look at customer configuration to make sure there's no issues. This puts a whole chunk of load and repetitive queries on the database. That's easily handled.
Except when you add in a bunch of data updates. Agents, customers, new signups adding and editing data in the database. All of a sudden those hundred pending SELECT statements are stuck because one big select locked the data when an UPDATE came in. Now you've got a bunch of web users who think your stuff is slow and/or broken. We've tried to attack this in a few ways:
- Fix your queries -- We watch our slow queries and try to make them faster. We look at our most often called queries and try to make them faster. Sounds simple, but bad queries are the biggest cause of problems.
- Add indexes -- This goes with "fixing" queries. Add indexes and make sure you're queries use them.
- Perform less queries -- Can you cache your data? Can you make less queries and do more in your language of choice? Can you make your users smarter (maybe without them knowing) about when they need to request data? Do it. The less queries you have, the more likely you won't lock things up.
- Split your reads and your writes -- If you can split your reads and writes at the code layer, then you can shuttle reads off to one (or more boxes) and writes off to the primary box, and you should lock up a lot less. We accomplished this by having a couple of boxes replicate the main database, and having one of our smart engineers subclass Perl::DBI to look for SELECT statements and swap the database handle over to the read replica. It helps more than you might thing (but it's not a silver bullet).
Most of this sounds like common sense. It is. But it still matters. We're trying to do a lot with a little, and every ounce of performance you can squeeze out matters, when your users are super demanding and will use any slowness as an excuse.
There are some other things we should and probably will try:
- Denormalization to bring data back together and cut down on costly joins
- Sharding to split our data up into smaller chunks and this cut down on long table scans and huge indexes
- A real MySQL cluster to optimize reads and writes and spread traffic out to many nodes
I wish I knew more. I'm still barely up the curve compared to some of the engineers and admins I work with. Thankfully, they've been able to keep our many million row tables (and many GB data and index files) humming along with few interruptions.