When setting out to build a reliable web application, people frequently try to get rid of any single point of failure. Running two data centers and two databases is usually the hardest part. There are a few different options for building in database and data center reliability: master-standby and master-master.
In master-standby, one data center and database is active processing user requests while the other is an inactive copy that can be quickly activated in case the master fails. In master-master, two different data centers and databases are both actively serving user traffic.
The advantage of master-standby fault tolerance is that at first blush it seems simpler. You just replicate your database to a standby data center. You don't need to change your app. If the primary data center or database fails, you just change your DNS to point to the backup data center.
The first problem is you don't really know if your backup data center works since most of the time it's just sitting there unused. It's relatively easy to test that the read functionality of your app works in the standby, but much harder to test that write functionality works without corrupting your database replication from the master.
The other problem is that you probably don't want automated fail over to the backup since once you fail over, you probably have to do a manual shut down of the master and later go through some reconciliation process when the master comes back up. So you risk longer periods of downtime while you try to decide whether a failure is really persistent, whether to fail everyone over to the standby, testing the standby after failing over etc.
With master-master replication, you know both sides are always working and you can constantly and automatically have failover happen any time health checks fail for one data center even if they're only transient several minute hiccups. You also have the knowledge that real users are getting served out of both data centers constantly so you know they both work (assuming you have internal reporting of errors, have users who notice problems and complain about them, run monitoring on both sides constantly etc).
One additional benefit of master-master replication is that your second data center isn't sitting idle 99% of the time. You have to be careful not to assume master-master replication increases your write capacity (it actually decreases it since all writes have to now happen twice, once in each data center), but you can increase the read performance from your site. As many web sites are read-mostly, this can be a big win.
The obvious downsides of master-master replication are that when replication gets out of sync, you have a bigger mess to clean up and that you have to write your application to be master-master aware. I'll talk about my experience with both of these in a future post.
Saturday, September 5, 2009
blog comments powered by Disqus
Subscribe to:
Post Comments (Atom)