UPDATE: I've added the guts of the script in the following post.
A few months ago, we got our secondary MySQL server online. We decided to run master-master replication with them. It's actually easier than it sounds, and so far, has been fairly tolerant of one of the masters going offline. The one issue we've had thus far was when an error on one of the boxes (a conflicting insert to a UNIQUE key) caused the masters to stop replicating. This unfortunately happened on a Saturday morning, and we didn't find out about it until Monday morning. Luckily, the guys over at Percona have already solved the problem of how to get it back in sync, as well as how best to monitor the problem.
First, I cleaned up the issues that caused the servers to get out of sync, by skipping the queries causing the issues. I ran the following a few times, until the machines actually finished syncing:
stop slave; set global sql_slave_skip_counter=1; start slave; show slave status\G
Then, I used mk-table-sync to replace the data in the up to date master, which then caused replication to update the slave. This is an amazing tool. The TL;DR version of the tool is that it selects the data from all of the tables in all of the databases, and performs a REPLACE on the data, to trigger the replication update. This allows all of the data to "change", without actually changing anything on the correct master.
Finally, I set up some monitoring. Another Percona tool to the rescue, this time in the form of mk-heartbeat. This causes a row to be inserted in to a specific table, with information about the master's status. Point both servers at the same table, and you end up replicating the table between them, causing both boxes to know exactly where the other is, at all times.
mk-heartbeat is great, but it won't alert you to problems that you have within the servers (for example, replication stopping). So I wrote a script that wraps mk-heartbeat, using it's --check parameter to determine the lag going on. It performs this check 5 times, takes the average of those, and, if the lag time is above 30 seconds, sends a text message to me, letting me know that this box thinks it's slave is out of sync. It also will tell me a couple of times a day if everything is all clear, just for my own peace of mind.
This script runs every 2 minutes. Our boxes are local to each other, and we're not really very write heavy. I figure less than 30 seconds of lag should be enough to be considered 'ok', but after that point, we have a problem that needs to be fixed ASAP. I specifically did not put in code that would only send a message X times over Y minutes, because if we get too far behind, it needs to nag me until the problem is solved. Who knows, I may be asleep the first few times it happens.