Managing consistency in a hierarchal deployment is significantly
different from managing consistency in a simple replication topology where
each slave is connected directly to the master. Here, it is not possible
to wait for a master position, since the positions are changed by every
intermediate relay server. Instead, it is necessary to figure out another
way to wait for the transactions. The MASTER_POS_WAIT function
is quite handy when it comes to handling the wait, so if it were possible
to use that function, it would solve a lot of problems. There are
basically two alternatives that you can use to ensure you are not reading
stale data.
The first solution is to rely on the global transaction ID to handle slave promotions
and to poll the slave repeatedly until it has processed the
transaction.
The second solution, illustrated in Figure 1, connects to all the
relay servers in the path from the master to the final slave
to ensure the change propagates to the slave. It is necessary to connect
to each relay slave between the master and the slave, since it is not
possible to know which binlog position will be used on each of the relay
servers.
Both solutions have their merits, so let’s consider the advantages
and disadvantages of each of them.
If the slaves are normally up-to-date with respect to the master,
the first solution will perform a simple check of the final slave only and
will usually show that the transaction has been replicated to the slave
and that processing can proceed. If the transaction has not been processed
yet, it is likely that it will be processed before the next check, so the
second time the final slave is checked, it will show that the transaction
has reached the slave. If the checking period is small enough, the delay
will not be noticeable for the user,
so a typical consistency check will require one or two extra messages when
polling the final slave. This approach requires only the final slave to be
polled, not any of the intermediate slaves. This can be an advantage from
an administrative point as well, since it does not require keeping track
of the intermediate slaves and how they are connected.
On the other hand, if the slaves normally lag behind, or if the
replication lag varies a lot, the second approach is probably better. The
first solution will repeatedly poll the slave, and most of the time will
report that the transaction has not been committed on
the slave. You can handle this by increasing the polling period, but if
the polling period has to be so large that the response time is
unacceptable, the first solution will not work well. In this case, it is
better to use the second solution and wait for the changes to ripple down
the replication tree and then execute the query.
For a tree of size N, the number of extra
requests will then be proportional to log N. For
instance, if you have 50 relay servers and each relay server handles 50
final slaves, you can handle all 2,500 slaves with exactly two extra
requests: one to the relay slave and then one to the final slave.
The disadvantages of the second approach are:
It requires the application code to have access to the relay
slaves so that they can connect to each relay slave in turn and wait
for the position to be reached.
It requires the application code to keep track of the
architecture of your replication so that the relay servers can be
queried.
Querying the relay slaves will slow them down, since they have to
handle more work, but in practice, this might turn out not to be a
problem. By introducing a caching database connection layer, you can avoid
some of the traffic. The caching layer will remember the binlog position
each time a request is made and query the relay only if the binlog
position is greater than the cached one. The following is a rough stub for
the caching function:
function wait_for_pos($server, $wait_for_pos) {
if (cached position for $server > $wait_for_pos)
return TRUE;
else {
code to wait for position and update cache
}
}
Since the binlog positions are always increasing—once a binlog
position is passed it remains passed—there is no risk of returning an
incorrect result. The only way to know for sure which technique is more
efficient is to monitor and profile the deployment to make sure queries
are executed fast enough for the application.
Example 1 shows
sample code to handle the first solution—querying the slave repeatedly to
see whether the transaction has been executed.
Example 1. PHP code for avoiding read of stale data using polling
function fetch_trans_id($server) {
$result = $server->query('SELECT server_id, trans_id FROM Last_Exec_Trans');
if ($result == NULL)
return NULL; // Execution failed
$row = $result->fetch_assoc();
if ($row == NULL)
return NULL; // Empty table !?
$gid = array($row['server_id'], $row['trans_id']);
$result->close();
return $gid;
}
function wait_for_trans_id($server, $server_id, $trans_id) {
if ($server_id == NULL || $trans_id == NULL)
return TRUE; // No transactions executed, trivially in sync
$server->autocommit(TRUE);
$gid = fetch_trans_id($server);
if ($gid == NULL)
return FALSE;
list($current_server_id, $current_trans_id) = $gid;
while ($current_server_id != $server_id || $current_trans_id < $trans_id) {
usleep(500000); // Wait half a second
$gid = fetch_trans_id($server);
if ($gid == NULL)
return FALSE;
list($current_server_id, $current_trans_id) = $gid;
}
return TRUE;
}
function commit_and_sync($master, $slave) {
if ($master->commit()) {
$gid = fetch_trans_id($master);
if ($gid == NULL)
return NULL;
if (!wait_for_trans_id($slave, $gid[0], $gid[1]))
return NULL;
return TRUE;
}
return FALSE;
}
function start_trans($server) {
$server->autocommit(FALSE);
}
|
The difference is
that the functions in 1 internally call fetch_trans_id and
wait_for_trans_id instead of fetch_master_pos and wait_for_pos. Some points worth noting in the
code:
We turn off autocommit in wait_for_trans_id before starting to query
the slave. This is necessary because if the isolation level is
repeatable read or stricter, the
select will find the same global transaction ID every time.
To prevent this, we commit each SELECT as a separate
transaction by turning on autocommit. An alternative is to use the
read committed isolation
level.
To avoid unnecessary sleeps in wait_for_trans_id, we fetch the global transaction ID and check it once before entering
the loop.
This code requires access only to the master and slave, not to
the intermediate relay servers.
Example 2 includes
code for ensuring you do not read stale data. It uses the technique of
querying all servers between the master and the final slave. This method
proceeds by first finding the entire chain of servers between the final
slave and the master, and then synchronizing each in turn all the way down
the chain until the transaction reaches the final slave.
Example 2. PHP code for avoiding reading stale data using waiting
function fetch_relay_chain($master, $final) {
$servers = array();
$server = $final;
while ($server !== $master) {
$server = get_master_for($server);
$servers[] = $server;
}
$servers[] = $master;
return $servers;
}
function commit_and_sync($master, $slave) {
if ($master->commit()) {
$server = fetch_relay_chain($master, $slave);
for ($i = sizeof($server) - 1; $i > 1 ; --$i) {
if (!sync_with_master($server[$i], $server[$i-1]))
return NULL; // Synchronization failed
}
}
}
function start_trans($server) {
$server->autocommit(FALSE);
}
|
To find all the servers between the master and the slave, we use the
function fetch_relay_chain. It
starts from the slave and uses the function get_master_for to get the master for a slave. We
have deliberately not included the code for this function, since it does
not add anything to our current discussion. However, this function has to
be defined for the code to work.
After the relay chain is fetched, the code synchronizes the master
with its slave all the way down the chain.
Note:
One way to fetch the master for a server is to use SHOW SLAVE STATUS and read the Master_Host and Master_Port fields. If you do this for each
transaction you are about to commit, however, the system will be very
slow.
Since the topology rarely changes, it is better to cache the
information on the application servers, or somewhere else, to avoid
excessive traffic to the database servers.
The
master is a critical component of a deployment and is likely to be a more
powerful machine than the slaves, so you should restore it to the master
position when bringing it back. Since the master stopped unexpectedly, it
is very likely to be out of sync with the rest of the deployment. This can
happen in two ways:
If the master has been offline for more than just a short time,
the rest of the system will have committed many transactions that the
master is not aware of. In a sense, the master is in an
alternative future compared to the rest of the
system. An illustration of this situation is shown in Figure 2.
If the master committed a transaction and wrote it to the binary
log, then crashed just after it acknowledged the transaction, the
transaction may not have made it to the slaves. This means the master
has one or more transactions that have not been seen by the slaves,
nor by any other part of the system.
If the original master is not too far behind the current master, the
easiest solution to the first problem is to connect the original master as
a slave to the current master, and then switch over all slaves to the
master once it has caught up. If, however, the original master has been
offline for a significant period, it is likely to be faster to clone one
of the slaves and then switch over all the slaves to the master.
If the master is in an alternative future, it is not likely that its
extra transactions should be brought into the deployment. Why? Because the
sudden appearance of a new transaction is likely to conflict with existing
transactions in subtle ways. For example, if the transaction is a message
in a message board, it is likely that a user has already recommitted the
message. If a message written earlier but reported as missing—because the
master crashed before the message was sent to a slave—suddenly reappears,
it will befuddle the users and definitely be considered an annoyance. In a
similar manner, users will not look kindly on shopping carts suddenly
having items added because the master was brought back into the
system.
In short, you can solve both of the out-of-sync problems—the master
in an alternative future and the master that needs to catch up—by simply
cloning a slave to the original master and then switching over each of the
current slaves in turn to the original master.
These problems, however, highlight how important it is to ensure
consistency by checking that changes to a master are available on some
other system before reporting the transaction as complete, in the event
that the master should crash. From a recovery perspective, this is excessive:
it is sufficient to ensure the transaction is available on at least one
other machine, for example on one of the slaves or relay servers connected
to the master. In general, you can tolerate n−1
failures if you have the change available on n servers.