MySQL Replication for Scale-Out : Consistency in a Hierarchal Deployment

5/3/2013 9:17:36 PM

Managing consistency in a hierarchal deployment is significantly different from managing consistency in a simple replication topology where each slave is connected directly to the master. Here, it is not possible to wait for a master position, since the positions are changed by every intermediate relay server. Instead, it is necessary to figure out another way to wait for the transactions. The MASTER_POS_WAIT function is quite handy when it comes to handling the wait, so if it were possible to use that function, it would solve a lot of problems. There are basically two alternatives that you can use to ensure you are not reading stale data.

The first solution is to rely on the global transaction ID to handle slave promotions and to poll the slave repeatedly until it has processed the transaction.

The second solution, illustrated in Figure 1, connects to all the relay servers in the path from the master to the final slave to ensure the change propagates to the slave. It is necessary to connect to each relay slave between the master and the slave, since it is not possible to know which binlog position will be used on each of the relay servers.

Figure 1. Synchronizing with all servers in a relay chain

Both solutions have their merits, so let’s consider the advantages and disadvantages of each of them.

If the slaves are normally up-to-date with respect to the master, the first solution will perform a simple check of the final slave only and will usually show that the transaction has been replicated to the slave and that processing can proceed. If the transaction has not been processed yet, it is likely that it will be processed before the next check, so the second time the final slave is checked, it will show that the transaction has reached the slave. If the checking period is small enough, the delay will not be noticeable for the user, so a typical consistency check will require one or two extra messages when polling the final slave. This approach requires only the final slave to be polled, not any of the intermediate slaves. This can be an advantage from an administrative point as well, since it does not require keeping track of the intermediate slaves and how they are connected.

On the other hand, if the slaves normally lag behind, or if the replication lag varies a lot, the second approach is probably better. The first solution will repeatedly poll the slave, and most of the time will report that the transaction has not been committed on the slave. You can handle this by increasing the polling period, but if the polling period has to be so large that the response time is unacceptable, the first solution will not work well. In this case, it is better to use the second solution and wait for the changes to ripple down the replication tree and then execute the query.

For a tree of size N, the number of extra requests will then be proportional to log N. For instance, if you have 50 relay servers and each relay server handles 50 final slaves, you can handle all 2,500 slaves with exactly two extra requests: one to the relay slave and then one to the final slave.

The disadvantages of the second approach are:

It requires the application code to have access to the relay slaves so that they can connect to each relay slave in turn and wait for the position to be reached.
It requires the application code to keep track of the architecture of your replication so that the relay servers can be queried.

Querying the relay slaves will slow them down, since they have to handle more work, but in practice, this might turn out not to be a problem. By introducing a caching database connection layer, you can avoid some of the traffic. The caching layer will remember the binlog position each time a request is made and query the relay only if the binlog position is greater than the cached one. The following is a rough stub for the caching function:

function wait_for_pos($server, $wait_for_pos) {
  if (cached position for $server > $wait_for_pos)
    return TRUE;
  else {
    code to wait for position and update cache
  }
}

Since the binlog positions are always increasing—once a binlog position is passed it remains passed—there is no risk of returning an incorrect result. The only way to know for sure which technique is more efficient is to monitor and profile the deployment to make sure queries are executed fast enough for the application.

Example 1 shows sample code to handle the first solution—querying the slave repeatedly to see whether the transaction has been executed.

Example 1. PHP code for avoiding read of stale data using polling

function fetch_trans_id($server) {
  $result = $server->query('SELECT server_id, trans_id FROM Last_Exec_Trans');
  if ($result == NULL)
    return NULL;                              // Execution failed
  $row = $result->fetch_assoc();
  if ($row == NULL)
    return NULL;                              // Empty table !?
  $gid = array($row['server_id'], $row['trans_id']);
  $result->close();
  return $gid;
}

function wait_for_trans_id($server, $server_id, $trans_id) {
  if ($server_id == NULL || $trans_id == NULL)
    return TRUE;        // No transactions executed, trivially in sync

  $server->autocommit(TRUE);
  $gid = fetch_trans_id($server);
  if ($gid == NULL)
    return FALSE;
  list($current_server_id, $current_trans_id) = $gid;
  while ($current_server_id != $server_id || $current_trans_id < $trans_id) {
    usleep(500000);                              // Wait half a second
    $gid = fetch_trans_id($server);
    if ($gid == NULL)
      return FALSE;
    list($current_server_id, $current_trans_id) = $gid;
  }
  return TRUE;
}

function commit_and_sync($master, $slave) {
  if ($master->commit()) {
    $gid = fetch_trans_id($master);
    if ($gid == NULL)
      return NULL;
    if (!wait_for_trans_id($slave, $gid[0], $gid[1]))
      return NULL;
    return TRUE;
  }
  return FALSE;
}

function start_trans($server) {
  $server->autocommit(FALSE);
}

The difference is that the functions in 1 internally call fetch_trans_id and wait_for_trans_id instead of fetch_master_pos and wait_for_pos. Some points worth noting in the code:

We turn off autocommit in wait_for_trans_id before starting to query the slave. This is necessary because if the isolation level is repeatable read or stricter, the select will find the same global transaction ID every time.
To prevent this, we commit each SELECT as a separate transaction by turning on autocommit. An alternative is to use the read committed isolation level.
To avoid unnecessary sleeps in wait_for_trans_id, we fetch the global transaction ID and check it once before entering the loop.
This code requires access only to the master and slave, not to the intermediate relay servers.

Example 2 includes code for ensuring you do not read stale data. It uses the technique of querying all servers between the master and the final slave. This method proceeds by first finding the entire chain of servers between the final slave and the master, and then synchronizing each in turn all the way down the chain until the transaction reaches the final slave.

Example 2. PHP code for avoiding reading stale data using waiting

function fetch_relay_chain($master, $final) {
  $servers = array();
  $server = $final;
  while ($server !== $master) {
    $server = get_master_for($server);
    $servers[] = $server;
  }
  $servers[] = $master;
  return $servers;
}

function commit_and_sync($master, $slave) {
  if ($master->commit()) {
    $server = fetch_relay_chain($master, $slave);
    for ($i = sizeof($server) - 1; $i > 1 ; --$i) {
      if (!sync_with_master($server[$i], $server[$i-1]))
        return NULL;                         // Synchronization failed
    }
  }
}

function start_trans($server) {
  $server->autocommit(FALSE);
}

To find all the servers between the master and the slave, we use the function fetch_relay_chain. It starts from the slave and uses the function get_master_for to get the master for a slave. We have deliberately not included the code for this function, since it does not add anything to our current discussion. However, this function has to be defined for the code to work.

After the relay chain is fetched, the code synchronizes the master with its slave all the way down the chain.

Note:

One way to fetch the master for a server is to use SHOW SLAVE STATUS and read the Master_Host and Master_Port fields. If you do this for each transaction you are about to commit, however, the system will be very slow.

Since the topology rarely changes, it is better to cache the information on the application servers, or somewhere else, to avoid excessive traffic to the database servers.

The master is a critical component of a deployment and is likely to be a more powerful machine than the slaves, so you should restore it to the master position when bringing it back. Since the master stopped unexpectedly, it is very likely to be out of sync with the rest of the deployment. This can happen in two ways:

If the master has been offline for more than just a short time, the rest of the system will have committed many transactions that the master is not aware of. In a sense, the master is in an alternative future compared to the rest of the system. An illustration of this situation is shown in Figure 2.
If the master committed a transaction and wrote it to the binary log, then crashed just after it acknowledged the transaction, the transaction may not have made it to the slaves. This means the master has one or more transactions that have not been seen by the slaves, nor by any other part of the system.

If the original master is not too far behind the current master, the easiest solution to the first problem is to connect the original master as a slave to the current master, and then switch over all slaves to the master once it has caught up. If, however, the original master has been offline for a significant period, it is likely to be faster to clone one of the slaves and then switch over all the slaves to the master.

If the master is in an alternative future, it is not likely that its extra transactions should be brought into the deployment. Why? Because the sudden appearance of a new transaction is likely to conflict with existing transactions in subtle ways. For example, if the transaction is a message in a message board, it is likely that a user has already recommitted the message. If a message written earlier but reported as missing—because the master crashed before the message was sent to a slave—suddenly reappears, it will befuddle the users and definitely be considered an annoyance. In a similar manner, users will not look kindly on shopping carts suddenly having items added because the master was brought back into the system.

Figure 2. Original master in an alternative future

In short, you can solve both of the out-of-sync problems—the master in an alternative future and the master that needs to catch up—by simply cloning a slave to the original master and then switching over each of the current slaves in turn to the original master.

These problems, however, highlight how important it is to ensure consistency by checking that changes to a master are available on some other system before reporting the transaction as complete, in the event that the master should crash. From a recovery perspective, this is excessive: it is sufficient to ensure the transaction is available on at least one other machine, for example on one of the slaves or relay servers connected to the master. In general, you can tolerate n−1 failures if you have the change available on n servers.

Others