1. Scaling Out Reads, Not Writes
It is important to understand that scaling out in this manner scales out reads,
not writes. Each new slave has to handle the same write load as the
master. The average load of the system can be described as:
So if you have a single server with a total capacity of 10,000
transactions per second, and there is a write load of 4,000 transactions
per second on the master, while there is a read load of 6,000 transactions
per second, the result will be:
Now, if you add three slaves to the master, the total capacity
increases to 40,000 transactions per second. Because the write queries are
replicated as well, each query is executed a total of four times—once on
the master and once on each of the three slaves—which means that each
slave has to handle 4,000 transactions per second in write load. The total
read load does not increase because it is distributed over the slaves.
This means that the average load now is:
Notice that in the formula, the capacity is increased by a factor of
4, since we now have a total of four servers, and replication causes the
write load to increase by a factor of 4 as well.
It is quite common to forget that replication forwards to each slave
all the write queries that the master handles. So you cannot use this
simple approach to scale writes, only reads.
2. The Value of Asynchronous Replication
MySQL replication is asynchronous, a type of
replication particularly suitable for modern applications such as
websites.
To handle a large number of reads, sites use replication to create
copies of the master and then let the slaves handle all read requests
while the master handles the write requests. This replication is
considered asynchronous because the master does not wait for the slaves to
apply the changes, but instead just dispatches each change request to the
slaves and assumes they will catch up eventually and replicate all the
changes. This technique for improving performance is usually a good idea
when you are scaling out.
In contrast, synchronous replication keeps the master and slaves in sync and does not allow a transaction to be
committed on the master unless the slave agrees to commit it as well. That
is, synchronous replication makes the master wait for all the slaves to
keep up with the writes.
Asynchronous replication is a lot faster than synchronous
replication, for reasons our description should make obvious. Compared to
asynchronous replication, synchronous replication requires extra
synchronizations to guarantee consistency. It is usually implemented
through a protocol called two-phase commit, which guarantees
consistency between the master and slaves, but requires extra messages to ping-pong
between them. Typically, it works like this:
When a commit statement is executed, the transaction is sent to the slaves and the slave is asked
to prepare for a commit.
Each slave prepares the transaction so that it can be committed,
and then sends an OK (or ABORT) message to the master, indicating that
the transaction is prepared (or that it could not be prepared).
The master waits for all slaves to send either an OK or an ABORT
message.
If the master receives an OK message from all slaves, it
sends a commit message to all slaves asking them to commit the
transaction.
If the master receives an ABORT message from any of the
slaves, it sends an abort message to all slaves asking them to
abort the transaction.
Each slave is then waiting for either an OK or an ABORT message
from the master.
If the slaves receive the commit request, they commit the
transaction and send an acknowledgment to the master
that the transaction is committed.
If the slaves receive an abort request, they abort the
transaction by undoing any changes and releasing any resources
they held, then send an acknowledgment to the master that the
transaction was aborted.
When the master has received acknowledgments from all slaves, it
reports the transaction as committed (or aborted) and continues with
processing the next transaction.
What makes this protocol slow is that it requires a total of four
messages, including the messages with the transaction and the prepare
request. The major problem is not the amount of network traffic required
to handle the synchronization, but the latency introduced by the network
and by processing the commit on the slave, together with the fact that the
commit is blocked on the master until all the slaves have acknowledged the
transaction. In contrast, asynchronous replication requires only a single
message to be sent with the transaction. As a bonus, the master does not
have to wait for the slave, but can report the transaction as committed
immediately, which improves performance significantly.
So why is it a problem that synchronous replication blocks each
commit while the slaves process it? If the slaves are close to the master
on the network, the extra messages needed by synchronous replication make
little difference, but if the slaves are not nearby—maybe in another town
or even on another continent—it makes a big difference.
Table 1
shows some examples for a server that can commit 10,000
transactions per second. This translates to a commit time of 0.1 ms (but
note that some implementations, such as MySQL Cluster, are able to process several commits in
parallel if they are independent). If the network latency is 0.01 ms (a
number we’ve chosen as a baseline by pinging one of our own computers),
the transaction commit time increases to 0.14 ms, which translates to
approximately 7000 transactions per second. If the network latency is 10
ms (which we found by pinging a server in a nearby city), the transaction
commit time increases to 40.1 ms, which translates to about 25
transactions per second! In contrast, asynchronous replication introduces
no delay at all, because the transactions are reported as committed
immediately, so the transaction commit time stays at the original 10,000
per second, just as if there were no slaves.
Table 1. Typical slowdowns caused by synchronous replication
Latency
(ms) | Transaction commit time
(ms) | Equivalent transactions
per second | Example
case |
---|
0.01 | 0.14 | ~7,100 | Same
computer |
0.1 | 0.5 | ~2,000 | Small LAN |
1 | 4.1 | ~240 | Bigger LAN |
10 | 40.1 | ~25 | Metropolitan
network |
100 | 400.1 | ~2 | Satellite |
The performance of asynchronous replication comes at the price of
consistency. Recall that in asynchronous replication the
transaction is reported as committed immediately,
without waiting for any acknowledgment from the
slave. This means the master may consider the transaction committed when
the slave does not. As a matter of fact, it might not even have left the
master, but is still waiting to be sent to the slave.
There are two problems with this that you need to be aware
of:
In the event of crashes on the master, transactions can
“disappear.”
A query executed on the slaves might return old data.