In the simple scale-out deployment—like the one described thus far—all
slaves receive all data and can therefore handle any kind of query. It is,
however, not very common to distribute requests evenly over the different
parts of the data. Instead, there is usually some data that needs to be
accessed very frequently and some that is rarely accessed. For example, on
an e-commerce site:
The product catalog is browsed almost all the time.
Data about items in stock may not be requested very
often.
User data is not requested very often, since most of the
critical information is recorded using session-specific information
stored in the browser as cookies.
On the other hand, if cookies are disabled, the session data
will be requested from the server with almost every page
request.
Newly added items are usually accessed more frequently than old
items, for example, “special offers” might be accessed more frequently
than other items.
It would clearly be a waste of resources to keep the rarely accessed
data on each and every slave just in case it is requested. It would be
much better to use the deployment shown in Figure 1, where a few servers
are dedicated to keeping rarely accessed data, while a different set of
servers are dedicated to keeping data that is accessed frequently.
To do this, it is necessary to separate tables when replicating.
MySQL can do this by filtering the events that leave the master or,
alternatively, filtering the events when they come to the slave.
1. Filtering Replication Events
The two different ways of filtering events are called master filters when the events are
filtered on the master and slave filters when the events are filtered on the slave. The master
filters control what goes into the binary log and therefore what is sent
to the slaves, while slave filters control what is executed on the
slave. For the master filters, events for filtered-out tables are not
stored in the binary log at all, while for slave filters, the events are
stored in the binary log and also sent to the slave and not filtered out
until just before they are going to be executed.
If master filters are used, the events are not stored in the
binary log at all.
This means that it is not possible to use PITR to recover these databases properly—if the databases
are stored in the backup image, they will still be restored when
restoring the backup, but any changes made to tables in the database
since that moment will not be recovered, since the changes are not in
the binary log.
If slave filters are used, all changes are sent over the
network.
This clearly wastes network bandwidth, especially over long-haul
network connections.
1.1. Master filters
There are two configuration options for creating master
filters:
binlog-do-db= db
If the current database of the statement is
db, the statement will be written to
the binary log; otherwise, the statement will be
discarded.
binlog-ignore-db= db
If the current database of the statement is
db, the statement will be discarded;
otherwise, the statement will be written to the binary
log.
If you want to replicate everything except a few databases, use
binlog-ignore-db. If you want to
replicate just a few databases, use binlog-do-db. Combining them is
not recommended, since the logic for deciding
whether a database should be replicated or not is complicated . The
options do not accept lists of databases, so if you want to list
several databases, you have to repeat an option multiple times.
As an example, to replicate everything except the
top and secret databases,
add the following options to the configuration file:
[mysqld]
...
binlog-ignore-db = top
binlog-ignore-db = secret
Warning:
Using binlog-*-db options
to filter events means that the two databases will not be stored in
the binary log at all, and hence cannot be recovered using PITR in
the event of a crash. For that reason, it is strongly recommended
that you use slave filters, not master filters, when you want to
filter the replication stream. You should use master filters only
for data that can be considered volatile and that you can afford to
lose.
1.2. Slave filters
Slave filtering offers a longer list of options. In addition to being
able to filter the events based on the database, slave filters can
also filter individual tables and even groups of table names by using
wildcards.
In the following list of rules, the replicate-wild rules look at the full name
of the table, including both the database and table name. The pattern
supplied to the option uses the same patterns as the LIKE string comparison function—that is, an
underscore (_) matches a single character, whereas a percent sign (%) matches a string of any length. Note,
however, that the pattern must contain a period to be legitimate. This
means that the database name and table name are matched individually,
so each wildcard applies only to the database name or table
name.
replicate-do-db= db
If the current database of the statement is
db, execute the statement.
replicate-ignore-db= db
If the current database of the statement is
db, discard the statement.
replicate-do-table= table
and replicate-wild-do-table= db_pattern. tbl_pattern
If the name of the table being updated is
table or matches the pattern, execute
updates to the table.
replicate-ignore-table= table
and replicate-wild-ignore-table=
db_pattern. tbl_pattern
If the name of the table being updated is
table or matches the pattern, discard
updates to the table.
These filtering rules are evaluated just before the server
decides whether to execute them, so all events are sent to the slave
before being filtered.
2. Using Filtering to Partition Events to Slaves
So what are the benefits and drawbacks of filtering on the master versus
filtering on the slave? At a brief glance, it might seem like a good
idea to structure the databases so that it is possible to filter events
on the master using the binlog-*-db
options instead of using the replicate-*-db options. That way, the network
is not laden with a lot of useless events that will be removed by the
slave anyway. However, there are
problems associated with filtering on the master:
Since the events are filtered from the binary log and there is
only a single binary log, it is not possible to “split” the changes
and send different parts of the database to different
servers.
The binary log is also used for PITR, so if there are any problems with the server, it
will not be possible to restore everything.
If, for some reason, it becomes necessary to split the data
differently, it will no longer be possible, since the binary log has
already been filtered and cannot be “unfiltered.”
It would be ideal if the filtering could be on the events sent
from the master and not on the events written to the binary log. It
would also be good if the filtering could be controlled by the slave so
that the slave could decide which data to replicate. As of MySQL version
5.1, this is not possible, and instead, it is necessary to filter events
using the replicate-* options—that
is, to filter the events on the slave.
Note:
There are ongoing discussions among the replication team about
implementing an advanced filtering feature that will allow filtering
of events at various points in the processing of the events, as well
as complex filtering logic.
At the time of this writing, there has been no decision on which
version of the server will implement this advanced filtering.
As an example, to dedicate a slave to the user data stored in the
two tables users and profiles
in the app database, shut down the server and add
the following filtering options to the my.cnf file:
[mysqld]
...
replicate-wild-do-table=app.users
replicate-wild-do-table=app.profiles
If you are concerned about network traffic—which could be
significant if you replicate over long-haul networks—you can set up a
relay slave on the same machine as the master, as shown in Figure 2 (or on the same
network segment as the master), whose only purpose is to produce a
filtered version of the master’s binary log.