Service Broker is frequently mentioned as an
excellent choice for helping to scale out database services. One of the
more compelling use cases is a Service Broker service that can be used
to asynchronously request data from a remote system. In such a case, a
request message would be sent to the remote data service from a local
stored procedure, which could do some other work while waiting for the
response—the requested data—to come back.
There are many ways to
architect such a system, and given that Service Broker allows messages
to be sent either as binary or XML, I wondered which would provide the
best overall performance and value from a code reuse perspective.
I started working with the AdventureWorks.HumanResources.Employee
table as a sample data set, imagining a remote data service requesting a
list of employees along with their attributes. After some
experimentation, I determined that the FOR XML RAW option is the easiest way to serialize a table in XML format, and I used the ROOT option to make the XML valid:
SELECT *
FROM HumanResources.Employee
FOR XML RAW, ROOT('Employees')
XML is, of course,
known to be an extremely verbose data interchange format, and I was not
surprised to discover that the data size of the resultant XML is 116KB,
despite the fact that the HumanResources.Employee
table itself has only 56KB of data. I experimented with setting shorter
column names, but it had very little effect on the size and created
what I feel to be unmaintainable code.
My first performance test, the results of which are shown in Figure 1,
was not especially promising. Simply serializing the results was taking
over 3 seconds per iteration. After some trial and error, I discovered
that the TYPE option hugely improved performance, bringing average time per iteration down by over 50%, as shown in Figure 2.
I was quite pleased
with these results until I decided to test deserialization. The first
problem was the code required to deserialize the XML back into a table.
In order to get back the same table I started with, I had to explicitly
define every column for the result set; this made the code quite a bit
more complex than I'd hoped for:
SELECT
col.value('@EmployeeID', 'int') AS EmployeeID,
col.value('@NationalIDNumber', 'nvarchar(15)') AS NationalIDNumber,
col.value('@ContactID', 'int') AS ContactID,
col.value('@LoginID', 'nvarchar(256)') AS LoginID,
col.value('@ManagerID', 'int') AS ManagerID,
col.value('@Title', 'nvarchar(50)') AS Title,
col.value('@BirthDate', 'datetime') AS BirthDate,
col.value('@MaritalStatus', 'nchar(1)') AS MaritalStatus,
col.value('@Gender', 'nchar(1)') AS Gender,
col.value('@HireDate', 'datetime') AS HireDate,
col.value('@SalariedFlag', 'bit') AS SalariedFlag,
col.value('@VacationHours', 'smallint') AS VacationHours,
col.value('@SickLeaveHours', 'smallint') AS SickLeaveHours,
col.value('@CurrentFlag', 'bit') AS CurrentFlag,
col.value('@rowguid', 'uniqueidentifier') AS rowguid,
col.value('@ModifiedDate', 'datetime') AS ModifiedDate
FROM @p.nodes ('/Employees/row') p (col)
The next problem was performance. As shown in Figure 3, when I tested deserializing the XML, performance went from pretty good to downright abysmal.
I decided to investigate
SQLCLR options for solving the problem, focusing on both reuse
potential and performance. My first thought was to return binary
serialized DataTables, and in order to
make that happen, I needed a way to return binary-formatted data from my
CLR routines. This of course called for .NET's BinaryFormatter class, so I created a class called serialization_helper. The following code was cataloged in an EXTERNAL_ACCESS assembly (required for System.IO access):
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using System.Security.Permissions;
using System.Runtime.Serialization.Formatters.Binary;
public partial class serialization_helper
{
public static byte[] getBytes(object o)
{
SecurityPermission sp =
new SecurityPermission(
SecurityPermissionFlag.SerializationFormatter);
sp.Assert();
BinaryFormatter bf = new BinaryFormatter();
using (System.IO.MemoryStream ms =
new System.IO.MemoryStream())
{
bf.Serialize(ms, o);
return(ms.ToArray());
}
}
public static object getObject(byte[] theBytes)
{
using (System.IO.MemoryStream ms =
new System.IO.MemoryStream(theBytes, false))
{
return(getObject(ms));
}
}
public static object getObject(System.IO.Stream s)
{
SecurityPermission sp =
new SecurityPermission(
SecurityPermissionFlag.SerializationFormatter);
sp.Assert();
BinaryFormatter bf = new BinaryFormatter();
return (bf.Deserialize(s));
}
};
Use of this class is fairly straightforward: to serialize an object, pass it into the getBytes method. This method first uses an assertion to allow SAFE callers to use it, and then uses the binary formatter to serialize the object to a Stream. The stream is then returned as a collection of bytes. Deserialization can be done using either overload of the getObject method. I found that depending on the scenario, I might have ready access to either a Stream
or a collection of bytes, so creating both overloads made sense instead
of duplicating code to produce one from the other. Deserialization also
uses an assertion before running, in order to allow calling code to be
cataloged as SAFE.
My first shot at getting the data was to simply load the input set into a DataTable and run it through the serialization_helper methods. The following code implements a UDF called GetDataTable_Binary, which uses this logic:
[Microsoft.SqlServer.Server.SqlFunction(
DataAccess = DataAccessKind.Read)]
public static SqlBytes GetDataTable_Binary(string query)
{
SqlConnection conn =
new SqlConnection("context connection = true;");
SqlCommand comm = new SqlCommand();
comm.Connection = conn;
comm.CommandText = query;
SqlDataAdapter da = new SqlDataAdapter();
da.SelectCommand = comm;
DataTable dt = new DataTable();
da.Fill(dt);
//Serialize and return the output
return new SqlBytes(
serialization_helper.getBytes(dt));
}
This method is used by
passing in a query for the table that you'd like to get back in binary
serialized form, as in the following example:
USE AdventureWorks
GO
DECLARE @sql NVARCHAR(4000)
SET @sql = 'SELECT * FROM HumanResources.Employee'
DECLARE @p VARBINARY(MAX)
SET @p =
dbo.GetDataTable_Binary(@sql)
While I'd achieved the
reuse potential I hoped for—this function can be used for any number of
queries—I was disappointed to find that the output data size had
ballooned to 232KB. Things looked even worse when I ran a performance
test and serialization speed turned out to be dismal at best, as shown
in Figure 4.
The main problem, as it turned out, was the default serialization behavior of the DataTable. Even when using the BinaryFormatter, a DataTable serializes itself first to XML, and then to binary—double the work that I expected. To fix this, set the RemotingFormat property of the DataTable to Binary before serialization:
dt.RemotingFormat = SerializationFormat.Binary;
Making this change resulted in much better performance, as illustrated by the test results shown in Figure 5.
I still felt that I
could do better, and after several more attempts that I won't bore you
with the details of, I decided to forgo the DataTable altogether and focus on a class that I've found historically to be much faster: SqlDataReader.
I worked on pulling the data out into object collections, and initial
tests that I ran showed the data size to be much closer to what I
expected. In addition to size improvements, serialization performance
turned out to be far better than that of the DataTable (but not as good as XML serialization with the TYPE directive).
The advantage of a DataTable
is that it's one easy-to-use unit that contains all of the data, as
well as the metadata. You don't have to be concerned with column names,
types, and sizes, as everything is automatically loaded into the DataTable for you. Working with a SqlDataReader requires a bit more work, since it can't be serialized as a single unit, but must instead be split up into its component parts.
Since the code I implemented is somewhat complex, I will walk you through it section by section. To begin with, I set the DataAccessKind.Read property on the SqlFunctionAttribute, in order to allow the method to access data via the context connection. A generic List is instantiated, which will hold one object collection per row of data, in addition to one for the metadata. Finally, the SqlConnection is instantiated, and the SqlCommand set up and executed:
[Microsoft.SqlServer.Server.SqlFunction(
DataAccess = DataAccessKind.Read)]
public static SqlBytes GetBinaryFromQueryResult(string query)
{
List<object[]> theList = new List<object[]>();
using (SqlConnection conn =
new SqlConnection("context connection = true;"))
{
SqlCommand comm = new SqlCommand();
comm.Connection = conn;
comm.CommandText = query;
conn.Open();
SqlDataReader read = comm.ExecuteReader();
The next step is to pull the metadata for each column out of the SqlDataReader. A method called GetSchemaTable is used to return a DataTable
populated with one row per column. The available fields are documented
in the MSDN Library, but I'm using the most common of them in the code
that follows. After populating the object collection with the metadata,
it is added to the output List:
DataTable dt = read.GetSchemaTable();
//Populate the field list from the schema table
object[] fields = new object[dt.Rows.Count];
for (int i = 0; i < fields.Length; i++)
{
object[] field = new object[5];
field[0] = dt.Rows[i]["ColumnName"];
field[1] = dt.Rows[i]["ProviderType"];
field[2] = dt.Rows[i]["ColumnSize"];
field[3] = dt.Rows[i]["NumericPrecision"];
field[4] = dt.Rows[i]["NumericScale"];
fields[i] = field;
}
//Add the collection of fields to the output list
theList.Add(fields);
Finally, the code loops over the rows returned by the query, using the GetValues method to pull each row out into an object collection that is added to the output. The List is converted into an array of object[] (object[][], to be more precise), which is serialized and returned to the caller.
//Add all of the rows to the output list
while (read.Read())
{
object[] o = new object[read.FieldCount];
read.GetValues(o);
theList.Add(o);
}
}
//Serialize and return the output
return new SqlBytes(
serialization_helper.getBytes(theList.ToArray()));
}
Once this function is created, calling it is almost identical to calling GetDataTable_Binary:
USE AdventureWorks
GO
DECLARE @sql NVARCHAR(4000)
SET @sql = 'SELECT * FROM HumanResources.Employee'
DECLARE @p VARBINARY(MAX)
SET @p =
dbo.GetBinaryFromQueryResult(@sql)
The result: 57KB worth of binary data—quite an improvement over both the XML and DataTable
methods. If using this to transfer data between Broker instances on
remote servers, the decrease in network traffic can make a big
difference. The serialization performance test, the results of which are
shown in Figure 6, showed that performance is vastly improved over the DataTable attempt, while not as good as XML serialization in conjunction with the TYPE directive.
Pleased with these results,
I decided to go ahead with deserialization. Continuing with my stress
on reuse potential, I decided that a stored procedure would be a better
choice for a UDF. A stored procedure does not have a fixed output as
does a UDF, so any input table can be deserialized and returned without
worrying about violating column list contracts.
The first part of the stored procedure follows:
[Microsoft.SqlServer.Server.SqlProcedure]
public static void GetTableFromBinary(SqlBytes theTable)
{
//Deserialize the input
object[] dt = (object[])(
serialization_helper.getObject(theTable.Value));
//First, get the fields
object[] fields = (object[])(dt[0]);
SqlMetaData[] cols = new SqlMetaData[fields.Length];
//Loop over the fields and populate SqlMetaData objects
for (int i = 0; i<fields.Length; i++)
{
object[] field = (object[])(fields[i]);
SqlDbType dbType = (SqlDbType)field[1];
After deserializing the
input bytes back into a collection of objects, the first item in the
collection—which is assumed to be the column metadata—is converted into a
collection of objects. This collection is looped over item-by-item in
order to create the output SqlMetaData objects that will be used to stream back the data to the caller.
The trickiest part of setting this up is the fact that each SQL Server data type requires a different SqlMetaData overload. DECIMAL
needs a precision and scale setting; character and binary types need a
size; and for other types, size, precision, and scale are all
inappropriate inputs. The following switch statement handles creation of the SqlMetaData instances:
//Different SqlMetaData overloads are required
//depending on the data type
switch (dbType)
{
case SqlDbType.Decimal:
cols[i] = new SqlMetaData(
(string)field[0],
dbType,
(byte)field[3],
(byte)field[4]);
break;
case SqlDbType.Binary:
case SqlDbType.Char:
case SqlDbType.NChar:
case SqlDbType.NVarChar:
case SqlDbType.VarBinary:
case SqlDbType.VarChar:
switch ((int)field[2])
{
//If it's a MAX type, use −1 as the size
case 2147483647:
cols[i] = new SqlMetaData(
(string)field[0],
dbType,
−1);
break;
default:
cols[i] = new SqlMetaData(
(string)field[0],
dbType,
(long)((int)field[2]));
break;
}
break;
default:
cols[i] = new SqlMetaData(
(string)field[0],
dbType);
break;
}
}
Once population of the columns collection has been completed, the data can be sent back to the caller using the SqlPipe class's SendResults methods. After starting the stream, the remainder of the objects in the input collection are looped over, cast to object[], and sent back as SqlDataRecords:
//Start the result stream
SqlDataRecord rec = new SqlDataRecord(cols);
SqlContext.Pipe.SendResultsStart(rec);
for (int i = 1; i < dt.Length; i++)
{
rec.SetValues((object[])dt[i]);
SqlContext.Pipe.SendResultsRow(rec);
}
//End the result stream
SqlContext.Pipe.SendResultsEnd();
}
Although the serialization
test had not yielded spectacular results, it turns out that
deserialization of data prepared in this manner is exceptionally fast
compared with the alternatives. The performance test, the results of
which are shown in Figure 7, revealed that deserialization of the SqlDataReader
data is almost an order of magnitude faster than deserialization of
similar XML. Although the serialization is slightly slower, I feel that
the combination of better network utilization and much faster
deserialization makes this a great technique for transferring tabular
data between Service Broker instances in scale-out and distributed
processing scenarios.