By far the most interesting SQLCLR feature is the
ability to create custom aggregates. Each of the other SQLCLR features,
with the possible exception of triggers, will see more use in production
applications than will aggregates, but aggregates and types are the
only members of the group that can help developers do things that simply
were not possible before. And unlike types, for which there was some
limited support in previous versions of SQL Server, aggregates bring
something totally new to the table. For that reason, it's a feature that
gets quite a bit of attention.
Unfortunately, the story is
not so great when it comes to actually using UDAs. Developers
experimenting with them for the first time are often extremely
disappointed to discover the maximum size limitation imposed by the
engine. To illustrate this, consider a simple aggregate designed to
concatenate an input set of strings. I'll walk through it section by
section in order to describe how it works.
To begin with, the SqlUserDefinedAggregateAttribute is used. Because strings are used internally, the format must be set to UserDefined. The MaxByteSize property is set to 8000, the maximum allowed:
[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
Format.UserDefined, MaxByteSize=8000)]
public struct string_concat : IBinarySerialize
{
A generic List is used to hold strings as they're sent to the aggregate, and the List is instantiated in the Init method:
private List<string> theStrings;
public void Init()
{
theStrings = new List<string>();
}
The Accumulate method checks whether the input is NULL, and if not adds it to the List:
public void Accumulate(SqlString Value)
{
if (!(Value.IsNull))
theStrings.Add(Value.Value);
}
The Merge method pulls all data out of the input group's collection, adding it to the local collection:
public void Merge(string_concat Group)
{
foreach (string theString in Group.theStrings)
this.theStrings.Add(theString);
}
The Terminate method converts the List to an array, and then uses the Join method to delimit the elements in the array with commas:
public SqlString Terminate()
{
string[] allStrings = theStrings.ToArray();
string final = String.Join(",", allStrings);
return new SqlString(final);
}
The final two methods are Read and Write,
used for serialization and deserialization of the aggregate.
Serialization occurs at two points during an aggregate's lifetime:
before calling Merge and before calling Terminate. Any time the data size of the serialized instance is greater than the MaxByteSize specified in the SqlUserDefinedFunctionAttribute, an exception will be thrown and aggregation will stop. The following implementation of the Read and Write methods works by first serializing the number of members in the List, and then serializing each member:
#region IBinarySerialize Members
public void Read(System.IO.BinaryReader r)
{
int count = r.ReadInt32();
this.theStrings = new List<string>(count);
for (; count > 0; count--)
{
theStrings.Add(r.ReadString());
}
}
public void Write(System.IO.BinaryWriter w)
{
w.Write(theStrings.Count);
foreach (string s in theStrings)
w.Write(s);
}
#endregion
}
The following T-SQL can be used to run this aggregate against the Name column of the Production.Product table in the AdventureWorks database:
SELECT dbo.String_Concat(Name)
FROM Production.Product
The expected output is a
comma-delimited list of product names, but instead the following
exception results due to the fact that the MaxByteSize has been exceeded:
Msg 6522, Level 16, State 2, Line 1
A .NET Framework error occurred during execution of user-defined routine or
aggregate "string_concat":
System.Data.SqlTypes.SqlTypeException: The buffer is insufficient. Read or write
operation failed.
System.Data.SqlTypes.SqlTypeException:
at System.Data.SqlTypes.SqlBytes.Write(Int64 offset, Byte[] buffer, Int32
offsetInBuffer, Int32 count)
at System.Data.SqlTypes.StreamOnSqlBytes.Write(Byte[] buffer, Int32 offset, Int32
count)
at System.IO.BinaryWriter.Write(String value)
at string_concat.Write(BinaryWriter w)
This is, to put it mildly,
both frustrating and annoying. The idea of being able to produce custom
aggregates, but for them to only be applicable when used for extremely
limited data sets, is akin to keeping the carrot just out of the reach
of the donkey.
I tried various methods
of getting around this limitation, including data compression and
different ways of serializing the data, all without much success. I
finally realized that the key is to not serialize the data
at all, but rather to keep it in memory between calls to the aggregate.
The solution I came up with was to store the data in a static Dictionary between calls, but the problem was what to use as a key.
Once again, I tried
several methods of solving the problem, including keying off of the
caller's SPID and passing a unique key into the aggregate by
concatenating it with the input data. These methods worked to some
degree, but eventually I came up with the idea of generating the key—a
GUID—at serialization time and serializing it instead of the data. This
way, the caller never has to worry about what the aggregate is doing
internally to extend its output byte size—obviously, highly desirable
from an encapsulation point of view.
In order to implement this solution, I used the SafeDictionary described in the section "Working with HostProtection Privileges":
using SafeDictionary;
Thread safety is
extremely important in this scenario, since many callers may be using
the aggregate at the same time, and one caller's insert into or deletion
from the Dictionary may affect the location of another's data in the internal tables, therefore causing a concurrency problem.
To implement this, I began with the same code as the string_concat method, but renamed it to string_concat_2. I modified the SqlUserDefinedAggregateAttribute to use a MaxByteSize of 16—the data size of a GUID—and set up a readonly, static instance of the ThreadSafeDictionary in addition to the local List:
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
Format.UserDefined, MaxByteSize=16)]
public struct string_concat_2 : IBinarySerialize
{
readonly static ThreadSafeDictionary<Guid, List<string>> theLists =
new ThreadSafeDictionary<Guid, List<string>>();
private List<string> theStrings;
Aside from the serialization methods, the only method requiring modification was Terminate. Since I was using Visual Studio deployment, I had to use SqlCharsSqlString in order to expose output data typed as NVARCHAR(MAX): instead of
//Make sure to use SqlChars if you use
//VS deployment!
public SqlChars Terminate()
{
string[] allStrings = theStrings.ToArray();
string final = String.Join(",", allStrings);
return new SqlChars(final);
}
The Write method
creates a new GUID, which is used as a key for the local collection
holding the already-accumulated strings. This GUID is then serialized so
that it can be used as the key to pull the data from the Dictionary in the Read
method later. Note the exception handling—one major consideration when
working with shared memory in SQLCLR objects is making sure to safeguard
against memory leaks whenever possible.
public void Write(System.IO.BinaryWriter w)
{
Guid g = Guid.NewGuid();
try
{
//Add the local collection to the static dictionary
theLists.Add(g, this.theStrings);
//Persist the GUID
w.Write(g.ToByteArray());
}
catch
{
//Try to clean up in case of exception
if (theLists.ContainsKey(g))
theLists.Remove(g);
}
}
The Read method deserializes the GUID and uses it to get the collection of strings from the Dictionary.
After this, the collection is immediately removed; again, it's
important to be as cautious as possible regarding memory leaks when
using this technique, in order to ensure that you do not create server
instability.
public void Read(System.IO.BinaryReader r)
{
//Get the GUID from the stream
Guid g = new Guid(r.ReadBytes(16));
try
{
//Grab the collection of strings
this.theStrings = theLists[g];
}
finally
{
//Clean up
theLists.Remove(g);
}
}
After deploying this modified version of the aggregate, things work the way they should have worked from the start. No exception occurs when aggregating every product name in the Production.Product table; instead, a delimited list is output.
I had a hunch that removing
most of the serialization would also improve performance, so I decided
to test the two versions against one another. I had to filter down the
input set a bit in order to get string_concat to work without throwing an exception, so I added a WHERE clause and limited the input to product IDs less than 500. Figures 1 and 2 show the results of the tests of string_concat and string_concat_2,
respectively. Removing the serialization reduced the overhead of the
aggregation somewhat, and resulted in around a 10% performance
improvement—a nice bonus.
Although I've
shown a solution involving string concatenation, that is certainly not
the only problem for which this technique could be used. Median and
statistical variance calculation are two other areas that spring to
mind, both of which require internally holding a list of inputs. In
cases in which these lists can grow larger than 8000 bytes, this
technique should help to provide aggregate functionality where it was
previously not possible.
Keep in mind that this
method does stress memory quite a bit more than the usual way of
developing aggregates. Not only keeping more data in memory, but also
keeping it around for a longer period of time, means that you'll use up
quite a bit more of your server's resources. As with any technique that
exploits the host in a way that wasn't really intended, make sure to
test carefully before deploying solutions to production environments.