SQL Server 2005 : Extending User-Defined Aggregates

1/31/2012 6:26:38 PM

By far the most interesting SQLCLR feature is the ability to create custom aggregates. Each of the other SQLCLR features, with the possible exception of triggers, will see more use in production applications than will aggregates, but aggregates and types are the only members of the group that can help developers do things that simply were not possible before. And unlike types, for which there was some limited support in previous versions of SQL Server, aggregates bring something totally new to the table. For that reason, it's a feature that gets quite a bit of attention.

Unfortunately, the story is not so great when it comes to actually using UDAs. Developers experimenting with them for the first time are often extremely disappointed to discover the maximum size limitation imposed by the engine. To illustrate this, consider a simple aggregate designed to concatenate an input set of strings. I'll walk through it section by section in order to describe how it works.

To begin with, the SqlUserDefinedAggregateAttribute is used. Because strings are used internally, the format must be set to UserDefined. The MaxByteSize property is set to 8000, the maximum allowed:

[Serializable]
[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
    Format.UserDefined, MaxByteSize=8000)]
public struct string_concat : IBinarySerialize
{

A generic List is used to hold strings as they're sent to the aggregate, and the List is instantiated in the Init method:

private List<string> theStrings;

    public void Init()
    {
        theStrings = new List<string>();
    }

The Accumulate method checks whether the input is NULL, and if not adds it to the List:

public void Accumulate(SqlString Value)
    {
        if (!(Value.IsNull))
            theStrings.Add(Value.Value);
    }

The Merge method pulls all data out of the input group's collection, adding it to the local collection:

public void Merge(string_concat Group)
    {
        foreach (string theString in Group.theStrings)
            this.theStrings.Add(theString);
    }

The Terminate method converts the List to an array, and then uses the Join method to delimit the elements in the array with commas:

public SqlString Terminate()
    {
        string[] allStrings = theStrings.ToArray();
        string final = String.Join(",", allStrings);

        return new SqlString(final);
    }

The final two methods are Read and Write, used for serialization and deserialization of the aggregate. Serialization occurs at two points during an aggregate's lifetime: before calling Merge and before calling Terminate. Any time the data size of the serialized instance is greater than the MaxByteSize specified in the SqlUserDefinedFunctionAttribute, an exception will be thrown and aggregation will stop. The following implementation of the Read and Write methods works by first serializing the number of members in the List, and then serializing each member:

#region IBinarySerialize Members

    public void Read(System.IO.BinaryReader r)
    {
        int count = r.ReadInt32();
        this.theStrings = new List<string>(count);

        for (; count > 0; count--)
        {
            theStrings.Add(r.ReadString());
        }
    }

    public void Write(System.IO.BinaryWriter w)
    {
        w.Write(theStrings.Count);
        foreach (string s in theStrings)
            w.Write(s);
    }

    #endregion
}

The following T-SQL can be used to run this aggregate against the Name column of the Production.Product table in the AdventureWorks database:

SELECT dbo.String_Concat(Name)
FROM Production.Product

The expected output is a comma-delimited list of product names, but instead the following exception results due to the fact that the MaxByteSize has been exceeded:

Msg 6522, Level 16, State 2, Line 1
A .NET Framework error occurred during execution of user-defined routine or
aggregate "string_concat":
System.Data.SqlTypes.SqlTypeException: The buffer is insufficient. Read or write
operation failed.
System.Data.SqlTypes.SqlTypeException:
   at System.Data.SqlTypes.SqlBytes.Write(Int64 offset, Byte[] buffer, Int32
offsetInBuffer, Int32 count)
   at System.Data.SqlTypes.StreamOnSqlBytes.Write(Byte[] buffer, Int32 offset, Int32
count)
   at System.IO.BinaryWriter.Write(String value)
   at string_concat.Write(BinaryWriter w)

This is, to put it mildly, both frustrating and annoying. The idea of being able to produce custom aggregates, but for them to only be applicable when used for extremely limited data sets, is akin to keeping the carrot just out of the reach of the donkey.

I tried various methods of getting around this limitation, including data compression and different ways of serializing the data, all without much success. I finally realized that the key is to not serialize the data at all, but rather to keep it in memory between calls to the aggregate. The solution I came up with was to store the data in a static Dictionary between calls, but the problem was what to use as a key.

Once again, I tried several methods of solving the problem, including keying off of the caller's SPID and passing a unique key into the aggregate by concatenating it with the input data. These methods worked to some degree, but eventually I came up with the idea of generating the key—a GUID—at serialization time and serializing it instead of the data. This way, the caller never has to worry about what the aggregate is doing internally to extend its output byte size—obviously, highly desirable from an encapsulation point of view.

In order to implement this solution, I used the SafeDictionary described in the section "Working with HostProtection Privileges":

using SafeDictionary;

Thread safety is extremely important in this scenario, since many callers may be using the aggregate at the same time, and one caller's insert into or deletion from the Dictionary may affect the location of another's data in the internal tables, therefore causing a concurrency problem.

To implement this, I began with the same code as the string_concat method, but renamed it to string_concat_2. I modified the SqlUserDefinedAggregateAttribute to use a MaxByteSize of 16—the data size of a GUID—and set up a readonly, static instance of the ThreadSafeDictionary in addition to the local List:

[Microsoft.SqlServer.Server.SqlUserDefinedAggregate(
    Format.UserDefined, MaxByteSize=16)]
public struct string_concat_2 : IBinarySerialize
{
    readonly static ThreadSafeDictionary<Guid, List<string>> theLists =
        new ThreadSafeDictionary<Guid, List<string>>();

    private List<string> theStrings;

Aside from the serialization methods, the only method requiring modification was Terminate. Since I was using Visual Studio deployment, I had to use SqlCharsSqlString in order to expose output data typed as NVARCHAR(MAX): instead of

//Make sure to use SqlChars if you use
    //VS deployment!
    public SqlChars Terminate()
    {
        string[] allStrings = theStrings.ToArray();
        string final = String.Join(",", allStrings);

        return new SqlChars(final);
    }

The Write method creates a new GUID, which is used as a key for the local collection holding the already-accumulated strings. This GUID is then serialized so that it can be used as the key to pull the data from the Dictionary in the Read method later. Note the exception handling—one major consideration when working with shared memory in SQLCLR objects is making sure to safeguard against memory leaks whenever possible.

public void Write(System.IO.BinaryWriter w)
    {
        Guid g = Guid.NewGuid();

        try
        {
            //Add the local collection to the static dictionary
            theLists.Add(g, this.theStrings);

            //Persist the GUID
            w.Write(g.ToByteArray());
        }
        catch
        {
            //Try to clean up in case of exception
            if (theLists.ContainsKey(g))
                theLists.Remove(g);
        }
    }

The Read method deserializes the GUID and uses it to get the collection of strings from the Dictionary. After this, the collection is immediately removed; again, it's important to be as cautious as possible regarding memory leaks when using this technique, in order to ensure that you do not create server instability.

public void Read(System.IO.BinaryReader r)
    {
        //Get the GUID from the stream
        Guid g = new Guid(r.ReadBytes(16));

try
        {
            //Grab the collection of strings
            this.theStrings = theLists[g];
        }
        finally
        {
            //Clean up
            theLists.Remove(g);
        }
    }

After deploying this modified version of the aggregate, things work the way they should have worked from the start. No exception occurs when aggregating every product name in the Production.Product table; instead, a delimited list is output.

I had a hunch that removing most of the serialization would also improve performance, so I decided to test the two versions against one another. I had to filter down the input set a bit in order to get string_concat to work without throwing an exception, so I added a WHERE clause and limited the input to product IDs less than 500. Figures 1 and 2 show the results of the tests of string_concat and string_concat_2, respectively. Removing the serialization reduced the overhead of the aggregation somewhat, and resulted in around a 10% performance improvement—a nice bonus.

Figure 1.Testing aggregation using serialization of all members of the internal collection

Figure 2. Aggregating only the GUID is a much faster solution, in addition to offering greater functionality.

Although I've shown a solution involving string concatenation, that is certainly not the only problem for which this technique could be used. Median and statistical variance calculation are two other areas that spring to mind, both of which require internally holding a list of inputs. In cases in which these lists can grow larger than 8000 bytes, this technique should help to provide aggregate functionality where it was previously not possible.

Keep in mind that this method does stress memory quite a bit more than the usual way of developing aggregates. Not only keeping more data in memory, but also keeping it around for a longer period of time, means that you'll use up quite a bit more of your server's resources. As with any technique that exploits the host in a way that wasn't really intended, make sure to test carefully before deploying solutions to production environments.

Others