If there is one topic that has generated
more debate than anything else, it is the scalability of lists in
SharePoint. You have as many opinions as there are people. Because I'm
one of those people, allow me to express my opinions on this as well.
My opinion is that the SharePoint content database
has been designed for low management overhead, not for ultra-extreme
scalability. But where you need that extreme scalability, the
architecture provides enough hooks and management infrastructure to
satisfy almost any conceivable need to a real world project.
Now let's argue my opinion. First of all, what is this concept of scalability? Scalability is vastly different from performance.
1. Scalability versus Performance
Let's imagine a non–computer task. Let's say your
business is to carry goods from point A to point B on the back of a
donkey. Business is good, so you seem to be getting more and more
requests to carry such goods. Soon enough, you realize that you need a
donkey that performs better. So you kill your donkey and get a body
builder donkey. As business grows, this body builder donkey is not
enough, either. So you kill him, too, and replace him with a really
expensive Sylvester Stallone donkey. Business is getting even crazier,
so you kill the Sylvester Stallone donkey. You get an Arnold
Schwarzenegger donkey and you pump him with steroids. By now, your
donkey is really performing well, but also really expensive. And even
this crazy donkey has limits. He is expensive, high maintenance,
doesn't show up at work on time like a typical government worker, and
some people say he has a funny accent.
The same goal could have been achieved by numerous
cheaper donkeys, by scaling your operation among the numerous cheaper
donkeys. This means that a scalable architecture breaks down its tasks
into equivalent activities that any cheaper donkey can pick.
The process of finding a stronger and stronger donkey is aiming for better performance.
The equivalent in the computer world is replacing weaker servers with
stronger servers. The process of distributing distributable loads
across multiple servers is referred to as scaling out your operation.
As you can see, scalability is a very different
animal (no pun intended) from performance. The end result is perhaps
the same; you can support more business and more users hitting your
system. It is how you solve the problem by providing better performance
or better scalability that what differentiates the cost. Numerous
less-powerful servers are almost always cheaper than a single very high
performance supercomputer/superdonkey.
Now let's leave the donkeys behind and come back to
SharePoint. In SharePoint, you can distribute your application across
different site collections and different sites. Within all these sites
and site collections your data will eventually live inside of lists or
document libraries. As an architect of the SharePoint project, the
first thing you should do is to attempt to distribute the logical
layout of your applications among sites and site collections just as
you would solve the scalability problem in any other project.
However, in every project the system can be scaled
out only to a point where the distributed load is identical. In other
words, the overall throughput of a system depends both on scalability
and performance. So performance is important, too! And we cannot talk
about lists scalability in SharePoint while turning a blind eye to
performance. So let's talk a little bit about performance, too.
Because you're a developer, I'd like you to sit back
and think of the most complicated project you've ever worked on.
Chances are that project involved some sort of database. In that
complicated database, how many tables had more than 10,000 or 20,000
rows? I would say that ignoring reporting databases, in most scenarios
80% of your tables don't have that many rows. This number of 10,000 or
20,000 rows is something that SharePoint lists can very comfortably
handle. But you also have to consider that when developing for a system
and marking an architecture as scalable, you must also consider all
sorts of operations that may perform on that data. These may be
inserts, updates, deletes, or queries. Also SharePoint lists give you a
ton of infrastructure on top that lets you create sophisticated user
interfaces and permissioning models on top of your data. All that
doesn't come for free!
But despite all the infrastructure and toppings you
get on top of your data with SharePoint, I can very confidently say
that when targeting that 80% scenario, SharePoint 2010 lists are more
than capable of handling all sorts of operations at that number. In
fact, there are built-in mechanisms such as indexing that in reality
let SharePoint go much beyond this number for query operations quite
comfortably.
2. The 20% Scenario
Now let's consider the 20% scenario. It consists of
cases in which you have a very large number of items in a list (let's
say millions), extremely large documents inside a document library, or
a large number of smaller documents in a document library.
Let's discuss the 20% scenario of document libraries
first. Document libraries accentuate scalability and performance
concerns because SharePoint by default stores its documents inside the
content database. Is storing documents as blobs inside a database a
good idea?
Perhaps the best and the worst thing about databases
is that there is no black or white answer. A lot of people argue that
storing blobs in a database is not recommended. Nothing could be
further from the truth. There are very significant advantages of
putting blobs in a database; for instance, you don't have any file name
overwrite issues, you don't have virus issues, backups are easy,
transactions are possible, and so on. One argument I have heard against
using blobs in a database is that performance may suffer. Again, such a
black-or-white argument can be very misleading, especially when talking
about databases. In fact, performance depends on how often the blobs
get updated and the size of those blobs. Databases tend to get
fragmented more easily than file systems. And as they get more and more
fragmented, their performance gets worse.
File systems, on the other hand, tend to defragment
themselves in downtimes. Also, new advances in solid state drive
technologies make defragging a problem of the past. While file systems
have the advantage of being less susceptible to fragmentation, they
suffer from some significant disadvantages as well. Given the
restricted cluster size of the file and the effort required to read the
position off the file before the file itself can be read, the
performance and overhead for extremely small files may suffer. With
smaller file sizes, you can actually get better performance by putting
the smaller files in a database rather than on a file system.
As a rough rule of thumb, files between zero KB and
256 KB will almost always perform better when placed in the database.
Files between 256 KB and 10MB are a gray area and depend upon how often
those files get updated. Files larger than 10 MB are better stored on
the file system.
|
|
As a result, when storing blobs inside the
SharePoint content database, you as the architect have to make a
conscious decision based upon the file size and the update frequency of
those blobs. By default, the blobs will go in the content database.
However, with SharePoint 2010 you have the ability to leverage Remote
Blob Storage (RBS). Using an RBS provider, you can choose to store the
actual file in an alternate store without affecting the SharePoint
object model at all. In other words, the storage location of the
document be it the content database or an RBS store is
invisible/transparent to the SharePoint object model and thus to your
applications.
Next, let's talk about the 20% scenario, with lists
that contain millions of items. If you reverse-engineer the SharePoint
content database structure (which you should do only for learning
purposes), you see some significant changes between SharePoint 2007 and
SharePoint 2010. Queries and tables now use indexes that are integers
in addition to GUIDs, queries now provide NOLOCK hints and ask the
execution plans to use those integer-based indexes. Connection strings
use ENLIST =FALSE, thus preventing distributed transactions and
isolation levels escalating to serializable. There are many such
under-the-cover improvements that allow a SharePoint content database
to be much more scalable than before. However, the SharePoint content
database has been designed so they can be managed with least overhead.
Your custom applications usually will find a database administrator
updating statistics at 2:00 AM, which is something that a SharePoint
content database does not have the luxury for. So with auto updating
statistics and clustered indexes, you still have the ability to easily
query millions of items in a single list and return the results in a
very short time. Such really fast queries can be done using the object
model. However, the user interface has been designed more for the 80%
scenario. In other words, if you generate HTML that is going to show
you millions and millions of rows as one query result, obviously that
page will load more slowly. And to be honest, I'd rather have a system
that works well for 80% and not have the 20% tail wag the 80% dog.
Still, the 20% scenario is important. And to prevent
even this problem, SharePoint comes with numerous management features
that prevent average users from querying more than a certain number of
items or creating projections of lists with numerous columns and them.
These are commonly referred to as list throttling scenarios.
Thus, with extremely large lists querying is never a
problem. What can become a problem, however, is inserting or updating
items.
So suppose you have a borderline scenario of a table
with millions and millions of items, and these millions and millions of
items are changing very frequently. First, I would argue that this is a
very borderline scenario. Second, this being a borderline scenario, it
is reasonable to expect organizations to hire a database administrator
to manage such a crazy table. And finally, even such a table can be
used inside of SharePoint using business connectivity services. So it
is really not such a problem after all.
Thus with all this background behind us, let me restate my opinion.
NOTE
In my opinion, the SharePoint content database
has been designed for low management overhead, not for ultra-extreme
scalability. But where you need that extreme scalability, the
architecture provides enough hooks and management infrastructure to
satisfy almost any conceivable need to a real-world project.
I made some interesting allusions to various facilities inside SharePoint 2010 that let you satisfy the 20% extreme scenarios: