In the old days of Exchange, one server could do it
all—an Exchange 5.5 or Exchange 2000 server would receive and deliver
email, handle client connections, and store user data. There was
limited separation of roles between front-end and back-end servers,
achieved by selecting the This Is A Front End Server check box in
Exchange System Manager. But that didn't enable or disable a role; it
merely changed functionality for HTTP, POP3, IMAP4, and NNTP access
from redirect to proxy. Exchange Server 2007 saw a significant change
in architecture with the separation of functions into server roles,
although it wasn't a complete transformation—certain clients (MAPI)
would still connect directly to the Mailbox servers for data while all
other clients connected through Client Access servers. Now in Exchange
Server 2010, even MAPI clients connect to the Client Access servers
through the new RPC Client Access functionality.
The preceding
paragraph should be a quick recap—why reproduce it here? Because it
reinforces a key point: in order to troubleshoot Exchange, you have to
understand the architecture. Understanding which functions of Exchange
are controlled by which server roles is absolutely critical, or you
could spend a lot of time troubleshooting the wrong server.
Troubleshooting Exchange Server 2010 often involves
collecting and reviewing information from a series of servers, rather
than focusing on one. For example, a user complains that he isn't
receiving new email. There are a number of possible causes for this:
The user's client isn't receiving notifications of new email.
The user's client can't connect to the Client Access server to retrieve new email.
All copies of the relevant mailbox database are offline.
The user's mailbox is full.
There are no Hub Transports available to deliver his message.
Transport agents preclude delivery of email to this end user.
A closer look at this list shows an interesting
breakdown. The first two issues could loosely be categorized as client
access issues, the next two as database issues, and the last as
transport issues. Obviously these correspond nicely to the three
required server roles, and since that makes a logical breakdown, that's
how we'll cover troubleshooting in the following sections. However,
before we dive right into the tools, let's take a moment to consider
what troubleshooting involves.
When faced with a technical problem, your immediate
impulse is often to jump right into the system and start clicking.
While this can be successful, particularly when you're resolving a
problem you've seen hundreds of times and know like the back of your
own hand, it's not necessarily a reproducible strategy. What happens
when you encounter a problem you haven't seen before? What do you do
when you truly have no idea what the root cause could be?
The first step in troubleshooting a problem, any problem, is to define what the problem is.
In many cases, this requires asking for more information. When an end
user says that she can't send email, does she mean that she can't open
Outlook? That she can't generate a new email? That she clicks Send but
the email never leaves the Drafts or Outbox folders? Or that she's sent
messages that were never received? The end result is the same—the user
can't send email—but the root causes are very different.
Once the problem has been defined, the next step is
to determine the scope of the problem. This often helps clarify the
direction of further troubleshooting. By determining how many users are
affected—and more importantly, determining what those users have in
common—you can rule out some possibilities and focus on things with a
greater impact. For example, if one user can't send email, the root
cause could be many things unique to that user, from Outlook
configuration to network connectivity to a disabled user account.
However, if a second user has a similar issue, it's
more likely to be something they have in common. Are they in the same
network segment, perhaps? If 10 users on different floors all report
Outlook problems, there may possibly be a problem on an Exchange
server. Are all 10 users in the same database, for example, or in the
same Active Directory site?
There are a number of clarifying questions that are extremely useful in determining the scope of a particular problem:
How many users are affected by the outage?
Do all the affected users access Exchange through the same method, such as Outlook, Outlook Web App, or ActiveSync?
What exactly are the users trying to do when they encounter the problem?
Are other users able to perform the same task without problems?
Are all of the users in the same database?
Are all of the users in the same site?
Does the problem occur all the time, only some of the time, or rarely?
The answers to these will often rule out
possibilities right from the start. If one user can't log into Outlook
successfully, but another in the same database can, you know
immediately that the relevant database must be mounted and accessible,
and you can then concentrate on other things.
Speaking of concentrating on other things, one of
the most difficult things in troubleshooting is ignoring the
unimportant distracters and focusing on what's causing the issue. It's
often difficult to differentiate between what's important and what's
not unless you know where to start (which is why defining the problem
is so important).
Here's an example: an end user reports that
he can't send email to a specific user, and during investigation you
also discover that he can't access a particular public folder. Is the
public folder problem directly related to the email problem? It might
be—if the recipient's mailbox is on a server that also houses the only
replica of that public folder, and that server's inaccessible, that
would explain both problems. But in many cases it might not—the public
folder store might be dismounted, the user might not have permissions,
or Exchange may be blocking referrals to the replica due to site link
costs. Although there's at least one explanation that covers both
problems, many more exist that are unique to the secondary problem. The
steps to troubleshoot internal mail flow are dramatically different
from those required to troubleshoot public folder access, so if you're
trying to resolve a problem with internal email, concentrate on that
and leave the public folder issue for later.