Symbian OS : Error-Handling Strategies - Escalate Errors

12/20/2012 11:06:54 AM

1. Problem

1.1. Context

Your component is part of or contains a layered software stack and has to appropriately handle non-fatal domain or system errors.

1.2. Summary

Non-fatal errors must be resolved appropriately so that an application or service can continue to provide the functionality the user expects.
There should be a clear separation between code for error handling and the normal flow of execution to reduce development time and maintenance costs by reducing code complexity and increasing software clarity.
Error-handling code should not have a significant impact at run time during normal operation.

1.3. Description

Most well-written software is layered so that high-level abstractions depend on low-level abstractions. For instance, a browser interacts with an end user through its UI code at the highest levels of abstraction. Each time the end user requests a new web page, that request is given to the HTTP subsystem which in turn hands the request to the IP stack at a lower level. At any stage, an error might be encountered that needs to be dealt with appropriately. If the error is fatal in some way, then there's little you can do but panic as per Fail Fast .

Non-fatal error conditions, however, are more likely to be encountered during the operation of your application or service and your code needs to be able to handle them appropriately. Such errors can occur at any point in the software stack whether you are an application developer or a device creator. If your software is not written to handle errors or handles them inappropriately then the user experience will be poor.

As an illustration, the following example of bad code handles a system error by ignoring it and trying to continue anyway. Similarly it handles a recoverable missing file domain error by panicking, which causes the application to terminate with no opportunity to save the user's data or to try an alternative file. Even if you try to set things up so that the file is always there, this is very easily broken by accident:

void CEngine::Construct(const TDesC& aFileName)
  {
  RFs fs;
  fs.Connect(); // Bad code - ignores possible system error

  RFile file;
  err = file.Open(fs,aFileName,EFileRead); // Could return KErrNotFound
  ASSERT(err == KErrNone);
  ...
  }

In general, it is not possible for code running at the lowest layers of your application or service to resolve domain or system errors. Consider a function written to open a file. If the open fails, should the function retry the operation? Not if the user entered the filename incorrectly but if the file is located on a remote file system off the device somewhere which does not have guaranteed availability then it would probably be worth retrying. However, the function cannot tell which situation it is in because its level of abstraction is too low. The point you should take away from this is that it may not be possible for code at one layer to take action to handle an error by itself because it lacks the context present at higher layers within the software.

When an error occurs that cannot be resolved at the layer in which it occurred the only option is to return it up the call stack until it reaches a layer that does have the context necessary to resolve the error. For example, if the engine of an application encounters a disk full error when trying to save the user's data, it is not able to start deleting files to make space without consulting the end user. So instead it escalates the error upwards to the UI layer so that the end user can be informed.

It would be inappropriate to use Fail Fast to resolve such errors by panicking. Whilst it does have the advantage of resolving the current error condition in a way which ensures that the integrity of data stored on the phone is unlikely to be compromised, it is too severe a reaction to system or domain errors that should be expected to occur at some point.

Unfortunately, the C++ function call mechanism provides no distinct support for error propagation by itself, which encourages the development of ad-hoc solutions such as returning an error code from a function. All callers of the function then need to check this return code to see if it is an error or a valid return, which can clutter up their code considerably.

As an example of where this is no problem to solve, consider the error-handling logic in the TCP network protocol. The TCP specification requires a peer to resend a packet if it detects that one has been dropped. This is so that application code does not have to deal with the unreliability of networks such as Ethernet. Since the protocol has all the information it needs to resolve the error in the layer in which it occurred, no propagation is required.

1.4. Example

An example of the problem we wish to solve is an application that transmits data via UDP using the Communication Infrastructure subsystem of Symbian OS, colloquially known as Comms-Infras. The application is likely to be split up into at least two layers with the higher or UI layer responsible for dealing with the end user and the lower or engine layer responsible for the communications channel.

The engine layer accesses the UDP protocol via the RSocket API but how should the engine layer handle the errors that it will receive from this API? How should it handle errors which occur during an attempt to establish a connection?

Clearly it should take steps to protect its own integrity and clean up any resources that are no longer needed that were allocated to perform the connection that subsequently failed. But to maintain correct layering the engine shouldn't know whether the connection was attempted as a result of an end user request or because some background task was being performed. In the latter case, notifying the end user would be confusing since they'd have no idea what the error meant as they wouldn't have initiated the connection attempt.
The engine cannot report errors to the end-user because not only does it not know if this is appropriate but doing so would violate the layering of the application and make future maintenance more difficult.
Ignoring errors is also not an option since this might be an important operation which the user expects to run reliably. Ignoring an error might even cause the user's data to be corrupted so it is important the whole application is designed to use the most appropriate error-handling strategy to resolve any problems.

For system errors, such as KErrNoMemory resulting from the failure to allocate a resource, you might think that a valid approach to resolving this error would be to try to free up any unnecessary memory. This would resolve the error within the engine with no need to involve the application at all. But how would you choose which memory to free? Clearly all memory has been allocated for a reason and most of it to allow client requests to be serviced. Perhaps caches and the like can be reduced in size but that will cause operations to take longer. This might be unacceptable if the application or service has real-time constraints that need to be met.

2. Solution

Lower-level components should not try to handle domain or system errors silently unless they have the full context and can do so successfully with no unexpected impact on the layers above. Instead lower layers should detect errors and pass them upwards to the layer that is capable of correctly resolving them.

Symbian OS provides direct support for escalating errors upwards known as the leave and trap operations. These allow errors to be propagated up the call stack by a leave and trapped by the layer that has sufficient context to resolve it. This mechanism is directly analogous to exception handling in C++ and Java.

Symbian OS does not explicitly use the standard C++ exception-handling mechanism, for historical reasons. When Symbian OS, or EPOC32 as it was then known, was first established, the compilers available at that time had poor or non-existent support for exceptions. You can use C++ exceptions within code based on Symbian OS. However, there are a number of difficulties with mixing Leave and trap operations with C++ exceptions and so it is not recommended that you use C++ exceptions.

2.1. Structure

The most basic structure for this pattern (see Figure 1) revolves around the following two concepts:

The caller is the higher-layer component that makes a function call to another component. This component is aware of the motivation for the function call and hence how to resolve any system or domain errors that might occur.

Figure 1. Structure of Escalate Errors pattern
The callee is the lower-layer component on which the function call is made. This component is responsible for attempting to satisfy the function call if possible and detecting any system or domain errors that occur whilst doing so. If an error is detected then the function escalates this to the caller through the Leave operation, otherwise the function returns as normal.

An important point to remember when creating a function is that there should only be one path out of it. Either return an error or leave but do not do both as this forces the caller to separately handle all of the ways an error can be reported. This results in more complex code in the caller and usually combines the disadvantages of both approaches!

Note that all leaves must have a corresponding trap harness. This is to ensure that errors are appropriately handled in all situations. A common strategy for resolving errors is to simply report the problem, in a top-level trap harness, to the end user with a simple message corresponding to the error code.

2.2. Dynamics

Normally, a whole series of caller–callee pairs are chained together in a call stack. In such a situation, when a leave occurs, the call stack is unwound until control is returned to the closest trap. This allows an error to be easily escalated upwards through more than one component since any component that doesn't want to handle the error simply doesn't trap it (see Figure 2).

Figure 2. Dynamics of Escalate Errors pattern

The most important decision to make is where you should place your trap harnesses. Having coarse-grained units of recovery has the advantage of fewer trap harnesses and their associated recovery code but with the disadvantage that the recovery code may be general and complex. There is also the danger that a small error leads to catastrophic results for the end user. For instance, if not having enough memory to apply bold formatting in a word processor resulted in a leave that unwound the entire call stack, this might terminate the application without giving the end user the opportunity to save and hence they might lose their data! On the other hand, too fine-grained units of recovery results in many trap harnesses and lots of recovery code with individual attention required to deal with each error case as well as a potentially significant increase in the size of your executable.

Unlike other operating systems, Symbian OS is largely event-driven so the current call stack often just handles a tiny event, such as a keystroke or a byte received. Thus trying to handle every entry point with a separate trap is impractical. Instead leaves are typically handled in one of three places:

Many threads have a top-level trap which is used as a last resort to resolve errors to minimize unnecessary error handling in the component. In particular, the Symbian OS application framework provides such a top-level trap for applications. If a leave does occur in an application, the CEikAppUi::HandleError() virtual function is called allowing applications to provide their own error-handling implementation.
Traps are placed in a RunL() implementation when using Active Objects to handle the result of an asynchronous service call. Or alternatively you can handle the error in the corresponding RunError() function if your RunL() leaves.
Trap harnesses can be nested so you do not need to rely on just having a top-level trap. This allows independent sub-components to do their own error handling if necessary. You should consider inserting a trap at the boundary of a component or layer. This can be useful if you wish to attempt to resolve any domain errors specific to your component or layer before they pass out of your control.

2.3. Implementation

Leaves

A leave is triggered by calling one of the User leave functions defined in e32std.h and exported by euser.dll. By calling one of these functions you indicate to Symbian OS that you cannot finish the current operation you are performing or return normally because an error has occurred. In response, Symbian OS searches up through the call stack looking for a trap harness to handle the leave. Whilst doing so Symbian OS automatically cleans up objects pushed onto the cleanup stack by lower-level functions.

The main leave function is User::Leave(TInt aErr) where the single integer parameter indicates the type of error and is equivalent to a throw statement in C++. By convention, negative integers are used to represent errors. There are a few helper functions that can be used in place of User::Leave():

User::LeaveIfError(TInt aReason) leaves if the reason code is negative or returns the reason if it is zero or positive.
User::LeaveIfNull(TAny* aPtr) leaves with KErrNoMemory if aPtr is null.
new(ELeave) CObject() is an overload of the new operator that automatically leaves with KErrNoMemory if there is not enough memory to allocate the object on the heap.

Here is an example where a function leaves if it couldn't establish a connection to the file server due to some system error or because it couldn't find the expected file. In each case, the higher layers are given the opportunity to resolve the error:

void CEngine::ConstructL(const TDesC& aFileName)
  {
  RFs fs;
  User::LeaveIfError(fs.Connect());

  RFile file;
  User::LeaveIfError(file.Open(fs,aFileName,EFileRead));
  ...
  }

By convention, the names of functions which can leave should always be suffixed with an 'L' (e.g., ConstructL()) so that a caller is aware the function may not return normally. Such a function is frequently referred to as a leaving function. Note that this rule applies to any function which calls a leaving function even if does not call User::Leave() itself. The function implicitly has the potential to leave because un-trapped leaves are propagated upward from any functions it calls.

Unfortunately you need to remember that this is only a convention and is not enforced by the compiler so an 'L' function is not always equivalent to a leaving function. However, static analysis tools such as epoc32\tools\leavescan.exe exist to help you with this. These tools parse your source code to evaluate your use of trap and leave operations and can tell you if you're violating the convention. They also check that all leaves have a trap associated with them to help you avoid USER 175 panics.

Traps

A trap harness is declared by using one of the TRAP macros defined in e32cmn.h. These macros will catch any leave from any function invoked within a TRAP macro. The main trap macro is TRAP(ret, expression) where expression is a call to a leaving function and ret is a pre-existing TInt variable. If a leave reaches the trap then the operating system assigns the error code to ret; if the expression returns normally, without leaving or because the leave was trapped at a lower level in the call stack, then ret is set to KErrNone to indicate that no error occurred.

As the caller of a function within a trap you should not need to worry about resources allocated by the callee. This is because the leaving mechanism is integrated with the Symbian OS cleanup stack. Any objects allocated by the callee and pushed onto the cleanup stack are deleted prior to the operating system invoking the trap harness.

In addition to the basic TRAP macro, Symbian OS defines the following similar macros:

TRAPD(ret, expression) – the same as TRAP except that it automatically declares ret as a TInt variable on the stack for you (hence the 'D' suffix) for convenience.
TRAP_IGNORE(expression) – simply traps expression and ignores whether or not any errors occurred.

Here is an example of using a trap macro:

void CMyComponent::Draw()
  {
  TRAPD(err, iMyClass->AllocBufferL());
  if(err < KErrNone)
    {
    DisplayErrorMsg(err);
    User::Exit(err);
    }
   ... // Continue as normal
   }

Intermediate Traps

If a function traps a leave but then determines from the error code that it is unable to resolve that specific error, it needs to escalate the error further upwards. This can be achieved by calling User::Leave() again with the same error code.

TRAPD(err, iBuffer = iMyClass->AllocBufferL());
if(err < KErrNone)
  {
  if(err == KErrNoMemory)
    {
    // Resolve error
    }
  else
    {
    User::Leave(err); // Escalate the error further up the call stack
    }
  }

Trapping and leaving again is normally only done if a function is only capable of resolving a subset of possible errors and wishes to trap some while escalating others. This should be done sparingly since every intermediate trap increases the cost of the entire leave operation as the stack unwind has to be restarted.

Additional Restrictions on Using Trap–Leave Operations

You should not call a leaving function from within a constructor.
This is because any member objects that have been constructed will not have their destructors called which can cause resource leaks.
You also should not allow a Leave to escape from a destructor.
Essentially this means that it is permissible to call leaving functions within a destructor so long as they are trapped before the destructor completes. This is for two reasons; the first is that the leave and trap mechanisms are implemented in terms of C++ exceptions and hence if an exception occurs the call stack is unwound. In doing so the destructors are called for objects that have been placed on the call stack. If the destructors of these objects leave then an abort may occur on some platforms as Symbian OS does not support leaves occurring whilst a leave is already being handled.
The second reason is that, in principle, a destructor should never fail. If a destructor can leave, it suggests that the code has been poorly architected. It also implies that part of the destruction process might fail, potentially leading to memory or handle leaks. One approach to solving this is to introduce 'two-phase destruction' where some form of ShutdownL() function is called prior to deleting the object. For further information on this, see the Symbian Developer Library.

2.4. Consequences

Positives

Errors can be handled in a more appropriate manner in the layer that understands the error compared to attempting to resolve the error immediately.
Escalating an error to a design layer with sufficient context to handle it ensures that the error is handled correctly. If this is not done and an attempt is made to handle an error at too low a level, your options for handling the error are narrowed to a few possibilities which are likely to be unsuitable.
The low-level code could retry the failed operation; it could silently ignore the error (not normally practical but there may be circumstances when ignoring certain errors is harmless); or it could use Fail Fast . None of these strategies is particularly desirable especially the use of Fail Fast, which should be reserved for faults rather than the domain or system errors that we are dealing with here.
In order to handle an error correctly without escalating it, the component would probably be forced to commit layering violations, e.g., by calling up into the user interface from lower-level code. This mixing of GUI and service code causes problems with encapsulation and portability as well as decreasing your component's maintainability. This pattern neatly avoids all these issues.
Less error-handling code needs to be written, which means the development costs and code size are reduced as well as making the component more maintainable.
When using this pattern, you do not need to write explicit code to check return codes because the leave and trap mechanism takes care of the process of escalating the error and finding a function higher in the call stack which can handle it for you. You do not need to write code to free resources allocated by the function if an error occurs because this is done automatically by the cleanup stack prior to the trap harness being invoked. This is especially true if you use a single trap harness at the top of a call stack which is handling an event.
Runtime performance may be improved.
Use of leave–trap does not require any logic to be written to check for errors except where trap harnesses are located. Functions which call leaving functions but do not handle the leaves themselves do not have to explicitly propagate errors upwards. This means that efficiency during normal operation improves because there is no need to check return values to see if a function call failed or to perform manual cleanup.

Negatives

Traps and leaves are not as flexible as the C++ exception mechanism. A leave can only escalate a single TInt value and hence can only convey error values without any additional context information.
In addition, a trap harness cannot be used to catch selected error values. If this is what you need to do then you have to trap all errors and leave again for those that you can't resolve at that point which is additional code and a performance overhead for your component.
Runtime performance may get worse when handling errors.
A leave is more expensive in terms of CPU usage, compared to returning an error code from a function, due to the cost of the additional machinery required to manage the data structures associated with traps and leaves. In the Symbian OS v9 Application Binary Interface (ABI), this overhead is currently minimal because the C++ compiler's exception-handling mechanism is used to implement the leave–trap mechanism which is usually very efficient in modern compilers.
It is best to use leaves to escalate errors which are not expected to occur many times a second. Out-of-memory and disk-full errors are a good example of non-fatal errors which are relatively infrequent but need to be reported and where a leave is usually the most effective mechanism. Frequent leaves can become a very noticeable performance bottleneck. Leaves also do not work well as a general reporting mechanism for conditions which are not errors. For example, it would not be appropriate to leave from a function that checks for the presence of a multimedia codec capability when that capability is not present. This is inefficient and leads to code bloat due to the requirement on the caller to add a trap to get the result of the check.
Leaves should not be used in real-time code because the leave implementation does not make any real-time guarantees due to the fact that it involves cleaning up any items on the cleanup stack and freeing resources, usually an unbounded operation.
Without additional support, leaves can only be used to escalate errors within the call stack of a single thread.
This pattern cannot be used when writing code that forms part of the Symbian OS kernel, such as device drivers, because the leave–trap operations are not available within the kernel.

3. Example Resolved

In the example, an application wished to send data to a peer via UDP. To do this, it was divided into two layers: the UI, dealing with the end user, and the engine, dealing with Comms-Infras.

Engine Layer

To achieve this, we need to open a UDP connection to be able to communicate with the peer device. The RSocket::Open() function opens a socket and RSocket::Connect() establishes the connection. These operations will fail if Comms-Infras has insufficient resources or the network is unavailable. The engine cannot resolve these errors because it is located at the bottom layer of the application design and does not have the context to try to transparently recover from an error without potentially adversely affecting the end user. In addition, it has no way of releasing resources to resolve local resource contention errors because they are owned and used by other parts of the application it does not have access to.

We could implement escalation of the errors by using function return codes as follows:

TInt CEngine::SendData(const TDesC8& aData)
  {
  // Open the socket server and create a socket
  RSocketServ serv;
  TInt err = serv.Connect();
  if(err < KErrNone)
    {
    return err;
    }

  RSocket sock;
  err = socket.Open(serv,
                    KAfInet,
                    KSockDatagram,
                    KProtocolInetUdp);
  if(err < KErrNone)
    {
    serv.Close();
    return err;
    }

  // Connect to the localhost.
  TInetAddr addr;
  addr.Input(_L("localhost"));
  addr.SetPort(KTelnetPort);
  TRequestStatus status;
  sock.Connect(addr, status);
  User::WaitForRequest(status);
  if(status.Int() < KErrNone)
    {
    sock.Close();
    serv.Close();
    return status.Int();
    }

  // Send the data in a UDP packet.
  sock.Send(aData, 0, status);
  User::WaitForRequest(status);

  sock.Close();
  serv.Close();
  return status.Int();
  }

However, as you can see, the error-handling code is all mixed up with the normal flow of execution making it more difficult to maintain. A better approach would be to use the Symbian OS error-handling facilities, resulting in a much more compact implementation:

void CEngine::SendDataL(const TDesC8& aData)
  {
  // Open the socket server and create a socket
  RSocketServ serv;
  User::LeaveIfError(serv.Connect());
  CleanupClosePushL(serv);

  RSocket sock;
  User::LeaveIfError(sock.Open(serv,
                               KAfInet,
                               KSockDatagram,
                               KProtocolInetUdp));
  CleanupClosePushL(sock);

  // Connect to the localhost.
  TInetAddr addr;
  addr.Input(_L("localhost"));
  addr.SetPort(KTelnetPort);
  TRequestStatus status;
  sock.Connect(addr, status);
  User::WaitForRequest(status);
  User::LeaveIfError(status.Int());

  // Send the data in a UDP packet.
  sock.Send(aData, 0, status);
  User::WaitForRequest(status);
  User::LeaveIfError(status.Int());

  CleanupStack::PopAndDestroy(2); // sock and serv
  }

Note that in the above we rely on the fact that RSocket::Close() does not leave. This is because we use CleanupClosePushL() to tell the cleanup stack to call Close() on both the RSocketServ and RSocket objects if a leave occurs while they're on the cleanup stack. This is a common property of Symbian OS functions used for cleanup functions, such as Close(), Release() and Stop(). There is nothing useful that the caller can do if one of these functions fails, so errors need to be handled silently by them.

UI Layer

In this case, the application implementation relies on the application framework to provide the top-level trap harness to catch all errors escalated upwards by the Engine. When an error is caught by the trap it then calls the CEikAppUi::HandleError() virtual function. By default, this displays the error that occurred in an alert window to the end user. If you've put everything on the cleanup stack then this may be all you need to do. However, an alternative is to override the function and provide a different implementation. Note that HandleError() is called with a number of parameters in addition to the basic error:

TErrorHandlerResponse CEikAppUi::HandleError(TInt aError,
                            const SExtendedError& aExtErr,
                                            TDes& aErrorText,
                                            TDes& aContextText)

These parameters are filled in by the application framework and go some way to providing extra context that might be needed when resolving the error at the top of the application's call stack. By relying on this, the lower layers of the application can escalate any errors upwards to the top layer in the design to handle the error. Use of this pattern enables errors to be resolved appropriately and minimizes the amount of error-handling code which needs to be written.

4. Other Known Uses

This pattern is used extensively within Symbian OS so here are just a couple of examples:

RArray
This is just one of many classes exported from euser.dll that leaves when it encounters an error. Basic data structures like these don't have any knowledge of why they're being used so they can't resolve any errors. Interestingly, this class provides both a leave and a function return variant of each of its functions. This is so that it can be used kernel-side, where leaves cannot be used, and user-side, where leaves should be used to simplify the calling code as much as possible.
CommsDat
CommsDat is a database engine for communication settings such as network and bearer information. It is rare for an error to occur when accessing a CommsDat record unless the record is missing. Hence by using leaves to report errors, its clients can avoid having to write excessive amounts of error-handling code. The use of the leave mechanism also frees up the return values of functions in the APIs so that they can be used for passing actual data.

5. Variants and Extensions

Escalating Errors over a Process Boundary
A limitation of the basic trap and leave operations is that they just escalate errors within a single thread. However, the Symbian OS Client–Server framework extends the basic mechanism so that if a leave occurs on the server side it is trapped within the framework. The error code is then used to complete the IPC message which initiated the operation that failed. When received on the client side this may be converted into a leave and continue being escalated up the client call stack. However, this is dependent on the implementation of the client-side DLL for the server.
Escalating Errors without Leaving
Whilst this pattern is normally implemented using the leave and trap mechanisms this isn't an essential part of the pattern. In situations where the higher layer didn't originate the current call stack, it may not be possible to use a Leave to escalate the error back to it. Instead the error will need to be passed upwards via an explicit function call which informs the higher layer of the error that needs to be handled. Examples of this commonly occur when an asynchronous request fails for some reason since by the time the response comes back the original call stack no longer exists; for instance, this occurs in Coordinator and Episodes .