2.2 Coding Defensively

Writing programs that run correctly under "normal" use is hard; writing programs that behave gracefully in failure situations is harder. This section demonstrates some coding techniques for finding bugs early and for detecting and recovering from problems in a running program.

The code samples presented later in this book deliberately skip extensive error checking and recovery code because this would obscure the basic functionality being presented. However, the final example in Chapter 11, "A Sample GNU/Linux Application," comes back to demonstrating how to use these techniques to write robust programs.

2.2.1 Using assert

A good objective to keep in mind when coding application programs is that bugs or unexpected errors should cause the program to fail dramatically, as early as possible. This will help you find bugs earlier in the development and testing cycles. Failures that don't exhibit themselves dramatically are often missed and don't show up until the application is in users' hands.

One of the simplest methods to check for unexpected conditions is the standard C assert macro. The argument to this macro is a Boolean expression. The program is terminated if the expression evaluates to false, after printing an error message containing the source file and line number and the text of the expression. The assert macro is very useful for a wide variety of consistency checks internal to a program. For instance, use assert to test the validity of function arguments, to test preconditions and postconditions of function calls (and method calls, in C++), and to test for unexpected return values.

Each use of assert serves not only as a runtime check of a condition, but also as documentation about the program's operation within the source code. If your program contains an assert (condition ) that says to someone reading your source code that condition should always be true at that point in the program, and if condition is not true, it's probably a bug in the program.

For performance-critical code, runtime checks such as uses of assert can impose a significant performance penalty. In these cases, you can compile your code with the NDEBUG macro defined, by using the -DNDEBUG flag on your compiler command line. With NDEBUG set, appearances of the assert macro will be preprocessed away. It's a good idea to do this only when necessary for performance reasons, though, and only with performance-critical source files.

Because it is possible to preprocess assert macros away, be careful that any expression you use with assert has no side effects. Specifically, you shouldn't call functions inside assert expressions, assign variables, or use modifying operators such as ++. Suppose, for example, that you call a function, do_something, repeatedly in a loop. The do_something function returns zero on success and nonzero on failure, but you don't expect it ever to fail in your program. You might be tempted to write:

 
for (i = 0; i < 100; ++i) 
  assert (do_something () == 0); 

However, you might find that this runtime check imposes too large a performance penalty and decide later to recompile with NDEBUG defined. This will remove the assert call entirely, so the expression will never be evaluated and do_something will never be called. You should write this instead:

 
for (i = 0; i < 100; ++i) {
  int status = do_something (); 
  assert (status == 0); 
} 

Another thing to bear in mind is that you should not use assert to test for invalid user input. Users don't like it when applications simply crash with a cryptic error message, even in response to invalid input. You should still always check for invalid input and produce sensible error messages in response input. Use assert for internal runtime checks only.

Some good places to use assert are these:

·         Check against null pointers, for instance, as invalid function arguments. The error message generated by {assert (pointer != NULL)},

·                 
Assertion 'pointer != ((void *)0)' failed. 

is more informative than the error message that would result if your program dereferenced a null pointer:

 
Segmentation fault (core dumped) 

·         Check conditions on function parameter values. For instance, if a function should be called only with a positive value for parameter foo, use this at the beginning of the function body:

·                 
assert (foo > 0); 

This will help you detect misuses of the function, and it also makes it very clear to someone reading the function's source code that there is a restriction on the parameter's value.

Don't hold back; use assert liberally throughout your programs.

2.2.2 System Call Failures

Most of us were originally taught how to write programs that execute to completion along a well-defined path. We divide the program into tasks and subtasks, and each function completes a task by invoking other functions to perform corresponding sub-tasks. Given appropriate inputs, we expect a function to produce the correct output and side effects.

The realities of computer hardware and software intrude into this idealized dream. Computers have limited resources; hardware fails; many programs execute at the same time; users and programmers make mistakes. It's often at the boundary between the application and the operating system that these realities exhibit themselves. Therefore, when using system calls to access system resources, to perform I/O, or for other purposes, it's important to understand not only what happens when the call succeeds, but also how and when the call can fail.

System calls can fail in many ways. For example:

·         The system can run out of resources (or the program can exceed the resource limits enforced by the system of a single program). For example, the program might try to allocate too much memory, to write too much to a disk, or to open too many files at the same time.

·         Linux may block a certain system call when a program attempts to perform an operation for which it does not have permission. For example, a program might attempt to write to a file marked read-only, to access the memory of another process, or to kill another user's program.

·         The arguments to a system call might be invalid, either because the user provided invalid input or because of a program bug. For instance, the program might pass an invalid memory address or an invalid file descriptor to a system call. Or, a program might attempt to open a directory as an ordinary file, or might pass the name of an ordinary file to a system call that expects a directory.

·         A system call can fail for reasons external to a program. This happens most often when a system call accesses a hardware device. The device might be faulty or might not support a particular operation, or perhaps a disk is not inserted in the drive.

·         A system call can sometimes be interrupted by an external event, such as the delivery of a signal. This might not indicate outright failure, but it is the responsibility of the calling program to restart the system call, if desired.

In a well-written program that makes extensive use of system calls, it is often the case that more code is devoted to detecting and handling errors and other exceptional circumstances than to the main work of the program.

2.2.3 Error Codes from System Calls

A majority of system calls return zero if the operation succeeds, or a nonzero value if the operation fails. (Many, though, have different return value conventions; for instance, malloc returns a null pointer to indicate failure. Always read the man page carefully when using a system call.) Although this information may be enough to determine whether the program should continue execution as usual, it probably does not provide enough information for a sensible recovery from errors.

Most system calls use a special variable named errno to store additional information in case of failure. [2] When a call fails, the system sets errno to a value indicating what went wrong. Because all system calls use the same errno variable to store error information, you should copy the value into another variable immediately after the failed call. The value of errno will be overwritten the next time you make a system call.

[2] Actually, for reasons of thread safety, errno is implemented as a macro, but it is used like a global variable.

Error values are integers; possible values are given by preprocessor macros, by convention named in all capitals and starting with "E"—for example, EACCES and EINVAL. Always use these macros to refer to errno values rather than integer values. Include the <errno.h> header if you use errno values.

GNU/Linux provides a convenient function, strerror, that returns a character string description of an errno error code, suitable for use in error messages. Include <string.h> if you use strerror

GNU/Linux also provides perror, which prints the error description directly to the stderr stream. Pass to perror a character string prefix to print before the error description, which should usually include the name of the function that failed. Include <stdio.h> if you use perror.

This code fragment attempts to open a file; if the open fails, it prints an error message and exits the program. Note that the open call returns an open file descriptor if the open operation succeeds, or -1 if the operation fails.

 
fd = open ("inputfile.txt", O_RDONLY); 
if (fd == -1) {
  /* The open failed.  Print an error message and exit.  */ 
  fprintf (stderr, "error opening file: %s\n", strerror (errno)); 
  exit (1); 
} 

Depending on your program and the nature of the system call, the appropriate action in case of failure might be to print an error message, to cancel an operation, to abort the program, to try again, or even to ignore the error. It's important, though, to include logic that handles all possible failure modes in some way or another.

One possible error code that you should be on the watch for, especially with I/O functions, is EINTR. Some functions, such as read, select, and sleep, can take significant time to execute. These are considered blocking functions because program execution is blocked until the call is completed. However, if the program receives a signal while blocked in one of these calls, the call will return without completing the operation. In this case, errno is set to EINTR. Usually, you'll want to retry the system call in this case.

Here's a code fragment that uses the chown call to change the owner of a file given by path to the user by user_id. If the call fails, the program takes action depending on the value of errno. Notice that when we detect what's probably a bug in the program, we exit using abort or assert, which cause a core file to be generated. This can be useful for post-mortem debugging. For other unrecoverable errors, such as out-of-memory conditions, we exit using exit and a nonzero exit value instead because a core file wouldn't be very useful.

 
rval = chown (path, user_id, -1); 
if (rval != 0) {
  /* Save errno because it's clobbered by the next system call.  */ 
  int error_code = errno; 
  /* The operation didn't succeed; chown should return -1 on error.  */ 
  assert (rval == -1); 
  /* Check the value of errno, and take appropriate action.  */ 
  switch (error_code) {
  case EPERM:         /* Permission denied.  */ 
  case EROFS:         /* PATH is on a read-only file system. */ 
  case ENAMETOOLONG:  /* PATH is too long.  */ 
  case ENOENT:        /* PATH does not exit.  */ 
  case ENOTDIR:       /* A component of PATH is not a directory.  */ 
  case EACCES:        /* A component of PATH is not accessible.  */ 
    /* Something's wrong with the file. Print an error message.  */ 
    fprintf (stderr, "error changing ownership of %s: %s\n", 
             path, strerror (error_code)); 
    /* Don't end the program; perhaps give the user a chance to 
       choose another file...   */ 
    break; 
 
  case EFAULT: 
    /* PATH contains an invalid memory address.  This is probably a bug.  */ 
    abort (); 
 
  case ENOMEM: 
    /* Ran out of kernel memory.  */ 
    fprintf (stderr, "%s\n", strerror (error_code)); 
    exit (1); 
 
  default: 
    /* Some other, unexpected, error code. We've tried to handle all 
       possible error codes; if we've missed one, that's a bug!  */ 
    abort (); 
  }; 
} 

You could simply have used this code, which behaves the same way if the call succeeds:

 
rval = chown (path, user_id, -1); 
assert (rval == 0); 

But if the call fails, this alternative makes no effort to report, handle, or recover from errors.

Whether you use the first form, the second form, or something in between depends on the error detection and recovery requirements for your program.

2.2.4 Errors and Resource Allocation

Often, when a system call fails, it's appropriate to cancel the current operation but not to terminate the program because it may be possible to recover from the error. One way to do this is to return from the current function, passing a return code to the caller indicating the error.

If you decide to return from the middle of a function, it's important to make sure that any resources successfully allocated previously in the function are first deallocated. These resources can include memory, file descriptors, file pointers, temporary files, synchronization objects, and so on. Otherwise, if your program continues running, the resources allocated before the failure occurred will be leaked.

Consider, for example, a function that reads from a file into a buffer. The function might follow these steps:

1.       Allocate the buffer.

2.       Open the file.

3.       Read from the file into the buffer.

4.       Close the file.

5.       Return the buffer.

If the file doesn't exist, Step 2 will fail. An appropriate course of action might be to return NULL from the function. However, if the buffer has already been allocated in Step 1, there is a risk of leaking that memory. You must remember to deallocate the buffer somewhere along any flow of control from which you don't return. If Step 3 fails, not only must you deallocate the buffer before returning, but you also must close the file.

Listing 2.6 shows an example of how you might write this function.

Listing 2.6 (readfile.c) Freeing Resources During Abnormal Conditions
#include <fcntl.h> 
#include <stdlib.h> 
#include <sys/stat.h> 
#include <sys/types.h> 
#include <unistd.h> 
char* read_from_file (const char* filename, size_t length) 
{
  char* buffer; 
  int fd; 
  ssize_t bytes_read; 
 
  /* Allocate the buffer.  */ 
  buffer = (char*) malloc (length); 
  if (buffer == NULL) 
    return NULL; 
  /* Open the file.  */ 
  fd = open (filename, O_RDONLY); 
  if (fd == -1) {
    /* open failed. Deallocate buffer before returning.  */ 
    free (buffer); 
    return NULL; 
  } 
  /* Read the data.  */ 
  bytes_read = read (fd, buffer, length); 
  if (bytes_read != length) {
    /* read failed. Deallocate buffer and close fd before returning.  */ 
    free (buffer); 
    close (fd); 
    return NULL; 
  } 
  /* Everything's fine. Close the file and return the buffer.  */ 
  close (fd); 
  return buffer; 
} 

Linux cleans up allocated memory, open files, and most other resources when a program terminates, so it's not necessary to deallocate buffers and close files before calling exit. You might need to manually free other shared resources, however, such as temporary files and shared memory, which can potentially outlive a program.