Poster of Linux kernelThe best gift for a Linux geek
 Linux kernel map 
⇦ prev ⇱ home next ⇨

7.3. Delaying Execution

Device drivers often need to delay the execution of a particular piece of code for a period of time, usually to allow the hardware to accomplish some task. In this section we cover a number of different techniques for achieving delays. The circumstances of each situation determine which technique is best to use; we go over them all, and point out the advantages and disadvantages of each.

One important thing to consider is how the delay you need compares with the clock tick, considering the range of HZ across the various platforms. Delays that are reliably longer than the clock tick, and don't suffer from its coarse granularity, can make use of the system clock. Very short delays typically must be implemented with software loops. In between these two cases lies a gray area. In this chapter, we use the phrase "long" delay to refer to a multiple-jiffy delay, which can be as low as a few milliseconds on some platforms, but is still long as seen by the CPU and the kernel.

The following sections talk about the different delays by taking a somewhat long path from various intuitive but inappropriate solutions to the right solution. We chose this path because it allows a more in-depth discussion of kernel issues related to timing. If you are eager to find the right code, just skim through the section.

7.3.1. Long Delays

Occasionally a driver needs to delay execution for relatively long periods—more than one clock tick. There are a few ways of accomplishing this sort of delay; we start with the simplest technique, then proceed to the more advanced techniques.

7.3.1.1 Busy waiting

If you want to delay execution by a multiple of the clock tick, allowing some slack in the value, the easiest (though not recommended) implementation is a loop that monitors the jiffy counter. The busy-waiting implementation usually looks like the following code, where j1 is the value of jiffies at the expiration of the delay:

while (time_before(jiffies, j1))
    cpu_relax(  );

The call to cpu_relax invokes an architecture-specific way of saying that you're not doing much with the processor at the moment. On many systems it does nothing at all; on symmetric multithreaded ("hyperthreaded") systems, it may yield the core to the other thread. In any case, this approach should definitely be avoided whenever possible. We show it here because on occasion you might want to run this code to better understand the internals of other code.

So let's look at how this code works. The loop is guaranteed to work because jiffies is declared as volatile by the kernel headers and, therefore, is fetched from memory any time some C code accesses it. Although technically correct (in that it works as designed), this busy loop severely degrades system performance. If you didn't configure your kernel for preemptive operation, the loop completely locks the processor for the duration of the delay; the scheduler never preempts a process that is running in kernel space, and the computer looks completely dead until time j1 is reached. The problem is less serious if you are running a preemptive kernel, because, unless the code is holding a lock, some of the processor's time can be recovered for other uses. Busy waits are still expensive on preemptive systems, however.

Still worse, if interrupts happen to be disabled when you enter the loop, jiffies won't be updated, and the while condition remains true forever. Running a preemptive kernel won't help either, and you'll be forced to hit the big red button.

This implementation of delaying code is available, like the following ones, in the jit module. The /proc/jit* files created by the module delay a whole second each time you read a line of text, and lines are guaranteed to be 20 bytes each. If you want to test the busy-wait code, you can read /proc/jitbusy, which busy-loops for one second for each line it returns.

Be sure to read, at most, one line (or a few lines) at a time from /proc/jitbusy. The simplified kernel mechanism to register /proc files invokes the read method over and over to fill the data buffer the user requested. Therefore, a command such as cat /proc/jitbusy, if it reads 4 KB at a time, freezes the computer for 205 seconds.


The suggested command to read /proc/jitbusy is dd bs=20 < /proc/jitbusy, optionally specifying the number of blocks as well. Each 20-byte line returned by the file represents the value the jiffy counter had before and after the delay. This is a sample run on an otherwise unloaded computer:

phon% dd bs=20 count=5 < /proc/jitbusy
  1686518   1687518
  1687519   1688519
  1688520   1689520
  1689520   1690520
  1690521   1691521

All looks good: delays are exactly one second (1000 jiffies), and the next read system call starts immediately after the previous one is over. But let's see what happens on a system with a large number of CPU-intensive processes running (and nonpreemptive kernel):

phon% dd bs=20 count=5 < /proc/jitbusy
  1911226   1912226
  1913323   1914323
  1919529   1920529
  1925632   1926632
  1931835   1932835

Here, each read system call delays exactly one second, but the kernel can take more than 5 seconds before scheduling the dd process so it can issue the next system call. That's expected in a multitasking system; CPU time is shared between all running processes, and a CPU-intensive process has its dynamic priority reduced. (A discussion of scheduling policies is outside the scope of this book.)

The test under load shown above has been performed while running the load50 sample program. This program forks a number of processes that do nothing, but do it in a CPU-intensive way. The program is part of the sample files accompanying this book, and forks 50 processes by default, although the number can be specified on the command line. In this chapter, and elsewhere in the book, the tests with a loaded system have been performed with load50 running in an otherwise idle computer.

If you repeat the command while running a preemptible kernel, you'll find no noticeable difference on an otherwise idle CPU and the following behavior under load:

phon% dd bs=20 count=5 < /proc/jitbusy
 14940680  14942777
 14942778  14945430
 14945431  14948491
 14948492  14951960
 14951961  14955840

Here, there is no significant delay between the end of a system call and the beginning of the next one, but the individual delays are far longer than one second: up to 3.8 seconds in the example shown and increasing over time. These values demonstrate that the process has been interrupted during its delay, scheduling other processes. The gap between system calls is not the only scheduling option for this process, so no special delay can be seen there.

7.3.1.2 Yielding the processor

As we have seen, busy waiting imposes a heavy load on the system as a whole; we would like to find a better technique. The first change that comes to mind is to explicitly release the CPU when we're not interested in it. This is accomplished by calling the schedule function, declared in <linux/sched.h>:

while (time_before(jiffies, j1)) {
    schedule(  );
}

This loop can be tested by reading /proc/jitsched as we read /proc/jitbusy above. However, is still isn't optimal. The current process does nothing but release the CPU, but it remains in the run queue. If it is the only runnable process, it actually runs (it calls the scheduler, which selects the same process, which calls the scheduler, which . . . ). In other words, the load of the machine (the average number of running processes) is at least one, and the idle task (process number 0, also called swapper for historical reasons) never runs. Though this issue may seem irrelevant, running the idle task when the computer is idle relieves the processor's workload, decreasing its temperature and increasing its lifetime, as well as the duration of the batteries if the computer happens to be your laptop. Moreover, since the process is actually executing during the delay, it is accountable for all the time it consumes.

The behavior of /proc/jitsched is actually similar to running /proc/jitbusy under a preemptive kernel. This is a sample run, on an unloaded system:

phon% dd bs=20 count=5 < /proc/jitsched
  1760205   1761207
  1761209   1762211
  1762212   1763212
  1763213   1764213
  1764214   1765217

It's interesting to note that each read sometimes ends up waiting a few clock ticks more than requested. This problem gets worse and worse as the system gets busy, and the driver could end up waiting longer than expected. Once a process releases the processor with schedule, there are no guarantees that the process will get the processor back anytime soon. Therefore, calling schedule in this manner is not a safe solution to the driver's needs, in addition to being bad for the computing system as a whole. If you test jitsched while running load50, you can see that the delay associated to each line is extended by a few seconds, because other processes are using the CPU when the timeout expires.

7.3.1.3 Timeouts

The suboptimal delay loops shown up to now work by watching the jiffy counter without telling anyone. But the best way to implement a delay, as you may imagine, is usually to ask the kernel to do it for you. There are two ways of setting up jiffy-based timeouts, depending on whether your driver is waiting for other events or not.

If your driver uses a wait queue to wait for some other event, but you also want to be sure that it runs within a certain period of time, it can use wait_event_timeout or wait_event_interruptible_timeout:

#include <linux/wait.h>
long wait_event_timeout(wait_queue_head_t q, condition, long timeout);
long wait_event_interruptible_timeout(wait_queue_head_t q,
                      condition, long timeout);

These functions sleep on the given wait queue, but they return after the timeout (expressed in jiffies) expires. Thus, they implement a bounded sleep that does not go on forever. Note that the timeout value represents the number of jiffies to wait, not an absolute time value. The value is represented by a signed number, because it sometimes is the result of a subtraction, although the functions complain through a printk statement if the provided timeout is negative. If the timeout expires, the functions return 0; if the process is awakened by another event, it returns the remaining delay expressed in jiffies. The return value is never negative, even if the delay is greater than expected because of system load.

The /proc/jitqueue file shows a delay based on wait_event_interruptible_timeout, although the module has no event to wait for, and uses 0 as a condition:

wait_queue_head_t wait;
init_waitqueue_head (&wait);
wait_event_interruptible_timeout(wait, 0, delay);

The observed behaviour, when reading /proc/jitqueue, is nearly optimal, even under load:

phon% dd bs=20 count=5 < /proc/jitqueue
  2027024   2028024
  2028025   2029025
  2029026   2030026
  2030027   2031027
  2031028   2032028

Since the reading process (dd above) is not in the run queue while waiting for the timeout, you see no difference in behavior whether the code is run in a preemptive kernel or not.

wait_event_timeout and wait_event_interruptible_timeout were designed with a hardware driver in mind, where execution could be resumed in either of two ways: either somebody calls wake_up on the wait queue, or the timeout expires. This doesn't apply to jitqueue, as nobody ever calls wake_up on the wait queue (after all, no other code even knows about it), so the process always wakes up when the timeout expires. To accommodate for this very situation, where you want to delay execution waiting for no specific event, the kernel offers the schedule_timeout function so you can avoid declaring and using a superfluous wait queue head:

#include <linux/sched.h>
signed long schedule_timeout(signed long timeout);

Here, timeout is the number of jiffies to delay. The return value is 0 unless the function returns before the given timeout has elapsed (in response to a signal). schedule_timeout requires that the caller first set the current process state, so a typical call looks like:

set_current_state(TASK_INTERRUPTIBLE);
schedule_timeout (delay);

The previous lines (from /proc/jitschedto ) cause the process to sleep until the given time has passed. Since wait_event_interruptible_timeout relies on schedule_timeout internally, we won't bother showing the numbers jitschedto returns, because they are the same as those of jitqueue. Once again, it is worth noting that an extra time interval could pass between the expiration of the timeout and when your process is actually scheduled to execute.

In the example just shown, the first line calls set_current_state to set things up so that the scheduler won't run the current process again until the timeout places it back in TASK_RUNNING state. To achieve an uninterruptible delay, use TASK_UNINTERRUPTIBLE instead. If you forget to change the state of the current process, a call to schedule_timeout behaves like a call to schedule (i.e., the jitsched behavior), setting up a timer that is not used.

If you want to play with the four jit files under different system situations or different kernels, or try other ways to delay execution, you may want to configure the amount of the delay when loading the module by setting the delay module parameter.

7.3.2. Short Delays

When a device driver needs to deal with latencies in its hardware, the delays involved are usually a few dozen microseconds at most. In this case, relying on the clock tick is definitely not the way to go.

The kernel functions ndelay, udelay, and mdelay serve well for short delays, delaying execution for the specified number of nanoseconds, microseconds, or milliseconds respectively.[2] Their prototypes are:

[2] The u in udelay represents the Greek letter mu and stands for micro.

#include <linux/delay.h>
void ndelay(unsigned long nsecs);
void udelay(unsigned long usecs);
void mdelay(unsigned long msecs);

The actual implementations of the functions are in <asm/delay.h>, being architecture-specific, and sometimes build on an external function. Every architecture implements udelay, but the other functions may or may not be defined; if they are not, <linux/delay.h> offers a default version based on udelay. In all cases, the delay achieved is at least the requested value but could be more; actually, no platform currently achieves nanosecond precision, although several ones offer submicrosecond precision. Delaying more than the requested value is usually not a problem, as short delays in a driver are usually needed to wait for the hardware, and the requirements are to wait for at least a given time lapse.

The implementation of udelay (and possibly ndelay too) uses a software loop based on the processor speed calculated at boot time, using the integer variable loops_per_jiffy. If you want to look at the actual code, however, be aware that the x86 implementation is quite a complex one because of the different timing sources it uses, based on what CPU type is running the code.

To avoid integer overflows in loop calculations, udelay and ndelay impose an upper bound in the value passed to them. If your module fails to load and displays an unresolved symbol, _ _bad_udelay, it means you called udelay with too large an argument. Note, however, that the compile-time check can be performed only on constant values and that not all platforms implement it. As a general rule, if you are trying to delay for thousands of nanoseconds, you should be using udelay rather than ndelay; similarly, millisecond-scale delays should be done with mdelay and not one of the finer-grained functions.

It's important to remember that the three delay functions are busy-waiting; other tasks can't be run during the time lapse. Thus, they replicate, though on a different scale, the behavior of jitbusy. Thus, these functions should only be used when there is no practical alternative.

There is another way of achieving millisecond (and longer) delays that does not involve busy waiting. The file <linux/delay.h> declares these functions:

void msleep(unsigned int millisecs);
unsigned long msleep_interruptible(unsigned int millisecs);
void ssleep(unsigned int seconds)

The first two functions puts the calling process to sleep for the given number of millisecs. A call to msleep is uninterruptible; you can be sure that the process sleeps for at least the given number of milliseconds. If your driver is sitting on a wait queue and you want a wakeup to break the sleep, use msleep_interruptible. The return value from msleep_interruptible is normally 0; if, however, the process is awakened early, the return value is the number of milliseconds remaining in the originally requested sleep period. A call to ssleep puts the process into an uninterruptible sleep for the given number of seconds.

In general, if you can tolerate longer delays than requested, you should use schedule_timeout, msleep, or ssleep.

    ⇦ prev ⇱ home next ⇨
    Poster of Linux kernelThe best gift for a Linux geek