Poster of Linux kernelThe best gift for a Linux geek
 Linux kernel map 
⇦ prev ⇱ home next ⇨

7.1. Measuring Time Lapses

The kernel keeps track of the flow of time by means of timer interrupts. Interrupts are covered in detail in Chapter 10.

Timer interrupts are generated by the system's timing hardware at regular intervals; this interval is programmed at boot time by the kernel according to the value of HZ, which is an architecture-dependent value defined in <linux/param.h> or a subplatform file included by it. Default values in the distributed kernel source range from 50 to 1200 ticks per second on real hardware, down to 24 for software simulators. Most platforms run at 100 or 1000 interrupts per second; the popular x86 PC defaults to 1000, although it used to be 100 in previous versions (up to and including 2.4). As a general rule, even if you know the value of HZ, you should never count on that specific value when programming.

It is possible to change the value of HZ for those who want systems with a different clock interrupt frequency. If you change HZ in the header file, you need to recompile the kernel and all modules with the new value. You might want to raise HZ to get a more fine-grained resolution in your asynchronous tasks, if you are willing to pay the overhead of the extra timer interrupts to achieve your goals. Actually, raising HZ to 1000 was pretty common with x86 industrial systems using Version 2.4 or 2.2 of the kernel. With current versions, however, the best approach to the timer interrupt is to keep the default value for HZ, by virtue of our complete trust in the kernel developers, who have certainly chosen the best value. Besides, some internal calculations are currently implemented only for HZ in the range from 12 to 1535 (see <linux/timex.h> and RFC-1589).

Every time a timer interrupt occurs, the value of an internal kernel counter is incremented. The counter is initialized to 0 at system boot, so it represents the number of clock ticks since last boot. The counter is a 64-bit variable (even on 32-bit architectures) and is called jiffies_64. However, driver writers normally access the jiffies variable, an unsigned long that is the same as either jiffies_64 or its least significant bits. Using jiffies is usually preferred because it is faster, and accesses to the 64-bit jiffies_64 value are not necessarily atomic on all architectures.

In addition to the low-resolution kernel-managed jiffy mechanism, some CPU platforms feature a high-resolution counter that software can read. Although its actual use varies somewhat across platforms, it's sometimes a very powerful tool.

7.1.1. Using the jiffies Counter

The counter and the utility functions to read it live in <linux/jiffies.h>, although you'll usually just include <linux/sched.h>, that automatically pulls jiffies.h in. Needless to say, both jiffies and jiffies_64 must be considered read-only.

Whenever your code needs to remember the current value of jiffies, it can simply access the unsigned long variable, which is declared as volatile to tell the compiler not to optimize memory reads. You need to read the current counter whenever your code needs to calculate a future time stamp, as shown in the following example:

#include <linux/jiffies.h>
unsigned long j, stamp_1, stamp_half, stamp_n;

j = jiffies;                      /* read the current value */
stamp_1    = j + HZ;              /* 1 second in the future */
stamp_half = j + HZ/2;            /* half a second */
stamp_n    = j + n * HZ / 1000;   /* n milliseconds */

This code has no problem with jiffies wrapping around, as long as different values are compared in the right way. Even though on 32-bit platforms the counter wraps around only once every 50 days when HZ is 1000, your code should be prepared to face that event. To compare your cached value (like stamp_1 above) and the current value, you should use one of the following macros:

#include <linux/jiffies.h>
int time_after(unsigned long a, unsigned long b);
int time_before(unsigned long a, unsigned long b);
int time_after_eq(unsigned long a, unsigned long b);
int time_before_eq(unsigned long a, unsigned long b);

The first evaluates true when a, as a snapshot of jiffies, represents a time after b, the second evaluates true when time a is before time b, and the last two compare for "after or equal" and "before or equal." The code works by converting the values to signed long, subtracting them, and comparing the result. If you need to know the difference between two instances of jiffies in a safe way, you can use the same trick: diff = (long)t2 - (long)t1;.

You can convert a jiffies difference to milliseconds trivially through:

msec = diff * 1000 / HZ;

Sometimes, however, you need to exchange time representations with user space programs that tend to represent time values with struct timeval and struct timespec. The two structures represent a precise time quantity with two numbers: seconds and microseconds are used in the older and popular struct timeval, and seconds and nanoseconds are used in the newer struct timespec. The kernel exports four helper functions to convert time values expressed as jiffies to and from those structures:

#include <linux/time.h>

unsigned long timespec_to_jiffies(struct timespec *value);
void jiffies_to_timespec(unsigned long jiffies, struct timespec *value);
unsigned long timeval_to_jiffies(struct timeval *value);
void jiffies_to_timeval(unsigned long jiffies, struct timeval *value);

Accessing the 64-bit jiffy count is not as straightforward as accessing jiffies. While on 64-bit computer architectures the two variables are actually one, access to the value is not atomic for 32-bit processors. This means you might read the wrong value if both halves of the variable get updated while you are reading them. It's extremely unlikely you'll ever need to read the 64-bit counter, but in case you do, you'll be glad to know that the kernel exports a specific helper function that does the proper locking for you:

#include <linux/jiffies.h>
u64 get_jiffies_64(void);

In the above prototype, the u64 type is used. This is one of the types defined by <linux/types.h> and represents an unsigned 64-bit type.

If you're wondering how 32-bit platforms update both the 32-bit and 64-bit counters at the same time, read the linker script for your platform (look for a file whose name matches vmlinux*.lds*). There, the jiffies symbol is defined to access the least significant word of the 64-bit value, according to whether the platform is little-endian or big-endian. Actually, the same trick is used for 64-bit platforms, so that the unsigned long and u64 variables are accessed at the same address.

Finally, note that the actual clock frequency is almost completely hidden from user space. The macro HZ always expands to 100 when user-space programs include param.h, and every counter reported to user space is converted accordingly. This applies to clock(3), times(2), and any related function. The only evidence available to users of the HZ value is how fast timer interrupts happen, as shown in /proc/interrupts. For example, you can obtain HZ by dividing this count by the system uptime reported in /proc/uptime.

7.1.2. Processor-Specific Registers

If you need to measure very short time intervals or you need extremely high precision in your figures, you can resort to platform-dependent resources, a choice of precision over portability.

In modern processors, the pressing demand for empirical performance figures is thwarted by the intrinsic unpredictability of instruction timing in most CPU designs due to cache memories, instruction scheduling, and branch prediction. As a response, CPU manufacturers introduced a way to count clock cycles as an easy and reliable way to measure time lapses. Therefore, most modern processors include a counter register that is steadily incremented once at each clock cycle. Nowadays, this clock counter is the only reliable way to carry out high-resolution timekeeping tasks.

The details differ from platform to platform: the register may or may not be readable from user space, it may or may not be writable, and it may be 64 or 32 bits wide. In the last case, you must be prepared to handle overflows just like we did with the jiffy counter. The register may even not exist for your platform, or it can be implemented in an external device by the hardware designer, if the CPU lacks the feature and you are dealing with a special-purpose computer.

Whether or not the register can be zeroed, we strongly discourage resetting it, even when hardware permits. You might not, after all, be the only user of the counter at any given time; on some platforms supporting SMP, for example, the kernel depends on such a counter to be synchronized across processors. Since you can always measure differences between values, as long as that difference doesn't exceed the overflow time, you can get the work done without claiming exclusive ownership of the register by modifying its current value.

The most renowned counter register is the TSC (timestamp counter), introduced in x86 processors with the Pentium and present in all CPU designs ever since—including the x86_64 platform. It is a 64-bit register that counts CPU clock cycles; it can be read from both kernel space and user space.

After including <asm/msr.h> (an x86-specific header whose name stands for "machine-specific registers"), you can use one of these macros:

rdtsc(low32,high32);
rdtscl(low32);
rdtscll(var64);

The first macro atomically reads the 64-bit value into two 32-bit variables; the next one ("read low half") reads the low half of the register into a 32-bit variable, discarding the high half; the last reads the 64-bit value into a long long variable, hence, the name. All of these macros store values into their arguments.

Reading the low half of the counter is enough for most common uses of the TSC. A 1-GHz CPU overflows it only once every 4.2 seconds, so you won't need to deal with multiregister variables if the time lapse you are benchmarking reliably takes less time. However, as CPU frequencies rise over time and as timing requirements increase, you'll most likely need to read the 64-bit counter more often in the future.

As an example using only the low half of the register, the following lines measure the execution of the instruction itself:

unsigned long ini, end;
rdtscl(ini); rdtscl(end);
printk("time lapse: %li\n", end - ini);

Some of the other platforms offer similar functionality, and kernel headers offer an architecture-independent function that you can use instead of rdtsc. It is called get_cycles, defined in <asm/timex.h> (included by <linux/timex.h>). Its prototype is:

 #include <linux/timex.h>
 cycles_t get_cycles(void);

This function is defined for every platform, and it always returns 0 on the platforms that have no cycle-counter register. The cycles_t type is an appropriate unsigned type to hold the value read.

Despite the availability of an architecture-independent function, we'd like to take the opportunity to show an example of inline assembly code. To this aim, we implement a rdtscl function for MIPS processors that works in the same way as the x86 one.

We base the example on MIPS because most MIPS processors feature a 32-bit counter as register 9 of their internal "coprocessor 0." To access the register, readable only from kernel space, you can define the following macro that executes a "move from coprocessor 0" assembly instruction:[1]

[1] The trailing nop instruction is required to prevent the compiler from accessing the target register in the instruction immediately following mfc0. This kind of interlock is typical of RISC processors, and the compiler can still schedule useful instructions in the delay slots. In this case, we use nop because inline assembly is a black box for the compiler and no optimization can be performed.

#define rdtscl(dest) \
   _ _asm_ _ _ _volatile_ _("mfc0 %0,$9; nop" : "=r" (dest))

With this macro in place, the MIPS processor can execute the same code shown earlier for the x86.

With gcc inline assembly, the allocation of general-purpose registers is left to the compiler. The macro just shown uses %0 as a placeholder for "argument 0," which is later specified as "any register (r) used as output (=)." The macro also states that the output register must correspond to the C expression dest. The syntax for inline assembly is very powerful but somewhat complex, especially for architectures that have constraints on what each register can do (namely, the x86 family). The syntax is described in the gcc documentation, usually available in the info documentation tree.

The short C-code fragment shown in this section has been run on a K7-class x86 processor and a MIPS VR4181 (using the macro just described). The former reported a time lapse of 11 clock ticks and the latter just 2 clock ticks. The small figure was expected, since RISC processors usually execute one instruction per clock cycle.

There is one other thing worth knowing about timestamp counters: they are not necessarily synchronized across processors in an SMP system. To be sure of getting a coherent value, you should disable preemption for code that is querying the counter.

    ⇦ prev ⇱ home next ⇨
    Poster of Linux kernelThe best gift for a Linux geek