Poster of Linux kernelThe best gift for a Linux geek
 Linux kernel map 
⇦ prev ⇱ home next ⇨

6.3. poll and select

Applications that use nonblocking I/O often use the poll, select, and epoll system calls as well. poll, select, and epoll have essentially the same functionality: each allow a process to determine whether it can read from or write to one or more open files without blocking. These calls can also block a process until any of a given set of file descriptors becomes available for reading or writing. Therefore, they are often used in applications that must use multiple input or output streams without getting stuck on any one of them. The same functionality is offered by multiple functions, because two were implemented in Unix almost at the same time by two different groups: select was introduced in BSD Unix, whereas poll was the System V solution. The epoll call[4] was added in 2.5.45 as a way of making the polling function scale to thousands of file descriptors.

[4] Actually, epoll is a set of three calls that together can be used to achieve the polling functionality. For our purposes, though, we can think of it as a single call.

Support for any of these calls requires support from the device driver. This support (for all three calls) is provided through the driver's poll method. This method has the following prototype:

unsigned int (*poll) (struct file *filp, poll_table *wait);

The driver method is called whenever the user-space program performs a poll, select, or epoll system call involving a file descriptor associated with the driver. The device method is in charge of these two steps:

  1. Call poll_wait on one or more wait queues that could indicate a change in the poll status. If no file descriptors are currently available for I/O, the kernel causes the process to wait on the wait queues for all file descriptors passed to the system call.

  2. Return a bit mask describing the operations (if any) that could be immediately performed without blocking.

Both of these operations are usually straightforward and tend to look very similar from one driver to the next. They rely, however, on information that only the driver can provide and, therefore, must be implemented individually by each driver.

The poll_table structure, the second argument to the poll method, is used within the kernel to implement the poll, select, and epoll calls; it is declared in <linux/poll.h>, which must be included by the driver source. Driver writers do not need to know anything about its internals and must use it as an opaque object; it is passed to the driver method so that the driver can load it with every wait queue that could wake up the process and change the status of the poll operation. The driver adds a wait queue to the poll_table structure by calling the function poll_wait:

 void poll_wait (struct file *, wait_queue_head_t *, poll_table *);

The second task performed by the poll method is returning the bit mask describing which operations could be completed immediately; this is also straightforward. For example, if the device has data available, a read would complete without sleeping; the poll method should indicate this state of affairs. Several flags (defined via <linux/poll.h>) are used to indicate the possible operations:

POLLIN

This bit must be set if the device can be read without blocking.

POLLRDNORM

This bit must be set if "normal" data is available for reading. A readable device returns (POLLIN | POLLRDNORM).

POLLRDBAND

This bit indicates that out-of-band data is available for reading from the device. It is currently used only in one place in the Linux kernel (the DECnet code) and is not generally applicable to device drivers.

POLLPRI

High-priority data (out-of-band) can be read without blocking. This bit causes select to report that an exception condition occurred on the file, because select reports out-of-band data as an exception condition.

POLLHUP

When a process reading this device sees end-of-file, the driver must set POLLHUP (hang-up). A process calling select is told that the device is readable, as dictated by the select functionality.

POLLERR

An error condition has occurred on the device. When poll is invoked, the device is reported as both readable and writable, since both read and write return an error code without blocking.

POLLOUT

This bit is set in the return value if the device can be written to without blocking.

POLLWRNORM

This bit has the same meaning as POLLOUT, and sometimes it actually is the same number. A writable device returns (POLLOUT | POLLWRNORM).

POLLWRBAND

Like POLLRDBAND, this bit means that data with nonzero priority can be written to the device. Only the datagram implementation of poll uses this bit, since a datagram can transmit out-of-band data.

It's worth repeating that POLLRDBAND and POLLWRBAND are meaningful only with file descriptors associated with sockets: device drivers won't normally use these flags.

The description of poll takes up a lot of space for something that is relatively simple to use in practice. Consider the scullpipe implementation of the poll method:

static unsigned int scull_p_poll(struct file *filp, poll_table *wait)
{
    struct scull_pipe *dev = filp->private_data;
    unsigned int mask = 0;

    /*
     * The buffer is circular; it is considered full
     * if "wp" is right behind "rp" and empty if the
     * two are equal.
     */
    down(&dev->sem);
    poll_wait(filp, &dev->inq,  wait);
    poll_wait(filp, &dev->outq, wait);
    if (dev->rp != dev->wp)
        mask |= POLLIN | POLLRDNORM;    /* readable */
    if (spacefree(dev))
        mask |= POLLOUT | POLLWRNORM;   /* writable */
    up(&dev->sem);
    return mask;
}

This code simply adds the two scullpipe wait queues to the poll_table, then sets the appropriate mask bits depending on whether data can be read or written.

The poll code as shown is missing end-of-file support, because scullpipe does not support an end-of-file condition. For most real devices, the poll method should return POLLHUP if no more data is (or will become) available. If the caller used the select system call, the file is reported as readable. Regardless of whether poll or select is used, the application knows that it can call read without waiting forever, and the read method returns, 0 to signal end-of-file.

With real FIFOs, for example, the reader sees an end-of-file when all the writers close the file, whereas in scullpipe the reader never sees end-of-file. The behavior is different because a FIFO is intended to be a communication channel between two processes, while scullpipe is a trash can where everyone can put data as long as there's at least one reader. Moreover, it makes no sense to reimplement what is already available in the kernel, so we chose to implement a different behavior in our example.

Implementing end-of-file in the same way as FIFOs do would mean checking dev->nwriters, both in read and in poll, and reporting end-of-file (as just described) if no process has the device opened for writing. Unfortunately, though, with this implementation, if a reader opened the scullpipe device before the writer, it would see end-of-file without having a chance to wait for data. The best way to fix this problem would be to implement blocking within open like real FIFOs do; this task is left as an exercise for the reader.

6.3.1. Interaction with read and write

The purpose of the poll and select calls is to determine in advance if an I/O operation will block. In that respect, they complement read and write. More important, poll and select are useful, because they let the application wait simultaneously for several data streams, although we are not exploiting this feature in the scull examples.

A correct implementation of the three calls is essential to make applications work correctly: although the following rules have more or less already been stated, we summarize them here.

6.3.1.1 Reading data from the device
  • If there is data in the input buffer, the read call should return immediately, with no noticeable delay, even if less data is available than the application requested, and the driver is sure the remaining data will arrive soon. You can always return less data than you're asked for if this is convenient for any reason (we did it in scull), provided you return at least one byte. In this case, poll should return POLLIN|POLLRDNORM.

  • If there is no data in the input buffer, by default read must block until at least one byte is there. If O_NONBLOCK is set, on the other hand, read returns immediately with a return value of -EAGAIN (although some old versions of System V return 0 in this case). In these cases, poll must report that the device is unreadable until at least one byte arrives. As soon as there is some data in the buffer, we fall back to the previous case.

  • If we are at end-of-file, read should return immediately with a return value of 0, independent of O_NONBLOCK. poll should report POLLHUP in this case.

6.3.1.2 Writing to the device
  • If there is space in the output buffer, write should return without delay. It can accept less data than the call requested, but it must accept at least one byte. In this case, poll reports that the device is writable by returning POLLOUT|POLLWRNORM.

  • If the output buffer is full, by default write blocks until some space is freed. If O_NONBLOCK is set, write returns immediately with a return value of -EAGAIN (older System V Unices returned 0). In these cases, poll should report that the file is not writable. If, on the other hand, the device is not able to accept any more data, write returns -ENOSPC ("No space left on device"), independently of the setting of O_NONBLOCK.

  • Never make a write call wait for data transmission before returning, even if O_NONBLOCK is clear. This is because many applications use select to find out whether a write will block. If the device is reported as writable, the call must not block. If the program using the device wants to ensure that the data it enqueues in the output buffer is actually transmitted, the driver must provide an fsync method. For instance, a removable device should have an fsync entry point.

Although this is a good set of general rules, one should also recognize that each device is unique and that sometimes the rules must be bent slightly. For example, record-oriented devices (such as tape drives) cannot execute partial writes.

6.3.1.3 Flushing pending output

We've seen how the write method by itself doesn't account for all data output needs. The fsync function, invoked by the system call of the same name, fills the gap. This method's prototype is

 int (*fsync) (struct file *file, struct dentry *dentry, int datasync);

If some application ever needs to be assured that data has been sent to the device, the fsync method must be implemented regardless of whether O_NONBLOCK is set. A call to fsync should return only when the device has been completely flushed (i.e., the output buffer is empty), even if that takes some time. The datasync argument is used to distinguish between the fsync and fdatasync system calls; as such, it is only of interest to filesystem code and can be ignored by drivers.

The fsync method has no unusual features. The call isn't time critical, so every device driver can implement it to the author's taste. Most of the time, char drivers just have a NULL pointer in their fops. Block devices, on the other hand, always implement the method with the general-purpose block_fsync, which, in turn, flushes all the blocks of the device, waiting for I/O to complete.

6.3.2. The Underlying Data Structure

The actual implementation of the poll and select system calls is reasonably simple, for those who are interested in how it works; epoll is a bit more complex but is built on the same mechanism. Whenever a user application calls poll, select, or epoll_ctl,[5] the kernel invokes the poll method of all files referenced by the system call, passing the same poll_table to each of them. The poll_table structure is just a wrapper around a function that builds the actual data structure. That structure, for poll and select, is a linked list of memory pages containing poll_table_entry structures. Each poll_table_entry holds the struct file and wait_queue_head_t pointers passed to poll_wait, along with an associated wait queue entry. The call to poll_wait sometimes also adds the process to the given wait queue. The whole structure must be maintained by the kernel so that the process can be removed from all of those queues before poll or select returns.

[5] This is the function that sets up the internal data structure for future calls to epoll_wait.

If none of the drivers being polled indicates that I/O can occur without blocking, the poll call simply sleeps until one of the (perhaps many) wait queues it is on wakes it up.

What's interesting in the implementation of poll is that the driver's poll method may be called with a NULL pointer as a poll_table argument. This situation can come about for a couple of reasons. If the application calling poll has provided a timeout value of 0 (indicating that no wait should be done), there is no reason to accumulate wait queues, and the system simply does not do it. The poll_table pointer is also set to NULL immediately after any driver being polled indicates that I/O is possible. Since the kernel knows at that point that no wait will occur, it does not build up a list of wait queues.

When the poll call completes, the poll_table structure is deallocated, and all wait queue entries previously added to the poll table (if any) are removed from the table and their wait queues.

We tried to show the data structures involved in polling in Figure 6-1; the figure is a simplified representation of the real data structures, because it ignores the multipage nature of a poll table and disregards the file pointer that is part of each poll_table_entry. The reader interested in the actual implementation is urged to look in <linux/poll.h> and fs/select.c.

Figure 6-1. The data structures behind poll


At this point, it is possible to understand the motivation behind the new epoll system call. In a typical case, a call to poll or select involves only a handful of file descriptors, so the cost of setting up the data structure is small. There are applications out there, however, that work with thousands of file descriptors. At that point, setting up and tearing down this data structure between every I/O operation becomes prohibitively expensive. The epoll system call family allows this sort of application to set up the internal kernel data structure exactly once and to use it many times.

    ⇦ prev ⇱ home next ⇨
    Poster of Linux kernelThe best gift for a Linux geek