Poster of Linux kernelThe best gift for a Linux geek
 Linux kernel map 
⇦ prev ⇱ home next ⇨

16.3. Request Processing

The core of every block driver is its request function. This function is where the real work gets done—or at least started; all the rest is overhead. Consequently, we spend a fair amount of time looking at request processing in block drivers.

A disk driver's performance can be a critical part of the performance of the system as a whole. Therefore, the kernel's block subsystem has been written with performance very much in mind; it does everything possible to enable your driver to get the most out of the devices it controls. This is a good thing, in that it enables blindingly fast I/O. On the other hand, the block subsystem unnecessarily exposes a great deal of complexity in the driver API. It is possible to write a very simple request function (we will see one shortly), but if your driver must perform at a high level on complex hardware, it will be anything but simple.

16.3.1. Introduction to the request Method

The block driver request method has the following prototype:

void request(request_queue_t *queue);

This function is called whenever the kernel believes it is time for your driver to process some reads, writes, or other operations on the device. The request function does not need to actually complete all of the requests on the queue before it returns; indeed, it probably does not complete any of them for most real devices. It must, however, make a start on those requests and ensure that they are all, eventually, processed by the driver.

Every device has a request queue. This is because actual transfers to and from a disk can take place far away from the time the kernel requests them, and because the kernel needs the flexibility to schedule each transfer at the most propitious moment (grouping together, for instance, requests that affect sectors close together on the disk). And the request function, you may remember, is associated with a request queue when that queue is created. Let us look back at how sbull makes its queue:

dev->queue = blk_init_queue(sbull_request, &dev->lock);

Thus, when the queue is created, the request function is associated with it. We also provided a spinlock as part of the queue creation process. Whenever our request function is called, that lock is held by the kernel. As a result, the request function is running in an atomic context; it must follow all of the usual rules for atomic code discussed in Chapter 5.

The queue lock also prevents the kernel from queuing any other requests for your device while your request function holds the lock. Under some conditions, you may want to consider dropping that lock while the request function runs. If you do so, however, you must be sure not to access the request queue, or any other data structure protected by the lock, while the lock is not held. You must also reacquire the lock before the request function returns.

Finally, the invocation of the request function is (usually) entirely asynchronous with respect to the actions of any user-space process. You cannot assume that the kernel is running in the context of the process that initiated the current request. You do not know if the I/O buffer provided by the request is in kernel or user space. So any sort of operation that explicitly accesses user space is in error and will certainly lead to trouble. As you will see, everything your driver needs to know about the request is contained within the structures passed to you via the request queue.

16.3.2. A Simple request Method

The sbull example driver provides a few different methods for request processing. By default, sbull uses a method called sbull_request, which is meant to be an example of the simplest possible request method. Without further ado, here it is:

static void sbull_request(request_queue_t *q)
{
    struct request *req;

    while ((req = elv_next_request(q)) != NULL) {
        struct sbull_dev *dev = req->rq_disk->private_data;
        if (! blk_fs_request(req)) {
            printk (KERN_NOTICE "Skip non-fs request\n");
            end_request(req, 0);
            continue;
        }
        sbull_transfer(dev, req->sector, req->current_nr_sectors,
                req->buffer, rq_data_dir(req));
        end_request(req, 1);
    }
}

This function introduces the struct request structure. We will examine struct request in great detail later on; for now, suffice it to say that it represents a block I/O request for us to execute.

The kernel provides the function elv_next_request to obtain the first incomplete request on the queue; that function returns NULL when there are no requests to be processed. Note that elv_next_request does not remove the request from the queue. If you call it twice with no intervening operations, it returns the same request structure both times. In this simple mode of operation, requests are taken off the queue only when they are complete.

A block request queue can contain requests that do not actually move blocks to and from a disk. Such requests can include vendor-specific, low-level diagnostics operations or instructions relating to specialized device modes, such as the packet writing mode for recordable media. Most block drivers do not know how to handle such requests and simply fail them; sbull works in this way as well. The call to block_fs_request tells us whether we are looking at a filesystem request—one that moves blocks of data. If a request is not a filesystem request, we pass it to end_request:

void end_request(struct request *req, int succeeded);

When we dispose of nonfilesystem requests, we pass succeeded as 0 to indicate that we did not successfully complete the request. Otherwise, we call sbull_transfer to actually move the data, using a set of fields provided in the request structure:

sector_t sector;

The index of the beginning sector on our device. Remember that this sector number, like all such numbers passed between the kernel and the driver, is expressed in 512-byte sectors. If your hardware uses a different sector size, you need to scale sector accordingly. For example, if the hardware uses 2048-byte sectors, you need to divide the beginning sector number by four before putting it into a request for the hardware.

unsigned long nr_sectors;

The number of (512-byte) sectors to be transferred.

char *buffer;

A pointer to the buffer to or from which the data should be transferred. This pointer is a kernel virtual address and can be dereferenced directly by the driver if need be.

rq_data_dir(struct request *req);

This macro extracts the direction of the transfer from the request; a zero return value denotes a read from the device, and a nonzero return value denotes a write to the device.

Given this information, the sbull driver can implement the actual data transfer with a simple memcpy call—our data is already in memory, after all. The function that performs this copy operation (sbull_transfer) also handles the scaling of sector sizes and ensures that we do not try to copy beyond the end of our virtual device:

static void sbull_transfer(struct sbull_dev *dev, unsigned long sector,
        unsigned long nsect, char *buffer, int write)
{
    unsigned long offset = sector*KERNEL_sectOR_SIZE;
    unsigned long nbytes = nsect*KERNEL_sectOR_SIZE;

    if ((offset + nbytes) > dev->size) {
        printk (KERN_NOTICE "Beyond-end write (%ld %ld)\n", offset, nbytes);
        return;
    }
    if (write)
        memcpy(dev->data + offset, buffer, nbytes);
    else
        memcpy(buffer, dev->data + offset, nbytes);
}

With the code, sbull implements a complete, simple RAM-based disk device. It is not, however, a realistic driver for many types of devices, for a couple of reasons.

The first of those reasons is that sbull executes requests synchronously, one at a time. High-performance disk devices are capable of having numerous requests outstanding at the same time; the disk's onboard controller can then choose to execute them in the optimal order (one hopes). As long as we process only the first request in the queue, we can never have multiple requests being fulfilled at a given time. Being able to work with more than one request requires a deeper understanding of request queues and the request structure; the next few sections help build that understanding.

There is another issue to consider, however. The best performance is obtained from disk devices when the system performs large transfers involving multiple sectors that are located together on the disk. The highest cost in a disk operation is always the positioning of the read and write heads; once that is done, the time required to actually read or write the data is almost insignificant. The developers who design and implement filesystems and virtual memory subsystems understand this, so they do their best to locate related data contiguously on the disk and to transfer as many sectors as possible in a single request. The block subsystem also helps in this regard; request queues contain a great deal of logic aimed at finding adjacent requests and coalescing them into larger operations.

The sbull driver, however, takes all that work and simply ignores it. Only one buffer is transferred at a time, meaning that the largest single transfer is almost never going to exceed the size of a single page. A block driver can do much better than that, but it requires a deeper understanding of request structures and the bio structures from which requests are built.

The next few sections delve more deeply into how the block layer does its job and the data structures that result from that work.

16.3.3. Request Queues

In the simplest sense, a block request queue is exactly that: a queue of block I/O requests. If you look under the hood, a request queue turns out to be a surprisingly complex data structure. Fortunately, drivers need not worry about most of that complexity.

Request queues keep track of outstanding block I/O requests. But they also play a crucial role in the creation of those requests. The request queue stores parameters that describe what kinds of requests the device is able to service: their maximum size, how many separate segments may go into a request, the hardware sector size, alignment requirements, etc. If your request queue is properly configured, it should never present you with a request that your device cannot handle.

Request queues also implement a plug-in interface that allows the use of multiple I/O schedulers (or elevators) to be used. An I/O scheduler's job is to present I/O requests to your driver in a way that maximizes performance. To this end, most I/O schedulers accumulate a batch of requests, sort them into increasing (or decreasing) block index order, and present the requests to the driver in that order. The disk head, when given a sorted list of requests, works its way from one end of the disk to the other, much like a full elevator moves in a single direction until all of its "requests" (people waiting to get off) have been satisfied. The 2.6 kernel includes a "deadline scheduler," which makes an effort to ensure that every request is satisfied within a preset maximum time, and an "anticipatory scheduler," which actually stalls a device briefly after a read request in anticipation that another, adjacent read will arrive almost immediately. As of this writing, the default scheduler is the anticipatory scheduler, which seems to give the best interactive system performance.

The I/O scheduler is also charged with merging adjacent requests. When a new I/O request is handed to the scheduler, it searches the queue for requests involving adjacent sectors; if one is found and if the resulting request would not be too large, the two requests are merged.

Request queues have a type of struct request_queue or request_queue_t. This type, and the many functions that operate on it, are defined in <linux/blkdev.h>. If you are interested in the implementation of request queues, you can find most of the code in drivers/block/ll_rw_block.c and elevator.c.

16.3.3.1 Queue creation and deletion

As we saw in our example code, a request queue is a dynamic data structure that must be created by the block I/O subsystem. The function to create and initialize a request queue is:

request_queue_t *blk_init_queue(request_fn_proc *request, spinlock_t *lock);

The arguments are, of course, the request function for this queue and a spinlock that controls access to the queue. This function allocates memory (quite a bit of memory, actually) and can fail because of this; you should always check the return value before attempting to use the queue.

As part of the initialization of a request queue, you can set the field queuedata (which is a void * pointer) to any value you like. This field is the request queue's equivalent to the private_data we have seen in other structures.

To return a request queue to the system (at module unload time, generally), call blk_cleanup_queue :

void blk_cleanup_queue(request_queue_t *);

After this call, your driver sees no more requests from the given queue and should not reference it again.

16.3.3.2 Queueing functions

There is a very small set of functions for the manipulation of requests on queues—at least, as far as drivers are concerned. You must hold the queue lock before you call these functions.

The function that returns the next request to process is elv_next_request :

struct request *elv_next_request(request_queue_t *queue);

We have already seen this function in the simple sbull example. It returns a pointer to the next request to process (as determined by the I/O scheduler) or NULL if no more requests remain to be processed. elv_next_request leaves the request on the queue but marks it as being active; this mark prevents the I/O scheduler from attempting to merge other requests with this one once you start to execute it.

To actually remove a request from a queue, use blkdev_dequeue_request :

void blkdev_dequeue_request(struct request *req);

If your driver operates on multiple requests from the same queue simultaneously, it must dequeue them in this manner.

Should you need to put a dequeued request back on the queue for some reason, you can call:

void elv_requeue_request(request_queue_t *queue, struct request *req);

16.3.3.3 Queue control functions

The block layer exports a set of functions that can be used by a driver to control how a request queue operates. These functions include:

void blk_stop_queue(request_queue_t *queue);

void blk_start_queue(request_queue_t *queue);

If your device has reached a state where it can handle no more outstanding commands, you can call blk_stop_queue to tell the block layer. After this call, your request function will not be called until you call blk_start_queue. Needless to say, you should not forget to restart the queue when your device can handle more requests. The queue lock must be held when calling either of these functions.

void blk_queue_bounce_limit(request_queue_t *queue, u64 dma_addr);

Function that tells the kernel the highest physical address to which your device can perform DMA. If a request comes in containing a reference to memory above the limit, a bounce buffer will be used for the operation; this is, of course, an expensive way to perform block I/O and should be avoided whenever possible. You can provide any reasonable physical address in this argument, or make use of the predefined symbols BLK_BOUNCE_HIGH (use bounce buffers for high-memory pages), BLK_BOUNCE_ISA (the driver can DMA only into the 16-MB ISA zone), or BLK_BOUNCE_ANY (the driver can perform DMA to any address). The default value is BLK_BOUNCE_HIGH.

void blk_queue_max_sectors(request_queue_t *queue, unsigned short max);

void blk_queue_max_phys_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_hw_segments(request_queue_t *queue, unsigned short max);

void blk_queue_max_segment_size(request_queue_t *queue, unsigned int max);

Functions that set parameters describing the requests that can be satisfied by this device. blk_queue_max_sectors can be used to set the maximum size of any request in (512-byte) sectors; the default is 255. blk_queue_max_phys_segments and blk_queue_max_hw_segments both control how many physical segments (nonadjacent areas in system memory) may be contained within a single request. Use blk_queue_max_phys_segments to say how many segments your driver is prepared to cope with; this may be the size of a staticly allocated scatterlist, for example. blk_queue_max_hw_segments, in contrast, is the maximum number of segments that the device itself can handle. Both of these parameters default to 128. Finally, blk_queue_max_segment_size tells the kernel how large any individual segment of a request can be in bytes; the default is 65,536 bytes.

blk_queue_segment_boundary(request_queue_t *queue, unsigned long mask);

Some devices cannot handle requests that cross a particular size memory boundary; if your device is one of those, use this function to tell the kernel about that boundary. For example, if your device has trouble with requests that cross a 4-MB boundary, pass in a mask of 0x3fffff. The default mask is 0xffffffff.

void blk_queue_dma_alignment(request_queue_t *queue, int mask);

Function that tells the kernel about the memory alignment constraints your device imposes on DMA transfers. All requests are created with the given alignment, and the length of the request also matches the alignment. The default mask is 0x1ff, which causes all requests to be aligned on 512-byte boundaries.

void blk_queue_hardsect_size(request_queue_t *queue, unsigned short max);

Tells the kernel about your device's hardware sector size. All requests generated by the kernel are a multiple of this size and are properly aligned. All communications between the block layer and the driver continues to be expressed in 512-byte sectors, however.

16.3.4. The Anatomy of a Request

In our simple example, we encountered the request structure. However, we have barely scratched the surface of that complicated data structure. In this section, we look, in some detail, at how block I/O requests are represented in the Linux kernel.

Each request structure represents one block I/O request, although it may have been formed through a merger of several independent requests at a higher level. The sectors to be transferred for any particular request may be distributed throughout main memory, although they always correspond to a set of consecutive sectors on the block device. The request is represented as a set of segments, each of which corresponds to one in-memory buffer. The kernel may join multiple requests that involve adjacent sectors on the disk, but it never combines read and write operations within a single request structure. The kernel also makes sure not to combine requests if the result would violate any of the request queue limits described in the previous section.

A request structure is implemented, essentially, as a linked list of bio structures combined with some housekeeping information to enable the driver to keep track of its position as it works through the request. The bio structure is a low-level description of a portion of a block I/O request; we take a look at it now.

16.3.4.1 The bio structure

When the kernel, in the form of a filesystem, the virtual memory subsystem, or a system call, decides that a set of blocks must be transferred to or from a block I/O device; it puts together a bio structure to describe that operation. That structure is then handed to the block I/O code, which merges it into an existing request structure or, if need be, creates a new one. The bio structure contains everything that a block driver needs to carry out the request without reference to the user-space process that caused that request to be initiated.

The bio structure, which is defined in <linux/bio.h>, contains a number of fields that may be of use to driver authors:

sector_t bi_sector;

The first (512-byte) sector to be transferred for this bio.

unsigned int bi_size;

The size of the data to be transferred, in bytes. Instead, it is often easier to use bio_sectors(bio), a macro that gives the size in sectors.

unsigned long bi_flags;

A set of flags describing the bio; the least significant bit is set if this is a write request (although the macro bio_data_dir(bio) should be used instead of looking at the flags directly).

unsigned short bio_phys_segments;

unsigned short bio_hw_segments;

The number of physical segments contained within this BIO and the number of segments seen by the hardware after DMA mapping is done, respectively.

The core of a bio, however, is an array called bi_io_vec , which is made up of the following structure:

struct bio_vec {
        struct page     *bv_page;
        unsigned int    bv_len;
        unsigned int    bv_offset;
};

Figure 16-1 shows how these structures all tie together. As you can see, by the time a block I/O request is turned into a bio structure, it has been broken down into individual pages of physical memory. All a driver needs to do is to step through this array of structures (there are bi_vcnt of them), and transfer data within each page (but only len bytes starting at offset).

Figure 16-1. The bio structure


Working directly with the bi_io_vec array is discouraged in the interest of kernel developers being able to change the bio structure in the future without breaking things. To that end, a set of macros has been provided to ease the process of working with the bio structure. The place to start is with bio_for_each_segment, which simply loops through every unprocessed entry in the bi_io_vec array. This macro should be used as follows:

int segno;
struct bio_vec *bvec;

bio_for_each_segment(bvec, bio, segno) {
    /* Do something with this segment
}

Within this loop, bvec points to the current bio_vec entry, and segno is the current segment number. These values can be used to set up DMA transfers (an alternative way using blk_rq_map_sg is described in Section 16.3.5.2). If you need to access the pages directly, you should first ensure that a proper kernel virtual address exists; to that end, you can use:

char *_ _bio_kmap_atomic(struct bio *bio, int i, enum km_type type);
void _ _bio_kunmap_atomic(char *buffer, enum km_type type);

This low-level function allows you to directly map the buffer found in a given bio_vec, as indicated by the index i. An atomic kmap is created; the caller must provide the appropriate slot to use (as described in the section Section 15.1.4).

The block layer also maintains a set of pointers within the bio structure to keep track of the current state of request processing. Several macros exist to provide access to that state:

struct page *bio_page(struct bio *bio);

Returns a pointer to the page structure representing the page to be transferred next.

int bio_offset(struct bio *bio);

Returns the offset within the page for the data to be transferred.

int bio_cur_sectors(struct bio *bio);

Returns the number of sectors to be transferred out of the current page.

char *bio_data(struct bio *bio);

Returns a kernel logical address pointing to the data to be transferred. Note that this address is available only if the page in question is not located in high memory; calling it in other situations is a bug. By default, the block subsystem does not pass high-memory buffers to your driver, but if you have changed that setting with blk_queue_bounce_limit, you probably should not be using bio_data.

char *bio_kmap_irq(struct bio *bio, unsigned long *flags);

void bio_kunmap_irq(char *buffer, unsigned long *flags);

bio_kmap_irq returns a kernel virtual address for any buffer, regardless of whether it resides in high or low memory. An atomic kmap is used, so your driver cannot sleep while this mapping is active. Use bio_kunmap_irq to unmap the buffer. Note that the flags argument is passed by pointer here. Note also that since an atomic kmap is used, you cannot map more than one segment at a time.

All of the functions just described access the "current" buffer—the first buffer that, as far as the kernel knows, has not been transferred. Drivers often want to work through several buffers in the bio before signaling completion on any of them (with end_that_request_first, to be described shortly), so these functions are often not useful. Several other macros exist for working with the internals of the bio structure (see <linux/bio.h> for details).

16.3.4.2 Request structure fields

Now that we have an idea of how the bio structure works, we can get deep into struct request and see how request processing works. The fields of this structure include:

sector_t hard_sector;

unsigned long hard_nr_sectors;

unsigned int hard_cur_sectors;

Fields that track the sectors that the driver has yet to complete. The first sector that has not been transferred is stored in hard_sector, the total number of sectors yet to transfer is in hard_nr_sectors, and the number of sectors remaining in the current bio is hard_cur_sectors. These fields are intended for use only within the block subsystem; drivers should not make use of them.

struct bio *bio;

bio is the linked list of bio structures for this request. You should not access this field directly; use rq_for_each_bio (described later) instead.

char *buffer;

The simple driver example earlier in this chapter used this field to find the buffer for the transfer. With our deeper understanding, we can now see that this field is simply the result of calling bio_data on the current bio.

unsigned short nr_phys_segments;

The number of distinct segments occupied by this request in physical memory after adjacent pages have been merged.

struct list_head queuelist;

The linked-list structure (as described in Section 11.5) that links the request into the request queue. If (and only if) you remove the request from the queue with blkdev_dequeue_request, you may use this list head to track the request in an internal list maintained by your driver.

Figure 16-2 shows how the request structure and its component bio structures fit together. In the figure, the request has been partially satisfied; the cbio and buffer fields point to the first bio that has not yet been transferred.

Figure 16-2. A request queue with a partially processed request


There are many other fields inside the request structure, but the list in this section should be enough for most driver writers.

16.3.4.3 Barrier requests

The block layer reorders requests before your driver sees them to improve I/O performance. Your driver, too, can reorder requests if there is a reason to do so. Often, this reordering happens by passing multiple requests to the drive and letting the hardware figure out the optimal ordering. There is a problem with unrestricted reordering of requests, however: some applications require guarantees that certain operations will complete before others are started. Relational database managers, for example, must be absolutely sure that their journaling information has been flushed to the drive before executing a transaction on the database contents. Journaling filesystems, which are now in use on most Linux systems, have very similar ordering constraints. If the wrong operations are reordered, the result can be severe, undetected data corruption.

The 2.6 block layer addresses this problem with the concept of a barrier request. If a request is marked with the REQ_HARDBARRER flag, it must be written to the drive before any following request is initiated. By "written to the drive," we mean that the data must actually reside and be persistent on the physical media. Many drives perform caching of write requests; this caching improves performance, but it can defeat the purpose of barrier requests. If a power failure occurs when the critical data is still sitting in the drive's cache, that data is still lost even if the drive has reported completion. So a driver that implements barrier requests must take steps to force the drive to actually write the data to the media.

If your driver honors barrier requests, the first step is to inform the block layer of this fact. Barrier handling is another of the request queues; it is set with:

void blk_queue_ordered(request_queue_t *queue, int flag);

To indicate that your driver implements barrier requests, set the flag parameter to a nonzero value.

The actual implementation of barrier requests is simply a matter of testing for the associated flag in the request structure. A macro has been provided to perform this test:

int blk_barrier_rq(struct request *req);

If this macro returns a nonzero value, the request is a barrier request. Depending on how your hardware works, you may have to stop taking requests from the queue until the barrier request has been completed. Other drives can understand barrier requests themselves; in this case, all your driver has to do is to issue the proper operations for those drives.

16.3.4.4 Nonretryable requests

Block drivers often attempt to retry requests that fail the first time. This behavior can lead to a more reliable system and help to avoid data loss. The kernel, however, sometimes marks requests as not being retryable. Such requests should simply fail as quickly as possible if they cannot be executed on the first try.

If your driver is considering retrying a failed request, it should first make a call to:

int blk_noretry_request(struct request *req);

If this macro returns a nonzero value, your driver should simply abort the request with an error code instead of retrying it.

16.3.5. Request Completion Functions

There are, as we will see, several different ways of working through a request structure. All of them make use of a couple of common functions, however, which handle the completion of an I/O request or parts of a request. Both of these functions are atomic and can be safely called from an atomic context.

When your device has completed transferring some or all of the sectors in an I/O request, it must inform the block subsystem with:

int end_that_request_first(struct request *req, int success, int count);

This function tells the block code that your driver has finished with the transfer of count sectors starting where you last left off. If the I/O was successful, pass success as 1; otherwise pass 0. Note that you must signal completion in order from the first sector to the last; if your driver and device somehow conspire to complete requests out of order, you have to store the out-of-order completion status until the intervening sectors have been transferred.

The return value from end_that_request_first is an indication of whether all sectors in this request have been transferred or not. A return value of 0 means that all sectors have been transferred and that the request is complete. At that point, you must dequeue the request with blkdev_dequeue_request (if you have not already done so) and pass it to:

void end_that_request_last(struct request *req);

end_that_request_last informs whoever is waiting for the request that it has completed and recycles the request structure; it must be called with the queue lock held.

In our simple sbull example, we didn't use any of the above functions. That example, instead, is called end_request. To show the effects of this call, here is the entire end_request function as seen in the 2.6.10 kernel:

void end_request(struct request *req, int uptodate)
{
    if (!end_that_request_first(req, uptodate, req->hard_cur_sectors)) {
        add_disk_randomness(req->rq_disk);
        blkdev_dequeue_request(req);
        end_that_request_last(req);
    }
}

The function add_disk_randomness uses the timing of block I/O requests to contribute entropy to the system's random number pool; it should be called only if the disk's timing is truly random. That is true for most mechanical devices, but it is not true for a memory-based virtual device, such as sbull. For this reason, the more complicated version of sbull shown in the next section does not call add_disk_randomness.

16.3.5.1 Working with bios

You now know enough to write a block driver that works directly with the bio structures that make up a request. An example might help, however. If the sbull driver is loaded with the request_mode parameter set to 1, it registers a bio-aware request function instead of the simple function we saw above. That function looks like this:

static void sbull_full_request(request_queue_t *q)
{
    struct request *req;
    int sectors_xferred;
    struct sbull_dev *dev = q->queuedata;

    while ((req = elv_next_request(q)) != NULL) {
        if (! blk_fs_request(req)) {
            printk (KERN_NOTICE "Skip non-fs request\n");
            end_request(req, 0);
            continue;
        }
        sectors_xferred = sbull_xfer_request(dev, req);
        if (! end_that_request_first(req, 1, sectors_xferred)) {
            blkdev_dequeue_request(req);
            end_that_request_last(req);
        }
    }
}

This function simply takes each request, passes it to sbull_xfer_request, then completes it with end_that_request_first and, if necessary, end_that_request_last. Thus, this function is handling the high-level queue and request management parts of the problem. The job of actually executing a request, however, falls to sbull_xfer_request:

static int sbull_xfer_request(struct sbull_dev *dev, struct request *req)
{
    struct bio *bio;
    int nsect = 0;
    
    rq_for_each_bio(bio, req) {
        sbull_xfer_bio(dev, bio);
        nsect += bio->bi_size/KERNEL_sectOR_SIZE;
    }
    return nsect;
}

Here we introduce another macro: rq_for_each_bio. As you might expect, this macro simply steps through each bio structure in the request, giving us a pointer that we can pass to sbull_xfer_bio for the transfer. That function looks like:

static int sbull_xfer_bio(struct sbull_dev *dev, struct bio *bio)
{
    int i;
    struct bio_vec *bvec;
    sector_t sector = bio->bi_sector;

    /* Do each segment independently. */
    bio_for_each_segment(bvec, bio, i) {
        char *buffer = _ _bio_kmap_atomic(bio, i, KM_USER0);
        sbull_transfer(dev, sector, bio_cur_sectors(bio),
                buffer, bio_data_dir(bio) =  = WRITE);
        sector += bio_cur_sectors(bio);
        _ _bio_kunmap_atomic(bio, KM_USER0);
    }
    return 0; /* Always "succeed" */
}

This function simply steps through each segment in the bio structure, gets a kernel virtual address to access the buffer, then calls the same sbull_transfer function we saw earlier to copy the data over.

Each device has its own needs, but, as a general rule, the code just shown should serve as a model for many situations where digging through the bio structures is needed.

16.3.5.2 Block requests and DMA

If you are working on a high-performance block driver, chances are you will be using DMA for the actual data transfers. A block driver can certainly step through the bio structures, as described above, create a DMA mapping for each one, and pass the result to the device. There is an easier way, however, if your device can do scatter/gather I/O. The function:

int blk_rq_map_sg(request_queue_t *queue, struct request *req, 
                  struct scatterlist *list);

fills in the given list with the full set of segments from the given request. Segments that are adjacent in memory are coalesced prior to insertion into the scatterlist, so you need not try to detect them yourself. The return value is the number of entries in the list. The function also passes back, in its third argument, a scatterlist suitable for passing to dma_map_sg. (See Section 15.4.4.7 for more information on dma_map_sg.)

Your driver must allocate the storage for the scatterlist before calling blk_rq_map_sg. The list must be able to hold at least as many entries as the request has physical segments; the struct request field nr_phys_segments holds that count, which will not exceed the maximum number of physical segments specified with blk_queue_max_phys_segments.

If you do not want blk_rq_map_sg to coalesce adjacent segments, you can change the default behavior with a call such as:

clear_bit(QUEUE_FLAG_CLUSTER, &queue->queue_flags);

Some SCSI disk drivers mark their request queue in this way, since they do not benefit from the coalescing of requests.

16.3.5.3 Doing without a request queue

Previously, we have discussed the work the kernel does to optimize the order of requests in the queue; this work involves sorting requests and, perhaps, even stalling the queue to allow an anticipated request to arrive. These techniques help the system's performance when dealing with a real, spinning disk drive. They are completely wasted, however, with a device like sbull. Many block-oriented devices, such as flash memory arrays, readers for media cards used in digital cameras, and RAM disks have truly random-access performance and do not benefit from advanced-request queueing logic. Other devices, such as software RAID arrays or virtual disks created by logical volume managers, do not have the performance characteristics for which the block layer's request queues are optimized. For this kind of device, it would be better to accept requests directly from the block layer and not bother with the request queue at all.

For these situations, the block layer supports a "no queue" mode of operation. To make use of this mode, your driver must provide a "make request" function, rather than a request function. The make_request function has this prototype:

typedef int (make_request_fn) (request_queue_t *q, struct bio *bio);

Note that a request queue is still present, even though it will never actually hold any requests. The make_request function takes as its main parameter a bio structure, which represents one or more buffers to be transferred. The make_request function can do one of two things: it can either perform the transfer directly, or it can redirect the request to another device.

Performing the transfer directly is just a matter of working through the bio with the accessor methods we described earlier. Since there is no request structure to work with, however, your function should signal completion directly to the creator of the bio structure with a call to bio_endio:

void bio_endio(struct bio *bio, unsigned int bytes, int error);

Here, bytes is the number of bytes you have transferred so far. It can be less than the number of bytes represented by the bio as a whole; in this way, you can signal partial completion, and update the internal "current buffer" pointers within the bio. You should either call bio_endio again as your device makes further process, or signal an error if you are unable to complete the request. Errors are indicated by providing a nonzero value for the error parameter; this value is normally an error code such as -EIO. The make_request should return 0, regardless of whether the I/O is successful.

If sbull is loaded with request_mode=2, it operates with a make_request function. Since sbull already has a function that can transfer a single bio, the make_request function is simple:

static int sbull_make_request(request_queue_t *q, struct bio *bio)
{
    struct sbull_dev *dev = q->queuedata;
    int status;

    status = sbull_xfer_bio(dev, bio);
    bio_endio(bio, bio->bi_size, status);
    return 0;
}

Please note that you should never call bio_endio from a regular request function; that job is handled by end_that_request_first instead.

Some block drivers, such as those implementing volume managers and software RAID arrays, really need to redirect the request to another device that handles the actual I/O. Writing such a driver is beyond the scope of this book. We note, however, that if the make_request function returns a nonzero value, the bio is submitted again. A "stacking" driver can, therefore, modify the bi_bdev field to point to a different device, change the starting sector value, then return; the block system then passes the bio to the new device. There is also a bio_split call that can be used to split a bio into multiple chunks for submission to more than one device. Although if the queue parameters are set up correctly, splitting a bio in this way should almost never be necessary.

Either way, you must tell the block subsystem that your driver is using a custom make_request function. To do so, you must allocate a request queue with:

request_queue_t *blk_alloc_queue(int flags);

This function differs from blk_init_queue in that it does not actually set up the queue to hold requests. The flags argument is a set of allocation flags to be used in allocating memory for the queue; usually the right value is GFP_KERNEL. Once you have a queue, pass it and your make_request function to blk_queue_make_request:

void blk_queue_make_request(request_queue_t *queue, make_request_fn *func);

The sbull code to set up the make_request function looks like:

dev->queue = blk_alloc_queue(GFP_KERNEL);
if (dev->queue =  = NULL)
    goto out_vfree;
blk_queue_make_request(dev->queue, sbull_make_request);

For the curious, some time spent digging through drivers/block/ll_rw_block.c shows that all queues have a make_request function. The default version, generic_make_request, handles the incorporation of the bio into a request structure. By providing a make_request function of its own, a driver is really just overriding a specific request queue method and sorting out much of the work.

    ⇦ prev ⇱ home next ⇨
    Poster of Linux kernelThe best gift for a Linux geek