Poster of Linux kernelThe best gift for a Linux geek
 Linux kernel map 
⇦ prev ⇱ home next ⇨

4.3. Debugging by Querying

The previous section described how printk works and how it can be used. What it didn't talk about are its disadvantages.

A massive use of printk can slow down the system noticeably, even if you lower console_loglevel to avoid loading the console device, because syslogd keeps syncing its output files; thus, every line that is printed causes a disk operation. This is the right implementation from syslogd 's perspective. It tries to write everything to disk in case the system crashes right after printing the message; however, you don't want to slow down your system just for the sake of debugging messages. This problem can be solved by prefixing the name of your log file as it appears in /etc/syslogd.conf with a hyphen.[2] The problem with changing the configuration file is that the modification will likely remain there after you are done debugging, even though during normal system operation you do want messages to be flushed to disk as soon as possible. An alternative to such a permanent change is running a program other than klogd (such as cat /proc/kmsg, as suggested earlier), but this may not provide a suitable environment for normal system operation.

[2] The hyphen, or minus sign, is a "magic" marker to prevent syslogd from flushing the file to disk at every new message, documented in syslog.conf(5), a manpage worth reading.

More often than not, the best way to get relevant information is to query the system when you need the information, instead of continually producing data. In fact, every Unix system provides many tools for obtaining system information: ps, netstat, vmstat, and so on.

A few techniques are available to driver developers for querying the system: creating a file in the /proc filesystem, using the ioctl driver method, and exporting attributes via sysfs. The use of sysfs requires quite some background on the driver model. It is discussed in Chapter 14.

4.3.1. Using the /proc Filesystem

The /proc filesystem is a special, software-created filesystem that is used by the kernel to export information to the world. Each file under /proc is tied to a kernel function that generates the file's "contents" on the fly when the file is read. We have already seen some of these files in action; /proc/modules, for example, always returns a list of the currently loaded modules.

/proc is heavily used in the Linux system. Many utilities on a modern Linux distribution, such as ps, top, and uptime, get their information from /proc. Some device drivers also export information via /proc, and yours can do so as well. The /proc filesystem is dynamic, so your module can add or remove entries at any time.

Fully featured /proc entries can be complicated beasts; among other things, they can be written to as well as read from. Most of the time, however, /proc entries are read-only files. This section concerns itself with the simple read-only case. Those who are interested in implementing something more complicated can look here for the basics; the kernel source may then be consulted for the full picture.

Before we continue, however, we should mention that adding files under /proc is discouraged. The /proc filesystem is seen by the kernel developers as a bit of an uncontrolled mess that has gone far beyond its original purpose (which was to provide information about the processes running in the system). The recommended way of making information available in new code is via sysfs. As suggested, working with sysfs requires an understanding of the Linux device model, however, and we do not get to that until Chapter 14. Meanwhile, files under /proc are slightly easier to create, and they are entirely suitable for debugging purposes, so we cover them here.

4.3.1.1 Implementing files in /proc

All modules that work with /proc should include <linux/proc_fs.h> to define the proper functions.

To create a read-only /proc file, your driver must implement a function to produce the data when the file is read. When some process reads the file (using the read system call), the request reaches your module by means of this function. We'll look at this function first and get to the registration interface later in this section.

When a process reads from your /proc file, the kernel allocates a page of memory (i.e., PAGE_SIZE bytes) where the driver can write data to be returned to user space. That buffer is passed to your function, which is a method called read_proc:

int (*read_proc)(char *page, char **start, off_t offset, int count, 
                 int *eof, void *data);

The page pointer is the buffer where you'll write your data; start is used by the function to say where the interesting data has been written in page (more on this later); offset and count have the same meaning as for the read method. The eof argument points to an integer that must be set by the driver to signal that it has no more data to return, while data is a driver-specific data pointer you can use for internal bookkeeping.

This function should return the number of bytes of data actually placed in the page buffer, just like the read method does for other files. Other output values are *eof and *start. eof is a simple flag, but the use of the start value is somewhat more complicated; its purpose is to help with the implementation of large (greater than one page) /proc files.

The start parameter has a somewhat unconventional use. Its purpose is to indicate where (within page) the data to be returned to the user is found. When your proc_read method is called, *start will be NULL. If you leave it NULL, the kernel assumes that the data has been put into page as if offset were 0; in other words, it assumes a simple-minded version of proc_read, which places the entire contents of the virtual file in page without paying attention to the offset parameter. If, instead, you set *start to a non-NULL value, the kernel assumes that the data pointed to by *start takes offset into account and is ready to be returned directly to the user. In general, simple proc_read methods that return tiny amounts of data just ignore start. More complex methods set *start to page and only place data beginning at the requested offset there.

There has long been another major issue with /proc files, which start is meant to solve as well. Sometimes the ASCII representation of kernel data structures changes between successive calls to read, so the reader process could find inconsistent data from one call to the next. If *start is set to a small integer value, the caller uses it to increment filp->f_pos independently of the amount of data you return, thus making f_pos an internal record number of your read_proc procedure. If, for example, your read_proc function is returning information from a big array of structures, and five of those structures were returned in the first call, *start could be set to 5. The next call provides that same value as the offset; the driver then knows to start returning data from the sixth structure in the array. This is acknowledged as a "hack" by its authors and can be seen in fs/proc/generic.c.

Note that there is a better way to implement large /proc files; it's called seq_file, and we'll discuss it shortly. First, though, it is time for an example. Here is a simple (if somewhat ugly) read_proc implementation for the scull device:

int scull_read_procmem(char *buf, char **start, off_t offset,
                   int count, int *eof, void *data)
{
    int i, j, len = 0;
    int limit = count - 80; /* Don't print more than this */

    for (i = 0; i < scull_nr_devs && len <= limit; i++) {
        struct scull_dev *d = &scull_devices[i];
        struct scull_qset *qs = d->data;
        if (down_interruptible(&d->sem))
            return -ERESTARTSYS;
        len += sprintf(buf+len,"\nDevice %i: qset %i, q %i, sz %li\n",
                i, d->qset, d->quantum, d->size);
        for (; qs && len <= limit; qs = qs->next) { /* scan the list */
            len += sprintf(buf + len, "  item at %p, qset at %p\n",
                    qs, qs->data);
            if (qs->data && !qs->next) /* dump only the last item */
                for (j = 0; j < d->qset; j++) {
                    if (qs->data[j])
                        len += sprintf(buf + len,
                                "    % 4i: %8p\n",
                                j, qs->data[j]);
                }
        }
        up(&scull_devices[i].sem);
    }
    *eof = 1;
    return len;
}

This is a fairly typical read_proc implementation. It assumes that there will never be a need to generate more than one page of data and so ignores the start and offset values. It is, however, careful not to overrun its buffer, just in case.

4.3.1.2 An older interface

If you read through the kernel source, you may encounter code implementing /proc files with an older interface:

int (*get_info)(char *page, char **start, off_t offset, int count);

All of the arguments have the same meaning as they do for read_proc, but the eof and data arguments are missing. This interface is still supported, but it could go away in the future; new code should use the read_proc interface instead.

4.3.1.3 Creating your /proc file

Once you have a read_proc function defined, you need to connect it to an entry in the /proc hierarchy. This is done with a call to create_proc_read_entry :

struct proc_dir_entry *create_proc_read_entry(const char *name,
                              mode_t mode, struct proc_dir_entry *base, 
                              read_proc_t *read_proc, void *data);

Here, name is the name of the file to create, mode is the protection mask for the file (it can be passed as 0 for a system-wide default), base indicates the directory in which the file should be created (if base is NULL, the file is created in the /proc root), read_proc is the read_proc function that implements the file, and data is ignored by the kernel (but passed to read_proc). Here is the call used by scull to make its /proc function available as /proc/scullmem:

create_proc_read_entry("scullmem", 0 /* default mode */,
        NULL /* parent dir */, scull_read_procmem,
        NULL /* client data */);

Here, we create a file called scullmem directly under /proc, with the default, world-readable protections.

The directory entry pointer can be used to create entire directory hierarchies under /proc. Note, however, that an entry may be more easily placed in a subdirectory of /proc simply by giving the directory name as part of the name of the entry—as long as the directory itself already exists. For example, an (often ignored) convention says that /proc entries associated with device drivers should go in the subdirectory driver/; scull could place its entry there simply by giving its name as driver/scullmem.

Entries in /proc, of course, should be removed when the module is unloaded. remove_proc_entry is the function that undoes what create_proc_read_entry already did:

remove_proc_entry("scullmem", NULL /* parent dir */);

Failure to remove entries can result in calls at unwanted times, or, if your module has been unloaded, kernel crashes.

When using /proc files as shown, you must remember a few nuisances of the implementation—no surprise its use is discouraged nowadays.

The most important problem is with removal of /proc entries. Such removal may well happen while the file is in use, as there is no owner associated to /proc entries, so using them doesn't act on the module's reference count. This problem is simply triggered by running sleep 100 < /proc/myfile just before removing the module, for example.

Another issue is about registering two entries with the same name. The kernel trusts the driver and doesn't check if the name is already registered, so if you are not careful you might end up with two or more entries with the same name. This is a problem known to happen in classrooms, and such entries are indistinguishable, both when you access them and when you call remove_proc_entry.

4.3.1.4 The seq_file interface

As we noted above, the implementation of large files under /proc is a little awkward. Over time, /proc methods have become notorious for buggy implementations when the amount of output grows large. As a way of cleaning up the /proc code and making life easier for kernel programmers, the seq_file interface was added. This interface provides a simple set of functions for the implementation of large kernel virtual files.

The seq_file interface assumes that you are creating a virtual file that steps through a sequence of items that must be returned to user space. To use seq_file, you must create a simple "iterator" object that can establish a position within the sequence, step forward, and output one item in the sequence. It may sound complicated, but, in fact, the process is quite simple. We'll step through the creation of a /proc file in the scull driver to show how it is done.

The first step, inevitably, is the inclusion of <linux/seq_file.h>. Then you must create four iterator methods, called start, next, stop, and show.

The start method is always called first. The prototype for this function is:

void *start(struct seq_file *sfile, loff_t *pos);

The sfile argument can almost always be ignored. pos is an integer position indicating where the reading should start. The interpretation of the position is entirely up to the implementation; it need not be a byte position in the resulting file. Since seq_file implementations typically step through a sequence of interesting items, the position is often interpreted as a cursor pointing to the next item in the sequence. The scull driver interprets each device as one item in the sequence, so the incoming pos is simply an index into the scull_devices array. Thus, the start method used in scull is:

static void *scull_seq_start(struct seq_file *s, loff_t *pos)
{
    if (*pos >= scull_nr_devs)
        return NULL;   /* No more to read */
    return scull_devices + *pos;
}

The return value, if non-NULL, is a private value that can be used by the iterator implementation.

The next function should move the iterator to the next position, returning NULL if there is nothing left in the sequence. This method's prototype is:

void *next(struct seq_file *sfile, void *v, loff_t *pos);

Here, v is the iterator as returned from the previous call to start or next, and pos is the current position in the file. next should increment the value pointed to by pos; depending on how your iterator works, you might (though probably won't) want to increment pos by more than one. Here's what scull does:

static void *scull_seq_next(struct seq_file *s, void *v, loff_t *pos)
{
    (*pos)++;
    if (*pos >= scull_nr_devs)
        return NULL;
    return scull_devices + *pos;
}

When the kernel is done with the iterator, it calls stop to clean up:

void stop(struct seq_file *sfile, void *v);

The scull implementation has no cleanup work to do, so its stop method is empty.

It is worth noting that the seq_file code, by design, does not sleep or perform other nonatomic tasks between the calls to start and stop. You are also guaranteed to see one stop call sometime shortly after a call to start. Therefore, it is safe for your start method to acquire semaphores or spinlocks. As long as your other seq_file methods are atomic, the whole sequence of calls is atomic. (If this paragraph does not make sense to you, come back to it after you've read the next chapter.)

In between these calls, the kernel calls the show method to actually output something interesting to the user space. This method's prototype is:

int show(struct seq_file *sfile, void *v);

This method should create output for the item in the sequence indicated by the iterator v. It should not use printk, however; instead, there is a special set of functions for seq_file output:

int seq_printf(struct seq_file *sfile, const char *fmt, ...);

This is the printf equivalent for seq_file implementations; it takes the usual format string and additional value arguments. You must also pass it the seq_file structure given to the show function, however. If seq_printf returns a nonzero value, it means that the buffer has filled, and output is being discarded. Most implementations ignore the return value, however.

int seq_putc(struct seq_file *sfile, char c);

int seq_puts(struct seq_file *sfile, const char *s);

These are the equivalents of the user-space putc and puts functions.

int seq_escape(struct seq_file *m, const char *s, const char *esc);

This function is equivalent to seq_puts with the exception that any character in s that is also found in esc is printed in octal format. A common value for esc is " \t\n\\", which keeps embedded white space from messing up the output and possibly confusing shell scripts.

int seq_path(struct seq_file *sfile, struct vfsmount *m, struct dentry

*dentry, char *esc);

This function can be used for outputting the file name associated with a given directory entry. It is unlikely to be useful in device drivers; we have included it here for completeness.

Getting back to our example; the show method used in scull is:

static int scull_seq_show(struct seq_file *s, void *v)
{
    struct scull_dev *dev = (struct scull_dev *) v;
    struct scull_qset *d;
    int i;

    if (down_interruptible(&dev->sem))
        return -ERESTARTSYS;
    seq_printf(s, "\nDevice %i: qset %i, q %i, sz %li\n",
            (int) (dev - scull_devices), dev->qset,
            dev->quantum, dev->size);
    for (d = dev->data; d; d = d->next) { /* scan the list */
        seq_printf(s, "  item at %p, qset at %p\n", d, d->data);
        if (d->data && !d->next) /* dump only the last item */
            for (i = 0; i < dev->qset; i++) {
                if (d->data[i])
                    seq_printf(s, "    % 4i: %8p\n",
                            i, d->data[i]);
            }
    }
    up(&dev->sem);
    return 0;
}

Here, we finally interpret our "iterator" value, which is simply a pointer to a scull_dev structure.

Now that it has a full set of iterator operations, scull must package them up and connect them to a file in /proc. The first step is done by filling in a seq_operations structure:

static struct seq_operations scull_seq_ops = {
    .start = scull_seq_start,
    .next  = scull_seq_next,
    .stop  = scull_seq_stop,
    .show  = scull_seq_show
};

With that structure in place, we must create a file implementation that the kernel understands. We do not use the read_proc method described previously; when using seq_file, it is best to connect in to /proc at a slightly lower level. That means creating a file_operations structure (yes, the same structure used for char drivers) implementing all of the operations needed by the kernel to handle reads and seeks on the file. Fortunately, this task is straightforward. The first step is to create an open method that connects the file to the seq_file operations:

static int scull_proc_open(struct inode *inode, struct file *file)
{
    return seq_open(file, &scull_seq_ops);
}

The call to seq_open connects the file structure with our sequence operations defined above. As it turns out, open is the only file operation we must implement ourselves, so we can now set up our file_operations structure:

static struct file_operations scull_proc_ops = {
    .owner   = THIS_MODULE,
    .open    = scull_proc_open,
    .read    = seq_read,
    .llseek  = seq_lseek,
    .release = seq_release
};

Here we specify our own open method, but use the canned methods seq_read, seq_lseek, and seq_release for everything else.

The final step is to create the actual file in /proc:

entry = create_proc_entry("scullseq", 0, NULL);
if (entry)
    entry->proc_fops = &scull_proc_ops;

Rather than using create_proc_read_entry, we call the lower-level create_proc_entry, which has this prototype:

struct proc_dir_entry *create_proc_entry(const char *name,
                              mode_t mode, 
                              struct proc_dir_entry *parent);

The arguments are the same as their equivalents in create_proc_read_entry: the name of the file, its protections, and the parent directory.

With the above code, scull has a new /proc entry that looks much like the previous one. It is superior, however, because it works regardless of how large its output becomes, it handles seeks properly, and it is generally easier to read and maintain. We recommend the use of seq_file for the implementation of files that contain more than a very small number of lines of output.

4.3.2. The ioctl Method

ioctl, which we show you how to use in Chapter 6, is a system call that acts on a file descriptor; it receives a number that identifies a command to be performed and (optionally) another argument, usually a pointer. As an alternative to using the /proc filesystem, you can implement a few ioctl commands tailored for debugging. These commands can copy relevant data structures from the driver to user space where you can examine them.

Using ioctl this way to get information is somewhat more difficult than using /proc, because you need another program to issue the ioctl and display the results. This program must be written, compiled, and kept in sync with the module you're testing. On the other hand, the driver-side code can be easier than what is needed to implement a /proc file.

There are times when ioctl is the best way to get information, because it runs faster than reading /proc. If some work must be performed on the data before it's written to the screen, retrieving the data in binary form is more efficient than reading a text file. In addition, ioctl doesn't require splitting data into fragments smaller than a page.

Another interesting advantage of the ioctl approach is that information-retrieval commands can be left in the driver even when debugging would otherwise be disabled. Unlike a /proc file, which is visible to anyone who looks in the directory (and too many people are likely to wonder "what that strange file is"), undocumented ioctl commands are likely to remain unnoticed. In addition, they will still be there should something weird happen to the driver. The only drawback is that the module will be slightly bigger.

    ⇦ prev ⇱ home next ⇨
    Poster of Linux kernelThe best gift for a Linux geek