Poster of Linux kernelThe best gift for a Linux geek
 Linux kernel map 
⇦ prev ⇱ home next ⇨

6.1. ioctl

Most drivers need—in addition to the ability to read and write the device—the ability to perform various types of hardware control via the device driver. Most devices can perform operations beyond simple data transfers; user space must often be able to request, for example, that the device lock its door, eject its media, report error information, change a baud rate, or self destruct. These operations are usually supported via the ioctl method, which implements the system call by the same name.

In user space, the ioctl system call has the following prototype:

int ioctl(int fd, unsigned long cmd, ...);

The prototype stands out in the list of Unix system calls because of the dots, which usually mark the function as having a variable number of arguments. In a real system, however, a system call can't actually have a variable number of arguments. System calls must have a well-defined prototype, because user programs can access them only through hardware "gates." Therefore, the dots in the prototype represent not a variable number of arguments but a single optional argument, traditionally identified as char *argp. The dots are simply there to prevent type checking during compilation. The actual nature of the third argument depends on the specific control command being issued (the second argument). Some commands take no arguments, some take an integer value, and some take a pointer to other data. Using a pointer is the way to pass arbitrary data to the ioctl call; the device is then able to exchange any amount of data with user space.

The unstructured nature of the ioctl call has caused it to fall out of favor among kernel developers. Each ioctl command is, essentially, a separate, usually undocumented system call, and there is no way to audit these calls in any sort of comprehensive manner. It is also difficult to make the unstructured ioctl arguments work identically on all systems; for example, consider 64-bit systems with a user-space process running in 32-bit mode. As a result, there is strong pressure to implement miscellaneous control operations by just about any other means. Possible alternatives include embedding commands into the data stream (we will discuss this approach later in this chapter) or using virtual filesystems, either sysfs or driver-specific filesystems. (We will look at sysfs in Chapter 14.) However, the fact remains that ioctl is often the easiest and most straightforward choice for true device operations.

The ioctl driver method has a prototype that differs somewhat from the user-space version:

int (*ioctl) (struct inode *inode, struct file *filp,
              unsigned int cmd, unsigned long arg);

The inode and filp pointers are the values corresponding to the file descriptor fd passed on by the application and are the same parameters passed to the open method. The cmd argument is passed from the user unchanged, and the optional arg argument is passed in the form of an unsigned long, regardless of whether it was given by the user as an integer or a pointer. If the invoking program doesn't pass a third argument, the arg value received by the driver operation is undefined. Because type checking is disabled on the extra argument, the compiler can't warn you if an invalid argument is passed to ioctl, and any associated bug would be difficult to spot.

As you might imagine, most ioctl implementations consist of a big switch statement that selects the correct behavior according to the cmd argument. Different commands have different numeric values, which are usually given symbolic names to simplify coding. The symbolic name is assigned by a preprocessor definition. Custom drivers usually declare such symbols in their header files; scull.h declares them for scull. User programs must, of course, include that header file as well to have access to those symbols.

6.1.1. Choosing the ioctl Commands

Before writing the code for ioctl, you need to choose the numbers that correspond to commands. The first instinct of many programmers is to choose a set of small numbers starting with or 1 and going up from there. There are, however, good reasons for not doing things that way. The ioctl command numbers should be unique across the system in order to prevent errors caused by issuing the right command to the wrong device. Such a mismatch is not unlikely to happen, and a program might find itself trying to change the baud rate of a non-serial-port input stream, such as a FIFO or an audio device. If each ioctl number is unique, the application gets an EINVAL error rather than succeeding in doing something unintended.

To help programmers create unique ioctl command codes, these codes have been split up into several bitfields. The first versions of Linux used 16-bit numbers: the top eight were the "magic" numbers associated with the device, and the bottom eight were a sequential number, unique within the device. This happened because Linus was "clueless" (his own word); a better division of bitfields was conceived only later. Unfortunately, quite a few drivers still use the old convention. They have to: changing the command codes would break no end of binary programs, and that is not something the kernel developers are willing to do.

To choose ioctl numbers for your driver according to the Linux kernel convention, you should first check include/asm/ioctl.h and Documentation/ioctl-number.txt. The header defines the bitfields you will be using: type (magic number), ordinal number, direction of transfer, and size of argument. The ioctl-number.txt file lists the magic numbers used throughout the kernel,[1] so you'll be able to choose your own magic number and avoid overlaps. The text file also lists the reasons why the convention should be used.

[1] Maintenance of this file has been somewhat scarce as of late, however.

The approved way to define ioctl command numbers uses four bitfields, which have the following meanings. New symbols introduced in this list are defined in <linux/ioctl.h>.


The magic number. Just choose one number (after consulting ioctl-number.txt) and use it throughout the driver. This field is eight bits wide (_IOC_TYPEBITS).


The ordinal (sequential) number. It's eight bits (_IOC_NRBITS) wide.


The direction of data transfer, if the particular command involves a data transfer. The possible values are _IOC_NONE (no data transfer), _IOC_READ, _IOC_WRITE, and _IOC_READ|_IOC_WRITE (data is transferred both ways). Data transfer is seen from the application's point of view; _IOC_READ means reading from the device, so the driver must write to user space. Note that the field is a bit mask, so _IOC_READ and _IOC_WRITE can be extracted using a logical AND operation.


The size of user data involved. The width of this field is architecture dependent, but is usually 13 or 14 bits. You can find its value for your specific architecture in the macro _IOC_SIZEBITS. It's not mandatory that you use the size field—the kernel does not check it—but it is a good idea. Proper use of this field can help detect user-space programming errors and enable you to implement backward compatibility if you ever need to change the size of the relevant data item. If you need larger data structures, however, you can just ignore the size field. We'll see how this field is used soon.

The header file <asm/ioctl.h>, which is included by <linux/ioctl.h>, defines macros that help set up the command numbers as follows: _IO(type,nr) (for a command that has no argument), _IOR(type,nr,datatype) (for reading data from the driver), _IOW(type,nr,datatype) (for writing data), and _IOWR(type,nr,datatype) (for bidirectional transfers). The type and number fields are passed as arguments, and the size field is derived by applying sizeof to the datatype argument.

The header also defines macros that may be used in your driver to decode the numbers: _IOC_DIR(nr), _IOC_TYPE(nr), _IOC_NR(nr), and _IOC_SIZE(nr). We won't go into any more detail about these macros because the header file is clear, and sample code is shown later in this section.

Here is how some ioctl commands are defined in scull. In particular, these commands set and get the driver's configurable parameters.

/* Use 'k' as magic number */
#define SCULL_IOC_MAGIC  'k'
/* Please use a different 8-bit number in your code */


 * S means "Set" through a ptr,
 * T means "Tell" directly with the argument value
 * G means "Get": reply by setting through a pointer
 * Q means "Query": response is on the return value
 * X means "eXchange": switch G and S atomically
 * H means "sHift": switch T and Q atomically

#define SCULL_IOC_MAXNR 14

The actual source file defines a few extra commands that have not been shown here.

We chose to implement both ways of passing integer arguments: by pointer and by explicit value (although, by an established convention, ioctl should exchange values by pointer). Similarly, both ways are used to return an integer number: by pointer or by setting the return value. This works as long as the return value is a positive integer; as you know by now, on return from any system call, a positive value is preserved (as we saw for read and write), while a negative value is considered an error and is used to set errno in user space.[2]

[2] Actually, all libc implementations currently in use (including uClibc) consider as error codes only values in the range -4095 to -1. Unfortunately, being able to return large negative numbers but not small ones is not very useful.

The "exchange" and "shift" operations are not particularly useful for scull. We implemented "exchange" to show how the driver can combine separate operations into a single atomic one, and "shift" to pair "tell" and "query." There are times when atomic test-and-set operations like these are needed, in particular, when applications need to set or release locks.

The explicit ordinal number of the command has no specific meaning. It is used only to tell the commands apart. Actually, you could even use the same ordinal number for a read command and a write command, since the actual ioctl number is different in the "direction" bits, but there is no reason why you would want to do so. We chose not to use the ordinal number of the command anywhere but in the declaration, so we didn't assign a symbolic value to it. That's why explicit numbers appear in the definition given previously. The example shows one way to use the command numbers, but you are free to do it differently.

With the exception of a small number of predefined commands (to be discussed shortly), the value of the ioctl cmd argument is not currently used by the kernel, and it's quite unlikely it will be in the future. Therefore, you could, if you were feeling lazy, avoid the complex declarations shown earlier and explicitly declare a set of scalar numbers. On the other hand, if you did, you wouldn't benefit from using the bitfields, and you would encounter difficulties if you ever submitted your code for inclusion in the mainline kernel. The header <linux/kd.h> is an example of this old-fashioned approach, using 16-bit scalar values to define the ioctl commands. That source file relied on scalar numbers because it used the conventions obeyed at that time, not out of laziness. Changing it now would cause gratuitous incompatibility.

6.1.2. The Return Value

The implementation of ioctl is usually a switch statement based on the command number. But what should the default selection be when the command number doesn't match a valid operation? The question is controversial. Several kernel functions return -EINVAL ("Invalid argument"), which makes sense because the command argument is indeed not a valid one. The POSIX standard, however, states that if an inappropriate ioctl command has been issued, then -ENOTTY should be returned. This error code is interpreted by the C library as "inappropriate ioctl for device," which is usually exactly what the programmer needs to hear. It's still pretty common, though, to return -EINVAL in response to an invalid ioctl command.

6.1.3. The Predefined Commands

Although the ioctl system call is most often used to act on devices, a few commands are recognized by the kernel. Note that these commands, when applied to your device, are decoded before your own file operations are called. Thus, if you choose the same number for one of your ioctl commands, you won't ever see any request for that command, and the application gets something unexpected because of the conflict between the ioctl numbers.

The predefined commands are divided into three groups:

  • Those that can be issued on any file (regular, device, FIFO, or socket)

  • Those that are issued only on regular files

  • Those specific to the filesystem type

Commands in the last group are executed by the implementation of the hosting filesystem (this is how the chattr command works). Device driver writers are interested only in the first group of commands, whose magic number is "T." Looking at the workings of the other groups is left to the reader as an exercise; ext2_ioctl is a most interesting function (and easier to understand than one might expect), because it implements the append-only flag and the immutable flag.

The following ioctl commands are predefined for any file, including device-special files:


Set the close-on-exec flag (File IOctl CLose on EXec). Setting this flag causes the file descriptor to be closed when the calling process executes a new program.


Clear the close-on-exec flag (File IOctl Not CLos on EXec). The command restores the common file behavior, undoing what FIOCLEX above does.


Set or reset asynchronous notification for the file (as discussed in the Section 6.4 later in this chapter). Note that kernel versions up to Linux 2.2.4 incorrectly used this command to modify the O_SYNC flag. Since both actions can be accomplished through fcntl, nobody actually uses the FIOASYNC command, which is reported here only for completeness.


This command returns the size of a file or directory; when applied to a device file, however, it yields an ENOTTY error return.


"File IOctl Non-Blocking I/O" (described in Section 6.2.3). This call modifies the O_NONBLOCK flag in filp->f_flags. The third argument to the system call is used to indicate whether the flag is to be set or cleared. (We'll look at the role of the flag later in this chapter.) Note that the usual way to change this flag is with the fcntl system call, using the F_SETFL command.

The last item in the list introduced a new system call, fcntl, which looks like ioctl. In fact, the fcntl call is very similar to ioctl in that it gets a command argument and an extra (optional) argument. It is kept separate from ioctl mainly for historical reasons: when Unix developers faced the problem of controlling I/O operations, they decided that files and devices were different. At the time, the only devices with ioctl implementations were ttys, which explains why -ENOTTY is the standard reply for an incorrect ioctl command. Things have changed, but fcntl remains a separate system call.

6.1.4. Using the ioctl Argument

Another point we need to cover before looking at the ioctl code for the scull driver is how to use the extra argument. If it is an integer, it's easy: it can be used directly. If it is a pointer, however, some care must be taken.

When a pointer is used to refer to user space, we must ensure that the user address is valid. An attempt to access an unverified user-supplied pointer can lead to incorrect behavior, a kernel oops, system corruption, or security problems. It is the driver's responsibility to make proper checks on every user-space address it uses and to return an error if it is invalid.

In Chapter 3, we looked at the copy_from_user and copy_to_user functions, which can be used to safely move data to and from user space. Those functions can be used in ioctl methods as well, but ioctl calls often involve small data items that can be more efficiently manipulated through other means. To start, address verification (without transferring data) is implemented by the function access_ok, which is declared in <asm/uaccess.h>:

int access_ok(int type, const void *addr, unsigned long size);

The first argument should be either VERIFY_READ or VERIFY_WRITE, depending on whether the action to be performed is reading the user-space memory area or writing it. The addr argument holds a user-space address, and size is a byte count. If ioctl, for instance, needs to read an integer value from user space, size is sizeof(int). If you need to both read and write at the given address, use VERIFY_WRITE, since it is a superset of VERIFY_READ.

Unlike most kernel functions, access_ok returns a boolean value: 1 for success (access is OK) and 0 for failure (access is not OK). If it returns false, the driver should usually return -EFAULT to the caller.

There are a couple of interesting things to note about access_ok. First, it does not do the complete job of verifying memory access; it only checks to see that the memory reference is in a region of memory that the process might reasonably have access to. In particular, access_ok ensures that the address does not point to kernel-space memory. Second, most driver code need not actually call access_ok. The memory-access routines described later take care of that for you. Nonetheless, we demonstrate its use so that you can see how it is done.

The scull source exploits the bitfields in the ioctl number to check the arguments before the switch:

int err = 0, tmp;
int retval = 0;
 * extract the type and number bitfields, and don't decode
 * wrong cmds: return ENOTTY (inappropriate ioctl) before access_ok(  )
 if (_IOC_TYPE(cmd) != SCULL_IOC_MAGIC) return -ENOTTY;
 if (_IOC_NR(cmd) > SCULL_IOC_MAXNR) return -ENOTTY;

 * the direction is a bitmask, and VERIFY_WRITE catches R/W
 * transfers. `Type' is user-oriented, while
 * access_ok is kernel-oriented, so the concept of "read" and
 * "write" is reversed
if (_IOC_DIR(cmd) & _IOC_READ)
    err = !access_ok(VERIFY_WRITE, (void _ _user *)arg, _IOC_SIZE(cmd));
else if (_IOC_DIR(cmd) & _IOC_WRITE)
    err =  !access_ok(VERIFY_READ, (void _ _user *)arg, _IOC_SIZE(cmd));
if (err) return -EFAULT;

After calling access_ok, the driver can safely perform the actual transfer. In addition to the copy_from_user and copy_to_user functions, the programmer can exploit a set of functions that are optimized for the most used data sizes (one, two, four, and eight bytes). These functions are described in the following list and are defined in <asm/uaccess.h>:

put_user(datum, ptr)

_ _put_user(datum, ptr)

These macros write the datum to user space; they are relatively fast and should be called instead of copy_to_user whenever single values are being transferred. The macros have been written to allow the passing of any type of pointer to put_user, as long as it is a user-space address. The size of the data transfer depends on the type of the ptr argument and is determined at compile time using the sizeof and typeof compiler builtins. As a result, if ptr is a char pointer, one byte is transferred, and so on for two, four, and possibly eight bytes.

put_user checks to ensure that the process is able to write to the given memory address. It returns 0 on success, and -EFAULT on error. _ _put_user performs less checking (it does not call access_ok), but can still fail if the memory pointed to is not writable by the user. Thus, _ _put_user should only be used if the memory region has already been verified with access_ok.

As a general rule, you call _ _put_user to save a few cycles when you are implementing a read method, or when you copy several items and, thus, call access_ok just once before the first data transfer, as shown above for ioctl.

get_user(local, ptr)

_ _get_user(local, ptr)

These macros are used to retrieve a single datum from user space. They behave like put_user and _ _put_user, but transfer data in the opposite direction. The value retrieved is stored in the local variable local; the return value indicates whether the operation succeeded. Again, _ _get_user should only be used if the address has already been verified with access_ok.

If an attempt is made to use one of the listed functions to transfer a value that does not fit one of the specific sizes, the result is usually a strange message from the compiler, such as "conversion to non-scalar type requested." In such cases, copy_to_user or copy_from_user must be used.

6.1.5. Capabilities and Restricted Operations

Access to a device is controlled by the permissions on the device file(s), and the driver is not normally involved in permissions checking. There are situations, however, where any user is granted read/write permission on the device, but some control operations should still be denied. For example, not all users of a tape drive should be able to set its default block size, and a user who has been granted read/write access to a disk device should probably still be denied the ability to format it. In cases like these, the driver must perform additional checks to be sure that the user is capable of performing the requested operation.

Unix systems have traditionally restricted privileged operations to the superuser account. This meant that privilege was an all-or-nothing thing—the superuser can do absolutely anything, but all other users are highly restricted. The Linux kernel provides a more flexible system called capabilities. A capability-based system leaves the all-or-nothing mode behind and breaks down privileged operations into separate subgroups. In this way, a particular user (or program) can be empowered to perform a specific privileged operation without giving away the ability to perform other, unrelated operations. The kernel uses capabilities exclusively for permissions management and exports two system calls capget and capset, to allow them to be managed from user space.

The full set of capabilities can be found in <linux/capability.h>. These are the only capabilities known to the system; it is not possible for driver authors or system administrators to define new ones without modifying the kernel source. A subset of those capabilities that might be of interest to device driver writers includes the following:


The ability to override access restrictions (data access control, or DAC) on files and directories.


The ability to perform network administration tasks, including those that affect network interfaces.


The ability to load or remove kernel modules.


The ability to perform "raw" I/O operations. Examples include accessing device ports or communicating directly with USB devices.


A catch-all capability that provides access to many system administration operations.


The ability to perform tty configuration tasks.

Before performing a privileged operation, a device driver should check that the calling process has the appropriate capability; failure to do so could result user processes performing unauthorized operations with bad results on system stability or security. Capability checks are performed with the capable function (defined in <linux/sched.h>):

 int capable(int capability);

In the scull sample driver, any user is allowed to query the quantum and quantum set sizes. Only privileged users, however, may change those values, since inappropriate values could badly affect system performance. When needed, the scull implementation of ioctl checks a user's privilege level as follows:

 if (! capable (CAP_SYS_ADMIN))
        return -EPERM;

In the absence of a more specific capability for this task, CAP_SYS_ADMIN was chosen for this test.

6.1.6. The Implementation of the ioctl Commands

The scull implementation of ioctl only transfers the configurable parameters of the device and turns out to be as easy as the following:

switch(cmd) {

    scull_quantum = SCULL_QUANTUM;
    scull_qset = SCULL_QSET;
  case SCULL_IOCSQUANTUM: /* Set: arg points to the value */
    if (! capable (CAP_SYS_ADMIN))
        return -EPERM;
    retval = _ _get_user(scull_quantum, (int _ _user *)arg);

  case SCULL_IOCTQUANTUM: /* Tell: arg is the value */
    if (! capable (CAP_SYS_ADMIN))
        return -EPERM;
    scull_quantum = arg;

  case SCULL_IOCGQUANTUM: /* Get: arg is pointer to result */
    retval = _ _put_user(scull_quantum, (int _ _user *)arg);

  case SCULL_IOCQQUANTUM: /* Query: return it (it's positive) */
    return scull_quantum;

  case SCULL_IOCXQUANTUM: /* eXchange: use arg as pointer */
    if (! capable (CAP_SYS_ADMIN))
        return -EPERM;
    tmp = scull_quantum;
    retval = _ _get_user(scull_quantum, (int _ _user *)arg);
    if (retval =  = 0)
        retval = _ _put_user(tmp, (int _ _user *)arg);

  case SCULL_IOCHQUANTUM: /* sHift: like Tell + Query */
    if (! capable (CAP_SYS_ADMIN))
        return -EPERM;
    tmp = scull_quantum;
    scull_quantum = arg;
    return tmp;

  default:  /* redundant, as cmd was checked against MAXNR */
    return -ENOTTY;
return retval;

scull also includes six entries that act on scull_qset. These entries are identical to the ones for scull_quantum and are not worth showing in print.

The six ways to pass and receive arguments look like the following from the caller's point of view (i.e., from user space):

int quantum;

ioctl(fd,SCULL_IOCSQUANTUM, &quantum);          /* Set by pointer */
ioctl(fd,SCULL_IOCTQUANTUM, quantum);           /* Set by value */

ioctl(fd,SCULL_IOCGQUANTUM, &quantum);          /* Get by pointer */
quantum = ioctl(fd,SCULL_IOCQQUANTUM);          /* Get by return value */

ioctl(fd,SCULL_IOCXQUANTUM, &quantum);          /* Exchange by pointer */
quantum = ioctl(fd,SCULL_IOCHQUANTUM, quantum); /* Exchange by value */

Of course, a normal driver would not implement such a mix of calling modes. We have done so here only to demonstrate the different ways in which things could be done. Normally, however, data exchanges would be consistently performed, either through pointers or by value, and mixing of the two techniques would be avoided.

6.1.7. Device Control Without ioctl

Sometimes controlling the device is better accomplished by writing control sequences to the device itself. For example, this technique is used in the console driver, where so-called escape sequences are used to move the cursor, change the default color, or perform other configuration tasks. The benefit of implementing device control this way is that the user can control the device just by writing data, without needing to use (or sometimes write) programs built just for configuring the device. When devices can be controlled in this manner, the program issuing commands often need not even be running on the same system as the device it is controlling.

For example, the setterm program acts on the console (or another terminal) configuration by printing escape sequences. The controlling program can live on a different computer from the controlled device, because a simple redirection of the data stream does the configuration job. This is what happens every time you run a remote tty session: escape sequences are printed remotely but affect the local tty; the technique is not restricted to ttys, though.

The drawback of controlling by printing is that it adds policy constraints to the device; for example, it is viable only if you are sure that the control sequence can't appear in the data being written to the device during normal operation. This is only partly true for ttys. Although a text display is meant to display only ASCII characters, sometimes control characters can slip through in the data being written and can, therefore, affect the console setup. This can happen, for example, when you cat a binary file to the screen; the resulting mess can contain anything, and you often end up with the wrong font on your console.

Controlling by write is definitely the way to go for those devices that don't transfer data but just respond to commands, such as robotic devices.

For instance, a driver written for fun by one of your authors moves a camera on two axes. In this driver, the "device" is simply a pair of old stepper motors, which can't really be read from or written to. The concept of "sending a data stream" to a stepper motor makes little or no sense. In this case, the driver interprets what is being written as ASCII commands and converts the requests to sequences of impulses that manipulate the stepper motors. The idea is similar, somewhat, to the AT commands you send to the modem in order to set up communication, the main difference being that the serial port used to communicate with the modem must transfer real data as well. The advantage of direct device control is that you can use cat to move the camera without writing and compiling special code to issue the ioctl calls.

When writing command-oriented drivers, there's no reason to implement the ioctl method. An additional command in the interpreter is easier to implement and use.

Sometimes, though, you might choose to act the other way around: instead of turning the write method into an interpreter and avoiding ioctl, you might choose to avoid write altogether and use ioctl commands exclusively, while accompanying the driver with a specific command-line tool to send those commands to the driver. This approach moves the complexity from kernel space to user space, where it may be easier to deal with, and helps keep the driver small while denying use of simple cat or echo commands.

    ⇦ prev ⇱ home next ⇨
    Poster of Linux kernelThe best gift for a Linux geek