|⇦ prev||⇱ home||next ⇨|
15.1. Memory Management in Linux
Rather than describing the theory of memory management in operating systems, this section tries to pinpoint the main features of the Linux implementation. Although you do not need to be a Linux virtual memory guru to implement mmap, a basic overview of how things work is useful. What follows is a fairly lengthy description of the data structures used by the kernel to manage memory. Once the necessary background has been covered, we can get into working with these structures.
15.1.1. Address Types
Linux is, of course, a virtual memory system, meaning that the addresses seen by user programs do not directly correspond to the physical addresses used by the hardware. Virtual memory introduces a layer of indirection that allows a number of nice things. With virtual memory, programs running on the system can allocate far more memory than is physically available; indeed, even a single process can have a virtual address space larger than the system's physical memory. Virtual memory also allows the program to play a number of tricks with the process's address space, including mapping the program's memory to device memory.
Thus far, we have talked about virtual and physical addresses, but a number of the details have been glossed over. The Linux system deals with several types of addresses, each with its own semantics. Unfortunately, the kernel code is not always very clear on exactly which type of address is being used in each situation, so the programmer must be careful.
The following is a list of address types used in Linux. Figure 15-1 shows how these address types relate to physical memory.
Figure 15-1. Address types used in Linux
If you have a logical address, the macro _ _pa( ) (defined in <asm/page.h>) returns its associated physical address. Physical addresses can be mapped back to logical addresses with _ _va( ), but only for low-memory pages.
Different kernel functions require different types of addresses. It would be nice if there were different C types defined, so that the required address types were explicit, but we have no such luck. In this chapter, we try to be clear on which types of addresses are used where.
15.1.2. Physical Addresses and Pages
Physical memory is divided into discrete units called pages. Much of the system's internal handling of memory is done on a per-page basis. Page size varies from one architecture to the next, although most systems currently use 4096-byte pages. The constant PAGE_SIZE (defined in <asm/page.h>) gives the page size on any given architecture.
If you look at a memory address—virtual or physical—it is divisible into a page number and an offset within the page. If 4096-byte pages are being used, for example, the 12 least-significant bits are the offset, and the remaining, higher bits indicate the page number. If you discard the offset and shift the rest of an offset to the right, the result is called a page frame number (PFN). Shifting bits to convert between page frame numbers and addresses is a fairly common operation; the macro PAGE_SHIFT tells how many bits must be shifted to make this conversion.
15.1.3. High and Low Memory
The difference between logical and kernel virtual addresses is highlighted on 32-bit systems that are equipped with large amounts of memory. With 32 bits, it is possible to address 4 GB of memory. Linux on 32-bit systems has, until recently, been limited to substantially less memory than that, however, because of the way it sets up the virtual address space.
The kernel (on the x86 architecture, in the default configuration) splits the 4-GB virtual address space between user-space and the kernel; the same set of mappings is used in both contexts. A typical split dedicates 3 GB to user space, and 1 GB for kernel space. The kernel's code and data structures must fit into that space, but the biggest consumer of kernel address space is virtual mappings for physical memory. The kernel cannot directly manipulate memory that is not mapped into the kernel's address space. The kernel, in other words, needs its own virtual address for any memory it must touch directly. Thus, for many years, the maximum amount of physical memory that could be handled by the kernel was the amount that could be mapped into the kernel's portion of the virtual address space, minus the space needed for the kernel code itself. As a result, x86-based Linux systems could work with a maximum of a little under 1 GB of physical memory.
In response to commercial pressure to support more memory while not breaking 32-bit application and the system's compatibility, the processor manufacturers have added "address extension" features to their products. The result is that, in many cases, even 32-bit processors can address more than 4 GB of physical memory. The limitation on how much memory can be directly mapped with logical addresses remains, however. Only the lowest portion of memory (up to 1 or 2 GB, depending on the hardware and the kernel configuration) has logical addresses; the rest (high memory) does not. Before accessing a specific high-memory page, the kernel must set up an explicit virtual mapping to make that page available in the kernel's address space. Thus, many kernel data structures must be placed in low memory; high memory tends to be reserved for user-space process pages.
The term "high memory" can be confusing to some, especially since it has other meanings in the PC world. So, to make things clear, we'll define the terms here:
On i386 systems, the boundary between low and high memory is usually set at just under 1 GB, although that boundary can be changed at kernel configuration time. This boundary is not related in any way to the old 640 KB limit found on the original PC, and its placement is not dictated by the hardware. It is, instead, a limit set by the kernel itself as it splits the 32-bit address space between kernel and user space.
15.1.4. The Memory Map and Struct Page
Historically, the kernel has used logical addresses to refer to pages of physical memory. The addition of high-memory support, however, has exposed an obvious problem with that approach—logical addresses are not available for high memory. Therefore, kernel functions that deal with memory are increasingly using pointers to struct page (defined in <linux/mm.h>) instead. This data structure is used to keep track of just about everything the kernel needs to know about physical memory; there is one struct page for each physical page on the system. Some of the fields of this structure include the following:
There is much more information within struct page, but it is part of the deeper black magic of memory management and is not of concern to driver writers.
The kernel maintains one or more arrays of struct page entries that track all of the physical memory on the system. On some systems, there is a single array called mem_map. On some systems, however, the situation is more complicated. Nonuniform memory access (NUMA) systems and those with widely discontiguous physical memory may have more than one memory map array, so code that is meant to be portable should avoid direct access to the array whenever possible. Fortunately, it is usually quite easy to just work with struct page pointers without worrying about where they come from.
Some functions and macros are defined for translating between struct page pointers and virtual addresses:
15.1.5. Page Tables
On any modern system, the processor must have a mechanism for translating virtual addresses into its corresponding physical addresses. This mechanism is called a page table; it is essentially a multilevel tree-structured array containing virtual-to-physical mappings and a few associated flags. The Linux kernel maintains a set of page tables even on architectures that do not use such tables directly.
A number of operations commonly performed by device drivers can involve manipulating page tables. Fortunately for the driver author, the 2.6 kernel has eliminated any need to work with page tables directly. As a result, we do not describe them in any detail; curious readers may want to have a look at Understanding The Linux Kernel by Daniel P. Bovet and Marco Cesati (O'Reilly) for the full story.
15.1.6. Virtual Memory Areas
The virtual memory area (VMA) is the kernel data structure used to manage distinct regions of a process's address space. A VMA represents a homogeneous region in the virtual memory of a process: a contiguous range of virtual addresses that have the same permission flags and are backed up by the same object (a file, say, or swap space). It corresponds loosely to the concept of a "segment," although it is better described as "a memory object with its own properties." The memory map of a process is made up of (at least) the following areas:
The memory areas of a process can be seen by looking in /proc/<pid/maps> (in which pid, of course, is replaced by a process ID). /proc/self is a special case of /proc/pid, because it always refers to the current process. As an example, here are a couple of memory maps (to which we have added short comments in italics):
# cat /proc/1/maps look at init 08048000-0804e000 r-xp 00000000 03:01 64652 /sbin/init text 0804e000-0804f000 rw-p 00006000 03:01 64652 /sbin/init data 0804f000-08053000 rwxp 00000000 00:00 0 zero-mapped BSS 40000000-40015000 r-xp 00000000 03:01 96278 /lib/ld-2.3.2.so text 40015000-40016000 rw-p 00014000 03:01 96278 /lib/ld-2.3.2.so data 40016000-40017000 rw-p 00000000 00:00 0 BSS for ld.so 42000000-4212e000 r-xp 00000000 03:01 80290 /lib/tls/libc-2.3.2.so text 4212e000-42131000 rw-p 0012e000 03:01 80290 /lib/tls/libc-2.3.2.so data 42131000-42133000 rw-p 00000000 00:00 0 BSS for libc bffff000-c0000000 rwxp 00000000 00:00 0 Stack segment ffffe000-fffff000 ---p 00000000 00:00 0 vsyscall page # rsh wolf cat /proc/self/maps #### x86-64 (trimmed) 00400000-00405000 r-xp 00000000 03:01 1596291 /bin/cat text 00504000-00505000 rw-p 00004000 03:01 1596291 /bin/cat data 00505000-00526000 rwxp 00505000 00:00 0 bss 3252200000-3252214000 r-xp 00000000 03:01 1237890 /lib64/ld-2.3.3.so 3252300000-3252301000 r--p 00100000 03:01 1237890 /lib64/ld-2.3.3.so 3252301000-3252302000 rw-p 00101000 03:01 1237890 /lib64/ld-2.3.3.so 7fbfffe000-7fc0000000 rw-p 7fbfffe000 00:00 0 stack ffffffffff600000-ffffffffffe00000 ---p 00000000 00:00 0 vsyscall
The fields in each line are:
start-end perm offset major:minor inode image
220.127.116.11 The vm_area_struct structure
When a user-space process calls mmap to map device memory into its address space, the system responds by creating a new VMA to represent that mapping. A driver that supports mmap (and, thus, that implements the mmap method) needs to help that process by completing the initialization of that VMA. The driver writer should, therefore, have at least a minimal understanding of VMAs in order to support mmap.
Let's look at the most important fields in struct vm_area_struct (defined in <linux/mm.h>). These fields may be used by device drivers in their mmap implementation. Note that the kernel maintains lists and trees of VMAs to optimize area lookup, and several fields of vm_area_struct are used to maintain this organization. Therefore, VMAs can't be created at will by a driver, or the structures break. The main fields of VMAs are as follows (note the similarity between these fields and the /proc output we just saw):
Like struct vm_area_struct, the vm_operations_struct is defined in <linux/mm.h>; it includes the operations listed below. These operations are the only ones needed to handle the process's memory needs, and they are listed in the order they are declared. Later in this chapter, some of these functions are implemented.
15.1.7. The Process Memory Map
The final piece of the memory management puzzle is the process memory map structure, which holds all of the other data structures together. Each process in the system (with the exception of a few kernel-space helper threads) has a struct mm_struct (defined in <linux/sched.h>) that contains the process's list of virtual memory areas, page tables, and various other bits of memory management housekeeping information, along with a semaphore (mmap_sem) and a spinlock (page_table_lock). The pointer to this structure is found in the task structure; in the rare cases where a driver needs to access it, the usual way is to use current->mm. Note that the memory management structure can be shared between processes; the Linux implementation of threads works in this way, for example.
|⇦ prev||⇱ home||next ⇨|