Previous Up Next

Appendix D  Process Address Space

D.1  Process Memory Descriptors

This section covers the functions used to allocate, initialise, copy and destroy memory descriptors.

D.1.1  Initalising a Descriptor

The initial mm_struct in the system is called init_mm and is statically initialised at compile time using the macro INIT_MM().

238 #define INIT_MM(name) \
239 {                                                         \
240       mm_rb:           RB_ROOT,                           \
241       pgd:             swapper_pg_dir,                    \
242       mm_users:        ATOMIC_INIT(2),                    \
243       mm_count:        ATOMIC_INIT(1),                    \
244       mmap_sem:        __RWSEM_INITIALIZER(name.mmap_sem),\
245       page_table_lock: SPIN_LOCK_UNLOCKED,                \
246       mmlist:          LIST_HEAD_INIT(name.mmlist),       \
247 }

Once it is established, new mm_structs are copies of their parent mm_struct and are copied using copy_mm() with the process specific fields initialised with init_mm().

D.1.2  Copying a Descriptor

D.1.2.1  Function: copy_mm

Source: kernel/fork.c

This function makes a copy of the mm_struct for the given task. This is only called from do_fork() after a new process has been created and needs its own mm_struct.

315 static int copy_mm(unsigned long clone_flags, 
                       struct task_struct * tsk)
316 {
317       struct mm_struct * mm, *oldmm;
318       int retval;
320       tsk->min_flt = tsk->maj_flt = 0;
321       tsk->cmin_flt = tsk->cmaj_flt = 0;
322       tsk->nswap = tsk->cnswap = 0;
324       tsk->mm = NULL;
325       tsk->active_mm = NULL;
327       /*
328        * Are we cloning a kernel thread?
330        * We need to steal a active VM for that..
331        */
332       oldmm = current->mm;
333       if (!oldmm)
334             return 0;
336       if (clone_flags & CLONE_VM) {
337             atomic_inc(&oldmm->mm_users);
338             mm = oldmm;
339             goto good_mm;
340       }

Reset fields that are not inherited by a child mm_struct and find a mm to copy from.

315The parameters are the flags passed for clone and the task that is creating a copy of the mm_struct
320-325Initialise the task_struct fields related to memory management
332Borrow the mm of the current running process to copy from
333A kernel thread has no mm so it can return immediately
336-341If the CLONE_VM flag is set, the child process is to share the mm with the parent process. This is required by users like pthreads. The mm_users field is incremented so the mm is not destroyed prematurely later. The good_mm label sets tskmm and tskactive_mm and returns success
342       retval = -ENOMEM;
343       mm = allocate_mm();
344       if (!mm)
345             goto fail_nomem;
347       /* Copy the current MM stuff.. */
348       memcpy(mm, oldmm, sizeof(*mm));
349       if (!mm_init(mm))
350             goto fail_nomem;
352       if (init_new_context(tsk,mm))
353             goto free_pt;
355       down_write(&oldmm->mmap_sem);
356       retval = dup_mmap(mm);
357       up_write(&oldmm->mmap_sem);
343Allocate a new mm
348-350Copy the parent mm and initialise the process specific mm fields with init_mm()
352-353Initialise the MMU context for architectures that do not automatically manage their MMU
355-357Call dup_mmap() which is responsible for copying all the VMAs regions in use by the parent process
359       if (retval)
360             goto free_pt;
362       /*
363        * child gets a private LDT (if there was an LDT in the parent)
364        */
365       copy_segments(tsk, mm);
367 good_mm:
368       tsk->mm = mm;
369       tsk->active_mm = mm;
370       return 0;
372 free_pt:
373       mmput(mm);
374 fail_nomem:
375       return retval;
376 }
359dup_mmap() returns 0 on success. If it failed, the label free_pt will call mmput() which decrements the use count of the mm
365This copies the LDT for the new process based on the parent process
368-370Set the new mm, active_mm and return success

D.1.2.2  Function: mm_init

Source: kernel/fork.c

This function initialises process specific mm fields.

230 static struct mm_struct * mm_init(struct mm_struct * mm)
231 {
232       atomic_set(&mm->mm_users, 1);
233       atomic_set(&mm->mm_count, 1);
234       init_rwsem(&mm->mmap_sem);
235       mm->page_table_lock = SPIN_LOCK_UNLOCKED;
236       mm->pgd = pgd_alloc(mm);
237       mm->def_flags = 0;
238       if (mm->pgd)
239             return mm;
240       free_mm(mm);
241       return NULL;
242 }
232Set the number of users to 1
233Set the reference count of the mm to 1
234Initialise the semaphore protecting the VMA list
235Initialise the spinlock protecting write access to it
236Allocate a new PGD for the struct
237By default, pages used by the process are not locked in memory
238If a PGD exists, return the initialised struct
240Initialisation failed, delete the mm_struct and return

D.1.3  Allocating a Descriptor

Two functions are provided allocating a mm_struct. To be slightly confusing, they are essentially the name. allocate_mm() will allocate a mm_struct from the slab allocator. mm_alloc() will allocate the struct and then call the function mm_init() to initialise it.

D.1.3.1  Function: allocate_mm

Source: kernel/fork.c

227 #define allocate_mm()   (kmem_cache_alloc(mm_cachep, SLAB_KERNEL))
226Allocate a mm_struct from the slab allocator

D.1.3.2  Function: mm_alloc

Source: kernel/fork.c

248 struct mm_struct * mm_alloc(void)
249 {
250       struct mm_struct * mm;
252       mm = allocate_mm();
253       if (mm) {
254             memset(mm, 0, sizeof(*mm));
255             return mm_init(mm);
256       }
257       return NULL;
258 }
252Allocate a mm_struct from the slab allocator
254Zero out all contents of the struct
255Perform basic initialisation

D.1.4  Destroying a Descriptor

A new user to an mm increments the usage count with a simple call,


It is decremented with a call to mmput(). If the mm_users count reaches zero, all the mapped regions are deleted with exit_mmap() and the page tables destroyed as there is no longer any users of the userspace portions. The mm_count count is decremented with mmdrop() as all the users of the page tables and VMAs are counted as one mm_struct user. When mm_count reaches zero, the mm_struct will be destroyed.

D.1.4.1  Function: mmput

Source: kernel/fork.c

Figure D.1: Call Graph: mmput()

276 void mmput(struct mm_struct *mm)
277 {
278       if (atomic_dec_and_lock(&mm->mm_users, &mmlist_lock)) {
279             extern struct mm_struct *swap_mm;
280             if (swap_mm == mm)
281                   swap_mm = list_entry(mm->, 
                              struct mm_struct, mmlist);
282             list_del(&mm->mmlist);
283             mmlist_nr--;
284             spin_unlock(&mmlist_lock);
285             exit_mmap(mm);
286             mmdrop(mm);
287       }
288 }
278Atomically decrement the mm_users field while holding the mmlist_lock lock. Return with the lock held if the count reaches zero
279-286If the usage count reaches zero, the mm and associated structures need to be removed
279-281The swap_mm is the last mm that was swapped out by the vmscan code. If the current process was the last mm swapped, move to the next entry in the list
282Remove this mm from the list
283-284Reduce the count of mms in the list and release the mmlist lock
285Remove all associated mappings
286Delete the mm

D.1.4.2  Function: mmdrop

Source: include/linux/sched.h

765 static inline void mmdrop(struct mm_struct * mm)
766 {
767       if (atomic_dec_and_test(&mm->mm_count))
768             __mmdrop(mm);
769 }
767Atomically decrement the reference count. The reference count could be higher if the mm was been used by lazy tlb switching tasks
768If the reference count reaches zero, call __mmdrop()

D.1.4.3  Function: __mmdrop

Source: kernel/fork.c

265 inline void __mmdrop(struct mm_struct *mm)
266 {
267       BUG_ON(mm == &init_mm);
268       pgd_free(mm->pgd);
269       destroy_context(mm);
270       free_mm(mm);
271 }
267Make sure the init_mm is not destroyed
268Delete the PGD entry
269Delete the LDT
270Call kmem_cache_free() for the mm freeing it with the slab allocator

D.2  Creating Memory Regions

This large section deals with the creation, deletion and manipulation of memory regions.

D.2.1  Creating A Memory Region

The main call graph for creating a memory region is shown in Figure 4.4.

D.2.1.1  Function: do_mmap

Source: include/linux/mm.h

This is a very simply wrapper function around do_mmap_pgoff() which performs most of the work.

557 static inline unsigned long do_mmap(struct file *file, 
            unsigned long addr,
558         unsigned long len, unsigned long prot,
559         unsigned long flag, unsigned long offset)
560 {
561     unsigned long ret = -EINVAL;
562     if ((offset + PAGE_ALIGN(len)) < offset)
563         goto out;
564     if (!(offset & ~PAGE_MASK))
565         ret = do_mmap_pgoff(file, addr, len, prot, flag, 
                                offset >> PAGE_SHIFT);
566 out:
567         return ret;
568 }
561By default, return -EINVAL
562-563Make sure that the size of the region will not overflow the total size of the address space
564-565Page align the offset and call do_mmap_pgoff() to map the region

D.2.1.2  Function: do_mmap_pgoff

Source: mm/mmap.c

This function is very large and so is broken up into a number of sections. Broadly speaking the sections are

393 unsigned long do_mmap_pgoff(struct file * file, 
                unsigned long addr,
                unsigned long len, unsigned long prot,
394             unsigned long flags, unsigned long pgoff)
395 {
396     struct mm_struct * mm = current->mm;
397     struct vm_area_struct * vma, * prev;
398     unsigned int vm_flags;
399     int correct_wcount = 0;
400     int error;
401     rb_node_t ** rb_link, * rb_parent;
403     if (file && (!file->f_op || !file->f_op->mmap))
404         return -ENODEV;
406     if (!len)
407         return addr;
409     len = PAGE_ALIGN(len);
        if (len > TASK_SIZE || len == 0)
            return -EINVAL;
414     /* offset overflow? */
415     if ((pgoff + (len >> PAGE_SHIFT)) < pgoff)
416         return -EINVAL;
418     /* Too many mappings? */
419     if (mm->map_count > max_map_count)
420         return -ENOMEM;
393The parameters which correspond directly to the parameters to the mmap system call are
file the struct file to mmap if this is a file backed mapping
addr the requested address to map
len the length in bytes to mmap
prot is the permissions on the area
flags are the flags for the mapping
pgoff is the offset within the file to begin the mmap at
403-404If a file or device is been mapped, make sure a filesystem or device specific mmap function is provided. For most filesystems, this will call generic_file_mmap()(See Section D.6.2.1)
406-407Make sure a zero length mmap() is not requested
409Ensure that the mapping is confined to the userspace portion of hte address space. On the x86, kernel space begins at PAGE_OFFSET(3GiB)
415-416Ensure the mapping will not overflow the end of the largest possible file size
419-490Only max_map_count number of mappings are allowed. By default this value is DEFAULT_MAX_MAP_COUNT or 65536 mappings
422     /* Obtain the address to map to. we verify (or select) it and
423      * ensure that it represents a valid section of the address space.
424      */
425     addr = get_unmapped_area(file, addr, len, pgoff, flags);
426     if (addr & ~PAGE_MASK)
427         return addr;
425After basic sanity checks, this function will call the device or file specific get_unmapped_area() function. If a device specific one is unavailable, arch_get_unmapped_area() is called. This function is discussed in Section D.3.2.2
429     /* Do simple checking here so the lower-level routines won't have
430      * to. we assume access permissions have been handled by the open
431      * of the memory object, so we don't do any here.
432      */
433     vm_flags = calc_vm_flags(prot,flags) | mm->def_flags 
                 | VM_MAYREAD | VM_MAYWRITE | VM_MAYEXEC;
435     /* mlock MCL_FUTURE? */
436     if (vm_flags & VM_LOCKED) {
437         unsigned long locked = mm->locked_vm << PAGE_SHIFT;
438         locked += len;
439         if (locked > current->rlim[RLIMIT_MEMLOCK].rlim_cur)
440             return -EAGAIN;
441     }
433calc_vm_flags() translates the prot and flags from userspace and translates them to their VM_ equivalents
436-440Check if it has been requested that all future mappings be locked in memory. If yes, make sure the process isn't locking more memory than it is allowed to. If it is, return -EAGAIN
443     if (file) {
444         switch (flags & MAP_TYPE) {
445         case MAP_SHARED:
446             if ((prot & PROT_WRITE) && 
                !(file->f_mode & FMODE_WRITE))
447                 return -EACCES;
449             /* Make sure we don't allow writing to 
                 an append-only file.. */
450             if (IS_APPEND(file->f_dentry->d_inode) &&
                    (file->f_mode & FMODE_WRITE))
451                 return -EACCES;
453             /* make sure there are no mandatory 
                 locks on the file. */
454             if (locks_verify_locked(file->f_dentry->d_inode))
455                 return -EAGAIN;
457             vm_flags |= VM_SHARED | VM_MAYSHARE;
458             if (!(file->f_mode & FMODE_WRITE))
459                 vm_flags &= ~(VM_MAYWRITE | VM_SHARED);
461             /* fall through */
462         case MAP_PRIVATE:
463             if (!(file->f_mode & FMODE_READ))
464                 return -EACCES;
465             break;
467         default:
468             return -EINVAL;
469         }
443-470If a file is been memory mapped, check the files access permissions
446-447If write access is requested, make sure the file is opened for write
450-451Similarly, if the file is opened for append, make sure it cannot be written to. The prot field is not checked because the prot field applies only to the mapping where as we need to check the opened file
453If the file is mandatory locked, return -EAGAIN so the caller will try a second type
457-459Fix up the flags to be consistent with the file flags
463-464Make sure the file can be read before mmapping it
470     } else {
471         vm_flags |= VM_SHARED | VM_MAYSHARE;
472         switch (flags & MAP_TYPE) {
473         default:
474             return -EINVAL;
475         case MAP_PRIVATE:
476             vm_flags &= ~(VM_SHARED | VM_MAYSHARE);
477             /* fall through */
478         case MAP_SHARED:
479             break;
480         }
481     }
471-481If the file is been mapped for anonymous use, fix up the flags if the requested mapping is MAP_PRIVATE to make sure the flags are consistent
483     /* Clear old maps */
484 munmap_back:
485     vma = find_vma_prepare(mm, addr, &prev, &rb_link, &rb_parent);
486     if (vma && vma->vm_start < addr + len) {
487         if (do_munmap(mm, addr, len))
488             return -ENOMEM;
489         goto munmap_back;
490     }
485find_vma_prepare()(See Section D.2.2.2) steps through the RB tree for the VMA corresponding to a given address
486-488If a VMA was found and it is part of the new mmaping, remove the old mapping as the new one will cover both
492     /* Check against address space limit. */
493     if ((mm->total_vm << PAGE_SHIFT) + len
494         > current->rlim[RLIMIT_AS].rlim_cur)
495         return -ENOMEM;
497     /* Private writable mapping? Check memory availability.. */
498     if ((vm_flags & (VM_SHARED | VM_WRITE)) == VM_WRITE &&
499         !(flags & MAP_NORESERVE)                 &&
500         !vm_enough_memory(len >> PAGE_SHIFT))
501         return -ENOMEM;
503     /* Can we just expand an old anonymous mapping? */
504     if (!file && !(vm_flags & VM_SHARED) && rb_parent)
505         if (vma_merge(mm, prev, rb_parent, 
                    addr, addr + len, vm_flags))
506             goto out;
493-495Make sure the new mapping will not will not exceed the total VM a process is allowed to have. It is unclear why this check is not made earlier
498-501If the caller does not specifically request that free space is not checked with MAP_NORESERVE and it is a private mapping, make sure enough memory is available to satisfy the mapping under current conditions
504-506If two adjacent memory mappings are anonymous and can be treated as one, expand the old mapping rather than creating a new one
508     /* Determine the object being mapped and call the appropriate
509      * specific mapper. the address has already been validated, but
510      * not unmapped, but the maps are removed from the list.
511      */
512     vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
513     if (!vma)
514         return -ENOMEM;
516     vma->vm_mm = mm;
517     vma->vm_start = addr;
518     vma->vm_end = addr + len;
519     vma->vm_flags = vm_flags;
520     vma->vm_page_prot = protection_map[vm_flags & 0x0f];
521     vma->vm_ops = NULL;
522     vma->vm_pgoff = pgoff;
523     vma->vm_file = NULL;
524     vma->vm_private_data = NULL;
525     vma->vm_raend = 0;
512Allocate a vm_area_struct from the slab allocator
516-525Fill in the basic vm_area_struct fields
527     if (file) {
528         error = -EINVAL;
529         if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP))
530             goto free_vma;
531         if (vm_flags & VM_DENYWRITE) {
532             error = deny_write_access(file);
533             if (error)
534                 goto free_vma;
535             correct_wcount = 1;
536         }
537         vma->vm_file = file;
538         get_file(file);
539         error = file->f_op->mmap(file, vma);
540         if (error)
541             goto unmap_and_free_vma;
527-542Fill in the file related fields if this is a file been mapped
529-530These are both invalid flags for a file mapping so free the vm_area_struct and return
531-536This flag is cleared by the system call mmap() but is still cleared for kernel modules that call this function directly. Historically, -ETXTBUSY was returned to the calling process if the underlying file was been written to
537Fill in the vm_file field
538This increments the file usage count
539Call the filesystem or device specific mmap() function. In many filesystem cases, this will call generic_file_mmap()(See Section D.6.2.1)
540-541If an error called, goto unmap_and_free_vma to clean up and return the error
542     } else if (flags & MAP_SHARED) {
543         error = shmem_zero_setup(vma);
544         if (error)
545             goto free_vma;
546     }
543If this is an anonymous shared mapping, the region is created and setup by shmem_zero_setup()(See Section L.7.1). Anonymous shared pages are backed by a virtual tmpfs filesystem so that they can be synchronised properly with swap. The writeback function is shmem_writepage()(See Section L.6.1)
548     /* Can addr have changed??
549      *
550      * Answer: Yes, several device drivers can do it in their
551      *     f_op->mmap method. -DaveM
552      */
553     if (addr != vma->vm_start) {
554         /*
555          * It is a bit too late to pretend changing the virtual
556          * area of the mapping, we just corrupted userspace
557          * in the do_munmap, so FIXME (not in 2.4 to avoid
558          * breaking the driver API).
559          */
560         struct vm_area_struct * stale_vma;
561         /* Since addr changed, we rely on the mmap op to prevent 
562          * collisions with existing vmas and just use
563          * find_vma_prepare to update the tree pointers.
564          */
565         addr = vma->vm_start;
566         stale_vma = find_vma_prepare(mm, addr, &prev,
567                         &rb_link, &rb_parent);
568         /*
569          * Make sure the lowlevel driver did its job right.
570          */
571         if (unlikely(stale_vma && stale_vma->vm_start <
                 vma->vm_end)) {
572             printk(KERN_ERR "buggy mmap operation: [<%p>]\n",
573                 file ? file->f_op->mmap : NULL);
574             BUG();
575         }
576     }
578     vma_link(mm, vma, prev, rb_link, rb_parent);
579     if (correct_wcount)
580         atomic_inc(&file->f_dentry->d_inode->i_writecount);
553-576If the address has changed, it means the device specific mmap operation moved the VMA address to somewhere else. The function find_vma_prepare() (See Section D.2.2.2) is used to find where the VMA was moved to
578Link in the new vm_area_struct
579-580Update the file write count
582 out:    
583     mm->total_vm += len >> PAGE_SHIFT;
584     if (vm_flags & VM_LOCKED) {
585         mm->locked_vm += len >> PAGE_SHIFT;
586         make_pages_present(addr, addr + len);
587     }
588     return addr;
590 unmap_and_free_vma:
591     if (correct_wcount)
592         atomic_inc(&file->f_dentry->d_inode->i_writecount);
593     vma->vm_file = NULL;
594     fput(file);
596     /* Undo any partial mapping done by a device driver. */
597     zap_page_range(mm, vma->vm_start, vma->vm_end - vma->vm_start);
598 free_vma:
599     kmem_cache_free(vm_area_cachep, vma);
600     return error;
601 }
583-588Update statistics for the process mm_struct and return the new address
590-597This is reached if the file has been partially mapped before failing. The write statistics are updated and then all user pages are removed with zap_page_range()
598-600This goto is used if the mapping failed immediately after the vm_area_struct is created. It is freed back to the slab allocator before the error is returned

D.2.2  Inserting a Memory Region

The call graph for insert_vm_struct() is shown in Figure 4.6.

D.2.2.1  Function: __insert_vm_struct

Source: mm/mmap.c

This is the top level function for inserting a new vma into an address space. There is a second function like it called simply insert_vm_struct() that is not described in detail here as the only difference is the one line of code increasing the map_count.

1174 void __insert_vm_struct(struct mm_struct * mm, 
                     struct vm_area_struct * vma)
1175 {
1176     struct vm_area_struct * __vma, * prev;
1177     rb_node_t ** rb_link, * rb_parent;
1179     __vma = find_vma_prepare(mm, vma->vm_start, &prev, 
                      &rb_link, &rb_parent);
1180     if (__vma && __vma->vm_start < vma->vm_end)
1181         BUG();
1182     __vma_link(mm, vma, prev, rb_link, rb_parent);
1183     mm->map_count++;
1184     validate_mm(mm);
1185 }
1174The arguments are the mm_struct that represents the linear address space and the vm_area_struct that is to be inserted
1179find_vma_prepare()(See Section D.2.2.2) locates where the new VMA can be inserted. It will be inserted between prev and __vma and the required nodes for the red-black tree are also returned
1180-1181This is a check to make sure the returned VMA is invalid. It is virtually impossible for this condition to occur without manually inserting bogus VMAs into the address space
1182This function does the actual work of linking the vma struct into the linear linked list and the red-black tree
1183Increase the map_count to show a new mapping has been added. This line is not present in insert_vm_struct()
1184validate_mm() is a debugging macro for red-black trees. If DEBUG_MM_RB is set, the linear list of VMAs and the tree will be traversed to make sure it is valid. The tree traversal is a recursive function so it is very important that that it is used only if really necessary as a large number of mappings could cause a stack overflow. If it is not set, validate_mm() does nothing at all

D.2.2.2  Function: find_vma_prepare

Source: mm/mmap.c

This is responsible for finding the correct places to insert a VMA at the supplied address. It returns a number of pieces of information via the actual return and the function arguments. The forward VMA to link to is returned with return. pprev is the previous node which is required because the list is a singly linked list. rb_link and rb_parent are the parent and leaf node the new VMA will be inserted between.

246 static struct vm_area_struct * find_vma_prepare(
                       struct mm_struct * mm,
                       unsigned long addr,
247                    struct vm_area_struct ** pprev,
248                    rb_node_t *** rb_link,
                     rb_node_t ** rb_parent)
249 {
250     struct vm_area_struct * vma;
251     rb_node_t ** __rb_link, * __rb_parent, * rb_prev;
253     __rb_link = &mm->mm_rb.rb_node;
254     rb_prev = __rb_parent = NULL;
255     vma = NULL;
257     while (*__rb_link) {
258         struct vm_area_struct *vma_tmp;
260         __rb_parent = *__rb_link;
261         vma_tmp = rb_entry(__rb_parent, 
                     struct vm_area_struct, vm_rb);
263         if (vma_tmp->vm_end > addr) {
264             vma = vma_tmp;
265             if (vma_tmp->vm_start <= addr)
266                 return vma;
267             __rb_link = &__rb_parent->rb_left;
268         } else {
269             rb_prev = __rb_parent;
270             __rb_link = &__rb_parent->rb_right;
271         }
272     }
274     *pprev = NULL;
275     if (rb_prev)
276         *pprev = rb_entry(rb_prev, struct vm_area_struct, vm_rb);
277     *rb_link = __rb_link;
278     *rb_parent = __rb_parent;
279     return vma;
280 }
246The function arguments are described above
253-255Initialise the search
263-272This is a similar tree walk to what was described for find_vma(). The only real difference is the nodes last traversed are remembered with the __rb_link and __rb_parent variables
275-276Get the back linking VMA via the red-black tree
279Return the forward linking VMA

D.2.2.3  Function: vma_link

Source: mm/mmap.c

This is the top-level function for linking a VMA into the proper lists. It is responsible for acquiring the necessary locks to make a safe insertion

337 static inline void vma_link(struct mm_struct * mm, 
                struct vm_area_struct * vma, 
                struct vm_area_struct * prev,
338                 rb_node_t ** rb_link, rb_node_t * rb_parent)
339 {
340     lock_vma_mappings(vma);
341     spin_lock(&mm->page_table_lock);
342     __vma_link(mm, vma, prev, rb_link, rb_parent);
343     spin_unlock(&mm->page_table_lock);
344     unlock_vma_mappings(vma);
346     mm->map_count++;
347     validate_mm(mm);
348 }
337mm is the address space the VMA is to be inserted into. prev is the backwards linked VMA for the linear linked list of VMAs. rb_link and rb_parent are the nodes required to make the rb insertion
340This function acquires the spinlock protecting the address_space representing the file that is been memory mapped.
341Acquire the page table lock which protects the whole mm_struct
342Insert the VMA
343Free the lock protecting the mm_struct
345Unlock the address_space for the file
346Increase the number of mappings in this mm
347If DEBUG_MM_RB is set, the RB trees and linked lists will be checked to make sure they are still valid

D.2.2.4  Function: __vma_link

Source: mm/mmap.c

This simply calls three helper functions which are responsible for linking the VMA into the three linked lists that link VMAs together.

329 static void __vma_link(struct mm_struct * mm, 
               struct vm_area_struct * vma,
               struct vm_area_struct * prev,
330            rb_node_t ** rb_link, rb_node_t * rb_parent)
331 {
332     __vma_link_list(mm, vma, prev, rb_parent);
333     __vma_link_rb(mm, vma, rb_link, rb_parent);
334     __vma_link_file(vma);
335 }
332This links the VMA into the linear linked lists of VMAs in this mm via the vm_next field
333This links the VMA into the red-black tree of VMAs in this mm whose root is stored in the vm_rb field
334This links the VMA into the shared mapping VMA links. Memory mapped files are linked together over potentially many mms by this function via the vm_next_share and vm_pprev_share fields

D.2.2.5  Function: __vma_link_list

Source: mm/mmap.c

282 static inline void __vma_link_list(struct mm_struct * mm, 
                     struct vm_area_struct * vma, 
                     struct vm_area_struct * prev,
283                    rb_node_t * rb_parent)
284 {
285     if (prev) {
286         vma->vm_next = prev->vm_next;
287         prev->vm_next = vma;
288     } else {
289         mm->mmap = vma;
290         if (rb_parent)
291             vma->vm_next = rb_entry(rb_parent, 
                                struct vm_area_struct, 
292         else
293             vma->vm_next = NULL;
294     }
295 }
285If prev is not null, the vma is simply inserted into the list
289Else this is the first mapping and the first element of the list has to be stored in the mm_struct
290The VMA is stored as the parent node

D.2.2.6  Function: __vma_link_rb

Source: mm/mmap.c

The principal workings of this function are stored within <linux/rbtree.h> and will not be discussed in detail in this book.

297 static inline void __vma_link_rb(struct mm_struct * mm, 
                     struct vm_area_struct * vma,
298                  rb_node_t ** rb_link, 
                     rb_node_t * rb_parent)
299 {
300     rb_link_node(&vma->vm_rb, rb_parent, rb_link);
301     rb_insert_color(&vma->vm_rb, &mm->mm_rb);
302 }

D.2.2.7  Function: __vma_link_file

Source: mm/mmap.c

This function links the VMA into a linked list of shared file mappings.

304 static inline void __vma_link_file(struct vm_area_struct * vma)
305 {
306     struct file * file;
308     file = vma->vm_file;
309     if (file) {
310         struct inode * inode = file->f_dentry->d_inode;
311         struct address_space *mapping = inode->i_mapping;
312         struct vm_area_struct **head;
314         if (vma->vm_flags & VM_DENYWRITE)
315             atomic_dec(&inode->i_writecount);
317         head = &mapping->i_mmap;
318         if (vma->vm_flags & VM_SHARED)
319             head = &mapping->i_mmap_shared;
321         /* insert vma into inode's share list */
322         if((vma->vm_next_share = *head) != NULL)
323             (*head)->vm_pprev_share = &vma->vm_next_share;
324         *head = vma;
325         vma->vm_pprev_share = head;
326     }
327 }
309Check to see if this VMA has a shared file mapping. If it does not, this function has nothing more to do
310-312Extract the relevant information about the mapping from the VMA
314-315If this mapping is not allowed to write even if the permissions are ok for writing, decrement the i_writecount field. A negative value to this field indicates that the file is memory mapped and may not be written to. Efforts to open the file for writing will now fail
317-319Check to make sure this is a shared mapping
322-325Insert the VMA into the shared mapping linked list

D.2.3  Merging Contiguous Regions

D.2.3.1  Function: vma_merge

Source: mm/mmap.c

This function checks to see if a region pointed to be prev may be expanded forwards to cover the area from addr to end instead of allocating a new VMA. If it cannot, the VMA ahead is checked to see can it be expanded backwards instead.

350 static int vma_merge(struct mm_struct * mm, 
                 struct vm_area_struct * prev,
351                rb_node_t * rb_parent, 
                 unsigned long addr, unsigned long end, 
                 unsigned long vm_flags)
352 {
353     spinlock_t * lock = &mm->page_table_lock;
354     if (!prev) {
355         prev = rb_entry(rb_parent, struct vm_area_struct, vm_rb);
356         goto merge_next;
357     }
350The parameters are as follows;
mm The mm the VMAs belong to
prev The VMA before the address we are interested in
rb_parent The parent RB node as returned by find_vma_prepare()
addr The starting address of the region to be merged
end The end of the region to be merged
vm_flags The permission flags of the region to be merged
353This is the lock to the mm
354-357If prev is not passed it, it is taken to mean that the VMA being tested for merging is in front of the region from addr to end. The entry for that VMA is extracted from the rb_parent
358     if (prev->vm_end == addr && can_vma_merge(prev, vm_flags)) {
359         struct vm_area_struct * next;
361         spin_lock(lock);
362         prev->vm_end = end;
363         next = prev->vm_next;
364         if (next && prev->vm_end == next->vm_start &&
                   can_vma_merge(next, vm_flags)) {
365             prev->vm_end = next->vm_end;
366             __vma_unlink(mm, next, prev);
367             spin_unlock(lock);
369             mm->map_count--;
370             kmem_cache_free(vm_area_cachep, next);
371             return 1;
372         }
373         spin_unlock(lock);
374         return 1;
375     }
377     prev = prev->vm_next;
378     if (prev) {
379  merge_next:
380         if (!can_vma_merge(prev, vm_flags))
381             return 0;
382         if (end == prev->vm_start) {
383             spin_lock(lock);
384             prev->vm_start = addr;
385             spin_unlock(lock);
386             return 1;
387         }
388     }
390     return 0;
391 }
358-375Check to see can the region pointed to by prev may be expanded to cover the current region
358The function can_vma_merge() checks the permissions of prev with those in vm_flags and that the VMA has no file mappings (i.e. it is anonymous). If it is true, the area at prev may be expanded
361Lock the mm
362Expand the end of the VMA region (vm_end) to the end of the new mapping (end)
363next is now the VMA in front of the newly expanded VMA
364Check if the expanded region can be merged with the VMA in front of it
365If it can, continue to expand the region to cover the next VMA
366As a VMA has been merged, one region is now defunct and may be unlinked
367No further adjustments are made to the mm struct so the lock is released
369There is one less mapped region to reduce the map_count
370Delete the struct describing the merged VMA
371Return success
377If this line is reached it means the region pointed to by prev could not be expanded forward so a check is made to see if the region ahead can be merged backwards instead
382-388Same idea as the above block except instead of adjusted vm_end to cover end, vm_start is expanded to cover addr

D.2.3.2  Function: can_vma_merge

Source: include/linux/mm.h

This trivial function checks to see if the permissions of the supplied VMA match the permissions in vm_flags

582 static inline int can_vma_merge(struct vm_area_struct * vma, 
                        unsigned long vm_flags)
583 {
584     if (!vma->vm_file && vma->vm_flags == vm_flags)
585         return 1;
586     else
587         return 0;
588 }
584Self explanatory. Return true if there is no file/device mapping (i.e. it is anonymous) and the VMA flags for both regions match

D.2.4  Remapping and Moving a Memory Region

D.2.4.1  Function: sys_mremap

Source: mm/mremap.c

The call graph for this function is shown in Figure 4.7. This is the system service call to remap a memory region

347 asmlinkage unsigned long sys_mremap(unsigned long addr,
348     unsigned long old_len, unsigned long new_len,
349     unsigned long flags, unsigned long new_addr)
350 {
351     unsigned long ret;
353     down_write(&current->mm->mmap_sem);
354     ret = do_mremap(addr, old_len, new_len, flags, new_addr);
355     up_write(&current->mm->mmap_sem);
356     return ret;
357 }
347-349The parameters are the same as those described in the mremap() man page
353Acquire the mm semaphore
354do_mremap()(See Section D.2.4.2) is the top level function for remapping a region
355Release the mm semaphore
356Return the status of the remapping

D.2.4.2  Function: do_mremap

Source: mm/mremap.c

This function does most of the actual “work” required to remap, resize and move a memory region. It is quite long but can be broken up into distinct parts which will be dealt with separately here. The tasks are broadly speaking

219 unsigned long do_mremap(unsigned long addr,
220     unsigned long old_len, unsigned long new_len,
221     unsigned long flags, unsigned long new_addr)
222 {
223     struct vm_area_struct *vma;
224     unsigned long ret = -EINVAL;
226     if (flags & ~(MREMAP_FIXED | MREMAP_MAYMOVE))
227         goto out;
229     if (addr & ~PAGE_MASK)
230         goto out;
232     old_len = PAGE_ALIGN(old_len);
233     new_len = PAGE_ALIGN(new_len);
219The parameters of the function are
addris the old starting address
old_lenis the old region length
new_lenis the new region length
flagsis the option flags passed. If MREMAP_MAYMOVE is specified, it means that the region is allowed to move if there is not enough linear address space at the current space. If MREMAP_FIXED is specified, it means that the whole region is to move to the specified new_addr with the new length. The area from new_addr to new_addr+new_len will be unmapped with do_munmap().
new_addris the address of the new region if it is moved
224At this point, the default return is -EINVAL for invalid arguments
226-227Make sure flags other than the two allowed flags are not used
229-230The address passed in must be page aligned
232-233Page align the passed region lengths
236     if (flags & MREMAP_FIXED) {
237         if (new_addr & ~PAGE_MASK)
238             goto out;
239         if (!(flags & MREMAP_MAYMOVE))
240             goto out;
242         if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len)
243             goto out;
245         /* Check if the location we're moving into overlaps the
246          * old location at all, and fail if it does.
247          */
248         if ((new_addr <= addr) && (new_addr+new_len) > addr)
249             goto out;
251         if ((addr <= new_addr) && (addr+old_len) > new_addr)
252             goto out;
254         do_munmap(current->mm, new_addr, new_len);
255     }

This block handles the condition where the region location is fixed and must be fully moved. It ensures the area been moved to is safe and definitely unmapped.

236MREMAP_FIXED is the flag which indicates the location is fixed
237-238The specified new_addr must be be page aligned
239-240If MREMAP_FIXED is specified, then the MAYMOVE flag must be used as well
242-243Make sure the resized region does not exceed TASK_SIZE
248-249Just as the comments indicate, the two regions been used for the move may not overlap
254Unmap the region that is about to be used. It is presumed the caller ensures that the region is not in use for anything important
261     ret = addr;
262     if (old_len >= new_len) {
263         do_munmap(current->mm, addr+new_len, old_len - new_len);
264         if (!(flags & MREMAP_FIXED) || (new_addr == addr))
265             goto out;
266     }
261At this point, the address of the resized region is the return value
262If the old length is larger than the new length, then the region is shrinking
263Unmap the unused region
264-235If the region is not to be moved, either because MREMAP_FIXED is not used or the new address matches the old address, goto out which will return the address
271     ret = -EFAULT;
272     vma = find_vma(current->mm, addr);
273     if (!vma || vma->vm_start > addr)
274         goto out;
275     /* We can't remap across vm area boundaries */
276     if (old_len > vma->vm_end - addr)
277         goto out;
278     if (vma->vm_flags & VM_DONTEXPAND) {
279         if (new_len > old_len)
280             goto out;
281     }
282     if (vma->vm_flags & VM_LOCKED) {
283         unsigned long locked = current->mm->locked_vm << PAGE_SHIFT;
284         locked += new_len - old_len;
285         ret = -EAGAIN;
286         if (locked > current->rlim[RLIMIT_MEMLOCK].rlim_cur)
287             goto out;
288     }
289     ret = -ENOMEM;
290     if ((current->mm->total_vm << PAGE_SHIFT) + (new_len - old_len)
291         > current->rlim[RLIMIT_AS].rlim_cur)
292         goto out;
293     /* Private writable mapping? Check memory availability.. */
294     if ((vma->vm_flags & (VM_SHARED | VM_WRITE)) == VM_WRITE &&
295         !(flags & MAP_NORESERVE) &&
296         !vm_enough_memory((new_len - old_len) >> PAGE_SHIFT))
297         goto out;

Do a number of checks to make sure it is safe to grow or move the region

271At this point, the default action is to return -EFAULT causing a segmentation fault as the ranges of memory been used are invalid
272Find the VMA responsible for the requested address
273If the returned VMA is not responsible for this address, then an invalid address was used so return a fault
276-277If the old_len passed in exceeds the length of the VMA, it means the user is trying to remap multiple regions which is not allowed
278-281If the VMA has been explicitly marked as non-resizable, raise a fault
282-283If the pages for this VMA must be locked in memory, recalculate the number of locked pages that will be kept in memory. If the number of pages exceed the ulimit set for this resource, return EAGAIN indicating to the caller that the region is locked and cannot be resized
289The default return at this point is to indicate there is not enough memory
290-292Ensure that the user will not exist their allowed allocation of memory
294-297Ensure that there is enough memory to satisfy the request after the resizing with vm_enough_memory()(See Section M.1.1)
302     if (old_len == vma->vm_end - addr &&
303         !((flags & MREMAP_FIXED) && (addr != new_addr)) &&
304         (old_len != new_len || !(flags & MREMAP_MAYMOVE))) {
305         unsigned long max_addr = TASK_SIZE;
306         if (vma->vm_next)
307             max_addr = vma->vm_next->vm_start;
308         /* can we just expand the current mapping? */
309         if (max_addr - addr >= new_len) {
310             int pages = (new_len - old_len) >> PAGE_SHIFT;
311             spin_lock(&vma->vm_mm->page_table_lock);
312             vma->vm_end = addr + new_len;
313             spin_unlock(&vma->vm_mm->page_table_lock);
314             current->mm->total_vm += pages;
315             if (vma->vm_flags & VM_LOCKED) {
316                 current->mm->locked_vm += pages;
317                 make_pages_present(addr + old_len,
318                            addr + new_len);
319             }
320             ret = addr;
321             goto out;
322         }
323     }

Handle the case where the region is been expanded and cannot be moved

302If it is the full region that is been remapped and ...
303The region is definitely not been moved and ...
304The region is been expanded and cannot be moved then ...
305Set the maximum address that can be used to TASK_SIZE, 3GiB on an x86
306-307If there is another region, set the max address to be the start of the next region
309-322Only allow the expansion if the newly sized region does not overlap with the next VMA
310Calculate the number of extra pages that will be required
311Lock the mm spinlock
312Expand the VMA
313Free the mm spinlock
314Update the statistics for the mm
315-319If the pages for this region are locked in memory, make them present now
320-321Return the address of the resized region
329     ret = -ENOMEM;
330     if (flags & MREMAP_MAYMOVE) {
331         if (!(flags & MREMAP_FIXED)) {
332             unsigned long map_flags = 0;
333             if (vma->vm_flags & VM_SHARED)
334                 map_flags |= MAP_SHARED;
336             new_addr = get_unmapped_area(vma->vm_file, 0,
                     new_len, vma->vm_pgoff, map_flags);
337             ret = new_addr;
338             if (new_addr & ~PAGE_MASK)
339                 goto out;
340         }
341         ret = move_vma(vma, addr, old_len, new_len, new_addr);
342     }
343 out:
344     return ret;
345 }

To expand the region, a new one has to be allocated and the old one moved to it

329The default action is to return saying no memory is available
330Check to make sure the region is allowed to move
331If MREMAP_FIXED is not specified, it means the new location was not supplied so one must be found
333-334Preserve the MAP_SHARED option
336Find an unmapped region of memory large enough for the expansion
337The return value is the address of the new region
338-339For the returned address to be not page aligned, get_unmapped_area() would need to be broken. This could possibly be the case with a buggy device driver implementing get_unmapped_area() incorrectly
341Call move_vma to move the region
343-344Return the address if successful and the error code otherwise

D.2.4.3  Function: move_vma

Source: mm/mremap.c

The call graph for this function is shown in Figure 4.8. This function is responsible for moving all the page table entries from one VMA to another region. If necessary a new VMA will be allocated for the region being moved to. Just like the function above, it is very long but may be broken up into the following distinct parts.

125 static inline unsigned long move_vma(struct vm_area_struct * vma,
126     unsigned long addr, unsigned long old_len, unsigned long new_len,
127     unsigned long new_addr)
128 {
129     struct mm_struct * mm = vma->vm_mm;
130     struct vm_area_struct * new_vma, * next, * prev;
131     int allocated_vma;
133     new_vma = NULL;
134     next = find_vma_prev(mm, new_addr, &prev);
125-127The parameters are
vma The VMA that the address been moved belongs to
addr The starting address of the moving region
old_len The old length of the region to move
new_len The new length of the region moved
new_addr The new address to relocate to
134Find the VMA preceding the address been moved to indicated by prev and return the region after the new mapping as next
135     if (next) {
136         if (prev && prev->vm_end == new_addr &&
137             can_vma_merge(prev, vma->vm_flags) && 
              !vma->vm_file && !(vma->vm_flags & VM_SHARED)) {
138             spin_lock(&mm->page_table_lock);
139             prev->vm_end = new_addr + new_len;
140             spin_unlock(&mm->page_table_lock);
141             new_vma = prev;
142             if (next != prev->vm_next)
143                 BUG();
144             if (prev->vm_end == next->vm_start &&
                can_vma_merge(next, prev->vm_flags)) {
145                 spin_lock(&mm->page_table_lock);
146                 prev->vm_end = next->vm_end;
147                 __vma_unlink(mm, next, prev);
148                 spin_unlock(&mm->page_table_lock);
150                 mm->map_count--;
151                 kmem_cache_free(vm_area_cachep, next);
152             }
153         } else if (next->vm_start == new_addr + new_len &&
154                can_vma_merge(next, vma->vm_flags) &&
                 !vma->vm_file && !(vma->vm_flags & VM_SHARED)) {
155             spin_lock(&mm->page_table_lock);
156             next->vm_start = new_addr;
157             spin_unlock(&mm->page_table_lock);
158             new_vma = next;
159         }
160     } else {

In this block, the new location is between two existing VMAs. Checks are made to see can be preceding region be expanded to cover the new mapping and then if it can be expanded to cover the next VMA as well. If it cannot be expanded, the next region is checked to see if it can be expanded backwards.

136-137If the preceding region touches the address to be mapped to and may be merged then enter this block which will attempt to expand regions
138Lock the mm
139Expand the preceding region to cover the new location
140Unlock the mm
141The new VMA is now the preceding VMA which was just expanded
142-143Make sure the VMA linked list is intact. It would require a device driver with severe brain damage to cause this situation to occur
144Check if the region can be expanded forward to encompass the next region
145If it can, then lock the mm
146Expand the VMA further to cover the next VMA
147There is now an extra VMA so unlink it
148Unlock the mm
150There is one less mapping now so update the map_count
151Free the memory used by the memory mapping
153Else the prev region could not be expanded forward so check if the region pointed to be next may be expanded backwards to cover the new mapping instead
155If it can, lock the mm
156Expand the mapping backwards
157Unlock the mm
158The VMA representing the new mapping is now next
161         prev = find_vma(mm, new_addr-1);
162         if (prev && prev->vm_end == new_addr &&
163             can_vma_merge(prev, vma->vm_flags) && !vma->vm_file &&
                    !(vma->vm_flags & VM_SHARED)) {
164             spin_lock(&mm->page_table_lock);
165             prev->vm_end = new_addr + new_len;
166             spin_unlock(&mm->page_table_lock);
167             new_vma = prev;
168         }
169     }

This block is for the case where the newly mapped region is the last VMA (next is NULL) so a check is made to see can the preceding region be expanded.

161Get the previously mapped region
162-163Check if the regions may be mapped
164Lock the mm
165Expand the preceding region to cover the new mapping
166Lock the mm
167The VMA representing the new mapping is now prev
171     allocated_vma = 0;
172     if (!new_vma) {
173         new_vma = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
174         if (!new_vma)
175             goto out;
176         allocated_vma = 1;
177     }
171Set a flag indicating if a new VMA was not allocated
172If a VMA has not been expanded to cover the new mapping then...
173Allocate a new VMA from the slab allocator
174-175If it could not be allocated, goto out to return failure
176Set the flag indicated a new VMA was allocated
179     if (!move_page_tables(current->mm, new_addr, addr, old_len)) {
180         unsigned long vm_locked = vma->vm_flags & VM_LOCKED;
182         if (allocated_vma) {
183             *new_vma = *vma;
184             new_vma->vm_start = new_addr;
185             new_vma->vm_end = new_addr+new_len;
186             new_vma->vm_pgoff += 
                     (addr-vma->vm_start) >> PAGE_SHIFT;
187             new_vma->vm_raend = 0;
188             if (new_vma->vm_file)
189                 get_file(new_vma->vm_file);
190             if (new_vma->vm_ops && new_vma->vm_ops->open)
191                 new_vma->vm_ops->open(new_vma);
192             insert_vm_struct(current->mm, new_vma);
193         }

            do_munmap(current->mm, addr, old_len);

197         current->mm->total_vm += new_len >> PAGE_SHIFT;
198         if (new_vma->vm_flags & VM_LOCKED) {
199             current->mm->locked_vm += new_len >> PAGE_SHIFT;
200             make_pages_present(new_vma->vm_start,
201                        new_vma->vm_end);
202         }
203         return new_addr;
204     }
205     if (allocated_vma)
206         kmem_cache_free(vm_area_cachep, new_vma);
207  out:
208     return -ENOMEM;
209 }
179move_page_tables()(See Section D.2.4.6) is responsible for copying all the page table entries. It returns 0 on success
182-193If a new VMA was allocated, fill in all the relevant details, including the file/device entries and insert it into the various VMA linked lists with insert_vm_struct()(See Section D.2.2.1)
194Unmap the old region as it is no longer required
197Update the total_vm size for this process. The size of the old region is not important as it is handled within do_munmap()
198-202If the VMA has the VM_LOCKED flag, all the pages within the region are made present with mark_pages_present()
203Return the address of the new region
205-206This is the error path. If a VMA was allocated, delete it
208Return an out of memory error

D.2.4.4  Function: make_pages_present

Source: mm/memory.c

This function makes all pages between addr and end present. It assumes that the two addresses are within the one VMA.

1460 int make_pages_present(unsigned long addr, unsigned long end)
1461 {
1462     int ret, len, write;
1463     struct vm_area_struct * vma;
1465     vma = find_vma(current->mm, addr);
1466     write = (vma->vm_flags & VM_WRITE) != 0;
1467     if (addr >= end)
1468         BUG();
1469     if (end > vma->vm_end)
1470         BUG();
1471     len = (end+PAGE_SIZE-1)/PAGE_SIZE-addr/PAGE_SIZE;
1472     ret = get_user_pages(current, current->mm, addr,
1473                 len, write, 0, NULL, NULL);
1474     return ret == len ? 0 : -1;
1475 }
1465Find the VMA with find_vma()(See Section D.3.1.1) that contains the starting address
1466Record if write-access is allowed in write
1467-1468If the starting address is after the end address, then BUG()
1469-1470If the range spans more than one VMA its a bug
1471Calculate the length of the region to fault in
1472Call get_user_pages() to fault in all the pages in the requested region. It returns the number of pages that were faulted in
1474Return true if all the requested pages were successfully faulted in

D.2.4.5  Function: get_user_pages

Source: mm/memory.c

This function is used to fault in user pages and may be used to fault in pages belonging to another process, which is required by ptrace() for example.

454 int get_user_pages(struct task_struct *tsk, struct mm_struct *mm, 
                       unsigned long start,
455                    int len, int write, int force, struct page **pages, 
                       struct vm_area_struct **vmas)
456 {
457     int i;
458     unsigned int flags;
460     /*
461      * Require read or write permissions.
462      * If 'force' is set, we only require the "MAY" flags.
463      */
464     flags =  write ? (VM_WRITE | VM_MAYWRITE) : (VM_READ | VM_MAYREAD);
465     flags &= force ? (VM_MAYREAD | VM_MAYWRITE) : (VM_READ | VM_WRITE);
466     i = 0;
454The parameters are:
tsk is the process that pages are been faulted for
mm is the mm_struct managing the address space being faulted
start is where to start faulting
len is the length of the region, in pages, to fault
write indicates if the pages are being faulted for writing
force indicates that the pages should be faulted even if the region only has the VM_MAYREAD or VM_MAYWRITE flags
pages is an array of struct pages which may be NULL. If supplied, the array will be filled with struct pages that were faulted in
vmas is similar to the pages array. If supplied, it will be filled with VMAs that were affected by the faults
464Set the required flags to VM_WRITE and VM_MAYWRITE flags if the parameter write is set to 1. Otherwise use the read equivilants
465If force is specified, only require the MAY flags
468     do {
469         struct vm_area_struct * vma;
471         vma = find_extend_vma(mm, start);
473         if ( !vma || 
                 (pages && vma->vm_flags & VM_IO) || 
                 !(flags & vma->vm_flags) )
474             return i ? : -EFAULT;
476         spin_lock(&mm->page_table_lock);
477         do {
478             struct page *map;
479             while (!(map = follow_page(mm, start, write))) {
480                 spin_unlock(&mm->page_table_lock);
481                 switch (handle_mm_fault(mm, vma, start, write)) {
482                 case 1:
483                     tsk->min_flt++;
484                     break;
485                 case 2:
486                     tsk->maj_flt++;
487                     break;
488                 case 0:
489                     if (i) return i;
490                     return -EFAULT;
491                 default:
492                     if (i) return i;
493                     return -ENOMEM;
494                 }
495                 spin_lock(&mm->page_table_lock);
496             }
497             if (pages) {
498                 pages[i] = get_page_map(map);
499                 /* FIXME: call the correct function,
500                  * depending on the type of the found page
501                  */
502                 if (!pages[i])
503                     goto bad_page;
504                 page_cache_get(pages[i]);
505             }
506             if (vmas)
507                 vmas[i] = vma;
508             i++;
509             start += PAGE_SIZE;
510             len--;
511         } while(len && start < vma->vm_end);
512         spin_unlock(&mm->page_table_lock);
513     } while(len);
514 out:
515     return i;
468-513This outer loop will move through every VMA affected by the faults
471Find the VMA affected by the current value of start. This variable is incremented in PAGE_SIZEd strides
473If a VMA does not exist for the address, or the caller has requested struct pages for a region that is IO mapped (and therefore not backed by physical memory) or that the VMA does not have the required flags, then return -EFAULT
476Lock the page table spinlock
479-496follow_page()(See Section C.2.1) walks the page tables and returns the struct page representing the frame mapped at start. This loop will only be entered if the PTE is not present and will keep looping until the PTE is known to be present with the page table spinlock held
480Unlock the page table spinlock as handle_mm_fault() is likely to sleep
481If the page is not present, fault it in with handle_mm_fault()(See Section D.5.3.1)
482-487Update the task_struct statistics indicating if a major or minor fault occured
488-490If the faulting address is invalid, return
491-493If the system is out of memory, return -ENOMEM
495Relock the page tables. The loop will check to make sure the page is actually present
597-505If the caller requested it, populate the pages array with struct pages affected by this function. Each struct will have a reference to it taken with page_cache_get()
506-507Similarly, record VMAs affected
508Increment i which is a counter for the number of pages present in the requested region
509Increment start in a page-sized stride
510Decrement the number of pages that must be faulted in
511Keep moving through the VMAs until the requested pages have been faulted in
512Release the page table spinlock
515Return the number of pages known to be present in the region
517     /*
518      * We found an invalid page in the VMA.  Release all we have
519      * so far and fail.
520      */
521 bad_page:
522     spin_unlock(&mm->page_table_lock);
523     while (i--)
524         page_cache_release(pages[i]);
525     i = -EFAULT;
526     goto out;
527 }
521This will only be reached if a struct page is found which represents a non-existant page frame
523-524If one if found, release references to all pages stored in the pages array
525-526Return -EFAULT

D.2.4.6  Function: move_page_tables

Source: mm/mremap.c

The call graph for this function is shown in Figure 4.9. This function is responsible copying all the page table entries from the region pointed to be old_addr to new_addr. It works by literally copying page table entries one at a time. When it is finished, it deletes all the entries from the old area. This is not the most efficient way to perform the operation, but it is very easy to error recover.

 90 static int move_page_tables(struct mm_struct * mm,
 91     unsigned long new_addr, unsigned long old_addr, 
        unsigned long len)
 92 {
 93     unsigned long offset = len;
 95     flush_cache_range(mm, old_addr, old_addr + len);
102     while (offset) {
103         offset -= PAGE_SIZE;
104         if (move_one_page(mm, old_addr + offset, new_addr +
105             goto oops_we_failed;
106     }
107     flush_tlb_range(mm, old_addr, old_addr + len);
108     return 0;
117 oops_we_failed:
118     flush_cache_range(mm, new_addr, new_addr + len);
119     while ((offset += PAGE_SIZE) < len)
120         move_one_page(mm, new_addr + offset, old_addr + offset);
121     zap_page_range(mm, new_addr, len);
122     return -1;
123 }
90The parameters are the mm for the process, the new location, the old location and the length of the region to move entries for
95flush_cache_range() will flush all CPU caches for this range. It must be called first as some architectures, notably Sparc's require that a virtual to physical mapping exist before flushing the TLB
102-106This loops through each page in the region and moves the PTE with move_one_pte()(See Section D.2.4.7). This translates to a lot of page table walking and could be performed much better but it is a rare operation
107Flush the TLB for the old region
108Return success
118-120This block moves all the PTEs back. A flush_tlb_range() is not necessary as there is no way the region could have been used yet so no TLB entries should exist
121Zap any pages that were allocated for the move
122Return failure

D.2.4.7  Function: move_one_page

Source: mm/mremap.c

This function is responsible for acquiring the spinlock before finding the correct PTE with get_one_pte() and copying it with copy_one_pte()

 77 static int move_one_page(struct mm_struct *mm, 
                 unsigned long old_addr, unsigned long new_addr)
 78 {
 79     int error = 0;
 80     pte_t * src;
 82     spin_lock(&mm->page_table_lock);
 83     src = get_one_pte(mm, old_addr);
 84     if (src)
 85         error = copy_one_pte(mm, src, alloc_one_pte(mm, new_addr));
 86     spin_unlock(&mm->page_table_lock);
 87     return error;
 88 }
82Acquire the mm lock
83Call get_one_pte()(See Section D.2.4.8) which walks the page tables to get the correct PTE
84-85If the PTE exists, allocate a PTE for the destination and copy the PTEs with copy_one_pte()(See Section D.2.4.10)
86Release the lock
87Return whatever copy_one_pte() returned. It will only return an error if alloc_one_pte()(See Section D.2.4.9) failed on line 85

D.2.4.8  Function: get_one_pte

Source: mm/mremap.c

This is a very simple page table walk.

 18 static inline pte_t *get_one_pte(struct mm_struct *mm, 
                                     unsigned long addr)
 19 {
 20     pgd_t * pgd;
 21     pmd_t * pmd;
 22     pte_t * pte = NULL;
 24     pgd = pgd_offset(mm, addr);
 25     if (pgd_none(*pgd))
 26         goto end;
 27     if (pgd_bad(*pgd)) {
 28         pgd_ERROR(*pgd);
 29         pgd_clear(pgd);
 30         goto end;
 31     }
 33     pmd = pmd_offset(pgd, addr);
 34     if (pmd_none(*pmd))
 35         goto end;
 36     if (pmd_bad(*pmd)) {
 37         pmd_ERROR(*pmd);
 38         pmd_clear(pmd);
 39         goto end;
 40     }
 42     pte = pte_offset(pmd, addr);
 43     if (pte_none(*pte))
 44         pte = NULL;
 45 end:
 46     return pte;
 47 }
24Get the PGD for this address
25-26If no PGD exists, return NULL as no PTE will exist either
27-31If the PGD is bad, mark that an error occurred in the region, clear its contents and return NULL
33-40Acquire the correct PMD in the same fashion as for the PGD
42Acquire the PTE so it may be returned if it exists

D.2.4.9  Function: alloc_one_pte

Source: mm/mremap.c

Trivial function to allocate what is necessary for one PTE in a region.

 49 static inline pte_t *alloc_one_pte(struct mm_struct *mm, 
                     unsigned long addr)
 50 {
 51     pmd_t * pmd;
 52     pte_t * pte = NULL;
 54     pmd = pmd_alloc(mm, pgd_offset(mm, addr), addr);
 55     if (pmd)
 56         pte = pte_alloc(mm, pmd, addr);
 57     return pte;
 58 }
54If a PMD entry does not exist, allocate it
55-56If the PMD exists, allocate a PTE entry. The check to make sure it succeeded is performed later in the function copy_one_pte()

D.2.4.10  Function: copy_one_pte

Source: mm/mremap.c

Copies the contents of one PTE to another.

 60 static inline int copy_one_pte(struct mm_struct *mm, 
                   pte_t * src, pte_t * dst)
 61 {
 62     int error = 0;
 63     pte_t pte;
 65     if (!pte_none(*src)) {
 66         pte = ptep_get_and_clear(src);
 67         if (!dst) {
 68             /* No dest?  We must put it back. */
 69             dst = src;
 70             error++;
 71         }
 72         set_pte(dst, pte);
 73     }
 74     return error;
 75 }
65If the source PTE does not exist, just return 0 to say the copy was successful
66Get the PTE and remove it from its old location
67-71If the dst does not exist, it means the call to alloc_one_pte() failed and the copy operation has failed and must be aborted
72Move the PTE to its new location
74Return an error if one occurred

D.2.5  Deleting a memory region

D.2.5.1  Function: do_munmap

Source: mm/mmap.c

The call graph for this function is shown in Figure 4.11. This function is responsible for unmapping a region. If necessary, the unmapping can span multiple VMAs and it can partially unmap one if necessary. Hence the full unmapping operation is divided into two major operations. This function is responsible for finding what VMAs are affected and unmap_fixup() is responsible for fixing up the remaining VMAs.

This function is divided up in a number of small sections will be dealt with in turn. The are broadly speaking;

924 int do_munmap(struct mm_struct *mm, unsigned long addr, 
                  size_t len)
925 {
926     struct vm_area_struct *mpnt, *prev, **npp, *free, *extra;
928     if ((addr & ~PAGE_MASK) || addr > TASK_SIZE || 
                     len  > TASK_SIZE-addr)
929         return -EINVAL;
931     if ((len = PAGE_ALIGN(len)) == 0)
932         return -EINVAL;
939     mpnt = find_vma_prev(mm, addr, &prev);
940     if (!mpnt)
941         return 0;
942     /* we have  addr < mpnt->vm_end  */
944     if (mpnt->vm_start >= addr+len)
945         return 0;
948     if ((mpnt->vm_start < addr && mpnt->vm_end > addr+len)
949         && mm->map_count >= max_map_count)
950         return -ENOMEM;
956     extra = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
957     if (!extra)
958         return -ENOMEM;
924The parameters are as follows;
mmThe mm for the processes performing the unmap operation
addrThe starting address of the region to unmap
lenThe length of the region
928-929Ensure the address is page aligned and that the area to be unmapped is not in the kernel virtual address space
931-932Make sure the region size to unmap is page aligned
939Find the VMA that contains the starting address and the preceding VMA so it can be easily unlinked later
940-941If no mpnt was returned, it means the address must be past the last used VMA so the address space is unused, just return
944-945If the returned VMA starts past the region we are trying to unmap, then the region in unused, just return
948-950The first part of the check sees if the VMA is just been partially unmapped, if it is, another VMA will be created later to deal with a region being broken into so to the map_count has to be checked to make sure it is not too large
956-958In case a new mapping is required, it is allocated now as later it will be much more difficult to back out in event of an error
960     npp = (prev ? &prev->vm_next : &mm->mmap);
961     free = NULL;
962     spin_lock(&mm->page_table_lock);
963     for ( ; mpnt && mpnt->vm_start < addr+len; mpnt = *npp) {
964         *npp = mpnt->vm_next;
965         mpnt->vm_next = free;
966         free = mpnt;
967         rb_erase(&mpnt->vm_rb, &mm->mm_rb);
968     }
969     mm->mmap_cache = NULL;  /* Kill the cache. */
970     spin_unlock(&mm->page_table_lock);

This section takes all the VMAs affected by the unmapping and places them on a separate linked list headed by a variable called free. This makes the fixup of the regions much easier.

960npp becomes the next VMA in the list during the for loop following below. To initialise it, it is either the current VMA (mpnt) or else it becomes the first VMA in the list
961free is the head of a linked list of VMAs that are affected by the unmapping
962Lock the mm
963Cycle through the list until the start of the current VMA is past the end of the region to be unmapped
964npp becomes the next VMA in the list
965-966Remove the current VMA from the linear linked list within the mm and place it on a linked list headed by free. The current mpnt becomes the head of the free linked list
967Delete mpnt from the red-black tree
969Remove the cached result in case the last looked up result is one of the regions to be unmapped
970Free the mm
972     /* Ok - we have the memory areas we should free on the 
973      * 'free' list, so release them, and unmap the page range..
974      * If the one of the segments is only being partially unmapped,
975      * it will put new vm_area_struct(s) into the address space.
976      * In that case we have to be careful with VM_DENYWRITE.
977      */
978     while ((mpnt = free) != NULL) {
979         unsigned long st, end, size;
980         struct file *file = NULL;
982         free = free->vm_next;
984         st = addr < mpnt->vm_start ? mpnt->vm_start : addr;
985         end = addr+len;
986         end = end > mpnt->vm_end ? mpnt->vm_end : end;
987         size = end - st;
989         if (mpnt->vm_flags & VM_DENYWRITE &&
990             (st != mpnt->vm_start || end != mpnt->vm_end) &&
991             (file = mpnt->vm_file) != NULL) {
992            atomic_dec(&file->f_dentry->d_inode->i_writecount);
993         }
994         remove_shared_vm_struct(mpnt);
995         mm->map_count--;
997         zap_page_range(mm, st, size);
999         /*
1000         * Fix the mapping, and free the old area 
             * if it wasn't reused.
1001         */
1002        extra = unmap_fixup(mm, mpnt, st, size, extra);
1003        if (file)
1004           atomic_inc(&file->f_dentry->d_inode->i_writecount);
1005     }
978Keep stepping through the list until no VMAs are left
982Move free to the next element in the list leaving mpnt as the head about to be removed
984st is the start of the region to be unmapped. If the addr is before the start of the VMA, the starting point is mpntvm_start, otherwise it is the supplied address
985-986Calculate the end of the region to map in a similar fashion
987Calculate the size of the region to be unmapped in this pass
989-993If the VM_DENYWRITE flag is specified, a hole will be created by this unmapping and a file is mapped then the i_writecount is decremented. When this field is negative, it counts how many users there is protecting this file from being opened for writing
994Remove the file mapping. If the file is still partially mapped, it will be acquired again during unmap_fixup()(See Section D.2.5.2)
995Reduce the map count
997Remove all pages within this region
1002Call unmap_fixup()(See Section D.2.5.2) to fix up the regions after this one is deleted
1003-1004Increment the writecount to the file as the region has been unmapped. If it was just partially unmapped, this call will simply balance out the decrement at line 987
1006     validate_mm(mm);
1008     /* Release the extra vma struct if it wasn't used */
1009     if (extra)
1010         kmem_cache_free(vm_area_cachep, extra);
1012     free_pgtables(mm, prev, addr, addr+len);
1014     return 0;
1015 }
1006validate_mm() is a debugging function. If enabled, it will ensure the VMA tree for this mm is still valid
1009-1010If extra VMA was not required, delete it
1012Free all the page tables that were used for the unmapped region
1014Return success

D.2.5.2  Function: unmap_fixup

Source: mm/mmap.c

This function fixes up the regions after a block has been unmapped. It is passed a list of VMAs that are affected by the unmapping, the region and length to be unmapped and a spare VMA that may be required to fix up the region if a whole is created. There is four principle cases it handles; The unmapping of a region, partial unmapping from the start to somewhere in the middle, partial unmapping from somewhere in the middle to the end and the creation of a hole in the middle of the region. Each case will be taken in turn.

787 static struct vm_area_struct * unmap_fixup(struct mm_struct *mm, 
788     struct vm_area_struct *area, unsigned long addr, size_t len, 
789     struct vm_area_struct *extra)
790 {
791     struct vm_area_struct *mpnt;
792     unsigned long end = addr + len;
794     area->vm_mm->total_vm -= len >> PAGE_SHIFT;
795     if (area->vm_flags & VM_LOCKED)
796         area->vm_mm->locked_vm -= len >> PAGE_SHIFT;

Function preamble.

787The parameters to the function are;
mm is the mm the unmapped region belongs to
area is the head of the linked list of VMAs affected by the unmapping
addr is the starting address of the unmapping
len is the length of the region to be unmapped
extra is a spare VMA passed in for when a hole in the middle is created
792Calculate the end address of the region being unmapped
794Reduce the count of the number of pages used by the process
795-796If the pages were locked in memory, reduce the locked page count
798     /* Unmapping the whole area. */
799     if (addr == area->vm_start && end == area->vm_end) {
800         if (area->vm_ops && area->vm_ops->close)
801             area->vm_ops->close(area);
802         if (area->vm_file)
803             fput(area->vm_file);
804         kmem_cache_free(vm_area_cachep, area);
805         return extra;
806     }

The first, and easiest, case is where the full region is being unmapped

799The full region is unmapped if the addr is the start of the VMA and the end is the end of the VMA. This is interesting because if the unmapping is spanning regions, it is possible the end is beyond the end of the VMA but the full of this VMA is still being unmapped
800-801If a close operation is supplied by the VMA, call it
802-803If a file or device is mapped, call fput() which decrements the usage count and releases it if the count falls to 0
804Free the memory for the VMA back to the slab allocator
805Return the extra VMA as it was unused
809     if (end == area->vm_end) {
810         /*
811          * here area isn't visible to the semaphore-less readers
812          * so we don't need to update it under the spinlock.
813          */
814         area->vm_end = addr;
815         lock_vma_mappings(area);
816         spin_lock(&mm->page_table_lock);
817     }

Handle the case where the middle of the region to the end is been unmapped

814Truncate the VMA back to addr. At this point, the pages for the region have already freed and the page table entries will be freed later so no further work is required
815If a file/device is being mapped, the lock protecting shared access to it is taken in the function lock_vm_mappings()
816Lock the mm. Later in the function, the remaining VMA will be reinserted into the mm
817           else if (addr == area->vm_start) {
818         area->vm_pgoff += (end - area->vm_start) >> PAGE_SHIFT;
819         /* same locking considerations of the above case */
820         area->vm_start = end;
821         lock_vma_mappings(area);
822         spin_lock(&mm->page_table_lock);
823     } else {

Handle the case where the VMA is been unmapped from the start to some part in the middle

818Increase the offset within the file/device mapped by the number of pages this unmapping represents
820Move the start of the VMA to the end of the region being unmapped
821-822Lock the file/device and mm as above
823     } else {
825         /* Add end mapping -- leave beginning for below */
826         mpnt = extra;
827         extra = NULL;
829         mpnt->vm_mm = area->vm_mm;
830         mpnt->vm_start = end;
831         mpnt->vm_end = area->vm_end;
832         mpnt->vm_page_prot = area->vm_page_prot;
833         mpnt->vm_flags = area->vm_flags;
834         mpnt->vm_raend = 0;
835         mpnt->vm_ops = area->vm_ops;
836         mpnt->vm_pgoff = area->vm_pgoff + 
                     ((end - area->vm_start) >> PAGE_SHIFT);
837         mpnt->vm_file = area->vm_file;
838         mpnt->vm_private_data = area->vm_private_data;
839         if (mpnt->vm_file)
840             get_file(mpnt->vm_file);
841         if (mpnt->vm_ops && mpnt->vm_ops->open)
842             mpnt->vm_ops->open(mpnt);
843         area->vm_end = addr;    /* Truncate area */
845         /* Because mpnt->vm_file == area->vm_file this locks
846          * things correctly.
847          */
848         lock_vma_mappings(area);
849         spin_lock(&mm->page_table_lock);
850         __insert_vm_struct(mm, mpnt);
851     }

Handle the case where a hole is being created by a partial unmapping. In this case, the extra VMA is required to create a new mapping from the end of the unmapped region to the end of the old VMA

826-827Take the extra VMA and make VMA NULL so that the calling function will know it is in use and cannot be freed
828-838Copy in all the VMA information
839If a file/device is mapped, get a reference to it with get_file()
841-842If an open function is provided, call it
843Truncate the VMA so that it ends at the start of the region to be unmapped
848-849Lock the files and mm as with the two previous cases
850Insert the extra VMA into the mm
853     __insert_vm_struct(mm, area);
854     spin_unlock(&mm->page_table_lock);
855     unlock_vma_mappings(area);
856     return extra;
857 }
853Reinsert the VMA into the mm
854Unlock the page tables
855Unlock the spinlock to the shared mapping
856Return the extra VMA if it was not used and NULL if it was

D.2.6  Deleting all memory regions

D.2.6.1  Function: exit_mmap

Source: mm/mmap.c

This function simply steps through all VMAs associated with the supplied mm and unmaps them.

1127 void exit_mmap(struct mm_struct * mm)
1128 {
1129     struct vm_area_struct * mpnt;
1131     release_segments(mm);
1132     spin_lock(&mm->page_table_lock);
1133     mpnt = mm->mmap;
1134     mm->mmap = mm->mmap_cache = NULL;
1135     mm->mm_rb = RB_ROOT;
1136     mm->rss = 0;
1137     spin_unlock(&mm->page_table_lock);
1138     mm->total_vm = 0;
1139     mm->locked_vm = 0;
1141     flush_cache_mm(mm);
1142     while (mpnt) {
1143         struct vm_area_struct * next = mpnt->vm_next;
1144         unsigned long start = mpnt->vm_start;
1145         unsigned long end = mpnt->vm_end;
1146         unsigned long size = end - start;
1148         if (mpnt->vm_ops) {
1149             if (mpnt->vm_ops->close)
1150                 mpnt->vm_ops->close(mpnt);
1151         }
1152         mm->map_count--;
1153         remove_shared_vm_struct(mpnt);
1154         zap_page_range(mm, start, size);
1155         if (mpnt->vm_file)
1156             fput(mpnt->vm_file);
1157         kmem_cache_free(vm_area_cachep, mpnt);
1158         mpnt = next;
1159     }
1160     flush_tlb_mm(mm);
1162     /* This is just debugging */
1163     if (mm->map_count)
1164         BUG();
1166     clear_page_tables(mm, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
1167 }
1131release_segments() will release memory segments associated with the process on its Local Descriptor Table (LDT) if the architecture supports segments and the process was using them. Some applications, notably WINE use this feature
1132Lock the mm
1133mpnt becomes the first VMA on the list
1134Clear VMA related information from the mm so it may be unlocked
1137Unlock the mm
1138-1139Clear the mm statistics
1141Flush the CPU for the address range
1142-1159Step through every VMA that was associated with the mm
1143Record what the next VMA to clear will be so this one may be deleted
1144-1146Record the start, end and size of the region to be deleted
1148-1151If there is a close operation associated with this VMA, call it
1152Reduce the map count
1153Remove the file/device mapping from the shared mappings list
1154Free all pages associated with this region
1155-1156If a file/device was mapped in this region, free it
1157Free the VMA struct
1158Move to the next VMA
1160Flush the TLB for this whole mm as it is about to be unmapped
1163-1164If the map_count is positive, it means the map count was not accounted for properly so call BUG() to mark it
1166Clear the page tables associated with this region with clear_page_tables() (See Section D.2.6.2)

D.2.6.2  Function: clear_page_tables

Source: mm/memory.c

This is the top-level function used to unmap all PTEs and free pages within a region. It is used when pagetables needs to be torn down such as when the process exits or a region is unmapped.

146 void clear_page_tables(struct mm_struct *mm, 
                           unsigned long first, int nr)
147 {
148     pgd_t * page_dir = mm->pgd;
150     spin_lock(&mm->page_table_lock);
151     page_dir += first;
152     do {
153         free_one_pgd(page_dir);
154         page_dir++;
155     } while (--nr);
156     spin_unlock(&mm->page_table_lock);
158     /* keep the page table cache within bounds */
159     check_pgt_cache();
160 }
148Get the PGD for the mm being unmapped
150Lock the pagetables
151-155Step through all PGDs in the requested range. For each PGD found, call free_one_pgd() (See Section D.2.6.3)
156Unlock the pagetables
159Check the cache of available PGD structures. If there are too many PGDs in the PGD quicklist, some of them will be reclaimed

D.2.6.3  Function: free_one_pgd

Source: mm/memory.c

This function tears down one PGD. For each PMD in this PGD, free_one_pmd() will be called.

109 static inline void free_one_pgd(pgd_t * dir)
110 {
111     int j;
112     pmd_t * pmd;
114     if (pgd_none(*dir))
115         return;
116     if (pgd_bad(*dir)) {
117         pgd_ERROR(*dir);
118         pgd_clear(dir);
119         return;
120     }
121     pmd = pmd_offset(dir, 0);
122     pgd_clear(dir);
123     for (j = 0; j < PTRS_PER_PMD ; j++) {
124         prefetchw(pmd+j+(PREFETCH_STRIDE/16));
125         free_one_pmd(pmd+j);
126     }
127     pmd_free(pmd);
128 }
114-115If no PGD exists here, return
116-120If the PGD is bad, flag the error and return
1121Get the first PMD in the PGD
122Clear the PGD entry
123-126For each PMD in this PGD, call free_one_pmd() (See Section D.2.6.4)
127Free the PMD page to the PMD quicklist. Later, check_pgt_cache() will be called and if the cache has too many PMD pages in it, they will be reclaimed

D.2.6.4  Function: free_one_pmd

Source: mm/memory.c

 93 static inline void free_one_pmd(pmd_t * dir)
 94 {
 95     pte_t * pte;
 97     if (pmd_none(*dir))
 98         return;
 99     if (pmd_bad(*dir)) {
100         pmd_ERROR(*dir);
101         pmd_clear(dir);
102         return;
103     }
104     pte = pte_offset(dir, 0);
105     pmd_clear(dir);
106     pte_free(pte);
107 }
97-98If no PMD exists here, return
99-103If the PMD is bad, flag the error and return
104Get the first PTE in the PMD
105Clear the PMD from the pagetable
106Free the PTE page to the PTE quicklist cache with pte_free(). Later, check_pgt_cache() will be called and if the cache has too many PTE pages in it, they will be reclaimed

D.3  Searching Memory Regions

The functions in this section deal with searching the virtual address space for mapped and free regions.

D.3.1  Finding a Mapped Memory Region

D.3.1.1  Function: find_vma

Source: mm/mmap.c

661 struct vm_area_struct * find_vma(struct mm_struct * mm, 
                                     unsigned long addr)
662 {
663     struct vm_area_struct *vma = NULL;
665     if (mm) {
666         /* Check the cache first. */
667         /* (Cache hit rate is typically around 35%.) */
668         vma = mm->mmap_cache;
669         if (!(vma && vma->vm_end > addr && 
              vma->vm_start <= addr)) {
670             rb_node_t * rb_node;
672             rb_node = mm->mm_rb.rb_node;
673             vma = NULL;
675             while (rb_node) {
676                 struct vm_area_struct * vma_tmp;
678                 vma_tmp = rb_entry(rb_node, 
                        struct vm_area_struct, vm_rb);
680                 if (vma_tmp->vm_end > addr) {
681                     vma = vma_tmp;
682                     if (vma_tmp->vm_start <= addr)
683                         break;
684                     rb_node = rb_node->rb_left;
685                 } else
686                     rb_node = rb_node->rb_right;
687             }
688             if (vma)
689                 mm->mmap_cache = vma;
690         }
691     }
692     return vma;
693 }
661The two parameters are the top level mm_struct that is to be searched and the address the caller is interested in
663Default to returning NULL for address not found
665Make sure the caller does not try and search a bogus mm
668mmap_cache has the result of the last call to find_vma(). This has a chance of not having to search at all through the red-black tree
669If it is a valid VMA that is being examined, check to see if the address being searched is contained within it. If it is, the VMA was the mmap_cache one so it can be returned, otherwise the tree is searched
670-674Start at the root of the tree
675-687This block is the tree walk
678The macro, as the name suggests, returns the VMA this tree node points to
680Check if the next node traversed by the left or right leaf
682If the current VMA is what is required, exit the while loop
689If the VMA is valid, set the mmap_cache for the next call to find_vma()
692Return the VMA that contains the address or as a side effect of the tree walk, return the VMA that is closest to the requested address

D.3.1.2  Function: find_vma_prev

Source: mm/mmap.c

696 struct vm_area_struct * find_vma_prev(struct mm_struct * mm, 
                        unsigned long addr,
697                     struct vm_area_struct **pprev)
698 {
699     if (mm) {
700         /* Go through the RB tree quickly. */
701         struct vm_area_struct * vma;
702         rb_node_t * rb_node, * rb_last_right, * rb_prev;
704         rb_node = mm->mm_rb.rb_node;
705         rb_last_right = rb_prev = NULL;
706         vma = NULL;
708         while (rb_node) {
709             struct vm_area_struct * vma_tmp;
711             vma_tmp = rb_entry(rb_node, 
                             struct vm_area_struct, vm_rb);
713             if (vma_tmp->vm_end > addr) {
714                 vma = vma_tmp;
715                 rb_prev = rb_last_right;
716                 if (vma_tmp->vm_start <= addr)
717                     break;
718                 rb_node = rb_node->rb_left;
719             } else {
720                 rb_last_right = rb_node;
721                 rb_node = rb_node->rb_right;
722             }
723         }
724         if (vma) {
725             if (vma->vm_rb.rb_left) {
726                 rb_prev = vma->vm_rb.rb_left;
727                 while (rb_prev->rb_right)
728                     rb_prev = rb_prev->rb_right;
729             }
730             *pprev = NULL;
731             if (rb_prev)
732                 *pprev = rb_entry(rb_prev, struct
                         vm_area_struct, vm_rb);
733             if ((rb_prev ? (*pprev)->vm_next : mm->mmap) !=
734                 BUG();
735             return vma;
736         }
737     }
738     *pprev = NULL;
739     return NULL;
740 }
696-723This is essentially the same as the find_vma() function already described. The only difference is that the last right node accesses is remembered as this will represent the vma previous to the requested vma.
725-729If the returned VMA has a left node, it means that it has to be traversed. It first takes the left leaf and then follows each right leaf until the bottom of the tree is found.
731-732Extract the VMA from the red-black tree node
733-734A debugging check, if this is the previous node, then its next field should point to the VMA being returned. If it is not, it is a bug

D.3.1.3  Function: find_vma_intersection

Source: include/linux/mm.h

673 static inline struct vm_area_struct * find_vma_intersection(
                       struct mm_struct * mm, 
                       unsigned long start_addr, unsigned long end_addr)
674 {
675     struct vm_area_struct * vma = find_vma(mm,start_addr);
677     if (vma && end_addr <= vma->vm_start)
678         vma = NULL;
679     return vma;
680 }
675Return the VMA closest to the starting address
677If a VMA is returned and the end address is still less than the beginning of the returned VMA, the VMA does not intersect
679Return the VMA if it does intersect

D.3.2  Finding a Free Memory Region

D.3.2.1  Function: get_unmapped_area

Source: mm/mmap.c

The call graph for this function is shown at Figure 4.5.

644 unsigned long get_unmapped_area(struct file *file, 
                        unsigned long addr,
                        unsigned long len, 
                        unsigned long pgoff, 
                        unsigned long flags)
645 {
646     if (flags & MAP_FIXED) {
647         if (addr > TASK_SIZE - len)
648             return -ENOMEM;
649         if (addr & ~PAGE_MASK)
650             return -EINVAL;
651         return addr;
652     }
654     if (file && file->f_op && file->f_op->get_unmapped_area)
655         return file->f_op->get_unmapped_area(file, addr, 
                                len, pgoff, flags);
657     return arch_get_unmapped_area(file, addr, len, pgoff, flags);
658 }
644The parameters passed are
file The file or device being mapped
addr The requested address to map to
len The length of the mapping
pgoff The offset within the file being mapped
flags Protection flags
646-652Sanity checked. If it is required that the mapping be placed at the specified address, make sure it will not overflow the address space and that it is page aligned
654If the struct file provides a get_unmapped_area() function, use it
657Else use arch_get_unmapped_area()(See Section D.3.2.2) as an anonymous version of the get_unmapped_area() function

D.3.2.2  Function: arch_get_unmapped_area

Source: mm/mmap.c

Architectures have the option of specifying this function for themselves by defining HAVE_ARCH_UNMAPPED_AREA. If the architectures does not supply one, this version is used.

615 static inline unsigned long arch_get_unmapped_area(
            struct file *filp,
            unsigned long addr, unsigned long len, 
            unsigned long pgoff, unsigned long flags)
616 {
617     struct vm_area_struct *vma;
619     if (len > TASK_SIZE)
620         return -ENOMEM;
622     if (addr) {
623         addr = PAGE_ALIGN(addr);
624         vma = find_vma(current->mm, addr);
625         if (TASK_SIZE - len >= addr &&
626             (!vma || addr + len <= vma->vm_start))
627             return addr;
628     }
631     for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
632         /* At this point:  (!vma || addr < vma->vm_end). */
633         if (TASK_SIZE - len < addr)
634             return -ENOMEM;
635         if (!vma || addr + len <= vma->vm_start)
636             return addr;
637         addr = vma->vm_end;
638     }
639 }
640 #else
641 extern unsigned long arch_get_unmapped_area(struct file *, 
                     unsigned long, unsigned long, 
                     unsigned long, unsigned long);
642 #endif
614If this is not defined, it means that the architecture does not provide its own arch_get_unmapped_area() so this one is used instead
615The parameters are the same as those for get_unmapped_area()(See Section D.3.2.1)
619-620Sanity check, make sure the required map length is not too long
622-628If an address is provided, use it for the mapping
623Make sure the address is page aligned
624find_vma()(See Section D.3.1.1) will return the region closest to the requested address
625-627Make sure the mapping will not overlap with another region. If it does not, return it as it is safe to use. Otherwise it gets ignored
629TASK_UNMAPPED_BASE is the starting point for searching for a free region to use
631-638Starting from TASK_UNMAPPED_BASE, linearly search the VMAs until a large enough region between them is found to store the new mapping. This is essentially a first fit search
641If an external function is provided, it still needs to be declared here

D.4  Locking and Unlocking Memory Regions

This section contains the functions related to locking and unlocking a region. The main complexity in them is how the regions need to be fixed up after the operation takes place.

D.4.1  Locking a Memory Region

D.4.1.1  Function: sys_mlock

Source: mm/mlock.c

The call graph for this function is shown in Figure 4.10. This is the system call mlock() for locking a region of memory into physical memory. This function simply checks to make sure that process and user limits are not exceeeded and that the region to lock is page aligned.

195 asmlinkage long sys_mlock(unsigned long start, size_t len)
196 {
197     unsigned long locked;
198     unsigned long lock_limit;
199     int error = -ENOMEM;
201     down_write(&current->mm->mmap_sem);
202     len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
203     start &= PAGE_MASK;
205     locked = len >> PAGE_SHIFT;
206     locked += current->mm->locked_vm;
208     lock_limit = current->rlim[RLIMIT_MEMLOCK].rlim_cur;
209     lock_limit >>= PAGE_SHIFT;
211     /* check against resource limits */
212     if (locked > lock_limit)
213         goto out;
215     /* we may lock at most half of physical memory... */
216     /* (this check is pretty bogus, but doesn't hurt) */
217     if (locked > num_physpages/2)
218         goto out;
220     error = do_mlock(start, len, 1);
221 out:
222     up_write(&current->mm->mmap_sem);
223     return error;
224 }
201Take the semaphore, we are likely to sleep during this so a spinlock can not be used
202Round the length up to the page boundary
203Round the start address down to the page boundary
205Calculate how many pages will be locked
206Calculate how many pages will be locked in total by this process
208-209Calculate what the limit is to the number of locked pages
212-213Do not allow the process to lock more than it should
217-218Do not allow the process to map more than half of physical memory
220Call do_mlock()(See Section D.4.1.4) which starts the “real” work by find the VMA clostest to the area to lock before calling mlock_fixup()(See Section D.4.3.1)
222Free the semaphore
223Return the error or success code from do_mlock()

D.4.1.2  Function: sys_mlockall

Source: mm/mlock.c

This is the system call mlockall() which attempts to lock all pages in the calling process in memory. If MCL_CURRENT is specified, all current pages will be locked. If MCL_FUTURE is specified, all future mappings will be locked. The flags may be or-ed together. This function makes sure that the flags and process limits are ok before calling do_mlockall().

266 asmlinkage long sys_mlockall(int flags)
267 {
268     unsigned long lock_limit;
269     int ret = -EINVAL;
271     down_write(&current->mm->mmap_sem);
272     if (!flags || (flags & ~(MCL_CURRENT | MCL_FUTURE)))
273         goto out;
275     lock_limit = current->rlim[RLIMIT_MEMLOCK].rlim_cur;
276     lock_limit >>= PAGE_SHIFT;
278     ret = -ENOMEM;
279     if (current->mm->total_vm > lock_limit)
280         goto out;
282     /* we may lock at most half of physical memory... */
283     /* (this check is pretty bogus, but doesn't hurt) */
284     if (current->mm->total_vm > num_physpages/2)
285         goto out;
287     ret = do_mlockall(flags);
288 out:
289     up_write(&current->mm->mmap_sem);
290     return ret;
291 }
269By default, return -EINVAL to indicate invalid parameters
271Acquire the current mm_struct semaphore
272-273Make sure that some valid flag has been specified. If not, goto out to unlock the semaphore and return -EINVAL
275-276Check the process limits to see how many pages may be locked
278From here on, the default error is -ENOMEM
279-280If the size of the locking would exceed set limits, then goto out
284-285Do not allow this process to lock more than half of physical memory. This is a bogus check because four processes locking a quarter of physical memory each will bypass this. It is acceptable though as only root proceses are allowed to lock memory and are unlikely to make this type of mistake
287Call the core function do_mlockall()(See Section D.4.1.3)
289-290Unlock the semaphore and return

D.4.1.3  Function: do_mlockall

Source: mm/mlock.c

238 static int do_mlockall(int flags)
239 {
240     int error;
241     unsigned int def_flags;
242     struct vm_area_struct * vma;
244     if (!capable(CAP_IPC_LOCK))
245         return -EPERM;
247     def_flags = 0;
248     if (flags & MCL_FUTURE)
249         def_flags = VM_LOCKED;
250     current->mm->def_flags = def_flags;
252     error = 0;
253     for (vma = current->mm->mmap; vma ; vma = vma->vm_next) {
254         unsigned int newflags;
256         newflags = vma->vm_flags | VM_LOCKED;
257         if (!(flags & MCL_CURRENT))
258             newflags &= ~VM_LOCKED;
259         error = mlock_fixup(vma, vma->vm_start, vma->vm_end, 
260         if (error)
261             break;
262     }
263     return error;
264 }
244-245The calling process must be either root or have CAP_IPC_LOCK capabilities
248-250The MCL_FUTURE flag says that all future pages should be locked so if set, the def_flags for VMAs should be VM_LOCKED
253-262Cycle through all VMAs
256Set the VM_LOCKED flag in the current VMA flags
257-258If the MCL_CURRENT flag has not been set requesting that all current pages be locked, then clear the VM_LOCKED flag. The logic is arranged like this so that the unlock code can use this same function just with no flags
259Call mlock_fixup()(See Section D.4.3.1) which will adjust the regions to match the locking as necessary
260-261If a non-zero value is returned at any point, stop locking. It is interesting to note that VMAs already locked will not be unlocked
263Return the success or error value

D.4.1.4  Function: do_mlock

Source: mm/mlock.c

This function is is responsible for starting the work needed to either lock or unlock a region depending on the value of the on parameter. It is broken up into two sections. The first makes sure the region is page aligned (despite the fact the only two callers of this function do the same thing) before finding the VMA that is to be adjusted. The second part then sets the appropriate flags before calling mlock_fixup() for each VMA that is affected by this locking.

148 static int do_mlock(unsigned long start, size_t len, int on)
149 {
150     unsigned long nstart, end, tmp;
151     struct vm_area_struct * vma, * next;
152     int error;
154     if (on && !capable(CAP_IPC_LOCK))
155         return -EPERM;
156     len = PAGE_ALIGN(len);
157     end = start + len;
158     if (end < start)
159         return -EINVAL;
160     if (end == start)
161         return 0;
162     vma = find_vma(current->mm, start);
163     if (!vma || vma->vm_start > start)
164         return -ENOMEM;

Page align the request and find the VMA

154Only root processes can lock pages
156Page align the length. This is redundent as the length is page aligned in the parent functions
157-159Calculate the end of the locking and make sure it is a valid region. Return -EINVAL if it is not
160-161if locking a region of size 0, just return
162Find the VMA that will be affected by this locking
163-164If the VMA for this address range does not exist, return -ENOMEM
166     for (nstart = start ; ; ) {
167         unsigned int newflags;
171         newflags = vma->vm_flags | VM_LOCKED;
172         if (!on)
173             newflags &= ~VM_LOCKED;
175         if (vma->vm_end >= end) {
176             error = mlock_fixup(vma, nstart, end, newflags);
177             break;
178         }
180         tmp = vma->vm_end;
181         next = vma->vm_next;
182         error = mlock_fixup(vma, nstart, tmp, newflags);
183         if (error)
184             break;
185         nstart = tmp;
186         vma = next;
187         if (!vma || vma->vm_start != nstart) {
188             error = -ENOMEM;
189             break;
190         }
191     }
192     return error;
193 }

Walk through the VMAs affected by this locking and call mlock_fixup() for each of them.

166-192Cycle through as many VMAs as necessary to lock the pages
171Set the VM_LOCKED flag on the VMA
172-173Unless this is an unlock in which case, remove the flag
175-177If this VMA is the last VMA to be affected by the unlocking, call mlock_fixup() with the end address for the locking and exit
180-190Else this is whole VMA needs to be locked. To lock it, the end of this VMA is pass as a parameter to mlock_fixup()(See Section D.4.3.1) instead of the end of the actual locking
180tmp is the end of the mapping on this VMA
181next is the next VMA that will be affected by the locking
182Call mlock_fixup()(See Section D.4.3.1) for this VMA
183-184If an error occurs, back out. Note that the VMAs already locked are not fixed up right
185The next start address is the start of the next VMA
186Move to the next VMA
187-190If there is no VMA , return -ENOMEM. The next condition though would require the regions to be extremly broken as a result of a broken implementation of mlock_fixup() or have VMAs that overlap
192Return the error or success value

D.4.2  Unlocking the region

D.4.2.1  Function: sys_munlock

Source: mm/mlock.c

Page align the request before calling do_mlock() which begins the real work of fixing up the regions.

226 asmlinkage long sys_munlock(unsigned long start, size_t len)
227 {
228     int ret;
230     down_write(&current->mm->mmap_sem);
231     len = PAGE_ALIGN(len + (start & ~PAGE_MASK));
232     start &= PAGE_MASK;
233     ret = do_mlock(start, len, 0);
234     up_write(&current->mm->mmap_sem);
235     return ret;
236 }
230Acquire the semaphore protecting the mm_struct
231Round the length of the region up to the nearest page boundary
232Round the start of the region down to the nearest page boundary
233Call do_mlock()(See Section D.4.1.4) with 0 as the third parameter to unlock the region
234Release the semaphore
235Return the success or failure code

D.4.2.2  Function: sys_munlockall

Source: mm/mlock.c

Trivial function. If the flags to mlockall() are 0 it gets translated as none of the current pages must be present and no future mappings should be locked either which means the VM_LOCKED flag will be removed on all VMAs.

293 asmlinkage long sys_munlockall(void)
294 {
295     int ret;
297     down_write(&current->mm->mmap_sem);
298     ret = do_mlockall(0);
299     up_write(&current->mm->mmap_sem);
300     return ret;
301 }
297Acquire the semaphore protecting the mm_struct
298Call do_mlockall()(See Section D.4.1.3) with 0 as flags which will remove the VM_LOCKED from all VMAs
299Release the semaphore
300Return the error or success code

D.4.3  Fixing up regions after locking/unlocking

D.4.3.1  Function: mlock_fixup

Source: mm/mlock.c

This function identifies four separate types of locking that must be addressed. There first is where the full VMA is to be locked where it calls mlock_fixup_all(). The second is where only the beginning portion of the VMA is affected, handled by mlock_fixup_start(). The third is the locking of a region at the end handled by mlock_fixup_end() and the last is locking a region in the middle of the VMA with mlock_fixup_middle().

117 static int mlock_fixup(struct vm_area_struct * vma, 
118    unsigned long start, unsigned long end, unsigned int newflags)
119 {
120     int pages, retval;
122     if (newflags == vma->vm_flags)
123         return 0;
125     if (start == vma->vm_start) {
126         if (end == vma->vm_end)
127             retval = mlock_fixup_all(vma, newflags);
128         else
129             retval = mlock_fixup_start(vma, end, newflags);
130     } else {
131         if (end == vma->vm_end)
132             retval = mlock_fixup_end(vma, start, newflags);
133         else
134             retval = mlock_fixup_middle(vma, start, 
                            end, newflags);
135     }
136     if (!retval) {
137         /* keep track of amount of locked VM */
138         pages = (end - start) >> PAGE_SHIFT;
139         if (newflags & VM_LOCKED) {
140             pages = -pages;
141             make_pages_present(start, end);
142         }
143         vma->vm_mm->locked_vm -= pages;
144     }
145     return retval;
146 }
122-123If no change is to be made, just return
125If the start of the locking is at the start of the VMA, it means that either the full region is to the locked or only a portion at the beginning
126-127The full VMA is being locked, call mlock_fixup_all() (See Section D.4.3.2)
128-129Part of the VMA is being locked with the start of the VMA matching the start of the locking, call mlock_fixup_start() (See Section D.4.3.3)
130Else either the a region at the end is to be locked or a region in the middle
131-132The end of the locking matches the end of the VMA, call mlock_fixup_end() (See Section D.4.3.4)
133-134A region in the middle of the VMA is to be locked, call mlock_fixup_middle() (See Section D.4.3.5)
136-144The fixup functions return 0 on success. If the fixup of the regions succeed and the regions are now marked as locked, call make_pages_present() which makes some basic checks before calling get_user_pages() which faults in all the pages in the same way the page fault handler does

D.4.3.2  Function: mlock_fixup_all

Source: mm/mlock.c

 15 static inline int mlock_fixup_all(struct vm_area_struct * vma, 
                    int newflags)
 16 {
 17     spin_lock(&vma->vm_mm->page_table_lock);
 18     vma->vm_flags = newflags;
 19     spin_unlock(&vma->vm_mm->page_table_lock);
 20     return 0;
 21 }
17-19Trivial, lock the VMA with the spinlock, set the new flags, release the lock and return success

D.4.3.3  Function: mlock_fixup_start

Source: mm/mlock.c

Slightly more compilcated. A new VMA is required to represent the affected region. The start of the old VMA is moved forward

 23 static inline int mlock_fixup_start(struct vm_area_struct * vma,
 24     unsigned long end, int newflags)
 25 {
 26     struct vm_area_struct * n;
 28     n = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 29     if (!n)
 30         return -EAGAIN;
 31     *n = *vma;
 32     n->vm_end = end;
 33     n->vm_flags = newflags;
 34     n->vm_raend = 0;
 35     if (n->vm_file)
 36         get_file(n->vm_file);
 37     if (n->vm_ops && n->vm_ops->open)
 38         n->vm_ops->open(n);
 39     vma->vm_pgoff += (end - vma->vm_start) >> PAGE_SHIFT;
 40     lock_vma_mappings(vma);
 41     spin_lock(&vma->vm_mm->page_table_lock);
 42     vma->vm_start = end;
 43     __insert_vm_struct(current->mm, n);
 44     spin_unlock(&vma->vm_mm->page_table_lock);
 45     unlock_vma_mappings(vma);
 46     return 0;
 47 }
28Allocate a VMA from the slab allocator for the affected region
31-34Copy in the necessary information
35-36If the VMA has a file or device mapping, get_file() will increment the reference count
37-38If an open() function is provided, call it
39Update the offset within the file or device mapping for the old VMA to be the end of the locked region
40lock_vma_mappings() will lock any files if this VMA is a shared region
41-44Lock the parent mm_struct, update its start to be the end of the affected region, insert the new VMA into the processes linked lists (See Section D.2.2.1) and release the lock
45Unlock the file mappings with unlock_vma_mappings()
46Return success

D.4.3.4  Function: mlock_fixup_end

Source: mm/mlock.c

Essentially the same as mlock_fixup_start() except the affected region is at the end of the VMA.

 49 static inline int mlock_fixup_end(struct vm_area_struct * vma,
 50     unsigned long start, int newflags)
 51 {
 52     struct vm_area_struct * n;
 54     n = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 55     if (!n)
 56         return -EAGAIN;
 57     *n = *vma;
 58     n->vm_start = start;
 59     n->vm_pgoff += (n->vm_start - vma->vm_start) >> PAGE_SHIFT;
 60     n->vm_flags = newflags;
 61     n->vm_raend = 0;
 62     if (n->vm_file)
 63         get_file(n->vm_file);
 64     if (n->vm_ops && n->vm_ops->open)
 65         n->vm_ops->open(n);
 66     lock_vma_mappings(vma);
 67     spin_lock(&vma->vm_mm->page_table_lock);
 68     vma->vm_end = start;
 69     __insert_vm_struct(current->mm, n);
 70     spin_unlock(&vma->vm_mm->page_table_lock);
 71     unlock_vma_mappings(vma);
 72     return 0;
 73 }
54Alloc a VMA from the slab allocator for the affected region
57-61Copy in the necessary information and update the offset within the file or device mapping
62-63If the VMA has a file or device mapping, get_file() will increment the reference count
64-65If an open() function is provided, call it
66lock_vma_mappings() will lock any files if this VMA is a shared region
67-70Lock the parent mm_struct, update its start to be the end of the affected region, insert the new VMA into the processes linked lists (See Section D.2.2.1) and release the lock
71Unlock the file mappings with unlock_vma_mappings()
72Return success

D.4.3.5  Function: mlock_fixup_middle

Source: mm/mlock.c

Similar to the previous two fixup functions except that 2 new regions are required to fix up the mapping.

 75 static inline int mlock_fixup_middle(struct vm_area_struct * vma,
 76     unsigned long start, unsigned long end, int newflags)
 77 {
 78     struct vm_area_struct * left, * right;
 80     left = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 81     if (!left)
 82         return -EAGAIN;
 83     right = kmem_cache_alloc(vm_area_cachep, SLAB_KERNEL);
 84     if (!right) {
 85         kmem_cache_free(vm_area_cachep, left);
 86         return -EAGAIN;
 87     }
 88     *left = *vma;
 89     *right = *vma;
 90     left->vm_end = start;
 91     right->vm_start = end;
 92     right->vm_pgoff += (right->vm_start - left->vm_start) >>
 93     vma->vm_flags = newflags;
 94     left->vm_raend = 0;
 95     right->vm_raend = 0;
 96     if (vma->vm_file)
 97         atomic_add(2, &vma->vm_file->f_count);
 99     if (vma->vm_ops && vma->vm_ops->open) {
100         vma->vm_ops->open(left);
101         vma->vm_ops->open(right);
102     }
103     vma->vm_raend = 0;
104     vma->vm_pgoff += (start - vma->vm_start) >> PAGE_SHIFT;
105     lock_vma_mappings(vma);
106     spin_lock(&vma->vm_mm->page_table_lock);
107     vma->vm_start = start;
108     vma->vm_end = end;
109     vma->vm_flags = newflags;
110     __insert_vm_struct(current->mm, left);
111     __insert_vm_struct(current->mm, right);
112     spin_unlock(&vma->vm_mm->page_table_lock);
113     unlock_vma_mappings(vma);
114     return 0;
115 }
80-87Allocate the two new VMAs from the slab allocator
88-89Copy in the information from the old VMA into them
90The end of the left region is the start of the region to be affected
91The start of the right region is the end of the affected region
92Update the file offset
93The old VMA is now the affected region so update its flags
94-95Make the readahead window 0 to ensure pages not belonging to their regions are not accidently read ahead
96-97Increment the reference count to the file/device mapping if there is one
99-102Call the open() function for the two new mappings
103-104Cancel the readahead window and update the offset within the file to be the beginning of the locked region
105Lock the shared file/device mappings
106-112Lock the parent mm_struct, update the VMA and insert the two new regions into the process before releasing the lock again
113Unlock the shared mappings
114Return success

D.5  Page Faulting

This section deals with the page fault handler. It begins with the architecture specific function for the x86 and then moves to the architecture independent layer. The architecture specific functions all have the same responsibilities.

D.5.1  x86 Page Fault Handler

D.5.1.1  Function: do_page_fault

Source: arch/i386/mm/fault.c

The call graph for this function is shown in Figure 4.12. This function is the x86 architecture dependent function for the handling of page fault exception handlers. Each architecture registers their own but all of them have similar responsibilities.

140 asmlinkage void do_page_fault(struct pt_regs *regs, 
                  unsigned long error_code)
141 {
142     struct task_struct *tsk;
143     struct mm_struct *mm;
144     struct vm_area_struct * vma;
145     unsigned long address;
146     unsigned long page;
147     unsigned long fixup;
148     int write;
149     siginfo_t info;
151     /* get the address */
152     __asm__("movl %%cr2,%0":"=r" (address));
154     /* It's safe to allow irq's after cr2 has been saved */
155     if (regs->eflags & X86_EFLAGS_IF)
156         local_irq_enable();
158     tsk = current;

Function preamble. Get the fault address and enable interrupts

140The parameters are
regs is a struct containing what all the registers at fault time
error_code indicates what sort of fault occurred
152As the comment indicates, the cr2 register holds the fault address
155-156If the fault is from within an interrupt, enable them
158Set the current task
173     if (address >= TASK_SIZE && !(error_code & 5))
174         goto vmalloc_fault;
176     mm = tsk->mm;
177     info.si_code = SEGV_MAPERR;
183     if (in_interrupt() || !mm)
184         goto no_context;

Check for exceptional faults, kernel faults, fault in interrupt and fault with no memory context

173If the fault address is over TASK_SIZE, it is within the kernel address space. If the error code is 5, then it means it happened while in kernel mode and is not a protection error so handle a vmalloc fault
176Record the working mm
183If this is an interrupt, or there is no memory context (such as with a kernel thread), there is no way to safely handle the fault so goto no_context
186     down_read(&mm->mmap_sem);
188     vma = find_vma(mm, address);
189     if (!vma)
190         goto bad_area;
191     if (vma->vm_start <= address)
192         goto good_area;
193     if (!(vma->vm_flags & VM_GROWSDOWN))
194         goto bad_area;
195     if (error_code & 4) {
196         /*
197          * accessing the stack below %esp is always a bug.
198          * The "+ 32" is there due to some instructions (like
199          * pusha) doing post-decrement on the stack and that
200          * doesn't show up until later..
201          */
202         if (address + 32 < regs->esp)
203             goto bad_area;
204     }
205     if (expand_stack(vma, address))
206         goto bad_area;

If a fault in userspace, find the VMA for the faulting address and determine if it is a good area, a bad area or if the fault occurred near a region that can be expanded such as the stack

186Take the long lived mm semaphore
188Find the VMA that is responsible or is closest to the faulting address
189-190If a VMA does not exist at all, goto bad_area
191-192If the start of the region is before the address, it means this VMA is the correct VMA for the fault so goto good_area which will check the permissions
193-194For the region that is closest, check if it can gown down (VM_GROWSDOWN). If it does, it means the stack can probably be expanded. If not, goto bad_area
195-204Check to make sure it isn't an access below the stack. if the error_code is 4, it means it is running in userspace
205-206The stack is the only region with VM_GROWSDOWN set so if we reach here, the stack is expaneded with with expand_stack()(See Section D.5.2.1), if it fails, goto bad_area
211 good_area:
212     info.si_code = SEGV_ACCERR;
213     write = 0;
214     switch (error_code & 3) {
215         default:    /* 3: write, present */
217             if (regs->cs == KERNEL_CS)
218                 printk("WP fault at %08lx\n", regs->eip);
219 #endif
220             /* fall through */
221         case 2:     /* write, not present */
222             if (!(vma->vm_flags & VM_WRITE))
223                 goto bad_area;
224             write++;
225             break;
226         case 1:     /* read, present */
227             goto bad_area;
228         case 0:     /* read, not present */
229             if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
230                 goto bad_area;
231     }

There is the first part of a good area is handled. The permissions need to be checked in case this is a protection fault.

212By default return an error
214Check the error code against bits 0 and 1 of the error code. Bit 0 at 0 means page was not present. At 1, it means a protection fault like a write to a read-only area. Bit 1 is 0 if it was a read fault and 1 if a write
215If it is 3, both bits are 1 so it is a write protection fault
221Bit 1 is a 1 so it is a write fault
222-223If the region can not be written to, it is a bad write to goto bad_area. If the region can be written to, this is a page that is marked Copy On Write (COW)
224Flag that a write has occurred
226-227This is a read and the page is present. There is no reason for the fault so must be some other type of exception like a divide by zero, goto bad_area where it is handled
228-230A read occurred on a missing page. Make sure it is ok to read or exec this page. If not, goto bad_area. The check for exec is made because the x86 can not exec protect a page and instead uses the read protect flag. This is why both have to be checked
233  survive:
239     switch (handle_mm_fault(mm, vma, address, write)) {
240     case 1:
241         tsk->min_flt++;
242         break;
243     case 2:
244         tsk->maj_flt++;
245         break;
246     case 0:
247         goto do_sigbus;
248     default:
249         goto out_of_memory;
250     }
252     /*
253      * Did it hit the DOS screen memory VA from vm86 mode?
254      */
255     if (regs->eflags & VM_MASK) {
256         unsigned long bit = (address - 0xA0000) >> PAGE_SHIFT;
257         if (bit < 32)
258             tsk->thread.screen_bitmap |= 1 << bit;
259     }
260     up_read(&mm->mmap_sem);
261     return;

At this point, an attempt is going to be made to handle the fault gracefully with handle_mm_fault().

239Call handle_mm_fault() with the relevant information about the fault. This is the architecture independent part of the handler
240-242A return of 1 means it was a minor fault. Update statistics
243-245A return of 2 means it was a major fault. Update statistics
246-247A return of 0 means some IO error happened during the fault so go to the do_sigbus handler
248-249Any other return means memory could not be allocated for the fault so we are out of memory. In reality this does not happen as another function out_of_memory() is invoked in mm/oom_kill.c before this could happen which is a lot more graceful about who it kills
260Release the lock to the mm
261Return as the fault has been successfully handled
267 bad_area:
268     up_read(&mm->mmap_sem);
270     /* User mode accesses just cause a SIGSEGV */
271     if (error_code & 4) {
272         tsk->thread.cr2 = address;
273         tsk->thread.error_code = error_code;
274         tsk->thread.trap_no = 14;
275         info.si_signo = SIGSEGV;
276         info.si_errno = 0;
277         /* info.si_code has been set above */
278         info.si_addr = (void *)address;
279         force_sig_info(SIGSEGV, &info, tsk);
280         return;
281     }
283     /*
284      * Pentium F0 0F C7 C8 bug workaround.
285      */
286     if (boot_cpu_data.f00f_bug) {
287         unsigned long nr;
289         nr = (address - idt) >> 3;
291         if (nr == 6) {
292             do_invalid_op(regs, 0);
293             return;
294         }
295     }

This is the bad area handler such as using memory with no vm_area_struct managing it. If the fault is not by a user process or the f00f bug, the no_context label is fallen through to.

271An error code of 4 implies userspace so it is a simple case of sending a SIGSEGV to kill the process
272-274Set thread information about what happened which can be read by a debugger later
275Record that a SIGSEGV signal was sent
276clear errno
278Record the address
279Send the SIGSEGV signal. The process will exit and dump all the relevant information
280Return as the fault has been successfully handled
286-295An bug in the first Pentiums was called the f00f bug which caused the processor to constantly page fault. It was used as a local DoS attack on a running Linux system. This bug was trapped within a few hours and a patch released. Now it results in a harmless termination of the process rather than a rebooting system
297 no_context:
298     /* Are we prepared to handle this kernel fault?  */
299     if ((fixup = search_exception_table(regs->eip)) != 0) {
300         regs->eip = fixup;
301         return;
302     }
299-302Search the exception table with search_exception_table() to see if this exception be handled and if so, call the proper exception handler after returning. This is really important during copy_from_user() and copy_to_user() when an exception handler is especially installed to trap reads and writes to invalid regions in userspace without having to make expensive checks. It means that a small fixup block of code can be called rather than falling through to the next block which causes an oops
304 /*
305  * Oops. The kernel tried to access some bad page. We'll have to
306  * terminate things with extreme prejudice.
307  */
309     bust_spinlocks(1);
311     if (address < PAGE_SIZE)
312         printk(KERN_ALERT "Unable to handle kernel NULL pointer
313     else
314         printk(KERN_ALERT "Unable to handle kernel paging
315     printk(" at virtual address %08lx\n",address);
316     printk(" printing eip:\n");
317     printk("%08lx\n", regs->eip);
318     asm("movl %%cr3,%0":"=r" (page));
319     page = ((unsigned long *) __va(page))[address >> 22];
320     printk(KERN_ALERT "*pde = %08lx\n", page);
321     if (page & 1) {
322         page &= PAGE_MASK;
323         address &= 0x003ff000;
324         page = ((unsigned long *) 
                __va(page))[address >> PAGE_SHIFT];
325         printk(KERN_ALERT "*pte = %08lx\n", page);
326     }
327     die("Oops", regs, error_code);
328     bust_spinlocks(0);
329     do_exit(SIGKILL);

This is the no_context handler. Some bad exception occurred which is going to end up in the process been terminated in all likeliness. Otherwise the kernel faulted when it definitely should have and an OOPS report is generated.

309-329Otherwise the kernel faulted when it really shouldn't have and it is a kernel bug. This block generates an oops report
309Forcibly free spinlocks which might prevent a message getting to console
311-312If the address is < PAGE_SIZE, it means that a null pointer was used. Linux deliberately has page 0 unassigned to trap this type of fault which is a common programming error
313-314Otherwise it is just some bad kernel error such as a driver trying to access userspace incorrectly
315-320Print out information about the fault
321-326Print out information about the page been faulted
327Die and generate an oops report which can be used later to get a stack trace so a developer can see more accurately where and how the fault occurred
329Forcibly kill the faulting process
335 out_of_memory:
336     if (tsk->pid == 1) {
337         yield();
338         goto survive;
339     }
340     up_read(&mm->mmap_sem);
341     printk("VM: killing process %s\n", tsk->comm);
342     if (error_code & 4)
343         do_exit(SIGKILL);
344     goto no_context;

The out of memory handler. Usually ends with the faulting process getting killed unless it is init

336-339If the process is init, just yield and goto survive which will try to handle the fault gracefully. init should never be killed
340Free the mm semaphore
341Print out a helpful “You are Dead” message
342If from userspace, just kill the process
344If in kernel space, go to the no_context handler which in this case will probably result in a kernel oops
346 do_sigbus:
347     up_read(&mm->mmap_sem);
353     tsk->thread.cr2 = address;
354     tsk->thread.error_code = error_code;
355     tsk->thread.trap_no = 14;
356     info.si_signo = SIGBUS;
357     info.si_errno = 0;
358     info.si_code = BUS_ADRERR;
359     info.si_addr = (void *)address;
360     force_sig_info(SIGBUS, &info, tsk);
362     /* Kernel mode? Handle exceptions or die */
363     if (!(error_code & 4))
364         goto no_context;
365     return;
347Free the mm lock
353-359Fill in information to show a SIGBUS occurred at the faulting address so that a debugger can trap it later
360Send the signal
363-364If in kernel mode, try and handle the exception during no_context
365If in userspace, just return and the process will die in due course
367 vmalloc_fault:
368     {
376         int offset = __pgd_offset(address);
377         pgd_t *pgd, *pgd_k;
378         pmd_t *pmd, *pmd_k;
379         pte_t *pte_k;
381         asm("movl %%cr3,%0":"=r" (pgd));
382         pgd = offset + (pgd_t *)__va(pgd);
383         pgd_k = init_mm.pgd + offset;
385         if (!pgd_present(*pgd_k))
386             goto no_context;
387         set_pgd(pgd, *pgd_k);
389         pmd = pmd_offset(pgd, address);
390         pmd_k = pmd_offset(pgd_k, address);
391         if (!pmd_present(*pmd_k))
392             goto no_context;
393         set_pmd(pmd, *pmd_k);
395         pte_k = pte_offset(pmd_k, address);
396         if (!pte_present(*pte_k))
397             goto no_context;
398         return;
399     }
400 }

This is the vmalloc fault handler. When pages are mapped in the vmalloc space, only the refernce page table is updated. As each process references this area, a fault will be trapped and the process page tables will be synchronised with the reference page table here.

376Get the offset within a PGD
381Copy the address of the PGD for the process from the cr3 register to pgd
382Calculate the pgd pointer from the process PGD
383Calculate for the kernel reference PGD
385-386If the pgd entry is invalid for the kernel page table, goto no_context
386Set the page table entry in the process page table with a copy from the kernel reference page table
389-393Same idea for the PMD. Copy the page table entry from the kernel reference page table to the process page tables
395Check the PTE
396-397If it is not present, it means the page was not valid even in the kernel reference page table so goto no_context to handle what is probably a kernel bug, probably a reference to a random part of unused kernel space
398Otherwise return knowing the process page tables have been updated and are in sync with the kernel page tables

D.5.2  Expanding the Stack

D.5.2.1  Function: expand_stack

Source: include/linux/mm.h

This function is called by the architecture dependant page fault handler. The VMA supplied is guarenteed to be one that can grow to cover the address.

640 static inline int expand_stack(struct vm_area_struct * vma, 
                                   unsigned long address)
641 {
642     unsigned long grow;
644     /*
645      * vma->vm_start/vm_end cannot change under us because 
         * the caller is required
646      * to hold the mmap_sem in write mode. We need to get the
647      * spinlock only before relocating the vma range ourself.
648      */
649     address &= PAGE_MASK;
650     spin_lock(&vma->vm_mm->page_table_lock);
651     grow = (vma->vm_start - address) >> PAGE_SHIFT;
652     if (vma->vm_end - address > current->rlim[RLIMIT_STACK].rlim_cur ||
653     ((vma->vm_mm->total_vm + grow) << PAGE_SHIFT) > 
                                       current->rlim[RLIMIT_AS].rlim_cur) {
654         spin_unlock(&vma->vm_mm->page_table_lock);
655         return -ENOMEM;
656     }
657     vma->vm_start = address;
658     vma->vm_pgoff -= grow;
659     vma->vm_mm->total_vm += grow;
660     if (vma->vm_flags & VM_LOCKED)
661         vma->vm_mm->locked_vm += grow;
662     spin_unlock(&vma->vm_mm->page_table_lock);
663     return 0;
664 }
649Round the address down to the nearest page boundary
650Lock the page tables spinlock
651Calculate how many pages the stack needs to grow by
652Check to make sure that the size of the stack does not exceed the process limits
653Check to make sure that the size of the addres space will not exceed process limits after the stack is grown
654-655If either of the limits are reached, return -ENOMEM which will cause the faulting process to segfault
657-658Grow the VMA down
659Update the amount of address space used by the process
660-661If the region is locked, update the number of locked pages used by the process
662-663Unlock the process page tables and return success

D.5.3  Architecture Independent Page Fault Handler

This is the top level pair of functions for the architecture independent page fault handler.

D.5.3.1  Function: handle_mm_fault

Source: mm/memory.c

The call graph for this function is shown in Figure 4.14. This function allocates the PMD and PTE necessary for this new PTE hat is about to be allocated. It takes the necessary locks to protect the page tables before calling handle_pte_fault() to fault in the page itself.

1364 int handle_mm_fault(struct mm_struct *mm, 
         struct vm_area_struct * vma,
1365     unsigned long address, int write_access)
1366 {
1367     pgd_t *pgd;
1368     pmd_t *pmd;
1370     current->state = TASK_RUNNING;
1371     pgd = pgd_offset(mm, address);
1373     /*
1374      * We need the page table lock to synchronize with kswapd
1375      * and the SMP-safe atomic PTE updates.
1376      */
1377     spin_lock(&mm->page_table_lock);
1378     pmd = pmd_alloc(mm, pgd, address);
1380     if (pmd) {
1381         pte_t * pte = pte_alloc(mm, pmd, address);
1382         if (pte)
1383             return handle_pte_fault(mm, vma, address,
                            write_access, pte);
1384     }
1385     spin_unlock(&mm->page_table_lock);
1386     return -1;
1387 }
1364The parameters of the function are;
mm is the mm_struct for the faulting process
vma is the vm_area_struct managing the region the fault occurred in
address is the faulting address
write_access is 1 if the fault is a write fault
1370Set the current state of the process
1371Get the pgd entry from the top level page table
1377Lock the mm_struct as the page tables will change
1378pmd_alloc() will allocate a pmd_t if one does not already exist
1380If the pmd has been successfully allocated then...
1381Allocate a PTE for this address if one does not already exist
1382-1383Handle the page fault with handle_pte_fault() (See Section D.5.3.2) and return the status code
1385Failure path, unlock the mm_struct
1386Return -1 which will be interpreted as an out of memory condition which is correct as this line is only reached if a PMD or PTE could not be allocated

D.5.3.2  Function: handle_pte_fault

Source: mm/memory.c

This function decides what type of fault this is and which function should handle it. do_no_page() is called if this is the first time a page is to be allocated. do_swap_page() handles the case where the page was swapped out to disk with the exception of pages swapped out from tmpfs. do_wp_page() breaks COW pages. If none of them are appropriate, the PTE entry is simply updated. If it was written to, it is marked dirty and it is marked accessed to show it is a young page.

1331 static inline int handle_pte_fault(struct mm_struct *mm,
1332     struct vm_area_struct * vma, unsigned long address,
1333     int write_access, pte_t * pte)
1334 {
1335     pte_t entry;
1337     entry = *pte;
1338     if (!pte_present(entry)) {
1339         /*
1340          * If it truly wasn't present, we know that kswapd
1341          * and the PTE updates will not touch it later. So
1342          * drop the lock.
1343          */
1344         if (pte_none(entry))
1345             return do_no_page(mm, vma, address, 
                         write_access, pte);
1346         return do_swap_page(mm, vma, address, pte, entry,
1347     }
1349     if (write_access) {
1350         if (!pte_write(entry))
1351             return do_wp_page(mm, vma, address, pte, entry);
1353         entry = pte_mkdirty(entry);
1354     }
1355     entry = pte_mkyoung(entry);
1356     establish_pte(vma, address, pte, entry);
1357     spin_unlock(&mm->page_table_lock);
1358     return 1;
1359 }
1331The parameters of the function are the same as those for handle_mm_fault() except the PTE for the fault is included
1337Record the PTE
1338Handle the case where the PTE is not present
1344If the PTE has never been filled, handle the allocation of the PTE with do_no_page()(See Section D.5.4.1)
1346If the page has been swapped out to backing storage, handle it with do_swap_page()(See Section D.5.5.1)
1349-1354Handle the case where the page is been written to
1350-1351If the PTE is marked write-only, it is a COW page so handle it with do_wp_page()(See Section D.5.6.1)
1353Otherwise just simply mark the page as dirty
1355Mark the page as accessed
1356establish_pte() copies the PTE and then updates the TLB and MMU cache. This does not copy in a new PTE but some architectures require the TLB and MMU update
1357Unlock the mm_struct and return that a minor fault occurred

D.5.4  Demand Allocation

D.5.4.1  Function: do_no_page

Source: mm/memory.c

The call graph for this function is shown in Figure 4.15. This function is called the first time a page is referenced so that it may be allocated and filled with data if necessary. If it is an anonymous page, determined by the lack of a vm_ops available to the VMA or the lack of a nopage() function, then do_anonymous_page() is called. Otherwise the supplied nopage() function is called to allocate a page and it is inserted into the page tables here. The function has the following tasks;

1245 static int do_no_page(struct mm_struct * mm, 
         struct vm_area_struct * vma,
1246     unsigned long address, int write_access, pte_t *page_table)
1247 {
1248     struct page * new_page;
1249     pte_t entry;
1251     if (!vma->vm_ops || !vma->vm_ops->nopage)
1252         return do_anonymous_page(mm, vma, page_table,
                        write_access, address);
1253     spin_unlock(&mm->page_table_lock);
1255     new_page = vma->vm_ops->nopage(vma, address & PAGE_MASK, 0);
1257     if (new_page == NULL)   /* no page was available -- SIGBUS */
1258         return 0;
1259     if (new_page == NOPAGE_OOM)
1260         return -1;
1245The parameters supplied are the same as those for handle_pte_fault()
1251-1252If no vm_ops is supplied or no nopage() function is supplied, then call do_anonymous_page()(See Section D.5.4.2) to allocate a page and return it
1253Otherwise free the page table lock as the nopage() function can not be called with spinlocks held
1255Call the supplied nopage function, in the case of filesystems, this is frequently filemap_nopage()(See Section D.6.4.1) but will be different for each device driver
1257-1258If NULL is returned, it means some error occurred in the nopage function such as an IO error while reading from disk. In this case, 0 is returned which results in a SIGBUS been sent to the faulting process
1259-1260If NOPAGE_OOM is returned, the physical page allocator failed to allocate a page and -1 is returned which will forcibly kill the process
1265     if (write_access && !(vma->vm_flags & VM_SHARED)) {
1266         struct page * page = alloc_page(GFP_HIGHUSER);
1267         if (!page) {
1268             page_cache_release(new_page);
1269             return -1;
1270         }
1271         copy_user_highpage(page, new_page, address);
1272         page_cache_release(new_page);
1273         lru_cache_add(page);
1274         new_page = page;
1275     }

Break COW early in this block if appropriate. COW is broken if the fault is a write fault and the region is not shared with VM_SHARED. If COW was not broken in this case, a second fault would occur immediately upon return.

1265Check if COW should be broken early
1266If so, allocate a new page for the process
1267-1270If the page could not be allocated, reduce the reference count to the page returned by the nopage() function and return -1 for out of memory
1271Otherwise copy the contents
1272Reduce the reference count to the returned page which may still be in use by another process
1273Add the new page to the LRU lists so it may be reclaimed by kswapd later
1277     spin_lock(&mm->page_table_lock);
1288     /* Only go through if we didn't race with anybody else... */
1289     if (pte_none(*page_table)) {
1290         ++mm->rss;
1291         flush_page_to_ram(new_page);
1292         flush_icache_page(vma, new_page);
1293         entry = mk_pte(new_page, vma->vm_page_prot);
1294         if (write_access)
1295             entry = pte_mkwrite(pte_mkdirty(entry));
1296         set_pte(page_table, entry);
1297     } else {
1298         /* One of our sibling threads was faster, back out. */
1299         page_cache_release(new_page);
1300         spin_unlock(&mm->page_table_lock);
1301         return 1;
1302     }
1304     /* no need to invalidate: a not-present page shouldn't 
        * be cached
1305     update_mmu_cache(vma, address, entry);
1306     spin_unlock(&mm->page_table_lock);
1307     return 2;     /* Major fault */
1308 }
1277Lock the page tables again as the allocations have finished and the page tables are about to be updated
1289Check if there is still no PTE in the entry we are about to use. If two faults hit here at the same time, it is possible another processor has already completed the page fault and this one should be backed out
1290-1297If there is no PTE entered, complete the fault
1290Increase the RSS count as the process is now using another page. A check really should be made here to make sure it isn't the global zero page as the RSS count could be misleading
1291As the page is about to be mapped to the process space, it is possible for some architectures that writes to the page in kernel space will not be visible to the process. flush_page_to_ram() ensures the CPU cache will be coherent
1292flush_icache_page() is similar in principle except it ensures the icache and dcache's are coherent
1293Create a pte_t with the appropriate permissions
1294-1295If this is a write, then make sure the PTE has write permissions
1296Place the new PTE in the process page tables
1297-1302If the PTE is already filled, the page acquired from the nopage() function must be released
1299Decrement the reference count to the page. If it drops to 0, it will be freed
1300-1301Release the mm_struct lock and return 1 to signal this is a minor page fault as no major work had to be done for this fault as it was all done by the winner of the race
1305Update the MMU cache for architectures that require it
1306-1307Release the mm_struct lock and return 2 to signal this is a major page fault

D.5.4.2  Function: do_anonymous_page

Source: mm/memory.c

This function allocates a new page for a process accessing a page for the first time. If it is a read access, a system wide page containing only zeros is mapped into the process. If it is write, a zero filled page is allocated and placed within the page tables

1190 static int do_anonymous_page(struct mm_struct * mm, 
                  struct vm_area_struct * vma, 
                  pte_t *page_table, int write_access, 
                  unsigned long addr)
1191 {
1192     pte_t entry;
1194     /* Read-only mapping of ZERO_PAGE. */
1195     entry = pte_wrprotect(mk_pte(ZERO_PAGE(addr), 
1197     /* ..except if it's a write access */
1198     if (write_access) {
1199         struct page *page;
1201         /* Allocate our own private page. */
1202         spin_unlock(&mm->page_table_lock);
1204         page = alloc_page(GFP_HIGHUSER);
1205         if (!page)
1206             goto no_mem;
1207         clear_user_highpage(page, addr);
1209         spin_lock(&mm->page_table_lock);
1210         if (!pte_none(*page_table)) {
1211             page_cache_release(page);
1212             spin_unlock(&mm->page_table_lock);
1213             return 1;
1214         }
1215         mm->rss++;
1216         flush_page_to_ram(page);
1217         entry = pte_mkwrite(
                 pte_mkdirty(mk_pte(page, vma->vm_page_prot)));
1218         lru_cache_add(page);
1219         mark_page_accessed(page);
1220     }
1222     set_pte(page_table, entry);
1224     /* No need to invalidate - it was non-present before */
1225     update_mmu_cache(vma, addr, entry);
1226     spin_unlock(&mm->page_table_lock);
1227     return 1;     /* Minor fault */
1229 no_mem:
1230     return -1;
1231 }
1190The parameters are the same as those passed to handle_pte_fault() (See Section D.5.3.2)
1195For read accesses, simply map the system wide empty_zero_page which the ZERO_PAGE() macro returns with the given permissions. The page is write protected so that a write to the page will result in a page fault
1198-1220If this is a write fault, then allocate a new page and zero fill it
1202Unlock the mm_struct as the allocation of a new page could sleep
1204Allocate a new page
1205If a page could not be allocated, return -1 to handle the OOM situation
1207Zero fill the page
1209Reacquire the lock as the page tables are to be updated
1215Update the RSS for the process. Note that the RSS is not updated if it is the global zero page being mapped as is the case with the read-only fault at line 1195
1216Ensure the cache is coherent
1217Mark the PTE writable and dirty as it has been written to
1218Add the page to the LRU list so it may be reclaimed by the swapper later
1219Mark the page accessed which ensures the page is marked hot and on the top of the active list
1222Fix the PTE in the page tables for this process
1225Update the MMU cache if the architecture needs it
1226Free the page table lock
1227Return as a minor fault as even though it is possible the page allocator spent time writing out pages, data did not have to be read from disk to fill this page

D.5.5  Demand Paging

D.5.5.1  Function: do_swap_page

Source: mm/memory.c

The call graph for this function is shown in Figure 4.16. This function handles the case where a page has been swapped out. A swapped out page may exist in the swap cache if it is shared between a number of processes or recently swapped in during readahead. This function is broken up into three parts

1117 static int do_swap_page(struct mm_struct * mm,
1118     struct vm_area_struct * vma, unsigned long address,
1119     pte_t * page_table, pte_t orig_pte, int write_access)
1120 {
1121     struct page *page;
1122     swp_entry_t entry = pte_to_swp_entry(orig_pte);
1123     pte_t pte;
1124     int ret = 1;
1126     spin_unlock(&mm->page_table_lock);
1127     page = lookup_swap_cache(entry);

Function preamble, check for the page in the swap cache

1117-1119The parameters are the same as those supplied to handle_pte_fault() (See Section D.5.3.2)
1122Get the swap entry information from the PTE
1126Free the mm_struct spinlock
1127Lookup the page in the swap cache
1128     if (!page) {
1129         swapin_readahead(entry);
1130         page = read_swap_cache_async(entry);
1131         if (!page) {
1136             int retval;
1137             spin_lock(&mm->page_table_lock);
1138             retval = pte_same(*page_table, orig_pte) ? -1 : 1;
1139             spin_unlock(&mm->page_table_lock);
1140             return retval;
1141         }
1143         /* Had to read the page from swap area: Major fault */
1144         ret = 2;
1145     }

If the page did not exist in the swap cache, then read it from backing storage with swapin_readhead() which reads in the requested pages and a number of pages after it. Once it completes, read_swap_cache_async() should be able to return the page.

1128-1145This block is executed if the page was not in the swap cache
1129swapin_readahead()(See Section D.6.6.1) reads in the requested page and a number of pages after it. The number of pages read in is determined by the page_cluster variable in mm/swap.c which is initialised to 2 on machines with less than 16MiB of memory and 3 otherwise. 2page_cluster pages are read in after the requested page unless a bad or empty page entry is encountered
1130read_swap_cache_async() (See Section K.3.1.1) will look up the requested page and read it from disk if necessary
1131-1141If the page does not exist, there was another fault which swapped in this page and removed it from the cache while spinlocks were dropped
1137Lock the mm_struct
1138Compare the two PTEs. If they do not match, -1 is returned to signal an IO error, else 1 is returned to mark a minor page fault as a disk access was not required for this particular page.
1139-1140Free the mm_struct and return the status
1144The disk had to be accessed to mark that this is a major page fault
1147     mark_page_accessed(page);
1149     lock_page(page);
1151     /*
1152      * Back out if somebody else faulted in this pte while we
1153      * released the page table lock.
1154      */
1155     spin_lock(&mm->page_table_lock);
1156     if (!pte_same(*page_table, orig_pte)) {
1157         spin_unlock(&mm->page_table_lock);
1158         unlock_page(page);
1159         page_cache_release(page);
1160         return 1;
1161     }
1163     /* The page isn't present yet, go ahead with the fault. */
1165     swap_free(entry);
1166     if (vm_swap_full())
1167         remove_exclusive_swap_page(page);
1169     mm->rss++;
1170     pte = mk_pte(page, vma->vm_page_prot);
1171     if (write_access && can_share_swap_page(page))
1172         pte = pte_mkdirty(pte_mkwrite(pte));
1173     unlock_page(page);
1175     flush_page_to_ram(page);
1176     flush_icache_page(vma, page);
1177     set_pte(page_table, pte);
1179     /* No need to invalidate - it was non-present before */
1180     update_mmu_cache(vma, address, pte);
1181     spin_unlock(&mm->page_table_lock);
1182     return ret;
1183 }

Place the page in the process page tables

1147mark_page_accessed()(See Section J.2.3.1) will mark the page as active so it will be moved to the top of the active LRU list
1149Lock the page which has the side effect of waiting for the IO swapping in the page to complete
1155-1161If someone else faulted in the page before we could, the reference to the page is dropped, the lock freed and return that this was a minor fault
1165The function swap_free()(See Section K.2.2.1) reduces the reference to a swap entry. If it drops to 0, it is actually freed
1166-1167Page slots in swap space are reserved for the same page once they have been swapped out to avoid having to search for a free slot each time. If the swap space is full though, the reservation is broken and the slot freed up for another page
1169The page is now going to be used so increment the mm_structs RSS count
1170Make a PTE for this page
1171If the page is been written to and is not shared between more than one process, mark it dirty so that it will be kept in sync with the backing storage and swap cache for other processes
1173Unlock the page
1175As the page is about to be mapped to the process space, it is possible for some architectures that writes to the page in kernel space will not be visible to the process. flush_page_to_ram() ensures the cache will be coherent
1176flush_icache_page() is similar in principle except it ensures the icache and dcache's are coherent
1177Set the PTE in the process page tables
1180Update the MMU cache if the architecture requires it
1181-1182Unlock the mm_struct and return whether it was a minor or major page fault

D.5.5.2  Function: can_share_swap_page

Source: mm/swapfile.c

This function determines if the swap cache entry for this page may be used or not. It may be used if there is no other references to it. Most of the work is performed by exclusive_swap_page() but this function first makes a few basic checks to avoid having to acquire too many locks.

259 int can_share_swap_page(struct page *page)
260 {
261     int retval = 0;
263     if (!PageLocked(page))
264         BUG();
265     switch (page_count(page)) {
266     case 3:
267         if (!page->buffers)
268                 break;
269         /* Fallthrough */
270     case 2:
271         if (!PageSwapCache(page))
272                 break;
273         retval = exclusive_swap_page(page);
274         break;
275     case 1:
276         if (PageReserved(page))
277                 break;
278             retval = 1;
279     }
280         return retval;
281 }
263-264This function is called from the fault path and the page must be locked
265Switch based on the number of references
266-268If the count is 3, but there is no buffers associated with it, there is more than one process using the page. Buffers may be associated for just one process if the page is backed by a swap file instead of a partition
270-273If the count is only two, but it is not a member of the swap cache, then it has no slot which may be shared so return false. Otherwise perform a full check with exclusive_swap_page() (See Section D.5.5.3)
276-277If the page is reserved, it is the global ZERO_PAGE so it cannot be shared otherwise this page is definitely the only one

D.5.5.3  Function: exclusive_swap_page

Source: mm/swapfile.c

This function checks if the process is the only user of a locked swap page.

229 static int exclusive_swap_page(struct page *page)
230 {
231     int retval = 0;
232     struct swap_info_struct * p;
233     swp_entry_t entry;
235     entry.val = page->index;
236     p = swap_info_get(entry);
237     if (p) {
238         /* Is the only swap cache user the cache itself? */
239         if (p->swap_map[SWP_OFFSET(entry)] == 1) {
240             /* Recheck the page count with the pagecache 
                 * lock held.. */
241             spin_lock(&pagecache_lock);
242             if (page_count(page) - !!page->buffers == 2)
243                 retval = 1;
244             spin_unlock(&pagecache_lock);
245         }
246         swap_info_put(p);
247     }
248     return retval;
249 }
231By default, return false
235The swp_entry_t for the page is stored in pageindex as explained in Section 2.4
236Get the swap_info_struct with swap_info_get()(See Section K.2.3.1)
237-247If a slot exists, check if we are the exclusive user and return true if we are
239Check if the slot is only being used by the cache itself. If it is, the page count needs to be checked again with the pagecache_lock held
242-243!!pagebuffers will evaluate to 1 if there is buffers are present so this block effectively checks if the process is the only user of the page. If it is, retval is set to 1 so that true will be returned
246Drop the reference to the slot that was taken with swap_info_get() (See Section K.2.3.1)

D.5.6  Copy On Write (COW) Pages

D.5.6.1  Function: do_wp_page

Source: mm/memory.c

The call graph for this function is shown in Figure 4.17. This function handles the case where a user tries to write to a private page shared amoung processes, such as what happens after fork(). Basically what happens is a page is allocated, the contents copied to the new page and the shared count decremented in the old page.

948 static int do_wp_page(struct mm_struct *mm, 
            struct vm_area_struct * vma,
949         unsigned long address, pte_t *page_table, pte_t pte)
950 {
951     struct page *old_page, *new_page;
953     old_page = pte_page(pte);
954     if (!VALID_PAGE(old_page))
955         goto bad_wp_page;
948-950The parameters are the same as those supplied to handle_pte_fault()
953-955Get a reference to the current page in the PTE and make sure it is valid
957     if (!TryLockPage(old_page)) {
958         int reuse = can_share_swap_page(old_page);
959         unlock_page(old_page);
960         if (reuse) {
961             flush_cache_page(vma, address);
962             establish_pte(vma, address, page_table,
963             spin_unlock(&mm->page_table_lock);
964             return 1;       /* Minor fault */
965         }
966     }
957First try to lock the page. If 0 is returned, it means the page was previously unlocked
958If we managed to lock it, call can_share_swap_page() (See Section D.5.5.2) to see are we the exclusive user of the swap slot for this page. If we are, it means that we are the last process to break COW and we can simply use this page rather than allocating a new one
960-965If we are the only users of the swap slot, then it means we are the only user of this page and the last process to break COW so the PTE is simply re-established and we return a minor fault
968     /*
969      * Ok, we need to copy. Oh, well..
970      */
971     page_cache_get(old_page);
972     spin_unlock(&mm->page_table_lock);
974     new_page = alloc_page(GFP_HIGHUSER);
975     if (!new_page)
976         goto no_mem;
977     copy_cow_page(old_page,new_page,address);
971We need to copy this page so first get a reference to the old page so it doesn't disappear before we are finished with it
972Unlock the spinlock as we are about to call alloc_page() (See Section F.2.1) which may sleep
974-976Allocate a page and make sure one was returned
977No prizes what this function does. If the page being broken is the global zero page, clear_user_highpage() will be used to zero out the contents of the page, otherwise copy_user_highpage() copies the actual contents
982     spin_lock(&mm->page_table_lock);
983     if (pte_same(*page_table, pte)) {
984         if (PageReserved(old_page))
985             ++mm->rss;
986         break_cow(vma, new_page, address, page_table);
987         lru_cache_add(new_page);
989         /* Free the old page.. */
990         new_page = old_page;
991     }
992     spin_unlock(&mm->page_table_lock);
993     page_cache_release(new_page);
994     page_cache_release(old_page);
995     return 1;       /* Minor fault */
982The page table lock was released for alloc_page()(See Section F.2.1) so reacquire it
983Make sure the PTE hasn't changed in the meantime which could have happened if another fault occured while the spinlock is released
984-985The RSS is only updated if PageReserved() is true which will only happen if the page being faulted is the global ZERO_PAGE which is not accounted for in the RSS. If this was a normal page, the process would be using the same number of physical frames after the fault as it was before but against the zero page, it'll be using a new frame so rss++
986break_cow() is responsible for calling the architecture hooks to ensure the CPU cache and TLBs are up to date and then establish the new page into the PTE. It first calls flush_page_to_ram() which must be called when a struct page is about to be placed in userspace. Next is flush_cache_page() which flushes the page from the CPU cache. Lastly is establish_pte() which establishes the new page into the PTE
987Add the page to the LRU lists
992Release the spinlock
993-994Drop the references to the pages
995Return a minor fault
997 bad_wp_page:
998     spin_unlock(&mm->page_table_lock);
999     printk("do_wp_page: bogus page at address %08lx (page 0x%lx)\n",
                    address,(unsigned long)old_page);
1000     return -1;
1001 no_mem:
1002     page_cache_release(old_page);
1003     return -1;
1004 }
997-1000This is a false COW break which will only happen with a buggy kernel. Print out an informational message and return
1001-1003The page allocation failed so release the reference to the old page and return -1

D.6  Page-Related Disk IO

D.6.1  Generic File Reading

This is more the domain of the IO manager than the VM but because it performs the operations via the page cache, we will cover it briefly. The operation of generic_file_write() is essentially the same although it is not covered by this book. However, if you understand how the read takes place, the write function will pose no problem to you.

D.6.1.1  Function: generic_file_read

Source: mm/filemap.c

This is the generic file read function used by any filesystem that reads pages through the page cache. For normal IO, it is responsible for building a read_descriptor_t for use with do_generic_file_read() and file_read_actor(). For direct IO, this function is basically a wrapper around generic_file_direct_IO().

1695 ssize_t generic_file_read(struct file * filp, 
                               char * buf, size_t count, 
                               loff_t *ppos)
1696 {
1697     ssize_t retval;
1699     if ((ssize_t) count < 0)
1700         return -EINVAL;
1702     if (filp->f_flags & O_DIRECT)
1703         goto o_direct;
1705     retval = -EFAULT;
1706     if (access_ok(VERIFY_WRITE, buf, count)) {
1707         retval = 0;
1709         if (count) {
1710             read_descriptor_t desc;
1712             desc.written = 0;
1713             desc.count = count;
1714             desc.buf = buf;
1715             desc.error = 0;
1716             do_generic_file_read(filp, ppos, &desc, 
1718             retval = desc.written;
1719             if (!retval)
1720                 retval = desc.error;
1721         }
1722     }
1723  out:
1724     return retval;

This block is concern with normal file IO.

1702-1703If this is direct IO, jump to the o_direct label
1706If the access permissions to write to a userspace page are ok, then proceed
1709If count is 0, there is no IO to perform
1712-1715Populate a read_descriptor_t structure which will be used by file_read_actor()(See Section L.3.2.3)
1716Perform the file read
1718Extract the number of bytes written from the read descriptor struct
1282-1683If an error occured, extract what the error was
1724Return either the number of bytes read or the error that occured
1726  o_direct:
1727     {
1728         loff_t pos = *ppos, size;
1729         struct address_space *mapping = 
1730         struct inode *inode = mapping->host;
1732         retval = 0;
1733         if (!count)
1734             goto out; /* skip atime */
1735         down_read(&inode->i_alloc_sem);
1736         down(&inode->i_sem);
1737         size = inode->i_size;
1738         if (pos < size) {
1739             retval = generic_file_direct_IO(READ, filp, buf, 
                                                 count, pos);
1740             if (retval > 0)
1741                 *ppos = pos + retval;
1742         }
1743         UPDATE_ATIME(filp->f_dentry->d_inode);
1744         goto out;
1745     }
1746 }

This block is concerned with direct IO. It is largely responsible for extracting the parameters required for generic_file_direct_IO().

1729Get the address_space used by this struct file
1733-1734If no IO has been requested, jump to out to avoid updating the inodes access time
1737Get the size of the file
1299-1700If the current position is before the end of the file, the read is safe so call generic_file_direct_IO()
1740-1741If the read was successful, update the current position in the file for the reader
1743Update the access time
1744Goto out which just returns retval

D.6.1.2  Function: do_generic_file_read

Source: mm/filemap.c

This is the core part of the generic file read operation. It is responsible for allocating a page if it doesn't already exist in the page cache. If it does, it must make sure the page is up-to-date and finally, it is responsible for making sure that the appropriate readahead window is set.

1349 void do_generic_file_read(struct file * filp, 
                               loff_t *ppos, 
                               read_descriptor_t * desc, 
                               read_actor_t actor)
1350 {
1351     struct address_space *mapping = 
1352     struct inode *inode = mapping->host;
1353     unsigned long index, offset;
1354     struct page *cached_page;
1355     int reada_ok;
1356     int error;
1357     int max_readahead = get_max_readahead(inode);
1359     cached_page = NULL;
1360     index = *ppos >> PAGE_CACHE_SHIFT;
1361     offset = *ppos & ~PAGE_CACHE_MASK;
1357Get the maximum readahead window size for this block device
1360Calculate the page index which holds the current file position pointer
1361Calculate the offset within the page that holds the current file position pointer
1363 /*
1364  * If the current position is outside the previous read-ahead
1365  * window, we reset the current read-ahead context and set read
1366  * ahead max to zero (will be set to just needed value later),
1367  * otherwise, we assume that the file accesses are sequential
1368  * enough to continue read-ahead.
1369  */
1370     if (index > filp->f_raend || 
             index + filp->f_rawin < filp->f_raend) {
1371         reada_ok = 0;
1372         filp->f_raend = 0;
1373         filp->f_ralen = 0;
1374         filp->f_ramax = 0;
1375         filp->f_rawin = 0;
1376     } else {
1377         reada_ok = 1;
1378     }
1379 /*
1380  * Adjust the current value of read-ahead max.
1381  * If the read operation stay in the first half page, force no
1382  * readahead. Otherwise try to increase read ahead max just
      * enough to do the read request.
1383  * Then, at least MIN_READAHEAD if read ahead is ok,
1384  * and at most MAX_READAHEAD in all cases.
1385  */
1386     if (!index && offset + desc->count <= (PAGE_CACHE_SIZE >> 1)) {
1387         filp->f_ramax = 0;
1388     } else {
1389         unsigned long needed;
1391         needed = ((offset + desc->count) >> PAGE_CACHE_SHIFT) + 1;
1393         if (filp->f_ramax < needed)
1394             filp->f_ramax = needed;
1396         if (reada_ok && filp->f_ramax < vm_min_readahead)
1397                 filp->f_ramax = vm_min_readahead;
1398         if (filp->f_ramax > max_readahead)
1399             filp->f_ramax = max_readahead;
1400     }
1370-1378As the comment suggests, the readahead window gets reset if the current file position is outside the current readahead window. It gets reset to 0 here and adjusted by generic_file_readahead()(See Section D.6.1.3) as necessary
1386-1400As the comment states, the readahead window gets adjusted slightly if we are in the second-half of the current page
1402     for (;;) {
1403         struct page *page, **hash;
1404         unsigned long end_index, nr, ret;
1406         end_index = inode->i_size >> PAGE_CACHE_SHIFT;
1408         if (index > end_index)
1409             break;
1410         nr = PAGE_CACHE_SIZE;
1411         if (index == end_index) {
1412             nr = inode->i_size & ~PAGE_CACHE_MASK;
1413             if (nr <= offset)
1414                 break;
1415         }
1417         nr = nr - offset;
1419         /*
1420          * Try to find the data in the page cache..
1421          */
1422         hash = page_hash(mapping, index);
1424         spin_lock(&pagecache_lock);
1425         page = __find_page_nolock(mapping, index, *hash);
1426         if (!page)
1427             goto no_cached_page;
1402This loop goes through each of the pages necessary to satisfy the read request
1406Calculate where the end of the file is in pages
1408-1409If the current index is beyond the end, then break out as we are trying to read beyond the end of the file
1410-1417Calculate nr to be the number of bytes remaining to be read in the current page. The block takes into account that this might be the last page used by the file and where the current file position is within the page
1422-1425Search for the page in the page cache
1426-1427If the page is not in the page cache, goto no_cached_page where it will be allocated
1428 found_page:
1429         page_cache_get(page);
1430         spin_unlock(&pagecache_lock);
1432         if (!Page_Uptodate(page))
1433             goto page_not_up_to_date;
1434         generic_file_readahead(reada_ok, filp, inode, page);

In this block, the page was found in the page cache.

1429Take a reference to the page in the page cache so it does not get freed prematurly
1432-1433If the page is not up-to-date, goto page_not_up_to_date to update the page with information on the disk
1434Perform file readahead with generic_file_readahead()(See Section D.6.1.3)
1435 page_ok:
1436         /* If users can be writing to this page using arbitrary
1437          * virtual addresses, take care about potential aliasing
1438          * before reading the page on the kernel side.
1439          */
1440         if (mapping->i_mmap_shared != NULL)
1441             flush_dcache_page(page);
1443         /*
1444          * Mark the page accessed if we read the
1445          * beginning or we just did an lseek.
1446          */
1447         if (!offset || !filp->f_reada)
1448             mark_page_accessed(page);
1450         /*
1451          * Ok, we have the page, and it's up-to-date, so
1452          * now we can copy it to user space...
1453          *
1454          * The actor routine returns how many bytes were actually used..
1455          * NOTE! This may not be the same as how much of a user buffer
1456          * we filled up (we may be padding etc), so we can only update
1457          * "pos" here (the actor routine has to update the user buffer
1458          * pointers and the remaining count).
1459          */
1460         ret = actor(desc, page, offset, nr);
1461         offset += ret;
1462         index += offset >> PAGE_CACHE_SHIFT;
1463         offset &= ~PAGE_CACHE_MASK;
1465         page_cache_release(page);
1466         if (ret == nr && desc->count)
1467             continue;
1468         break;

In this block, the page is present in the page cache and ready to be read by the file read actor function.

1440-1441As other users could be writing this page, call flush_dcache_page() to make sure the changes are visible
1447-1448As the page has just been accessed, call mark_page_accessed() (See Section J.2.3.1) to move it to the active_list
1460Call the actor function. In this case, the actor function is file_read_actor() (See Section L.3.2.3) which is responsible for copying the bytes from the page to userspace
1461Update the current offset within the file
1462Move to the next page if necessary
1463Update the offset within the page we are currently reading. Remember that we could have just crossed into the next page in the file
1465Release our reference to this page
1466-1468If there is still data to be read, loop again to read the next page. Otherwise break as the read operation is complete
1470 /*
1471  * Ok, the page was not immediately readable, so let's try to read 
      * ahead while we're at it..
1472  */
1473 page_not_up_to_date:
1474         generic_file_readahead(reada_ok, filp, inode, page);
1476         if (Page_Uptodate(page))
1477             goto page_ok;
1479         /* Get exclusive access to the page ... */
1480         lock_page(page);
1482         /* Did it get unhashed before we got the lock? */
1483         if (!page->mapping) {
1484             UnlockPage(page);
1485             page_cache_release(page);
1486             continue;
1487         }
1489         /* Did somebody else fill it already? */
1490         if (Page_Uptodate(page)) {
1491             UnlockPage(page);
1492             goto page_ok;
1493         }

In this block, the page being read was not up-to-date with information on the disk. generic_file_readahead() is called to update the current page and readahead as IO is required anyway.

1474Call generic_file_readahead()(See Section D.6.1.3) to sync the current page and readahead if necessary
1476-1477If the page is now up-to-date, goto page_ok to start copying the bytes to userspace
1480Otherwise something happened with readahead so lock the page for exclusive access
1483-1487If the page was somehow removed from the page cache while spinlocks were not held, then release the reference to the page and start all over again. The second time around, the page will get allocated and inserted into the page cache all over again
1490-1493If someone updated the page while we did not have a lock on the page then unlock it again and goto page_ok to copy the bytes to userspace
1495 readpage:
1496         /* ... and start the actual read. The read will 
              * unlock the page. */
1497         error = mapping->a_ops->readpage(filp, page);
1499         if (!error) {
1500             if (Page_Uptodate(page))
1501                 goto page_ok;
1503             /* Again, try some read-ahead while waiting for
                  * the page to finish.. */
1504             generic_file_readahead(reada_ok, filp, inode, page);
1505             wait_on_page(page);
1506             if (Page_Uptodate(page))
1507                 goto page_ok;
1508             error = -EIO;
1509         }
1511         /* UHHUH! A synchronous read error occurred. Report it */
1512         desc->error = error;
1513         page_cache_release(page);
1514         break;

At this block, readahead failed to we synchronously read the page with the address_space supplied readpage() function.

1497Call the address_space filesystem-specific readpage() function. In many cases this will ultimatly call the function block_read_full_page() declared in fs/buffer.c()
1499-1501If no error occurred and the page is now up-to-date, goto page_ok to begin copying the bytes to userspace
1504Otherwise, schedule some readahead to occur as we are forced to wait on IO anyway
1505-1507Wait for IO on the requested page to complete. If it finished successfully, then goto page_ok
1508Otherwise an error occured so set -EIO to be returned to userspace
1512-1514An IO error occured so record it and release the reference to the current page. This error will be picked up from the read_descriptor_t struct by generic_file_read() (See Section D.6.1.1)
1516 no_cached_page:
1517         /*
1518          * Ok, it wasn't cached, so we need to create a new
1519          * page..
1520          *
1521          * We get here with the page cache lock held.
1522          */
1523         if (!cached_page) {
1524             spin_unlock(&pagecache_lock);
1525             cached_page = page_cache_alloc(mapping);
1526             if (!cached_page) {
1527                 desc->error = -ENOMEM;
1528                 break;
1529             }
1531             /*
1532              * Somebody may have added the page while we
1533              * dropped the page cache lock. Check for that.
1534              */
1535             spin_lock(&pagecache_lock);
1536             page = __find_page_nolock(mapping, index, *hash);
1537             if (page)
1538                 goto found_page;
1539         }
1541         /*
1542          * Ok, add the new page to the hash-queues...
1543          */
1544         page = cached_page;
1545         __add_to_page_cache(page, mapping, index, hash);
1546         spin_unlock(&pagecache_lock);
1547         lru_cache_add(page);        
1548         cached_page = NULL;
1550         goto readpage;
1551     }

In this block, the page does not exist in the page cache so allocate one and add it.

1523-1539If a cache page has not already been allocated then allocate one and make sure that someone else did not insert one into the page cache while we were sleeping
1524Release pagecache_lock as page_cache_alloc() may sleep
1525-1529Allocate a page and set -ENOMEM to be returned if the allocation failed
1535-1536Acquire pagecache_lock again and search the page cache to make sure another process has not inserted it while the lock was dropped
1537If another process added a suitable page to the cache already, jump to found_page as the one we just allocated is no longer necessary
1544-1545Otherwise, add the page we just allocated to the page cache
1547Add the page to the LRU lists
1548Set cached_page to NULL as it is now in use
1550Goto readpage to schedule the page to be read from disk
1553     *ppos = ((loff_t) index << PAGE_CACHE_SHIFT) + offset;
1554     filp->f_reada = 1;
1555     if (cached_page)
1556         page_cache_release(cached_page);
1557     UPDATE_ATIME(inode);
1558 }
1553Update our position within the file
1555-1556If a page was allocated for addition to the page cache and then found to be unneeded, release it here
1557Update the access time to the file

D.6.1.3  Function: generic_file_readahead

Source: mm/filemap.c

This function performs generic file read-ahead. Readahead is one of the few areas that is very heavily commented upon in the code. It is highly recommended that you read the comments in mm/filemap.c marked with “Read-ahead context”.

1222 static void generic_file_readahead(int reada_ok,
1223     struct file * filp, struct inode * inode,
1224     struct page * page)
1225 {
1226     unsigned long end_index;
1227     unsigned long index = page->index;
1228     unsigned long max_ahead, ahead;
1229     unsigned long raend;
1230     int max_readahead = get_max_readahead(inode);
1232     end_index = inode->i_size >> PAGE_CACHE_SHIFT;
1234     raend = filp->f_raend;
1235     max_ahead = 0;
1227Get the index to start from based on the supplied page
1230Get the maximum sized readahead for this block device
1232Get the index, in pages, of the end of the file
1234Get the end of the readahead window from the struct file
1237 /*
1238  * The current page is locked.
1239  * If the current position is inside the previous read IO request, 
1240  * do not try to reread previously read ahead pages.
1241  * Otherwise decide or not to read ahead some pages synchronously.
1242  * If we are not going to read ahead, set the read ahead context
1243  * for this page only.
1244  */
1245     if (PageLocked(page)) {
1246         if (!filp->f_ralen || 
                 index >= raend || 
                 index + filp->f_rawin < raend) {
1247             raend = index;
1248             if (raend < end_index)
1249                 max_ahead = filp->f_ramax;
1250             filp->f_rawin = 0;
1251             filp->f_ralen = 1;
1252             if (!max_ahead) {
1253                 filp->f_raend  = index + filp->f_ralen;
1254                 filp->f_rawin += filp->f_ralen;
1255             }
1256         }
1257     }

This block has encountered a page that is locked so it must decide whether to temporarily disable readahead.

1245If the current page is locked for IO, then check if the current page is within the last readahead window. If it is, there is no point trying to readahead again. If it is not, or readahead has not been performed previously, update the readahead context
1246The first check is if readahead has been performed previously. The second is to see if the current locked page is after where the the previous readahead finished. The third check is if the current locked page is within the current readahead window
1247Update the end of the readahead window
1248-1249If the end of the readahead window is not after the end of the file, set max_ahead to be the maximum amount of readahead that should be used with this struct file(filpf_ramax)
1250-1255Set readahead to only occur with the current page, effectively disabling readahead
1258 /*
1259  * The current page is not locked.
1260  * If we were reading ahead and,
1261  * if the current max read ahead size is not zero and,
1262  * if the current position is inside the last read-ahead IO
1263  * request, it is the moment to try to read ahead asynchronously.
1264  * We will later force unplug device in order to force
      * asynchronous read IO.
1265  */
1266     else if (reada_ok && filp->f_ramax && raend >= 1 &&
1267          index <= raend && index + filp->f_ralen >= raend) {
1268 /*
1269  * Add ONE page to max_ahead in order to try to have about the
1270  * same IO maxsize as synchronous read-ahead 
1271  * Compute the position of the last page we have tried to read
1272  * in order to begin to read ahead just at the next page.
1273  */
1274         raend -= 1;
1275         if (raend < end_index)
1276             max_ahead = filp->f_ramax + 1;
1278         if (max_ahead) {
1279             filp->f_rawin = filp->f_ralen;
1280             filp->f_ralen = 0;
1281             reada_ok      = 2;
1282         }
1283     }

This is one of the rare cases where the in-code commentary makes the code as clear as it possibly could be. Basically, it is saying that if the current page is not locked for IO, then extend the readahead window slight and remember that readahead is currently going well.

1284 /*
1285  * Try to read ahead pages.
1286  * We hope that ll_rw_blk() plug/unplug, coalescence, requests
1287  * sort and the scheduler, will work enough for us to avoid too 
      * bad actuals IO requests.
1288  */
1289     ahead = 0;
1290     while (ahead < max_ahead) {
1291         ahead ++;
1292         if ((raend + ahead) >= end_index)
1293             break;
1294         if (page_cache_read(filp, raend + ahead) < 0)
1295             break;
1296     }

This block performs the actual readahead by calling page_cache_read() for each of the pages in the readahead window. Note here how ahead is incremented for each page that is readahead.

1297 /*
1298  * If we tried to read ahead some pages,
1299  * If we tried to read ahead asynchronously,
1300  *   Try to force unplug of the device in order to start an
1301  *   asynchronous read IO request.
1302  * Update the read-ahead context.
1303  * Store the length of the current read-ahead window.
1304  * Double the current max read ahead size.
1305  *   That heuristic avoid to do some large IO for files that are
1306  *   not really accessed sequentially.
1307  */
1308     if (ahead) {
1309         filp->f_ralen += ahead;
1310         filp->f_rawin += filp->f_ralen;
1311         filp->f_raend = raend + ahead + 1;
1313         filp->f_ramax += filp->f_ramax;
1315         if (filp->f_ramax > max_readahead)
1316             filp->f_ramax = max_readahead;
1319         profile_readahead((reada_ok == 2), filp);
1320 #endif
1321     }
1323     return;
1324 }

If readahead was successful, then update the readahead fields in the struct file to mark the progress. This is basically growing the readahead context but can be reset by do_generic_file_readahead() if it is found that the readahead is ineffective.

1309Update the f_ralen with the number of pages that were readahead in this pass
1310Update the size of the readahead window
1311Mark the end of hte readahead
1313Double the current maximum-sized readahead
1315-1316Do not let the maximum sized readahead get larger than the maximum readahead defined for this block device

D.6.2  Generic File mmap()

D.6.2.1  Function: generic_file_mmap

Source: mm/filemap.c

This is the generic mmap() function used by many struct files as their struct file_operations. It is mainly responsible for ensuring the appropriate address_space functions exist and setting what VMA operations to use.

2249 int generic_file_mmap(struct file * file, 
                           struct vm_area_struct * vma)
2250 {
2251     struct address_space *mapping = 
2252     struct inode *inode = mapping->host;
2254     if ((vma->vm_flags & VM_SHARED) && 
             (vma->vm_flags & VM_MAYWRITE)) {
2255         if (!mapping->a_ops->writepage)
2256             return -EINVAL;
2257     }
2258     if (!mapping->a_ops->readpage)
2259         return -ENOEXEC;
2260     UPDATE_ATIME(inode);
2261     vma->vm_ops = &generic_file_vm_ops;
2262     return 0;
2263 }
2251Get the address_space that is managing the file being mapped
2252Get the struct inode for this address_space
2254-2257If the VMA is to be shared and writable, make sure an a_opswritepage() function exists. Return -EINVAL if it does not
2258-2259Make sure a a_opsreadpage() function exists
2260Update the access time for the inode
2261Use generic_file_vm_ops for the file operations. The generic VM operations structure, defined in mm/filemap.c, only supplies filemap_nopage() (See Section D.6.4.1) as it's nopage() function. No other callback is defined

D.6.3  Generic File Truncation

This section covers the path where a file is being truncated. The actual system call truncate() is implemented by sys_truncate() in fs/open.c. By the time the top-level function in the VM is called (vmtruncate()), the dentry information for the file has been updated and the inode's semaphore has been acquired.

D.6.3.1  Function: vmtruncate

Source: mm/memory.c

This is the top-level VM function responsible for truncating a file. When it completes, all page table entries mapping pages that have been truncated have been unmapped and reclaimed if possible.

1042 int vmtruncate(struct inode * inode, loff_t offset)
1043 {
1044     unsigned long pgoff;
1045     struct address_space *mapping = inode->i_mapping;
1046     unsigned long limit;
1048     if (inode->i_size < offset)
1049         goto do_expand;
1050     inode->i_size = offset;
1051     spin_lock(&mapping->i_shared_lock);
1052     if (!mapping->i_mmap && !mapping->i_mmap_shared)
1053         goto out_unlock;
1055     pgoff = (offset + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
1056     if (mapping->i_mmap != NULL)
1057         vmtruncate_list(mapping->i_mmap, pgoff);
1058     if (mapping->i_mmap_shared != NULL)
1059         vmtruncate_list(mapping->i_mmap_shared, pgoff);
1061 out_unlock:
1062     spin_unlock(&mapping->i_shared_lock);
1063     truncate_inode_pages(mapping, offset);
1064     goto out_truncate;
1066 do_expand:
1067     limit = current->rlim[RLIMIT_FSIZE].rlim_cur;
1068     if (limit != RLIM_INFINITY && offset > limit)
1069         goto out_sig;
1070     if (offset > inode->i_sb->s_maxbytes)
1071         goto out;
1072     inode->i_size = offset;
1074 out_truncate:
1075     if (inode->i_op && inode->i_op->truncate) {
1076         lock_kernel();
1077         inode->i_op->truncate(inode);
1078         unlock_kernel();
1079     }
1080     return 0;
1081 out_sig:
1082     send_sig(SIGXFSZ, current, 0);
1083 out:
1084     return -EFBIG;
1085 }
1042The parameters passed are the inode being truncated and the new offset marking the new end of the file. The old length of the file is stored in inodei_size
1045Get the address_space responsible for the inode
1048-1049If the new file size is larger than the old size, then goto do_expand where the ulimits for the process will be checked before the file is grown
1050Here, the file is being shrunk so update inodei_size to match
1051Lock the spinlock protecting the two lists of VMAs using this inode
1052-1053If no VMAs are mapping the inode, goto out_unlock where the pages used by the file will be reclaimed by truncate_inode_pages() (See Section D.6.3.6)
1055Calculate pgoff as the offset within the file in pages where the truncation will begin
1056-1057Truncate pages from all private mappings with vmtruncate_list() (See Section D.6.3.2)
1058-1059Truncate pages from all shared mappings
1062Unlock the spinlock protecting the VMA lists
1063Call truncate_inode_pages() (See Section D.6.3.6) to reclaim the pages if they exist in the page cache for the file
1064Goto out_truncate to call the filesystem specific truncate() function so the blocks used on disk will be freed
1066-1071If the file is being expanded, make sure that the process limits for maximum file size are not being exceeded and the hosting filesystem is able to support the new filesize
1072If the limits are fine, then update the inodes size and fall through to call the filesystem-specific truncate function which will fill the expanded filesize with zeros
1075-1079If the filesystem provides a truncate() function, then lock the kernel, call it and unlock the kernel again. Filesystems do not acquire the proper locks to prevent races between file truncation and file expansion due to writing or faulting so the big kernel lock is needed
1080Return success
1082-1084If the file size would grow to being too big, send the SIGXFSZ signal to the calling process and return -EFBIG

D.6.3.2  Function: vmtruncate_list

Source: mm/memory.c

This function cycles through all VMAs in an address_spaces list and calls zap_page_range() for the range of addresses which map a file that is being truncated.

1006 static void vmtruncate_list(struct vm_area_struct *mpnt, 
                                 unsigned long pgoff)
1007 {
1008     do {
1009         struct mm_struct *mm = mpnt->vm_mm;
1010         unsigned long start = mpnt->vm_start;
1011         unsigned long end = mpnt->vm_end;
1012         unsigned long len = end - start;
1013         unsigned long diff;
1015         /* mapping wholly truncated? */
1016         if (mpnt->vm_pgoff >= pgoff) {
1017             zap_page_range(mm, start, len);
1018             continue;
1019         }
1021         /* mapping wholly unaffected? */
1022         len = len >> PAGE_SHIFT;
1023         diff = pgoff - mpnt->vm_pgoff;
1024         if (diff >= len)
1025             continue;
1027         /* Ok, partially affected.. */
1028         start += diff << PAGE_SHIFT;
1029         len = (len - diff) << PAGE_SHIFT;
1030         zap_page_range(mm, start, len);
1031     } while ((mpnt = mpnt->vm_next_share) != NULL);
1032 }
1008-1031Loop through all VMAs in the list
1009Get the mm_struct that hosts this VMA
1010-1012Calculate the start, end and length of the VMA
1016-1019If the whole VMA is being truncated, call the function zap_page_range() (See Section D.6.3.3) with the start and length of the full VMA
1022Calculate the length of the VMA in pages
1023-1025Check if the VMA maps any of the region being truncated. If the VMA in unaffected, continue to the next VMA
1028-1029Else the VMA is being partially truncated so calculate where the start and length of the region to truncate is in pages
1030Call zap_page_range() (See Section D.6.3.3) to unmap the affected region

D.6.3.3  Function: zap_page_range

Source: mm/memory.c

This function is the top-level pagetable-walk function which unmaps userpages in the specified range from a mm_struct.

360 void zap_page_range(struct mm_struct *mm, 
                        unsigned long address, unsigned long size)
361 {
362     mmu_gather_t *tlb;
363     pgd_t * dir;
364     unsigned long start = address, end = address + size;
365     int freed = 0;
367     dir = pgd_offset(mm, address);
369     /*
370      * This is a long-lived spinlock. That's fine.
371      * There's no contention, because the page table
372      * lock only protects against kswapd anyway, and
373      * even if kswapd happened to be looking at this
374      * process we _want_ it to get stuck.
375      */
376     if (address >= end)
377         BUG();
378     spin_lock(&mm->page_table_lock);
379     flush_cache_range(mm, address, end);
380     tlb = tlb_gather_mmu(mm);
382     do {
383         freed += zap_pmd_range(tlb, dir, address, end - address);
384         address = (address + PGDIR_SIZE) & PGDIR_MASK;
385         dir++;
386     } while (address && (address < end));
388     /* this will flush any remaining tlb entries */
389     tlb_finish_mmu(tlb, start, end);
391     /*
392      * Update rss for the mm_struct (not necessarily current->mm)
393      * Notice that rss is an unsigned long.
394      */
395     if (mm->rss > freed)
396         mm->rss -= freed;
397     else
398         mm->rss = 0;
399     spin_unlock(&mm->page_table_lock);
400 }
364Calculate the start and end address for zapping
367Calculate the PGD (dir) that contains the starting address
376-377Make sure the start address is not after the end address
378Acquire the spinlock protecting the page tables. This is a very long-held lock and would normally be considered a bad idea but the comment above the block explains why it is ok in this case
379Flush the CPU cache for this range
380tlb_gather_mmu() records the MM that is being altered. Later, tlb_remove_page() will be called to unmap the PTE which stores the PTEs in a struct free_pte_ctx until the zapping is finished. This is to avoid having to constantly flush the TLB as PTEs are freed
382-386For each PMD affected by the zapping, call zap_pmd_range() until the end address has been reached. Note that tlb is passed as well for tlb_remove_page() to use later
389tlb_finish_mmu() frees all the PTEs that were unmapped by tlb_remove_page() and then flushes the TLBs. Doing the flushing this way avoids a storm of TLB flushing that would be otherwise required for each PTE unmapped
395-398Update RSS count
399Release the pagetable lock

D.6.3.4  Function: zap_pmd_range

Source: mm/memory.c

This function is unremarkable. It steps through the PMDs that are affected by the requested range and calls zap_pte_range() for each one.

331 static inline int zap_pmd_range(mmu_gather_t *tlb, pgd_t * dir, 
                                    unsigned long address, 
        unsigned long size)
332 {
333     pmd_t * pmd;
334     unsigned long end;
335     int freed;
337     if (pgd_none(*dir))
338         return 0;
339     if (pgd_bad(*dir)) {
340         pgd_ERROR(*dir);
341         pgd_clear(dir);
342         return 0;
343     }
344     pmd = pmd_offset(dir, address);
345     end = address + size;
346     if (end > ((address + PGDIR_SIZE) & PGDIR_MASK))
347         end = ((address + PGDIR_SIZE) & PGDIR_MASK);
348     freed = 0;
349     do {
350         freed += zap_pte_range(tlb, pmd, address, end - address);
351         address = (address + PMD_SIZE) & PMD_MASK; 
352         pmd++;
353     } while (address < end);
354     return freed;
355 }
337-338If no PGD exists, return
339-343If the PGD is bad, flag the error and return
344Get the starting pmd
345-347Calculate the end address of the zapping. If it is beyond the end of this PGD, then set end to the end of the PGD
349-353Step through all PMDs in this PGD. For each PMD, call zap_pte_range() (See Section D.6.3.5) to unmap the PTEs
354Return how many pages were freed

D.6.3.5  Function: zap_pte_range

Source: mm/memory.c

This function calls tlb_remove_page() for each PTE in the requested pmd within the requested address range.

294 static inline int zap_pte_range(mmu_gather_t *tlb, pmd_t * pmd, 
                                    unsigned long address, 
        unsigned long size)
295 {
296     unsigned long offset;
297     pte_t * ptep;
298     int freed = 0;
300     if (pmd_none(*pmd))
301         return 0;
302     if (pmd_bad(*pmd)) {
303         pmd_ERROR(*pmd);
304         pmd_clear(pmd);
305         return 0;
306     }
307     ptep = pte_offset(pmd, address);
308     offset = address & ~PMD_MASK;
309     if (offset + size > PMD_SIZE)
310         size = PMD_SIZE - offset;
311     size &= PAGE_MASK;
312     for (offset=0; offset < size; ptep++, offset += PAGE_SIZE) {
313         pte_t pte = *ptep;
314         if (pte_none(pte))
315             continue;
316         if (pte_present(pte)) {
317             struct page *page = pte_page(pte);
318             if (VALID_PAGE(page) && !PageReserved(page))
319                 freed ++;
320             /* This will eventually call __free_pte on the pte. */
321             tlb_remove_page(tlb, ptep, address + offset);
322         } else {
323             free_swap_and_cache(pte_to_swp_entry(pte));
324             pte_clear(ptep);
325         }
326     }
328     return freed;
329 }
300-301If the PMD does not exist, return
302-306If the PMD is bad, flag the error and return
307Get the starting PTE offset
308Align hte offset to a PMD boundary
309If the size of the region to unmap is past the PMD boundary, fix the size so that only this PMD will be affected
311Align size to a page boundary
312-326Step through all PTEs in the region
314-315If no PTE exists, continue to the next one
316-322If the PTE is present, then call tlb_remove_page() to unmap the page. If the page is reclaimable, increment the freed count
322-325If the PTE is in use but the page is paged out or in the swap cache, then free the swap slot and page page with free_swap_and_cache() (See Section K.3.2.3). It is possible that a page is reclaimed if it was in the swap cache that is unaccounted for here but it is not of paramount importance
328Return the number of pages that were freed

D.6.3.6  Function: truncate_inode_pages

Source: mm/filemap.c

This is the top-level function responsible for truncating all pages from the page cache that occur after lstart in a mapping.

327 void truncate_inode_pages(struct address_space * mapping, 
                              loff_t lstart) 
328 {
329     unsigned long start = (lstart + PAGE_CACHE_SIZE - 1) >> 
330     unsigned partial = lstart & (PAGE_CACHE_SIZE - 1);
331     int unlocked;
333     spin_lock(&pagecache_lock);
334     do {
335         unlocked = truncate_list_pages(&mapping->clean_pages, 
                                           start, &partial);
336         unlocked |= truncate_list_pages(&mapping->dirty_pages, 
                                            start, &partial);
337         unlocked |= truncate_list_pages(&mapping->locked_pages, 
                                            start, &partial);
338     } while (unlocked);
339     /* Traversed all three lists without dropping the lock */
340     spin_unlock(&pagecache_lock);
341 }
329Calculate where to start the truncation as an index in pages
330Calculate partial as an offset within the last page if it is being partially truncated
333Lock the page cache
334This will loop until none of the calls to truncate_list_pages() return that a page was found that should have been reclaimed
335Use truncate_list_pages() (See Section D.6.3.7) to truncate all pages in the clean_pages list
336Similarly, truncate pages in the dirty_pages list
337Similarly, truncate pages in the locked_pages list
340Unlock the page cache

D.6.3.7  Function: truncate_list_pages

Source: mm/filemap.c

This function searches the requested list (head) which is part of an address_space. If pages are found after start, they will be truncated.

259 static int truncate_list_pages(struct list_head *head, 
                                   unsigned long start, 
                                   unsigned *partial)
260 {
261     struct list_head *curr;
262     struct page * page;
263     int unlocked = 0;
265  restart:
266     curr = head->prev;
267     while (curr != head) {
268         unsigned long offset;
270         page = list_entry(curr, struct page, list);
271         offset = page->index;
273         /* Is one of the pages to truncate? */
274         if ((offset >= start) || 
                (*partial && (offset + 1) == start)) {
275             int failed;
277             page_cache_get(page);
278             failed = TryLockPage(page);
280             list_del(head);
281             if (!failed)
282                 /* Restart after this page */
283                 list_add_tail(head, curr);
284             else
285                 /* Restart on this page */
286                 list_add(head, curr);
288             spin_unlock(&pagecache_lock);
289             unlocked = 1;
291             if (!failed) {
292                 if (*partial && (offset + 1) == start) {
293                     truncate_partial_page(page, *partial);
294                     *partial = 0;
295                 } else 
296                     truncate_complete_page(page);
298                 UnlockPage(page);
299             } else
300                 wait_on_page(page);
302             page_cache_release(page);
304             if (current->need_resched) {
305                 __set_current_state(TASK_RUNNING);
306                 schedule();
307             }
309             spin_lock(&pagecache_lock);
310             goto restart;
311         }
312         curr = curr->prev;
313     }
314     return unlocked;
315 }
266-267Record the start of the list and loop until the full list has been scanned
270-271Get the page for this entry and what offset within the file it represents
274If the current page is after start or is a page that is to be partially truncated, then truncate this page, else move to the next one
277-278Take a reference to the page and try to lock it
280Remove the page from the list
281-283If we locked the page, add it back to the list where it will be skipped over on the next iteration of the loop
284-286Else add it back where it will be found again immediately. Later in the function, wait_on_page() is called until the page is unlocked
288Release the pagecache lock
299Set locked to 1 to indicate a page was found that had to be truncated. This will force truncate_inode_pages() to call this function again to make sure there are no pages left behind. This looks like an oversight and was intended to have the functions recalled only if a locked page was found but the way it is implemented means that it will called whether the page was locked or not
291-299If we locked the page, then truncate it
292-294If the page is to be partially truncated, call truncate_partial_page() (See Section D.6.3.10) with the offset within the page where the truncation beings (partial)
296Else call truncate_complete_page() (See Section D.6.3.8) to truncate the whole page
298Unlock the page
300If the page locking failed, call wait_on_page() to wait until the page can be locked
302Release the reference to the page. If there are no more mappings for the page, it will be reclaimed
304-307Check if the process should call schedule() before continuing. This is to prevent a truncating process from hogging the CPU
309Reacquire the spinlock and restart the scanning for pages to reclaim
312The current page should not be reclaimed so move to the next page
314Return 1 if a page was found in the list that had to be truncated

D.6.3.8  Function: truncate_complete_page

Source: mm/filemap.c

239 static void truncate_complete_page(struct page *page)
240 {
241     /* Leave it on the LRU if it gets converted into 
         * anonymous buffers */
242     if (!page->buffers || do_flushpage(page, 0))
243         lru_cache_del(page);
245     /*
246      * We remove the page from the page cache _after_ we have
247      * destroyed all buffer-cache references to it. Otherwise some
248      * other process might think this inode page is not in the
249      * page cache and creates a buffer-cache alias to it causing
250      * all sorts of fun problems ...  
251      */
252     ClearPageDirty(page);
253     ClearPageUptodate(page);
254     remove_inode_page(page);
255     page_cache_release(page);
256 }
242If the page has buffers, call do_flushpage() (See Section D.6.3.9) to flush all buffers associated with the page. The comments in the following lines describe the problem concisely
243Delete the page from the LRU
252-253Clear the dirty and uptodate flags for the page
254Call remove_inode_page() (See Section J.1.2.1) to delete the page from the page cache
255Drop the reference to the page. The page will be later reclaimed when truncate_list_pages() drops it's own private refernece to it

D.6.3.9  Function: do_flushpage

Source: mm/filemap.c

This function is responsible for flushing all buffers associated with a page.

223 static int do_flushpage(struct page *page, unsigned long offset)
224 {
225     int (*flushpage) (struct page *, unsigned long);
226     flushpage = page->mapping->a_ops->flushpage;
227     if (flushpage)
228         return (*flushpage)(page, offset);
229     return block_flushpage(page, offset);
230 }
226-228If the pagemapping provides a flushpage() function, call it
229Else call block_flushpage() which is the generic function for flushing buffers associated with a page

D.6.3.10  Function: truncate_partial_page

Source: mm/filemap.c

This function partially truncates a page by zeroing out the higher bytes no longer in use and flushing any associated buffers.

232 static inline void truncate_partial_page(struct page *page, 
                                             unsigned partial)
233 {
234     memclear_highpage_flush(page, partial, PAGE_CACHE_SIZE-partial);
235     if (page->buffers)
236         do_flushpage(page, partial);
237 }
234memclear_highpage_flush() fills an address range with zeros. In this case, it will zero from partial to the end of the page
235-236If the page has any associated buffers, flush any buffers containing data in the truncated region

D.6.4  Reading Pages for the Page Cache

D.6.4.1  Function: filemap_nopage

Source: mm/filemap.c

This is the generic nopage() function used by many VMAs. This loops around itself with a large number of goto's which can be difficult to trace but there is nothing novel here. It is principally responsible for fetching the faulting page from either the pgae cache or reading it from disk. If appropriate it will also perform file read-ahead.

1994 struct page * filemap_nopage(struct vm_area_struct * area, 
                                  unsigned long address, 
                                  int unused)
1995 {
1996     int error;
1997     struct file *file = area->vm_file;
1998     struct address_space *mapping = 
1999     struct inode *inode = mapping->host;
2000     struct page *page, **hash;
2001     unsigned long size, pgoff, endoff;
2003     pgoff = ((address - area->vm_start) >> PAGE_CACHE_SHIFT) + 
2004     endoff = ((area->vm_end - area->vm_start) >> PAGE_CACHE_SHIFT) + 

This block acquires the struct file, addres_space and inode important for this page fault. It then acquires the starting offset within the file needed for this fault and the offset that corresponds to the end of this VMA. The offset is the end of the VMA instead of the end of the page in case file read-ahead is performed.

1997-1999Acquire the struct file, address_space and inode required for this fault
2003Calculate pgoff which is the offset within the file corresponding to the beginning of the fault
2004Calculate the offset within the file corresponding to the end of the VMA
2006 retry_all:
2007     /*
2008      * An external ptracer can access pages that normally aren't
2009      * accessible..
2010      */
2011     size = (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
2012     if ((pgoff >= size) && (area->vm_mm == current->mm))
2013         return NULL;
2015     /* The "size" of the file, as far as mmap is concerned, isn't 
            bigger than the mapping */
2016     if (size > endoff)
2017         size = endoff;
2019     /*
2020      * Do we have something in the page cache already?
2021      */
2022     hash = page_hash(mapping, pgoff);
2023 retry_find:
2024     page = __find_get_page(mapping, pgoff, hash);
2025     if (!page)
2026         goto no_cached_page;
2028     /*
2029      * Ok, found a page in the page cache, now we need to check
2030      * that it's up-to-date.
2031      */
2032     if (!Page_Uptodate(page))
2033         goto page_not_uptodate;
2011Calculate the size of the file in pages
2012If the faulting pgoff is beyond the end of the file and this is not a tracing process, return NULL
2016-2017If the VMA maps beyond the end of the file, then set the size of the file to be the end of the mapping
2022-2024Search for the page in the page cache
2025-2026If it does not exist, goto no_cached_page where page_cache_read() will be called to read the page from backing storage
2032-2033If the page is not up-to-date, goto page_not_uptodate where the page will either be declared invalid or else the data in the page updated
2035 success:
2036     /*
2037      * Try read-ahead for sequential areas.
2038      */
2039     if (VM_SequentialReadHint(area))
2040         nopage_sequential_readahead(area, pgoff, size);
2042     /*
2043      * Found the page and have a reference on it, need to check sharing
2044      * and possibly copy it over to another page..
2045      */
2046     mark_page_accessed(page);
2047     flush_page_to_ram(page);
2048     return page;
2039-2040If this mapping specified the VM_SEQ_READ hint, then the pages are the current fault will be pre-faulted with nopage_sequential_readahead()
2046Mark the faulted-in page as accessed so it will be moved to the active_list
2047As the page is about to be installed into a process page table, call flush_page_to_ram() so that recent stores by the kernel to the page will definitly be visible to userspace
2048Return the faulted-in page
2050 no_cached_page:
2051     /*
2052      * If the requested offset is within our file, try to read
2053      * a whole cluster of pages at once.
2054      *
2055      * Otherwise, we're off the end of a privately mapped file,
2056      * so we need to map a zero page.
2057      */
2058     if ((pgoff < size) && !VM_RandomReadHint(area))
2059         error = read_cluster_nonblocking(file, pgoff, size);
2060     else
2061         error = page_cache_read(file, pgoff);
2063     /*
2064      * The page we want has now been added to the page cache.
2065      * In the unlikely event that someone removed it in the
2066      * meantime, we'll just come back here and read it again.
2067      */
2068     if (error >= 0)
2069         goto retry_find;
2071     /*
2072      * An error return from page_cache_read can result if the
2073      * system is low on memory, or a problem occurs while trying
2074      * to schedule I/O.
2075      */
2076     if (error == -ENOMEM)
2077         return NOPAGE_OOM;
2078     return NULL;
2058-2059If the end of the file has not been reached and the random-read hint has not been specified, call read_cluster_nonblocking() to pre-fault in just a few pages near ths faulting page
2061Else, the file is being accessed randomly, so just call page_cache_read() (See Section D.6.4.2) to read in just the faulting page
2068-2069If no error occurred, goto retry_find at line 1958 which will check to make sure the page is in the page cache before returning
2076-2077If the error was due to being out of memory, return that so the fault handler can act accordingly
2078Else return NULL to indicate that a non-existant page was faulted resulting in a SIGBUS signal being sent to the faulting process
2080 page_not_uptodate:
2081     lock_page(page);
2083     /* Did it get unhashed while we waited for it? */
2084     if (!page->mapping) {
2085         UnlockPage(page);
2086         page_cache_release(page);
2087         goto retry_all;
2088     }
2090     /* Did somebody else get it up-to-date? */
2091     if (Page_Uptodate(page)) {
2092         UnlockPage(page);
2093         goto success;
2094     }
2096     if (!mapping->a_ops->readpage(file, page)) {
2097         wait_on_page(page);
2098         if (Page_Uptodate(page))
2099             goto success;
2100     }

In this block, the page was found but it was not up-to-date so the reasons for the page not being up to date are checked. If it looks ok, the appropriate readpage() function is called to resync the page.

2081Lock the page for IO
2084-2088If the page was removed from the mapping (possible because of a file truncation) and is now anonymous, then goto retry_all which will try and fault in the page again
2090-2094Check again if the Uptodate flag in case the page was updated just before we locked the page for IO
2096Call the address_spacereadpage() function to schedule the data to be read from disk
2097Wait for the IO to complete and if it is now up-to-date, goto success to return the page. If the readpage() function failed, fall through to the error recovery path
2102     /*
2103      * Umm, take care of errors if the page isn't up-to-date.
2104      * Try to re-read it _once_. We do this synchronously,
2105      * because there really aren't any performance issues here
2106      * and we need to check for errors.
2107      */
2108     lock_page(page);
2110     /* Somebody truncated the page on us? */
2111     if (!page->mapping) {
2112         UnlockPage(page);
2113         page_cache_release(page);
2114         goto retry_all;
2115     }
2117     /* Somebody else successfully read it in? */
2118     if (Page_Uptodate(page)) {
2119         UnlockPage(page);
2120         goto success;
2121     }
2122     ClearPageError(page);
2123     if (!mapping->a_ops->readpage(file, page)) {
2124         wait_on_page(page);
2125         if (Page_Uptodate(page))
2126             goto success;
2127     }
2129     /*
2130      * Things didn't work out. Return zero to tell the
2131      * mm layer so, possibly freeing the page cache page first.
2132      */
2133     page_cache_release(page);
2134     return NULL;
2135 }

In this path, the page is not up-to-date due to some IO error. A second attempt is made to read the page data and if it fails, return.

2110-2127This is almost identical to the previous block. The only difference is that ClearPageError() is called to clear the error caused by the previous IO
2133If it still failed, release the reference to the page because it is useless
2134Return NULL because the fault failed

D.6.4.2  Function: page_cache_read

Source: mm/filemap.c

This function adds the page corresponding to the offset within the file to the page cache if it does not exist there already.

702 static int page_cache_read(struct file * file, 
                               unsigned long offset)
703 {
704     struct address_space *mapping = 
705     struct page **hash = page_hash(mapping, offset);
706     struct page *page; 
708     spin_lock(&pagecache_lock);
709     page = __find_page_nolock(mapping, offset, *hash);
710     spin_unlock(&pagecache_lock);
711     if (page)
712         return 0;
714     page = page_cache_alloc(mapping);
715     if (!page)
716         return -ENOMEM;
718     if (!add_to_page_cache_unique(page, mapping, offset, hash)) {
719         int error = mapping->a_ops->readpage(file, page);
720         page_cache_release(page);
721         return error;
722     }
723     /*
724      * We arrive here in the unlikely event that someone 
725      * raced with us and added our page to the cache first.
726      */
727     page_cache_release(page);
728     return 0;
729 }
704Acquire the address_space mapping managing the file
705The page cache is a hash table and page_hash() returns the first page in the bucket for this mapping and offset
708-709Search the page cache with __find_page_nolock() (See Section J.1.4.3). This basically will traverse the list starting at hash to see if the requested page can be found
711-712If the page is already in the page cache, return
714Allocate a new page for insertion into the page cache. page_cache_alloc() will allocate a page from the buddy allocator using GFP mask information contained in mapping
718Insert the page into the page cache with add_to_page_cache_unique() (See Section J.1.1.2). This function is used because a second check needs to be made to make sure the page was not inserted into the page cache while the pagecache_lock spinlock was not acquired
719If the allocated page was inserted into the page cache, it needs to be populated with data so the readpage() function for the mapping is called. This schedules the IO to take place and the page will be unlocked when the IO completes
720The path in add_to_page_cache_unique() (See Section J.1.1.2) takes an extra reference to the page being added to the page cache which is dropped here. The page will not be freed
727If another process added the page to the page cache, it is released here by page_cache_release() as there will be no users of the page

D.6.5  File Readahead for nopage()

D.6.5.1  Function: nopage_sequential_readahead

Source: mm/filemap.c

This function is only called by filemap_nopage() when the VM_SEQ_READ flag has been specified in the VMA. When half of the current readahead-window has been faulted in, the next readahead window is scheduled for IO and pages from the previous window are freed.

1936 static void nopage_sequential_readahead(
         struct vm_area_struct * vma,
1937     unsigned long pgoff, unsigned long filesize)
1938 {
1939     unsigned long ra_window;
1941     ra_window = get_max_readahead(vma->vm_file->f_dentry->d_inode);
1942     ra_window = CLUSTER_OFFSET(ra_window + CLUSTER_PAGES - 1);
1944     /* vm_raend is zero if we haven't read ahead 
          * in this area yet.  */
1945     if (vma->vm_raend == 0)
1946         vma->vm_raend = vma->vm_pgoff + ra_window;
1941get_max_readahead() returns the maximum sized readahead window for the block device the specified inode resides on
1942CLUSTER_PAGES is the number of pages that are paged-in or paged-out in bulk. The macro CLUSTER_OFFSET() will align the readahead window to a cluster boundary
1180-1181If read-ahead has not occurred yet, set the end of the read-ahead window (vm_reend)
1948     /*
1949      * If we've just faulted the page half-way through our window,
1950      * then schedule reads for the next window, and release the
1951      * pages in the previous window.
1952      */
1953     if ((pgoff + (ra_window >> 1)) == vma->vm_raend) {
1954         unsigned long start = vma->vm_pgoff + vma->vm_raend;
1955         unsigned long end = start + ra_window;
1957         if (end > ((vma->vm_end >> PAGE_SHIFT) + vma->vm_pgoff))
1958             end = (vma->vm_end >> PAGE_SHIFT) + vma->vm_pgoff;
1959         if (start > end)
1960             return;
1962         while ((start < end) && (start < filesize)) {
1963             if (read_cluster_nonblocking(vma->vm_file,
1964                             start, filesize) < 0)
1965                 break;
1966             start += CLUSTER_PAGES;
1967         }
1968         run_task_queue(&tq_disk);
1970         /* if we're far enough past the beginning of this area,
1971            recycle pages that are in the previous window. */
1972         if (vma->vm_raend > 
                              (vma->vm_pgoff + ra_window + ra_window)) {
1973             unsigned long window = ra_window << PAGE_SHIFT;
1975             end = vma->vm_start + (vma->vm_raend << PAGE_SHIFT);
1976             end -= window + window;
1977             filemap_sync(vma, end - window, window, MS_INVALIDATE);
1978         }
1980         vma->vm_raend += ra_window;
1981     }
1983     return;
1984 }
1953If the fault has occurred half-way through the read-ahead window then schedule the next readahead window to be read in from disk and free the pages for the first half of the current window as they are presumably not required any more
1954-1955Calculate the start and end of the next readahead window as we are about to schedule it for IO
1957If the end of the readahead window is after the end of the VMA, then set end to the end of the VMA
1959-1960If we are at the end of the mapping, just return as there is no more readahead to perform
1962-1967Schedule the next readahead window to be paged in by calling read_cluster_nonblocking()(See Section D.6.5.2)
1968Call run_task_queue() to start the IO
1972-1978Recycle the pages in the previous read-ahead window with filemap_sync() as they are no longer required
1980Update where the end of the readahead window is

D.6.5.2  Function: read_cluster_nonblocking

Source: mm/filemap.c

737 static int read_cluster_nonblocking(struct file * file, 
                                        unsigned long offset,
738     unsigned long filesize)
739 {
740     unsigned long pages = CLUSTER_PAGES;
742     offset = CLUSTER_OFFSET(offset);
743     while ((pages-- > 0) && (offset < filesize)) {
744         int error = page_cache_read(file, offset);
745         if (error < 0)
746             return error;
747         offset ++;
748     }
750     return 0;
751 }
740CLUSTER_PAGES will be 4 pages in low memory systems and 8 pages in larger ones. This means that on an x86 with ample memory, 32KiB will be read in one cluster
742CLUSTER_OFFSET() will align the offset to a cluster-sized alignment
743-748Read the full cluster into the page cache by calling page_cache_read() (See Section D.6.4.2) for each page in the cluster
745-746If an error occurs during read-ahead, return the error
750Return success

D.6.6  Swap Related Read-Ahead

D.6.6.1  Function: swapin_readahead

Source: mm/memory.c

This function will fault in a number of pages after the current entry. It will stop with either CLUSTER_PAGES have been swapped in or an unused swap entry is found.

1093 void swapin_readahead(swp_entry_t entry)
1094 {
1095     int i, num;
1096     struct page *new_page;
1097     unsigned long offset;
1099     /*
1100      * Get the number of handles we should do readahead io to.
1101      */
1102     num = valid_swaphandles(entry, &offset);
1103     for (i = 0; i < num; offset++, i++) {
1104         /* Ok, do the async read-ahead now */
1105         new_page = read_swap_cache_async(SWP_ENTRY(SWP_TYPE(entry), 
1106         if (!new_page)
1107             break;
1108         page_cache_release(new_page);
1109     }
1110     return;
1111 }
1102valid_swaphandles() is what determines how many pages should be swapped in. It will stop at the first empty entry or when CLUSTER_PAGES is reached
1103-1109Swap in the pages
1105Attempt to swap the page into the swap cache with read_swap_cache_async() (See Section K.3.1.1)
1106-1107If the page could not be paged in, break and return
1108Drop the reference to the page that read_swap_cache_async() takes

D.6.6.2  Function: valid_swaphandles

Source: mm/swapfile.c

This function determines how many pages should be readahead from swap starting from offset. It will readahead to the next unused swap slot but at most, it will return CLUSTER_PAGES.

1238 int valid_swaphandles(swp_entry_t entry, unsigned long *offset)
1239 {
1240     int ret = 0, i = 1 << page_cluster;
1241     unsigned long toff;
1242     struct swap_info_struct *swapdev = SWP_TYPE(entry) + swap_info;
1244     if (!page_cluster)      /* no readahead */
1245         return 0;
1246     toff = (SWP_OFFSET(entry) >> page_cluster) << page_cluster;
1247     if (!toff)          /* first page is swap header */
1248         toff++, i--;
1249     *offset = toff;
1251     swap_device_lock(swapdev);
1252     do {
1253         /* Don't read-ahead past the end of the swap area */
1254         if (toff >= swapdev->max)
1255             break;
1256         /* Don't read in free or bad pages */
1257         if (!swapdev->swap_map[toff])
1258             break;
1259         if (swapdev->swap_map[toff] == SWAP_MAP_BAD)
1260             break;
1261         toff++;
1262         ret++;
1263     } while (--i);
1264     swap_device_unlock(swapdev);
1265     return ret;
1266 }
1240i is set to CLUSTER_PAGES which is the equivalent of the bitshift shown here
1242Get the swap_info_struct that contains this entry
1244-1245If readahead has been disabled, return
1246Calculate toff to be entry rounded down to the nearest CLUSTER_PAGES-sized boundary
1247-1248If toff is 0, move it to 1 as the first page contains information about the swap area
1251Lock the swap device as we are about to scan it
1252-1263Loop at most i, which is initialised to CLUSTER_PAGES, times
1254-1255If the end of the swap area is reached, then that is as far as can be readahead
1257-1258If an unused entry is reached, just return as it is as far as we want to readahead
1259-1260Likewise, return if a bad entry is discovered
1261Move to the next slot
1262Increment the number of pages to be readahead
1264Unlock the swap device
1265Return the number of pages which should be readahead

Previous Up Next