Previous Up Next

Appendix B  Describing Physical Memory

B.1  Initialising Zones

B.1.1  Function: setup_memory

Source: arch/i386/kernel/setup.c

The call graph for this function is shown in Figure 2.3. This function gets the necessary information to give to the boot memory allocator to initialise itself. It is broken up into a number of different tasks.

991 static unsigned long __init setup_memory(void)
992 {
993       unsigned long bootmap_size, start_pfn, max_low_pfn;
995       /*
996        * partially used pages are not usable - thus
997        * we are rounding upwards:
998        */
999       start_pfn = PFN_UP(__pa(&_end));
1001      find_max_pfn();
1003      max_low_pfn = find_max_low_pfn();
1005 #ifdef CONFIG_HIGHMEM
1006      highstart_pfn = highend_pfn = max_pfn;
1007      if (max_pfn > max_low_pfn) {
1008            highstart_pfn = max_low_pfn;
1009      }
1010      printk(KERN_NOTICE "%ldMB HIGHMEM available.\n",
1011            pages_to_mb(highend_pfn - highstart_pfn));
1012 #endif
1013      printk(KERN_NOTICE "%ldMB LOWMEM available.\n",
1014                  pages_to_mb(max_low_pfn));
999PFN_UP() takes a physical address, rounds it up to the next page and returns the page frame number. _end is the address of the end of the loaded kernel image so start_pfn is now the offset of the first physical page frame that may be used
1001find_max_pfn() loops through the e820 map searching for the highest available pfn
1003find_max_low_pfn() finds the highest page frame addressable in ZONE_NORMAL
1005-1011If high memory is enabled, start with a high memory region of 0. If it turns out there is memory after max_low_pfn, put the start of high memory (highstart_pfn) there and the end of high memory at max_pfn. Print out an informational message on the availability of high memory
1013-1014Print out an informational message on the amount of low memory
1018      bootmap_size = init_bootmem(start_pfn, max_low_pfn);
1020      register_bootmem_low_pages(max_low_pfn);
1028      reserve_bootmem(HIGH_MEMORY, (PFN_PHYS(start_pfn) +
1029                  bootmap_size + PAGE_SIZE-1) - (HIGH_MEMORY));
1035      reserve_bootmem(0, PAGE_SIZE);
1037 #ifdef CONFIG_SMP
1043       reserve_bootmem(PAGE_SIZE, PAGE_SIZE);
1044 #endif
1046         /*
1047          * Reserve low memory region for sleep support.
1048          */
1049         acpi_reserve_bootmem();
1050 #endif
1018init_bootmem()(See Section E.1.1) initialises the bootmem_data struct for the config_page_data node. It sets where physical memory begins and ends for the node, allocates a bitmap representing the pages and sets all pages as reserved initially
1020register_bootmem_low_pages() reads the e820 map and calls free_bootmem() (See Section E.3.1) for all usable pages in the running system. This is what marks the pages marked as reserved during initialisation as free
1028-1029Reserve the pages that are being used to store the bitmap representing the pages
1035Reserve page 0 as it is often a special page used by the bios
1043Reserve an extra page which is required by the trampoline code. The trampoline code deals with how userspace enters kernel space
1045-1050If sleep support is added, reserve memory required for it. This is only of interest to laptops interested in suspending and beyond the scope of this book
1051 #ifdef CONFIG_X86_LOCAL_APIC
1052       /*
1053        * Find and reserve possible boot-time SMP configuration:
1054        */
1055       find_smp_config();
1056 #endif
1058       if (LOADER_TYPE && INITRD_START) {
1059             if (INITRD_START + INITRD_SIZE <= 
                    (max_low_pfn << PAGE_SHIFT)) {
1060                   reserve_bootmem(INITRD_START, INITRD_SIZE);
1061                   initrd_start =
1062                    INITRD_START ? INITRD_START + PAGE_OFFSET : 0;
1063                   initrd_end = initrd_start+INITRD_SIZE;
1064             }
1065             else {
1066                   printk(KERN_ERR 
                           "initrd extends beyond end of memory "
1067                       "(0x%08lx > 0x%08lx)\ndisabling initrd\n",
1068                       INITRD_START + INITRD_SIZE,
1069                       max_low_pfn << PAGE_SHIFT);
1070                   initrd_start = 0;
1071             }
1072       }
1073 #endif
1075       return max_low_pfn;
1076 }
1055This function reserves memory that stores config information about the SMP setup
1057-1073If initrd is enabled, the memory containing its image will be reserved. initrd provides a tiny filesystem image which is used to boot the system
1075Return the upper limit of addressable memory in ZONE_NORMAL

B.1.2  Function: zone_sizes_init

Source: arch/i386/mm/init.c

This is the top-level function which is used to initialise each of the zones. The size of the zones in PFNs was discovered during setup_memory() (See Section B.1.1). This function populates an array of zone sizes for passing to free_area_init().

323 static void __init zone_sizes_init(void)
324 {
325     unsigned long zones_size[MAX_NR_ZONES] = {0, 0, 0};
326     unsigned int max_dma, high, low;
328     max_dma = virt_to_phys((char *)MAX_DMA_ADDRESS) >> PAGE_SHIFT;
329     low = max_low_pfn;
330     high = highend_pfn;
332     if (low < max_dma)
333         zones_size[ZONE_DMA] = low;
334     else {
335         zones_size[ZONE_DMA] = max_dma;
336         zones_size[ZONE_NORMAL] = low - max_dma;
338         zones_size[ZONE_HIGHMEM] = high - low;
339 #endif
340     }
341     free_area_init(zones_size);
342 }
325Initialise the sizes to 0
328Calculate the PFN for the maximum possible DMA address. This doubles up as the largest number of pages that may exist in ZONE_DMA
329max_low_pfn is the highest PFN available to ZONE_NORMAL
330highend_pfn is the highest PFN available to ZONE_HIGHMEM
332-333If the highest PFN in ZONE_NORMAL is below MAX_DMA_ADDRESS, then just set the size of ZONE_DMA to it. The other zones remain at 0
335Set the number of pages in ZONE_DMA
336The size of ZONE_NORMAL is max_low_pfn minus the number of pages in ZONE_DMA
338The size of ZONE_HIGHMEM is the highest possible PFN minus the highest possible PFN in ZONE_NORMAL (max_low_pfn)

B.1.3  Function: free_area_init

Source: mm/page_alloc.c

This is the architecture independant function for setting up a UMA architecture. It simply calls the core function passing the static contig_page_data as the node. NUMA architectures will use free_area_init_node() instead.

838 void __init free_area_init(unsigned long *zones_size)
839 {
840     free_area_init_core(0, &contig_page_data, &mem_map, zones_size, 
                            0, 0, 0);
841 }
838The parameters passed to free_area_init_core() are
0 is the Node Identifier for the node, which is 0
contig_page_data is the static global pg_data_t
mem_map is the global mem_map used for tracking struct pages. The function free_area_init_core() will allocate memory for this array
zones_sizes is the array of zone sizes filled by zone_sizes_init()
0 This zero is the starting physical address
0 The second zero is an array of memory hole sizes which doesn't apply to UMA architectures
0 The last 0 is a pointer to a local mem_map for this node which is used by NUMA architectures

B.1.4  Function: free_area_init_node

Source: mm/numa.c

There are two versions of this function. The first is almost identical to free_area_init() except it uses a different starting physical address. There is for architectures that have only one node (so they use contig_page_data) but whose physical address is not at 0.

This version of the function, called after the pagetable initialisation, if for initialisation each pgdat in the system. The caller has the option of allocating their own local portion of the mem_map and passing it in as a parameter if they want to optimise it's location for the architecture. If they choose not to, it will be allocated later by free_area_init_core().

 61 void __init free_area_init_node(int nid, 
        pg_data_t *pgdat, struct page *pmap,
 62     unsigned long *zones_size, unsigned long zone_start_paddr, 
 63     unsigned long *zholes_size)
 64 {
 65     int i, size = 0;
 66     struct page *discard;
 68     if (mem_map == (mem_map_t *)NULL)
 69         mem_map = (mem_map_t *)PAGE_OFFSET;
 71     free_area_init_core(nid, pgdat, &discard, zones_size, 
 72                     zholes_size, pmap);
 73     pgdat->node_id = nid;
 75     /*
 76      * Get space for the valid bitmap.
 77      */
 78     for (i = 0; i < MAX_NR_ZONES; i++)
 79         size += zones_size[i];
 80     size = LONG_ALIGN((size + 7) >> 3);
 81     pgdat->valid_addr_bitmap = 
                     (unsigned long *)alloc_bootmem_node(pgdat, size);
 82     memset(pgdat->valid_addr_bitmap, 0, size);
 83 }
61The parameters to the function are:
nid is the Node Identifier (NID) of the pgdat passed in
pgdat is the node to be initialised
pmap is a pointer to the portion of the mem_map for this node to use, frequently passed as NULL and allocated later
zones_size is an array of zone sizes in this node
zone_start_paddr is the starting physical addres for the node
zholes_size is an array of hole sizes in each zone
68-69If the global mem_map has not been set, set it to the beginning of the kernel portion of the linear address space. Remeber that with NUMA, mem_map is a virtual array with portions filled in by local maps used by each node
71Call free_area_init_core(). Note that discard is passed in as the third parameter as no global mem_map needs to be set for NUMA
73Record the pgdats NID
78-79Calculate the total size of the nide
80Recalculate size as the number of bits requires to have one bit for every byte of the size
81Allocate a bitmap to represent where valid areas exist in the node. In reality, this is only used by the sparc architecture so it is unfortunate to waste the memory every other architecture
82Initially, all areas are invalid. Valid regions are marked later in the mem_init() functions for the sparc. Other architectures just ignore the bitmap

B.1.5  Function: free_area_init_core

Source: mm/page_alloc.c

This function is responsible for initialising all zones and allocating their local lmem_map within a node. In UMA architectures, this function is called in a way that will initialise the global mem_map array. In NUMA architectures, the array is treated as a virtual array that is sparsely populated.

684 void __init free_area_init_core(int nid, 
        pg_data_t *pgdat, struct page **gmap,
685     unsigned long *zones_size, unsigned long zone_start_paddr, 
686     unsigned long *zholes_size, struct page *lmem_map)
687 {
688     unsigned long i, j;
689     unsigned long map_size;
690     unsigned long totalpages, offset, realtotalpages;
691     const unsigned long zone_required_alignment = 
                                              1UL << (MAX_ORDER-1);
693     if (zone_start_paddr & ~PAGE_MASK)
694         BUG();
696     totalpages = 0;
697     for (i = 0; i < MAX_NR_ZONES; i++) {
698         unsigned long size = zones_size[i];
699         totalpages += size;
700     }
701     realtotalpages = totalpages;
702     if (zholes_size)
703         for (i = 0; i < MAX_NR_ZONES; i++)
704             realtotalpages -= zholes_size[i];
706     printk("On node %d totalpages: %lu\n", nid, realtotalpages);

This block is mainly responsible for calculating the size of each zone.

691The zone must be aligned against the maximum sized block that can be allocated by the buddy allocator for bitwise operations to work
693-694It is a bug if the physical address is not page aligned
696Initialise the totalpages count for this node to 0
697-700Calculate the total size of the node by iterating through zone_sizes
701-704Calculate the real amount of memory by substracting the size of the holes in zholes_size
706Print an informational message for the user on how much memory is available in this node
708     /*
709      * Some architectures (with lots of mem and discontinous memory
710      * maps) have to search for a good mem_map area:
711      * For discontigmem, the conceptual mem map array starts from 
712      * PAGE_OFFSET, we need to align the actual array onto a mem map 
713      * boundary, so that MAP_NR works.
714      */
715     map_size = (totalpages + 1)*sizeof(struct page);
716     if (lmem_map == (struct page *)0) {
717         lmem_map = (struct page *) alloc_bootmem_node(pgdat, map_size);
718         lmem_map = (struct page *)(PAGE_OFFSET + 
719             MAP_ALIGN((unsigned long)lmem_map - PAGE_OFFSET));
720     }
721     *gmap = pgdat->node_mem_map = lmem_map;
722     pgdat->node_size = totalpages;
723     pgdat->node_start_paddr = zone_start_paddr;
724     pgdat->node_start_mapnr = (lmem_map - mem_map);
725     pgdat->nr_zones = 0;
727     offset = lmem_map - mem_map;    

This block allocates the local lmem_map if necessary and sets the gmap. In UMA architectures, gmap is actually mem_map and so this is where the memory for it is allocated

715Calculate the amount of memory required for the array. It is the total number of pages multipled by the size of a struct page
716If the map has not already been allocated, allocate it
717Allocate the memory from the boot memory allocator
718MAP_ALIGN() will align the array on a struct page sized boundary for calculations that locate offsets within the mem_map based on the physical address with the MAP_NR() macro
721Set the gmap and pgdatnode_mem_map variables to the allocated lmem_map. In UMA architectures, this just set mem_map
722Record the size of the node
723Record the starting physical address
724Record what the offset within mem_map this node occupies
725Initialise the zone count to 0. This will be set later in the function
727offset is now the offset within mem_map that the local portion lmem_map begins at
728     for (j = 0; j < MAX_NR_ZONES; j++) {
729         zone_t *zone = pgdat->node_zones + j;
730         unsigned long mask;
731         unsigned long size, realsize;
733         zone_table[nid * MAX_NR_ZONES + j] = zone;
734         realsize = size = zones_size[j];
735         if (zholes_size)
736             realsize -= zholes_size[j];
738         printk("zone(%lu): %lu pages.\n", j, size);
739         zone->size = size;
740         zone->name = zone_names[j];
741         zone->lock = SPIN_LOCK_UNLOCKED;
742         zone->zone_pgdat = pgdat;
743         zone->free_pages = 0;
744         zone->need_balance = 0;
745         if (!size)
746             continue;

This block starts a loop which initialises every zone_t within the node. The initialisation starts with the setting of the simplier fields that values already exist for.

728Loop through all zones in the node
733Record a pointer to this zone in the zone_table. See Section 2.4.1
734-736Calculate the real size of the zone based on the full size in zones_size minus the size of the holes in zholes_size
738Print an informational message saying how many pages are in this zone
739Record the size of the zone
740zone_names is the string name of the zone for printing purposes
741-744Initialise some other fields for the zone such as it's parent pgdat
745-746If the zone has no memory, continue to the next zone as nothing further is required
752         zone->wait_table_size = wait_table_size(size);
753         zone->wait_table_shift =
754             BITS_PER_LONG - wait_table_bits(zone->wait_table_size);
755         zone->wait_table = (wait_queue_head_t *)
756             alloc_bootmem_node(pgdat, zone->wait_table_size
757                         * sizeof(wait_queue_head_t));
759         for(i = 0; i < zone->wait_table_size; ++i)
760             init_waitqueue_head(zone->wait_table + i);

Initialise the waitqueue for this zone. Processes waiting on pages in the zone use this hashed table to select a queue to wait on. This means that all processes waiting in a zone will not have to be woken when a page is unlocked, just a smaller subset.

752wait_table_size() calculates the size of the table to use based on the number of pages in the zone and the desired ratio between the number of queues and the number of pages. The table will never be larger than 4KiB
753-754Calculate the shift for the hashing algorithm
755Allocate a table of wait_queue_head_t that can hold zonewait_table_size entries
759-760Initialise all of the wait queues
762         pgdat->nr_zones = j+1;
764         mask = (realsize / zone_balance_ratio[j]);
765         if (mask < zone_balance_min[j])
766             mask = zone_balance_min[j];
767         else if (mask > zone_balance_max[j])
768             mask = zone_balance_max[j];
769         zone->pages_min = mask;
770         zone->pages_low = mask*2;
771         zone->pages_high = mask*3;
773         zone->zone_mem_map = mem_map + offset;
774         zone->zone_start_mapnr = offset;
775         zone->zone_start_paddr = zone_start_paddr;
777         if ((zone_start_paddr >> PAGE_SHIFT) & 
778             printk("BUG: wrong zone alignment, it will crash\n");

Calculate the watermarks for the zone and record the location of the zone. The watermarks are calculated as ratios of the zone size.

762First, as a new zone is active, update the number of zones in this node
764Calculate the mask (which will be used as the pages_min watermark) as the size of the zone divided by the balance ratio for this zone. The balance ratio is 128 for all zones as declared at the top of mm/page_alloc.c
765-766The zone_balance_min ratios are 20 for all zones so this means that pages_min will never be below 20
767-768Similarly, the zone_balance_max ratios are all 255 so pages_min will never be over 255
769pages_min is set to mask
770pages_low is twice the number of pages as pages_min
771pages_high is three times the number of pages as pages_min
773Record where the first struct page for this zone is located within mem_map
774Record the index within mem_map this zone begins at
775Record the starting physical address
777-778Ensure that the zone is correctly aligned for use with the buddy allocator otherwise the bitwise operations used for the buddy allocator will break
780         /*
781          * Initially all pages are reserved - free ones are freed
782          * up by free_all_bootmem() once the early boot process is
783          * done. Non-atomic initialization, single-pass.
784          */
785         for (i = 0; i < size; i++) {
786             struct page *page = mem_map + offset + i;
787             set_page_zone(page, nid * MAX_NR_ZONES + j);
788             set_page_count(page, 0);
789             SetPageReserved(page);
790             INIT_LIST_HEAD(&page->list);
791             if (j != ZONE_HIGHMEM)
792                 set_page_address(page, __va(zone_start_paddr));
793             zone_start_paddr += PAGE_SIZE;
794         }
785-794Initially, all pages in the zone are marked as reserved as there is no way to know which ones are in use by the boot memory allocator. When the boot memory allocator is retiring in free_all_bootmem(), the unused pages will have their PG_reserved bit cleared
786Get the page for this offset
787The zone the page belongs to is encoded with the page flags. See Section 2.4.1
788Set the count to 0 as no one is using it
789Set the reserved flag. Later, the boot memory allocator will clear this bit if the page is no longer in use
790Initialise the list head for the page
791-792Set the pagevirtual field if it is available and the page is in low memory
793Increment zone_start_paddr by a page size as this variable will be used to record the beginning of the next zone
796         offset += size;
797         for (i = 0; ; i++) {
798             unsigned long bitmap_size;
800             INIT_LIST_HEAD(&zone->free_area[i].free_list);
801             if (i == MAX_ORDER-1) {
802                 zone->free_area[i].map = NULL;
803                 break;
804             }
829             bitmap_size = (size-1) >> (i+4);
830             bitmap_size = LONG_ALIGN(bitmap_size+1);
831             zone->free_area[i].map = 
832               (unsigned long *) alloc_bootmem_node(pgdat, 
833         }
834     }
835     build_zonelists(pgdat);
836 }

This block initialises the free lists for the zone and allocates the bitmap used by the buddy allocator to record the state of page buddies.

797This will loop from 0 to MAX_ORDER-1
800Initialise the linked list for the free_list of the current order i
801-804If this is the last order, then set the free area map to NULL as this is what marks the end of the free lists
829Calculate the bitmap_size to be the number of bytes required to hold a bitmap where each bit represents on pair of buddies that are 2i number of pages
830Align the size to a long with LONG_ALIGN() as all bitwise operations are on longs
831-832Allocate the memory for the map
834This loops back to move to the next zone
835Build the zone fallback lists for this node with build_zonelists()

B.1.6  Function: build_zonelists

Source: mm/page_alloc.c

This builds the list of fallback zones for each zone in the requested node. This is for when an allocation cannot be satisified and another zone is consulted. When this is finished, allocatioons from ZONE_HIGHMEM will fallback to ZONE_NORMAL. Allocations from ZONE_NORMAL will fall back to ZONE_DMA which in turn has nothing to fall back on.

589 static inline void build_zonelists(pg_data_t *pgdat)
590 {
591     int i, j, k;
593     for (i = 0; i <= GFP_ZONEMASK; i++) {
594         zonelist_t *zonelist;
595         zone_t *zone;
597         zonelist = pgdat->node_zonelists + i;
598         memset(zonelist, 0, sizeof(*zonelist));
600         j = 0;
601         k = ZONE_NORMAL;
602         if (i & __GFP_HIGHMEM)
603             k = ZONE_HIGHMEM;
604         if (i & __GFP_DMA)
605             k = ZONE_DMA;
607         switch (k) {
608             default:
609                 BUG();
610             /*
611              * fallthrough:
612              */
613             case ZONE_HIGHMEM:
614                 zone = pgdat->node_zones + ZONE_HIGHMEM;
615                 if (zone->size) {
616 #ifndef CONFIG_HIGHMEM
617                     BUG();
618 #endif
619                     zonelist->zones[j++] = zone;
620                 }
621             case ZONE_NORMAL:
622                 zone = pgdat->node_zones + ZONE_NORMAL;
623                 if (zone->size)
624                     zonelist->zones[j++] = zone;
625             case ZONE_DMA:
626                 zone = pgdat->node_zones + ZONE_DMA;
627                 if (zone->size)
628                     zonelist->zones[j++] = zone;
629         }
630         zonelist->zones[j++] = NULL;
631     } 
632 }
593This looks through the maximum possible number of zones
597Get the zonelist for this zone and zero it
600Start j at 0 which corresponds to ZONE_DMA
601-605Set k to be the type of zone currently being examined
615-620If the zone has memory, then ZONE_HIGHMEM is the preferred zone to allocate from for high memory allocations. If ZONE_HIGHMEM has no memory, then ZONE_NORMAL will become the preferred zone when the next case is fallen through to as j is not incremented for an empty zone
621-624Set the next preferred zone to allocate from to be ZONE_NORMAL. Again, do not use it if the zone has no memory
626-628Set the final fallback zone to be ZONE_DMA. The check is still made for ZONE_DMA having memory as in a NUMA architecture, not all nodes will have a ZONE_DMA

B.2  Page Operations


B.2 Page Operations216
B.2.1 Locking Pages216
  B.2.1.1 Function: lock_page()216
  B.2.1.2 Function: __lock_page()216
  B.2.1.3 Function: sync_page()217
B.2.2 Unlocking Pages218
  B.2.2.1 Function: unlock_page()218
B.2.3 Waiting on Pages219
  B.2.3.1 Function: wait_on_page()219
  B.2.3.2 Function: ___wait_on_page()219

B.2.1  Locking Pages

B.2.1.1  Function: lock_page

Source: mm/filemap.c

This function tries to lock a page. If the page cannot be locked, it will cause the process to sleep until the page is available.

921 void lock_page(struct page *page)
922 {
923     if (TryLockPage(page))
924         __lock_page(page);
925 }
923TryLockPage() is just a wrapper around test_and_set_bit() for the PG_locked bit in pageflags. If the bit was previously clear, the function returns immediately as the page is now locked
924Otherwise call __lock_page()(See Section B.2.1.2) to put the process to sleep

B.2.1.2  Function: __lock_page

Source: mm/filemap.c

This is called after a TryLockPage() failed. It will locate the waitqueue for this page and sleep on it until the lock can be acquired.

897 static void __lock_page(struct page *page)
898 {
899     wait_queue_head_t *waitqueue = page_waitqueue(page);
900     struct task_struct *tsk = current;
901     DECLARE_WAITQUEUE(wait, tsk);
903     add_wait_queue_exclusive(waitqueue, &wait);
904     for (;;) {
905         set_task_state(tsk, TASK_UNINTERRUPTIBLE);
906         if (PageLocked(page)) {
907             sync_page(page);
908             schedule();
909         }
910         if (!TryLockPage(page))
911             break;
912     }
913     __set_task_state(tsk, TASK_RUNNING);
914     remove_wait_queue(waitqueue, &wait);
915 }
899page_waitqueue() is the implementation of the hash algorithm which determines which wait queue this page belongs to in the table zonewait_table
900-901Initialise the waitqueue for this task
903Add this process to the waitqueue returned by page_waitqueue()
904-912Loop here until the lock is acquired
905Set the process states as being in uninterruptible sleep. When schedule() is called, the process will be put to sleep and will not wake again until the queue is explicitly woken up
906If the page is still locked then call sync_page() function to schedule the page to be synchronised with it's backing storage. Call schedule() to sleep until the queue is woken up such as when IO on the page completes
910-1001Try and lock the page again. If we succeed, exit the loop, otherwise sleep on the queue again
913-914The lock is now acquired so set the process state to TASK_RUNNING and remove it from the wait queue. The function now returns with the lock acquired

B.2.1.3  Function: sync_page

Source: mm/filemap.c

This calls the filesystem-specific sync_page() to synchronsise the page with it's backing storage.

140 static inline int sync_page(struct page *page)
141 {
142     struct address_space *mapping = page->mapping;
144     if (mapping && mapping->a_ops && mapping->a_ops->sync_page)
145         return mapping->a_ops->sync_page(page);
146     return 0;
147 }
142Get the address_space for the page if it exists
144-145If a backing exists, and it has an associated address_space_operations which provides a sync_page() function, then call it

B.2.2  Unlocking Pages

B.2.2.1  Function: unlock_page

Source: mm/filemap.c

This function unlocks a page and wakes up any processes that may be waiting on it.

874 void unlock_page(struct page *page)
875 {
876     wait_queue_head_t *waitqueue = page_waitqueue(page);
877     ClearPageLaunder(page);
878     smp_mb__before_clear_bit();
879     if (!test_and_clear_bit(PG_locked, &(page)->flags))
880         BUG();
881     smp_mb__after_clear_bit(); 
883     /*
884      * Although the default semantics of wake_up() are
885      * to wake all, here the specific function is used
886      * to make it even more explicit that a number of
887      * pages are being waited on here.
888      */
889     if (waitqueue_active(waitqueue))
890         wake_up_all(waitqueue);
891 }
876page_waitqueue() is the implementation of the hash algorithm which determines which wait queue this page belongs to in the table zonewait_table
877Clear the launder bit as IO has now completed on the page
878This is a memory block operations which must be called before performing bit operations that may be seen by multiple processors
879-880Clear the PG_locked bit. It is a BUG() if the bit was already cleared
881Complete the SMP memory block operation
889-890If there are processes waiting on the page queue for this page, wake them

B.2.3  Waiting on Pages

B.2.3.1  Function: wait_on_page

Source: include/linux/pagemap.h

 94 static inline void wait_on_page(struct page * page)
 95 {
 96     if (PageLocked(page))
 97         ___wait_on_page(page);
 98 }
96-97If the page is currently locked, then call ___wait_on_page() to sleep until it is unlocked

B.2.3.2  Function: ___wait_on_page

Source: mm/filemap.c

This function is called after PageLocked() has been used to determine the page is locked. The calling process will probably sleep until the page is unlocked.

849 void ___wait_on_page(struct page *page)
850 {
851     wait_queue_head_t *waitqueue = page_waitqueue(page);
852     struct task_struct *tsk = current;
853     DECLARE_WAITQUEUE(wait, tsk);
855     add_wait_queue(waitqueue, &wait);
856     do {
857         set_task_state(tsk, TASK_UNINTERRUPTIBLE);
858         if (!PageLocked(page))
859             break;
860         sync_page(page);
861         schedule();
862     } while (PageLocked(page));
863     __set_task_state(tsk, TASK_RUNNING);
864     remove_wait_queue(waitqueue, &wait);
865 }
851page_waitqueue() is the implementation of the hash algorithm which determines which wait queue this page belongs to in the table zonewait_table
852-853Initialise the waitqueue for the current task
855Add this task to the waitqueue returned by page_waitqueue()
857Set the process state to be in uninterruptible sleep. When schedule() is called, the process will sleep
858-859Check to make sure the page was not unlocked since we last checked
860Call sync_page()(See Section B.2.1.3) to call the filesystem-specific function to synchronise the page with it's backing storage
861Call schedule() to go to sleep. The process will be woken when the page is unlocked
862Check if the page is still locked. Remember that multiple pages could be using this wait queue and there could be processes sleeping that wish to lock this page
863-864The page has been unlocked. Set the process to be in the TASK_RUNNING state and remove the process from the waitqueue

Previous Up Next