Previous Up Next

Appendix C  Page Table Management

C.1  Page Table Initialisation

C.1.1  Function: paging_init

Source: arch/i386/mm/init.c

This is the top-level function called from setup_arch(). When this function returns, the page tables have been fully setup. Be aware that this is all x86 specific.

351 void __init paging_init(void)
352 {
353     pagetable_init();
355     load_cr3(swapper_pg_dir);       
357 #if CONFIG_X86_PAE
362     if (cpu_has_pae)
363         set_in_cr4(X86_CR4_PAE);
364 #endif
366     __flush_tlb_all();
369     kmap_init();
370 #endif
371     zone_sizes_init();
372 }
353pagetable_init() is responsible for setting up a static page table using swapper_pg_dir as the PGD
355Load the initialised swapper_pg_dir into the CR3 register so that the CPU will be able to use it
362-363If PAE is enabled, set the appropriate bit in the CR4 register
366Flush all TLBs, including the global kernel ones
369kmap_init() initialises the region of pagetables reserved for use with kmap()
371zone_sizes_init() (See Section B.1.2) records the size of each of the zones before calling free_area_init() (See Section B.1.3) to initialise each zone

C.1.2  Function: pagetable_init

Source: arch/i386/mm/init.c

This function is responsible for statically inialising a pagetable starting with a statically defined PGD called swapper_pg_dir. At the very least, a PTE will be available that points to every page frame in ZONE_NORMAL.

205 static void __init pagetable_init (void)
206 {
207     unsigned long vaddr, end;
208     pgd_t *pgd, *pgd_base;
209     int i, j, k;
210     pmd_t *pmd;
211     pte_t *pte, *pte_base;
213     /*
214      * This can be zero as well - no problem, in that case we exit
215      * the loops anyway due to the PTRS_PER_* conditions.
216      */
217     end = (unsigned long)__va(max_low_pfn*PAGE_SIZE);
219     pgd_base = swapper_pg_dir;
220 #if CONFIG_X86_PAE
221     for (i = 0; i < PTRS_PER_PGD; i++)
222         set_pgd(pgd_base + i, __pgd(1 + __pa(empty_zero_page)));
223 #endif
224     i = __pgd_offset(PAGE_OFFSET);
225     pgd = pgd_base + i;

This first block initialises the PGD. It does this by pointing each entry to the global zero page. Entries needed to reference available memory in ZONE_NORMAL will be allocated later.

217The variable end marks the end of physical memory in ZONE_NORMAL
219pgd_base is set to the beginning of the statically declared PGD
220-223If PAE is enabled, it is insufficent to leave each entry as simply 0 (which in effect points each entry to the global zero page) as each pgd_t is a struct. Instead, set_pgd must be called for each pgd_t to point the entyr to the global zero page
224i is initialised as the offset within the PGD that corresponds to PAGE_OFFSET. In other words, this function will only be initialising the kernel portion of the linear address space, the userspace portion is left alone
225pgd is initialised to the pgd_t corresponding to the beginning of the kernel portion of the linear address space
227     for (; i < PTRS_PER_PGD; pgd++, i++) {
228         vaddr = i*PGDIR_SIZE;
229         if (end && (vaddr >= end))
230             break;
231 #if CONFIG_X86_PAE
232         pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
233         set_pgd(pgd, __pgd(__pa(pmd) + 0x1));
234 #else
235         pmd = (pmd_t *)pgd;
236 #endif
237         if (pmd != pmd_offset(pgd, 0))
238             BUG();

This loop begins setting up valid PMD entries to point to. In the PAE case, pages are allocated with alloc_bootmem_low_pages() and the PGD is set appropriately. Without PAE, there is no middle directory, so it is just “folded” back onto the PGD to preserve the illustion of a 3-level pagetable.

227i is already initialised to the beginning of the kernel portion of the linear address space so keep looping until the last pgd_t at PTRS_PER_PGD is reached
228Calculate the virtual address for this PGD
229-230If the end of ZONE_NORMAL is reached, exit the loop as further page table entries are not needed
231-234If PAE is enabled, allocate a page for the PMD and it with set_pgd()
235If PAE is not available, just set pmd to the current pgd_t. This is the “folding back” trick for emulating 3-level pagetables
237-238Sanity check to make sure the PMD is valid
239         for (j = 0; j < PTRS_PER_PMD; pmd++, j++) {
240             vaddr = i*PGDIR_SIZE + j*PMD_SIZE;
241             if (end && (vaddr >= end))
242                 break;
243             if (cpu_has_pse) {
244                 unsigned long __pe;
246                 set_in_cr4(X86_CR4_PSE);
247                 boot_cpu_data.wp_works_ok = 1;
248                 __pe = _KERNPG_TABLE + _PAGE_PSE + __pa(vaddr);
249                 /* Make it "global" too if supported */
250                 if (cpu_has_pge) {
251                     set_in_cr4(X86_CR4_PGE);
252                     __pe += _PAGE_GLOBAL;
253                 }
254                 set_pmd(pmd, __pmd(__pe));
255                 continue;
256             }
258             pte_base = pte = 
                           (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);

Initialise each entry in the PMD. This loop will only execute unless PAE is enabled. Remember that without PAE, PTRS_PER_PMD is 1.

240Calculate the virtual address for this PMD
241-242If the end of ZONE_NORMAL is reached, finish
243-248If the CPU support PSE, then use large TLB entries. This means that for kernel pages, a TLB entry will map 4MiB instead of the normal 4KiB and the third level of PTEs is unnecessary
258__pe is set as the flags for a kernel pagetable (_KERNPG_TABLE), the flag to indicate that this is an entry mapping 4MiB (_PAGE_PSE) and then set to the physical address for this virtual address with __pa(). Note that this means that 4MiB of physical memory is not being mapped by the pagetables
250-253If the CPU supports PGE, then set it for this page table entry. This marks the entry as being “global” and visible to all processes
254-255As the third level is not required because of PSE, set the PMD now with set_pmd() and continue to the next PMD
258Else, PSE is not support and PTEs are required so allocate a page for them
260             for (k = 0; k < PTRS_PER_PTE; pte++, k++) {
261                 vaddr = i*PGDIR_SIZE + j*PMD_SIZE + k*PAGE_SIZE;
262                 if (end && (vaddr >= end))
263                     break;
264                 *pte = mk_pte_phys(__pa(vaddr), PAGE_KERNEL);
265             }
266             set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte_base)));
267             if (pte_base != pte_offset(pmd, 0))
268                 BUG();
270         }
271     }

Initialise the PTEs.

260-265For each pte_t, calculate the virtual address currently being examined and create a PTE that points to the appropriate physical page frame
266The PTEs have been initialised so set the PMD to point to it
267-268Make sure that the entry was established correctly
273     /*
274      * Fixed mappings, only the page table structure has to be
275      * created - mappings will be set by set_fixmap():
276      */
277     vaddr = __fix_to_virt(__end_of_fixed_addresses - 1) & PMD_MASK;
278     fixrange_init(vaddr, 0, pgd_base);
281     /*
282      * Permanent kmaps:
283      */
284     vaddr = PKMAP_BASE;
285     fixrange_init(vaddr, vaddr + PAGE_SIZE*LAST_PKMAP, pgd_base);
287     pgd = swapper_pg_dir + __pgd_offset(vaddr);
288     pmd = pmd_offset(pgd, vaddr);
289     pte = pte_offset(pmd, vaddr);
290     pkmap_page_table = pte;
291 #endif
293 #if CONFIG_X86_PAE
294     /*
295      * Add low memory identity-mappings - SMP needs it when
296      * starting up on an AP from real-mode. In the non-PAE
297      * case we already have these mappings through head.S.
298      * All user-space mappings are explicitly cleared after
299      * SMP startup.
300      */
301     pgd_base[0] = pgd_base[USER_PTRS_PER_PGD];
302 #endif
303 }

At this point, page table entries have been setup which reference all parts of ZONE_NORMAL. The remaining regions needed are those for fixed mappings and those needed for mapping high memory pages with kmap().

277The fixed address space is considered to start at FIXADDR_TOP and “finish” earlier in the address space. __fix_to_virt() takes an index as a parameter and returns the index'th page frame backwards (starting from FIXADDR_TOP) within the the fixed virtual address space. __end_of_fixed_addresses is the last index used by the fixed virtual address space. In other words, this line returns the virtual address of the PMD that corresponds to the beginning of the fixed virtual address space
278By passing 0 as the “end” to fixrange_init(), the function will start at vaddr and build valid PGDs and PMDs until the end of the virtual address space. PTEs are not needed for these addresses
280-291Set up page tables for use with kmap()
287-290Get the PTE corresponding to the beginning of the region for use with kmap()
301This sets up a temporary identity mapping between the virtual address 0 and the physical address 0

C.1.3  Function: fixrange_init

Source: arch/i386/mm/init.c

This function creates valid PGDs and PMDs for fixed virtual address mappings.

167 static void __init fixrange_init (unsigned long start, 
                                      unsigned long end, 
                                      pgd_t *pgd_base)
168 {
169     pgd_t *pgd;
170     pmd_t *pmd;
171     pte_t *pte;
172     int i, j;
173     unsigned long vaddr;
175     vaddr = start;
176     i = __pgd_offset(vaddr);
177     j = __pmd_offset(vaddr);
178     pgd = pgd_base + i;
180     for ( ; (i < PTRS_PER_PGD) && (vaddr != end); pgd++, i++) {
181 #if CONFIG_X86_PAE
182         if (pgd_none(*pgd)) {
183             pmd = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);
184             set_pgd(pgd, __pgd(__pa(pmd) + 0x1));
185             if (pmd != pmd_offset(pgd, 0))
186                 printk("PAE BUG #02!\n");
187         }
188         pmd = pmd_offset(pgd, vaddr);
189 #else
190         pmd = (pmd_t *)pgd;
191 #endif
192         for (; (j < PTRS_PER_PMD) && (vaddr != end); pmd++, j++) {
193             if (pmd_none(*pmd)) {
194                 pte = (pte_t *) alloc_bootmem_low_pages(PAGE_SIZE);
195                 set_pmd(pmd, __pmd(_KERNPG_TABLE + __pa(pte)));
196                 if (pte != pte_offset(pmd, 0))
197                     BUG();
198             }
199             vaddr += PMD_SIZE;
200         }
201         j = 0;
202     }
203 }
175Set the starting virtual address (vadd) to the requested starting address provided as the parameter
176Get the index within the PGD corresponding to vaddr
177Get the index within the PMD corresponding to vaddr
178Get the starting pgd_t
180Keep cycling until end is reached. When pagetable_init() passes in 0, this loop will continue until the end of the PGD
182-187In the case of PAE, allocate a page for the PMD if one has not already been allocated
190Without PAE, there is no PMD so treat the pgd_t as the pmd_t
192-200For each entry in the PMD, allocate a page for the pte_t entries and set it within the pagetables. Note that vaddr is incremented in PMD-sized strides

C.1.4  Function: kmap_init

Source: arch/i386/mm/init.c

This function only exists if CONFIG_HIGHMEM is set during compile time. It is responsible for caching where the beginning of the kmap region is, the PTE referencing it and the protection for the page tables. This means the PGD will not have to be checked every time kmap() is used.

 75 pte_t *kmap_pte;
 76 pgprot_t kmap_prot;
 78 #define kmap_get_fixmap_pte(vaddr)                      \
 79     pte_offset(pmd_offset(pgd_offset_k(vaddr), (vaddr)), (vaddr))
 81 void __init kmap_init(void)
 82 {
 83     unsigned long kmap_vstart;
 85     /* cache the first kmap pte */
 86     kmap_vstart = __fix_to_virt(FIX_KMAP_BEGIN);
 87     kmap_pte = kmap_get_fixmap_pte(kmap_vstart);
 89     kmap_prot = PAGE_KERNEL;
 90 }
 91 #endif /* CONFIG_HIGHMEM */
78-79As fixrange_init() has already set up valid PGDs and PMDs, there is no need to double check them so kmap_get_fixmap_pte() is responsible for quickly traversing the page table
86Cache the virtual address for the kmap region in kmap_vstart
87Cache the PTE for the start of the kmap region in kmap_pte
89Cache the protection for the page table entries with kmap_prot

C.2  Page Table Walking

C.2.1  Function: follow_page

Source: mm/memory.c

This function returns the struct page used by the PTE at address in mm's page tables.

405 static struct page * follow_page(struct mm_struct *mm, 
                                     unsigned long address, 
                                     int write) 
406 {
407     pgd_t *pgd;
408     pmd_t *pmd;
409     pte_t *ptep, pte;
411     pgd = pgd_offset(mm, address);
412     if (pgd_none(*pgd) || pgd_bad(*pgd))
413         goto out;
415     pmd = pmd_offset(pgd, address);
416     if (pmd_none(*pmd) || pmd_bad(*pmd))
417         goto out;
419     ptep = pte_offset(pmd, address);
420     if (!ptep)
421         goto out;
423     pte = *ptep;
424     if (pte_present(pte)) {
425         if (!write ||
426             (pte_write(pte) && pte_dirty(pte)))
427             return pte_page(pte);
428     }
430 out:
431     return 0;
432 }
405The parameters are the mm whose page tables that is about to be walked, the address whose struct page is of interest and write which indicates if the page is about to be written to
411Get the PGD for the address and make sure it is present and valid
415-417Get the PMD for the address and make sure it is present and valid
419Get the PTE for the address and make sure it exists
424If the PTE is currently present, then we have something to return
425-426If the caller has indicated a write is about to take place, check to make sure that the PTE has write permissions set and if so, make the PTE dirty
427If the PTE is present and the permissions are fine, return the struct page mapped by the PTE
431Return 0 indicating that the address has no associated struct page

Previous Up Next