From ea319518ba3de282c13ae1cf4bf2215c5e03e67e Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Fri, 26 Dec 2008 15:08:55 +0100 Subject: locking, percpu counters: introduce separate lock classes Impact: fix lockdep false positives Classify percpu_counter instances similar to regular lock objects -- that is, per instantiation site. The networking code has increased its use of percpu_counters, which leads to false positives if they are treated as a single class. Signed-off-by: Peter Zijlstra Signed-off-by: Ingo Molnar --- mm/backing-dev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index f2e574dbc30..f3b12585782 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -220,7 +220,7 @@ int bdi_init(struct backing_dev_info *bdi) bdi->max_prop_frac = PROP_FRAC_BASE; for (i = 0; i < NR_BDI_STAT_ITEMS; i++) { - err = percpu_counter_init_irq(&bdi->bdi_stat[i], 0); + err = percpu_counter_init(&bdi->bdi_stat[i], 0); if (err) goto err; } -- cgit From 08fba69986e20c1c9e5fe2e6064d146cc4f42480 Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 6 Jan 2009 14:38:53 -0800 Subject: mm: report the pagesize backing a VMA in /proc/pid/smaps It is useful to verify a hugepage-aware application is using the expected pagesizes for its memory regions. This patch creates an entry called KernelPageSize in /proc/pid/smaps that is the size of page used by the kernel to back a VMA. The entry is not called PageSize as it is possible the MMU uses a different size. This extension should not break any sensible parser that skips lines containing unrecognised information. Signed-off-by: Mel Gorman Acked-by: "KOSAKI Motohiro" Cc: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 16 ++++++++++++++++ 1 file changed, 16 insertions(+) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 6058b53dcb8..5cb8bc7c80f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -219,6 +219,22 @@ static pgoff_t vma_hugecache_offset(struct hstate *h, (vma->vm_pgoff >> huge_page_order(h)); } +/* + * Return the size of the pages allocated when backing a VMA. In the majority + * cases this will be same size as used by the page table entries. + */ +unsigned long vma_kernel_pagesize(struct vm_area_struct *vma) +{ + struct hstate *hstate; + + if (!is_vm_hugetlb_page(vma)) + return PAGE_SIZE; + + hstate = hstate_vma(vma); + + return 1UL << (hstate->order + PAGE_SHIFT); +} + /* * Flags for MAP_PRIVATE reservations. These are stored in the bottom * bits of the reservation map pointer, which are always clear due to -- cgit From 3340289ddf29ca75c3acfb3a6b72f234b2f74d5c Mon Sep 17 00:00:00 2001 From: Mel Gorman Date: Tue, 6 Jan 2009 14:38:54 -0800 Subject: mm: report the MMU pagesize in /proc/pid/smaps The KernelPageSize entry in /proc/pid/smaps is the pagesize used by the kernel to back a VMA. This matches the size used by the MMU in the majority of cases. However, one counter-example occurs on PPC64 kernels whereby a kernel using 64K as a base pagesize may still use 4K pages for the MMU on older processor. To distinguish, this patch reports MMUPageSize as the pagesize used by the MMU in /proc/pid/smaps. Signed-off-by: Mel Gorman Cc: "KOSAKI Motohiro" Cc: Alexey Dobriyan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 5cb8bc7c80f..9595278b5ab 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -235,6 +235,19 @@ unsigned long vma_kernel_pagesize(struct vm_area_struct *vma) return 1UL << (hstate->order + PAGE_SHIFT); } +/* + * Return the page size being used by the MMU to back a VMA. In the majority + * of cases, the page size used by the kernel matches the MMU size. On + * architectures where it differs, an architecture-specific version of this + * function is required. + */ +#ifndef vma_mmu_pagesize +unsigned long vma_mmu_pagesize(struct vm_area_struct *vma) +{ + return vma_kernel_pagesize(vma); +} +#endif + /* * Flags for MAP_PRIVATE reservations. These are stored in the bottom * bits of the reservation map pointer, which are always clear due to -- cgit From bf3f3bc5e734706730c12a323f9b2068052aa1f0 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:38:55 -0800 Subject: mm: don't mark_page_accessed in fault path Doing a mark_page_accessed at fault-time, then doing SetPageReferenced at unmap-time if the pte is young has a number of problems. mark_page_accessed is supposed to be roughly the equivalent of a young pte for unmapped references. Unfortunately it doesn't come with any context: after being called, reclaim doesn't know who or why the page was touched. So calling mark_page_accessed not only adds extra lru or PG_referenced manipulations for pages that are already going to have pte_young ptes anyway, but it also adds these references which are difficult to work with from the context of vma specific references (eg. MADV_SEQUENTIAL pte_young may not wish to contribute to the page being referenced). Then, simply doing SetPageReferenced when zapping a pte and finding it is young, is not a really good solution either. SetPageReferenced does not correctly promote the page to the active list for example. So after removing mark_page_accessed from the fault path, several mmap()+touch+munmap() would have a very different result from several read(2) calls for example, which is not really desirable. Signed-off-by: Nick Piggin Acked-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 1 - mm/memory.c | 2 +- 2 files changed, 1 insertion(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index f5769b4dc07..f9d88183f69 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1530,7 +1530,6 @@ retry_find: /* * Found the page and have a reference on it. */ - mark_page_accessed(page); ra->prev_pos = (loff_t)page->index << PAGE_CACHE_SHIFT; vmf->page = page; return ret | VM_FAULT_LOCKED; diff --git a/mm/memory.c b/mm/memory.c index 7b9db658aca..5e0e91cc6b6 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -768,7 +768,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, if (pte_dirty(ptent)) set_page_dirty(page); if (pte_young(ptent)) - SetPageReferenced(page); + mark_page_accessed(page); file_rss--; } page_remove_rmap(page, vma); -- cgit From 390722baa7fc447b0a4f0c3c3f537ed056dbc944 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:38:57 -0800 Subject: mm: don't mark_page_accessed in shmem_fault Following "mm: don't mark_page_accessed in fault path", which now places a mark_page_accessed() in zap_pte_range(), we should remove the mark_page_accessed() from shmem_fault(). Signed-off-by: Hugh Dickins Cc: Nick Piggin Cc: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/shmem.c | 1 - 1 file changed, 1 deletion(-) (limited to 'mm') diff --git a/mm/shmem.c b/mm/shmem.c index f1b0d4871f3..24f18fdee6e 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1444,7 +1444,6 @@ static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) if (error) return ((error == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS); - mark_page_accessed(vmf->page); return ret | VM_FAULT_LOCKED; } -- cgit From 3140a2273009c01c27d316f35ab76a37e105fdd8 Mon Sep 17 00:00:00 2001 From: Brice Goglin Date: Tue, 6 Jan 2009 14:38:57 -0800 Subject: mm: rework do_pages_move() to work on page_sized chunks Rework do_pages_move() to work by page-sized chunks of struct page_to_node that are passed to do_move_page_to_node_array(). We now only have to allocate a single page instead a possibly very large vmalloc area to store all page_to_node entries. As a result, new_page_node() will now have a very small lookup, hidding much of the overall sys_move_pages() overhead. Signed-off-by: Brice Goglin Signed-off-by: Nathalie Furmento Acked-by: Christoph Lameter Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 79 +++++++++++++++++++++++++++++++++--------------------------- 1 file changed, 44 insertions(+), 35 deletions(-) (limited to 'mm') diff --git a/mm/migrate.c b/mm/migrate.c index 21631ab8c08..0a75716cb73 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -919,41 +919,43 @@ static int do_pages_move(struct mm_struct *mm, struct task_struct *task, const int __user *nodes, int __user *status, int flags) { - struct page_to_node *pm = NULL; + struct page_to_node *pm; nodemask_t task_nodes; - int err = 0; - int i; + unsigned long chunk_nr_pages; + unsigned long chunk_start; + int err; task_nodes = cpuset_mems_allowed(task); - /* Limit nr_pages so that the multiplication may not overflow */ - if (nr_pages >= ULONG_MAX / sizeof(struct page_to_node) - 1) { - err = -E2BIG; - goto out; - } - - pm = vmalloc((nr_pages + 1) * sizeof(struct page_to_node)); - if (!pm) { - err = -ENOMEM; + err = -ENOMEM; + pm = (struct page_to_node *)__get_free_page(GFP_KERNEL); + if (!pm) goto out; - } - /* - * Get parameters from user space and initialize the pm - * array. Return various errors if the user did something wrong. + * Store a chunk of page_to_node array in a page, + * but keep the last one as a marker */ - for (i = 0; i < nr_pages; i++) { - const void __user *p; + chunk_nr_pages = (PAGE_SIZE / sizeof(struct page_to_node)) - 1; - err = -EFAULT; - if (get_user(p, pages + i)) - goto out_pm; + for (chunk_start = 0; + chunk_start < nr_pages; + chunk_start += chunk_nr_pages) { + int j; - pm[i].addr = (unsigned long)p; - if (nodes) { + if (chunk_start + chunk_nr_pages > nr_pages) + chunk_nr_pages = nr_pages - chunk_start; + + /* fill the chunk pm with addrs and nodes from user-space */ + for (j = 0; j < chunk_nr_pages; j++) { + const void __user *p; int node; - if (get_user(node, nodes + i)) + err = -EFAULT; + if (get_user(p, pages + j + chunk_start)) + goto out_pm; + pm[j].addr = (unsigned long) p; + + if (get_user(node, nodes + j + chunk_start)) goto out_pm; err = -ENODEV; @@ -964,22 +966,29 @@ static int do_pages_move(struct mm_struct *mm, struct task_struct *task, if (!node_isset(node, task_nodes)) goto out_pm; - pm[i].node = node; - } else - pm[i].node = 0; /* anything to not match MAX_NUMNODES */ - } - /* End marker */ - pm[nr_pages].node = MAX_NUMNODES; + pm[j].node = node; + } + + /* End marker for this chunk */ + pm[chunk_nr_pages].node = MAX_NUMNODES; + + /* Migrate this chunk */ + err = do_move_page_to_node_array(mm, pm, + flags & MPOL_MF_MOVE_ALL); + if (err < 0) + goto out_pm; - err = do_move_page_to_node_array(mm, pm, flags & MPOL_MF_MOVE_ALL); - if (err >= 0) /* Return status information */ - for (i = 0; i < nr_pages; i++) - if (put_user(pm[i].status, status + i)) + for (j = 0; j < chunk_nr_pages; j++) + if (put_user(pm[j].status, status + j + chunk_start)) { err = -EFAULT; + goto out_pm; + } + } + err = 0; out_pm: - vfree(pm); + free_page((unsigned long)pm); out: return err; } -- cgit From 5bd1455c239672081d0e7f086e899b8cbc7a9844 Mon Sep 17 00:00:00 2001 From: Brice Goglin Date: Tue, 6 Jan 2009 14:38:58 -0800 Subject: mm: move_pages: no need to set pp->page to ZERO_PAGE(0) by default pp->page is never used when not set to the right page, so there is no need to set it to ZERO_PAGE(0) by default. Signed-off-by: Brice Goglin Acked-by: Christoph Lameter Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 6 ------ 1 file changed, 6 deletions(-) (limited to 'mm') diff --git a/mm/migrate.c b/mm/migrate.c index 0a75716cb73..60510306c44 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -848,12 +848,6 @@ static int do_move_page_to_node_array(struct mm_struct *mm, struct vm_area_struct *vma; struct page *page; - /* - * A valid page pointer that will not match any of the - * pages that will be moved. - */ - pp->page = ZERO_PAGE(0); - err = -EFAULT; vma = find_vma(mm, pp->addr); if (!vma || !vma_migratable(vma)) -- cgit From 1c0fe6e3bda0464728c23c8d84aa47567e8b716c Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:38:59 -0800 Subject: mm: invoke oom-killer from page fault Rather than have the pagefault handler kill a process directly if it gets a VM_FAULT_OOM, have it call into the OOM killer. With increasingly sophisticated oom behaviour (cpusets, memory cgroups, oom killing throttling, oom priority adjustment or selective disabling, panic on oom, etc), it's silly to unconditionally kill the faulting process at page fault time. Create a hook for pagefault oom path to call into instead. Only converted x86 and uml so far. [akpm@linux-foundation.org: make __out_of_memory() static] [akpm@linux-foundation.org: fix comment] Signed-off-by: Nick Piggin Cc: Jeff Dike Acked-by: Ingo Molnar Cc: Thomas Gleixner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/oom_kill.c | 94 +++++++++++++++++++++++++++++++++++++++++------------------ 1 file changed, 65 insertions(+), 29 deletions(-) (limited to 'mm') diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 558f9afe6e4..c592965dab2 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -509,6 +509,69 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) spin_unlock(&zone_scan_mutex); } +/* + * Must be called with tasklist_lock held for read. + */ +static void __out_of_memory(gfp_t gfp_mask, int order) +{ + if (sysctl_oom_kill_allocating_task) { + oom_kill_process(current, gfp_mask, order, 0, NULL, + "Out of memory (oom_kill_allocating_task)"); + + } else { + unsigned long points; + struct task_struct *p; + +retry: + /* + * Rambo mode: Shoot down a process and hope it solves whatever + * issues we may have. + */ + p = select_bad_process(&points, NULL); + + if (PTR_ERR(p) == -1UL) + return; + + /* Found nothing?!?! Either we hang forever, or we panic. */ + if (!p) { + read_unlock(&tasklist_lock); + panic("Out of memory and no killable processes...\n"); + } + + if (oom_kill_process(p, gfp_mask, order, points, NULL, + "Out of memory")) + goto retry; + } +} + +/* + * pagefault handler calls into here because it is out of memory but + * doesn't know exactly how or why. + */ +void pagefault_out_of_memory(void) +{ + unsigned long freed = 0; + + blocking_notifier_call_chain(&oom_notify_list, 0, &freed); + if (freed > 0) + /* Got some memory back in the last second. */ + return; + + if (sysctl_panic_on_oom) + panic("out of memory from page fault. panic_on_oom is selected.\n"); + + read_lock(&tasklist_lock); + __out_of_memory(0, 0); /* unknown gfp_mask and order */ + read_unlock(&tasklist_lock); + + /* + * Give "p" a good chance of killing itself before we + * retry to allocate memory. + */ + if (!test_thread_flag(TIF_MEMDIE)) + schedule_timeout_uninterruptible(1); +} + /** * out_of_memory - kill the "best" process when we run out of memory * @zonelist: zonelist pointer @@ -522,8 +585,6 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) */ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) { - struct task_struct *p; - unsigned long points = 0; unsigned long freed = 0; enum oom_constraint constraint; @@ -544,7 +605,7 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) switch (constraint) { case CONSTRAINT_MEMORY_POLICY: - oom_kill_process(current, gfp_mask, order, points, NULL, + oom_kill_process(current, gfp_mask, order, 0, NULL, "No available memory (MPOL_BIND)"); break; @@ -553,35 +614,10 @@ void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order) panic("out of memory. panic_on_oom is selected\n"); /* Fall-through */ case CONSTRAINT_CPUSET: - if (sysctl_oom_kill_allocating_task) { - oom_kill_process(current, gfp_mask, order, points, NULL, - "Out of memory (oom_kill_allocating_task)"); - break; - } -retry: - /* - * Rambo mode: Shoot down a process and hope it solves whatever - * issues we may have. - */ - p = select_bad_process(&points, NULL); - - if (PTR_ERR(p) == -1UL) - goto out; - - /* Found nothing?!?! Either we hang forever, or we panic. */ - if (!p) { - read_unlock(&tasklist_lock); - panic("Out of memory and no killable processes...\n"); - } - - if (oom_kill_process(p, gfp_mask, order, points, NULL, - "Out of memory")) - goto retry; - + __out_of_memory(gfp_mask, order); break; } -out: read_unlock(&tasklist_lock); /* -- cgit From c7d4caeb1d68d07f77cc09fc20b7759d6d7aa3b1 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 6 Jan 2009 14:39:00 -0800 Subject: oom: fix zone_scan_mutex name zone_scan_mutex is actually a spinlock, so name it appropriately. Signed-off-by: David Rientjes Reviewed-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/oom_kill.c | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/oom_kill.c b/mm/oom_kill.c index c592965dab2..e5f50cfdca4 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -31,7 +31,7 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks; -static DEFINE_SPINLOCK(zone_scan_mutex); +static DEFINE_SPINLOCK(zone_scan_lock); /* #define DEBUG */ /** @@ -470,7 +470,7 @@ int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_mask) struct zone *zone; int ret = 1; - spin_lock(&zone_scan_mutex); + spin_lock(&zone_scan_lock); for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) { if (zone_is_oom_locked(zone)) { ret = 0; @@ -480,7 +480,7 @@ int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_mask) for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) { /* - * Lock each zone in the zonelist under zone_scan_mutex so a + * Lock each zone in the zonelist under zone_scan_lock so a * parallel invocation of try_set_zone_oom() doesn't succeed * when it shouldn't. */ @@ -488,7 +488,7 @@ int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_mask) } out: - spin_unlock(&zone_scan_mutex); + spin_unlock(&zone_scan_lock); return ret; } @@ -502,11 +502,11 @@ void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_mask) struct zoneref *z; struct zone *zone; - spin_lock(&zone_scan_mutex); + spin_lock(&zone_scan_lock); for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask)) { zone_clear_flag(zone, ZONE_OOM_LOCKED); } - spin_unlock(&zone_scan_mutex); + spin_unlock(&zone_scan_lock); } /* -- cgit From 75aa199410359dc5fbcf9025ff7af98a9d20f0d5 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 6 Jan 2009 14:39:01 -0800 Subject: oom: print triggering task's cpuset and mems allowed When cpusets are enabled, it's necessary to print the triggering task's set of allowable nodes so the subsequently printed meminfo can be interpreted correctly. We also print the task's cpuset name for informational purposes. [rientjes@google.com: task lock current before dereferencing cpuset] Cc: Paul Menage Cc: Li Zefan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/oom_kill.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm') diff --git a/mm/oom_kill.c b/mm/oom_kill.c index e5f50cfdca4..6b9e758c98a 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -392,6 +392,9 @@ static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order, printk(KERN_WARNING "%s invoked oom-killer: " "gfp_mask=0x%x, order=%d, oomkilladj=%d\n", current->comm, gfp_mask, order, current->oomkilladj); + task_lock(current); + cpuset_print_task_mems_allowed(current); + task_unlock(current); dump_stack(); show_mem(); if (sysctl_oom_dump_tasks) -- cgit From 31a12666d8f0c22235297e1c1575f82061480029 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:39:04 -0800 Subject: mm: write_cache_pages cyclic fix In write_cache_pages, scanned == 1 is supposed to mean that cyclic writeback has circled through zero, thus we should not circle again. However it gets set to 1 after the first successful pagevec lookup. This leads to cases where not enough data gets written. Counterexample: file with first 10 pages dirty, writeback_index == 5, nr_to_write == 10. Then the 5 last pages will be found, and scanned will be set to 1, after writing those out, we will not cycle back to get the first 5. Rework this logic, now we'll always cycle unless we started off from index 0. When cycling, only write out as far as 1 page before the start page from the first cycle (so we don't write parts of the file twice). Signed-off-by: Nick Piggin Cc: Chris Mason Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page-writeback.c | 25 ++++++++++++++++++------- 1 file changed, 18 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2970e35fd03..eb277bdd4c5 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -868,9 +868,10 @@ int write_cache_pages(struct address_space *mapping, int done = 0; struct pagevec pvec; int nr_pages; + pgoff_t uninitialized_var(writeback_index); pgoff_t index; pgoff_t end; /* Inclusive */ - int scanned = 0; + int cycled; int range_whole = 0; long nr_to_write = wbc->nr_to_write; @@ -881,14 +882,19 @@ int write_cache_pages(struct address_space *mapping, pagevec_init(&pvec, 0); if (wbc->range_cyclic) { - index = mapping->writeback_index; /* Start from prev offset */ + writeback_index = mapping->writeback_index; /* prev offset */ + index = writeback_index; + if (index == 0) + cycled = 1; + else + cycled = 0; end = -1; } else { index = wbc->range_start >> PAGE_CACHE_SHIFT; end = wbc->range_end >> PAGE_CACHE_SHIFT; if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX) range_whole = 1; - scanned = 1; + cycled = 1; /* ignore range_cyclic tests */ } retry: while (!done && (index <= end) && @@ -897,7 +903,6 @@ retry: min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { unsigned i; - scanned = 1; for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; @@ -915,7 +920,11 @@ retry: continue; } - if (!wbc->range_cyclic && page->index > end) { + if (page->index > end) { + /* + * can't be range_cyclic (1st pass) because + * end == -1 in that case. + */ done = 1; unlock_page(page); continue; @@ -946,13 +955,15 @@ retry: pagevec_release(&pvec); cond_resched(); } - if (!scanned && !done) { + if (!cycled) { /* + * range_cyclic: * We hit the last page and there is more work to be done: wrap * back to the start of the file */ - scanned = 1; + cycled = 1; index = 0; + end = writeback_index - 1; goto retry; } if (!wbc->no_nrwrite_index_update) { -- cgit From bd19e012f6fd3b7309689165ea865cbb7bb88c1e Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:39:06 -0800 Subject: mm: write_cache_pages early loop termination We'd like to break out of the loop early in many situations, however the existing code has been setting mapping->writeback_index past the final page in the pagevec lookup for cyclic writeback. This is a problem if we don't process all pages up to the final page. Currently the code mostly keeps writeback_index reasonable and hacked around this by not breaking out of the loop or writing pages outside the range in these cases. Keep track of a real "done index" that enables us to terminate the loop in a much more flexible manner. Needed by the subsequent patch to preserve writepage errors, and then further patches to break out of the loop early for other reasons. However there are no functional changes with this patch alone. Signed-off-by: Nick Piggin Cc: Chris Mason Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page-writeback.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index eb277bdd4c5..01b9cb8ccf6 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -871,6 +871,7 @@ int write_cache_pages(struct address_space *mapping, pgoff_t uninitialized_var(writeback_index); pgoff_t index; pgoff_t end; /* Inclusive */ + pgoff_t done_index; int cycled; int range_whole = 0; long nr_to_write = wbc->nr_to_write; @@ -897,6 +898,7 @@ int write_cache_pages(struct address_space *mapping, cycled = 1; /* ignore range_cyclic tests */ } retry: + done_index = index; while (!done && (index <= end) && (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, PAGECACHE_TAG_DIRTY, @@ -906,6 +908,8 @@ retry: for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; + done_index = page->index + 1; + /* * At this point we hold neither mapping->tree_lock nor * lock on the page itself: the page may be truncated or @@ -968,7 +972,7 @@ retry: } if (!wbc->no_nrwrite_index_update) { if (wbc->range_cyclic || (range_whole && nr_to_write > 0)) - mapping->writeback_index = index; + mapping->writeback_index = done_index; wbc->nr_to_write = nr_to_write; } -- cgit From 00266770b8b3a6a77f896ca501a0613739086832 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:39:06 -0800 Subject: mm: write_cache_pages writepage error fix In write_cache_pages, if ret signals a real error, but we still have some pages left in the pagevec, done would be set to 1, but the remaining pages would continue to be processed and ret will be overwritten in the process. It could easily be overwritten with success, and thus success will be returned even if there is an error. Thus the caller is told all writes succeeded, wheras in reality some did not. Fix this by bailing immediately if there is an error, and retaining the first error code. This is a data integrity bug. Signed-off-by: Nick Piggin Cc: Chris Mason Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page-writeback.c | 26 ++++++++++++++++++++------ 1 file changed, 20 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 01b9cb8ccf6..2e847cdcad0 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -944,12 +944,26 @@ retry: } ret = (*writepage)(page, wbc, data); - - if (unlikely(ret == AOP_WRITEPAGE_ACTIVATE)) { - unlock_page(page); - ret = 0; - } - if (ret || (--nr_to_write <= 0)) + if (unlikely(ret)) { + if (ret == AOP_WRITEPAGE_ACTIVATE) { + unlock_page(page); + ret = 0; + } else { + /* + * done_index is set past this page, + * so media errors will not choke + * background writeout for the entire + * file. This has consequences for + * range_cyclic semantics (ie. it may + * not be suitable for data integrity + * writeout). + */ + done = 1; + break; + } + } + + if (--nr_to_write <= 0) done = 1; if (wbc->nonblocking && bdi_write_congested(bdi)) { wbc->encountered_congestion = 1; -- cgit From 05fe478dd04e02fa230c305ab9b5616669821dd3 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:39:08 -0800 Subject: mm: write_cache_pages integrity fix In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so the function will return success after writing out nr_to_write pages, even if that was not sufficient to guarantee data integrity. The callers tend to set it to values that could break data interity semantics easily in practice. For example, nr_to_write can be set to mapping->nr_pages * 2, however if a file has a single, dirty page, then fsync is called, subsequent pages might be concurrently added and dirtied, then write_cache_pages might writeout two of these newly dirty pages, while not writing out the old page that should have been written out. Fix this by ignoring nr_to_write if it is a data integrity sync. This is a data integrity bug. The reason this has been done in the past is to avoid stalling sync operations behind page dirtiers. "If a file has one dirty page at offset 1000000000000000 then someone does an fsync() and someone else gets in first and starts madly writing pages at offset 0, we want to write that page at 1000000000000000. Somehow." What we do today is return success after an arbitrary amount of pages are written, whether or not we have provided the data-integrity semantics that the caller has asked for. Even this doesn't actually fix all stall cases completely: in the above situation, if the file has a huge number of pages in pagecache (but not dirty), then mapping->nrpages is going to be huge, even if pages are being dirtied. This change does indeed make the possibility of long stalls lager, and that's not a good thing, but lying about data integrity is even worse. We have to either perform the sync, or return -ELINUXISLAME so at least the caller knows what has happened. There are subsequent competing approaches in the works to solve the stall problems properly, without compromising data integrity. Signed-off-by: Nick Piggin Cc: Chris Mason Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 2 +- mm/page-writeback.c | 6 ++++-- 2 files changed, 5 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index f9d88183f69..9c5e6235cc7 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -210,7 +210,7 @@ int __filemap_fdatawrite_range(struct address_space *mapping, loff_t start, int ret; struct writeback_control wbc = { .sync_mode = sync_mode, - .nr_to_write = mapping->nrpages * 2, + .nr_to_write = LONG_MAX, .range_start = start, .range_end = end, }; diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2e847cdcad0..5edca676e2c 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -963,8 +963,10 @@ retry: } } - if (--nr_to_write <= 0) - done = 1; + if (wbc->sync_mode == WB_SYNC_NONE) { + if (--wbc->nr_to_write <= 0) + done = 1; + } if (wbc->nonblocking && bdi_write_congested(bdi)) { wbc->encountered_congestion = 1; done = 1; -- cgit From 5a3d5c9813db56a75934eb1015367fda23a8b0b4 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:39:09 -0800 Subject: mm: write_cache_pages cleanups Get rid of some complex expressions from flow control statements, add a comment, remove some duplicate code. Signed-off-by: Nick Piggin Cc: Chris Mason Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page-writeback.c | 34 ++++++++++++++++++++++------------ 1 file changed, 22 insertions(+), 12 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 5edca676e2c..c3fb38b1ea4 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -899,11 +899,14 @@ int write_cache_pages(struct address_space *mapping, } retry: done_index = index; - while (!done && (index <= end) && - (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, - PAGECACHE_TAG_DIRTY, - min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { - unsigned i; + while (!done && (index <= end)) { + int i; + + nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, + PAGECACHE_TAG_DIRTY, + min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1); + if (nr_pages == 0) + break; for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; @@ -919,7 +922,16 @@ retry: */ lock_page(page); + /* + * Page truncated or invalidated. We can freely skip it + * then, even for data integrity operations: the page + * has disappeared concurrently, so there could be no + * real expectation of this data interity operation + * even if there is now a new, dirty page at the same + * pagecache address. + */ if (unlikely(page->mapping != mapping)) { +continue_unlock: unlock_page(page); continue; } @@ -930,18 +942,15 @@ retry: * end == -1 in that case. */ done = 1; - unlock_page(page); - continue; + goto continue_unlock; } if (wbc->sync_mode != WB_SYNC_NONE) wait_on_page_writeback(page); if (PageWriteback(page) || - !clear_page_dirty_for_io(page)) { - unlock_page(page); - continue; - } + !clear_page_dirty_for_io(page)) + goto continue_unlock; ret = (*writepage)(page, wbc, data); if (unlikely(ret)) { @@ -964,7 +973,8 @@ retry: } if (wbc->sync_mode == WB_SYNC_NONE) { - if (--wbc->nr_to_write <= 0) + wbc->nr_to_write--; + if (wbc->nr_to_write <= 0) done = 1; } if (wbc->nonblocking && bdi_write_congested(bdi)) { -- cgit From 515f4a037fb9ab736f8bad733fcd2ffd350cf265 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:39:10 -0800 Subject: mm: write_cache_pages optimise page cleaning In write_cache_pages, if we get stuck behind another process that is cleaning pages, we will be forced to wait for them to finish, then perform our own writeout (if it was redirtied during the long wait), then wait for that. If a page under writeout is still clean, we can skip waiting for it (if we're part of a data integrity sync, we'll be waiting for all writeout pages afterwards, so we'll still be waiting for the other guy's write that's cleaned the page). Signed-off-by: Nick Piggin Cc: Chris Mason Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page-writeback.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index c3fb38b1ea4..2e8c2b01d5d 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -945,11 +945,20 @@ continue_unlock: goto continue_unlock; } - if (wbc->sync_mode != WB_SYNC_NONE) - wait_on_page_writeback(page); + if (!PageDirty(page)) { + /* someone wrote it for us */ + goto continue_unlock; + } + + if (PageWriteback(page)) { + if (wbc->sync_mode != WB_SYNC_NONE) + wait_on_page_writeback(page); + else + goto continue_unlock; + } - if (PageWriteback(page) || - !clear_page_dirty_for_io(page)) + BUG_ON(PageWriteback(page)); + if (!clear_page_dirty_for_io(page)) goto continue_unlock; ret = (*writepage)(page, wbc, data); -- cgit From d5482cdf8a0aacb1e6468a97d5544f5829c8d8c4 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:39:11 -0800 Subject: mm: write_cache_pages terminate quickly Terminate the write_cache_pages loop upon encountering the first page past end, without locking the page. Pages cannot have their index change when we have a reference on them (truncate, eg truncate_inode_pages_range performs the same check without the page lock). Signed-off-by: Nick Piggin Cc: Chris Mason Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page-writeback.c | 32 ++++++++++++++++---------------- 1 file changed, 16 insertions(+), 16 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 2e8c2b01d5d..0d986c13d47 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -911,15 +911,24 @@ retry: for (i = 0; i < nr_pages; i++) { struct page *page = pvec.pages[i]; - done_index = page->index + 1; - /* - * At this point we hold neither mapping->tree_lock nor - * lock on the page itself: the page may be truncated or - * invalidated (changing page->mapping to NULL), or even - * swizzled back from swapper_space to tmpfs file - * mapping + * At this point, the page may be truncated or + * invalidated (changing page->mapping to NULL), or + * even swizzled back from swapper_space to tmpfs file + * mapping. However, page->index will not change + * because we have a reference on the page. */ + if (page->index > end) { + /* + * can't be range_cyclic (1st pass) because + * end == -1 in that case. + */ + done = 1; + break; + } + + done_index = page->index + 1; + lock_page(page); /* @@ -936,15 +945,6 @@ continue_unlock: continue; } - if (page->index > end) { - /* - * can't be range_cyclic (1st pass) because - * end == -1 in that case. - */ - done = 1; - goto continue_unlock; - } - if (!PageDirty(page)) { /* someone wrote it for us */ goto continue_unlock; -- cgit From 82fd1a9a8ced9607312b54859572bcc6211e8919 Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Tue, 6 Jan 2009 14:39:11 -0800 Subject: mm: write_cache_pages more terminate quickly Now that we have the early-termination logic in place, it makes sense to bail out early in all other cases where done is set to 1. Signed-off-by: Nick Piggin Cc: Chris Mason Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page-writeback.c | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0d986c13d47..08d2b960b29 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -983,12 +983,15 @@ continue_unlock: if (wbc->sync_mode == WB_SYNC_NONE) { wbc->nr_to_write--; - if (wbc->nr_to_write <= 0) + if (wbc->nr_to_write <= 0) { done = 1; + break; + } } if (wbc->nonblocking && bdi_write_congested(bdi)) { wbc->encountered_congestion = 1; done = 1; + break; } } pagevec_release(&pvec); -- cgit From c04fc586c1a480ba198f03ae7b6cbd7b57380b91 Mon Sep 17 00:00:00 2001 From: Gary Hade Date: Tue, 6 Jan 2009 14:39:14 -0800 Subject: mm: show node to memory section relationship with symlinks in sysfs Show node to memory section relationship with symlinks in sysfs Add /sys/devices/system/node/nodeX/memoryY symlinks for all the memory sections located on nodeX. For example: /sys/devices/system/node/node1/memory135 -> ../../memory/memory135 indicates that memory section 135 resides on node1. Also revises documentation to cover this change as well as updating Documentation/ABI/testing/sysfs-devices-memory to include descriptions of memory hotremove files 'phys_device', 'phys_index', and 'state' that were previously not described there. In addition to it always being a good policy to provide users with the maximum possible amount of physical location information for resources that can be hot-added and/or hot-removed, the following are some (but likely not all) of the user benefits provided by this change. Immediate: - Provides information needed to determine the specific node on which a defective DIMM is located. This will reduce system downtime when the node or defective DIMM is swapped out. - Prevents unintended onlining of a memory section that was previously offlined due to a defective DIMM. This could happen during node hot-add when the user or node hot-add assist script onlines _all_ offlined sections due to user or script inability to identify the specific memory sections located on the hot-added node. The consequences of reintroducing the defective memory could be ugly. - Provides information needed to vary the amount and distribution of memory on specific nodes for testing or debugging purposes. Future: - Will provide information needed to identify the memory sections that need to be offlined prior to physical removal of a specific node. Symlink creation during boot was tested on 2-node x86_64, 2-node ppc64, and 2-node ia64 systems. Symlink creation during physical memory hot-add tested on a 2-node x86_64 system. Signed-off-by: Gary Hade Signed-off-by: Badari Pulavarty Acked-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) (limited to 'mm') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index b1737118546..2ba38bc07b4 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -216,7 +216,8 @@ static int __meminit __add_zone(struct zone *zone, unsigned long phys_start_pfn) return 0; } -static int __meminit __add_section(struct zone *zone, unsigned long phys_start_pfn) +static int __meminit __add_section(int nid, struct zone *zone, + unsigned long phys_start_pfn) { int nr_pages = PAGES_PER_SECTION; int ret; @@ -234,7 +235,7 @@ static int __meminit __add_section(struct zone *zone, unsigned long phys_start_p if (ret < 0) return ret; - return register_new_memory(__pfn_to_section(phys_start_pfn)); + return register_new_memory(nid, __pfn_to_section(phys_start_pfn)); } #ifdef CONFIG_SPARSEMEM_VMEMMAP @@ -273,8 +274,8 @@ static int __remove_section(struct zone *zone, struct mem_section *ms) * call this function after deciding the zone to which to * add the new pages. */ -int __ref __add_pages(struct zone *zone, unsigned long phys_start_pfn, - unsigned long nr_pages) +int __ref __add_pages(int nid, struct zone *zone, unsigned long phys_start_pfn, + unsigned long nr_pages) { unsigned long i; int err = 0; @@ -284,7 +285,7 @@ int __ref __add_pages(struct zone *zone, unsigned long phys_start_pfn, end_sec = pfn_to_section_nr(phys_start_pfn + nr_pages - 1); for (i = start_sec; i <= end_sec; i++) { - err = __add_section(zone, i << PFN_SECTION_SHIFT); + err = __add_section(nid, zone, i << PFN_SECTION_SHIFT); /* * EEXIST is finally dealt with by ioresource collision -- cgit From 5594c8c813d9e907ff55da7080d42653478b73e8 Mon Sep 17 00:00:00 2001 From: Yinghai Lu Date: Tue, 6 Jan 2009 14:39:14 -0800 Subject: mm: print out memmap number only if it is not zero Don't print the size of the zone's memmap array if it does not have one. Impact: cleanup Signed-off-by: Yinghai Lu Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d8ac0147456..2f644c3e3da 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3469,9 +3469,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat, PAGE_ALIGN(size * sizeof(struct page)) >> PAGE_SHIFT; if (realsize >= memmap_pages) { realsize -= memmap_pages; - printk(KERN_DEBUG - " %s zone: %lu pages used for memmap\n", - zone_names[j], memmap_pages); + if (memmap_pages) + printk(KERN_DEBUG + " %s zone: %lu pages used for memmap\n", + zone_names[j], memmap_pages); } else printk(KERN_WARNING " %s zone: %lu pages exceeds realsize %lu\n", -- cgit From 1b0bd118862cd9fe9ac2872137a1b8107e83ff9d Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:15 -0800 Subject: mm: get rid of pagevec_release_nonlru() speculative page references patch (commit: e286781d5f2e9c846e012a39653a166e9d31777d) removed last pagevec_release_nonlru() caller. So this function can be removed now. This patch doesn't have any functional change. Signed-off-by: KOSAKI Motohiro Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swap.c | 22 ---------------------- 1 file changed, 22 deletions(-) (limited to 'mm') diff --git a/mm/swap.c b/mm/swap.c index b135ec90cde..21a566fc457 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -397,28 +397,6 @@ void __pagevec_release(struct pagevec *pvec) EXPORT_SYMBOL(__pagevec_release); -/* - * pagevec_release() for pages which are known to not be on the LRU - * - * This function reinitialises the caller's pagevec. - */ -void __pagevec_release_nonlru(struct pagevec *pvec) -{ - int i; - struct pagevec pages_to_free; - - pagevec_init(&pages_to_free, pvec->cold); - for (i = 0; i < pagevec_count(pvec); i++) { - struct page *page = pvec->pages[i]; - - VM_BUG_ON(PageLRU(page)); - if (put_page_testzero(page)) - pagevec_add(&pages_to_free, page); - } - pagevec_free(&pages_to_free); - pagevec_reinit(pvec); -} - /* * Add the passed pages to the LRU, then drop the caller's refcount * on them. Reinitialises the caller's pagevec. -- cgit From 64cdd548ffe26849d4cd113ac640f60606063b14 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:16 -0800 Subject: mm: cleanup: remove #ifdef CONFIG_MIGRATION #ifdef in *.c file decrease source readability a bit. removing is better. This patch doesn't have any functional change. Signed-off-by: KOSAKI Motohiro Cc: Christoph Lameter Cc: Mel Gorman Cc: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mprotect.c | 6 ++---- mm/rmap.c | 10 +++------- 2 files changed, 5 insertions(+), 11 deletions(-) (limited to 'mm') diff --git a/mm/mprotect.c b/mm/mprotect.c index cfb4c485206..d0f6e7ce09f 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -22,6 +22,7 @@ #include #include #include +#include #include #include #include @@ -59,8 +60,7 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd, ptent = pte_mkwrite(ptent); ptep_modify_prot_commit(mm, addr, pte, ptent); -#ifdef CONFIG_MIGRATION - } else if (!pte_file(oldpte)) { + } else if (PAGE_MIGRATION && !pte_file(oldpte)) { swp_entry_t entry = pte_to_swp_entry(oldpte); if (is_write_migration_entry(entry)) { @@ -72,9 +72,7 @@ static void change_pte_range(struct mm_struct *mm, pmd_t *pmd, set_pte_at(mm, addr, pte, swp_entry_to_pte(entry)); } -#endif } - } while (pte++, addr += PAGE_SIZE, addr != end); arch_leave_lazy_mmu_mode(); pte_unmap_unlock(pte - 1, ptl); diff --git a/mm/rmap.c b/mm/rmap.c index 10993942d6c..53c56dacd72 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -50,6 +50,7 @@ #include #include #include +#include #include @@ -818,8 +819,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, spin_unlock(&mmlist_lock); } dec_mm_counter(mm, anon_rss); -#ifdef CONFIG_MIGRATION - } else { + } else if (PAGE_MIGRATION) { /* * Store the pfn of the page in a special migration * pte. do_swap_page() will wait until the migration @@ -827,19 +827,15 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, */ BUG_ON(!migration); entry = make_migration_entry(page, pte_write(pteval)); -#endif } set_pte_at(mm, address, pte, swp_entry_to_pte(entry)); BUG_ON(pte_file(*pte)); - } else -#ifdef CONFIG_MIGRATION - if (migration) { + } else if (PAGE_MIGRATION && migration) { /* Establish migration entry for a file page */ swp_entry_t entry; entry = make_migration_entry(page, pte_write(pteval)); set_pte_at(mm, address, pte, swp_entry_to_pte(entry)); } else -#endif dec_mm_counter(mm, file_rss); -- cgit From 4917e5d0499b5ae7b26b56fccaefddf9aec9369c Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 6 Jan 2009 14:39:17 -0800 Subject: mm: more likely reclaim MADV_SEQUENTIAL mappings File pages mapped only in sequentially read mappings are perfect reclaim canditates. This patch makes these mappings behave like weak references, their pages will be reclaimed unless they have a strong reference from a normal mapping as well. It changes the reclaim and the unmap path where they check if the page has been referenced. In both cases, accesses through sequentially read mappings will be ignored. Benchmark results from KOSAKI Motohiro: http://marc.info/?l=linux-mm&m=122485301925098&w=2 Signed-off-by: Johannes Weiner Signed-off-by: Rik van Riel Acked-by: KOSAKI Motohiro Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 3 ++- mm/rmap.c | 13 +++++++++++-- 2 files changed, 13 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 5e0e91cc6b6..99e8d5c7b31 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -767,7 +767,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, else { if (pte_dirty(ptent)) set_page_dirty(page); - if (pte_young(ptent)) + if (pte_young(ptent) && + likely(!VM_SequentialReadHint(vma))) mark_page_accessed(page); file_rss--; } diff --git a/mm/rmap.c b/mm/rmap.c index 53c56dacd72..f01e92244c5 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -360,8 +360,17 @@ static int page_referenced_one(struct page *page, goto out_unmap; } - if (ptep_clear_flush_young_notify(vma, address, pte)) - referenced++; + if (ptep_clear_flush_young_notify(vma, address, pte)) { + /* + * Don't treat a reference through a sequentially read + * mapping as such. If the page has been used in + * another mapping, we will catch it; if this other + * mapping is already gone, the unmap path will have + * set PG_referenced or activated the page. + */ + if (likely(!VM_SequentialReadHint(vma))) + referenced++; + } /* Pretend the page is referenced if the task has the swap token and is in the middle of a page fault. */ -- cgit From c1279c4ef37a06ba708e6b1f6fd98b45c52770f6 Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Tue, 6 Jan 2009 14:39:18 -0800 Subject: mm: vmalloc tweak failure printk If we can't service a vmalloc allocation, show size of the allocation that actually failed. Useful for debugging. Signed-off-by: Glauber Costa Signed-off-by: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 7465f22fec0..2644afb9d6a 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -381,8 +381,9 @@ found: goto retry; } if (printk_ratelimit()) - printk(KERN_WARNING "vmap allocation failed: " - "use vmalloc= to increase size.\n"); + printk(KERN_WARNING + "vmap allocation for size %lu failed: " + "use vmalloc= to increase size.\n", size); return ERR_PTR(-EBUSY); } -- cgit From 848778483351e90f9a2c587bdbe0c78b17c1e30b Mon Sep 17 00:00:00 2001 From: Glauber Costa Date: Tue, 6 Jan 2009 14:39:19 -0800 Subject: mm: vmalloc improve vmallocinfo If we do that, output of files like /proc/vmallocinfo will show things like "vmalloc_32", "vmalloc_user", or whomever the caller was as the caller. This info is not as useful as the real caller of the allocation. So, proposal is to call __vmalloc_node node directly, with matching parameters to save the caller information Signed-off-by: Glauber Costa Signed-off-by: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 2644afb9d6a..b62ea569aa4 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -1376,7 +1376,8 @@ void *vmalloc_user(unsigned long size) struct vm_struct *area; void *ret; - ret = __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO, PAGE_KERNEL); + ret = __vmalloc_node(size, GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO, + PAGE_KERNEL, -1, __builtin_return_address(0)); if (ret) { area = find_vm_area(ret); area->flags |= VM_USERMAP; @@ -1421,7 +1422,8 @@ EXPORT_SYMBOL(vmalloc_node); void *vmalloc_exec(unsigned long size) { - return __vmalloc(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC); + return __vmalloc_node(size, GFP_KERNEL | __GFP_HIGHMEM, PAGE_KERNEL_EXEC, + -1, __builtin_return_address(0)); } #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32) @@ -1441,7 +1443,8 @@ void *vmalloc_exec(unsigned long size) */ void *vmalloc_32(unsigned long size) { - return __vmalloc(size, GFP_VMALLOC32, PAGE_KERNEL); + return __vmalloc_node(size, GFP_VMALLOC32, PAGE_KERNEL, + -1, __builtin_return_address(0)); } EXPORT_SYMBOL(vmalloc_32); @@ -1457,7 +1460,8 @@ void *vmalloc_32_user(unsigned long size) struct vm_struct *area; void *ret; - ret = __vmalloc(size, GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL); + ret = __vmalloc_node(size, GFP_VMALLOC32 | __GFP_ZERO, PAGE_KERNEL, + -1, __builtin_return_address(0)); if (ret) { area = find_vm_area(ret); area->flags |= VM_USERMAP; -- cgit From e97a630eb0f5b8b380fd67504de6cedebb489003 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:39:19 -0800 Subject: mm: vmalloc use mutex for purge The vmalloc purge lock can be a mutex so we can sleep while a purge is going on (purge involves a global kernel TLB invalidate, so it can take quite a while). Signed-off-by: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index b62ea569aa4..78689cba178 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -473,7 +474,7 @@ static atomic_t vmap_lazy_nr = ATOMIC_INIT(0); static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, int sync, int force_flush) { - static DEFINE_SPINLOCK(purge_lock); + static DEFINE_MUTEX(purge_lock); LIST_HEAD(valist); struct vmap_area *va; int nr = 0; @@ -484,10 +485,10 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, * the case that isn't actually used at the moment anyway. */ if (!sync && !force_flush) { - if (!spin_trylock(&purge_lock)) + if (!mutex_trylock(&purge_lock)) return; } else - spin_lock(&purge_lock); + mutex_lock(&purge_lock); rcu_read_lock(); list_for_each_entry_rcu(va, &vmap_area_list, list) { @@ -519,7 +520,7 @@ static void __purge_vmap_area_lazy(unsigned long *start, unsigned long *end, __free_vmap_area(va); spin_unlock(&vmap_area_lock); } - spin_unlock(&purge_lock); + mutex_unlock(&purge_lock); } /* -- cgit From cd52858c73f9f7df859a08fb08496ca39b9b3d8d Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:39:20 -0800 Subject: mm: vmalloc make lazy unmapping configurable Lazy unmapping in the vmalloc code has now opened the possibility for use after free bugs to go undetected. We can catch those by forcing an unmap and flush (which is going to be slow, but that's what happens). Signed-off-by: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmalloc.c | 24 ++++++++++++++++++++++++ 1 file changed, 24 insertions(+) (limited to 'mm') diff --git a/mm/vmalloc.c b/mm/vmalloc.c index 78689cba178..c5db9a7264d 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -434,6 +434,27 @@ static void unmap_vmap_area(struct vmap_area *va) vunmap_page_range(va->va_start, va->va_end); } +static void vmap_debug_free_range(unsigned long start, unsigned long end) +{ + /* + * Unmap page tables and force a TLB flush immediately if + * CONFIG_DEBUG_PAGEALLOC is set. This catches use after free + * bugs similarly to those in linear kernel virtual address + * space after a page has been freed. + * + * All the lazy freeing logic is still retained, in order to + * minimise intrusiveness of this debugging feature. + * + * This is going to be *slow* (linear kernel virtual address + * debugging doesn't do a broadcast TLB flush so it is a lot + * faster). + */ +#ifdef CONFIG_DEBUG_PAGEALLOC + vunmap_page_range(start, end); + flush_tlb_kernel_range(start, end); +#endif +} + /* * lazy_max_pages is the maximum amount of virtual address space we gather up * before attempting to purge with a TLB flush. @@ -914,6 +935,7 @@ void vm_unmap_ram(const void *mem, unsigned int count) BUG_ON(addr & (PAGE_SIZE-1)); debug_check_no_locks_freed(mem, size); + vmap_debug_free_range(addr, addr+size); if (likely(count <= VMAP_MAX_ALLOC)) vb_free(mem, size); @@ -1130,6 +1152,8 @@ struct vm_struct *remove_vm_area(const void *addr) if (va && va->flags & VM_VM_AREA) { struct vm_struct *vm = va->private; struct vm_struct *tmp, **p; + + vmap_debug_free_range(va->va_start, va->va_end); free_unmap_vmap_area(va); vm->size -= PAGE_SIZE; -- cgit From 38e0edb15bd07c6a0caf0cfe39f8f90bd98601b2 Mon Sep 17 00:00:00 2001 From: Jeremy Fitzhardinge Date: Tue, 6 Jan 2009 14:39:21 -0800 Subject: mm/apply_to_range: call pte function with lazy updates Make the pte-level function in apply_to_range be called in lazy mmu mode, so that any pagetable modifications can be batched. Signed-off-by: Jeremy Fitzhardinge Cc: Johannes Weiner Cc: Nick Piggin Cc: Venkatesh Pallipadi Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 4 ++++ 1 file changed, 4 insertions(+) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 99e8d5c7b31..b5af358b8b2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1645,6 +1645,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, BUG_ON(pmd_huge(*pmd)); + arch_enter_lazy_mmu_mode(); + token = pmd_pgtable(*pmd); do { @@ -1653,6 +1655,8 @@ static int apply_to_pte_range(struct mm_struct *mm, pmd_t *pmd, break; } while (pte++, addr += PAGE_SIZE, addr != end); + arch_leave_lazy_mmu_mode(); + if (mm != &init_mm) pte_unmap_unlock(pte-1, ptl); return err; -- cgit From 3c1d43787b48c798f44dc32a6e6deb5ca2da3e68 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:23 -0800 Subject: mm: remove GFP_HIGHUSER_PAGECACHE GFP_HIGHUSER_PAGECACHE is just an alias for GFP_HIGHUSER_MOVABLE, making that harder to track down: remove it, and its out-of-work brothers GFP_NOFS_PAGECACHE and GFP_USER_PAGECACHE. Since we're making that improvement to hotremove_migrate_alloc(), I think we can now also remove one of the "o"s from its comment. Signed-off-by: Hugh Dickins Acked-by: Mel Gorman Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory_hotplug.c | 9 +++------ 1 file changed, 3 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 2ba38bc07b4..c083cf5fd6d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -627,15 +627,12 @@ int scan_lru_pages(unsigned long start, unsigned long end) } static struct page * -hotremove_migrate_alloc(struct page *page, - unsigned long private, - int **x) +hotremove_migrate_alloc(struct page *page, unsigned long private, int **x) { - /* This should be improoooooved!! */ - return alloc_page(GFP_HIGHUSER_PAGECACHE); + /* This should be improooooved!! */ + return alloc_page(GFP_HIGHUSER_MOVABLE); } - #define NR_OFFLINE_AT_ONCE_PAGES (256) static int do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) -- cgit From 6d91add09f4bad5f4d4233b13faa392f0c4b16be Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:24 -0800 Subject: mm: add Set,ClearPageSwapCache stubs If we add NOOP stubs for SetPageSwapCache() and ClearPageSwapCache(), then we can remove the #ifdef CONFIG_SWAPs from mm/migrate.c. Signed-off-by: Hugh Dickins Acked-by: Christoph Lameter Cc: Nick Piggin Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/migrate.c | 4 ---- 1 file changed, 4 deletions(-) (limited to 'mm') diff --git a/mm/migrate.c b/mm/migrate.c index 60510306c44..55373983c9c 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -300,12 +300,10 @@ static int migrate_page_move_mapping(struct address_space *mapping, * Now we know that no one else is looking at the page. */ get_page(newpage); /* add cache reference */ -#ifdef CONFIG_SWAP if (PageSwapCache(page)) { SetPageSwapCache(newpage); set_page_private(newpage, page_private(page)); } -#endif radix_tree_replace_slot(pslot, newpage); @@ -373,9 +371,7 @@ static void migrate_page_copy(struct page *newpage, struct page *page) mlock_migrate_page(newpage, page); -#ifdef CONFIG_SWAP ClearPageSwapCache(page); -#endif ClearPagePrivate(page); set_page_private(page, 0); /* page->mapping contains a flag for PageAnon() */ -- cgit From 51726b1222863852c46ca21ed0115b85d1edfd89 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:25 -0800 Subject: mm: replace some BUG_ONs by VM_BUG_ONs The swap code is over-provisioned with BUG_ONs on assorted page flags, mostly dating back to 2.3. They're good documentation, and guard against developer error, but a waste of space on most systems: change them to VM_BUG_ONs, conditional on CONFIG_DEBUG_VM. Just delete the PagePrivate ones: they're later, from 2.5.69, but even less interesting now. Signed-off-by: Hugh Dickins Reviewed-by: Christoph Lameter Cc: Nick Piggin Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_io.c | 4 ++-- mm/swap_state.c | 19 +++++++++---------- mm/swapfile.c | 8 +++----- 3 files changed, 14 insertions(+), 17 deletions(-) (limited to 'mm') diff --git a/mm/page_io.c b/mm/page_io.c index 065c4480eaf..d277a80efa7 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -125,8 +125,8 @@ int swap_readpage(struct file *file, struct page *page) struct bio *bio; int ret = 0; - BUG_ON(!PageLocked(page)); - BUG_ON(PageUptodate(page)); + VM_BUG_ON(!PageLocked(page)); + VM_BUG_ON(PageUptodate(page)); bio = get_swap_bio(GFP_KERNEL, page_private(page), page, end_swap_bio_read); if (bio == NULL) { diff --git a/mm/swap_state.c b/mm/swap_state.c index 3353c9029ce..e793fdea275 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -72,10 +72,10 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) { int error; - BUG_ON(!PageLocked(page)); - BUG_ON(PageSwapCache(page)); - BUG_ON(PagePrivate(page)); - BUG_ON(!PageSwapBacked(page)); + VM_BUG_ON(!PageLocked(page)); + VM_BUG_ON(PageSwapCache(page)); + VM_BUG_ON(!PageSwapBacked(page)); + error = radix_tree_preload(gfp_mask); if (!error) { page_cache_get(page); @@ -108,10 +108,9 @@ int add_to_swap_cache(struct page *page, swp_entry_t entry, gfp_t gfp_mask) */ void __delete_from_swap_cache(struct page *page) { - BUG_ON(!PageLocked(page)); - BUG_ON(!PageSwapCache(page)); - BUG_ON(PageWriteback(page)); - BUG_ON(PagePrivate(page)); + VM_BUG_ON(!PageLocked(page)); + VM_BUG_ON(!PageSwapCache(page)); + VM_BUG_ON(PageWriteback(page)); radix_tree_delete(&swapper_space.page_tree, page_private(page)); set_page_private(page, 0); @@ -134,8 +133,8 @@ int add_to_swap(struct page * page, gfp_t gfp_mask) swp_entry_t entry; int err; - BUG_ON(!PageLocked(page)); - BUG_ON(!PageUptodate(page)); + VM_BUG_ON(!PageLocked(page)); + VM_BUG_ON(!PageUptodate(page)); for (;;) { entry = get_swap_page(); diff --git a/mm/swapfile.c b/mm/swapfile.c index 54a9f87e516..214e90b9494 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -333,7 +333,7 @@ int can_share_swap_page(struct page *page) { int count; - BUG_ON(!PageLocked(page)); + VM_BUG_ON(!PageLocked(page)); count = page_mapcount(page); if (count <= 1 && PageSwapCache(page)) count += page_swapcount(page); @@ -350,8 +350,7 @@ static int remove_exclusive_swap_page_count(struct page *page, int count) struct swap_info_struct * p; swp_entry_t entry; - BUG_ON(PagePrivate(page)); - BUG_ON(!PageLocked(page)); + VM_BUG_ON(!PageLocked(page)); if (!PageSwapCache(page)) return 0; @@ -432,7 +431,6 @@ void free_swap_and_cache(swp_entry_t entry) if (page) { int one_user; - BUG_ON(PagePrivate(page)); one_user = (page_count(page) == 2); /* Only cache user (+us), or swap space full? Free it! */ /* Also recheck PageSwapCache after page is locked (above) */ @@ -1209,7 +1207,7 @@ int page_queue_congested(struct page *page) { struct backing_dev_info *bdi; - BUG_ON(!PageLocked(page)); /* It pins the swap_info_struct */ + VM_BUG_ON(!PageLocked(page)); /* It pins the swap_info_struct */ if (PageSwapCache(page)) { swp_entry_t entry = { .val = page_private(page) }; -- cgit From b5934c531849ff4a51ce0f290141efe564290e40 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:25 -0800 Subject: mm: add_active_or_unevictable into rmap lru_cache_add_active_or_unevictable() and page_add_new_anon_rmap() always appear together. Save some symbol table space and some jumping around by removing lru_cache_add_active_or_unevictable(), folding its code into page_add_new_anon_rmap(): like how we add file pages to lru just after adding them to page cache. Remove the nearby "TODO: is this safe?" comments (yes, it is safe), and change page_add_new_anon_rmap()'s address BUG_ON to VM_BUG_ON as originally intended. Signed-off-by: Hugh Dickins Acked-by: Rik van Riel Cc: Lee Schermerhorn Cc: Nick Piggin Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 6 ------ mm/rmap.c | 7 ++++++- mm/swap.c | 19 ------------------- 3 files changed, 6 insertions(+), 26 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index b5af358b8b2..a138c50dc39 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1949,10 +1949,7 @@ gotten: */ ptep_clear_flush_notify(vma, address, page_table); SetPageSwapBacked(new_page); - lru_cache_add_active_or_unevictable(new_page, vma); page_add_new_anon_rmap(new_page, vma, address); - -//TODO: is this safe? do_anonymous_page() does it this way. set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); if (old_page) { @@ -2448,7 +2445,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, goto release; inc_mm_counter(mm, anon_rss); SetPageSwapBacked(page); - lru_cache_add_active_or_unevictable(page, vma); page_add_new_anon_rmap(page, vma, address); set_pte_at(mm, address, page_table, entry); @@ -2597,7 +2593,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (anon) { inc_mm_counter(mm, anon_rss); SetPageSwapBacked(page); - lru_cache_add_active_or_unevictable(page, vma); page_add_new_anon_rmap(page, vma, address); } else { inc_mm_counter(mm, file_rss); @@ -2607,7 +2602,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, get_page(dirty_page); } } -//TODO: is this safe? do_anonymous_page() does it this way. set_pte_at(mm, address, page_table, entry); /* no need to invalidate: a not-present page won't be cached */ diff --git a/mm/rmap.c b/mm/rmap.c index f01e92244c5..10da68202b1 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -47,6 +47,7 @@ #include #include #include +#include #include #include #include @@ -671,9 +672,13 @@ void page_add_anon_rmap(struct page *page, void page_add_new_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { - BUG_ON(address < vma->vm_start || address >= vma->vm_end); + VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); atomic_set(&page->_mapcount, 0); /* elevate count by 1 (starts at -1) */ __page_set_anon_rmap(page, vma, address); + if (page_evictable(page, vma)) + lru_cache_add_lru(page, LRU_ACTIVE + page_is_file_cache(page)); + else + add_page_to_unevictable_list(page); } /** diff --git a/mm/swap.c b/mm/swap.c index 21a566fc457..ff0b290475f 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -246,25 +246,6 @@ void add_page_to_unevictable_list(struct page *page) spin_unlock_irq(&zone->lru_lock); } -/** - * lru_cache_add_active_or_unevictable - * @page: the page to be added to LRU - * @vma: vma in which page is mapped for determining reclaimability - * - * place @page on active or unevictable LRU list, depending on - * page_evictable(). Note that if the page is not evictable, - * it goes directly back onto it's zone's unevictable list. It does - * NOT use a per cpu pagevec. - */ -void lru_cache_add_active_or_unevictable(struct page *page, - struct vm_area_struct *vma) -{ - if (page_evictable(page, vma)) - lru_cache_add_lru(page, LRU_ACTIVE + page_is_file_cache(page)); - else - add_page_to_unevictable_list(page); -} - /* * Drain pages out of the cpu's pagevecs. * Either "cpu" is the current CPU, and preemption has already been -- cgit From 2afd1c928f1132b8d0099866e75ce8ad713a1180 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:26 -0800 Subject: mm: make page_lock_anon_vma() static page_lock_anon_vma() and page_unlock_anon_vma() were made available to show_page_path() in vmscan.c; but now that has been removed, make them static in rmap.c again, they're better kept private if possible. Signed-off-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/rmap.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/rmap.c b/mm/rmap.c index 10da68202b1..892e1877366 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -193,7 +193,7 @@ void __init anon_vma_init(void) * Getting a lock on a stable anon_vma from a page off the LRU is * tricky: page_lock_anon_vma rely on RCU to guard against the races. */ -struct anon_vma *page_lock_anon_vma(struct page *page) +static struct anon_vma *page_lock_anon_vma(struct page *page) { struct anon_vma *anon_vma; unsigned long anon_mapping; @@ -213,7 +213,7 @@ out: return NULL; } -void page_unlock_anon_vma(struct anon_vma *anon_vma) +static void page_unlock_anon_vma(struct anon_vma *anon_vma) { spin_unlock(&anon_vma->lock); rcu_read_unlock(); -- cgit From cbf84b7add8103b92aaa84928e335df726bfc8da Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:27 -0800 Subject: mm: further cleanup page_add_new_anon_rmap Moving lru_cache_add_active_or_unevictable() into page_add_new_anon_rmap() was good but stupid: we can and should SetPageSwapBacked() there too; and we know for sure that this anonymous, swap-backed page is not file cache. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Cc: Nick Piggin Acked-by: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 3 --- mm/rmap.c | 6 +++--- 2 files changed, 3 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index a138c50dc39..122d965e820 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1948,7 +1948,6 @@ gotten: * thread doing COW. */ ptep_clear_flush_notify(vma, address, page_table); - SetPageSwapBacked(new_page); page_add_new_anon_rmap(new_page, vma, address); set_pte_at(mm, address, page_table, entry); update_mmu_cache(vma, address, entry); @@ -2444,7 +2443,6 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma, if (!pte_none(*page_table)) goto release; inc_mm_counter(mm, anon_rss); - SetPageSwapBacked(page); page_add_new_anon_rmap(page, vma, address); set_pte_at(mm, address, page_table, entry); @@ -2592,7 +2590,6 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma, entry = maybe_mkwrite(pte_mkdirty(entry), vma); if (anon) { inc_mm_counter(mm, anon_rss); - SetPageSwapBacked(page); page_add_new_anon_rmap(page, vma, address); } else { inc_mm_counter(mm, file_rss); diff --git a/mm/rmap.c b/mm/rmap.c index 892e1877366..b1770b11a57 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -47,7 +47,6 @@ #include #include #include -#include #include #include #include @@ -673,10 +672,11 @@ void page_add_new_anon_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { VM_BUG_ON(address < vma->vm_start || address >= vma->vm_end); - atomic_set(&page->_mapcount, 0); /* elevate count by 1 (starts at -1) */ + SetPageSwapBacked(page); + atomic_set(&page->_mapcount, 0); /* increment count (starts at -1) */ __page_set_anon_rmap(page, vma, address); if (page_evictable(page, vma)) - lru_cache_add_lru(page, LRU_ACTIVE + page_is_file_cache(page)); + lru_cache_add_lru(page, LRU_ACTIVE_ANON); else add_page_to_unevictable_list(page); } -- cgit From 58a01a45721bf7bd3a41a86248c3cb02a6b0c501 Mon Sep 17 00:00:00 2001 From: Julia Lawall Date: Tue, 6 Jan 2009 14:39:28 -0800 Subject: mm/page_alloc.c: eliminate NULL test and memset after alloc_bootmem As noted by Akinobu Mita in patch b1fceac2b9e04d278316b2faddf276015fc06e3b, alloc_bootmem and related functions never return NULL and always return a zeroed region of memory. Thus a NULL test or memset after calls to these functions is unnecessary. This was fixed using the following semantic patch. (http://www.emn.fr/x-info/coccinelle/) // @@ expression E; statement S; @@ E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...) ... when != E ( - BUG_ON (E == NULL); | - if (E == NULL) S ) @@ expression E,E1; @@ E = \(alloc_bootmem\|alloc_bootmem_low\|alloc_bootmem_pages\|alloc_bootmem_low_pages\|alloc_bootmem_node\|alloc_bootmem_low_pages_node\|alloc_bootmem_pages_node\)(...) ... when != E - memset(E,0,E1); // Signed-off-by: Julia Lawall Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 2f644c3e3da..ce2a026219b 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3381,10 +3381,8 @@ static void __init setup_usemap(struct pglist_data *pgdat, { unsigned long usemapsize = usemap_size(zonesize); zone->pageblock_flags = NULL; - if (usemapsize) { + if (usemapsize) zone->pageblock_flags = alloc_bootmem_node(pgdat, usemapsize); - memset(zone->pageblock_flags, 0, usemapsize); - } } #else static void inline setup_usemap(struct pglist_data *pgdat, -- cgit From 364aeb2849789b51bf4b9af2ddd02fee7285c54e Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 6 Jan 2009 14:39:29 -0800 Subject: mm: change dirty limit type specifiers to unsigned long The background dirty and dirty limits are better defined with type specifiers of unsigned long since negative writeback thresholds are not possible. These values, as returned by get_dirty_limits(), are normally compared with ZVC values to determine whether writeback shall commence or be throttled. Such page counts cannot be negative, so declaring the page limits as signed is unnecessary. Acked-by: Peter Zijlstra Cc: Dave Chinner Cc: Christoph Lameter Signed-off-by: David Rientjes Cc: Andrea Righi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/backing-dev.c | 6 +++--- mm/page-writeback.c | 22 +++++++++++----------- 2 files changed, 14 insertions(+), 14 deletions(-) (limited to 'mm') diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 801c08b046e..6f80beddd8a 100644 --- a/mm/backing-dev.c +++ b/mm/backing-dev.c @@ -24,9 +24,9 @@ static void bdi_debug_init(void) static int bdi_debug_stats_show(struct seq_file *m, void *v) { struct backing_dev_info *bdi = m->private; - long background_thresh; - long dirty_thresh; - long bdi_thresh; + unsigned long background_thresh; + unsigned long dirty_thresh; + unsigned long bdi_thresh; get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi); diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 08d2b960b29..4d4074cff30 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -362,13 +362,13 @@ unsigned long determine_dirtyable_memory(void) } void -get_dirty_limits(long *pbackground, long *pdirty, long *pbdi_dirty, - struct backing_dev_info *bdi) +get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty, + unsigned long *pbdi_dirty, struct backing_dev_info *bdi) { int background_ratio; /* Percentages */ int dirty_ratio; - long background; - long dirty; + unsigned long background; + unsigned long dirty; unsigned long available_memory = determine_dirtyable_memory(); struct task_struct *tsk; @@ -423,9 +423,9 @@ static void balance_dirty_pages(struct address_space *mapping) { long nr_reclaimable, bdi_nr_reclaimable; long nr_writeback, bdi_nr_writeback; - long background_thresh; - long dirty_thresh; - long bdi_thresh; + unsigned long background_thresh; + unsigned long dirty_thresh; + unsigned long bdi_thresh; unsigned long pages_written = 0; unsigned long write_chunk = sync_writeback_pages(); @@ -580,8 +580,8 @@ EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr); void throttle_vm_writeout(gfp_t gfp_mask) { - long background_thresh; - long dirty_thresh; + unsigned long background_thresh; + unsigned long dirty_thresh; for ( ; ; ) { get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL); @@ -624,8 +624,8 @@ static void background_writeout(unsigned long _min_pages) }; for ( ; ; ) { - long background_thresh; - long dirty_thresh; + unsigned long background_thresh; + unsigned long dirty_thresh; get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL); if (global_page_state(NR_FILE_DIRTY) + -- cgit From 2da02997e08d3efe8174c7a47696e6f7cbe69ba9 Mon Sep 17 00:00:00 2001 From: David Rientjes Date: Tue, 6 Jan 2009 14:39:31 -0800 Subject: mm: add dirty_background_bytes and dirty_bytes sysctls This change introduces two new sysctls to /proc/sys/vm: dirty_background_bytes and dirty_bytes. dirty_background_bytes is the counterpart to dirty_background_ratio and dirty_bytes is the counterpart to dirty_ratio. With growing memory capacities of individual machines, it's no longer sufficient to specify dirty thresholds as a percentage of the amount of dirtyable memory over the entire system. dirty_background_bytes and dirty_bytes specify quantities of memory, in bytes, that represent the dirty limits for the entire system. If either of these values is set, its value represents the amount of dirty memory that is needed to commence either background or direct writeback. When a `bytes' or `ratio' file is written, its counterpart becomes a function of the written value. For example, if dirty_bytes is written to be 8096, 8K of memory is required to commence direct writeback. dirty_ratio is then functionally equivalent to 8K / the amount of dirtyable memory: dirtyable_memory = free pages + mapped pages + file cache dirty_background_bytes = dirty_background_ratio * dirtyable_memory -or- dirty_background_ratio = dirty_background_bytes / dirtyable_memory AND dirty_bytes = dirty_ratio * dirtyable_memory -or- dirty_ratio = dirty_bytes / dirtyable_memory Only one of dirty_background_bytes and dirty_background_ratio may be specified at a time, and only one of dirty_bytes and dirty_ratio may be specified. When one sysctl is written, the other appears as 0 when read. The `bytes' files operate on a page size granularity since dirty limits are compared with ZVC values, which are in page units. Prior to this change, the minimum dirty_ratio was 5 as implemented by get_dirty_limits() although /proc/sys/vm/dirty_ratio would show any user written value between 0 and 100. This restriction is maintained, but dirty_bytes has a lower limit of only one page. Also prior to this change, the dirty_background_ratio could not equal or exceed dirty_ratio. This restriction is maintained in addition to restricting dirty_background_bytes. If either background threshold equals or exceeds that of the dirty threshold, it is implicitly set to half the dirty threshold. Acked-by: Peter Zijlstra Cc: Dave Chinner Cc: Christoph Lameter Signed-off-by: David Rientjes Cc: Andrea Righi Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page-writeback.c | 102 +++++++++++++++++++++++++++++++++++++++++++--------- 1 file changed, 86 insertions(+), 16 deletions(-) (limited to 'mm') diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 4d4074cff30..b493db7841d 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -68,6 +68,12 @@ static inline long sync_writeback_pages(void) */ int dirty_background_ratio = 5; +/* + * dirty_background_bytes starts at 0 (disabled) so that it is a function of + * dirty_background_ratio * the amount of dirtyable memory + */ +unsigned long dirty_background_bytes; + /* * free highmem will not be subtracted from the total free memory * for calculating free ratios if vm_highmem_is_dirtyable is true @@ -79,6 +85,12 @@ int vm_highmem_is_dirtyable; */ int vm_dirty_ratio = 10; +/* + * vm_dirty_bytes starts at 0 (disabled) so that it is a function of + * vm_dirty_ratio * the amount of dirtyable memory + */ +unsigned long vm_dirty_bytes; + /* * The interval between `kupdate'-style writebacks, in jiffies */ @@ -135,23 +147,75 @@ static int calc_period_shift(void) { unsigned long dirty_total; - dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / 100; + if (vm_dirty_bytes) + dirty_total = vm_dirty_bytes / PAGE_SIZE; + else + dirty_total = (vm_dirty_ratio * determine_dirtyable_memory()) / + 100; return 2 + ilog2(dirty_total - 1); } /* - * update the period when the dirty ratio changes. + * update the period when the dirty threshold changes. */ +static void update_completion_period(void) +{ + int shift = calc_period_shift(); + prop_change_shift(&vm_completions, shift); + prop_change_shift(&vm_dirties, shift); +} + +int dirty_background_ratio_handler(struct ctl_table *table, int write, + struct file *filp, void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret; + + ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos); + if (ret == 0 && write) + dirty_background_bytes = 0; + return ret; +} + +int dirty_background_bytes_handler(struct ctl_table *table, int write, + struct file *filp, void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int ret; + + ret = proc_doulongvec_minmax(table, write, filp, buffer, lenp, ppos); + if (ret == 0 && write) + dirty_background_ratio = 0; + return ret; +} + int dirty_ratio_handler(struct ctl_table *table, int write, struct file *filp, void __user *buffer, size_t *lenp, loff_t *ppos) { int old_ratio = vm_dirty_ratio; - int ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos); + int ret; + + ret = proc_dointvec_minmax(table, write, filp, buffer, lenp, ppos); if (ret == 0 && write && vm_dirty_ratio != old_ratio) { - int shift = calc_period_shift(); - prop_change_shift(&vm_completions, shift); - prop_change_shift(&vm_dirties, shift); + update_completion_period(); + vm_dirty_bytes = 0; + } + return ret; +} + + +int dirty_bytes_handler(struct ctl_table *table, int write, + struct file *filp, void __user *buffer, size_t *lenp, + loff_t *ppos) +{ + int old_bytes = vm_dirty_bytes; + int ret; + + ret = proc_doulongvec_minmax(table, write, filp, buffer, lenp, ppos); + if (ret == 0 && write && vm_dirty_bytes != old_bytes) { + update_completion_period(); + vm_dirty_ratio = 0; } return ret; } @@ -365,23 +429,29 @@ void get_dirty_limits(unsigned long *pbackground, unsigned long *pdirty, unsigned long *pbdi_dirty, struct backing_dev_info *bdi) { - int background_ratio; /* Percentages */ - int dirty_ratio; unsigned long background; unsigned long dirty; unsigned long available_memory = determine_dirtyable_memory(); struct task_struct *tsk; - dirty_ratio = vm_dirty_ratio; - if (dirty_ratio < 5) - dirty_ratio = 5; + if (vm_dirty_bytes) + dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE); + else { + int dirty_ratio; - background_ratio = dirty_background_ratio; - if (background_ratio >= dirty_ratio) - background_ratio = dirty_ratio / 2; + dirty_ratio = vm_dirty_ratio; + if (dirty_ratio < 5) + dirty_ratio = 5; + dirty = (dirty_ratio * available_memory) / 100; + } + + if (dirty_background_bytes) + background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE); + else + background = (dirty_background_ratio * available_memory) / 100; - background = (background_ratio * available_memory) / 100; - dirty = (dirty_ratio * available_memory) / 100; + if (background >= dirty) + background = dirty / 2; tsk = current; if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) { background += background / 4; -- cgit From 878b63ac889df706d01048f2c110e322ad2f996d Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:32 -0800 Subject: mm: gup persist for write permission do_wp_page()'s VM_FAULT_WRITE return value tells __get_user_pages() that COW has been done if necessary, though it may be leaving the pte without write permission - for the odd case of forced writing to a readonly vma for ptrace. At present GUP then retries the follow_page() without asking for write permission, to escape an endless loop when forced. But an application may be relying on GUP to guarantee a writable page which won't be COWed again when written from userspace, whereas a race here might leave a readonly pte in place? Change the VM_FAULT_WRITE handling to ask follow_page() for write permission again, except in that odd case of forced writing to a readonly vma. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Cc: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 122d965e820..f594bb65a9f 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1264,9 +1264,15 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, * do_wp_page has broken COW when necessary, * even if maybe_mkwrite decided not to set * pte_write. We can thus safely do subsequent - * page lookups as if they were reads. + * page lookups as if they were reads. But only + * do so when looping for pte_write is futile: + * in some cases userspace may also be wanting + * to write to the gotten user page, which a + * read fault here might prevent (a readonly + * page might get reCOWed by userspace write). */ - if (ret & VM_FAULT_WRITE) + if ((ret & VM_FAULT_WRITE) && + !(vma->vm_flags & VM_WRITE)) foll_flags &= ~FOLL_WRITE; cond_resched(); -- cgit From ab967d86015a19777955370deebc8262d50fed63 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:33 -0800 Subject: mm: wp lock page before deciding cow An application may rely on get_user_pages() to give it pages writable from userspace and shared with a driver, GUP breaking COW if necessary. It may mprotect() the pages' writability, off and on, from time to time. Normally this works fine (so long as the app does not fork); but just occasionally, under memory pressure, a readonly pte in a newly writable area is COWed unnecessarily, breaking the link with the driver: because do_wp_page() does trylock_page, and falls back to COW whenever that fails. For reliable behaviour in the unshared case, when the trylock_page fails, now unlock pagetable, lock page and relock pagetable, before deciding whether Copy-On-Write is really necessary. Reported-by: Zhou Yingchao Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Cc: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index f594bb65a9f..3922ffcf3df 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1848,10 +1848,21 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, * not dirty accountable. */ if (PageAnon(old_page)) { - if (trylock_page(old_page)) { - reuse = can_share_swap_page(old_page); - unlock_page(old_page); + if (!trylock_page(old_page)) { + page_cache_get(old_page); + pte_unmap_unlock(page_table, ptl); + lock_page(old_page); + page_table = pte_offset_map_lock(mm, pmd, address, + &ptl); + if (!pte_same(*page_table, orig_pte)) { + unlock_page(old_page); + page_cache_release(old_page); + goto unlock; + } + page_cache_release(old_page); } + reuse = can_share_swap_page(old_page); + unlock_page(old_page); } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { /* -- cgit From 7b1fe59793e61f826bef053107b57b23954833bb Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:34 -0800 Subject: mm: reuse_swap_page replaces can_share_swap_page A good place to free up old swap is where do_wp_page(), or do_swap_page(), is about to redirty the page: the data on disk is then stale and won't be read again; and if we do decide to write the page out later, using the previous swap location makes an unnecessary disk seek very likely. So give can_share_swap_page() the side-effect of delete_from_swap_cache() when it safely can. And can_share_swap_page() was always a misleading name, the more so if it has a side-effect: rename it reuse_swap_page(). Irrelevant cleanup nearby: remove swap_token_default_timeout definition from swap.h: it's used nowhere. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Acked-by: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 4 ++-- mm/swapfile.c | 15 +++++++++++---- 2 files changed, 13 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 3922ffcf3df..8f471edcb98 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1861,7 +1861,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma, } page_cache_release(old_page); } - reuse = can_share_swap_page(old_page); + reuse = reuse_swap_page(old_page); unlock_page(old_page); } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { @@ -2392,7 +2392,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, inc_mm_counter(mm, anon_rss); pte = mk_pte(page, vma->vm_page_prot); - if (write_access && can_share_swap_page(page)) { + if (write_access && reuse_swap_page(page)) { pte = maybe_mkwrite(pte_mkdirty(pte), vma); write_access = 0; } diff --git a/mm/swapfile.c b/mm/swapfile.c index 214e90b9494..bfd4ee59cb8 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -326,17 +326,24 @@ static inline int page_swapcount(struct page *page) } /* - * We can use this swap cache entry directly - * if there are no other references to it. + * We can write to an anon page without COW if there are no other references + * to it. And as a side-effect, free up its swap: because the old content + * on disk will never be read, and seeking back there to write new content + * later would only waste time away from clustering. */ -int can_share_swap_page(struct page *page) +int reuse_swap_page(struct page *page) { int count; VM_BUG_ON(!PageLocked(page)); count = page_mapcount(page); - if (count <= 1 && PageSwapCache(page)) + if (count <= 1 && PageSwapCache(page)) { count += page_swapcount(page); + if (count == 1 && !PageWriteback(page)) { + delete_from_swap_cache(page); + SetPageDirty(page); + } + } return count == 1; } -- cgit From a2c43eed8334e878702fca713b212ae2a11d84b9 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:36 -0800 Subject: mm: try_to_free_swap replaces remove_exclusive_swap_page remove_exclusive_swap_page(): its problem is in living up to its name. It doesn't matter if someone else has a reference to the page (raised page_count); it doesn't matter if the page is mapped into userspace (raised page_mapcount - though that hints it may be worth keeping the swap): all that matters is that there be no more references to the swap (and no writeback in progress). swapoff (try_to_unuse) has been removing pages from swapcache for years, with no concern for page count or page mapcount, and we used to have a comment in lookup_swap_cache() recognizing that: if you go for a page of swapcache, you'll get the right page, but it could have been removed from swapcache by the time you get page lock. So, give up asking for exclusivity: get rid of remove_exclusive_swap_page(), and remove_exclusive_swap_page_ref() and remove_exclusive_swap_page_count() which were spawned for the recent LRU work: replace them by the simpler try_to_free_swap() which just checks page_swapcount(). Similarly, remove the page_count limitation from free_swap_and_count(), but assume that it's worth holding on to the swap if page is mapped and swap nowhere near full. Add a vm_swap_full() test in free_swap_cache()? It would be consistent, but I think we probably have enough for now. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Cc: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 2 +- mm/page_io.c | 2 +- mm/swap.c | 3 +-- mm/swap_state.c | 8 +++---- mm/swapfile.c | 70 ++++++++++----------------------------------------------- mm/vmscan.c | 2 +- 6 files changed, 20 insertions(+), 67 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 8f471edcb98..1a83fe5339a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2403,7 +2403,7 @@ static int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma, swap_free(entry); if (vm_swap_full() || (vma->vm_flags & VM_LOCKED) || PageMlocked(page)) - remove_exclusive_swap_page(page); + try_to_free_swap(page); unlock_page(page); if (write_access) { diff --git a/mm/page_io.c b/mm/page_io.c index d277a80efa7..dc6ce0afbde 100644 --- a/mm/page_io.c +++ b/mm/page_io.c @@ -98,7 +98,7 @@ int swap_writepage(struct page *page, struct writeback_control *wbc) struct bio *bio; int ret = 0, rw = WRITE; - if (remove_exclusive_swap_page(page)) { + if (try_to_free_swap(page)) { unlock_page(page); goto out; } diff --git a/mm/swap.c b/mm/swap.c index ff0b290475f..ba2c0e8b8b5 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -454,8 +454,7 @@ void pagevec_swap_free(struct pagevec *pvec) struct page *page = pvec->pages[i]; if (PageSwapCache(page) && trylock_page(page)) { - if (PageSwapCache(page)) - remove_exclusive_swap_page_ref(page); + try_to_free_swap(page); unlock_page(page); } } diff --git a/mm/swap_state.c b/mm/swap_state.c index e793fdea275..bcb47276929 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -195,14 +195,14 @@ void delete_from_swap_cache(struct page *page) * If we are the only user, then try to free up the swap cache. * * Its ok to check for PageSwapCache without the page lock - * here because we are going to recheck again inside - * exclusive_swap_page() _with_ the lock. + * here because we are going to recheck again inside + * try_to_free_swap() _with_ the lock. * - Marcelo */ static inline void free_swap_cache(struct page *page) { - if (PageSwapCache(page) && trylock_page(page)) { - remove_exclusive_swap_page(page); + if (PageSwapCache(page) && !page_mapped(page) && trylock_page(page)) { + try_to_free_swap(page); unlock_page(page); } } diff --git a/mm/swapfile.c b/mm/swapfile.c index bfd4ee59cb8..f4360182760 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -348,68 +348,23 @@ int reuse_swap_page(struct page *page) } /* - * Work out if there are any other processes sharing this - * swap cache page. Free it if you can. Return success. + * If swap is getting full, or if there are no more mappings of this page, + * then try_to_free_swap is called to free its swap space. */ -static int remove_exclusive_swap_page_count(struct page *page, int count) +int try_to_free_swap(struct page *page) { - int retval; - struct swap_info_struct * p; - swp_entry_t entry; - VM_BUG_ON(!PageLocked(page)); if (!PageSwapCache(page)) return 0; if (PageWriteback(page)) return 0; - if (page_count(page) != count) /* us + cache + ptes */ - return 0; - - entry.val = page_private(page); - p = swap_info_get(entry); - if (!p) + if (page_swapcount(page)) return 0; - /* Is the only swap cache user the cache itself? */ - retval = 0; - if (p->swap_map[swp_offset(entry)] == 1) { - /* Recheck the page count with the swapcache lock held.. */ - spin_lock_irq(&swapper_space.tree_lock); - if ((page_count(page) == count) && !PageWriteback(page)) { - __delete_from_swap_cache(page); - SetPageDirty(page); - retval = 1; - } - spin_unlock_irq(&swapper_space.tree_lock); - } - spin_unlock(&swap_lock); - - if (retval) { - swap_free(entry); - page_cache_release(page); - } - - return retval; -} - -/* - * Most of the time the page should have two references: one for the - * process and one for the swap cache. - */ -int remove_exclusive_swap_page(struct page *page) -{ - return remove_exclusive_swap_page_count(page, 2); -} - -/* - * The pageout code holds an extra reference to the page. That raises - * the reference count to test for to 2 for a page that is only in the - * swap cache plus 1 for each process that maps the page. - */ -int remove_exclusive_swap_page_ref(struct page *page) -{ - return remove_exclusive_swap_page_count(page, 2 + page_mapcount(page)); + delete_from_swap_cache(page); + SetPageDirty(page); + return 1; } /* @@ -436,13 +391,12 @@ void free_swap_and_cache(swp_entry_t entry) spin_unlock(&swap_lock); } if (page) { - int one_user; - - one_user = (page_count(page) == 2); - /* Only cache user (+us), or swap space full? Free it! */ - /* Also recheck PageSwapCache after page is locked (above) */ + /* + * Not mapped elsewhere, or swap space full? Free it! + * Also recheck PageSwapCache now page is locked (above). + */ if (PageSwapCache(page) && !PageWriteback(page) && - (one_user || vm_swap_full())) { + (!page_mapped(page) || vm_swap_full())) { delete_from_swap_cache(page); SetPageDirty(page); } diff --git a/mm/vmscan.c b/mm/vmscan.c index d196f46c880..c8601dd3660 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -759,7 +759,7 @@ cull_mlocked: activate_locked: /* Not a candidate for swapping, so reclaim swap space. */ if (PageSwapCache(page) && vm_swap_full()) - remove_exclusive_swap_page_ref(page); + try_to_free_swap(page); VM_BUG_ON(PageActive(page)); SetPageActive(page); pgactivate++; -- cgit From 68bdc8d64742ccc5e340c5d122ebbab3f0cf2a74 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:37 -0800 Subject: mm: try_to_unuse check removing right swap There's a possible race in try_to_unuse() which Nick Piggin led me to two years ago. Where it does lock_page() after read_swap_cache_async(), what if another task removed that page from swapcache just before we locked it? It would sail though the (*swap_map > 1) tests doing nothing (because it could not have been removed from swapcache before its swap references were gone), until it reaches the delete_from_swap_cache(page) near the bottom. Now imagine that this page has been allocated to swap on a different swap area while we dropped page lock (perhaps at the top, perhaps in unuse_mm): we could wrongly remove from swap cache before the page has been written to swap, so a subsequent do_swap_page() would read in stale data from swap. I think this case could not happen before: remove_exclusive_swap_page() refused while page count was raised. But now with reuse_swap_page() and try_to_free_swap() removing from swap cache without minding page count, I think it could happen - the previous patch argued that it was safe because try_to_unuse() already ignored page count, but overlooked that it might be breaking the assumptions in try_to_unuse() itself. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Cc: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 11 ++++++++++- 1 file changed, 10 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index f4360182760..9ce7f81c8ab 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -889,7 +889,16 @@ static int try_to_unuse(unsigned int type) lock_page(page); wait_on_page_writeback(page); } - if (PageSwapCache(page)) + + /* + * It is conceivable that a racing task removed this page from + * swap cache just before we acquired the page lock at the top, + * or while we dropped it in unuse_mm(). The page might even + * be back in swap cache on another swap area: that we must not + * delete, since it may not have been written out to swap yet. + */ + if (PageSwapCache(page) && + likely(page_private(page) == entry.val)) delete_from_swap_cache(page); /* -- cgit From 63d6c5ad7fc27455ce5cb4706884671fb7e0df08 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:38 -0800 Subject: mm: remove try_to_munlock from vmscan An unfortunate feature of the Unevictable LRU work was that reclaiming an anonymous page involved an extra scan through the anon_vma: to check that the page is evictable before allocating swap, because the swap could not be freed reliably soon afterwards. Now try_to_free_swap() has replaced remove_exclusive_swap_page(), that's not an issue any more: remove try_to_munlock() call from shrink_page_list(), leaving it to try_to_munmap() to discover if the page is one to be culled to the unevictable list - in which case then try_to_free_swap(). Update unevictable-lru.txt to remove comments on the try_to_munlock() in shrink_page_list(), and shorten some lines over 80 columns. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Acked-by: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 11 ++--------- 1 file changed, 2 insertions(+), 9 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index c8601dd3660..74f875733e2 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -625,15 +625,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (PageAnon(page) && !PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; - switch (try_to_munlock(page)) { - case SWAP_FAIL: /* shouldn't happen */ - case SWAP_AGAIN: - goto keep_locked; - case SWAP_MLOCK: - goto cull_mlocked; - case SWAP_SUCCESS: - ; /* fall thru'; add to swap cache */ - } if (!add_to_swap(page, GFP_ATOMIC)) goto activate_locked; may_enter_fs = 1; @@ -752,6 +743,8 @@ free_it: continue; cull_mlocked: + if (PageSwapCache(page)) + try_to_free_swap(page); unlock_page(page); putback_lru_page(page); continue; -- cgit From ac47b003d03c2a4f28aef1d505b66d24ad191c4f Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:39 -0800 Subject: mm: remove gfp_mask from add_to_swap Remove gfp_mask argument from add_to_swap(): it's misleading because its only caller, shrink_page_list(), is not atomic at that point; and in due course (implementing discard) we'll sometimes want to allocate some memory with GFP_NOIO (as is used in swap_writepage) when allocating swap. No change to the gfp_mask passed down to add_to_swap_cache(): still use __GFP_HIGH without __GFP_WAIT (with nomemalloc and nowarn as before): though it's not obvious if that's the best combination to ask for here. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Cc: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swap_state.c | 4 ++-- mm/vmscan.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/swap_state.c b/mm/swap_state.c index bcb47276929..81c825f67a7 100644 --- a/mm/swap_state.c +++ b/mm/swap_state.c @@ -128,7 +128,7 @@ void __delete_from_swap_cache(struct page *page) * Allocate swap space for the page and add the page to the * swap cache. Caller needs to hold the page lock. */ -int add_to_swap(struct page * page, gfp_t gfp_mask) +int add_to_swap(struct page *page) { swp_entry_t entry; int err; @@ -153,7 +153,7 @@ int add_to_swap(struct page * page, gfp_t gfp_mask) * Add it to the swap cache and mark it dirty */ err = add_to_swap_cache(page, entry, - gfp_mask|__GFP_NOMEMALLOC|__GFP_NOWARN); + __GFP_HIGH|__GFP_NOMEMALLOC|__GFP_NOWARN); switch (err) { case 0: /* Success */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 74f875733e2..cc7401571cb 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -625,7 +625,7 @@ static unsigned long shrink_page_list(struct list_head *page_list, if (PageAnon(page) && !PageSwapCache(page)) { if (!(sc->gfp_mask & __GFP_IO)) goto keep_locked; - if (!add_to_swap(page, GFP_ATOMIC)) + if (!add_to_swap(page)) goto activate_locked; may_enter_fs = 1; } -- cgit From 60371d971a3d01afd102f0bbf2681f32ecc31d78 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:40 -0800 Subject: mm: add add_to_swap stub If we add a failing stub for add_to_swap(), then we can remove the #ifdef CONFIG_SWAP from mm/vmscan.c. This was intended as a source cleanup, but looking more closely, it turns out that the !CONFIG_SWAP case was going to keep_locked for an anonymous page, whereas now it goes to the more suitable activate_locked, like the CONFIG_SWAP nr_swap_pages 0 case. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Acked-by: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 -- 1 file changed, 2 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index cc7401571cb..f350523a8ee 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -617,7 +617,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, referenced && page_mapping_inuse(page)) goto activate_locked; -#ifdef CONFIG_SWAP /* * Anonymous process memory has backing store? * Try to allocate it some swap space here. @@ -629,7 +628,6 @@ static unsigned long shrink_page_list(struct list_head *page_list, goto activate_locked; may_enter_fs = 1; } -#endif /* CONFIG_SWAP */ mapping = page_mapping(page); -- cgit From b962716b459505a8d83aea313fea0abe76749f42 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:41 -0800 Subject: mm: optimize get_scan_ratio for no swap Rik suggests a simplified get_scan_ratio() for !CONFIG_SWAP. Yes, the gcc optimizer gives us that, when nr_swap_pages is #defined as 0L. Move usual declaration to swapfile.c: it never belonged in page_alloc.c. Signed-off-by: Hugh Dickins Cc: Lee Schermerhorn Acked-by: Rik van Riel Cc: Nick Piggin Cc: KAMEZAWA Hiroyuki Cc: Robin Holt Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 1 - mm/swapfile.c | 1 + mm/vmscan.c | 12 ++++++------ 3 files changed, 7 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ce2a026219b..65133436fe3 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -69,7 +69,6 @@ EXPORT_SYMBOL(node_states); unsigned long totalram_pages __read_mostly; unsigned long totalreserve_pages __read_mostly; -long nr_swap_pages; int percpu_pagelist_fraction; #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE diff --git a/mm/swapfile.c b/mm/swapfile.c index 9ce7f81c8ab..725e56c362d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -35,6 +35,7 @@ static DEFINE_SPINLOCK(swap_lock); static unsigned int nr_swapfiles; +long nr_swap_pages; long total_swap_pages; static int swap_overflow; static int least_priority; diff --git a/mm/vmscan.c b/mm/vmscan.c index f350523a8ee..d500b906746 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1327,12 +1327,6 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, unsigned long anon_prio, file_prio; unsigned long ap, fp; - anon = zone_page_state(zone, NR_ACTIVE_ANON) + - zone_page_state(zone, NR_INACTIVE_ANON); - file = zone_page_state(zone, NR_ACTIVE_FILE) + - zone_page_state(zone, NR_INACTIVE_FILE); - free = zone_page_state(zone, NR_FREE_PAGES); - /* If we have no swap space, do not bother scanning anon pages. */ if (nr_swap_pages <= 0) { percent[0] = 0; @@ -1340,6 +1334,12 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, return; } + anon = zone_page_state(zone, NR_ACTIVE_ANON) + + zone_page_state(zone, NR_INACTIVE_ANON); + file = zone_page_state(zone, NR_ACTIVE_FILE) + + zone_page_state(zone, NR_INACTIVE_FILE); + free = zone_page_state(zone, NR_FREE_PAGES); + /* If we have very few page cache pages, force-scan anon pages. */ if (unlikely(file + free <= zone->pages_high)) { percent[0] = 100; -- cgit From 077cbc5864cd9188fa4c4e181e48ff58317e6400 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:42 -0800 Subject: memcg: reclaim shouldn't change zone->recent_rotated statistics memcg reclaim shouldn't change zone->recent_rotated statistics. If memcgroup reclaim changes zone statistics, global reclaim can get a bit confused. Signed-off-by: KOSAKI Motohiro Acked-by: Rik van Riel Cc: Balbir Singh Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index d500b906746..da7c3a2304a 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1246,7 +1246,8 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, * This helps balance scan pressure between file and anonymous * pages in get_scan_ratio. */ - zone->recent_rotated[!!file] += pgmoved; + if (scan_global_lru(sc)) + zone->recent_rotated[!!file] += pgmoved; /* * Move the pages to the [file or anon] inactive list. -- cgit From feb166948876e2ff8f70b2da273b2a8e86957578 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:43 -0800 Subject: mm: make init_section_page_cgroup() static Sparse output following warning. mm/page_cgroup.c:100:15: warning: symbol 'init_section_page_cgroup' was not declared. Should it be static? cleanup here. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_cgroup.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/page_cgroup.c b/mm/page_cgroup.c index ab27ff75051..d6507a660ed 100644 --- a/mm/page_cgroup.c +++ b/mm/page_cgroup.c @@ -101,7 +101,7 @@ struct page_cgroup *lookup_page_cgroup(struct page *page) } /* __alloc_bootmem...() is protected by !slab_available() */ -int __init_refok init_section_page_cgroup(unsigned long pfn) +static int __init_refok init_section_page_cgroup(unsigned long pfn) { struct mem_section *section; struct page_cgroup *base, *pc; -- cgit From 2bc7273b0e3a509fb598abfc5b9fe50158b830d2 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:43 -0800 Subject: mm: make maddr __iomem sparse output following warnings. mm/memory.c:2936:8: warning: incorrect type in assignment (different address spaces) mm/memory.c:2936:8: expected void *maddr mm/memory.c:2936:8: got void [noderef] cleanup here. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 1a83fe5339a..89339c61f8e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2966,7 +2966,7 @@ int generic_access_phys(struct vm_area_struct *vma, unsigned long addr, { resource_size_t phys_addr; unsigned long prot = 0; - void *maddr; + void __iomem *maddr; int offset = addr & (PAGE_SIZE-1); if (follow_phys(vma, addr, write, &prot, &phys_addr)) -- cgit From d38d2a7582012ecf53aac33683ca5c689093cf65 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:44 -0800 Subject: mm: make mem_cgroup_resize_limit() static Sparse output following warnings. mm/memcontrol.c:782:5: warning: symbol 'mem_cgroup_resize_limit' was not declared. Should it be static? cleanup here. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memcontrol.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 866dcc7eeb0..51ee9654557 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -779,7 +779,8 @@ int mem_cgroup_shrink_usage(struct mm_struct *mm, gfp_t gfp_mask) return 0; } -int mem_cgroup_resize_limit(struct mem_cgroup *memcg, unsigned long long val) +static int mem_cgroup_resize_limit(struct mem_cgroup *memcg, + unsigned long long val) { int retry_count = MEM_CGROUP_RECLAIM_RETRIES; -- cgit From ff30153bf9647c8646538810d4c01015a5e44787 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:44 -0800 Subject: mm: make scan_all_zones_unevictable_pages() static sparse output following warning. mm/vmscan.c:2549:6: warning: symbol 'scan_all_zones_unevictable_pages' was not declared. Should it be static? cleanup here. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index da7c3a2304a..062767d159d 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2506,7 +2506,7 @@ void scan_zone_unevictable_pages(struct zone *zone) * that has possibly/probably made some previously unevictable pages * evictable. */ -void scan_all_zones_unevictable_pages(void) +static void scan_all_zones_unevictable_pages(void) { struct zone *zone; -- cgit From 14b90b22ec0f359ef4791033ab386b2b627bae07 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:45 -0800 Subject: mm: make scan_zone_unevictable_pages() static sparse output following warning mm/vmscan.c:2507:6: warning: symbol 'scan_zone_unevictable_pages' was not declared. Should it be static? cleanup here. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 062767d159d..1ef5a2eeb29 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -2464,7 +2464,7 @@ void scan_mapping_unevictable_pages(struct address_space *mapping) * back onto @zone's unevictable list. */ #define SCAN_UNEVICTABLE_BATCH_SIZE 16UL /* arbitrary lock hold batch size */ -void scan_zone_unevictable_pages(struct zone *zone) +static void scan_zone_unevictable_pages(struct zone *zone) { struct list_head *l_unevictable = &zone->lru[LRU_UNEVICTABLE].list; unsigned long scan; -- cgit From efab81864161f8c546d4403873e7ae7831ed5b26 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:39:46 -0800 Subject: mm: make setup_per_zone_inactive_ratio() static Sparse output following warning. mm/page_alloc.c:4301:6: warning: symbol 'setup_per_zone_inactive_ratio' was not declared. Should it be static? cleanup here. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 65133436fe3..31c512410a9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4314,7 +4314,7 @@ void setup_per_zone_pages_min(void) * 1TB 101 10GB * 10TB 320 32GB */ -void setup_per_zone_inactive_ratio(void) +static void setup_per_zone_inactive_ratio(void) { struct zone *zone; -- cgit From 73fd8748ab0b9b3ddd178bea1d7ae03372033d96 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:47 -0800 Subject: swapfile: swapon needs larger size type sys_swapon()'s swapfilesize (better renamed swapfilepages) is declared as an int, but should be an unsigned long like the maxpages it's compared against: on 64-bit (with 4kB pages) a swapfile of 2^44 bytes was rejected with "Swap area shorter than signature indicates". Signed-off-by: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index 725e56c362d..e2adc8eb931 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1461,7 +1461,7 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) int nr_extents = 0; sector_t span; unsigned long maxpages = 1; - int swapfilesize; + unsigned long swapfilepages; unsigned short *swap_map = NULL; struct page *page = NULL; struct inode *inode = NULL; @@ -1539,7 +1539,7 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) goto bad_swap; } - swapfilesize = i_size_read(inode) >> PAGE_SHIFT; + swapfilepages = i_size_read(inode) >> PAGE_SHIFT; /* * Read the swap header. @@ -1616,7 +1616,7 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) error = -EINVAL; if (!maxpages) goto bad_swap; - if (swapfilesize && maxpages > swapfilesize) { + if (swapfilepages && maxpages > swapfilepages) { printk(KERN_WARNING "Swap area shorter than signature indicates\n"); goto bad_swap; -- cgit From 22c6f8fdb31993cf49bdd4a47b64a7002391e1c7 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:48 -0800 Subject: swapfile: remove SWP_ACTIVE mask Remove the SWP_ACTIVE mask: it just obscures the SWP_WRITEOK flag. Signed-off-by: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index e2adc8eb931..915cb3fc43d 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1222,7 +1222,7 @@ asmlinkage long sys_swapoff(const char __user * specialfile) spin_lock(&swap_lock); for (type = swap_list.head; type >= 0; type = swap_info[type].next) { p = swap_info + type; - if ((p->flags & SWP_ACTIVE) == SWP_ACTIVE) { + if (p->flags & SWP_WRITEOK) { if (p->swap_file->f_mapping == mapping) break; } @@ -1674,7 +1674,7 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) else p->prio = --least_priority; p->swap_map = swap_map; - p->flags = SWP_ACTIVE; + p->flags |= SWP_WRITEOK; nr_swap_pages += nr_good_pages; total_swap_pages += nr_good_pages; -- cgit From 886bb7e9c3ed0bb3e4a2b1f336d8c6a6e5a4b782 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:48 -0800 Subject: swapfile: remove surplus whitespace Remove trailing whitespace from swapfile.c, and odd swap_show() alignment. Signed-off-by: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index 915cb3fc43d..c46c83d6aab 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -92,7 +92,7 @@ static inline unsigned long scan_swap_map(struct swap_info_struct *si) unsigned long offset, last_in_cluster; int latency_ration = LATENCY_LIMIT; - /* + /* * We try to cluster swap pages by allocating them sequentially * in swap. Once we've allocated SWAPFILE_CLUSTER pages this * way, however, we resort to first-free allocation, starting @@ -269,7 +269,7 @@ bad_nofile: printk(KERN_ERR "swap_free: %s%08lx\n", Bad_file, entry.val); out: return NULL; -} +} static int swap_entry_free(struct swap_info_struct *p, unsigned long offset) { @@ -736,10 +736,10 @@ static int try_to_unuse(unsigned int type) break; } - /* + /* * Get a page for the entry, using the existing swap * cache page if there is one. Otherwise, get a clean - * page and read the swap into it. + * page and read the swap into it. */ swap_map = &si->swap_map[i]; entry = swp_entry(type, i); @@ -1202,7 +1202,7 @@ asmlinkage long sys_swapoff(const char __user * specialfile) char * pathname; int i, type, prev; int err; - + if (!capable(CAP_SYS_ADMIN)) return -EPERM; @@ -1395,12 +1395,12 @@ static int swap_show(struct seq_file *swap, void *v) file = ptr->swap_file; len = seq_path(swap, &file->f_path, " \t\n\\"); seq_printf(swap, "%*s%s\t%u\t%u\t%d\n", - len < 40 ? 40 - len : 1, " ", - S_ISBLK(file->f_path.dentry->d_inode->i_mode) ? + len < 40 ? 40 - len : 1, " ", + S_ISBLK(file->f_path.dentry->d_inode->i_mode) ? "partition" : "file\t", - ptr->pages << (PAGE_SHIFT - 10), - ptr->inuse_pages << (PAGE_SHIFT - 10), - ptr->prio); + ptr->pages << (PAGE_SHIFT - 10), + ptr->inuse_pages << (PAGE_SHIFT - 10), + ptr->prio); return 0; } @@ -1565,7 +1565,7 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) error = -EINVAL; goto bad_swap; } - + switch (swap_header_version) { case 1: printk(KERN_ERR "version 0 swap is no longer supported. " -- cgit From 81e33971271ec8603fe696731ff9967afb99e729 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:49 -0800 Subject: swapfile: remove v0 SWAP-SPACE message The kernel has not supported v0 SWAP-SPACE since 2.5.22: I think we can now safely drop its "version 0 swap is no longer supported" message - just say "Unable to find swap-space signature" as usual. This removes one level of indentation from a stretch of sys_swapon(). I'd have liked to be specific, saying "Unable to find SWAPSPACE2 signature", but it's just too confusing that the version 1 signature shows the number 2. Irrelevant nearby cleanup: kmap(page) already gives page_address(page). Signed-off-by: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 146 ++++++++++++++++++++++++++-------------------------------- 1 file changed, 65 insertions(+), 81 deletions(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index c46c83d6aab..85ff603385c 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1456,7 +1456,6 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) int i, prev; int error; union swap_header *swap_header = NULL; - int swap_header_version; unsigned int nr_good_pages = 0; int nr_extents = 0; sector_t span; @@ -1553,101 +1552,86 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) error = PTR_ERR(page); goto bad_swap; } - kmap(page); - swap_header = page_address(page); + swap_header = kmap(page); - if (!memcmp("SWAP-SPACE",swap_header->magic.magic,10)) - swap_header_version = 1; - else if (!memcmp("SWAPSPACE2",swap_header->magic.magic,10)) - swap_header_version = 2; - else { + if (memcmp("SWAPSPACE2", swap_header->magic.magic, 10)) { printk(KERN_ERR "Unable to find swap-space signature\n"); error = -EINVAL; goto bad_swap; } - switch (swap_header_version) { - case 1: - printk(KERN_ERR "version 0 swap is no longer supported. " - "Use mkswap -v1 %s\n", name); + /* swap partition endianess hack... */ + if (swab32(swap_header->info.version) == 1) { + swab32s(&swap_header->info.version); + swab32s(&swap_header->info.last_page); + swab32s(&swap_header->info.nr_badpages); + for (i = 0; i < swap_header->info.nr_badpages; i++) + swab32s(&swap_header->info.badpages[i]); + } + /* Check the swap header's sub-version */ + if (swap_header->info.version != 1) { + printk(KERN_WARNING + "Unable to handle swap header version %d\n", + swap_header->info.version); error = -EINVAL; goto bad_swap; - case 2: - /* swap partition endianess hack... */ - if (swab32(swap_header->info.version) == 1) { - swab32s(&swap_header->info.version); - swab32s(&swap_header->info.last_page); - swab32s(&swap_header->info.nr_badpages); - for (i = 0; i < swap_header->info.nr_badpages; i++) - swab32s(&swap_header->info.badpages[i]); - } - /* Check the swap header's sub-version and the size of - the swap file and bad block lists */ - if (swap_header->info.version != 1) { - printk(KERN_WARNING - "Unable to handle swap header version %d\n", - swap_header->info.version); - error = -EINVAL; - goto bad_swap; - } + } - p->lowest_bit = 1; - p->cluster_next = 1; + p->lowest_bit = 1; + p->cluster_next = 1; - /* - * Find out how many pages are allowed for a single swap - * device. There are two limiting factors: 1) the number of - * bits for the swap offset in the swp_entry_t type and - * 2) the number of bits in the a swap pte as defined by - * the different architectures. In order to find the - * largest possible bit mask a swap entry with swap type 0 - * and swap offset ~0UL is created, encoded to a swap pte, - * decoded to a swp_entry_t again and finally the swap - * offset is extracted. This will mask all the bits from - * the initial ~0UL mask that can't be encoded in either - * the swp_entry_t or the architecture definition of a - * swap pte. - */ - maxpages = swp_offset(pte_to_swp_entry(swp_entry_to_pte(swp_entry(0,~0UL)))) - 1; - if (maxpages > swap_header->info.last_page) - maxpages = swap_header->info.last_page; - p->highest_bit = maxpages - 1; + /* + * Find out how many pages are allowed for a single swap + * device. There are two limiting factors: 1) the number of + * bits for the swap offset in the swp_entry_t type and + * 2) the number of bits in the a swap pte as defined by + * the different architectures. In order to find the + * largest possible bit mask a swap entry with swap type 0 + * and swap offset ~0UL is created, encoded to a swap pte, + * decoded to a swp_entry_t again and finally the swap + * offset is extracted. This will mask all the bits from + * the initial ~0UL mask that can't be encoded in either + * the swp_entry_t or the architecture definition of a + * swap pte. + */ + maxpages = swp_offset(pte_to_swp_entry( + swp_entry_to_pte(swp_entry(0, ~0UL)))) - 1; + if (maxpages > swap_header->info.last_page) + maxpages = swap_header->info.last_page; + p->highest_bit = maxpages - 1; - error = -EINVAL; - if (!maxpages) - goto bad_swap; - if (swapfilepages && maxpages > swapfilepages) { - printk(KERN_WARNING - "Swap area shorter than signature indicates\n"); - goto bad_swap; - } - if (swap_header->info.nr_badpages && S_ISREG(inode->i_mode)) - goto bad_swap; - if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES) - goto bad_swap; + error = -EINVAL; + if (!maxpages) + goto bad_swap; + if (swapfilepages && maxpages > swapfilepages) { + printk(KERN_WARNING + "Swap area shorter than signature indicates\n"); + goto bad_swap; + } + if (swap_header->info.nr_badpages && S_ISREG(inode->i_mode)) + goto bad_swap; + if (swap_header->info.nr_badpages > MAX_SWAP_BADPAGES) + goto bad_swap; - /* OK, set up the swap map and apply the bad block list */ - swap_map = vmalloc(maxpages * sizeof(short)); - if (!swap_map) { - error = -ENOMEM; - goto bad_swap; - } + /* OK, set up the swap map and apply the bad block list */ + swap_map = vmalloc(maxpages * sizeof(short)); + if (!swap_map) { + error = -ENOMEM; + goto bad_swap; + } - error = 0; - memset(swap_map, 0, maxpages * sizeof(short)); - for (i = 0; i < swap_header->info.nr_badpages; i++) { - int page_nr = swap_header->info.badpages[i]; - if (page_nr <= 0 || page_nr >= swap_header->info.last_page) - error = -EINVAL; - else - swap_map[page_nr] = SWAP_MAP_BAD; - } - nr_good_pages = swap_header->info.last_page - - swap_header->info.nr_badpages - - 1 /* header page */; - if (error) + memset(swap_map, 0, maxpages * sizeof(short)); + for (i = 0; i < swap_header->info.nr_badpages; i++) { + int page_nr = swap_header->info.badpages[i]; + if (page_nr <= 0 || page_nr >= swap_header->info.last_page) { + error = -EINVAL; goto bad_swap; + } + swap_map[page_nr] = SWAP_MAP_BAD; } + nr_good_pages = swap_header->info.last_page - + swap_header->info.nr_badpages - + 1 /* header page */; if (nr_good_pages) { swap_map[0] = SWAP_MAP_BAD; -- cgit From ebebbbe904634b0ca1c674457b399f68db5e05b1 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:50 -0800 Subject: swapfile: rearrange scan and swap_info Before making functional changes, rearrange scan_swap_map() to simplify subsequent diffs. Actually, there is one functional change in there: leave cluster_nr negative while scanning for a new cluster - resetting it early increased the likelihood that when we have difficulty finding a free cluster, another task may come in and try doing exactly the same - just a waste of cpu. Before making functional changes, rearrange struct swap_info_struct slightly: flags will be needed as an unsigned long (for wait_on_bit), next is a good int to pair with prio, old_block_size is uninteresting so shift it to the end. Signed-off-by: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 66 +++++++++++++++++++++++++++++++++-------------------------- 1 file changed, 37 insertions(+), 29 deletions(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index 85ff603385c..4d9855f86e7 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -89,7 +89,8 @@ void swap_unplug_io_fn(struct backing_dev_info *unused_bdi, struct page *page) static inline unsigned long scan_swap_map(struct swap_info_struct *si) { - unsigned long offset, last_in_cluster; + unsigned long offset; + unsigned long last_in_cluster; int latency_ration = LATENCY_LIMIT; /* @@ -103,10 +104,13 @@ static inline unsigned long scan_swap_map(struct swap_info_struct *si) */ si->flags += SWP_SCANNING; - if (unlikely(!si->cluster_nr)) { - si->cluster_nr = SWAPFILE_CLUSTER - 1; - if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) - goto lowest; + offset = si->cluster_next; + + if (unlikely(!si->cluster_nr--)) { + if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { + si->cluster_nr = SWAPFILE_CLUSTER - 1; + goto checks; + } spin_unlock(&swap_lock); offset = si->lowest_bit; @@ -118,43 +122,47 @@ static inline unsigned long scan_swap_map(struct swap_info_struct *si) last_in_cluster = offset + SWAPFILE_CLUSTER; else if (offset == last_in_cluster) { spin_lock(&swap_lock); - si->cluster_next = offset-SWAPFILE_CLUSTER+1; - goto cluster; + offset -= SWAPFILE_CLUSTER - 1; + si->cluster_next = offset; + si->cluster_nr = SWAPFILE_CLUSTER - 1; + goto checks; } if (unlikely(--latency_ration < 0)) { cond_resched(); latency_ration = LATENCY_LIMIT; } } + + offset = si->lowest_bit; spin_lock(&swap_lock); - goto lowest; + si->cluster_nr = SWAPFILE_CLUSTER - 1; } - si->cluster_nr--; -cluster: - offset = si->cluster_next; - if (offset > si->highest_bit) -lowest: offset = si->lowest_bit; -checks: if (!(si->flags & SWP_WRITEOK)) +checks: + if (!(si->flags & SWP_WRITEOK)) goto no_page; if (!si->highest_bit) goto no_page; - if (!si->swap_map[offset]) { - if (offset == si->lowest_bit) - si->lowest_bit++; - if (offset == si->highest_bit) - si->highest_bit--; - si->inuse_pages++; - if (si->inuse_pages == si->pages) { - si->lowest_bit = si->max; - si->highest_bit = 0; - } - si->swap_map[offset] = 1; - si->cluster_next = offset + 1; - si->flags -= SWP_SCANNING; - return offset; + if (offset > si->highest_bit) + offset = si->lowest_bit; + if (si->swap_map[offset]) + goto scan; + + if (offset == si->lowest_bit) + si->lowest_bit++; + if (offset == si->highest_bit) + si->highest_bit--; + si->inuse_pages++; + if (si->inuse_pages == si->pages) { + si->lowest_bit = si->max; + si->highest_bit = 0; } + si->swap_map[offset] = 1; + si->cluster_next = offset + 1; + si->flags -= SWP_SCANNING; + return offset; +scan: spin_unlock(&swap_lock); while (++offset <= si->highest_bit) { if (!si->swap_map[offset]) { @@ -167,7 +175,7 @@ checks: if (!(si->flags & SWP_WRITEOK)) } } spin_lock(&swap_lock); - goto lowest; + goto checks; no_page: si->flags -= SWP_SCANNING; -- cgit From 6a6ba83175c029c7820765bae44692266b29e67a Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:51 -0800 Subject: swapfile: swapon use discard (trim) When adding swap, all the old data on swap can be forgotten: sys_swapon() discard all but the header page of the swap partition (or every extent but the header of the swap file), to give a solidstate swap device the opportunity to optimize its wear-levelling. If that succeeds, note SWP_DISCARDABLE for later use, and report it with a "D" at the right end of the kernel's "Adding ... swap" message. Perhaps something should be shown in /proc/swaps (swapon -s), but we have to be more cautious before making any addition to that format. Signed-off-by: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Nick Piggin Cc: David Woodhouse Cc: Jens Axboe Cc: Matthew Wilcox Cc: Joern Engel Cc: James Bottomley Cc: Donjun Shin Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 39 +++++++++++++++++++++++++++++++++++++-- 1 file changed, 37 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index 4d9855f86e7..fbeb4bb8eb5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -84,6 +84,37 @@ void swap_unplug_io_fn(struct backing_dev_info *unused_bdi, struct page *page) up_read(&swap_unplug_sem); } +/* + * swapon tell device that all the old swap contents can be discarded, + * to allow the swap device to optimize its wear-levelling. + */ +static int discard_swap(struct swap_info_struct *si) +{ + struct swap_extent *se; + int err = 0; + + list_for_each_entry(se, &si->extent_list, list) { + sector_t start_block = se->start_block << (PAGE_SHIFT - 9); + pgoff_t nr_blocks = se->nr_pages << (PAGE_SHIFT - 9); + + if (se->start_page == 0) { + /* Do not discard the swap header page! */ + start_block += 1 << (PAGE_SHIFT - 9); + nr_blocks -= 1 << (PAGE_SHIFT - 9); + if (!nr_blocks) + continue; + } + + err = blkdev_issue_discard(si->bdev, start_block, + nr_blocks, GFP_KERNEL); + if (err) + break; + + cond_resched(); + } + return err; /* That will often be -EOPNOTSUPP */ +} + #define SWAPFILE_CLUSTER 256 #define LATENCY_LIMIT 256 @@ -1658,6 +1689,9 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) goto bad_swap; } + if (discard_swap(p) == 0) + p->flags |= SWP_DISCARDABLE; + mutex_lock(&swapon_mutex); spin_lock(&swap_lock); if (swap_flags & SWAP_FLAG_PREFER) @@ -1671,9 +1705,10 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) total_swap_pages += nr_good_pages; printk(KERN_INFO "Adding %uk swap on %s. " - "Priority:%d extents:%d across:%lluk\n", + "Priority:%d extents:%d across:%lluk%s\n", nr_good_pages<<(PAGE_SHIFT-10), name, p->prio, - nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10)); + nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10), + (p->flags & SWP_DISCARDABLE) ? " D" : ""); /* insert swap space into swap_list: */ prev = -1; -- cgit From 7992fde72ce06c73280a1939b7a1e903bc95ef85 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:53 -0800 Subject: swapfile: swap allocation use discard When scan_swap_map() finds a free cluster of swap pages to allocate, discard the old contents of the cluster if the device supports discard. But don't bother when swap is so fragmented that we allocate single pages. Be careful about racing allocations made while we're scanning for a cluster; and hold up allocations made while we're discarding. Signed-off-by: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Nick Piggin Cc: David Woodhouse Cc: Jens Axboe Cc: Matthew Wilcox Cc: Joern Engel Cc: James Bottomley Cc: Donjun Shin Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 119 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 118 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index fbeb4bb8eb5..ca75b9e7c09 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -115,14 +115,62 @@ static int discard_swap(struct swap_info_struct *si) return err; /* That will often be -EOPNOTSUPP */ } +/* + * swap allocation tell device that a cluster of swap can now be discarded, + * to allow the swap device to optimize its wear-levelling. + */ +static void discard_swap_cluster(struct swap_info_struct *si, + pgoff_t start_page, pgoff_t nr_pages) +{ + struct swap_extent *se = si->curr_swap_extent; + int found_extent = 0; + + while (nr_pages) { + struct list_head *lh; + + if (se->start_page <= start_page && + start_page < se->start_page + se->nr_pages) { + pgoff_t offset = start_page - se->start_page; + sector_t start_block = se->start_block + offset; + pgoff_t nr_blocks = se->nr_pages - offset; + + if (nr_blocks > nr_pages) + nr_blocks = nr_pages; + start_page += nr_blocks; + nr_pages -= nr_blocks; + + if (!found_extent++) + si->curr_swap_extent = se; + + start_block <<= PAGE_SHIFT - 9; + nr_blocks <<= PAGE_SHIFT - 9; + if (blkdev_issue_discard(si->bdev, start_block, + nr_blocks, GFP_NOIO)) + break; + } + + lh = se->list.next; + if (lh == &si->extent_list) + lh = lh->next; + se = list_entry(lh, struct swap_extent, list); + } +} + +static int wait_for_discard(void *word) +{ + schedule(); + return 0; +} + #define SWAPFILE_CLUSTER 256 #define LATENCY_LIMIT 256 static inline unsigned long scan_swap_map(struct swap_info_struct *si) { unsigned long offset; - unsigned long last_in_cluster; + unsigned long last_in_cluster = 0; int latency_ration = LATENCY_LIMIT; + int found_free_cluster = 0; /* * We try to cluster swap pages by allocating them sequentially @@ -142,6 +190,19 @@ static inline unsigned long scan_swap_map(struct swap_info_struct *si) si->cluster_nr = SWAPFILE_CLUSTER - 1; goto checks; } + if (si->flags & SWP_DISCARDABLE) { + /* + * Start range check on racing allocations, in case + * they overlap the cluster we eventually decide on + * (we scan without swap_lock to allow preemption). + * It's hardly conceivable that cluster_nr could be + * wrapped during our scan, but don't depend on it. + */ + if (si->lowest_alloc) + goto checks; + si->lowest_alloc = si->max; + si->highest_alloc = 0; + } spin_unlock(&swap_lock); offset = si->lowest_bit; @@ -156,6 +217,7 @@ static inline unsigned long scan_swap_map(struct swap_info_struct *si) offset -= SWAPFILE_CLUSTER - 1; si->cluster_next = offset; si->cluster_nr = SWAPFILE_CLUSTER - 1; + found_free_cluster = 1; goto checks; } if (unlikely(--latency_ration < 0)) { @@ -167,6 +229,7 @@ static inline unsigned long scan_swap_map(struct swap_info_struct *si) offset = si->lowest_bit; spin_lock(&swap_lock); si->cluster_nr = SWAPFILE_CLUSTER - 1; + si->lowest_alloc = 0; } checks: @@ -191,6 +254,60 @@ checks: si->swap_map[offset] = 1; si->cluster_next = offset + 1; si->flags -= SWP_SCANNING; + + if (si->lowest_alloc) { + /* + * Only set when SWP_DISCARDABLE, and there's a scan + * for a free cluster in progress or just completed. + */ + if (found_free_cluster) { + /* + * To optimize wear-levelling, discard the + * old data of the cluster, taking care not to + * discard any of its pages that have already + * been allocated by racing tasks (offset has + * already stepped over any at the beginning). + */ + if (offset < si->highest_alloc && + si->lowest_alloc <= last_in_cluster) + last_in_cluster = si->lowest_alloc - 1; + si->flags |= SWP_DISCARDING; + spin_unlock(&swap_lock); + + if (offset < last_in_cluster) + discard_swap_cluster(si, offset, + last_in_cluster - offset + 1); + + spin_lock(&swap_lock); + si->lowest_alloc = 0; + si->flags &= ~SWP_DISCARDING; + + smp_mb(); /* wake_up_bit advises this */ + wake_up_bit(&si->flags, ilog2(SWP_DISCARDING)); + + } else if (si->flags & SWP_DISCARDING) { + /* + * Delay using pages allocated by racing tasks + * until the whole discard has been issued. We + * could defer that delay until swap_writepage, + * but it's easier to keep this self-contained. + */ + spin_unlock(&swap_lock); + wait_on_bit(&si->flags, ilog2(SWP_DISCARDING), + wait_for_discard, TASK_UNINTERRUPTIBLE); + spin_lock(&swap_lock); + } else { + /* + * Note pages allocated by racing tasks while + * scan for a free cluster is in progress, so + * that its final discard can exclude them. + */ + if (offset < si->lowest_alloc) + si->lowest_alloc = offset; + if (offset > si->highest_alloc) + si->highest_alloc = offset; + } + } return offset; scan: -- cgit From 20137a490f397d9c01fc9fadd83a8d198bda4477 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:54 -0800 Subject: swapfile: swapon randomize if nonrot Swap allocation has always started from the beginning of the swap area; but if we're dealing with a solidstate swap device which can only remap blocks within limited zones, that would sooner wear out the first zone. Therefore sys_swapon() test whether blk_queue is non-rotational, and if so randomize the cluster_next starting position for allocation. If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it with an "SS" at the right end of the kernel's "Adding ... swap" message (so that if it's both nonrot and discardable, "SSD" will be shown there). Perhaps something should be shown in /proc/swaps (swapon -s), but we have to be more cautious before making any addition to that format. Signed-off-by: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Nick Piggin Cc: David Woodhouse Cc: Jens Axboe Cc: Matthew Wilcox Cc: Joern Engel Cc: James Bottomley Cc: Donjun Shin Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index ca75b9e7c09..b0f56603b9b 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include @@ -1806,6 +1807,11 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) goto bad_swap; } + if (blk_queue_nonrot(bdev_get_queue(p->bdev))) { + p->flags |= SWP_SOLIDSTATE; + srandom32((u32)get_seconds()); + p->cluster_next = 1 + (random32() % p->highest_bit); + } if (discard_swap(p) == 0) p->flags |= SWP_DISCARDABLE; @@ -1822,10 +1828,11 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) total_swap_pages += nr_good_pages; printk(KERN_INFO "Adding %uk swap on %s. " - "Priority:%d extents:%d across:%lluk%s\n", + "Priority:%d extents:%d across:%lluk %s%s\n", nr_good_pages<<(PAGE_SHIFT-10), name, p->prio, nr_extents, (unsigned long long)span<<(PAGE_SHIFT-10), - (p->flags & SWP_DISCARDABLE) ? " D" : ""); + (p->flags & SWP_SOLIDSTATE) ? "SS" : "", + (p->flags & SWP_DISCARDABLE) ? "D" : ""); /* insert swap space into swap_list: */ prev = -1; -- cgit From c60aa176c6de82703f064082b909496fc4fee956 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:55 -0800 Subject: swapfile: swap allocation cycle if nonrot Though attempting to find free clusters (Andrea), swap allocation has always restarted its searches from the beginning of the swap area (sct), to reduce seek times between swap pages, by not scattering them all over the partition. But on a solidstate swap device, seeks are cheap, and block remapping to level the wear may be limited by zones: in that case it's better to cycle around the whole partition. Signed-off-by: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Nick Piggin Cc: David Woodhouse Cc: Jens Axboe Cc: Matthew Wilcox Cc: Joern Engel Cc: James Bottomley Cc: Donjun Shin Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 46 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index b0f56603b9b..763210732b5 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -169,6 +169,7 @@ static int wait_for_discard(void *word) static inline unsigned long scan_swap_map(struct swap_info_struct *si) { unsigned long offset; + unsigned long scan_base; unsigned long last_in_cluster = 0; int latency_ration = LATENCY_LIMIT; int found_free_cluster = 0; @@ -181,10 +182,11 @@ static inline unsigned long scan_swap_map(struct swap_info_struct *si) * all over the entire swap partition, so that we reduce * overall disk seek times between swap pages. -- sct * But we do now try to find an empty cluster. -Andrea + * And we let swap pages go all over an SSD partition. Hugh */ si->flags += SWP_SCANNING; - offset = si->cluster_next; + scan_base = offset = si->cluster_next; if (unlikely(!si->cluster_nr--)) { if (si->pages - si->inuse_pages < SWAPFILE_CLUSTER) { @@ -206,7 +208,16 @@ static inline unsigned long scan_swap_map(struct swap_info_struct *si) } spin_unlock(&swap_lock); - offset = si->lowest_bit; + /* + * If seek is expensive, start searching for new cluster from + * start of partition, to minimize the span of allocated swap. + * But if seek is cheap, search from our current position, so + * that swap is allocated from all over the partition: if the + * Flash Translation Layer only remaps within limited zones, + * we don't want to wear out the first zone too quickly. + */ + if (!(si->flags & SWP_SOLIDSTATE)) + scan_base = offset = si->lowest_bit; last_in_cluster = offset + SWAPFILE_CLUSTER - 1; /* Locate the first empty (unaligned) cluster */ @@ -228,6 +239,27 @@ static inline unsigned long scan_swap_map(struct swap_info_struct *si) } offset = si->lowest_bit; + last_in_cluster = offset + SWAPFILE_CLUSTER - 1; + + /* Locate the first empty (unaligned) cluster */ + for (; last_in_cluster < scan_base; offset++) { + if (si->swap_map[offset]) + last_in_cluster = offset + SWAPFILE_CLUSTER; + else if (offset == last_in_cluster) { + spin_lock(&swap_lock); + offset -= SWAPFILE_CLUSTER - 1; + si->cluster_next = offset; + si->cluster_nr = SWAPFILE_CLUSTER - 1; + found_free_cluster = 1; + goto checks; + } + if (unlikely(--latency_ration < 0)) { + cond_resched(); + latency_ration = LATENCY_LIMIT; + } + } + + offset = scan_base; spin_lock(&swap_lock); si->cluster_nr = SWAPFILE_CLUSTER - 1; si->lowest_alloc = 0; @@ -239,7 +271,7 @@ checks: if (!si->highest_bit) goto no_page; if (offset > si->highest_bit) - offset = si->lowest_bit; + scan_base = offset = si->lowest_bit; if (si->swap_map[offset]) goto scan; @@ -323,8 +355,18 @@ scan: latency_ration = LATENCY_LIMIT; } } + offset = si->lowest_bit; + while (++offset < scan_base) { + if (!si->swap_map[offset]) { + spin_lock(&swap_lock); + goto checks; + } + if (unlikely(--latency_ration < 0)) { + cond_resched(); + latency_ration = LATENCY_LIMIT; + } + } spin_lock(&swap_lock); - goto checks; no_page: si->flags -= SWP_SCANNING; -- cgit From 858a29900ea2d639759e697be901a60b759cdcfb Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:56 -0800 Subject: swapfile: change discard pgoff_t to sector_t Change pgoff_t nr_blocks in discard_swap() and discard_swap_cluster() to sector_t: given the constraints on swap offsets (in particular, the 5 bits of swap type accommodated in the same unsigned long), pgoff_t was actually safe as is, but it certainly looked worrying when shifted left. [akpm@linux-foundation.org: fix shift overflow] Signed-off-by: Hugh Dickins Cc: KAMEZAWA Hiroyuki Cc: Nick Piggin Cc: David Woodhouse Cc: Jens Axboe Cc: Matthew Wilcox Cc: Joern Engel Cc: James Bottomley Cc: Donjun Shin Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index 763210732b5..6a078557306 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -96,7 +96,7 @@ static int discard_swap(struct swap_info_struct *si) list_for_each_entry(se, &si->extent_list, list) { sector_t start_block = se->start_block << (PAGE_SHIFT - 9); - pgoff_t nr_blocks = se->nr_pages << (PAGE_SHIFT - 9); + sector_t nr_blocks = (sector_t)se->nr_pages << (PAGE_SHIFT - 9); if (se->start_page == 0) { /* Do not discard the swap header page! */ @@ -133,7 +133,7 @@ static void discard_swap_cluster(struct swap_info_struct *si, start_page < se->start_page + se->nr_pages) { pgoff_t offset = start_page - se->start_page; sector_t start_block = se->start_block + offset; - pgoff_t nr_blocks = se->nr_pages - offset; + sector_t nr_blocks = se->nr_pages - offset; if (nr_blocks > nr_pages) nr_blocks = nr_pages; -- cgit From f0d7a4b3ed46816f5097d521850a8ab7a0d40f3c Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:39:57 -0800 Subject: swapfile: let others seed random Remove the srandom32((u32)get_seconds()) from non-rotational swapon: there's been a coincidental discussion of earlier randomization, assume that goes ahead, let swapon be a client rather than stirring for itself. Signed-off-by: Hugh Dickins Cc: David Woodhouse Cc: Donjun Shin Cc: James Bottomley Cc: Jens Axboe Cc: Joern Engel Cc: KAMEZAWA Hiroyuki Cc: Matthew Wilcox Cc: Nick Piggin Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 1 - 1 file changed, 1 deletion(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index 6a078557306..d0052360191 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1851,7 +1851,6 @@ asmlinkage long sys_swapon(const char __user * specialfile, int swap_flags) if (blk_queue_nonrot(bdev_get_queue(p->bdev))) { p->flags |= SWP_SOLIDSTATE; - srandom32((u32)get_seconds()); p->cluster_next = 1 + (random32() % p->highest_bit); } if (discard_swap(p) == 0) -- cgit From ebdd4aea8d736e3b5ce27ab0a26860c9fded341b Mon Sep 17 00:00:00 2001 From: Hannes Eder Date: Tue, 6 Jan 2009 14:39:58 -0800 Subject: hugetlb: fix sparse warnings Fix the following sparse warnings: mm/hugetlb.c:375:3: warning: returning void-valued expression mm/hugetlb.c:408:3: warning: returning void-valued expression Signed-off-by: Hannes Eder Acked-by: Nishanth Aravamudan Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 9595278b5ab..82321da23cc 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -400,8 +400,10 @@ static void clear_huge_page(struct page *page, { int i; - if (unlikely(sz > MAX_ORDER_NR_PAGES)) - return clear_gigantic_page(page, addr, sz); + if (unlikely(sz > MAX_ORDER_NR_PAGES)) { + clear_gigantic_page(page, addr, sz); + return; + } might_sleep(); for (i = 0; i < sz/PAGE_SIZE; i++) { @@ -433,8 +435,10 @@ static void copy_huge_page(struct page *dst, struct page *src, int i; struct hstate *h = hstate_vma(vma); - if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) - return copy_gigantic_page(dst, src, addr, vma); + if (unlikely(pages_per_huge_page(h) > MAX_ORDER_NR_PAGES)) { + copy_gigantic_page(dst, src, addr, vma); + return; + } might_sleep(); for (i = 0; i < pages_per_huge_page(h); i++) { -- cgit From a79311c14eae4bb946a97af25f3e1b17d625985d Mon Sep 17 00:00:00 2001 From: Rik van Riel Date: Tue, 6 Jan 2009 14:40:01 -0800 Subject: vmscan: bail out of direct reclaim after swap_cluster_max pages When the VM is under pressure, it can happen that several direct reclaim processes are in the pageout code simultaneously. It also happens that the reclaiming processes run into mostly referenced, mapped and dirty pages in the first round. This results in multiple direct reclaim processes having a lower pageout priority, which corresponds to a higher target of pages to scan. This in turn can result in each direct reclaim process freeing many pages. Together, they can end up freeing way too many pages. This kicks useful data out of memory (in some cases more than half of all memory is swapped out). It also impacts performance by keeping tasks stuck in the pageout code for too long. A 30% improvement in hackbench has been observed with this patch. The fix is relatively simple: in shrink_zone() we can check how many pages we have already freed, direct reclaim tasks break out of the scanning loop if they have already freed enough pages and have reached a lower priority level. We do not break out of shrink_zone() when priority == DEF_PRIORITY, to ensure that equal pressure is applied to every zone in the common case. However, in order to do this we do need to know how many pages we already freed, so move nr_reclaimed into scan_control. akpm: a historical interlude... We tried this in 2004: :commit e468e46a9bea3297011d5918663ce6d19094cf87 :Author: akpm :Date: Thu Jun 24 15:53:52 2004 +0000 : :[PATCH] vmscan.c: dont reclaim too many pages : : The shrink_zone() logic can, under some circumstances, cause far too many : pages to be reclaimed. Say, we're scanning at high priority and suddenly hit : a large number of reclaimable pages on the LRU. : Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed. And we reverted it in 2006: :commit 210fe530305ee50cd889fe9250168228b2994f32 :Author: Andrew Morton :Date: Fri Jan 6 00:11:14 2006 -0800 : : [PATCH] vmscan: balancing fix : : Revert a patch which went into 2.6.8-rc1. The changelog for that patch was: : : The shrink_zone() logic can, under some circumstances, cause far too many : pages to be reclaimed. Say, we're scanning at high priority and suddenly : hit a large number of reclaimable pages on the LRU. : : Change things so we bale out when SWAP_CLUSTER_MAX pages have been : reclaimed. : : Problem is, this change caused significant imbalance in inter-zone scan : balancing by truncating scans of larger zones. : : Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone : balancing algorithm would require that if we're scanning 100 pages of : ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will : cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are : reclaimed. Thus effectively causing smaller zones to be scanned relatively : harder than large ones. : : Now I need to remember what the workload was which caused me to write this : patch originally, then fix it up in a different way... And we haven't demonstrated that whatever problem caused that reversion is not being reintroduced by this change in 2008. Signed-off-by: Rik van Riel Cc: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 62 ++++++++++++++++++++++++++++++++----------------------------- 1 file changed, 33 insertions(+), 29 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 1ef5a2eeb29..5faa7739487 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -52,6 +52,9 @@ struct scan_control { /* Incremented by the number of inactive pages that were scanned */ unsigned long nr_scanned; + /* Number of pages freed so far during a call to shrink_zones() */ + unsigned long nr_reclaimed; + /* This context's GFP mask */ gfp_t gfp_mask; @@ -1400,12 +1403,11 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc, /* * This is a basic per-zone page freer. Used by both kswapd and direct reclaim. */ -static unsigned long shrink_zone(int priority, struct zone *zone, +static void shrink_zone(int priority, struct zone *zone, struct scan_control *sc) { unsigned long nr[NR_LRU_LISTS]; unsigned long nr_to_scan; - unsigned long nr_reclaimed = 0; unsigned long percent[2]; /* anon @ 0; file @ 1 */ enum lru_list l; @@ -1446,10 +1448,21 @@ static unsigned long shrink_zone(int priority, struct zone *zone, (unsigned long)sc->swap_cluster_max); nr[l] -= nr_to_scan; - nr_reclaimed += shrink_list(l, nr_to_scan, + sc->nr_reclaimed += shrink_list(l, nr_to_scan, zone, sc, priority); } } + /* + * On large memory systems, scan >> priority can become + * really large. This is fine for the starting priority; + * we want to put equal scanning pressure on each zone. + * However, if the VM has a harder time of freeing pages, + * with multiple processes reclaiming pages, the total + * freeing target can get unreasonably large. + */ + if (sc->nr_reclaimed > sc->swap_cluster_max && + priority < DEF_PRIORITY && !current_is_kswapd()) + break; } /* @@ -1462,7 +1475,6 @@ static unsigned long shrink_zone(int priority, struct zone *zone, shrink_active_list(SWAP_CLUSTER_MAX, zone, sc, priority, 0); throttle_vm_writeout(sc->gfp_mask); - return nr_reclaimed; } /* @@ -1476,16 +1488,13 @@ static unsigned long shrink_zone(int priority, struct zone *zone, * b) The zones may be over pages_high but they must go *over* pages_high to * satisfy the `incremental min' zone defense algorithm. * - * Returns the number of reclaimed pages. - * * If a zone is deemed to be full of pinned pages then just give it a light * scan then give up on it. */ -static unsigned long shrink_zones(int priority, struct zonelist *zonelist, +static void shrink_zones(int priority, struct zonelist *zonelist, struct scan_control *sc) { enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask); - unsigned long nr_reclaimed = 0; struct zoneref *z; struct zone *zone; @@ -1516,10 +1525,8 @@ static unsigned long shrink_zones(int priority, struct zonelist *zonelist, priority); } - nr_reclaimed += shrink_zone(priority, zone, sc); + shrink_zone(priority, zone, sc); } - - return nr_reclaimed; } /* @@ -1544,7 +1551,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, int priority; unsigned long ret = 0; unsigned long total_scanned = 0; - unsigned long nr_reclaimed = 0; struct reclaim_state *reclaim_state = current->reclaim_state; unsigned long lru_pages = 0; struct zoneref *z; @@ -1572,7 +1578,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, sc->nr_scanned = 0; if (!priority) disable_swap_token(); - nr_reclaimed += shrink_zones(priority, zonelist, sc); + shrink_zones(priority, zonelist, sc); /* * Don't shrink slabs when reclaiming memory from * over limit cgroups @@ -1580,13 +1586,13 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, if (scan_global_lru(sc)) { shrink_slab(sc->nr_scanned, sc->gfp_mask, lru_pages); if (reclaim_state) { - nr_reclaimed += reclaim_state->reclaimed_slab; + sc->nr_reclaimed += reclaim_state->reclaimed_slab; reclaim_state->reclaimed_slab = 0; } } total_scanned += sc->nr_scanned; - if (nr_reclaimed >= sc->swap_cluster_max) { - ret = nr_reclaimed; + if (sc->nr_reclaimed >= sc->swap_cluster_max) { + ret = sc->nr_reclaimed; goto out; } @@ -1609,7 +1615,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist, } /* top priority shrink_zones still had more to do? don't OOM, then */ if (!sc->all_unreclaimable && scan_global_lru(sc)) - ret = nr_reclaimed; + ret = sc->nr_reclaimed; out: /* * Now that we've scanned all the zones at this priority level, note @@ -1704,7 +1710,6 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order) int priority; int i; unsigned long total_scanned; - unsigned long nr_reclaimed; struct reclaim_state *reclaim_state = current->reclaim_state; struct scan_control sc = { .gfp_mask = GFP_KERNEL, @@ -1723,7 +1728,7 @@ static unsigned long balance_pgdat(pg_data_t *pgdat, int order) loop_again: total_scanned = 0; - nr_reclaimed = 0; + sc.nr_reclaimed = 0; sc.may_writepage = !laptop_mode; count_vm_event(PAGEOUTRUN); @@ -1809,11 +1814,11 @@ loop_again: */ if (!zone_watermark_ok(zone, order, 8*zone->pages_high, end_zone, 0)) - nr_reclaimed += shrink_zone(priority, zone, &sc); + shrink_zone(priority, zone, &sc); reclaim_state->reclaimed_slab = 0; nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL, lru_pages); - nr_reclaimed += reclaim_state->reclaimed_slab; + sc.nr_reclaimed += reclaim_state->reclaimed_slab; total_scanned += sc.nr_scanned; if (zone_is_all_unreclaimable(zone)) continue; @@ -1827,7 +1832,7 @@ loop_again: * even in laptop mode */ if (total_scanned > SWAP_CLUSTER_MAX * 2 && - total_scanned > nr_reclaimed + nr_reclaimed / 2) + total_scanned > sc.nr_reclaimed + sc.nr_reclaimed / 2) sc.may_writepage = 1; } if (all_zones_ok) @@ -1845,7 +1850,7 @@ loop_again: * matches the direct reclaim path behaviour in terms of impact * on zone->*_priority. */ - if (nr_reclaimed >= SWAP_CLUSTER_MAX) + if (sc.nr_reclaimed >= SWAP_CLUSTER_MAX) break; } out: @@ -1867,7 +1872,7 @@ out: goto loop_again; } - return nr_reclaimed; + return sc.nr_reclaimed; } /* @@ -2219,7 +2224,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) struct task_struct *p = current; struct reclaim_state reclaim_state; int priority; - unsigned long nr_reclaimed = 0; struct scan_control sc = { .may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE), .may_swap = !!(zone_reclaim_mode & RECLAIM_SWAP), @@ -2252,9 +2256,9 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) priority = ZONE_RECLAIM_PRIORITY; do { note_zone_scanning_priority(zone, priority); - nr_reclaimed += shrink_zone(priority, zone, &sc); + shrink_zone(priority, zone, &sc); priority--; - } while (priority >= 0 && nr_reclaimed < nr_pages); + } while (priority >= 0 && sc.nr_reclaimed < nr_pages); } slab_reclaimable = zone_page_state(zone, NR_SLAB_RECLAIMABLE); @@ -2278,13 +2282,13 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) * Update nr_reclaimed by the number of slab pages we * reclaimed from this zone. */ - nr_reclaimed += slab_reclaimable - + sc.nr_reclaimed += slab_reclaimable - zone_page_state(zone, NR_SLAB_RECLAIMABLE); } p->reclaim_state = NULL; current->flags &= ~(PF_MEMALLOC | PF_SWAPWRITE); - return nr_reclaimed >= nr_pages; + return sc.nr_reclaimed >= nr_pages; } int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order) -- cgit From 01dbe5c9b1004dab045cb7f38428258ca9cddc02 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:02 -0800 Subject: vmscan: improve reclaim throughput to bail out patch The vmscan bail out patch move nr_reclaimed variable to struct scan_control. Unfortunately, indirect access can easily happen cache miss. if heavy memory pressure happend, that's ok. cache miss already plenty. it is not observable. but, if memory pressure is lite, performance degression is obserbable. I compared following three pattern (it was mesured 10 times each) hackbench 125 process 3000 hackbench 130 process 3000 hackbench 135 process 3000 2.6.28-rc6 bail-out 125 130 135 125 130 135 ============================================================== 71.866 75.86 81.274 93.414 73.254 193.382 74.145 78.295 77.27 74.897 75.021 80.17 70.305 77.643 75.855 70.134 77.571 79.896 74.288 73.986 75.955 77.222 78.48 80.619 72.029 79.947 78.312 75.128 82.172 79.708 71.499 77.615 77.042 74.177 76.532 77.306 76.188 74.471 83.562 73.839 72.43 79.833 73.236 75.606 78.743 76.001 76.557 82.726 69.427 77.271 76.691 76.236 79.371 103.189 72.473 76.978 80.643 69.128 78.932 75.736 avg 72.545 76.767 78.534 76.017 77.03 93.256 std 1.89 1.71 2.41 6.29 2.79 34.16 min 69.427 73.986 75.855 69.128 72.43 75.736 max 76.188 79.947 83.562 93.414 82.172 193.382 about 4-5% degression. Then, this patch introduces a temporary local variable. result: 2.6.28-rc6 this patch num 125 130 135 125 130 135 ============================================================== 71.866 75.86 81.274 67.302 68.269 77.161 74.145 78.295 77.27 72.616 72.712 79.06 70.305 77.643 75.855 72.475 75.712 77.735 74.288 73.986 75.955 69.229 73.062 78.814 72.029 79.947 78.312 71.551 74.392 78.564 71.499 77.615 77.042 69.227 74.31 78.837 76.188 74.471 83.562 70.759 75.256 76.6 73.236 75.606 78.743 69.966 76.001 78.464 69.427 77.271 76.691 69.068 75.218 80.321 72.473 76.978 80.643 72.057 77.151 79.068 avg 72.545 76.767 78.534 70.425 74.2083 78.462 std 1.89 1.71 2.41 1.66 2.34 1.00 min 69.427 73.986 75.855 67.302 68.269 76.6 max 76.188 79.947 83.562 72.616 77.151 80.321 OK. the degression is disappeared. Signed-off-by: KOSAKI Motohiro Acked-by: Rik van Riel Cc: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 15 +++++++++------ 1 file changed, 9 insertions(+), 6 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 5faa7739487..13f050d667e 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1410,6 +1410,8 @@ static void shrink_zone(int priority, struct zone *zone, unsigned long nr_to_scan; unsigned long percent[2]; /* anon @ 0; file @ 1 */ enum lru_list l; + unsigned long nr_reclaimed = sc->nr_reclaimed; + unsigned long swap_cluster_max = sc->swap_cluster_max; get_scan_ratio(zone, sc, percent); @@ -1425,7 +1427,7 @@ static void shrink_zone(int priority, struct zone *zone, } zone->lru[l].nr_scan += scan; nr[l] = zone->lru[l].nr_scan; - if (nr[l] >= sc->swap_cluster_max) + if (nr[l] >= swap_cluster_max) zone->lru[l].nr_scan = 0; else nr[l] = 0; @@ -1444,12 +1446,11 @@ static void shrink_zone(int priority, struct zone *zone, nr[LRU_INACTIVE_FILE]) { for_each_evictable_lru(l) { if (nr[l]) { - nr_to_scan = min(nr[l], - (unsigned long)sc->swap_cluster_max); + nr_to_scan = min(nr[l], swap_cluster_max); nr[l] -= nr_to_scan; - sc->nr_reclaimed += shrink_list(l, nr_to_scan, - zone, sc, priority); + nr_reclaimed += shrink_list(l, nr_to_scan, + zone, sc, priority); } } /* @@ -1460,11 +1461,13 @@ static void shrink_zone(int priority, struct zone *zone, * with multiple processes reclaiming pages, the total * freeing target can get unreasonably large. */ - if (sc->nr_reclaimed > sc->swap_cluster_max && + if (nr_reclaimed > swap_cluster_max && priority < DEF_PRIORITY && !current_is_kswapd()) break; } + sc->nr_reclaimed = nr_reclaimed; + /* * Even if we did not try to evict anon pages at all, we want to * rebalance the anon lru active/inactive ratio. -- cgit From 09f445e7f5107c91be12ed386350de6cd055e0a4 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:03 -0800 Subject: mm: kill zone_is_near_oom() zone_is_near_oom() is unused. Signed-off-by: KOSAKI Motohiro Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 5 ----- 1 file changed, 5 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 13f050d667e..466a36b3bad 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1167,11 +1167,6 @@ static inline void note_zone_scanning_priority(struct zone *zone, int priority) zone->prev_priority = priority; } -static inline int zone_is_near_oom(struct zone *zone) -{ - return zone->pages_scanned >= (zone_lru_pages(zone) * 3); -} - /* * This moves pages from the active list to the inactive list. * -- cgit From 79f4b7bf393e67bbffec807cc68caaefc72b82ee Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:40:05 -0800 Subject: badpage: simplify page_alloc flag check+clear Simplify the PAGE_FLAGS checking and clearing when freeing and allocating a page: check the same flags as before when freeing, clear ALL the flags (unless PageReserved) when freeing, check ALL flags off when allocating. Signed-off-by: Hugh Dickins Cc: Nick Piggin Cc: Christoph Lameter Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 19 ++++++------------- 1 file changed, 6 insertions(+), 13 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 31c512410a9..b90a74d2848 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -231,7 +231,6 @@ static void bad_page(struct page *page) printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n" KERN_EMERG "Backtrace:\n"); dump_stack(); - page->flags &= ~PAGE_FLAGS_CLEAR_WHEN_BAD; set_page_count(page, 0); reset_page_mapcount(page); page->mapping = NULL; @@ -468,16 +467,16 @@ static inline int free_pages_check(struct page *page) (page_count(page) != 0) | (page->flags & PAGE_FLAGS_CHECK_AT_FREE))) bad_page(page); - if (PageDirty(page)) - __ClearPageDirty(page); - if (PageSwapBacked(page)) - __ClearPageSwapBacked(page); /* * For now, we report if PG_reserved was found set, but do not * clear it, and do not free the page. But we shall soon need * to do more, for when the ZERO_PAGE count wraps negative. */ - return PageReserved(page); + if (PageReserved(page)) + return 1; + if (page->flags & PAGE_FLAGS_CHECK_AT_PREP) + page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; + return 0; } /* @@ -621,13 +620,7 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) if (PageReserved(page)) return 1; - page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_reclaim | - 1 << PG_referenced | 1 << PG_arch_1 | - 1 << PG_owner_priv_1 | 1 << PG_mappedtodisk -#ifdef CONFIG_UNEVICTABLE_LRU - | 1 << PG_mlocked -#endif - ); + page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; set_page_private(page, 0); set_page_refcounted(page); -- cgit From 8cc3b39221b0ecbd83a338948a8396df097fc656 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:40:06 -0800 Subject: badpage: keep any bad page out of circulation Until now the bad_page() checkers have special-cased PageReserved, keeping those pages out of circulation thereafter. Now extend the special case to all: we want to keep ANY page with bad state out of circulation - the "free" page may well be in use by something. Leave the bad state of those pages untouched, for examination by debuggers; except for PageBuddy - leaving that set would risk bringing the page back. Signed-off-by: Hugh Dickins Cc: Nick Piggin Cc: Christoph Lameter Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 52 ++++++++++++++++++++++++---------------------------- 1 file changed, 24 insertions(+), 28 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index b90a74d2848..bd330252fc7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -231,9 +231,9 @@ static void bad_page(struct page *page) printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n" KERN_EMERG "Backtrace:\n"); dump_stack(); - set_page_count(page, 0); - reset_page_mapcount(page); - page->mapping = NULL; + + /* Leave bad fields for debug, except PageBuddy could make trouble */ + __ClearPageBuddy(page); add_taint(TAINT_BAD_PAGE); } @@ -290,25 +290,31 @@ void prep_compound_gigantic_page(struct page *page, unsigned long order) } #endif -static void destroy_compound_page(struct page *page, unsigned long order) +static int destroy_compound_page(struct page *page, unsigned long order) { int i; int nr_pages = 1 << order; + int bad = 0; - if (unlikely(compound_order(page) != order)) + if (unlikely(compound_order(page) != order) || + unlikely(!PageHead(page))) { bad_page(page); + bad++; + } - if (unlikely(!PageHead(page))) - bad_page(page); __ClearPageHead(page); + for (i = 1; i < nr_pages; i++) { struct page *p = page + i; - if (unlikely(!PageTail(p) | - (p->first_page != page))) + if (unlikely(!PageTail(p) | (p->first_page != page))) { bad_page(page); + bad++; + } __ClearPageTail(p); } + + return bad; } static inline void prep_zero_page(struct page *page, int order, gfp_t gfp_flags) @@ -428,7 +434,8 @@ static inline void __free_one_page(struct page *page, int migratetype = get_pageblock_migratetype(page); if (unlikely(PageCompound(page))) - destroy_compound_page(page, order); + if (unlikely(destroy_compound_page(page, order))) + return; page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1); @@ -465,15 +472,10 @@ static inline int free_pages_check(struct page *page) if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (page_count(page) != 0) | - (page->flags & PAGE_FLAGS_CHECK_AT_FREE))) + (page->flags & PAGE_FLAGS_CHECK_AT_FREE))) { bad_page(page); - /* - * For now, we report if PG_reserved was found set, but do not - * clear it, and do not free the page. But we shall soon need - * to do more, for when the ZERO_PAGE count wraps negative. - */ - if (PageReserved(page)) return 1; + } if (page->flags & PAGE_FLAGS_CHECK_AT_PREP) page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; return 0; @@ -521,11 +523,11 @@ static void __free_pages_ok(struct page *page, unsigned int order) { unsigned long flags; int i; - int reserved = 0; + int bad = 0; for (i = 0 ; i < (1 << order) ; ++i) - reserved += free_pages_check(page + i); - if (reserved) + bad += free_pages_check(page + i); + if (bad) return; if (!PageHighMem(page)) { @@ -610,17 +612,11 @@ static int prep_new_page(struct page *page, int order, gfp_t gfp_flags) if (unlikely(page_mapcount(page) | (page->mapping != NULL) | (page_count(page) != 0) | - (page->flags & PAGE_FLAGS_CHECK_AT_PREP))) + (page->flags & PAGE_FLAGS_CHECK_AT_PREP))) { bad_page(page); - - /* - * For now, we report if PG_reserved was found set, but do not - * clear it, and do not allocate the page: as a safety net. - */ - if (PageReserved(page)) return 1; + } - page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP; set_page_private(page, 0); set_page_refcounted(page); -- cgit From 3dc147414ccad81dc33edb80774b1fed12a38c08 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:40:08 -0800 Subject: badpage: replace page_remove_rmap Eeek and BUG Now that bad pages are kept out of circulation, there is no need for the infamous page_remove_rmap() BUG() - once that page is freed, its negative mapcount will issue a "Bad page state" message and the page won't be freed. Removing the BUG() allows more info, on subsequent pages, to be gathered. We do have more info about the page at this point than bad_page() can know - notably, what the pmd is, which might pinpoint something like low 64kB corruption - but page_remove_rmap() isn't given the address to find that. In practice, there is only one call to page_remove_rmap() which has ever reported anything, that from zap_pte_range() (usually on exit, sometimes on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek" message and leave it all to zap_pte_range(). mm/memory.c already has a hardly used print_bad_pte() function, showing some of the appropriate info: extend it to show what we want for the rmap case: pte info, page info (when there is a page) and vma info to compare. zap_pte_range() already knows the pmd, but print_bad_pte() is easier to use if it works that out for itself. Some of this info is also shown in bad_page()'s "Bad page state" message. Keep them separate, but adjust them to match each other as far as possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE there too. print_bad_pte() show current->comm unconditionally (though it should get repeated in the usually irrelevant stack trace): sorry, I misled Nick Piggin to make it conditional on vm_mm == current->mm, but current->mm is already NULL in the exit case. Usually current->comm is good, though exceptionally it may not be that of the mm (when "swapoff" for example). Signed-off-by: Hugh Dickins Cc: Nick Piggin Cc: Christoph Lameter Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 52 ++++++++++++++++++++++++++++++++++++++++------------ mm/page_alloc.c | 16 ++++++++-------- mm/rmap.c | 16 ---------------- 3 files changed, 48 insertions(+), 36 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 89339c61f8e..cda04b19f73 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -52,6 +52,9 @@ #include #include #include +#include +#include +#include #include #include @@ -59,9 +62,6 @@ #include #include -#include -#include - #include "internal.h" #ifndef CONFIG_NEED_MULTIPLE_NODES @@ -375,15 +375,41 @@ static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss) * * The calling function must still handle the error. */ -static void print_bad_pte(struct vm_area_struct *vma, pte_t pte, - unsigned long vaddr) -{ - printk(KERN_ERR "Bad pte = %08llx, process = %s, " - "vm_flags = %lx, vaddr = %lx\n", - (long long)pte_val(pte), - (vma->vm_mm == current->mm ? current->comm : "???"), - vma->vm_flags, vaddr); +static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, + pte_t pte, struct page *page) +{ + pgd_t *pgd = pgd_offset(vma->vm_mm, addr); + pud_t *pud = pud_offset(pgd, addr); + pmd_t *pmd = pmd_offset(pud, addr); + struct address_space *mapping; + pgoff_t index; + + mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL; + index = linear_page_index(vma, addr); + + printk(KERN_EMERG "Bad page map in process %s pte:%08llx pmd:%08llx\n", + current->comm, + (long long)pte_val(pte), (long long)pmd_val(*pmd)); + if (page) { + printk(KERN_EMERG + "page:%p flags:%p count:%d mapcount:%d mapping:%p index:%lx\n", + page, (void *)page->flags, page_count(page), + page_mapcount(page), page->mapping, page->index); + } + printk(KERN_EMERG + "addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n", + (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index); + /* + * Choose text because data symbols depend on CONFIG_KALLSYMS_ALL=y + */ + if (vma->vm_ops) + print_symbol(KERN_EMERG "vma->vm_ops->fault: %s\n", + (unsigned long)vma->vm_ops->fault); + if (vma->vm_file && vma->vm_file->f_op) + print_symbol(KERN_EMERG "vma->vm_file->f_op->mmap: %s\n", + (unsigned long)vma->vm_file->f_op->mmap); dump_stack(); + add_taint(TAINT_BAD_PAGE); } static inline int is_cow_mapping(unsigned int flags) @@ -773,6 +799,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, file_rss--; } page_remove_rmap(page, vma); + if (unlikely(page_mapcount(page) < 0)) + print_bad_pte(vma, addr, ptent, page); tlb_remove_page(tlb, page); continue; } @@ -2684,7 +2712,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma, /* * Page table corrupted: show pte and kill process. */ - print_bad_pte(vma, orig_pte, address); + print_bad_pte(vma, address, orig_pte, NULL); return VM_FAULT_OOM; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index bd330252fc7..3acb216e9a7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -222,14 +222,14 @@ static inline int bad_range(struct zone *zone, struct page *page) static void bad_page(struct page *page) { - printk(KERN_EMERG "Bad page state in process '%s'\n" KERN_EMERG - "page:%p flags:0x%0*lx mapping:%p mapcount:%d count:%d\n", - current->comm, page, (int)(2*sizeof(unsigned long)), - (unsigned long)page->flags, page->mapping, - page_mapcount(page), page_count(page)); - - printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n" - KERN_EMERG "Backtrace:\n"); + printk(KERN_EMERG "Bad page state in process %s pfn:%05lx\n", + current->comm, page_to_pfn(page)); + printk(KERN_EMERG + "page:%p flags:%p count:%d mapcount:%d mapping:%p index:%lx\n", + page, (void *)page->flags, page_count(page), + page_mapcount(page), page->mapping, page->index); + printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"); + dump_stack(); /* Leave bad fields for debug, except PageBuddy could make trouble */ diff --git a/mm/rmap.c b/mm/rmap.c index b1770b11a57..32098255082 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -47,7 +47,6 @@ #include #include #include -#include #include #include #include @@ -725,21 +724,6 @@ void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long void page_remove_rmap(struct page *page, struct vm_area_struct *vma) { if (atomic_add_negative(-1, &page->_mapcount)) { - if (unlikely(page_mapcount(page) < 0)) { - printk (KERN_EMERG "Eeek! page_mapcount(page) went negative! (%d)\n", page_mapcount(page)); - printk (KERN_EMERG " page pfn = %lx\n", page_to_pfn(page)); - printk (KERN_EMERG " page->flags = %lx\n", page->flags); - printk (KERN_EMERG " page->count = %x\n", page_count(page)); - printk (KERN_EMERG " page->mapping = %p\n", page->mapping); - print_symbol (KERN_EMERG " vma->vm_ops = %s\n", (unsigned long)vma->vm_ops); - if (vma->vm_ops) { - print_symbol (KERN_EMERG " vma->vm_ops->fault = %s\n", (unsigned long)vma->vm_ops->fault); - } - if (vma->vm_file && vma->vm_file->f_op) - print_symbol (KERN_EMERG " vma->vm_file->f_op->mmap = %s\n", (unsigned long)vma->vm_file->f_op->mmap); - BUG(); - } - /* * Now that the last pte has gone, s390 must transfer dirty * flag from storage key to struct page. We can usually skip -- cgit From 22b31eec63e5f2e219a3ee15f456897272bc73e8 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:40:09 -0800 Subject: badpage: vm_normal_page use print_bad_pte print_bad_pte() is so far being called only when zap_pte_range() finds negative page_mapcount, or there's a fault on a pte_file where it does not belong. That's weak coverage when we suspect pagetable corruption. Originally, it was called when vm_normal_page() found an invalid pfn: but pfn_valid is expensive on some architectures and configurations, so 2.6.24 put that under CONFIG_DEBUG_VM (which doesn't help in the field), then 2.6.26 replaced it by a VM_BUG_ON (likewise). Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a __read_mostly note of the highest_memmap_pfn, vm_normal_page() then check pfn against that. We could call this pfn_plausible() or pfn_sane(), but I doubt we'll need it elsewhere: of course it's not reliable, but gives much stronger pagetable validation on many boxes. Also use print_bad_pte() when the pte_special bit is found outside a VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON. Signed-off-by: Hugh Dickins Cc: Nick Piggin Cc: Christoph Lameter Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 1 + mm/memory.c | 20 ++++++++++---------- mm/page_alloc.c | 4 ++++ 3 files changed, 15 insertions(+), 10 deletions(-) (limited to 'mm') diff --git a/mm/internal.h b/mm/internal.h index 13333bc2eb6..1981bc9454f 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -49,6 +49,7 @@ extern void putback_lru_page(struct page *page); /* * in mm/page_alloc.c */ +extern unsigned long highest_memmap_pfn; extern void __free_pages_bootmem(struct page *page, unsigned int order); /* diff --git a/mm/memory.c b/mm/memory.c index cda04b19f73..890095f5f36 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -467,21 +467,18 @@ static inline int is_cow_mapping(unsigned int flags) struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, pte_t pte) { - unsigned long pfn; + unsigned long pfn = pte_pfn(pte); if (HAVE_PTE_SPECIAL) { - if (likely(!pte_special(pte))) { - VM_BUG_ON(!pfn_valid(pte_pfn(pte))); - return pte_page(pte); - } - VM_BUG_ON(!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))); + if (likely(!pte_special(pte))) + goto check_pfn; + if (!(vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))) + print_bad_pte(vma, addr, pte, NULL); return NULL; } /* !HAVE_PTE_SPECIAL case follows: */ - pfn = pte_pfn(pte); - if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) { if (vma->vm_flags & VM_MIXEDMAP) { if (!pfn_valid(pfn)) @@ -497,11 +494,14 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr, } } - VM_BUG_ON(!pfn_valid(pfn)); +check_pfn: + if (unlikely(pfn > highest_memmap_pfn)) { + print_bad_pte(vma, addr, pte, NULL); + return NULL; + } /* * NOTE! We still have PageReserved() pages in the page tables. - * * eg. VDSO mappings can cause them to exist. */ out: diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 3acb216e9a7..755c99a0ac7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -69,6 +69,7 @@ EXPORT_SYMBOL(node_states); unsigned long totalram_pages __read_mostly; unsigned long totalreserve_pages __read_mostly; +unsigned long highest_memmap_pfn __read_mostly; int percpu_pagelist_fraction; #ifdef CONFIG_HUGETLB_PAGE_SIZE_VARIABLE @@ -2597,6 +2598,9 @@ void __meminit memmap_init_zone(unsigned long size, int nid, unsigned long zone, unsigned long pfn; struct zone *z; + if (highest_memmap_pfn < end_pfn - 1) + highest_memmap_pfn = end_pfn - 1; + z = &NODE_DATA(nid)->node_zones[zone]; for (pfn = start_pfn; pfn < end_pfn; pfn++) { /* -- cgit From 2509ef26db4699a5d9fa876e90ddfc107afcab84 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:40:10 -0800 Subject: badpage: zap print_bad_pte on swap and file Complete zap_pte_range()'s coverage of bad pagetable entries by calling print_bad_pte() on a pte_file in a linear vma and on a bad swap entry. That needs free_swap_and_cache() to tell it, which will also have shown one of those "swap_free" errors (but with much less information). Similar checks in fork's copy_one_pte()? No, that would be more noisy than helpful: we'll see them when parent and child exec or exit. Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR case, that could only be a bug in sys_remap_file_pages(), not a bad pte. VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent with what happens if do_swap_page() operates a bad swap entry; but don't we have patches to be more careful about killing when VM_FAULT_OOM? Signed-off-by: Hugh Dickins Cc: Nick Piggin Cc: Christoph Lameter Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 11 +++++++---- mm/swapfile.c | 7 ++++--- 2 files changed, 11 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 890095f5f36..b273cc12b15 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -810,8 +810,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, */ if (unlikely(details)) continue; - if (!pte_file(ptent)) - free_swap_and_cache(pte_to_swp_entry(ptent)); + if (pte_file(ptent)) { + if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) + print_bad_pte(vma, addr, ptent, NULL); + } else if + (unlikely(!free_swap_and_cache(pte_to_swp_entry(ptent)))) + print_bad_pte(vma, addr, ptent, NULL); pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); } while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0)); @@ -2707,8 +2711,7 @@ static int do_nonlinear_fault(struct mm_struct *mm, struct vm_area_struct *vma, if (!pte_unmap_same(mm, pmd, page_table, orig_pte)) return 0; - if (unlikely(!(vma->vm_flags & VM_NONLINEAR) || - !(vma->vm_flags & VM_CAN_NONLINEAR))) { + if (unlikely(!(vma->vm_flags & VM_NONLINEAR))) { /* * Page table corrupted: show pte and kill process. */ diff --git a/mm/swapfile.c b/mm/swapfile.c index d0052360191..f2874585577 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -571,13 +571,13 @@ int try_to_free_swap(struct page *page) * Free the swap entry like above, but also try to * free the page cache entry if it is the last user. */ -void free_swap_and_cache(swp_entry_t entry) +int free_swap_and_cache(swp_entry_t entry) { - struct swap_info_struct * p; + struct swap_info_struct *p; struct page *page = NULL; if (is_migration_entry(entry)) - return; + return 1; p = swap_info_get(entry); if (p) { @@ -603,6 +603,7 @@ void free_swap_and_cache(swp_entry_t entry) unlock_page(page); page_cache_release(page); } + return p != NULL; } #ifdef CONFIG_HIBERNATION -- cgit From edc315fd222497ae4f4b959a9e31ada1e68a4755 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:40:11 -0800 Subject: badpage: remove vma from page_remove_rmap Remove page_remove_rmap()'s vma arg, which was only for the Eeek message. And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's page_dup_rmap(): we're trying to be more resilient about that than BUGs. Signed-off-by: Hugh Dickins Cc: Nick Piggin Cc: Christoph Lameter Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap_xip.c | 2 +- mm/fremap.c | 2 +- mm/memory.c | 4 ++-- mm/rmap.c | 8 +++----- 4 files changed, 7 insertions(+), 9 deletions(-) (limited to 'mm') diff --git a/mm/filemap_xip.c b/mm/filemap_xip.c index b5167dfb2f2..0c04615651b 100644 --- a/mm/filemap_xip.c +++ b/mm/filemap_xip.c @@ -193,7 +193,7 @@ retry: /* Nuke the page table entry. */ flush_cache_page(vma, address, pte_pfn(*pte)); pteval = ptep_clear_flush_notify(vma, address, pte); - page_remove_rmap(page, vma); + page_remove_rmap(page); dec_mm_counter(mm, file_rss); BUG_ON(pte_dirty(pteval)); pte_unmap_unlock(pte, ptl); diff --git a/mm/fremap.c b/mm/fremap.c index 7d12ca70ef7..62d5bbda921 100644 --- a/mm/fremap.c +++ b/mm/fremap.c @@ -37,7 +37,7 @@ static void zap_pte(struct mm_struct *mm, struct vm_area_struct *vma, if (page) { if (pte_dirty(pte)) set_page_dirty(page); - page_remove_rmap(page, vma); + page_remove_rmap(page); page_cache_release(page); update_hiwater_rss(mm); dec_mm_counter(mm, file_rss); diff --git a/mm/memory.c b/mm/memory.c index b273cc12b15..0f9abbaf618 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -798,7 +798,7 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, mark_page_accessed(page); file_rss--; } - page_remove_rmap(page, vma); + page_remove_rmap(page); if (unlikely(page_mapcount(page) < 0)) print_bad_pte(vma, addr, ptent, page); tlb_remove_page(tlb, page); @@ -2023,7 +2023,7 @@ gotten: * mapcount is visible. So transitively, TLBs to * old page will be flushed before it can be reused. */ - page_remove_rmap(old_page, vma); + page_remove_rmap(old_page); } /* Free the old page.. */ diff --git a/mm/rmap.c b/mm/rmap.c index 32098255082..ac4af8cffbf 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -707,7 +707,6 @@ void page_add_file_rmap(struct page *page) */ void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long address) { - BUG_ON(page_mapcount(page) == 0); if (PageAnon(page)) __page_check_anon_rmap(page, vma, address); atomic_inc(&page->_mapcount); @@ -717,11 +716,10 @@ void page_dup_rmap(struct page *page, struct vm_area_struct *vma, unsigned long /** * page_remove_rmap - take down pte mapping from a page * @page: page to remove mapping from - * @vma: the vm area in which the mapping is removed * * The caller needs to hold the pte lock. */ -void page_remove_rmap(struct page *page, struct vm_area_struct *vma) +void page_remove_rmap(struct page *page) { if (atomic_add_negative(-1, &page->_mapcount)) { /* @@ -837,7 +835,7 @@ static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma, dec_mm_counter(mm, file_rss); - page_remove_rmap(page, vma); + page_remove_rmap(page); page_cache_release(page); out_unmap: @@ -952,7 +950,7 @@ static int try_to_unmap_cluster(unsigned long cursor, unsigned int *mapcount, if (pte_dirty(pteval)) set_page_dirty(page); - page_remove_rmap(page, vma); + page_remove_rmap(page); page_cache_release(page); dec_mm_counter(mm, file_rss); (*mapcount)--; -- cgit From d936cf9b39b06c8d2e0d7fb5e7b4f176e18dec69 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:40:12 -0800 Subject: badpage: ratelimit print_bad_pte and bad_page print_bad_pte() and bad_page() might each need ratelimiting - especially for their dump_stacks, almost never of interest, yet not quite dispensible. Correlating corruption across neighbouring entries can be very helpful, so allow a burst of 60 reports before keeping quiet for the remainder of that minute (or allow a steady drip of one report per second). Signed-off-by: Hugh Dickins Cc: Nick Piggin Cc: Christoph Lameter Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 23 +++++++++++++++++++++++ mm/page_alloc.c | 26 +++++++++++++++++++++++++- 2 files changed, 48 insertions(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index 0f9abbaf618..b12888c1b4e 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -383,6 +383,29 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, pmd_t *pmd = pmd_offset(pud, addr); struct address_space *mapping; pgoff_t index; + static unsigned long resume; + static unsigned long nr_shown; + static unsigned long nr_unshown; + + /* + * Allow a burst of 60 reports, then keep quiet for that minute; + * or allow a steady drip of one report per second. + */ + if (nr_shown == 60) { + if (time_before(jiffies, resume)) { + nr_unshown++; + return; + } + if (nr_unshown) { + printk(KERN_EMERG + "Bad page map: %lu messages suppressed\n", + nr_unshown); + nr_unshown = 0; + } + nr_shown = 0; + } + if (nr_shown++ == 0) + resume = jiffies + 60 * HZ; mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL; index = linear_page_index(vma, addr); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 755c99a0ac7..d1a80f652c6 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -223,6 +223,30 @@ static inline int bad_range(struct zone *zone, struct page *page) static void bad_page(struct page *page) { + static unsigned long resume; + static unsigned long nr_shown; + static unsigned long nr_unshown; + + /* + * Allow a burst of 60 reports, then keep quiet for that minute; + * or allow a steady drip of one report per second. + */ + if (nr_shown == 60) { + if (time_before(jiffies, resume)) { + nr_unshown++; + goto out; + } + if (nr_unshown) { + printk(KERN_EMERG + "Bad page state: %lu messages suppressed\n", + nr_unshown); + nr_unshown = 0; + } + nr_shown = 0; + } + if (nr_shown++ == 0) + resume = jiffies + 60 * HZ; + printk(KERN_EMERG "Bad page state in process %s pfn:%05lx\n", current->comm, page_to_pfn(page)); printk(KERN_EMERG @@ -232,7 +256,7 @@ static void bad_page(struct page *page) printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"); dump_stack(); - +out: /* Leave bad fields for debug, except PageBuddy could make trouble */ __ClearPageBuddy(page); add_taint(TAINT_BAD_PAGE); -- cgit From 1e9e63650d6cb88e6d6d2ca6cc3ee276c26de4a3 Mon Sep 17 00:00:00 2001 From: Hugh Dickins Date: Tue, 6 Jan 2009 14:40:13 -0800 Subject: badpage: KERN_ALERT BUG instead of KERN_EMERG bad_page() and rmap Eeek messages have said KERN_EMERG for a few years, which I've followed in print_bad_pte(). These are serious system errors, on a par with BUGs, but they're not quite emergencies, and we do our best to carry on: say KERN_ALERT "BUG: " like the x86 oops does. And remove the "Trying to fix it up, but a reboot is needed" line: it's not untrue, but I hope the KERN_ALERT "BUG: " conveys as much. Signed-off-by: Hugh Dickins Cc: Nick Piggin Cc: Christoph Lameter Cc: Mel Gorman Cc: Rik van Riel Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/memory.c | 15 ++++++++------- mm/page_alloc.c | 9 ++++----- 2 files changed, 12 insertions(+), 12 deletions(-) (limited to 'mm') diff --git a/mm/memory.c b/mm/memory.c index b12888c1b4e..db68af8e0bc 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -397,8 +397,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, return; } if (nr_unshown) { - printk(KERN_EMERG - "Bad page map: %lu messages suppressed\n", + printk(KERN_ALERT + "BUG: Bad page map: %lu messages suppressed\n", nr_unshown); nr_unshown = 0; } @@ -410,26 +410,27 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL; index = linear_page_index(vma, addr); - printk(KERN_EMERG "Bad page map in process %s pte:%08llx pmd:%08llx\n", + printk(KERN_ALERT + "BUG: Bad page map in process %s pte:%08llx pmd:%08llx\n", current->comm, (long long)pte_val(pte), (long long)pmd_val(*pmd)); if (page) { - printk(KERN_EMERG + printk(KERN_ALERT "page:%p flags:%p count:%d mapcount:%d mapping:%p index:%lx\n", page, (void *)page->flags, page_count(page), page_mapcount(page), page->mapping, page->index); } - printk(KERN_EMERG + printk(KERN_ALERT "addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n", (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index); /* * Choose text because data symbols depend on CONFIG_KALLSYMS_ALL=y */ if (vma->vm_ops) - print_symbol(KERN_EMERG "vma->vm_ops->fault: %s\n", + print_symbol(KERN_ALERT "vma->vm_ops->fault: %s\n", (unsigned long)vma->vm_ops->fault); if (vma->vm_file && vma->vm_file->f_op) - print_symbol(KERN_EMERG "vma->vm_file->f_op->mmap: %s\n", + print_symbol(KERN_ALERT "vma->vm_file->f_op->mmap: %s\n", (unsigned long)vma->vm_file->f_op->mmap); dump_stack(); add_taint(TAINT_BAD_PAGE); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d1a80f652c6..d531e8ef998 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -237,8 +237,8 @@ static void bad_page(struct page *page) goto out; } if (nr_unshown) { - printk(KERN_EMERG - "Bad page state: %lu messages suppressed\n", + printk(KERN_ALERT + "BUG: Bad page state: %lu messages suppressed\n", nr_unshown); nr_unshown = 0; } @@ -247,13 +247,12 @@ static void bad_page(struct page *page) if (nr_shown++ == 0) resume = jiffies + 60 * HZ; - printk(KERN_EMERG "Bad page state in process %s pfn:%05lx\n", + printk(KERN_ALERT "BUG: Bad page state in process %s pfn:%05lx\n", current->comm, page_to_pfn(page)); - printk(KERN_EMERG + printk(KERN_ALERT "page:%p flags:%p count:%d mapcount:%d mapping:%p index:%lx\n", page, (void *)page->flags, page_count(page), page_mapcount(page), page->mapping, page->index); - printk(KERN_EMERG "Trying to fix it up, but a reboot is needed\n"); dump_stack(); out: -- cgit From b555749aac87d7c2637f153e44bd77c7fdf4c65b Mon Sep 17 00:00:00 2001 From: Andrew Morton Date: Tue, 6 Jan 2009 14:40:13 -0800 Subject: vmscan: shrink_active_list(): reduce lru_lock hold time These three statements manipulate local variables and do not need the lock coverage. Cc: Johannes Weiner Cc: Lee Schermerhorn Cc: Rik van Riel Signed-off-by: Linus Torvalds --- mm/vmscan.c | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 466a36b3bad..5daf606e0a3 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1237,6 +1237,13 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, list_add(&page->lru, &l_inactive); } + /* + * Move the pages to the [file or anon] inactive list. + */ + pagevec_init(&pvec, 1); + pgmoved = 0; + lru = LRU_BASE + file * LRU_FILE; + spin_lock_irq(&zone->lru_lock); /* * Count referenced pages from currently used mappings as @@ -1247,13 +1254,6 @@ static void shrink_active_list(unsigned long nr_pages, struct zone *zone, if (scan_global_lru(sc)) zone->recent_rotated[!!file] += pgmoved; - /* - * Move the pages to the [file or anon] inactive list. - */ - pagevec_init(&pvec, 1); - - pgmoved = 0; - lru = LRU_BASE + file * LRU_FILE; while (!list_empty(&l_inactive)) { page = lru_to_page(&l_inactive); prefetchw_prev_lru_page(page, &l_inactive, flags); -- cgit From 4779280d1ea4d361af13ae77ba55217fbcd16d4c Mon Sep 17 00:00:00 2001 From: Ying Han Date: Tue, 6 Jan 2009 14:40:18 -0800 Subject: mm: make get_user_pages() interruptible The initial implementation of checking TIF_MEMDIE covers the cases of OOM killing. If the process has been OOM killed, the TIF_MEMDIE is set and it return immediately. This patch includes: 1. add the case that the SIGKILL is sent by user processes. The process can try to get_user_pages() unlimited memory even if a user process has sent a SIGKILL to it(maybe a monitor find the process exceed its memory limit and try to kill it). In the old implementation, the SIGKILL won't be handled until the get_user_pages() returns. 2. change the return value to be ERESTARTSYS. It makes no sense to return ENOMEM if the get_user_pages returned by getting a SIGKILL signal. Considering the general convention for a system call interrupted by a signal is ERESTARTNOSYS, so the current return value is consistant to that. Lee: An unfortunate side effect of "make-get_user_pages-interruptible" is that it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked, resulting in freeing of mlocked pages. Freeing of mlocked pages, in itself, is not so bad. We just count them now--altho' I had hoped to remove this stat and add PG_MLOCKED to the free pages flags check. However, consider pages in shared libraries mapped by more than one task that a task mlocked--e.g., via mlockall(). If the task that mlocked the pages exits via SIGKILL, these pages would be left mlocked and unevictable. Proposed fix: Add another GUP flag to ignore sigkill when calling get_user_pages from munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for the same purpose. We are not actually allocating memory in this case, which "make-get_user_pages-interruptible" intends to avoid. We're just munlocking pages that are already resident and mapped, and we're reusing get_user_pages() to access those pages. ?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL into a single flag: GUP_FLAGS_MUNLOCK ??? [Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock] Signed-off-by: Paul Menage Signed-off-by: Ying Han Reviewed-by: KOSAKI Motohiro Reviewed-by: Pekka Enberg Cc: Nick Piggin Cc: Hugh Dickins Cc: Oleg Nesterov Cc: Lee Schermerhorn Cc: Rohit Seth Cc: David Rientjes Signed-off-by: Lee Schermerhorn Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/internal.h | 1 + mm/memory.c | 14 +++++++++----- mm/mlock.c | 9 +++++---- 3 files changed, 15 insertions(+), 9 deletions(-) (limited to 'mm') diff --git a/mm/internal.h b/mm/internal.h index 1981bc9454f..478223b73a2 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -276,6 +276,7 @@ static inline void mminit_validate_memmodel_limits(unsigned long *start_pfn, #define GUP_FLAGS_WRITE 0x1 #define GUP_FLAGS_FORCE 0x2 #define GUP_FLAGS_IGNORE_VMA_PERMISSIONS 0x4 +#define GUP_FLAGS_IGNORE_SIGKILL 0x8 int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, unsigned long start, int len, int flags, diff --git a/mm/memory.c b/mm/memory.c index db68af8e0bc..3f8fa06b963 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1210,6 +1210,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, int write = !!(flags & GUP_FLAGS_WRITE); int force = !!(flags & GUP_FLAGS_FORCE); int ignore = !!(flags & GUP_FLAGS_IGNORE_VMA_PERMISSIONS); + int ignore_sigkill = !!(flags & GUP_FLAGS_IGNORE_SIGKILL); if (len <= 0) return 0; @@ -1288,12 +1289,15 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm, struct page *page; /* - * If tsk is ooming, cut off its access to large memory - * allocations. It has a pending SIGKILL, but it can't - * be processed until returning to user space. + * If we have a pending SIGKILL, don't keep faulting + * pages and potentially allocating memory, unless + * current is handling munlock--e.g., on exit. In + * that case, we are not allocating memory. Rather, + * we're only unlocking already resident/mapped pages. */ - if (unlikely(test_tsk_thread_flag(tsk, TIF_MEMDIE))) - return i ? i : -ENOMEM; + if (unlikely(!ignore_sigkill && + fatal_signal_pending(current))) + return i ? i : -ERESTARTSYS; if (write) foll_flags |= FOLL_WRITE; diff --git a/mm/mlock.c b/mm/mlock.c index 3035a56e761..e125156c664 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -173,12 +173,13 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, (atomic_read(&mm->mm_users) != 0)); /* - * mlock: don't page populate if page has PROT_NONE permission. - * munlock: the pages always do munlock althrough - * its has PROT_NONE permission. + * mlock: don't page populate if vma has PROT_NONE permission. + * munlock: always do munlock although the vma has PROT_NONE + * permission, or SIGKILL is pending. */ if (!mlock) - gup_flags |= GUP_FLAGS_IGNORE_VMA_PERMISSIONS; + gup_flags |= GUP_FLAGS_IGNORE_VMA_PERMISSIONS | + GUP_FLAGS_IGNORE_SIGKILL; if (vma->vm_flags & VM_WRITE) gup_flags |= GUP_FLAGS_WRITE; -- cgit From 853ac43ab194f5051b27a55060215d696dc9480d Mon Sep 17 00:00:00 2001 From: Matt Mackall Date: Tue, 6 Jan 2009 14:40:20 -0800 Subject: shmem: unify regular and tiny shmem tiny-shmem shares most of its 130 lines of code with shmem and tends to break when particular bits of shmem get modified. Unifying saves code and makes keeping these two in sync much easier. before: 14367 392 24 14783 39bf mm/shmem.o 396 72 8 476 1dc mm/tiny-shmem.o after: 14367 392 24 14783 39bf mm/shmem.o 412 72 8 492 1ec mm/shmem.o tiny Signed-off-by: Matt Mackall Acked-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/Makefile | 4 +- mm/shmem.c | 81 +++++++++++++++++++++++++++++----- mm/tiny-shmem.c | 134 -------------------------------------------------------- 3 files changed, 72 insertions(+), 147 deletions(-) delete mode 100644 mm/tiny-shmem.c (limited to 'mm') diff --git a/mm/Makefile b/mm/Makefile index 51c27709cc7..72255be57f8 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -9,7 +9,7 @@ mmu-$(CONFIG_MMU) := fremap.o highmem.o madvise.o memory.o mincore.o \ obj-y := bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \ maccess.o page_alloc.o page-writeback.o pdflush.o \ - readahead.o swap.o truncate.o vmscan.o \ + readahead.o swap.o truncate.o vmscan.o shmem.o \ prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \ page_isolation.o mm_init.o $(mmu-y) @@ -21,9 +21,7 @@ obj-$(CONFIG_HUGETLBFS) += hugetlb.o obj-$(CONFIG_NUMA) += mempolicy.o obj-$(CONFIG_SPARSEMEM) += sparse.o obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o -obj-$(CONFIG_SHMEM) += shmem.o obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o -obj-$(CONFIG_TINY_SHMEM) += tiny-shmem.o obj-$(CONFIG_SLOB) += slob.o obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o obj-$(CONFIG_SLAB) += slab.o diff --git a/mm/shmem.c b/mm/shmem.c index 24f18fdee6e..5941f980136 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -14,31 +14,39 @@ * Copyright (c) 2004, Luke Kenneth Casson Leighton * Copyright (c) 2004 Red Hat, Inc., James Morris * + * tiny-shmem: + * Copyright (c) 2004, 2008 Matt Mackall + * * This file is released under the GPL. */ +#include +#include +#include +#include +#include +#include +#include +#include + +static struct vfsmount *shm_mnt; + +#ifdef CONFIG_SHMEM /* * This virtual memory filesystem is heavily based on the ramfs. It * extends ramfs by the ability to use swap and honor resource limits * which makes it a completely usable filesystem. */ -#include -#include -#include #include #include #include -#include #include -#include -#include #include #include #include #include #include -#include #include #include #include @@ -2485,7 +2493,6 @@ static struct file_system_type tmpfs_fs_type = { .get_sb = shmem_get_sb, .kill_sb = kill_litter_super, }; -static struct vfsmount *shm_mnt; static int __init init_tmpfs(void) { @@ -2524,7 +2531,51 @@ out4: shm_mnt = ERR_PTR(error); return error; } -module_init(init_tmpfs) + +#else /* !CONFIG_SHMEM */ + +/* + * tiny-shmem: simple shmemfs and tmpfs using ramfs code + * + * This is intended for small system where the benefits of the full + * shmem code (swap-backed and resource-limited) are outweighed by + * their complexity. On systems without swap this code should be + * effectively equivalent, but much lighter weight. + */ + +#include + +static struct file_system_type tmpfs_fs_type = { + .name = "tmpfs", + .get_sb = ramfs_get_sb, + .kill_sb = kill_litter_super, +}; + +static int __init init_tmpfs(void) +{ + BUG_ON(register_filesystem(&tmpfs_fs_type) != 0); + + shm_mnt = kern_mount(&tmpfs_fs_type); + BUG_ON(IS_ERR(shm_mnt)); + + return 0; +} + +int shmem_unuse(swp_entry_t entry, struct page *page) +{ + return 0; +} + +#define shmem_file_operations ramfs_file_operations +#define shmem_vm_ops generic_file_vm_ops +#define shmem_get_inode ramfs_get_inode +#define shmem_acct_size(a, b) 0 +#define shmem_unacct_size(a, b) do {} while (0) +#define SHMEM_MAX_BYTES LLONG_MAX + +#endif /* CONFIG_SHMEM */ + +/* common code */ /** * shmem_file_setup - get an unlinked file living in tmpfs @@ -2568,12 +2619,20 @@ struct file *shmem_file_setup(char *name, loff_t size, unsigned long flags) if (!inode) goto close_file; +#ifdef CONFIG_SHMEM SHMEM_I(inode)->flags = flags & VM_ACCOUNT; +#endif d_instantiate(dentry, inode); inode->i_size = size; inode->i_nlink = 0; /* It is unlinked */ init_file(file, shm_mnt, dentry, FMODE_WRITE | FMODE_READ, - &shmem_file_operations); + &shmem_file_operations); + +#ifndef CONFIG_MMU + error = ramfs_nommu_expand_for_mapping(inode, size); + if (error) + goto close_file; +#endif return file; close_file: @@ -2605,3 +2664,5 @@ int shmem_zero_setup(struct vm_area_struct *vma) vma->vm_ops = &shmem_vm_ops; return 0; } + +module_init(init_tmpfs) diff --git a/mm/tiny-shmem.c b/mm/tiny-shmem.c deleted file mode 100644 index 3e67d575ee6..00000000000 --- a/mm/tiny-shmem.c +++ /dev/null @@ -1,134 +0,0 @@ -/* - * tiny-shmem.c: simple shmemfs and tmpfs using ramfs code - * - * Matt Mackall January, 2004 - * derived from mm/shmem.c and fs/ramfs/inode.c - * - * This is intended for small system where the benefits of the full - * shmem code (swap-backed and resource-limited) are outweighed by - * their complexity. On systems without swap this code should be - * effectively equivalent, but much lighter weight. - */ - -#include -#include -#include -#include -#include -#include -#include -#include -#include - -static struct file_system_type tmpfs_fs_type = { - .name = "tmpfs", - .get_sb = ramfs_get_sb, - .kill_sb = kill_litter_super, -}; - -static struct vfsmount *shm_mnt; - -static int __init init_tmpfs(void) -{ - BUG_ON(register_filesystem(&tmpfs_fs_type) != 0); - - shm_mnt = kern_mount(&tmpfs_fs_type); - BUG_ON(IS_ERR(shm_mnt)); - - return 0; -} -module_init(init_tmpfs) - -/** - * shmem_file_setup - get an unlinked file living in tmpfs - * @name: name for dentry (to be seen in /proc//maps - * @size: size to be set for the file - * @flags: vm_flags - */ -struct file *shmem_file_setup(char *name, loff_t size, unsigned long flags) -{ - int error; - struct file *file; - struct inode *inode; - struct dentry *dentry, *root; - struct qstr this; - - if (IS_ERR(shm_mnt)) - return (void *)shm_mnt; - - error = -ENOMEM; - this.name = name; - this.len = strlen(name); - this.hash = 0; /* will go */ - root = shm_mnt->mnt_root; - dentry = d_alloc(root, &this); - if (!dentry) - goto put_memory; - - error = -ENFILE; - file = get_empty_filp(); - if (!file) - goto put_dentry; - - error = -ENOSPC; - inode = ramfs_get_inode(root->d_sb, S_IFREG | S_IRWXUGO, 0); - if (!inode) - goto close_file; - - d_instantiate(dentry, inode); - inode->i_size = size; - inode->i_nlink = 0; /* It is unlinked */ - init_file(file, shm_mnt, dentry, FMODE_WRITE | FMODE_READ, - &ramfs_file_operations); - -#ifndef CONFIG_MMU - error = ramfs_nommu_expand_for_mapping(inode, size); - if (error) - goto close_file; -#endif - return file; - -close_file: - put_filp(file); -put_dentry: - dput(dentry); -put_memory: - return ERR_PTR(error); -} -EXPORT_SYMBOL_GPL(shmem_file_setup); - -/** - * shmem_zero_setup - setup a shared anonymous mapping - * @vma: the vma to be mmapped is prepared by do_mmap_pgoff - */ -int shmem_zero_setup(struct vm_area_struct *vma) -{ - struct file *file; - loff_t size = vma->vm_end - vma->vm_start; - - file = shmem_file_setup("dev/zero", size, vma->vm_flags); - if (IS_ERR(file)) - return PTR_ERR(file); - - if (vma->vm_file) - fput(vma->vm_file); - vma->vm_file = file; - vma->vm_ops = &generic_file_vm_ops; - return 0; -} - -int shmem_unuse(swp_entry_t entry, struct page *page) -{ - return 0; -} - -#ifndef CONFIG_MMU -unsigned long shmem_get_unmapped_area(struct file *file, - unsigned long addr, - unsigned long len, - unsigned long pgoff, - unsigned long flags) -{ - return ramfs_nommu_get_unmapped_area(file, addr, len, pgoff, flags); -} -#endif -- cgit From 48aae42556e5ea1ba0d8ddab25352706577af2ed Mon Sep 17 00:00:00 2001 From: ZhenwenXu Date: Tue, 6 Jan 2009 14:40:21 -0800 Subject: mm/mmap.c: fix coding style Fix a little of the coding style in mm/mmap.c [akpm@linux-foundation.org: cleanup] Signed-off-by: ZhenwenXu Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mmap.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) (limited to 'mm') diff --git a/mm/mmap.c b/mm/mmap.c index 2c778fcfd9b..e4507b23e62 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -413,7 +413,7 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma, static void __vma_link_file(struct vm_area_struct *vma) { - struct file * file; + struct file *file; file = vma->vm_file; if (file) { @@ -474,11 +474,10 @@ static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma, * insert vm structure into list and rbtree and anon_vma, * but it has already been inserted into prio_tree earlier. */ -static void -__insert_vm_struct(struct mm_struct * mm, struct vm_area_struct * vma) +static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma) { - struct vm_area_struct * __vma, * prev; - struct rb_node ** rb_link, * rb_parent; + struct vm_area_struct *__vma, *prev; + struct rb_node **rb_link, *rb_parent; __vma = find_vma_prepare(mm, vma->vm_start,&prev, &rb_link, &rb_parent); BUG_ON(__vma && __vma->vm_start < vma->vm_end); @@ -908,7 +907,7 @@ void vm_stat_account(struct mm_struct *mm, unsigned long flags, * The caller must hold down_write(current->mm->mmap_sem). */ -unsigned long do_mmap_pgoff(struct file * file, unsigned long addr, +unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff) { @@ -1464,7 +1463,7 @@ get_unmapped_area(struct file *file, unsigned long addr, unsigned long len, EXPORT_SYMBOL(get_unmapped_area); /* Look up the first VMA which satisfies addr < vm_end, NULL if none. */ -struct vm_area_struct * find_vma(struct mm_struct * mm, unsigned long addr) +struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr) { struct vm_area_struct *vma = NULL; @@ -1507,7 +1506,7 @@ find_vma_prev(struct mm_struct *mm, unsigned long addr, struct vm_area_struct **pprev) { struct vm_area_struct *vma = NULL, *prev = NULL; - struct rb_node * rb_node; + struct rb_node *rb_node; if (!mm) goto out; @@ -1541,7 +1540,7 @@ out: * update accounting. This is shared with both the * grow-up and grow-down cases. */ -static int acct_stack_growth(struct vm_area_struct * vma, unsigned long size, unsigned long grow) +static int acct_stack_growth(struct vm_area_struct *vma, unsigned long size, unsigned long grow) { struct mm_struct *mm = vma->vm_mm; struct rlimit *rlim = current->signal->rlim; -- cgit From 48b47c561e41525061b5bc0cfd67d6367fd11dc4 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:40:22 -0800 Subject: mm: direct IO starvation improvement Direct IO can invalidate and sync a lot of pagecache pages in the mapping. A 4K direct IO will actually try to sync and/or invalidate the pagecache of the entire file, for example (which might be many GB or TB large). Improve this by doing range syncs. Also, memory no longer has to be unmapped to catch the dirty bits for syncing, as dirty bits would remain coherent due to dirty mmap accounting. This fixes the immediate DM deadlocks when doing direct IO reads to block device with a mounted filesystem, if only by papering over the problem somewhat rather than addressing the fsync starvation cases. Signed-off-by: Nick Piggin Reviewed-by: Jeff Moyer Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 16 +++++----------- 1 file changed, 5 insertions(+), 11 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index 9c5e6235cc7..f3555fb806d 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1317,7 +1317,8 @@ generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov, goto out; /* skip atime */ size = i_size_read(inode); if (pos < size) { - retval = filemap_write_and_wait(mapping); + retval = filemap_write_and_wait_range(mapping, pos, + pos + iov_length(iov, nr_segs) - 1); if (!retval) { retval = mapping->a_ops->direct_IO(READ, iocb, iov, pos, nr_segs); @@ -2059,18 +2060,10 @@ generic_file_direct_write(struct kiocb *iocb, const struct iovec *iov, if (count != ocount) *nr_segs = iov_shorten((struct iovec *)iov, *nr_segs, count); - /* - * Unmap all mmappings of the file up-front. - * - * This will cause any pte dirty bits to be propagated into the - * pageframes for the subsequent filemap_write_and_wait(). - */ write_len = iov_length(iov, *nr_segs); end = (pos + write_len - 1) >> PAGE_CACHE_SHIFT; - if (mapping_mapped(mapping)) - unmap_mapping_range(mapping, pos, write_len, 0); - written = filemap_write_and_wait(mapping); + written = filemap_write_and_wait_range(mapping, pos, pos + write_len - 1); if (written) goto out; @@ -2290,7 +2283,8 @@ generic_file_buffered_write(struct kiocb *iocb, const struct iovec *iov, * the file data here, to try to honour O_DIRECT expectations. */ if (unlikely(file->f_flags & O_DIRECT) && written) - status = filemap_write_and_wait(mapping); + status = filemap_write_and_wait_range(mapping, + pos, pos + written - 1); return written ? written : status; } -- cgit From 67d58ac47d25f7e2a105248a4aea6113131ab874 Mon Sep 17 00:00:00 2001 From: Nick Piggin Date: Tue, 6 Jan 2009 14:40:28 -0800 Subject: mm: pagecache gfp flags fix Frustratingly, gfp_t is really divided into two classes of flags. One are the context dependent ones (can we sleep? can we enter filesystem? block subsystem? should we use some extra reserves, etc.). The other ones are the type of memory required and depend on how the algorithm is implemented rather than the point at which the memory is allocated (highmem? dma memory? etc). Some of the functions which allocate a page and add it to page cache take a gfp_t, but sometimes those functions or their callers aren't really doing the right thing: when allocating pagecache page, the memory type should be mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type should be kernel mapped (not highmem) memory. The gfp_t argument should only really be needed for context dependent options. This patch doesn't really solve that tangle in a nice way, but it does attempt to fix a couple of bugs. - find_or_create_page changes its radix-tree allocation to only include the main context dependent flags in order so the pagecache page may be allocated from arbitrary types of memory without affecting the radix-tree. In practice, slab allocations don't come from highmem anyway, and radix-tree only uses slab allocations. So there isn't a practical change (unless some fs uses GFP_DMA for pages). - grab_cache_page_nowait() is changed to allocate radix-tree nodes with GFP_NOFS, because it is not supposed to reenter the filesystem. This bug could cause lock recursion if a filesystem is not expecting the function to reenter the fs (as-per documentation). Filesystems should be careful about exactly what semantics they want and what they get when fiddling with gfp_t masks to allocate pagecache. One should be as liberal as possible with the type of memory that can be used, and same for the the context specific flags. Signed-off-by: Nick Piggin Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/filemap.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) (limited to 'mm') diff --git a/mm/filemap.c b/mm/filemap.c index f3555fb806d..2f55a1e2baf 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -741,7 +741,14 @@ repeat: page = __page_cache_alloc(gfp_mask); if (!page) return NULL; - err = add_to_page_cache_lru(page, mapping, index, gfp_mask); + /* + * We want a regular kernel memory (not highmem or DMA etc) + * allocation for the radix tree nodes, but we need to honour + * the context-specific requirements the caller has asked for. + * GFP_RECLAIM_MASK collects those requirements. + */ + err = add_to_page_cache_lru(page, mapping, index, + (gfp_mask & GFP_RECLAIM_MASK)); if (unlikely(err)) { page_cache_release(page); page = NULL; @@ -950,7 +957,7 @@ grab_cache_page_nowait(struct address_space *mapping, pgoff_t index) return NULL; } page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS); - if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) { + if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) { page_cache_release(page); page = NULL; } -- cgit From 901608d9045146aec6f14a7777ea4b1501c379f0 Mon Sep 17 00:00:00 2001 From: Oleg Nesterov Date: Tue, 6 Jan 2009 14:40:29 -0800 Subject: mm: introduce get_mm_hiwater_xxx(), fix taskstats->hiwater_xxx accounting xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses mm->hiwater_xxx directly, this leads to 2 problems: - taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any moment before the task exits, so we should check the current values of rss/vm anyway. - do_exit()->update_hiwater_xxx() calls are racy. An exiting thread can be preempted right before mm->hiwater_xxx = new_val, and another thread can use A_LOT of memory and exit in between. When the first thread resumes it can be the last thread in the thread group, in that case we report the wrong hiwater_xxx values which do not take A_LOT into account. Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change xacct_add_tsk() to use them. The first helper will also be used by rusage->ru_maxrss accounting. Kill do_exit()->update_hiwater_xxx() calls. Unless we are going to decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody can look at this mm_struct when exit_mmap() actually unmaps the memory. Signed-off-by: Oleg Nesterov Acked-by: Hugh Dickins Reviewed-by: KOSAKI Motohiro Acked-by: Balbir Singh Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mmap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) (limited to 'mm') diff --git a/mm/mmap.c b/mm/mmap.c index e4507b23e62..1f97d8aa9b0 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2102,7 +2102,7 @@ void exit_mmap(struct mm_struct *mm) lru_add_drain(); flush_cache_mm(mm); tlb = tlb_gather_mmu(mm, 1); - /* Don't update_hiwater_rss(mm) here, do_exit already did */ + /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ end = unmap_vmas(&tlb, vma, 0, -1, &nr_accounted, NULL); vm_unacct_memory(nr_accounted); -- cgit From 9f572e3f96b8a2ef70dcb881e64c7b9c10057d98 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:29 -0800 Subject: mm: remove CONFIG_OUT_OF_LINE_PFN_TO_PAGE No architectures use CONFIG_OUT_OF_LINE_PFN_TO_PAGE - it can be removed. Signed-off-by: KOSAKI Motohiro Cc: KAMEZAWA Hiroyuki Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/page_alloc.c | 13 ------------- 1 file changed, 13 deletions(-) (limited to 'mm') diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d531e8ef998..7bf22e04531 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4587,19 +4587,6 @@ void *__init alloc_large_system_hash(const char *tablename, return table; } -#ifdef CONFIG_OUT_OF_LINE_PFN_TO_PAGE -struct page *pfn_to_page(unsigned long pfn) -{ - return __pfn_to_page(pfn); -} -unsigned long page_to_pfn(struct page *page) -{ - return __page_to_pfn(page); -} -EXPORT_SYMBOL(pfn_to_page); -EXPORT_SYMBOL(page_to_pfn); -#endif /* CONFIG_OUT_OF_LINE_PFN_TO_PAGE */ - /* Return a pointer to the bitmap storing bits affecting a block of pages */ static inline unsigned long *get_pageblock_bitmap(struct zone *zone, unsigned long pfn) -- cgit From 084f71ae5ceeb16734d8ac47559d3c718456a865 Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:30 -0800 Subject: mm: kill page_queue_congested() page_queue_congested() was introduced in 2002, but it was never used Signed-off-by: KOSAKI Motohiro Cc: Peter Zijlstra Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/swapfile.c | 20 -------------------- 1 file changed, 20 deletions(-) (limited to 'mm') diff --git a/mm/swapfile.c b/mm/swapfile.c index f2874585577..eec5ca758a2 100644 --- a/mm/swapfile.c +++ b/mm/swapfile.c @@ -1372,26 +1372,6 @@ out: return ret; } -#if 0 /* We don't need this yet */ -#include -int page_queue_congested(struct page *page) -{ - struct backing_dev_info *bdi; - - VM_BUG_ON(!PageLocked(page)); /* It pins the swap_info_struct */ - - if (PageSwapCache(page)) { - swp_entry_t entry = { .val = page_private(page) }; - struct swap_info_struct *sis; - - sis = get_swap_info_struct(swp_type(entry)); - bdi = sis->bdev->bd_inode->i_mapping->backing_dev_info; - } else - bdi = page->mapping->backing_dev_info; - return bdi_write_congested(bdi); -} -#endif - asmlinkage long sys_swapoff(const char __user * specialfile) { struct swap_info_struct * p = NULL; -- cgit From dcd4a049b9751828c516c59709f3fdf50436df85 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 6 Jan 2009 14:40:31 -0800 Subject: mm: check for no mmaps in exit_mmap() When dup_mmap() ooms we can end up with mm->mmap == NULL. The error path does mmput() and unmap_vmas() gets a NULL vma which it dereferences. In exit_mmap() there is nothing to do at all for this case, we can cancel the callpath right there. [akpm@linux-foundation.org: add sorely-needed comment] Signed-off-by: Johannes Weiner Reported-by: Akinobu Mita Cc: Nick Piggin Cc: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/mmap.c | 3 +++ 1 file changed, 3 insertions(+) (limited to 'mm') diff --git a/mm/mmap.c b/mm/mmap.c index 1f97d8aa9b0..a910c045cfd 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2090,6 +2090,9 @@ void exit_mmap(struct mm_struct *mm) arch_exit_mmap(mm); mmu_notifier_release(mm); + if (!mm->mmap) /* Can happen if dup_mmap() received an OOM */ + return; + if (mm->locked_vm) { vma = mm->mmap; while (vma) { -- cgit From 594fe1a044325bb0a1a49ca7d086e3df4f1df59a Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Tue, 6 Jan 2009 14:40:32 -0800 Subject: bootmem: print request details before BUG_ON(them) Moving the request details print-out before the sanity checks that might panic() enables us to analyse invalid requests without having access to the line information of the stack dump. Signed-off-by: Johannes Weiner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/bootmem.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) (limited to 'mm') diff --git a/mm/bootmem.c b/mm/bootmem.c index ac5a891f142..51a0ccf61e0 100644 --- a/mm/bootmem.c +++ b/mm/bootmem.c @@ -435,6 +435,10 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata, unsigned long fallback = 0; unsigned long min, max, start, sidx, midx, step; + bdebug("nid=%td size=%lx [%lu pages] align=%lx goal=%lx limit=%lx\n", + bdata - bootmem_node_data, size, PAGE_ALIGN(size) >> PAGE_SHIFT, + align, goal, limit); + BUG_ON(!size); BUG_ON(align & (align - 1)); BUG_ON(limit && goal + size > limit); @@ -442,10 +446,6 @@ static void * __init alloc_bootmem_core(struct bootmem_data *bdata, if (!bdata->node_bootmem_map) return NULL; - bdebug("nid=%td size=%lx [%lu pages] align=%lx goal=%lx limit=%lx\n", - bdata - bootmem_node_data, size, PAGE_ALIGN(size) >> PAGE_SHIFT, - align, goal, limit); - min = bdata->node_min_pfn; max = bdata->node_low_pfn; -- cgit From 73ce02e96fe34a983199a9855b2ae738f960a6ee Mon Sep 17 00:00:00 2001 From: KOSAKI Motohiro Date: Tue, 6 Jan 2009 14:40:33 -0800 Subject: mm: stop kswapd's infinite loop at high order allocation Wassim Dagash reported following kswapd infinite loop problem. kswapd runs in some infinite loop trying to swap until order 10 of zone highmem is OK.... kswapd will continue to try to balance order 10 of zone highmem forever (or until someone release a very large chunk of highmem). For non order-0 allocations, the system may never be balanced due to fragmentation but kswapd should not infinitely loop as a result. Instead, recheck all watermarks at order-0 as they are the most important. If watermarks are ok, kswapd will go back to sleep. [akpm@linux-foundation.org: fix comment] Reported-by: wassim dagash Signed-off-by: KOSAKI Motohiro Reviewed-by: Nick Piggin Signed-off-by: Mel Gorman Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/vmscan.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) (limited to 'mm') diff --git a/mm/vmscan.c b/mm/vmscan.c index 5daf606e0a3..b07c48b09a9 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1867,6 +1867,23 @@ out: try_to_freeze(); + /* + * Fragmentation may mean that the system cannot be + * rebalanced for high-order allocations in all zones. + * At this point, if nr_reclaimed < SWAP_CLUSTER_MAX, + * it means the zones have been fully scanned and are still + * not balanced. For high-order allocations, there is + * little point trying all over again as kswapd may + * infinite loop. + * + * Instead, recheck all watermarks at order-0 as they + * are the most important. If watermarks are ok, kswapd will go + * back to sleep. High-order users can still perform direct + * reclaim if they wish. + */ + if (sc.nr_reclaimed < SWAP_CLUSTER_MAX) + order = sc.order = 0; + goto loop_again; } -- cgit From 91f47662dfaa5b459aebe13284c6c38db27350dc Mon Sep 17 00:00:00 2001 From: Cyrill Gorcunov Date: Tue, 6 Jan 2009 14:40:33 -0800 Subject: mm: hugetlb: remove redundant `if' operation At this point we already know that 'addr' is not NULL so get rid of redundant 'if'. Probably gcc eliminate it by optimization pass. [akpm@linux-foundation.org: use __weak, too] Signed-off-by: Cyrill Gorcunov Reviewed-by: Ingo Molnar Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/hugetlb.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) (limited to 'mm') diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 82321da23cc..618e9830408 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1005,7 +1005,7 @@ static struct page *alloc_huge_page(struct vm_area_struct *vma, return page; } -__attribute__((weak)) int alloc_bootmem_huge_page(struct hstate *h) +int __weak alloc_bootmem_huge_page(struct hstate *h) { struct huge_bootmem_page *m; int nr_nodes = nodes_weight(node_online_map); @@ -1024,8 +1024,7 @@ __attribute__((weak)) int alloc_bootmem_huge_page(struct hstate *h) * puts them into the mem_map). */ m = addr; - if (m) - goto found; + goto found; } hstate_next_node(h); nr_nodes--; -- cgit From 67faaada1ebcccf29745346f1d7cb5392f46500a Mon Sep 17 00:00:00 2001 From: Geert Uytterhoeven Date: Tue, 6 Jan 2009 14:41:12 -0800 Subject: Remove obsolete CONFIG_RESOURCES_64BIT commit 8308c54d7e312f7a03e2ce2057d0837e6fe3843f ("generic: redefine resource_size_t as phys_addr_t") made CONFIG_RESOURCES_64BIT obsolete, but didn't remove it. Remove it. Signed-off-by: Geert Uytterhoeven Cc: Jeremy Fitzhardinge Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- mm/Kconfig | 6 ------ 1 file changed, 6 deletions(-) (limited to 'mm') diff --git a/mm/Kconfig b/mm/Kconfig index 5b5790f8a81..a5b77811fdf 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -181,12 +181,6 @@ config MIGRATION example on NUMA systems to put pages nearer to the processors accessing the page. -config RESOURCES_64BIT - bool "64 bit Memory and IO resources (EXPERIMENTAL)" if (!64BIT && EXPERIMENTAL) - default 64BIT - help - This option allows memory and IO resources to be 64 bit. - config PHYS_ADDR_T_64BIT def_bool 64BIT || ARCH_PHYS_ADDR_T_64BIT -- cgit