* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
slub: No need for per node slab counters if !SLUB_DEBUG
slub: Move map/flag clearing to __free_slab
slub: Fixes to per cpu stat output in sysfs
slub: Deal with config variable dependencies
slub: Reduce #ifdef ZONE_DMA by moving kmalloc_caches_dma near dma logic
slub: Initialize per-cpu stats
Fix two regressions dealing with the kgdb core.
1) kgdb_skipexception and kgdb_post_primary_code are optional
functions that are only required on archs that need special exception
fixups.
2) The kernel address space scope must be set on any probe_kernel_*
function or archs such as ARCH=arm will not allow access to the kernel
memory space. As an example, it is required to allow the full kernel
address space is when you the kernel debugger to inspect a system
call.
Signed-off-by: Jason Wessel <jason.wessel@windriver.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
add probe_kernel_read() and probe_kernel_write().
Uninlined and restricted to kernel range memory only, as suggested
by Linus.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Fix memory corruption and crash on 32-bit x86 systems.
If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of
RAM, then we call memory_present() with a start/end that goes outside
the scope of MAX_PHYSMEM_BITS.
That causes this loop to happily walk over the limit of the sparse
memory section map:
for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) {
unsigned long section = pfn_to_section_nr(pfn);
struct mem_section *ms;
sparse_index_init(section, nid);
set_section_nid(section, nid);
ms = __nr_to_section(section);
if (!ms->section_mem_map)
ms->section_mem_map = sparse_encode_early_nid(nid) |
SECTION_MARKED_PRESENT;
'ms' will be out of bounds and we'll corrupt a small amount of memory by
encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it.
The corruption might happen when encoding a non-zero node ID, or due to
the SECTION_MARKED_PRESENT which is 0x1:
mmzone.h:#define SECTION_MARKED_PRESENT (1UL<<0)
The fix is to sanity check anything the architecture passes to
sparsemem.
This bug seems to be rather old (as old as sparsemem support itself),
but the exact incarnation depended on random details like configs, which
made this bug more prominent in v2.6.25-to-be.
An additional enhancement might be to print a warning about ignored or
trimmed memory ranges.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Christoph Lameter <clameter@sgi.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Yinghai Lu <Yinghai.Lu@sun.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The per node counters are used mainly for showing data through the sysfs API.
If that API is not compiled in then there is no point in keeping track of this
data. Disable counters for the number of slabs and the number of total slabs
if !SLUB_DEBUG. Incrementing the per node counters is also accessing a
potentially contended cacheline so this could actually be a performance
benefit to embedded systems.
SLABINFO support is also affected. It now must depends on SLUB_DEBUG (which
is on by default).
Patch also avoids a check for a NULL kmem_cache_node pointer in new_slab()
if the system is not compiled with NUMA support.
[penberg@cs.helsinki.fi: fix oops and move ->nr_slabs into CONFIG_SLUB_DEBUG]
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
__free_slab does some diagnostics. The resetting of mapcount etc
in discard_slab() can interfere with debug processing. So move
the reset immediately before the page is freed.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Only output per cpu stats if the kernel is build for SMP.
Use a capital "C" as a leading character for the processor number
(same as the numa statistics that also use a capital letter "N").
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
count_partial() is used by both slabinfo and the sysfs proc support. Move
the function directly before the beginning of the sysfs code so that it can
be easily found. Rework the preprocessor conditional to take into account
that slub sysfs support depends on CONFIG_SYSFS *and* CONFIG_SLUB_DEBUG.
Make CONFIG_SLUB_STATS depend on CONFIG_SLUB_DEBUG and CONFIG_SYSFS. There
is no point of keeping statistics if no one can restrive them.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Move the definition of kmalloc_caches_dma() into a later #ifdef CONFIG_ZONE_DMA.
This saves one #ifdef and leaves us with a total of two #ifdefs for dma slab support.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
As spotted by kmemcheck, we need to initialize the per-CPU ->stat array before
using it.
[kmem_cache_cpu structures are usually allocated from arrays defined via
DEFINE_PER_CPU that are zeroed so we have not noticed this so far --cl].
Reported-by: Vegard Nossum <vegard.nossum@gmail.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
This should be N_NORMAL_MEMORY.
N_NORMAL_MEMORY is "true" if a node has memory for the kernel. N_HIGH_MEMORY
is "true" if a node has memory for HIGHMEM. (If CONFIG_HIGHMEM=n, always
"true")
This check is used for testing whether we can use kmalloc_node() on a node.
Then, if there is a node which only contains HIGHMEM, the system will call
kmalloc_node() which doesn't contain memory for the kernel. If it happens
under SLUB, the kernel will panic. I think this only happens on x86_32-numa.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A boot option for the memory controller was discussed on lkml. It is a good
idea to add it, since it saves memory for people who want to turn off the
memory controller.
By default the option is on for the following two reasons:
1. It provides compatibility with the current scheme where the memory
controller turns on if the config option is enabled
2. It allows for wider testing of the memory controller, once the config
option is enabled
We still allow the create, destroy callbacks to succeed, since they are not
aware of boot options. We do not populate the directory will memory resource
controller specific files.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Paul Menage <menage@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hugh Dickins <hugh@veritas.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Small typo in the patch recently merged to avoid the unused symbol
message for count_partial(). Discussion thread with confirmation of fix at
http://marc.info/?t=120696854400001&r=1&w=2
Typo in the check if we need the count_partial function that was
introduced by 53625b4204
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This reverts commit 3811dbf671.
The masking was not at all useless, and it was sensible. We handle
GFP_ZERO in the caller, and passing it down to any page allocator logic
is buggy and wrong.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Running the counters testcase from libhugetlbfs results in on 2.6.25-rc5
and 2.6.25-rc5-mm1:
BUG: soft lockup - CPU#3 stuck for 61s! [counters:10531]
NIP: c0000000000d1f3c LR: c0000000000d1f2c CTR: c0000000001b5088
REGS: c000005db12cb360 TRAP: 0901 Not tainted (2.6.25-rc5-autokern1)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 48008448 XER: 20000000
TASK = c000005dbf3d6000[10531] 'counters' THREAD: c000005db12c8000 CPU: 3
GPR00: 0000000000000004 c000005db12cb5e0 c000000000879228 0000000000000004
GPR04: 0000000000000010 0000000000000000 0000000000200200 0000000000100100
GPR08: c0000000008aba10 000000000000ffff 0000000000000004 0000000000000000
GPR12: 0000000028000442 c000000000770080
NIP [c0000000000d1f3c] .return_unused_surplus_pages+0x84/0x18c
LR [c0000000000d1f2c] .return_unused_surplus_pages+0x74/0x18c
Call Trace:
[c000005db12cb5e0] [c000005db12cb670] 0xc000005db12cb670 (unreliable)
[c000005db12cb670] [c0000000000d24c4] .hugetlb_acct_memory+0x2e0/0x354
[c000005db12cb740] [c0000000001b5048] .truncate_hugepages+0x1d4/0x214
[c000005db12cb890] [c0000000001b50a4] .hugetlbfs_delete_inode+0x1c/0x3c
[c000005db12cb920] [c000000000103fd8] .generic_delete_inode+0xf8/0x1c0
[c000005db12cb9b0] [c0000000001b5100] .hugetlbfs_drop_inode+0x3c/0x24c
[c000005db12cba50] [c00000000010287c] .iput+0xdc/0xf8
[c000005db12cbad0] [c0000000000fee54] .dentry_iput+0x12c/0x194
[c000005db12cbb60] [c0000000000ff050] .d_kill+0x6c/0xa4
[c000005db12cbbf0] [c0000000000ffb74] .dput+0x18c/0x1b0
[c000005db12cbc70] [c0000000000e9e98] .__fput+0x1a4/0x1e8
[c000005db12cbd10] [c0000000000e61ec] .filp_close+0xb8/0xe0
[c000005db12cbda0] [c0000000000e62d0] .sys_close+0xbc/0x134
[c000005db12cbe30] [c00000000000872c] syscall_exit+0x0/0x40
Instruction dump:
ebbe8038 38800010 e8bf0002 3bbd0008 7fa3eb78 38a50001 7ca507b4 4818df25
60000000 38800010 38a00000 7c601b78 <7fa3eb78> 2f800010 409d0008 38000010
This was tracked down to a potential livelock in
return_unused_surplus_hugepages(). In the case where we have surplus
pages on some node, but no free pages on the same node, we may never
break out of the loop. To avoid this livelock, terminate the search if
we iterate a number of times equal to the number of online nodes without
freeing a page.
Thanks to Andy Whitcroft and Adam Litke for helping with debugging and
the patch.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently we show the surplus hugetlb pool state in /proc/meminfo, but
not in the per-node meminfo files, even though we track the information
on a per-node basis. Printing it there can help track down dynamic pool
bugs including the one in the follow-on patch.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 556a169dab ("slab: fix bootstrap on
memoryless node") introduced bootstrap-time cache_cache list3s for all nodes
but forgot that initkmem_list3 needs to be accessed by [somevalue + node]. This
patch fixes list_add() corruption in mm/slab.c seen on the ES7000.
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Olaf Hering <olaf@aepfle.de>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Dan Yeisley <dan.yeisley@unisys.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Avoid warnings about unused functions if neither SLUB_DEBUG nor CONFIG_SLABINFO
is defined. This patch will be reversed when slab defrag is merged since slab
defrag requires count_partial() to determine the fragmentation status of
slab caches.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
[PATCH] get stack footprint of pathname resolution back to relative sanity
[PATCH] double iput() on failure exit in hugetlb
[PATCH] double dput() on failure exit in tiny-shmem
[PATCH] fix up new filp allocators
[PATCH] check for null vfsmount in dentry_open()
[PATCH] reiserfs: eliminate private use of struct file in xattr
[PATCH] sanitize hppfs
hppfs pass vfsmount to dentry_open()
[PATCH] restore export of do_kern_mount()
Revert commit f1a9ee758de7de1e040de849fdef46e6802ea117:
Author: Rik van Riel <riel@redhat.com>
Date: Thu Feb 7 00:14:08 2008 -0800
kswapd should only wait on IO if there is IO
The current kswapd (and try_to_free_pages) code has an oddity where the
code will wait on IO, even if there is no IO in flight. This problem is
notable especially when the system scans through many unfreeable pages,
causing unnecessary stalls in the VM.
Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path
will sleep if a significant number of pages are encountered that should be
written out. This gives kswapd a chance to write out those pages, while
the direct reclaim task sleeps.
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Because of large latencies and interactivity problems reported by Carlos,
here: http://lkml.org/lkml/2008/3/22/211
Cc: Rik van Riel <riel@redhat.com>
Cc: "Carlos R. Mafra" <crmafra2@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
With numa enabled, some callers could have a range of memory on one node
but try to free that on other node. This can cause some pages to be
freed wrongly.
For example: when we try to allocate 128g boot ram early for
gart/swiotlb, and free that range later so gart/swiotlb can get some
range afterwards.
With this patch, we don't need to care which node holds the range, just
loop to call free_bootmem_node for all online nodes.
This patch makes free_bootmem_core() more robust by trimming the sidx
and eidx according the ram range that the node has.
And make the free_bootmem_core handle this out of range case. We could
use bdata_list to make sure the range can be freed for sure. So next
time, we don't need to loop online nodes and could use free_bootmem
directly.
Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix kernel-doc notation in mm/readahead.c.
Change ":" to ";" so that it doesn't get treated as a doc section heading.
Move the comment block ending "*/" to a line by itself so that the text on
that last line is not lost (dropped).
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The check t->pid == t->pid is not the blessed way to check whether a task is a
group leader.
This is not about the code beautifulness only, but about pid namespaces fixes
- both the tgid and the pid fields on the task_struct are (slowly :( )
becoming deprecated.
Besides, the thread_group_leader() macro makes only one dereference :)
Signed-off-by: Pavel Emelyanov <xemul@openvz.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Correct kernel-doc function names and parameters in rmap.c.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Convert tiny-shmem.c function comments to kernel-doc. Add parameters and
convert/fix other kernel-doc in shmem.c.
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix various kernel-doc notation in mm/:
filemap.c: add function short description; convert 2 to kernel-doc
fremap.c: change parameter 'prot' to @prot
pagewalk.c: change "-" in function parameters to ":"
slab.c: fix short description of kmem_ptr_validate()
swap.c: fix description & parameters of put_pages_list()
swap_state.c: fix function parameters
vmalloc.c: change "@returns" to "Returns:" since that is not a parameter
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fallback path needs to enable interrupts like done for
the other page allocator calls. This was not necessary with
the alternate fast path since we handled irq enable/disable in
the slow path. The regular fastpath handles irq enable/disable
around calls to the slow path so we need to restore the proper
status before calling the page allocator from the slowpath.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
iov_iter_advance() skips over zero-length iovecs, however it does not properly
terminate at the end of the iovec array. Fix this by checking against
i->count before we skip a zero-length iov.
The bug was reproduced with a test program that continually randomly creates
iovs to writev. The fix was also verified with the same program and also it
could verify that the correct data was contained in the file after each
writev.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Tested-by: "Kevin Coffman" <kwc@citi.umich.edu>
Cc: "Alexey Dobriyan" <adobriyan@gmail.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Free pages in the hugetlb pool are free and as such have a reference count of
zero. Regular allocations into the pool from the buddy are "freed" into the
pool which results in their page_count dropping to zero. However, surplus
pages can be directly utilized by the caller without first being freed to the
pool. Therefore, a call to put_page_testzero() is in order so that such a
page will be handed to the caller with a correct count.
This has not affected end users because the bad page count is reset before the
page is handed off. However, under CONFIG_DEBUG_VM this triggers a BUG when
the page count is validated.
Thanks go to Mel for first spotting this issue and providing an initial fix.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Address 3 known bugs in the current memory policy reference counting method.
I have a series of patches to rework the reference counting to reduce overhead
in the allocation path. However, that series will require testing in -mm once
I repost it.
1) alloc_page_vma() does not release the extra reference taken for
vma/shared mempolicy when the mode == MPOL_INTERLEAVE. This can result in
leaking mempolicy structures. This is probably occurring, but not being
noticed.
Fix: add the conditional release of the reference.
2) hugezonelist unconditionally releases a reference on the mempolicy when
mode == MPOL_INTERLEAVE. This can result in decrementing the reference
count for system default policy [should have no ill effect] or premature
freeing of task policy. If this occurred, the next allocation using task
mempolicy would use the freed structure and probably BUG out.
Fix: add the necessary check to the release.
3) The current reference counting method assumes that vma 'get_policy()'
methods automatically add an extra reference a non-NULL returned mempolicy.
This is true for shmem_get_policy() used by tmpfs mappings, including
regular page shm segments. However, SHM_HUGETLB shm's, backed by
hugetlbfs, just use the vma policy without the extra reference. This
results in freeing of the vma policy on the first allocation, with reuse of
the freed mempolicy structure on subsequent allocations.
Fix: Rather than add another condition to the conditional reference
release, which occur in the allocation path, just add a reference when
returning the vma policy in shm_get_policy() to match the assumptions.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Greg KH <greg@kroah.com>
Cc: Andi Kleen <ak@suse.de>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Rientjes <rientjes@google.com>
Cc: <eric.whitney@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
NUMA slab allocator cpu migration bugfix
The NUMA slab allocator (specifically, cache_alloc_refill)
is not refreshing its local copies of what cpu and what
numa node it is on, when it drops and reacquires the irq
block that it inherited from its caller. As a result
those values become invalid if an attempt to migrate the
process to another numa node occured while the irq block
had been dropped.
The solution is to make cache_alloc_refill reload these
variables whenever it drops and reacquires the irq block.
The error is very difficult to hit. When it does occur,
one gets the following oops + stack traceback bits in
check_spinlock_acquired:
kernel BUG at mm/slab.c:2417
cache_alloc_refill+0xe6
kmem_cache_alloc+0xd0
...
This patch was developed against 2.6.23, ported to and
compiled-tested only against 2.6.25-rc4.
Signed-off-by: Joe Korty <joe.korty@ccur.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
SLUB should pack even small objects nicely into cachelines if that is what
has been asked for. Use the same algorithm as SLAB for this.
The effect of this patch for a system with a cacheline size of 64
bytes is that the 24 byte sized slab caches will now put exactly
2 objects into a cacheline instead of 3 with some overlap into
the next cacheline. This reduces the object density in a 4k slab
from 170 to 128 objects (same as SLAB).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Make them all use angle brackets and the directory name.
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
The NUMA fallback logic should be passing local_flags to kmem_get_pages() and not simply the
flags passed in.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
The remote frees are in the freelist of the page and not in the
percpu freelist.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Adam Litke noticed that currently we grow the hugepage pool independent of any
cpuset the running process may be in, but when shrinking the pool, the cpuset
is checked. This leads to inconsistency when shrinking the pool in a
restricted cpuset -- an administrator may have been able to grow the pool on a
node restricted by a containing cpuset, but they cannot shrink it there.
There are two options: either prevent growing of the pool outside of the
cpuset or allow shrinking outside of the cpuset. >From previous discussions
on linux-mm, /proc/sys/vm/nr_hugepages is an administrative interface that
should not be restricted by cpusets. So allow shrinking the pool by removing
pages from nodes outside of current's cpuset.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Irwin <wli@holomorphy.com>
Cc: Lee Schermerhorn <Lee.Schermerhonr@hp.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Paul Jackson <pj@sgi.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A hugetlb reservation may be inadequately backed in the event of racing
allocations and frees when utilizing surplus huge pages. Consider the
following series of events in processes A and B:
A) Allocates some surplus pages to satisfy a reservation
B) Frees some huge pages
A) A notices the extra free pages and drops hugetlb_lock to free some of
its surplus pages back to the buddy allocator.
B) Allocates some huge pages
A) Reacquires hugetlb_lock and returns from gather_surplus_huge_pages()
Avoid this by commiting the reservation after pages have been allocated but
before dropping the lock to free excess pages. For parity, release the
reservation in return_unused_surplus_pages().
This patch also corrects the cpuset_mems_nr() error path in
hugetlb_acct_memory(). If the cpuset check fails, uncommit the
reservation, but also be sure to return any surplus huge pages that may
have been allocated to back the failed reservation.
Thanks to Andy Whitcroft for discovering this.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
While testing force_empty, during an exit_mmap, __mem_cgroup_remove_list
called from mem_cgroup_uncharge_page oopsed on a NULL pointer in the lru list.
I couldn't see what racing tasks on other cpus were doing, but surmise that
another must have been in mem_cgroup_charge_common on the same page, between
its unlock_page_cgroup and spin_lock_irqsave near done (thanks to that kzalloc
which I'd almost changed to a kmalloc).
Normally such a race cannot happen, the ref_cnt prevents it, the final
uncharge cannot race with the initial charge. But force_empty buggers the
ref_cnt, that's what it's all about; and thereafter forced pages are
vulnerable to races such as this (just think of a shared page also mapped into
an mm of another mem_cgroup than that just emptied). And remain vulnerable
until they're freed indefinitely later.
This patch just fixes the oops by moving the unlock_page_cgroups down below
adding to and removing from the list (only possible given the previous patch);
and while we're at it, we might as well make it an invariant that
page->page_cgroup is always set while pc is on lru.
But this behaviour of force_empty seems highly unsatisfactory to me: why have
a ref_cnt if we always have to cope with it being violated (as in the earlier
page migration patch). We may prefer force_empty to move pages to an orphan
mem_cgroup (could be the root, but better not), from which other cgroups could
recover them; we might need to reverse the locking again; but no time now for
such concerns.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
As for force_empty, though this may not be the main topic here,
mem_cgroup_force_empty_list() can be implemented simpler. It is possible to
make the function just call mem_cgroup_uncharge_page() instead of releasing
page_cgroups by itself. The tip is to call get_page() before invoking
mem_cgroup_uncharge_page(), so the page won't be released during this
function.
Kamezawa-san points out that by the time mem_cgroup_uncharge_page() uncharges,
the page might have been reassigned to an lru of a different mem_cgroup, and
now be emptied from that; but Hugh claims that's okay, the end state is the
same as when it hasn't gone to another list.
And once force_empty stops taking lock_page_cgroup within mz->lru_lock,
mem_cgroup_move_lists() can be simplified to take mz->lru_lock directly while
holding page_cgroup lock (but still has to use try_lock_page_cgroup).
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state) went
into page freeing, I've hit it from time to time in testing on some machines,
sometimes only after many days. Recently found a machine which could usually
produce it within a few hours, which got me there at last.
The culprit is mem_cgroup_move_lists, whose locking is inadequate; and the
arrangement of structures was such that you got page_cgroups from the lru list
neatly put on to SLUB's freelist. Kamezawa-san identified the same hole
independently.
The main problem was that it was missing the lock_page_cgroup it needs to
safely page_get_page_cgroup; but it's tricky to go beyond that too, and I
couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected. See the code for
comments on the constraints.
This patch immediately gets replaced by a simpler one from Hirokazu-san; but
is it just foolish pride that tells me to put this one on record, in case we
need to come back to it later?
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging from
it, and before removing page_cgroup from one of its lru lists: isn't there a
danger that struct mem_cgroup memory could be freed and reused before
completing that, so corrupting something? Never seen it, and for all I know
there may be other constraints which make it impossible; but let's be
defensive and reverse the ordering there.
mem_cgroup_force_empty_list is safe because there's an extra css_get around
all its works; but even so, change its ordering the same way round, to help
get in the habit of doing it like this.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove clear_page_cgroup: it's an unhelpful helper, see for example how
mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call it
(serious races from that? I'm not sure).
Once that's gone, you can see it's pointless for page_cgroup's ref_cnt to be
atomic: it's always manipulated under lock_page_cgroup, except where
force_empty unilaterally reset it to 0 (and how does uncharge's
atomic_dec_and_test protect against that?).
Simplify this page_cgroup locking: if you've got the lock and the pc is
attached, then the ref_cnt must be positive: VM_BUG_ONs to check that, and to
check that pc->page matches page (we're on the way to finding why sometimes it
doesn't, but this patch doesn't fix that).
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Hirokazu Takahashi <taka@valinux.co.jp>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>