From ab1de7ead871ebe6d12a774c3c25de0388cde082 Mon Sep 17 00:00:00 2001 From: Qi Zheng Date: Wed, 17 May 2023 07:45:45 +0000 Subject: [PATCH 1/3] cgroup: fix missing cpus_read_{lock,unlock}() in cgroup_transfer_tasks() The commit 4f7e7236435c ("cgroup: Fix threadgroup_rwsem <-> cpus_read_lock() deadlock") fixed the deadlock between cgroup_threadgroup_rwsem and cpus_read_lock() by introducing cgroup_attach_{lock,unlock}() and removing cpus_read_{lock,unlock}() from cpuset_attach(). But cgroup_transfer_tasks() was missed and not handled, which will cause th following warning: WARNING: CPU: 0 PID: 589 at kernel/cpu.c:526 lockdep_assert_cpus_held+0x32/0x40 CPU: 0 PID: 589 Comm: kworker/1:4 Not tainted 6.4.0-rc2-next-20230517 #50 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 Workqueue: events cpuset_hotplug_workfn RIP: 0010:lockdep_assert_cpus_held+0x32/0x40 <...> Call Trace: cpuset_attach+0x40/0x240 cgroup_migrate_execute+0x452/0x5e0 ? _raw_spin_unlock_irq+0x28/0x40 cgroup_transfer_tasks+0x1f3/0x360 ? find_held_lock+0x32/0x90 ? cpuset_hotplug_workfn+0xc81/0xed0 cpuset_hotplug_workfn+0xcb1/0xed0 ? process_one_work+0x248/0x5b0 process_one_work+0x2b9/0x5b0 worker_thread+0x56/0x3b0 ? process_one_work+0x5b0/0x5b0 kthread+0xf1/0x120 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 So just use the cgroup_attach_{lock,unlock}() helper to fix it. Reported-by: Zhao Gongyi Signed-off-by: Qi Zheng Acked-by: Muchun Song Fixes: 05c7b7a92cc8 ("cgroup/cpuset: Fix a race between cpuset_attach() and cpu hotplug") Cc: stable@vger.kernel.org # v5.17+ Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup-v1.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/cgroup/cgroup-v1.c b/kernel/cgroup/cgroup-v1.c index aeef06c465ef..5407241dbb45 100644 --- a/kernel/cgroup/cgroup-v1.c +++ b/kernel/cgroup/cgroup-v1.c @@ -108,7 +108,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) cgroup_lock(); - percpu_down_write(&cgroup_threadgroup_rwsem); + cgroup_attach_lock(true); /* all tasks in @from are being moved, all csets are source */ spin_lock_irq(&css_set_lock); @@ -144,7 +144,7 @@ int cgroup_transfer_tasks(struct cgroup *to, struct cgroup *from) } while (task && !ret); out_err: cgroup_migrate_finish(&mgctx); - percpu_up_write(&cgroup_threadgroup_rwsem); + cgroup_attach_unlock(true); cgroup_unlock(); return ret; } From 2bd110339288c18823dcace602b63b0d8627e520 Mon Sep 17 00:00:00 2001 From: John Sperbeck Date: Sun, 21 May 2023 19:29:53 +0000 Subject: [PATCH 2/3] cgroup: always put cset in cgroup_css_set_put_fork A successful call to cgroup_css_set_fork() will always have taken a ref on kargs->cset (regardless of CLONE_INTO_CGROUP), so always do a corresponding put in cgroup_css_set_put_fork(). Without this, a cset and its contained css structures will be leaked for some fork failures. The following script reproduces the leak for a fork failure due to exceeding pids.max in the pids controller. A similar thing can happen if we jump to the bad_fork_cancel_cgroup label in copy_process(). [ -z "$1" ] && echo "Usage $0 pids-root" && exit 1 PID_ROOT=$1 CGROUP=$PID_ROOT/foo [ -e $CGROUP ] && rmdir -f $CGROUP mkdir $CGROUP echo 5 > $CGROUP/pids.max echo $$ > $CGROUP/cgroup.procs fork_bomb() { set -e for i in $(seq 10); do /bin/sleep 3600 & done } (fork_bomb) & wait echo $$ > $PID_ROOT/cgroup.procs kill $(cat $CGROUP/cgroup.procs) rmdir $CGROUP Fixes: ef2c41cf38a7 ("clone3: allow spawning processes into cgroups") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: John Sperbeck Signed-off-by: Tejun Heo --- kernel/cgroup/cgroup.c | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 625d7483951c..245cf62ce85a 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -6486,19 +6486,18 @@ static int cgroup_css_set_fork(struct kernel_clone_args *kargs) static void cgroup_css_set_put_fork(struct kernel_clone_args *kargs) __releases(&cgroup_threadgroup_rwsem) __releases(&cgroup_mutex) { + struct cgroup *cgrp = kargs->cgrp; + struct css_set *cset = kargs->cset; + cgroup_threadgroup_change_end(current); + if (cset) { + put_css_set(cset); + kargs->cset = NULL; + } + if (kargs->flags & CLONE_INTO_CGROUP) { - struct cgroup *cgrp = kargs->cgrp; - struct css_set *cset = kargs->cset; - cgroup_unlock(); - - if (cset) { - put_css_set(cset); - kargs->cset = NULL; - } - if (cgrp) { cgroup_put(cgrp); kargs->cgrp = NULL; From 5647e53f7856bb39dae781fe26aa65a699e2fc9f Mon Sep 17 00:00:00 2001 From: Dan Schatzberg Date: Thu, 1 Jun 2023 11:38:19 -0700 Subject: [PATCH 3/3] cgroup: Documentation: Clarify usage of memory limits The existing documentation refers to memory.high as the "main mechanism to control memory usage." This seems incorrect to me - memory.high can result in reclaim pressure which simply leads to stalls unless some external component observes and actions on it (e.g. systemd-oomd can be used for this purpose). While this is feasible, users are unaware of this interaction and are led to believe that memory.high alone is an effective mechanism for limiting memory. The documentation should recommend the use of memory.max as the effective way to enforce memory limits - it triggers reclaim and results in OOM kills by itself. Signed-off-by: Dan Schatzberg Acked-by: Johannes Weiner Acked-by: Chris Down Signed-off-by: Tejun Heo --- Documentation/admin-guide/cgroup-v2.rst | 22 ++++++++++------------ 1 file changed, 10 insertions(+), 12 deletions(-) diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index f67c0829350b..e592a9364473 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1213,23 +1213,25 @@ PAGE_SIZE multiple when read back. A read-write single value file which exists on non-root cgroups. The default is "max". - Memory usage throttle limit. This is the main mechanism to - control memory usage of a cgroup. If a cgroup's usage goes + Memory usage throttle limit. If a cgroup's usage goes over the high boundary, the processes of the cgroup are throttled and put under heavy reclaim pressure. Going over the high limit never invokes the OOM killer and - under extreme conditions the limit may be breached. + under extreme conditions the limit may be breached. The high + limit should be used in scenarios where an external process + monitors the limited cgroup to alleviate heavy reclaim + pressure. memory.max A read-write single value file which exists on non-root cgroups. The default is "max". - Memory usage hard limit. This is the final protection - mechanism. If a cgroup's memory usage reaches this limit and - can't be reduced, the OOM killer is invoked in the cgroup. - Under certain circumstances, the usage may go over the limit - temporarily. + Memory usage hard limit. This is the main mechanism to limit + memory usage of a cgroup. If a cgroup's memory usage reaches + this limit and can't be reduced, the OOM killer is invoked in + the cgroup. Under certain circumstances, the usage may go + over the limit temporarily. In default configuration regular 0-order allocations always succeed unless OOM killer chooses current task as a victim. @@ -1238,10 +1240,6 @@ PAGE_SIZE multiple when read back. Caller could retry them differently, return into userspace as -ENOMEM or silently ignore in cases like disk readahead. - This is the ultimate protection mechanism. As long as the - high limit is used and monitored properly, this limit's - utility is limited to providing the final safety net. - memory.reclaim A write-only nested-keyed file which exists for all cgroups.