{"api_version":"1","generated_at":"2026-05-28T15:54:17+00:00","cve":"CVE-2026-46223","urls":{"html":"https://cve.report/CVE-2026-46223","api":"https://cve.report/api/cve/CVE-2026-46223.json","docs":"https://cve.report/api","cve_org":"https://www.cve.org/CVERecord?id=CVE-2026-46223","nvd":"https://nvd.nist.gov/vuln/detail/CVE-2026-46223"},"summary":{"title":"cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated","description":"In the Linux kernel, the following vulnerability has been resolved:\n\ncgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated\n\nA chain of commits going back to v7.0 reworked rmdir to satisfy the\ncontroller invariant that a subsystem's ->css_offline() must not run while\ntasks are still doing kernel-side work in the cgroup.\n\n[1] d245698d727a (\"cgroup: Defer task cgroup unlink until after the task is done switching out\")\n[2] a72f73c4dd9b (\"cgroup: Don't expose dead tasks in cgroup\")\n[3] 1b164b876c36 (\"cgroup: Wait for dying tasks to leave on rmdir\")\n[4] 4c56a8ac6869 (\"cgroup: Fix cgroup_drain_dying() testing the wrong condition\")\n[5] 13e786b64bd3 (\"cgroup: Increment nr_dying_subsys_* from rmdir context\")\n\n[1] moved task cset unlink from do_exit() to finish_task_switch() so a\ntask's cset link drops only after the task has fully stopped scheduling.\nThat made tasks past exit_signals() linger on cset->tasks until their final\ncontext switch, which led to a series of problems as what userspace expected\nto see after rmdir diverged from what the kernel needs to wait for. [2]-[5]\ntried to bridge that divergence: [2] filtered the exiting tasks from\ncgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]\nfixed the wait's condition; [5] made nr_dying_subsys_* visible\nsynchronously.\n\nThe cgroup_drain_dying() wait in [3] turned out to be a dead end. When the\nrmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.\nhost PID 1 systemd reaping orphan pids that were re-parented to it during\nthe same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those\npids to free, the pids can't free because PID 1 is the reaper and it's stuck\nin rmdir, and the system A-A deadlocks. No internal lock ordering breaks\nthis; the wait itself is the bug.\n\nThe css killing side that drove the original reorder, however, can be made\ncleanly asynchronous: ->css_offline() is already async, run from\ncss_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to\nmake that chain start only after all tasks have left the cgroup. rmdir's\nuser-visible side then returns as soon as cgroup.procs and friends are\nempty, while ->css_offline() still runs only after the cgroup is fully\ndrained.\n\nVerified by the original reproducer (pidns teardown + zombie reaper, runs\nunder vng) which hangs vanilla and succeeds here, and by per-commit\ndeterministic repros for [2], [3], [4], [5] with a boot parameter that\nwidens the post-exit_signals() window so each state is reliably reachable.\nSome stress tests on top of that.\n\ncgroup_apply_control_disable() has the same shape of pre-existing race:\nwhen a controller is disabled via subtree_control, kill_css() ran\nsynchronously while tasks past exit_signals() could still be linked to\nthe cgroup's csets, and ->css_offline() could fire before they drained.\nThis patch preserves the existing synchronous behavior at that call site\n(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch\nwill defer kill_css_finish() there using a per-css trigger.\n\nThis seems like the right approach and I don't see problems with it. The\nchanges are somewhat invasive but not excessively so, so backporting to\n-stable should be okay. If something does turn out to be wrong, the fallback\nis to revert the entire chain ([1]-[5]) and rework in the development branch\ninstead.\n\nv2: Pin cgrp across the deferred destroy work with explicit\n    cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1\n    wasn't actually broken (ordered cgroup_offline_wq + queue_work order\n    in cgroup_task_dead() saved it) but the explicit ref removes the\n    dependency on those non-obvious invariants. Also note the\n    pre-existing cgroup_apply_control_disable() race in the description;\n    a follow-up will defer kill_css_finish() there.","state":"PUBLISHED","assigner":"Linux","published_at":"2026-05-28 10:16:37","updated_at":"2026-05-28 13:44:01"},"problem_types":[],"metrics":[],"references":[{"url":"https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b","name":"https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b","refsource":"416baaa9-dc9f-4396-8d5f-8c081fb06d67","tags":[],"title":"","mime":"","httpstatus":"","archivestatus":"0"},{"url":"https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0","name":"https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0","refsource":"416baaa9-dc9f-4396-8d5f-8c081fb06d67","tags":[],"title":"","mime":"","httpstatus":"","archivestatus":"0"},{"url":"https://www.cve.org/CVERecord?id=CVE-2026-46223","name":"CVE Program record","refsource":"CVE.ORG","tags":["canonical"]},{"url":"https://nvd.nist.gov/vuln/detail/CVE-2026-46223","name":"NVD vulnerability detail","refsource":"NVD","tags":["canonical","analysis"]}],"affected":[{"source":"CNA","vendor":"Linux","product":"Linux","version":"affected 1b164b876c36c3eb5561dd9b37702b04401b0166 33fa2e6b1507a0a377a151a8826438bedad1d0b0 git","platforms":[]},{"source":"CNA","vendor":"Linux","product":"Linux","version":"affected 1b164b876c36c3eb5561dd9b37702b04401b0166 93618edf753838a727dbff63c7c291dee22d656b git","platforms":[]},{"source":"CNA","vendor":"Linux","product":"Linux","version":"affected 78c72bce4a87819126211c0d24e18350010604fb git","platforms":[]},{"source":"CNA","vendor":"Linux","product":"Linux","version":"affected 6.19.12 6.20 semver","platforms":[]},{"source":"CNA","vendor":"Linux","product":"Linux","version":"affected 7.0","platforms":[]},{"source":"CNA","vendor":"Linux","product":"Linux","version":"unaffected 7.0 semver","platforms":[]},{"source":"CNA","vendor":"Linux","product":"Linux","version":"unaffected 7.0.9 7.0.* semver","platforms":[]},{"source":"CNA","vendor":"Linux","product":"Linux","version":"unaffected 7.1-rc3 * original_commit_for_fix","platforms":[]}],"timeline":[],"solutions":[],"workarounds":[],"exploits":[],"credits":[],"nvd_cpes":[],"vendor_comments":[],"enrichments":{"kev":null,"epss":null,"legacy_qids":[]},"source_records":{"cve_program":{"containers":{"cna":{"affected":[{"defaultStatus":"unaffected","product":"Linux","programFiles":["include/linux/cgroup-defs.h","kernel/cgroup/cgroup.c"],"repo":"https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git","vendor":"Linux","versions":[{"lessThan":"33fa2e6b1507a0a377a151a8826438bedad1d0b0","status":"affected","version":"1b164b876c36c3eb5561dd9b37702b04401b0166","versionType":"git"},{"lessThan":"93618edf753838a727dbff63c7c291dee22d656b","status":"affected","version":"1b164b876c36c3eb5561dd9b37702b04401b0166","versionType":"git"},{"status":"affected","version":"78c72bce4a87819126211c0d24e18350010604fb","versionType":"git"},{"lessThan":"6.20","status":"affected","version":"6.19.12","versionType":"semver"}]},{"defaultStatus":"affected","product":"Linux","programFiles":["include/linux/cgroup-defs.h","kernel/cgroup/cgroup.c"],"repo":"https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git","vendor":"Linux","versions":[{"status":"affected","version":"7.0"},{"lessThan":"7.0","status":"unaffected","version":"0","versionType":"semver"},{"lessThanOrEqual":"7.0.*","status":"unaffected","version":"7.0.9","versionType":"semver"},{"lessThanOrEqual":"*","status":"unaffected","version":"7.1-rc3","versionType":"original_commit_for_fix"}]}],"cpeApplicability":[{"nodes":[{"cpeMatch":[{"criteria":"cpe:2.3:o:linux:linux_kernel:*:*:*:*:*:*:*:*","versionEndExcluding":"7.0.9","versionStartIncluding":"7.0","vulnerable":true},{"criteria":"cpe:2.3:o:linux:linux_kernel:*:*:*:*:*:*:*:*","versionEndExcluding":"7.1-rc3","versionStartIncluding":"7.0","vulnerable":true},{"criteria":"cpe:2.3:o:linux:linux_kernel:*:*:*:*:*:*:*:*","versionStartIncluding":"6.19.12","vulnerable":true}],"negate":false,"operator":"OR"}]}],"descriptions":[{"lang":"en","value":"In the Linux kernel, the following vulnerability has been resolved:\n\ncgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated\n\nA chain of commits going back to v7.0 reworked rmdir to satisfy the\ncontroller invariant that a subsystem's ->css_offline() must not run while\ntasks are still doing kernel-side work in the cgroup.\n\n[1] d245698d727a (\"cgroup: Defer task cgroup unlink until after the task is done switching out\")\n[2] a72f73c4dd9b (\"cgroup: Don't expose dead tasks in cgroup\")\n[3] 1b164b876c36 (\"cgroup: Wait for dying tasks to leave on rmdir\")\n[4] 4c56a8ac6869 (\"cgroup: Fix cgroup_drain_dying() testing the wrong condition\")\n[5] 13e786b64bd3 (\"cgroup: Increment nr_dying_subsys_* from rmdir context\")\n\n[1] moved task cset unlink from do_exit() to finish_task_switch() so a\ntask's cset link drops only after the task has fully stopped scheduling.\nThat made tasks past exit_signals() linger on cset->tasks until their final\ncontext switch, which led to a series of problems as what userspace expected\nto see after rmdir diverged from what the kernel needs to wait for. [2]-[5]\ntried to bridge that divergence: [2] filtered the exiting tasks from\ncgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]\nfixed the wait's condition; [5] made nr_dying_subsys_* visible\nsynchronously.\n\nThe cgroup_drain_dying() wait in [3] turned out to be a dead end. When the\nrmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.\nhost PID 1 systemd reaping orphan pids that were re-parented to it during\nthe same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those\npids to free, the pids can't free because PID 1 is the reaper and it's stuck\nin rmdir, and the system A-A deadlocks. No internal lock ordering breaks\nthis; the wait itself is the bug.\n\nThe css killing side that drove the original reorder, however, can be made\ncleanly asynchronous: ->css_offline() is already async, run from\ncss_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to\nmake that chain start only after all tasks have left the cgroup. rmdir's\nuser-visible side then returns as soon as cgroup.procs and friends are\nempty, while ->css_offline() still runs only after the cgroup is fully\ndrained.\n\nVerified by the original reproducer (pidns teardown + zombie reaper, runs\nunder vng) which hangs vanilla and succeeds here, and by per-commit\ndeterministic repros for [2], [3], [4], [5] with a boot parameter that\nwidens the post-exit_signals() window so each state is reliably reachable.\nSome stress tests on top of that.\n\ncgroup_apply_control_disable() has the same shape of pre-existing race:\nwhen a controller is disabled via subtree_control, kill_css() ran\nsynchronously while tasks past exit_signals() could still be linked to\nthe cgroup's csets, and ->css_offline() could fire before they drained.\nThis patch preserves the existing synchronous behavior at that call site\n(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch\nwill defer kill_css_finish() there using a per-css trigger.\n\nThis seems like the right approach and I don't see problems with it. The\nchanges are somewhat invasive but not excessively so, so backporting to\n-stable should be okay. If something does turn out to be wrong, the fallback\nis to revert the entire chain ([1]-[5]) and rework in the development branch\ninstead.\n\nv2: Pin cgrp across the deferred destroy work with explicit\n    cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1\n    wasn't actually broken (ordered cgroup_offline_wq + queue_work order\n    in cgroup_task_dead() saved it) but the explicit ref removes the\n    dependency on those non-obvious invariants. Also note the\n    pre-existing cgroup_apply_control_disable() race in the description;\n    a follow-up will defer kill_css_finish() there."}],"providerMetadata":{"dateUpdated":"2026-05-28T09:40:40.791Z","orgId":"416baaa9-dc9f-4396-8d5f-8c081fb06d67","shortName":"Linux"},"references":[{"url":"https://git.kernel.org/stable/c/33fa2e6b1507a0a377a151a8826438bedad1d0b0"},{"url":"https://git.kernel.org/stable/c/93618edf753838a727dbff63c7c291dee22d656b"}],"title":"cgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated","x_generator":{"engine":"bippy-1.2.0"}}},"cveMetadata":{"assignerOrgId":"416baaa9-dc9f-4396-8d5f-8c081fb06d67","assignerShortName":"Linux","cveId":"CVE-2026-46223","datePublished":"2026-05-28T09:40:40.791Z","dateReserved":"2026-05-13T15:03:33.106Z","dateUpdated":"2026-05-28T09:40:40.791Z","state":"PUBLISHED"},"dataType":"CVE_RECORD","dataVersion":"5.2"},"nvd":{"publishedDate":"2026-05-28 10:16:37","lastModifiedDate":"2026-05-28 13:44:01","problem_types":[],"metrics":[],"configurations":[]},"legacy_mitre":{"record":{"CveYear":"2026","CveId":"46223","Ordinal":"1","Title":"cgroup: Defer css percpu_ref kill on rmdir until cgroup is depop","CVE":"CVE-2026-46223","Year":"2026"},"notes":[{"CveYear":"2026","CveId":"46223","Ordinal":"1","NoteData":"In the Linux kernel, the following vulnerability has been resolved:\n\ncgroup: Defer css percpu_ref kill on rmdir until cgroup is depopulated\n\nA chain of commits going back to v7.0 reworked rmdir to satisfy the\ncontroller invariant that a subsystem's ->css_offline() must not run while\ntasks are still doing kernel-side work in the cgroup.\n\n[1] d245698d727a (\"cgroup: Defer task cgroup unlink until after the task is done switching out\")\n[2] a72f73c4dd9b (\"cgroup: Don't expose dead tasks in cgroup\")\n[3] 1b164b876c36 (\"cgroup: Wait for dying tasks to leave on rmdir\")\n[4] 4c56a8ac6869 (\"cgroup: Fix cgroup_drain_dying() testing the wrong condition\")\n[5] 13e786b64bd3 (\"cgroup: Increment nr_dying_subsys_* from rmdir context\")\n\n[1] moved task cset unlink from do_exit() to finish_task_switch() so a\ntask's cset link drops only after the task has fully stopped scheduling.\nThat made tasks past exit_signals() linger on cset->tasks until their final\ncontext switch, which led to a series of problems as what userspace expected\nto see after rmdir diverged from what the kernel needs to wait for. [2]-[5]\ntried to bridge that divergence: [2] filtered the exiting tasks from\ncgroup.procs; [3] had rmdir(2) sleep in TASK_UNINTERRUPTIBLE for them; [4]\nfixed the wait's condition; [5] made nr_dying_subsys_* visible\nsynchronously.\n\nThe cgroup_drain_dying() wait in [3] turned out to be a dead end. When the\nrmdir caller is also the reaper of a zombie that pins a pidns teardown (e.g.\nhost PID 1 systemd reaping orphan pids that were re-parented to it during\nthe same teardown), rmdir blocks in TASK_UNINTERRUPTIBLE waiting for those\npids to free, the pids can't free because PID 1 is the reaper and it's stuck\nin rmdir, and the system A-A deadlocks. No internal lock ordering breaks\nthis; the wait itself is the bug.\n\nThe css killing side that drove the original reorder, however, can be made\ncleanly asynchronous: ->css_offline() is already async, run from\ncss_killed_work_fn() driven by percpu_ref_kill_and_confirm(). The fix is to\nmake that chain start only after all tasks have left the cgroup. rmdir's\nuser-visible side then returns as soon as cgroup.procs and friends are\nempty, while ->css_offline() still runs only after the cgroup is fully\ndrained.\n\nVerified by the original reproducer (pidns teardown + zombie reaper, runs\nunder vng) which hangs vanilla and succeeds here, and by per-commit\ndeterministic repros for [2], [3], [4], [5] with a boot parameter that\nwidens the post-exit_signals() window so each state is reliably reachable.\nSome stress tests on top of that.\n\ncgroup_apply_control_disable() has the same shape of pre-existing race:\nwhen a controller is disabled via subtree_control, kill_css() ran\nsynchronously while tasks past exit_signals() could still be linked to\nthe cgroup's csets, and ->css_offline() could fire before they drained.\nThis patch preserves the existing synchronous behavior at that call site\n(kill_css_sync() + kill_css_finish() back-to-back) and a follow-up patch\nwill defer kill_css_finish() there using a per-css trigger.\n\nThis seems like the right approach and I don't see problems with it. The\nchanges are somewhat invasive but not excessively so, so backporting to\n-stable should be okay. If something does turn out to be wrong, the fallback\nis to revert the entire chain ([1]-[5]) and rework in the development branch\ninstead.\n\nv2: Pin cgrp across the deferred destroy work with explicit\n    cgroup_get()/cgroup_put() around queue_work() and the work_fn. v1\n    wasn't actually broken (ordered cgroup_offline_wq + queue_work order\n    in cgroup_task_dead() saved it) but the explicit ref removes the\n    dependency on those non-obvious invariants. Also note the\n    pre-existing cgroup_apply_control_disable() race in the description;\n    a follow-up will defer kill_css_finish() there.","Type":"Description","Title":"cgroup: Defer css percpu_ref kill on rmdir until cgroup is depop"}]}}}