Skip to content

refactor(ipc): rework SysV SHM to tmpfs/pagecache backing for Linux compat#1986

Merged
fslongjin merged 2 commits into
DragonOS-Community:masterfrom
fslongjin:refactor/sysv-shm-linux-compat
Jun 25, 2026
Merged

refactor(ipc): rework SysV SHM to tmpfs/pagecache backing for Linux compat#1986
fslongjin merged 2 commits into
DragonOS-Community:masterfrom
fslongjin:refactor/sysv-shm-linux-compat

Conversation

@fslongjin

Copy link
Copy Markdown
Member

Summary

Rework SysV shared memory (shmget/shmat/shmdt/shmctl) to be backed by unlinked tmpfs/shmem files + the page cache, converging on the Linux 6.6.139 architecture rather than the previous "pre-allocate a contiguous run of physical pages + manually maintain attach counts through VMA.shm_id" path.

The old design diverges from Linux's and produces incorrect lifecycle, permission, and fragment handling across shmat/shmdt/fork/mremap/exit.

What changes

  • tmpfs/pagecache backing. SHM segments are now backed by unlinked tmpfs/shmem files + PageCache instead of a pre-allocated contiguous physical range, enabling on-demand / non-contiguous allocation and reuse of the existing file-mmap path.
  • Unified VMA/file lifecycle. An attach is modeled as a wrapper file + vm_pgoff + shm_vm_ops; attach counts are driven by VMA open/close hooks instead of manual bookkeeping, so fork/munmap/exit/mremap share one path.
  • Correct shmdt. Only detaches the original SysV SHM VMA, including fragments left behind by munmap/mprotect/mremap, instead of un-arbiting any mapping at a given start address.
  • IPC permissions & namespaces. Adds IPC permission checks (owner / non-owner, capabilities) and IPC-namespace awareness (ipc_namespace, uid/gid mapping).
  • mlock accounting. Adds RLIMIT_MEMLOCK accounting and marks resident SHM pages unevictable while the segment is locked.

Supporting changes

  • mm: fault, madvise, page, and the mmap/mprotect/mremap/msync syscalls.
  • filesystem/page_cache: unevictable/locking aggregation.
  • filesystem/tmpfs: creation of kernel-private unlinked shmem files.
  • mm/ucontext: unified VMA/file lifecycle hooks.

Tests

  • New dunitest suite sysv_shm_semantics covering nattch, RMID, read-only mapping, partial unmap, key/size, and repeated RMID.
  • Extended mlock_semantics; whitelisted the new suite.

This PR contains only the staged code changes; the accompanying design/plan/review docs under docs/kernel/ipc/ are intentionally left out.

🤖 Generated with Claude Code

…ompat

SysV SHM previously pre-allocated a contiguous run of physical pages and
maintained attach counts by hand through VMA.shm_id. That diverges from
Linux's design and produces incorrect lifecycle, permission, and fragment
handling across shmat/shmdt/fork/mremap/exit.

This change reworks SysV SHM toward Linux 6.6.139 semantics:

* Back SHM segments with unlinked tmpfs/shmem files + PageCache instead of
  pre-allocated contiguous pages, enabling on-demand/non-contiguous
  allocation and reuse of the existing file-mmap path.
* Model an attach as a wrapper file + vm_pgoff + shm_vm_ops, with attach
  counts driven by unified VMA open/close hooks rather than manual
  bookkeeping.
* shmdt now only detaches the original SysV SHM VMA, including fragments
  left behind by munmap/mprotect/mremap.
* Add IPC permission checks (owner/non-owner, capabilities) together with
  IPC-namespace awareness (ipc_namespace, uid/gid mapping).
* Add RLIMIT_MEMLOCK accounting and mark resident SHM pages unevictable
  while the segment is locked.

Supporting work lands in mm (fault, madvise, page, mmap/mprotect/mremap/
msync), page_cache (unevictable/locking aggregation), tmpfs (creation of
kernel-private unlinked shmem files), and ucontext (unified VMA/file
lifecycle hooks).

Tests: add the dunitest sysv_shm_semantics suite (nattch, RMID, read-only
mapping, partial unmap, key/size, repeated RMID) and extend mlock_semantics;
whitelist the new suite.

Signed-off-by: longjin <longjin@dragonos.org>
@fslongjin fslongjin force-pushed the refactor/sysv-shm-linux-compat branch from 195770e to 66fc347 Compare June 25, 2026 08:36
@fslongjin

Copy link
Copy Markdown
Member Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66fc347265

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

len: size,
prot_flags,
map_flags,
may_exec: true,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge 不要无条件授予 SysV SHM 的可执行后续权限

这里即使调用者没有传入 SHM_EXEC,也把映射创建为 VM_MAYEXEC;因此只具备读权限的进程可以先 shmat(..., SHM_RDONLY) 成功,再对返回地址调用 mprotect(PROT_EXEC),绕过 attach_begin() 中只有 SHM_EXEC 才检查执行权限的逻辑。应当只在 shmflg 包含 SHM_EXEC 时允许 may_exec

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里我复核了一下 Linux 6.6.139 的语义,结论是这个点不应修改。

Linux do_shmat() 确实只有在 SHM_EXEC 传入时才把初始 prot 加上 PROT_EXEC,并把 execute bit 纳入 SysV IPC 权限检查;但随后进入通用 do_mmap() 时,文件映射默认会带上 VM_MAYEXEC,除非路径所在挂载是 noexec。后续 mprotect(PROT_EXEC) 主要由 VM_MAYEXEC 放行,不会重新回到 SysV SHM 的 SHM_EXEC 权限检查。

我也用 Linux 主机做了一个最小探针:无 SHM_EXECshmat(..., SHM_RDONLY) 后再 mprotect(PROT_READ | PROT_EXEC),结果为成功。因此把这里改成 may_exec = shmflg.contains(SHM_EXEC) 会比 Linux 更严格,反而破坏兼容性。

Comment thread kernel/src/ipc/shm.rs
let mut guard = SYSV_SHM_MEMLOCK_ACCOUNT.lock();
let current = guard.get(&account_key).copied().unwrap_or(0);
let next = current.checked_add(bytes).ok_or(SystemError::ENOMEM)?;
if (next as u128) > rlimit as u128 && !capable(CAPFlags::CAP_IPC_LOCK) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge 按 IPC 用户命名空间判断 SHM_LOCK 提权

shm_lock_begin() 已经按当前 IPC namespace 的 user_ns 接受 CAP_IPC_LOCK,但这里忽略了传入的目标命名空间并改用 capable()(初始 user namespace)绕过 rlimit;容器 root 这类只在 IPC namespace 的 user_ns 中有 CAP_IPC_LOCK 的调用者会先通过权限检查,随后在段大小超过 RLIMIT_MEMLOCK 时被错误地返回 ENOMEM。这里应使用传入的 target_user_nsns_capable(..., CAP_IPC_LOCK) 或在已具备该命名空间能力时跳过 charge。

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里分两层看:权限检查和 memlock rlimit 绕过。

shm_lock_begin() 按当前 IPC namespace 的 user_ns 检查 CAP_IPC_LOCK 是对齐 Linux shmctl_do_lock() 的;但真正超过 RLIMIT_MEMLOCK 时是否绕过 charge,Linux 6.6.139 的 user_shm_lock() 使用的是 capable(CAP_IPC_LOCK),也就是 initial user namespace 语义,而不是 ns_capable(ns->user_ns, CAP_IPC_LOCK)

所以按这个建议改成用 target_user_ns 绕过 rlimit 会和 Linux 语义不一致。我已经在后续提交 f264620a 里删除了 charge_memlock_for_shm() 未使用的 target_user_ns 参数,避免代码误导;实际 rlimit 绕过仍保留 Linux 的 capable(CAP_IPC_LOCK) 行为。

Signed-off-by: longjin <longjin@dragonos.org>
@fslongjin fslongjin merged commit 4bdfeb4 into DragonOS-Community:master Jun 25, 2026
30 of 31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant