refactor(ipc): rework SysV SHM to tmpfs/pagecache backing for Linux compat#1986
Conversation
…ompat SysV SHM previously pre-allocated a contiguous run of physical pages and maintained attach counts by hand through VMA.shm_id. That diverges from Linux's design and produces incorrect lifecycle, permission, and fragment handling across shmat/shmdt/fork/mremap/exit. This change reworks SysV SHM toward Linux 6.6.139 semantics: * Back SHM segments with unlinked tmpfs/shmem files + PageCache instead of pre-allocated contiguous pages, enabling on-demand/non-contiguous allocation and reuse of the existing file-mmap path. * Model an attach as a wrapper file + vm_pgoff + shm_vm_ops, with attach counts driven by unified VMA open/close hooks rather than manual bookkeeping. * shmdt now only detaches the original SysV SHM VMA, including fragments left behind by munmap/mprotect/mremap. * Add IPC permission checks (owner/non-owner, capabilities) together with IPC-namespace awareness (ipc_namespace, uid/gid mapping). * Add RLIMIT_MEMLOCK accounting and mark resident SHM pages unevictable while the segment is locked. Supporting work lands in mm (fault, madvise, page, mmap/mprotect/mremap/ msync), page_cache (unevictable/locking aggregation), tmpfs (creation of kernel-private unlinked shmem files), and ucontext (unified VMA/file lifecycle hooks). Tests: add the dunitest sysv_shm_semantics suite (nattch, RMID, read-only mapping, partial unmap, key/size, repeated RMID) and extend mlock_semantics; whitelist the new suite. Signed-off-by: longjin <longjin@dragonos.org>
195770e to
66fc347
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 66fc347265
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| len: size, | ||
| prot_flags, | ||
| map_flags, | ||
| may_exec: true, |
There was a problem hiding this comment.
这里我复核了一下 Linux 6.6.139 的语义,结论是这个点不应修改。
Linux do_shmat() 确实只有在 SHM_EXEC 传入时才把初始 prot 加上 PROT_EXEC,并把 execute bit 纳入 SysV IPC 权限检查;但随后进入通用 do_mmap() 时,文件映射默认会带上 VM_MAYEXEC,除非路径所在挂载是 noexec。后续 mprotect(PROT_EXEC) 主要由 VM_MAYEXEC 放行,不会重新回到 SysV SHM 的 SHM_EXEC 权限检查。
我也用 Linux 主机做了一个最小探针:无 SHM_EXEC、shmat(..., SHM_RDONLY) 后再 mprotect(PROT_READ | PROT_EXEC),结果为成功。因此把这里改成 may_exec = shmflg.contains(SHM_EXEC) 会比 Linux 更严格,反而破坏兼容性。
| let mut guard = SYSV_SHM_MEMLOCK_ACCOUNT.lock(); | ||
| let current = guard.get(&account_key).copied().unwrap_or(0); | ||
| let next = current.checked_add(bytes).ok_or(SystemError::ENOMEM)?; | ||
| if (next as u128) > rlimit as u128 && !capable(CAPFlags::CAP_IPC_LOCK) { |
There was a problem hiding this comment.
shm_lock_begin() 已经按当前 IPC namespace 的 user_ns 接受 CAP_IPC_LOCK,但这里忽略了传入的目标命名空间并改用 capable()(初始 user namespace)绕过 rlimit;容器 root 这类只在 IPC namespace 的 user_ns 中有 CAP_IPC_LOCK 的调用者会先通过权限检查,随后在段大小超过 RLIMIT_MEMLOCK 时被错误地返回 ENOMEM。这里应使用传入的 target_user_ns 做 ns_capable(..., CAP_IPC_LOCK) 或在已具备该命名空间能力时跳过 charge。
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
这里分两层看:权限检查和 memlock rlimit 绕过。
shm_lock_begin() 按当前 IPC namespace 的 user_ns 检查 CAP_IPC_LOCK 是对齐 Linux shmctl_do_lock() 的;但真正超过 RLIMIT_MEMLOCK 时是否绕过 charge,Linux 6.6.139 的 user_shm_lock() 使用的是 capable(CAP_IPC_LOCK),也就是 initial user namespace 语义,而不是 ns_capable(ns->user_ns, CAP_IPC_LOCK)。
所以按这个建议改成用 target_user_ns 绕过 rlimit 会和 Linux 语义不一致。我已经在后续提交 f264620a 里删除了 charge_memlock_for_shm() 未使用的 target_user_ns 参数,避免代码误导;实际 rlimit 绕过仍保留 Linux 的 capable(CAP_IPC_LOCK) 行为。
Signed-off-by: longjin <longjin@dragonos.org>
Summary
Rework SysV shared memory (
shmget/shmat/shmdt/shmctl) to be backed by unlinked tmpfs/shmem files + the page cache, converging on the Linux 6.6.139 architecture rather than the previous "pre-allocate a contiguous run of physical pages + manually maintain attach counts throughVMA.shm_id" path.The old design diverges from Linux's and produces incorrect lifecycle, permission, and fragment handling across
shmat/shmdt/fork/mremap/exit.What changes
PageCacheinstead of a pre-allocated contiguous physical range, enabling on-demand / non-contiguous allocation and reuse of the existing file-mmap path.vm_pgoff+shm_vm_ops; attach counts are driven by VMA open/close hooks instead of manual bookkeeping, sofork/munmap/exit/mremapshare one path.shmdt. Only detaches the original SysV SHM VMA, including fragments left behind bymunmap/mprotect/mremap, instead of un-arbiting any mapping at a given start address.ipc_namespace, uid/gid mapping).RLIMIT_MEMLOCKaccounting and marks resident SHM pages unevictable while the segment is locked.Supporting changes
mm: fault, madvise, page, and the mmap/mprotect/mremap/msync syscalls.filesystem/page_cache: unevictable/locking aggregation.filesystem/tmpfs: creation of kernel-private unlinked shmem files.mm/ucontext: unified VMA/file lifecycle hooks.Tests
sysv_shm_semanticscovering nattch, RMID, read-only mapping, partial unmap, key/size, and repeated RMID.mlock_semantics; whitelisted the new suite.This PR contains only the staged code changes; the accompanying design/plan/review docs under
docs/kernel/ipc/are intentionally left out.🤖 Generated with Claude Code