Skip to content

Commit da06bb0

Browse files
David Hildenbrand (Red Hat)gregkh
authored andcommitted
mm/hugetlb: fix excessive IPI broadcasts when unsharing PMD tables using mmu_gather
commit 8ce720d upstream. As reported, ever since commit 1013af4 ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") we can end up in some situations where we perform so many IPI broadcasts when unsharing hugetlb PMD page tables that it severely regresses some workloads. In particular, when we fork()+exit(), or when we munmap() a large area backed by many shared PMD tables, we perform one IPI broadcast per unshared PMD table. There are two optimizations to be had: (1) When we process (unshare) multiple such PMD tables, such as during exit(), it is sufficient to send a single IPI broadcast (as long as we respect locking rules) instead of one per PMD table. Locking prevents that any of these PMD tables could get reused before we drop the lock. (2) When we are not the last sharer (> 2 users including us), there is no need to send the IPI broadcast. The shared PMD tables cannot become exclusive (fully unshared) before an IPI will be broadcasted by the last sharer. Concurrent GUP-fast could walk into a PMD table just before we unshared it. It could then succeed in grabbing a page from the shared page table even after munmap() etc succeeded (and supressed an IPI). But there is not difference compared to GUP-fast just sleeping for a while after grabbing the page and re-enabling IRQs. Most importantly, GUP-fast will never walk into page tables that are no-longer shared, because the last sharer will issue an IPI broadcast. (if ever required, checking whether the PUD changed in GUP-fast after grabbing the page like we do in the PTE case could handle this) So let's rework PMD sharing TLB flushing + IPI sync to use the mmu_gather infrastructure so we can implement these optimizations and demystify the code at least a bit. Extend the mmu_gather infrastructure to be able to deal with our special hugetlb PMD table sharing implementation. To make initialization of the mmu_gather easier when working on a single VMA (in particular, when dealing with hugetlb), provide tlb_gather_mmu_vma(). We'll consolidate the handling for (full) unsharing of PMD tables in tlb_unshare_pmd_ptdesc() and tlb_flush_unshared_tables(), and track in "struct mmu_gather" whether we had (full) unsharing of PMD tables. Because locking is very special (concurrent unsharing+reuse must be prevented), we disallow deferring flushing to tlb_finish_mmu() and instead require an explicit earlier call to tlb_flush_unshared_tables(). From hugetlb code, we call huge_pmd_unshare_flush() where we make sure that the expected lock protecting us from concurrent unsharing+reuse is still held. Check with a VM_WARN_ON_ONCE() in tlb_finish_mmu() that tlb_flush_unshared_tables() was properly called earlier. Document it all properly. Notes about tlb_remove_table_sync_one() interaction with unsharing: There are two fairly tricky things: (1) tlb_remove_table_sync_one() is a NOP on architectures without CONFIG_MMU_GATHER_RCU_TABLE_FREE. Here, the assumption is that the previous TLB flush would send an IPI to all relevant CPUs. Careful: some architectures like x86 only send IPIs to all relevant CPUs when tlb->freed_tables is set. The relevant architectures should be selecting MMU_GATHER_RCU_TABLE_FREE, but x86 might not do that in stable kernels and it might have been problematic before this patch. Also, the arch flushing behavior (independent of IPIs) is different when tlb->freed_tables is set. Do we have to enlighten them to also take care of tlb->unshared_tables? So far we didn't care, so hopefully we are fine. Of course, we could be setting tlb->freed_tables as well, but that might then unnecessarily flush too much, because the semantics of tlb->freed_tables are a bit fuzzy. This patch changes nothing in this regard. (2) tlb_remove_table_sync_one() is not a NOP on architectures with CONFIG_MMU_GATHER_RCU_TABLE_FREE that actually don't need a sync. Take x86 as an example: in the common case (!pv, !X86_FEATURE_INVLPGB) we still issue IPIs during TLB flushes and don't actually need the second tlb_remove_table_sync_one(). This optimized can be implemented on top of this, by checking e.g., in tlb_remove_table_sync_one() whether we really need IPIs. But as described in (1), it really must honor tlb->freed_tables then to send IPIs to all relevant CPUs. Notes on TLB flushing changes: (1) Flushing for non-shared PMD tables We're converting from flush_hugetlb_tlb_range() to tlb_remove_huge_tlb_entry(). Given that we properly initialize the MMU gather in tlb_gather_mmu_vma() to be hugetlb aware, similar to __unmap_hugepage_range(), that should be fine. (2) Flushing for shared PMD tables We're converting from various things (flush_hugetlb_tlb_range(), tlb_flush_pmd_range(), flush_tlb_range()) to tlb_flush_pmd_range(). tlb_flush_pmd_range() achieves the same that tlb_remove_huge_tlb_entry() would achieve in these scenarios. Note that tlb_remove_huge_tlb_entry() also calls __tlb_remove_tlb_entry(), however that is only implemented on powerpc, which does not support PMD table sharing. Similar to (1), tlb_gather_mmu_vma() should make sure that TLB flushing keeps on working as expected. Further, note that the ptdesc_pmd_pts_dec() in huge_pmd_share() is not a concern, as we are holding the i_mmap_lock the whole time, preventing concurrent unsharing. That ptdesc_pmd_pts_dec() usage will be removed separately as a cleanup later. There are plenty more cleanups to be had, but they have to wait until this is fixed. [david@kernel.org: fix kerneldoc] Link: https://lkml.kernel.org/r/f223dd74-331c-412d-93fc-69e360a5006c@kernel.org Link: https://lkml.kernel.org/r/20251223214037.580860-5-david@kernel.org Fixes: 1013af4 ("mm/hugetlb: fix huge_pmd_unshare() vs GUP-fast race") Signed-off-by: David Hildenbrand (Red Hat) <david@kernel.org> Reported-by: "Uschakow, Stanislav" <suschako@amazon.de> Closes: https://lore.kernel.org/all/4d3878531c76479d9f8ca9789dc6485d@amazon.de/ Tested-by: Laurence Oberman <loberman@redhat.com> Acked-by: Harry Yoo <harry.yoo@oracle.com> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> Cc: Lance Yang <lance.yang@linux.dev> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Rik van Riel <riel@surriel.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David Hildenbrand (Arm) <david@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent baa198a commit da06bb0

6 files changed

Lines changed: 208 additions & 66 deletions

File tree

include/asm-generic/tlb.h

Lines changed: 75 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,8 @@
4646
*
4747
* The mmu_gather API consists of:
4848
*
49-
* - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_finish_mmu()
49+
* - tlb_gather_mmu() / tlb_gather_mmu_fullmm() / tlb_gather_mmu_vma() /
50+
* tlb_finish_mmu()
5051
*
5152
* start and finish a mmu_gather
5253
*
@@ -344,6 +345,20 @@ struct mmu_gather {
344345
unsigned int vma_huge : 1;
345346
unsigned int vma_pfn : 1;
346347

348+
/*
349+
* Did we unshare (unmap) any shared page tables? For now only
350+
* used for hugetlb PMD table sharing.
351+
*/
352+
unsigned int unshared_tables : 1;
353+
354+
/*
355+
* Did we unshare any page tables such that they are now exclusive
356+
* and could get reused+modified by the new owner? When setting this
357+
* flag, "unshared_tables" will be set as well. For now only used
358+
* for hugetlb PMD table sharing.
359+
*/
360+
unsigned int fully_unshared_tables : 1;
361+
347362
unsigned int batch_count;
348363

349364
#ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -380,6 +395,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
380395
tlb->cleared_pmds = 0;
381396
tlb->cleared_puds = 0;
382397
tlb->cleared_p4ds = 0;
398+
tlb->unshared_tables = 0;
383399
/*
384400
* Do not reset mmu_gather::vma_* fields here, we do not
385401
* call into tlb_start_vma() again to set them if there is an
@@ -459,7 +475,7 @@ static inline void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
459475
* these bits.
460476
*/
461477
if (!(tlb->freed_tables || tlb->cleared_ptes || tlb->cleared_pmds ||
462-
tlb->cleared_puds || tlb->cleared_p4ds))
478+
tlb->cleared_puds || tlb->cleared_p4ds || tlb->unshared_tables))
463479
return;
464480

465481
tlb_flush(tlb);
@@ -748,6 +764,63 @@ static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
748764
}
749765
#endif
750766

767+
#ifdef CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING
768+
static inline void tlb_unshare_pmd_ptdesc(struct mmu_gather *tlb, struct ptdesc *pt,
769+
unsigned long addr)
770+
{
771+
/*
772+
* The caller must make sure that concurrent unsharing + exclusive
773+
* reuse is impossible until tlb_flush_unshared_tables() was called.
774+
*/
775+
VM_WARN_ON_ONCE(!ptdesc_pmd_is_shared(pt));
776+
ptdesc_pmd_pts_dec(pt);
777+
778+
/* Clearing a PUD pointing at a PMD table with PMD leaves. */
779+
tlb_flush_pmd_range(tlb, addr & PUD_MASK, PUD_SIZE);
780+
781+
/*
782+
* If the page table is now exclusively owned, we fully unshared
783+
* a page table.
784+
*/
785+
if (!ptdesc_pmd_is_shared(pt))
786+
tlb->fully_unshared_tables = true;
787+
tlb->unshared_tables = true;
788+
}
789+
790+
static inline void tlb_flush_unshared_tables(struct mmu_gather *tlb)
791+
{
792+
/*
793+
* As soon as the caller drops locks to allow for reuse of
794+
* previously-shared tables, these tables could get modified and
795+
* even reused outside of hugetlb context, so we have to make sure that
796+
* any page table walkers (incl. TLB, GUP-fast) are aware of that
797+
* change.
798+
*
799+
* Even if we are not fully unsharing a PMD table, we must
800+
* flush the TLB for the unsharer now.
801+
*/
802+
if (tlb->unshared_tables)
803+
tlb_flush_mmu_tlbonly(tlb);
804+
805+
/*
806+
* Similarly, we must make sure that concurrent GUP-fast will not
807+
* walk previously-shared page tables that are getting modified+reused
808+
* elsewhere. So broadcast an IPI to wait for any concurrent GUP-fast.
809+
*
810+
* We only perform this when we are the last sharer of a page table,
811+
* as the IPI will reach all CPUs: any GUP-fast.
812+
*
813+
* Note that on configs where tlb_remove_table_sync_one() is a NOP,
814+
* the expectation is that the tlb_flush_mmu_tlbonly() would have issued
815+
* required IPIs already for us.
816+
*/
817+
if (tlb->fully_unshared_tables) {
818+
tlb_remove_table_sync_one();
819+
tlb->fully_unshared_tables = false;
820+
}
821+
}
822+
#endif /* CONFIG_HUGETLB_PMD_PAGE_TABLE_SHARING */
823+
751824
#endif /* CONFIG_MMU */
752825

753826
#endif /* _ASM_GENERIC__TLB_H */

include/linux/hugetlb.h

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -235,8 +235,9 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
235235
pte_t *huge_pte_offset(struct mm_struct *mm,
236236
unsigned long addr, unsigned long sz);
237237
unsigned long hugetlb_mask_last_page(struct hstate *h);
238-
int huge_pmd_unshare(struct mm_struct *mm, struct vm_area_struct *vma,
239-
unsigned long addr, pte_t *ptep);
238+
int huge_pmd_unshare(struct mmu_gather *tlb, struct vm_area_struct *vma,
239+
unsigned long addr, pte_t *ptep);
240+
void huge_pmd_unshare_flush(struct mmu_gather *tlb, struct vm_area_struct *vma);
240241
void adjust_range_if_pmd_sharing_possible(struct vm_area_struct *vma,
241242
unsigned long *start, unsigned long *end);
242243

@@ -295,13 +296,17 @@ static inline struct address_space *hugetlb_folio_mapping_lock_write(
295296
return NULL;
296297
}
297298

298-
static inline int huge_pmd_unshare(struct mm_struct *mm,
299-
struct vm_area_struct *vma,
300-
unsigned long addr, pte_t *ptep)
299+
static inline int huge_pmd_unshare(struct mmu_gather *tlb,
300+
struct vm_area_struct *vma, unsigned long addr, pte_t *ptep)
301301
{
302302
return 0;
303303
}
304304

305+
static inline void huge_pmd_unshare_flush(struct mmu_gather *tlb,
306+
struct vm_area_struct *vma)
307+
{
308+
}
309+
305310
static inline void adjust_range_if_pmd_sharing_possible(
306311
struct vm_area_struct *vma,
307312
unsigned long *start, unsigned long *end)

include/linux/mm_types.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1262,6 +1262,7 @@ static inline unsigned int mm_cid_size(void)
12621262
struct mmu_gather;
12631263
extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
12641264
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
1265+
void tlb_gather_mmu_vma(struct mmu_gather *tlb, struct vm_area_struct *vma);
12651266
extern void tlb_finish_mmu(struct mmu_gather *tlb);
12661267

12671268
struct vm_fault;

0 commit comments

Comments
 (0)