Skip to content

Commit a76f5ca

Browse files
bfaccinigregkh
authored andcommitted
mm/fake-numa: allow later numa node hotplug
[ Upstream commit 63db817 ] Current fake-numa implementation prevents new Numa nodes to be later hot-plugged by drivers. A common symptom of this limitation is the "node <X> was absent from the node_possible_map" message by associated warning in mm/memory_hotplug.c: add_memory_resource(). This comes from the lack of remapping in both pxm_to_node_map[] and node_to_pxm_map[] tables to take fake-numa nodes into account and thus triggers collisions with original and physical nodes only-mapping that had been determined from BIOS tables. This patch fixes this by doing the necessary node-ids translation in both pxm_to_node_map[]/node_to_pxm_map[] tables. node_distance[] table has also been fixed accordingly. Details: When trying to use fake-numa feature on our system where new Numa nodes are being "hot-plugged" upon driver load, this fails with the following type of message and warning with stack : node 8 was absent from the node_possible_map WARNING: CPU: 61 PID: 4259 at mm/memory_hotplug.c:1506 add_memory_resource+0x3dc/0x418 This issue prevents the use of the fake-NUMA debug feature with the system's full configuration, when it has proven to be sometimes extremely useful for performance testing of multi-tasked, memory-bound applications, as it enables better isolation of processes/ranks compared to fat NUMA nodes. Usual numactl output after driver has “hot-plugged”/unveiled some new Numa nodes with and without memory : $ numactl --hardware available: 9 nodes (0-8) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 490037 MB node 0 free: 484432 MB node 1 cpus: node 1 size: 97280 MB node 1 free: 97279 MB node 2 cpus: node 2 size: 0 MB node 2 free: 0 MB node 3 cpus: node 3 size: 0 MB node 3 free: 0 MB node 4 cpus: node 4 size: 0 MB node 4 free: 0 MB node 5 cpus: node 5 size: 0 MB node 5 free: 0 MB node 6 cpus: node 6 size: 0 MB node 6 free: 0 MB node 7 cpus: node 7 size: 0 MB node 7 free: 0 MB node 8 cpus: node 8 size: 0 MB node 8 free: 0 MB node distances: node 0 1 2 3 4 5 6 7 8 0: 10 80 80 80 80 80 80 80 80 1: 80 10 255 255 255 255 255 255 255 2: 80 255 10 255 255 255 255 255 255 3: 80 255 255 10 255 255 255 255 255 4: 80 255 255 255 10 255 255 255 255 5: 80 255 255 255 255 10 255 255 255 6: 80 255 255 255 255 255 10 255 255 7: 80 255 255 255 255 255 255 10 255 8: 80 255 255 255 255 255 255 255 10 With recent M.Rapoport set of fake-numa patches in mm-everything and using numa=fake=4 boot parameter : $ numactl --hardware available: 4 nodes (0-3) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 122518 MB node 0 free: 117141 MB node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 219911 MB node 1 free: 219751 MB node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 2 size: 122599 MB node 2 free: 122541 MB node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 3 size: 122479 MB node 3 free: 122408 MB node distances: node 0 1 2 3 0: 10 10 10 10 1: 10 10 10 10 2: 10 10 10 10 3: 10 10 10 10 With recent M.Rapoport set of fake-numa patches in mm-everything, this patch on top, using numa=fake=4 boot parameter : # numactl —hardware available: 12 nodes (0-11) node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 0 size: 122518 MB node 0 free: 116429 MB node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 1 size: 122631 MB node 1 free: 122576 MB node 2 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 2 size: 122599 MB node 2 free: 122544 MB node 3 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 node 3 size: 122479 MB node 3 free: 122419 MB node 4 cpus: node 4 size: 97280 MB node 4 free: 97279 MB node 5 cpus: node 5 size: 0 MB node 5 free: 0 MB node 6 cpus: node 6 size: 0 MB node 6 free: 0 MB node 7 cpus: node 7 size: 0 MB node 7 free: 0 MB node 8 cpus: node 8 size: 0 MB node 8 free: 0 MB node 9 cpus: node 9 size: 0 MB node 9 free: 0 MB node 10 cpus: node 10 size: 0 MB node 10 free: 0 MB node 11 cpus: node 11 size: 0 MB node 11 free: 0 MB node distances: node 0 1 2 3 4 5 6 7 8 9 10 11 0: 10 10 10 10 80 80 80 80 80 80 80 80 1: 10 10 10 10 80 80 80 80 80 80 80 80 2: 10 10 10 10 80 80 80 80 80 80 80 80 3: 10 10 10 10 80 80 80 80 80 80 80 80 4: 80 80 80 80 10 255 255 255 255 255 255 255 5: 80 80 80 80 255 10 255 255 255 255 255 255 6: 80 80 80 80 255 255 10 255 255 255 255 255 7: 80 80 80 80 255 255 255 10 255 255 255 255 8: 80 80 80 80 255 255 255 255 10 255 255 255 9: 80 80 80 80 255 255 255 255 255 10 255 255 10: 80 80 80 80 255 255 255 255 255 255 10 255 11: 80 80 80 80 255 255 255 255 255 255 255 10 Link: https://lkml.kernel.org/r/20250106120659.359610-2-bfaccini@nvidia.com Signed-off-by: Bruno Faccini <bfaccini@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Stable-dep-of: f46c26f ("mm: numa,memblock: include <asm/numa.h> for 'numa_nodes_parsed'") Signed-off-by: Sasha Levin <sashal@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
1 parent d1beb4d commit a76f5ca

5 files changed

Lines changed: 133 additions & 8 deletions

File tree

drivers/acpi/numa/srat.c

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,92 @@ int acpi_map_pxm_to_node(int pxm)
8181
}
8282
EXPORT_SYMBOL(acpi_map_pxm_to_node);
8383

84+
#ifdef CONFIG_NUMA_EMU
85+
/*
86+
* Take max_nid - 1 fake-numa nodes into account in both
87+
* pxm_to_node_map()/node_to_pxm_map[] tables.
88+
*/
89+
int __init fix_pxm_node_maps(int max_nid)
90+
{
91+
static int pxm_to_node_map_copy[MAX_PXM_DOMAINS] __initdata
92+
= { [0 ... MAX_PXM_DOMAINS - 1] = NUMA_NO_NODE };
93+
static int node_to_pxm_map_copy[MAX_NUMNODES] __initdata
94+
= { [0 ... MAX_NUMNODES - 1] = PXM_INVAL };
95+
int i, j, index = -1, count = 0;
96+
nodemask_t nodes_to_enable;
97+
98+
if (numa_off || srat_disabled())
99+
return -1;
100+
101+
/* find fake nodes PXM mapping */
102+
for (i = 0; i < MAX_NUMNODES; i++) {
103+
if (node_to_pxm_map[i] != PXM_INVAL) {
104+
for (j = 0; j <= max_nid; j++) {
105+
if ((emu_nid_to_phys[j] == i) &&
106+
WARN(node_to_pxm_map_copy[j] != PXM_INVAL,
107+
"Node %d is already binded to PXM %d\n",
108+
j, node_to_pxm_map_copy[j]))
109+
return -1;
110+
if (emu_nid_to_phys[j] == i) {
111+
node_to_pxm_map_copy[j] =
112+
node_to_pxm_map[i];
113+
if (j > index)
114+
index = j;
115+
count++;
116+
}
117+
}
118+
}
119+
}
120+
if (WARN(index != max_nid, "%d max nid when expected %d\n",
121+
index, max_nid))
122+
return -1;
123+
124+
nodes_clear(nodes_to_enable);
125+
126+
/* map phys nodes not used for fake nodes */
127+
for (i = 0; i < MAX_NUMNODES; i++) {
128+
if (node_to_pxm_map[i] != PXM_INVAL) {
129+
for (j = 0; j <= max_nid; j++)
130+
if (emu_nid_to_phys[j] == i)
131+
break;
132+
/* fake nodes PXM mapping has been done */
133+
if (j <= max_nid)
134+
continue;
135+
/* find first hole */
136+
for (j = 0;
137+
j < MAX_NUMNODES &&
138+
node_to_pxm_map_copy[j] != PXM_INVAL;
139+
j++)
140+
;
141+
if (WARN(j == MAX_NUMNODES,
142+
"Number of nodes exceeds MAX_NUMNODES\n"))
143+
return -1;
144+
node_to_pxm_map_copy[j] = node_to_pxm_map[i];
145+
node_set(j, nodes_to_enable);
146+
count++;
147+
}
148+
}
149+
150+
/* creating reverse mapping in pxm_to_node_map[] */
151+
for (i = 0; i < MAX_NUMNODES; i++)
152+
if (node_to_pxm_map_copy[i] != PXM_INVAL &&
153+
pxm_to_node_map_copy[node_to_pxm_map_copy[i]] == NUMA_NO_NODE)
154+
pxm_to_node_map_copy[node_to_pxm_map_copy[i]] = i;
155+
156+
/* overwrite with new mapping */
157+
for (i = 0; i < MAX_NUMNODES; i++) {
158+
node_to_pxm_map[i] = node_to_pxm_map_copy[i];
159+
pxm_to_node_map[i] = pxm_to_node_map_copy[i];
160+
}
161+
162+
/* enable other nodes found in PXM for hotplug */
163+
nodes_or(numa_nodes_parsed, nodes_to_enable, numa_nodes_parsed);
164+
165+
pr_debug("found %d total number of nodes\n", count);
166+
return 0;
167+
}
168+
#endif
169+
84170
static void __init
85171
acpi_table_print_srat_entry(struct acpi_subtable_header *header)
86172
{

include/acpi/acpi_numa.h

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,16 @@ extern int node_to_pxm(int);
1717
extern int acpi_map_pxm_to_node(int);
1818
extern unsigned char acpi_srat_revision;
1919
extern void disable_srat(void);
20+
extern int fix_pxm_node_maps(int max_nid);
2021

2122
extern void bad_srat(void);
2223
extern int srat_disabled(void);
2324

2425
#else /* CONFIG_ACPI_NUMA */
26+
static inline int fix_pxm_node_maps(int max_nid)
27+
{
28+
return 0;
29+
}
2530
static inline void disable_srat(void)
2631
{
2732
}

include/linux/numa_memblks.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,10 @@ int __init numa_cleanup_meminfo(struct numa_meminfo *mi);
2929
int __init numa_memblks_init(int (*init_func)(void),
3030
bool memblock_force_top_down);
3131

32+
extern int numa_distance_cnt;
33+
3234
#ifdef CONFIG_NUMA_EMU
35+
extern int emu_nid_to_phys[MAX_NUMNODES];
3336
int numa_emu_cmdline(char *str);
3437
void __init numa_emu_update_cpu_to_node(int *emu_nid_to_phys,
3538
unsigned int nr_emu_nids);

mm/numa_emulation.c

Lines changed: 38 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -8,11 +8,12 @@
88
#include <linux/memblock.h>
99
#include <linux/numa_memblks.h>
1010
#include <asm/numa.h>
11+
#include <acpi/acpi_numa.h>
1112

1213
#define FAKE_NODE_MIN_SIZE ((u64)32 << 20)
1314
#define FAKE_NODE_MIN_HASH_MASK (~(FAKE_NODE_MIN_SIZE - 1UL))
1415

15-
static int emu_nid_to_phys[MAX_NUMNODES];
16+
int emu_nid_to_phys[MAX_NUMNODES];
1617
static char *emu_cmdline __initdata;
1718

1819
int __init numa_emu_cmdline(char *str)
@@ -379,6 +380,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
379380
size_t phys_size = numa_dist_cnt * numa_dist_cnt * sizeof(phys_dist[0]);
380381
int max_emu_nid, dfl_phys_nid;
381382
int i, j, ret;
383+
nodemask_t physnode_mask = numa_nodes_parsed;
382384

383385
if (!emu_cmdline)
384386
goto no_emu;
@@ -395,7 +397,6 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
395397
* split the system RAM into N fake nodes.
396398
*/
397399
if (strchr(emu_cmdline, 'U')) {
398-
nodemask_t physnode_mask = numa_nodes_parsed;
399400
unsigned long n;
400401
int nid = 0;
401402

@@ -465,20 +466,28 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
465466
*/
466467
max_emu_nid = setup_emu2phys_nid(&dfl_phys_nid);
467468

468-
/* commit */
469-
*numa_meminfo = ei;
470-
471469
/* Make sure numa_nodes_parsed only contains emulated nodes */
472470
nodes_clear(numa_nodes_parsed);
473471
for (i = 0; i < ARRAY_SIZE(ei.blk); i++)
474472
if (ei.blk[i].start != ei.blk[i].end &&
475473
ei.blk[i].nid != NUMA_NO_NODE)
476474
node_set(ei.blk[i].nid, numa_nodes_parsed);
477475

478-
numa_emu_update_cpu_to_node(emu_nid_to_phys, ARRAY_SIZE(emu_nid_to_phys));
476+
/* fix pxm_to_node_map[] and node_to_pxm_map[] to avoid collision
477+
* with faked numa nodes, particularly during later memory hotplug
478+
* handling, and also update numa_nodes_parsed accordingly.
479+
*/
480+
ret = fix_pxm_node_maps(max_emu_nid);
481+
if (ret < 0)
482+
goto no_emu;
483+
484+
/* commit */
485+
*numa_meminfo = ei;
486+
487+
numa_emu_update_cpu_to_node(emu_nid_to_phys, max_emu_nid + 1);
479488

480489
/* make sure all emulated nodes are mapped to a physical node */
481-
for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
490+
for (i = 0; i < max_emu_nid + 1; i++)
482491
if (emu_nid_to_phys[i] == NUMA_NO_NODE)
483492
emu_nid_to_phys[i] = dfl_phys_nid;
484493

@@ -501,12 +510,34 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
501510
numa_set_distance(i, j, dist);
502511
}
503512
}
513+
for (i = 0; i < numa_distance_cnt; i++) {
514+
for (j = 0; j < numa_distance_cnt; j++) {
515+
int physi, physj;
516+
u8 dist;
517+
518+
/* distance between fake nodes is already ok */
519+
if (emu_nid_to_phys[i] != NUMA_NO_NODE &&
520+
emu_nid_to_phys[j] != NUMA_NO_NODE)
521+
continue;
522+
if (emu_nid_to_phys[i] != NUMA_NO_NODE)
523+
physi = emu_nid_to_phys[i];
524+
else
525+
physi = i - max_emu_nid;
526+
if (emu_nid_to_phys[j] != NUMA_NO_NODE)
527+
physj = emu_nid_to_phys[j];
528+
else
529+
physj = j - max_emu_nid;
530+
dist = phys_dist[physi * numa_dist_cnt + physj];
531+
numa_set_distance(i, j, dist);
532+
}
533+
}
504534

505535
/* free the copied physical distance table */
506536
memblock_free(phys_dist, phys_size);
507537
return;
508538

509539
no_emu:
540+
numa_nodes_parsed = physnode_mask;
510541
/* No emulation. Build identity emu_nid_to_phys[] for numa_add_cpu() */
511542
for (i = 0; i < ARRAY_SIZE(emu_nid_to_phys); i++)
512543
emu_nid_to_phys[i] = i;

mm/numa_memblks.c

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
#include <linux/numa.h>
88
#include <linux/numa_memblks.h>
99

10-
static int numa_distance_cnt;
10+
int numa_distance_cnt;
1111
static u8 *numa_distance;
1212

1313
nodemask_t numa_nodes_parsed __initdata;

0 commit comments

Comments
 (0)