Skip to content

[Bug]: dcgm-exporter appears to stall after "Initializing system entities of type 'CPU Core'" on H200 with dense MIG topology #2482

@anoopsinghnegi

Description

@anoopsinghnegi

Environment

  • GPU Operator version: v26.3.0
  • dcgm-exporter image: 4.5.1-4.8.0-distroless
  • GPU: 8 x NVIDIA H200 NVL
  • MIG enabled: true
  • OS Type Ubuntu 24.04.3
  • Containerd 2.1.5

MIG configuration

Observed behaviour depends on MIG profile:

  • 3g.71gb → exporter starts successfully
  • 2g.35gb → exporter startup stalls
  • 1g.18gb → exporter startup stalls

Problem

nvidia-dcgm-exporter starts and initializes DCGM successfully, but appears to stall during startup and does not proceed to serve metrics / become ready under dense MIG profiles (1g.18gb, 2g.35gb).

Relevant logs:

time=2026-05-26T12:08:45.313Z level=INFO msg="Starting dcgm-exporter" Version=4.5.1-4.8.0
time=2026-05-26T12:08:45.317Z level=INFO msg="Attempting to initialize DCGM."
time=2026-05-26T12:08:50.342Z level=INFO msg="Initialized DCGM Fields module."
time=2026-05-26T12:08:50.342Z level=INFO msg="Attempting to initialize NVML library."
time=2026-05-26T12:08:50.342Z level=INFO msg="NVML provider successfully initialized for Kubernetes MIG support"
time=2026-05-26T12:08:50.342Z level=INFO msg="DCGM successfully initialized!"
time=2026-05-26T12:08:51.107Z level=INFO msg="Successfully queried DCGM profiling metric groups" reload_id=0 count=2 gpu_model="NVIDIA H200 NVL"
time=2026-05-26T12:08:51.107Z level=INFO msg="Building registry for current GPU topology"
time=2026-05-26T12:08:51.107Z level=INFO msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-metrics.csv'"
time=2026-05-26T12:08:51.108Z level=INFO msg="Initializing system entities of type 'GPU'"
time=2026-05-26T12:08:51.616Z level=INFO msg="Initializing system entities of type 'NvSwitch'"
time=2026-05-26T12:08:51.616Z level=INFO msg="Not collecting NvSwitch metrics; no switches to monitor"
time=2026-05-26T12:08:51.616Z level=INFO msg="Initializing system entities of type 'NvLink'"
time=2026-05-26T12:08:51.616Z level=WARN msg="Failed to initialize NvSwitch/NvLink info" error="no switches to monitor"
time=2026-05-26T12:08:52.017Z level=INFO msg="Initializing system entities of type 'CPU'"
time=2026-05-26T12:08:52.055Z level=INFO msg="Not collecting CPU metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2026-05-26T12:08:52.055Z level=INFO msg="Initializing system entities of type 'CPU Core'"
time=2026-05-26T12:08:52.056Z level=INFO msg="Not collecting CPU Core metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"

No additional startup logs are emitted after this point (for example, no indication that the metrics server is ready).

Workaround observed

Changing:

DCGM_EXPORTER_INTERVAL=2000

to:

DCGM_EXPORTER_INTERVAL=50000

allows the exporter to start successfully and serve metrics.

This suggests startup/initialization behavior may be affected by the default polling interval when many MIG instances are present.

Expected behavior

dcgm-exporter should complete initialization and begin serving metrics regardless of MIG partition density, without requiring a large DCGM_EXPORTER_INTERVAL.

Observations

  • DCGM initialization succeeds.
  • NVML initialization succeeds.
  • Profiling metric group discovery succeeds.
  • Issue appears only when the number of MIG instances is high (1g / 2g profiles).
  • This suggests a possible issue during registry/entity initialization for dense MIG topologies.

Request

Please investigate whether dcgm-exporter 4.5.1 has a startup/initialization issue with dense MIG configurations on 8 x H200 NVL, especially with 1g.18gb / 2g.35gb profiles.

Also, please let us know if any additional logs/debug flags should be collected to help further diagnose this issue.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions