Environment
- GPU Operator version:
v26.3.0
- dcgm-exporter image:
4.5.1-4.8.0-distroless
- GPU:
8 x NVIDIA H200 NVL
- MIG enabled:
true
- OS Type
Ubuntu 24.04.3
- Containerd
2.1.5
MIG configuration
Observed behaviour depends on MIG profile:
3g.71gb → exporter starts successfully
2g.35gb → exporter startup stalls
1g.18gb → exporter startup stalls
Problem
nvidia-dcgm-exporter starts and initializes DCGM successfully, but appears to stall during startup and does not proceed to serve metrics / become ready under dense MIG profiles (1g.18gb, 2g.35gb).
Relevant logs:
time=2026-05-26T12:08:45.313Z level=INFO msg="Starting dcgm-exporter" Version=4.5.1-4.8.0
time=2026-05-26T12:08:45.317Z level=INFO msg="Attempting to initialize DCGM."
time=2026-05-26T12:08:50.342Z level=INFO msg="Initialized DCGM Fields module."
time=2026-05-26T12:08:50.342Z level=INFO msg="Attempting to initialize NVML library."
time=2026-05-26T12:08:50.342Z level=INFO msg="NVML provider successfully initialized for Kubernetes MIG support"
time=2026-05-26T12:08:50.342Z level=INFO msg="DCGM successfully initialized!"
time=2026-05-26T12:08:51.107Z level=INFO msg="Successfully queried DCGM profiling metric groups" reload_id=0 count=2 gpu_model="NVIDIA H200 NVL"
time=2026-05-26T12:08:51.107Z level=INFO msg="Building registry for current GPU topology"
time=2026-05-26T12:08:51.107Z level=INFO msg="Falling back to metric file '/etc/dcgm-exporter/dcgm-metrics.csv'"
time=2026-05-26T12:08:51.108Z level=INFO msg="Initializing system entities of type 'GPU'"
time=2026-05-26T12:08:51.616Z level=INFO msg="Initializing system entities of type 'NvSwitch'"
time=2026-05-26T12:08:51.616Z level=INFO msg="Not collecting NvSwitch metrics; no switches to monitor"
time=2026-05-26T12:08:51.616Z level=INFO msg="Initializing system entities of type 'NvLink'"
time=2026-05-26T12:08:51.616Z level=WARN msg="Failed to initialize NvSwitch/NvLink info" error="no switches to monitor"
time=2026-05-26T12:08:52.017Z level=INFO msg="Initializing system entities of type 'CPU'"
time=2026-05-26T12:08:52.055Z level=INFO msg="Not collecting CPU metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
time=2026-05-26T12:08:52.055Z level=INFO msg="Initializing system entities of type 'CPU Core'"
time=2026-05-26T12:08:52.056Z level=INFO msg="Not collecting CPU Core metrics; error retrieving DCGM CPU hierarchy: This request is serviced by a module of DCGM that is not currently loaded"
No additional startup logs are emitted after this point (for example, no indication that the metrics server is ready).
Workaround observed
Changing:
DCGM_EXPORTER_INTERVAL=2000
to:
DCGM_EXPORTER_INTERVAL=50000
allows the exporter to start successfully and serve metrics.
This suggests startup/initialization behavior may be affected by the default polling interval when many MIG instances are present.
Expected behavior
dcgm-exporter should complete initialization and begin serving metrics regardless of MIG partition density, without requiring a large DCGM_EXPORTER_INTERVAL.
Observations
- DCGM initialization succeeds.
- NVML initialization succeeds.
- Profiling metric group discovery succeeds.
- Issue appears only when the number of MIG instances is high (
1g / 2g profiles).
- This suggests a possible issue during registry/entity initialization for dense MIG topologies.
Request
Please investigate whether dcgm-exporter 4.5.1 has a startup/initialization issue with dense MIG configurations on 8 x H200 NVL, especially with 1g.18gb / 2g.35gb profiles.
Also, please let us know if any additional logs/debug flags should be collected to help further diagnose this issue.
Environment
v26.3.04.5.1-4.8.0-distroless8 x NVIDIA H200 NVLtrueUbuntu 24.04.32.1.5MIG configuration
Observed behaviour depends on MIG profile:
3g.71gb→ exporter starts successfully2g.35gb→ exporter startup stalls1g.18gb→ exporter startup stallsProblem
nvidia-dcgm-exporterstarts and initializes DCGM successfully, but appears to stall during startup and does not proceed to serve metrics / become ready under dense MIG profiles (1g.18gb,2g.35gb).Relevant logs:
No additional startup logs are emitted after this point (for example, no indication that the metrics server is ready).
Workaround observed
Changing:
DCGM_EXPORTER_INTERVAL=2000to:
DCGM_EXPORTER_INTERVAL=50000allows the exporter to start successfully and serve metrics.
This suggests startup/initialization behavior may be affected by the default polling interval when many MIG instances are present.
Expected behavior
dcgm-exporter should complete initialization and begin serving metrics regardless of MIG partition density, without requiring a large
DCGM_EXPORTER_INTERVAL.Observations
1g/2gprofiles).Request
Please investigate whether
dcgm-exporter 4.5.1has a startup/initialization issue with dense MIG configurations on8 x H200 NVL, especially with1g.18gb/2g.35gbprofiles.Also, please let us know if any additional logs/debug flags should be collected to help further diagnose this issue.