Environment
- GPU Operator Version: 26.3.0
- GPU Model: H100-NVL
- MIG Strategy: single
Description
During MIG repartitioning operations in NVIDIA GPU Operator, if the MIG Manager is unable to complete the requested MIG partitioning and the node remains stuck in:
the nvidia.com/gpu.product label gets removed from the node.
One reproducible scenario observed is:
- MIG repartitioning starts
- node reboots during repartitioning
- MIG Manager cannot complete reconciliation after reboot
mig.config.state remains pending
nvidia.com/gpu.product label gets removed
This causes issues for external components/controllers that rely on the nvidia.com/gpu.product label to identify the GPU type for reconciliation, scheduling, or inventory purposes.
Current Behavior
When repartitioning is interrupted or fails and the node remains in pending state:
mig.config.state=pending
nvidia.com/gpu.product label is removed entirely
As a result, dependent components lose visibility into the underlying GPU hardware type.
Expected Behavior
Instead of removing the label completely during repartitioning/pending state, preserve the GPU product information and append a suffix indicating the node is still pending repartitioning.
similar to an invalid partitioning case, the label is set to nvidia.com/gpu.product=NVIDIA-H200-NVL-INVALID
Example:
nvidia.com/gpu.product=NVIDIA-H200-NVL-ERROR
This would:
- preserve GPU hardware visibility
- indicate repartitioning is incomplete/pending
- help external controllers continue identifying the GPU type
- avoid reconciliation or scheduling issues caused by label disappearance
Impact
External components depending on nvidia.com/gpu.product cannot reliably determine GPU type once the label is removed, especially during interrupted repartitioning scenarios such as node reboot.
Environment
Description
During MIG repartitioning operations in NVIDIA GPU Operator, if the MIG Manager is unable to complete the requested MIG partitioning and the node remains stuck in:
the
nvidia.com/gpu.productlabel gets removed from the node.One reproducible scenario observed is:
mig.config.stateremainspendingnvidia.com/gpu.productlabel gets removedThis causes issues for external components/controllers that rely on the
nvidia.com/gpu.productlabel to identify the GPU type for reconciliation, scheduling, or inventory purposes.Current Behavior
When repartitioning is interrupted or fails and the node remains in pending state:
mig.config.state=pendingnvidia.com/gpu.productlabel is removed entirelyAs a result, dependent components lose visibility into the underlying GPU hardware type.
Expected Behavior
Instead of removing the label completely during repartitioning/pending state, preserve the GPU product information and append a suffix indicating the node is still pending repartitioning.
similar to an invalid partitioning case, the label is set to
nvidia.com/gpu.product=NVIDIA-H200-NVL-INVALIDExample:
This would:
Impact
External components depending on
nvidia.com/gpu.productcannot reliably determine GPU type once the label is removed, especially during interrupted repartitioning scenarios such as node reboot.