Skip to content

[Bug]: nvidia.com/gpu.product label removed when MIG repartitioning remains in pending state #2495

@anoopsinghnegi

Description

@anoopsinghnegi

Environment

  • GPU Operator Version: 26.3.0
  • GPU Model: H100-NVL
  • MIG Strategy: single

Description

During MIG repartitioning operations in NVIDIA GPU Operator, if the MIG Manager is unable to complete the requested MIG partitioning and the node remains stuck in:

mig.config.state=pending

the nvidia.com/gpu.product label gets removed from the node.

One reproducible scenario observed is:

  • MIG repartitioning starts
  • node reboots during repartitioning
  • MIG Manager cannot complete reconciliation after reboot
  • mig.config.state remains pending
  • nvidia.com/gpu.product label gets removed

This causes issues for external components/controllers that rely on the nvidia.com/gpu.product label to identify the GPU type for reconciliation, scheduling, or inventory purposes.

Current Behavior

When repartitioning is interrupted or fails and the node remains in pending state:

  • mig.config.state=pending
  • nvidia.com/gpu.product label is removed entirely

As a result, dependent components lose visibility into the underlying GPU hardware type.

Expected Behavior

Instead of removing the label completely during repartitioning/pending state, preserve the GPU product information and append a suffix indicating the node is still pending repartitioning.
similar to an invalid partitioning case, the label is set to nvidia.com/gpu.product=NVIDIA-H200-NVL-INVALID

Example:

nvidia.com/gpu.product=NVIDIA-H200-NVL-ERROR

This would:

  • preserve GPU hardware visibility
  • indicate repartitioning is incomplete/pending
  • help external controllers continue identifying the GPU type
  • avoid reconciliation or scheduling issues caused by label disappearance

Impact

External components depending on nvidia.com/gpu.product cannot reliably determine GPU type once the label is removed, especially during interrupted repartitioning scenarios such as node reboot.

Metadata

Metadata

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions