Skip to content

[Question]: One of pod nvidia-cuda-validator-rr9qz 0/1 Init:Error #2511

@AleksFirsta

Description

@AleksFirsta

Hello Team

In my cluster I cant fix pod one of the pod

gpu-feature-discovery-9ztn5                                       1/1     Running      0               16m
gpu-feature-discovery-nlg6b                                       1/1     Running      0               16m
gpu-feature-discovery-rmvtr                                       1/1     Running      0               16m
gpu-operator-75b97bdcbc-fjmsm                                     1/1     Running      0               17m
nvidia-container-toolkit-daemonset-65m2r                          1/1     Running      0               16m
nvidia-container-toolkit-daemonset-prg6k                          1/1     Running      0               16m
nvidia-container-toolkit-daemonset-s9g8m                          1/1     Running      0               16m
nvidia-cuda-validator-44rfx                                       0/1     Completed    0               14m
nvidia-cuda-validator-7bh8j                                       0/1     Completed    0               13m
nvidia-cuda-validator-rr9qz                                       0/1     Init:Error   4 (2m14s ago)   5m1s
nvidia-dcgm-exporter-bn65s                                        1/1     Running      0               16m
nvidia-dcgm-exporter-f454z                                        1/1     Running      0               16m
nvidia-dcgm-exporter-z8n4w                                        1/1     Running      0               16m
nvidia-device-plugin-daemonset-k4rl8                              1/1     Running      0               16m
nvidia-device-plugin-daemonset-p7w7s                              1/1     Running      0               16m
nvidia-device-plugin-daemonset-vt5d7                              1/1     Running      0               16m
nvidia-driver-daemonset-9hcbr                                     1/1     Running      0               17m
nvidia-driver-daemonset-dhq4j                                     1/1     Running      0               16m
nvidia-driver-daemonset-t84wg                                     1/1     Running      0               16m
nvidia-gpu-operator-node-feature-discovery-gc-796ccf46c6-p9674    1/1     Running      0               17m
nvidia-gpu-operator-node-feature-discovery-master-65589f87tw7qq   1/1     Running      0               17m
nvidia-gpu-operator-node-feature-discovery-worker-62nv7           1/1     Running      1 (16m ago)     17m
nvidia-gpu-operator-node-feature-discovery-worker-gjvjg           1/1     Running      0               17m
nvidia-gpu-operator-node-feature-discovery-worker-jvmqm           1/1     Running      1 (16m ago)     17m
nvidia-gpu-operator-node-feature-discovery-worker-pxkgq           1/1     Running      0               17m
nvidia-gpu-operator-node-feature-discovery-worker-qgrbd           1/1     Running      1 (16m ago)     17m
nvidia-gpu-operator-node-feature-discovery-worker-t9rq6           1/1     Running      1 (16m ago)     17m
nvidia-mig-manager-jq47r                                          1/1     Running      0               15m
nvidia-mig-manager-l6wdd                                          1/1     Running      0               13m
nvidia-mig-manager-twltl                                          1/1     Running      0               14m
nvidia-operator-validator-hpqxf                                   1/1     Running      0               16m
nvidia-operator-validator-tc7b2                                   1/1     Running      0               16m
nvidia-operator-validator-zfrz6                                   0/1     Init:2/4     2 (5m13s ago)   16m

I did node restarts, daemonSet restart but nothing can helped me

GPU Operator version: v26.3.2
Driver: 580.159.04
GPU: NVIDIA B200
OS Type Ubuntu 24.04
CUDA Version: 13.0

Metadata

Metadata

Assignees

Labels

questionCategorizes issue or PR as a support question.

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions