NCP-AII模擬資料質問114：A critical AI model training job consistently fails on a specific GPU server in your cluster after running

<<前へ次へ>>

質問 114/131

A critical AI model training job consistently fails on a specific GPU server in your cluster after running for approximately 24 hours.
Monitoring data shows a sudden drop in GPU power consumption followed by a system reboot. All other GPUs on the server appear normal. The server has redundant PSUs. What is the MOST likely cause?

A. A software bug in the A1 model causing a kernel panic specifically triggered after 24 hours of execution. B. Thermal runaway on the GPU due to a failing thermal interface material (TIM) between the GPU die and the heatsink. C. A transient power supply issue affecting only one of the redundant PSUs, triggering a system-wide protection mechanism. D. ECC memory errors accumulating over time, eventually leading to a non-recoverable system fault. E. A driver crash, causing the GPU to reset and the system to reboot.

質問 114/131

コメントを発表する

Download PDF File