A critical AI model training job consistently fails on a specific GPU server in your cluster after running for approximately 24 hours. Monitoring data shows a sudden drop in GPU power consumption followed by a system reboot. All other GPUs on the server appear normal. The server has redundant PSUs. What is the MOST likely cause?
正解:B
Thermal runaway (B) is the most probable cause. The 24-hour delay suggests a gradual heat buildup. A failing TIM would cause the GPU to overheat until it triggers a thermal shutdown, resulting in the power drop and reboot. While a PSU issue (C) is possible, redundant PSUs should prevent a complete failure unless one PSU is completely dead and the second PSU is overloaded by the entire load for a short period. The other options are less likely to cause this specific failure pattern.