You're deploying a multi-GPU training job on a cluster using Slurm. You need to ensure that the GPUs allocated to the job are healthy and functioning correctly before the training starts. What's the MOST effective approach to pre-validate the GPU hardware?
正解:C
Using the DCGM diagnostic suite is the most thorough way to pre-validate GPU hardware. DCGM provides comprehensive tests to check GPU health, including memory, compute, and interconnects. A simple CUDA program or checking nvidia-smr provides basic validation, but not as comprehensive as DCGM. Monitoring temperature is reactive, not proactive. Assuming GPUs are healthy without validation is risky.