NCP-AII受験料過去問質問66：During a 72-hour HPL burn-in test on a DGX H100 cluster, one node shows a 15% performance drop after

<<前へ次へ>>

質問 66/131

During a 72-hour HPL burn-in test on a DGX H100 cluster, one node shows a 15% performance drop after 48 hours. What are the two most likely causes and diagnostic steps?
Pick the 2 correct responses below.

A. MPI configuration error; rerun with --cpu-affinity adjustments. B. Network packet loss; analyze ibdiagnet reports. C. Thermal throttling due to cooling issues; check nvidia-smi dmon. D. Memory corruption; reboot the node and reduce problem size N.

質問 66/131

コメントを発表する

Download PDF File