During a 72-hour HPL burn-in test on a DGX H100 cluster, one node shows a 15% performance drop after 48 hours. What are the two most likely causes and diagnostic steps?
Pick the 2 correct responses below.
正解:B,C
The two most likely causes are network packet loss and thermal throttling. A performance drop after 48 hours of HPL burn-in is less likely to be a simple launch-time MPI configuration issue, because MPI affinity errors usually appear from the beginning of the run as consistently poor performance. A delayed degradation suggests the system changed state under sustained load. Thermal throttling is a common cause: after many hours, rack cooling imbalance, blocked airflow, high inlet temperature, or fan behavior can cause GPU clocks to drop. nvidia-smi dmon helps monitor GPU temperature, power, utilization, and clocks over time. Network packet loss is also likely in multi-node HPL because HPL depends on heavy communication across the InfiniBand fabric. Link errors, symbol errors, retransmissions, degraded cables, or congestion can reduce sustained performance. ibdiagnet is the correct fabric-level diagnostic tool to collect and analyze InfiniBand health, topology, counters, and link issues. Rebooting and reducing matrix size would hide the symptom rather than diagnose it. Correct burn-in practice is to preserve evidence, inspect thermal telemetry, review network diagnostics, and compare the affected node against healthy peers.