Porting topography version of AWP-ODC to OLCF Frontier, with ROCm-Aware support to improve the performance of collective communication

Yifeng Cui, Akash Palla, Arnav Talreja, Lars Koesterke, Wenyang Zhang, Te-Yang Yeh, & Philip J. Maechling

Published September 8, 2024, SCEC Contribution #13964, 2024 SCEC Annual Meeting Poster #216

We have ported and verified the topography version of AWP-ODC, with discontinuous mesh feature enabled, to HIP so that it runs on AMD MI250X GPUs. 103.3% parallel efficiency was benchmarked on Frontier between 8 and 4,096 nodes or up to 32,768 GCDs. Frontier is a two exaflop/s computing system based on the AMD Radeon Instinct GPUs and EPYC CPUs, a Leadership Computing Facility at Oak Ridge National Laboratory (ORNL). This HIP topography code has been used in the production runs on Frontier, a primary computing engine currently utilizing the 2024 SCEC INCITE allocation, a 700K node-hours supercomputing time award. Furthermore, we implemented ROCm-Aware GPU direct support in the topo code, and demonstrated 14% additional reduction in time-to-solution up to 4,096 nodes. The AWP-ODC-Topo code is also tuned on TACC Vista, an Arm-based NVIDIA GH200 Grace Hopper Superchip, with excellent performance demonstrated. This poster will demonstrate the studies of weak scaling and the performance characteristics on GPUs. We discuss the efforts of verifying the ROCm-Aware development, and utilizing high-performance MVAPICH libraries with the on-the-fly compression on modern GPU clusters.

Key Words
AWP-ODC, topography, Frontier, wave propagation

Citation
Cui, Y., Palla, A., Talreja, A., Koesterke, L., Zhang, W., Yeh, T., & Maechling, P. J. (2024, 09). Porting topography version of AWP-ODC to OLCF Frontier, with ROCm-Aware support to improve the performance of collective communication. Poster Presentation at 2024 SCEC Annual Meeting.


Related Projects & Working Groups
Research Computing (RC)