Building for AMD GPUs

The AMD ROCm development platform

The AMD Radeon Open Compute (ROCm) platform is a software stack for programming and running of programs on GPUs. The ROCm platform has support for different programming models such as heterogeneous interface for portability (HIP), offloading to GPU with OpenMP directives, and the SYCL programming model.

Programs on Dardel are installed using a specific Cray Parallel Environment (CPE). The main version of the Cray Parallel Environment on Dardel is currently 23.12 which can be loaded with

ml PDC/23.12

To load the ROCm module version 5.7.0 and set the accelerator target to amd-gfx90a (AMD MI250X GPU)

ml rocm/5.7.0
ml craype-accel-amd-gfx90a

Programs can then be built with different toolchains (Cray, Gnu, AOCC), as are available in the different versions of the Cray Programming Environments Compilers and libraries.

For running programs as batch jobs on the GPU nodes, see job script example 6 on Job script examples.

Compiler and linker flags environment variables

For executables that are built with the compilers of the Cray Compiler Environment (CCE), verbose runtime information can be enabled with the environment variable CRAY_ACC_DEBUG which takes values 1, 2 or 3. For the highest level of information

export CRAY_ACC_DEBUG=3

Build and run examples

Example 1: Build and run a C++ code with offloading to GPU with HIP

In this example we build and test run a Hello World C++ code in which offloading to GPU is done with the heterogeneous interface for portability (HIP). The program is built with the AMD hipcc compiler.

# Download the source code
wget https://raw.githubusercontent.com/PDC-support/introduction-to-pdc/master/example/hello_world_gpu.cpp

# Load the ROCm module and set the accelerator target to amd gfx90a  AMD MI250X GPU 
ml rocm/5.7.0
ml craype-accel-amd-gfx90a

# We use the AMD hipcc compiler  Check the full path of the command hipcc
which hipcc
# returns
/opt/rocm-5.7.0/bin/hipcc

# Compile the code on the login node
hipcc --offload-arch=gfx90a hello_world_gpu.cpp -o hello_world_gpu.x

# Test the code in an interactive session 
# First queue to get one GPU node reserved for 10 minutes
salloc -N 1 -t 0:10:00 -A <project name> -p gpu
# wait for a node 

# then run the program
srun -n 1 ./hello_world_gpu.x

# with program output to standard out
You can access GPU devices: 0-7
GPU 0: hello world
...

Example 2: Build and run a Fortran code with offloading to GPU with OpenMP

In this example we build and test run a Fortran program that calculates the dot product of two long vectors by means of offloading to GPU with OpenMP. The build is done within the PrgEnv-cray environment using the Cray Compiler Environment.

# Download the source code
wget https://github.com/ENCCS/openmp-gpu/raw/main/content/exercise/ex04/solution/ex04.F90

# Load the ROCm module and set the accelerator target to amd gfx90a  AMD MI250X GPU 
ml rocm/5.7.0
ml craype-accel-amd-gfx90a

# Check which compiler the compiler wrapper is pointing to
ftn --version
# returns
Cray Fortran : Version 17.0.0

# Compile the code on the login node
ftn -fopenmp ex04.F90 -o ex04.x

# Test the code in interactive session 
# First queue to get one GPU node reserved for 10 minutes
salloc -N 1 -t 0:10:00 -A <project name> -p gpu
# wait for a node 

# then run the program
srun -n 1 ./ex04.x

# with program output to standard out
The sum is:  1.25

# Alternatively  login to the node with  for example 
ssh nid002792
# where nid002792 is one of the Dardel GPU nodes 

# Load the rocm module
ml rocm/5.7.0

# then run the program
./ex04.x

# with program output to standard out
The sum is:  1.25

# For CCE build executables  enable verbose runtime information on
# the offloading to GPU with the environment variable
export CRAY_ACC_DEBUG=3

# When rerunning the program
./ex04.x

# a detailed listing of data transfer to and from the host memory to the
# device memory is displayed
ACC: Version 5.0 of HIP already initialized, runtime version 50013601
ACC: Get Device 0
...
...
ACC: End transfer (to acc 0 bytes, to host 4 bytes)
ACC:
The sum is:  1.25
ACC: __tgt_unregister_lib