ROCm on a AMD 3020e Laptop APU
2024-02-27

There are broadly 3 approaches to installing ROCm:
- Using prebuilt ROCm and PyTorch binaries
- Using prebuilt binaries for ROCm but compiling PyTorch from source
- Compiling both ROCm and PyTorch from source

Installation is non-trivial. The default ROCm installation instructions don't work for APUs.
The following information is likely useful for any Vega-series (also known as GCN5.0 or gfx9) APU.

Specification references:
AMD 3020e specs
AMD GPU device ID database

The AMD 3020e APU has a Vega 3 gfx902 integrated graphics processor.
For the gfx9 series, discrete GPUs use gfx900 (ie. Vega 64) while gfx902 is the APU version.
ROCm 5 as of 2022 is unofficially compatible with gfx900 GPUs using HSA_OVERRIDE_GFX_VERSION.
AMD does not officially or unoffically support gfx902, but let's try anyways.

A 2018 Github comment on APU support:

The ISA does have meaningful changes each generation.
In particular, the binary encoding of even similar instructions can change slightly each generation.
As such, moving between e.g. gfx7 to gfx8 or gfx9 GPUs is likely to result in binaries that are wholly incompatible.

As described here, one of the major difference between the dGPU and APU instruction sets (e.g. gfx900 dGPU and gfx902 APU) is the required support for XNACK.
XNACK is needed so that the GPU can take precise exceptions on particular memory accesses for things like page faults.
Our current dGPUs do not need this, so the compiler does not create code with XNACK -- this offers higher performance in these GPUs.
The APUs do need these, if we want to run code in "HSA" mode where the CPU and GPU share a virtual memory space with IOMMU support.
So the compiler must target these differently.

A 2023 Github comment on APU vRAM being limited by BIOS config:
In conclusion, there are several solutions for other AMD APU user
Override the BIOS settings to allocate more memory. This method like the old days that you set your dedicated video memory in BIOS. More VRAM means less system memory.
torch-apu-helper uses the the Unified Memory Architecture (UMA), the APU would be able to allocate the memory from the system dynamically. It is a good demo but this way would not get all API working (e.g. getDeviceStats). If you are using application based on PyTorch, it would be likely that it would not work. I filed an issue on PyTorch, hopefully they can add native AMD APU support.

Relevant links:
Niconiconi's guide to enable XNACK
Pomoke's torch-apu-helper

Prior attempts for ROCm on APUs:
Compiling from source for APUs
Using prebuilt binaries
Using prebuilt binaries on Arch
Using an custom build for gfx902

First attempt: Binary install on Arch Linux (unsuccessful)

Install steps:

# add user to video and render groups
sudo gpasswd -a edw render
sudo gpasswd -a edw video

# install rocm
sudo pacman -Syu rocm-device-libs rocminfo rocm-smi-lib

# install texlive
sudo pacman -Syu texlive

# install headers, rocm-hip-sdk, dkms
sudo pacman -Syu linux-headers rocm-hip-sdk dkms

# change tmpdir to fix memory error, install rocm pytorch nightly (to match arch installing rocm6)
export TMPDIR='/var/tmp'
pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.0

Environment variable config in .bash_profile/.bashrc:

# add toolchain to path
export PATH=$PATH:/opt/rocm/bin

# enable xnack (required for apus)
export HSA_XNACK=1

# override to gfx900 (closest rocm version to gfx902)
export HSA_OVERRIDE_GFX_VERSION=9.0.0

# force pytorch to emulate cuda behaviour
export DEVICE=cuda

# enable detailed debug logging
export AMD_LOG_LEVEL=4

# set hip to use device 0 (integrated gpu)
export HIP_VISIBLE_DEVICES=0

# force pytorch to compile for gfx900:xnack+
export PYTORCH_ROCM_ARCH=gfx900:xnack+

Modprobe config in /etc/modprobe.d/modprobe.conf to enable xnack retry and disable power saving:

options amdgpu noretry=0
options amdgpu ppfeaturemask=0xfff73fff

PyTorch is detecting CUDA as available and hipInfo detects XNACK enabled.

Results:

Test command:
HSA_OVERRIDE_GFX_VERSION=9.0.0 ./venv/bin/python -c 'import torch; print(torch.cuda.is_available())'
Result:
True (successful)

Test command:
HSA_OVERRIDE_GFX_VERSION=9.0.0 HSA_XNACK=4 DEVICE=cuda HIP_VISIBLE_DEVICES=0 AMD_LOG_LEVEL=4 ./venv/bin/python -c 'import torch;a=torch.randn(3).to("cuda")'
Curated debug output:
:1:hip_fatbin.cpp :109 : 0675447520 us: [pid:1045 tid:0x799c431e1740] Missing CO for these ISAs -
:1:hip_fatbin.cpp :112 : 0675447528 us: [pid:1045 tid:0x799c431e1740] amdgcn-amd-amdhsa--gfx900:xnack+
:1:hip_fatbin.cpp :302 : 0675447536 us: [pid:1045 tid:0x799c431e1740] Releasing COMGR data failed with status 2
:1:hip_fatbin.cpp :256 : 0675449091 us: [pid:1045 tid:0x799c431e1740] Cannot find CO in the bundle for ISA: amdgcn-amd-amdhsa--gfx900:xnack+
(The above output occurs during 'import torch')
(Laptop later becomes unresponsive and display starts flickering)
(Behavior is reproducible with or without xnack, device=cuda and hip_visible_devices)
(Interestingly, this behavior can be reproduced with llama.cpp, suggesting that the issue isn't caused on the PyTorch side)

Second attempt: using custom build for gfx902 on Ubuntu 18.04 (unsuccessful)
Installed deb packages from the zip bundle (bs-rocm-2.8.0_git40bf6f3e-apu-core-pkgs_1804LTS_amd64.zip) and followed Pytorch install instructions from the log file attached to the whl download (rocm-2.6.0-apu_Ubuntu_1804_AMD64/pytorch).

$ sudo apt install git python3-pip libopenblas-dev cmake libnuma-dev autoconf build-essential \
ca-certificates curl libgoogle-glog-dev libhiredis-dev libleveldb-dev liblmdb-dev \
libopencv-dev libpthread-stubs0-dev libsnappy-dev sudo vim libprotobuf-dev protobuf-compiler \
python3-opencv

$ pip3 install enum34 numpy pyyaml setuptools typing cffi future hypothesis

$ pip3 install torch-1.3.0a0+e8acc2e-cp36-cp36m-linux_x86_64.whl

Various packages such as rccl, hipsparse, etc. were missing so I installed then from the bionic (18.04) apt repo.
I had to backport a significant chunk of runtime libraries using PPAs to fix shared object errors when installing software from the apt repo (ie. error libMIOpen.so.1 not found, libjasper not installed, etc.)
At some point this lead to conflicts in libc causing PyTorch to fail with references to libsystemd and librt.
It's possible reattempting this on a newer version of Ubuntu rather than the recommended 18.04 would be worthwhile, though the PyTorch and ROCm version for the custom build is outdated which limits its usefulness.
Another issue is that every part of the install process (zip bundle, apt, PyTorch) uses a different version of ROCm causing conflicts.

A strategic retreat: Path forward

For now, I've put getting ROCm working on the backburner. It seems the juice isn't quite worth the squeeze. The primary path forward I see is to compile the ROCm toolchain from source on Gentoo (just compiling PyTorch from source won't be enough), then try to reconfigure the build to include the missing COMGR bundle.

wip