Setting Up the OS, NVIDIA Drivers, CUDA, and cuDNN on IBM Power 9 Servers
Background
This is the first post in a tutorial series on how to build a Language Model API on an IBM Power9 server, covering everything from setting up the operating system to having the API running remote inference.
This step of the tutorial shows how to set up the operating system and install NVIDIA drivers, CUDA, and cuDNN on machines with IBM Power9 AC922 processors. The focus is on ensuring everything works correctly on ppc64le architectures, which are common in high-performance environments.
IBM Power9: The IBM Power9 AC922 is a high-performance machine used for demanding tasks such as artificial intelligence and scientific computing. It uses Power9 processors and works well with NVIDIA GPUs, offering high-speed communication between the CPU and GPU.
NVIDIA Drivers: Software that allows the operating system to communicate correctly with NVIDIA GPUs. These drivers are essential to enable GPU acceleration.
CUDA: NVIDIA’s platform for accelerating parallel computing on GPUs. It lets you run complex algorithms efficiently, such as Large Language Model inference.
cuDNN: A GPU-optimized library of primitives for deep neural networks (DNNs) developed by NVIDIA. It offers high-performance implementations of key DNN operations like convolutions, pooling, and normalization, significantly speeding up training and inference on GPUs.
TL;DR
- This post provides a step-by-step guide on setting up Power9 servers, including the OS and NVIDIA configurations.
- The main challenge is finding compatible versions for the Power9 machine architecture.
Setting up the Operating System
Let’s start with the installation of Red Hat Enterprise Linux 8.10 (Ootpa). On Power systems, the architecture used is ppc64le (PowerPC 64-bit little-endian), so it’s essential to ensure the .iso image is compatible with this architecture. Otherwise, the Power9’s petitboot won’t recognize the media and installation won’t proceed.
- You can download the correct image from the link provided.
- In this tutorial, we’ll use the Boot ISO option and follow the official Red Hat documentation to create a bootable USB medium.
- After inserting the installation media into the Power9 server and rebooting, the system should automatically start petitboot.
- From there, just follow the official installation guide to complete the OS setup.
Setting up NVIDIA Driver and CUDA
Checking GPUs and Operating System
To enable the operating system to communicate properly with the server’s GPUs, we need to install and configure the NVIDIA driver.
- First, let’s check for the presence of the GPU(s):
lspci | grep -i nvidia
The expected output is something like:
0004:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)
- Next, let’s check the system architecture and operating system name:
uname -m && cat /etc/redhat-release
The expected output is:
ppc64le Red Hat Enterprise Linux release 8.10 (Ootpa)
Avoiding conflicts
To avoid potential conflicts, it’s recommended to disable the nouveau driver and SELinux.
The nouveau driver is an open-source driver for NVIDIA GPUs that replaces the proprietary driver when users want to use only free software without needing high performance.
SELinux=enable restricts certain processes from making changes to the system, which can conflict with the installations we’ll do in this tutorial.
- Disable the
nouveaudriver:
echo -e "blacklist nouveau\noptions nouveau modeset=0" | sudo tee /etc/modprobe.d/disable-nouveau.conf
- To disable
SELinux, let’s first check its status by running:
sestatus
If it’s active, you’ll need to set the SELINUX=disabled parameter in the /etc/selinux/config file to proceed. Remember that saving changes requires sudo permissions.
- After that, update the
initramfsand reboot the machine with the following commands:
sudo dracut --force
sudo reboot
- To verify everything worked so far, let’s check if
nouveauis disabled:
lsmod | grep nouveau
If it’s been successfully disabled, there will be no output.
- To verify the
SELinux:
sestatus
If it’s disabled, the output will be: SELinux status: disabled
Installing Prerequisites
- Let’s install some prerequisites before starting the actual installation:
sudo dnf install pciutils environment-modules
sudo dnf install kernel-devel-$(uname -r) kernel-headers
sudo dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
sudo dnf clean all
sudo dnf install dkms
- We also need to enable some repositories:
sudo subscription-manager repos --enable=rhel-8-for-ppc64le-appstream-rpms
sudo subscription-manager repos --enable=rhel-8-for-ppc64le-baseos-rpms
sudo subscription-manager repos --enable=codeready-builder-for-rhel-8-ppc64le-rpms
Downloading and Installing CUDA Package Repositories
- Let’s download CUDA version 12.2 and NVIDIA Driver 535.54.03-1 with the following command:
wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-rhel8-12-2-local-12.2.0_535.54.03-1.ppc64le.rpm
- To install the downloaded package:
sudo rpm -i cuda-repo-rhel8-12-2-local-12.2.0_535.54.03-1.ppc64le.rpm
- To install the NVIDIA driver and CUDA, run the following commands:
sudo dnf install nvidia-driver-cuda
sudo dnf clean all
sudo dnf module reset nvidia-driver
sudo dnf module enable nvidia-driver:latest-dkms
sudo dnf -y module install nvidia-driver:latest-dkms
sudo dnf -y install cuda
With these commands, the driver and CUDA installation is complete.
Post-Installation Steps
- Let’s set the
PATHandLD_LIBRARY_PATHenvironment variables. To do this, edit the.bashrcfile and add these two lines:
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
To update the environment variables, run the following command:
source ~/.bashrc
We need to make two manual changes because they aren’t handled automatically by the CUDA package installation. If these aren’t done, the CUDA driver installation will not work properly.
- The first change is to configure the NVIDIA persistence daemon. First, check its status, and if it’s not active, enable it:
systemctl status nvidia-persistenced
systemctl enable nvidia-persistenced
Some Linux distributions have a udev rule that brings hot-plugged memory online as soon as it’s detected, preventing NVIDIA software from correctly configuring GPU memory on Power9.
- To disable this rule, run the following commands:
sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
sudo sed -i 's/SUBSYSTEM!="memory",.*GOTO="memory_hotplug_end"/SUBSYSTEM=="*", GOTO="memory_hotplug_end"/' /etc/udev/rules.d/40-redhat.rules
Installation Check
After completing all these steps, let’s reboot the machine and verify the installations:
- Reboot the machine:
sudo reboot
- Check the NVIDIA driver:
nvidia-smi
The output of the command above should display CUDA compiler information: version and install date. It should also list available devices (GPUs) with details like name, memory, temperature, and other information.
To perform the final check, let’s download the cuda-samples repository and run the device test.
- Download the repository and access the
cuda-samplesversion matching the installed CUDA:
git clone https://github.com/NVIDIA/cuda-samples.git
cd cuda-samples/Samples/1_Utilities/deviceQuery
git checkout v12.2
- To build and run the tests:
make
./deviceQuery
After running this test, you should see Result = PASS in the last line. This confirms that the Power9 is set up with the NVIDIA driver and CUDA working correctly.
Setting up the CUDNN
- First, we need to download and install the
.rpmpackage specific toppc64le.
wget https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn-local-repo-rhel8-9.0.0-1.0-1.ppc64le.rpm
sudo rpm -i cudnn-local-repo-rhel8-9.0.0-1.0-1.ppc64le.rpm
sudo dnf clean all
sudo dnf -y install cudnn
- After installing, set the
CUDNN_LIBRARYandCUDNN_INCLUDE_DIRenvironment variables directly by adding these lines to your.bashrc:
echo 'export CUDNN_LIBRARY=/usr/lib64' >> ~/.bashrc
echo 'export CUDNN_LIBRARY=/usr/lib64' >> ~/.bashrc
After that, the CUDNN installation process is complete.
This is the first part of our tutorial. Once you’ve finished all the steps in this post, the server will be ready to install the conda package manager and the pytorch library. You can access the second part of this tutorial at this link.
