Setting Up the Conda and PyTorch on IBM Power9 Servers

Background

This is the second post in a tutorial series on how to build a Language Model API on an IBM Power9 server, covering everything from setting up the operating system to having the API running remote inference. The first post covers installing the OS and configuring NVIDIA drivers, CUDA, and CUDNN. In this step, we’ll show how to set up the Conda package manager and the PyTorch library.

Conda: Conda is an open-source, cross-platform package and environment management system. It’s like a “toolbox” for data scientists and developers to organize their projects.

PyTorch: PyTorch is an open-source machine learning library developed primarily by Facebook AI Research (FAIR). It’s especially popular for building deep learning applications, a subfield of machine learning inspired by how the human brain works.

TL;DR

This post provides a step-by-step guide to installing Conda and PyTorch.
The main challenge is finding compatible versions for the Power9 machine architecture.

Setting up the Conda

We’ll start with installing Conda. On Power systems, the architecture used is ppc64le (PowerPC 64-bit little-endian), so it’s essential to download the version for this architecture. We’ll use miniconda, a lighter option that’s better suited for custom setups like the Power9 server.

To download and install the latest version of Miniconda:

sudo wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-ppc64le.sh
bash ~/Miniconda3-latest-Linux-ppc64le.sh

Check if Conda was activated automatically:

conda -–version

If it didn’t start automatically, you’ll need to activate it.

To ensure it’s automatically activated with each new connection, we will write the command into your .bashrc (or .zshrc) file.

echo 'source ~/miniconda3/etc/profile.d/conda.sh' >> ~/.bashrc
source ~/.bashrc

Check again with the command:

conda -–version

Expected output looks like: conda 23.10.0

Installing and configuring the PyTorch library

There are no official builds or Conda/PyPi wheels with full support for the ppc64le architecture. To install PyTorch, you’ll need to build it manually.

(Optional) Creating a Conda virtual environment

It’s recommended to create a dedicated virtual environment to install PyTorch in isolation.

To create and activate the virtual environment, run:

conda create -y -n api_llm python=3.10
conda activate api_llm

Installing prerequisites

We need to install some packages required to properly build PyTorch.

First, install the packages using the following commands:

conda install -y -c conda-forge openblas libblas cmake ninja python3-devel gcc-c++ rust cargo

CMake (the build system used by PyTorch) dropped support for scripts declaring compatibility with older versions (<3.5). To address this, we need to install a version of cmake <3.5 using pip.

Run the command:

pip install cmake==3.27.7

To make sure the correct version was installed, run the command:

cmake --version

Expected output: cmake version 3.27.7

Building PyTorch

Now let’s start the PyTorch build process.

The first step is to clone the repository and set it up to install version 2.6.0:

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
git checkout v2.6.0 
git submodule sync 
git submodule update --init --recursive

To install the required packages via pip, run the following command:

pip install -r requirements.txt

And finally, to build PyTorch, run Python’s setup.py:

sudo USE_CUDA=1 USE_DISTRIBUTED=1 USE_NCCL=1 USE_GLOO=1 USE_CUDNN=1 python setup.py install

The build process usually takes a while, around 15 minutes.

To check if everything worked correctly, create a file named test_torch.py

nano test_torch.py

This file should contain the following lines:

 1import torch
 2print(torch.__version__)
 3print("CUDA available:", torch.cuda.is_available())
 4print("Number of GPUs:", torch.cuda.device_count())
 5print("GPU name:", torch.cuda.get_device_name(0))
 6x = torch.rand(3, 3).cuda()
 7y = torch.rand(3, 3).cuda()
 8print("Sum on GPU:", (x + y))
 9print("cuDNN available:", torch.backends.cudnn.is_available())
10print("C extensions loaded:", torch._C._cuda_getDeviceCount() > 0)

When you run this file, you’ll check:

Installed PyTorch version
CUDA availability
Number of available GPUs
GPU name on the Power9 server
Whether GPU usage is working correctly
CUDNN availability
Whether the .so files were compiled correctly

This script simply verifies some CUDA and PyTorch informations and performs a basic addition operation using GPU tensors.

Run the file with the command:

python test_gpu.py

Expected output should look something like:

2.6.0a0+git1eba9b3
CUDA available: True
Number of GPUs: 4
GPU name: Tesla V100-SXM2-16GB
Sum on GPU: tensor([[1.9163, 1.2208, 0.5998],
        [1.7962, 0.6040, 1.3943],
        [0.9536, 0.8010, 0.0668]], device='cuda:0')
cuDNN available: True
C extensions loaded: True

Keep in mind that the output may vary depending on the number and model of GPUs, as well as the tensor sums (due to randomness). What matters is that the boolean outputs in the script return True.

With this, PyTorch is installed and ready to use. In the next tutorial, we’ll run the first Language Model inference on the Power9 server.