LLM Inference with Ollama on IBM Power9 Using CPU

Context

This post presents a practical guide for performing inference of large Language Models (LLMs) using Ollama, in an IBM POWER9 environment. Ollama is a framework based on llama.cpp, designed to simplify the implementation and execution of such models, offering a user-friendly interface and support for various tasks.

Figure 1

Flow of a request

Despite the growth in LLM usage, the availability of materials focused on the ppc64le architecture (IBM POWER9) is still quite limited. In general, available tutorials are old, poorly detailed, or focused on more common architectures like x86_64, which makes reproducing the environment in the presented context difficult. This is the first of two posts in this series, which aims to perform inference entirely via CPU, exploring the ppc64le architecture, in an updated, practical, and reproducible way. In the next post, we will address the use of GPU to accelerate the process.

TL;DR

  • This post presents details on how to configure the environment to perform inferences with IBM POWER9 infrastructure.
  • Execution is performed via CPU using Ollama;
  • The main challenge involves correctly configuring the environment, especially dependencies like Go, GCC, and CMake, in addition to compatibility with RHEL

Environment Used

Hardware:

  • ppc64le architecture;
  • RAM: ~64GB;
  • Execution: Virtual Machine (VM);

Operating System: Alma Linux 8.10 (ppc64le), binary compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.

Initial Setup

To run Ollama on the POWER9 architecture, it is necessary to prepare the environment with the appropriate dependencies. The first step is to update the system and install basic utilities:

sudo dnf update -y
sudo dnf install -y wget git tar make gcc gcc-c++ cmake gcc-toolset-11

Although this command installs some dependencies, it is necessary to ensure that the correct versions are being used.

Configuring Go

Ollama is developed in Go, so it is necessary to ensure the appropriate version.

Expected Version: 1.25.7 linux/ppc64le

If not installed:

wget https://go.dev/dl/go1.25.7.linux-ppc64le.tar.gz
sudo tar -C /usr/local -xzf go1.25.7.linux-ppc64le.tar.gz
export PATH=/usr/local/go/bin:$PATH

To add to PATH permanently:

echo 'export PATH=/usr/local/go/bin:$PATH' >> ~/.bashrc
source ~/.bashrc

Verify if the version is correct: go version

Configuring CMake

Verify if the version is correct: cmake --version

Expected Version: cmake 3.26.5

If not installed:

wget https://github.com/Kitware/CMake/releases/download/v3.26.5/cmake-3.26.5.tar.gz
tar -xzf cmake-3.26.5.tar.gz
cd cmake-3.26.5
./bootstrap
make -j$(nproc)
sudo make install

Configuring GCC

Expected Version: gcc 11.2.1

Important: On AlmaLinux 8, the gcc-toolset is not activated automatically. It is necessary to enable the session manually:

scl enable gcc-toolset-11 bash

This command activates GCC only in the current session. If you open another terminal, you will need to run the command again.

Verify the version: gcc --version

If not installed:

sudo dnf install -y gcc-toolset-11
scl enable gcc-toolset-11 bash

Cloning Ollama

With the environment configured, we can build Ollama. Here we clone the official Ollama repository and change the version used (important for POWER compatibility and to get a stable version).

cd /root
git clone https://github.com/ollama/ollama.git
cd ollama

#Change the version: 
git checkout v0.9.4

To verify, use: git status

Build Ollama

After activating GCC in the correct version:

export CGO_ENABLED=1
go clean -cache -modcache -i -r
go build -o ollama .

CGO needs to be enabled because Ollama depends on llama.cpp, which uses C/C++ code for performance optimizations. Without it, the build fails or loses compatibility with the architecture.

This should occur without any errors and generate the ollama binary created in the current directory.

To verify: ./ollama --version

Performing Inference

With Ollama compiled, we can start the server:

./ollama serve

An important observation is that, since the environment is running on a virtual machine, it is not possible to keep the command running in the main terminal and, simultaneously, use another terminal in the same session to perform inference, without some auxiliary tool to manage multiple terminals. What we will do then is run the server in the background, but you can choose to use Tmux or Screen, allowing the same terminal to remain available for executing the remaining commands (which we will see next). For this, you can run:

./ollama serve &

To verify if it worked: ps aux | grep ollama. It will show something like:

Figure 2

Ollama running

Download the test model and run inference

For validation, we used the TinyLlama model, as it is lightweight and suitable for CPU execution. For this, in another terminal, run:

./ollama pull tinyllama

To run inference:

./ollama run tinyllama "The sky is blue?"

If everything has been done correctly, you will have something like:

Figure 3

Inference being executed

It is important to highlight that Ollama works, by default, with models available in its own repository, which are already converted and optimized for execution, generally in a format compatible with llama.cpp. These models can be easily used via the ollama pull command, as in the case of TinyLlama used in this example. Although it is possible to use external models, this requires additional steps, such as conversion to compatible formats (for example, GGUF) and the creation of a Modelfile.

Final Considerations

With the steps presented, it was possible to configure the environment to run LLM inferences on an IBM POWER9 machine using the CPU. Although functional, this approach has limitations in performance, especially for larger models, due to the absence of GPU acceleration. As a next step, we intend to explore execution using GPU, evaluating performance gains and scalability.

Next Steps

  • Test newer versions and compatibility between them;
  • Conduct benchmarking experiments to compare CPU Inference performance against GPU inference;
  • Second post in this series, performing GPU inference.