TensorFlow on IBM UFCG

Running VMs with KubeVirt on IBM Power9 (ppc64le)

Fri, 22 May 2026 00:00:00 +0000

Context

This post aims to present the process of adapting KubeVirt for the IBM POWER9 (ppc64le) architecture. It covers the main challenges encountered, the modifications made to the source code, the role of each component, and the results obtained at the end of the process.

KubeVirt is an operator that extends Kubernetes to manage virtual machines (VMs) as native resources. In traditional environments, VMs are managed by tools like libvirt/virsh, separate from the container ecosystem. KubeVirt eliminates this separation: with it, you can create, start, stop, and monitor VMs using the same Kubernetes commands and workflows — kubectl, YAML, namespaces, and RBAC. VMs run as real QEMU/KVM processes inside pods managed by Kubernetes.

The motivation for this work arose in the context of the Multiarq project, which maintains a shared HPC infrastructure on IBM POWER9. The ability to manage VMs and containers in the same Kubernetes cluster simplifies environment administration and opens the door to scenarios such as GPU passthrough for AI/ML workloads inside VMs, isolation of research environments, and multi-architecture compatibility testing.

The main challenge is that KubeVirt does not officially support ppc64le. Only x86_64 (amd64), arm64, and s390x are supported. This means the build system, API validations, configuration defaults, and the libvirt domain generation pipeline do not recognize ppc64le, defaulting everything to amd64.

TL;DR

KubeVirt does not officially support ppc64le; the entire pipeline assumes amd64 as a fallback.
We compiled Go binaries directly, bypassing the Bazel build system that does not recognize the architecture.
Patches were required across ~14 Go files and 4 new files were created to add ppc64le support.
Docker images were built with custom Dockerfiles and served via a local registry.
With these adaptations, it was possible to run a CirrOS ppc64le VM via KubeVirt on POWER9, managed entirely by Kubernetes.

Execution Environment

Architecture: IBM Power9 server (ppc64le).
Operating System: AlmaLinux 8.10, binary compatible with RHEL 8.9/8.10.
GPUs: 4x NVIDIA Tesla V100-SXM2-16GB.
Docker: Docker CE 26.1.3.
Kubernetes: v1.35.0 via minikube v1.38.0 (docker driver, containerd runtime).
KubeVirt: v1.8.2.
Go: 1.24.9.

What KubeVirt Is and How It Works

KubeVirt is composed of several components that work together to translate a Kubernetes resource (the VirtualMachineInstance, or VMI) into a real QEMU/KVM VM running on the host.

The virt-operator is the entry point: when the administrator creates the KubeVirt Custom Resource in the cluster, the operator provisions all other components — deployments, daemonsets, services, RBAC. It acts as a permanent installer that reconciles the desired state.

The virt-api handles Kubernetes API calls for KubeVirt resources. When the user runs kubectl apply on a VMI, virt-api validates the YAML (e.g., is the architecture supported? does the machine type exist?) and injects defaults (e.g., firmware UUID, CPU topology).

The virt-controller watches VMIs and decides where they should run. It creates a special pod — the virt-launcher — on the appropriate node, with all necessary configurations (volumes, devices, node selectors).

The virt-handler runs as a DaemonSet (one per node) and is the local agent that bridges Kubernetes and libvirt/QEMU. When the virt-launcher pod appears on the node, virt-handler reads the VMI spec, generates the libvirt domain XML, and instructs libvirt to create the VM. It also registers device plugins with the kubelet (/dev/kvm, /dev/net/tun, /dev/vhost-net) so pods can access the required devices.

The virt-launcher is the pod that encapsulates the VM. Each VMI generates a dedicated pod with three containers: compute (QEMU + libvirt), guest-console-log, and container disk. Inside the compute container, the QEMU process runs the actual VM — with its own kernel, memory, and virtual CPU.

The full flow is:

kubectl apply → API Server → virt-api (validates, injects defaults)
virt-controller detects the VMI → creates the virt-launcher pod on the appropriate node
kubelet starts the virt-launcher pod on the node
virt-handler detects the pod → reads the VMI spec → generates libvirt XML → calls libvirt
libvirt starts QEMU → VM runs inside the pod

It is important to note that the VM does not become a container — it runs as a real QEMU process inside a pod. Kubernetes manages the pod lifecycle, and KubeVirt translates between the two worlds.

Challenges and Adaptations

Build System

KubeVirt’s build system uses Bazel, which does not recognize ppc64le. The format_archname function in the build script only accepts x86_64, aarch64, and s390x. The solution was to compile the Go binaries directly with go build, bypassing Bazel.

An additional dependency is libnbd: virt-launcher requires version 1.18+, but AlmaLinux 8 only provides 1.6. It was necessary to compile libnbd 1.20 from source. The container-disk component is a C program (not Go) that requires static compilation to run in FROM scratch containers.

API Validation

The virt-api validation webhook rejects VMIs with an unknown architecture. Without the patch, a VMI with architecture: ppc64le would be rejected before even reaching the scheduler. It was necessary to add cases in the admitter and create a specific validation function for ppc64le.

Configuration Defaults

KubeVirt needs to know which machine type to use for each architecture (e.g., pc-q35 for amd64, virt for arm64). For ppc64le, we configured pseries as the default machine type — the virtual machine type for POWER.

Libvirt Domain Generation

This was the central challenge. KubeVirt converts the VMI spec into a libvirt domain XML that QEMU interprets. This pipeline has two parts:

The arch-defaulter sets default OS type values (arch and machine) in the XML. Without the patch, it returned x86_64 for ppc64le, causing libvirt to attempt creating an x86 VM on a POWER machine — resulting in the error No emulator found for arch 'x86_64'.

The converter is an interface with ~12 methods that define architecture-specific behaviors: whether USB is needed, SMBIOS, PCIe placement, ROM tuning, etc. Implementations existed for amd64, arm64, and s390x, but not for ppc64le. The code fell back to converterAMD64, generating incompatible configurations. We created converterPPC64LE with values appropriate for POWER: no USB, no SMBIOS, no PCIe placement, with VirtIO as the disk model.

After resolving the converter, an USB device error appeared: the graphics/video pipeline had no case for ppc64le, causing libvirt to add a default VGA video device that depended on USB — but the USB controller was disabled (IsUSBNeeded: false). The solution was to add a ppc64le case in the video configurator with virtio as the video device, following the same pattern as s390x and arm64.

Finally, the CPU model: KubeVirt uses host-model as the default, which does not work with nested virtualization on POWER9. The solution was to specify POWER9 as the CPU model in the VMI.

Docker Images

With no Dockerfiles in the project (everything is generated by Bazel), we created custom Dockerfiles for each component. The simpler components (virt-operator, virt-api, virt-controller, virt-exportproxy) use ubi8/ubi-minimal as the base. virt-handler requires additional system tools. virt-launcher is the most complex, using almalinux:8 as the base and dependencies on qemu-kvm, libvirt, and the compiled libnbd. A local registry (registry:2 on port 5000) serves the images to minikube.

Technical Step-by-Step Guide

Due to the large number of steps involved, a detailed step-by-step guide with all patches to the Go code, Dockerfiles, compilation commands, and configuration is available at this link:kubevirt-ppc64le-installation-guide.

Results

With all adaptations applied, it was possible to run a CirrOS ppc64le VM via KubeVirt on POWER9, managed entirely by Kubernetes:

$ kubectl get vmi test-vmi -o wideNAME AGE PHASE IP NODENAME READYtest-vmi 2m43s Running 10.244.120.124 minikube True

Data collected from inside the VM confirms correct execution:

	Value
Architecture	ppc64le
CPU	POWER9 (architected), altivec supported
Hypervisor	KVM
Platform	pSeries
Model	IBM pSeries (emulated by qemu)
Kernel	5.15.0-71-generic ppc64le

These results confirm that KubeVirt is generating the correct libvirt domain for ppc64le, with machine type pseries, POWER9 CPU, and KVM/QEMU virtualization with VirtIO paravirtualization.

Final Considerations

With the adaptations made, it became possible to use KubeVirt to create and manage virtual machines on an IBM POWER9 via Kubernetes. The VM that runs is a real KVM/QEMU VM — with its own kernel, isolated memory, and virtual CPU — managed like any other Kubernetes resource.

In the context of the Multiarq project, this solution allows unifying the management of containers and VMs in the same cluster, simplifying administration of the shared infrastructure. Workloads that require kernel isolation or direct hardware access (such as GPU passthrough) can run in VMs without leaving the Kubernetes ecosystem.

The patches made are potentially contributable to the upstream KubeVirt project. KubeVirt’s architecture already provides for extensibility by architecture — the interface pattern (Converter, ArchDefaulter) and per-arch switches make it straightforward to add new platforms. ppc64le follows the same pattern as s390x, which was also added to the project at a later stage.

Next Steps

Resolve the USB/Graphics conflict to allow VNC without the autoattachGraphicsDevice: false workaround, enabling graphical access to VMs;
Adjust the default CPU model in the code so that ppc64le automatically uses POWER9 without requiring manual specification in the VMI;
Explore GPU passthrough of the V100s via KubeVirt to run AI/ML workloads inside VMs managed by Kubernetes;
Test other distributions as containerDisk (Fedora, Ubuntu Server, AlmaLinux ppc64le) to validate compatibility beyond CirrOS;
Configure masquerade networking to enable live migration between nodes;
Document the changes in PR format for contribution to the KubeVirt upstream;
Validate KubeVirt on the Single Node OpenShift (OCP 4.21) already installed on the machine, using OpenShift Virtualization as the operator.

TensorFlow 2.21 CPU on IBM Power9 (ppc64le)

Mon, 04 May 2026 00:00:00 +0000

Context

TensorFlow (TF) is the most globally adopted machine learning framework. However, since 2021, Google ended official support for pre-compiled binaries for the ppc64le architecture, and the tensorflow/community repository was archived in 2025.

Environment Used

Hardware: ppc64le architecture;
RAM: ~64GB;
Execution: Virtual Machine (VM);
Operating System: Alma Linux 8.10 (ppc64le), binary compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.

Initial Setup (Installing TF 2.14)

As a starting point, we validated the installation of TensorFlow 2.14.1 (via RocketCE) on an IBM Power9 VM (ppc64le architecture) with AlmaLinux, using Miniforge (conda). Here are the commands for installation:

conda create -n tf214 python=3.11 -yconda activate tf214conda install -c rocketce tensorflow-cpu=2.14.1 -y# expected output: 2.14.1python -c "import tensorflow as tf; print(tf.__version__)"

As a result, functional TensorFlow 2.14.1 is expected. This same version is also available on the Open-CE channels of Oregon State University and MIT. With TF 2.14 working, we have access to: Keras, TensorBoard, TensorFlow Hub, tensorflow-text, Hugging Face Transformers, Jupyter, and the entire classic ML stack.

TF 2.14 vs TF 2.21 (The Latest)

Version 2.14 is functional but is several versions behind the latest, 2.21. The most significant differences focus on incompatibility with two very important tools:

Keras 3: a complete rewrite that transforms Keras into a multi-backend framework, allowing the same model and code to run on TensorFlow, PyTorch, or JAX without any changes. TF 2.14 only supports Keras 2.
NumPy 2: In addition to correcting dozens of historical API inconsistencies, NumPy 2.0 brings significant efficiency gains. TF 2.14 does not support NumPy 2.

Compiling TensorFlow 2.21 Natively on Power9 (CPU-Only)

Initially, we successfully compiled TensorFlow 2.21 (CPU-Only) directly from source code. This compilation was performed on an IBM Power9 VM and generated a native .whl package for linux_ppc64le. Subsequently, TF 2.21 had its functionality validated through a complete suite of tests. This is a fundamental milestone upon which GPU support will be built in the next stage.

Challenges: Hermeticity and x86 Dependency

The modern architecture of TensorFlow (and its build system, Bazel 7) embraced the “Hermetic” model: forcing the use of pre-compiled binaries and logic tied to x86_64, aarch64 architectures, and NVIDIA accelerators. For ppc64le, this means that a naive compilation simply fails when trying to download tools for incompatible architectures.

We identified four categories of blockage:

Bazel 7: Google does not distribute Bazel 7 for PowerPC. It would be necessary to compile it from scratch.
Hermetic Toolchains: TF 2.21 tries to download pre-compiled LLVM/Clang for x86 or aarch64, which doesn’t run on Power9.
CUDA/GPU Dependencies: Even in CPU-only mode, the build system tries to download and link giant NVIDIA libraries. Our strategy was to completely isolate GPU support with empty stubs, ensuring a stable CPU-only foundation before adding any accelerators.
Latent C++ Bugs: XLA and MLIR code contain constructs that work in Google’s Clang but break in the system’s default GCC 8.5, from AVX-512 flags to template ambiguities in absl::NoDestructor.

Compilation Process

Step 1: Compiling Bazel 7.1.0 from Scratch

Since Google does not distribute Bazel 7 for ppc64le, the first step to enable its use on ppc64le architecture was to compile Bazel itself from its source code, using the -dist.zip file, which already includes the necessary bootstrap artifacts for Bazel to self-build without depending on a previous version of itself. The process requires Java 21 and takes between 1 and 2 hours depending on the cores available in the VM. The critical point here is passing the correct variables to the compile.sh script. Without this step, none of the following steps are possible. The bazel build command simply doesn’t exist for ppc64le otherwise. We created a tutorial with the Bazel 7.1 installation process which can be accessed in the repository.

Step 2: Bypass Strategy — Stub Repositories

With Bazel 7 functional on ppc64le architecture, we attacked the problem of hermetic dependencies. Our solution was to create “stub” repositories, empty local directories that satisfy Bazel’s dependency declarations without downloading anything:

LLVM stubs: Empty filegroups that satisfy toolchain rules without trying to install LLVM.
CUDA/ROCm/TensorRT stubs: Empty C++ libraries and Starlark rules that allow the build to proceed without missing dependency errors.
PyPI stubs: Stub Python modules that simulate the dependencies of Google’s hermetic pip, forcing the use of libraries from the conda environment.
Python stub: Redirects to the Python in our conda environment, bypassing the download of the hermetic Python that doesn’t exist for ppc64le.

All stubs are injected via --override_repository in the bazel build call, without altering the TensorFlow source code.

Bypass Strategy — Stub Repositories

Step 3: Surgical Patches in the Source Code

With the build infrastructure resolved, we found 21 incompatibilities in TensorFlow’s C++ and Python code that manifest exclusively in the GCC 13 + ppc64le combination. The problems focused on three categories:

Clang-exclusive compilation flags that GCC rejects.
C++ template ambiguities in XLA and MLIR components that Google’s compiler masks but GCC 13 exposes.
References to CUDA and TensorRT headers that cease to exist when replaced by stubs.

Each incompatibility was resolved with a precise Python patch, without altering TensorFlow’s functional logic. The complete table with all 21 patches is available in the repository.

Step 4: The Compilation

With all patches applied, the final compilation is triggered with a single bazel build command. In addition to standard optimization flags, the command injects all stub repositories via --override_repository, totaling about 80 flags. Bazel’s incremental cache is fundamental here: each time a patch is needed and compilation is resumed, only the affected targets are recompiled. This transformed the “patch → compile → error → patch” cycle from unfeasible to manageable (about 4 hours).

The Definitive Solution: Conda Package and Binaries (Ready for Use)

So that the community doesn’t need to redo all this complex build engineering, we packaged the result of this engineering into a “plug and play” solution.

We made an official Release available in the repository containing the source code already with all patches applied and the generated native .whl binary. More importantly: we created and published a complete Conda recipe that automatically resolves classic C++ library compatibility issues (GLIBCXX and GCC mismatch) common on Power9.

Now, native TensorFlow 2.21 can be installed directly through our Conda channel, providing the same installation experience as official corporate distributions.

How to Install (Quick Tutorial)

To use TensorFlow 2.21 in your Power9 environment immediately, simply run:

conda create -n tf221 python=3.11 -yconda activate tf221conda install -c ufcg-ibm -c conda-forge tensorflow-cpu=2.21.0 -y

A detailed installation tutorial via Conda is also available in our repository.

Functional Result on IBM Power9 server

We installed the final package and executed a complete suite of 35 tests, covering eight functional categories: from basic tensor operations to model save/load and stress tests. All 35 tests passed. The stress test (5000×5000 matrix multiplication) successfully executed on the IBM Power9 CPU, and training an MLP for 20 epochs confirmed loss convergence, indicating that automatic differentiation, optimizers, and numerical operations are all working correctly from end to end.

IBM Tools using TensorFlow

IBM AI tools like AIF360, AIX360, and ART were already compatible with TensorFlow 2.14, as they are Python libraries that use the environment’s TF without binary coupling. The real value of native TensorFlow 2.21 compiled for Power9 lies in continuity: these libraries were already starting to declare dependencies on TF versions higher than 2.14, which meant that without this build, the Power9 environment would remain stuck on old and unsupported versions. Additionally, the improvements accumulated in TF between versions 2.14 and 2.21 bring incremental performance gains to fairness, explainability, and adversarial robustness analysis pipelines.

Reproducibility and Materials

The entire process and generated artifacts are documented and available in our repository:

Official Release: Altered source code and ready-to-use .whl binary.
Conda Installation Tutorial: Practical guide to install version 2.21 directly through our Conda channel.
Bazel 7.1.0 Compilation Tutorial: Step-by-step guide to compile Bazel 7.1.0 from source.
TensorFlow 2.21 Compilation Tutorial: Full guide to compile TensorFlow 2.21 with all necessary patches.

Impact

This compilation represents the latest version of TensorFlow natively available for ppc64le and with it:

Keras 3 becomes available for ppc64le for the first time.
NumPy 2.0 ceases to be a bottleneck for the Python scientific ecosystem on IBM Power9.
Hugging Face Transformers stack with more models compatible with Power9.

Next Steps

The TF 2.21 we compiled runs exclusively on CPU. The next challenge is to repeat the process with CUDA enabled on IBM Power9 servers equipped with NVIDIA GPUs. The stubs we created to isolate the GPU in this compilation were designed precisely to facilitate this transition: by replacing them with real CUDA libraries, we will have a solid starting point for GPU compilation. If successful, Power9 would have the latest deep learning framework with hardware acceleration, something non-existent today in any distribution for ppc64le.

LLM Inference with Ollama on IBM Power9 Using GPU

Thu, 16 Apr 2026 00:00:00 +0000

Context

This is the second post in the series about language model inference on POWER9 with Ollama. In this article, we will cover how to send requests using GPU, achieving a significant performance gain compared to the CPU approach shown in the previous post.

The main challenge is that Ollama does not offer official support for the ppc64le architecture with CUDA. The solution was found through an official IBM community blog, where a contributor made a fork of Ollama adapted to support NVIDIA GPUs on POWER9 via CUDA. However, that fork is outdated and does not support newer models like Gemma 3 and DeepSeek.

Therefore, we developed an updated fork, based on the official Ollama (v0.23.2), with the necessary patches for ppc64le and GPU support via CUDA. This tutorial explains how to compile Ollama for the ppc64le architecture, and for those who don’t want to compile, we also provide a pre-compiled binary in the releases on GitHub.

TL;DR

This post presents details on setting up the environment to perform inferences using IBM POWER9 infrastructure;
Ollama does not offer official support for ppc64le with CUDA;
The fork was compiled from scratch using CMake and Go, pointing to CUDA 12.2 and specifying the V100 architecture (sm_70);
A pre-compiled binary is also available on the project’s GitHub;
With this, it was possible to run LLM inference on IBM POWER9 with GPU acceleration and support for recent models.

Environment Used

Hardware:

ppc64le architecture;
Recommended minimum RAM: ~64GB;
GPU: NVIDIA Tesla V100;
NVIDIA driver: 535.54.03;
CUDA: version 12.2.

Operating System: Alma Linux 8.10 (ppc64le), binary compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.

Initial Checks

Verify that the driver and GPU are visible:

nvidia-smi

Verify that CUDA is installed:

nvcc --version

Note: If nothing appears, try:

export PATH=/usr/local/cuda-12.2/bin:$PATHexport CUDACXX=/usr/local/cuda-12.2/bin/nvcc

Also verify that CUDA 12 exists:

ls -la /usr/local/cuda-12

Running in a Virtual Environment

In this tutorial, we make the necessary configuration inside a virtual environment to isolate execution and settings. This is optional but recommended.

conda create -n ollamaGPU python=3.11 -yconda activate ollamaGPU

To deactivate the environment:

conda deactivate

Initial Setup

To compile Ollama on POWER9, the following dependencies with appropriate versions are required:

Go: 1.26.0
GCC: 11.2.1 (via gcc-toolset-11)
CMake: >= 3.24

Cloning and Building Ollama

With the environment configured, we can build Ollama. The compilation uses CMake to generate CUDA kernels with nvcc, and Go to compile the binary. An important detail is the CUDA_ARCHITECTURES=70 parameter: each NVIDIA GPU has a specific architecture identified by an sm_XX code, and the V100 is from the Volta architecture (sm_70). By specifying this value, we instruct the build to compile only for the V100, reducing compilation time.

The complete step-by-step compilation, including the necessary fixes for ppc64le, as well as installation and configuration of the dependencies mentioned earlier, is documented in the repository’s README.

For those who don’t want to compile, a pre-compiled binary is available directly from the releases page:

# Download the binarywget https://github.com/llm-pt-ibm/ollama-ppc64le/releases/download/v0.23.2-ppc64le-power9/ollama-ppc64le# Give execute permissionchmod +x ollama-ppc64le

Note: The repository contains branches of the official Ollama. The patches for ppc64le are exclusively in the ollama-ppc64le branch.

Performing Inference

With Ollama compiled, we can start the server:

./ollama serve

To verify it worked, type: ps aux | grep ollama.

Wait a few seconds and check the logs to confirm the server detected the GPUs correctly. Look for these lines:

inference compute ... library=CUDA compute=7.0 ... description="Tesla V100-SXM2-16GB" total="16.0 GiB"

Download the test model and run inference

For validation, we used the llama3.1:8b model. In another terminal, run:

./ollama pull llama3.1:8b

To run inference:

./ollama run llama3.1:8b "tell me all odd numbers up to 100"

Confirm GPU usage

In another terminal, with inference running, run:

nvidia-smi

In the processes section, you should see ollama with memory allocated on one of the GPUs:

Ollama using the GPU

Final Considerations

With the steps presented, it was possible to configure the environment to run LLM inference on an IBM POWER9 machine using NVIDIA Tesla V100 GPUs. With this approach, model inference has a significant performance gain compared to CPU execution. Using the Meta Llama 3.1 8B Instruct model as a reference, GPU execution achieved a higher token generation rate than CPU execution.

Let’s look at the collected data for the same request (tell me all odd numbers up to 100) with both types of execution:

	CPU	GPU
Token generation rate	0.71 tokens/s	79.82 tokens/s
Total duration	3m49s	4.52s
Prompt evaluation rate	10.67 tokens/s	295.77 tokens/s

With the data presented in the table, we see that GPU execution was approximately 112 times faster in token generation, with total response time reduced from 3 minutes and 49 seconds to 4.52 seconds.

Next Steps

Evaluate GPU and CPU execution in a comparative post and with other architectures;
Test GPU inference with larger models, with more than 8 billion parameters, for example;
Test new models available in the updated fork, such as Gemma 3 and DeepSeek;

LLM Inference with vLLM Using GPU on Power9

Fri, 10 Apr 2026 00:00:00 +0000

Background

This post aims to present the steps necessary to install vLLM in an IBM POWER9 environment (ppc64le architecture). The main required resources, modifications, dependencies, versions used, and installation steps necessary to run inference with a given model will be detailed.

vLLM is a tool focused on serving and efficient inference of large language models (LLMs), allowing models to be exposed through an API and execute inference in an optimized way, especially in GPU environments.

The need to install vLLM arose during the data generation process with the InstructLab tool. In that workflow, it is necessary to use a teacher model to generate synthetic data that will later be used for training or fine-tuning other models. For this, it is possible to use tools such as llama-cpp, already compatible with the IBM POWER9 environment, or vLLM, which was not yet available due to installation difficulties on this architecture. Unlike llama-cpp, which is more geared towards local execution and smaller-scale scenarios, vLLM stands out for better GPU utilization and the ability to handle multiple requests simultaneously in an efficient manner, being more suitable for large-scale inference scenarios and production environments.

Thus, we will present the technical steps required to make the vLLM installation feasible in the IBM POWER9 environment (ppc64le), describing the adaptations made so that the tool works correctly in this context.

TL;DR

Compilation and installation of LLVM, required as build infrastructure for subsequent dependencies.
Compilation and adaptation of Triton, including adjustments for compatibility with the Power9 architecture.
Installation and configuration of vLLM, considering its dependencies and specific runtime requirements.
Development of containers containing the entire configured environment for executing the tool.
Practical demonstration of using the images, including server startup and running inference using GPU.

Execution Environment

The environment used for the vLLM installation includes:

Architecture: IBM Power9 Server (ppc64le architecture).
Operating System (OS): AlmaLinux 8.10 binary compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.
RAM: 512GB.
GPUs: 4x NVIDIA Tesla V100 SXM2 16GB (NVLink2).

Dependencies and Installation

During the vLLM build process, three main dependencies stand out: LLVM, Triton, and PyTorch. These dependencies are problematic for the correct functioning of the tool.

LLVM constitutes the foundation of the compilation infrastructure used throughout the process, being responsible for generating, optimizing, and transforming intermediate representation into executable low-level code. In the context of vLLM, its role is essential to enable efficient execution of GPU kernels, especially those defined by Triton, which rely directly on its compilation backends (components responsible for generating optimized code for different hardware architectures). Triton, in turn, acts as the component responsible for defining and executing GPU-optimized kernels, playing a central role in the inference efficiency of language models. Its integration with LLVM allows generating highly optimized code for different architectures. PyTorch provides the foundation for tensor manipulation and model execution, offering the fundamental operations for GPU inference, in addition to serving as an interface to acceleration mechanisms and low-level libraries.

Dependency flow for compiling vLLM on Power9.

Due to the lack of native support for these packages on the ppc64le architecture, their use on IBM POWER9 required several adaptations based on the official repositories of these tools. These modifications ranged from fixing incompatibilities in specific methods to adjusting sub-dependencies that did not support the ppc64le architecture, as well as using Conda to help manage environments and dependencies. In some cases, manual compilation of additional components was also necessary. After overcoming these challenges, it became possible to install and run vLLM on the IBM POWER9 environment.

Due to the large number of steps involved, the step-by-step detailed procedures are presented in this link: vllm installation guide. It is worth noting that each of the steps described is essential to guarantee the correct compilation and execution of vLLM in the proposed environment.

Containerization

During the installation process, it was observed that the large number of involved steps could make environment reproduction difficult and lead to inconsistent scenarios. Because of this, we chose containerization of the solution as a way to make the experiment reproducible, portable, and simpler to use for other users.

For this, we provide (in this repository) scripts responsible for both building the images and automating execution, organizing all necessary steps. These scripts perform tasks such as identifying available resources, copying required CUDA binaries, and starting vLLM properly.

Execution was simplified so that the user only needs to provide the local path of the model to be used. Parameters such as port, number of GPUs, and image to be executed are optional and have predefined default values.

Repository developed for running vLLM via containers.

Additionally, we provide a video (vLLM Power9 demonstration) that demonstrates the use of vLLM from the provided repository.

Final Considerations

With the resources provided in this repository, it became possible to automate the process of installing and using vLLM on ppc64le architectures with V100 GPUs.

In the context of the IBM-MultiArq project, this solution proves especially relevant for using InstructLab, enabling local execution of teacher models via vLLM, expanding experimentation and development possibilities within the proposed environment.

Next Steps

As a continuation of this work, we propose conducting a comparative performance study between llama-cpp and vLLM. Additionally, the repository was structured to provide continuous support for vLLM, including its adaptation to future versions, the identification of remaining limitations, and the evolution of solutions as new challenges arise.

Installing Docker in an Architecture ppc64le (Power9) Environment

Wed, 01 Apr 2026 00:00:00 +0000

Context

Given the need to standardize software execution on our IBM Power9 (ppc64le) server, containers are a robust solution for avoiding environment conflicts. This post continues the work of structuring our infrastructure by detailing the installation of Docker Engine on AlmaLinux. Adopting this technology is strategically important for ensuring strict dependency isolation and portability across applications. With it, we can package everything from general-purpose libraries to more complex services, ensuring a clean, secure, and highly reproducible runtime environment.

Docker Engine has official support for AlmaLinux on the x86_64, arm64, s390x, and ppc64le architectures, which allows us to use it directly on Power9 without special adaptations. However, some care is required before and during installation, such as uninstalling tools that conflict with Docker and ensuring the images used are compatible with ppc64le.

TL;DR

This post presents the step-by-step process for installing Docker Engine on AlmaLinux in the ppc64le architecture.
You must remove Podman and Buildah before installing, because they conflict with Docker.
Docker Hub images need explicit ppc64le support to work on Power9.

Environment Used

Architecture: IBM Power9 server (ppc64le architecture)
Operating System (OS): AlmaLinux 8.10 binary compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10
RAM: 512GB

Prerequisites

Before installing Docker, it is important to be aware of a firewall limitation: when exposing container ports with Docker, those ports bypass the default firewalld rules. Make sure this does not pose a problem for your environment before proceeding. It is also important to note that Docker Engine is compatible with Rocky Linux 8 and 9 and AlmaLinux 8 on the ppc64le architecture.

Removing Conflicting Packages

AlmaLinux includes Podman and Buildah by default. These packages conflict with Docker Engine and must be removed before installation. It is also recommended to remove any older Docker versions that might be present:

sudo dnf remove -y podman \ buildah \ docker \ docker-client \ docker-client-latest \ docker-common \ docker-latest \ docker-latest-logrotate \ docker-logrotate \ docker-engine

Adding the Docker Repository and Installing Required Packages

Repository Setup

The recommended installation method is to use Docker’s official repository. It is worth mentioning that Docker uses the CentOS repository for RHEL-based distributions such as AlmaLinux, and this is officially supported. First, install the dnf-plugins-core package and add the repository:

sudo dnf install -y dnf-plugins-coresudo dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo

Installing Docker Engine

With the repository configured, install the latest version of Docker Engine along with the build and compose plugins:

sudo dnf install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Starting the service

Unlike Debian-based distributions such as Ubuntu, Docker does not start automatically on AlmaLinux after installation. You need to start the service manually and enable it so it comes up with the system:

sudo systemctl start dockersudo systemctl enable docker

Verifying the Installation

To confirm that everything was installed correctly, run the hello-world image. Docker will automatically detect the ppc64le architecture and pull the correct image:

sudo docker run hello-world

The expected output is a message confirming that Docker is working correctly.

Post-installation Configuration

By default, only the root user or users with sudo privileges can run Docker commands. To avoid using sudo on every command, add your user to the docker group. First, create the group if it does not already exist:

sudo groupadd docker

Then add your user to the group:

sudo usermod -aG docker $USER

You need to log out and log back in for the permissions to take effect.

Tips for Power9 Architecture

Because we are using IBM Power9, a few additional considerations matter when working with Docker Hub. The first point is image compatibility: not all images available on Docker Hub support ppc64le. Images built only for x86_64 will fail on Power9, so always verify that the desired image has the ppc64le tag before using it.

To validate that Docker is running correctly and recognizing the machine architecture, use:

docker version --format '{{.Server.Arch}}'

The expected output is ppc64le.

Final Considerations

Installing Docker Engine on AlmaLinux (ppc64le) follows a straightforward path as long as conflicts with Podman and Buildah are resolved beforehand. Official ppc64le support from Docker provides a stable experience on Power9, with the important caveat that image compatibility must always be checked before use.

With Docker installed and configured, the environment is ready to run containers and move on to the next steps in our language model infrastructure.

LLM Inference with Ollama on IBM Power9 Using CPU

Wed, 01 Apr 2026 00:00:00 +0000

Context

This post presents a practical guide for performing inference of large Language Models (LLMs) using Ollama, in an IBM POWER9 environment. Ollama is a framework based on llama.cpp, designed to simplify the implementation and execution of such models, offering a user-friendly interface and support for various tasks.

Flow of a request

Despite the growth in LLM usage, the availability of materials focused on the ppc64le architecture (IBM POWER9) is still quite limited. In general, available tutorials are old, poorly detailed, or focused on more common architectures like x86_64, which makes reproducing the environment in the presented context difficult. This is the first of two posts in this series, which aims to perform inference entirely via CPU, exploring the ppc64le architecture, in an updated, practical, and reproducible way. In the next post, we will address the use of GPU to accelerate the process.

TL;DR

This post presents details on how to configure the environment to perform inferences with IBM POWER9 infrastructure.
Execution is performed via CPU using Ollama;
The main challenge involves correctly configuring the environment, especially dependencies like Go, GCC, and CMake, in addition to compatibility with RHEL

Environment Used

Hardware:

ppc64le architecture;
RAM: ~64GB;
Execution: Virtual Machine (VM);

Operating System: Alma Linux 8.10 (ppc64le), binary compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.

Initial Setup

To run Ollama on the POWER9 architecture, it is necessary to prepare the environment with the appropriate dependencies.The first step is to update the system and install basic utilities:

sudo dnf update -ysudo dnf install -y wget git tar make gcc gcc-c++ cmake gcc-toolset-11

Although this command installs some dependencies, it is necessary to ensure that the correct versions are being used.

Configuring Go

Ollama is developed in Go, so it is necessary to ensure the appropriate version.

Expected Version: 1.25.7 linux/ppc64le

If not installed:

wget https://go.dev/dl/go1.25.7.linux-ppc64le.tar.gzsudo tar -C /usr/local -xzf go1.25.7.linux-ppc64le.tar.gzexport PATH=/usr/local/go/bin:$PATH

To add to PATH permanently:

echo 'export PATH=/usr/local/go/bin:$PATH' >> ~/.bashrcsource ~/.bashrc

Verify if the version is correct: go version

Configuring CMake

Verify if the version is correct: cmake --version

Expected Version: cmake 3.26.5

If not installed:

wget https://github.com/Kitware/CMake/releases/download/v3.26.5/cmake-3.26.5.tar.gztar -xzf cmake-3.26.5.tar.gzcd cmake-3.26.5./bootstrapmake -j$(nproc)sudo make install

Configuring GCC

Expected Version: gcc 11.2.1

Important: On AlmaLinux 8, the gcc-toolset is not activated automatically. It is necessary to enable the session manually:

scl enable gcc-toolset-11 bash

This command activates GCC only in the current session. If you open another terminal, you will need to run the command again.

Verify the version: gcc --version

If not installed:

sudo dnf install -y gcc-toolset-11scl enable gcc-toolset-11 bash

Cloning Ollama

With the environment configured, we can build Ollama. Here we clone the official Ollama repository and change the version used (important for POWER compatibility and to get a stable version).

cd /rootgit clone https://github.com/ollama/ollama.gitcd ollama#Change the version: git checkout v0.9.4

To verify, use: git status

Build Ollama

After activating GCC in the correct version:

export CGO_ENABLED=1go clean -cache -modcache -i -rgo build -o ollama .

CGO needs to be enabled because Ollama depends on llama.cpp, which uses C/C++ code for performance optimizations. Without it, the build fails or loses compatibility with the architecture.

This should occur without any errors and generate the ollama binary created in the current directory.

To verify: ./ollama --version

Performing Inference

With Ollama compiled, we can start the server:

./ollama serve

An important observation is that, since the environment is running on a virtual machine, it is not possible to keep the command running in the main terminal and, simultaneously, use another terminal in the same session to perform inference, without some auxiliary tool to manage multiple terminals.What we will do then is run the server in the background, but you can choose to use Tmux or Screen, allowing the same terminal to remain available for executing the remaining commands (which we will see next). For this, you can run:

./ollama serve &

To verify if it worked: ps aux | grep ollama. It will show something like:

Ollama running

Download the test model and run inference

For validation, we used the TinyLlama model, as it is lightweight and suitable for CPU execution. For this, in another terminal, run:

./ollama pull tinyllama

To run inference:

./ollama run tinyllama "The sky is blue?"

If everything has been done correctly, you will have something like:

Inference being executed

It is important to highlight that Ollama works, by default, with models available in its own repository, which are already converted and optimized for execution, generally in a format compatible with llama.cpp. These models can be easily used via the ollama pull command, as in the case of TinyLlama used in this example. Although it is possible to use external models, this requires additional steps, such as conversion to compatible formats (for example, GGUF) and the creation of a Modelfile.

Final Considerations

With the steps presented, it was possible to configure the environment to run LLM inferences on an IBM POWER9 machine using the CPU. Although functional, this approach has limitations in performance, especially for larger models, due to the absence of GPU acceleration. As a next step, we intend to explore execution using GPU, evaluating performance gains and scalability.

Next Steps

Test newer versions and compatibility between them;
Conduct benchmarking experiments to compare CPU Inference performance against GPU inference;
Second post in this series, performing GPU inference.

Power9 Virtualization: how we structured an isolated environment with KVM and Libvirt

Fri, 27 Mar 2026 00:00:00 +0000

Context

Given the need to establish isolated and secure environments for installing libraries, frameworks, and general-purpose tools, environment encapsulation emerged as an effective solution, implemented through KVM managed via virt-manager and virsh.

Virtualization is widely used in x86 environments, with mature tooling and established workflows. However, when migrating to architectures such as IBM Power9 (ppc64le), many of these processes are no longer straightforward and require architecture-specific adaptations. Below, we provide a diagram showing this interaction across four layers.

Communication flow between Hardware (Power9) and Virtual Machines

The flow is organized into the following layers:

Figure 1: Diagram representing a 4-layer virtualization architecture.

In this work, we explore how to build a virtualized environment using KVM and Libvirt on a Power9 server, with focus on isolation, reproducibility, and shared team usage.

TL;DR

We implemented a virtualized environment on Power9 using KVM + Libvirt.
We adapted common virtualization workflows to ppc64le, solving permission, write-lock, and provisioning issues.
The environment provides secure isolation between users and straightforward VM management.
We provide ready-to-use images with NVIDIA/CUDA drivers for immediate use.

Environment used

Architecture: IBM Power9 server (ppc64le architecture).
Operating System (OS): AlmaLinux 8.10 binary-compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.
RAM: 512GB.
Execution: Virtual Manager for Virtual Machine (VM) management.
Hypervisor: KVM (Kernel-based Virtual Machine) / QEMU.
Management: Libvirt (virsh, virt-install, virt-customize).
Storage: Virtual disks in .qcow2 format.
GPUs: 4x NVIDIA Tesla V100 SXM2 16GB (NVLink2).

Installing the virtualization environment (KVM + Libvirt)

Before creating any VM, you need to install and configure KVM and Libvirt on the Power9 server.

Package installation:

sudo dnf install -y qemu-kvm libvirt libvirt-client libvirt-daemon libvirt-daemon-kvm virt-install virt-viewer guestfs-tools \libguestfs-tools python3-libvirt

Starting the service:

sudo systemctl enable --now libvirtdsudo systemctl status libvirtd

Adding your user to the libvirt group:So non-root users can manage VMs without requiring sudo for every command:

Run the command below:

sudo usermod -aG libvirt $(whoami)

Log out and log back in for the change to take effect.

Verifying the installation:

Check virsh version:

sudo virsh version

Validate CPU virtualization support:

sudo virt-host-validate

Setup

Environment preparation:In KVM, the fastest way to provision VMs is to clone a “seed” image (.qcow2) and expand it, instead of performing a clean install from ISO. To keep things organized, all virtual disks should be stored in a dedicated directory:

Download the AlmaLinux 8 base image:

cd /home/user/wget https://repo.almalinux.org/almalinux/8/cloud/ppc64le/images/AlmaLinux-8-GenericCloud-latest.ppc64le.qcow2 -O alma8_base.qcow2

Hypervisor management:Hypervisor and instance administration follows specific procedures to ensure system stability. Administrator commands to control virtualization services on Power9:

Stop KVM services:

sudo systemctl stop libvirtd

Start KVM services again:

sudo systemctl start libvirtd

Enable at boot:

sudo systemctl enable libvirtd

Permission setup:The system user running KVM (qemu) needs permission to access VM disks. If disks are stored inside a personal home directory, Linux blocks access by default. To allow hypervisor access without exposing personal files, grant execute (o+x) permission on directories:

Allow qemu to traverse the home directory (traversal only, no read permission):

chmod o+x /home/user

Allow qemu to access the disk directory:

chmod o+x /home/user/discos

Virtual network configuration (Libvirt):Libvirt creates a default NAT network (default) that places VMs in the 192.168.122.0/24 range. VMs can access the internet through NAT, but they are not directly reachable from external networks without additional setup.

Check network status:

sudo virsh net-list --all

If inactive, start and enable at boot:

sudo virsh net-start defaultsudo virsh net-autostart default

If the network does not exist, define and initialize it:

sudo virsh net-define /usr/share/libvirt/networks/default.xmlsudo virsh net-start defaultsudo virsh net-autostart default

If the XML file is missing, install the network config package:

sudo dnf install -y libvirt-daemon-config-network

Creating new VMs:

Clone the base image:

cp /home/user/alma8_base.qcow2 /home/user/discos/nome_vm.qcow2

Expand the disk (must be done BEFORE creating the VM):

qemu-img resize /home/user/discos/nome_vm.qcow2 +100G

Create the VM:

sudo virt-install \ --connect qemu:///system \ --name vm_nome \ --memory 131072 \ --vcpus 16 \ --cpu host \ --disk path=/home/user/discos/nome_vm.qcow2,format=qcow2 \ --import \ --os-variant almalinux8 \ --network network=default \ --graphics none \ --noautoconsole

Post-creation VM customization:After creating the VM, you must set the root password, since cloud images usually come without one. We use virt-customize for this. Important: the VM must be powered off before safely editing its disk.

Shut down the VM:

sudo virsh shutdown vm_nome

Wait for complete shutdown:

sudo virsh list --all

Inject the root password into disk:

sudo virt-customize -a /home/user/discos/nome_vm.qcow2 \ --root-password password:senha_desejada

Start the VM again:

sudo virsh start vm_nome

Accessing VMs:

Via serial console

Connect to VM console:

sudo virsh console vm_nome

To exit the console, use Ctrl + ].

Via SSH

Find the VM IP address:

sudo virsh domifaddr vm_nome

Access via SSH:

ssh root@<ip_da_vm>

Managing and deleting VMs:If you need to destroy an environment and recreate it from scratch, follow these 3 mandatory cleanup steps:

Force-stop the VM:

sudo virsh destroy nome_da_vm

Remove VM definition from Libvirt:

sudo virsh undefine nome_da_vm

Delete the virtual disk to free Power9 storage:

rm -f /home/user/discos/nome_da_vm.qcow2

Creating a VM from an existing image (cloning):To create a new VM from an already configured image, such as prebuilt NVIDIA-ready images:

Option A: clone via qemu-img (keeps original image intact):

qemu-img create -f qcow2 -b imagem-base.qcow2 -F qcow2 nova-vm.qcow2

Option B: clone via virt-clone:

virt-clone \ --original vm-base \ --name vm-nova \ --file /home/user/discos/nova-vm.qcow2

If needed, you can execute the VM deletion step above and recreate it according to step 5.

Ready-to-use images with NVIDIA drivers

To simplify the use of Tesla V100 GPUs available on the server, we provide pre-configured .qcow2 images with NVIDIA drivers, CUDA, and cuDNN already installed. This removes the need to configure the base environment for every new use.

Available images:

Image	Contents
AlmaLinux-8-Power9-NVIDIA-drivers.qcow2.xz	AlmaLinux 8.10 + drivers NVIDIA 535 + CUDA 12.2 + cuDNN 9.0
InstructLab-Power9-0.25.0.qcow2.xz	AlmaLinux 8.10 + InstructLab 0.25.0 + dependencies required for execution on Power9 (ppc64le).

How to use pre-configured images:

Download the image from the shared folder and decompress it:

pip install --user gdowngdown --folder "https://drive.google.com/drive/u/1/folders/1WM8fHKWaMu-NJOzwqh6cdcET7mNE50du"xz -d InstructLab-Power9-0.25.0.qcow2.xz

Move it to the disks directory and create a VM from it:

cp InstructLab-Power9-0.25.0.qcow2 /home/user/discos/minha-vm-gpu.qcow2

Create the VM as usual:

sudo virt-install \ --connect qemu:///system \ --name vm_gpu \ --memory 131072 \ --vcpus 16 \ --cpu host \ --disk path=/home/user/discos/minha-vm-gpu.qcow2,format=qcow2 \ --import \ --os-variant almalinux8 \ --network network=default \ --graphics none \ --noautoconsole

For the VM to access physical GPUs, PCIe passthrough must be configured as described in the next post of this series.

How to generate a new image from a configured VM:After installing drivers or any software inside a VM, you can export its current state as a reusable image:

Shut down the VM:

sudo virsh shutdown vm_nome

Convert and compress the image (removes unused space):

qemu-img convert -O qcow2 -c \ /home/user/discos/vm_nome.qcow2 \ /home/user/discos/AlmaLinux-8-Power9-minha-imagem.qcow2

Compress for distribution:

xz -T0 -v /home/user/discos/AlmaLinux-8-Power9-minha-imagem.qcow2

Expected output: AlmaLinux-8-Power9-minha-imagem.qcow2.xz.

Verify image integrity:

qemu-img check AlmaLinux-8-Power9-minha-imagem.qcow2qemu-img info AlmaLinux-8-Power9-minha-imagem.qcow2

Evaluation of IBM Granite Models for Code-Generation Tasks on HumanEvalX

Fri, 28 Nov 2025 00:00:00 +0000

Context

The use of language models for code generation and understanding has become essential in modern development workflows.
As part of a joint research effort between LSD/UFCG and IBM, we investigated the performance of the IBM Granite 4 family on the HumanEvalX benchmark, which evaluates programming capabilities in five languages: Python, Java, Go, C++, and JavaScript.

The goal was to answer key questions from the team:

How versatile are the Granite models across different languages?
Do smaller models deliver useful performance?
How do the Granites compare to models from other providers such as DeepSeek Coder and CodeLlama?

Methodology / Process

The evaluation was conducted using OpenCompass, a modern and extensible framework for large-scale LLM benchmarking. It allowed experiments to be executed in a standardized, reproducible way with consistent inference protocols.

Since OpenCompass does not provide native support for models hosted on the IBM Cloud, it was necessary to develop a custom client to integrate the framework with the IBM Cloud Inference API. This client allowed the evaluation process to send requests transparently, handle authentication, manage generation parameters, and return outputs in the expected benchmark format. Experiments were also run in Google Colab, which served as a practical environment for prototyping and running the models.

We used the HumanEvalX benchmark, an extension of the traditional HumanEval, covering five languages with the Pass@1 metric.

The evaluated models included:

Granite 4.0 Micro (3B)
Granite 4.0 (1B)
Granite 4.0 h-tiny (7B)
Granite 4.0 h-small (30B) — via IBM Cloud
granite 4.0 (350M)
granite code instruct 8B — via IBM Cloud
DeepSeek Coder (6.7B)
CodeLlama (7B)

The metric used was Pass@1, following the benchmark protocol.

Results and Conclusions

Performance heatmap of the models on HumanEvalX.

The evaluation revealed important behaviors:

1. granite-4.0-h-small stood out for its versatility

He surpassed 60% Pass@1 in Java, C++, and JavaScript, while also maintaining over 50% in Python and Go. This consistent performance across languages suggests that the model has good generalization capability, showing promise in scenarios that involve different programming ecosystems, although additional benchmarks and evaluations are important before drawing broader conclusions.

2. Granite Micro (3B) performed above expectations

Despite being a small model, Granite Micro (3B) delivered 65.85% in JavaScript and 68.90% in Java, outperforming even some larger models evaluated. This shows that even with a compact architecture, it can deliver solid results, making it a highly efficient option for applications that require low computational cost without sacrificing performance.

3. The size progression (350M → 1B → 3B → 7B → 30B) shows gradual and coherent evolution

The results show that as we move through the different sizes of the Granite line, there is a coherent evolution in performance. Smaller models deliver stable results within their category, while larger ones gradually expand the ability to solve more complex tasks. This distribution helps clarify where each model fits in the usage spectrum.

4. Comparing different providers helps contextualize the results

Alongside the IBM models, we also evaluated models from other providers such as DeepSeek and Meta. In some languages, the differences were small, but in all of them there was at least one model from the Granite family that achieved the highest score. The Granite 4 Micro (3B) and Granite 4 h-small (30B) models were the standouts, with results that were close to, and in some cases above, those of models recognized as code specialists.

Next Steps

Run the same Granite models on LiveCodeBench, a broader benchmark that goes beyond code generation, also evaluating code execution and test-output.
Perform a fine-tuning of the Granite 4.0 Micro (3B) using InstructLab and observe the impact of this adaptation on the model’s performance in HumanEvalX, comparing before and after the adjustment.

Computação@UFCG Leads Brazil's Contributions to the HELM-Stanford Framework in Partnership with IBM

Wed, 09 Jul 2025 00:00:00 +0000

Collaboration between UFCG’s Computer Science department and IBM makes the university the top brazilian contributor to the HELM-Stanford evaluation framework in 2025.

HELM-Stanford is one of the world’s leading frameworks for evaluating language models, measuring accuracy, robustness, and fairness. Being the top Brazilian contributor — through the partnership between Computação@UFCG and IBM — highlights the national protagonism in developing fairer, safer, and more representative metrics for LLMs, especially in multilingual and culturally diverse contexts.

The partnership between Computação@UFCG and IBM resulted in 15 significant contributions to HELM-Stanford in 2025. These contributions include adding Portuguese-language benchmarks, fixing bugs, improving source code, and including new evaluation sets, expanding the framework’s linguistic diversity and robustness.

The project, coordinated by Professor João Brunet with participation from Professors Fábio Morais and Leandro Balby, features a multidisciplinary team dedicated to LLM evaluation. The team also includes one professor from IFPB, three graduate students, three undergraduate students, and a professional with software development experience. IBM, as a project partner, has also assigned professionals to work directly on the collaboration. Together, the group has made meaningful contributions to advancing HELM-Stanford, with a focus on including the Portuguese language and continuously improving the framework.

Multidisciplinary project team

LLMs Inference API on IBM Power9 Server

Thu, 03 Jul 2025 00:00:00 +0000

Background

This is the fourth and final post in a tutorial series that aims to show step by step how to build a LLM API on a Power9 server, from operating system setup to remote inference execution. We already configured the operating system, NVIDIA drivers, CUDA, and cuDNN in the first post, installed Conda and PyTorch in the second post, and built the API in the third post. In this stage, we will present the built API and show how to make requests.

TL;DR

This post introduces the built LLM inference API and how to use it.
We will show how to make requests using Python and curl.

Introducing the API

Built with FastAPI, it includes loading specific models, keeping them in GPU memory for successive calls, and generating text from prompts sent via HTTP requests. It was implemented with FastAPI and includes API Key access control, memory management (loading and unloading models), support for multiple GPUs with automatic sharding, and endpoints for status queries. The goal is to provide a robust, production-ready service optimized for intensive use, ensuring fast inferences and easy integration with external applications.

Architecture Overview

The API exposes LLMs via FastAPI with REST endpoints. The ModelManager handles loading, unloading, and model inference, keeping models in GPU memory for fast calls. Authentication is enforced via API Key. The architecture supports multiple GPUs with automatic sharding to optimize memory usage and performance. Models are sourced from Hugging Face and use the Transformers library to perform inferences.

Architecture Diagram

Main Features

Load Models
- /load_model
- Loads a model from the Hugging Face Hub
- Performs sharding across GPUs
- Supports Hugging Face Token
Generate Text
- /generate
- Accepts prompt, max_tokens, model name, temperature, and top_p
- Uses an already loaded model or loads a new one
- Returns result in JSON
Management
- /status: Checks the loaded model and device (CPU/GPU)
- /unload_model: Frees GPU and memory
- /generate_apikey: Creates API keys from LDAP user

Usage Flow

Usage flow diagram

Inputs and Endpoints

The table below describes the API endpoints, required inputs, and responses.

Inputs and endpoints table
Endpoints	Method	Api Key	Input (Body/Query)	Response
`/generate_apikey`	POST	❌	{username}	API Key
`/load_model`	POST	✅	{model_name hf_token(opcional) device(opcional)}	None, just loads the model
`/generate`	POST	✅	{model_name prompt hf_token(opcional) max_tokens(opcional) temperature(opcional) top_p(opcional)}	Text generated by the model
`/status`	GET	✅	None	Model status and the device it is loaded on
`/unload_model`	POST	✅	None	None, just unloads the model

How to Use the API with Python

Generate API Key

 1import requests 2import json 3import os 4 5url = "http://<power9_ip_server>:8000/" 6username = <ldap_user> 7hf_token = os.getenv("HUGGINGFACE_TOKEN") 8 9response = requests.post(f"{url}/generate_apikey", json={"username": username}).content.decode()1011api_key = json.loads(response).get("api_key")

It is important that the Hugging Face Token is set as an environment variable in the location where the inference will run.
api_key will be the return value of the called function.

Load Model

First, we need to create a header containing the API Key returned from the code above and the payload with model_name and the Hugging Face token hf_token. After that, we can send the request with these two pieces of information.

1headers = {"Content-Type": "application/json",2"x-api-key": api_key}34payload = {"model_name": "ibm-granite/granite-3.3-8b-instruct",5 "hf_token": hf_token}67resp = requests.post(f"{url}/load_model", headers=headers, json=payload)

Generate Text

Now we need to create a new payload with the necessary information to generate text with an LLM, which includes: prompt, model_name, and hf_token.

1payload = {"prompt": "Hello, tell me a little about the Federal University of Campina Grande (UFCG)",2 "model_name": "ibm-granite/granite-3.3-8b-instruct",3 "hf_token": hf_token}45resp = requests.post(f"{url}/generate", headers=headers, json=payload)67resp = json.loads(resp.content.decode())

Check status and unload the model

To check the status and unload the model, we don’t need to send anything in the payload—just the header with the API key:

1requests.get(f"{url}/status", headers=headers).content

1resp = requests.post(f"{url}/unload_model", headers=headers)

How to use the API with curl in CLI

Generate API Key

curl -X POST "http://<power9_ip_server>:8000/generate_apikey" \ -H "Content-Type: application/json" \ -d '{"username": <ldap_user>}'

It is important that the Hugging Face Token is set as an environment variable in the location where the inference will run.
The user in the username field must be enclosed in quotation marks (" “)
After running the request above, the returned API key should be saved as an environment variable to make future executions easier. To save it, copy the returned API key and run the command:

export API_KEY_P9=<returned_api_key>

Load Model

curl -X POST "http://<power9_ip_server>:8000/load_model" \ -H "Content-Type: application/json" \ -H "x-api-key: $API_KEY" \ -d '{ "model_name":"ibm-granite/granite-3.3-8b-instruct", "hf_token":"'"$HUGGINGFACE_TOKEN"'" }'

Generate Text

curl -X POST "http://<power9_ip_server>:8000/generate" \ -H "Content-Type: application/json" \ -H "x-api-key: $API_KEY" \ -d '{ "model_name": "ibm-granite/granite-3.3-8b-instruct" "prompt":"Hello, tell me a little about the Federal University of Campina Grande (UFCG)", "hf_token": "'"$HUGGINGFACE_TOKEN"'", "max_tokens":50 }'

Check status and unload the model

To check the status and unload the model, we don’t need to send anything in the payload—just the header with the API key:

curl -X GET "http://<power9_ip_server>:8000/status" \ -H "Content-Type: application/json" \ -H "x-api-key: $API_KEY"

curl -X POST "http://<power9_ip_server>:8000/unload_model" \ -H "Content-Type: application/json" \ -H "x-api-key: $API_KEY"

We hope this series has helped clarify the full development and deployment process. The LLM-IBM-UFCG team is available for questions or suggestions about future improvements.

Building an API for LLM inferences on IBM Power9 servers

Wed, 02 Jul 2025 00:00:00 +0000

Background

This is the third post in a tutorial series designed to show step by step how to build a LLM API on a Power9 server, from operating system setup to remote inference execution. We already configured the operating system, NVIDIA drivers, CUDA, and cuDNN in the first post, and installed Conda and PyTorch in the second post. In this stage, we will build the API using FastAPI and the Transformers library, downloading models from Hugging Face and running the web server with uvicorn.

The implemented API will support generating API keys, loading models, performing inferences, checking status, and unloading models.

FastAPI: a modern web framework for building APIs with Python 3.8+, based on static typing and async programming. It is designed to be fast, easy to use, and robust, making API development more efficient.

Transformers: an open-source library developed by Hugging Face. It offers easy and efficient access to a wide collection of state-of-the-art pretrained models for Natural Language Processing (NLP), computer vision, and audio.

Hugging Face: Hugging Face is a platform focused on artificial intelligence, known for hosting NLP models and other tasks. The Hugging Face Hub is a collaborative repository where developers and researchers can share, version, and download ready-to-use models, making access and integration easier.

Uvicorn: ASGI (Asynchronous Server Gateway Interface) web server. Uvicorn is a high-performance server for asynchronous Python applications.

TL;DR

This post provides a step-by-step guide to implementing an API that performs LLM inferences.
We will use FastAPI and Transformers to develop this API and Hugging Face to download the models.

Environment Setup

Directory Structure

Start by creating the basic project structure:

model_api/├── requirements.txt├── app/│ ├── __init__.py│ ├── main.py│ ├── schemas.py│ ├── auth.py│ ├── model_manager.py│ ├── utils.py│ └── apikey_store.json└── README.md (optional)

`requirements.txt` File

We will use FastAPI and Transformers to build the API. Additionally, we will use uvicorn to run the server, pydantic for input data validation, and torch, which we installed in the previous tutorial.

First, we’ll install the required libraries and then populate the requirements.txt file. Remember to activate your conda environment if you created one, to ensure proper use of pytorch.

conda activate llm_apipip install fastapi uvicorn transformers

The requirements.txt file will look like this:

requirements.txt

1fastapi>=0.104.02uvicorn>=0.24.03torch>=2.0.04transformers>=4.35.05pydantic>=2.0.0

API Key Storage File

The apikey_store.json file will store the generated API keys. We will start with it empty, containing only {}.

apikey_store.json

1{}

Schemas and Data Validation

Schemas are essential for validating the API’s input and output data. They ensure data is in the correct format and enable automatic documentation generation.

We will create the app/schemas.py file containing all the data models. We will define four models: GenerateRequest, LoadModelRequest, ApiKeyResponse, and LDAPUserRequest.

schemas.py

 1from pydantic import BaseModel, Field 2from typing import Optional 3 4class GenerateRequest(BaseModel): 5 model_name: str = Field(..., description="The name of the model to use for generation.") 6 prompt: str = Field(..., description="The input text to generate a response for.") 7 max_tokens: Optional[int] = Field(300, description="The maximum length of the generated response.") 8 temperature: Optional[float] = Field(1.0, description="The sampling temperature for generation.") 9 top_p: Optional[float] = Field(1.0, description="The cumulative probability for nucleus sampling.")10 hf_token: Optional[str] = Field(None, description="The Hugging Face tokenizer to use, if applicable.")111213class LoadModelRequest(BaseModel):14 model_name: str = Field(..., description="The name of the model to load.")15 device: Optional[str] = Field("cuda", description="The device to load the model on (e.g., 'cpu', 'cuda').")16 hf_token: Optional[str] = Field(None, description="The Hugging Face tokenizer to use, if applicable.")1718class ApiKeyResponse(BaseModel):19 api_key: str = Field(..., description="The API key for accessing the model API.")2021class LDAPUserRequest(BaseModel):22 username: str = Field(..., description="The username for LDAP authentication.")

All classes inherit from pydantic’s BaseModel, gaining validation, serialization, and automatic documentation features.
The Field(...) declaration defines a required field with no default value.
The Field(value) declaration defines a required field with value as its default.
The Optional[type] annotation indicates the field is optional but must be of type type if provided.

With the schemas defined, let’s create the file responsible for API Key authentication.

Authentication and API Keys

The authentication system protects your API by ensuring that only authorized users can access the endpoints. We will implement a mechanism based on API Keys.

Let’s create the app/auth.py file with all the authentication functionalities.

auth.py

 1import secrets  2import json 3from fastapi import HTTPException, Request 4 5APIKEY_STORE_FILE = "app/apikey_store.json" 6 7def load_apikeys(): 8 try: 9 with open(APIKEY_STORE_FILE, "r") as f:10 return json.load(f)11 except FileNotFoundError:12 raise HTTPException(13 status_code=404,14 detail=f"API keys file not found: {APIKEY_STORE_FILE}")15 16def save_apikeys(keys: dict):17 with open(APIKEY_STORE_FILE, "w") as f:18 json.dump(keys, f, indent=4)1920def generate_apikey(user:str) -> str:21 key = secrets.token_hex(32)22 keys = load_apikeys()23 keys[user] = key24 save_apikeys(keys)25 return key2627async def verify_apikey(request: Request) -> bool:28 apikey = request.headers.get("x-API-Key")29 if not apikey:30 raise HTTPException(31 status_code=401,32 detail="API key not provided.")33 try:34 keys = load_apikeys()35 if apikey in keys.values():36 return True37 38 except json.JSONDecodeError:39 raise HTTPException(40 status_code=403,41 detail="Invalid API Key")

The load_apikeys function loads the information stored in the app/apikey_store.json file.
save_apikeys is responsible for saving the content in JSON format.
The generate_apikey function creates a key for a user and adds it to the dictionary using the provided username as the key.
verify_apikey will be called whenever a request arrives, to perform validation.

Model and GPU Manager

The app/model_manager.py is the core of the API, responsible for loading, managing, and running llm. It optimizes GPU/CPU usage and ensures efficient text generation.

model_manager.py

 1import torch  2from transformers import AutoTokenizer, AutoModelForCausalLM 3from fastapi import HTTPException 4import gc 5from .utils import is_model_on_gpu 6 7DEVICE = "cuda" if torch.cuda.is_available() else "cpu" 8 9class ModelManager:10 def __init__(self):11 self.model = None12 self.tokenizer = None13 self.model_name = None1415 def load_model(self, model_name: str, hf_token:str = None, device: str = DEVICE):16 if self.model_name != None and self.model_name != model_name:17 print("Removing previously loaded model...")1819 self.unload_model() 20 print(f"Loading model {model_name} on device {device}...")21 22 if self.model_name != model_name:23 try: 24 if hf_token: 25 self.tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)26 self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map="balanced", token=hf_token)27 else:28 self.tokenizer = AutoTokenizer.from_pretrained(model_name)29 self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map="balanced")30 self.model.eval()31 self.model_name = model_name32 print(is_model_on_gpu(self.model.hf_device_map, self.model_name))33 34 except Exception as e:35 raise HTTPException(status_code=500, detail=f"Erro ao carregar modelo: {str(e)}")36 else:37 print(f"The model {model_name} is already loaded.")3839 def generate(self, model_name:str, hf_token: str, prompt:str, max_tokens:int = 300, temperature:float = 1.0, top_p:float = 1.0) -> str:40 41 if self.model_name != model_name:42 self.load_model(model_name, hf_token, device=DEVICE)4344 if self.model is None or self.tokenizer is None:45 raise HTTPException(status_code=400, detail="No model loaded.")4647 try:48 inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)49 with torch.no_grad(): 50 outputs = self.model.generate(**inputs, max_new_tokens=max_tokens,temperature=temperature, top_p=top_p, eos_token_id=self.tokenizer.eos_token_id)51 return self.tokenizer.decode(outputs[0], skip_special_tokens=True)52 except Exception as e:53 raise HTTPException(status_code=500, detail=f"Error generating text:{str(e)}")54 55 def get_status(self) -> str: 56 if self.model is None:57 self.unload_model()58 return "No model loaded." 59 return is_model_on_gpu(self.model.hf_device_map, self.model_name)6061 def unload_model(self):62 self.model = None63 self.tokenizer = None64 old_model = self.model_name if self.model_name else False65 self.model_name = None6667 gc.collect()68 torch.cuda.empty_cache()69 return f"Model {old_model} successfully unloaded." if old_model else "No model loaded to unload."7071manager = ModelManager()

The load_model function loads a new model into memory, removing any previously loaded model.
generate is the main function of the API, responsible for performing model inference. It allows adjusting the parameters: temperature, top_p, and max_tokens.
get_status reports whether there is a loaded model and whether it is on the GPU or CPU.
The unload_model function removes the model from memory, clears the CUDA cache, and invokes Python’s garbage collector to avoid leftovers that could interfere with future loads.

FastAPI API Endpoints

The app/main.py file is where all the components come together. In it, we define all the endpoints and the API’s routing logic.

main.py

 1from fastapi import FastAPI, Request, HTTPException, Depends 2from fastapi.responses import JSONResponse 3from app import schemas, model_manager, auth 4 5app = FastAPI() 6 7async def require_api_key(request: Request) -> schemas.LDAPUserRequest: 8 user = await auth.verify_apikey(request) 9 if not user:10 raise HTTPException(status_code=401, detail="Invalid API Key")11 return user1213@app.post("/generate_apikey")14async def generate_apikey(payload: schemas.LDAPUserRequest) -> JSONResponse:15 key = auth.generate_apikey(payload.username)16 return JSONResponse(status_code=200, content={"api_key": key})1718@app.post("/load_model", dependencies=[Depends(require_api_key)])19async def load_model(payload: schemas.LoadModelRequest) -> JSONResponse:20 try:21 model_manager.manager.load_model(payload.model_name, payload.hf_token, payload.device)22 return JSONResponse(content={"message": f"Model {payload.model_name} loaded successfully."})23 except Exception as e:24 raise HTTPException(status_code=500, content={"error": str(e)})25 26@app.post("/generate", dependencies=[Depends(require_api_key)])27async def generate(payload: schemas.GenerateRequest)-> JSONResponse:28 try:29 result = model_manager.manager.generate(payload.model_name, payload.hf_token,payload.prompt, payload.max_tokens, payload.temperature, payload.top_p)30 return {"result": result}31 except Exception as e:32 return JSONResponse(status_code=500, content={"error": str(e)})33 34@app.get("/status", dependencies=[Depends(require_api_key)])35async def status()-> JSONResponse:36 str_status = model_manager.manager.get_status()37 return JSONResponse(content={"status": str_status})3839@app.post("/unload_model", dependencies=[Depends(require_api_key)])40async def unload_model() -> JSONResponse:41 try:42 str_unload = model_manager.manager.unload_model()43 return JSONResponse(content={"message":str_unload})44 except Exception as e:45 raise HTTPException(status_code=500, content={"error": str(e)})

The require_api_key function checks the API Key on each request and returns the authenticated user or raises a 401 error.
generate_apikey creates and returns a new API key for the specified user.
load_model loads the specified model. If needed, it also accepts a Hugging Face token.
The generate function makes the model perform inference using the given prompt and parameters.
Calling the status endpoint returns the current status of the model manager.
unload_model unloads the currently loaded model and returns a success message if completed properly.

`utils.py` File

The app/utils.py file contains the function that checks whether the loaded model is fully or partially on the GPU, or if it was loaded on the CPU.

utils.py

1def is_model_on_gpu(hf_device_map: dict, model_name: str) -> str:2 if '' in hf_device_map.keys() and hf_device_map[''] == 'cpu':3 return f"Model {model_name} fully loaded on CPU."4 elif 'cpu' in hf_device_map.values():5 return f"Some layers of the model {model_name} are loaded on the CPU."6 else:7 return f"Model {model_name} fully loaded on GPU."

Running the API

To run the API with uvicorn, simply execute a command specifying the host and port for the service to start.

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

app:main refers to the app/main.py file, which connects all components and handles user requests.
--host 0.0.0.0 sets the IP address on which the Uvicorn server will listen. The value 0.0.0.0 allows the server to be accessible from any network interface on the Power9 machine.
--port 8000 specifies the port on which the server will listen for requests.
--reload is a flag for development use. It automatically reloads the server whenever changes are made.

BBy following this guide, you’ll have a working API capable of running LLM inference using models downloaded from Hugging Face. In the next tutorial, we will show how to send requests to the API using curl and Python.

Setting Up the Conda and PyTorch on IBM Power9 Servers

Mon, 30 Jun 2025 00:00:00 +0000

Background

This is the second post in a tutorial series on how to build a Language Model API on an IBM Power9 server, covering everything from setting up the operating system to having the API running remote inference. The first post covers installing the OS and configuring NVIDIA drivers, CUDA, and CUDNN. In this step, we’ll show how to set up the Conda package manager and the PyTorch library.

Conda: Conda is an open-source, cross-platform package and environment management system. It’s like a “toolbox” for data scientists and developers to organize their projects.

PyTorch: PyTorch is an open-source machine learning library developed primarily by Facebook AI Research (FAIR). It’s especially popular for building deep learning applications, a subfield of machine learning inspired by how the human brain works.

TL;DR

This post provides a step-by-step guide to installing Conda and PyTorch.
The main challenge is finding compatible versions for the Power9 machine architecture.

Setting up the Conda

We’ll start with installing Conda. On Power systems, the architecture used is ppc64le (PowerPC 64-bit little-endian), so it’s essential to download the version for this architecture. We’ll use miniconda, a lighter option that’s better suited for custom setups like the Power9 server.

To download and install the latest version of Miniconda:

sudo wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-ppc64le.shbash ~/Miniconda3-latest-Linux-ppc64le.sh

Check if Conda was activated automatically:

conda -–version

If it didn’t start automatically, you’ll need to activate it.

To ensure it’s automatically activated with each new connection, we will write the command into your .bashrc (or .zshrc) file.

echo 'source ~/miniconda3/etc/profile.d/conda.sh' >> ~/.bashrcsource ~/.bashrc

Check again with the command:

conda -–version

Expected output looks like: conda 23.10.0

Installing and configuring the PyTorch library

There are no official builds or Conda/PyPi wheels with full support for the ppc64le architecture. To install PyTorch, you’ll need to build it manually.

(Optional) Creating a Conda virtual environment

It’s recommended to create a dedicated virtual environment to install PyTorch in isolation.

To create and activate the virtual environment, run:

conda create -y -n api_llm python=3.10conda activate api_llm

Installing prerequisites

We need to install some packages required to properly build PyTorch.

First, install the packages using the following commands:

conda install -y -c conda-forge openblas libblas cmake ninja python3-devel gcc-c++ rust cargo

CMake (the build system used by PyTorch) dropped support for scripts declaring compatibility with older versions (<3.5). To address this, we need to install a version of cmake <3.5 using pip.

Run the command:

pip install cmake==3.27.7

To make sure the correct version was installed, run the command:

cmake --version

Expected output: cmake version 3.27.7

Building PyTorch

Now let’s start the PyTorch build process.

The first step is to clone the repository and set it up to install version 2.6.0:

git clone --recursive https://github.com/pytorch/pytorchcd pytorchgit checkout v2.6.0 git submodule sync git submodule update --init --recursive

To install the required packages via pip, run the following command:

pip install -r requirements.txt

And finally, to build PyTorch, run Python’s setup.py:

sudo USE_CUDA=1 USE_DISTRIBUTED=1 USE_NCCL=1 USE_GLOO=1 USE_CUDNN=1 python setup.py install

The build process usually takes a while, around 15 minutes.

To check if everything worked correctly, create a file named test_torch.py

nano test_torch.py

This file should contain the following lines:

 1import torch 2print(torch.__version__) 3print("CUDA available:", torch.cuda.is_available()) 4print("Number of GPUs:", torch.cuda.device_count()) 5print("GPU name:", torch.cuda.get_device_name(0)) 6x = torch.rand(3, 3).cuda() 7y = torch.rand(3, 3).cuda() 8print("Sum on GPU:", (x + y)) 9print("cuDNN available:", torch.backends.cudnn.is_available())10print("C extensions loaded:", torch._C._cuda_getDeviceCount() > 0)

When you run this file, you’ll check:

Installed PyTorch version
CUDA availability
Number of available GPUs
GPU name on the Power9 server
Whether GPU usage is working correctly
CUDNN availability
Whether the .so files were compiled correctly

This script simply verifies some CUDA and PyTorch informations and performs a basic addition operation using GPU tensors.

Run the file with the command:

python test_gpu.py

Expected output should look something like:

2.6.0a0+git1eba9b3CUDA available: TrueNumber of GPUs: 4GPU name: Tesla V100-SXM2-16GBSum on GPU: tensor([[1.9163, 1.2208, 0.5998], [1.7962, 0.6040, 1.3943], [0.9536, 0.8010, 0.0668]], device='cuda:0')cuDNN available: TrueC extensions loaded: True

Keep in mind that the output may vary depending on the number and model of GPUs, as well as the tensor sums (due to randomness). What matters is that the boolean outputs in the script return True.

With this, PyTorch is installed and ready to use. In the next tutorial, we’ll run the first Language Model inference on the Power9 server.

Setting Up the OS, NVIDIA Drivers, CUDA, and cuDNN on IBM Power 9 Servers

Sun, 29 Jun 2025 00:00:00 +0000

Background

This is the first post in a tutorial series on how to build a Language Model API on an IBM Power9 server, covering everything from setting up the operating system to having the API running remote inference.This step of the tutorial shows how to set up the operating system and install NVIDIA drivers, CUDA, and cuDNN on machines with IBM Power9 AC922 processors. The focus is on ensuring everything works correctly on ppc64le architectures, which are common in high-performance environments.

IBM Power9: The IBM Power9 AC922 is a high-performance machine used for demanding tasks such as artificial intelligence and scientific computing. It uses Power9 processors and works well with NVIDIA GPUs, offering high-speed communication between the CPU and GPU.

NVIDIA Drivers: Software that allows the operating system to communicate correctly with NVIDIA GPUs. These drivers are essential to enable GPU acceleration.

CUDA: NVIDIA’s platform for accelerating parallel computing on GPUs. It lets you run complex algorithms efficiently, such as Large Language Model inference.

cuDNN: A GPU-optimized library of primitives for deep neural networks (DNNs) developed by NVIDIA. It offers high-performance implementations of key DNN operations like convolutions, pooling, and normalization, significantly speeding up training and inference on GPUs.

TL;DR

This post provides a step-by-step guide on setting up Power9 servers, including the OS and NVIDIA configurations.
The main challenge is finding compatible versions for the Power9 machine architecture.

Setting up the Operating System

Let’s start with the installation of Red Hat Enterprise Linux 8.10 (Ootpa). On Power systems, the architecture used is ppc64le (PowerPC 64-bit little-endian), so it’s essential to ensure the .iso image is compatible with this architecture. Otherwise, the Power9’s petitboot won’t recognize the media and installation won’t proceed.

You can download the correct image from the link provided.
In this tutorial, we’ll use the Boot ISO option and follow the official Red Hat documentation to create a bootable USB medium.
After inserting the installation media into the Power9 server and rebooting, the system should automatically start petitboot.
From there, just follow the official installation guide to complete the OS setup.

Setting up NVIDIA Driver and CUDA

Checking GPUs and Operating System

To enable the operating system to communicate properly with the server’s GPUs, we need to install and configure the NVIDIA driver.

First, let’s check for the presence of the GPU(s):

lspci | grep -i nvidia

The expected output is something like:

0004:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)

Next, let’s check the system architecture and operating system name:

uname -m && cat /etc/redhat-release

The expected output is:

ppc64le Red Hat Enterprise Linux release 8.10 (Ootpa)

Avoiding conflicts

To avoid potential conflicts, it’s recommended to disable the nouveau driver and SELinux.

The nouveau driver is an open-source driver for NVIDIA GPUs that replaces the proprietary driver when users want to use only free software without needing high performance.

SELinux=enable restricts certain processes from making changes to the system, which can conflict with the installations we’ll do in this tutorial.

Disable the nouveau driver:

echo -e "blacklist nouveau\noptions nouveau modeset=0" | sudo tee /etc/modprobe.d/disable-nouveau.conf

To disable SELinux, let’s first check its status by running:

sestatus

If it’s active, you’ll need to set the SELINUX=disabled parameter in the /etc/selinux/config file to proceed. Remember that saving changes requires sudo permissions.

After that, update the initramfs and reboot the machine with the following commands:

sudo dracut --forcesudo reboot

To verify everything worked so far, let’s check if nouveau is disabled:

lsmod | grep nouveau

If it’s been successfully disabled, there will be no output.

To verify the SELinux:

sestatus

If it’s disabled, the output will be: SELinux status: disabled

Installing Prerequisites

Let’s install some prerequisites before starting the actual installation:

sudo dnf install pciutils environment-modulessudo dnf install kernel-devel-$(uname -r) kernel-headerssudo dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpmsudo dnf clean all sudo dnf install dkms

We also need to enable some repositories:

sudo subscription-manager repos --enable=rhel-8-for-ppc64le-appstream-rpmssudo subscription-manager repos --enable=rhel-8-for-ppc64le-baseos-rpmssudo subscription-manager repos --enable=codeready-builder-for-rhel-8-ppc64le-rpms

Downloading and Installing CUDA Package Repositories

Let’s download CUDA version 12.2 and NVIDIA Driver 535.54.03-1 with the following command:

wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-rhel8-12-2-local-12.2.0_535.54.03-1.ppc64le.rpm

To install the downloaded package:

sudo rpm -i cuda-repo-rhel8-12-2-local-12.2.0_535.54.03-1.ppc64le.rpm

To install the NVIDIA driver and CUDA, run the following commands:

sudo dnf install nvidia-driver-cuda sudo dnf clean all sudo dnf module reset nvidia-driver sudo dnf module enable nvidia-driver:latest-dkmssudo dnf -y module install nvidia-driver:latest-dkmssudo dnf -y install cuda

With these commands, the driver and CUDA installation is complete.

Post-Installation Steps

Let’s set the PATH and LD_LIBRARY_PATH environment variables. To do this, edit the .bashrc file and add these two lines:

export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

To update the environment variables, run the following command:

source ~/.bashrc

We need to make two manual changes because they aren’t handled automatically by the CUDA package installation. If these aren’t done, the CUDA driver installation will not work properly.

The first change is to configure the NVIDIA persistence daemon. First, check its status, and if it’s not active, enable it:

systemctl status nvidia-persistencedsystemctl enable nvidia-persistenced

Some Linux distributions have a udev rule that brings hot-plugged memory online as soon as it’s detected, preventing NVIDIA software from correctly configuring GPU memory on Power9.

To disable this rule, run the following commands:

sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/sudo sed -i 's/SUBSYSTEM!="memory",.*GOTO="memory_hotplug_end"/SUBSYSTEM=="*", GOTO="memory_hotplug_end"/' /etc/udev/rules.d/40-redhat.rules

Installation Check

After completing all these steps, let’s reboot the machine and verify the installations:

Reboot the machine:

sudo reboot

Check the NVIDIA driver:

nvidia-smi

The output of the command above should display CUDA compiler information: version and install date. It should also list available devices (GPUs) with details like name, memory, temperature, and other information.

To perform the final check, let’s download the cuda-samples repository and run the device test.

Download the repository and access the cuda-samples version matching the installed CUDA:

git clone https://github.com/NVIDIA/cuda-samples.git cd cuda-samples/Samples/1_Utilities/deviceQuerygit checkout v12.2

To build and run the tests:

make./deviceQuery

After running this test, you should see Result = PASS in the last line. This confirms that the Power9 is set up with the NVIDIA driver and CUDA working correctly.

Setting up the CUDNN

First, we need to download and install the .rpm package specific to ppc64le.

wget https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn-local-repo-rhel8-9.0.0-1.0-1.ppc64le.rpmsudo rpm -i cudnn-local-repo-rhel8-9.0.0-1.0-1.ppc64le.rpmsudo dnf clean allsudo dnf -y install cudnn

After installing, set the CUDNN_LIBRARY and CUDNN_INCLUDE_DIR environment variables directly by adding these lines to your .bashrc:

echo 'export CUDNN_LIBRARY=/usr/lib64' >> ~/.bashrc echo 'export CUDNN_LIBRARY=/usr/lib64' >> ~/.bashrc

After that, the CUDNN installation process is complete.

This is the first part of our tutorial. Once you’ve finished all the steps in this post, the server will be ready to install the conda package manager and the pytorch library. You can access the second part of this tutorial at this link.

Evaluating Small-Scale LLMs (up to 8B) on PT-BR Benchmarks

Mon, 02 Jun 2025 00:00:00 +0000

Background

This is the first of two posts in this series, aimed at providing a summary of the investigation we conducted using the HELM (Holistic Evaluation of Language Models) evaluation framework to assess the Granite family of models, the Llama-3.1-8B model, and the DeepSeek-R1-Distill-Llama-3.1-8B model. The evaluations cover both Portuguese-language benchmarks and code generation tasks. In this first part, the focus is on evaluating model performance in Brazilian Portuguese (PT-BR) for sentiment analysis and MQA (Multiple-Choice Question Answering) tasks. The second part, to be published soon, will present the evaluation results for code generation tasks.

The use of English-language datasets for evaluating language models is common practice. However, to evaluate this models across different languages and cultural contexts, it is important to test them on benchmarks in other languages. In the case of PT-BR, which typically represents a smaller share of the data used to train multilingual models, understanding model behavior is an important step in evaluating their suitability for tasks and contexts specific to this language. In this sense, this post aims to contribute to that understanding by highlighting both the advances and the remaining challenges in these LLMs’ performance on tasks in the PT-BR context.

TL;DR

We evaluated the models: Granite, Llama-3.1-8B, and DeepSeek-R1-Distill-Llama-3.1-8B on the ENEM Challenge, TweetSent-Br, and IMDB benchmarks.
Our method involved experimentation supported by the HELM framework, which we describe in detail in this document.
The results show that the models accurately classify sentiments in movie reviews in PT-BR.

Method

Execution Environment and Tool Used

We used HELM as the evaluation tool. HELM is an LLM evaluation framework developed by researchers at Stanford University. It includes a variety of benchmarks, such as sentiment analysis, code generation, and multiple-choice question answering. Using these benchmarks, we evaluated and compared the performance of the Granite (8B), Llama-3.1-8B, and DeepSeek-R1-Distill-Llama-3.1-8B models.

For running the experiments, we used Google Colab as the environment, which provides access to an A100 GPU. In this setup, we were able to clone the HELM repository and run models with 8 billion parameters. All configuration and testing were carried out on this platform, ensuring convenience and access to the necessary computational resources.

In a future post, we will go into more detail about LLM evaluation strategies and tools, with a deeper focus on HELM’s capabilities and operation.

Benchmarks and Models

To run tests in Brazilian Portuguese scenarios, it was necessary to extend HELM by adding new benchmarks, since the tool did not previously support this language. This effort represented a direct contribution to HELM, adding three benchmarks:

ENEM Challenge: built from questions from the Exame Nacional do Ensino Médio (ENEM), designed to evaluate LLMs ability to handle MQA tasks across various knowledge areas, including Humanities, Natural Sciences, Languages, and Mathematics.
TweetSent-Br: composed of tweets, specifically for sentiment analysis tasks. The dataset is organized into three main classes: positive (tweets expressing a positive reaction about the main topic), negative (tweets expressing a negative reaction), and neutral (tweets that don’t fit the other categories).
IMDB: made up of movie reviews written in Brazilian Portuguese. This benchmark also focuses on sentiment classification tasks, but uses longer-form review texts, in contrast to TweetSent-Br’s shorter posts.

About the models, selection was guided by compatibility with the available execution environment and by citation relevance and performance. This included the Granite family of models developed by IBM; the Llama models from Meta; and the DeepSeek-R1-Distill-Llama-8B, a compact, optimized version derived from Llama 3.1. This choice enabled a fair and practical comparison among the models.

Results

Below, we present the results obtained, along with charts developed by the team to make it easier to visualize and understand the models’ performance on the evaluated tasks.

ENEM Challenge:

Chart of results on the ENEM Challenge

The results indicate that the models showed similar performance, with a slight advantage for Llama. The models achieved an average accuracy of 62.53%, suggesting that while they demonstrate some level of understanding of the questions, they still lack sufficient ability to answer ENEM exam questions satisfactorily. Improvement is still needed, particularly in reasoning and interpretation in Portuguese.

TweetSent-Br:

Chart of results on the TweetSent-Br

In this benchmark, as observed with the ENEM Challenge, the results were also similar across models. This reinforces the view that there are still gaps in model performance on sentiment classification tasks in Portuguese. Classifying a message as positive, negative, or neutral remains a challenge for these models, especially given the nuances and ambiguities of the language.

IMDB:

Chart of results on the IMDB

In the IMDB benchmark, the results were quite positive. The models achieved accuracy rates above 90%, demonstrating strong performance in sentiment classification. The highlight was the Granite model with 8B parameters, which showed a slight advantage over the others. These results indicate that the models can easily categorize movie reviews in Portuguese, showing greater proficiency in this type of task.

Conclusion

This study provided a clearer view of the performance of language models in PT-BR through evaluation on three different benchmarks. The results show that the models analyzed have reasonable performance when selecting an answer in ENEM knowledge areas, while also indicating that there is still room for improvement. On the other hand, in the IMDB sentiment analysis task, these smaller-scale models demonstrated good classification ability.

The team plans, in future studies, to conduct experiments with larger-scale models to enable broader comparisons of performance and efficiency. This will allow for a more detailed analysis of the errors made by each model, contributing to a deeper understanding of their strengths and limitations.

Performing CPU Inference on Power10

Sun, 06 Apr 2025 00:00:00 +0000

Background

In this post, we will share our experience running the Granite-20b-Code-Instruct model on a Power10 machine, describing the challenges and the necessary configurations to perform inference using Llama.cpp, one of the most popular open-source libraries in this domain.

TL;DR

This post provides details on how to set up and run inference using IBM Power10 infrastructure.
Our main challenge was configuring Llama.cpp, which required adjustments such as installing Ninja-builder, compiling OpenBLAS, and updating the C compiler.

Infrastructure

Inference was performed on a machine with IBM POWER10 architecture, equipped with 750 GB of RAM and running Red Hat Enterprise Linux 8.10. Access to the environment was provided through a VM, requiring the use of a VPN to establish secure and controlled communication with the system, enabling remote and efficient execution of activities.

Initial Setup

The library that enables run LLMs using CPU resources is Llama.cpp. To set it up, we needed to resolve two external dependencies: Ninja-builder and OpenBLAS. Ninja-builder optimizes the compilation process, while OpenBLAS is a high-performance library for matrix computations.

During the OpenBLAS build process, we identified discrepancies in the internal tests validating matrix calculations, indicating a compatibility problem with the available C compiler, which was an older version (8.5.0). The solution was to update the compiler to a newer version, 13.2, ensuring better compatibility with the Power10 architecture and validating the accuracy of the numerical operations required for Llama.cpp. Below, we present the step-by-step process used to enable the compilation of the required libraries and update the C compiler.

Creating the build environment for the builder

sudo dnf update -y && dnf -y groupinstall 'Development Tools' && dnf install -y \ cmake git ninja-build-debugsource.ppc64le \ && dnf clean all

Updating the C Compiler and Setting Environment Variables

scl enable gcc-toolset-13 bashexport CC=/usr/bin/gcc-13export CXX=/usr/bin/g++-13

Downloading and Building OpenBLAS

git clone --recursive https://github.com/DanielCasali/OpenBLAS.git && cd OpenBLAS && \ make -j$(nproc --all) TARGET=POWER10 DYNAMIC_ARCH=1 && \ make PREFIX=/opt/OpenBLAS install && \ cd /

Downloading and Building Llama.cpp using the OpenBLAS library

 git clone https://github.com/DanielCasali/llama.cpp.git && cd llama.cpp && sed -i "s/powerpc64le/native -mvsx -mtune=native -D__POWER10_VECTOR__/g" ggml/src/CMakeLists.txt && \ mkdir build; \ cd build; \ cmake -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS=/opt/OpenBLAS/include -G Ninja ..; \ cmake --build . --config Release

With all these steps completed successfully, the environment was properly configured and optimized for running Llama.cpp locally. We are now able to start a server to perform inference with LLMs efficiently, using only CPU resources.

Performing Inference

We chose the Granite-20b-code-instruct model in the .GGUF format, which is specifically designed to optimize the performance of language models in CPU-only environments. These models are quantized, meaning their calculation precision is reduced, which in turn lowers their size and memory consumption, making them ideal for efficient execution with Llama.cpp. This approach enables high-performance local inference even on processor-only architectures such as POWER10.The model was downloaded directly from Hugging Face. Below, we show the step-by-step process to download it:

Create a directory for the model in Llama.cpp:

mkdir -p /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF

Access the directory in Llama.cpp:

cd /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF

Download the model from Hugging Face:

wget https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k-GGUF/resolve/main/granite-20b-code-instruct.Q4_K_M.gguf

The last step can take longer based on the model’s number of parameters.. However, once the steps above are completed, we can start a Llama.cpp server to perform inference. By default, the server is exposed on port 8080 of the Power10 machine, but this is fully customizable. The following code illustrates how to configure and run the Llama server:

/root/llama.cpp/build/bin/llama-server --host 0.0.0.0 --model /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF/granite-20b-code-instruct.Q4_K_M.gguf

With the Llama.cpp server running on port 8080, we can now perform inference via HTTP requests. In this example, for simplicity, we use curl to make the requests:

curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{ "prompt": "Make a hello world program in Java. Your answer should be in Java code only.", "max_tokens": 100 }'

Below is an example of how the response is returned:

{ "content": "public class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!"); }}

With this setup, we are now able to perform inference on CPU. Our upcoming posts will focus on running these inferences using the HELM (Holistic Evaluation of Language Models) framework as the intermediary.

Introduction

Wed, 12 Mar 2025 00:00:00 +0000

Welcome to the blog of the partnership between the Federal University of Campina Grande (UFCG) and IBM!

This space brings together articles, tutorials, and research results produced by our team across different projects. Each project focuses on a distinct area of research:

LLM Evaluation — evaluation of large language models, with a focus on benchmarks for Brazilian Portuguese.
AgentOps — development of AI agents capable of autonomously performing multiple tasks.
Judo-AI — use of AI models for analysis of judo matches and training sessions, applying computer vision and deep learning techniques for movement detection and action recognition.
5G — integration of AI techniques in 5G network environments, with intelligent control, optimization, and network management mechanisms.
MultiArq — provisioning of common tools for new architectures (ppc64le), seeking and adapting specific tools and creating technical documentation about the architecture.

Browse the posts and follow the latest updates!