Virtualization on IBM UFCG

LLM Inference with Ollama on IBM Power9 Using CPU

Wed, 01 Apr 2026 00:00:00 +0000

Context

This post presents a practical guide for performing inference of large Language Models (LLMs) using Ollama, in an IBM POWER9 environment. Ollama is a framework based on llama.cpp, designed to simplify the implementation and execution of such models, offering a user-friendly interface and support for various tasks.

Flow of a request

Despite the growth in LLM usage, the availability of materials focused on the ppc64le architecture (IBM POWER9) is still quite limited. In general, available tutorials are old, poorly detailed, or focused on more common architectures like x86_64, which makes reproducing the environment in the presented context difficult. This is the first of two posts in this series, which aims to perform inference entirely via CPU, exploring the ppc64le architecture, in an updated, practical, and reproducible way. In the next post, we will address the use of GPU to accelerate the process.

TL;DR

This post presents details on how to configure the environment to perform inferences with IBM POWER9 infrastructure.
Execution is performed via CPU using Ollama;
The main challenge involves correctly configuring the environment, especially dependencies like Go, GCC, and CMake, in addition to compatibility with RHEL

Environment Used

Hardware:

ppc64le architecture;
RAM: ~64GB;
Execution: Virtual Machine (VM);

Operating System: Alma Linux 8.10 (ppc64le), binary compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.

Initial Setup

To run Ollama on the POWER9 architecture, it is necessary to prepare the environment with the appropriate dependencies.The first step is to update the system and install basic utilities:

sudo dnf update -ysudo dnf install -y wget git tar make gcc gcc-c++ cmake gcc-toolset-11

Although this command installs some dependencies, it is necessary to ensure that the correct versions are being used.

Configuring Go

Ollama is developed in Go, so it is necessary to ensure the appropriate version.

Expected Version: 1.25.7 linux/ppc64le

If not installed:

wget https://go.dev/dl/go1.25.7.linux-ppc64le.tar.gzsudo tar -C /usr/local -xzf go1.25.7.linux-ppc64le.tar.gzexport PATH=/usr/local/go/bin:$PATH

To add to PATH permanently:

echo 'export PATH=/usr/local/go/bin:$PATH' >> ~/.bashrcsource ~/.bashrc

Verify if the version is correct: go version

Configuring CMake

Verify if the version is correct: cmake --version

Expected Version: cmake 3.26.5

If not installed:

wget https://github.com/Kitware/CMake/releases/download/v3.26.5/cmake-3.26.5.tar.gztar -xzf cmake-3.26.5.tar.gzcd cmake-3.26.5./bootstrapmake -j$(nproc)sudo make install

Configuring GCC

Expected Version: gcc 11.2.1

Important: On AlmaLinux 8, the gcc-toolset is not activated automatically. It is necessary to enable the session manually:

scl enable gcc-toolset-11 bash

This command activates GCC only in the current session. If you open another terminal, you will need to run the command again.

Verify the version: gcc --version

If not installed:

sudo dnf install -y gcc-toolset-11scl enable gcc-toolset-11 bash

Cloning Ollama

With the environment configured, we can build Ollama. Here we clone the official Ollama repository and change the version used (important for POWER compatibility and to get a stable version).

cd /rootgit clone https://github.com/ollama/ollama.gitcd ollama#Change the version: git checkout v0.9.4

To verify, use: git status

Build Ollama

After activating GCC in the correct version:

export CGO_ENABLED=1go clean -cache -modcache -i -rgo build -o ollama .

CGO needs to be enabled because Ollama depends on llama.cpp, which uses C/C++ code for performance optimizations. Without it, the build fails or loses compatibility with the architecture.

This should occur without any errors and generate the ollama binary created in the current directory.

To verify: ./ollama --version

Performing Inference

With Ollama compiled, we can start the server:

./ollama serve

An important observation is that, since the environment is running on a virtual machine, it is not possible to keep the command running in the main terminal and, simultaneously, use another terminal in the same session to perform inference, without some auxiliary tool to manage multiple terminals.What we will do then is run the server in the background, but you can choose to use Tmux or Screen, allowing the same terminal to remain available for executing the remaining commands (which we will see next). For this, you can run:

./ollama serve &

To verify if it worked: ps aux | grep ollama. It will show something like:

Ollama running

Download the test model and run inference

For validation, we used the TinyLlama model, as it is lightweight and suitable for CPU execution. For this, in another terminal, run:

./ollama pull tinyllama

To run inference:

./ollama run tinyllama "The sky is blue?"

If everything has been done correctly, you will have something like:

Inference being executed

It is important to highlight that Ollama works, by default, with models available in its own repository, which are already converted and optimized for execution, generally in a format compatible with llama.cpp. These models can be easily used via the ollama pull command, as in the case of TinyLlama used in this example. Although it is possible to use external models, this requires additional steps, such as conversion to compatible formats (for example, GGUF) and the creation of a Modelfile.

Final Considerations

With the steps presented, it was possible to configure the environment to run LLM inferences on an IBM POWER9 machine using the CPU. Although functional, this approach has limitations in performance, especially for larger models, due to the absence of GPU acceleration. As a next step, we intend to explore execution using GPU, evaluating performance gains and scalability.

Next Steps

Test newer versions and compatibility between them;
Conduct benchmarking experiments to compare CPU Inference performance against GPU inference;
Second post in this series, performing GPU inference.

Power9 Virtualization: how we structured an isolated environment with KVM and Libvirt

Fri, 27 Mar 2026 00:00:00 +0000

Context

Given the need to establish isolated and secure environments for installing libraries, frameworks, and general-purpose tools, environment encapsulation emerged as an effective solution, implemented through KVM managed via virt-manager and virsh.

Virtualization is widely used in x86 environments, with mature tooling and established workflows. However, when migrating to architectures such as IBM Power9 (ppc64le), many of these processes are no longer straightforward and require architecture-specific adaptations. Below, we provide a diagram showing this interaction across four layers.

Communication flow between Hardware (Power9) and Virtual Machines

The flow is organized into the following layers:

Figure 1: Diagram representing a 4-layer virtualization architecture.

In this work, we explore how to build a virtualized environment using KVM and Libvirt on a Power9 server, with focus on isolation, reproducibility, and shared team usage.

TL;DR

We implemented a virtualized environment on Power9 using KVM + Libvirt.
We adapted common virtualization workflows to ppc64le, solving permission, write-lock, and provisioning issues.
The environment provides secure isolation between users and straightforward VM management.
We provide ready-to-use images with NVIDIA/CUDA drivers for immediate use.

Environment used

Architecture: IBM Power9 server (ppc64le architecture).
Operating System (OS): AlmaLinux 8.10 binary-compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.
RAM: 512GB.
Execution: Virtual Manager for Virtual Machine (VM) management.
Hypervisor: KVM (Kernel-based Virtual Machine) / QEMU.
Management: Libvirt (virsh, virt-install, virt-customize).
Storage: Virtual disks in .qcow2 format.
GPUs: 4x NVIDIA Tesla V100 SXM2 16GB (NVLink2).

Installing the virtualization environment (KVM + Libvirt)

Before creating any VM, you need to install and configure KVM and Libvirt on the Power9 server.

Package installation:

sudo dnf install -y qemu-kvm libvirt libvirt-client libvirt-daemon libvirt-daemon-kvm virt-install virt-viewer guestfs-tools \libguestfs-tools python3-libvirt

Starting the service:

sudo systemctl enable --now libvirtdsudo systemctl status libvirtd

Adding your user to the libvirt group:So non-root users can manage VMs without requiring sudo for every command:

Run the command below:

sudo usermod -aG libvirt $(whoami)

Log out and log back in for the change to take effect.

Verifying the installation:

Check virsh version:

sudo virsh version

Validate CPU virtualization support:

sudo virt-host-validate

Setup

Environment preparation:In KVM, the fastest way to provision VMs is to clone a “seed” image (.qcow2) and expand it, instead of performing a clean install from ISO. To keep things organized, all virtual disks should be stored in a dedicated directory:

Download the AlmaLinux 8 base image:

cd /home/user/wget https://repo.almalinux.org/almalinux/8/cloud/ppc64le/images/AlmaLinux-8-GenericCloud-latest.ppc64le.qcow2 -O alma8_base.qcow2

Hypervisor management:Hypervisor and instance administration follows specific procedures to ensure system stability. Administrator commands to control virtualization services on Power9:

Stop KVM services:

sudo systemctl stop libvirtd

Start KVM services again:

sudo systemctl start libvirtd

Enable at boot:

sudo systemctl enable libvirtd

Permission setup:The system user running KVM (qemu) needs permission to access VM disks. If disks are stored inside a personal home directory, Linux blocks access by default. To allow hypervisor access without exposing personal files, grant execute (o+x) permission on directories:

Allow qemu to traverse the home directory (traversal only, no read permission):

chmod o+x /home/user

Allow qemu to access the disk directory:

chmod o+x /home/user/discos

Virtual network configuration (Libvirt):Libvirt creates a default NAT network (default) that places VMs in the 192.168.122.0/24 range. VMs can access the internet through NAT, but they are not directly reachable from external networks without additional setup.

Check network status:

sudo virsh net-list --all

If inactive, start and enable at boot:

sudo virsh net-start defaultsudo virsh net-autostart default

If the network does not exist, define and initialize it:

sudo virsh net-define /usr/share/libvirt/networks/default.xmlsudo virsh net-start defaultsudo virsh net-autostart default

If the XML file is missing, install the network config package:

sudo dnf install -y libvirt-daemon-config-network

Creating new VMs:

Clone the base image:

cp /home/user/alma8_base.qcow2 /home/user/discos/nome_vm.qcow2

Expand the disk (must be done BEFORE creating the VM):

qemu-img resize /home/user/discos/nome_vm.qcow2 +100G

Create the VM:

sudo virt-install \ --connect qemu:///system \ --name vm_nome \ --memory 131072 \ --vcpus 16 \ --cpu host \ --disk path=/home/user/discos/nome_vm.qcow2,format=qcow2 \ --import \ --os-variant almalinux8 \ --network network=default \ --graphics none \ --noautoconsole

Post-creation VM customization:After creating the VM, you must set the root password, since cloud images usually come without one. We use virt-customize for this. Important: the VM must be powered off before safely editing its disk.

Shut down the VM:

sudo virsh shutdown vm_nome

Wait for complete shutdown:

sudo virsh list --all

Inject the root password into disk:

sudo virt-customize -a /home/user/discos/nome_vm.qcow2 \ --root-password password:senha_desejada

Start the VM again:

sudo virsh start vm_nome

Accessing VMs:

Via serial console

Connect to VM console:

sudo virsh console vm_nome

To exit the console, use Ctrl + ].

Via SSH

Find the VM IP address:

sudo virsh domifaddr vm_nome

Access via SSH:

ssh root@<ip_da_vm>

Managing and deleting VMs:If you need to destroy an environment and recreate it from scratch, follow these 3 mandatory cleanup steps:

Force-stop the VM:

sudo virsh destroy nome_da_vm

Remove VM definition from Libvirt:

sudo virsh undefine nome_da_vm

Delete the virtual disk to free Power9 storage:

rm -f /home/user/discos/nome_da_vm.qcow2

Creating a VM from an existing image (cloning):To create a new VM from an already configured image, such as prebuilt NVIDIA-ready images:

Option A: clone via qemu-img (keeps original image intact):

qemu-img create -f qcow2 -b imagem-base.qcow2 -F qcow2 nova-vm.qcow2

Option B: clone via virt-clone:

virt-clone \ --original vm-base \ --name vm-nova \ --file /home/user/discos/nova-vm.qcow2

If needed, you can execute the VM deletion step above and recreate it according to step 5.

Ready-to-use images with NVIDIA drivers

To simplify the use of Tesla V100 GPUs available on the server, we provide pre-configured .qcow2 images with NVIDIA drivers, CUDA, and cuDNN already installed. This removes the need to configure the base environment for every new use.

Available images:
Image Contents
AlmaLinux-8-Power9-NVIDIA-drivers.qcow2.xz AlmaLinux 8.10 + drivers NVIDIA 535 + CUDA 12.2 + cuDNN 9.0
How to use pre-configured images:

Image	Contents
AlmaLinux-8-Power9-NVIDIA-drivers.qcow2.xz	AlmaLinux 8.10 + drivers NVIDIA 535 + CUDA 12.2 + cuDNN 9.0

Download and decompress the image:

wget <url_do_repositorio>/AlmaLinux-8-Power9-NVIDIA-drivers.qcow2.xzxz -d AlmaLinux-8-Power9-NVIDIA-drivers.qcow2.xz

Move it to the disks directory and create a VM from it:

cp AlmaLinux-8-Power9-NVIDIA-drivers.qcow2 /home/user/discos/minha-vm-gpu.qcow2

Create the VM as usual:

sudo virt-install \ --connect qemu:///system \ --name vm_gpu \ --memory 131072 \ --vcpus 16 \ --cpu host \ --disk path=/home/user/discos/minha-vm-gpu.qcow2,format=qcow2 \ --import \ --os-variant almalinux8 \ --network network=default \ --graphics none \ --noautoconsole

For the VM to access physical GPUs, PCIe passthrough must be configured as described in the next post of this series.

How to generate a new image from a configured VM:After installing drivers or any software inside a VM, you can export its current state as a reusable image:

Shut down the VM:

sudo virsh shutdown vm_nome

Convert and compress the image (removes unused space):

qemu-img convert -O qcow2 -c \ /home/user/discos/vm_nome.qcow2 \ /home/user/discos/AlmaLinux-8-Power9-minha-imagem.qcow2

Compress for distribution:

xz -T0 -v /home/user/discos/AlmaLinux-8-Power9-minha-imagem.qcow2

Expected output: AlmaLinux-8-Power9-minha-imagem.qcow2.xz.

Verify image integrity:

qemu-img check AlmaLinux-8-Power9-minha-imagem.qcow2qemu-img info AlmaLinux-8-Power9-minha-imagem.qcow2

Evaluation of IBM Granite Models for Code-Generation Tasks on HumanEvalX

Fri, 28 Nov 2025 00:00:00 +0000

Context

The use of language models for code generation and understanding has become essential in modern development workflows.
As part of a joint research effort between LSD/UFCG and IBM, we investigated the performance of the IBM Granite 4 family on the HumanEvalX benchmark, which evaluates programming capabilities in five languages: Python, Java, Go, C++, and JavaScript.

The goal was to answer key questions from the team:

How versatile are the Granite models across different languages?
Do smaller models deliver useful performance?
How do the Granites compare to models from other providers such as DeepSeek Coder and CodeLlama?

Methodology / Process

The evaluation was conducted using OpenCompass, a modern and extensible framework for large-scale LLM benchmarking. It allowed experiments to be executed in a standardized, reproducible way with consistent inference protocols.

Since OpenCompass does not provide native support for models hosted on the IBM Cloud, it was necessary to develop a custom client to integrate the framework with the IBM Cloud Inference API. This client allowed the evaluation process to send requests transparently, handle authentication, manage generation parameters, and return outputs in the expected benchmark format. Experiments were also run in Google Colab, which served as a practical environment for prototyping and running the models.

We used the HumanEvalX benchmark, an extension of the traditional HumanEval, covering five languages with the Pass@1 metric.

The evaluated models included:

Granite 4.0 Micro (3B)
Granite 4.0 (1B)
Granite 4.0 h-tiny (7B)
Granite 4.0 h-small (30B) — via IBM Cloud
granite 4.0 (350M)
granite code instruct 8B — via IBM Cloud
DeepSeek Coder (6.7B)
CodeLlama (7B)

The metric used was Pass@1, following the benchmark protocol.

Results and Conclusions

Performance heatmap of the models on HumanEvalX.

The evaluation revealed important behaviors:

1. granite-4.0-h-small stood out for its versatility

He surpassed 60% Pass@1 in Java, C++, and JavaScript, while also maintaining over 50% in Python and Go. This consistent performance across languages suggests that the model has good generalization capability, showing promise in scenarios that involve different programming ecosystems, although additional benchmarks and evaluations are important before drawing broader conclusions.

2. Granite Micro (3B) performed above expectations

Despite being a small model, Granite Micro (3B) delivered 65.85% in JavaScript and 68.90% in Java, outperforming even some larger models evaluated. This shows that even with a compact architecture, it can deliver solid results, making it a highly efficient option for applications that require low computational cost without sacrificing performance.

3. The size progression (350M → 1B → 3B → 7B → 30B) shows gradual and coherent evolution

The results show that as we move through the different sizes of the Granite line, there is a coherent evolution in performance. Smaller models deliver stable results within their category, while larger ones gradually expand the ability to solve more complex tasks. This distribution helps clarify where each model fits in the usage spectrum.

4. Comparing different providers helps contextualize the results

Alongside the IBM models, we also evaluated models from other providers such as DeepSeek and Meta. In some languages, the differences were small, but in all of them there was at least one model from the Granite family that achieved the highest score. The Granite 4 Micro (3B) and Granite 4 h-small (30B) models were the standouts, with results that were close to, and in some cases above, those of models recognized as code specialists.

Next Steps

Run the same Granite models on LiveCodeBench, a broader benchmark that goes beyond code generation, also evaluating code execution and test-output.
Perform a fine-tuning of the Granite 4.0 Micro (3B) using InstructLab and observe the impact of this adaptation on the model’s performance in HumanEvalX, comparing before and after the adjustment.

Computação@UFCG Leads Brazil's Contributions to the HELM-Stanford Framework in Partnership with IBM

Wed, 09 Jul 2025 00:00:00 +0000

Collaboration between UFCG’s Computer Science department and IBM makes the university the top brazilian contributor to the HELM-Stanford evaluation framework in 2025.

HELM-Stanford is one of the world’s leading frameworks for evaluating language models, measuring accuracy, robustness, and fairness. Being the top Brazilian contributor — through the partnership between Computação@UFCG and IBM — highlights the national protagonism in developing fairer, safer, and more representative metrics for LLMs, especially in multilingual and culturally diverse contexts.

The partnership between Computação@UFCG and IBM resulted in 15 significant contributions to HELM-Stanford in 2025. These contributions include adding Portuguese-language benchmarks, fixing bugs, improving source code, and including new evaluation sets, expanding the framework’s linguistic diversity and robustness.

The project, coordinated by Professor João Brunet with participation from Professors Fábio Morais and Leandro Balby, features a multidisciplinary team dedicated to LLM evaluation. The team also includes one professor from IFPB, three graduate students, three undergraduate students, and a professional with software development experience. IBM, as a project partner, has also assigned professionals to work directly on the collaboration. Together, the group has made meaningful contributions to advancing HELM-Stanford, with a focus on including the Portuguese language and continuously improving the framework.

Multidisciplinary project team

LLMs Inference API on IBM Power9 Server

Thu, 03 Jul 2025 00:00:00 +0000

Background

This is the fourth and final post in a tutorial series that aims to show step by step how to build a LLM API on a Power9 server, from operating system setup to remote inference execution. We already configured the operating system, NVIDIA drivers, CUDA, and cuDNN in the first post, installed Conda and PyTorch in the second post, and built the API in the third post. In this stage, we will present the built API and show how to make requests.

TL;DR

This post introduces the built LLM inference API and how to use it.
We will show how to make requests using Python and curl.

Introducing the API

Built with FastAPI, it includes loading specific models, keeping them in GPU memory for successive calls, and generating text from prompts sent via HTTP requests. It was implemented with FastAPI and includes API Key access control, memory management (loading and unloading models), support for multiple GPUs with automatic sharding, and endpoints for status queries. The goal is to provide a robust, production-ready service optimized for intensive use, ensuring fast inferences and easy integration with external applications.

Architecture Overview

The API exposes LLMs via FastAPI with REST endpoints. The ModelManager handles loading, unloading, and model inference, keeping models in GPU memory for fast calls. Authentication is enforced via API Key. The architecture supports multiple GPUs with automatic sharding to optimize memory usage and performance. Models are sourced from Hugging Face and use the Transformers library to perform inferences.

Architecture Diagram

Main Features

Load Models
- /load_model
- Loads a model from the Hugging Face Hub
- Performs sharding across GPUs
- Supports Hugging Face Token
Generate Text
- /generate
- Accepts prompt, max_tokens, model name, temperature, and top_p
- Uses an already loaded model or loads a new one
- Returns result in JSON
Management
- /status: Checks the loaded model and device (CPU/GPU)
- /unload_model: Frees GPU and memory
- /generate_apikey: Creates API keys from LDAP user

Usage Flow

Usage flow diagram

Inputs and Endpoints

The table below describes the API endpoints, required inputs, and responses.

Inputs and endpoints table
Endpoints	Method	Api Key	Input (Body/Query)	Response
`/generate_apikey`	POST	❌	{username}	API Key
`/load_model`	POST	✅	{model_name hf_token(opcional) device(opcional)}	None, just loads the model
`/generate`	POST	✅	{model_name prompt hf_token(opcional) max_tokens(opcional) temperature(opcional) top_p(opcional)}	Text generated by the model
`/status`	GET	✅	None	Model status and the device it is loaded on
`/unload_model`	POST	✅	None	None, just unloads the model

How to Use the API with Python

Generate API Key

 1import requests 2import json 3import os 4 5url = "http://<power9_ip_server>:8000/" 6username = <ldap_user> 7hf_token = os.getenv("HUGGINGFACE_TOKEN") 8 9response = requests.post(f"{url}/generate_apikey", json={"username": username}).content.decode()1011api_key = json.loads(response).get("api_key")

It is important that the Hugging Face Token is set as an environment variable in the location where the inference will run.
api_key will be the return value of the called function.

Load Model

First, we need to create a header containing the API Key returned from the code above and the payload with model_name and the Hugging Face token hf_token. After that, we can send the request with these two pieces of information.

1headers = {"Content-Type": "application/json",2"x-api-key": api_key}34payload = {"model_name": "ibm-granite/granite-3.3-8b-instruct",5 "hf_token": hf_token}67resp = requests.post(f"{url}/load_model", headers=headers, json=payload)

Generate Text

Now we need to create a new payload with the necessary information to generate text with an LLM, which includes: prompt, model_name, and hf_token.

1payload = {"prompt": "Hello, tell me a little about the Federal University of Campina Grande (UFCG)",2 "model_name": "ibm-granite/granite-3.3-8b-instruct",3 "hf_token": hf_token}45resp = requests.post(f"{url}/generate", headers=headers, json=payload)67resp = json.loads(resp.content.decode())

Check status and unload the model

To check the status and unload the model, we don’t need to send anything in the payload—just the header with the API key:

1requests.get(f"{url}/status", headers=headers).content

1resp = requests.post(f"{url}/unload_model", headers=headers)

How to use the API with curl in CLI

Generate API Key

curl -X POST "http://<power9_ip_server>:8000/generate_apikey" \ -H "Content-Type: application/json" \ -d '{"username": <ldap_user>}'

It is important that the Hugging Face Token is set as an environment variable in the location where the inference will run.
The user in the username field must be enclosed in quotation marks (" “)
After running the request above, the returned API key should be saved as an environment variable to make future executions easier. To save it, copy the returned API key and run the command:

export API_KEY_P9=<returned_api_key>

Load Model

curl -X POST "http://<power9_ip_server>:8000/load_model" \ -H "Content-Type: application/json" \ -H "x-api-key: $API_KEY" \ -d '{ "model_name":"ibm-granite/granite-3.3-8b-instruct", "hf_token":"'"$HUGGINGFACE_TOKEN"'" }'

Generate Text

curl -X POST "http://<power9_ip_server>:8000/generate" \ -H "Content-Type: application/json" \ -H "x-api-key: $API_KEY" \ -d '{ "model_name": "ibm-granite/granite-3.3-8b-instruct" "prompt":"Hello, tell me a little about the Federal University of Campina Grande (UFCG)", "hf_token": "'"$HUGGINGFACE_TOKEN"'", "max_tokens":50 }'

Check status and unload the model

To check the status and unload the model, we don’t need to send anything in the payload—just the header with the API key:

curl -X GET "http://<power9_ip_server>:8000/status" \ -H "Content-Type: application/json" \ -H "x-api-key: $API_KEY"

curl -X POST "http://<power9_ip_server>:8000/unload_model" \ -H "Content-Type: application/json" \ -H "x-api-key: $API_KEY"

We hope this series has helped clarify the full development and deployment process. The LLM-IBM-UFCG team is available for questions or suggestions about future improvements.

Building an API for LLM inferences on IBM Power9 servers

Wed, 02 Jul 2025 00:00:00 +0000

Background

This is the third post in a tutorial series designed to show step by step how to build a LLM API on a Power9 server, from operating system setup to remote inference execution. We already configured the operating system, NVIDIA drivers, CUDA, and cuDNN in the first post, and installed Conda and PyTorch in the second post. In this stage, we will build the API using FastAPI and the Transformers library, downloading models from Hugging Face and running the web server with uvicorn.

The implemented API will support generating API keys, loading models, performing inferences, checking status, and unloading models.

FastAPI: a modern web framework for building APIs with Python 3.8+, based on static typing and async programming. It is designed to be fast, easy to use, and robust, making API development more efficient.

Transformers: an open-source library developed by Hugging Face. It offers easy and efficient access to a wide collection of state-of-the-art pretrained models for Natural Language Processing (NLP), computer vision, and audio.

Hugging Face: Hugging Face is a platform focused on artificial intelligence, known for hosting NLP models and other tasks. The Hugging Face Hub is a collaborative repository where developers and researchers can share, version, and download ready-to-use models, making access and integration easier.

Uvicorn: ASGI (Asynchronous Server Gateway Interface) web server. Uvicorn is a high-performance server for asynchronous Python applications.

TL;DR

This post provides a step-by-step guide to implementing an API that performs LLM inferences.
We will use FastAPI and Transformers to develop this API and Hugging Face to download the models.

Environment Setup

Directory Structure

Start by creating the basic project structure:

model_api/├── requirements.txt├── app/│ ├── __init__.py│ ├── main.py│ ├── schemas.py│ ├── auth.py│ ├── model_manager.py│ ├── utils.py│ └── apikey_store.json└── README.md (optional)

`requirements.txt` File

We will use FastAPI and Transformers to build the API. Additionally, we will use uvicorn to run the server, pydantic for input data validation, and torch, which we installed in the previous tutorial.

First, we’ll install the required libraries and then populate the requirements.txt file. Remember to activate your conda environment if you created one, to ensure proper use of pytorch.

conda activate llm_apipip install fastapi uvicorn transformers

The requirements.txt file will look like this:

requirements.txt

1fastapi>=0.104.02uvicorn>=0.24.03torch>=2.0.04transformers>=4.35.05pydantic>=2.0.0

API Key Storage File

The apikey_store.json file will store the generated API keys. We will start with it empty, containing only {}.

apikey_store.json

1{}

Schemas and Data Validation

Schemas are essential for validating the API’s input and output data. They ensure data is in the correct format and enable automatic documentation generation.

We will create the app/schemas.py file containing all the data models. We will define four models: GenerateRequest, LoadModelRequest, ApiKeyResponse, and LDAPUserRequest.

schemas.py

 1from pydantic import BaseModel, Field 2from typing import Optional 3 4class GenerateRequest(BaseModel): 5 model_name: str = Field(..., description="The name of the model to use for generation.") 6 prompt: str = Field(..., description="The input text to generate a response for.") 7 max_tokens: Optional[int] = Field(300, description="The maximum length of the generated response.") 8 temperature: Optional[float] = Field(1.0, description="The sampling temperature for generation.") 9 top_p: Optional[float] = Field(1.0, description="The cumulative probability for nucleus sampling.")10 hf_token: Optional[str] = Field(None, description="The Hugging Face tokenizer to use, if applicable.")111213class LoadModelRequest(BaseModel):14 model_name: str = Field(..., description="The name of the model to load.")15 device: Optional[str] = Field("cuda", description="The device to load the model on (e.g., 'cpu', 'cuda').")16 hf_token: Optional[str] = Field(None, description="The Hugging Face tokenizer to use, if applicable.")1718class ApiKeyResponse(BaseModel):19 api_key: str = Field(..., description="The API key for accessing the model API.")2021class LDAPUserRequest(BaseModel):22 username: str = Field(..., description="The username for LDAP authentication.")

All classes inherit from pydantic’s BaseModel, gaining validation, serialization, and automatic documentation features.
The Field(...) declaration defines a required field with no default value.
The Field(value) declaration defines a required field with value as its default.
The Optional[type] annotation indicates the field is optional but must be of type type if provided.

With the schemas defined, let’s create the file responsible for API Key authentication.

Authentication and API Keys

The authentication system protects your API by ensuring that only authorized users can access the endpoints. We will implement a mechanism based on API Keys.

Let’s create the app/auth.py file with all the authentication functionalities.

auth.py

 1import secrets  2import json 3from fastapi import HTTPException, Request 4 5APIKEY_STORE_FILE = "app/apikey_store.json" 6 7def load_apikeys(): 8 try: 9 with open(APIKEY_STORE_FILE, "r") as f:10 return json.load(f)11 except FileNotFoundError:12 raise HTTPException(13 status_code=404,14 detail=f"API keys file not found: {APIKEY_STORE_FILE}")15 16def save_apikeys(keys: dict):17 with open(APIKEY_STORE_FILE, "w") as f:18 json.dump(keys, f, indent=4)1920def generate_apikey(user:str) -> str:21 key = secrets.token_hex(32)22 keys = load_apikeys()23 keys[user] = key24 save_apikeys(keys)25 return key2627async def verify_apikey(request: Request) -> bool:28 apikey = request.headers.get("x-API-Key")29 if not apikey:30 raise HTTPException(31 status_code=401,32 detail="API key not provided.")33 try:34 keys = load_apikeys()35 if apikey in keys.values():36 return True37 38 except json.JSONDecodeError:39 raise HTTPException(40 status_code=403,41 detail="Invalid API Key")

The load_apikeys function loads the information stored in the app/apikey_store.json file.
save_apikeys is responsible for saving the content in JSON format.
The generate_apikey function creates a key for a user and adds it to the dictionary using the provided username as the key.
verify_apikey will be called whenever a request arrives, to perform validation.

Model and GPU Manager

The app/model_manager.py is the core of the API, responsible for loading, managing, and running llm. It optimizes GPU/CPU usage and ensures efficient text generation.

model_manager.py

 1import torch  2from transformers import AutoTokenizer, AutoModelForCausalLM 3from fastapi import HTTPException 4import gc 5from .utils import is_model_on_gpu 6 7DEVICE = "cuda" if torch.cuda.is_available() else "cpu" 8 9class ModelManager:10 def __init__(self):11 self.model = None12 self.tokenizer = None13 self.model_name = None1415 def load_model(self, model_name: str, hf_token:str = None, device: str = DEVICE):16 if self.model_name != None and self.model_name != model_name:17 print("Removing previously loaded model...")1819 self.unload_model() 20 print(f"Loading model {model_name} on device {device}...")21 22 if self.model_name != model_name:23 try: 24 if hf_token: 25 self.tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)26 self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map="balanced", token=hf_token)27 else:28 self.tokenizer = AutoTokenizer.from_pretrained(model_name)29 self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map="balanced")30 self.model.eval()31 self.model_name = model_name32 print(is_model_on_gpu(self.model.hf_device_map, self.model_name))33 34 except Exception as e:35 raise HTTPException(status_code=500, detail=f"Erro ao carregar modelo: {str(e)}")36 else:37 print(f"The model {model_name} is already loaded.")3839 def generate(self, model_name:str, hf_token: str, prompt:str, max_tokens:int = 300, temperature:float = 1.0, top_p:float = 1.0) -> str:40 41 if self.model_name != model_name:42 self.load_model(model_name, hf_token, device=DEVICE)4344 if self.model is None or self.tokenizer is None:45 raise HTTPException(status_code=400, detail="No model loaded.")4647 try:48 inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)49 with torch.no_grad(): 50 outputs = self.model.generate(**inputs, max_new_tokens=max_tokens,temperature=temperature, top_p=top_p, eos_token_id=self.tokenizer.eos_token_id)51 return self.tokenizer.decode(outputs[0], skip_special_tokens=True)52 except Exception as e:53 raise HTTPException(status_code=500, detail=f"Error generating text:{str(e)}")54 55 def get_status(self) -> str: 56 if self.model is None:57 self.unload_model()58 return "No model loaded." 59 return is_model_on_gpu(self.model.hf_device_map, self.model_name)6061 def unload_model(self):62 self.model = None63 self.tokenizer = None64 old_model = self.model_name if self.model_name else False65 self.model_name = None6667 gc.collect()68 torch.cuda.empty_cache()69 return f"Model {old_model} successfully unloaded." if old_model else "No model loaded to unload."7071manager = ModelManager()

The load_model function loads a new model into memory, removing any previously loaded model.
generate is the main function of the API, responsible for performing model inference. It allows adjusting the parameters: temperature, top_p, and max_tokens.
get_status reports whether there is a loaded model and whether it is on the GPU or CPU.
The unload_model function removes the model from memory, clears the CUDA cache, and invokes Python’s garbage collector to avoid leftovers that could interfere with future loads.

FastAPI API Endpoints

The app/main.py file is where all the components come together. In it, we define all the endpoints and the API’s routing logic.

main.py

 1from fastapi import FastAPI, Request, HTTPException, Depends 2from fastapi.responses import JSONResponse 3from app import schemas, model_manager, auth 4 5app = FastAPI() 6 7async def require_api_key(request: Request) -> schemas.LDAPUserRequest: 8 user = await auth.verify_apikey(request) 9 if not user:10 raise HTTPException(status_code=401, detail="Invalid API Key")11 return user1213@app.post("/generate_apikey")14async def generate_apikey(payload: schemas.LDAPUserRequest) -> JSONResponse:15 key = auth.generate_apikey(payload.username)16 return JSONResponse(status_code=200, content={"api_key": key})1718@app.post("/load_model", dependencies=[Depends(require_api_key)])19async def load_model(payload: schemas.LoadModelRequest) -> JSONResponse:20 try:21 model_manager.manager.load_model(payload.model_name, payload.hf_token, payload.device)22 return JSONResponse(content={"message": f"Model {payload.model_name} loaded successfully."})23 except Exception as e:24 raise HTTPException(status_code=500, content={"error": str(e)})25 26@app.post("/generate", dependencies=[Depends(require_api_key)])27async def generate(payload: schemas.GenerateRequest)-> JSONResponse:28 try:29 result = model_manager.manager.generate(payload.model_name, payload.hf_token,payload.prompt, payload.max_tokens, payload.temperature, payload.top_p)30 return {"result": result}31 except Exception as e:32 return JSONResponse(status_code=500, content={"error": str(e)})33 34@app.get("/status", dependencies=[Depends(require_api_key)])35async def status()-> JSONResponse:36 str_status = model_manager.manager.get_status()37 return JSONResponse(content={"status": str_status})3839@app.post("/unload_model", dependencies=[Depends(require_api_key)])40async def unload_model() -> JSONResponse:41 try:42 str_unload = model_manager.manager.unload_model()43 return JSONResponse(content={"message":str_unload})44 except Exception as e:45 raise HTTPException(status_code=500, content={"error": str(e)})

The require_api_key function checks the API Key on each request and returns the authenticated user or raises a 401 error.
generate_apikey creates and returns a new API key for the specified user.
load_model loads the specified model. If needed, it also accepts a Hugging Face token.
The generate function makes the model perform inference using the given prompt and parameters.
Calling the status endpoint returns the current status of the model manager.
unload_model unloads the currently loaded model and returns a success message if completed properly.

`utils.py` File

The app/utils.py file contains the function that checks whether the loaded model is fully or partially on the GPU, or if it was loaded on the CPU.

utils.py

1def is_model_on_gpu(hf_device_map: dict, model_name: str) -> str:2 if '' in hf_device_map.keys() and hf_device_map[''] == 'cpu':3 return f"Model {model_name} fully loaded on CPU."4 elif 'cpu' in hf_device_map.values():5 return f"Some layers of the model {model_name} are loaded on the CPU."6 else:7 return f"Model {model_name} fully loaded on GPU."

Running the API

To run the API with uvicorn, simply execute a command specifying the host and port for the service to start.

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

app:main refers to the app/main.py file, which connects all components and handles user requests.
--host 0.0.0.0 sets the IP address on which the Uvicorn server will listen. The value 0.0.0.0 allows the server to be accessible from any network interface on the Power9 machine.
--port 8000 specifies the port on which the server will listen for requests.
--reload is a flag for development use. It automatically reloads the server whenever changes are made.

BBy following this guide, you’ll have a working API capable of running LLM inference using models downloaded from Hugging Face. In the next tutorial, we will show how to send requests to the API using curl and Python.

Setting Up the Conda and PyTorch on IBM Power9 Servers

Mon, 30 Jun 2025 00:00:00 +0000

Background

This is the second post in a tutorial series on how to build a Language Model API on an IBM Power9 server, covering everything from setting up the operating system to having the API running remote inference. The first post covers installing the OS and configuring NVIDIA drivers, CUDA, and CUDNN. In this step, we’ll show how to set up the Conda package manager and the PyTorch library.

Conda: Conda is an open-source, cross-platform package and environment management system. It’s like a “toolbox” for data scientists and developers to organize their projects.

PyTorch: PyTorch is an open-source machine learning library developed primarily by Facebook AI Research (FAIR). It’s especially popular for building deep learning applications, a subfield of machine learning inspired by how the human brain works.

TL;DR

This post provides a step-by-step guide to installing Conda and PyTorch.
The main challenge is finding compatible versions for the Power9 machine architecture.

Setting up the Conda

We’ll start with installing Conda. On Power systems, the architecture used is ppc64le (PowerPC 64-bit little-endian), so it’s essential to download the version for this architecture. We’ll use miniconda, a lighter option that’s better suited for custom setups like the Power9 server.

To download and install the latest version of Miniconda:

sudo wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-ppc64le.shbash ~/Miniconda3-latest-Linux-ppc64le.sh

Check if Conda was activated automatically:

conda -–version

If it didn’t start automatically, you’ll need to activate it.

To ensure it’s automatically activated with each new connection, we will write the command into your .bashrc (or .zshrc) file.

echo 'source ~/miniconda3/etc/profile.d/conda.sh' >> ~/.bashrcsource ~/.bashrc

Check again with the command:

conda -–version

Expected output looks like: conda 23.10.0

Installing and configuring the PyTorch library

There are no official builds or Conda/PyPi wheels with full support for the ppc64le architecture. To install PyTorch, you’ll need to build it manually.

(Optional) Creating a Conda virtual environment

It’s recommended to create a dedicated virtual environment to install PyTorch in isolation.

To create and activate the virtual environment, run:

conda create -y -n api_llm python=3.10conda activate api_llm

Installing prerequisites

We need to install some packages required to properly build PyTorch.

First, install the packages using the following commands:

conda install -y -c conda-forge openblas libblas cmake ninja python3-devel gcc-c++ rust cargo

CMake (the build system used by PyTorch) dropped support for scripts declaring compatibility with older versions (<3.5). To address this, we need to install a version of cmake <3.5 using pip.

Run the command:

pip install cmake==3.27.7

To make sure the correct version was installed, run the command:

cmake --version

Expected output: cmake version 3.27.7

Building PyTorch

Now let’s start the PyTorch build process.

The first step is to clone the repository and set it up to install version 2.6.0:

git clone --recursive https://github.com/pytorch/pytorchcd pytorchgit checkout v2.6.0 git submodule sync git submodule update --init --recursive

To install the required packages via pip, run the following command:

pip install -r requirements.txt

And finally, to build PyTorch, run Python’s setup.py:

sudo USE_CUDA=1 USE_DISTRIBUTED=1 USE_NCCL=1 USE_GLOO=1 USE_CUDNN=1 python setup.py install

The build process usually takes a while, around 15 minutes.

To check if everything worked correctly, create a file named test_torch.py

nano test_torch.py

This file should contain the following lines:

 1import torch 2print(torch.__version__) 3print("CUDA available:", torch.cuda.is_available()) 4print("Number of GPUs:", torch.cuda.device_count()) 5print("GPU name:", torch.cuda.get_device_name(0)) 6x = torch.rand(3, 3).cuda() 7y = torch.rand(3, 3).cuda() 8print("Sum on GPU:", (x + y)) 9print("cuDNN available:", torch.backends.cudnn.is_available())10print("C extensions loaded:", torch._C._cuda_getDeviceCount() > 0)

When you run this file, you’ll check:

Installed PyTorch version
CUDA availability
Number of available GPUs
GPU name on the Power9 server
Whether GPU usage is working correctly
CUDNN availability
Whether the .so files were compiled correctly

This script simply verifies some CUDA and PyTorch informations and performs a basic addition operation using GPU tensors.

Run the file with the command:

python test_gpu.py

Expected output should look something like:

2.6.0a0+git1eba9b3CUDA available: TrueNumber of GPUs: 4GPU name: Tesla V100-SXM2-16GBSum on GPU: tensor([[1.9163, 1.2208, 0.5998], [1.7962, 0.6040, 1.3943], [0.9536, 0.8010, 0.0668]], device='cuda:0')cuDNN available: TrueC extensions loaded: True

Keep in mind that the output may vary depending on the number and model of GPUs, as well as the tensor sums (due to randomness). What matters is that the boolean outputs in the script return True.

With this, PyTorch is installed and ready to use. In the next tutorial, we’ll run the first Language Model inference on the Power9 server.

Setting Up the OS, NVIDIA Drivers, CUDA, and cuDNN on IBM Power 9 Servers

Sun, 29 Jun 2025 00:00:00 +0000

Background

This is the first post in a tutorial series on how to build a Language Model API on an IBM Power9 server, covering everything from setting up the operating system to having the API running remote inference.This step of the tutorial shows how to set up the operating system and install NVIDIA drivers, CUDA, and cuDNN on machines with IBM Power9 AC922 processors. The focus is on ensuring everything works correctly on ppc64le architectures, which are common in high-performance environments.

IBM Power9: The IBM Power9 AC922 is a high-performance machine used for demanding tasks such as artificial intelligence and scientific computing. It uses Power9 processors and works well with NVIDIA GPUs, offering high-speed communication between the CPU and GPU.

NVIDIA Drivers: Software that allows the operating system to communicate correctly with NVIDIA GPUs. These drivers are essential to enable GPU acceleration.

CUDA: NVIDIA’s platform for accelerating parallel computing on GPUs. It lets you run complex algorithms efficiently, such as Large Language Model inference.

cuDNN: A GPU-optimized library of primitives for deep neural networks (DNNs) developed by NVIDIA. It offers high-performance implementations of key DNN operations like convolutions, pooling, and normalization, significantly speeding up training and inference on GPUs.

TL;DR

This post provides a step-by-step guide on setting up Power9 servers, including the OS and NVIDIA configurations.
The main challenge is finding compatible versions for the Power9 machine architecture.

Setting up the Operating System

Let’s start with the installation of Red Hat Enterprise Linux 8.10 (Ootpa). On Power systems, the architecture used is ppc64le (PowerPC 64-bit little-endian), so it’s essential to ensure the .iso image is compatible with this architecture. Otherwise, the Power9’s petitboot won’t recognize the media and installation won’t proceed.

You can download the correct image from the link provided.
In this tutorial, we’ll use the Boot ISO option and follow the official Red Hat documentation to create a bootable USB medium.
After inserting the installation media into the Power9 server and rebooting, the system should automatically start petitboot.
From there, just follow the official installation guide to complete the OS setup.

Setting up NVIDIA Driver and CUDA

Checking GPUs and Operating System

To enable the operating system to communicate properly with the server’s GPUs, we need to install and configure the NVIDIA driver.

First, let’s check for the presence of the GPU(s):

lspci | grep -i nvidia

The expected output is something like:

0004:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)

Next, let’s check the system architecture and operating system name:

uname -m && cat /etc/redhat-release

The expected output is:

ppc64le Red Hat Enterprise Linux release 8.10 (Ootpa)

Avoiding conflicts

To avoid potential conflicts, it’s recommended to disable the nouveau driver and SELinux.

The nouveau driver is an open-source driver for NVIDIA GPUs that replaces the proprietary driver when users want to use only free software without needing high performance.

SELinux=enable restricts certain processes from making changes to the system, which can conflict with the installations we’ll do in this tutorial.

Disable the nouveau driver:

echo -e "blacklist nouveau\noptions nouveau modeset=0" | sudo tee /etc/modprobe.d/disable-nouveau.conf

To disable SELinux, let’s first check its status by running:

sestatus

If it’s active, you’ll need to set the SELINUX=disabled parameter in the /etc/selinux/config file to proceed. Remember that saving changes requires sudo permissions.

After that, update the initramfs and reboot the machine with the following commands:

sudo dracut --forcesudo reboot

To verify everything worked so far, let’s check if nouveau is disabled:

lsmod | grep nouveau

If it’s been successfully disabled, there will be no output.

To verify the SELinux:

sestatus

If it’s disabled, the output will be: SELinux status: disabled

Installing Prerequisites

Let’s install some prerequisites before starting the actual installation:

sudo dnf install pciutils environment-modulessudo dnf install kernel-devel-$(uname -r) kernel-headerssudo dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpmsudo dnf clean all sudo dnf install dkms

We also need to enable some repositories:

sudo subscription-manager repos --enable=rhel-8-for-ppc64le-appstream-rpmssudo subscription-manager repos --enable=rhel-8-for-ppc64le-baseos-rpmssudo subscription-manager repos --enable=codeready-builder-for-rhel-8-ppc64le-rpms

Downloading and Installing CUDA Package Repositories

Let’s download CUDA version 12.2 and NVIDIA Driver 535.54.03-1 with the following command:

wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-rhel8-12-2-local-12.2.0_535.54.03-1.ppc64le.rpm

To install the downloaded package:

sudo rpm -i cuda-repo-rhel8-12-2-local-12.2.0_535.54.03-1.ppc64le.rpm

To install the NVIDIA driver and CUDA, run the following commands:

sudo dnf install nvidia-driver-cuda sudo dnf clean all sudo dnf module reset nvidia-driver sudo dnf module enable nvidia-driver:latest-dkmssudo dnf -y module install nvidia-driver:latest-dkmssudo dnf -y install cuda

With these commands, the driver and CUDA installation is complete.

Post-Installation Steps

Let’s set the PATH and LD_LIBRARY_PATH environment variables. To do this, edit the .bashrc file and add these two lines:

export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

To update the environment variables, run the following command:

source ~/.bashrc

We need to make two manual changes because they aren’t handled automatically by the CUDA package installation. If these aren’t done, the CUDA driver installation will not work properly.

The first change is to configure the NVIDIA persistence daemon. First, check its status, and if it’s not active, enable it:

systemctl status nvidia-persistencedsystemctl enable nvidia-persistenced

Some Linux distributions have a udev rule that brings hot-plugged memory online as soon as it’s detected, preventing NVIDIA software from correctly configuring GPU memory on Power9.

To disable this rule, run the following commands:

sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/sudo sed -i 's/SUBSYSTEM!="memory",.*GOTO="memory_hotplug_end"/SUBSYSTEM=="*", GOTO="memory_hotplug_end"/' /etc/udev/rules.d/40-redhat.rules

Installation Check

After completing all these steps, let’s reboot the machine and verify the installations:

Reboot the machine:

sudo reboot

Check the NVIDIA driver:

nvidia-smi

The output of the command above should display CUDA compiler information: version and install date. It should also list available devices (GPUs) with details like name, memory, temperature, and other information.

To perform the final check, let’s download the cuda-samples repository and run the device test.

Download the repository and access the cuda-samples version matching the installed CUDA:

git clone https://github.com/NVIDIA/cuda-samples.git cd cuda-samples/Samples/1_Utilities/deviceQuerygit checkout v12.2

To build and run the tests:

make./deviceQuery

After running this test, you should see Result = PASS in the last line. This confirms that the Power9 is set up with the NVIDIA driver and CUDA working correctly.

Setting up the CUDNN

First, we need to download and install the .rpm package specific to ppc64le.

wget https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn-local-repo-rhel8-9.0.0-1.0-1.ppc64le.rpmsudo rpm -i cudnn-local-repo-rhel8-9.0.0-1.0-1.ppc64le.rpmsudo dnf clean allsudo dnf -y install cudnn

After installing, set the CUDNN_LIBRARY and CUDNN_INCLUDE_DIR environment variables directly by adding these lines to your .bashrc:

echo 'export CUDNN_LIBRARY=/usr/lib64' >> ~/.bashrc echo 'export CUDNN_LIBRARY=/usr/lib64' >> ~/.bashrc

After that, the CUDNN installation process is complete.

This is the first part of our tutorial. Once you’ve finished all the steps in this post, the server will be ready to install the conda package manager and the pytorch library. You can access the second part of this tutorial at this link.

Evaluating Small-Scale LLMs (up to 8B) on PT-BR Benchmarks

Mon, 02 Jun 2025 00:00:00 +0000

Background

This is the first of two posts in this series, aimed at providing a summary of the investigation we conducted using the HELM (Holistic Evaluation of Language Models) evaluation framework to assess the Granite family of models, the Llama-3.1-8B model, and the DeepSeek-R1-Distill-Llama-3.1-8B model. The evaluations cover both Portuguese-language benchmarks and code generation tasks. In this first part, the focus is on evaluating model performance in Brazilian Portuguese (PT-BR) for sentiment analysis and MQA (Multiple-Choice Question Answering) tasks. The second part, to be published soon, will present the evaluation results for code generation tasks.

The use of English-language datasets for evaluating language models is common practice. However, to evaluate this models across different languages and cultural contexts, it is important to test them on benchmarks in other languages. In the case of PT-BR, which typically represents a smaller share of the data used to train multilingual models, understanding model behavior is an important step in evaluating their suitability for tasks and contexts specific to this language. In this sense, this post aims to contribute to that understanding by highlighting both the advances and the remaining challenges in these LLMs’ performance on tasks in the PT-BR context.

TL;DR

We evaluated the models: Granite, Llama-3.1-8B, and DeepSeek-R1-Distill-Llama-3.1-8B on the ENEM Challenge, TweetSent-Br, and IMDB benchmarks.
Our method involved experimentation supported by the HELM framework, which we describe in detail in this document.
The results show that the models accurately classify sentiments in movie reviews in PT-BR.

Method

Execution Environment and Tool Used

We used HELM as the evaluation tool. HELM is an LLM evaluation framework developed by researchers at Stanford University. It includes a variety of benchmarks, such as sentiment analysis, code generation, and multiple-choice question answering. Using these benchmarks, we evaluated and compared the performance of the Granite (8B), Llama-3.1-8B, and DeepSeek-R1-Distill-Llama-3.1-8B models.

For running the experiments, we used Google Colab as the environment, which provides access to an A100 GPU. In this setup, we were able to clone the HELM repository and run models with 8 billion parameters. All configuration and testing were carried out on this platform, ensuring convenience and access to the necessary computational resources.

In a future post, we will go into more detail about LLM evaluation strategies and tools, with a deeper focus on HELM’s capabilities and operation.

Benchmarks and Models

To run tests in Brazilian Portuguese scenarios, it was necessary to extend HELM by adding new benchmarks, since the tool did not previously support this language. This effort represented a direct contribution to HELM, adding three benchmarks:

ENEM Challenge: built from questions from the Exame Nacional do Ensino Médio (ENEM), designed to evaluate LLMs ability to handle MQA tasks across various knowledge areas, including Humanities, Natural Sciences, Languages, and Mathematics.
TweetSent-Br: composed of tweets, specifically for sentiment analysis tasks. The dataset is organized into three main classes: positive (tweets expressing a positive reaction about the main topic), negative (tweets expressing a negative reaction), and neutral (tweets that don’t fit the other categories).
IMDB: made up of movie reviews written in Brazilian Portuguese. This benchmark also focuses on sentiment classification tasks, but uses longer-form review texts, in contrast to TweetSent-Br’s shorter posts.

About the models, selection was guided by compatibility with the available execution environment and by citation relevance and performance. This included the Granite family of models developed by IBM; the Llama models from Meta; and the DeepSeek-R1-Distill-Llama-8B, a compact, optimized version derived from Llama 3.1. This choice enabled a fair and practical comparison among the models.

Results

Below, we present the results obtained, along with charts developed by the team to make it easier to visualize and understand the models’ performance on the evaluated tasks.

ENEM Challenge:

Chart of results on the ENEM Challenge

The results indicate that the models showed similar performance, with a slight advantage for Llama. The models achieved an average accuracy of 62.53%, suggesting that while they demonstrate some level of understanding of the questions, they still lack sufficient ability to answer ENEM exam questions satisfactorily. Improvement is still needed, particularly in reasoning and interpretation in Portuguese.

TweetSent-Br:

Chart of results on the TweetSent-Br

In this benchmark, as observed with the ENEM Challenge, the results were also similar across models. This reinforces the view that there are still gaps in model performance on sentiment classification tasks in Portuguese. Classifying a message as positive, negative, or neutral remains a challenge for these models, especially given the nuances and ambiguities of the language.

IMDB:

Chart of results on the IMDB

In the IMDB benchmark, the results were quite positive. The models achieved accuracy rates above 90%, demonstrating strong performance in sentiment classification. The highlight was the Granite model with 8B parameters, which showed a slight advantage over the others. These results indicate that the models can easily categorize movie reviews in Portuguese, showing greater proficiency in this type of task.

Conclusion

This study provided a clearer view of the performance of language models in PT-BR through evaluation on three different benchmarks. The results show that the models analyzed have reasonable performance when selecting an answer in ENEM knowledge areas, while also indicating that there is still room for improvement. On the other hand, in the IMDB sentiment analysis task, these smaller-scale models demonstrated good classification ability.

The team plans, in future studies, to conduct experiments with larger-scale models to enable broader comparisons of performance and efficiency. This will allow for a more detailed analysis of the errors made by each model, contributing to a deeper understanding of their strengths and limitations.

Performing CPU Inference on Power10

Sun, 06 Apr 2025 00:00:00 +0000

Background

In this post, we will share our experience running the Granite-20b-Code-Instruct model on a Power10 machine, describing the challenges and the necessary configurations to perform inference using Llama.cpp, one of the most popular open-source libraries in this domain.

TL;DR

This post provides details on how to set up and run inference using IBM Power10 infrastructure.
Our main challenge was configuring Llama.cpp, which required adjustments such as installing Ninja-builder, compiling OpenBLAS, and updating the C compiler.

Infrastructure

Inference was performed on a machine with IBM POWER10 architecture, equipped with 750 GB of RAM and running Red Hat Enterprise Linux 8.10. Access to the environment was provided through a VM, requiring the use of a VPN to establish secure and controlled communication with the system, enabling remote and efficient execution of activities.

Initial Setup

The library that enables run LLMs using CPU resources is Llama.cpp. To set it up, we needed to resolve two external dependencies: Ninja-builder and OpenBLAS. Ninja-builder optimizes the compilation process, while OpenBLAS is a high-performance library for matrix computations.

During the OpenBLAS build process, we identified discrepancies in the internal tests validating matrix calculations, indicating a compatibility problem with the available C compiler, which was an older version (8.5.0). The solution was to update the compiler to a newer version, 13.2, ensuring better compatibility with the Power10 architecture and validating the accuracy of the numerical operations required for Llama.cpp. Below, we present the step-by-step process used to enable the compilation of the required libraries and update the C compiler.

Creating the build environment for the builder

sudo dnf update -y && dnf -y groupinstall 'Development Tools' && dnf install -y \ cmake git ninja-build-debugsource.ppc64le \ && dnf clean all

Updating the C Compiler and Setting Environment Variables

scl enable gcc-toolset-13 bashexport CC=/usr/bin/gcc-13export CXX=/usr/bin/g++-13

Downloading and Building OpenBLAS

git clone --recursive https://github.com/DanielCasali/OpenBLAS.git && cd OpenBLAS && \ make -j$(nproc --all) TARGET=POWER10 DYNAMIC_ARCH=1 && \ make PREFIX=/opt/OpenBLAS install && \ cd /

Downloading and Building Llama.cpp using the OpenBLAS library

 git clone https://github.com/DanielCasali/llama.cpp.git && cd llama.cpp && sed -i "s/powerpc64le/native -mvsx -mtune=native -D__POWER10_VECTOR__/g" ggml/src/CMakeLists.txt && \ mkdir build; \ cd build; \ cmake -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS=/opt/OpenBLAS/include -G Ninja ..; \ cmake --build . --config Release

With all these steps completed successfully, the environment was properly configured and optimized for running Llama.cpp locally. We are now able to start a server to perform inference with LLMs efficiently, using only CPU resources.

Performing Inference

We chose the Granite-20b-code-instruct model in the .GGUF format, which is specifically designed to optimize the performance of language models in CPU-only environments. These models are quantized, meaning their calculation precision is reduced, which in turn lowers their size and memory consumption, making them ideal for efficient execution with Llama.cpp. This approach enables high-performance local inference even on processor-only architectures such as POWER10.The model was downloaded directly from Hugging Face. Below, we show the step-by-step process to download it:

Create a directory for the model in Llama.cpp:

mkdir -p /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF

Access the directory in Llama.cpp:

cd /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF

Download the model from Hugging Face:

wget https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k-GGUF/resolve/main/granite-20b-code-instruct.Q4_K_M.gguf

The last step can take longer based on the model’s number of parameters.. However, once the steps above are completed, we can start a Llama.cpp server to perform inference. By default, the server is exposed on port 8080 of the Power10 machine, but this is fully customizable. The following code illustrates how to configure and run the Llama server:

/root/llama.cpp/build/bin/llama-server --host 0.0.0.0 --model /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF/granite-20b-code-instruct.Q4_K_M.gguf

With the Llama.cpp server running on port 8080, we can now perform inference via HTTP requests. In this example, for simplicity, we use curl to make the requests:

curl -X POST http://localhost:8080/completion \ -H "Content-Type: application/json" \ -d '{ "prompt": "Make a hello world program in Java. Your answer should be in Java code only.", "max_tokens": 100 }'

Below is an example of how the response is returned:

{ "content": "public class HelloWorld { public static void main(String[] args) { System.out.println("Hello, World!"); }}

With this setup, we are now able to perform inference on CPU. Our upcoming posts will focus on running these inferences using the HELM (Holistic Evaluation of Language Models) framework as the intermediary.

Introduction

Wed, 12 Mar 2025 00:00:00 +0000

Welcome to the blog of the partnership between the Federal University of Campina Grande (UFCG) and IBM!

This space brings together articles, tutorials, and research results produced by our team across different projects. Each project focuses on a distinct area of research:

LLM Evaluation — evaluation of large language models, with a focus on benchmarks for Brazilian Portuguese.
AgentOps — development of AI agents capable of autonomously performing multiple tasks.
Judo-AI — use of AI models for analysis of judo matches and training sessions, applying computer vision and deep learning techniques for movement detection and action recognition.
5G — integration of AI techniques in 5G network environments, with intelligent control, optimization, and network management mechanisms.
MultiArq — provisioning of common tools for new architectures (ppc64le), seeking and adapting specific tools and creating technical documentation about the architecture.

Browse the posts and follow the latest updates!

Virtualization on IBM UFCG

LLM Inference with Ollama on IBM Power9 Using CPU

Context

TL;DR

Environment Used

Initial Setup

Configuring Go

If not installed:

Configuring CMake

If not installed:

Configuring GCC

If not installed:

Cloning Ollama

Build Ollama

Performing Inference

Download the test model and run inference

Final Considerations

Next Steps

Power9 Virtualization: how we structured an isolated environment with KVM and Libvirt

Context

Communication flow between Hardware (Power9) and Virtual Machines

TL;DR

Environment used

Installing the virtualization environment (KVM + Libvirt)

Setup

Ready-to-use images with NVIDIA drivers

Evaluation of IBM Granite Models for Code-Generation Tasks on HumanEvalX

Context

Methodology / Process

Results and Conclusions

1. granite-4.0-h-small stood out for its versatility

2. Granite Micro (3B) performed above expectations

3. The size progression (350M → 1B → 3B → 7B → 30B) shows gradual and coherent evolution

4. Comparing different providers helps contextualize the results

Next Steps

Computação@UFCG Leads Brazil's Contributions to the HELM-Stanford Framework in Partnership with IBM

LLMs Inference API on IBM Power9 Server

Background

TL;DR

Introducing the API

Architecture Overview

Main Features

Usage Flow

Inputs and Endpoints

How to Use the API with Python

Generate API Key

Load Model

Generate Text

Check status and unload the model

How to use the API with curl in CLI

Generate API Key

Load Model

Generate Text

Check status and unload the model

Building an API for LLM inferences on IBM Power9 servers

Background

TL;DR

Environment Setup

Directory Structure

requirements.txt File

API Key Storage File

Schemas and Data Validation

Authentication and API Keys

Model and GPU Manager

FastAPI API Endpoints

utils.py File

Running the API

Setting Up the Conda and PyTorch on IBM Power9 Servers

Background

TL;DR

Setting up the Conda

Installing and configuring the PyTorch library

(Optional) Creating a Conda virtual environment

Installing prerequisites

Building PyTorch

Setting Up the OS, NVIDIA Drivers, CUDA, and cuDNN on IBM Power 9 Servers

Background

TL;DR

Setting up the Operating System

Setting up NVIDIA Driver and CUDA

`requirements.txt` File

`utils.py` File