LLMs Inference API on IBM Power9 Server

Background

This is the fourth and final post in a tutorial series that aims to show step by step how to build a LLM API on a Power9 server, from operating system setup to remote inference execution. We already configured the operating system, NVIDIA drivers, CUDA, and cuDNN in the first post, installed Conda and PyTorch in the second post, and built the API in the third post. In this stage, we will present the built API and show how to make requests.

TL;DR

This post introduces the built LLM inference API and how to use it.
We will show how to make requests using Python and curl.

Introducing the API

Built with FastAPI, it includes loading specific models, keeping them in GPU memory for successive calls, and generating text from prompts sent via HTTP requests. It was implemented with FastAPI and includes API Key access control, memory management (loading and unloading models), support for multiple GPUs with automatic sharding, and endpoints for status queries. The goal is to provide a robust, production-ready service optimized for intensive use, ensuring fast inferences and easy integration with external applications.

Architecture Overview

The API exposes LLMs via FastAPI with REST endpoints. The ModelManager handles loading, unloading, and model inference, keeping models in GPU memory for fast calls. Authentication is enforced via API Key. The architecture supports multiple GPUs with automatic sharding to optimize memory usage and performance. Models are sourced from Hugging Face and use the Transformers library to perform inferences.

Descrição alternativa — Architecture Diagram

Main Features

Load Models
- /load_model
- Loads a model from the Hugging Face Hub
- Performs sharding across GPUs
- Supports Hugging Face Token
Generate Text
- /generate
- Accepts prompt, max_tokens, model name, temperature, and top_p
- Uses an already loaded model or loads a new one
- Returns result in JSON
Management
- /status: Checks the loaded model and device (CPU/GPU)
- /unload_model: Frees GPU and memory
- /generate_apikey: Creates API keys from LDAP user

Usage Flow

Inputs and Endpoints

The table below describes the API endpoints, required inputs, and responses.

Inputs and endpoints table
Endpoints	Method	Api Key	Input (Body/Query)	Response
`/generate_apikey`	POST	❌	{username}	API Key
`/load_model`	POST	✅	{model_name hf_token(opcional) device(opcional)}	None, just loads the model
`/generate`	POST	✅	{model_name prompt hf_token(opcional) max_tokens(opcional) temperature(opcional) top_p(opcional) }	Text generated by the model
`/status`	GET	✅	None	Model status and the device it is loaded on
`/unload_model`	POST	✅	None	None, just unloads the model

How to Use the API with Python

Generate API Key

 1import requests
 2import json
 3import os
 4
 5url = "http://<power9_ip_server>:8000/"
 6username = <ldap_user>
 7hf_token = os.getenv("HUGGINGFACE_TOKEN")
 8
 9response = requests.post(f"{url}/generate_apikey", json={"username": username}).content.decode()
10
11api_key = json.loads(response).get("api_key")

It is important that the Hugging Face Token is set as an environment variable in the location where the inference will run.
api_key will be the return value of the called function.

Load Model

First, we need to create a header containing the API Key returned from the code above and the payload with model_name and the Hugging Face token hf_token. After that, we can send the request with these two pieces of information.

1headers = {"Content-Type": "application/json",
2"x-api-key": api_key}
3
4payload = {"model_name": "ibm-granite/granite-3.3-8b-instruct",
5           "hf_token": hf_token}
6
7resp = requests.post(f"{url}/load_model", headers=headers, json=payload)

Generate Text

Now we need to create a new payload with the necessary information to generate text with an LLM, which includes: prompt, model_name, and hf_token.

1payload = {"prompt": "Hello, tell me a little about the Federal University of Campina Grande (UFCG)",
2           "model_name": "ibm-granite/granite-3.3-8b-instruct",
3           "hf_token": hf_token}
4
5resp = requests.post(f"{url}/generate", headers=headers, json=payload)
6
7resp = json.loads(resp.content.decode())

Check status and unload the model

To check the status and unload the model, we don’t need to send anything in the payload—just the header with the API key:

1requests.get(f"{url}/status", headers=headers).content

1resp = requests.post(f"{url}/unload_model", headers=headers)

How to use the API with curl in CLI

Generate API Key

curl -X POST "http://<power9_ip_server>:8000/generate_apikey" \
  -H "Content-Type: application/json" \
  -d '{"username": <ldap_user>}'

It is important that the Hugging Face Token is set as an environment variable in the location where the inference will run.
The user in the username field must be enclosed in quotation marks (" “)
After running the request above, the returned API key should be saved as an environment variable to make future executions easier. To save it, copy the returned API key and run the command:

export API_KEY_P9=<returned_api_key>

Load Model

curl -X POST "http://<power9_ip_server>:8000/load_model" \
  -H "Content-Type: application/json" \
  -H "x-api-key: $API_KEY" \
  -d '{
        "model_name":"ibm-granite/granite-3.3-8b-instruct",
        "hf_token":"'"$HUGGINGFACE_TOKEN"'"
      }'

Generate Text

curl -X POST "http://<power9_ip_server>:8000/generate" \
  -H "Content-Type: application/json" \
  -H "x-api-key: $API_KEY" \
  -d '{
        "model_name": "ibm-granite/granite-3.3-8b-instruct"
        "prompt":"Hello, tell me a little about the Federal University of Campina Grande (UFCG)",
        "hf_token": "'"$HUGGINGFACE_TOKEN"'",
        "max_tokens":50
      }'

Check status and unload the model

To check the status and unload the model, we don’t need to send anything in the payload—just the header with the API key:

curl -X GET "http://<power9_ip_server>:8000/status" \
  -H "Content-Type: application/json" \
  -H "x-api-key: $API_KEY"

curl -X POST "http://<power9_ip_server>:8000/unload_model" \
  -H "Content-Type: application/json" \
  -H "x-api-key: $API_KEY"

We hope this series has helped clarify the full development and deployment process. The LLM-IBM-UFCG team is available for questions or suggestions about future improvements.