Building an API for LLM inferences on IBM Power9 servers

Background

This is the third post in a tutorial series designed to show step by step how to build a LLM API on a Power9 server, from operating system setup to remote inference execution. We already configured the operating system, NVIDIA drivers, CUDA, and cuDNN in the first post, and installed Conda and PyTorch in the second post. In this stage, we will build the API using FastAPI and the Transformers library, downloading models from Hugging Face and running the web server with uvicorn.

The implemented API will support generating API keys, loading models, performing inferences, checking status, and unloading models.

FastAPI: a modern web framework for building APIs with Python 3.8+, based on static typing and async programming. It is designed to be fast, easy to use, and robust, making API development more efficient.

Transformers: an open-source library developed by Hugging Face. It offers easy and efficient access to a wide collection of state-of-the-art pretrained models for Natural Language Processing (NLP), computer vision, and audio.

Hugging Face: Hugging Face is a platform focused on artificial intelligence, known for hosting NLP models and other tasks. The Hugging Face Hub is a collaborative repository where developers and researchers can share, version, and download ready-to-use models, making access and integration easier.

Uvicorn: ASGI (Asynchronous Server Gateway Interface) web server. Uvicorn is a high-performance server for asynchronous Python applications.

TL;DR

This post provides a step-by-step guide to implementing an API that performs LLM inferences.
We will use FastAPI and Transformers to develop this API and Hugging Face to download the models.

Environment Setup

Directory Structure

Start by creating the basic project structure:

model_api/
├── requirements.txt
├── app/
│   ├── __init__.py
│   ├── main.py
│   ├── schemas.py
│   ├── auth.py
│   ├── model_manager.py
│   ├── utils.py
│   └── apikey_store.json
└── README.md (optional)

`requirements.txt` File

We will use FastAPI and Transformers to build the API. Additionally, we will use uvicorn to run the server, pydantic for input data validation, and torch, which we installed in the previous tutorial.

First, we’ll install the required libraries and then populate the requirements.txt file. Remember to activate your conda environment if you created one, to ensure proper use of pytorch.

conda activate llm_api
pip install fastapi uvicorn transformers

The requirements.txt file will look like this:

requirements.txt

1fastapi>=0.104.0
2uvicorn>=0.24.0
3torch>=2.0.0
4transformers>=4.35.0
5pydantic>=2.0.0

API Key Storage File

The apikey_store.json file will store the generated API keys. We will start with it empty, containing only {}.

apikey_store.json

1{}

Schemas and Data Validation

Schemas are essential for validating the API’s input and output data. They ensure data is in the correct format and enable automatic documentation generation.

We will create the app/schemas.py file containing all the data models. We will define four models: GenerateRequest, LoadModelRequest, ApiKeyResponse, and LDAPUserRequest.

schemas.py

 1from pydantic import BaseModel, Field
 2from typing import Optional
 3
 4class GenerateRequest(BaseModel):
 5    model_name: str = Field(..., description="The name of the model to use for generation.")
 6    prompt: str = Field(..., description="The input text to generate a response for.")
 7    max_tokens: Optional[int] = Field(300, description="The maximum length of the generated response.")
 8    temperature: Optional[float] = Field(1.0, description="The sampling temperature for generation.")
 9    top_p: Optional[float] = Field(1.0, description="The cumulative probability for nucleus sampling.")
10    hf_token: Optional[str] = Field(None, description="The Hugging Face tokenizer to use, if applicable.")
11
12
13class LoadModelRequest(BaseModel):
14    model_name: str = Field(..., description="The name of the model to load.")
15    device: Optional[str] = Field("cuda", description="The device to load the model on (e.g., 'cpu', 'cuda').")
16    hf_token: Optional[str] = Field(None, description="The Hugging Face tokenizer to use, if applicable.")
17
18class ApiKeyResponse(BaseModel):
19    api_key: str = Field(..., description="The API key for accessing the model API.")
20
21class LDAPUserRequest(BaseModel):
22    username: str = Field(..., description="The username for LDAP authentication.")

All classes inherit from pydantic’s BaseModel, gaining validation, serialization, and automatic documentation features.
The Field(...) declaration defines a required field with no default value.
The Field(value) declaration defines a required field with value as its default.
The Optional[type] annotation indicates the field is optional but must be of type type if provided.

With the schemas defined, let’s create the file responsible for API Key authentication.

Authentication and API Keys

The authentication system protects your API by ensuring that only authorized users can access the endpoints. We will implement a mechanism based on API Keys.

Let’s create the app/auth.py file with all the authentication functionalities.

auth.py

 1import secrets 
 2import json
 3from fastapi import HTTPException, Request
 4
 5APIKEY_STORE_FILE = "app/apikey_store.json"
 6
 7def load_apikeys():
 8    try:
 9        with open(APIKEY_STORE_FILE, "r") as f:
10            return json.load(f)
11    except FileNotFoundError:
12        raise HTTPException(
13            status_code=404,
14            detail=f"API keys file not found: {APIKEY_STORE_FILE}")
15    
16def save_apikeys(keys: dict):
17    with open(APIKEY_STORE_FILE, "w") as f:
18        json.dump(keys, f, indent=4)
19
20def generate_apikey(user:str) -> str:
21    key = secrets.token_hex(32)
22    keys = load_apikeys()
23    keys[user] = key
24    save_apikeys(keys)
25    return key
26
27async def verify_apikey(request: Request) -> bool:
28    apikey = request.headers.get("x-API-Key")
29    if not apikey:
30        raise HTTPException(
31            status_code=401,
32            detail="API key not provided.")
33    try:
34        keys = load_apikeys()
35        if apikey in keys.values():
36            return True
37    
38    except json.JSONDecodeError:
39        raise HTTPException(
40        status_code=403,
41        detail="Invalid API Key")

The load_apikeys function loads the information stored in the app/apikey_store.json file.
save_apikeys is responsible for saving the content in JSON format.
The generate_apikey function creates a key for a user and adds it to the dictionary using the provided username as the key.
verify_apikey will be called whenever a request arrives, to perform validation.

Model and GPU Manager

The app/model_manager.py is the core of the API, responsible for loading, managing, and running llm. It optimizes GPU/CPU usage and ensures efficient text generation.

model_manager.py

 1import torch 
 2from transformers import AutoTokenizer, AutoModelForCausalLM
 3from fastapi import HTTPException
 4import gc
 5from .utils import is_model_on_gpu
 6
 7DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 8
 9class ModelManager:
10    def __init__(self):
11        self.model = None
12        self.tokenizer = None
13        self.model_name = None
14
15    def load_model(self, model_name: str, hf_token:str = None, device: str = DEVICE):
16        if self.model_name != None and self.model_name != model_name:
17            print("Removing previously loaded model...")
18
19        self.unload_model()        
20        print(f"Loading model {model_name} on device {device}...")
21       
22        if self.model_name != model_name:
23            try:            
24                if hf_token:           
25                    self.tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)
26                    self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map="balanced", token=hf_token)
27                else:
28                    self.tokenizer = AutoTokenizer.from_pretrained(model_name)
29                    self.model = AutoModelForCausalLM.from_pretrained(model_name, device_map="balanced")
30                self.model.eval()
31                self.model_name = model_name
32                print(is_model_on_gpu(self.model.hf_device_map, self.model_name))
33                
34            except Exception as e:
35                raise HTTPException(status_code=500, detail=f"Erro ao carregar modelo: {str(e)}")
36        else:
37            print(f"The model {model_name} is already loaded.")
38
39    def generate(self, model_name:str, hf_token: str, prompt:str, max_tokens:int = 300, temperature:float = 1.0, top_p:float = 1.0) -> str:
40        
41        if self.model_name != model_name:
42            self.load_model(model_name, hf_token, device=DEVICE)
43
44        if self.model is None or self.tokenizer is None:
45            raise HTTPException(status_code=400, detail="No model loaded.")
46
47        try:
48            inputs = self.tokenizer(prompt, return_tensors="pt").to(self.model.device)
49            with torch.no_grad(): 
50                outputs = self.model.generate(**inputs, max_new_tokens=max_tokens,temperature=temperature, top_p=top_p, eos_token_id=self.tokenizer.eos_token_id)
51            return self.tokenizer.decode(outputs[0], skip_special_tokens=True)
52        except Exception as e:
53            raise HTTPException(status_code=500, detail=f"Error generating text:{str(e)}")
54    
55    def get_status(self) -> str:        
56        if self.model is None:
57            self.unload_model()
58            return "No model loaded."       
59        return is_model_on_gpu(self.model.hf_device_map, self.model_name)
60
61    def unload_model(self):
62        self.model = None
63        self.tokenizer = None
64        old_model = self.model_name if self.model_name else False
65        self.model_name = None
66
67        gc.collect()
68        torch.cuda.empty_cache()
69        return f"Model {old_model} successfully unloaded." if old_model else "No model loaded to unload."
70
71manager = ModelManager()

The load_model function loads a new model into memory, removing any previously loaded model.
generate is the main function of the API, responsible for performing model inference. It allows adjusting the parameters: temperature, top_p, and max_tokens.
get_status reports whether there is a loaded model and whether it is on the GPU or CPU.
The unload_model function removes the model from memory, clears the CUDA cache, and invokes Python’s garbage collector to avoid leftovers that could interfere with future loads.

FastAPI API Endpoints

The app/main.py file is where all the components come together. In it, we define all the endpoints and the API’s routing logic.

main.py

 1from fastapi import FastAPI, Request, HTTPException, Depends
 2from fastapi.responses import JSONResponse
 3from app import schemas, model_manager, auth
 4
 5app = FastAPI()
 6
 7async def require_api_key(request: Request) -> schemas.LDAPUserRequest:
 8    user = await auth.verify_apikey(request)
 9    if not user:
10        raise HTTPException(status_code=401, detail="Invalid API Key")
11    return user
12
13@app.post("/generate_apikey")
14async def generate_apikey(payload: schemas.LDAPUserRequest) -> JSONResponse:
15    key = auth.generate_apikey(payload.username)
16    return JSONResponse(status_code=200, content={"api_key": key})
17
18@app.post("/load_model", dependencies=[Depends(require_api_key)])
19async def load_model(payload: schemas.LoadModelRequest) -> JSONResponse:
20    try:
21        model_manager.manager.load_model(payload.model_name, payload.hf_token, payload.device)
22        return JSONResponse(content={"message": f"Model {payload.model_name} loaded successfully."})
23    except Exception as e:
24        raise HTTPException(status_code=500, content={"error": str(e)})
25    
26@app.post("/generate", dependencies=[Depends(require_api_key)])
27async def generate(payload: schemas.GenerateRequest)-> JSONResponse:
28    try:
29        result = model_manager.manager.generate(payload.model_name, payload.hf_token,payload.prompt, payload.max_tokens, payload.temperature, payload.top_p)
30        return {"result": result}
31    except Exception as e:
32        return JSONResponse(status_code=500, content={"error": str(e)})
33    
34@app.get("/status", dependencies=[Depends(require_api_key)])
35async def status()-> JSONResponse:
36    str_status = model_manager.manager.get_status()
37    return JSONResponse(content={"status": str_status})
38
39@app.post("/unload_model", dependencies=[Depends(require_api_key)])
40async def unload_model() -> JSONResponse:
41    try:
42        str_unload = model_manager.manager.unload_model()
43        return JSONResponse(content={"message":str_unload})
44    except Exception as e:
45        raise HTTPException(status_code=500, content={"error": str(e)})

The require_api_key function checks the API Key on each request and returns the authenticated user or raises a 401 error.
generate_apikey creates and returns a new API key for the specified user.
load_model loads the specified model. If needed, it also accepts a Hugging Face token.
The generate function makes the model perform inference using the given prompt and parameters.
Calling the status endpoint returns the current status of the model manager.
unload_model unloads the currently loaded model and returns a success message if completed properly.

`utils.py` File

The app/utils.py file contains the function that checks whether the loaded model is fully or partially on the GPU, or if it was loaded on the CPU.

utils.py

1def is_model_on_gpu(hf_device_map: dict, model_name: str) -> str:
2    if '' in hf_device_map.keys() and hf_device_map[''] == 'cpu':
3        return f"Model {model_name} fully loaded on CPU."
4    elif 'cpu' in hf_device_map.values():
5        return f"Some layers of the model {model_name} are loaded on the CPU."
6    else:
7        return f"Model {model_name} fully loaded on GPU."

Running the API

To run the API with uvicorn, simply execute a command specifying the host and port for the service to start.

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

app:main refers to the app/main.py file, which connects all components and handles user requests.
--host 0.0.0.0 sets the IP address on which the Uvicorn server will listen. The value 0.0.0.0 allows the server to be accessible from any network interface on the Power9 machine.
--port 8000 specifies the port on which the server will listen for requests.
--reload is a flag for development use. It automatically reloads the server whenever changes are made.

BBy following this guide, you’ll have a working API capable of running LLM inference using models downloaded from Hugging Face. In the next tutorial, we will show how to send requests to the API using curl and Python.