<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Virtualization on IBM UFCG</title><link>https://llm-pt-ibm.github.io/en/tags/virtualization/</link><description>Recent content in Virtualization on IBM UFCG</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>IBM &amp; UFCG - 2025</copyright><lastBuildDate>Fri, 27 Mar 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://llm-pt-ibm.github.io/en/tags/virtualization/index.xml" rel="self" type="application/rss+xml"/><item><title>LLM Inference with Ollama on IBM Power9 Using CPU</title><link>https://llm-pt-ibm.github.io/en/posts/ollama_cpu/</link><pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/ollama_cpu/</guid><description>&lt;h2 id="context">Context&lt;/h2>&lt;p>This post presents a practical guide for performing inference of large Language Models (LLMs) using &lt;a href="https://ollama.com/" rel="external">&lt;span class="link-personalizado">&lt;em>Ollama&lt;/em>&lt;/span>&lt;/a>, in an IBM POWER9 environment. Ollama is a &lt;em>framework&lt;/em> based on &lt;a href="https://github.com/ggml-org/llama.cpp.git" rel="external">&lt;span class="link-personalizado">&lt;em>llama.cpp&lt;/em>&lt;/span>&lt;/a>, designed to simplify the implementation and execution of such models, offering a user-friendly interface and support for various tasks.&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/funcionamento_ollama.png" alt="Figure 1"/>&lt;figcaption> &lt;p>Flow of a request&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>Despite the growth in LLM usage, the availability of materials focused on the &lt;em>ppc64le&lt;/em> architecture (IBM POWER9) is still quite limited. In general, available tutorials are old, poorly detailed, or focused on more common architectures like &lt;em>x86_64&lt;/em>, which makes reproducing the environment in the presented context difficult. This is the first of two posts in this series, which aims to perform inference entirely via CPU, exploring the &lt;em>ppc64le&lt;/em> architecture, in an updated, practical, and reproducible way. In the next post, we will address the use of GPU to accelerate the process.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post presents details on how to configure the environment to perform inferences with IBM POWER9 infrastructure.&lt;/li>&lt;li>Execution is performed via CPU using Ollama;&lt;/li>&lt;li>The main challenge involves correctly configuring the environment, especially dependencies like &lt;em>Go&lt;/em>, &lt;em>GCC&lt;/em>, and &lt;em>CMake&lt;/em>, in addition to compatibility with &lt;a href="https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux" rel="external">&lt;span class="link-personalizado">&lt;em>RHEL&lt;/em>&lt;/span>&lt;/a>&lt;/li>&lt;/ul>&lt;h2 id="environment-used">Environment Used&lt;/h2>&lt;p>&lt;strong>Hardware&lt;/strong>:&lt;/p>&lt;ul>&lt;li>&lt;em>ppc64le&lt;/em> architecture;&lt;/li>&lt;li>RAM: ~64GB;&lt;/li>&lt;li>Execution: Virtual Machine (VM);&lt;/li>&lt;/ul>&lt;p>&lt;strong>Operating System:&lt;/strong> Alma Linux 8.10 (&lt;em>ppc64le&lt;/em>), binary compatible with &lt;em>Red Hat Enterprise Linux (RHEL)&lt;/em> 8.9/8.10.&lt;/p>&lt;h2 id="initial-setup">Initial &lt;em>Setup&lt;/em>&lt;/h2>&lt;p>To run Ollama on the POWER9 architecture, it is necessary to prepare the environment with the appropriate dependencies.The first step is to update the system and install basic utilities:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo dnf update -y&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo dnf install -y wget git tar make gcc gcc-c++ cmake gcc-toolset-11&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Although this command installs some dependencies, it is necessary to ensure that the correct versions are being used.&lt;/p>&lt;h3 id="configuring-go">Configuring &lt;em>Go&lt;/em>&lt;/h3>&lt;p>Ollama is developed in &lt;em>Go&lt;/em>, so it is necessary to ensure the appropriate version.&lt;/p>&lt;p>&lt;strong>Expected Version:&lt;/strong> 1.25.7 linux/ppc64le&lt;/p>&lt;h4 id="if-not-installed">If not installed:&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>wget https://go.dev/dl/go1.25.7.linux-ppc64le.tar.gz&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo tar -C /usr/local -xzf go1.25.7.linux-ppc64le.tar.gz&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>export PATH&lt;span style="color:#f92672">=&lt;/span>/usr/local/go/bin:$PATH&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To add to &lt;em>PATH&lt;/em> permanently:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>echo &lt;span style="color:#e6db74">&amp;#39;export PATH=/usr/local/go/bin:$PATH&amp;#39;&lt;/span> &amp;gt;&amp;gt; ~/.bashrc&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>source ~/.bashrc&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Verify if the version is correct: &lt;code>go version&lt;/code>&lt;/p>&lt;h3 id="configuring-cmake">Configuring &lt;em>CMake&lt;/em>&lt;/h3>&lt;p>Verify if the version is correct: &lt;code>cmake --version&lt;/code>&lt;/p>&lt;p>&lt;strong>Expected Version:&lt;/strong> cmake 3.26.5&lt;/p>&lt;h4 id="if-not-installed-1">If not installed:&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>wget https://github.com/Kitware/CMake/releases/download/v3.26.5/cmake-3.26.5.tar.gz&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>tar -xzf cmake-3.26.5.tar.gz&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cd cmake-3.26.5&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>./bootstrap&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>make -j&lt;span style="color:#66d9ef">$(&lt;/span>nproc&lt;span style="color:#66d9ef">)&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo make install&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="configuring-gcc">Configuring &lt;em>GCC&lt;/em>&lt;/h3>&lt;p>&lt;strong>Expected Version:&lt;/strong> &lt;code>gcc 11.2.1&lt;/code>&lt;/p>&lt;p>&lt;strong>Important:&lt;/strong> On AlmaLinux 8, the &lt;em>gcc-toolset&lt;/em> is not activated automatically. It is necessary to enable the session manually:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>scl enable gcc-toolset-11 bash&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This command activates GCC only in the current session. If you open another terminal, you will need to run the command again.&lt;/p>&lt;p>&lt;strong>Verify the version:&lt;/strong> &lt;code>gcc --version&lt;/code>&lt;/p>&lt;h4 id="if-not-installed-2">If not installed:&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo dnf install -y gcc-toolset-11&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>scl enable gcc-toolset-11 bash&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="cloning-ollama">Cloning Ollama&lt;/h3>&lt;p>With the environment configured, we can build Ollama. Here we clone the official Ollama repository and change the version used (important for POWER compatibility and to get a stable version).&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cd /root&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>git clone https://github.com/ollama/ollama.git&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cd ollama&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">#Change the version: &lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>git checkout v0.9.4&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To verify, use: &lt;code>git status&lt;/code>&lt;/p>&lt;h2 id="build-ollama">&lt;em>Build&lt;/em> Ollama&lt;/h2>&lt;p>After activating GCC in the correct version:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>export CGO_ENABLED&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>go clean -cache -modcache -i -r&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>go build -o ollama .&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;em>CGO&lt;/em> needs to be enabled because Ollama depends on llama.cpp, which uses C/C++ code for performance optimizations. Without it, the &lt;em>build&lt;/em> fails or loses compatibility with the architecture.&lt;/p>&lt;p>This should occur without any errors and generate the &lt;code>ollama&lt;/code> binary created in the current directory.&lt;/p>&lt;p>To verify: &lt;code>./ollama --version&lt;/code>&lt;/p>&lt;h2 id="performing-inference">Performing Inference&lt;/h2>&lt;p>With &lt;em>Ollama&lt;/em> compiled, we can start the server:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama serve&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>An important observation is that, since the environment is running on a virtual machine, it is not possible to keep the command running in the main terminal and, simultaneously, use another terminal in the same session to perform inference, without some auxiliary tool to manage multiple terminals.What we will do then is run the server in the background, but you can choose to use &lt;em>Tmux&lt;/em> or &lt;em>Screen&lt;/em>, allowing the same terminal to remain available for executing the remaining commands (which we will see next). For this, you can run:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama serve &amp;amp;&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To verify if it worked: &lt;code>ps aux | grep ollama&lt;/code>. It will show something like:&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/print_ollama_serve.png" alt="Figure 2"/>&lt;figcaption> &lt;p>Ollama running&lt;/p> &lt;/figcaption>&lt;/figure>&lt;h2 id="download-the-test-model-and-run-inference">Download the test model and run inference&lt;/h2>&lt;p>For validation, we used the &lt;em>TinyLlama&lt;/em> model, as it is lightweight and suitable for CPU execution. For this, in another terminal, run:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama pull tinyllama&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To run inference:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama run tinyllama &lt;span style="color:#e6db74">&amp;#34;The sky is blue?&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If everything has been done correctly, you will have something like:&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/ollama_run.png" alt="Figure 3"/>&lt;figcaption> &lt;p>Inference being executed&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>It is important to highlight that &lt;em>Ollama&lt;/em> works, by default, with models available in its own repository, which are already converted and optimized for execution, generally in a format compatible with &lt;em>llama.cpp&lt;/em>. These models can be easily used via the &lt;code>ollama pull&lt;/code> command, as in the case of &lt;em>TinyLlama&lt;/em> used in this example. Although it is possible to use external models, this requires additional steps, such as conversion to compatible formats (for example, &lt;em>GGUF&lt;/em>) and the creation of a &lt;em>Modelfile&lt;/em>.&lt;/p>&lt;h2 id="final-considerations">Final Considerations&lt;/h2>&lt;p>With the steps presented, it was possible to configure the environment to run LLM inferences on an IBM POWER9 machine using the CPU. Although functional, this approach has limitations in performance, especially for larger models, due to the absence of GPU acceleration. As a next step, we intend to explore execution using GPU, evaluating performance gains and scalability.&lt;/p>&lt;h2 id="next-steps">Next Steps&lt;/h2>&lt;ul>&lt;li>Test newer versions and compatibility between them;&lt;/li>&lt;li>Conduct benchmarking experiments to compare CPU Inference performance against GPU inference;&lt;/li>&lt;li>Second post in this series, performing GPU inference.&lt;/li>&lt;/ul></description></item><item><title>Power9 Virtualization: how we structured an isolated environment with KVM and Libvirt</title><link>https://llm-pt-ibm.github.io/en/posts/virtualization/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/virtualization/</guid><description>&lt;h2 id="context">Context&lt;/h2>&lt;p>Given the need to establish isolated and secure environments for installing libraries, frameworks, and general-purpose tools, environment encapsulation emerged as an effective solution, implemented through KVM managed via &lt;code>virt-manager&lt;/code> and &lt;code>virsh&lt;/code>.&lt;/p>&lt;p>Virtualization is widely used in x86 environments, with mature tooling and established workflows. However, when migrating to architectures such as IBM Power9 (&lt;code>ppc64le&lt;/code>), many of these processes are no longer straightforward and require architecture-specific adaptations. Below, we provide a diagram showing this interaction across four layers.&lt;/p>&lt;h2 id="communication-flow-between-hardware-power9-and-virtual-machines">Communication flow between Hardware (Power9) and Virtual Machines&lt;/h2>&lt;p>The flow is organized into the following layers:&lt;/p>&lt;!--&lt;div style="text-align: center;"> &lt;img src="https://llm-pt-ibm.github.io/images/kvm_virtualizacao.png" style="max-width: 90%;">&lt;/div> -->&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/kvm_virtualizacao.png" alt="Figure 1: Diagram representing a 4-layer virtualization architecture."/>&lt;figcaption> &lt;p>Figure 1: Diagram representing a 4-layer virtualization architecture.&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>In this work, we explore how to build a virtualized environment using KVM and Libvirt on a Power9 server, with focus on isolation, reproducibility, and shared team usage.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>We implemented a virtualized environment on Power9 using KVM + Libvirt.&lt;/li>&lt;li>We adapted common virtualization workflows to &lt;code>ppc64le&lt;/code>, solving permission, write-lock, and provisioning issues.&lt;/li>&lt;li>The environment provides secure isolation between users and straightforward VM management.&lt;/li>&lt;li>We provide ready-to-use images with NVIDIA/CUDA drivers for immediate use.&lt;/li>&lt;/ul>&lt;h2 id="environment-used">Environment used&lt;/h2>&lt;ul>&lt;li>&lt;strong>Architecture&lt;/strong>: IBM Power9 server (&lt;code>ppc64le&lt;/code> architecture).&lt;/li>&lt;li>&lt;strong>Operating System (OS)&lt;/strong>: AlmaLinux 8.10 binary-compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.&lt;/li>&lt;li>&lt;strong>RAM&lt;/strong>: 512GB.&lt;/li>&lt;li>&lt;strong>Execution&lt;/strong>: Virtual Manager for Virtual Machine (VM) management.&lt;/li>&lt;li>&lt;strong>Hypervisor&lt;/strong>: KVM (Kernel-based Virtual Machine) / QEMU.&lt;/li>&lt;li>&lt;strong>Management&lt;/strong>: Libvirt (&lt;code>virsh&lt;/code>, &lt;code>virt-install&lt;/code>, &lt;code>virt-customize&lt;/code>).&lt;/li>&lt;li>&lt;strong>Storage&lt;/strong>: Virtual disks in &lt;code>.qcow2&lt;/code> format.&lt;/li>&lt;li>&lt;strong>GPUs&lt;/strong>: 4x NVIDIA Tesla V100 SXM2 16GB (NVLink2).&lt;/li>&lt;/ul>&lt;h2 id="installing-the-virtualization-environment-kvm--libvirt">Installing the virtualization environment (KVM + Libvirt)&lt;/h2>&lt;p>Before creating any VM, you need to install and configure KVM and Libvirt on the Power9 server.&lt;/p>&lt;ol>&lt;li>&lt;strong>Package installation&lt;/strong>:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo dnf install -y qemu-kvm libvirt libvirt-client libvirt-daemon libvirt-daemon-kvm virt-install virt-viewer guestfs-tools \libguestfs-tools python3-libvirt&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>&lt;strong>Starting the service&lt;/strong>:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo systemctl enable --now libvirtdsudo systemctl status libvirtd&lt;/code>&lt;/pre>&lt;ol start="3">&lt;li>&lt;strong>Adding your user to the &lt;code>libvirt&lt;/code> group&lt;/strong>:So non-root users can manage VMs without requiring &lt;code>sudo&lt;/code> for every command:&lt;/li>&lt;/ol>&lt;p>Run the command below:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo usermod -aG libvirt $(whoami)&lt;/code>&lt;/pre>&lt;p>Log out and log back in for the change to take effect.&lt;/p>&lt;ol start="4">&lt;li>&lt;strong>Verifying the installation&lt;/strong>:&lt;/li>&lt;/ol>&lt;p>Check &lt;code>virsh&lt;/code> version:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh version&lt;/code>&lt;/pre>&lt;p>Validate CPU virtualization support:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virt-host-validate&lt;/code>&lt;/pre>&lt;h2 id="setup">Setup&lt;/h2>&lt;ol>&lt;li>&lt;strong>Environment preparation&lt;/strong>:In KVM, the fastest way to provision VMs is to clone a “seed” image (&lt;code>.qcow2&lt;/code>) and expand it, instead of performing a clean install from ISO. To keep things organized, all virtual disks should be stored in a dedicated directory:&lt;/li>&lt;/ol>&lt;p>Download the AlmaLinux 8 base image:&lt;/p>&lt;pre tabindex="0">&lt;code>cd /home/user/wget https://repo.almalinux.org/almalinux/8/cloud/ppc64le/images/AlmaLinux-8-GenericCloud-latest.ppc64le.qcow2 -O alma8_base.qcow2&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>&lt;strong>Hypervisor management&lt;/strong>:Hypervisor and instance administration follows specific procedures to ensure system stability. Administrator commands to control virtualization services on Power9:&lt;/li>&lt;/ol>&lt;p>Stop KVM services:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo systemctl stop libvirtd&lt;/code>&lt;/pre>&lt;p>Start KVM services again:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo systemctl start libvirtd&lt;/code>&lt;/pre>&lt;p>Enable at boot:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo systemctl enable libvirtd&lt;/code>&lt;/pre>&lt;ol start="3">&lt;li>&lt;strong>Permission setup&lt;/strong>:The system user running KVM (&lt;code>qemu&lt;/code>) needs permission to access VM disks. If disks are stored inside a personal home directory, Linux blocks access by default. To allow hypervisor access without exposing personal files, grant execute (&lt;code>o+x&lt;/code>) permission on directories:&lt;/li>&lt;/ol>&lt;p>Allow &lt;code>qemu&lt;/code> to traverse the home directory (traversal only, no read permission):&lt;/p>&lt;pre tabindex="0">&lt;code>chmod o+x /home/user&lt;/code>&lt;/pre>&lt;p>Allow &lt;code>qemu&lt;/code> to access the disk directory:&lt;/p>&lt;pre tabindex="0">&lt;code>chmod o+x /home/user/discos&lt;/code>&lt;/pre>&lt;ol start="4">&lt;li>&lt;strong>Virtual network configuration (Libvirt)&lt;/strong>:Libvirt creates a default NAT network (&lt;code>default&lt;/code>) that places VMs in the &lt;code>192.168.122.0/24&lt;/code> range. VMs can access the internet through NAT, but they are not directly reachable from external networks without additional setup.&lt;/li>&lt;/ol>&lt;p>Check network status:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh net-list --all&lt;/code>&lt;/pre>&lt;p>If inactive, start and enable at boot:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh net-start defaultsudo virsh net-autostart default&lt;/code>&lt;/pre>&lt;p>If the network does not exist, define and initialize it:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh net-define /usr/share/libvirt/networks/default.xmlsudo virsh net-start defaultsudo virsh net-autostart default&lt;/code>&lt;/pre>&lt;p>If the XML file is missing, install the network config package:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo dnf install -y libvirt-daemon-config-network&lt;/code>&lt;/pre>&lt;ol start="5">&lt;li>&lt;strong>Creating new VMs&lt;/strong>:&lt;/li>&lt;/ol>&lt;p>Clone the base image:&lt;/p>&lt;pre tabindex="0">&lt;code>cp /home/user/alma8_base.qcow2 /home/user/discos/nome_vm.qcow2&lt;/code>&lt;/pre>&lt;p>Expand the disk (must be done BEFORE creating the VM):&lt;/p>&lt;pre tabindex="0">&lt;code>qemu-img resize /home/user/discos/nome_vm.qcow2 +100G&lt;/code>&lt;/pre>&lt;p>Create the VM:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virt-install \ --connect qemu:///system \ --name vm_nome \ --memory 131072 \ --vcpus 16 \ --cpu host \ --disk path=/home/user/discos/nome_vm.qcow2,format=qcow2 \ --import \ --os-variant almalinux8 \ --network network=default \ --graphics none \ --noautoconsole&lt;/code>&lt;/pre>&lt;ol start="6">&lt;li>&lt;strong>Post-creation VM customization&lt;/strong>:After creating the VM, you must set the root password, since cloud images usually come without one. We use &lt;code>virt-customize&lt;/code> for this. &lt;strong>Important&lt;/strong>: the VM must be powered off before safely editing its disk.&lt;/li>&lt;/ol>&lt;p>Shut down the VM:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh shutdown vm_nome&lt;/code>&lt;/pre>&lt;p>Wait for complete shutdown:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh list --all&lt;/code>&lt;/pre>&lt;p>Inject the root password into disk:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virt-customize -a /home/user/discos/nome_vm.qcow2 \ --root-password password:senha_desejada&lt;/code>&lt;/pre>&lt;p>Start the VM again:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh start vm_nome&lt;/code>&lt;/pre>&lt;ol start="7">&lt;li>&lt;strong>Accessing VMs&lt;/strong>:&lt;/li>&lt;/ol>&lt;p>&lt;strong>Via serial console&lt;/strong>&lt;/p>&lt;p>Connect to VM console:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh console vm_nome&lt;/code>&lt;/pre>&lt;p>To exit the console, use &lt;code>Ctrl + ]&lt;/code>.&lt;/p>&lt;p>&lt;strong>Via SSH&lt;/strong>&lt;/p>&lt;p>Find the VM IP address:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh domifaddr vm_nome&lt;/code>&lt;/pre>&lt;p>Access via SSH:&lt;/p>&lt;pre tabindex="0">&lt;code>ssh root@&amp;lt;ip_da_vm&amp;gt;&lt;/code>&lt;/pre>&lt;ol start="8">&lt;li>&lt;strong>Managing and deleting VMs&lt;/strong>:If you need to destroy an environment and recreate it from scratch, follow these 3 mandatory cleanup steps:&lt;/li>&lt;/ol>&lt;p>Force-stop the VM:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh destroy nome_da_vm&lt;/code>&lt;/pre>&lt;p>Remove VM definition from Libvirt:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh undefine nome_da_vm&lt;/code>&lt;/pre>&lt;p>Delete the virtual disk to free Power9 storage:&lt;/p>&lt;pre tabindex="0">&lt;code>rm -f /home/user/discos/nome_da_vm.qcow2&lt;/code>&lt;/pre>&lt;ol start="9">&lt;li>&lt;strong>Creating a VM from an existing image (cloning)&lt;/strong>:To create a new VM from an already configured image, such as prebuilt NVIDIA-ready images:&lt;/li>&lt;/ol>&lt;p>Option A: clone via &lt;code>qemu-img&lt;/code> (keeps original image intact):&lt;/p>&lt;pre tabindex="0">&lt;code>qemu-img create -f qcow2 -b imagem-base.qcow2 -F qcow2 nova-vm.qcow2&lt;/code>&lt;/pre>&lt;p>Option B: clone via &lt;code>virt-clone&lt;/code>:&lt;/p>&lt;pre tabindex="0">&lt;code>virt-clone \ --original vm-base \ --name vm-nova \ --file /home/user/discos/nova-vm.qcow2&lt;/code>&lt;/pre>&lt;p>If needed, you can execute the VM deletion step above and recreate it according to step 5.&lt;/p>&lt;h2 id="ready-to-use-images-with-nvidia-drivers">Ready-to-use images with NVIDIA drivers&lt;/h2>&lt;p>To simplify the use of Tesla V100 GPUs available on the server, we provide pre-configured &lt;code>.qcow2&lt;/code> images with NVIDIA drivers, CUDA, and cuDNN already installed. This removes the need to configure the base environment for every new use.&lt;/p>&lt;ol>&lt;li>&lt;p>&lt;strong>Available images&lt;/strong>:&lt;/p>&lt;table>&lt;thead>&lt;tr>&lt;th style="text-align:left">Image&lt;/th>&lt;th style="text-align:left">Contents&lt;/th>&lt;/tr>&lt;/thead>&lt;tbody>&lt;tr>&lt;td style="text-align:left">AlmaLinux-8-Power9-NVIDIA-drivers.qcow2.xz&lt;/td>&lt;td style="text-align:left">AlmaLinux 8.10 + drivers NVIDIA 535 + CUDA 12.2 + cuDNN 9.0&lt;/td>&lt;/tr>&lt;/tbody>&lt;/table>&lt;/li>&lt;li>&lt;p>&lt;strong>How to use pre-configured images&lt;/strong>:&lt;/p>&lt;/li>&lt;/ol>&lt;p>Download and decompress the image:&lt;/p>&lt;pre tabindex="0">&lt;code>wget &amp;lt;url_do_repositorio&amp;gt;/AlmaLinux-8-Power9-NVIDIA-drivers.qcow2.xzxz -d AlmaLinux-8-Power9-NVIDIA-drivers.qcow2.xz&lt;/code>&lt;/pre>&lt;p>Move it to the disks directory and create a VM from it:&lt;/p>&lt;pre tabindex="0">&lt;code>cp AlmaLinux-8-Power9-NVIDIA-drivers.qcow2 /home/user/discos/minha-vm-gpu.qcow2&lt;/code>&lt;/pre>&lt;p>Create the VM as usual:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virt-install \ --connect qemu:///system \ --name vm_gpu \ --memory 131072 \ --vcpus 16 \ --cpu host \ --disk path=/home/user/discos/minha-vm-gpu.qcow2,format=qcow2 \ --import \ --os-variant almalinux8 \ --network network=default \ --graphics none \ --noautoconsole&lt;/code>&lt;/pre>&lt;p>For the VM to access physical GPUs, PCIe passthrough must be configured as described in the next post of this series.&lt;/p>&lt;ol start="3">&lt;li>&lt;strong>How to generate a new image from a configured VM&lt;/strong>:After installing drivers or any software inside a VM, you can export its current state as a reusable image:&lt;/li>&lt;/ol>&lt;p>Shut down the VM:&lt;/p>&lt;pre tabindex="0">&lt;code>sudo virsh shutdown vm_nome&lt;/code>&lt;/pre>&lt;p>Convert and compress the image (removes unused space):&lt;/p>&lt;pre tabindex="0">&lt;code>qemu-img convert -O qcow2 -c \ /home/user/discos/vm_nome.qcow2 \ /home/user/discos/AlmaLinux-8-Power9-minha-imagem.qcow2&lt;/code>&lt;/pre>&lt;p>Compress for distribution:&lt;/p>&lt;pre tabindex="0">&lt;code>xz -T0 -v /home/user/discos/AlmaLinux-8-Power9-minha-imagem.qcow2&lt;/code>&lt;/pre>&lt;p>Expected output: &lt;code>AlmaLinux-8-Power9-minha-imagem.qcow2.xz&lt;/code>.&lt;/p>&lt;p>Verify image integrity:&lt;/p>&lt;pre tabindex="0">&lt;code>qemu-img check AlmaLinux-8-Power9-minha-imagem.qcow2qemu-img info AlmaLinux-8-Power9-minha-imagem.qcow2&lt;/code>&lt;/pre></description></item><item><title>Evaluation of IBM Granite Models for Code-Generation Tasks on HumanEvalX</title><link>https://llm-pt-ibm.github.io/en/posts/post_humanevalx/</link><pubDate>Fri, 28 Nov 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/post_humanevalx/</guid><description>&lt;h2 id="context">Context&lt;/h2>&lt;p>The use of language models for &lt;strong>code generation and understanding&lt;/strong> has become essential in modern development workflows.&lt;br>As part of a joint research effort between &lt;strong>LSD/UFCG&lt;/strong> and &lt;strong>IBM&lt;/strong>, we investigated the performance of the &lt;strong>IBM Granite 4&lt;/strong> family on the &lt;strong>HumanEvalX&lt;/strong> benchmark, which evaluates programming capabilities in &lt;em>five languages&lt;/em>: Python, Java, Go, C++, and JavaScript.&lt;/p>&lt;p>The goal was to answer key questions from the team:&lt;/p>&lt;ul>&lt;li>&lt;em>How versatile are the Granite models across different languages?&lt;/em>&lt;/li>&lt;li>&lt;em>Do smaller models deliver useful performance?&lt;/em>&lt;/li>&lt;li>&lt;em>How do the Granites compare to models from other providers such as DeepSeek Coder and CodeLlama?&lt;/em>&lt;/li>&lt;/ul>&lt;hr>&lt;h2 id="methodology--process">Methodology / Process&lt;/h2>&lt;p>The evaluation was conducted using &lt;strong>OpenCompass&lt;/strong>, a modern and extensible framework for large-scale LLM benchmarking. It allowed experiments to be executed in a standardized, reproducible way with consistent inference protocols.&lt;/p>&lt;p>Since OpenCompass does not provide native support for models hosted on the &lt;strong>IBM Cloud&lt;/strong>, it was necessary to develop a custom client to integrate the framework with the IBM Cloud Inference API. This client allowed the evaluation process to send requests transparently, handle authentication, manage generation parameters, and return outputs in the expected benchmark format. Experiments were also run in &lt;strong>Google Colab&lt;/strong>, which served as a practical environment for prototyping and running the models.&lt;/p>&lt;p>We used the HumanEvalX benchmark, an extension of the traditional HumanEval, covering five languages with the &lt;strong>Pass@1&lt;/strong> metric.&lt;/p>&lt;p>The evaluated models included:&lt;/p>&lt;ul>&lt;li>Granite 4.0 Micro (3B)&lt;/li>&lt;li>Granite 4.0 (1B)&lt;/li>&lt;li>Granite 4.0 h-tiny (7B)&lt;/li>&lt;li>Granite 4.0 h-small (30B) — via IBM Cloud&lt;/li>&lt;li>granite 4.0 (350M)&lt;/li>&lt;li>granite code instruct 8B — via IBM Cloud&lt;/li>&lt;li>DeepSeek Coder (6.7B)&lt;/li>&lt;li>CodeLlama (7B)&lt;/li>&lt;/ul>&lt;p>The metric used was &lt;strong>Pass@1&lt;/strong>, following the benchmark protocol.&lt;/p>&lt;hr>&lt;h2 id="results-and-conclusions">Results and Conclusions&lt;/h2>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/heatmap_humanevalX.png" alt="Performance heatmap"/>&lt;figcaption> &lt;p>Performance heatmap of the models on HumanEvalX.&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>The evaluation revealed important behaviors:&lt;/p>&lt;h3 id="1-granite-40-h-small-stood-out-for-its-versatility">&lt;strong>1. granite-4.0-h-small stood out for its versatility&lt;/strong>&lt;/h3>&lt;p>He surpassed 60% &lt;strong>Pass@1&lt;/strong> in Java, C++, and JavaScript, while also maintaining over 50% in Python and Go. This consistent performance across languages suggests that the model has good generalization capability, showing promise in scenarios that involve different programming ecosystems, although additional benchmarks and evaluations are important before drawing broader conclusions.&lt;/p>&lt;h3 id="2-granite-micro-3b-performed-above-expectations">&lt;strong>2. Granite Micro (3B) performed above expectations&lt;/strong>&lt;/h3>&lt;p>Despite being a small model, Granite Micro (3B) delivered 65.85% in JavaScript and 68.90% in Java, outperforming even some larger models evaluated. This shows that even with a compact architecture, it can deliver solid results, making it a highly efficient option for applications that require low computational cost without sacrificing performance.&lt;/p>&lt;h3 id="3-the-size-progression-350m--1b--3b--7b--30b-shows-gradual-and-coherent-evolution">&lt;strong>3. The size progression (350M → 1B → 3B → 7B → 30B) shows gradual and coherent evolution&lt;/strong>&lt;/h3>&lt;p>The results show that as we move through the different sizes of the Granite line, there is a coherent evolution in performance. Smaller models deliver stable results within their category, while larger ones gradually expand the ability to solve more complex tasks. This distribution helps clarify where each model fits in the usage spectrum.&lt;/p>&lt;h3 id="4-comparing-different-providers-helps-contextualize-the-results">&lt;strong>4. Comparing different providers helps contextualize the results&lt;/strong>&lt;/h3>&lt;p>Alongside the IBM models, we also evaluated models from other providers such as DeepSeek and Meta. In some languages, the differences were small, but in all of them there was at least one model from the Granite family that achieved the highest score. The Granite 4 Micro (3B) and Granite 4 h-small (30B) models were the standouts, with results that were close to, and in some cases above, those of models recognized as code specialists.&lt;/p>&lt;hr>&lt;h2 id="next-steps">Next Steps&lt;/h2>&lt;ul>&lt;li>Run the same Granite models on &lt;strong>LiveCodeBench&lt;/strong>, a broader benchmark that goes beyond &lt;strong>code generation&lt;/strong>, also evaluating &lt;strong>code execution&lt;/strong> and &lt;strong>test-output&lt;/strong>.&lt;/li>&lt;li>Perform a &lt;strong>fine-tuning of the Granite 4.0 Micro (3B) using InstructLab&lt;/strong> and observe the impact of this adaptation on the model’s performance in &lt;strong>HumanEvalX&lt;/strong>, comparing before and after the adjustment.&lt;/li>&lt;/ul></description></item><item><title>Computação@UFCG Leads Brazil's Contributions to the HELM-Stanford Framework in Partnership with IBM</title><link>https://llm-pt-ibm.github.io/en/posts/contribuicao_helm/</link><pubDate>Wed, 09 Jul 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/contribuicao_helm/</guid><description>&lt;p>&lt;strong>Collaboration between UFCG’s Computer Science department and IBM makes the university the top brazilian contributor to the &lt;a href="https://github.com/stanford-crfm/helm" rel="external">&lt;span class="link-personalizado">HELM-Stanford&lt;/span>&lt;/a> evaluation framework in 2025.&lt;/strong>&lt;/p>&lt;p>HELM-Stanford is one of the world’s leading frameworks for evaluating language models, measuring accuracy, robustness, and fairness. Being the top Brazilian contributor — through the partnership between Computação@UFCG and IBM — highlights the national protagonism in developing fairer, safer, and more representative metrics for LLMs, especially in multilingual and culturally diverse contexts.&lt;/p>&lt;p>The partnership between Computação@UFCG and IBM resulted in 15 significant contributions to HELM-Stanford in 2025. These contributions include adding Portuguese-language benchmarks, fixing bugs, improving source code, and including new evaluation sets, expanding the framework’s linguistic diversity and robustness.&lt;/p>&lt;p>The project, coordinated by Professor João Brunet with participation from Professors Fábio Morais and Leandro Balby, features a multidisciplinary team dedicated to LLM evaluation. The team also includes one professor from IFPB, three graduate students, three undergraduate students, and a professional with software development experience. IBM, as a project partner, has also assigned professionals to work directly on the collaboration. Together, the group has made meaningful contributions to advancing HELM-Stanford, with a focus on including the Portuguese language and continuously improving the framework.&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/carvalheira.jpeg" alt="Multidisciplinary project team"/>&lt;figcaption> &lt;p>Multidisciplinary project team&lt;/p> &lt;/figcaption>&lt;/figure></description></item><item><title>LLMs Inference API on IBM Power9 Server</title><link>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt4_en/</link><pubDate>Thu, 03 Jul 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt4_en/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This is the fourth and final post in a tutorial series that aims to show step by step how to build a LLM API on a Power9 server, from operating system setup to remote inference execution. We already configured the operating system, NVIDIA drivers, CUDA, and cuDNN in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt1_en/">&lt;span class="link-personalizado">first post&lt;/span>&lt;/a>, installed Conda and PyTorch in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/">&lt;span class="link-personalizado">second post&lt;/span>&lt;/a>, and built the API in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt3_en/">&lt;span class="link-personalizado">third post&lt;/span>&lt;/a>. In this stage, we will present the built API and show how to make requests.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post introduces the built LLM inference API and how to use it.&lt;/li>&lt;li>We will show how to make requests using Python and curl.&lt;/li>&lt;/ul>&lt;h2 id="introducing-the-api">Introducing the API&lt;/h2>&lt;p>Built with FastAPI, it includes loading specific models, keeping them in GPU memory for successive calls, and generating text from prompts sent via HTTP requests. It was implemented with FastAPI and includes API Key access control, memory management (loading and unloading models), support for multiple GPUs with automatic sharding, and endpoints for status queries. The goal is to provide a robust, production-ready service optimized for intensive use, ensuring fast inferences and easy integration with external applications.&lt;/p>&lt;h4 id="architecture-overview">Architecture Overview&lt;/h4>&lt;p>The API exposes LLMs via FastAPI with REST endpoints. The ModelManager handles loading, unloading, and model inference, keeping models in GPU memory for fast calls. Authentication is enforced via API Key. The architecture supports multiple GPUs with automatic sharding to optimize memory usage and performance. Models are sourced from Hugging Face and use the Transformers library to perform inferences.&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/arquitetura_api_llm_01_en.png" alt="Descrição alternativa"/>&lt;figcaption> &lt;p>Architecture Diagram&lt;/p> &lt;/figcaption>&lt;/figure>&lt;h4 id="main-features">Main Features&lt;/h4>&lt;ul>&lt;li>&lt;p>&lt;strong>Load Models&lt;/strong>&lt;/p>&lt;ul>&lt;li>&lt;code>/load_model&lt;/code>&lt;/li>&lt;li>Loads a model from the Hugging Face Hub&lt;/li>&lt;li>Performs sharding across GPUs&lt;/li>&lt;li>Supports Hugging Face Token&lt;/li>&lt;/ul>&lt;/li>&lt;li>&lt;p>&lt;strong>Generate Text&lt;/strong>&lt;/p>&lt;ul>&lt;li>&lt;code>/generate&lt;/code>&lt;/li>&lt;li>Accepts prompt, max_tokens, model name, temperature, and top_p&lt;/li>&lt;li>Uses an already loaded model or loads a new one&lt;/li>&lt;li>Returns result in JSON&lt;/li>&lt;/ul>&lt;/li>&lt;li>&lt;p>&lt;strong>Management&lt;/strong>&lt;/p>&lt;ul>&lt;li>&lt;code>/status&lt;/code>: Checks the loaded model and device (CPU/GPU)&lt;/li>&lt;li>&lt;code>/unload_model&lt;/code>: Frees GPU and memory&lt;/li>&lt;li>&lt;code>/generate_apikey&lt;/code>: Creates API keys from LDAP user&lt;/li>&lt;/ul>&lt;/li>&lt;/ul>&lt;h4 id="usage-flow">Usage Flow&lt;/h4>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/arquitetura_api_llm_02_en.png" alt="Descrição alternativa"/>&lt;figcaption> &lt;p>Usage flow diagram&lt;/p> &lt;/figcaption>&lt;/figure>&lt;h4 id="inputs-and-endpoints">Inputs and Endpoints&lt;/h4>&lt;p>The table below describes the API endpoints, required inputs, and responses.&lt;/p>&lt;style>table { border-collapse: collapse; width: 100%;}th { background-color: #cccccc; text-align: center; padding: 8px; border: 1px solid #b3b3b3;}td { padding: 8px; border: 1px solid #ccc; text-align: left;}td.center { text-align: center;}caption { caption-side: bottom}&lt;/style>&lt;table> &lt;caption>Inputs and endpoints table &lt;thead> &lt;tr> &lt;th>Endpoints&lt;/th> &lt;th>Method&lt;/th> &lt;th>Api Key&lt;/th> &lt;th>Input (Body/Query)&lt;/th> &lt;th>Response&lt;/th> &lt;/tr> &lt;/thead> &lt;tbody> &lt;tr> &lt;td>&lt;code>/generate_apikey&lt;/code>&lt;/td> &lt;td class="center">POST&lt;/td> &lt;td class="center">❌&lt;/td> &lt;td class="center">{username}&lt;/td> &lt;td class="center">API Key&lt;/td> &lt;/tr> &lt;tr> &lt;td>&lt;code>/load_model&lt;/code>&lt;/td> &lt;td class="center">POST&lt;/td> &lt;td class="center">✅&lt;/td> &lt;td class="center">{model_name &lt;br> hf_token(opcional) &lt;br> device(opcional)}&lt;/td> &lt;td class="center">None, just loads the model&lt;/td> &lt;/tr> &lt;tr> &lt;td>&lt;code>/generate&lt;/code>&lt;/td> &lt;td class="center">POST&lt;/td> &lt;td class="center">✅&lt;/td> &lt;td class="center">{model_name &lt;br> prompt &lt;br> hf_token(opcional) &lt;br> max_tokens(opcional) &lt;br> temperature(opcional) &lt;br> top_p(opcional)}&lt;/td> &lt;td class="center">Text generated by the model&lt;/td> &lt;/tr> &lt;tr> &lt;td>&lt;code>/status&lt;/code>&lt;/td> &lt;td class="center">GET&lt;/td> &lt;td class="center">✅&lt;/td> &lt;td class="center">None&lt;/td> &lt;td class="center">Model status and the device it is loaded on&lt;/td> &lt;/tr> &lt;tr> &lt;td>&lt;code>/unload_model&lt;/code>&lt;/td> &lt;td class="center">POST&lt;/td> &lt;td class="center">✅&lt;/td> &lt;td class="center">None&lt;/td> &lt;td class="center">None, just unloads the model&lt;/td> &lt;/tr> &lt;/tbody>&lt;/table>&lt;h2 id="how-to-use-the-api-with-python">How to Use the API with Python&lt;/h2>&lt;h4 id="generate-api-key">Generate API Key&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> requests&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> json&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> os&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span>url &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span>username &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>ldap_user&lt;span style="color:#f92672">&amp;gt;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span>hf_token &lt;span style="color:#f92672">=&lt;/span> os&lt;span style="color:#f92672">.&lt;/span>getenv(&lt;span style="color:#e6db74">&amp;#34;HUGGINGFACE_TOKEN&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span>response &lt;span style="color:#f92672">=&lt;/span> requests&lt;span style="color:#f92672">.&lt;/span>post(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>url&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/generate_apikey&amp;#34;&lt;/span>, json&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;username&amp;#34;&lt;/span>: username})&lt;span style="color:#f92672">.&lt;/span>content&lt;span style="color:#f92672">.&lt;/span>decode()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11&lt;/span>&lt;span>api_key &lt;span style="color:#f92672">=&lt;/span> json&lt;span style="color:#f92672">.&lt;/span>loads(response)&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&amp;#34;api_key&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>It is important that the Hugging Face Token is set as an environment variable in the location where the inference will run.&lt;/li>&lt;li>&lt;code>api_key&lt;/code> will be the return value of the called function.&lt;/li>&lt;/ul>&lt;h4 id="load-model">Load Model&lt;/h4>&lt;p>First, we need to create a header containing the API Key returned from the code above and the payload with &lt;code>model_name&lt;/code> and the Hugging Face token &lt;code>hf_token&lt;/code>. After that, we can send the request with these two pieces of information.&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>headers &lt;span style="color:#f92672">=&lt;/span> {&lt;span style="color:#e6db74">&amp;#34;Content-Type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;application/json&amp;#34;&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2&lt;/span>&lt;span>&lt;span style="color:#e6db74">&amp;#34;x-api-key&amp;#34;&lt;/span>: api_key}&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4&lt;/span>&lt;span>payload &lt;span style="color:#f92672">=&lt;/span> {&lt;span style="color:#e6db74">&amp;#34;model_name&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;ibm-granite/granite-3.3-8b-instruct&amp;#34;&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5&lt;/span>&lt;span> &lt;span style="color:#e6db74">&amp;#34;hf_token&amp;#34;&lt;/span>: hf_token}&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7&lt;/span>&lt;span>resp &lt;span style="color:#f92672">=&lt;/span> requests&lt;span style="color:#f92672">.&lt;/span>post(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>url&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/load_model&amp;#34;&lt;/span>, headers&lt;span style="color:#f92672">=&lt;/span>headers, json&lt;span style="color:#f92672">=&lt;/span>payload)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="generate-text">Generate Text&lt;/h4>&lt;p>Now we need to create a new payload with the necessary information to generate text with an LLM, which includes: &lt;code>prompt&lt;/code>, &lt;code>model_name&lt;/code>, and &lt;code>hf_token&lt;/code>.&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>payload &lt;span style="color:#f92672">=&lt;/span> {&lt;span style="color:#e6db74">&amp;#34;prompt&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;Hello, tell me a little about the Federal University of Campina Grande (UFCG)&amp;#34;&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2&lt;/span>&lt;span> &lt;span style="color:#e6db74">&amp;#34;model_name&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;ibm-granite/granite-3.3-8b-instruct&amp;#34;&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3&lt;/span>&lt;span> &lt;span style="color:#e6db74">&amp;#34;hf_token&amp;#34;&lt;/span>: hf_token}&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5&lt;/span>&lt;span>resp &lt;span style="color:#f92672">=&lt;/span> requests&lt;span style="color:#f92672">.&lt;/span>post(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>url&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/generate&amp;#34;&lt;/span>, headers&lt;span style="color:#f92672">=&lt;/span>headers, json&lt;span style="color:#f92672">=&lt;/span>payload)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7&lt;/span>&lt;span>resp &lt;span style="color:#f92672">=&lt;/span> json&lt;span style="color:#f92672">.&lt;/span>loads(resp&lt;span style="color:#f92672">.&lt;/span>content&lt;span style="color:#f92672">.&lt;/span>decode())&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="check-status-and-unload-the-model">Check status and unload the model&lt;/h4>&lt;p>To check the status and unload the model, we don&amp;rsquo;t need to send anything in the payload—just the header with the API key:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>requests&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>url&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/status&amp;#34;&lt;/span>, headers&lt;span style="color:#f92672">=&lt;/span>headers)&lt;span style="color:#f92672">.&lt;/span>content&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>resp &lt;span style="color:#f92672">=&lt;/span> requests&lt;span style="color:#f92672">.&lt;/span>post(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>url&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/unload_model&amp;#34;&lt;/span>, headers&lt;span style="color:#f92672">=&lt;/span>headers)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="how-to-use-the-api-with-curl-in-cli">How to use the API with curl in CLI&lt;/h2>&lt;h4 id="generate-api-key-1">Generate API Key&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>curl -X POST &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/generate_apikey&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -d &lt;span style="color:#e6db74">&amp;#39;{&amp;#34;username&amp;#34;: &amp;lt;ldap_user&amp;gt;}&amp;#39;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>It is important that the Hugging Face Token is set as an environment variable in the location where the inference will run.&lt;/li>&lt;li>The user in the username field must be enclosed in quotation marks (&amp;quot; &amp;ldquo;)&lt;/li>&lt;li>After running the request above, the returned API key should be saved as an environment variable to make future executions easier. To save it, copy the returned API key and run the command:&lt;/li>&lt;/ul>&lt;pre tabindex="0">&lt;code>export API_KEY_P9=&amp;lt;returned_api_key&amp;gt;&lt;/code>&lt;/pre>&lt;h4 id="load-model-1">Load Model&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>curl -X POST &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/load_model&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;x-api-key: &lt;/span>$API_KEY&lt;span style="color:#e6db74">&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -d &lt;span style="color:#e6db74">&amp;#39;{&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;model_name&amp;#34;:&amp;#34;ibm-granite/granite-3.3-8b-instruct&amp;#34;,&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;hf_token&amp;#34;:&amp;#34;&amp;#39;&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>$HUGGINGFACE_TOKEN&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">&amp;#39;&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> }&amp;#39;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="generate-text-1">Generate Text&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>curl -X POST &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/generate&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;x-api-key: &lt;/span>$API_KEY&lt;span style="color:#e6db74">&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -d &lt;span style="color:#e6db74">&amp;#39;{&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;model_name&amp;#34;: &amp;#34;ibm-granite/granite-3.3-8b-instruct&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;prompt&amp;#34;:&amp;#34;Hello, tell me a little about the Federal University of Campina Grande (UFCG)&amp;#34;,&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;hf_token&amp;#34;: &amp;#34;&amp;#39;&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>$HUGGINGFACE_TOKEN&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">&amp;#39;&amp;#34;,&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;max_tokens&amp;#34;:50&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> }&amp;#39;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="check-status-and-unload-the-model-1">Check status and unload the model&lt;/h4>&lt;p>To check the status and unload the model, we don&amp;rsquo;t need to send anything in the payload—just the header with the API key:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>curl -X GET &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/status&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;x-api-key: &lt;/span>$API_KEY&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>curl -X POST &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/unload_model&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;x-api-key: &lt;/span>$API_KEY&lt;span style="color:#e6db74">&amp;#34;&lt;/span> &lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We hope this series has helped clarify the full development and deployment process. The LLM-IBM-UFCG team is available for questions or suggestions about future improvements.&lt;/p></description></item><item><title>Building an API for LLM inferences on IBM Power9 servers</title><link>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt3_en/</link><pubDate>Wed, 02 Jul 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt3_en/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This is the third post in a tutorial series designed to show step by step how to build a LLM API on a Power9 server, from operating system setup to remote inference execution. We already configured the operating system, NVIDIA drivers, CUDA, and cuDNN in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt1_en/">&lt;span class="link-personalizado">first post&lt;/span>&lt;/a>, and installed Conda and PyTorch in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/">&lt;span class="link-personalizado">second post&lt;/span>&lt;/a>. In this stage, we will build the API using FastAPI and the Transformers library, downloading models from Hugging Face and running the web server with uvicorn.&lt;/p>&lt;p>The implemented API will support generating API keys, loading models, performing inferences, checking status, and unloading models.&lt;/p>&lt;p>&lt;strong>FastAPI&lt;/strong>: a modern web framework for building APIs with Python 3.8+, based on static typing and async programming. It is designed to be fast, easy to use, and robust, making API development more efficient.&lt;/p>&lt;p>&lt;strong>Transformers&lt;/strong>: an open-source library developed by Hugging Face. It offers easy and efficient access to a wide collection of state-of-the-art pretrained models for Natural Language Processing (NLP), computer vision, and audio.&lt;/p>&lt;p>&lt;strong>Hugging Face&lt;/strong>: Hugging Face is a platform focused on artificial intelligence, known for hosting NLP models and other tasks. The Hugging Face Hub is a collaborative repository where developers and researchers can share, version, and download ready-to-use models, making access and integration easier.&lt;/p>&lt;p>&lt;strong>Uvicorn&lt;/strong>: ASGI (Asynchronous Server Gateway Interface) web server. Uvicorn is a high-performance server for asynchronous Python applications.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post provides a step-by-step guide to implementing an API that performs LLM inferences.&lt;/li>&lt;li>We will use FastAPI and Transformers to develop this API and Hugging Face to download the models.&lt;/li>&lt;/ul>&lt;h2 id="environment-setup">Environment Setup&lt;/h2>&lt;h4 id="directory-structure">Directory Structure&lt;/h4>&lt;p>Start by creating the basic project structure:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-txt" data-lang="txt">&lt;span style="display:flex;">&lt;span>model_api/&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── requirements.txt&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── app/&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── __init__.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── main.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── schemas.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── auth.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── model_manager.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── utils.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ └── apikey_store.json&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└── README.md (optional)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="requirementstxt-file">&lt;code>requirements.txt&lt;/code> File&lt;/h4>&lt;p>We will use FastAPI and Transformers to build the API. Additionally, we will use uvicorn to run the server, pydantic for input data validation, and torch, which we installed in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/">previous tutorial&lt;/a>.&lt;/p>&lt;p>First, we&amp;rsquo;ll install the required libraries and then populate the &lt;code>requirements.txt&lt;/code> file. Remember to activate your &lt;code>conda&lt;/code> environment if you created one, to ensure proper use of &lt;code>pytorch&lt;/code>.&lt;/p>&lt;pre tabindex="0">&lt;code>conda activate llm_apipip install fastapi uvicorn transformers&lt;/code>&lt;/pre>&lt;p>The &lt;code>requirements.txt&lt;/code> file will look like this:&lt;/p>&lt;p>&lt;strong>requirements.txt&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-txt" data-lang="txt">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>fastapi&amp;gt;=0.104.0&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2&lt;/span>&lt;span>uvicorn&amp;gt;=0.24.0&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3&lt;/span>&lt;span>torch&amp;gt;=2.0.0&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4&lt;/span>&lt;span>transformers&amp;gt;=4.35.0&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5&lt;/span>&lt;span>pydantic&amp;gt;=2.0.0&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="api-key-storage-file">API Key Storage File&lt;/h4>&lt;p>The &lt;code>apikey_store.json&lt;/code> file will store the generated API keys. We will start with it empty, containing only &lt;code>{}&lt;/code>.&lt;/p>&lt;p>&lt;strong>apikey_store.json&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>{}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="schemas-and-data-validation">Schemas and Data Validation&lt;/h2>&lt;p>Schemas are essential for validating the API&amp;rsquo;s input and output data. They ensure data is in the correct format and enable automatic documentation generation.&lt;/p>&lt;p>We will create the &lt;code>app/schemas.py&lt;/code> file containing all the data models. We will define four models: &lt;code>GenerateRequest&lt;/code>, &lt;code>LoadModelRequest&lt;/code>, &lt;code>ApiKeyResponse&lt;/code>, and &lt;code>LDAPUserRequest&lt;/code>.&lt;/p>&lt;p>&lt;strong>schemas.py&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> pydantic &lt;span style="color:#f92672">import&lt;/span> BaseModel, Field&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> typing &lt;span style="color:#f92672">import&lt;/span> Optional&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">GenerateRequest&lt;/span>(BaseModel):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span> model_name: str &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#f92672">...&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The name of the model to use for generation.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span> prompt: str &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#f92672">...&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The input text to generate a response for.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span> max_tokens: Optional[int] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#ae81ff">300&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The maximum length of the generated response.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span> temperature: Optional[float] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#ae81ff">1.0&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The sampling temperature for generation.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span> top_p: Optional[float] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#ae81ff">1.0&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The cumulative probability for nucleus sampling.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span> hf_token: Optional[str] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#66d9ef">None&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The Hugging Face tokenizer to use, if applicable.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13&lt;/span>&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">LoadModelRequest&lt;/span>(BaseModel):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14&lt;/span>&lt;span> model_name: str &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#f92672">...&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The name of the model to load.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15&lt;/span>&lt;span> device: Optional[str] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#e6db74">&amp;#34;cuda&amp;#34;&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The device to load the model on (e.g., &amp;#39;cpu&amp;#39;, &amp;#39;cuda&amp;#39;).&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16&lt;/span>&lt;span> hf_token: Optional[str] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#66d9ef">None&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The Hugging Face tokenizer to use, if applicable.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18&lt;/span>&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">ApiKeyResponse&lt;/span>(BaseModel):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19&lt;/span>&lt;span> api_key: str &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#f92672">...&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The API key for accessing the model API.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21&lt;/span>&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">LDAPUserRequest&lt;/span>(BaseModel):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22&lt;/span>&lt;span> username: str &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#f92672">...&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The username for LDAP authentication.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>All classes inherit from &lt;code>pydantic&lt;/code>&amp;rsquo;s &lt;code>BaseModel&lt;/code>, gaining validation, serialization, and automatic documentation features.&lt;/li>&lt;li>The &lt;code>Field(...)&lt;/code> declaration defines a required field with no default value.&lt;/li>&lt;li>The &lt;code>Field(value)&lt;/code> declaration defines a required field with &lt;code>value&lt;/code> as its default.&lt;/li>&lt;li>The &lt;code>Optional[type]&lt;/code> annotation indicates the field is optional but must be of type &lt;code>type&lt;/code> if provided.&lt;/li>&lt;/ul>&lt;p>With the schemas defined, let&amp;rsquo;s create the file responsible for API Key authentication.&lt;/p>&lt;h2 id="authentication-and-api-keys">Authentication and API Keys&lt;/h2>&lt;p>The authentication system protects your API by ensuring that only authorized users can access the endpoints. We will implement a mechanism based on API Keys.&lt;/p>&lt;p>Let&amp;rsquo;s create the &lt;code>app/auth.py&lt;/code> file with all the authentication functionalities.&lt;/p>&lt;p>&lt;strong>auth.py&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> secrets &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> json&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> fastapi &lt;span style="color:#f92672">import&lt;/span> HTTPException, Request&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span>APIKEY_STORE_FILE &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;app/apikey_store.json&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">load_apikeys&lt;/span>():&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> open(APIKEY_STORE_FILE, &lt;span style="color:#e6db74">&amp;#34;r&amp;#34;&lt;/span>) &lt;span style="color:#66d9ef">as&lt;/span> f:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> json&lt;span style="color:#f92672">.&lt;/span>load(f)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">FileNotFoundError&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13&lt;/span>&lt;span> status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">404&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14&lt;/span>&lt;span> detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;API keys file not found: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>APIKEY_STORE_FILE&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16&lt;/span>&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">save_apikeys&lt;/span>(keys: dict):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17&lt;/span>&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> open(APIKEY_STORE_FILE, &lt;span style="color:#e6db74">&amp;#34;w&amp;#34;&lt;/span>) &lt;span style="color:#66d9ef">as&lt;/span> f:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18&lt;/span>&lt;span> json&lt;span style="color:#f92672">.&lt;/span>dump(keys, f, indent&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">4&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20&lt;/span>&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">generate_apikey&lt;/span>(user:str) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> str:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21&lt;/span>&lt;span> key &lt;span style="color:#f92672">=&lt;/span> secrets&lt;span style="color:#f92672">.&lt;/span>token_hex(&lt;span style="color:#ae81ff">32&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22&lt;/span>&lt;span> keys &lt;span style="color:#f92672">=&lt;/span> load_apikeys()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23&lt;/span>&lt;span> keys[user] &lt;span style="color:#f92672">=&lt;/span> key&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24&lt;/span>&lt;span> save_apikeys(keys)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> key&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">verify_apikey&lt;/span>(request: Request) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> bool:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">28&lt;/span>&lt;span> apikey &lt;span style="color:#f92672">=&lt;/span> request&lt;span style="color:#f92672">.&lt;/span>headers&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&amp;#34;x-API-Key&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">29&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#f92672">not&lt;/span> apikey:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">30&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">31&lt;/span>&lt;span> status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">401&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">32&lt;/span>&lt;span> detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;API key not provided.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">33&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">34&lt;/span>&lt;span> keys &lt;span style="color:#f92672">=&lt;/span> load_apikeys()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">35&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> apikey &lt;span style="color:#f92672">in&lt;/span> keys&lt;span style="color:#f92672">.&lt;/span>values():&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">36&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#66d9ef">True&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">37&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">38&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> json&lt;span style="color:#f92672">.&lt;/span>JSONDecodeError:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">39&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">40&lt;/span>&lt;span> status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">403&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">41&lt;/span>&lt;span> detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Invalid API Key&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>The &lt;code>load_apikeys&lt;/code> function loads the information stored in the &lt;code>app/apikey_store.json&lt;/code> file.&lt;/li>&lt;li>&lt;code>save_apikeys&lt;/code> is responsible for saving the content in JSON format.&lt;/li>&lt;li>The &lt;code>generate_apikey&lt;/code> function creates a key for a user and adds it to the dictionary using the provided username as the key.&lt;/li>&lt;li>&lt;code>verify_apikey&lt;/code> will be called whenever a request arrives, to perform validation.&lt;/li>&lt;/ul>&lt;h2 id="model-and-gpu-manager">Model and GPU Manager&lt;/h2>&lt;p>The &lt;code>app/model_manager.py&lt;/code> is the core of the API, responsible for loading, managing, and running llm. It optimizes GPU/CPU usage and ensures efficient text generation.&lt;/p>&lt;p>&lt;strong>model_manager.py&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> torch &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> transformers &lt;span style="color:#f92672">import&lt;/span> AutoTokenizer, AutoModelForCausalLM&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> fastapi &lt;span style="color:#f92672">import&lt;/span> HTTPException&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> gc&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> .utils &lt;span style="color:#f92672">import&lt;/span> is_model_on_gpu&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span>DEVICE &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;cuda&amp;#34;&lt;/span> &lt;span style="color:#66d9ef">if&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>cuda&lt;span style="color:#f92672">.&lt;/span>is_available() &lt;span style="color:#66d9ef">else&lt;/span> &lt;span style="color:#e6db74">&amp;#34;cpu&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">ModelManager&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> __init__(self):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15&lt;/span>&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">load_model&lt;/span>(self, model_name: str, hf_token:str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>, device: str &lt;span style="color:#f92672">=&lt;/span> DEVICE):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">!=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span> &lt;span style="color:#f92672">and&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">!=&lt;/span> model_name:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17&lt;/span>&lt;span> print(&lt;span style="color:#e6db74">&amp;#34;Removing previously loaded model...&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>unload_model() &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20&lt;/span>&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Loading model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> on device &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>device&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">...&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">!=&lt;/span> model_name:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>: &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> hf_token: &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer &lt;span style="color:#f92672">=&lt;/span> AutoTokenizer&lt;span style="color:#f92672">.&lt;/span>from_pretrained(model_name, token&lt;span style="color:#f92672">=&lt;/span>hf_token)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">=&lt;/span> AutoModelForCausalLM&lt;span style="color:#f92672">.&lt;/span>from_pretrained(model_name, device_map&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;balanced&amp;#34;&lt;/span>, token&lt;span style="color:#f92672">=&lt;/span>hf_token)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27&lt;/span>&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">28&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer &lt;span style="color:#f92672">=&lt;/span> AutoTokenizer&lt;span style="color:#f92672">.&lt;/span>from_pretrained(model_name)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">29&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">=&lt;/span> AutoModelForCausalLM&lt;span style="color:#f92672">.&lt;/span>from_pretrained(model_name, device_map&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;balanced&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">30&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model&lt;span style="color:#f92672">.&lt;/span>eval()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">31&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">=&lt;/span> model_name&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">32&lt;/span>&lt;span> print(is_model_on_gpu(self&lt;span style="color:#f92672">.&lt;/span>model&lt;span style="color:#f92672">.&lt;/span>hf_device_map, self&lt;span style="color:#f92672">.&lt;/span>model_name))&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">33&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">34&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">35&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>, detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Erro ao carregar modelo: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>str(e)&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">36&lt;/span>&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">37&lt;/span>&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> is already loaded.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">38&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">39&lt;/span>&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">generate&lt;/span>(self, model_name:str, hf_token: str, prompt:str, max_tokens:int &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">300&lt;/span>, temperature:float &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">1.0&lt;/span>, top_p:float &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">1.0&lt;/span>) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> str:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">40&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">41&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">!=&lt;/span> model_name:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">42&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>load_model(model_name, hf_token, device&lt;span style="color:#f92672">=&lt;/span>DEVICE)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">43&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">44&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">is&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span> &lt;span style="color:#f92672">or&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer &lt;span style="color:#f92672">is&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">45&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">400&lt;/span>, detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;No model loaded.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">46&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">47&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">48&lt;/span>&lt;span> inputs &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer(prompt, return_tensors&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;pt&amp;#34;&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>to(self&lt;span style="color:#f92672">.&lt;/span>model&lt;span style="color:#f92672">.&lt;/span>device)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">49&lt;/span>&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>no_grad(): &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">50&lt;/span>&lt;span> outputs &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model&lt;span style="color:#f92672">.&lt;/span>generate(&lt;span style="color:#f92672">**&lt;/span>inputs, max_new_tokens&lt;span style="color:#f92672">=&lt;/span>max_tokens,temperature&lt;span style="color:#f92672">=&lt;/span>temperature, top_p&lt;span style="color:#f92672">=&lt;/span>top_p, eos_token_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>tokenizer&lt;span style="color:#f92672">.&lt;/span>eos_token_id)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">51&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer&lt;span style="color:#f92672">.&lt;/span>decode(outputs[&lt;span style="color:#ae81ff">0&lt;/span>], skip_special_tokens&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">52&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">53&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>, detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Error generating text:&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>str(e)&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">54&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">55&lt;/span>&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">get_status&lt;/span>(self) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> str: &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">56&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">is&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">57&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>unload_model()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">58&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#e6db74">&amp;#34;No model loaded.&amp;#34;&lt;/span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">59&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> is_model_on_gpu(self&lt;span style="color:#f92672">.&lt;/span>model&lt;span style="color:#f92672">.&lt;/span>hf_device_map, self&lt;span style="color:#f92672">.&lt;/span>model_name)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">60&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">61&lt;/span>&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">unload_model&lt;/span>(self):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">62&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">63&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">64&lt;/span>&lt;span> old_model &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#66d9ef">else&lt;/span> &lt;span style="color:#66d9ef">False&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">65&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">66&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">67&lt;/span>&lt;span> gc&lt;span style="color:#f92672">.&lt;/span>collect()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">68&lt;/span>&lt;span> torch&lt;span style="color:#f92672">.&lt;/span>cuda&lt;span style="color:#f92672">.&lt;/span>empty_cache()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">69&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>old_model&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> successfully unloaded.&amp;#34;&lt;/span> &lt;span style="color:#66d9ef">if&lt;/span> old_model &lt;span style="color:#66d9ef">else&lt;/span> &lt;span style="color:#e6db74">&amp;#34;No model loaded to unload.&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">70&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">71&lt;/span>&lt;span>manager &lt;span style="color:#f92672">=&lt;/span> ModelManager()&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>The &lt;code>load_model&lt;/code> function loads a new model into memory, removing any previously loaded model.&lt;/li>&lt;li>&lt;code>generate&lt;/code> is the main function of the API, responsible for performing model inference. It allows adjusting the parameters: temperature, top_p, and max_tokens.&lt;/li>&lt;li>&lt;code>get_status&lt;/code> reports whether there is a loaded model and whether it is on the GPU or CPU.&lt;/li>&lt;li>The &lt;code>unload_model&lt;/code> function removes the model from memory, clears the CUDA cache, and invokes Python’s garbage collector to avoid leftovers that could interfere with future loads.&lt;/li>&lt;/ul>&lt;h2 id="fastapi-api-endpoints">FastAPI API Endpoints&lt;/h2>&lt;p>The &lt;code>app/main.py&lt;/code> file is where all the components come together. In it, we define all the endpoints and the API’s routing logic.&lt;/p>&lt;p>&lt;strong>main.py&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> fastapi &lt;span style="color:#f92672">import&lt;/span> FastAPI, Request, HTTPException, Depends&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> fastapi.responses &lt;span style="color:#f92672">import&lt;/span> JSONResponse&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> app &lt;span style="color:#f92672">import&lt;/span> schemas, model_manager, auth&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span>app &lt;span style="color:#f92672">=&lt;/span> FastAPI()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">require_api_key&lt;/span>(request: Request) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> schemas&lt;span style="color:#f92672">.&lt;/span>LDAPUserRequest:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span> user &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">await&lt;/span> auth&lt;span style="color:#f92672">.&lt;/span>verify_apikey(request)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#f92672">not&lt;/span> user:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">401&lt;/span>, detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Invalid API Key&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> user&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13&lt;/span>&lt;span>&lt;span style="color:#a6e22e">@app.post&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/generate_apikey&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">generate_apikey&lt;/span>(payload: schemas&lt;span style="color:#f92672">.&lt;/span>LDAPUserRequest) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> JSONResponse:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15&lt;/span>&lt;span> key &lt;span style="color:#f92672">=&lt;/span> auth&lt;span style="color:#f92672">.&lt;/span>generate_apikey(payload&lt;span style="color:#f92672">.&lt;/span>username)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">200&lt;/span>, content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;api_key&amp;#34;&lt;/span>: key})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18&lt;/span>&lt;span>&lt;span style="color:#a6e22e">@app.post&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/load_model&amp;#34;&lt;/span>, dependencies&lt;span style="color:#f92672">=&lt;/span>[Depends(require_api_key)])&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">load_model&lt;/span>(payload: schemas&lt;span style="color:#f92672">.&lt;/span>LoadModelRequest) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> JSONResponse:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21&lt;/span>&lt;span> model_manager&lt;span style="color:#f92672">.&lt;/span>manager&lt;span style="color:#f92672">.&lt;/span>load_model(payload&lt;span style="color:#f92672">.&lt;/span>model_name, payload&lt;span style="color:#f92672">.&lt;/span>hf_token, payload&lt;span style="color:#f92672">.&lt;/span>device)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse(content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;message&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>payload&lt;span style="color:#f92672">.&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> loaded successfully.&amp;#34;&lt;/span>})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>, content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;error&amp;#34;&lt;/span>: str(e)})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26&lt;/span>&lt;span>&lt;span style="color:#a6e22e">@app.post&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/generate&amp;#34;&lt;/span>, dependencies&lt;span style="color:#f92672">=&lt;/span>[Depends(require_api_key)])&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">generate&lt;/span>(payload: schemas&lt;span style="color:#f92672">.&lt;/span>GenerateRequest)&lt;span style="color:#f92672">-&amp;gt;&lt;/span> JSONResponse:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">28&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">29&lt;/span>&lt;span> result &lt;span style="color:#f92672">=&lt;/span> model_manager&lt;span style="color:#f92672">.&lt;/span>manager&lt;span style="color:#f92672">.&lt;/span>generate(payload&lt;span style="color:#f92672">.&lt;/span>model_name, payload&lt;span style="color:#f92672">.&lt;/span>hf_token,payload&lt;span style="color:#f92672">.&lt;/span>prompt, payload&lt;span style="color:#f92672">.&lt;/span>max_tokens, payload&lt;span style="color:#f92672">.&lt;/span>temperature, payload&lt;span style="color:#f92672">.&lt;/span>top_p)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">30&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> {&lt;span style="color:#e6db74">&amp;#34;result&amp;#34;&lt;/span>: result}&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">31&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">32&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>, content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;error&amp;#34;&lt;/span>: str(e)})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">33&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">34&lt;/span>&lt;span>&lt;span style="color:#a6e22e">@app.get&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/status&amp;#34;&lt;/span>, dependencies&lt;span style="color:#f92672">=&lt;/span>[Depends(require_api_key)])&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">35&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">status&lt;/span>()&lt;span style="color:#f92672">-&amp;gt;&lt;/span> JSONResponse:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">36&lt;/span>&lt;span> str_status &lt;span style="color:#f92672">=&lt;/span> model_manager&lt;span style="color:#f92672">.&lt;/span>manager&lt;span style="color:#f92672">.&lt;/span>get_status()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">37&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse(content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;status&amp;#34;&lt;/span>: str_status})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">38&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">39&lt;/span>&lt;span>&lt;span style="color:#a6e22e">@app.post&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/unload_model&amp;#34;&lt;/span>, dependencies&lt;span style="color:#f92672">=&lt;/span>[Depends(require_api_key)])&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">40&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">unload_model&lt;/span>() &lt;span style="color:#f92672">-&amp;gt;&lt;/span> JSONResponse:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">41&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">42&lt;/span>&lt;span> str_unload &lt;span style="color:#f92672">=&lt;/span> model_manager&lt;span style="color:#f92672">.&lt;/span>manager&lt;span style="color:#f92672">.&lt;/span>unload_model()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">43&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse(content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;message&amp;#34;&lt;/span>:str_unload})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">44&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">45&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>, content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;error&amp;#34;&lt;/span>: str(e)})&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>The &lt;code>require_api_key&lt;/code> function checks the API Key on each request and returns the authenticated user or raises a 401 error.&lt;/li>&lt;li>&lt;code>generate_apikey&lt;/code> creates and returns a new API key for the specified user.&lt;/li>&lt;li>&lt;code>load_model&lt;/code> loads the specified model. If needed, it also accepts a Hugging Face token.&lt;/li>&lt;li>The &lt;code>generate&lt;/code> function makes the model perform inference using the given prompt and parameters.&lt;/li>&lt;li>Calling the &lt;code>status&lt;/code> endpoint returns the current status of the model manager.&lt;/li>&lt;li>&lt;code>unload_model&lt;/code> unloads the currently loaded model and returns a success message if completed properly.&lt;/li>&lt;/ul>&lt;h2 id="utilspy-file">&lt;code>utils.py&lt;/code> File&lt;/h2>&lt;p>The &lt;code>app/utils.py&lt;/code> file contains the function that checks whether the loaded model is fully or partially on the GPU, or if it was loaded on the CPU.&lt;/p>&lt;p>&lt;strong>utils.py&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">is_model_on_gpu&lt;/span>(hf_device_map: dict, model_name: str) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> str:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#e6db74">&amp;#39;&amp;#39;&lt;/span> &lt;span style="color:#f92672">in&lt;/span> hf_device_map&lt;span style="color:#f92672">.&lt;/span>keys() &lt;span style="color:#f92672">and&lt;/span> hf_device_map[&lt;span style="color:#e6db74">&amp;#39;&amp;#39;&lt;/span>] &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&amp;#39;cpu&amp;#39;&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> fully loaded on CPU.&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4&lt;/span>&lt;span> &lt;span style="color:#66d9ef">elif&lt;/span> &lt;span style="color:#e6db74">&amp;#39;cpu&amp;#39;&lt;/span> &lt;span style="color:#f92672">in&lt;/span> hf_device_map&lt;span style="color:#f92672">.&lt;/span>values():&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Some layers of the model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> are loaded on the CPU.&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6&lt;/span>&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> fully loaded on GPU.&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="running-the-api">Running the API&lt;/h2>&lt;p>To run the API with &lt;code>uvicorn&lt;/code>, simply execute a command specifying the host and port for the service to start.&lt;/p>&lt;pre tabindex="0">&lt;code>uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload&lt;/code>&lt;/pre>&lt;ul>&lt;li>&lt;p>&lt;code>app:main&lt;/code> refers to the &lt;code>app/main.py&lt;/code> file, which connects all components and handles user requests.&lt;/p>&lt;/li>&lt;li>&lt;p>&lt;code>--host 0.0.0.0&lt;/code> sets the IP address on which the Uvicorn server will listen. The value &lt;code>0.0.0.0&lt;/code> allows the server to be accessible from any network interface on the Power9 machine.&lt;/p>&lt;/li>&lt;li>&lt;p>&lt;code>--port 8000&lt;/code> specifies the port on which the server will listen for requests.&lt;/p>&lt;/li>&lt;li>&lt;p>&lt;code>--reload&lt;/code> is a flag for development use. It automatically reloads the server whenever changes are made.&lt;/p>&lt;/li>&lt;/ul>&lt;p>BBy following this guide, you&amp;rsquo;ll have a working API capable of running LLM inference using models downloaded from Hugging Face. In the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt4_en/">&lt;span class="link-personalizado">next tutorial&lt;/span>&lt;/a>, we will show how to send requests to the API using curl and Python.&lt;/p></description></item><item><title>Setting Up the Conda and PyTorch on IBM Power9 Servers</title><link>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/</link><pubDate>Mon, 30 Jun 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This is the second post in a tutorial series on how to build a Language Model API on an IBM Power9 server, covering everything from setting up the operating system to having the API running remote inference. The &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt1_en/">&lt;span class="link-personalizado">first post&lt;/span>&lt;/a> covers installing the OS and configuring NVIDIA drivers, CUDA, and CUDNN. In this step, we&amp;rsquo;ll show how to set up the Conda package manager and the PyTorch library.&lt;/p>&lt;p>&lt;strong>Conda&lt;/strong>: Conda is an open-source, cross-platform package and environment management system. It&amp;rsquo;s like a &amp;ldquo;toolbox&amp;rdquo; for data scientists and developers to organize their projects.&lt;/p>&lt;p>&lt;strong>PyTorch&lt;/strong>: PyTorch is an open-source machine learning library developed primarily by Facebook AI Research (FAIR). It&amp;rsquo;s especially popular for building deep learning applications, a subfield of machine learning inspired by how the human brain works.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post provides a step-by-step guide to installing Conda and PyTorch.&lt;/li>&lt;li>The main challenge is finding compatible versions for the Power9 machine architecture.&lt;/li>&lt;/ul>&lt;h2 id="setting-up-the-conda">Setting up the Conda&lt;/h2>&lt;p>We&amp;rsquo;ll start with installing &lt;strong>Conda&lt;/strong>. On Power systems, the architecture used is &lt;code>ppc64le&lt;/code> (PowerPC 64-bit little-endian), so it&amp;rsquo;s essential to download the version for this architecture. We&amp;rsquo;ll use &lt;strong>miniconda&lt;/strong>, a lighter option that&amp;rsquo;s better suited for custom setups like the Power9 server.&lt;/p>&lt;ol>&lt;li>To download and install the latest version of Miniconda:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-ppc64le.shbash ~/Miniconda3-latest-Linux-ppc64le.sh&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>Check if Conda was activated automatically:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>conda -–version&lt;/code>&lt;/pre>&lt;p>If it didn&amp;rsquo;t start automatically, you&amp;rsquo;ll need to activate it.&lt;/p>&lt;ol start="3">&lt;li>To ensure it&amp;rsquo;s automatically activated with each new connection, we will write the command into your &lt;code>.bashrc&lt;/code> (or &lt;code>.zshrc&lt;/code>) file.&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>echo &amp;#39;source ~/miniconda3/etc/profile.d/conda.sh&amp;#39; &amp;gt;&amp;gt; ~/.bashrcsource ~/.bashrc&lt;/code>&lt;/pre>&lt;p>Check again with the command:&lt;/p>&lt;pre tabindex="0">&lt;code>conda -–version&lt;/code>&lt;/pre>&lt;p>Expected output looks like: &lt;code>conda 23.10.0&lt;/code>&lt;/p>&lt;h2 id="installing-and-configuring-the-pytorch-library">Installing and configuring the PyTorch library&lt;/h2>&lt;p>There are no official builds or Conda/PyPi wheels with full support for the &lt;strong>ppc64le&lt;/strong> architecture. To install PyTorch, you’ll need to build it manually.&lt;/p>&lt;h4 id="optional-creating-a-conda-virtual-environment">(Optional) Creating a Conda virtual environment&lt;/h4>&lt;p>It’s recommended to create a dedicated virtual environment to install PyTorch in isolation.&lt;/p>&lt;ol>&lt;li>To create and activate the virtual environment, run:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>conda create -y -n api_llm python=3.10conda activate api_llm&lt;/code>&lt;/pre>&lt;h4 id="installing-prerequisites">Installing prerequisites&lt;/h4>&lt;p>We need to install some packages required to properly build PyTorch.&lt;/p>&lt;ol>&lt;li>First, install the packages using the following commands:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>conda install -y -c conda-forge openblas libblas cmake ninja python3-devel gcc-c++ rust cargo&lt;/code>&lt;/pre>&lt;p>CMake (the build system used by PyTorch) dropped support for scripts declaring compatibility with older versions (&amp;lt;3.5). To address this, we need to install a version of cmake &amp;lt;3.5 using pip.&lt;/p>&lt;ol start="2">&lt;li>Run the command:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>pip install cmake==3.27.7&lt;/code>&lt;/pre>&lt;p>To make sure the correct version was installed, run the command:&lt;/p>&lt;pre tabindex="0">&lt;code>cmake --version &lt;/code>&lt;/pre>&lt;p>Expected output: &lt;code>cmake version 3.27.7&lt;/code>&lt;/p>&lt;h4 id="building-pytorch">Building PyTorch&lt;/h4>&lt;p>Now let&amp;rsquo;s start the &lt;strong>PyTorch&lt;/strong> build process.&lt;/p>&lt;ol>&lt;li>The first step is to clone the repository and set it up to install version 2.6.0:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>git clone --recursive https://github.com/pytorch/pytorchcd pytorchgit checkout v2.6.0 git submodule sync git submodule update --init --recursive &lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>To install the required packages via pip, run the following command:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>pip install -r requirements.txt&lt;/code>&lt;/pre>&lt;ol start="3">&lt;li>And finally, to build PyTorch, run Python’s setup.py:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo USE_CUDA=1 USE_DISTRIBUTED=1 USE_NCCL=1 USE_GLOO=1 USE_CUDNN=1 python setup.py install&lt;/code>&lt;/pre>&lt;p>The build process usually takes a while, around 15 minutes.&lt;/p>&lt;ol start="4">&lt;li>To check if everything worked correctly, create a file named &lt;code>test_torch.py&lt;/code>&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>nano test_torch.py&lt;/code>&lt;/pre>&lt;p>This file should contain the following lines:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> torch&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>print(torch&lt;span style="color:#f92672">.&lt;/span>__version__)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;CUDA available:&amp;#34;&lt;/span>, torch&lt;span style="color:#f92672">.&lt;/span>cuda&lt;span style="color:#f92672">.&lt;/span>is_available())&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;Number of GPUs:&amp;#34;&lt;/span>, torch&lt;span style="color:#f92672">.&lt;/span>cuda&lt;span style="color:#f92672">.&lt;/span>device_count())&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;GPU name:&amp;#34;&lt;/span>, torch&lt;span style="color:#f92672">.&lt;/span>cuda&lt;span style="color:#f92672">.&lt;/span>get_device_name(&lt;span style="color:#ae81ff">0&lt;/span>))&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span>x &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>rand(&lt;span style="color:#ae81ff">3&lt;/span>, &lt;span style="color:#ae81ff">3&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>cuda()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span>y &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>rand(&lt;span style="color:#ae81ff">3&lt;/span>, &lt;span style="color:#ae81ff">3&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>cuda()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;Sum on GPU:&amp;#34;&lt;/span>, (x &lt;span style="color:#f92672">+&lt;/span> y))&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;cuDNN available:&amp;#34;&lt;/span>, torch&lt;span style="color:#f92672">.&lt;/span>backends&lt;span style="color:#f92672">.&lt;/span>cudnn&lt;span style="color:#f92672">.&lt;/span>is_available())&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;C extensions loaded:&amp;#34;&lt;/span>, torch&lt;span style="color:#f92672">.&lt;/span>_C&lt;span style="color:#f92672">.&lt;/span>_cuda_getDeviceCount() &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>When you run this file, you’ll check:&lt;/p>&lt;ul>&lt;li>Installed PyTorch version&lt;/li>&lt;li>CUDA availability&lt;/li>&lt;li>Number of available GPUs&lt;/li>&lt;li>GPU name on the Power9 server&lt;/li>&lt;li>Whether GPU usage is working correctly&lt;/li>&lt;li>CUDNN availability&lt;/li>&lt;li>Whether the .so files were compiled correctly&lt;/li>&lt;/ul>&lt;p>This script simply verifies some CUDA and PyTorch informations and performs a basic addition operation using GPU tensors.&lt;/p>&lt;ol start="5">&lt;li>Run the file with the command:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>python test_gpu.py&lt;/code>&lt;/pre>&lt;p>Expected output should look something like:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>2.6.0a0+git1eba9b3&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>CUDA available: True&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Number of GPUs: &lt;span style="color:#ae81ff">4&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>GPU name: Tesla V100-SXM2-16GB&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Sum on GPU: tensor&lt;span style="color:#f92672">([[&lt;/span>1.9163, 1.2208, 0.5998&lt;span style="color:#f92672">]&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">[&lt;/span>1.7962, 0.6040, 1.3943&lt;span style="color:#f92672">]&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">[&lt;/span>0.9536, 0.8010, 0.0668&lt;span style="color:#f92672">]]&lt;/span>, device&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#39;cuda:0&amp;#39;&lt;/span>&lt;span style="color:#f92672">)&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cuDNN available: True&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>C extensions loaded: True&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Keep in mind that the output may vary depending on the number and model of GPUs, as well as the tensor sums (due to randomness). What matters is that the boolean outputs in the script return &lt;code>True&lt;/code>.&lt;/p>&lt;p>With this, PyTorch is installed and ready to use. &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt3_en/">&lt;span class="link-personalizado">In the next tutorial&lt;/span>&lt;/a>, we’ll run the first Language Model inference on the Power9 server.&lt;/p></description></item><item><title>Setting Up the OS, NVIDIA Drivers, CUDA, and cuDNN on IBM Power 9 Servers</title><link>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt1_en/</link><pubDate>Sun, 29 Jun 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt1_en/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This is the first post in a tutorial series on how to build a Language Model API on an IBM Power9 server, covering everything from setting up the operating system to having the API running remote inference.This step of the tutorial shows how to set up the operating system and install NVIDIA drivers, CUDA, and cuDNN on machines with IBM Power9 AC922 processors. The focus is on ensuring everything works correctly on &lt;code>ppc64le&lt;/code> architectures, which are common in high-performance environments.&lt;/p>&lt;p>&lt;strong>IBM Power9&lt;/strong>: The IBM Power9 AC922 is a high-performance machine used for demanding tasks such as artificial intelligence and scientific computing. It uses Power9 processors and works well with NVIDIA GPUs, offering high-speed communication between the CPU and GPU.&lt;/p>&lt;p>&lt;strong>NVIDIA Drivers&lt;/strong>: Software that allows the operating system to communicate correctly with NVIDIA GPUs. These drivers are essential to enable GPU acceleration.&lt;/p>&lt;p>&lt;strong>CUDA&lt;/strong>: NVIDIA&amp;rsquo;s platform for accelerating parallel computing on GPUs. It lets you run complex algorithms efficiently, such as Large Language Model inference.&lt;/p>&lt;p>&lt;strong>cuDNN&lt;/strong>: A GPU-optimized library of primitives for deep neural networks (DNNs) developed by NVIDIA. It offers high-performance implementations of key DNN operations like convolutions, pooling, and normalization, significantly speeding up training and inference on GPUs.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post provides a step-by-step guide on setting up Power9 servers, including the OS and NVIDIA configurations.&lt;/li>&lt;li>The main challenge is finding compatible versions for the Power9 machine architecture.&lt;/li>&lt;/ul>&lt;h2 id="setting-up-the-operating-system">Setting up the Operating System&lt;/h2>&lt;p>Let&amp;rsquo;s start with the installation of &lt;strong>Red Hat Enterprise Linux 8.10 (Ootpa)&lt;/strong>. On Power systems, the architecture used is &lt;code>ppc64le&lt;/code> (PowerPC 64-bit little-endian), so it&amp;rsquo;s essential to ensure the .iso image is compatible with this architecture. Otherwise, the Power9&amp;rsquo;s petitboot won&amp;rsquo;t recognize the media and installation won&amp;rsquo;t proceed.&lt;/p>&lt;ol>&lt;li>You can download the correct image from the &lt;a href="https://access.redhat.com/downloads/content/279/ver=/rhel---8/8.10/ppc64le/product-software" rel="external">&lt;span class="link-personalizado">link&lt;/span>&lt;/a> provided.&lt;/li>&lt;li>In this tutorial, we&amp;rsquo;ll use the &lt;strong>Boot ISO&lt;/strong> option and follow the &lt;a href="https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/interactively_installing_rhel_from_installation_media/assembly_creating-a-bootable-installation-medium_rhel-installer" rel="external">&lt;span class="link-personalizado">official Red Hat documentation&lt;/span>&lt;/a> to create a bootable USB medium.&lt;/li>&lt;li>After inserting the installation media into the Power9 server and rebooting, the system should automatically start petitboot.&lt;/li>&lt;li>From there, just follow the &lt;a href="https://www.ibm.com/docs/en/linuxonibm/liabw/rhelqs_guide_Power_p9_usb.pdf" rel="external">&lt;span class="link-personalizado">official installation guide&lt;/span>&lt;/a> to complete the OS setup.&lt;/li>&lt;/ol>&lt;h2 id="setting-up-nvidia-driver-and-cuda">Setting up NVIDIA Driver and CUDA&lt;/h2>&lt;h4 id="checking-gpus-and-operating-system">Checking GPUs and Operating System&lt;/h4>&lt;p>To enable the operating system to communicate properly with the server&amp;rsquo;s GPUs, we need to install and configure the NVIDIA driver.&lt;/p>&lt;ol>&lt;li>First, let&amp;rsquo;s check for the presence of the GPU(s):&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>lspci | grep -i nvidia&lt;/code>&lt;/pre>&lt;p>The expected output is something like:&lt;/p>&lt;p>&lt;code>0004:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)&lt;/code>&lt;/p>&lt;ol start="2">&lt;li>Next, let&amp;rsquo;s check the system architecture and operating system name:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>uname -m &amp;amp;&amp;amp; cat /etc/redhat-release&lt;/code>&lt;/pre>&lt;p>The expected output is:&lt;/p>&lt;p>&lt;code>ppc64le Red Hat Enterprise Linux release 8.10 (Ootpa)&lt;/code>&lt;/p>&lt;h4 id="avoiding-conflicts">Avoiding conflicts&lt;/h4>&lt;p>To avoid potential conflicts, it&amp;rsquo;s recommended to disable the &lt;code>nouveau&lt;/code> driver and &lt;code>SELinux&lt;/code>.&lt;/p>&lt;p>The &lt;code>nouveau&lt;/code> driver is an open-source driver for NVIDIA GPUs that replaces the proprietary driver when users want to use only free software without needing high performance.&lt;/p>&lt;p>&lt;code>SELinux=enable&lt;/code> restricts certain processes from making changes to the system, which can conflict with the installations we&amp;rsquo;ll do in this tutorial.&lt;/p>&lt;ol>&lt;li>Disable the &lt;code>nouveau&lt;/code> driver:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>echo -e &amp;#34;blacklist nouveau\noptions nouveau modeset=0&amp;#34; | sudo tee /etc/modprobe.d/disable-nouveau.conf&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>To disable &lt;code>SELinux&lt;/code>, let&amp;rsquo;s first check its status by running:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sestatus&lt;/code>&lt;/pre>&lt;p>If it&amp;rsquo;s active, you&amp;rsquo;ll need to set the &lt;code>SELINUX=disabled&lt;/code> parameter in the &lt;code>/etc/selinux/config&lt;/code> file to proceed. Remember that saving changes requires sudo permissions.&lt;/p>&lt;ol start="3">&lt;li>After that, update the &lt;code>initramfs&lt;/code> and reboot the machine with the following commands:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo dracut --forcesudo reboot&lt;/code>&lt;/pre>&lt;ol start="4">&lt;li>To verify everything worked so far, let&amp;rsquo;s check if &lt;code>nouveau&lt;/code> is disabled:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>lsmod | grep nouveau&lt;/code>&lt;/pre>&lt;p>If it&amp;rsquo;s been successfully disabled, there will be no output.&lt;/p>&lt;ol start="5">&lt;li>To verify the &lt;code>SELinux&lt;/code>:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sestatus&lt;/code>&lt;/pre>&lt;p>If it&amp;rsquo;s disabled, the output will be: &lt;code>SELinux status: disabled&lt;/code>&lt;/p>&lt;h4 id="installing-prerequisites">Installing Prerequisites&lt;/h4>&lt;ol>&lt;li>Let&amp;rsquo;s install some prerequisites before starting the actual installation:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo dnf install pciutils environment-modulessudo dnf install kernel-devel-$(uname -r) kernel-headerssudo dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpmsudo dnf clean all sudo dnf install dkms&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>We also need to enable some repositories:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo subscription-manager repos --enable=rhel-8-for-ppc64le-appstream-rpmssudo subscription-manager repos --enable=rhel-8-for-ppc64le-baseos-rpmssudo subscription-manager repos --enable=codeready-builder-for-rhel-8-ppc64le-rpms&lt;/code>&lt;/pre>&lt;h4 id="downloading-and-installing-cuda-package-repositories">Downloading and Installing CUDA Package Repositories&lt;/h4>&lt;ol>&lt;li>Let&amp;rsquo;s download &lt;strong>CUDA version 12.2&lt;/strong> and &lt;strong>NVIDIA Driver 535.54.03-1&lt;/strong> with the following command:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-rhel8-12-2-local-12.2.0_535.54.03-1.ppc64le.rpm&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>To install the downloaded package:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo rpm -i cuda-repo-rhel8-12-2-local-12.2.0_535.54.03-1.ppc64le.rpm&lt;/code>&lt;/pre>&lt;ol start="3">&lt;li>To install the NVIDIA driver and CUDA, run the following commands:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo dnf install nvidia-driver-cuda sudo dnf clean all sudo dnf module reset nvidia-driver sudo dnf module enable nvidia-driver:latest-dkmssudo dnf -y module install nvidia-driver:latest-dkmssudo dnf -y install cuda &lt;/code>&lt;/pre>&lt;p>With these commands, the driver and CUDA installation is complete.&lt;/p>&lt;h4 id="post-installation-steps">Post-Installation Steps&lt;/h4>&lt;ol>&lt;li>Let&amp;rsquo;s set the &lt;code>PATH&lt;/code> and &lt;code>LD_LIBRARY_PATH&lt;/code> environment variables. To do this, edit the &lt;code>.bashrc&lt;/code> file and add these two lines:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH&lt;/code>&lt;/pre>&lt;p>To update the environment variables, run the following command:&lt;/p>&lt;pre tabindex="0">&lt;code>source ~/.bashrc&lt;/code>&lt;/pre>&lt;p>We need to make two manual changes because they aren&amp;rsquo;t handled automatically by the CUDA package installation. If these aren&amp;rsquo;t done, the CUDA driver installation will not work properly.&lt;/p>&lt;ol start="2">&lt;li>The first change is to configure the NVIDIA persistence daemon. First, check its status, and if it&amp;rsquo;s not active, enable it:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>systemctl status nvidia-persistencedsystemctl enable nvidia-persistenced&lt;/code>&lt;/pre>&lt;p>Some Linux distributions have a udev rule that brings hot-plugged memory online as soon as it&amp;rsquo;s detected, preventing NVIDIA software from correctly configuring GPU memory on Power9.&lt;/p>&lt;ol start="3">&lt;li>To disable this rule, run the following commands:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/sudo sed -i &amp;#39;s/SUBSYSTEM!=&amp;#34;memory&amp;#34;,.*GOTO=&amp;#34;memory_hotplug_end&amp;#34;/SUBSYSTEM==&amp;#34;*&amp;#34;, GOTO=&amp;#34;memory_hotplug_end&amp;#34;/&amp;#39; /etc/udev/rules.d/40-redhat.rules&lt;/code>&lt;/pre>&lt;h4 id="installation-check">Installation Check&lt;/h4>&lt;p>After completing all these steps, let&amp;rsquo;s reboot the machine and verify the installations:&lt;/p>&lt;ol>&lt;li>Reboot the machine:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo reboot&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>Check the NVIDIA driver:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>nvidia-smi&lt;/code>&lt;/pre>&lt;p>The output of the command above should display CUDA compiler information: version and install date. It should also list available devices (GPUs) with details like name, memory, temperature, and other information.&lt;/p>&lt;p>To perform the final check, let&amp;rsquo;s download the &lt;code>cuda-samples&lt;/code> repository and run the device test.&lt;/p>&lt;ol start="3">&lt;li>Download the repository and access the &lt;code>cuda-samples&lt;/code> version matching the installed CUDA:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>git clone https://github.com/NVIDIA/cuda-samples.git cd cuda-samples/Samples/1_Utilities/deviceQuerygit checkout v12.2 &lt;/code>&lt;/pre>&lt;ol start="4">&lt;li>To build and run the tests:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>make./deviceQuery&lt;/code>&lt;/pre>&lt;p>After running this test, you should see &lt;code>Result = PASS&lt;/code> in the last line. This confirms that the Power9 is set up with the NVIDIA driver and CUDA working correctly.&lt;/p>&lt;h2 id="setting-up-the-cudnn">Setting up the CUDNN&lt;/h2>&lt;ol>&lt;li>First, we need to download and install the &lt;code>.rpm&lt;/code> package specific to &lt;code>ppc64le&lt;/code>.&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>wget https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn-local-repo-rhel8-9.0.0-1.0-1.ppc64le.rpmsudo rpm -i cudnn-local-repo-rhel8-9.0.0-1.0-1.ppc64le.rpmsudo dnf clean allsudo dnf -y install cudnn&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>After installing, set the &lt;code>CUDNN_LIBRARY&lt;/code> and &lt;code>CUDNN_INCLUDE_DIR&lt;/code> environment variables directly by adding these lines to your &lt;code>.bashrc&lt;/code>:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>echo &amp;#39;export CUDNN_LIBRARY=/usr/lib64&amp;#39; &amp;gt;&amp;gt; ~/.bashrc echo &amp;#39;export CUDNN_LIBRARY=/usr/lib64&amp;#39; &amp;gt;&amp;gt; ~/.bashrc &lt;/code>&lt;/pre>&lt;p>After that, the CUDNN installation process is complete.&lt;/p>&lt;p>This is the first part of our tutorial. Once you&amp;rsquo;ve finished all the steps in this post, the server will be ready to install the &lt;code>conda&lt;/code> package manager and the &lt;code>pytorch&lt;/code> library. You can access the second part of this tutorial at this &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/">&lt;span class="link-personalizado">link&lt;/span>&lt;/a>.&lt;/p></description></item><item><title>Evaluating Small-Scale LLMs (up to 8B) on PT-BR Benchmarks</title><link>https://llm-pt-ibm.github.io/en/posts/experimentos_benchmarks_pt_br/</link><pubDate>Mon, 02 Jun 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/experimentos_benchmarks_pt_br/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This is the first of two posts in this series, aimed at providing a summary of the investigation we conducted using the &lt;a href="https://github.com/stanford-crfm/helm" rel="external">&lt;span class="link-personalizado">HELM&lt;/span>&lt;/a> (&lt;em>Holistic Evaluation of Language Models&lt;/em>) evaluation framework to assess the &lt;a href="https://huggingface.co/ibm-granite" rel="external">&lt;span class="link-personalizado">Granite&lt;/span>&lt;/a> family of models, the &lt;a href="https://huggingface.co/meta-llama/Llama-3.1-8B" rel="external">&lt;span class="link-personalizado">Llama-3.1-8B&lt;/span>&lt;/a> model, and the &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B" rel="external">&lt;span class="link-personalizado">DeepSeek-R1-Distill-Llama-3.1-8B&lt;/span>&lt;/a> model. The evaluations cover both Portuguese-language benchmarks and code generation tasks. In this first part, the focus is on evaluating model performance in Brazilian Portuguese (PT-BR) for &lt;strong>sentiment analysis&lt;/strong> and &lt;strong>MQA&lt;/strong> (&lt;em>Multiple-Choice Question Answering&lt;/em>) tasks. The second part, to be published soon, will present the evaluation results for code generation tasks.&lt;/p>&lt;p>The use of English-language datasets for evaluating language models is common practice. However, to evaluate this models across different languages and cultural contexts, it is important to test them on benchmarks in other languages. In the case of PT-BR, which typically represents a smaller share of the data used to train multilingual models, understanding model behavior is an important step in evaluating their suitability for tasks and contexts specific to this language. In this sense, this post aims to contribute to that understanding by highlighting both the advances and the remaining challenges in these LLMs’ performance on tasks in the PT-BR context.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;div style="text-align: justify;">&lt;ul>&lt;li>We evaluated the models: Granite, Llama-3.1-8B, and DeepSeek-R1-Distill-Llama-3.1-8B on the ENEM Challenge, TweetSent-Br, and IMDB benchmarks.&lt;/li>&lt;li>Our method involved experimentation supported by the HELM framework, which we describe in detail in this document.&lt;/li>&lt;li>The results show that the models accurately classify sentiments in movie reviews in PT-BR.&lt;/li>&lt;/ul>&lt;/div>&lt;h2 id="method">Method&lt;/h2>&lt;h3 id="execution-environment-and-tool-used">Execution Environment and Tool Used&lt;/h3>&lt;p>We used HELM as the evaluation tool. HELM is an LLM evaluation framework developed by researchers at Stanford University. It includes a variety of benchmarks, such as sentiment analysis, code generation, and multiple-choice question answering. Using these benchmarks, we evaluated and compared the performance of the Granite (8B), Llama-3.1-8B, and DeepSeek-R1-Distill-Llama-3.1-8B models.&lt;/p>&lt;p>For running the experiments, we used Google Colab as the environment, which provides access to an A100 GPU. In this setup, we were able to clone the HELM repository and run models with 8 billion parameters. All configuration and testing were carried out on this platform, ensuring convenience and access to the necessary computational resources.&lt;/p>&lt;p>In a future post, we will go into more detail about LLM evaluation strategies and tools, with a deeper focus on HELM’s capabilities and operation.&lt;/p>&lt;h3 id="benchmarks-and-models">Benchmarks and Models&lt;/h3>&lt;p>To run tests in Brazilian Portuguese scenarios, it was necessary to extend HELM by adding new benchmarks, since the tool did not previously support this language. This effort represented a direct contribution to HELM, adding three benchmarks:&lt;/p>&lt;ul>&lt;li>&lt;p>&lt;a href="https://huggingface.co/datasets/eduagarcia/enem_challenge" rel="external">&lt;span class="link-personalizado">&lt;strong>ENEM Challenge&lt;/strong>&lt;/span>&lt;/a>: built from questions from the Exame Nacional do Ensino Médio (ENEM), designed to evaluate LLMs ability to handle MQA tasks across various knowledge areas, including Humanities, Natural Sciences, Languages, and Mathematics.&lt;/p>&lt;/li>&lt;li>&lt;p>&lt;a href="https://huggingface.co/datasets/eduagarcia/tweetsentbr_fewshot" rel="external">&lt;span class="link-personalizado">&lt;strong>TweetSent-Br&lt;/strong>&lt;/span>&lt;/a>: composed of tweets, specifically for sentiment analysis tasks. The dataset is organized into three main classes: positive (tweets expressing a positive reaction about the main topic), negative (tweets expressing a negative reaction), and neutral (tweets that don’t fit the other categories).&lt;/p>&lt;/li>&lt;li>&lt;p>&lt;a href="https://huggingface.co/datasets/maritaca-ai/imdb_pt" rel="external">&lt;span class="link-personalizado">&lt;strong>IMDB&lt;/strong>&lt;/span>&lt;/a>: made up of movie reviews written in Brazilian Portuguese. This benchmark also focuses on sentiment classification tasks, but uses longer-form review texts, in contrast to TweetSent-Br’s shorter posts.&lt;/p>&lt;/li>&lt;/ul>&lt;p>About the models, selection was guided by compatibility with the available execution environment and by citation relevance and performance. This included the Granite family of models developed by IBM; the Llama models from Meta; and the DeepSeek-R1-Distill-Llama-8B, a compact, optimized version derived from Llama 3.1. This choice enabled a fair and practical comparison among the models.&lt;/p>&lt;h2 id="results">Results&lt;/h2>&lt;p>Below, we present the results obtained, along with charts developed by the team to make it easier to visualize and understand the models’ performance on the evaluated tasks.&lt;/p>&lt;ul>&lt;li>&lt;strong>ENEM Challenge&lt;/strong>:&lt;/li>&lt;/ul>&lt;!--&lt;div style="text-align: center;"> &lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image001.png" style="max-width: 90%;">&lt;/div> -->&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image001.png" alt="Chart of results on the ENEM Challenge"/>&lt;figcaption> &lt;p>Chart of results on the ENEM Challenge&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>The results indicate that the models showed similar performance, with a slight advantage for Llama. The models achieved an average accuracy of 62.53%, suggesting that while they demonstrate some level of understanding of the questions, they still lack sufficient ability to answer ENEM exam questions satisfactorily. Improvement is still needed, particularly in reasoning and interpretation in Portuguese.&lt;/p>&lt;ul>&lt;li>&lt;strong>TweetSent-Br&lt;/strong>:&lt;/li>&lt;/ul>&lt;!--&lt;div style="text-align: center;"> &lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image002.png" style="max-width: 90%;">&lt;/div> -->&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image002.png" alt="Chart of results on the TweetSent-Br"/>&lt;figcaption> &lt;p>Chart of results on the TweetSent-Br&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>In this benchmark, as observed with the ENEM Challenge, the results were also similar across models. This reinforces the view that there are still gaps in model performance on sentiment classification tasks in Portuguese. Classifying a message as positive, negative, or neutral remains a challenge for these models, especially given the nuances and ambiguities of the language.&lt;/p>&lt;ul>&lt;li>&lt;strong>IMDB&lt;/strong>:&lt;/li>&lt;/ul>&lt;!--&lt;div style="text-align: center;"> &lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image003.png" style="max-width: 90%;">&lt;/div>-->&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image003.png" alt="Chart of results on the IMDB"/>&lt;figcaption> &lt;p>Chart of results on the IMDB&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>In the IMDB benchmark, the results were quite positive. The models achieved accuracy rates above 90%, demonstrating strong performance in sentiment classification. The highlight was the Granite model with 8B parameters, which showed a slight advantage over the others. These results indicate that the models can easily categorize movie reviews in Portuguese, showing greater proficiency in this type of task.&lt;/p>&lt;h2 id="conclusion">Conclusion&lt;/h2>&lt;p>This study provided a clearer view of the performance of language models in PT-BR through evaluation on three different benchmarks. The results show that the models analyzed have reasonable performance when selecting an answer in ENEM knowledge areas, while also indicating that there is still room for improvement. On the other hand, in the IMDB sentiment analysis task, these smaller-scale models demonstrated good classification ability.&lt;/p>&lt;p>The team plans, in future studies, to conduct experiments with larger-scale models to enable broader comparisons of performance and efficiency. This will allow for a more detailed analysis of the errors made by each model, contributing to a deeper understanding of their strengths and limitations.&lt;/p></description></item><item><title>Performing CPU Inference on Power10</title><link>https://llm-pt-ibm.github.io/en/posts/power10/</link><pubDate>Sun, 06 Apr 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/power10/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>In this post, we will share our experience running the Granite-20b-Code-Instruct model on a Power10 machine, describing the challenges and the necessary configurations to perform inference using Llama.cpp, one of the most popular open-source libraries in this domain.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post provides details on how to set up and run inference using IBM Power10 infrastructure.&lt;/li>&lt;li>Our main challenge was configuring Llama.cpp, which required adjustments such as installing Ninja-builder, compiling OpenBLAS, and updating the C compiler.&lt;/li>&lt;/ul>&lt;h2 id="infrastructure">Infrastructure&lt;/h2>&lt;p>Inference was performed on a machine with IBM POWER10 architecture, equipped with 750 GB of RAM and running Red Hat Enterprise Linux 8.10. Access to the environment was provided through a VM, requiring the use of a VPN to establish secure and controlled communication with the system, enabling remote and efficient execution of activities.&lt;/p>&lt;h2 id="initial-setup">Initial Setup&lt;/h2>&lt;p>The library that enables run LLMs using CPU resources is Llama.cpp. To set it up, we needed to resolve two external dependencies: Ninja-builder and OpenBLAS. Ninja-builder optimizes the compilation process, while OpenBLAS is a high-performance library for matrix computations.&lt;/p>&lt;p>During the OpenBLAS build process, we identified discrepancies in the internal tests validating matrix calculations, indicating a compatibility problem with the available C compiler, which was an older version (8.5.0). The solution was to &lt;strong>update the compiler to a newer version, 13.2&lt;/strong>, ensuring better compatibility with the Power10 architecture and validating the accuracy of the numerical operations required for Llama.cpp. Below, we present the step-by-step process used to enable the compilation of the required libraries and update the C compiler.&lt;/p>&lt;ol>&lt;li>Creating the build environment for the builder&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo dnf update -y &amp;amp;&amp;amp; dnf -y groupinstall &amp;#39;Development Tools&amp;#39; &amp;amp;&amp;amp; dnf install -y \ cmake git ninja-build-debugsource.ppc64le \ &amp;amp;&amp;amp; dnf clean all&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>Updating the C Compiler and Setting Environment Variables&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>scl enable gcc-toolset-13 bashexport CC=/usr/bin/gcc-13export CXX=/usr/bin/g++-13&lt;/code>&lt;/pre>&lt;ol start="3">&lt;li>Downloading and Building OpenBLAS&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>git clone --recursive https://github.com/DanielCasali/OpenBLAS.git &amp;amp;&amp;amp; cd OpenBLAS &amp;amp;&amp;amp; \ make -j$(nproc --all) TARGET=POWER10 DYNAMIC_ARCH=1 &amp;amp;&amp;amp; \ make PREFIX=/opt/OpenBLAS install &amp;amp;&amp;amp; \ cd /&lt;/code>&lt;/pre>&lt;ol start="4">&lt;li>Downloading and Building Llama.cpp using the OpenBLAS library&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code> git clone https://github.com/DanielCasali/llama.cpp.git &amp;amp;&amp;amp; cd llama.cpp &amp;amp;&amp;amp; sed -i &amp;#34;s/powerpc64le/native -mvsx -mtune=native -D__POWER10_VECTOR__/g&amp;#34; ggml/src/CMakeLists.txt &amp;amp;&amp;amp; \ mkdir build; \ cd build; \ cmake -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS=/opt/OpenBLAS/include -G Ninja ..; \ cmake --build . --config Release&lt;/code>&lt;/pre>&lt;p>With all these steps completed successfully, the environment was properly configured and optimized for running Llama.cpp locally. We are now able to start a server to perform inference with LLMs efficiently, using only CPU resources.&lt;/p>&lt;h2 id="performing-inference">Performing Inference&lt;/h2>&lt;p>We chose the Granite-20b-code-instruct model in the .GGUF format, which is specifically designed to optimize the performance of language models in CPU-only environments. These models are quantized, meaning their calculation precision is reduced, which in turn lowers their size and memory consumption, making them ideal for efficient execution with Llama.cpp. This approach enables high-performance local inference even on processor-only architectures such as POWER10.The model was downloaded directly from Hugging Face. Below, we show the step-by-step process to download it:&lt;/p>&lt;ol>&lt;li>Create a directory for the model in Llama.cpp:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>mkdir -p /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>Access the directory in Llama.cpp:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>cd /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF&lt;/code>&lt;/pre>&lt;ol start="3">&lt;li>Download the model from Hugging Face:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>wget https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k-GGUF/resolve/main/granite-20b-code-instruct.Q4_K_M.gguf&lt;/code>&lt;/pre>&lt;p>The last step can take longer based on the model’s number of parameters.. However, once the steps above are completed, we can start a Llama.cpp server to perform inference. By default, the server is exposed on port 8080 of the Power10 machine, but this is fully customizable. The following code illustrates how to configure and run the Llama server:&lt;/p>&lt;pre tabindex="0">&lt;code>/root/llama.cpp/build/bin/llama-server --host 0.0.0.0 --model /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF/granite-20b-code-instruct.Q4_K_M.gguf&lt;/code>&lt;/pre>&lt;p>With the Llama.cpp server running on port 8080, we can now perform inference via HTTP requests. In this example, for simplicity, we use curl to make the requests:&lt;/p>&lt;pre tabindex="0">&lt;code>curl -X POST http://localhost:8080/completion \ -H &amp;#34;Content-Type: application/json&amp;#34; \ -d &amp;#39;{ &amp;#34;prompt&amp;#34;: &amp;#34;Make a hello world program in Java. Your answer should be in Java code only.&amp;#34;, &amp;#34;max_tokens&amp;#34;: 100 }&amp;#39;&lt;/code>&lt;/pre>&lt;p>Below is an example of how the response is returned:&lt;/p>&lt;pre tabindex="0">&lt;code>{ &amp;#34;content&amp;#34;: &amp;#34;public class HelloWorld { public static void main(String[] args) { System.out.println(&amp;#34;Hello, World!&amp;#34;); }}&lt;/code>&lt;/pre>&lt;p>With this setup, we are now able to perform inference on CPU. Our upcoming posts will focus on running these inferences using the HELM (&lt;em>Holistic Evaluation of Language Models&lt;/em>) framework as the intermediary.&lt;/p></description></item><item><title>Introduction</title><link>https://llm-pt-ibm.github.io/en/posts/introducao/</link><pubDate>Wed, 12 Mar 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/introducao/</guid><description>&lt;p>Welcome to the blog of the partnership between the &lt;strong>Federal University of Campina Grande (UFCG)&lt;/strong> and &lt;strong>IBM&lt;/strong>!&lt;/p>&lt;p>This space brings together articles, tutorials, and research results produced by our team across different projects. Each project focuses on a distinct area of research:&lt;/p>&lt;ul>&lt;li>&lt;strong>&lt;a href="https://llm-pt-ibm.github.io/en/projects/llm-eval/">LLM Evaluation&lt;/a>&lt;/strong> — evaluation of large language models, with a focus on benchmarks for Brazilian Portuguese.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://llm-pt-ibm.github.io/projects/agents-ai">AgentOps&lt;/a>&lt;/strong> — development of AI agents capable of autonomously performing multiple tasks.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://llm-pt-ibm.github.io/en/projects/judo-ai/">Judo-AI&lt;/a>&lt;/strong> — use of AI models for analysis of judo matches and training sessions, applying computer vision and deep learning techniques for movement detection and action recognition.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://llm-pt-ibm.github.io/en/projects/5g/">5G&lt;/a>&lt;/strong> — integration of AI techniques in 5G network environments, with intelligent control, optimization, and network management mechanisms.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://llm-pt-ibm.github.io/en/projects/multiarq/">MultiArq&lt;/a>&lt;/strong> — provisioning of common tools for new architectures (ppc64le), seeking and adapting specific tools and creating technical documentation about the architecture.&lt;/li>&lt;/ul>&lt;p>Browse the posts and follow the latest updates!&lt;/p></description></item></channel></rss>