<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>TensorFlow on IBM UFCG</title><link>https://llm-pt-ibm.github.io/en/tags/tensorflow/</link><description>Recent content in TensorFlow on IBM UFCG</description><generator>Hugo -- gohugo.io</generator><language>en</language><copyright>IBM &amp; UFCG - 2025</copyright><lastBuildDate>Mon, 04 May 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://llm-pt-ibm.github.io/en/tags/tensorflow/index.xml" rel="self" type="application/rss+xml"/><item><title>Running VMs with KubeVirt on IBM Power9 (ppc64le)</title><link>https://llm-pt-ibm.github.io/en/posts/kubevirt-post/</link><pubDate>Fri, 22 May 2026 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/kubevirt-post/</guid><description>&lt;h2 id="context">Context&lt;/h2>&lt;p>This post aims to present the process of adapting &lt;a href="https://kubevirt.io/" rel="external">KubeVirt&lt;/a> for the IBM POWER9 (ppc64le) architecture. It covers the main challenges encountered, the modifications made to the source code, the role of each component, and the results obtained at the end of the process.&lt;/p>&lt;p>KubeVirt is an operator that extends Kubernetes to manage virtual machines (VMs) as native resources. In traditional environments, VMs are managed by tools like libvirt/virsh, separate from the container ecosystem. KubeVirt eliminates this separation: with it, you can create, start, stop, and monitor VMs using the same Kubernetes commands and workflows — &lt;code>kubectl&lt;/code>, YAML, namespaces, and RBAC. VMs run as real QEMU/KVM processes inside pods managed by Kubernetes.&lt;/p>&lt;p>The motivation for this work arose in the context of the Multiarq project, which maintains a shared HPC infrastructure on IBM POWER9. The ability to manage VMs and containers in the same Kubernetes cluster simplifies environment administration and opens the door to scenarios such as GPU passthrough for AI/ML workloads inside VMs, isolation of research environments, and multi-architecture compatibility testing.&lt;/p>&lt;p>The main challenge is that KubeVirt &lt;strong>does not officially support ppc64le&lt;/strong>. Only x86_64 (amd64), arm64, and s390x are supported. This means the build system, API validations, configuration defaults, and the libvirt domain generation pipeline do not recognize ppc64le, defaulting everything to amd64.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>KubeVirt does not officially support ppc64le; the entire pipeline assumes amd64 as a fallback.&lt;/li>&lt;li>We compiled Go binaries directly, bypassing the Bazel build system that does not recognize the architecture.&lt;/li>&lt;li>Patches were required across ~14 Go files and 4 new files were created to add ppc64le support.&lt;/li>&lt;li>Docker images were built with custom Dockerfiles and served via a local registry.&lt;/li>&lt;li>With these adaptations, it was possible to run a CirrOS ppc64le VM via KubeVirt on POWER9, managed entirely by Kubernetes.&lt;/li>&lt;/ul>&lt;h2 id="execution-environment">Execution Environment&lt;/h2>&lt;ul>&lt;li>&lt;strong>Architecture:&lt;/strong> IBM Power9 server (ppc64le).&lt;/li>&lt;li>&lt;strong>Operating System:&lt;/strong> AlmaLinux 8.10, binary compatible with RHEL 8.9/8.10.&lt;/li>&lt;li>&lt;strong>GPUs:&lt;/strong> 4x NVIDIA Tesla V100-SXM2-16GB.&lt;/li>&lt;li>&lt;strong>Docker:&lt;/strong> Docker CE 26.1.3.&lt;/li>&lt;li>&lt;strong>Kubernetes:&lt;/strong> v1.35.0 via minikube v1.38.0 (docker driver, containerd runtime).&lt;/li>&lt;li>&lt;strong>KubeVirt:&lt;/strong> v1.8.2.&lt;/li>&lt;li>&lt;strong>Go:&lt;/strong> 1.24.9.&lt;/li>&lt;/ul>&lt;h2 id="what-kubevirt-is-and-how-it-works">What KubeVirt Is and How It Works&lt;/h2>&lt;p>KubeVirt is composed of several components that work together to translate a Kubernetes resource (the VirtualMachineInstance, or VMI) into a real QEMU/KVM VM running on the host.&lt;/p>&lt;p>The &lt;strong>virt-operator&lt;/strong> is the entry point: when the administrator creates the KubeVirt Custom Resource in the cluster, the operator provisions all other components — deployments, daemonsets, services, RBAC. It acts as a permanent installer that reconciles the desired state.&lt;/p>&lt;p>The &lt;strong>virt-api&lt;/strong> handles Kubernetes API calls for KubeVirt resources. When the user runs &lt;code>kubectl apply&lt;/code> on a VMI, virt-api validates the YAML (e.g., is the architecture supported? does the machine type exist?) and injects defaults (e.g., firmware UUID, CPU topology).&lt;/p>&lt;p>The &lt;strong>virt-controller&lt;/strong> watches VMIs and decides where they should run. It creates a special pod — the virt-launcher — on the appropriate node, with all necessary configurations (volumes, devices, node selectors).&lt;/p>&lt;p>The &lt;strong>virt-handler&lt;/strong> runs as a DaemonSet (one per node) and is the local agent that bridges Kubernetes and libvirt/QEMU. When the virt-launcher pod appears on the node, virt-handler reads the VMI spec, generates the libvirt domain XML, and instructs libvirt to create the VM. It also registers device plugins with the kubelet (&lt;code>/dev/kvm&lt;/code>, &lt;code>/dev/net/tun&lt;/code>, &lt;code>/dev/vhost-net&lt;/code>) so pods can access the required devices.&lt;/p>&lt;p>The &lt;strong>virt-launcher&lt;/strong> is the pod that encapsulates the VM. Each VMI generates a dedicated pod with three containers: compute (QEMU + libvirt), guest-console-log, and container disk. Inside the compute container, the QEMU process runs the actual VM — with its own kernel, memory, and virtual CPU.&lt;/p>&lt;p>The full flow is:&lt;/p>&lt;ol>&lt;li>&lt;code>kubectl apply&lt;/code> → API Server → virt-api (validates, injects defaults)&lt;/li>&lt;li>virt-controller detects the VMI → creates the virt-launcher pod on the appropriate node&lt;/li>&lt;li>kubelet starts the virt-launcher pod on the node&lt;/li>&lt;li>virt-handler detects the pod → reads the VMI spec → generates libvirt XML → calls libvirt&lt;/li>&lt;li>libvirt starts QEMU → VM runs inside the pod&lt;/li>&lt;/ol>&lt;p>It is important to note that the VM &lt;strong>does not become a container&lt;/strong> — it runs as a real QEMU process inside a pod. Kubernetes manages the pod lifecycle, and KubeVirt translates between the two worlds.&lt;/p>&lt;h2 id="challenges-and-adaptations">Challenges and Adaptations&lt;/h2>&lt;h3 id="build-system">Build System&lt;/h3>&lt;p>KubeVirt&amp;rsquo;s build system uses Bazel, which does not recognize ppc64le. The &lt;code>format_archname&lt;/code> function in the build script only accepts &lt;code>x86_64&lt;/code>, &lt;code>aarch64&lt;/code>, and &lt;code>s390x&lt;/code>. The solution was to compile the Go binaries directly with &lt;code>go build&lt;/code>, bypassing Bazel.&lt;/p>&lt;p>An additional dependency is &lt;strong>libnbd&lt;/strong>: virt-launcher requires version 1.18+, but AlmaLinux 8 only provides 1.6. It was necessary to compile libnbd 1.20 from source. The &lt;strong>container-disk&lt;/strong> component is a C program (not Go) that requires static compilation to run in &lt;code>FROM scratch&lt;/code> containers.&lt;/p>&lt;h3 id="api-validation">API Validation&lt;/h3>&lt;p>The virt-api validation webhook rejects VMIs with an unknown architecture. Without the patch, a VMI with &lt;code>architecture: ppc64le&lt;/code> would be rejected before even reaching the scheduler. It was necessary to add cases in the admitter and create a specific validation function for ppc64le.&lt;/p>&lt;h3 id="configuration-defaults">Configuration Defaults&lt;/h3>&lt;p>KubeVirt needs to know which machine type to use for each architecture (e.g., &lt;code>pc-q35&lt;/code> for amd64, &lt;code>virt&lt;/code> for arm64). For ppc64le, we configured &lt;code>pseries&lt;/code> as the default machine type — the virtual machine type for POWER.&lt;/p>&lt;h3 id="libvirt-domain-generation">Libvirt Domain Generation&lt;/h3>&lt;p>This was the central challenge. KubeVirt converts the VMI spec into a libvirt domain XML that QEMU interprets. This pipeline has two parts:&lt;/p>&lt;p>The &lt;strong>arch-defaulter&lt;/strong> sets default OS type values (&lt;code>arch&lt;/code> and &lt;code>machine&lt;/code>) in the XML. Without the patch, it returned &lt;code>x86_64&lt;/code> for ppc64le, causing libvirt to attempt creating an x86 VM on a POWER machine — resulting in the error &lt;code>No emulator found for arch 'x86_64'&lt;/code>.&lt;/p>&lt;p>The &lt;strong>converter&lt;/strong> is an interface with ~12 methods that define architecture-specific behaviors: whether USB is needed, SMBIOS, PCIe placement, ROM tuning, etc. Implementations existed for amd64, arm64, and s390x, but not for ppc64le. The code fell back to &lt;code>converterAMD64&lt;/code>, generating incompatible configurations. We created &lt;code>converterPPC64LE&lt;/code> with values appropriate for POWER: no USB, no SMBIOS, no PCIe placement, with VirtIO as the disk model.&lt;/p>&lt;p>After resolving the converter, an &lt;strong>USB device&lt;/strong> error appeared: the graphics/video pipeline had no case for ppc64le, causing libvirt to add a default VGA video device that depended on USB — but the USB controller was disabled (&lt;code>IsUSBNeeded: false&lt;/code>). The solution was to add a ppc64le case in the video configurator with &lt;code>virtio&lt;/code> as the video device, following the same pattern as s390x and arm64.&lt;/p>&lt;p>Finally, the &lt;strong>CPU model&lt;/strong>: KubeVirt uses &lt;code>host-model&lt;/code> as the default, which does not work with nested virtualization on POWER9. The solution was to specify &lt;code>POWER9&lt;/code> as the CPU model in the VMI.&lt;/p>&lt;h3 id="docker-images">Docker Images&lt;/h3>&lt;p>With no Dockerfiles in the project (everything is generated by Bazel), we created custom Dockerfiles for each component. The simpler components (virt-operator, virt-api, virt-controller, virt-exportproxy) use &lt;code>ubi8/ubi-minimal&lt;/code> as the base. virt-handler requires additional system tools. virt-launcher is the most complex, using &lt;code>almalinux:8&lt;/code> as the base and dependencies on qemu-kvm, libvirt, and the compiled libnbd. A local registry (&lt;code>registry:2&lt;/code> on port 5000) serves the images to minikube.&lt;/p>&lt;h3 id="technical-step-by-step-guide">Technical Step-by-Step Guide&lt;/h3>&lt;p>Due to the large number of steps involved, a detailed step-by-step guide with all patches to the Go code, Dockerfiles, compilation commands, and configuration is available at this link:&lt;a href="https://github.com/llm-pt-ibm/kubevirt-ppc64le" rel="external">kubevirt-ppc64le-installation-guide&lt;/a>.&lt;/p>&lt;h2 id="results">Results&lt;/h2>&lt;p>With all adaptations applied, it was possible to run a CirrOS ppc64le VM via KubeVirt on POWER9, managed entirely by Kubernetes:&lt;/p>&lt;pre tabindex="0">&lt;code>$ kubectl get vmi test-vmi -o wideNAME AGE PHASE IP NODENAME READYtest-vmi 2m43s Running 10.244.120.124 minikube True&lt;/code>&lt;/pre>&lt;p>Data collected from inside the VM confirms correct execution:&lt;/p>&lt;table>&lt;thead>&lt;tr>&lt;th>&lt;/th>&lt;th>Value&lt;/th>&lt;/tr>&lt;/thead>&lt;tbody>&lt;tr>&lt;td>Architecture&lt;/td>&lt;td>ppc64le&lt;/td>&lt;/tr>&lt;tr>&lt;td>CPU&lt;/td>&lt;td>POWER9 (architected), altivec supported&lt;/td>&lt;/tr>&lt;tr>&lt;td>Hypervisor&lt;/td>&lt;td>KVM&lt;/td>&lt;/tr>&lt;tr>&lt;td>Platform&lt;/td>&lt;td>pSeries&lt;/td>&lt;/tr>&lt;tr>&lt;td>Model&lt;/td>&lt;td>IBM pSeries (emulated by qemu)&lt;/td>&lt;/tr>&lt;tr>&lt;td>Kernel&lt;/td>&lt;td>5.15.0-71-generic ppc64le&lt;/td>&lt;/tr>&lt;/tbody>&lt;/table>&lt;p>These results confirm that KubeVirt is generating the correct libvirt domain for ppc64le, with machine type &lt;code>pseries&lt;/code>, POWER9 CPU, and KVM/QEMU virtualization with VirtIO paravirtualization.&lt;/p>&lt;h2 id="final-considerations">Final Considerations&lt;/h2>&lt;p>With the adaptations made, it became possible to use KubeVirt to create and manage virtual machines on an IBM POWER9 via Kubernetes. The VM that runs is a real KVM/QEMU VM — with its own kernel, isolated memory, and virtual CPU — managed like any other Kubernetes resource.&lt;/p>&lt;p>In the context of the Multiarq project, this solution allows unifying the management of containers and VMs in the same cluster, simplifying administration of the shared infrastructure. Workloads that require kernel isolation or direct hardware access (such as GPU passthrough) can run in VMs without leaving the Kubernetes ecosystem.&lt;/p>&lt;p>The patches made are potentially contributable to the upstream KubeVirt project. KubeVirt&amp;rsquo;s architecture already provides for extensibility by architecture — the interface pattern (Converter, ArchDefaulter) and per-arch switches make it straightforward to add new platforms. ppc64le follows the same pattern as s390x, which was also added to the project at a later stage.&lt;/p>&lt;h2 id="next-steps">Next Steps&lt;/h2>&lt;ul>&lt;li>Resolve the USB/Graphics conflict to allow VNC without the &lt;code>autoattachGraphicsDevice: false&lt;/code> workaround, enabling graphical access to VMs;&lt;/li>&lt;li>Adjust the default CPU model in the code so that ppc64le automatically uses &lt;code>POWER9&lt;/code> without requiring manual specification in the VMI;&lt;/li>&lt;li>Explore GPU passthrough of the V100s via KubeVirt to run AI/ML workloads inside VMs managed by Kubernetes;&lt;/li>&lt;li>Test other distributions as containerDisk (Fedora, Ubuntu Server, AlmaLinux ppc64le) to validate compatibility beyond CirrOS;&lt;/li>&lt;li>Configure masquerade networking to enable live migration between nodes;&lt;/li>&lt;li>Document the changes in PR format for contribution to the KubeVirt upstream;&lt;/li>&lt;li>Validate KubeVirt on the Single Node OpenShift (OCP 4.21) already installed on the machine, using OpenShift Virtualization as the operator.&lt;/li>&lt;/ul></description></item><item><title>TensorFlow 2.21 CPU on IBM Power9 (ppc64le)</title><link>https://llm-pt-ibm.github.io/en/posts/post_tf221_power9_en/</link><pubDate>Mon, 04 May 2026 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/post_tf221_power9_en/</guid><description>&lt;h2 id="context">Context&lt;/h2>&lt;p>TensorFlow (TF) is the most globally adopted machine learning framework. However, since 2021, Google ended official support for pre-compiled binaries for the ppc64le architecture, and the &lt;code>tensorflow/community&lt;/code> repository was archived in 2025.&lt;/p>&lt;h2 id="environment-used">Environment Used&lt;/h2>&lt;ul>&lt;li>&lt;strong>Hardware:&lt;/strong> ppc64le architecture;&lt;/li>&lt;li>&lt;strong>RAM:&lt;/strong> ~64GB;&lt;/li>&lt;li>&lt;strong>Execution:&lt;/strong> Virtual Machine (VM);&lt;/li>&lt;li>&lt;strong>Operating System:&lt;/strong> Alma Linux 8.10 (ppc64le), binary compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.&lt;/li>&lt;/ul>&lt;h2 id="initial-setup-installing-tf-214">Initial Setup (Installing TF 2.14)&lt;/h2>&lt;p>As a starting point, we validated the installation of TensorFlow 2.14.1 (via RocketCE) on an IBM Power9 VM (ppc64le architecture) with AlmaLinux, using Miniforge (conda). Here are the commands for installation:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>conda create -n tf214 python&lt;span style="color:#f92672">=&lt;/span>3.11 -y&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>conda activate tf214&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>conda install -c rocketce tensorflow-cpu&lt;span style="color:#f92672">=&lt;/span>2.14.1 -y&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># expected output: 2.14.1&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>python -c &lt;span style="color:#e6db74">&amp;#34;import tensorflow as tf; print(tf.__version__)&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>As a result, functional TensorFlow 2.14.1 is expected. This same version is also available on the Open-CE channels of Oregon State University and MIT. With TF 2.14 working, we have access to: Keras, TensorBoard, TensorFlow Hub, tensorflow-text, Hugging Face Transformers, Jupyter, and the entire classic ML stack.&lt;/p>&lt;h2 id="tf-214-vs-tf-221-the-latest">TF 2.14 vs TF 2.21 (The Latest)&lt;/h2>&lt;p>Version 2.14 is functional but is several versions behind the latest, 2.21. The most significant differences focus on incompatibility with two very important tools:&lt;/p>&lt;ul>&lt;li>&lt;strong>Keras 3:&lt;/strong> a complete rewrite that transforms Keras into a multi-backend framework, allowing the same model and code to run on TensorFlow, PyTorch, or JAX without any changes. TF 2.14 only supports Keras 2.&lt;/li>&lt;li>&lt;strong>NumPy 2:&lt;/strong> In addition to correcting dozens of historical API inconsistencies, NumPy 2.0 brings significant efficiency gains. TF 2.14 does not support NumPy 2.&lt;/li>&lt;/ul>&lt;h2 id="compiling-tensorflow-221-natively-on-power9-cpu-only">Compiling TensorFlow 2.21 Natively on Power9 (CPU-Only)&lt;/h2>&lt;p>Initially, we successfully compiled TensorFlow 2.21 (CPU-Only) directly from source code. This compilation was performed on an IBM Power9 VM and generated a native &lt;code>.whl&lt;/code> package for &lt;code>linux_ppc64le&lt;/code>. Subsequently, TF 2.21 had its functionality validated through a complete suite of tests. This is a fundamental milestone upon which GPU support will be built in the next stage.&lt;/p>&lt;h2 id="challenges-hermeticity-and-x86-dependency">Challenges: Hermeticity and x86 Dependency&lt;/h2>&lt;p>The modern architecture of TensorFlow (and its build system, Bazel 7) embraced the &amp;ldquo;Hermetic&amp;rdquo; model: forcing the use of pre-compiled binaries and logic tied to x86_64, aarch64 architectures, and NVIDIA accelerators. For ppc64le, this means that a naive compilation simply fails when trying to download tools for incompatible architectures.&lt;/p>&lt;h3 id="we-identified-four-categories-of-blockage">We identified four categories of blockage:&lt;/h3>&lt;ol>&lt;li>&lt;strong>Bazel 7:&lt;/strong> Google does not distribute Bazel 7 for PowerPC. It would be necessary to compile it from scratch.&lt;/li>&lt;li>&lt;strong>Hermetic Toolchains:&lt;/strong> TF 2.21 tries to download pre-compiled LLVM/Clang for x86 or aarch64, which doesn&amp;rsquo;t run on Power9.&lt;/li>&lt;li>&lt;strong>CUDA/GPU Dependencies:&lt;/strong> Even in CPU-only mode, the build system tries to download and link giant NVIDIA libraries. Our strategy was to completely isolate GPU support with empty stubs, ensuring a stable CPU-only foundation before adding any accelerators.&lt;/li>&lt;li>&lt;strong>Latent C++ Bugs:&lt;/strong> XLA and MLIR code contain constructs that work in Google&amp;rsquo;s Clang but break in the system&amp;rsquo;s default GCC 8.5, from AVX-512 flags to template ambiguities in &lt;code>absl::NoDestructor&lt;/code>.&lt;/li>&lt;/ol>&lt;h2 id="compilation-process">Compilation Process&lt;/h2>&lt;h3 id="step-1-compiling-bazel-710-from-scratch">Step 1: Compiling Bazel 7.1.0 from Scratch&lt;/h3>&lt;p>Since Google does not distribute Bazel 7 for ppc64le, the first step to enable its use on ppc64le architecture was to compile Bazel itself from its source code, using the &lt;code>-dist.zip&lt;/code> file, which already includes the necessary bootstrap artifacts for Bazel to self-build without depending on a previous version of itself. The process requires Java 21 and takes between 1 and 2 hours depending on the cores available in the VM. The critical point here is passing the correct variables to the &lt;code>compile.sh&lt;/code> script. Without this step, none of the following steps are possible. The &lt;code>bazel build&lt;/code> command simply doesn&amp;rsquo;t exist for ppc64le otherwise. We created a &lt;a href="https://github.com/llm-pt-ibm/TensorFlow-2.21-Power9/blob/main/bazel/tutorial_bazel7_power9.md" rel="external">tutorial&lt;/a> with the Bazel 7.1 installation process which can be accessed in the repository.&lt;/p>&lt;h3 id="step-2-bypass-strategy--stub-repositories">Step 2: Bypass Strategy — Stub Repositories&lt;/h3>&lt;p>With Bazel 7 functional on ppc64le architecture, we attacked the problem of hermetic dependencies. Our solution was to create &amp;ldquo;stub&amp;rdquo; repositories, empty local directories that satisfy Bazel&amp;rsquo;s dependency declarations without downloading anything:&lt;/p>&lt;ul>&lt;li>&lt;strong>LLVM stubs:&lt;/strong> Empty filegroups that satisfy toolchain rules without trying to install LLVM.&lt;/li>&lt;li>&lt;strong>CUDA/ROCm/TensorRT stubs:&lt;/strong> Empty C++ libraries and Starlark rules that allow the build to proceed without missing dependency errors.&lt;/li>&lt;li>&lt;strong>PyPI stubs:&lt;/strong> Stub Python modules that simulate the dependencies of Google&amp;rsquo;s hermetic pip, forcing the use of libraries from the conda environment.&lt;/li>&lt;li>&lt;strong>Python stub:&lt;/strong> Redirects to the Python in our conda environment, bypassing the download of the hermetic Python that doesn&amp;rsquo;t exist for ppc64le.&lt;/li>&lt;/ul>&lt;p>All stubs are injected via &lt;code>--override_repository&lt;/code> in the &lt;code>bazel build&lt;/code> call, without altering the TensorFlow source code.&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/tf221_bypass_strategy.png" alt="Bypass Strategy"/>&lt;figcaption> &lt;p>Bypass Strategy — Stub Repositories&lt;/p> &lt;/figcaption>&lt;/figure>&lt;h3 id="step-3-surgical-patches-in-the-source-code">Step 3: Surgical Patches in the Source Code&lt;/h3>&lt;p>With the build infrastructure resolved, we found 21 incompatibilities in TensorFlow&amp;rsquo;s C++ and Python code that manifest exclusively in the GCC 13 + ppc64le combination. The problems focused on three categories:&lt;/p>&lt;ol>&lt;li>Clang-exclusive compilation flags that GCC rejects.&lt;/li>&lt;li>C++ template ambiguities in XLA and MLIR components that Google&amp;rsquo;s compiler masks but GCC 13 exposes.&lt;/li>&lt;li>References to CUDA and TensorRT headers that cease to exist when replaced by stubs.&lt;/li>&lt;/ol>&lt;p>Each incompatibility was resolved with a precise Python patch, without altering TensorFlow&amp;rsquo;s functional logic. The &lt;a href="https://github.com/llm-pt-ibm/TensorFlow-2.21-Power9/blob/main/cpu/tutorial_tf221_power9.md" rel="external">complete table with all 21 patches&lt;/a> is available in the repository.&lt;/p>&lt;h3 id="step-4-the-compilation">Step 4: The Compilation&lt;/h3>&lt;p>With all patches applied, the final compilation is triggered with a single &lt;code>bazel build&lt;/code> command. In addition to standard optimization flags, the command injects all stub repositories via &lt;code>--override_repository&lt;/code>, totaling about 80 flags. Bazel&amp;rsquo;s incremental cache is fundamental here: each time a patch is needed and compilation is resumed, only the affected targets are recompiled. This transformed the &amp;ldquo;patch → compile → error → patch&amp;rdquo; cycle from unfeasible to manageable (about 4 hours).&lt;/p>&lt;h2 id="the-definitive-solution-conda-package-and-binaries-ready-for-use">The Definitive Solution: Conda Package and Binaries (Ready for Use)&lt;/h2>&lt;p>So that the community doesn&amp;rsquo;t need to redo all this complex build engineering, we packaged the result of this engineering into a &amp;ldquo;plug and play&amp;rdquo; solution.&lt;/p>&lt;p>We made an &lt;a href="https://github.com/llm-pt-ibm/tensorflow/releases/tag/v2.21.0-cpu-only" rel="external">official Release&lt;/a> available in the repository containing the source code already with all patches applied and the generated native &lt;code>.whl&lt;/code> binary. More importantly: we created and published a complete Conda recipe that automatically resolves classic C++ library compatibility issues (GLIBCXX and GCC mismatch) common on Power9.&lt;/p>&lt;p>Now, native TensorFlow 2.21 can be installed directly through our Conda channel, providing the same installation experience as official corporate distributions.&lt;/p>&lt;h2 id="how-to-install-quick-tutorial">How to Install (Quick Tutorial)&lt;/h2>&lt;p>To use TensorFlow 2.21 in your Power9 environment immediately, simply run:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>conda create -n tf221 python&lt;span style="color:#f92672">=&lt;/span>3.11 -y&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>conda activate tf221&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>conda install -c ufcg-ibm -c conda-forge tensorflow-cpu&lt;span style="color:#f92672">=&lt;/span>2.21.0 -y&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>A &lt;a href="https://github.com/llm-pt-ibm/TensorFlow-2.21-Power9/blob/main/cpu/install-tensorflow-ppc64le.md" rel="external">detailed installation tutorial via Conda&lt;/a> is also available in our repository.&lt;/p>&lt;h2 id="functional-result-on-ibm-power9-server">Functional Result on IBM Power9 server&lt;/h2>&lt;p>We installed the final package and executed a complete suite of 35 tests, covering eight functional categories: from basic tensor operations to model save/load and stress tests. All 35 tests passed. The stress test (5000×5000 matrix multiplication) successfully executed on the IBM Power9 CPU, and training an MLP for 20 epochs confirmed loss convergence, indicating that automatic differentiation, optimizers, and numerical operations are all working correctly from end to end.&lt;/p>&lt;h2 id="ibm-tools-using-tensorflow">IBM Tools using TensorFlow&lt;/h2>&lt;p>IBM AI tools like AIF360, AIX360, and ART were already compatible with TensorFlow 2.14, as they are Python libraries that use the environment&amp;rsquo;s TF without binary coupling. The real value of native TensorFlow 2.21 compiled for Power9 lies in continuity: these libraries were already starting to declare dependencies on TF versions higher than 2.14, which meant that without this build, the Power9 environment would remain stuck on old and unsupported versions. Additionally, the improvements accumulated in TF between versions 2.14 and 2.21 bring incremental performance gains to fairness, explainability, and adversarial robustness analysis pipelines.&lt;/p>&lt;h2 id="reproducibility-and-materials">Reproducibility and Materials&lt;/h2>&lt;p>The entire process and generated artifacts are documented and available in our repository:&lt;/p>&lt;ul>&lt;li>&lt;strong>&lt;a href="https://github.com/llm-pt-ibm/tensorflow/releases/tag/v2.21.0-cpu-only" rel="external">Official Release&lt;/a>:&lt;/strong> Altered source code and ready-to-use &lt;code>.whl&lt;/code> binary.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://github.com/llm-pt-ibm/TensorFlow-2.21-Power9/blob/main/cpu/install-tensorflow-ppc64le.md" rel="external">Conda Installation Tutorial&lt;/a>:&lt;/strong> Practical guide to install version 2.21 directly through our Conda channel.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://github.com/llm-pt-ibm/TensorFlow-2.21-Power9/blob/main/bazel/tutorial_bazel7_power9.md" rel="external">Bazel 7.1.0 Compilation Tutorial&lt;/a>:&lt;/strong> Step-by-step guide to compile Bazel 7.1.0 from source.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://github.com/llm-pt-ibm/TensorFlow-2.21-Power9/blob/main/cpu/tutorial_tf221_power9.md" rel="external">TensorFlow 2.21 Compilation Tutorial&lt;/a>:&lt;/strong> Full guide to compile TensorFlow 2.21 with all necessary patches.&lt;/li>&lt;/ul>&lt;h2 id="impact">Impact&lt;/h2>&lt;p>This compilation represents the latest version of TensorFlow natively available for ppc64le and with it:&lt;/p>&lt;ul>&lt;li>&lt;strong>Keras 3&lt;/strong> becomes available for ppc64le for the first time.&lt;/li>&lt;li>&lt;strong>NumPy 2.0&lt;/strong> ceases to be a bottleneck for the Python scientific ecosystem on IBM Power9.&lt;/li>&lt;li>&lt;strong>Hugging Face Transformers stack&lt;/strong> with more models compatible with Power9.&lt;/li>&lt;/ul>&lt;h2 id="next-steps">Next Steps&lt;/h2>&lt;p>The TF 2.21 we compiled runs exclusively on CPU. The next challenge is to repeat the process with CUDA enabled on IBM Power9 servers equipped with NVIDIA GPUs. The stubs we created to isolate the GPU in this compilation were designed precisely to facilitate this transition: by replacing them with real CUDA libraries, we will have a solid starting point for GPU compilation. If successful, Power9 would have the latest deep learning framework with hardware acceleration, something non-existent today in any distribution for ppc64le.&lt;/p></description></item><item><title>LLM Inference with Ollama on IBM Power9 Using GPU</title><link>https://llm-pt-ibm.github.io/en/posts/ollama_gpu/</link><pubDate>Thu, 16 Apr 2026 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/ollama_gpu/</guid><description>&lt;h2 id="context">Context&lt;/h2>&lt;p>This is the second post in the series about language model inference on POWER9 with &lt;a href="https://ollama.com/" rel="external">&lt;span class="link-personalizado">&lt;em>Ollama&lt;/em>&lt;/span>&lt;/a>. In this article, we will cover how to send requests using GPU, achieving a significant performance gain compared to the CPU approach shown in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/ollama_cpu/" rel="external">&lt;span class="link-personalizado">&lt;em>previous post&lt;/em>&lt;/span>&lt;/a>.&lt;/p>&lt;p>The main challenge is that Ollama does not offer official support for the &lt;em>ppc64le&lt;/em> architecture with &lt;a href="https://developer.nvidia.com/cuda-12-2-0-download-archive?target_os=Linux&amp;target_arch=ppc64le&amp;Distribution=RHEL&amp;target_version=8&amp;target_type=rpm_local" rel="external">&lt;span class="link-personalizado">&lt;em>CUDA&lt;/em>&lt;/span>&lt;/a>. The solution was found through an &lt;a href="https://community.ibm.com/community/user/blogs/andrey-klyachkin/2025/03/06/run-ollama-on-almalinux-ppc64le-ibm-power" rel="external">&lt;span class="link-personalizado">&lt;em>official IBM community blog&lt;/em>&lt;/span>&lt;/a>, where a contributor made a &lt;a href="https://github.com/naveedus/ollama-ppc64le" rel="external">&lt;span class="link-personalizado">&lt;em>fork&lt;/em>&lt;/span>&lt;/a> of Ollama adapted to support NVIDIA GPUs on POWER9 via CUDA. However, that fork is outdated and does not support newer models like Gemma 3 and DeepSeek.&lt;/p>&lt;p>Therefore, we developed an &lt;a href="https://github.com/llm-pt-ibm/ollama-ppc64le" rel="external">&lt;span class="link-personalizado">&lt;em>updated fork&lt;/em>&lt;/span>&lt;/a>, based on the official Ollama (v0.23.2), with the necessary patches for ppc64le and GPU support via CUDA. This tutorial explains how to compile Ollama for the ppc64le architecture, and for those who don&amp;rsquo;t want to compile, we also provide a &lt;a href="https://github.com/llm-pt-ibm/ollama-ppc64le/releases/tag/v0.23.2-ppc64le-power9" rel="external">&lt;span class="link-personalizado">&lt;em>pre-compiled binary&lt;/em>&lt;/span>&lt;/a> in the &lt;em>releases&lt;/em> on GitHub.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post presents details on setting up the environment to perform inferences using IBM POWER9 infrastructure;&lt;/li>&lt;li>Ollama does not offer official support for &lt;em>ppc64le&lt;/em> with CUDA;&lt;/li>&lt;li>The fork was compiled from scratch using CMake and Go, pointing to CUDA 12.2 and specifying the V100 architecture (&lt;code>sm_70&lt;/code>);&lt;/li>&lt;li>A pre-compiled binary is also available on the project&amp;rsquo;s GitHub;&lt;/li>&lt;li>With this, it was possible to run LLM inference on IBM POWER9 with GPU acceleration and support for recent models.&lt;/li>&lt;/ul>&lt;h2 id="environment-used">Environment Used&lt;/h2>&lt;p>&lt;strong>Hardware&lt;/strong>:&lt;/p>&lt;ul>&lt;li>&lt;em>ppc64le&lt;/em> architecture;&lt;/li>&lt;li>Recommended minimum RAM: ~64GB;&lt;/li>&lt;li>GPU: NVIDIA Tesla V100;&lt;/li>&lt;li>NVIDIA driver: 535.54.03;&lt;/li>&lt;li>CUDA: version 12.2.&lt;/li>&lt;/ul>&lt;p>&lt;strong>Operating System:&lt;/strong> Alma Linux 8.10 (&lt;em>ppc64le&lt;/em>), binary compatible with &lt;em>Red Hat Enterprise Linux (RHEL)&lt;/em> 8.9/8.10.&lt;/p>&lt;h2 id="initial-checks">Initial Checks&lt;/h2>&lt;ol>&lt;li>Verify that the driver and GPU are visible:&lt;/li>&lt;/ol>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>nvidia-smi&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="2">&lt;li>Verify that CUDA is installed:&lt;/li>&lt;/ol>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>nvcc --version&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Note: If nothing appears, try:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>export PATH&lt;span style="color:#f92672">=&lt;/span>/usr/local/cuda-12.2/bin:$PATH&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>export CUDACXX&lt;span style="color:#f92672">=&lt;/span>/usr/local/cuda-12.2/bin/nvcc&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="3">&lt;li>Also verify that CUDA 12 exists:&lt;/li>&lt;/ol>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>ls -la /usr/local/cuda-12&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="running-in-a-virtual-environment">Running in a Virtual Environment&lt;/h2>&lt;p>In this tutorial, we make the necessary configuration inside a virtual environment to isolate execution and settings. This is optional but recommended.&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>conda create -n ollamaGPU python&lt;span style="color:#f92672">=&lt;/span>3.11 -y&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>conda activate ollamaGPU&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To deactivate the environment:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>conda deactivate&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="initial-setup">Initial Setup&lt;/h2>&lt;p>To compile Ollama on POWER9, the following dependencies with appropriate versions are required:&lt;/p>&lt;ul>&lt;li>&lt;strong>Go:&lt;/strong> 1.26.0&lt;/li>&lt;li>&lt;strong>GCC:&lt;/strong> 11.2.1 (via gcc-toolset-11)&lt;/li>&lt;li>&lt;strong>CMake:&lt;/strong> &amp;gt;= 3.24&lt;/li>&lt;/ul>&lt;h2 id="cloning-and-building-ollama">Cloning and Building Ollama&lt;/h2>&lt;p>With the environment configured, we can build Ollama. The compilation uses CMake to generate CUDA kernels with &lt;code>nvcc&lt;/code>, and Go to compile the binary. An important detail is the &lt;code>CUDA_ARCHITECTURES=70&lt;/code> parameter: each NVIDIA GPU has a specific architecture identified by an &lt;code>sm_XX&lt;/code> code, and the V100 is from the Volta architecture (&lt;code>sm_70&lt;/code>). By specifying this value, we instruct the build to compile only for the V100, reducing compilation time.&lt;/p>&lt;p>The complete step-by-step compilation, including the necessary fixes for ppc64le, as well as installation and configuration of the dependencies mentioned earlier, is documented in the &lt;a href="https://github.com/llm-pt-ibm/ollama-ppc64le/blob/ollama-ppc64le/README_POWER9.md" rel="external">&lt;span class="link-personalizado">&lt;em>repository&amp;rsquo;s README&lt;/em>&lt;/span>&lt;/a>.&lt;/p>&lt;p>For those who don&amp;rsquo;t want to compile, a pre-compiled binary is available directly from the &lt;a href="https://github.com/llm-pt-ibm/ollama-ppc64le/releases/tag/v0.23.2-ppc64le-power9" rel="external">&lt;span class="link-personalizado">&lt;em>releases&lt;/em>&lt;/span>&lt;/a> page:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Download the binary&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wget https://github.com/llm-pt-ibm/ollama-ppc64le/releases/download/v0.23.2-ppc64le-power9/ollama-ppc64le&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Give execute permission&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>chmod +x ollama-ppc64le&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;strong>Note:&lt;/strong> The repository contains branches of the official Ollama. The patches for ppc64le are exclusively in the &lt;code>ollama-ppc64le&lt;/code> branch.&lt;/p>&lt;h2 id="performing-inference">Performing Inference&lt;/h2>&lt;p>With Ollama compiled, we can start the server:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama serve&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To verify it worked, type: &lt;code>ps aux | grep ollama&lt;/code>.&lt;/p>&lt;p>Wait a few seconds and check the logs to confirm the server detected the GPUs correctly. Look for these lines:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>inference compute ... library&lt;span style="color:#f92672">=&lt;/span>CUDA compute&lt;span style="color:#f92672">=&lt;/span>7.0 ... description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Tesla V100-SXM2-16GB&amp;#34;&lt;/span> total&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;16.0 GiB&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="download-the-test-model-and-run-inference">Download the test model and run inference&lt;/h2>&lt;p>For validation, we used the &lt;code>llama3.1:8b&lt;/code> model. In another terminal, run:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama pull llama3.1:8b&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To run inference:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama run llama3.1:8b &lt;span style="color:#e6db74">&amp;#34;tell me all odd numbers up to 100&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="confirm-gpu-usage">Confirm GPU usage&lt;/h2>&lt;p>In another terminal, with inference running, run:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>nvidia-smi&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In the processes section, you should see &lt;code>ollama&lt;/code> with memory allocated on one of the GPUs:&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/ollama_gpu.png" alt="Figure 1"/>&lt;figcaption> &lt;p>Ollama using the GPU&lt;/p> &lt;/figcaption>&lt;/figure>&lt;h2 id="final-considerations">Final Considerations&lt;/h2>&lt;p>With the steps presented, it was possible to configure the environment to run LLM inference on an IBM POWER9 machine using NVIDIA Tesla V100 GPUs. With this approach, model inference has a significant performance gain compared to CPU execution. Using the Meta Llama 3.1 8B Instruct model as a reference, GPU execution achieved a higher token generation rate than CPU execution.&lt;/p>&lt;p>Let&amp;rsquo;s look at the collected data for the same request (&lt;code>tell me all odd numbers up to 100&lt;/code>) with both types of execution:&lt;/p>&lt;table>&lt;thead>&lt;tr>&lt;th>&lt;/th>&lt;th>CPU&lt;/th>&lt;th>GPU&lt;/th>&lt;/tr>&lt;/thead>&lt;tbody>&lt;tr>&lt;td>Token generation rate&lt;/td>&lt;td>0.71 tokens/s&lt;/td>&lt;td>79.82 tokens/s&lt;/td>&lt;/tr>&lt;tr>&lt;td>Total duration&lt;/td>&lt;td>3m49s&lt;/td>&lt;td>4.52s&lt;/td>&lt;/tr>&lt;tr>&lt;td>Prompt evaluation rate&lt;/td>&lt;td>10.67 tokens/s&lt;/td>&lt;td>295.77 tokens/s&lt;/td>&lt;/tr>&lt;/tbody>&lt;/table>&lt;p>With the data presented in the table, we see that GPU execution was approximately 112 times faster in token generation, with total response time reduced from 3 minutes and 49 seconds to 4.52 seconds.&lt;/p>&lt;h2 id="next-steps">Next Steps&lt;/h2>&lt;ul>&lt;li>Evaluate GPU and CPU execution in a comparative post and with other architectures;&lt;/li>&lt;li>Test GPU inference with larger models, with more than 8 billion parameters, for example;&lt;/li>&lt;li>Test new models available in the updated fork, such as Gemma 3 and DeepSeek;&lt;/li>&lt;/ul></description></item><item><title>LLM Inference with vLLM Using GPU on Power9</title><link>https://llm-pt-ibm.github.io/en/posts/vllm_gpu/</link><pubDate>Fri, 10 Apr 2026 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/vllm_gpu/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This post aims to present the steps necessary to install &lt;a href="https://github.com/vllm-project/vllm" rel="external">&lt;span class="link-personalizado">vLLM&lt;/span>&lt;/a> in an IBM POWER9 environment (ppc64le architecture). The main required resources, modifications, dependencies, versions used, and installation steps necessary to run inference with a given model will be detailed.&lt;/p>&lt;p>vLLM is a tool focused on serving and efficient inference of large language models (LLMs), allowing models to be exposed through an API and execute inference in an optimized way, especially in GPU environments.&lt;/p>&lt;p>The need to install vLLM arose during the data generation process with the &lt;a href="https://github.com/instructlab" rel="external">&lt;span class="link-personalizado">InstructLab&lt;/span>&lt;/a> tool. In that workflow, it is necessary to use a teacher model to generate synthetic data that will later be used for training or fine-tuning other models. For this, it is possible to use tools such as llama-cpp, already compatible with the IBM POWER9 environment, or vLLM, which was not yet available due to installation difficulties on this architecture. Unlike llama-cpp, which is more geared towards local execution and smaller-scale scenarios, vLLM stands out for better GPU utilization and the ability to handle multiple requests simultaneously in an efficient manner, being more suitable for large-scale inference scenarios and production environments.&lt;/p>&lt;p>Thus, we will present the technical steps required to make the vLLM installation feasible in the IBM POWER9 environment (ppc64le), describing the adaptations made so that the tool works correctly in this context.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>Compilation and installation of LLVM, required as build infrastructure for subsequent dependencies.&lt;/li>&lt;li>Compilation and adaptation of Triton, including adjustments for compatibility with the Power9 architecture.&lt;/li>&lt;li>Installation and configuration of vLLM, considering its dependencies and specific runtime requirements.&lt;/li>&lt;li>Development of containers containing the entire configured environment for executing the tool.&lt;/li>&lt;li>Practical demonstration of using the images, including server startup and running inference using GPU.&lt;/li>&lt;/ul>&lt;h2 id="execution-environment">Execution Environment&lt;/h2>&lt;p>The environment used for the vLLM installation includes:&lt;/p>&lt;ul>&lt;li>&lt;strong>Architecture:&lt;/strong> IBM Power9 Server (ppc64le architecture).&lt;/li>&lt;li>&lt;strong>Operating System (OS):&lt;/strong> AlmaLinux 8.10 binary compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.&lt;/li>&lt;li>&lt;strong>RAM:&lt;/strong> 512GB.&lt;/li>&lt;li>&lt;strong>GPUs:&lt;/strong> 4x NVIDIA Tesla V100 SXM2 16GB (NVLink2).&lt;/li>&lt;/ul>&lt;h2 id="dependencies-and-installation">Dependencies and Installation&lt;/h2>&lt;p>During the vLLM build process, three main dependencies stand out: &lt;a href="https://github.com/llvm/llvm-project" rel="external">&lt;span class="link-personalizado">LLVM&lt;/span>&lt;/a>, &lt;a href="https://github.com/triton-lang/triton" rel="external">&lt;span class="link-personalizado">Triton&lt;/span>&lt;/a>, and &lt;a href="https://github.com/pytorch/pytorch" rel="external">&lt;span class="link-personalizado">PyTorch&lt;/span>&lt;/a>. These dependencies are problematic for the correct functioning of the tool.&lt;/p>&lt;p>LLVM constitutes the foundation of the compilation infrastructure used throughout the process, being responsible for generating, optimizing, and transforming intermediate representation into executable low-level code. In the context of vLLM, its role is essential to enable efficient execution of GPU kernels, especially those defined by Triton, which rely directly on its compilation backends (components responsible for generating optimized code for different hardware architectures). Triton, in turn, acts as the component responsible for defining and executing GPU-optimized kernels, playing a central role in the inference efficiency of language models. Its integration with LLVM allows generating highly optimized code for different architectures. PyTorch provides the foundation for tensor manipulation and model execution, offering the fundamental operations for GPU inference, in addition to serving as an interface to acceleration mechanisms and low-level libraries.&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/vllm_dependencias.png" alt="Figure 1"/>&lt;figcaption> &lt;p>Dependency flow for compiling vLLM on Power9.&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>Due to the lack of native support for these packages on the ppc64le architecture, their use on IBM POWER9 required several adaptations based on the official repositories of these tools. These modifications ranged from fixing incompatibilities in specific methods to adjusting sub-dependencies that did not support the ppc64le architecture, as well as using Conda to help manage environments and dependencies. In some cases, manual compilation of additional components was also necessary. After overcoming these challenges, it became possible to install and run vLLM on the IBM POWER9 environment.&lt;/p>&lt;p>Due to the large number of steps involved, the step-by-step detailed procedures are presented in this link: &lt;a href="https://github.com/llm-pt-ibm/vllm_gpu/blob/main/manual-installation-guide.md" rel="external">&lt;span class="link-personalizado">vllm installation guide&lt;/span>&lt;/a>. It is worth noting that each of the steps described is essential to guarantee the correct compilation and execution of vLLM in the proposed environment.&lt;/p>&lt;h2 id="containerization">Containerization&lt;/h2>&lt;p>During the installation process, it was observed that the large number of involved steps could make environment reproduction difficult and lead to inconsistent scenarios. Because of this, we chose containerization of the solution as a way to make the experiment reproducible, portable, and simpler to use for other users.&lt;/p>&lt;p>For this, we provide (in this &lt;a href="https://github.com/llm-pt-ibm/vllm_gpu" rel="external">&lt;span class="link-personalizado">repository&lt;/span>&lt;/a>) scripts responsible for both building the images and automating execution, organizing all necessary steps. These scripts perform tasks such as identifying available resources, copying required CUDA binaries, and starting vLLM properly.&lt;/p>&lt;p>Execution was simplified so that the user only needs to provide the local path of the model to be used. Parameters such as port, number of GPUs, and image to be executed are optional and have predefined default values.&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/repositorio_git_vllm.png" alt="Figure 2"/>&lt;figcaption> &lt;p>Repository developed for running vLLM via containers.&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>Additionally, we provide a video (&lt;a href="https://drive.google.com/file/d/1chIuklLfjQBMMu6XlgpohP08rTW04Keg/view" rel="external">&lt;span class="link-personalizado">vLLM Power9 demonstration&lt;/span>&lt;/a>) that demonstrates the use of vLLM from the provided repository.&lt;/p>&lt;h2 id="final-considerations">Final Considerations&lt;/h2>&lt;p>With the resources provided in this repository, it became possible to automate the process of installing and using vLLM on ppc64le architectures with V100 GPUs.&lt;/p>&lt;p>In the context of the IBM-MultiArq project, this solution proves especially relevant for using InstructLab, enabling local execution of teacher models via vLLM, expanding experimentation and development possibilities within the proposed environment.&lt;/p>&lt;h2 id="next-steps">Next Steps&lt;/h2>&lt;p>As a continuation of this work, we propose conducting a comparative performance study between llama-cpp and vLLM. Additionally, the repository was structured to provide continuous support for vLLM, including its adaptation to future versions, the identification of remaining limitations, and the evolution of solutions as new challenges arise.&lt;/p></description></item><item><title>Installing Docker in an Architecture ppc64le (Power9) Environment</title><link>https://llm-pt-ibm.github.io/en/posts/post_docker/</link><pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/post_docker/</guid><description>&lt;h2 id="context">Context&lt;/h2>&lt;p>Given the need to standardize software execution on our IBM Power9 (ppc64le) server, containers are a robust solution for avoiding environment conflicts. This post continues the work of structuring our infrastructure by detailing the installation of Docker Engine on AlmaLinux. Adopting this technology is strategically important for ensuring strict dependency isolation and portability across applications. With it, we can package everything from general-purpose libraries to more complex services, ensuring a clean, secure, and highly reproducible runtime environment.&lt;/p>&lt;p>Docker Engine has official support for AlmaLinux on the x86_64, arm64, s390x, and ppc64le architectures, which allows us to use it directly on Power9 without special adaptations. However, some care is required before and during installation, such as uninstalling tools that conflict with Docker and ensuring the images used are compatible with ppc64le.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post presents the step-by-step process for installing Docker Engine on AlmaLinux in the ppc64le architecture.&lt;/li>&lt;li>You must remove Podman and Buildah before installing, because they conflict with Docker.&lt;/li>&lt;li>Docker Hub images need explicit ppc64le support to work on Power9.&lt;/li>&lt;/ul>&lt;h2 id="environment-used">Environment Used&lt;/h2>&lt;ul>&lt;li>&lt;strong>Architecture&lt;/strong>: IBM Power9 server (ppc64le architecture)&lt;/li>&lt;li>&lt;strong>Operating System (OS)&lt;/strong>: AlmaLinux 8.10 binary compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10&lt;/li>&lt;li>&lt;strong>RAM&lt;/strong>: 512GB&lt;/li>&lt;/ul>&lt;h2 id="prerequisites">Prerequisites&lt;/h2>&lt;p>Before installing Docker, it is important to be aware of a firewall limitation: when exposing container ports with Docker, those ports bypass the default firewalld rules. Make sure this does not pose a problem for your environment before proceeding. It is also important to note that Docker Engine is compatible with Rocky Linux 8 and 9 and AlmaLinux 8 on the ppc64le architecture.&lt;/p>&lt;h2 id="removing-conflicting-packages">Removing Conflicting Packages&lt;/h2>&lt;p>AlmaLinux includes Podman and Buildah by default. These packages conflict with Docker Engine and must be removed before installation. It is also recommended to remove any older Docker versions that might be present:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo dnf remove -y podman &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> buildah &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> docker &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> docker-client &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> docker-client-latest &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> docker-common &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> docker-latest &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> docker-latest-logrotate &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> docker-logrotate &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> docker-engine&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="adding-the-docker-repository-and-installing-required-packages">Adding the Docker Repository and Installing Required Packages&lt;/h2>&lt;h3 id="repository-setup">Repository Setup&lt;/h3>&lt;p>The recommended installation method is to use Docker&amp;rsquo;s official repository. It is worth mentioning that Docker uses the CentOS repository for RHEL-based distributions such as AlmaLinux, and this is officially supported. First, install the dnf-plugins-core package and add the repository:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo dnf install -y dnf-plugins-core&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo dnf config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="installing-docker-engine">Installing Docker Engine&lt;/h3>&lt;p>With the repository configured, install the latest version of Docker Engine along with the build and compose plugins:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo dnf install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="starting-the-service">Starting the service&lt;/h3>&lt;p>Unlike Debian-based distributions such as Ubuntu, Docker does not start automatically on AlmaLinux after installation. You need to start the service manually and enable it so it comes up with the system:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo systemctl start docker&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo systemctl enable docker&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="verifying-the-installation">Verifying the Installation&lt;/h2>&lt;p>To confirm that everything was installed correctly, run the hello-world image. Docker will automatically detect the ppc64le architecture and pull the correct image:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo docker run hello-world&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The expected output is a message confirming that Docker is working correctly.&lt;/p>&lt;h2 id="post-installation-configuration">Post-installation Configuration&lt;/h2>&lt;p>By default, only the root user or users with sudo privileges can run Docker commands. To avoid using sudo on every command, add your user to the docker group. First, create the group if it does not already exist:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo groupadd docker&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Then add your user to the group:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo usermod -aG docker $USER&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>You need to log out and log back in for the permissions to take effect.&lt;/p>&lt;h2 id="tips-for-power9-architecture">Tips for Power9 Architecture&lt;/h2>&lt;p>Because we are using IBM Power9, a few additional considerations matter when working with Docker Hub. The first point is image compatibility: not all images available on Docker Hub support ppc64le. Images built only for x86_64 will fail on Power9, so always verify that the desired image has the ppc64le tag before using it.&lt;/p>&lt;p>To validate that Docker is running correctly and recognizing the machine architecture, use:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>docker version --format &lt;span style="color:#e6db74">&amp;#39;{{.Server.Arch}}&amp;#39;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The expected output is ppc64le.&lt;/p>&lt;h2 id="final-considerations">Final Considerations&lt;/h2>&lt;p>Installing Docker Engine on AlmaLinux (ppc64le) follows a straightforward path as long as conflicts with Podman and Buildah are resolved beforehand. Official ppc64le support from Docker provides a stable experience on Power9, with the important caveat that image compatibility must always be checked before use.&lt;/p>&lt;p>With Docker installed and configured, the environment is ready to run containers and move on to the next steps in our language model infrastructure.&lt;/p></description></item><item><title>LLM Inference with Ollama on IBM Power9 Using CPU</title><link>https://llm-pt-ibm.github.io/en/posts/ollama_cpu/</link><pubDate>Wed, 01 Apr 2026 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/ollama_cpu/</guid><description>&lt;h2 id="context">Context&lt;/h2>&lt;p>This post presents a practical guide for performing inference of large Language Models (LLMs) using &lt;a href="https://ollama.com/" rel="external">&lt;span class="link-personalizado">&lt;em>Ollama&lt;/em>&lt;/span>&lt;/a>, in an IBM POWER9 environment. Ollama is a &lt;em>framework&lt;/em> based on &lt;a href="https://github.com/ggml-org/llama.cpp.git" rel="external">&lt;span class="link-personalizado">&lt;em>llama.cpp&lt;/em>&lt;/span>&lt;/a>, designed to simplify the implementation and execution of such models, offering a user-friendly interface and support for various tasks.&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/funcionamento_ollama.png" alt="Figure 1"/>&lt;figcaption> &lt;p>Flow of a request&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>Despite the growth in LLM usage, the availability of materials focused on the &lt;em>ppc64le&lt;/em> architecture (IBM POWER9) is still quite limited. In general, available tutorials are old, poorly detailed, or focused on more common architectures like &lt;em>x86_64&lt;/em>, which makes reproducing the environment in the presented context difficult. This is the first of two posts in this series, which aims to perform inference entirely via CPU, exploring the &lt;em>ppc64le&lt;/em> architecture, in an updated, practical, and reproducible way. In the next post, we will address the use of GPU to accelerate the process.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post presents details on how to configure the environment to perform inferences with IBM POWER9 infrastructure.&lt;/li>&lt;li>Execution is performed via CPU using Ollama;&lt;/li>&lt;li>The main challenge involves correctly configuring the environment, especially dependencies like &lt;em>Go&lt;/em>, &lt;em>GCC&lt;/em>, and &lt;em>CMake&lt;/em>, in addition to compatibility with &lt;a href="https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux" rel="external">&lt;span class="link-personalizado">&lt;em>RHEL&lt;/em>&lt;/span>&lt;/a>&lt;/li>&lt;/ul>&lt;h2 id="environment-used">Environment Used&lt;/h2>&lt;p>&lt;strong>Hardware&lt;/strong>:&lt;/p>&lt;ul>&lt;li>&lt;em>ppc64le&lt;/em> architecture;&lt;/li>&lt;li>RAM: ~64GB;&lt;/li>&lt;li>Execution: Virtual Machine (VM);&lt;/li>&lt;/ul>&lt;p>&lt;strong>Operating System:&lt;/strong> Alma Linux 8.10 (&lt;em>ppc64le&lt;/em>), binary compatible with &lt;em>Red Hat Enterprise Linux (RHEL)&lt;/em> 8.9/8.10.&lt;/p>&lt;h2 id="initial-setup">Initial &lt;em>Setup&lt;/em>&lt;/h2>&lt;p>To run Ollama on the POWER9 architecture, it is necessary to prepare the environment with the appropriate dependencies.The first step is to update the system and install basic utilities:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo dnf update -y&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo dnf install -y wget git tar make gcc gcc-c++ cmake gcc-toolset-11&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Although this command installs some dependencies, it is necessary to ensure that the correct versions are being used.&lt;/p>&lt;h3 id="configuring-go">Configuring &lt;em>Go&lt;/em>&lt;/h3>&lt;p>Ollama is developed in &lt;em>Go&lt;/em>, so it is necessary to ensure the appropriate version.&lt;/p>&lt;p>&lt;strong>Expected Version:&lt;/strong> 1.25.7 linux/ppc64le&lt;/p>&lt;h4 id="if-not-installed">If not installed:&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>wget https://go.dev/dl/go1.25.7.linux-ppc64le.tar.gz&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo tar -C /usr/local -xzf go1.25.7.linux-ppc64le.tar.gz&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>export PATH&lt;span style="color:#f92672">=&lt;/span>/usr/local/go/bin:$PATH&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To add to &lt;em>PATH&lt;/em> permanently:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>echo &lt;span style="color:#e6db74">&amp;#39;export PATH=/usr/local/go/bin:$PATH&amp;#39;&lt;/span> &amp;gt;&amp;gt; ~/.bashrc&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>source ~/.bashrc&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Verify if the version is correct: &lt;code>go version&lt;/code>&lt;/p>&lt;h3 id="configuring-cmake">Configuring &lt;em>CMake&lt;/em>&lt;/h3>&lt;p>Verify if the version is correct: &lt;code>cmake --version&lt;/code>&lt;/p>&lt;p>&lt;strong>Expected Version:&lt;/strong> cmake 3.26.5&lt;/p>&lt;h4 id="if-not-installed-1">If not installed:&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>wget https://github.com/Kitware/CMake/releases/download/v3.26.5/cmake-3.26.5.tar.gz&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>tar -xzf cmake-3.26.5.tar.gz&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cd cmake-3.26.5&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>./bootstrap&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>make -j&lt;span style="color:#66d9ef">$(&lt;/span>nproc&lt;span style="color:#66d9ef">)&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo make install&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="configuring-gcc">Configuring &lt;em>GCC&lt;/em>&lt;/h3>&lt;p>&lt;strong>Expected Version:&lt;/strong> &lt;code>gcc 11.2.1&lt;/code>&lt;/p>&lt;p>&lt;strong>Important:&lt;/strong> On AlmaLinux 8, the &lt;em>gcc-toolset&lt;/em> is not activated automatically. It is necessary to enable the session manually:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>scl enable gcc-toolset-11 bash&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This command activates GCC only in the current session. If you open another terminal, you will need to run the command again.&lt;/p>&lt;p>&lt;strong>Verify the version:&lt;/strong> &lt;code>gcc --version&lt;/code>&lt;/p>&lt;h4 id="if-not-installed-2">If not installed:&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo dnf install -y gcc-toolset-11&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>scl enable gcc-toolset-11 bash&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h3 id="cloning-ollama">Cloning Ollama&lt;/h3>&lt;p>With the environment configured, we can build Ollama. Here we clone the official Ollama repository and change the version used (important for POWER compatibility and to get a stable version).&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cd /root&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>git clone https://github.com/ollama/ollama.git&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cd ollama&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">#Change the version: &lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>git checkout v0.9.4&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To verify, use: &lt;code>git status&lt;/code>&lt;/p>&lt;h2 id="build-ollama">&lt;em>Build&lt;/em> Ollama&lt;/h2>&lt;p>After activating GCC in the correct version:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>export CGO_ENABLED&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">1&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>go clean -cache -modcache -i -r&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>go build -o ollama .&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>&lt;em>CGO&lt;/em> needs to be enabled because Ollama depends on llama.cpp, which uses C/C++ code for performance optimizations. Without it, the &lt;em>build&lt;/em> fails or loses compatibility with the architecture.&lt;/p>&lt;p>This should occur without any errors and generate the &lt;code>ollama&lt;/code> binary created in the current directory.&lt;/p>&lt;p>To verify: &lt;code>./ollama --version&lt;/code>&lt;/p>&lt;h2 id="performing-inference">Performing Inference&lt;/h2>&lt;p>With &lt;em>Ollama&lt;/em> compiled, we can start the server:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama serve&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>An important observation is that, since the environment is running on a virtual machine, it is not possible to keep the command running in the main terminal and, simultaneously, use another terminal in the same session to perform inference, without some auxiliary tool to manage multiple terminals.What we will do then is run the server in the background, but you can choose to use &lt;em>Tmux&lt;/em> or &lt;em>Screen&lt;/em>, allowing the same terminal to remain available for executing the remaining commands (which we will see next). For this, you can run:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama serve &amp;amp;&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To verify if it worked: &lt;code>ps aux | grep ollama&lt;/code>. It will show something like:&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/print_ollama_serve.png" alt="Figure 2"/>&lt;figcaption> &lt;p>Ollama running&lt;/p> &lt;/figcaption>&lt;/figure>&lt;h2 id="download-the-test-model-and-run-inference">Download the test model and run inference&lt;/h2>&lt;p>For validation, we used the &lt;em>TinyLlama&lt;/em> model, as it is lightweight and suitable for CPU execution. For this, in another terminal, run:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama pull tinyllama&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To run inference:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>./ollama run tinyllama &lt;span style="color:#e6db74">&amp;#34;The sky is blue?&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If everything has been done correctly, you will have something like:&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/ollama_run.png" alt="Figure 3"/>&lt;figcaption> &lt;p>Inference being executed&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>It is important to highlight that &lt;em>Ollama&lt;/em> works, by default, with models available in its own repository, which are already converted and optimized for execution, generally in a format compatible with &lt;em>llama.cpp&lt;/em>. These models can be easily used via the &lt;code>ollama pull&lt;/code> command, as in the case of &lt;em>TinyLlama&lt;/em> used in this example. Although it is possible to use external models, this requires additional steps, such as conversion to compatible formats (for example, &lt;em>GGUF&lt;/em>) and the creation of a &lt;em>Modelfile&lt;/em>.&lt;/p>&lt;h2 id="final-considerations">Final Considerations&lt;/h2>&lt;p>With the steps presented, it was possible to configure the environment to run LLM inferences on an IBM POWER9 machine using the CPU. Although functional, this approach has limitations in performance, especially for larger models, due to the absence of GPU acceleration. As a next step, we intend to explore execution using GPU, evaluating performance gains and scalability.&lt;/p>&lt;h2 id="next-steps">Next Steps&lt;/h2>&lt;ul>&lt;li>Test newer versions and compatibility between them;&lt;/li>&lt;li>Conduct benchmarking experiments to compare CPU Inference performance against GPU inference;&lt;/li>&lt;li>Second post in this series, performing GPU inference.&lt;/li>&lt;/ul></description></item><item><title>Power9 Virtualization: how we structured an isolated environment with KVM and Libvirt</title><link>https://llm-pt-ibm.github.io/en/posts/post_virtualization/</link><pubDate>Fri, 27 Mar 2026 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/post_virtualization/</guid><description>&lt;h2 id="context">Context&lt;/h2>&lt;p>Given the need to establish isolated and secure environments for installing libraries, frameworks, and general-purpose tools, environment encapsulation emerged as an effective solution, implemented through KVM managed via &lt;code>virt-manager&lt;/code> and &lt;code>virsh&lt;/code>.&lt;/p>&lt;p>Virtualization is widely used in x86 environments, with mature tooling and established workflows. However, when migrating to architectures such as IBM Power9 (&lt;code>ppc64le&lt;/code>), many of these processes are no longer straightforward and require architecture-specific adaptations. Below, we provide a diagram showing this interaction across four layers.&lt;/p>&lt;h2 id="communication-flow-between-hardware-power9-and-virtual-machines">Communication flow between Hardware (Power9) and Virtual Machines&lt;/h2>&lt;p>The flow is organized into the following layers:&lt;/p>&lt;!--&lt;div style="text-align: center;"> &lt;img src="https://llm-pt-ibm.github.io/images/kvm_virtualization-en.png" style="max-width: 90%;">&lt;/div> -->&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/kvm_virtualization-en.png" alt="Figure 1: Diagram representing a 4-layer virtualization architecture."/>&lt;figcaption> &lt;p>Figure 1: Diagram representing a 4-layer virtualization architecture.&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>In this work, we explore how to build a virtualized environment using KVM and Libvirt on a Power9 server, with focus on isolation, reproducibility, and shared team usage.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>We implemented a virtualized environment on Power9 using KVM + Libvirt.&lt;/li>&lt;li>We adapted common virtualization workflows to &lt;code>ppc64le&lt;/code>, solving permission, write-lock, and provisioning issues.&lt;/li>&lt;li>The environment provides secure isolation between users and straightforward VM management.&lt;/li>&lt;li>We provide ready-to-use images with NVIDIA/CUDA drivers for immediate use.&lt;/li>&lt;/ul>&lt;h2 id="environment-used">Environment used&lt;/h2>&lt;ul>&lt;li>&lt;strong>Architecture&lt;/strong>: IBM Power9 server (&lt;code>ppc64le&lt;/code> architecture).&lt;/li>&lt;li>&lt;strong>Operating System (OS)&lt;/strong>: AlmaLinux 8.10 binary-compatible with Red Hat Enterprise Linux (RHEL) 8.9/8.10.&lt;/li>&lt;li>&lt;strong>RAM&lt;/strong>: 512GB.&lt;/li>&lt;li>&lt;strong>Execution&lt;/strong>: Virtual Manager for Virtual Machine (VM) management.&lt;/li>&lt;li>&lt;strong>Hypervisor&lt;/strong>: KVM (Kernel-based Virtual Machine) / QEMU.&lt;/li>&lt;li>&lt;strong>Management&lt;/strong>: Libvirt (&lt;code>virsh&lt;/code>, &lt;code>virt-install&lt;/code>, &lt;code>virt-customize&lt;/code>).&lt;/li>&lt;li>&lt;strong>Storage&lt;/strong>: Virtual disks in &lt;code>.qcow2&lt;/code> format.&lt;/li>&lt;li>&lt;strong>GPUs&lt;/strong>: 4x NVIDIA Tesla V100 SXM2 16GB (NVLink2).&lt;/li>&lt;/ul>&lt;h2 id="installing-the-virtualization-environment-kvm--libvirt">Installing the virtualization environment (KVM + Libvirt)&lt;/h2>&lt;p>Before creating any VM, you need to install and configure KVM and Libvirt on the Power9 server.&lt;/p>&lt;ol>&lt;li>&lt;strong>Package installation&lt;/strong>:&lt;/li>&lt;/ol>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo dnf install -y qemu-kvm libvirt libvirt-client libvirt-daemon libvirt-daemon-kvm virt-install virt-viewer guestfs-tools &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span>libguestfs-tools python3-libvirt&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="2">&lt;li>&lt;strong>Starting the service&lt;/strong>:&lt;/li>&lt;/ol>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo systemctl enable --now libvirtd&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo systemctl status libvirtd&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="3">&lt;li>&lt;strong>Adding your user to the &lt;code>libvirt&lt;/code> group&lt;/strong>:So non-root users can manage VMs without requiring &lt;code>sudo&lt;/code> for every command:&lt;/li>&lt;/ol>&lt;p>Run the command below:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo usermod -aG libvirt &lt;span style="color:#66d9ef">$(&lt;/span>whoami&lt;span style="color:#66d9ef">)&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Log out and log back in for the change to take effect.&lt;/p>&lt;ol start="4">&lt;li>&lt;strong>Verifying the installation&lt;/strong>:&lt;/li>&lt;/ol>&lt;p>Check &lt;code>virsh&lt;/code> version:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh version&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Validate CPU virtualization support:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virt-host-validate&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="setup">Setup&lt;/h2>&lt;ol>&lt;li>&lt;strong>Environment preparation&lt;/strong>:In KVM, the fastest way to provision VMs is to clone a “seed” image (&lt;code>.qcow2&lt;/code>) and expand it, instead of performing a clean install from ISO. To keep things organized, all virtual disks should be stored in a dedicated directory:&lt;/li>&lt;/ol>&lt;p>Download the AlmaLinux 8 base image:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cd /home/user/&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>wget https://repo.almalinux.org/almalinux/8/cloud/ppc64le/images/AlmaLinux-8-GenericCloud-latest.ppc64le.qcow2 -O alma8_base.qcow2&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="2">&lt;li>&lt;strong>Hypervisor management&lt;/strong>:Hypervisor and instance administration follows specific procedures to ensure system stability. Administrator commands to control virtualization services on Power9:&lt;/li>&lt;/ol>&lt;p>Stop KVM services:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo systemctl stop libvirtd&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Start KVM services again:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo systemctl start libvirtd&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Enable at boot:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo systemctl enable libvirtd&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="3">&lt;li>&lt;strong>Permission setup&lt;/strong>:The system user running KVM (&lt;code>qemu&lt;/code>) needs permission to access VM disks. If disks are stored inside a personal home directory, Linux blocks access by default. To allow hypervisor access without exposing personal files, grant execute (&lt;code>o+x&lt;/code>) permission on directories:&lt;/li>&lt;/ol>&lt;p>Allow &lt;code>qemu&lt;/code> to traverse the home directory (traversal only, no read permission):&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>chmod o+x /home/user&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Allow &lt;code>qemu&lt;/code> to access the disk directory:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>chmod o+x /home/user/discos&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="4">&lt;li>&lt;strong>Virtual network configuration (Libvirt)&lt;/strong>:Libvirt creates a default NAT network (&lt;code>default&lt;/code>) that places VMs in the &lt;code>192.168.122.0/24&lt;/code> range. VMs can access the internet through NAT, but they are not directly reachable from external networks without additional setup.&lt;/li>&lt;/ol>&lt;p>Check network status:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh net-list --all&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If inactive, start and enable at boot:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh net-start default&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo virsh net-autostart default&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If the network does not exist, define and initialize it:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh net-define /usr/share/libvirt/networks/default.xml&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo virsh net-start default&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>sudo virsh net-autostart default&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If the XML file is missing, install the network config package:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo dnf install -y libvirt-daemon-config-network&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="5">&lt;li>&lt;strong>Creating new VMs&lt;/strong>:&lt;/li>&lt;/ol>&lt;p>Clone the base image:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cp /home/user/alma8_base.qcow2 /home/user/discos/nome_vm.qcow2&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Expand the disk (must be done BEFORE creating the VM):&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>qemu-img resize /home/user/discos/nome_vm.qcow2 +100G&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Create the VM:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virt-install &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --connect qemu:///system &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --name vm_nome &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --memory &lt;span style="color:#ae81ff">131072&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --vcpus &lt;span style="color:#ae81ff">16&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --cpu host &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --disk path&lt;span style="color:#f92672">=&lt;/span>/home/user/discos/nome_vm.qcow2,format&lt;span style="color:#f92672">=&lt;/span>qcow2 &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --import &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --os-variant almalinux8 &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --network network&lt;span style="color:#f92672">=&lt;/span>default &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --graphics none &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --noautoconsole&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="6">&lt;li>&lt;strong>Post-creation VM customization&lt;/strong>:After creating the VM, you must set the root password, since cloud images usually come without one. We use &lt;code>virt-customize&lt;/code> for this. &lt;strong>Important&lt;/strong>: the VM must be powered off before safely editing its disk.&lt;/li>&lt;/ol>&lt;p>Shut down the VM:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh shutdown vm_nome&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Wait for complete shutdown:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh list --all&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Inject the root password into disk:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virt-customize -a /home/user/discos/nome_vm.qcow2 &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --root-password password:senha_desejada&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Start the VM again:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh start vm_nome&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="7">&lt;li>&lt;strong>Accessing VMs&lt;/strong>:&lt;/li>&lt;/ol>&lt;p>&lt;strong>Via serial console&lt;/strong>&lt;/p>&lt;p>Connect to VM console:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh console vm_nome&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>To exit the console, use &lt;code>Ctrl + ]&lt;/code>.&lt;/p>&lt;p>&lt;strong>Via SSH&lt;/strong>&lt;/p>&lt;p>Find the VM IP address:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh domifaddr vm_nome&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Access via SSH:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>ssh root@&amp;lt;ip_da_vm&amp;gt;&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="8">&lt;li>&lt;strong>Managing and deleting VMs&lt;/strong>:If you need to destroy an environment and recreate it from scratch, follow these 3 mandatory cleanup steps:&lt;/li>&lt;/ol>&lt;p>Force-stop the VM:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh destroy nome_da_vm&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Remove VM definition from Libvirt:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh undefine nome_da_vm&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Delete the virtual disk to free Power9 storage:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>rm -f /home/user/discos/nome_da_vm.qcow2&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ol start="9">&lt;li>&lt;strong>Creating a VM from an existing image (cloning)&lt;/strong>:To create a new VM from an already configured image, such as prebuilt NVIDIA-ready images:&lt;/li>&lt;/ol>&lt;p>Option A: clone via &lt;code>qemu-img&lt;/code> (keeps original image intact):&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>qemu-img create -f qcow2 -b imagem-base.qcow2 -F qcow2 nova-vm.qcow2&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Option B: clone via &lt;code>virt-clone&lt;/code>:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>virt-clone &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --original vm-base &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --name vm-nova &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --file /home/user/discos/nova-vm.qcow2&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If needed, you can execute the VM deletion step above and recreate it according to step 5.&lt;/p>&lt;h2 id="ready-to-use-images-with-nvidia-drivers">Ready-to-use images with NVIDIA drivers&lt;/h2>&lt;p>To simplify the use of Tesla V100 GPUs available on the server, we provide pre-configured &lt;code>.qcow2&lt;/code> images with NVIDIA drivers, CUDA, and cuDNN already installed. This removes the need to configure the base environment for every new use.&lt;/p>&lt;ol>&lt;li>&lt;p>&lt;strong>Available images&lt;/strong>:&lt;/p>&lt;table>&lt;thead>&lt;tr>&lt;th style="text-align:left">Image&lt;/th>&lt;th style="text-align:left">Contents&lt;/th>&lt;/tr>&lt;/thead>&lt;tbody>&lt;tr>&lt;td style="text-align:left">AlmaLinux-8-Power9-NVIDIA-drivers.qcow2.xz&lt;/td>&lt;td style="text-align:left">AlmaLinux 8.10 + drivers NVIDIA 535 + CUDA 12.2 + cuDNN 9.0&lt;/td>&lt;/tr>&lt;tr>&lt;td style="text-align:left">InstructLab-Power9-0.25.0.qcow2.xz&lt;/td>&lt;td style="text-align:left">AlmaLinux 8.10 + InstructLab 0.25.0 + dependencies required for execution on Power9 (ppc64le).&lt;/td>&lt;/tr>&lt;/tbody>&lt;/table>&lt;/li>&lt;li>&lt;p>&lt;strong>How to use pre-configured images&lt;/strong>:&lt;/p>&lt;/li>&lt;/ol>&lt;p>Download the image from the shared folder and decompress it:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>pip install --user gdown&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>gdown --folder &lt;span style="color:#e6db74">&amp;#34;https://drive.google.com/drive/u/1/folders/1WM8fHKWaMu-NJOzwqh6cdcET7mNE50du&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>xz -d InstructLab-Power9-0.25.0.qcow2.xz&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Move it to the disks directory and create a VM from it:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>cp InstructLab-Power9-0.25.0.qcow2 /home/user/discos/minha-vm-gpu.qcow2&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Create the VM as usual:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virt-install &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --connect qemu:///system &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --name vm_gpu &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --memory &lt;span style="color:#ae81ff">131072&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --vcpus &lt;span style="color:#ae81ff">16&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --cpu host &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --disk path&lt;span style="color:#f92672">=&lt;/span>/home/user/discos/minha-vm-gpu.qcow2,format&lt;span style="color:#f92672">=&lt;/span>qcow2 &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --import &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --os-variant almalinux8 &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --network network&lt;span style="color:#f92672">=&lt;/span>default &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --graphics none &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> --noautoconsole&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>For the VM to access physical GPUs, PCIe passthrough must be configured as described in the next post of this series.&lt;/p>&lt;ol start="3">&lt;li>&lt;strong>How to generate a new image from a configured VM&lt;/strong>:After installing drivers or any software inside a VM, you can export its current state as a reusable image:&lt;/li>&lt;/ol>&lt;p>Shut down the VM:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>sudo virsh shutdown vm_nome&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Convert and compress the image (removes unused space):&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>qemu-img convert -O qcow2 -c &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> /home/user/discos/vm_nome.qcow2 &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> /home/user/discos/AlmaLinux-8-Power9-minha-imagem.qcow2&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Compress for distribution:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>xz -T0 -v /home/user/discos/AlmaLinux-8-Power9-minha-imagem.qcow2&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Expected output: &lt;code>AlmaLinux-8-Power9-minha-imagem.qcow2.xz&lt;/code>.&lt;/p>&lt;p>Verify image integrity:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>qemu-img check AlmaLinux-8-Power9-minha-imagem.qcow2&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>qemu-img info AlmaLinux-8-Power9-minha-imagem.qcow2&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div></description></item><item><title>Evaluation of IBM Granite Models for Code-Generation Tasks on HumanEvalX</title><link>https://llm-pt-ibm.github.io/en/posts/post_humanevalx/</link><pubDate>Fri, 28 Nov 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/post_humanevalx/</guid><description>&lt;h2 id="context">Context&lt;/h2>&lt;p>The use of language models for &lt;strong>code generation and understanding&lt;/strong> has become essential in modern development workflows.&lt;br>As part of a joint research effort between &lt;strong>LSD/UFCG&lt;/strong> and &lt;strong>IBM&lt;/strong>, we investigated the performance of the &lt;strong>IBM Granite 4&lt;/strong> family on the &lt;strong>HumanEvalX&lt;/strong> benchmark, which evaluates programming capabilities in &lt;em>five languages&lt;/em>: Python, Java, Go, C++, and JavaScript.&lt;/p>&lt;p>The goal was to answer key questions from the team:&lt;/p>&lt;ul>&lt;li>&lt;em>How versatile are the Granite models across different languages?&lt;/em>&lt;/li>&lt;li>&lt;em>Do smaller models deliver useful performance?&lt;/em>&lt;/li>&lt;li>&lt;em>How do the Granites compare to models from other providers such as DeepSeek Coder and CodeLlama?&lt;/em>&lt;/li>&lt;/ul>&lt;hr>&lt;h2 id="methodology--process">Methodology / Process&lt;/h2>&lt;p>The evaluation was conducted using &lt;strong>OpenCompass&lt;/strong>, a modern and extensible framework for large-scale LLM benchmarking. It allowed experiments to be executed in a standardized, reproducible way with consistent inference protocols.&lt;/p>&lt;p>Since OpenCompass does not provide native support for models hosted on the &lt;strong>IBM Cloud&lt;/strong>, it was necessary to develop a custom client to integrate the framework with the IBM Cloud Inference API. This client allowed the evaluation process to send requests transparently, handle authentication, manage generation parameters, and return outputs in the expected benchmark format. Experiments were also run in &lt;strong>Google Colab&lt;/strong>, which served as a practical environment for prototyping and running the models.&lt;/p>&lt;p>We used the HumanEvalX benchmark, an extension of the traditional HumanEval, covering five languages with the &lt;strong>Pass@1&lt;/strong> metric.&lt;/p>&lt;p>The evaluated models included:&lt;/p>&lt;ul>&lt;li>Granite 4.0 Micro (3B)&lt;/li>&lt;li>Granite 4.0 (1B)&lt;/li>&lt;li>Granite 4.0 h-tiny (7B)&lt;/li>&lt;li>Granite 4.0 h-small (30B) — via IBM Cloud&lt;/li>&lt;li>granite 4.0 (350M)&lt;/li>&lt;li>granite code instruct 8B — via IBM Cloud&lt;/li>&lt;li>DeepSeek Coder (6.7B)&lt;/li>&lt;li>CodeLlama (7B)&lt;/li>&lt;/ul>&lt;p>The metric used was &lt;strong>Pass@1&lt;/strong>, following the benchmark protocol.&lt;/p>&lt;hr>&lt;h2 id="results-and-conclusions">Results and Conclusions&lt;/h2>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/heatmap_humanevalX.png" alt="Performance heatmap"/>&lt;figcaption> &lt;p>Performance heatmap of the models on HumanEvalX.&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>The evaluation revealed important behaviors:&lt;/p>&lt;h3 id="1-granite-40-h-small-stood-out-for-its-versatility">&lt;strong>1. granite-4.0-h-small stood out for its versatility&lt;/strong>&lt;/h3>&lt;p>He surpassed 60% &lt;strong>Pass@1&lt;/strong> in Java, C++, and JavaScript, while also maintaining over 50% in Python and Go. This consistent performance across languages suggests that the model has good generalization capability, showing promise in scenarios that involve different programming ecosystems, although additional benchmarks and evaluations are important before drawing broader conclusions.&lt;/p>&lt;h3 id="2-granite-micro-3b-performed-above-expectations">&lt;strong>2. Granite Micro (3B) performed above expectations&lt;/strong>&lt;/h3>&lt;p>Despite being a small model, Granite Micro (3B) delivered 65.85% in JavaScript and 68.90% in Java, outperforming even some larger models evaluated. This shows that even with a compact architecture, it can deliver solid results, making it a highly efficient option for applications that require low computational cost without sacrificing performance.&lt;/p>&lt;h3 id="3-the-size-progression-350m--1b--3b--7b--30b-shows-gradual-and-coherent-evolution">&lt;strong>3. The size progression (350M → 1B → 3B → 7B → 30B) shows gradual and coherent evolution&lt;/strong>&lt;/h3>&lt;p>The results show that as we move through the different sizes of the Granite line, there is a coherent evolution in performance. Smaller models deliver stable results within their category, while larger ones gradually expand the ability to solve more complex tasks. This distribution helps clarify where each model fits in the usage spectrum.&lt;/p>&lt;h3 id="4-comparing-different-providers-helps-contextualize-the-results">&lt;strong>4. Comparing different providers helps contextualize the results&lt;/strong>&lt;/h3>&lt;p>Alongside the IBM models, we also evaluated models from other providers such as DeepSeek and Meta. In some languages, the differences were small, but in all of them there was at least one model from the Granite family that achieved the highest score. The Granite 4 Micro (3B) and Granite 4 h-small (30B) models were the standouts, with results that were close to, and in some cases above, those of models recognized as code specialists.&lt;/p>&lt;hr>&lt;h2 id="next-steps">Next Steps&lt;/h2>&lt;ul>&lt;li>Run the same Granite models on &lt;strong>LiveCodeBench&lt;/strong>, a broader benchmark that goes beyond &lt;strong>code generation&lt;/strong>, also evaluating &lt;strong>code execution&lt;/strong> and &lt;strong>test-output&lt;/strong>.&lt;/li>&lt;li>Perform a &lt;strong>fine-tuning of the Granite 4.0 Micro (3B) using InstructLab&lt;/strong> and observe the impact of this adaptation on the model’s performance in &lt;strong>HumanEvalX&lt;/strong>, comparing before and after the adjustment.&lt;/li>&lt;/ul></description></item><item><title>Computação@UFCG Leads Brazil's Contributions to the HELM-Stanford Framework in Partnership with IBM</title><link>https://llm-pt-ibm.github.io/en/posts/contribuicao_helm/</link><pubDate>Wed, 09 Jul 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/contribuicao_helm/</guid><description>&lt;p>&lt;strong>Collaboration between UFCG’s Computer Science department and IBM makes the university the top brazilian contributor to the &lt;a href="https://github.com/stanford-crfm/helm" rel="external">&lt;span class="link-personalizado">HELM-Stanford&lt;/span>&lt;/a> evaluation framework in 2025.&lt;/strong>&lt;/p>&lt;p>HELM-Stanford is one of the world’s leading frameworks for evaluating language models, measuring accuracy, robustness, and fairness. Being the top Brazilian contributor — through the partnership between Computação@UFCG and IBM — highlights the national protagonism in developing fairer, safer, and more representative metrics for LLMs, especially in multilingual and culturally diverse contexts.&lt;/p>&lt;p>The partnership between Computação@UFCG and IBM resulted in 15 significant contributions to HELM-Stanford in 2025. These contributions include adding Portuguese-language benchmarks, fixing bugs, improving source code, and including new evaluation sets, expanding the framework’s linguistic diversity and robustness.&lt;/p>&lt;p>The project, coordinated by Professor João Brunet with participation from Professors Fábio Morais and Leandro Balby, features a multidisciplinary team dedicated to LLM evaluation. The team also includes one professor from IFPB, three graduate students, three undergraduate students, and a professional with software development experience. IBM, as a project partner, has also assigned professionals to work directly on the collaboration. Together, the group has made meaningful contributions to advancing HELM-Stanford, with a focus on including the Portuguese language and continuously improving the framework.&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/carvalheira.jpeg" alt="Multidisciplinary project team"/>&lt;figcaption> &lt;p>Multidisciplinary project team&lt;/p> &lt;/figcaption>&lt;/figure></description></item><item><title>LLMs Inference API on IBM Power9 Server</title><link>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt4_en/</link><pubDate>Thu, 03 Jul 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt4_en/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This is the fourth and final post in a tutorial series that aims to show step by step how to build a LLM API on a Power9 server, from operating system setup to remote inference execution. We already configured the operating system, NVIDIA drivers, CUDA, and cuDNN in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt1_en/">&lt;span class="link-personalizado">first post&lt;/span>&lt;/a>, installed Conda and PyTorch in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/">&lt;span class="link-personalizado">second post&lt;/span>&lt;/a>, and built the API in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt3_en/">&lt;span class="link-personalizado">third post&lt;/span>&lt;/a>. In this stage, we will present the built API and show how to make requests.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post introduces the built LLM inference API and how to use it.&lt;/li>&lt;li>We will show how to make requests using Python and curl.&lt;/li>&lt;/ul>&lt;h2 id="introducing-the-api">Introducing the API&lt;/h2>&lt;p>Built with FastAPI, it includes loading specific models, keeping them in GPU memory for successive calls, and generating text from prompts sent via HTTP requests. It was implemented with FastAPI and includes API Key access control, memory management (loading and unloading models), support for multiple GPUs with automatic sharding, and endpoints for status queries. The goal is to provide a robust, production-ready service optimized for intensive use, ensuring fast inferences and easy integration with external applications.&lt;/p>&lt;h4 id="architecture-overview">Architecture Overview&lt;/h4>&lt;p>The API exposes LLMs via FastAPI with REST endpoints. The ModelManager handles loading, unloading, and model inference, keeping models in GPU memory for fast calls. Authentication is enforced via API Key. The architecture supports multiple GPUs with automatic sharding to optimize memory usage and performance. Models are sourced from Hugging Face and use the Transformers library to perform inferences.&lt;/p>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/arquitetura_api_llm_01_en.png" alt="Descrição alternativa"/>&lt;figcaption> &lt;p>Architecture Diagram&lt;/p> &lt;/figcaption>&lt;/figure>&lt;h4 id="main-features">Main Features&lt;/h4>&lt;ul>&lt;li>&lt;p>&lt;strong>Load Models&lt;/strong>&lt;/p>&lt;ul>&lt;li>&lt;code>/load_model&lt;/code>&lt;/li>&lt;li>Loads a model from the Hugging Face Hub&lt;/li>&lt;li>Performs sharding across GPUs&lt;/li>&lt;li>Supports Hugging Face Token&lt;/li>&lt;/ul>&lt;/li>&lt;li>&lt;p>&lt;strong>Generate Text&lt;/strong>&lt;/p>&lt;ul>&lt;li>&lt;code>/generate&lt;/code>&lt;/li>&lt;li>Accepts prompt, max_tokens, model name, temperature, and top_p&lt;/li>&lt;li>Uses an already loaded model or loads a new one&lt;/li>&lt;li>Returns result in JSON&lt;/li>&lt;/ul>&lt;/li>&lt;li>&lt;p>&lt;strong>Management&lt;/strong>&lt;/p>&lt;ul>&lt;li>&lt;code>/status&lt;/code>: Checks the loaded model and device (CPU/GPU)&lt;/li>&lt;li>&lt;code>/unload_model&lt;/code>: Frees GPU and memory&lt;/li>&lt;li>&lt;code>/generate_apikey&lt;/code>: Creates API keys from LDAP user&lt;/li>&lt;/ul>&lt;/li>&lt;/ul>&lt;h4 id="usage-flow">Usage Flow&lt;/h4>&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/arquitetura_api_llm_02_en.png" alt="Descrição alternativa"/>&lt;figcaption> &lt;p>Usage flow diagram&lt;/p> &lt;/figcaption>&lt;/figure>&lt;h4 id="inputs-and-endpoints">Inputs and Endpoints&lt;/h4>&lt;p>The table below describes the API endpoints, required inputs, and responses.&lt;/p>&lt;style>table { border-collapse: collapse; width: 100%;}th { background-color: #cccccc; text-align: center; padding: 8px; border: 1px solid #b3b3b3;}td { padding: 8px; border: 1px solid #ccc; text-align: left;}td.center { text-align: center;}caption { caption-side: bottom}&lt;/style>&lt;table> &lt;caption>Inputs and endpoints table &lt;thead> &lt;tr> &lt;th>Endpoints&lt;/th> &lt;th>Method&lt;/th> &lt;th>Api Key&lt;/th> &lt;th>Input (Body/Query)&lt;/th> &lt;th>Response&lt;/th> &lt;/tr> &lt;/thead> &lt;tbody> &lt;tr> &lt;td>&lt;code>/generate_apikey&lt;/code>&lt;/td> &lt;td class="center">POST&lt;/td> &lt;td class="center">❌&lt;/td> &lt;td class="center">{username}&lt;/td> &lt;td class="center">API Key&lt;/td> &lt;/tr> &lt;tr> &lt;td>&lt;code>/load_model&lt;/code>&lt;/td> &lt;td class="center">POST&lt;/td> &lt;td class="center">✅&lt;/td> &lt;td class="center">{model_name &lt;br> hf_token(opcional) &lt;br> device(opcional)}&lt;/td> &lt;td class="center">None, just loads the model&lt;/td> &lt;/tr> &lt;tr> &lt;td>&lt;code>/generate&lt;/code>&lt;/td> &lt;td class="center">POST&lt;/td> &lt;td class="center">✅&lt;/td> &lt;td class="center">{model_name &lt;br> prompt &lt;br> hf_token(opcional) &lt;br> max_tokens(opcional) &lt;br> temperature(opcional) &lt;br> top_p(opcional)}&lt;/td> &lt;td class="center">Text generated by the model&lt;/td> &lt;/tr> &lt;tr> &lt;td>&lt;code>/status&lt;/code>&lt;/td> &lt;td class="center">GET&lt;/td> &lt;td class="center">✅&lt;/td> &lt;td class="center">None&lt;/td> &lt;td class="center">Model status and the device it is loaded on&lt;/td> &lt;/tr> &lt;tr> &lt;td>&lt;code>/unload_model&lt;/code>&lt;/td> &lt;td class="center">POST&lt;/td> &lt;td class="center">✅&lt;/td> &lt;td class="center">None&lt;/td> &lt;td class="center">None, just unloads the model&lt;/td> &lt;/tr> &lt;/tbody>&lt;/table>&lt;h2 id="how-to-use-the-api-with-python">How to Use the API with Python&lt;/h2>&lt;h4 id="generate-api-key">Generate API Key&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> requests&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> json&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> os&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span>url &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span>username &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#f92672">&amp;lt;&lt;/span>ldap_user&lt;span style="color:#f92672">&amp;gt;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span>hf_token &lt;span style="color:#f92672">=&lt;/span> os&lt;span style="color:#f92672">.&lt;/span>getenv(&lt;span style="color:#e6db74">&amp;#34;HUGGINGFACE_TOKEN&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span>response &lt;span style="color:#f92672">=&lt;/span> requests&lt;span style="color:#f92672">.&lt;/span>post(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>url&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/generate_apikey&amp;#34;&lt;/span>, json&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;username&amp;#34;&lt;/span>: username})&lt;span style="color:#f92672">.&lt;/span>content&lt;span style="color:#f92672">.&lt;/span>decode()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11&lt;/span>&lt;span>api_key &lt;span style="color:#f92672">=&lt;/span> json&lt;span style="color:#f92672">.&lt;/span>loads(response)&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&amp;#34;api_key&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>It is important that the Hugging Face Token is set as an environment variable in the location where the inference will run.&lt;/li>&lt;li>&lt;code>api_key&lt;/code> will be the return value of the called function.&lt;/li>&lt;/ul>&lt;h4 id="load-model">Load Model&lt;/h4>&lt;p>First, we need to create a header containing the API Key returned from the code above and the payload with &lt;code>model_name&lt;/code> and the Hugging Face token &lt;code>hf_token&lt;/code>. After that, we can send the request with these two pieces of information.&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>headers &lt;span style="color:#f92672">=&lt;/span> {&lt;span style="color:#e6db74">&amp;#34;Content-Type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;application/json&amp;#34;&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2&lt;/span>&lt;span>&lt;span style="color:#e6db74">&amp;#34;x-api-key&amp;#34;&lt;/span>: api_key}&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4&lt;/span>&lt;span>payload &lt;span style="color:#f92672">=&lt;/span> {&lt;span style="color:#e6db74">&amp;#34;model_name&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;ibm-granite/granite-3.3-8b-instruct&amp;#34;&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5&lt;/span>&lt;span> &lt;span style="color:#e6db74">&amp;#34;hf_token&amp;#34;&lt;/span>: hf_token}&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7&lt;/span>&lt;span>resp &lt;span style="color:#f92672">=&lt;/span> requests&lt;span style="color:#f92672">.&lt;/span>post(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>url&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/load_model&amp;#34;&lt;/span>, headers&lt;span style="color:#f92672">=&lt;/span>headers, json&lt;span style="color:#f92672">=&lt;/span>payload)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="generate-text">Generate Text&lt;/h4>&lt;p>Now we need to create a new payload with the necessary information to generate text with an LLM, which includes: &lt;code>prompt&lt;/code>, &lt;code>model_name&lt;/code>, and &lt;code>hf_token&lt;/code>.&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>payload &lt;span style="color:#f92672">=&lt;/span> {&lt;span style="color:#e6db74">&amp;#34;prompt&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;Hello, tell me a little about the Federal University of Campina Grande (UFCG)&amp;#34;&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2&lt;/span>&lt;span> &lt;span style="color:#e6db74">&amp;#34;model_name&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;ibm-granite/granite-3.3-8b-instruct&amp;#34;&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3&lt;/span>&lt;span> &lt;span style="color:#e6db74">&amp;#34;hf_token&amp;#34;&lt;/span>: hf_token}&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5&lt;/span>&lt;span>resp &lt;span style="color:#f92672">=&lt;/span> requests&lt;span style="color:#f92672">.&lt;/span>post(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>url&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/generate&amp;#34;&lt;/span>, headers&lt;span style="color:#f92672">=&lt;/span>headers, json&lt;span style="color:#f92672">=&lt;/span>payload)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7&lt;/span>&lt;span>resp &lt;span style="color:#f92672">=&lt;/span> json&lt;span style="color:#f92672">.&lt;/span>loads(resp&lt;span style="color:#f92672">.&lt;/span>content&lt;span style="color:#f92672">.&lt;/span>decode())&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="check-status-and-unload-the-model">Check status and unload the model&lt;/h4>&lt;p>To check the status and unload the model, we don&amp;rsquo;t need to send anything in the payload—just the header with the API key:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>requests&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>url&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/status&amp;#34;&lt;/span>, headers&lt;span style="color:#f92672">=&lt;/span>headers)&lt;span style="color:#f92672">.&lt;/span>content&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>resp &lt;span style="color:#f92672">=&lt;/span> requests&lt;span style="color:#f92672">.&lt;/span>post(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>url&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">/unload_model&amp;#34;&lt;/span>, headers&lt;span style="color:#f92672">=&lt;/span>headers)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="how-to-use-the-api-with-curl-in-cli">How to use the API with curl in CLI&lt;/h2>&lt;h4 id="generate-api-key-1">Generate API Key&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>curl -X POST &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/generate_apikey&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -d &lt;span style="color:#e6db74">&amp;#39;{&amp;#34;username&amp;#34;: &amp;lt;ldap_user&amp;gt;}&amp;#39;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>It is important that the Hugging Face Token is set as an environment variable in the location where the inference will run.&lt;/li>&lt;li>The user in the username field must be enclosed in quotation marks (&amp;quot; &amp;ldquo;)&lt;/li>&lt;li>After running the request above, the returned API key should be saved as an environment variable to make future executions easier. To save it, copy the returned API key and run the command:&lt;/li>&lt;/ul>&lt;pre tabindex="0">&lt;code>export API_KEY_P9=&amp;lt;returned_api_key&amp;gt;&lt;/code>&lt;/pre>&lt;h4 id="load-model-1">Load Model&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>curl -X POST &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/load_model&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;x-api-key: &lt;/span>$API_KEY&lt;span style="color:#e6db74">&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -d &lt;span style="color:#e6db74">&amp;#39;{&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;model_name&amp;#34;:&amp;#34;ibm-granite/granite-3.3-8b-instruct&amp;#34;,&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;hf_token&amp;#34;:&amp;#34;&amp;#39;&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>$HUGGINGFACE_TOKEN&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">&amp;#39;&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> }&amp;#39;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="generate-text-1">Generate Text&lt;/h4>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>curl -X POST &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/generate&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;x-api-key: &lt;/span>$API_KEY&lt;span style="color:#e6db74">&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -d &lt;span style="color:#e6db74">&amp;#39;{&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;model_name&amp;#34;: &amp;#34;ibm-granite/granite-3.3-8b-instruct&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;prompt&amp;#34;:&amp;#34;Hello, tell me a little about the Federal University of Campina Grande (UFCG)&amp;#34;,&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;hf_token&amp;#34;: &amp;#34;&amp;#39;&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>$HUGGINGFACE_TOKEN&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">&amp;#39;&amp;#34;,&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;max_tokens&amp;#34;:50&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> }&amp;#39;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="check-status-and-unload-the-model-1">Check status and unload the model&lt;/h4>&lt;p>To check the status and unload the model, we don&amp;rsquo;t need to send anything in the payload—just the header with the API key:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>curl -X GET &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/status&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;x-api-key: &lt;/span>$API_KEY&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>curl -X POST &lt;span style="color:#e6db74">&amp;#34;http://&amp;lt;power9_ip_server&amp;gt;:8000/unload_model&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;Content-Type: application/json&amp;#34;&lt;/span> &lt;span style="color:#ae81ff">\&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">&lt;/span> -H &lt;span style="color:#e6db74">&amp;#34;x-api-key: &lt;/span>$API_KEY&lt;span style="color:#e6db74">&amp;#34;&lt;/span> &lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We hope this series has helped clarify the full development and deployment process. The LLM-IBM-UFCG team is available for questions or suggestions about future improvements.&lt;/p></description></item><item><title>Building an API for LLM inferences on IBM Power9 servers</title><link>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt3_en/</link><pubDate>Wed, 02 Jul 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt3_en/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This is the third post in a tutorial series designed to show step by step how to build a LLM API on a Power9 server, from operating system setup to remote inference execution. We already configured the operating system, NVIDIA drivers, CUDA, and cuDNN in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt1_en/">&lt;span class="link-personalizado">first post&lt;/span>&lt;/a>, and installed Conda and PyTorch in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/">&lt;span class="link-personalizado">second post&lt;/span>&lt;/a>. In this stage, we will build the API using FastAPI and the Transformers library, downloading models from Hugging Face and running the web server with uvicorn.&lt;/p>&lt;p>The implemented API will support generating API keys, loading models, performing inferences, checking status, and unloading models.&lt;/p>&lt;p>&lt;strong>FastAPI&lt;/strong>: a modern web framework for building APIs with Python 3.8+, based on static typing and async programming. It is designed to be fast, easy to use, and robust, making API development more efficient.&lt;/p>&lt;p>&lt;strong>Transformers&lt;/strong>: an open-source library developed by Hugging Face. It offers easy and efficient access to a wide collection of state-of-the-art pretrained models for Natural Language Processing (NLP), computer vision, and audio.&lt;/p>&lt;p>&lt;strong>Hugging Face&lt;/strong>: Hugging Face is a platform focused on artificial intelligence, known for hosting NLP models and other tasks. The Hugging Face Hub is a collaborative repository where developers and researchers can share, version, and download ready-to-use models, making access and integration easier.&lt;/p>&lt;p>&lt;strong>Uvicorn&lt;/strong>: ASGI (Asynchronous Server Gateway Interface) web server. Uvicorn is a high-performance server for asynchronous Python applications.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post provides a step-by-step guide to implementing an API that performs LLM inferences.&lt;/li>&lt;li>We will use FastAPI and Transformers to develop this API and Hugging Face to download the models.&lt;/li>&lt;/ul>&lt;h2 id="environment-setup">Environment Setup&lt;/h2>&lt;h4 id="directory-structure">Directory Structure&lt;/h4>&lt;p>Start by creating the basic project structure:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-txt" data-lang="txt">&lt;span style="display:flex;">&lt;span>model_api/&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── requirements.txt&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── app/&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── __init__.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── main.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── schemas.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── auth.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── model_manager.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ ├── utils.py&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>│ └── apikey_store.json&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└── README.md (optional)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="requirementstxt-file">&lt;code>requirements.txt&lt;/code> File&lt;/h4>&lt;p>We will use FastAPI and Transformers to build the API. Additionally, we will use uvicorn to run the server, pydantic for input data validation, and torch, which we installed in the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/">previous tutorial&lt;/a>.&lt;/p>&lt;p>First, we&amp;rsquo;ll install the required libraries and then populate the &lt;code>requirements.txt&lt;/code> file. Remember to activate your &lt;code>conda&lt;/code> environment if you created one, to ensure proper use of &lt;code>pytorch&lt;/code>.&lt;/p>&lt;pre tabindex="0">&lt;code>conda activate llm_apipip install fastapi uvicorn transformers&lt;/code>&lt;/pre>&lt;p>The &lt;code>requirements.txt&lt;/code> file will look like this:&lt;/p>&lt;p>&lt;strong>requirements.txt&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-txt" data-lang="txt">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>fastapi&amp;gt;=0.104.0&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2&lt;/span>&lt;span>uvicorn&amp;gt;=0.24.0&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3&lt;/span>&lt;span>torch&amp;gt;=2.0.0&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4&lt;/span>&lt;span>transformers&amp;gt;=4.35.0&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5&lt;/span>&lt;span>pydantic&amp;gt;=2.0.0&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h4 id="api-key-storage-file">API Key Storage File&lt;/h4>&lt;p>The &lt;code>apikey_store.json&lt;/code> file will store the generated API keys. We will start with it empty, containing only &lt;code>{}&lt;/code>.&lt;/p>&lt;p>&lt;strong>apikey_store.json&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>{}&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="schemas-and-data-validation">Schemas and Data Validation&lt;/h2>&lt;p>Schemas are essential for validating the API&amp;rsquo;s input and output data. They ensure data is in the correct format and enable automatic documentation generation.&lt;/p>&lt;p>We will create the &lt;code>app/schemas.py&lt;/code> file containing all the data models. We will define four models: &lt;code>GenerateRequest&lt;/code>, &lt;code>LoadModelRequest&lt;/code>, &lt;code>ApiKeyResponse&lt;/code>, and &lt;code>LDAPUserRequest&lt;/code>.&lt;/p>&lt;p>&lt;strong>schemas.py&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> pydantic &lt;span style="color:#f92672">import&lt;/span> BaseModel, Field&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> typing &lt;span style="color:#f92672">import&lt;/span> Optional&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">GenerateRequest&lt;/span>(BaseModel):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span> model_name: str &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#f92672">...&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The name of the model to use for generation.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span> prompt: str &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#f92672">...&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The input text to generate a response for.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span> max_tokens: Optional[int] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#ae81ff">300&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The maximum length of the generated response.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span> temperature: Optional[float] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#ae81ff">1.0&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The sampling temperature for generation.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span> top_p: Optional[float] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#ae81ff">1.0&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The cumulative probability for nucleus sampling.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span> hf_token: Optional[str] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#66d9ef">None&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The Hugging Face tokenizer to use, if applicable.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13&lt;/span>&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">LoadModelRequest&lt;/span>(BaseModel):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14&lt;/span>&lt;span> model_name: str &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#f92672">...&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The name of the model to load.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15&lt;/span>&lt;span> device: Optional[str] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#e6db74">&amp;#34;cuda&amp;#34;&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The device to load the model on (e.g., &amp;#39;cpu&amp;#39;, &amp;#39;cuda&amp;#39;).&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16&lt;/span>&lt;span> hf_token: Optional[str] &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#66d9ef">None&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The Hugging Face tokenizer to use, if applicable.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18&lt;/span>&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">ApiKeyResponse&lt;/span>(BaseModel):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19&lt;/span>&lt;span> api_key: str &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#f92672">...&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The API key for accessing the model API.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21&lt;/span>&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">LDAPUserRequest&lt;/span>(BaseModel):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22&lt;/span>&lt;span> username: str &lt;span style="color:#f92672">=&lt;/span> Field(&lt;span style="color:#f92672">...&lt;/span>, description&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The username for LDAP authentication.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>All classes inherit from &lt;code>pydantic&lt;/code>&amp;rsquo;s &lt;code>BaseModel&lt;/code>, gaining validation, serialization, and automatic documentation features.&lt;/li>&lt;li>The &lt;code>Field(...)&lt;/code> declaration defines a required field with no default value.&lt;/li>&lt;li>The &lt;code>Field(value)&lt;/code> declaration defines a required field with &lt;code>value&lt;/code> as its default.&lt;/li>&lt;li>The &lt;code>Optional[type]&lt;/code> annotation indicates the field is optional but must be of type &lt;code>type&lt;/code> if provided.&lt;/li>&lt;/ul>&lt;p>With the schemas defined, let&amp;rsquo;s create the file responsible for API Key authentication.&lt;/p>&lt;h2 id="authentication-and-api-keys">Authentication and API Keys&lt;/h2>&lt;p>The authentication system protects your API by ensuring that only authorized users can access the endpoints. We will implement a mechanism based on API Keys.&lt;/p>&lt;p>Let&amp;rsquo;s create the &lt;code>app/auth.py&lt;/code> file with all the authentication functionalities.&lt;/p>&lt;p>&lt;strong>auth.py&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> secrets &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> json&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> fastapi &lt;span style="color:#f92672">import&lt;/span> HTTPException, Request&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span>APIKEY_STORE_FILE &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;app/apikey_store.json&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">load_apikeys&lt;/span>():&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> open(APIKEY_STORE_FILE, &lt;span style="color:#e6db74">&amp;#34;r&amp;#34;&lt;/span>) &lt;span style="color:#66d9ef">as&lt;/span> f:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> json&lt;span style="color:#f92672">.&lt;/span>load(f)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">FileNotFoundError&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13&lt;/span>&lt;span> status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">404&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14&lt;/span>&lt;span> detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;API keys file not found: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>APIKEY_STORE_FILE&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16&lt;/span>&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">save_apikeys&lt;/span>(keys: dict):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17&lt;/span>&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> open(APIKEY_STORE_FILE, &lt;span style="color:#e6db74">&amp;#34;w&amp;#34;&lt;/span>) &lt;span style="color:#66d9ef">as&lt;/span> f:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18&lt;/span>&lt;span> json&lt;span style="color:#f92672">.&lt;/span>dump(keys, f, indent&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">4&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20&lt;/span>&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">generate_apikey&lt;/span>(user:str) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> str:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21&lt;/span>&lt;span> key &lt;span style="color:#f92672">=&lt;/span> secrets&lt;span style="color:#f92672">.&lt;/span>token_hex(&lt;span style="color:#ae81ff">32&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22&lt;/span>&lt;span> keys &lt;span style="color:#f92672">=&lt;/span> load_apikeys()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23&lt;/span>&lt;span> keys[user] &lt;span style="color:#f92672">=&lt;/span> key&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24&lt;/span>&lt;span> save_apikeys(keys)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> key&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">verify_apikey&lt;/span>(request: Request) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> bool:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">28&lt;/span>&lt;span> apikey &lt;span style="color:#f92672">=&lt;/span> request&lt;span style="color:#f92672">.&lt;/span>headers&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&amp;#34;x-API-Key&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">29&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#f92672">not&lt;/span> apikey:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">30&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">31&lt;/span>&lt;span> status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">401&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">32&lt;/span>&lt;span> detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;API key not provided.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">33&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">34&lt;/span>&lt;span> keys &lt;span style="color:#f92672">=&lt;/span> load_apikeys()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">35&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> apikey &lt;span style="color:#f92672">in&lt;/span> keys&lt;span style="color:#f92672">.&lt;/span>values():&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">36&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#66d9ef">True&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">37&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">38&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> json&lt;span style="color:#f92672">.&lt;/span>JSONDecodeError:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">39&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">40&lt;/span>&lt;span> status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">403&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">41&lt;/span>&lt;span> detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Invalid API Key&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>The &lt;code>load_apikeys&lt;/code> function loads the information stored in the &lt;code>app/apikey_store.json&lt;/code> file.&lt;/li>&lt;li>&lt;code>save_apikeys&lt;/code> is responsible for saving the content in JSON format.&lt;/li>&lt;li>The &lt;code>generate_apikey&lt;/code> function creates a key for a user and adds it to the dictionary using the provided username as the key.&lt;/li>&lt;li>&lt;code>verify_apikey&lt;/code> will be called whenever a request arrives, to perform validation.&lt;/li>&lt;/ul>&lt;h2 id="model-and-gpu-manager">Model and GPU Manager&lt;/h2>&lt;p>The &lt;code>app/model_manager.py&lt;/code> is the core of the API, responsible for loading, managing, and running llm. It optimizes GPU/CPU usage and ensures efficient text generation.&lt;/p>&lt;p>&lt;strong>model_manager.py&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> torch &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> transformers &lt;span style="color:#f92672">import&lt;/span> AutoTokenizer, AutoModelForCausalLM&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> fastapi &lt;span style="color:#f92672">import&lt;/span> HTTPException&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> gc&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> .utils &lt;span style="color:#f92672">import&lt;/span> is_model_on_gpu&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span>DEVICE &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;cuda&amp;#34;&lt;/span> &lt;span style="color:#66d9ef">if&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>cuda&lt;span style="color:#f92672">.&lt;/span>is_available() &lt;span style="color:#66d9ef">else&lt;/span> &lt;span style="color:#e6db74">&amp;#34;cpu&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">ModelManager&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> __init__(self):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15&lt;/span>&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">load_model&lt;/span>(self, model_name: str, hf_token:str &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>, device: str &lt;span style="color:#f92672">=&lt;/span> DEVICE):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">!=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span> &lt;span style="color:#f92672">and&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">!=&lt;/span> model_name:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17&lt;/span>&lt;span> print(&lt;span style="color:#e6db74">&amp;#34;Removing previously loaded model...&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>unload_model() &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20&lt;/span>&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Loading model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> on device &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>device&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">...&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">!=&lt;/span> model_name:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>: &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> hf_token: &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer &lt;span style="color:#f92672">=&lt;/span> AutoTokenizer&lt;span style="color:#f92672">.&lt;/span>from_pretrained(model_name, token&lt;span style="color:#f92672">=&lt;/span>hf_token)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">=&lt;/span> AutoModelForCausalLM&lt;span style="color:#f92672">.&lt;/span>from_pretrained(model_name, device_map&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;balanced&amp;#34;&lt;/span>, token&lt;span style="color:#f92672">=&lt;/span>hf_token)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27&lt;/span>&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">28&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer &lt;span style="color:#f92672">=&lt;/span> AutoTokenizer&lt;span style="color:#f92672">.&lt;/span>from_pretrained(model_name)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">29&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">=&lt;/span> AutoModelForCausalLM&lt;span style="color:#f92672">.&lt;/span>from_pretrained(model_name, device_map&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;balanced&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">30&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model&lt;span style="color:#f92672">.&lt;/span>eval()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">31&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">=&lt;/span> model_name&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">32&lt;/span>&lt;span> print(is_model_on_gpu(self&lt;span style="color:#f92672">.&lt;/span>model&lt;span style="color:#f92672">.&lt;/span>hf_device_map, self&lt;span style="color:#f92672">.&lt;/span>model_name))&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">33&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">34&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">35&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>, detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Erro ao carregar modelo: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>str(e)&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">36&lt;/span>&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">37&lt;/span>&lt;span> print(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;The model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> is already loaded.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">38&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">39&lt;/span>&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">generate&lt;/span>(self, model_name:str, hf_token: str, prompt:str, max_tokens:int &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">300&lt;/span>, temperature:float &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">1.0&lt;/span>, top_p:float &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">1.0&lt;/span>) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> str:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">40&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">41&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">!=&lt;/span> model_name:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">42&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>load_model(model_name, hf_token, device&lt;span style="color:#f92672">=&lt;/span>DEVICE)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">43&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">44&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">is&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span> &lt;span style="color:#f92672">or&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer &lt;span style="color:#f92672">is&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">45&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">400&lt;/span>, detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;No model loaded.&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">46&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">47&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">48&lt;/span>&lt;span> inputs &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer(prompt, return_tensors&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;pt&amp;#34;&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>to(self&lt;span style="color:#f92672">.&lt;/span>model&lt;span style="color:#f92672">.&lt;/span>device)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">49&lt;/span>&lt;span> &lt;span style="color:#66d9ef">with&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>no_grad(): &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">50&lt;/span>&lt;span> outputs &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model&lt;span style="color:#f92672">.&lt;/span>generate(&lt;span style="color:#f92672">**&lt;/span>inputs, max_new_tokens&lt;span style="color:#f92672">=&lt;/span>max_tokens,temperature&lt;span style="color:#f92672">=&lt;/span>temperature, top_p&lt;span style="color:#f92672">=&lt;/span>top_p, eos_token_id&lt;span style="color:#f92672">=&lt;/span>self&lt;span style="color:#f92672">.&lt;/span>tokenizer&lt;span style="color:#f92672">.&lt;/span>eos_token_id)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">51&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer&lt;span style="color:#f92672">.&lt;/span>decode(outputs[&lt;span style="color:#ae81ff">0&lt;/span>], skip_special_tokens&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">52&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">53&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>, detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Error generating text:&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>str(e)&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">54&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">55&lt;/span>&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">get_status&lt;/span>(self) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> str: &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">56&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">is&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">57&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>unload_model()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">58&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#e6db74">&amp;#34;No model loaded.&amp;#34;&lt;/span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">59&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> is_model_on_gpu(self&lt;span style="color:#f92672">.&lt;/span>model&lt;span style="color:#f92672">.&lt;/span>hf_device_map, self&lt;span style="color:#f92672">.&lt;/span>model_name)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">60&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">61&lt;/span>&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">unload_model&lt;/span>(self):&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">62&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">63&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>tokenizer &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">64&lt;/span>&lt;span> old_model &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#66d9ef">else&lt;/span> &lt;span style="color:#66d9ef">False&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">65&lt;/span>&lt;span> self&lt;span style="color:#f92672">.&lt;/span>model_name &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">66&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">67&lt;/span>&lt;span> gc&lt;span style="color:#f92672">.&lt;/span>collect()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">68&lt;/span>&lt;span> torch&lt;span style="color:#f92672">.&lt;/span>cuda&lt;span style="color:#f92672">.&lt;/span>empty_cache()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">69&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>old_model&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> successfully unloaded.&amp;#34;&lt;/span> &lt;span style="color:#66d9ef">if&lt;/span> old_model &lt;span style="color:#66d9ef">else&lt;/span> &lt;span style="color:#e6db74">&amp;#34;No model loaded to unload.&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">70&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">71&lt;/span>&lt;span>manager &lt;span style="color:#f92672">=&lt;/span> ModelManager()&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>The &lt;code>load_model&lt;/code> function loads a new model into memory, removing any previously loaded model.&lt;/li>&lt;li>&lt;code>generate&lt;/code> is the main function of the API, responsible for performing model inference. It allows adjusting the parameters: temperature, top_p, and max_tokens.&lt;/li>&lt;li>&lt;code>get_status&lt;/code> reports whether there is a loaded model and whether it is on the GPU or CPU.&lt;/li>&lt;li>The &lt;code>unload_model&lt;/code> function removes the model from memory, clears the CUDA cache, and invokes Python’s garbage collector to avoid leftovers that could interfere with future loads.&lt;/li>&lt;/ul>&lt;h2 id="fastapi-api-endpoints">FastAPI API Endpoints&lt;/h2>&lt;p>The &lt;code>app/main.py&lt;/code> file is where all the components come together. In it, we define all the endpoints and the API’s routing logic.&lt;/p>&lt;p>&lt;strong>main.py&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> fastapi &lt;span style="color:#f92672">import&lt;/span> FastAPI, Request, HTTPException, Depends&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> fastapi.responses &lt;span style="color:#f92672">import&lt;/span> JSONResponse&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>&lt;span style="color:#f92672">from&lt;/span> app &lt;span style="color:#f92672">import&lt;/span> schemas, model_manager, auth&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span>app &lt;span style="color:#f92672">=&lt;/span> FastAPI()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">require_api_key&lt;/span>(request: Request) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> schemas&lt;span style="color:#f92672">.&lt;/span>LDAPUserRequest:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span> user &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">await&lt;/span> auth&lt;span style="color:#f92672">.&lt;/span>verify_apikey(request)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#f92672">not&lt;/span> user:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">401&lt;/span>, detail&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Invalid API Key&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">11&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> user&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">12&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">13&lt;/span>&lt;span>&lt;span style="color:#a6e22e">@app.post&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/generate_apikey&amp;#34;&lt;/span>)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">14&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">generate_apikey&lt;/span>(payload: schemas&lt;span style="color:#f92672">.&lt;/span>LDAPUserRequest) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> JSONResponse:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">15&lt;/span>&lt;span> key &lt;span style="color:#f92672">=&lt;/span> auth&lt;span style="color:#f92672">.&lt;/span>generate_apikey(payload&lt;span style="color:#f92672">.&lt;/span>username)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">16&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">200&lt;/span>, content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;api_key&amp;#34;&lt;/span>: key})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">17&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">18&lt;/span>&lt;span>&lt;span style="color:#a6e22e">@app.post&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/load_model&amp;#34;&lt;/span>, dependencies&lt;span style="color:#f92672">=&lt;/span>[Depends(require_api_key)])&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">19&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">load_model&lt;/span>(payload: schemas&lt;span style="color:#f92672">.&lt;/span>LoadModelRequest) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> JSONResponse:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">20&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">21&lt;/span>&lt;span> model_manager&lt;span style="color:#f92672">.&lt;/span>manager&lt;span style="color:#f92672">.&lt;/span>load_model(payload&lt;span style="color:#f92672">.&lt;/span>model_name, payload&lt;span style="color:#f92672">.&lt;/span>hf_token, payload&lt;span style="color:#f92672">.&lt;/span>device)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">22&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse(content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;message&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>payload&lt;span style="color:#f92672">.&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> loaded successfully.&amp;#34;&lt;/span>})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">23&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">24&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>, content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;error&amp;#34;&lt;/span>: str(e)})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">25&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">26&lt;/span>&lt;span>&lt;span style="color:#a6e22e">@app.post&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/generate&amp;#34;&lt;/span>, dependencies&lt;span style="color:#f92672">=&lt;/span>[Depends(require_api_key)])&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">27&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">generate&lt;/span>(payload: schemas&lt;span style="color:#f92672">.&lt;/span>GenerateRequest)&lt;span style="color:#f92672">-&amp;gt;&lt;/span> JSONResponse:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">28&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">29&lt;/span>&lt;span> result &lt;span style="color:#f92672">=&lt;/span> model_manager&lt;span style="color:#f92672">.&lt;/span>manager&lt;span style="color:#f92672">.&lt;/span>generate(payload&lt;span style="color:#f92672">.&lt;/span>model_name, payload&lt;span style="color:#f92672">.&lt;/span>hf_token,payload&lt;span style="color:#f92672">.&lt;/span>prompt, payload&lt;span style="color:#f92672">.&lt;/span>max_tokens, payload&lt;span style="color:#f92672">.&lt;/span>temperature, payload&lt;span style="color:#f92672">.&lt;/span>top_p)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">30&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> {&lt;span style="color:#e6db74">&amp;#34;result&amp;#34;&lt;/span>: result}&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">31&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">32&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>, content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;error&amp;#34;&lt;/span>: str(e)})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">33&lt;/span>&lt;span> &lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">34&lt;/span>&lt;span>&lt;span style="color:#a6e22e">@app.get&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/status&amp;#34;&lt;/span>, dependencies&lt;span style="color:#f92672">=&lt;/span>[Depends(require_api_key)])&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">35&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">status&lt;/span>()&lt;span style="color:#f92672">-&amp;gt;&lt;/span> JSONResponse:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">36&lt;/span>&lt;span> str_status &lt;span style="color:#f92672">=&lt;/span> model_manager&lt;span style="color:#f92672">.&lt;/span>manager&lt;span style="color:#f92672">.&lt;/span>get_status()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">37&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse(content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;status&amp;#34;&lt;/span>: str_status})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">38&lt;/span>&lt;span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">39&lt;/span>&lt;span>&lt;span style="color:#a6e22e">@app.post&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/unload_model&amp;#34;&lt;/span>, dependencies&lt;span style="color:#f92672">=&lt;/span>[Depends(require_api_key)])&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">40&lt;/span>&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">unload_model&lt;/span>() &lt;span style="color:#f92672">-&amp;gt;&lt;/span> JSONResponse:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">41&lt;/span>&lt;span> &lt;span style="color:#66d9ef">try&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">42&lt;/span>&lt;span> str_unload &lt;span style="color:#f92672">=&lt;/span> model_manager&lt;span style="color:#f92672">.&lt;/span>manager&lt;span style="color:#f92672">.&lt;/span>unload_model()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">43&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse(content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;message&amp;#34;&lt;/span>:str_unload})&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">44&lt;/span>&lt;span> &lt;span style="color:#66d9ef">except&lt;/span> &lt;span style="color:#a6e22e">Exception&lt;/span> &lt;span style="color:#66d9ef">as&lt;/span> e:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">45&lt;/span>&lt;span> &lt;span style="color:#66d9ef">raise&lt;/span> HTTPException(status_code&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#ae81ff">500&lt;/span>, content&lt;span style="color:#f92672">=&lt;/span>{&lt;span style="color:#e6db74">&amp;#34;error&amp;#34;&lt;/span>: str(e)})&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;ul>&lt;li>The &lt;code>require_api_key&lt;/code> function checks the API Key on each request and returns the authenticated user or raises a 401 error.&lt;/li>&lt;li>&lt;code>generate_apikey&lt;/code> creates and returns a new API key for the specified user.&lt;/li>&lt;li>&lt;code>load_model&lt;/code> loads the specified model. If needed, it also accepts a Hugging Face token.&lt;/li>&lt;li>The &lt;code>generate&lt;/code> function makes the model perform inference using the given prompt and parameters.&lt;/li>&lt;li>Calling the &lt;code>status&lt;/code> endpoint returns the current status of the model manager.&lt;/li>&lt;li>&lt;code>unload_model&lt;/code> unloads the currently loaded model and returns a success message if completed properly.&lt;/li>&lt;/ul>&lt;h2 id="utilspy-file">&lt;code>utils.py&lt;/code> File&lt;/h2>&lt;p>The &lt;code>app/utils.py&lt;/code> file contains the function that checks whether the loaded model is fully or partially on the GPU, or if it was loaded on the CPU.&lt;/p>&lt;p>&lt;strong>utils.py&lt;/strong>&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">1&lt;/span>&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">is_model_on_gpu&lt;/span>(hf_device_map: dict, model_name: str) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> str:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">2&lt;/span>&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#e6db74">&amp;#39;&amp;#39;&lt;/span> &lt;span style="color:#f92672">in&lt;/span> hf_device_map&lt;span style="color:#f92672">.&lt;/span>keys() &lt;span style="color:#f92672">and&lt;/span> hf_device_map[&lt;span style="color:#e6db74">&amp;#39;&amp;#39;&lt;/span>] &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&amp;#39;cpu&amp;#39;&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">3&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> fully loaded on CPU.&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">4&lt;/span>&lt;span> &lt;span style="color:#66d9ef">elif&lt;/span> &lt;span style="color:#e6db74">&amp;#39;cpu&amp;#39;&lt;/span> &lt;span style="color:#f92672">in&lt;/span> hf_device_map&lt;span style="color:#f92672">.&lt;/span>values():&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">5&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Some layers of the model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> are loaded on the CPU.&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">6&lt;/span>&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">7&lt;/span>&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Model &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>model_name&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> fully loaded on GPU.&amp;#34;&lt;/span>&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h2 id="running-the-api">Running the API&lt;/h2>&lt;p>To run the API with &lt;code>uvicorn&lt;/code>, simply execute a command specifying the host and port for the service to start.&lt;/p>&lt;pre tabindex="0">&lt;code>uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload&lt;/code>&lt;/pre>&lt;ul>&lt;li>&lt;p>&lt;code>app:main&lt;/code> refers to the &lt;code>app/main.py&lt;/code> file, which connects all components and handles user requests.&lt;/p>&lt;/li>&lt;li>&lt;p>&lt;code>--host 0.0.0.0&lt;/code> sets the IP address on which the Uvicorn server will listen. The value &lt;code>0.0.0.0&lt;/code> allows the server to be accessible from any network interface on the Power9 machine.&lt;/p>&lt;/li>&lt;li>&lt;p>&lt;code>--port 8000&lt;/code> specifies the port on which the server will listen for requests.&lt;/p>&lt;/li>&lt;li>&lt;p>&lt;code>--reload&lt;/code> is a flag for development use. It automatically reloads the server whenever changes are made.&lt;/p>&lt;/li>&lt;/ul>&lt;p>BBy following this guide, you&amp;rsquo;ll have a working API capable of running LLM inference using models downloaded from Hugging Face. In the &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt4_en/">&lt;span class="link-personalizado">next tutorial&lt;/span>&lt;/a>, we will show how to send requests to the API using curl and Python.&lt;/p></description></item><item><title>Setting Up the Conda and PyTorch on IBM Power9 Servers</title><link>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/</link><pubDate>Mon, 30 Jun 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This is the second post in a tutorial series on how to build a Language Model API on an IBM Power9 server, covering everything from setting up the operating system to having the API running remote inference. The &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt1_en/">&lt;span class="link-personalizado">first post&lt;/span>&lt;/a> covers installing the OS and configuring NVIDIA drivers, CUDA, and CUDNN. In this step, we&amp;rsquo;ll show how to set up the Conda package manager and the PyTorch library.&lt;/p>&lt;p>&lt;strong>Conda&lt;/strong>: Conda is an open-source, cross-platform package and environment management system. It&amp;rsquo;s like a &amp;ldquo;toolbox&amp;rdquo; for data scientists and developers to organize their projects.&lt;/p>&lt;p>&lt;strong>PyTorch&lt;/strong>: PyTorch is an open-source machine learning library developed primarily by Facebook AI Research (FAIR). It&amp;rsquo;s especially popular for building deep learning applications, a subfield of machine learning inspired by how the human brain works.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post provides a step-by-step guide to installing Conda and PyTorch.&lt;/li>&lt;li>The main challenge is finding compatible versions for the Power9 machine architecture.&lt;/li>&lt;/ul>&lt;h2 id="setting-up-the-conda">Setting up the Conda&lt;/h2>&lt;p>We&amp;rsquo;ll start with installing &lt;strong>Conda&lt;/strong>. On Power systems, the architecture used is &lt;code>ppc64le&lt;/code> (PowerPC 64-bit little-endian), so it&amp;rsquo;s essential to download the version for this architecture. We&amp;rsquo;ll use &lt;strong>miniconda&lt;/strong>, a lighter option that&amp;rsquo;s better suited for custom setups like the Power9 server.&lt;/p>&lt;ol>&lt;li>To download and install the latest version of Miniconda:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-ppc64le.shbash ~/Miniconda3-latest-Linux-ppc64le.sh&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>Check if Conda was activated automatically:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>conda -–version&lt;/code>&lt;/pre>&lt;p>If it didn&amp;rsquo;t start automatically, you&amp;rsquo;ll need to activate it.&lt;/p>&lt;ol start="3">&lt;li>To ensure it&amp;rsquo;s automatically activated with each new connection, we will write the command into your &lt;code>.bashrc&lt;/code> (or &lt;code>.zshrc&lt;/code>) file.&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>echo &amp;#39;source ~/miniconda3/etc/profile.d/conda.sh&amp;#39; &amp;gt;&amp;gt; ~/.bashrcsource ~/.bashrc&lt;/code>&lt;/pre>&lt;p>Check again with the command:&lt;/p>&lt;pre tabindex="0">&lt;code>conda -–version&lt;/code>&lt;/pre>&lt;p>Expected output looks like: &lt;code>conda 23.10.0&lt;/code>&lt;/p>&lt;h2 id="installing-and-configuring-the-pytorch-library">Installing and configuring the PyTorch library&lt;/h2>&lt;p>There are no official builds or Conda/PyPi wheels with full support for the &lt;strong>ppc64le&lt;/strong> architecture. To install PyTorch, you’ll need to build it manually.&lt;/p>&lt;h4 id="optional-creating-a-conda-virtual-environment">(Optional) Creating a Conda virtual environment&lt;/h4>&lt;p>It’s recommended to create a dedicated virtual environment to install PyTorch in isolation.&lt;/p>&lt;ol>&lt;li>To create and activate the virtual environment, run:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>conda create -y -n api_llm python=3.10conda activate api_llm&lt;/code>&lt;/pre>&lt;h4 id="installing-prerequisites">Installing prerequisites&lt;/h4>&lt;p>We need to install some packages required to properly build PyTorch.&lt;/p>&lt;ol>&lt;li>First, install the packages using the following commands:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>conda install -y -c conda-forge openblas libblas cmake ninja python3-devel gcc-c++ rust cargo&lt;/code>&lt;/pre>&lt;p>CMake (the build system used by PyTorch) dropped support for scripts declaring compatibility with older versions (&amp;lt;3.5). To address this, we need to install a version of cmake &amp;lt;3.5 using pip.&lt;/p>&lt;ol start="2">&lt;li>Run the command:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>pip install cmake==3.27.7&lt;/code>&lt;/pre>&lt;p>To make sure the correct version was installed, run the command:&lt;/p>&lt;pre tabindex="0">&lt;code>cmake --version &lt;/code>&lt;/pre>&lt;p>Expected output: &lt;code>cmake version 3.27.7&lt;/code>&lt;/p>&lt;h4 id="building-pytorch">Building PyTorch&lt;/h4>&lt;p>Now let&amp;rsquo;s start the &lt;strong>PyTorch&lt;/strong> build process.&lt;/p>&lt;ol>&lt;li>The first step is to clone the repository and set it up to install version 2.6.0:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>git clone --recursive https://github.com/pytorch/pytorchcd pytorchgit checkout v2.6.0 git submodule sync git submodule update --init --recursive &lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>To install the required packages via pip, run the following command:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>pip install -r requirements.txt&lt;/code>&lt;/pre>&lt;ol start="3">&lt;li>And finally, to build PyTorch, run Python’s setup.py:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo USE_CUDA=1 USE_DISTRIBUTED=1 USE_NCCL=1 USE_GLOO=1 USE_CUDNN=1 python setup.py install&lt;/code>&lt;/pre>&lt;p>The build process usually takes a while, around 15 minutes.&lt;/p>&lt;ol start="4">&lt;li>To check if everything worked correctly, create a file named &lt;code>test_torch.py&lt;/code>&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>nano test_torch.py&lt;/code>&lt;/pre>&lt;p>This file should contain the following lines:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 1&lt;/span>&lt;span>&lt;span style="color:#f92672">import&lt;/span> torch&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 2&lt;/span>&lt;span>print(torch&lt;span style="color:#f92672">.&lt;/span>__version__)&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 3&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;CUDA available:&amp;#34;&lt;/span>, torch&lt;span style="color:#f92672">.&lt;/span>cuda&lt;span style="color:#f92672">.&lt;/span>is_available())&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 4&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;Number of GPUs:&amp;#34;&lt;/span>, torch&lt;span style="color:#f92672">.&lt;/span>cuda&lt;span style="color:#f92672">.&lt;/span>device_count())&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 5&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;GPU name:&amp;#34;&lt;/span>, torch&lt;span style="color:#f92672">.&lt;/span>cuda&lt;span style="color:#f92672">.&lt;/span>get_device_name(&lt;span style="color:#ae81ff">0&lt;/span>))&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 6&lt;/span>&lt;span>x &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>rand(&lt;span style="color:#ae81ff">3&lt;/span>, &lt;span style="color:#ae81ff">3&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>cuda()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 7&lt;/span>&lt;span>y &lt;span style="color:#f92672">=&lt;/span> torch&lt;span style="color:#f92672">.&lt;/span>rand(&lt;span style="color:#ae81ff">3&lt;/span>, &lt;span style="color:#ae81ff">3&lt;/span>)&lt;span style="color:#f92672">.&lt;/span>cuda()&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 8&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;Sum on GPU:&amp;#34;&lt;/span>, (x &lt;span style="color:#f92672">+&lt;/span> y))&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f"> 9&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;cuDNN available:&amp;#34;&lt;/span>, torch&lt;span style="color:#f92672">.&lt;/span>backends&lt;span style="color:#f92672">.&lt;/span>cudnn&lt;span style="color:#f92672">.&lt;/span>is_available())&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span style="white-space:pre;-webkit-user-select:none;user-select:none;margin-right:0.4em;padding:0 0.4em 0 0.4em;color:#7f7f7f">10&lt;/span>&lt;span>print(&lt;span style="color:#e6db74">&amp;#34;C extensions loaded:&amp;#34;&lt;/span>, torch&lt;span style="color:#f92672">.&lt;/span>_C&lt;span style="color:#f92672">.&lt;/span>_cuda_getDeviceCount() &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>)&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>When you run this file, you’ll check:&lt;/p>&lt;ul>&lt;li>Installed PyTorch version&lt;/li>&lt;li>CUDA availability&lt;/li>&lt;li>Number of available GPUs&lt;/li>&lt;li>GPU name on the Power9 server&lt;/li>&lt;li>Whether GPU usage is working correctly&lt;/li>&lt;li>CUDNN availability&lt;/li>&lt;li>Whether the .so files were compiled correctly&lt;/li>&lt;/ul>&lt;p>This script simply verifies some CUDA and PyTorch informations and performs a basic addition operation using GPU tensors.&lt;/p>&lt;ol start="5">&lt;li>Run the file with the command:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>python test_gpu.py&lt;/code>&lt;/pre>&lt;p>Expected output should look something like:&lt;/p>&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>2.6.0a0+git1eba9b3&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>CUDA available: True&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Number of GPUs: &lt;span style="color:#ae81ff">4&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>GPU name: Tesla V100-SXM2-16GB&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Sum on GPU: tensor&lt;span style="color:#f92672">([[&lt;/span>1.9163, 1.2208, 0.5998&lt;span style="color:#f92672">]&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">[&lt;/span>1.7962, 0.6040, 1.3943&lt;span style="color:#f92672">]&lt;/span>,&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">[&lt;/span>0.9536, 0.8010, 0.0668&lt;span style="color:#f92672">]]&lt;/span>, device&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#e6db74">&amp;#39;cuda:0&amp;#39;&lt;/span>&lt;span style="color:#f92672">)&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>cuDNN available: True&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>C extensions loaded: True&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Keep in mind that the output may vary depending on the number and model of GPUs, as well as the tensor sums (due to randomness). What matters is that the boolean outputs in the script return &lt;code>True&lt;/code>.&lt;/p>&lt;p>With this, PyTorch is installed and ready to use. &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt3_en/">&lt;span class="link-personalizado">In the next tutorial&lt;/span>&lt;/a>, we’ll run the first Language Model inference on the Power9 server.&lt;/p></description></item><item><title>Setting Up the OS, NVIDIA Drivers, CUDA, and cuDNN on IBM Power 9 Servers</title><link>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt1_en/</link><pubDate>Sun, 29 Jun 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt1_en/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This is the first post in a tutorial series on how to build a Language Model API on an IBM Power9 server, covering everything from setting up the operating system to having the API running remote inference.This step of the tutorial shows how to set up the operating system and install NVIDIA drivers, CUDA, and cuDNN on machines with IBM Power9 AC922 processors. The focus is on ensuring everything works correctly on &lt;code>ppc64le&lt;/code> architectures, which are common in high-performance environments.&lt;/p>&lt;p>&lt;strong>IBM Power9&lt;/strong>: The IBM Power9 AC922 is a high-performance machine used for demanding tasks such as artificial intelligence and scientific computing. It uses Power9 processors and works well with NVIDIA GPUs, offering high-speed communication between the CPU and GPU.&lt;/p>&lt;p>&lt;strong>NVIDIA Drivers&lt;/strong>: Software that allows the operating system to communicate correctly with NVIDIA GPUs. These drivers are essential to enable GPU acceleration.&lt;/p>&lt;p>&lt;strong>CUDA&lt;/strong>: NVIDIA&amp;rsquo;s platform for accelerating parallel computing on GPUs. It lets you run complex algorithms efficiently, such as Large Language Model inference.&lt;/p>&lt;p>&lt;strong>cuDNN&lt;/strong>: A GPU-optimized library of primitives for deep neural networks (DNNs) developed by NVIDIA. It offers high-performance implementations of key DNN operations like convolutions, pooling, and normalization, significantly speeding up training and inference on GPUs.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post provides a step-by-step guide on setting up Power9 servers, including the OS and NVIDIA configurations.&lt;/li>&lt;li>The main challenge is finding compatible versions for the Power9 machine architecture.&lt;/li>&lt;/ul>&lt;h2 id="setting-up-the-operating-system">Setting up the Operating System&lt;/h2>&lt;p>Let&amp;rsquo;s start with the installation of &lt;strong>Red Hat Enterprise Linux 8.10 (Ootpa)&lt;/strong>. On Power systems, the architecture used is &lt;code>ppc64le&lt;/code> (PowerPC 64-bit little-endian), so it&amp;rsquo;s essential to ensure the .iso image is compatible with this architecture. Otherwise, the Power9&amp;rsquo;s petitboot won&amp;rsquo;t recognize the media and installation won&amp;rsquo;t proceed.&lt;/p>&lt;ol>&lt;li>You can download the correct image from the &lt;a href="https://access.redhat.com/downloads/content/279/ver=/rhel---8/8.10/ppc64le/product-software" rel="external">&lt;span class="link-personalizado">link&lt;/span>&lt;/a> provided.&lt;/li>&lt;li>In this tutorial, we&amp;rsquo;ll use the &lt;strong>Boot ISO&lt;/strong> option and follow the &lt;a href="https://docs.redhat.com/en/documentation/red_hat_enterprise_linux/8/html/interactively_installing_rhel_from_installation_media/assembly_creating-a-bootable-installation-medium_rhel-installer" rel="external">&lt;span class="link-personalizado">official Red Hat documentation&lt;/span>&lt;/a> to create a bootable USB medium.&lt;/li>&lt;li>After inserting the installation media into the Power9 server and rebooting, the system should automatically start petitboot.&lt;/li>&lt;li>From there, just follow the &lt;a href="https://www.ibm.com/docs/en/linuxonibm/liabw/rhelqs_guide_Power_p9_usb.pdf" rel="external">&lt;span class="link-personalizado">official installation guide&lt;/span>&lt;/a> to complete the OS setup.&lt;/li>&lt;/ol>&lt;h2 id="setting-up-nvidia-driver-and-cuda">Setting up NVIDIA Driver and CUDA&lt;/h2>&lt;h4 id="checking-gpus-and-operating-system">Checking GPUs and Operating System&lt;/h4>&lt;p>To enable the operating system to communicate properly with the server&amp;rsquo;s GPUs, we need to install and configure the NVIDIA driver.&lt;/p>&lt;ol>&lt;li>First, let&amp;rsquo;s check for the presence of the GPU(s):&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>lspci | grep -i nvidia&lt;/code>&lt;/pre>&lt;p>The expected output is something like:&lt;/p>&lt;p>&lt;code>0004:04:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 SXM2 16GB] (rev a1)&lt;/code>&lt;/p>&lt;ol start="2">&lt;li>Next, let&amp;rsquo;s check the system architecture and operating system name:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>uname -m &amp;amp;&amp;amp; cat /etc/redhat-release&lt;/code>&lt;/pre>&lt;p>The expected output is:&lt;/p>&lt;p>&lt;code>ppc64le Red Hat Enterprise Linux release 8.10 (Ootpa)&lt;/code>&lt;/p>&lt;h4 id="avoiding-conflicts">Avoiding conflicts&lt;/h4>&lt;p>To avoid potential conflicts, it&amp;rsquo;s recommended to disable the &lt;code>nouveau&lt;/code> driver and &lt;code>SELinux&lt;/code>.&lt;/p>&lt;p>The &lt;code>nouveau&lt;/code> driver is an open-source driver for NVIDIA GPUs that replaces the proprietary driver when users want to use only free software without needing high performance.&lt;/p>&lt;p>&lt;code>SELinux=enable&lt;/code> restricts certain processes from making changes to the system, which can conflict with the installations we&amp;rsquo;ll do in this tutorial.&lt;/p>&lt;ol>&lt;li>Disable the &lt;code>nouveau&lt;/code> driver:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>echo -e &amp;#34;blacklist nouveau\noptions nouveau modeset=0&amp;#34; | sudo tee /etc/modprobe.d/disable-nouveau.conf&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>To disable &lt;code>SELinux&lt;/code>, let&amp;rsquo;s first check its status by running:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sestatus&lt;/code>&lt;/pre>&lt;p>If it&amp;rsquo;s active, you&amp;rsquo;ll need to set the &lt;code>SELINUX=disabled&lt;/code> parameter in the &lt;code>/etc/selinux/config&lt;/code> file to proceed. Remember that saving changes requires sudo permissions.&lt;/p>&lt;ol start="3">&lt;li>After that, update the &lt;code>initramfs&lt;/code> and reboot the machine with the following commands:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo dracut --forcesudo reboot&lt;/code>&lt;/pre>&lt;ol start="4">&lt;li>To verify everything worked so far, let&amp;rsquo;s check if &lt;code>nouveau&lt;/code> is disabled:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>lsmod | grep nouveau&lt;/code>&lt;/pre>&lt;p>If it&amp;rsquo;s been successfully disabled, there will be no output.&lt;/p>&lt;ol start="5">&lt;li>To verify the &lt;code>SELinux&lt;/code>:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sestatus&lt;/code>&lt;/pre>&lt;p>If it&amp;rsquo;s disabled, the output will be: &lt;code>SELinux status: disabled&lt;/code>&lt;/p>&lt;h4 id="installing-prerequisites">Installing Prerequisites&lt;/h4>&lt;ol>&lt;li>Let&amp;rsquo;s install some prerequisites before starting the actual installation:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo dnf install pciutils environment-modulessudo dnf install kernel-devel-$(uname -r) kernel-headerssudo dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpmsudo dnf clean all sudo dnf install dkms&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>We also need to enable some repositories:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo subscription-manager repos --enable=rhel-8-for-ppc64le-appstream-rpmssudo subscription-manager repos --enable=rhel-8-for-ppc64le-baseos-rpmssudo subscription-manager repos --enable=codeready-builder-for-rhel-8-ppc64le-rpms&lt;/code>&lt;/pre>&lt;h4 id="downloading-and-installing-cuda-package-repositories">Downloading and Installing CUDA Package Repositories&lt;/h4>&lt;ol>&lt;li>Let&amp;rsquo;s download &lt;strong>CUDA version 12.2&lt;/strong> and &lt;strong>NVIDIA Driver 535.54.03-1&lt;/strong> with the following command:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>wget https://developer.download.nvidia.com/compute/cuda/12.2.0/local_installers/cuda-repo-rhel8-12-2-local-12.2.0_535.54.03-1.ppc64le.rpm&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>To install the downloaded package:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo rpm -i cuda-repo-rhel8-12-2-local-12.2.0_535.54.03-1.ppc64le.rpm&lt;/code>&lt;/pre>&lt;ol start="3">&lt;li>To install the NVIDIA driver and CUDA, run the following commands:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo dnf install nvidia-driver-cuda sudo dnf clean all sudo dnf module reset nvidia-driver sudo dnf module enable nvidia-driver:latest-dkmssudo dnf -y module install nvidia-driver:latest-dkmssudo dnf -y install cuda &lt;/code>&lt;/pre>&lt;p>With these commands, the driver and CUDA installation is complete.&lt;/p>&lt;h4 id="post-installation-steps">Post-Installation Steps&lt;/h4>&lt;ol>&lt;li>Let&amp;rsquo;s set the &lt;code>PATH&lt;/code> and &lt;code>LD_LIBRARY_PATH&lt;/code> environment variables. To do this, edit the &lt;code>.bashrc&lt;/code> file and add these two lines:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>export PATH=/usr/local/cuda/bin:$PATH export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH&lt;/code>&lt;/pre>&lt;p>To update the environment variables, run the following command:&lt;/p>&lt;pre tabindex="0">&lt;code>source ~/.bashrc&lt;/code>&lt;/pre>&lt;p>We need to make two manual changes because they aren&amp;rsquo;t handled automatically by the CUDA package installation. If these aren&amp;rsquo;t done, the CUDA driver installation will not work properly.&lt;/p>&lt;ol start="2">&lt;li>The first change is to configure the NVIDIA persistence daemon. First, check its status, and if it&amp;rsquo;s not active, enable it:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>systemctl status nvidia-persistencedsystemctl enable nvidia-persistenced&lt;/code>&lt;/pre>&lt;p>Some Linux distributions have a udev rule that brings hot-plugged memory online as soon as it&amp;rsquo;s detected, preventing NVIDIA software from correctly configuring GPU memory on Power9.&lt;/p>&lt;ol start="3">&lt;li>To disable this rule, run the following commands:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/sudo sed -i &amp;#39;s/SUBSYSTEM!=&amp;#34;memory&amp;#34;,.*GOTO=&amp;#34;memory_hotplug_end&amp;#34;/SUBSYSTEM==&amp;#34;*&amp;#34;, GOTO=&amp;#34;memory_hotplug_end&amp;#34;/&amp;#39; /etc/udev/rules.d/40-redhat.rules&lt;/code>&lt;/pre>&lt;h4 id="installation-check">Installation Check&lt;/h4>&lt;p>After completing all these steps, let&amp;rsquo;s reboot the machine and verify the installations:&lt;/p>&lt;ol>&lt;li>Reboot the machine:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo reboot&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>Check the NVIDIA driver:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>nvidia-smi&lt;/code>&lt;/pre>&lt;p>The output of the command above should display CUDA compiler information: version and install date. It should also list available devices (GPUs) with details like name, memory, temperature, and other information.&lt;/p>&lt;p>To perform the final check, let&amp;rsquo;s download the &lt;code>cuda-samples&lt;/code> repository and run the device test.&lt;/p>&lt;ol start="3">&lt;li>Download the repository and access the &lt;code>cuda-samples&lt;/code> version matching the installed CUDA:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>git clone https://github.com/NVIDIA/cuda-samples.git cd cuda-samples/Samples/1_Utilities/deviceQuerygit checkout v12.2 &lt;/code>&lt;/pre>&lt;ol start="4">&lt;li>To build and run the tests:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>make./deviceQuery&lt;/code>&lt;/pre>&lt;p>After running this test, you should see &lt;code>Result = PASS&lt;/code> in the last line. This confirms that the Power9 is set up with the NVIDIA driver and CUDA working correctly.&lt;/p>&lt;h2 id="setting-up-the-cudnn">Setting up the CUDNN&lt;/h2>&lt;ol>&lt;li>First, we need to download and install the &lt;code>.rpm&lt;/code> package specific to &lt;code>ppc64le&lt;/code>.&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>wget https://developer.download.nvidia.com/compute/cudnn/9.0.0/local_installers/cudnn-local-repo-rhel8-9.0.0-1.0-1.ppc64le.rpmsudo rpm -i cudnn-local-repo-rhel8-9.0.0-1.0-1.ppc64le.rpmsudo dnf clean allsudo dnf -y install cudnn&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>After installing, set the &lt;code>CUDNN_LIBRARY&lt;/code> and &lt;code>CUDNN_INCLUDE_DIR&lt;/code> environment variables directly by adding these lines to your &lt;code>.bashrc&lt;/code>:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>echo &amp;#39;export CUDNN_LIBRARY=/usr/lib64&amp;#39; &amp;gt;&amp;gt; ~/.bashrc echo &amp;#39;export CUDNN_LIBRARY=/usr/lib64&amp;#39; &amp;gt;&amp;gt; ~/.bashrc &lt;/code>&lt;/pre>&lt;p>After that, the CUDNN installation process is complete.&lt;/p>&lt;p>This is the first part of our tutorial. Once you&amp;rsquo;ve finished all the steps in this post, the server will be ready to install the &lt;code>conda&lt;/code> package manager and the &lt;code>pytorch&lt;/code> library. You can access the second part of this tutorial at this &lt;a href="https://llm-pt-ibm.github.io/en/posts/tutorial_power9_pt2_en/">&lt;span class="link-personalizado">link&lt;/span>&lt;/a>.&lt;/p></description></item><item><title>Evaluating Small-Scale LLMs (up to 8B) on PT-BR Benchmarks</title><link>https://llm-pt-ibm.github.io/en/posts/experimentos_benchmarks_pt_br/</link><pubDate>Mon, 02 Jun 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/experimentos_benchmarks_pt_br/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>This is the first of two posts in this series, aimed at providing a summary of the investigation we conducted using the &lt;a href="https://github.com/stanford-crfm/helm" rel="external">&lt;span class="link-personalizado">HELM&lt;/span>&lt;/a> (&lt;em>Holistic Evaluation of Language Models&lt;/em>) evaluation framework to assess the &lt;a href="https://huggingface.co/ibm-granite" rel="external">&lt;span class="link-personalizado">Granite&lt;/span>&lt;/a> family of models, the &lt;a href="https://huggingface.co/meta-llama/Llama-3.1-8B" rel="external">&lt;span class="link-personalizado">Llama-3.1-8B&lt;/span>&lt;/a> model, and the &lt;a href="https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B" rel="external">&lt;span class="link-personalizado">DeepSeek-R1-Distill-Llama-3.1-8B&lt;/span>&lt;/a> model. The evaluations cover both Portuguese-language benchmarks and code generation tasks. In this first part, the focus is on evaluating model performance in Brazilian Portuguese (PT-BR) for &lt;strong>sentiment analysis&lt;/strong> and &lt;strong>MQA&lt;/strong> (&lt;em>Multiple-Choice Question Answering&lt;/em>) tasks. The second part, to be published soon, will present the evaluation results for code generation tasks.&lt;/p>&lt;p>The use of English-language datasets for evaluating language models is common practice. However, to evaluate this models across different languages and cultural contexts, it is important to test them on benchmarks in other languages. In the case of PT-BR, which typically represents a smaller share of the data used to train multilingual models, understanding model behavior is an important step in evaluating their suitability for tasks and contexts specific to this language. In this sense, this post aims to contribute to that understanding by highlighting both the advances and the remaining challenges in these LLMs’ performance on tasks in the PT-BR context.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;div style="text-align: justify;">&lt;ul>&lt;li>We evaluated the models: Granite, Llama-3.1-8B, and DeepSeek-R1-Distill-Llama-3.1-8B on the ENEM Challenge, TweetSent-Br, and IMDB benchmarks.&lt;/li>&lt;li>Our method involved experimentation supported by the HELM framework, which we describe in detail in this document.&lt;/li>&lt;li>The results show that the models accurately classify sentiments in movie reviews in PT-BR.&lt;/li>&lt;/ul>&lt;/div>&lt;h2 id="method">Method&lt;/h2>&lt;h3 id="execution-environment-and-tool-used">Execution Environment and Tool Used&lt;/h3>&lt;p>We used HELM as the evaluation tool. HELM is an LLM evaluation framework developed by researchers at Stanford University. It includes a variety of benchmarks, such as sentiment analysis, code generation, and multiple-choice question answering. Using these benchmarks, we evaluated and compared the performance of the Granite (8B), Llama-3.1-8B, and DeepSeek-R1-Distill-Llama-3.1-8B models.&lt;/p>&lt;p>For running the experiments, we used Google Colab as the environment, which provides access to an A100 GPU. In this setup, we were able to clone the HELM repository and run models with 8 billion parameters. All configuration and testing were carried out on this platform, ensuring convenience and access to the necessary computational resources.&lt;/p>&lt;p>In a future post, we will go into more detail about LLM evaluation strategies and tools, with a deeper focus on HELM’s capabilities and operation.&lt;/p>&lt;h3 id="benchmarks-and-models">Benchmarks and Models&lt;/h3>&lt;p>To run tests in Brazilian Portuguese scenarios, it was necessary to extend HELM by adding new benchmarks, since the tool did not previously support this language. This effort represented a direct contribution to HELM, adding three benchmarks:&lt;/p>&lt;ul>&lt;li>&lt;p>&lt;a href="https://huggingface.co/datasets/eduagarcia/enem_challenge" rel="external">&lt;span class="link-personalizado">&lt;strong>ENEM Challenge&lt;/strong>&lt;/span>&lt;/a>: built from questions from the Exame Nacional do Ensino Médio (ENEM), designed to evaluate LLMs ability to handle MQA tasks across various knowledge areas, including Humanities, Natural Sciences, Languages, and Mathematics.&lt;/p>&lt;/li>&lt;li>&lt;p>&lt;a href="https://huggingface.co/datasets/eduagarcia/tweetsentbr_fewshot" rel="external">&lt;span class="link-personalizado">&lt;strong>TweetSent-Br&lt;/strong>&lt;/span>&lt;/a>: composed of tweets, specifically for sentiment analysis tasks. The dataset is organized into three main classes: positive (tweets expressing a positive reaction about the main topic), negative (tweets expressing a negative reaction), and neutral (tweets that don’t fit the other categories).&lt;/p>&lt;/li>&lt;li>&lt;p>&lt;a href="https://huggingface.co/datasets/maritaca-ai/imdb_pt" rel="external">&lt;span class="link-personalizado">&lt;strong>IMDB&lt;/strong>&lt;/span>&lt;/a>: made up of movie reviews written in Brazilian Portuguese. This benchmark also focuses on sentiment classification tasks, but uses longer-form review texts, in contrast to TweetSent-Br’s shorter posts.&lt;/p>&lt;/li>&lt;/ul>&lt;p>About the models, selection was guided by compatibility with the available execution environment and by citation relevance and performance. This included the Granite family of models developed by IBM; the Llama models from Meta; and the DeepSeek-R1-Distill-Llama-8B, a compact, optimized version derived from Llama 3.1. This choice enabled a fair and practical comparison among the models.&lt;/p>&lt;h2 id="results">Results&lt;/h2>&lt;p>Below, we present the results obtained, along with charts developed by the team to make it easier to visualize and understand the models’ performance on the evaluated tasks.&lt;/p>&lt;ul>&lt;li>&lt;strong>ENEM Challenge&lt;/strong>:&lt;/li>&lt;/ul>&lt;!--&lt;div style="text-align: center;"> &lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image001.png" style="max-width: 90%;">&lt;/div> -->&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image001.png" alt="Chart of results on the ENEM Challenge"/>&lt;figcaption> &lt;p>Chart of results on the ENEM Challenge&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>The results indicate that the models showed similar performance, with a slight advantage for Llama. The models achieved an average accuracy of 62.53%, suggesting that while they demonstrate some level of understanding of the questions, they still lack sufficient ability to answer ENEM exam questions satisfactorily. Improvement is still needed, particularly in reasoning and interpretation in Portuguese.&lt;/p>&lt;ul>&lt;li>&lt;strong>TweetSent-Br&lt;/strong>:&lt;/li>&lt;/ul>&lt;!--&lt;div style="text-align: center;"> &lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image002.png" style="max-width: 90%;">&lt;/div> -->&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image002.png" alt="Chart of results on the TweetSent-Br"/>&lt;figcaption> &lt;p>Chart of results on the TweetSent-Br&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>In this benchmark, as observed with the ENEM Challenge, the results were also similar across models. This reinforces the view that there are still gaps in model performance on sentiment classification tasks in Portuguese. Classifying a message as positive, negative, or neutral remains a challenge for these models, especially given the nuances and ambiguities of the language.&lt;/p>&lt;ul>&lt;li>&lt;strong>IMDB&lt;/strong>:&lt;/li>&lt;/ul>&lt;!--&lt;div style="text-align: center;"> &lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image003.png" style="max-width: 90%;">&lt;/div>-->&lt;figure>&lt;img src="https://llm-pt-ibm.github.io/images/experimentos_benchmarks_pt_br_image003.png" alt="Chart of results on the IMDB"/>&lt;figcaption> &lt;p>Chart of results on the IMDB&lt;/p> &lt;/figcaption>&lt;/figure>&lt;p>In the IMDB benchmark, the results were quite positive. The models achieved accuracy rates above 90%, demonstrating strong performance in sentiment classification. The highlight was the Granite model with 8B parameters, which showed a slight advantage over the others. These results indicate that the models can easily categorize movie reviews in Portuguese, showing greater proficiency in this type of task.&lt;/p>&lt;h2 id="conclusion">Conclusion&lt;/h2>&lt;p>This study provided a clearer view of the performance of language models in PT-BR through evaluation on three different benchmarks. The results show that the models analyzed have reasonable performance when selecting an answer in ENEM knowledge areas, while also indicating that there is still room for improvement. On the other hand, in the IMDB sentiment analysis task, these smaller-scale models demonstrated good classification ability.&lt;/p>&lt;p>The team plans, in future studies, to conduct experiments with larger-scale models to enable broader comparisons of performance and efficiency. This will allow for a more detailed analysis of the errors made by each model, contributing to a deeper understanding of their strengths and limitations.&lt;/p></description></item><item><title>Performing CPU Inference on Power10</title><link>https://llm-pt-ibm.github.io/en/posts/power10/</link><pubDate>Sun, 06 Apr 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/power10/</guid><description>&lt;h2 id="background">Background&lt;/h2>&lt;p>In this post, we will share our experience running the Granite-20b-Code-Instruct model on a Power10 machine, describing the challenges and the necessary configurations to perform inference using Llama.cpp, one of the most popular open-source libraries in this domain.&lt;/p>&lt;h2 id="tldr">TL;DR&lt;/h2>&lt;ul>&lt;li>This post provides details on how to set up and run inference using IBM Power10 infrastructure.&lt;/li>&lt;li>Our main challenge was configuring Llama.cpp, which required adjustments such as installing Ninja-builder, compiling OpenBLAS, and updating the C compiler.&lt;/li>&lt;/ul>&lt;h2 id="infrastructure">Infrastructure&lt;/h2>&lt;p>Inference was performed on a machine with IBM POWER10 architecture, equipped with 750 GB of RAM and running Red Hat Enterprise Linux 8.10. Access to the environment was provided through a VM, requiring the use of a VPN to establish secure and controlled communication with the system, enabling remote and efficient execution of activities.&lt;/p>&lt;h2 id="initial-setup">Initial Setup&lt;/h2>&lt;p>The library that enables run LLMs using CPU resources is Llama.cpp. To set it up, we needed to resolve two external dependencies: Ninja-builder and OpenBLAS. Ninja-builder optimizes the compilation process, while OpenBLAS is a high-performance library for matrix computations.&lt;/p>&lt;p>During the OpenBLAS build process, we identified discrepancies in the internal tests validating matrix calculations, indicating a compatibility problem with the available C compiler, which was an older version (8.5.0). The solution was to &lt;strong>update the compiler to a newer version, 13.2&lt;/strong>, ensuring better compatibility with the Power10 architecture and validating the accuracy of the numerical operations required for Llama.cpp. Below, we present the step-by-step process used to enable the compilation of the required libraries and update the C compiler.&lt;/p>&lt;ol>&lt;li>Creating the build environment for the builder&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>sudo dnf update -y &amp;amp;&amp;amp; dnf -y groupinstall &amp;#39;Development Tools&amp;#39; &amp;amp;&amp;amp; dnf install -y \ cmake git ninja-build-debugsource.ppc64le \ &amp;amp;&amp;amp; dnf clean all&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>Updating the C Compiler and Setting Environment Variables&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>scl enable gcc-toolset-13 bashexport CC=/usr/bin/gcc-13export CXX=/usr/bin/g++-13&lt;/code>&lt;/pre>&lt;ol start="3">&lt;li>Downloading and Building OpenBLAS&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>git clone --recursive https://github.com/DanielCasali/OpenBLAS.git &amp;amp;&amp;amp; cd OpenBLAS &amp;amp;&amp;amp; \ make -j$(nproc --all) TARGET=POWER10 DYNAMIC_ARCH=1 &amp;amp;&amp;amp; \ make PREFIX=/opt/OpenBLAS install &amp;amp;&amp;amp; \ cd /&lt;/code>&lt;/pre>&lt;ol start="4">&lt;li>Downloading and Building Llama.cpp using the OpenBLAS library&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code> git clone https://github.com/DanielCasali/llama.cpp.git &amp;amp;&amp;amp; cd llama.cpp &amp;amp;&amp;amp; sed -i &amp;#34;s/powerpc64le/native -mvsx -mtune=native -D__POWER10_VECTOR__/g&amp;#34; ggml/src/CMakeLists.txt &amp;amp;&amp;amp; \ mkdir build; \ cd build; \ cmake -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DBLAS_INCLUDE_DIRS=/opt/OpenBLAS/include -G Ninja ..; \ cmake --build . --config Release&lt;/code>&lt;/pre>&lt;p>With all these steps completed successfully, the environment was properly configured and optimized for running Llama.cpp locally. We are now able to start a server to perform inference with LLMs efficiently, using only CPU resources.&lt;/p>&lt;h2 id="performing-inference">Performing Inference&lt;/h2>&lt;p>We chose the Granite-20b-code-instruct model in the .GGUF format, which is specifically designed to optimize the performance of language models in CPU-only environments. These models are quantized, meaning their calculation precision is reduced, which in turn lowers their size and memory consumption, making them ideal for efficient execution with Llama.cpp. This approach enables high-performance local inference even on processor-only architectures such as POWER10.The model was downloaded directly from Hugging Face. Below, we show the step-by-step process to download it:&lt;/p>&lt;ol>&lt;li>Create a directory for the model in Llama.cpp:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>mkdir -p /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF&lt;/code>&lt;/pre>&lt;ol start="2">&lt;li>Access the directory in Llama.cpp:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>cd /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF&lt;/code>&lt;/pre>&lt;ol start="3">&lt;li>Download the model from Hugging Face:&lt;/li>&lt;/ol>&lt;pre tabindex="0">&lt;code>wget https://huggingface.co/ibm-granite/granite-20b-code-instruct-8k-GGUF/resolve/main/granite-20b-code-instruct.Q4_K_M.gguf&lt;/code>&lt;/pre>&lt;p>The last step can take longer based on the model’s number of parameters.. However, once the steps above are completed, we can start a Llama.cpp server to perform inference. By default, the server is exposed on port 8080 of the Power10 machine, but this is fully customizable. The following code illustrates how to configure and run the Llama server:&lt;/p>&lt;pre tabindex="0">&lt;code>/root/llama.cpp/build/bin/llama-server --host 0.0.0.0 --model /root/llama.cpp/models/granite-20b-code-instruct-8k-GGUF/granite-20b-code-instruct.Q4_K_M.gguf&lt;/code>&lt;/pre>&lt;p>With the Llama.cpp server running on port 8080, we can now perform inference via HTTP requests. In this example, for simplicity, we use curl to make the requests:&lt;/p>&lt;pre tabindex="0">&lt;code>curl -X POST http://localhost:8080/completion \ -H &amp;#34;Content-Type: application/json&amp;#34; \ -d &amp;#39;{ &amp;#34;prompt&amp;#34;: &amp;#34;Make a hello world program in Java. Your answer should be in Java code only.&amp;#34;, &amp;#34;max_tokens&amp;#34;: 100 }&amp;#39;&lt;/code>&lt;/pre>&lt;p>Below is an example of how the response is returned:&lt;/p>&lt;pre tabindex="0">&lt;code>{ &amp;#34;content&amp;#34;: &amp;#34;public class HelloWorld { public static void main(String[] args) { System.out.println(&amp;#34;Hello, World!&amp;#34;); }}&lt;/code>&lt;/pre>&lt;p>With this setup, we are now able to perform inference on CPU. Our upcoming posts will focus on running these inferences using the HELM (&lt;em>Holistic Evaluation of Language Models&lt;/em>) framework as the intermediary.&lt;/p></description></item><item><title>Introduction</title><link>https://llm-pt-ibm.github.io/en/posts/introducao/</link><pubDate>Wed, 12 Mar 2025 00:00:00 +0000</pubDate><guid>https://llm-pt-ibm.github.io/en/posts/introducao/</guid><description>&lt;p>Welcome to the blog of the partnership between the &lt;strong>Federal University of Campina Grande (UFCG)&lt;/strong> and &lt;strong>IBM&lt;/strong>!&lt;/p>&lt;p>This space brings together articles, tutorials, and research results produced by our team across different projects. Each project focuses on a distinct area of research:&lt;/p>&lt;ul>&lt;li>&lt;strong>&lt;a href="https://llm-pt-ibm.github.io/en/projects/llm-eval/">LLM Evaluation&lt;/a>&lt;/strong> — evaluation of large language models, with a focus on benchmarks for Brazilian Portuguese.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://llm-pt-ibm.github.io/projects/agents-ai">AgentOps&lt;/a>&lt;/strong> — development of AI agents capable of autonomously performing multiple tasks.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://llm-pt-ibm.github.io/en/projects/judo-ai/">Judo-AI&lt;/a>&lt;/strong> — use of AI models for analysis of judo matches and training sessions, applying computer vision and deep learning techniques for movement detection and action recognition.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://llm-pt-ibm.github.io/en/projects/5g/">5G&lt;/a>&lt;/strong> — integration of AI techniques in 5G network environments, with intelligent control, optimization, and network management mechanisms.&lt;/li>&lt;li>&lt;strong>&lt;a href="https://llm-pt-ibm.github.io/en/projects/multiarq/">MultiArq&lt;/a>&lt;/strong> — provisioning of common tools for new architectures (ppc64le), seeking and adapting specific tools and creating technical documentation about the architecture.&lt;/li>&lt;/ul>&lt;p>Browse the posts and follow the latest updates!&lt;/p></description></item></channel></rss>