Thumbnail image

Run Ollama LXC on Proxmox VE With NVIDIA RTX 3090 GPU

In this article, I will explain how I set up Ollama on my home lab running Proxmox VE in a Linux Container (LXC) with an NVIDIA RTX 3090 GPU featuring 24GB of VRAM. This setup allows for efficient local AI inference with large language models while maintaining the flexibility and isolation that containers provide.

What you’ll learn

In this guide, you’ll learn:

  • How to configure Proxmox VE for NVIDIA GPU passthrough
  • Setting up NVIDIA drivers on the host system
  • Creating and configuring an LXC container with GPU access
  • Installing and running Ollama with GPU acceleration
  • Optimizing performance for local AI inference with large language models

Environment Overview

I used the following environment for this setup:

Environment

  • GPU: NVIDIA RTX 3090 24GB GDDR6
  • RAM: 32GB DDR3
  • CPU: AMD RYZEN 2 2200G
  • ProxmoxVE: 9.2.3
  • Kernel version: 7.0.6-2-pve
  • Virtualization: AMD-V / IOMMU

Why using LXC

The general idea is to reduce the overhead added when running virtual machines. We can put all packages and processes for running local LLMs into a separate container but still directly access relevant kernel modules and devices on the host. Since my machine is not the most powerful in terms of CPU and RAM, I found this quite efficient compared to deploying a separate VM or adding some other container runtimes to my Proxmox home lab. I will adress some drawbacks at the end.

Step 1: Prepare Your Proxmox Host System

Before creating the LXC container, we need to ensure the host system is properly configured with NVIDIA drivers.

Enable IOMMU in BIOS

First, we enable IOMMU in BIOS:

  • Intel: Enable VT-d (Virtualization Technology for Directed I/O)
  • AMD: Enable IOMMU or AMD-V

Enable IOMMU in Bootloader

Then, we enable the iommu parameter in your kernel bootloader / GRUB. On an AMD system, use amd_iommu=on, while on Intel systems use intel_iommu=on.

# /etc/default/grub
# GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"

Disable open source NVIDIA kernel modules

We do not want the open source NVIDIA drivers nouveau or nova to interfere with proprietary drivers, since we install the proprietary later. Therefore we blacklist the kernel modules in modprobe so they are not loaded during boot.

# /etc/modprobe.d/blacklist.conf
blacklist nouveau
blacklist nova
blacklist nova_core

Update the inital filesystem loaded on boot.

# apply changes for next boot
update-initramfs -u -k all

Now unload the kernel modules so we can safely install NVIDIA drivers.

# disable preinstalled nvidia kernel drivers
for mod in nova_core nova nouveau; do
    if lsmod | grep -q "^${mod} "; then
        modprobe -r "${mod}" 2>/dev/null || true
    fi
done

Installing NVIDIA Drivers on Proxmox

We update & upgrade system packages and install some packages needed for installation.

apt update && apt upgrade -y # upgrade optional
apt install -y proxmox-default-headers proxmox-headers-$(uname -r) gcc make dkms dwarves

We install the NVIDIA driver for the RTX 3090 without display drivers 595.80.

wget -O /tmp/NVIDIA-Linux-x86_64-595.80.run https://us.download.nvidia.com/XFree86/Linux-x86_64/595.80/NVIDIA-Linux-x86_64-595.80.run
chmod +x /tmp/NVIDIA-Linux-x86_64-595.80.run
/tmp/NVIDIA-Linux-x86_64-595.80.run --dkms --disable-nouveau --kernel-module-type=proprietary --no-opengl-files --no-install-libglvnd --silent

I prefer this method since it is easy to control the driver verions and keep libraries on the host and the LXC in sync, which we will see later on. Proxmox runs on Debian but we will use Ubuntu for the LXC since NVIDIA recommends using this distro with their CUDA software.

After installation we can reboot.

reboot

Verify NVIDIA Driver Installation

Now we can enable the persistence service which makes sure the NVIDIA drivers are always loaded to RAM even if the GPU is currently not used.

# /etc/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
ExecStart=/usr/bin/nvidia-persistenced --user nvpd
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced

[Install]
WantedBy=multi-user.target

Now we enable the persistence mode.

systemctl enable --now nvidia-persistenced

The gpu should now be visible.

nvidia-smi
# Output
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.80                 Driver Version: 595.80         CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:01:00.0 Off |                  N/A |
| 53%   31C    P8             53W /  390W |       1MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

You should see output showing your RTX 3090 GPU and driver information.

Step 2: Create LXC Container

Now we’ll create an LXC container for Ollama. This container will be configured to use GPU passthrough.

Creating the Container

  1. In Proxmox Web UI as root user, navigate to your node
  2. Click “Create CT” to start creating a new container
  3. Configure basic settings:
    • Hostname: ollama-container
    • Password: Set a secure password for root user
    • Unpriviliged: Uncheck
    • Template: Choose Ubuntu 24.04 LTS
    • Memory: At least 4GB RAM
    • Disk: 200GB
    • CPU cores: 2-4 cores

Configure GPU Passthrough

To enable GPU passthrough for the container:

  1. In the container configuration, go to the “Options” tab
  2. Enable “GPU Passthrough”
  3. Select your NVIDIA RTX 3090 GPU from the available devices
  4. Save the configuration

Step 3: Install Ollama in LXC Container

Once the container is created and started, we’ll install Ollama inside it.

Access Your Container

Connect to your container via pct if you connect from your Proxmox host.

pct push <lxc_id> /tmp/NVIDIA-Linux-x86_64-595.80.run /tmp/NVIDIA-Linux-x86_64-595.80.run
pct enter <lxc_id>

Install Required Dependencies

Update package lists and install necessary dependencies:

apt update
apt install curl zstd -y
/tmp/NVIDIA-Linux-x86_64-595.80.run --no-kernel-module

Install Ollama

Install Ollama using the official installation script:

curl -fsSL https://ollama.com/install.sh | sh
# Enable on contaienr startup
systemctl enable --now ollama

Expose the service to other users in your network.

# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"

Apply configs.

sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 4: Configure custom qwen3

I currently run qwen3-coder:30b-a3b-q4_K_M with 4-Bit quitization and 128k context window which should be about be the sweet spot in between the size of the model and the size of the context window. The configuration can be done in a Modelfile.

# ~/Modelfile
FROM qwen3-coder:30b-a3b-q4_K_M
# Forces Ollama to handle up to 128k tokens
PARAMETER num_ctx 131072
# Mandatory for handling 128k tokens without grinding to a halt
PARAMETER flash_attn true
# Compresses the KV Cache to fit 128k comfortably inside 24GB VRAM
PARAMETER kv_cache_type q8_0
# Allows the agent to output up to 4096 tokens per single turn
PARAMETER num_predict 4096
# Temperature balance for strict logical tool use vs creative coding
PARAMETER temperature 0.2

Now we build the configured model with ollama:

ollama build rtx3090-qwen3 -f ~/Modelfile

Step 5: Test Ollama with GPU Acceleration

Now let’s verify that Ollama is properly utilizing your NVIDIA RTX 3090.

We run the model.

ollama run rtx3090-qwen3

In a separate shell we can check on our gpu.

watch -n 1 nvidia-smi

Now we should be able to see that the gpu utilization and VRAM allocation is growing. As well as the temperature and the power draw of course.

Troubleshooting

Common Issues and Solutions

Privileged Mode Requirements

We have to start the LXC in privileged mode so it can access the NVIDIA devices. This may require some additional considerations on how to restrict access. Also, if the LXC runs into problems that may affect kernel processes, the host may be affected as well.

Driver Version Mismatch

You must ensure that the NVIDIA software stack on your host system and in your LXC match versions. Otherwise, GPU driver installation will fail. I circumvent this by using the same installation script with different parameterization.

Ideas for Optimization

Offload model storage to an NVMe

On a cold call on a LLM via ollama, the model stored on some hard disk must be loaded to the GPUs VRAM. One should consider to use NVMe drives to reduce load times. In my case I use a mainbaord with a NVMe drive in my first M2 slot which is direcly connected with my CPU via PCIe. This speeds up loading significantl to a couple of seconds for models with ~20GB of size. In any case check your mainbaord manual since M2 slots may also be connected to the mainboard chipset which then proxies data trasnfer to the CPU which can be a bottleneck or add latency.

Optimizing the ollama config for Coding

Things to play around:

  • Use model with less Paramters (15B instead of 30B) -> lower footprint on VRAM but less potent model
  • Decrease context size -> I feel like <64K is not recommendable
  • Use higher compression of the key-value cache -> 8-Bit could be the minimum for coding and context of >64K
  • Offloading to CPU -> In my case an AMD 2200G with DDR3 RAM is more or less uselsess

Conclusion

This setup provides an efficient way to run Ollama with GPU acceleration in a home lab environment. By using LXC containers with NVIDIA GPU passthrough, you can achieve good performance while maintaining the flexibility and isolation that containers provide.

Key takeaways from this guide:

  • GPU passthrough allows direct hardware access for optimal performance
  • Proper driver configuration on both host and container systems is crucial
  • LXC containers offer a good balance between performance and resource efficiency
  • The RTX 3090 with 24GB VRAM provides some usable capacity for running large language models

For larger deployments or production environments, consider:

  • Using virtual machines instead of containers for better isolation
  • Implementing more robust monitoring and management solutions
  • Exploring container orchestration platforms like Kubernetes for scaling
  • More GPUs and especially more VRAM

This approach is particularly useful for developers and researchers who need to experiment with various LLMs while maintaining good performance and resource utilization.

Related Posts