Run Ollama LXC on Proxmox VE With NVIDIA RTX 3090 GPU
In this article, I will explain how I set up Ollama on my home lab running Proxmox VE in a Linux Container (LXC) with an NVIDIA RTX 3090 GPU featuring 24GB of VRAM. This setup allows for efficient local AI inference with large language models while maintaining the flexibility and isolation that containers provide.
What you’ll learn
In this guide, you’ll learn:
- How to configure Proxmox VE for NVIDIA GPU passthrough
- Setting up NVIDIA drivers on the host system
- Creating and configuring an LXC container with GPU access
- Installing and running Ollama with GPU acceleration
- Optimizing performance for local AI inference with large language models
Environment Overview
I used the following environment for this setup:
Environment
- GPU: NVIDIA RTX 3090 24GB GDDR6
- RAM: 32GB DDR3
- CPU: AMD RYZEN 2 2200G
- ProxmoxVE: 9.2.3
- Kernel version: 7.0.6-2-pve
- Virtualization: AMD-V / IOMMU
Why using LXC
The general idea is to reduce the overhead added when running virtual machines. We can put all packages and processes for running local LLMs into a separate container but still directly access relevant kernel modules and devices on the host. Since my machine is not the most powerful in terms of CPU and RAM, I found this quite efficient compared to deploying a separate VM or adding some other container runtimes to my Proxmox home lab. I will adress some drawbacks at the end.
Step 1: Prepare Your Proxmox Host System
Before creating the LXC container, we need to ensure the host system is properly configured with NVIDIA drivers.
Enable IOMMU in BIOS
First, we enable IOMMU in BIOS:
- Intel: Enable VT-d (Virtualization Technology for Directed I/O)
- AMD: Enable IOMMU or AMD-V
Enable IOMMU in Bootloader
Then, we enable the iommu parameter in your kernel bootloader / GRUB. On an AMD system, use amd_iommu=on, while on Intel systems use intel_iommu=on.
# /etc/default/grub
# GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"
Disable open source NVIDIA kernel modules
We do not want the open source NVIDIA drivers nouveau or nova to interfere with proprietary drivers, since we install the proprietary later.
Therefore we blacklist the kernel modules in modprobe so they are not loaded during boot.
# /etc/modprobe.d/blacklist.conf
blacklist nouveau
blacklist nova
blacklist nova_core
Update the inital filesystem loaded on boot.
# apply changes for next boot
update-initramfs -u -k all
Now unload the kernel modules so we can safely install NVIDIA drivers.
# disable preinstalled nvidia kernel drivers
for mod in nova_core nova nouveau; do
if lsmod | grep -q "^${mod} "; then
modprobe -r "${mod}" 2>/dev/null || true
fi
done
Installing NVIDIA Drivers on Proxmox
We update & upgrade system packages and install some packages needed for installation.
apt update && apt upgrade -y # upgrade optional
apt install -y proxmox-default-headers proxmox-headers-$(uname -r) gcc make dkms dwarves
We install the NVIDIA driver for the RTX 3090 without display drivers 595.80.
wget -O /tmp/NVIDIA-Linux-x86_64-595.80.run https://us.download.nvidia.com/XFree86/Linux-x86_64/595.80/NVIDIA-Linux-x86_64-595.80.run
chmod +x /tmp/NVIDIA-Linux-x86_64-595.80.run
/tmp/NVIDIA-Linux-x86_64-595.80.run --dkms --disable-nouveau --kernel-module-type=proprietary --no-opengl-files --no-install-libglvnd --silent
I prefer this method since it is easy to control the driver verions and keep libraries on the host and the LXC in sync, which we will see later on. Proxmox runs on Debian but we will use Ubuntu for the LXC since NVIDIA recommends using this distro with their CUDA software.
After installation we can reboot.
reboot
Verify NVIDIA Driver Installation
Now we can enable the persistence service which makes sure the NVIDIA drivers are always loaded to RAM even if the GPU is currently not used.
# /etc/systemd/system/nvidia-persistenced.service
[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target
[Service]
Type=forking
ExecStart=/usr/bin/nvidia-persistenced --user nvpd
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced
[Install]
WantedBy=multi-user.target
Now we enable the persistence mode.
systemctl enable --now nvidia-persistenced
The gpu should now be visible.
nvidia-smi
# Output
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.80 Driver Version: 595.80 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3090 Off | 00000000:01:00.0 Off | N/A |
| 53% 31C P8 53W / 390W | 1MiB / 24576MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
You should see output showing your RTX 3090 GPU and driver information.
Step 2: Create LXC Container
Now we’ll create an LXC container for Ollama. This container will be configured to use GPU passthrough.
Creating the Container
- In Proxmox Web UI as root user, navigate to your node
- Click “Create CT” to start creating a new container
- Configure basic settings:
- Hostname:
ollama-container - Password: Set a secure password for root user
- Unpriviliged: Uncheck
- Template: Choose Ubuntu 24.04 LTS
- Memory: At least 4GB RAM
- Disk: 200GB
- CPU cores: 2-4 cores
- Hostname:
Configure GPU Passthrough
To enable GPU passthrough for the container:
- In the container configuration, go to the “Options” tab
- Enable “GPU Passthrough”
- Select your NVIDIA RTX 3090 GPU from the available devices
- Save the configuration
Step 3: Install Ollama in LXC Container
Once the container is created and started, we’ll install Ollama inside it.
Access Your Container
Connect to your container via pct if you connect from your Proxmox host.
pct push <lxc_id> /tmp/NVIDIA-Linux-x86_64-595.80.run /tmp/NVIDIA-Linux-x86_64-595.80.run
pct enter <lxc_id>
Install Required Dependencies
Update package lists and install necessary dependencies:
apt update
apt install curl zstd -y
/tmp/NVIDIA-Linux-x86_64-595.80.run --no-kernel-module
Install Ollama
Install Ollama using the official installation script:
curl -fsSL https://ollama.com/install.sh | sh
# Enable on contaienr startup
systemctl enable --now ollama
Expose the service to other users in your network.
# /etc/systemd/system/ollama.service.d/override.conf
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Apply configs.
sudo systemctl daemon-reload
sudo systemctl restart ollama
Step 4: Configure custom qwen3
I currently run qwen3-coder:30b-a3b-q4_K_M with 4-Bit quitization and 128k context window which should be about be the sweet spot in between the size of the model and the size of the context window.
The configuration can be done in a Modelfile.
# ~/Modelfile
FROM qwen3-coder:30b-a3b-q4_K_M
# Forces Ollama to handle up to 128k tokens
PARAMETER num_ctx 131072
# Mandatory for handling 128k tokens without grinding to a halt
PARAMETER flash_attn true
# Compresses the KV Cache to fit 128k comfortably inside 24GB VRAM
PARAMETER kv_cache_type q8_0
# Allows the agent to output up to 4096 tokens per single turn
PARAMETER num_predict 4096
# Temperature balance for strict logical tool use vs creative coding
PARAMETER temperature 0.2
Now we build the configured model with ollama:
ollama build rtx3090-qwen3 -f ~/Modelfile
Step 5: Test Ollama with GPU Acceleration
Now let’s verify that Ollama is properly utilizing your NVIDIA RTX 3090.
We run the model.
ollama run rtx3090-qwen3
In a separate shell we can check on our gpu.
watch -n 1 nvidia-smi
Now we should be able to see that the gpu utilization and VRAM allocation is growing. As well as the temperature and the power draw of course.
Troubleshooting
Common Issues and Solutions
Privileged Mode Requirements
We have to start the LXC in privileged mode so it can access the NVIDIA devices. This may require some additional considerations on how to restrict access. Also, if the LXC runs into problems that may affect kernel processes, the host may be affected as well.
Driver Version Mismatch
You must ensure that the NVIDIA software stack on your host system and in your LXC match versions. Otherwise, GPU driver installation will fail. I circumvent this by using the same installation script with different parameterization.
Ideas for Optimization
Offload model storage to an NVMe
On a cold call on a LLM via ollama, the model stored on some hard disk must be loaded to the GPUs VRAM. One should consider to use NVMe drives to reduce load times. In my case I use a mainbaord with a NVMe drive in my first M2 slot which is direcly connected with my CPU via PCIe. This speeds up loading significantl to a couple of seconds for models with ~20GB of size. In any case check your mainbaord manual since M2 slots may also be connected to the mainboard chipset which then proxies data trasnfer to the CPU which can be a bottleneck or add latency.
Optimizing the ollama config for Coding
Things to play around:
- Use model with less Paramters (15B instead of 30B) -> lower footprint on VRAM but less potent model
- Decrease context size -> I feel like <64K is not recommendable
- Use higher compression of the key-value cache -> 8-Bit could be the minimum for coding and context of >64K
- Offloading to CPU -> In my case an AMD 2200G with DDR3 RAM is more or less uselsess
Conclusion
This setup provides an efficient way to run Ollama with GPU acceleration in a home lab environment. By using LXC containers with NVIDIA GPU passthrough, you can achieve good performance while maintaining the flexibility and isolation that containers provide.
Key takeaways from this guide:
- GPU passthrough allows direct hardware access for optimal performance
- Proper driver configuration on both host and container systems is crucial
- LXC containers offer a good balance between performance and resource efficiency
- The RTX 3090 with 24GB VRAM provides some usable capacity for running large language models
For larger deployments or production environments, consider:
- Using virtual machines instead of containers for better isolation
- Implementing more robust monitoring and management solutions
- Exploring container orchestration platforms like Kubernetes for scaling
- More GPUs and especially more VRAM
This approach is particularly useful for developers and researchers who need to experiment with various LLMs while maintaining good performance and resource utilization.