CLI Coding Agent via Ollama · AMD ROCm · Qwen2.5-Coder 14B
The amdgpu kernel driver lives on the Proxmox host. You'll need the device major numbers before creating the LXC container.
lspci -k | grep -A 2 "VGA"
# Expected: Kernel driver in use: amdgpu
ls -la /dev/kfd
# Expected: crw-rw---- 1 root render 510, 0 ...
# If /dev/kfd is missing:
modprobe amdgpu
dmesg | grep -i kfd
ls -la /dev/dri/
# Major number for DRI = 226 (look for: 226, 0 ... card0)
ls -la /dev/kfd
# Major number for KFD — commonly 510, verify yours
The container must be privileged — unprivileged containers cannot pass through GPU devices. Edit the LXC config before first boot to add the cgroup device rules.
pveam update
pveam download local ubuntu-22.04-standard_22.04-1_amd64.tar.zst
| Field | Value | Notes |
|---|---|---|
| Hostname | llm-coder | — |
| Template | ubuntu-22.04-standard | — |
| Disk size | 60 GB minimum | Models are large — Qwen2.5-Coder 14B is ~9GB |
| CPU cores | 8+ | — |
| RAM | 16384 MB (16 GB) | ROCm is memory-hungry |
| Swap | 4096 MB | — |
| Network | Set a static IP | You'll reference this IP from your laptop's Aider config |
| Privileged | ✅ CHECK THIS | Required — unprivileged containers cannot pass through GPU devices |
<CTID> with your container ID:
nano /etc/pve/lxc/<CTID>.conf
Add these lines at the bottom:
# DRI devices (render nodes)
lxc.cgroup2.devices.allow: c 226:0 rwm
lxc.cgroup2.devices.allow: c 226:1 rwm
lxc.cgroup2.devices.allow: c 226:128 rwm
lxc.cgroup2.devices.allow: c 226:129 rwm
# KFD compute device — replace 510 with your actual major number
lxc.cgroup2.devices.allow: c 510:0 rwm
# Bind mount into container
lxc.mount.entry: /dev/dri dev/dri none bind,optional,create=dir
lxc.mount.entry: /dev/kfd dev/kfd none bind,optional,create=file
Start the container and verify:
pct start <CTID>
pct enter <CTID>
ls -la /dev/dri/ # Should show card0, card1, renderD128, renderD129
ls -la /dev/kfd # Must exist
All commands from here run inside the LXC container. Installing ROCm userspace only — --no-dkms is critical since the kernel driver lives on the Proxmox host.
apt update && apt upgrade -y
apt install -y wget gnupg2 curl git build-essential \
python3-pip libnuma-dev libpci-dev \
ca-certificates software-properties-common
wget https://repo.radeon.com/amdgpu-install/6.1.3/ubuntu/jammy/amdgpu-install_6.1.60103-1_all.deb
dpkg -i amdgpu-install_6.1.60103-1_all.deb
apt update
# --no-dkms is critical: kernel driver is already on the Proxmox host
amdgpu-install --usecase=rocm --no-dkms -y
usermod -aG render,video root
echo 'export HSA_OVERRIDE_GFX_VERSION=10.3.0' >> /etc/environment
echo 'HSA_OVERRIDE_GFX_VERSION=10.3.0' >> /etc/profile.d/rocm.sh
export HSA_OVERRIDE_GFX_VERSION=10.3.0
rocminfo | grep -A 5 "gfx"
# Must show: Name: gfx1031 and Compute Unit: 40
rocm-smi
# Shows GPU temp, VRAM usage, utilization
If rocminfo hangs or shows no agents: recheck /dev/kfd is present and the cgroup major number matches.
Ollama serves the model over a local API that Aider will call from your laptop. The key settings are injecting the GPU override into the systemd service and binding to all interfaces so it's accessible over LAN.
curl -fsSL https://ollama.com/install.sh | sh
mkdir -p /etc/systemd/system/ollama.service.d
cat > /etc/systemd/system/ollama.service.d/override.conf << 'EOF'
[Service]
Environment="HSA_OVERRIDE_GFX_VERSION=10.3.0"
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_NUM_PARALLEL=1"
Environment="OLLAMA_KEEP_ALIVE=60m"
EOF
HSA_OVERRIDE_GFX_VERSION — without this, ROCm ignores the GPU entirelyOLLAMA_HOST=0.0.0.0 — binds to all interfaces so your laptop can reach it over LANOLLAMA_NUM_PARALLEL=1 — one request at a time; prevents VRAM fragmentation with a single userOLLAMA_KEEP_ALIVE=60m — keeps model in VRAM between requests. Aider sessions are slow with cold loads (default is 5m)systemctl daemon-reload
systemctl restart ollama
systemctl enable ollama
# Verify it started cleanly and detected the GPU:
journalctl -u ollama -n 50 | grep -i "gpu\|rocm\|error"
ollama pull qwen2.5-coder:14b
# ~9GB download — also consider a smaller fallback:
ollama pull qwen2.5-coder:7b # ~4.5GB
Tune the model for coding workloads (larger context, lower temperature):
ollama show qwen2.5-coder:14b --modelfile > /root/coder14b.modelfile
Edit the file and ensure these parameters:
PARAMETER num_ctx 16384
PARAMETER num_gpu 99
PARAMETER temperature 0.2
PARAMETER repeat_penalty 1.05
num_ctx 16384 — 16K context fits in 12GB VRAM at this quantization, gives Aider room to load multiple filestemperature 0.2 — more deterministic, less "creative" with syntaxnum_gpu 99 — forces all layers onto GPUollama create qwen2.5-coder-14b-aider -f /root/coder14b.modelfile
ollama list # Verify it appears
curl http://<LXC-IP>:11434/api/tags
# Should return JSON listing your models
# Also test a real inference call:
curl http://<LXC-IP>:11434/api/generate \
-d '{"model":"qwen2.5-coder-14b-aider","prompt":"def fibonacci(n):","stream":false}'
Don't proceed to Part 5 until this curl returns a valid response.
All commands from here run on your laptop, not the Proxmox server. Aider handles the file edits and Git commits locally — the Proxmox LXC just provides the inference engine over LAN.
| Prefix | Behavior |
|---|---|
ollama/qwen2.5-coder:14b | Uses raw completions API. Works but misses system prompt formatting. |
ollama_chat/qwen2.5-coder:14b | Uses chat completions API — proper role formatting. Use this one. |
ollama_chat/ prefix for coding models.
pip install on your system Python — use pipx for isolation:
# Install pipx if you don't have it:
sudo apt install pipx # Ubuntu/Debian
# or: brew install pipx # macOS
pipx install aider-chat
pipx ensurepath # Adds ~/.local/bin to PATH
# Verify:
aider --version
cd /path/to/your/project
# If not already a repo:
git init
git add .
git commit -m "initial commit before aider session"
Aider makes a commit after every accepted change, giving you a clean undo trail.
~/.aider.conf.yml:
cat > ~/.aider.conf.yml << 'EOF'
# Ollama server on Proxmox LXC
model: ollama_chat/qwen2.5-coder-14b-aider
ollama-api-base: http://<YOUR_LXC_IP>:11434
# Behavior
auto-commits: true
dirty-commits: true
stream: true
# Context
map-tokens: 2048
max-chat-history-tokens: 4096
# UI
pretty: true
EOF
With this file in place, just run aider with no flags.
cd /your/project
aider
Aider has three interaction modes:
/ask <question> asks without editing/architect <task> reasons about the approach first, then writes code. Closest to how Claude Code works./add <file> Add a file to the active context (model can edit it)
/drop <file> Remove a file from context (save tokens)
/ls List files currently in context
/run <command> Run a shell command and feed output to the model
/ask <question> Ask without making changes
/diff Show all changes made in this session
/undo Revert the last commit Aider made
/clear Clear chat history (keeps files in context)
/exit Exit Aider
/add). Review the diff before committing. Use /undo if the change isn't right. Feed test failures directly with /run pytest tests/ -v.watch -n 1 rocm-smi
During active generation you should see: GPU utilization 80–100%, VRAM ~10–11GB (for 14B with 16K context), temp 60–80°C — normal for this card.
# Verify you're using ollama_chat/ not ollama/
# Check model is staying loaded (not reloading each request):
journalctl -u ollama -f
# You should NOT see "loading model" on every request
# If you do, OLLAMA_KEEP_ALIVE may not have applied:
systemctl show ollama | grep KEEP_ALIVE
# On the server, list available models:
ollama list
# Make sure the exact name matches what's in your config
# Common mistake: config says qwen2.5-coder-14b-aider but you forgot
# to run `ollama create` with the custom modelfile
git config --global user.email "[email protected]"
git config --global user.name "Your Name"
/drop anything not immediately needed/clear (removes conversation but keeps files)num_ctx in your modelfile to 8192 if 16384 is causing OOMOn RX 6700 XT with Qwen2.5-Coder 14B (Q4_K_M) at 16K context:
| Metric | Expected |
|---|---|
| Generation speed | 12–20 tokens/sec |
| Prompt processing | ~1000–2000 tokens/sec |
| VRAM (model loaded, idle) | ~9.5 GB |
| VRAM (during generation) | ~11–11.5 GB |
| Time to first token | 1–4 seconds |
| Cold load time (not in VRAM) | 8–15 seconds |
With OLLAMA_KEEP_ALIVE=60m, cold loads only happen after 60 minutes of inactivity.
# === On the Proxmox LXC (server) ===
systemctl status ollama
journalctl -u ollama -f
ollama list
rocm-smi
watch -n 1 rocm-smi
# === On your laptop (client) ===
aider # Launch with config file defaults
aider --model ollama_chat/qwen2.5-coder:14b # Override model
# Inside Aider:
/add <file> # Add file to context
/drop <file> # Remove file from context
/run <cmd> # Run shell command, feed output to model
/ask <question> # Ask without editing
/architect <task> # Reason first, then edit (most powerful mode)
/undo # Revert last AI commit
/diff # Show all changes this session
/clear # Reset chat history