Edge DeploymentRaspberry PiCI/CD

Deploying a Local LLM Cluster on Raspberry Pi 5: A Step-by-Step Guide

UUnknown

2026-02-26

10 min read

Practical guide to provisioning, thermal/swap tuning, model sharding, and OTA CI/CD for a Pi5 + AI HAT+ 2 LLM cluster.

Hook: Why build a Raspberry Pi 5 LLM cluster at the edge in 2026?

If you manage ML-enabled products, you’ve hit the same wall as many teams in 2025–26: cloud LLM costs and privacy constraints are rising, and delivering low-latency, private inference at the edge is suddenly practical thanks to better quantization and purpose-built accelerators like the AI HAT+ 2 for Raspberry Pi 5. This guide gives a practical, production-minded path to provisioning, networking, and orchestrating a small Pi5 cluster for local model serving — covering thermal and swap tuning, model sharding strategies, secure SSH/OTA workflows, and CI/CD templates you can run today.

What you’ll get — at a glance

Hardware and topology recommendations for a 3–5 node Pi5 + AI HAT+ 2 cluster
Provisioning steps (images, SSH, network, inventory)
Thermal & swap tuning (zram, swapfile, fan control, monitoring)
Two practical model-sharding options and sample orchestration patterns
Secure remote update (OTA) and GitOps CI/CD blueprint
Scripts, systemd units, and an Ansible/k3s template to get you running

Context & 2026 trends that matter

Late 2025 and early 2026 cemented a few trends relevant to edge LLM deployments:

Wide adoption of GGUF/4-bit and 3-bit quantization — models that used to require big GPUs now run acceptably on ARM CPUs with ACCELERATION from boards like AI HAT+ 2.
Better ARM aarch64 toolchains and container images — multi-arch builds are standard in CI, enabling reproducible edge images.
Edge orchestration gets pragmatic — lightweight Kubernetes (k3s) and GitOps (Flux/ArgoCD) are common for fleets of small nodes.
Security-first OTA — signed delta updates and remote attestations matter as deployments move from lab to production.

Cluster topology & hardware checklist

Design for reliability and realistic throughput; a recommended small cluster:

3–5 x Raspberry Pi 5 (a 3-node cluster gives redundancy; 5 helps with throughput/replication)
AI HAT+ 2 attached to each Pi5 for acceleration and offload
Fast local storage: NVMe via compatible PCIe adapter or high-end UHS SD (avoid cheap SD cards)
Gigabit LAN switch, dedicated /24 subnet for the cluster (or VLAN)
Power with per-node UPS or a small rack UPS for graceful shutdowns
Active cooling / heatsink stack per Pi5 + thermal monitoring

Provisioning: from image to SSH (practical steps)

Use a reproducible base image and automation. The steps below assume aarch64 Raspberry Pi OS or Ubuntu Server 24.04+ for Pi5.

1) Build a golden image

Start from the official Ubuntu/Raspbian aarch64 image. Customize with cloud-init or a first-boot Ansible pull.

# generate a partitioned image (example outline)
sudo apt update && sudo apt install -y pishrink qemu-user-static
# copy and mount image, chroot and install packages: docker/containerd, python3, openssh-server, node-exporter etc.

2) SSH and identity

Disable password auth and ensure SSH keys and host CAs are in place.

# on your admin workstation
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_rsa
# push public keys into /etc/ssh/authorized_keys via Ansible
# sshd_config changes
PasswordAuthentication no
PermitRootLogin no
ChallengeResponseAuthentication no

3) Inventory and naming

Use stable hostnames and static DHCP reservations or static IPs. Example naming: pi5-llm-01.local, -02, -03. Create an Ansible inventory (snippet):

[llm_nodes]
pi5-llm-01 ansible_host=192.168.50.21
pi5-llm-02 ansible_host=192.168.50.22
pi5-llm-03 ansible_host=192.168.50.23

Thermal management: keep the cluster stable under inference load

LLM workloads are sustained and can trigger thermal throttling. Combine passive cooling, active control, and software limits.

Hardware mitigations

Z-height aluminum heatsinks and a compact fan per Pi; align airflow across the rack/shelf.
Use metal case plates to distribute heat when stacking.
Ensure AI HAT+ 2 has its own thermal path (heatsink + airflow).

Software & control (example)

Use the built-in hwmon/sysfs and a small daemon to control a PWM fan and log temps. Sample systemd unit (concept):

[Unit]
Description=Simple PWM fan controller
After=multi-user.target

[Service]
Type=simple
ExecStart=/usr/local/bin/fan-control.py
Restart=always

[Install]
WantedBy=multi-user.target

Your fan-control.py reads /sys/class/thermal/thermal_zone*/temp and sets PWM via GPIO; emit Prometheus metrics concurrently. Keep thresholds conservative: start fan at 55°C, max at 75°C, and gracefully reduce CPU frequency at 85°C.

Swap and memory: zram + fallback swapfile

Pi5 + AI HAT+ 2 still has limited RAM compared to servers. Use zram for compressed RAM swap with a fast on-disk swapfile as fallback for big batch spikes.

Enable zram

sudo apt install -y zram-tools
# /etc/default/zramswap (example)
ZRAM_MEM_PERCENT=30  # percent of RAM to reserve for compressed swap

Fallback swapfile

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# add to /etc/fstab
/swapfile none swap sw 0 0
# tune swappiness
sudo sysctl vm.swappiness=10

Keep vm.swappiness low (10-20) — prefer RAM and zram; use swap only during occasional pressure. Monitor with htop and vmstat.

Model serving and two practical sharding strategies

There are two realistic paths for small Pi clusters in 2026: request-level sharding (replication + routing) and model-parallel sharding (layer-slicing). Choose based on throughput, latency targets, and model size.

Option A: Request-level sharding (recommended for reliability)

Run a full quantized model instance (GGUF/ggml) on each node and use a lightweight router to distribute requests. This is simple, resilient, and works well when models are small enough to fit per-node memory (many 4-bit quant models do).

Pros: simpler, no network activation passing, easy scaling and rolling updates
Cons: higher memory per node, less efficient for very large models

Architecture blueprint:

API Gateway (small x86 or Pi node) — routes to nodes via Nginx or Envoy
Node model server — container running llama.cpp/vicuna/webui/ggml binary exposed on local port
Health checks + metrics exported (Prometheus node-exporter & model-exporter)

# sample Nginx upstream (simple round-robin)
upstream llm_pool {
    server 192.168.50.21:8080;
    server 192.168.50.22:8080;
    server 192.168.50.23:8080;
}

server {
    listen 80;
    location /v1/completions {
        proxy_pass http://llm_pool;
    }
}

Option B: Model-parallel sharding (layer-slicing) — advanced

Distribute layers across nodes and pass activations over a fast RPC (gRPC/ZeroMQ). This is more complex but lets you serve models larger than a single node's memory.

Pros: serve larger models with small nodes, potentially higher aggregate throughput
Cons: latency overhead from activation passing, fragile network dependency, complex sync

High-level implementation steps (blueprint):

Quantize model into layer partitions (layer 0..N split into contiguous blocks)
On each node, run a lightweight inference engine that can execute only its layer slice (use a compiled ggml/libtorch backend that supports loading partial model weights)
Implement an activation RPC: the forward pass endpoint accepts a tensor, runs local layers, returns next activation to the orchestrator
Front-end aggregator coordinates the sequence and performs token decoding

Use a binary RPC over a fast LAN (gRPC streaming or custom TCP with protobuf). Measure and optimize to reduce copy/serialization costs (use raw bytes for tensor transport). Expect to tune batch size and pipeline concurrency heavily.

Orchestration: k3s + GitOps for reproducible deployments

For fleets, use k3s (lightweight Kubernetes) and Flux/ArgoCD for Git-driven deployments. k3s supports ARM and is sufficiently small for Pi clusters.

Install k3s (quick)

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.28.6+k3s1 sh -s - server --disable traefik
# join agent nodes with node token
sudo K3S_URL=https://192.168.50.21:6443 K3S_TOKEN=XXX sh -

Deploy model servers as k3s workloads

Use a Deployment + Service per model. Example Helm/manifest approach:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ggml-model
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ggml
  template:
    metadata:
      labels:
        app: ggml
    spec:
      containers:
      - name: ggml
        image: myregistry/ggml-arm64:latest
        ports:
        - containerPort: 8080
        resources:
          limits:
            memory: "6Gi"
            cpu: "1500m"

Remote update strategies (OTA) and CI/CD

Edge OTA must be safe: signed images, delta updates, staged rollouts, and rollback. For small teams you can combine Docker image signing + GitOps:

Recommended CI pipeline (GitHub Actions outline)

On push to main: build multi-arch images using buildx (arm64) and push to registry
Sign images with cosign and push signatures to a registry or key server
Create a new Kubernetes manifest commit (image tag bump) in an infra repo
Flux in the cluster notices the change, validates signatures, and applies manifests

# sample buildx (CI job)
docker buildx build --platform linux/arm64 --push -t myregistry/ggml-arm64:${{ github.sha }} .
cosign sign --key cosign.key myregistry/ggml-arm64:${{ github.sha }}
# commit the image tag to infra repo and push

Staged rollout and rollback

Use k3s Deployment strategies (maxSurge/maxUnavailable) and health checks. Implement an automatic rollback if liveness probes fail post-deploy. Keep a canary namespace for testing updates on a single node first.

Security & networking best practices

Use a private VLAN for the cluster and only expose the gateway/API to authorized networks.
Mutual TLS between services — use cert-manager on k3s or pre-provisioned certificates.
SSH via bastion host and SSH certs (short-lived) — avoid exposing SSH to the internet.
Restrict container capabilities; run model containers as non-root users.

Monitoring, logging, and alerting

Collect the basics: node metrics, thermal readings, swap usage, and inference latency. Minimal stack:

Prometheus + Grafana (node-exporter, custom model-exporter)
Alertmanager for critical conditions (temp > 85°C, swap > threshold, service down)
Centralized logs with Loki or a hosted alternative; keep logs compressed to reduce bandwidth

Example: Ansible snippet to install zram, Docker, and the fan controller

- hosts: llm_nodes
  become: yes
  tasks:
    - name: install packages
      apt:
        name: ["zram-tools","containerd","prometheus-node-exporter"]
        state: present
    - name: deploy fan control script
      copy:
        src: fan-control.py
        dest: /usr/local/bin/fan-control.py
        mode: '0755'
    - name: enable fan service
      systemd:
        name: fan-control.service
        enabled: yes
        state: started

Operational tips & trade-offs

Start with request-level sharding. It’s simple and less likely to fail in production.
Measure memory: some quant models still don’t fit. Prefer 4-bit quant or trimmed LLMs for Pi nodes.
Network latency matters — use wired Ethernet and a separate LAN for inference traffic.
If you need true model-parallel, prototype in a lab with 2–3 nodes and a synthetic workload first.
Expect to iterate on batch sizes, temperature settings, and timeouts to keep tail latencies acceptable.

Case study (short): 3-node prototype for on-prem chat assistant

In late 2025 a small fintech team built a private chat assistant for internal docs using three Pi5 nodes with AI HAT+ 2. They chose request-level sharding with a 4-bit quantized GGUF model. Outcomes:

Average latency: 300–600ms for short prompts (single-turn)
Operational cost: near-zero after one-off hardware purchase
Lessons learned: aggressive thermal tuning and zram were essential; one node failure gracefully handled by round-robin routing

Deploy small, test often, and automate updates with GitOps — those three practices move a Pi cluster from proof-of-concept to dependable service.

Actionable takeaways

Start small: provision 3 Pi5 nodes, enable zram, and run a quantized GGUF model per node.
Protect thermals: add fans + monitor temps; auto-throttle rather than risk throttling derailment.
Use k3s + Flux: GitOps for safe, auditable OTA updates with signed images.
Prefer request-level sharding first: implement model-parallel only if you must host models larger than a single node.

Next steps — checklist to deploy in a weekend

Order 3 Pi5 + AI HAT+ 2 and NVMe or high-quality SD cards
Build the golden image with SSH keys and zram enabled
Deploy k3s and add nodes
Deploy one quantized model as a Deployment and route with Nginx/Envoy
Configure Prometheus alerts for temp & swap and onboarding into your incident workflow

Closing: future-proofing and 2026 predictions

Edge LLMs will keep maturing in 2026: expect better 3-bit quant formats, faster ARM-specific inference kernels, and more robust model-sharding libraries. Investing in standardized orchestration (k3s + GitOps), signed CI artifacts, and thermal/swap reliability pays dividends as you scale from a hobby cluster to a production edge fleet. The AI HAT+ 2 makes local inference feasible — but architecture and operations decide whether a deployment is maintainable.

Call to action

If you want a ready-to-run repo with Ansible playbooks, k3s manifests, a sample GitHub Actions CI, and a tested fan-control script for Pi5 + AI HAT+ 2, drop your email or start a trial with our templates library at various.cloud/edge-llm — we’ll send the repo and a 30-minute onboarding checklist to get your first cluster live this weekend.

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.