Deploying a Local LLM Cluster on Raspberry Pi 5: A Step-by-Step Guide
Edge DeploymentRaspberry PiCI/CD

Deploying a Local LLM Cluster on Raspberry Pi 5: A Step-by-Step Guide

UUnknown
2026-02-26
10 min read
Advertisement

Practical guide to provisioning, thermal/swap tuning, model sharding, and OTA CI/CD for a Pi5 + AI HAT+ 2 LLM cluster.

Hook: Why build a Raspberry Pi 5 LLM cluster at the edge in 2026?

If you manage ML-enabled products, you’ve hit the same wall as many teams in 2025–26: cloud LLM costs and privacy constraints are rising, and delivering low-latency, private inference at the edge is suddenly practical thanks to better quantization and purpose-built accelerators like the AI HAT+ 2 for Raspberry Pi 5. This guide gives a practical, production-minded path to provisioning, networking, and orchestrating a small Pi5 cluster for local model serving — covering thermal and swap tuning, model sharding strategies, secure SSH/OTA workflows, and CI/CD templates you can run today.

What you’ll get — at a glance

  • Hardware and topology recommendations for a 3–5 node Pi5 + AI HAT+ 2 cluster
  • Provisioning steps (images, SSH, network, inventory)
  • Thermal & swap tuning (zram, swapfile, fan control, monitoring)
  • Two practical model-sharding options and sample orchestration patterns
  • Secure remote update (OTA) and GitOps CI/CD blueprint
  • Scripts, systemd units, and an Ansible/k3s template to get you running

Late 2025 and early 2026 cemented a few trends relevant to edge LLM deployments:

  • Wide adoption of GGUF/4-bit and 3-bit quantization — models that used to require big GPUs now run acceptably on ARM CPUs with ACCELERATION from boards like AI HAT+ 2.
  • Better ARM aarch64 toolchains and container images — multi-arch builds are standard in CI, enabling reproducible edge images.
  • Edge orchestration gets pragmatic — lightweight Kubernetes (k3s) and GitOps (Flux/ArgoCD) are common for fleets of small nodes.
  • Security-first OTA — signed delta updates and remote attestations matter as deployments move from lab to production.

Cluster topology & hardware checklist

Design for reliability and realistic throughput; a recommended small cluster:

  • 3–5 x Raspberry Pi 5 (a 3-node cluster gives redundancy; 5 helps with throughput/replication)
  • AI HAT+ 2 attached to each Pi5 for acceleration and offload
  • Fast local storage: NVMe via compatible PCIe adapter or high-end UHS SD (avoid cheap SD cards)
  • Gigabit LAN switch, dedicated /24 subnet for the cluster (or VLAN)
  • Power with per-node UPS or a small rack UPS for graceful shutdowns
  • Active cooling / heatsink stack per Pi5 + thermal monitoring

Provisioning: from image to SSH (practical steps)

Use a reproducible base image and automation. The steps below assume aarch64 Raspberry Pi OS or Ubuntu Server 24.04+ for Pi5.

1) Build a golden image

Start from the official Ubuntu/Raspbian aarch64 image. Customize with cloud-init or a first-boot Ansible pull.

# generate a partitioned image (example outline)
sudo apt update && sudo apt install -y pishrink qemu-user-static
# copy and mount image, chroot and install packages: docker/containerd, python3, openssh-server, node-exporter etc.
  

2) SSH and identity

Disable password auth and ensure SSH keys and host CAs are in place.

# on your admin workstation
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_rsa
# push public keys into /etc/ssh/authorized_keys via Ansible
# sshd_config changes
PasswordAuthentication no
PermitRootLogin no
ChallengeResponseAuthentication no

3) Inventory and naming

Use stable hostnames and static DHCP reservations or static IPs. Example naming: pi5-llm-01.local, -02, -03. Create an Ansible inventory (snippet):

[llm_nodes]
pi5-llm-01 ansible_host=192.168.50.21
pi5-llm-02 ansible_host=192.168.50.22
pi5-llm-03 ansible_host=192.168.50.23

Thermal management: keep the cluster stable under inference load

LLM workloads are sustained and can trigger thermal throttling. Combine passive cooling, active control, and software limits.

Hardware mitigations

  • Z-height aluminum heatsinks and a compact fan per Pi; align airflow across the rack/shelf.
  • Use metal case plates to distribute heat when stacking.
  • Ensure AI HAT+ 2 has its own thermal path (heatsink + airflow).

Software & control (example)

Use the built-in hwmon/sysfs and a small daemon to control a PWM fan and log temps. Sample systemd unit (concept):

[Unit]
Description=Simple PWM fan controller
After=multi-user.target

[Service]
Type=simple
ExecStart=/usr/local/bin/fan-control.py
Restart=always

[Install]
WantedBy=multi-user.target

Your fan-control.py reads /sys/class/thermal/thermal_zone*/temp and sets PWM via GPIO; emit Prometheus metrics concurrently. Keep thresholds conservative: start fan at 55°C, max at 75°C, and gracefully reduce CPU frequency at 85°C.

Swap and memory: zram + fallback swapfile

Pi5 + AI HAT+ 2 still has limited RAM compared to servers. Use zram for compressed RAM swap with a fast on-disk swapfile as fallback for big batch spikes.

Enable zram

sudo apt install -y zram-tools
# /etc/default/zramswap (example)
ZRAM_MEM_PERCENT=30  # percent of RAM to reserve for compressed swap

Fallback swapfile

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# add to /etc/fstab
/swapfile none swap sw 0 0
# tune swappiness
sudo sysctl vm.swappiness=10

Keep vm.swappiness low (10-20) — prefer RAM and zram; use swap only during occasional pressure. Monitor with htop and vmstat.

Model serving and two practical sharding strategies

There are two realistic paths for small Pi clusters in 2026: request-level sharding (replication + routing) and model-parallel sharding (layer-slicing). Choose based on throughput, latency targets, and model size.

Run a full quantized model instance (GGUF/ggml) on each node and use a lightweight router to distribute requests. This is simple, resilient, and works well when models are small enough to fit per-node memory (many 4-bit quant models do).

  • Pros: simpler, no network activation passing, easy scaling and rolling updates
  • Cons: higher memory per node, less efficient for very large models

Architecture blueprint:

  • API Gateway (small x86 or Pi node) — routes to nodes via Nginx or Envoy
  • Node model server — container running llama.cpp/vicuna/webui/ggml binary exposed on local port
  • Health checks + metrics exported (Prometheus node-exporter & model-exporter)
# sample Nginx upstream (simple round-robin)
upstream llm_pool {
    server 192.168.50.21:8080;
    server 192.168.50.22:8080;
    server 192.168.50.23:8080;
}

server {
    listen 80;
    location /v1/completions {
        proxy_pass http://llm_pool;
    }
}

Option B: Model-parallel sharding (layer-slicing) — advanced

Distribute layers across nodes and pass activations over a fast RPC (gRPC/ZeroMQ). This is more complex but lets you serve models larger than a single node's memory.

  • Pros: serve larger models with small nodes, potentially higher aggregate throughput
  • Cons: latency overhead from activation passing, fragile network dependency, complex sync

High-level implementation steps (blueprint):

  1. Quantize model into layer partitions (layer 0..N split into contiguous blocks)
  2. On each node, run a lightweight inference engine that can execute only its layer slice (use a compiled ggml/libtorch backend that supports loading partial model weights)
  3. Implement an activation RPC: the forward pass endpoint accepts a tensor, runs local layers, returns next activation to the orchestrator
  4. Front-end aggregator coordinates the sequence and performs token decoding

Use a binary RPC over a fast LAN (gRPC streaming or custom TCP with protobuf). Measure and optimize to reduce copy/serialization costs (use raw bytes for tensor transport). Expect to tune batch size and pipeline concurrency heavily.

Orchestration: k3s + GitOps for reproducible deployments

For fleets, use k3s (lightweight Kubernetes) and Flux/ArgoCD for Git-driven deployments. k3s supports ARM and is sufficiently small for Pi clusters.

Install k3s (quick)

curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.28.6+k3s1 sh -s - server --disable traefik
# join agent nodes with node token
sudo K3S_URL=https://192.168.50.21:6443 K3S_TOKEN=XXX sh -

Deploy model servers as k3s workloads

Use a Deployment + Service per model. Example Helm/manifest approach:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ggml-model
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ggml
  template:
    metadata:
      labels:
        app: ggml
    spec:
      containers:
      - name: ggml
        image: myregistry/ggml-arm64:latest
        ports:
        - containerPort: 8080
        resources:
          limits:
            memory: "6Gi"
            cpu: "1500m"

Remote update strategies (OTA) and CI/CD

Edge OTA must be safe: signed images, delta updates, staged rollouts, and rollback. For small teams you can combine Docker image signing + GitOps:

  1. On push to main: build multi-arch images using buildx (arm64) and push to registry
  2. Sign images with cosign and push signatures to a registry or key server
  3. Create a new Kubernetes manifest commit (image tag bump) in an infra repo
  4. Flux in the cluster notices the change, validates signatures, and applies manifests
# sample buildx (CI job)
docker buildx build --platform linux/arm64 --push -t myregistry/ggml-arm64:${{ github.sha }} .
cosign sign --key cosign.key myregistry/ggml-arm64:${{ github.sha }}
# commit the image tag to infra repo and push

Staged rollout and rollback

Use k3s Deployment strategies (maxSurge/maxUnavailable) and health checks. Implement an automatic rollback if liveness probes fail post-deploy. Keep a canary namespace for testing updates on a single node first.

Security & networking best practices

  • Use a private VLAN for the cluster and only expose the gateway/API to authorized networks.
  • Mutual TLS between services — use cert-manager on k3s or pre-provisioned certificates.
  • SSH via bastion host and SSH certs (short-lived) — avoid exposing SSH to the internet.
  • Restrict container capabilities; run model containers as non-root users.

Monitoring, logging, and alerting

Collect the basics: node metrics, thermal readings, swap usage, and inference latency. Minimal stack:

  • Prometheus + Grafana (node-exporter, custom model-exporter)
  • Alertmanager for critical conditions (temp > 85°C, swap > threshold, service down)
  • Centralized logs with Loki or a hosted alternative; keep logs compressed to reduce bandwidth

Example: Ansible snippet to install zram, Docker, and the fan controller

- hosts: llm_nodes
  become: yes
  tasks:
    - name: install packages
      apt:
        name: ["zram-tools","containerd","prometheus-node-exporter"]
        state: present
    - name: deploy fan control script
      copy:
        src: fan-control.py
        dest: /usr/local/bin/fan-control.py
        mode: '0755'
    - name: enable fan service
      systemd:
        name: fan-control.service
        enabled: yes
        state: started

Operational tips & trade-offs

  • Start with request-level sharding. It’s simple and less likely to fail in production.
  • Measure memory: some quant models still don’t fit. Prefer 4-bit quant or trimmed LLMs for Pi nodes.
  • Network latency matters — use wired Ethernet and a separate LAN for inference traffic.
  • If you need true model-parallel, prototype in a lab with 2–3 nodes and a synthetic workload first.
  • Expect to iterate on batch sizes, temperature settings, and timeouts to keep tail latencies acceptable.

Case study (short): 3-node prototype for on-prem chat assistant

In late 2025 a small fintech team built a private chat assistant for internal docs using three Pi5 nodes with AI HAT+ 2. They chose request-level sharding with a 4-bit quantized GGUF model. Outcomes:

  • Average latency: 300–600ms for short prompts (single-turn)
  • Operational cost: near-zero after one-off hardware purchase
  • Lessons learned: aggressive thermal tuning and zram were essential; one node failure gracefully handled by round-robin routing
Deploy small, test often, and automate updates with GitOps — those three practices move a Pi cluster from proof-of-concept to dependable service.

Actionable takeaways

  • Start small: provision 3 Pi5 nodes, enable zram, and run a quantized GGUF model per node.
  • Protect thermals: add fans + monitor temps; auto-throttle rather than risk throttling derailment.
  • Use k3s + Flux: GitOps for safe, auditable OTA updates with signed images.
  • Prefer request-level sharding first: implement model-parallel only if you must host models larger than a single node.

Next steps — checklist to deploy in a weekend

  1. Order 3 Pi5 + AI HAT+ 2 and NVMe or high-quality SD cards
  2. Build the golden image with SSH keys and zram enabled
  3. Deploy k3s and add nodes
  4. Deploy one quantized model as a Deployment and route with Nginx/Envoy
  5. Configure Prometheus alerts for temp & swap and onboarding into your incident workflow

Closing: future-proofing and 2026 predictions

Edge LLMs will keep maturing in 2026: expect better 3-bit quant formats, faster ARM-specific inference kernels, and more robust model-sharding libraries. Investing in standardized orchestration (k3s + GitOps), signed CI artifacts, and thermal/swap reliability pays dividends as you scale from a hobby cluster to a production edge fleet. The AI HAT+ 2 makes local inference feasible — but architecture and operations decide whether a deployment is maintainable.

Call to action

If you want a ready-to-run repo with Ansible playbooks, k3s manifests, a sample GitHub Actions CI, and a tested fan-control script for Pi5 + AI HAT+ 2, drop your email or start a trial with our templates library at various.cloud/edge-llm — we’ll send the repo and a 30-minute onboarding checklist to get your first cluster live this weekend.

Advertisement

Related Topics

#Edge Deployment#Raspberry Pi#CI/CD
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-26T05:55:18.465Z