Hook: Why build a Raspberry Pi 5 LLM cluster at the edge in 2026?
If you manage ML-enabled products, you’ve hit the same wall as many teams in 2025–26: cloud LLM costs and privacy constraints are rising, and delivering low-latency, private inference at the edge is suddenly practical thanks to better quantization and purpose-built accelerators like the AI HAT+ 2 for Raspberry Pi 5. This guide gives a practical, production-minded path to provisioning, networking, and orchestrating a small Pi5 cluster for local model serving — covering thermal and swap tuning, model sharding strategies, secure SSH/OTA workflows, and CI/CD templates you can run today.
What you’ll get — at a glance
- Hardware and topology recommendations for a 3–5 node Pi5 + AI HAT+ 2 cluster
- Provisioning steps (images, SSH, network, inventory)
- Thermal & swap tuning (zram, swapfile, fan control, monitoring)
- Two practical model-sharding options and sample orchestration patterns
- Secure remote update (OTA) and GitOps CI/CD blueprint
- Scripts, systemd units, and an Ansible/k3s template to get you running
Context & 2026 trends that matter
Late 2025 and early 2026 cemented a few trends relevant to edge LLM deployments:
- Wide adoption of GGUF/4-bit and 3-bit quantization — models that used to require big GPUs now run acceptably on ARM CPUs with ACCELERATION from boards like AI HAT+ 2.
- Better ARM aarch64 toolchains and container images — multi-arch builds are standard in CI, enabling reproducible edge images.
- Edge orchestration gets pragmatic — lightweight Kubernetes (k3s) and GitOps (Flux/ArgoCD) are common for fleets of small nodes.
- Security-first OTA — signed delta updates and remote attestations matter as deployments move from lab to production.
Cluster topology & hardware checklist
Design for reliability and realistic throughput; a recommended small cluster:
- 3–5 x Raspberry Pi 5 (a 3-node cluster gives redundancy; 5 helps with throughput/replication)
- AI HAT+ 2 attached to each Pi5 for acceleration and offload
- Fast local storage: NVMe via compatible PCIe adapter or high-end UHS SD (avoid cheap SD cards)
- Gigabit LAN switch, dedicated /24 subnet for the cluster (or VLAN)
- Power with per-node UPS or a small rack UPS for graceful shutdowns
- Active cooling / heatsink stack per Pi5 + thermal monitoring
Provisioning: from image to SSH (practical steps)
Use a reproducible base image and automation. The steps below assume aarch64 Raspberry Pi OS or Ubuntu Server 24.04+ for Pi5.
1) Build a golden image
Start from the official Ubuntu/Raspbian aarch64 image. Customize with cloud-init or a first-boot Ansible pull.
# generate a partitioned image (example outline)
sudo apt update && sudo apt install -y pishrink qemu-user-static
# copy and mount image, chroot and install packages: docker/containerd, python3, openssh-server, node-exporter etc.
2) SSH and identity
Disable password auth and ensure SSH keys and host CAs are in place.
# on your admin workstation
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_rsa
# push public keys into /etc/ssh/authorized_keys via Ansible
# sshd_config changes
PasswordAuthentication no
PermitRootLogin no
ChallengeResponseAuthentication no
3) Inventory and naming
Use stable hostnames and static DHCP reservations or static IPs. Example naming: pi5-llm-01.local, -02, -03. Create an Ansible inventory (snippet):
[llm_nodes]
pi5-llm-01 ansible_host=192.168.50.21
pi5-llm-02 ansible_host=192.168.50.22
pi5-llm-03 ansible_host=192.168.50.23
Thermal management: keep the cluster stable under inference load
LLM workloads are sustained and can trigger thermal throttling. Combine passive cooling, active control, and software limits.
Hardware mitigations
- Z-height aluminum heatsinks and a compact fan per Pi; align airflow across the rack/shelf.
- Use metal case plates to distribute heat when stacking.
- Ensure AI HAT+ 2 has its own thermal path (heatsink + airflow).
Software & control (example)
Use the built-in hwmon/sysfs and a small daemon to control a PWM fan and log temps. Sample systemd unit (concept):
[Unit]
Description=Simple PWM fan controller
After=multi-user.target
[Service]
Type=simple
ExecStart=/usr/local/bin/fan-control.py
Restart=always
[Install]
WantedBy=multi-user.target
Your fan-control.py reads /sys/class/thermal/thermal_zone*/temp and sets PWM via GPIO; emit Prometheus metrics concurrently. Keep thresholds conservative: start fan at 55°C, max at 75°C, and gracefully reduce CPU frequency at 85°C.
Swap and memory: zram + fallback swapfile
Pi5 + AI HAT+ 2 still has limited RAM compared to servers. Use zram for compressed RAM swap with a fast on-disk swapfile as fallback for big batch spikes.
Enable zram
sudo apt install -y zram-tools
# /etc/default/zramswap (example)
ZRAM_MEM_PERCENT=30 # percent of RAM to reserve for compressed swap
Fallback swapfile
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# add to /etc/fstab
/swapfile none swap sw 0 0
# tune swappiness
sudo sysctl vm.swappiness=10
Keep vm.swappiness low (10-20) — prefer RAM and zram; use swap only during occasional pressure. Monitor with htop and vmstat.
Model serving and two practical sharding strategies
There are two realistic paths for small Pi clusters in 2026: request-level sharding (replication + routing) and model-parallel sharding (layer-slicing). Choose based on throughput, latency targets, and model size.
Option A: Request-level sharding (recommended for reliability)
Run a full quantized model instance (GGUF/ggml) on each node and use a lightweight router to distribute requests. This is simple, resilient, and works well when models are small enough to fit per-node memory (many 4-bit quant models do).
- Pros: simpler, no network activation passing, easy scaling and rolling updates
- Cons: higher memory per node, less efficient for very large models
Architecture blueprint:
- API Gateway (small x86 or Pi node) — routes to nodes via Nginx or Envoy
- Node model server — container running llama.cpp/vicuna/webui/ggml binary exposed on local port
- Health checks + metrics exported (Prometheus node-exporter & model-exporter)
# sample Nginx upstream (simple round-robin)
upstream llm_pool {
server 192.168.50.21:8080;
server 192.168.50.22:8080;
server 192.168.50.23:8080;
}
server {
listen 80;
location /v1/completions {
proxy_pass http://llm_pool;
}
}
Option B: Model-parallel sharding (layer-slicing) — advanced
Distribute layers across nodes and pass activations over a fast RPC (gRPC/ZeroMQ). This is more complex but lets you serve models larger than a single node's memory.
- Pros: serve larger models with small nodes, potentially higher aggregate throughput
- Cons: latency overhead from activation passing, fragile network dependency, complex sync
High-level implementation steps (blueprint):
- Quantize model into layer partitions (layer 0..N split into contiguous blocks)
- On each node, run a lightweight inference engine that can execute only its layer slice (use a compiled ggml/libtorch backend that supports loading partial model weights)
- Implement an activation RPC: the forward pass endpoint accepts a tensor, runs local layers, returns next activation to the orchestrator
- Front-end aggregator coordinates the sequence and performs token decoding
Use a binary RPC over a fast LAN (gRPC streaming or custom TCP with protobuf). Measure and optimize to reduce copy/serialization costs (use raw bytes for tensor transport). Expect to tune batch size and pipeline concurrency heavily.
Orchestration: k3s + GitOps for reproducible deployments
For fleets, use k3s (lightweight Kubernetes) and Flux/ArgoCD for Git-driven deployments. k3s supports ARM and is sufficiently small for Pi clusters.
Install k3s (quick)
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.28.6+k3s1 sh -s - server --disable traefik
# join agent nodes with node token
sudo K3S_URL=https://192.168.50.21:6443 K3S_TOKEN=XXX sh -
Deploy model servers as k3s workloads
Use a Deployment + Service per model. Example Helm/manifest approach:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ggml-model
spec:
replicas: 3
selector:
matchLabels:
app: ggml
template:
metadata:
labels:
app: ggml
spec:
containers:
- name: ggml
image: myregistry/ggml-arm64:latest
ports:
- containerPort: 8080
resources:
limits:
memory: "6Gi"
cpu: "1500m"
Remote update strategies (OTA) and CI/CD
Edge OTA must be safe: signed images, delta updates, staged rollouts, and rollback. For small teams you can combine Docker image signing + GitOps:
Recommended CI pipeline (GitHub Actions outline)
- On push to main: build multi-arch images using buildx (arm64) and push to registry
- Sign images with cosign and push signatures to a registry or key server
- Create a new Kubernetes manifest commit (image tag bump) in an infra repo
- Flux in the cluster notices the change, validates signatures, and applies manifests
# sample buildx (CI job)
docker buildx build --platform linux/arm64 --push -t myregistry/ggml-arm64:${{ github.sha }} .
cosign sign --key cosign.key myregistry/ggml-arm64:${{ github.sha }}
# commit the image tag to infra repo and push
Staged rollout and rollback
Use k3s Deployment strategies (maxSurge/maxUnavailable) and health checks. Implement an automatic rollback if liveness probes fail post-deploy. Keep a canary namespace for testing updates on a single node first.
Security & networking best practices
- Use a private VLAN for the cluster and only expose the gateway/API to authorized networks.
- Mutual TLS between services — use cert-manager on k3s or pre-provisioned certificates.
- SSH via bastion host and SSH certs (short-lived) — avoid exposing SSH to the internet.
- Restrict container capabilities; run model containers as non-root users.
Monitoring, logging, and alerting
Collect the basics: node metrics, thermal readings, swap usage, and inference latency. Minimal stack:
- Prometheus + Grafana (node-exporter, custom model-exporter)
- Alertmanager for critical conditions (temp > 85°C, swap > threshold, service down)
- Centralized logs with Loki or a hosted alternative; keep logs compressed to reduce bandwidth
Example: Ansible snippet to install zram, Docker, and the fan controller
- hosts: llm_nodes
become: yes
tasks:
- name: install packages
apt:
name: ["zram-tools","containerd","prometheus-node-exporter"]
state: present
- name: deploy fan control script
copy:
src: fan-control.py
dest: /usr/local/bin/fan-control.py
mode: '0755'
- name: enable fan service
systemd:
name: fan-control.service
enabled: yes
state: started
Operational tips & trade-offs
- Start with request-level sharding. It’s simple and less likely to fail in production.
- Measure memory: some quant models still don’t fit. Prefer 4-bit quant or trimmed LLMs for Pi nodes.
- Network latency matters — use wired Ethernet and a separate LAN for inference traffic.
- If you need true model-parallel, prototype in a lab with 2–3 nodes and a synthetic workload first.
- Expect to iterate on batch sizes, temperature settings, and timeouts to keep tail latencies acceptable.
Case study (short): 3-node prototype for on-prem chat assistant
In late 2025 a small fintech team built a private chat assistant for internal docs using three Pi5 nodes with AI HAT+ 2. They chose request-level sharding with a 4-bit quantized GGUF model. Outcomes:
- Average latency: 300–600ms for short prompts (single-turn)
- Operational cost: near-zero after one-off hardware purchase
- Lessons learned: aggressive thermal tuning and zram were essential; one node failure gracefully handled by round-robin routing
Deploy small, test often, and automate updates with GitOps — those three practices move a Pi cluster from proof-of-concept to dependable service.
Actionable takeaways
- Start small: provision 3 Pi5 nodes, enable zram, and run a quantized GGUF model per node.
- Protect thermals: add fans + monitor temps; auto-throttle rather than risk throttling derailment.
- Use k3s + Flux: GitOps for safe, auditable OTA updates with signed images.
- Prefer request-level sharding first: implement model-parallel only if you must host models larger than a single node.
Next steps — checklist to deploy in a weekend
- Order 3 Pi5 + AI HAT+ 2 and NVMe or high-quality SD cards
- Build the golden image with SSH keys and zram enabled
- Deploy k3s and add nodes
- Deploy one quantized model as a Deployment and route with Nginx/Envoy
- Configure Prometheus alerts for temp & swap and onboarding into your incident workflow
Closing: future-proofing and 2026 predictions
Edge LLMs will keep maturing in 2026: expect better 3-bit quant formats, faster ARM-specific inference kernels, and more robust model-sharding libraries. Investing in standardized orchestration (k3s + GitOps), signed CI artifacts, and thermal/swap reliability pays dividends as you scale from a hobby cluster to a production edge fleet. The AI HAT+ 2 makes local inference feasible — but architecture and operations decide whether a deployment is maintainable.
Call to action
If you want a ready-to-run repo with Ansible playbooks, k3s manifests, a sample GitHub Actions CI, and a tested fan-control script for Pi5 + AI HAT+ 2, drop your email or start a trial with our templates library at various.cloud/edge-llm — we’ll send the repo and a 30-minute onboarding checklist to get your first cluster live this weekend.
Related Reading
- Price-Proof Your Jewelry Line: How Tariff Conversations Should Shape Your 2026 Assortment
- Packing List for Traveling with Dogs: Essentials for Cottage Weekends
- How to Choose a Registered Agent and Formation Service Without Adding Complexity to Your Stack
- From Bankruptcy to Studio: Legal Steps for Media Companies Rebooting Their Business
- How to Stop AI from Making Your Shift Supervisors’ Jobs Harder