Deploying a Local LLM Cluster on Raspberry Pi 5: A Step-by-Step Guide
Practical guide to provisioning, thermal/swap tuning, model sharding, and OTA CI/CD for a Pi5 + AI HAT+ 2 LLM cluster.
Hook: Why build a Raspberry Pi 5 LLM cluster at the edge in 2026?
If you manage ML-enabled products, you’ve hit the same wall as many teams in 2025–26: cloud LLM costs and privacy constraints are rising, and delivering low-latency, private inference at the edge is suddenly practical thanks to better quantization and purpose-built accelerators like the AI HAT+ 2 for Raspberry Pi 5. This guide gives a practical, production-minded path to provisioning, networking, and orchestrating a small Pi5 cluster for local model serving — covering thermal and swap tuning, model sharding strategies, secure SSH/OTA workflows, and CI/CD templates you can run today.
What you’ll get — at a glance
- Hardware and topology recommendations for a 3–5 node Pi5 + AI HAT+ 2 cluster
- Provisioning steps (images, SSH, network, inventory)
- Thermal & swap tuning (zram, swapfile, fan control, monitoring)
- Two practical model-sharding options and sample orchestration patterns
- Secure remote update (OTA) and GitOps CI/CD blueprint
- Scripts, systemd units, and an Ansible/k3s template to get you running
Context & 2026 trends that matter
Late 2025 and early 2026 cemented a few trends relevant to edge LLM deployments:
- Wide adoption of GGUF/4-bit and 3-bit quantization — models that used to require big GPUs now run acceptably on ARM CPUs with ACCELERATION from boards like AI HAT+ 2.
- Better ARM aarch64 toolchains and container images — multi-arch builds are standard in CI, enabling reproducible edge images.
- Edge orchestration gets pragmatic — lightweight Kubernetes (k3s) and GitOps (Flux/ArgoCD) are common for fleets of small nodes.
- Security-first OTA — signed delta updates and remote attestations matter as deployments move from lab to production.
Cluster topology & hardware checklist
Design for reliability and realistic throughput; a recommended small cluster:
- 3–5 x Raspberry Pi 5 (a 3-node cluster gives redundancy; 5 helps with throughput/replication)
- AI HAT+ 2 attached to each Pi5 for acceleration and offload
- Fast local storage: NVMe via compatible PCIe adapter or high-end UHS SD (avoid cheap SD cards)
- Gigabit LAN switch, dedicated /24 subnet for the cluster (or VLAN)
- Power with per-node UPS or a small rack UPS for graceful shutdowns
- Active cooling / heatsink stack per Pi5 + thermal monitoring
Provisioning: from image to SSH (practical steps)
Use a reproducible base image and automation. The steps below assume aarch64 Raspberry Pi OS or Ubuntu Server 24.04+ for Pi5.
1) Build a golden image
Start from the official Ubuntu/Raspbian aarch64 image. Customize with cloud-init or a first-boot Ansible pull.
# generate a partitioned image (example outline)
sudo apt update && sudo apt install -y pishrink qemu-user-static
# copy and mount image, chroot and install packages: docker/containerd, python3, openssh-server, node-exporter etc.
2) SSH and identity
Disable password auth and ensure SSH keys and host CAs are in place.
# on your admin workstation
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_rsa
# push public keys into /etc/ssh/authorized_keys via Ansible
# sshd_config changes
PasswordAuthentication no
PermitRootLogin no
ChallengeResponseAuthentication no
3) Inventory and naming
Use stable hostnames and static DHCP reservations or static IPs. Example naming: pi5-llm-01.local, -02, -03. Create an Ansible inventory (snippet):
[llm_nodes]
pi5-llm-01 ansible_host=192.168.50.21
pi5-llm-02 ansible_host=192.168.50.22
pi5-llm-03 ansible_host=192.168.50.23
Thermal management: keep the cluster stable under inference load
LLM workloads are sustained and can trigger thermal throttling. Combine passive cooling, active control, and software limits.
Hardware mitigations
- Z-height aluminum heatsinks and a compact fan per Pi; align airflow across the rack/shelf.
- Use metal case plates to distribute heat when stacking.
- Ensure AI HAT+ 2 has its own thermal path (heatsink + airflow).
Software & control (example)
Use the built-in hwmon/sysfs and a small daemon to control a PWM fan and log temps. Sample systemd unit (concept):
[Unit]
Description=Simple PWM fan controller
After=multi-user.target
[Service]
Type=simple
ExecStart=/usr/local/bin/fan-control.py
Restart=always
[Install]
WantedBy=multi-user.target
Your fan-control.py reads /sys/class/thermal/thermal_zone*/temp and sets PWM via GPIO; emit Prometheus metrics concurrently. Keep thresholds conservative: start fan at 55°C, max at 75°C, and gracefully reduce CPU frequency at 85°C.
Swap and memory: zram + fallback swapfile
Pi5 + AI HAT+ 2 still has limited RAM compared to servers. Use zram for compressed RAM swap with a fast on-disk swapfile as fallback for big batch spikes.
Enable zram
sudo apt install -y zram-tools
# /etc/default/zramswap (example)
ZRAM_MEM_PERCENT=30 # percent of RAM to reserve for compressed swap
Fallback swapfile
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
# add to /etc/fstab
/swapfile none swap sw 0 0
# tune swappiness
sudo sysctl vm.swappiness=10
Keep vm.swappiness low (10-20) — prefer RAM and zram; use swap only during occasional pressure. Monitor with htop and vmstat.
Model serving and two practical sharding strategies
There are two realistic paths for small Pi clusters in 2026: request-level sharding (replication + routing) and model-parallel sharding (layer-slicing). Choose based on throughput, latency targets, and model size.
Option A: Request-level sharding (recommended for reliability)
Run a full quantized model instance (GGUF/ggml) on each node and use a lightweight router to distribute requests. This is simple, resilient, and works well when models are small enough to fit per-node memory (many 4-bit quant models do).
- Pros: simpler, no network activation passing, easy scaling and rolling updates
- Cons: higher memory per node, less efficient for very large models
Architecture blueprint:
- API Gateway (small x86 or Pi node) — routes to nodes via Nginx or Envoy
- Node model server — container running llama.cpp/vicuna/webui/ggml binary exposed on local port
- Health checks + metrics exported (Prometheus node-exporter & model-exporter)
# sample Nginx upstream (simple round-robin)
upstream llm_pool {
server 192.168.50.21:8080;
server 192.168.50.22:8080;
server 192.168.50.23:8080;
}
server {
listen 80;
location /v1/completions {
proxy_pass http://llm_pool;
}
}
Option B: Model-parallel sharding (layer-slicing) — advanced
Distribute layers across nodes and pass activations over a fast RPC (gRPC/ZeroMQ). This is more complex but lets you serve models larger than a single node's memory.
- Pros: serve larger models with small nodes, potentially higher aggregate throughput
- Cons: latency overhead from activation passing, fragile network dependency, complex sync
High-level implementation steps (blueprint):
- Quantize model into layer partitions (layer 0..N split into contiguous blocks)
- On each node, run a lightweight inference engine that can execute only its layer slice (use a compiled ggml/libtorch backend that supports loading partial model weights)
- Implement an activation RPC: the forward pass endpoint accepts a tensor, runs local layers, returns next activation to the orchestrator
- Front-end aggregator coordinates the sequence and performs token decoding
Use a binary RPC over a fast LAN (gRPC streaming or custom TCP with protobuf). Measure and optimize to reduce copy/serialization costs (use raw bytes for tensor transport). Expect to tune batch size and pipeline concurrency heavily.
Orchestration: k3s + GitOps for reproducible deployments
For fleets, use k3s (lightweight Kubernetes) and Flux/ArgoCD for Git-driven deployments. k3s supports ARM and is sufficiently small for Pi clusters.
Install k3s (quick)
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=v1.28.6+k3s1 sh -s - server --disable traefik
# join agent nodes with node token
sudo K3S_URL=https://192.168.50.21:6443 K3S_TOKEN=XXX sh -
Deploy model servers as k3s workloads
Use a Deployment + Service per model. Example Helm/manifest approach:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ggml-model
spec:
replicas: 3
selector:
matchLabels:
app: ggml
template:
metadata:
labels:
app: ggml
spec:
containers:
- name: ggml
image: myregistry/ggml-arm64:latest
ports:
- containerPort: 8080
resources:
limits:
memory: "6Gi"
cpu: "1500m"
Remote update strategies (OTA) and CI/CD
Edge OTA must be safe: signed images, delta updates, staged rollouts, and rollback. For small teams you can combine Docker image signing + GitOps:
Recommended CI pipeline (GitHub Actions outline)
- On push to main: build multi-arch images using buildx (arm64) and push to registry
- Sign images with cosign and push signatures to a registry or key server
- Create a new Kubernetes manifest commit (image tag bump) in an infra repo
- Flux in the cluster notices the change, validates signatures, and applies manifests
# sample buildx (CI job)
docker buildx build --platform linux/arm64 --push -t myregistry/ggml-arm64:${{ github.sha }} .
cosign sign --key cosign.key myregistry/ggml-arm64:${{ github.sha }}
# commit the image tag to infra repo and push
Staged rollout and rollback
Use k3s Deployment strategies (maxSurge/maxUnavailable) and health checks. Implement an automatic rollback if liveness probes fail post-deploy. Keep a canary namespace for testing updates on a single node first.
Security & networking best practices
- Use a private VLAN for the cluster and only expose the gateway/API to authorized networks.
- Mutual TLS between services — use cert-manager on k3s or pre-provisioned certificates.
- SSH via bastion host and SSH certs (short-lived) — avoid exposing SSH to the internet.
- Restrict container capabilities; run model containers as non-root users.
Monitoring, logging, and alerting
Collect the basics: node metrics, thermal readings, swap usage, and inference latency. Minimal stack:
- Prometheus + Grafana (node-exporter, custom model-exporter)
- Alertmanager for critical conditions (temp > 85°C, swap > threshold, service down)
- Centralized logs with Loki or a hosted alternative; keep logs compressed to reduce bandwidth
Example: Ansible snippet to install zram, Docker, and the fan controller
- hosts: llm_nodes
become: yes
tasks:
- name: install packages
apt:
name: ["zram-tools","containerd","prometheus-node-exporter"]
state: present
- name: deploy fan control script
copy:
src: fan-control.py
dest: /usr/local/bin/fan-control.py
mode: '0755'
- name: enable fan service
systemd:
name: fan-control.service
enabled: yes
state: started
Operational tips & trade-offs
- Start with request-level sharding. It’s simple and less likely to fail in production.
- Measure memory: some quant models still don’t fit. Prefer 4-bit quant or trimmed LLMs for Pi nodes.
- Network latency matters — use wired Ethernet and a separate LAN for inference traffic.
- If you need true model-parallel, prototype in a lab with 2–3 nodes and a synthetic workload first.
- Expect to iterate on batch sizes, temperature settings, and timeouts to keep tail latencies acceptable.
Case study (short): 3-node prototype for on-prem chat assistant
In late 2025 a small fintech team built a private chat assistant for internal docs using three Pi5 nodes with AI HAT+ 2. They chose request-level sharding with a 4-bit quantized GGUF model. Outcomes:
- Average latency: 300–600ms for short prompts (single-turn)
- Operational cost: near-zero after one-off hardware purchase
- Lessons learned: aggressive thermal tuning and zram were essential; one node failure gracefully handled by round-robin routing
Deploy small, test often, and automate updates with GitOps — those three practices move a Pi cluster from proof-of-concept to dependable service.
Actionable takeaways
- Start small: provision 3 Pi5 nodes, enable zram, and run a quantized GGUF model per node.
- Protect thermals: add fans + monitor temps; auto-throttle rather than risk throttling derailment.
- Use k3s + Flux: GitOps for safe, auditable OTA updates with signed images.
- Prefer request-level sharding first: implement model-parallel only if you must host models larger than a single node.
Next steps — checklist to deploy in a weekend
- Order 3 Pi5 + AI HAT+ 2 and NVMe or high-quality SD cards
- Build the golden image with SSH keys and zram enabled
- Deploy k3s and add nodes
- Deploy one quantized model as a Deployment and route with Nginx/Envoy
- Configure Prometheus alerts for temp & swap and onboarding into your incident workflow
Closing: future-proofing and 2026 predictions
Edge LLMs will keep maturing in 2026: expect better 3-bit quant formats, faster ARM-specific inference kernels, and more robust model-sharding libraries. Investing in standardized orchestration (k3s + GitOps), signed CI artifacts, and thermal/swap reliability pays dividends as you scale from a hobby cluster to a production edge fleet. The AI HAT+ 2 makes local inference feasible — but architecture and operations decide whether a deployment is maintainable.
Call to action
If you want a ready-to-run repo with Ansible playbooks, k3s manifests, a sample GitHub Actions CI, and a tested fan-control script for Pi5 + AI HAT+ 2, drop your email or start a trial with our templates library at various.cloud/edge-llm — we’ll send the repo and a 30-minute onboarding checklist to get your first cluster live this weekend.
Related Reading
- Price-Proof Your Jewelry Line: How Tariff Conversations Should Shape Your 2026 Assortment
- Packing List for Traveling with Dogs: Essentials for Cottage Weekends
- How to Choose a Registered Agent and Formation Service Without Adding Complexity to Your Stack
- From Bankruptcy to Studio: Legal Steps for Media Companies Rebooting Their Business
- How to Stop AI from Making Your Shift Supervisors’ Jobs Harder
Related Topics
Unknown
Contributor
Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.
Up Next
More stories handpicked for you
Benchmarks: How the $130 AI HAT+ 2 Transforms Raspberry Pi 5 for Local Generative AI
Case Study: Turning an Internal Dining Recommender into an Enterprise Micro App Platform
Preparing Your DNS for the Rise of Short-Lived Mobile AI Browsing Sessions
Building a Lightweight Governance Layer for Weekend Micro Apps Using IaC Policies
Edge vs Centralized Hosting for Warehouse Automation: A 2026 Playbook
From Our Network
Trending stories across our publication group