CI/CD Pipeline for TinyML: Continuous Delivery to Raspberry Pi 5 with AI HAT+ 2
CI/CDEdge AITemplates

CI/CD Pipeline for TinyML: Continuous Delivery to Raspberry Pi 5 with AI HAT+ 2

UUnknown
2026-02-28
10 min read
Advertisement

Template and tutorial for TinyML CI/CD: package, cross-compile, test, and safely canary-deploy LLMs to Raspberry Pi 5 with AI HAT+ 2.

Ship tiny-but-powerful LLMs to Raspberry Pi 5 + AI HAT+ 2: a production CI/CD template

Hook: You can run real-world LLM inference on fleets of Raspberry Pi 5 devices with the AI HAT+ 2, but getting build, test and deployment right is painful: cross-compilation, model packaging, constrained-memory failures, and rollbacks across hundreds of physical devices. This guide gives a ready-to-use CI/CD template and tutorial (2026) to package quantized models, cross-compile runtimes, run automated tests, and perform safe canary rollouts and automated rollbacks to Pi5 edge LLMs.

Why this matters in 2026

Edge LLM adoption accelerated through 2024–2025 as quantized models and optimized runtimes (ggml/llama.cpp variants, MLC LLM, and vendor NPUs) matured. The AI HAT+ 2 released late 2025 unlocked practical generative workloads on the Raspberry Pi 5 range by exposing hardware acceleration and more RAM headroom for INT8/FP16 inference. That made TinyML CI/CD an operations-first problem: edge fleets are physical devices with firmware, kernel modules and storage constraints — not ephemeral containers in the cloud.

Expectations for 2026: security-by-design (SLSA), artifact signing, OCI model registries, and automated safety gates will be baseline in production pipelines. This article is for developers and infra leads who need a repeatable, secure pipeline for packaging edge LLMs and delivering them safely to Pi5 devices.

High-level pipeline (inverted pyramid)

  1. Produce artifacts: quantized model files + tokenizer + runtime binaries (arm64 optimized).
  2. Test artifacts: unit, integration (emulated), and hardware smoke tests.
  3. Store & sign: push to artifact registry (OCI or S3) and cosign signatures.
  4. Deliver: push container/image or firmware bundle to device-management platform (Mender, balena, etc.).
  5. Rollout: staged canary rollout with automated metrics-based rollback.

Architecture & components

  • Build host: CI runners (GitHub Actions / GitLab) with Docker buildx and QEMU for cross builds.
  • Model conversion: quantizer + tokenizer bundling (ggml, ONNX->ggml, or optimized MLC pipelines).
  • Runtime: llama.cpp, custom C++ runtime, or containerized Python runtime built for arm64 with NEON acceleration.
  • Artifact storage: OCI registry (GitHub Packages, Harbor) or object store (S3) with content-addressable names.
  • Device management: Mender / balenaCloud / Fleet API for orchestrating deployments and groups.
  • Monitoring: Prometheus metrics (latency, memory, error rate) and alert rules for rollback triggers.

Model packaging: deterministic, minimal, and auditable

Edge LLM bundles must be small and self-describing. Use a zip/tarball or OCI artifact with this structure:

<model-name>-
  ├─ /model/         (quantized .ggml / .bin files)
  ├─ /tokenizer/     (vocab.json, merges)
  ├─ /runtime/       (compiled binary or wheel for arm64)
  ├─ metadata.json   (version, quantization, input_size, checksums)
  ├─ provenance.json (training/model-card, SLSA provenance if available)
  └─ signature.cosign

Best practices:

  • Include clear metadata: quantization format (INT8/INT4), expected memory footprint, and a checksum per file.
  • Keep tokenizers bundled to avoid runtime download or incompatible tokenizer versions.
  • Sign artifacts using cosign and attach provenance as part of SLSA supply chain assertions.
  • Use semantic versioning and content-addressable artifact names: modelname@sha256:...

Cross-compilation & runtime builds for Pi5

Pi5 is ARM64. Build strategies in CI:

  1. Container multi-arch builds (recommended): Use Docker buildx with QEMU to produce arm64 images from x86 CI workers.
  2. Native cross-compile: Use aarch64 cross toolchains / gcc with GOARCH or CFLAGS tuned for NEON/FPU.
  3. Binary packaging: Compile runtime as static binary to reduce runtime dependencies; strip symbols for smaller size.

Dockerfile pattern (multi-stage)

FROM --platform=$BUILDPLATFORM ubuntu:24.04 AS builder
RUN apt-get update && apt-get install -y build-essential cmake git qemu-user-static
WORKDIR /src
# Checkout runtime and build for target
COPY . .
ARG TARGETPLATFORM
RUN --mount=type=cache,target=/root/.cache \
  mkdir -p build && cd build && cmake .. -DCMAKE_SYSTEM_NAME=Linux -DCMAKE_SYSTEM_PROCESSOR=aarch64 && make -j$(nproc)

FROM --platform=$TARGETPLATFORM debian:bookworm-slim
COPY --from=builder /src/build/my_runtime /usr/local/bin/my_runtime
ENTRYPOINT ["/usr/local/bin/my_runtime"]

Use buildx command in CI: docker buildx build --platform linux/arm64,linux/amd64 -t $IMAGE --push .

Automated tests for edge LLMs

Tests are the difference between a controlled canary and a fleet-wide incident.

Test categories

  • Unit tests: tokenizer correctness, config parsing, small deterministic outputs.
  • Integration (emulated): run runtime in QEMU or arm64 container and do a smoke infer on a tiny prompt. Verify token counts and no memory OOM.
  • Hardware smoke tests: on a Pi5 device or a small pool of self-hosted runners: full inference to detect driver/firmware mismatches on AI HAT+ 2.
  • Performance/SLO checks: latency percentiles and max memory. Fail if > threshold.
  • Security scans: dependency CVE checks and cosign verification of inputs.

Example: a CI job that runs an inference using the quantized artifact in an arm64 container and asserts latency < 400ms and RSS < 1.2GB.

Artifact storage & signing

For reproducibility store both model artifacts and runtime in an OCI registry when possible. Use ORAS to push model bundles as OCI artifacts so your deployment system can pull them consistently.

Sign with cosign and store signatures in the registry. In 2026, policy engines (conftest + SLSA attestations) are commonly enforced in CI/CD gates.

# push using ORAS
oras push $OCI_REGISTRY/model:1.2.0 model.tar.gz --manifest-config metadata.json:application/vnd.acme.model.config.v1+json
# sign
cosign sign --key cosign.key $OCI_REGISTRY/model:1.2.0

CI/CD example: GitHub Actions pipeline

This YAML shows key jobs: build, test, sign, push, and trigger canary deployment via Mender API. Adapt to GitLab CI and others.

name: Build-Test-Deploy-Pi5

on:
  push:
    tags: ['v*']
  workflow_dispatch:

env:
  IMAGE: ghcr.io/org/edge-llm
  MODEL_NAME: tiny-llm-quant

jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4
      - name: Set up QEMU
        uses: docker/setup-qemu-action@v2
      - name: Set up Docker Buildx
        uses: docker/setup-buildx-action@v2
      - name: Login to GHCR
        uses: docker/login-action@v2
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GHCR_PAT }}
      - name: Build multi-arch image and push
        run: |
          docker buildx build --platform linux/amd64,linux/arm64 \
            -t $IMAGE/$MODEL_NAME:${{ github.ref_name }} \
            --push .
  test-emulated:
    needs: build
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Pull arm64 image and run smoke test
        run: |
          docker run --rm --platform linux/arm64 $IMAGE/$MODEL_NAME:${{ github.ref_name }} \
            --smoke-test --max-latency-ms 400
  sign-and-publish:
    needs: test-emulated
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Install cosign
        run: curl -sSL "https://github.com/sigstore/cosign/releases/latest/download/cosign-linux-amd64" -o /usr/local/bin/cosign && chmod +x /usr/local/bin/cosign
      - name: Sign image
        env:
          COSIGN_PASSWORD: ${{ secrets.COSIGN_PASSWORD }}
        run: |
          echo $COSIGN_PASSWORD | cosign import-key-pair --kms nothing cosign.key cosign.pub || true
          cosign sign --key cosign.key $IMAGE/$MODEL_NAME:${{ github.ref_name }}
      - name: Push model to OCI (optional)
        run: oras push $OCI_REGISTRY/$MODEL_NAME:${{ github.ref_name }} model.tar.gz --manifest-config metadata.json:application/vnd.acme.model.config.v1+json
  canary-deploy:
    needs: sign-and-publish
    runs-on: ubuntu-latest
    steps:
      - name: Trigger Mender deployment (canary group)
        env:
          MENDER_TOKEN: ${{ secrets.MENDER_TOKEN }}
        run: |
          curl -s -X POST https://hosted.mender.io/api/management/v1/deployments \
            -H "Authorization: Bearer $MENDER_TOKEN" \
            -H 'Content-Type: application/json' \
            -d '{"artifact_name":"'"$IMAGE/$MODEL_NAME:${{ github.ref_name }}'","name":"canary-${{ github.ref_name }}","devices":"group:canary-pi5"}'

Canary rollout and automated rollback

Canaries for Pi fleets are device-group based (e.g., 5–10 physical Pi5 units). A safe canary pipeline includes:

  • Start with a small device group labeled "canary"
  • Deploy artifact and run hardware smoke tests via device management (sanity probe runs)
  • Monitor latency, memory, inference error rate and system-level logs for OOM or kernel issues
  • Promote to larger group gradually if metrics remain within threshold
  • Automatic rollback if metrics breach thresholds

Example rollback automation (pseudo-script)

#!/bin/bash
CANARY_GROUP=canary-pi5
THRESHOLD_LATENCY_MS=600
THRESHOLD_ERROR_RATE=0.02
DEPLOYMENT_ID=$1
# Query Prometheus for avg latency last 10m
avg_latency=$(curl -s 'http://prometheus/api/v1/query?query=avg_over_time(infer_latency_ms{group="canary"}[10m])' | jq -r '.data.result[0].value[1]')
error_rate=$(curl -s 'http://prometheus/api/v1/query?query=avg_over_time(infer_errors_total{group="canary"}[10m])' | jq -r '.data.result[0].value[1]')
if (( $(echo "$avg_latency > $THRESHOLD_LATENCY_MS" | bc -l) )) || (( $(echo "$error_rate > $THRESHOLD_ERROR_RATE" | bc -l) )); then
  echo "Threshold breached: initiating rollback"
  # Trigger rollback via Mender API
  curl -X POST -H "Authorization: Bearer $MENDER_TOKEN" -d '{"deployment_id":"'$DEPLOYMENT_ID'","force":true}' https://hosted.mender.io/api/management/v1/deployments/rollback
fi

Integrate this script in a scheduled GitHub Actions job or in the device manager's health-check hooks to enforce automated rollbacks.

Device-level safety: health probes and feature flags

  • Run a local health-probe on Pi5 that checks process alive, memory headroom, and inference latency. The device should refuse activation of a new model if health checks fail.
  • Use feature flags or model selectors so a device can run multiple models but only activate one at a time.
  • Keep a lightweight supervisor (systemd or container supervisor) that can auto-restart the runtime and report state to the fleet manager.

Operational checklist before deployment

  • Signed artifact present in OCI / object store.
  • Metadata with declared memory + expected max tokens.
  • Unit & emulated integration tests passed.
  • Hardware smoke tests against a small Pi5 pool with AI HAT+ 2 passed.
  • Canary group defined, monitoring & alerts wired to the pipeline.
  • Rollback automation and image retention policy configured.

Advanced strategies & 2026 predictions

Expect the following trends through 2026:

  • OCI model registries: standardization around ORAS & OCI model manifests for provenance and signatures.
  • SLSA enforcement in CI: model provenance and attestations will be mandatory for regulated deployments.
  • On-device runtime orchestration: edge containers with tiny orchestrators (k3s-lite or balena supervisors) will allow A/B runtime experiments.
  • Hardware-aware quantization: quantization toolchains that embed AI HAT+ 2 acceleration parameters into metadata to guide runtime selection.

Adopting these now future-proofs your pipeline and reduces friction when new device NPUs or quant formats appear.

Real-world example (case study sketch)

Company X shipped an on-device summarization agent to a network of 500 Pi5 kiosks. They used this pipeline:

  1. Quantized model to INT8 using a custom ggml pipeline and packaged as OCI artifact.
  2. Built and signed the arm64 runtime with Docker buildx and cosign.
  3. Ran emulated latency tests in CI and hardware smoke tests on 10 Pi5 canaries with AI HAT+ 2.
  4. Rolled out to 10% of devices per day, with a Prometheus rule that automatically paused rollout if P95 latency increased >20%.

Result: stable rollout in 4 days and one automatic rollback during a nightly kernel update that revealed a driver mismatch — rollback prevented a fleet-wide failure.

Common pitfalls and how to avoid them

  • Missing tokenizer: always bundle tokenizer to avoid subtle output mismatches.
  • Undetected memory growth: include RSS and OOM probes; run long-running soak tests on at least one canary device.
  • Unsigned artifacts: enforce cosign signature verification in device managers before activation.
  • No device labels: tag devices with hardware/firmware versions to avoid bad rollouts across incompatible units.

Quick reference: commands & tools

  • Buildx + QEMU: docker buildx build --platform linux/arm64 --push
  • Model push (ORAS): oras push registry/model:tag model.tar.gz --manifest-config metadata.json
  • Sign (cosign): cosign sign --key cosign.key registry/model:tag
  • Device management: Mender / balenaCloud APIs for group deployments
  • Monitoring: Prometheus + Grafana with SLO-based alerts for rollback

Actionable takeaways

  • Design artifacts as self-contained OCI bundles with metadata and cosign signatures.
  • Use Docker buildx + QEMU in CI to produce arm64 images without Pi hardware in the loop.
  • Run emulated tests in CI plus minimal hardware smoke tests on a canary pool of Pi5s with AI HAT+ 2 before promotion.
  • Automate canary rollouts tied to Prometheus SLOs and implement automatic rollback scripts.
  • Enforce provenance (SLSA) and artifact signature verification on the device before activation.

Next steps & template repo

Take this article and map it to a Git repo with these components:

  • ci/workflows/build-and-deploy.yml (GitHub Actions example above)
  • scripts/pack-model.sh (creates model.tar.gz and metadata)
  • scripts/health-probe.sh (device-level)
  • deployment/mender-template.json (canary group deployment API template)
  • tests/smoke-inference (emulated inference test harness)

Final thought (2026)

Edge LLMs on Pi5 with AI HAT+ 2 are production-capable if you treat the device like physical infrastructure: build reproducible artifacts, sign and attest provenance, run emulated+hardware tests, and adopt staged canaries with automated rollbacks. The pattern in this guide reflects what production teams adopted across late 2025 into 2026 — plug these building blocks into your pipeline and you’ll reduce deployment risk and accelerate delivery to the edge.

“In 2026, the differentiator in edge AI isn’t model size — it’s delivery discipline.”

Call to action: Clone the starter CI/CD template (build + test + canary) from our repo, adapt the GitHub Actions example to your registry, and run the emulated smoke tests today. Need help tailoring the pipeline to your fleet size or choosing a device manager? Contact our engineering team for a free audit of one Pi5 canary deployment plan.

Advertisement

Related Topics

#CI/CD#Edge AI#Templates
U

Unknown

Contributor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-02-28T03:56:21.656Z