AI platform and reliability engineering for production AI systems

Jonathan Wrede

AI Platform & Reliability Engineer

Ich helfe Teams, AI-Prototypen in produktionsreife Systeme zu überführen: Evals, Observability, CI/CD, Kosten-/Latenz-Monitoring und Deployment-Gates auf Kubernetes. In Produktion arbeite ich mit 100.000+ IoT-Geräten; öffentliche AI-Infrastruktur-Projekte sind aipreflight, llmprobe und tokentoll.

I help teams turn AI prototypes into production-ready systems with evals, observability, CI/CD, cost and latency monitoring, and deployment gates on Kubernetes. In production I work with 100,000+ IoT devices; my public AI infrastructure projects include aipreflight, llmprobe, and tokentoll.

Aktueller Fokus: aus Notebooks, RAG-Demos und LLM-Prototypen werden messbare, debuggable und release-fähige AI-Systeme.

Current focus: turning notebooks, RAG demos, and LLM prototypes into measurable, debuggable, and releasable AI systems.

Kontakt AI-AuditAI Audit Lebenslauf GitHub LinkedIn

aipreflight

Preflight-Gates für Evals, Kosten, Latenz und Observability. Preflight gates for evals, cost, latency, and observability.

llm-bench

Live-Benchmark für TTFT, Latenz, Fehler und Durchsatz. Live benchmark for TTFT, latency, errors, and throughput.

tokentoll

AI WORKLOAD

K8s

OPERATE

SLOs

Messbare Wirkung Measurable Impact

100k+

IoT-Thermostate überwacht

IoT thermostats monitored

~400

Kundenstandorte mit aktiver Batterieprognose

customer sites with active battery prediction

95+

dbt-Modelle im Analytics-Warehouse

dbt models in analytics warehouse

61-99%

Speicherreduktion in Thesis-Experimenten

memory reduction in thesis experiments

Produktiv eingesetzter Stack Proven Production Stack

Ingest

Python / IoT APIs / TimescaleDB

Transform

dbt Core / Polars / SQL

Orchestrate

Dagster / Docker / Cron

Deploy

Kubernetes / Helm / ArgoCD

Observe

Grafana / Prometheus

Ausgewählte Projekterfahrung Selected Project Experience

AI INFRA / OBSERVABILITY

aipreflight inference profile

SLA gates / Prometheus / vLLM

Operator-Workflow für LLM-Inferenz-Deployments: externe Probes, SLA-Gates, Prometheus/vLLM-Korrelation, Concurrency-Sweeps, Grafana-Inspektion, Runbooks, Tests und CI. Operator workflow for LLM inference deployments: external probes, SLA gates, Prometheus/vLLM correlation, concurrency sweeps, Grafana inspection, runbooks, tests, and CI.

Python|Go|Prometheus|Grafana|vLLM

GEBÄUDETECHNIK / IoTBUILDING TECH / IoT

IoT Analytics-PlattformIoT Analytics Platform

ELT / dbt / Dagster / ArgoCD

Zentrales Analytics-Warehouse für 100.000+ IoT-Thermostate, von der rohen Gerätetelemetrie zu getesteten dbt-Modellen, Dagster-Orchestrierung und Grafana-Dashboards auf Kubernetes. Central analytics warehouse for 100,000+ IoT thermostats, from raw device telemetry to tested dbt models, Dagster orchestration, and Grafana dashboards on Kubernetes.

dbt|Dagster|Grafana|ArgoCD|K8s

Production ML / IoT

Batterie-Intelligenz-SystemBattery Intelligence System

Python Library / Algorithmen / Fleet

ML-gestützte Kapazitäts- und Laufzeitvorhersage für 100.000+ IoT-Geräte, als produktionsreife Python-Bibliothek mit Konfidenzintervallen und hardwarespezifischen Defaults. ML-powered capacity and runtime prediction for 100,000+ IoT devices, shipped as a production Python library with confidence intervals and per-hardware defaults.

Python|Polars|pydantic|Dagster|Docker

AI SECURITY

Secure Neural Network Inference

M.Sc.-Thesis · In Arbeit · 2026 M.Sc. Thesis · In Progress · 2026

Speicher-Profiling und Optimierung für 4 SNNI-Systeme auf BERT und ViT. Aktuelle Thesis-Experimente zeigen 61-99 % Speicherreduktion, plus analytische Modelle und ein Deployability-Framework. Memory profiling and optimization across 4 SNNI systems on BERT and ViT. Current thesis experiments show 61-99% memory reduction, plus analytical models and a deployability framework.

Python|PyTorch|HE / CKKS|MPC In ArbeitIn progress

Open-Source AI-Infrastruktur Open Source AI Infrastructure

Gemergte Beiträge in produktionsnahen AI-/Daten-Infrastrukturprojekten und einsatznahe Werkzeuge für LLM-Betrieb: Deploy-Gates, Endpoint-Health, Latenz und Kosten. Merged contributions in production-grade AI/data infrastructure projects and applied tools for LLM operations: deploy gates, endpoint health, latency, and cost.

GitHub PRs

LLM serving

Security- und Reliability-Härtung für den llm-d Router. Security and reliability hardening for the llm-d router.

llm-d/llm-d-router #960

Observability

OpenAI-Tool-Definitionen in OpenTelemetry GenAI-Telemetrie durchgereicht. Passed OpenAI tool definitions through OpenTelemetry GenAI telemetry.

opentelemetry-python-contrib #4554

Feature stores

Feast-Fixes für BigQuery, DynamoDB und Trino Offline Stores. Feast fixes for BigQuery, DynamoDB, and Trino offline stores.

#6362 #6366 #6365 #6381 #6360

Einsatznahe AI-Plattform-Werkzeuge: konkrete Operator-Workflows für Probleme, die beim Betrieb von AI-Features tatsächlich wehtun: Eval-Gates, Deploy-Gates, Endpoint-Health, Latenz und Kosten. Applied AI platform tools: concrete operator workflows for the problems that actually hurt when operating AI features: eval gates, deploy gates, endpoint health, latency, and cost.

aipreflight

CI/CD-Readiness-Gate für AI-Apps und LLM-Endpoints: Evals, Kostenbudgets, externe Probes, SLA-Gates, Prometheus/vLLM-Korrelation und Runbooks. CI/CD readiness gate for AI apps and LLM endpoints: evals, cost budgets, external probes, SLA gates, Prometheus/vLLM correlation, and runbooks.

llmprobe

Go-CLI für Synthetic Monitoring und CI-Smoke-Tests von LLM-Inference-Endpunkten. Misst TTFT, Latenz, Durchsatz und Fehler. Go CLI for synthetic monitoring and CI smoke tests of LLM inference endpoints. Measures TTFT, latency, throughput, and errors.

tokentoll

GitHub Action und CLI für LLM-Kostendiffs in Code Reviews: statische Analyse, Preisdatenbank, PR-Kommentare und MCP-Server. GitHub Action and CLI for LLM cost diffs in code review: static analysis, pricing database, PR comments, and MCP server.

ErfahrungExperience

Jan 2025, HeutePresent

Machine Learning Engineer | Vilisto

Batterie-Vorhersagealgorithmen, IoT-Datenplattform (dbt + Dagster + ArgoCD), Fleet-Monitoring und interne FastAPI-Ops-Tools für 100.000+ Geräte. Battery prediction algorithms, IoT data platform (dbt + Dagster + ArgoCD), fleet monitoring, and internal FastAPI ops tools for 100,000+ devices.

Jan 2023, Dec 2024

Data Platform Consultant | Saracus

Cloud Data Warehouses (Snowflake / BigQuery / Azure Synapse), Python-ETL-Pipelines und Streamlit-Apps mit OAuth SSO für Enterprise-Kunden. Cloud data warehouses (Snowflake / BigQuery / Azure Synapse), Python ETL pipelines, and Streamlit apps with OAuth SSO for enterprise clients.

Sep 2021, Oct 2022

Working Student, Data Scientist | Vilisto

ML-Modelle zur Präsenzerkennung (Python, scikit-learn), PostgreSQL/TimescaleDB-Pipelines, Vue.js-Frontend und internes Labeling-Tool. Presence-detection ML models (Python, scikit-learn), PostgreSQL/TimescaleDB pipelines, Vue.js frontend, and internal labeling tool.

AusbildungEducation

M.Sc. Computer Science, AI / ML / Statistics

Universität Münster | Thesis: Speicheroptimierung für Secure Neural Network Inference in Transformern University of Münster | Thesis: Optimizing memory footprints for secure neural network inference in transformers

Aktuelle Beiträge Latest writing

Alle ansehen

AI INFRA / OBSERVABILITY aipreflight inference profile

KontextContext

LLM-Inferenz-Deployments brauchen ein belastbares Readiness-Signal aus Client-Sicht. Interne Servermetriken zeigen, was im Engine-Prozess passiert, entscheiden aber nicht, ob ein Endpoint aus Nutzersicht Traffic bekommen sollte. Das Inferenz-Profil von aipreflight verbindet externe Probes mit Prometheus/vLLM-Metriken und macht daraus einen Deployment-Verdict mit Exit-Code. LLM inference deployments need a reliable readiness signal from the client side. Internal server metrics show what happens inside the engine process, but they do not decide whether an endpoint should receive traffic from the user's perspective. aipreflight's inference profile combines external probes with Prometheus/vLLM metrics and turns them into a deployment verdict with an exit code.

Readiness-WorkflowReadiness Workflow

gate.sh

OpenAI API

Prometheus

Report

Exit code

SLA thresholds TTFT / TPOT vLLM metrics CI ready

UmsetzungImplementation

Readiness-Gate mit SLA-Schwellen für TTFT, Latenz, Fehlerquote und Durchsatz. Der Exit-Code macht das Ergebnis in CI, Release-Jobs oder manuellen Rollouts verwendbar. Readiness gate with SLA thresholds for TTFT, latency, error rate, and throughput. The exit code makes the result usable in CI, release jobs, or manual rollout checks.
Diagnosemodus korreliert externe Probe-Daten mit Prometheus/vLLM-Metriken. So wird sichtbar, ob Latenz im Modellserver entsteht oder außerhalb, etwa Gateway, Proxy, Netzwerk oder Routing. Diagnosis mode correlates external probe data with Prometheus/vLLM metrics. That shows whether latency comes from the model server or outside it, such as gateway, proxy, network, or routing.
Concurrency-Sweeps erzeugen reproduzierbare Run-Verzeichnisse pro Laststufe, inklusive Markdown-Bericht, JSON-Ergebnis und Prometheus-Fenster für spätere Analyse. Concurrency sweeps create reproducible run directories per load level, including Markdown report, JSON result, and Prometheus window for later analysis.
Grafana-Dashboard und Runbooks bleiben Inspektionswerkzeuge. Die primäre Schnittstelle ist bewusst der Verdict: route traffic, investigate oder scale. Grafana dashboard and runbooks remain inspection tools. The primary interface is intentionally the verdict: route traffic, investigate, or scale.

DesignentscheidungDesign Decision

Das Projekt ist kein Dashboard-first-Demo. Es ist ein Operator-Workflow: eine kleine, testbare Schnittstelle zwischen LLM-Endpoint, Prometheus und Deployment-Entscheidung. Das zeigt genau die Art Infrastrukturarbeit, die für AI-Plattformrollen relevant ist. The project is not a dashboard-first demo. It is an operator workflow: a small, testable interface between LLM endpoint, Prometheus, and deployment decision. That shows the kind of infrastructure work that matters for AI platform roles.

Python Go Prometheus Grafana vLLM OpenAI-compatible APIs CI

GitHub Technische FallstudieTechnical case study

BUILDING TECH / IoT IoT Analytics-PlattformIoT Analytics Platform

KontextContext

ELT-Plattform für IoT-Telemetrie von 100.000+ Thermostaten. Kernproblem: Rohdaten leben in einer transaktionalen Zeitreihendatenbank, nicht für analytische Abfragen geeignet. Ziel war ein von der Produktion getrenntes Analytics-Warehouse mit getesteten Transformationen und nachvollziehbarer Lineage. ELT platform for IoT telemetry from 100,000+ thermostats. Core problem: raw data lives in a transactional time-series database, unsuitable for analytical queries. Goal was an analytics warehouse fully separate from production, with tested transformations and traceable data lineage.

ArchitekturArchitecture

IoT Sources

EL Job (PII-safe)

Timescale raw

dbt

Models + Tests

Dagster

Grafana

Kubernetes + Helm ArgoCD Sqitch migrations

dbt-Modellschichten (95+ Modelle)dbt Model Layers (95+ models)

Staging

stg_valve_movement_daily

stg_last_updates_per_day

stg_battery_fleet_daily

stg_thermostat_hierarchy

+ 4 more…

Mart

device_states_per_day

battery_algorithm_eval

online_ratios_by_firmware

hardware_fw_distributions

+ many more…

UmsetzungImplementation

ELT statt ETL: Daten werden roh geladen und im Warehouse transformiert. PII wird im Ingestion-Layer entfernt, bevor Daten das Analytics-Schema erreichen. 12-stündliche und tägliche Dagster-Assets mit automatischer Backfill-Unterstützung. ELT over ETL: data is loaded raw and transformed inside the warehouse. PII is stripped in the ingestion layer before data reaches the analytics schema. 12-hourly and daily Dagster assets with automatic backfill support.
Zwei-Schicht dbt-Architektur: Staging-Modelle sind 1:1 mit Rohdatenquellen (nur Casting, Renaming, Deduplication), keine Business-Logik. Mart-Modelle enthalten Aggregationen und Joins. Chunked Upserts für historische Backfills ohne Speicherprobleme. Two-layer dbt architecture: staging models are 1:1 with raw sources (only casting, renaming, deduplication), no business logic. Mart models own aggregations and joins. Chunked upserts for historical backfills without memory issues.
Sqitch verwaltet das Raw-Schema (EL-Zielschema), dbt verwaltet sein eigenes Analytics-Schema, keine Migrationskonflikte zwischen beiden Tools. ArgoCD GitOps: Staging-App trackt den main-Branch automatisch, Prod-App wird manuell promotet. Sqitch owns the raw schema (EL target), dbt owns its own analytics schema, no migration conflicts between the two tools. ArgoCD GitOps: staging app auto-syncs to main branch, prod app is manually promoted.
Dagster Software-Defined Assets mit explizitem Abhängigkeitsgraph: die Orchestrierung ergibt sich aus den deklarierten Asset-Dependencies, nicht aus manuell definierten DAGs. Ermöglicht automatisches Erkennen veralteter Assets bei Schema-Änderungen. Dagster software-defined assets with explicit dependency graph: orchestration is derived from declared asset dependencies, not manually wired DAGs. Enables automatic detection of stale assets when schemas change.

DesignentscheidungenDesign Decisions

ELT statt ETL ermöglicht das Neuausführen von Transformationen ohne erneuten Extract. Die strikte Staging/Mart-Trennung verhindert, dass Rohdaten-Strukturen direkt in Applikationscode durchsickern. Sqitch und dbt besitzen separate Schemas, damit dbt-Migrationen niemals Rohdaten-Tabellen berühren. ELT over ETL allows re-running transformations without re-extracting. The strict staging/mart split prevents raw data structures from leaking into application code. Sqitch and dbt own separate schemas so dbt migrations never touch raw data tables.

dbt Dagster PostgreSQL / TimescaleDB Grafana Prometheus ArgoCD Kubernetes Helm Python Sqitch

Production ML / IoT Batterie-Intelligenz-SystemBattery Intelligence System

KontextContext

IoT-Thermostate steuern Heizkörperventile per Motor. Ventilbewegungsschritte und Klemmenspannung sind die einzigen verfügbaren Signale zur Batterieeinschätzung, kein direkter Stromsensor. Die Herausforderung: genaue Kapazitäts- und Laufzeitschätzung aus indirekten, verrauschten Signalen über ~400 Deployments mit unterschiedlichen Hardware-Generationen. IoT thermostats actuate radiator valves via a motor. Valve movement step counts and terminal voltage are the only signals available for battery estimation, no direct current sensor. The challenge: accurate capacity and runtime estimation from indirect, noisy signals across ~400 deployments with different hardware generations.

System-ArchitekturSystem Architecture

Valve + Voltage

datavil
library

Algorithms

thermostat-
supervisor

PostgreSQL

100k+ devices

datavil, Core Algorithms

Capacity
Estimation

Faulty Day
Detection

p5 / p95
Intervals

Runtime
Prediction

GitLab CI/CD private PyPI · v0.10.3 Docker

UmsetzungImplementation

Ventilbewegung als Energieproxy: jeder Motorschritt hat einen bekannten mAh-Verbrauch pro Hardware-Version. Coulomb-Konvertierung summiert diese Schrittkosten über die Zeit. Geräte ohne gemessene Kapazität erhalten hardwarespezifische Fallback-Defaults. Valve movement as energy proxy: each motor step has a known mAh cost per hardware version. Coulomb conversion accumulates these step costs over time. Devices without measured capacity fall back to per-hardware-version defaults.
Fehlerhafter-Tag-Erkennung mit zwei Schwellwert-Typen: Spannungsabfall-Threshold UND Aktivitätsanomalie. Einzelne Tagesausreißer (Fenster < 6h) erhalten eine geringere Strafe als mehrtägige Fehlersegmente, um kurzes Sensorjauschen nicht überzubestrafen. Faulty day detection uses two threshold types: voltage drop AND activity anomaly. Single-day dips (window < 6h) receive a lighter penalty than sustained multi-day segments, to avoid over-penalizing brief sensor noise.
Konfidenzintervalle: empirische Verteilung gemessener Kapazitäten je Hardware-Version bildet einen Prior. Fehlende Tage addieren eine Kapazitätsstrafe. p5/p95 werden aus dieser Verteilung abgeleitet, das Ergebnis ist ein Laufzeitintervall, keine Punktschätzung. Confidence intervals: empirical distribution of measured capacities per hardware version forms a prior. Missing days add a capacity penalty term. p5/p95 are derived from this distribution, the output is a runtime interval, not a point estimate.
Bibliotheks-Design: reine Funktionen ohne DB-Zugriffe oder Seiteneffekte in datavil. Polars DataFrames mit pandera-Schemas an Funktionsgrenzen erzwingen korrekte Spaltentypen vor jeder Berechnung. beartype ergänzt pyright-Statikanalyse mit Runtime-Checks. Library design: pure functions only, no DB access or side effects inside datavil. Polars DataFrames with pandera schemas at function boundaries enforce correct column types before any computation. beartype adds runtime checks on top of pyright static analysis.
Release-Pipeline: commitizen parst Conventional-Commit-Messages und bumpt semver automatisch (fix, patch, feat, minor, breaking, major), aktualisiert CHANGELOG.md, setzt Git-Tag und triggert den GitLab-CI-Publish-Job ins private PyPI-Registry. Release pipeline: commitizen parses conventional commit messages to auto-bump semver (fix, patch, feat, minor, breaking, major), updates CHANGELOG.md, sets a git tag, and triggers the GitLab CI publish job to the private PyPI registry.

DesignentscheidungenDesign Decisions

Die Trennung von Algorithmus-Bibliothek (datavil) und Scheduler (thermostat-supervisor) hält Domänenlogik ohne Datenbankzugriff testbar. Reine Funktionen mit strikter Schema-Validierung bedeuten: Algorithmusfehler tauchen in Unit-Tests auf, nicht in Produktions-Cron-Runs. Die Bibliothek wird als versioniertes PyPI-Paket ausgeliefert, damit andere Services dieselben Algorithmen konsumieren können ohne Code-Duplikation. Separating the algorithm library (datavil) from the scheduler (thermostat-supervisor) keeps domain logic testable without a database. Pure functions with strict schema validation mean algorithm bugs surface in unit tests, not in production cron runs. Shipping the library as a versioned PyPI package lets other services consume the same algorithms without duplicating code.

Python Polars pydantic beartype pandera Docker GitLab CI/CD uv PostgreSQL

AI SECURITY / RESEARCH Secure Neural Network Inference

M.Sc.-Thesis in Arbeit, Universität Münster · Abschluss geplant Oktober 2026 M.Sc. thesis in progress, University of Münster · Expected completion October 2026

ProblemstellungProblem

Secure Neural Network Inference (SNNI) ermöglicht private Inferenz, weder Clientdaten noch das Servermodell werden offengelegt. Der Speicherbedarf von Dutzenden bis Hunderten GB macht SNNI-Deployments praktisch unmöglich. Bisherige Evaluierungen ignorieren Speicher weitgehend. Secure Neural Network Inference (SNNI) enables private inference, neither client data nor the server model is revealed. Memory requirements of tens to hundreds of GB make SNNI deployments practically infeasible. Prior evaluations largely ignore memory.

SNNI-ProtokollSNNI Protocol

Client
(private input)

HE / MPC

Server
(private model)

Encrypted
result

4 SNNI-Systeme · 4 kryptographische Paradigmen

4 SNNI Systems · 4 Cryptographic Paradigms

HE / CKKS

MPC

Hybrid HE+MPC

SHARK

BERT-base ViT-Base/16 SNNI-Benchmark repo

BeiträgeContributions

HE/CKKS verschlüsselt den Client-Input homomorph: der Server wertet das Netz auf Ciphertext aus ohne zu entschlüsseln, hoher Rechenaufwand, aber keine Kommunikation während der Inferenz. MPC (Secret Sharing) verteilt Input und Gewichte über mehrere Parteien; günstigere Grundoperationen, aber hohe Kommunikationslast proportional zu Aktivierungsgrößen. HE/CKKS homomorphically encrypts the client input: the server evaluates the network on ciphertext without decrypting, high compute cost but no communication during inference. MPC (secret sharing) distributes input and weights across parties; cheaper base operations but high communication overhead proportional to activation sizes.
Speicher-Profiling der 4 Systeme auf BERT-base und ViT-Base/16: RSS-Tracking auf Layer-Ebene, um zu identifizieren wo Allokationen entstehen. Stock-Implementierungen: 3,3 GB bis über 192 GB Peak, großteils durch Puffer, die für die gesamte Inferenz pre-allokiert werden statt schichtweise freigegeben. Memory profiling of 4 systems on BERT-base and ViT-Base/16: layer-level RSS tracking to identify where allocations occur. Stock implementations: 3.3 GB to over 192 GB peak, largely due to buffers pre-allocated for the full inference pass rather than released layer-by-layer.
Waste Factors quantifizieren, wie viel mehr Speicher ein System allokiert als rechnerisch minimal nötig ist. Analytische Modelle leiten den Waste Factor aus Paradigma-Eigenschaften (Ciphertext-Expansion, Kommunikationspuffer, Protokoll-Overhead) ab, vorhersagbar ohne Profiling. Waste factors quantify how much more memory a system allocates than the theoretical minimum. Analytical models derive the waste factor from paradigm properties (ciphertext expansion, communication buffers, protocol overhead), predictable without profiling.
Optimierungen (Layer-Fusion, In-Place-Operationen, schrittweise Puffer-Freigabe) reduzieren Peak-Speicher um 61-99 % und generalisieren über Paradigmen und Modellarchitekturen. Das Deployability-Framework mappt Hardware-Speicher-Budgets auf machbare System-/Modell-Kombinationen. Optimizations (layer fusion, in-place operations, incremental buffer release) reduce peak memory by 61-99% and generalize across paradigms and model architectures. The deployability framework maps hardware memory budgets to feasible system/model combinations.

Speicherreduktion (schematisch)

Memory Reduction (schematic)

Stock3.3 GB, 192 GB+

OptimiertOptimized 61-99% reduction

Python PyTorch HE / CKKS MPC BERT ViT Transformers