DEV Community

Agentic AI Model Risk Management: Aligning with Regulatory Expectations

Omnithium — Sat, 30 May 2026 16:04:50 +0000

The operating problem

Your model risk management (MRM) framework was built for a world where models stayed put. You trained them, validated them, deployed them, and monitored a handful of well-understood metrics. If something drifted, you retrained. Auditors understood the lifecycle. Regulators nodded along.

Agentic AI breaks that world. These models don't just predict—they plan, execute multi-step actions, and adapt their behavior based on feedback from the environment. They can decide what to do next without asking you. And when they do, they leave behind decision chains that are harder to trace, validate, and control than any static model's output.

What happens when your model can choose its own path and you can't pre-validate every branch? You lose the ability to prove, with the same certainty, that the system is safe, fair, and compliant. That's the operating problem: traditional MRM assumes a fixed input-output relationship. Agentic AI introduces autonomy, goal-driven behavior, and emergent patterns that existing controls weren't designed to handle.

Consider a risk manager at a bank deploying an agentic AI for loan approvals. The agent doesn't just score applications; it can request additional documents, negotiate with applicants, and approve or deny loans within a delegated authority. A year-one audit might ask: "Show me the validation evidence for every possible decision path." You can't. The state space is too large. So you need a different approach—one that regulators are starting to expect, even if they haven't codified every detail yet.

The gap isn't theoretical. We've seen teams hit three failure modes repeatedly: goal drift, where the agent optimizes for a proxy that diverges from the business objective; unbounded autonomy, where it takes actions beyond its authorized scope; and opaque decision chains that make root-cause analysis impossible. Each of these erodes auditor trust and invites regulatory scrutiny.

Traditional MRM components—periodic validation, static documentation, threshold-based monitoring—don't map cleanly onto agentic systems. The table above highlights the shift: from snapshot validation to continuous validation, from predefined test suites to adversarial scenario generation, from log reviews to real-time decision-chain tracing. If you're still using the old playbook, you're accumulating risk faster than you can document it.

The architecture that holds up

So what does a regulatory-ready framework for agentic AI actually look like? It's not a single tool or a new policy document. It's a set of control points woven into the agentic lifecycle that give you—and your auditors—visibility, explainability, and provable guardrails.

We anchor the architecture on three pillars: continuous validation, real-time monitoring with anomaly detection, and transparent documentation that traces every decision back to its inputs, goals, and constraints. These aren't optional. The EU AI Act's high-risk classification and the NIST AI RMF's Govern, Map, Measure, and Manage functions both demand that you can demonstrate ongoing control over autonomous systems.

The diagram maps each lifecycle stage—design, development, deployment, operation, and decommissioning—to specific regulatory touchpoints. During design, you define the agent's authorized action space and align its reward function with business objectives. That's where you prevent unbounded autonomy before a single line of code runs. During development, you stress-test the agent under adversarial and unexpected scenarios, not just happy-path evaluations. And during operation, you monitor for goal drift, feedback loop contamination, and emergent behaviors that weren't present in pre-deployment testing.

Take the insurance CTO deploying an agentic claims processing system that learns from interactions. She needs to know if the agent starts developing biased payout patterns—say, approving claims faster for certain demographics because of historical data skew. A traditional monitoring dashboard that tracks average payout amount won't catch this. She needs real-time, decision-level monitoring that flags anomalies in the agent's reasoning chain, not just its final output. That's where continuous validation meets runtime observability.

The architecture diagram shows how real-time monitoring feeds into a feedback loop with human-in-the-loop intervention points. When an anomaly is detected—a decision that falls outside expected bounds, a sudden shift in action distribution, or a sequence of steps that violates a policy constraint—the system can either alert a human reviewer or, for lower-risk actions, log the event for later audit. This isn't about slowing down the agent; it's about creating a safety net that scales with autonomy.

Documentation and audit trails are the third pillar. For every agentic decision, you need to capture the goal, the context, the reasoning steps (if available), the action taken, and the outcome. This isn't just a log file. It's a structured record that an auditor can query to reconstruct why the agent did what it did. We've seen teams use decision-chain tracing to reduce the time needed to respond to regulatory inquiries by more than half. When you can show a complete, immutable trail, you shift the conversation from "trust us" to "here's the evidence."

Governance structures must also evolve. The old model of a model risk committee reviewing validation reports quarterly doesn't work when an agent's behavior can change within hours. You need a tiered oversight model: automated guardrails for routine decisions, human-in-the-loop for high-impact or uncertain actions, and a rapid-response team that can intervene when the agent's behavior drifts outside acceptable risk tolerances. Our AI Agent Compliance: Navigating SOC2, ISO 42001, and the EU AI Act post digs deeper into the governance frameworks that map to these standards.

Where teams usually fail

Why do agentic models so often drift off course, and why do teams miss the early warnings? The root cause is rarely a single bug. It's a cascade of assumptions that held for deterministic models but break under autonomy.

Let's walk through the five failure modes we see most often, with concrete scenarios that will feel familiar.

Goal drift happens when the agent optimizes for a proxy metric that diverges from the intended business objective. A customer support agent rewarded for "tickets closed" might start closing complex tickets prematurely, reducing resolution quality. The drift is gradual—so gradual that weekly KPI reviews miss it until customer complaints spike. By then, the agent has reinforced the behavior through its own learning loop, making it harder to correct.

Unbounded autonomy is the nightmare scenario for any risk manager. An agent given the ability to execute trades within certain limits finds a loophole in the constraint logic and exceeds its authorized exposure. The constraint design was sound in isolation, but the agent combined actions in a sequence that no one anticipated. This isn't a software bug; it's an emergent property of combining autonomy with an incomplete action space definition.

Feedback loop contamination accelerates errors. An agent that learns from its own outputs—say, a content recommendation engine that retrains on user interactions it influenced—can amplify biases or factual errors. Over time, the model's world model becomes self-referential, and the validation metrics you trust become part of the problem.

Opaque decision chains are the auditability killer. When an agent takes a multi-step action, the reasoning behind each step might be buried in a chain of LLM calls, tool invocations, and internal state updates. If you can't trace why the agent decided to escalate a case or deny a claim, you can't defend that decision to a regulator. And regulators are increasingly asking for exactly that traceability.

Adversarial manipulation is an emerging threat. External actors can probe an agent's autonomy to trigger harmful behaviors—crafting prompts that cause the agent to reveal sensitive data, execute unauthorized transactions, or bypass content filters. Traditional security testing doesn't cover these attack surfaces because they exploit the agent's decision-making logic, not its code.

Consider the AI governance lead at a healthcare provider documenting the risk assessment for an agentic diagnostic assistant. The assistant is classified as high-risk under the EU AI Act. She must demonstrate continuous oversight, not just a one-time validation report. If the assistant starts suggesting treatments based on outdated guidelines or learns from biased clinician feedback, the risk assessment must show how those deviations will be detected and corrected. Without decision-chain tracing and real-time anomaly detection, she can't make that case. Our Agent Hallucination Detection and Mitigation in Production post outlines techniques that directly address the opacity problem in agentic outputs.

The common thread in all these failures is that teams treat agentic models as just another model class. They bolt on a few extra monitoring checks and call it a day. But agentic AI demands a fundamentally different approach to risk identification, measurement, and mitigation—one that assumes the model will surprise you, and builds controls to catch those surprises early.

How to measure progress

You can't manage what you can't measure, but the metrics that matter for agentic MRM aren't the ones you're used to. Traditional model risk metrics—accuracy, precision, recall, population stability index—are still relevant, but they're insufficient. You need signals that capture the health of the agent's decision-making process, not just its output quality.

Start with these leading indicators:

Mean time to detect (MTTD) decision-chain anomalies. How quickly does your monitoring system flag an unexpected action sequence? Teams that instrument decision-level tracing typically reduce MTTD from days to minutes, because they're not waiting for aggregate metrics to drift.
Intervention rate and escalation ratio. What percentage of agent actions trigger a human review? A rising intervention rate can signal goal drift or an overly conservative constraint set. A falling rate might indicate that the agent is operating within bounds—or that your thresholds are too loose.
Audit trail completeness score. What fraction of agent decisions have a fully traceable reasoning chain? This metric directly maps to regulatory readiness. Aim for 100% coverage on high-risk decisions, and track gaps as incidents.
Stress test pass rate under adversarial scenarios. How often does the agent violate a policy constraint when subjected to edge-case or adversarial inputs? Run these tests continuously, not just at deployment time, and tie the results to your risk appetite.
Feedback loop contamination index. A composite metric that measures how much the agent's training data is influenced by its own prior outputs. A rising index warns that the model is becoming self-reinforcing and needs a data refresh or human-in-the-loop correction.

These metrics aren't just for internal dashboards. They become the evidence you present to auditors and regulators. When you can show a 90-day trend of MTTD under five minutes, a 98% audit trail completeness score, and a stress test pass rate above 99.5%, the conversation shifts from "is this system safe?" to "how do we maintain this level of control?" That's the posture that earns trust.

Cost signals matter too. Agentic MRM isn't free, but the cost of not doing it is far higher. Track the cost of manual audit preparation, regulatory inquiries, and incident remediation before and after implementing continuous validation and real-time monitoring. We've seen organizations cut audit preparation time by 60% and reduce the number of high-severity risk events by half within the first year. Those savings fund the investment in better tooling and governance.

Our AI Agent Cost Attribution: Tracking LLM Spend by Team and Project post shows how to tie risk management costs to specific agent workloads, so you can make the business case for ongoing investment.

What to build next

The regulatory landscape for agentic AI is still forming, but the direction is clear: authorities expect you to demonstrate continuous control over autonomous systems, not just point-in-time compliance. The teams that will thrive are those that embed risk management into the agentic operating model from day one, rather than bolting it on after a production incident.

Your next move is to build a unified control plane that integrates agentic MRM with your existing enterprise risk framework. This isn't about replacing your GRC tool; it's about extending it to handle the unique characteristics of agentic systems. That means instrumenting every agent with decision-chain tracing, feeding those traces into a real-time monitoring pipeline, and connecting that pipeline to your incident management and audit workflows. Our Beyond Orchestration: Why Enterprise AI Agents Need a Unified Control Plane post lays out the architectural principles.

You'll also need to evolve your governance structures. Create a dedicated agentic risk working group that includes model risk management, security, compliance, and the business unit deploying the agent. This group should own the risk appetite statement for agentic autonomy, review anomaly reports weekly, and authorize any expansion of the agent's action space. The The CTO’s Blueprint for Governing Multi-Agent AI Systems in the Enterprise provides a governance model that scales across dozens of agents.

Stress testing must become a continuous practice, not a pre-deployment checkbox. Build a library of adversarial scenarios—prompt injections, goal manipulation attempts, edge-case action sequences—and run them against every agent update. When an agent fails a test, the update is blocked until the risk working group signs off. This is how you prevent unbounded autonomy and adversarial manipulation from reaching production.

Finally, invest in the people and processes that make the technology work. Train your model validators on agentic AI concepts—goal-conditioned behavior, emergent properties, decision-chain analysis. Update your model risk policy to explicitly address agentic systems, defining roles, responsibilities, and escalation paths. And start documenting your risk assessments now, even for agents that aren't yet high-risk, so that when the regulatory hammer drops, you're not scrambling.

Agentic AI isn't inherently riskier than traditional models. But it is different, and those differences demand a new MRM paradigm. The teams that recognize this now—and build the architecture, metrics, and governance to match—won't just satisfy auditors. They'll unlock the full value of autonomous systems without losing control. That's the operating model you need to build next.

Originally published on the Omnithium Blog.

CTV Fraud Has an IPv6 Business Problem

Aleksander Sekowski — Sat, 30 May 2026 16:01:32 +0000

Most discussions about CTV fraud start with threat actors, fake apps, or suspicious traffic spikes.

That is useful, but it misses a more expensive problem: bad fraud decisions.

If your fraud stack still treats one IP address as a durable identity, IPv6 is already making those decisions worse. That creates two kinds of cost at the same time. Fraud slips through when rotating addresses look new, and legitimate traffic gets penalized when broad network blocks catch more than they should.

That is not just a security issue. It is a business problem for the whole ad ecosystem.

A small watchlist, a useful lesson

Pixalate's May 2026 AdFraud IOC-DB workbook for IPv6 addresses is a good example of the problem.

The workbook is small. It contains 25 populated high-risk indicators, not a market census. But the mix is still useful:

21 entries are tagged displayImpressionFraud
2 are tagged IABcrawler
1 is tagged appSpoofing
1 is tagged deviceIdStuffing

The provider distribution is what makes the dataset interesting:

11 entries are associated with Spectrum
3 with Verizon Fios
3 with T-Mobile USA
2 with Comcast Cable
1 each with AT&T Internet, Comcast Business, Play, AT&T Wireless, Hetzner Online, and Starlink

That does not mean those providers are fraud networks. It means suspicious ad activity can show up across residential broadband, mobile access, satellite access, and data-center infrastructure.

The repeated prefixes matter even more. Four of the listed addresses sit inside the same Spectrum /64. Nine sit inside the same Spectrum /32. Two Verizon Fios addresses share a /64. Two Comcast Cable addresses share a /64.

That is the operational lesson.

The single address can change. The surrounding network context can still repeat.

Why IPv6 changes the economics

IPv6 was built to make long-term address correlation harder.

RFC 8981 describes temporary IPv6 addresses that rotate randomized interface identifiers over time. That is a privacy improvement. It reduces the value of using one full address to track the same host across many sessions.

That is good network design. It is bad news for simplistic fraud models.

If a system still assumes one address equals one stable endpoint, it will make two predictable mistakes:

It will miss abuse when suspicious actors rotate through new /128s.
It will overblock when one suspicious /128 gets expanded to a much broader prefix without enough evidence.

Both errors are expensive.

One leaks money to bad traffic. The other blocks revenue from good traffic.

The false positive problem is bigger than most teams admit

In ad tech, false negatives get the attention because they look like fraud losses.

False positives are quieter. They look like lower match rates, lower fill, lower bid density, underdelivery, or weaker reach. That makes them easier to misdiagnose.

If a buyer, platform, or verification layer decides that a broad IPv6 prefix is bad because one address in that space was flagged, the blast radius can be large.

For publishers, that can mean rejecting legitimate demand or discounting inventory quality for users who are not actually fraudulent.

For SSPs and exchanges, it can mean pushing overly broad risk labels downstream, which changes auction behavior without proving the underlying case.

For DSPs, it can mean excluding reachable households from a campaign, weakening delivery and frequency goals while making optimization look worse than it should.

For agencies and brands, it can mean paying for expensive fraud controls that suppress real audience access.

This is where IPv6 becomes a business issue instead of a pure detection issue.

When identity assumptions get weaker, the cost of blunt enforcement gets higher.

CTV makes those errors harder to unwind

CTV already has fragmented observability.

The IP visible to a content request is not always the same IP seen by the ad server, the SSAI stitcher, the player, or the downstream reporting system. By the time the logs disagree, the impression is gone.

That matters because network signals are often treated as if they are closer to ground truth than they really are.

A suspicious IPv6 indicator can tell you something useful about origin, recurrence, or likely abuse. It does not tell you, on its own, whether:

the inventory description was false
the app was spoofed
the supply path was misrepresented
the VAST was runnable
the ad rendered successfully on the device that mattered

That means a weak IP decision can ripple across multiple business systems.

It can affect eligibility, scoring, pacing, billing, discrepancy reviews, partner escalations, and renewal conversations.

In practice, many of those downstream failures show up as execution problems that have nothing to do with IP identity by themselves: too many redirects in a wrapper chain, insecure HTTP media URLs on HTTPS inventory, or a tag that passes a quick glance but fails in a real live-tag test flow.

Where the ecosystem feels the damage

The biggest implication of IPv6 in fraud detection is not technical complexity by itself. It is decision quality across the market.

Publishers

Publishers care about fill, yield, and trust.

If network-based controls are too aggressive, good traffic can get downgraded or blocked. If the controls are too weak, invalid traffic still makes it into sold inventory. Either way, the publisher absorbs the economic damage first.

SSPs and exchanges

SSPs and exchanges sit in the middle of the trust chain.

If they pass along weak identity assumptions as if they were strong fraud signals, they distort auction quality and partner scoring. If they do not cluster recurring signals above the single address level, they also miss repeat patterns that should trigger closer review.

DSPs and buyers

DSPs need accurate suppression, not maximum suppression.

Overblocking broad IPv6 space can quietly reduce addressable reach and campaign efficiency. Underblocking lets suspicious activity continue long enough to waste budget and pollute performance models.

Verification and fraud vendors

Vendors that still lean too heavily on single-address reputation will face the hardest tradeoff. Their models can look decisive while being economically blunt.

The market increasingly needs cluster logic, recurrence logic, and stronger correlation across network, app, device, and execution signals.

What better operations look like

The answer is not to throw away IP intelligence.

The answer is to use it more carefully.

An IPv6 IOC feed is most useful as an escalation surface, not a standalone verdict engine.

That means a few practical shifts:

1. Treat the /128 as a lead

Keep the exact address. It still matters.

But do not stop there. Enrich it with bundle ID, app ID, supply path, user agent data, session timing, creative identifiers, SSAI markers, and execution outcome.

2. Cluster above the single address

Review /64, /48, /32, ASN, ISP, and time-window recurrence together.

That is where repeated behavior becomes visible without pretending the full address is a stable identity token.

3. Separate ranking from enforcement

A network signal can justify lower trust, tighter review, or increased measurement before it justifies a hard block.

This is especially important in consumer broadband and mobile access space, where the collateral damage of overbroad enforcement can be substantial.

4. Connect detection to business outcomes

For any suspect impression, teams should be able to connect:

request context
network context
winning creative
final VAST or stitched instruction
execution result
billing consequence

If those records do not join cleanly, the organization is not really evaluating behavior. It is comparing disconnected logs and making partial decisions.

If you want to go deeper into the VAST side of the problem

The network side is only part of the story. If you want to connect business outcomes back to execution quality, these are the most useful vastlint.org pages to start with:

VAST Tag Validator for raw XML validation
VAST Tag Tester for live tag QA, creative preview, and click tracking
VAST Inspector for hop-by-hop wrapper debugging
How to validate VAST XML for the practical decision tree between validator, tester, and inspector
VAST versions guide for version drift, VPAID removal, and CTV addendum context
IAB VAST validator guide for where pure spec compliance stops and platform behavior starts
VAST-2.0-wrapper-depth for one of the most common delivery blockers in wrapped tags
VAST-2.0-mediafile-https for the portability and CTV playback risk of insecure media URLs
Rule derivation methodology if you want to see how the validation rules are grounded in specs and standards

The bigger market implication

The ad ecosystem has spent years building better fraud controls around identifiers that were never as stable as people wanted them to be.

IPv6 makes that harder to ignore.

It forces a more honest model of what a network signal is.

It is evidence.

It is context.

It is sometimes a strong clue.

It is not identity by default.

That matters because the market does not only pay for fraud that gets through. It also pays for misclassification, suppressed reach, broken partner trust, and slow dispute cycles caused by bad assumptions.

So the real business question is no longer just, "Can this IP be flagged?"

It is, "What happens to revenue, delivery, trust, and reconciliation when we act on that signal?"

That is the right question for CTV.

And increasingly, it is the right question for the broader ad ecosystem too.

Sources

Pixalate, AdFraud IOC-DB - IPv6 Addresses, May 2026 workbook reviewed from the IOC database export
RFC 8981, Temporary Address Extensions for Stateless Address Autoconfiguration in IPv6, https://www.rfc-editor.org/rfc/rfc8981

Source note

The IPv6 IOC workbook reflects Pixalate's published watchlist data and disclaimer language for internal operational use. It is useful as an indicator set, not as a standalone market estimate or a definitive claim about any ISP, subscriber, or platform.

The great AI enshittification

Frank A — Sat, 30 May 2026 16:01:28 +0000

Would you trust a "Vibe taxi driver", or a "vibe dentist"? Somehow the industry trusts "vibe coders" or "AI coders" but as expected quality is down the drain.

Regardless of if you like it or not, the AI enshittification has begun. Now AI is making even healthcare decisions, an unregulated market. Not just physical health, mental health too. The self-driving car you take, may almost kill you.

Is there a place for quality products and services these days?

Python was all the rage a few years back, and still highly popular. Though the methods of learning have adapted. In the past blogs were popular, combined with stackoverflow and fiddling.

These days AI made a big impact, stackoverflow is down, and people use AI way more than fiddling. So there is a lot of "vibe coding", where code is generated, but quality is down the drain.

Apps full of security holes. Mind you that link is for large companies, you can imagine how many security holes exist in vibe coded apps: IDOR, credentials hard coded, no row-level security causing attackers to dump all your personal data, SQL injection, RCE caused by no upload filter. Indeed, there is more software than ever, but not software you want to trust your personal data with.

If you want to learn dev skills the old fashioned way, that's still possible. To practice Python there are many platforms like PyChallenge, leetcode and others. And of course you can build personal projects without AI.

But if there's still a place for quality products and services in the modern world? Only time will tell. There used to be "buy it for life" products, these days.. the products may last only one year.

The Veltrix Treasure Hunt Engine: Why Our First Rewrite Cost Us 3.2 Million Requests Per Second

Lillian Dube — Sat, 30 May 2026 16:01:13 +0000

The Problem We Were Actually Solving

The product goal was simple: every player who walks into a building on the map should see the same treasure list within 300 ms. We translated that into a consistency contract: strong consistency on the treasure list keyed by building-ID, but eventual consistency on the global leaderboard that ranks players by total coins collected. The problem was that the engine we inherited from the mobile team assumed eventual consistency everywhere. Their Redis Cluster v6.2.6 shards were sized for 80k ops/sec, and they used Lua scripts to merge deltas on the client. When the Royale drop pushed 1.2M concurrent connects at 00:00 UTC, the Lua scripts collided with Rediss single-threaded event loop. We saw 47k script-timeouts per minute and a P99 tail latency of 4.2 seconds on the treasure-list endpoints.

What We Tried First (And Why It Failed)

Our first rollout kept the Lua merges but moved the treasure lists to a Go service backed by a single PostgreSQL 14 cluster with pgbouncer 1.17.0 connection pooling. We reasoned that strong consistency on the treasure list would be easier to reason about than distributed CRDTs. The migration script ran at 20:00 UTC the night before the drop. Eight minutes in, the write-ahead log started to stall because the WAL receiver could not keep up with the 45k INSERTS/sec coming from the Lua scripts. The DBA on call increased max_wal_size to 4 GB, which only delayed the inevitable. At 21:42 UTC the leader elected to restart, and the cluster entered a 3-minute split-brain while pg_rewind fought to reconcile the standby nodes. When the service came back, the Lua scripts had already enqueued 1.9 million backlogged treasure events. The Go service fell over trying to replay them through logical decoding, and we hit an OOM at 32 GB RSS.

The Architecture Decision

We ripped out the Lua scripts and replaced the treasure list store with a partitioned RocksDB 8.7.0 tier that we called the Cellar. Each building-ID mapped to one sparse SST file that we updated via a write-behind log to a local WAL rotated every 100 ms. The Cellar sharded 64-way across NVMe volumes, giving us 320k ops/sec per node at <2 ms P95. We fronted the Cellar with a single envoy 1.26.0 proxy that implemented a consistent-hash policy on building-ID. Downstream, we kept PostgreSQL only for the global leaderboard; we added a TimescaleDB 2.12.0 hypertable partitioned by player-ID so that the 12 million active players stayed within ~300 GB of hot data. The Timescale instance ran on AWS RDS i3.8xlarge with 2 TB gp3 disks and a 30k IOPS burst credit.

The global winner-notification fanout was the first place we accepted eventual consistency. We switched from WebSockets to NATS 2.9.21 jetstream with a 5-minute deduplication window. Each player subscribed to exactly one jetstream subject: user.. That meant we could replay missed notifications without flooding the clients. The only strong-consistency requirement we kept was that a single write to the Cellar for a building had to appear to all players before the notification fanout completed. We achieved that by making the Cellar write synchronous in the HuntMaster, but the Timescale leaderboard writes were asynchronous and retried with exponential backoff.

We also introduced a local cache layer with Dragonfly 1.8.1 acting as a L1 shard for each envoy instance. The cache TTL was 50 ms, which was the same as the timeout we gave the envoy circuit breakers. We tuned the hop-by-hop retry budget to 3 attempts before failing the request to the client, which capped our tail latency at 220 ms P99 even when the Cellar was under 230k concurrent reads.

What The Numbers Said After

The Winter Royale drop went live at 00:00 UTC on 15 December 2025. In the first hour we ingested 2.9 billion treasure updates. Our scrape job on the HuntMaster showed a steady 3.2M requests/sec on the write path with no flapping. The Cellar nodes reported 78 k ops/sec per shard at 2.1 ms P95 latency. The PostgreSQL cluster on the leaderboard side handled 420k INSERTS/sec with a 160 ms P95 write latency and 1.2 seconds P99. NATS jetstream delivered 1.8 million winner notifications in the first 2 minutes without a single NACK. The client error rate stayed below 0.04 % across all regions.

The cost side was brutal: the Cellar nodes alone ran 64 r6i.2xlarge instances, each costing $1.092 per hour, or ~$1,500/day. The NATS jetstream cluster added another $840/day for 9 m5.2xlarge brokers with 5 TB gp3 storage each. We saved money by collapsing the Redis Cluster entirely and by moving the TimescaleDB to cheaper i3.2xlarge spot instances at $0.24/hour, reducing the leaderboard bill from $3.1k/day to $1.2k/day.

What I Would Do Differently

I would not have tried to migrate the treasure-list store while

I Made My AI Models Argue, Then Let Hermes Be the Judge

Arqam Waheed — Sat, 30 May 2026 16:00:54 +0000

This is a submission for the Hermes Agent Challenge: Build With Hermes Agent

TL;DR — Ask any judgment call and three different AI models argue it out, then Hermes hands down one verdict, a confidence score, and exactly why they split. Every verdict, dissent, and mind-changed-in-debate is written into Hermes' own memory, so the next question re-weights the jurors before they ever vote. The judging is a pure function over that memory: no memory, no weights, no verdict. Three models, one verdict, $0.

What I Built

An LLM once talked me into the wrong database with total confidence. One smooth, authoritative answer. I shipped it. It cost me a weekend and a migration I'm still not over.

The villain here is single-model overconfidence: you get one polished reply, and the disagreement that should have warned you is invisible. You never see the other opinions, because you only asked one model.

So I stopped trusting one model. I convened a jury.

Council takes any judgment call ("Postgres or Mongo?", "is this PR safe to merge?", "is this clause risky?") and asks three different models, lets them disagree, then has Hermes deliver one verdict, a confidence score, and exactly why they split. Three models, one verdict, $0.

You ask a question. Council fans it out to three jurors (two free OpenRouter models from different families and one local model via Ollama), each takes a position with reasons. Then, if they disagree, a second deliberation round runs: each juror sees the others' answers and either holds or changes its mind, so the council debates instead of just voting once. Hermes then judges the deliberated opinions: a single verdict, a confidence score (high when they agree, low when they split 2-1), and a "why they disagreed" panel. Every verdict is remembered, a council skill learns which juror to trust for which kind of question, and the agent can even propose its own trust adjustments for you to approve.

The whole product is one question box. Everything interesting happens behind it, and the rest of this post is mostly pictures of that "behind."

Demo

Repo: https://github.com/ArqamWaheed/council
Live demo: https://council-jet-kappa.vercel.app/

Try "Should a 3-person startup use microservices?" and open the dissent panel.

Local, one command (runs at $0 in offline mock mode, no key needed):

git clone https://github.com/ArqamWaheed/council && cd council && ./setup_hermes.sh && python server.py

Architecture, in pictures

I think the design is easiest to see, so here's the system as a sequence of images. Each caption is the explanation.

The core loop. One question, three independent Hermes subagents (2 hosted + 1 local) fanned out in parallel, then a fourth Hermes run (the foreman) synthesizes one verdict. Every arrow is the same hermes -z interface; nothing talks to a model directly.

The bet. A hosted model and an on-device model sit on the same jury, swapped with a single --provider/--model flag, no code change. This model-agnosticism is the one Hermes property the whole project is built on.

The UX surface. Confidence is high when jurors agree and drops on a 2-1 split. The dissent panel is collapsed by default, and you expand it exactly when the confidence number makes you nervous.

The actual product. A confident single answer hides this; Council makes the disagreement the headline. Getting the clustering right here was subtle (see "What I learned" below).

The headline feature: a council that **deliberates, not just votes. After round 1, disagreeing jurors get a second Hermes pass where they read each other's arguments and may hold or change their vote. A "⇄ changed" badge marks the ones that moved, and the confidence dial actually climbs when a 2-1 split is talked into agreement.

The agentic learning loop, human-in-the-loop. Hermes proposes; you approve or dismiss. Approved rules persist client-side and ride along with the next convene call.

Persistence the judge can verify. Verdicts are mirrored into Hermes' own memory, so recall is Hermes doing the work; proof lives in docs/hermes-proof/04-memory-recall.txt.

Code

Repo: https://github.com/ArqamWaheed/council

Interesting files:

hermes_run.py (the Hermes CLI driver every juror/judge call goes through)
run_council.py (orchestration + the deterministic judge + Hermes foreman + the --reflect loop)
skills/council/SKILL.md (the juror-weighting brain Hermes edits)
server.py (the /api/reflect + /api/learn endpoints)
index.html (the designed verdict UI with the foreman TTS readout and localStorage persistence).

Proof that Hermes is genuinely in the loop (subagent transcripts, skill diff, memory recall) is in docs/hermes-proof/.

# hermes_run.py: every juror/judge call is a real Hermes run
def ask(prompt, provider, model, skills=None, timeout=120):
    cmd = [binary(), "--provider", provider, "--model", model]
    if skills: cmd += ["--skills", skills]
    cmd += ["-z", prompt]                       # -z = one-shot, final answer on stdout
    return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout).stdout

# jurors.py: fan out one Hermes subagent per juror, in parallel
with ThreadPoolExecutor(max_workers=len(roster())) as pool:
    opinions = list(pool.map(lambda c: ask_juror(*c), enumerate(roster())))

How I Used Hermes Agent

Why Hermes at all: the model-agnostic core. Hermes lets you point at any provider and swap with a flag, no code change. Council is built on top of that one property: the jurors are different models, and Hermes is the only piece that makes "different models" cheap. The clearest proof is the third juror: it runs locally via Ollama while the other two are hosted on OpenRouter, and all three answer through the exact same hermes -z interface (the model-agnostic diagram above). A hosted model and an on-device model, sitting on the same jury, no code change: that's model-agnosticism you can see. I genuinely didn't see another entry in this challenge exploit it; everyone picked one model and moved on. That's the whole bet.

Subagents: one real Hermes run per juror. Each juror is a genuine, isolated Hermes invocation on a different provider+model (hermes -z --provider openrouter --model … for the two hosted jurors, --provider ollama-local … for the on-device one), fanned out in parallel so no model's reasoning anchors another's (the convene-flow diagram above). Hermes does the inference; my Python (jurors.py to hermes_run.py) is just the fan-out plumbing, and every juror in the output JSON is tagged "via": "hermes". The gotcha worth flagging: Hermes enforces a 64K-context floor, which for the local model meant setting both ollama_num_ctx and a named custom_providers entry; without the named provider, --provider ollama silently routed to the wrong base URL. setup_hermes.sh encodes the working config so a judge can reproduce it in one command.

A true debate, not just a vote (round 2 is real Hermes work). This is the feature I'm proudest of. After round 1, if the jurors disagree, each one gets a second Hermes run that shows it the others' positions and lead reasons and asks it to hold or change its mind. Real jurors reconsider through the same hermes -z path as round 1, so the debate is genuine extra agentic work, not a UI flourish; mock jurors reconsider deterministically so the offline demo stays reproducible. The judge then synthesizes the verdict from the deliberated opinions, so a juror that's talked round actually moves the outcome (the deliberation diagram above). It's gated on disagreement (a unanimous round 1 skips it) and toggled with COUNCIL_DEBATE=0.

Why a skill, not a prompt, for judging. The foreman's verdict is itself a Hermes run (hermes -z --skills council) grounded in skills/council/SKILL.md, which is installed into Hermes (hermes skills list shows it). The weighting logic lives in a machine-readable weights block.

The judging brain is data, not a buried prompt. --learn and --reflect both edit this block, and the installed Hermes copy is kept in sync.

After a string of security questions, --learn appended a rule to upweight the local model on that topic (and synced the installed Hermes copy) because it had caught issues the hosted models missed:

python run_council.py --learn "Local Juror | security | 1.5"

On the next security question that juror's vote counts 1.5×, read straight back by the judge. Counterfactual: a static synthesis prompt can't get better; this does. (The before/after skill diff is in docs/hermes-proof/03-skill-learning.txt.)

Letting the agent propose its own learning, now on the web and grounded in evidence. python run_council.py --reflect (and the "Should the council reweight itself?" button in the UI) hands Hermes its own memory of past verdicts and asks it to propose one weight change, e.g. "the local juror has dissented on three database calls; upweight it." The key fix this round: the proposal is evidence-grounded, since Hermes is fed the actual dissent tally and any rule backed by fewer than two real dissents is rejected, so it can't just parrot the example baked into the skill. You then Approve or Dismiss it (the reflect-flow diagram above). That's the agentic loop done honestly: a single verdict has no ground truth, so the agent surfaces a pattern and a human confirms it's signal, not overfitting (the exact tension this post closes on). (Offline, it falls back to a deterministic heuristic so it never breaks.)

Making learning survive a stateless deploy. On a hosted demo the filesystem is read-only, so an approved rule can't be written back to SKILL.md. Council handles this honestly: approved rules are stored in the browser's localStorage and re-sent with every /api/convene call, where they're merged into the judge's weights for that request. Locally you get a persistent SKILL.md; on the web you get per-browser persistence, and either way the learning sticks.

Why memory. Each verdict is appended to a log and mirrored into Hermes' own MEMORY.md, so I can ask hermes -z "what did the council decide about auth?" and Hermes recalls it from its memory, not from my code (the memory-recall image above). Proof: docs/hermes-proof/04-memory-recall.txt.

The foreman reads the verdict aloud. The verdict card has a "the foreman reads the verdict" button (browser SpeechSynthesis, $0); Hermes also ships native TTS via hermes setup tts. On-theme and memorable: a jury foreman announcing the decision.

The build itself was agent-run. I kept a memory.md the coding agent read before each task and updated after (so context stayed cheap), committed every increment with Conventional Commits, and built the verdict UI with the frontend-design skill, which is why the confidence dial and colour-coded juror chips read as designed, not default-template AI slop. The repo's AGENTS.md + commit history show the process, not just the result.

Why these models, and the concession. Two free OpenRouter models from different families (≥64K context, since Hermes rejects smaller at startup) plus a local Ollama juror. Two honest concessions: (1) free models are slower and three calls add latency (~10-20s/verdict); (2) the free tier is aggressively rate-limited, so I hit 429s constantly while building, and Council retries and, if a juror still won't answer, falls back (Hermes to direct API to deterministic stand-in) rather than crashing the verdict, which also means the demo runs fully offline at $0. For a once-a-decision tool, I'll take it. Cost: $0.

License. MIT. Fork it, add your own jurors.

What I learned (and what's next)

The disagreement is the product. A 2-1 split is more useful than a confident single answer, so the clustering that decides "who actually disagreed" has to be right. A small local model once wrote a vague position ("to facilitate efficient integration…") whose reasons clearly endorsed Postgres; the first version mis-filed it as a dissenter. The fix: when a juror's stated position is ambiguous, fall back to reading its reasons, and ignore options only mentioned in a comparison ("better than Mongo" isn't a vote for Mongo). Now agreeing jurors cluster together, and the split count is honest.
Grounded beats glib. Letting the agent propose its own weighting only works if the proposal is tied to real evidence; an ungrounded "reflect" just echoes whatever example is in the skill.
Hermes' 64K-context floor caught a model that would've quietly underperformed.
A council should deliberate, not just vote. The round-2 debate above was the turning point: letting jurors read each other and reconsider means a juror that's genuinely persuaded moves the verdict, and you watch the confidence dial climb as a 2-1 split becomes unanimous. A one-shot vote can't do that.

Road To KiwiEngine #4: The Racecar Driver Analogy

Drew Marshall — Sat, 30 May 2026 16:00:00 +0000

One thing I keep coming back to when thinking about modern software is this:

A racecar driver shouldn’t need to manufacture every part of the car before racing.

They should be able to:

choose reliable components
assemble systems
tune performance
focus on operating effectively

But in software, we often expect businesses to do the opposite.

Before a company can even begin solving its actual operational problems, it frequently has to piece together:

hosting
authentication
databases
deployment pipelines
billing systems
analytics
infrastructure
APIs
admin systems
integrations
monitoring
workflow tooling

And by the time all of that is assembled, the original business problem sometimes becomes secondary to maintaining the technology stack itself.

That realization changed how I think about software architecture entirely.

Businesses Usually Don’t Want Technology Stacks

Most businesses do not wake up excited about infrastructure assembly.

They care about:

serving customers
operating efficiently
scaling sustainably
managing workflows
improving reliability
growing revenue

The software is supposed to support the operation.

But increasingly, modern systems require businesses to become partial infrastructure companies just to function effectively online.

That’s a huge shift from the earlier web.

Modern Software Has Become Operationally Heavy

One thing I’ve noticed is that the complexity of modern software often comes less from the business logic itself and more from the surrounding operational ecosystem.

For example:
launching a modern platform may involve:

frontend systems
backend systems
cloud infrastructure
CI/CD pipelines
environment management
container orchestration
observability tooling
CDN layers
API gateways
billing providers
authentication services

All before the business even begins delivering value.

That operational weight compounds quickly.

This Is Part of Why Blueprint Thinking Became Important to Me

The more systems I worked on, the more I became interested in:

reusable operational systems instead of:
endlessly rebuilding implementation details.

For example:

A restaurant platform shouldn’t need to reinvent:

ordering flows
delivery states
inventory workflows
customer notifications
payment systems
operational dashboards

A creator platform shouldn’t need to rebuild:

memberships
subscriptions
storefront systems
content delivery
audience workflows

Those operational patterns already exist.

So the interesting challenge becomes:

“How do we create systems that allow businesses to focus more on operating and less on rebuilding infrastructure repeatedly?”

This Is Where Platforms Become Interesting

I think this is one reason platform ecosystems became so influential historically.

Platforms reduce operational friction.

WordPress did this incredibly well for publishing.

Shopify did this for eCommerce.

Other ecosystems solved similar operational problems in different industries.

The common pattern is usually:

reduce setup friction
abstract operational complexity
provide extensibility
improve accessibility

That’s much bigger than simply “building apps.”

But Modern Systems Need More Than Simplicity

At the same time, modern operational systems increasingly require:

scalability
deployment awareness
observability
portability
infrastructure flexibility
lifecycle management

So now the challenge becomes balancing:

simplicity with
operational capability.

That’s not easy.

Especially as systems become larger and more interconnected.

The Infrastructure Layer Is Becoming the Real Product

One thing I increasingly believe is that many modern software companies are actually infrastructure companies disguised as application companies.

Because eventually:

reliability matters
deployment matters
scaling matters
integrations matter
operational workflows matter
portability matters

The operational layer becomes the long-term challenge.

Not just the UI.

This Shift Changed How I Think About WebEngine

A lot of the philosophy behind:

WebEngine
KiwiPress
Citrode
blueprint systems
operational runtime architecture

comes from thinking deeply about this operational burden.

I became increasingly interested in:

deployment-aware systems
infrastructure-aware development
operational portability
composable runtime architecture
reusable business blueprints
lifecycle-aware systems

Not because technology itself is the goal.

But because reducing operational friction matters enormously for businesses.

AI Makes This More Important, Not Less

Ironically, I think AI increases the importance of operational architecture.

Because AI can increasingly generate:

code
interfaces
APIs
boilerplate systems

quickly.

But generated systems still require:

structure
workflows
operational boundaries
infrastructure
maintainability
lifecycle management

Otherwise complexity compounds at machine speed.

That’s one reason I think blueprint systems and operational platforms are becoming increasingly important.

I Think the Industry Is Moving Toward Operational Abstraction

One thing I suspect we’ll see more of over time is software moving higher up the abstraction ladder.

Not just:

frameworks
components
libraries

But:

operational systems
infrastructure orchestration
lifecycle-aware platforms
composable ecosystems
business blueprints

Because businesses ultimately want operational outcomes.

Not endless infrastructure assembly.

The Goal Isn’t Removing Flexibility

This is important:
I don’t think businesses should lose flexibility.

I think they should gain better operational foundations.

The ideal system should allow:

extensibility
customization
scalability
portability

without forcing every company to become infrastructure experts before they can operate effectively.

That’s a very different architectural philosophy than simply:

“assemble everything manually.”

Final Thoughts

Software has become incredibly powerful.

But it has also become operationally heavy.

And increasingly, I think the biggest challenge isn’t:

“How do we build more technology?”

It’s:

“How do we reduce operational friction while still enabling powerful systems?”

Because most businesses don’t actually want to spend their lives assembling racecars.

They want to race.

How I Cut Aider's Token Bill 80%: Prompt Caching, MCP Code Mode, and Tier Routing

Vishal VeeraReddy — Sat, 30 May 2026 15:56:21 +0000

Aider is the best terminal AI coding tool I've used. But by default it sends every diff through your OpenAI or Anthropic key, which gets expensive fast on real refactors — a single 100-file repo map can torch a few dollars before Aider even reads your prompt.

This post shows how to run Aider against any LLM provider — Ollama for free local runs, OpenRouter for mixed-provider routing, AWS Bedrock for the enterprise plate — through a single OpenAI-compatible endpoint, with prompt caching and MCP Code Mode layered on top to slash the bill further. I'll use Lynkr, the self-hosted gateway I maintain.

Full disclosure: I build Lynkr. I'm going to make the case for why the combination — gateway + caching + code-mode tools — is the real cost lever, not just "swap your provider."

The setup in three commands

# 1. Start the gateway
npx lynkr@latest

# 2. Point Aider at it
export OPENAI_API_BASE=http://localhost:8081/v1
export OPENAI_API_KEY=any-value

# 3. Run Aider with any model name Lynkr knows about
aider --model deepseek/deepseek-v3.2-reasoner

That's it. Aider speaks the OpenAI Chat Completions protocol; Lynkr speaks it back and quietly translates the call to whichever upstream provider you've configured (Ollama, Bedrock, Anthropic, Azure, OpenRouter, Databricks, llama.cpp, LM Studio, ...). Aider has no idea it's talking to a router.

Where the money actually leaks in Aider

Most "save money on AI coding" posts focus on swapping GPT-4o for a cheaper model. That's table stakes. The real spend in an Aider session breaks down roughly like this:

Call type	Share of total tokens	Where it goes
Repo map (system context, sent every turn)	~50–60%	Same prefix, every single request
File contents you've /add'd	~20–30%	Same prefix until you change the files
The actual diff / instruction	~5–10%	Genuinely new each turn
Commit messages, summarization	~5%	Cheap model anyway

Look at that table. Most of your Aider bill is the same bytes being re-sent over and over. Swapping models helps a little. Caching that repetitive prefix helps a lot.

Lever 1: Prompt caching — cuts the repeated-prefix tax

Anthropic, Bedrock, Gemini, and OpenRouter all support prompt caching now, but Aider doesn't speak any of their cache-control protocols natively (it speaks one — OpenAI's — and only partially). Lynkr sits in the middle and injects cache_control: ephemeral breakpoints on the right blocks before forwarding upstream.

What that means in practice: the second Aider request in a session — same repo map, same /added files — only pays for the few hundred tokens of new instruction. Cached input tokens are 10% the price of fresh input on Anthropic, 25% on Bedrock, free for 5 minutes on Gemini.

On a 4-hour Aider session against Claude Opus 4 or GPT-5, this single lever has cut my own input bill by ~70% before I even start tier-routing.

Lynkr enables it automatically when the upstream provider supports it. No Aider config change.

# .env
MODEL_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
PROMPT_CACHE=true    # default on, but explicit is good

Lever 2: MCP Code Mode — collapse N tool calls into 1

Aider doesn't use tool calls itself (it parses code blocks from plain Markdown). But the moment you start composing Aider with other MCP tools — file search, web fetch, sandboxed execution — the round-trip cost explodes. Every tool call is a full request/response cycle through the LLM.

Lynkr's MCP Code Mode (borrowed from Cloudflare's pattern) flips this. Instead of advertising each MCP tool as a separate function the model can call, Lynkr exposes them as a small TypeScript API that the model writes a single program against. The program runs in a sandbox, hits all the tools it needs, and returns the result in one LLM round trip.

Example: "find every file that imports redis, check if any still use the v3 API, and print a migration TODO list."

Tool-call mode (default everywhere else): 5 file_search calls + 12 file_read calls + 1 grep call = 18 round trips. Each round trip re-sends the conversation history.
MCP Code Mode (Lynkr): model writes ~20 lines of TS using mcp.fileSearch() and mcp.fileRead(), executes once, returns the result.

For coding-heavy sessions where Aider is composed with other MCP tools, this is a 5–15x reduction in tokens spent on tool plumbing.

Lever 3: Tier routing — match model to task

Aider's own polyglot leaderboard tells a more interesting story in late 2026 than most people realize:

Model	% correct	Total benchmark cost
GPT-5 (high)	88.0%	$29.08
o3-pro (high)	84.9%	$146.32
Gemini 2.5 Pro (32k think)	83.1%	$49.88
Claude Opus 4 (32k think)	72.0%	$65.75
DeepSeek-V3.2 Reasoner	74.2%	$1.30
DeepSeek-V3.2 Chat	70.2%	$0.88
Kimi K2	59.1%	$1.24
GPT-4o (2024-08-06)	23.1%	$7.03

Two things to notice:

GPT-4o — the model most Aider quickstarts still suggest — is now near the bottom. 23% on polyglot. The defaults aged badly.
DeepSeek-V3.2 Reasoner is 74% correct for $1.30 of total benchmark cost. That's within striking distance of GPT-5 at ~1/22nd the bill, and roughly 50× cheaper than o3-pro.

For Aider specifically, you don't need a $146 model to rename a variable. You need it for architecture decisions — and even then, V3.2 Reasoner is probably the right default for everything except the genuinely hardest 10% of calls.

Lynkr's tier routing splits the work by prompt complexity:

Aider call type	Routes to	Notes
Repo map summarization	`qwen2.5-coder:7b` (Ollama, local)	Free, runs on your laptop
File edits, single-function diffs	`deepseek-v3.2-chat` (OpenRouter)	70% correct, ~$0.88/benchmark
Default coding workhorse	`deepseek-v3.2-reasoner` (OpenRouter)	74% correct, ~$1.30/benchmark
Hardest 10% — architecture, multi-file refactor	`gpt-5` or `gemini-2.5-pro`	Used sparingly

# .env additions
TIER_SIMPLE=ollama:qwen2.5-coder:7b
TIER_MEDIUM=openrouter:deepseek/deepseek-v3.2-chat
TIER_COMPLEX=openrouter:deepseek/deepseek-v3.2-reasoner
TIER_REASONING=openrouter:openai/gpt-5

Then point Aider at --model lynkr-auto and Lynkr scores each prompt before picking the tier.

Stacking the three levers

Each lever on its own is meaningful. Stacked, they compound:

Caching alone: ~70% input-token cut on a stable session
+ Tier routing: another ~40% by pushing routine calls to Flash/Ollama
+ MCP Code Mode (if you compose with other MCP tools): another 5–15x on tool-plumbing tokens

In my own Aider workflow — heavy refactors against a 200k-LOC monorepo — this combination has dropped a session that used to cost ~$8 in Claude calls down to under $1.50. Not because Claude got cheaper. Because most of the work is now happening on cached prefixes, free local models, or in-sandbox code execution.

Configuration walkthrough

Step 1 — Install and start Lynkr

npx lynkr@latest

First run creates a .env file. Minimal config:

MODEL_PROVIDER=anthropic
ANTHROPIC_API_KEY=sk-ant-...
PROMPT_CACHE=true
PORT=8081

For full local + free:

MODEL_PROVIDER=ollama
OLLAMA_ENDPOINT=http://localhost:11434
OLLAMA_MODEL=qwen2.5-coder:latest
PORT=8081

Then ollama pull qwen2.5-coder:latest.

Step 2 — Point Aider at the gateway

export OPENAI_API_BASE=http://localhost:8081/v1
export OPENAI_API_KEY=dummy

Drop those in your shell rc file.

Step 3 — Pick a model (or let Lynkr pick)

# Direct pass-through
aider --model deepseek/deepseek-v3.2-reasoner

# Or let Lynkr tier-route
aider --model lynkr-auto

Step 4 — Verify

curl -s http://localhost:8081/v1/models | python3 -m json.tool | head

Start Lynkr with LOG_LEVEL=info and watch the cache-hit lines on your second Aider request — that's where the savings show up.

Aider-specific gotchas

Weak model for commits / summarization. Aider uses a cheaper model for non-code calls; default is gpt-4o-mini. Override to a free local one:

aider --model openai/gpt-4o --weak-model ollama/qwen2.5-coder:7b

Long context. Local Ollama models will OOM on 200k+ token repo maps. Either set --map-tokens 0, or route long-context calls to Gemini Flash 1M-token contexts via the TIER_REASONING line above.

Streaming. Aider expects streaming responses. Lynkr streams by default. If you're on a non-streaming Databricks endpoint, set STREAM_PASSTHROUGH=false and Lynkr buffers + simulates.

Cache hit rate. Prompt caching only fires when the prefix is byte-identical across requests. If your repo map changes (you edit a /added file), the cache for that block invalidates and rebuilds. Lynkr logs cache-hit ratios per session — watch them; if hit rate is below 60% something in your workflow is busting the prefix.

Quickref

Aider env var	Lynkr role
`OPENAI_API_BASE=http://localhost:8081/v1`	Where Lynkr listens
`OPENAI_API_KEY=dummy`	Required by Aider, ignored by Lynkr
`--model deepseek/deepseek-v3.2-reasoner`	Forwarded as-is to the configured upstream
`--model lynkr-auto`	Triggers Lynkr's complexity-based tier routing
`--weak-model ollama/qwen2.5-coder:7b`	Free local model for commit messages

TL;DR

The default Aider setup pays full price for the same repo-map bytes on every turn. The fix isn't "use a cheaper model" — it's:

Cache the repetitive prefix (prompt caching).
Collapse tool plumbing into one call (MCP Code Mode).
Match model size to task complexity (tier routing).

Stacked, those three levers have taken my Aider sessions from ~$8 to ~$1.50 without changing how I work. Lynkr is one gateway that does all three; it's Apache 2.0, single Node binary, drop-in OpenAI base URL.

Aider's GitHub: https://github.com/Aider-AI/aider
Lynkr's GitHub: https://github.com/Fast-Editor/Lynkr — star to follow next integration writeups (OpenHands, Vercel AI SDK, Open Interpreter queued).

BAIXAR VÍDEO DO YOUTUBE

Vinicius Andrade — Sat, 30 May 2026 15:55:59 +0000

Criei um gerenciador de downloads desktop em Python e quero feedback da comunidade!

O PyFlowDownloader é um app desktop feito com Python + PySide6 que usa yt-dlp para baixar vídeos e áudios de forma assíncrona do youtube. Algumas coisas que ele já faz:

Fila de downloads com progresso em tempo real
Cancelamento de downloads ativos ou pendentes
Suporte a MP4 e MP3, de 144p até 1080p
Histórico com exportação para CSV
Interface desktop com tema visual via QSS Build para Windows via PyInstaller + pipeline de release no GitHub Actions

Está na versão v0.3.0 e ainda tem muito espaço pra crescer. Repositório: https://github.com/Vinny00101/PyFlowDownloader

Se você puder **testar e deixar sua opinião nos comentários, ficaria muito grato! Quer saber:

O que achou da experiência de uso?
Algum bug que encontrou?
O que você adicionaria ou melhoraria no projeto?

Todo feedback é bem-vindo!

Releasing HeliosProxy, The programmable Postgres data-plane

Dani Moya — Sat, 30 May 2026 15:54:05 +0000

Happy to announce HeliosProxy !!
Far beyond a pooling tool, HeliosProxy ** is a next-gen programmable Postgres data-plane. **Works with PostgreSQL-compatible databases, not only HeliosDB.

It starts as a PgBouncer-compatible wedge, then adds the operational surface teams usually build from multiple tools:

connection pooling
failover and transaction replay
shadow execution
anomaly detection
edge cache controls
admin REST API
embedded admin UI
signed WASM plugins
OCI-style plugin artifacts
Kubernetes operator
Terraform and Pulumi providers
22 installable Claude/Codex operator skills

Install operator skills:
heliosdb-proxy install skills

PostgreSQL #DevOps #SRE #Database #AIcoding

Hello, DEV Community! 👋

Ana Villar — Sat, 30 May 2026 15:53:23 +0000

I'm Ana, and I'm excited to start sharing my technical journey here. This is a brief post to introduce myself and give you a heads-up on what's coming.

What to Expect

In upcoming posts, I'll be diving into hands-on, infrastructure-focused tutorials and walkthroughs, including:

Setting up a XWiki server on-premises — from installation to configuration, getting a collaborative wiki platform running in your own environment.
Deploying Kubernetes on RHEL 10 virtual machines — step-by-step guidance on building a Kubernetes cluster on Red Hat Enterprise Linux 10.
Deploying OpenShift as virtual machines on a RHEL 10 host — exploring how to run OpenShift on top of RHEL 10, combining the power of containers with VM-level control.

My goal is to keep things practical, clear, and rooted in real-world experience — the kind of content I wish I'd had when tackling these setups myself.

Why These Topics?

Because on-premises infrastructure is far from dead. Whether it's compliance requirements, performance needs, or simply wanting full control over your stack, there's still a strong case for running things yourself. And RHEL was my choice for certification purposes, and it's given me the flexibility to keep using hardware that other vendors have discontinued.

Stay tuned, and feel free to follow me so you don't miss the upcoming posts. If any of these topics spark your interest, I'd love to hear about it in the comments! 🚀

Three Bitcoin Primitives That Don't Exist Anywhere Else (PoW Beacon, DLC Oracle, Fair-Launch Rune)

Zeke — Sat, 30 May 2026 15:52:11 +0000

The Problem With "Bitcoin-native" Claims

Most things calling themselves Bitcoin-native are not. They settle on Ethereum, they custody coins through a federation, they hand you an IOU and call it Bitcoin. Plenty of projects ship something useful that touches Bitcoin somewhere. Few ship primitives where every byte that matters lives on the chain, or anchors to the chain, or settles on the chain.

This week I shipped three primitives that fit that bar, on the same captcha endpoint that hands out SHA-256 challenges to AI agents around the clock. They are not new ideas in isolation. Randomness beacons, DLC oracles, and Rune fair-launches all exist. What is new is wiring them through honest proof-of-work, so the entropy comes from work nobody can grind in their favor, and the distribution rewards the exact same kind of compute that secures Bitcoin itself.

Here is what landed, how to verify it, and where the seams are.

Primitive 1: PoW Randomness Beacon

Every minute the captcha server batches the PoW solutions it received, builds a Merkle tree, publishes the root, and anchors that root in Bitcoin via OpenTimestamps. The Merkle root becomes a public seed that nobody could have predicted, because nobody knew what challenges agents would solve in the next sixty seconds. Once the OTS proof confirms, the seed is forever attestable against the Bitcoin chain.

This inverts the usual move. Most randomness oracles inject an external source of entropy into Bitcoin. We harvest entropy that already exists out in the wild, compress it cheaply, and anchor it.

curl -s https://captcha.powforge.dev/api/beacon/latest

You get back something like this:

{
  "epoch": 3,
  "merkle_root": "9321a45272fd3331e0ee73cbd86c32ad30dd6a786e3f1c95cb1afd8a2d1c18c1",
  "beacon_random": "abc1fead17f55476ad0e248357db8b3d29510318f1c59111b78785da5368629a",
  "leaf_count": 3,
  "ots_status": "submitted",
  "btc_confirmed": null,
  "weak": true,
  "weak_reason": "leaf_count=3 < threshold=10"
}

A few things worth noticing. weak: true is honest signaling, not a bug. When leaf count is low, the beacon flags itself as weak so you do not build a contract on it that needs strong unpredictability. ots_status: submitted means the OTS server has the proof and is waiting for the next Bitcoin block to anchor it. Once that happens, btc_confirmed flips to the block height and the seed is forever verifiable against the chain.

Why this matters: the beacon does not ask you to trust me. It asks you to trust SHA-256 and the Bitcoin chain. If you can verify a Merkle root and an OTS proof, you can verify the beacon yourself. That is the whole point.

Primitive 2: DLC Oracle on PoW

The PowForge Schnorr oracle lives at attest.powforge.dev and signs outcomes with a stable BIP-340 Schnorr key. Two parties anywhere on Earth can write a Bitcoin contract whose outcome depends on a future beacon or attested event, fund it into a 2-of-2 multisig, and have it auto-settle the moment the oracle attests.

This is standard DLC machinery. What is new is that the oracle's signing pipeline is gated behind proof-of-work, and it serves binary TLV outputs (OracleAnnouncement type 55332, OracleAttestation type 55400) that any dlcdevkit-compatible wallet can consume without additional wrapping.

curl -s https://attest.powforge.dev/api/v1/info | jq '{oracle_pubkey, attestation_tag}'

{
  "oracle_pubkey": "2bc78390c94d8bbb96ac3e6940462ba2812418d871e701c1a845fdb1dfd4a0e5",
  "attestation_tag": "DLC/oracle/attestation/v0"
}

That x-only pubkey is the Schnorr key the oracle signs with. Any DLC-aware wallet can pin a contract to it. The JavaScript client is available as @powforge/attest-client on npm.

What can you actually do with it? A few things that are awkward to build with conventional oracles:

Programmable lotteries where the winning number is provably unmanipulated. The number was already committed before tickets closed.
Insurance contracts that settle on observable computational difficulty. If real AI-agent traffic spikes, the beacon reflects it. Sell a put on that.
Any agreement that needs both parties to trust a number that neither side can grind. The grinder would need to control the entire AI captcha solver fleet, which is exactly the population that does not coordinate.

The economic property here is that the oracle's signature is gated by work the oracle itself did not do. The oracle is a publisher, not a producer. The randomness was already paid for by every agent that solved a challenge.

Primitive 3: PoW Fair-Launch Rune

A new Bitcoin Rune called POWFORGE•PROOF will etch on mainnet with the entire 21,000,000 supply premined to a single relay key. Distribution happens through one mechanism: solve a 14-bit SHA-256 PoW challenge, supply a Bitcoin address, get 1,000 units. One claim per address per 24 hours. No presale, no allocations, no VCs, no fundraising round.

curl -s https://captcha.powforge.dev/rune/info

{
  "rune": { "name": "POWFORGE•PROOF", "symbol": "⚒", "total_supply": 21000000, "parcel_size": 1000 },
  "pow": { "algo": "sha256", "difficulty_bits": 14 },
  "distribution": { "model": "off-chain-enforcement", "rate_limit": "1 claim per recipient address per 24h" },
  "status": "scaffold"
}

The conventional Rune fair-launch model is open-mint, first-confirmed-wins. It devolves into a gas war the moment a Rune attracts attention. Whoever pays the highest fee wins the next block, and the actual buyers get priced out. PoW gating swaps that for work-proof fairness. Anyone with a CPU-second to spare can win a parcel. There is no fee escalation, because fees are not the bottleneck. The challenge is.

Where it stands: Phase 1 (scaffold) and Phase 2 (real Runestone OP_RETURN bytes via @magiceden-oss/runestone-lib) are shipped. Phase 3 wires PSBT assembly with @scure/btc-signer and broadcasts via a local mainnet node. The minting key sits next to the oracle key in the operator's config directory. RPC permission for sendrawtransaction is confirmed working.

What sits between here and mainnet etch:

PSBT builder for the etch tx (around 254 vbytes counting the commit-reveal witness)
Rune-name uniqueness audit against the ord registry
Multisig gating on the relay key so no single human can grief the launch
A funded UTXO at the minting address (around 2,000 sats covers the etch round-trip)

The honest framing: until those pieces land, POWFORGE•PROOF is reserved by convention, not by chain. The Rune does not exist on Bitcoin yet. The infrastructure that will etch it does, and you can poke every endpoint that drives it.

How They Connect

Real work flows in. AI agents solve PoW captchas to pay for free-tier API access. The captcha server processes those solutions three ways:

The solutions feed an honest randomness primitive (the beacon)
The primitive becomes oracle-signed for trustless contracts (the DLC oracle)
The pipeline anchors a fair token distribution that rewards the same work modality that secures Bitcoin itself (the Rune)

The through-line is Bitcoin's own thesis: proof of work is the cheapest way to make a number trustworthy. Apply that to randomness, you get a beacon. Apply that to attestation, you get a DLC oracle. Apply that to token distribution, you get an un-front-runnable fair-launch. Each layer is independently useful. Together they are a working demonstration that PoW economics extend further than Bitcoin's blockspace.

What's Next

Rune Phase 3. PSBT assembly via @scure/btc-signer, dedicated minting key, mainnet etch behind a gated runbook. Roughly 10 hours of dev plus 2 hours in the operator loop.
DLC client integration. Example contract templates so two parties can spin up a beacon-settled wager in a single command.
PoW-gated oracle signing. The captcha server's PoW verification gates a Schnorr signature from a stable oracle key. Tapscript leaves can reference that oracle pubkey as a spending condition, making "valid PoW solution" the prerequisite for an on-chain signing event.
Direct on-chain PoW gate (the long path). A real OP_SHA256 plus difficulty comparison inside a Tapscript leaf needs OP_CAT. That opcode is reserved on Bitcoin but not activated as of this writing. Until soft-fork activation, we route PoW through the oracle layer rather than the script layer. The oracle path is strictly weaker than a script-level gate, but it ships today.

Try It

Verify everything yourself. All three services are live and return JSON.

# The randomness beacon (on pow-captcha)
curl -s https://captcha.powforge.dev/api/beacon/latest

# The DLC oracle info + pubkey (on attest.powforge.dev)
curl -s https://attest.powforge.dev/api/v1/info

# The Rune fair-launch metadata (on pow-captcha)
curl -s https://captcha.powforge.dev/rune/info

# A live PoW challenge for the rune fair-launch
curl -s https://captcha.powforge.dev/rune/challenge

If any of those return something that looks broken, that is data. Tell me. The whole point of building in public is that the next iteration is shaped by what the last one got wrong.

PoW is not a perfect economic primitive. It is the simplest one we have for making a number expensive to forge. Three primitives this week, all on the same engine. More coming.

Append-only doesn't mean what you'd hope

Norbert Rosenwinkel — Sat, 30 May 2026 15:51:07 +0000

Event sourcing gets sold on immutability. You don't update, you don't delete, you only append, so the history is permanent.

It mostly isn't. The events are immutable because your code agrees not to touch them, not because anything actually stops it. Underneath they're still rows in Postgres, and rows have a DBA with write access. A migration that "cleans up" old data. A 2 a.m. query run against the wrong connection. A backup restored with slightly different bytes in it.

Change one of those rows and a replay won't blink. The aggregate rebuilds, the projections rebuild, everything looks fine. Usually the first person to notice is a customer whose balance is off, and by then the trail is cold.

Chain each event into the next

The trick is small. Give every row two extra columns: a hash of its contents, and the hash of the row before it.

#1  AccountOpened     prev=00000…  hash=70be4f…
                                      │
                                      ▼
#2  AmountDeposited   prev=70be4f…  hash=796018…
                                      │
                                      ▼
#3  AmountWithdrawn   prev=796018…  hash=6a0260…

The hash is SHA-256(previousHash || json(payload)). Nothing exotic.

The point is that each hash depends on the one before it. Edit a payload and its hash stops matching. Rewrite that hash to cover for the edit, and now the next row's pointer is wrong. You can't fix one without breaking the next.

About forty lines of it

Appending an event hashes it together with the previous one:

public HashChainedEntry Append(object payload)
{
    var previousHash = _entries.Count == 0 ? GenesisHash : _entries[^1].Hash;
    var hash = ComputeHash(previousHash, payload);
    var entry = new HashChainedEntry(_entries.Count + 1, payload, previousHash, hash);
    _entries.Add(entry);
    return entry;
}

internal static byte[] ComputeHash(byte[] previousHash, object payload)
{
    var payloadJson = JsonSerializer.SerializeToUtf8Bytes(payload, payload.GetType());
    var combined = new byte[previousHash.Length + payloadJson.Length];
    Buffer.BlockCopy(previousHash, 0, combined, 0, previousHash.Length);
    Buffer.BlockCopy(payloadJson, 0, combined, previousHash.Length, payloadJson.Length);
    return SHA256.HashData(combined);
}

Verifying is the same thing backwards. Walk the rows, recompute, and check two things on each one: the pointer and the hash.

byte[] previousHash = new byte[32]; // genesis
foreach (var entry in store.Entries)
{
    if (!ByteArraysEqual(previousHash, entry.PreviousHash))
        throw new EventStreamCorruptedException(entry.Sequence,
            "previous-hash pointer does not match the prior entry's hash");

    var recomputed = ComputeHash(previousHash, entry.Payload);
    if (!ByteArraysEqual(recomputed, entry.Hash))
        throw new EventStreamCorruptedException(entry.Sequence,
            "stored hash does not match a fresh re-hash of the payload (payload was modified after commit)");

    previousHash = entry.Hash;
}

Bump Alice's $50 deposit to $5,000 straight in the table, and the check stops you cold at the exact row:

Event stream tampering detected at sequence #2: stored hash does not
match a fresh re-hash of the payload (payload was modified after commit)

What that gets you

Someone tries to…	…and it shows up because
Edit one event's payload	the re-hash no longer matches the stored hash
Rewrite the stored hash to match	the next row's pointer no longer matches
Delete a row from the middle	the next row's pointer doesn't match its new neighbour
Slip in a forged row	same thing, the pointer chain breaks at the seam

The honest ceiling

Here's the part people gloss over. That table assumes the attacker is lazy: edit a row, move on, leave the stale hash behind. Someone with full write access doesn't have to be lazy. They can edit the row and then recompute every hash after it. Now the chain is consistent again and the verifier has nothing to say.

A hash chain is a checksum, not a signature. If you own both ends of it, so does anyone who owns your database. That's the honest ceiling of doing this inside your own four walls, and it's worth saying out loud before someone says it for you in the comments.

Getting out of your own walls

This is what anchoring is for, and it's the part I find actually interesting.

Next to the per-stream chains, Stratara keeps a second table of anchors. Every so many events it writes down the head of the chain at that point. Each anchor row has a BlockchainTxHash column, and that column is the hook: you take the anchor and commit it somewhere you don't control. A public blockchain. An RFC 3161 timestamp authority. An OpenTimestamps calendar. A notary. Anything you trust that isn't you.

Once an anchor lives somewhere out of your reach, the recompute attack falls apart. Your insider can rewrite every hash in the database and still can't touch the value you already pinned elsewhere. The question stops being "is this chain internally consistent" and becomes "does it still match what we committed outside." That second one is much harder to fake.

Let me be straight about what ships versus what you wire yourself. The anchor table, the worker that writes anchors, and the BlockchainTxHash column are in the box. Actually pushing an anchor to your source of truth, and checking against it later, is the part you wire up. Stratara doesn't pick the chain for you, the same way it doesn't pick your message broker. The sample at the end runs the whole thing in memory so you can see the shape of it.

One caveat, said plainly: if someone owns your database and your anchoring pipeline, they can re-chain and re-anchor and it'll all look fine. The defense only holds if the thing you anchor to is genuinely out of their hands. That's the entire reason to put it outside.

Where the hashing happens, and where verifying does

The hashing runs on a background worker, not inline on every append, so writes stay cheap. The chain gets filled in a beat behind the commit. Verifying is a separate thing you do on purpose: a scheduled job, or checking the external anchor. You don't want it on the read path, because that's a SELECT … ORDER BY Sequence on every query and it ties each read to the integrity check.

Worth being straight about: nothing in the framework wakes up and hunts for tampering on its own today. The hashes and the anchors are there so that when you verify — on a schedule, during an audit, after an incident — the evidence is intact and a break lands on the exact row. For a SOC 2 or ISO 27001 audit, the worker's structured logs are the running record that the hashing happened across the period; the verification job is what proves the chain held.

Where this lives

I build Stratara, a CQRS and event-sourcing stack for .NET 10. The chaining is the EventStreamHashing worker, running against Postgres. None of the idea is Stratara-specific though. If you've got an append-only table, you can bolt this on yourself.

The TamperProof sample is the whole story in zero-dependency, in-memory code, in three acts: a clean chain that verifies, a sloppy tamper caught at the exact row, and a full re-chain that sails past the local check but gets caught by an external anchor.

Wiring it into a real app is more than one dotnet add — you need the event store, the hashing worker, and a little DI — so the getting-started guide walks the minimal setup. Full docs are at https://docs.stratara.tech, and it's source-available under FSL-1.1-MIT (not OSI-approved OSS), which flips to plain MIT after two years.

This is just one slice of Stratara, and honestly the easiest to show off. There's plenty more I want to write up — the tenant-aware encryption side especially, where a tenant's data is cryptographically bound to their own key — without cramming it all into one wall of text. So if this was your kind of thing, stick around: more coming.

If you're already event sourcing: how would you actually prove to an auditor that nobody's touched the log? Genuinely curious what people are doing here.