De Moura’s thesis is blunt: generation scales faster than verification, and our review habits are eroding as “good enough” output floods repos. He cites patterns like “accept all” reviews, high failure rates in security tests, and the Heartbleed lesson as a preview of systemic risk when bugs propagate across shared dependencies. The prescription is formal specifications that stand apart from the code and proofs checked by a minimal, trusted kernel. Speed without proof becomes liability, not leverage. leodemoura.gith... 🔍
What does a path forward look like in practice? One strand is formalization at the math level, where TorchLean treats learned components as first‑class objects, executes with explicit Float32 semantics, and applies IBP and CROWN‑LiRPA to certify properties like robustness or controller safety. This is not a silver bullet, but it shows how execution and verification can share a single semantics instead of ad‑hoc test harnesses. It pushes correctness closer to the code people actually run. leandojo.org 🧪
Another strand is operational: test the behavior users experience, continuously and deterministically. Cekura simulates full conversations with synthetic users, evaluates outcomes with structured judges, and monitors live traffic at the session level to catch failures that single‑turn checks miss. Formal proofs guard correctness at the core, while simulation guards the surface where regressions hurt customers. Together they sketch a reliability stack that scales with generation. news.ycombinato... ⚙️