Andreas Bigger

Engineer. Investor. Writer.

← Home·Ramblings·March 5, 2026

Missing Years Is a Bug

A routine version bump. A database that wouldn't open. And a bug that had been hiding in plain sight for years.

Wednesday morning. A cup of coffee, a couple meetings scheduled, and a routine deploy. Light work.

We were just bumping a version that was mostly dependency updates. Few logic changes, and an end-to-end working local devnet.

That morning, we were rolling out a new consensus client. That was where all our attention was.

Other components -- including the execution client -- was just along for the ride. The kind of thing where you almost consider just skipping the staging environment. Almost.

If something was going to break, it wasn't going to be in the execution client.

deployproduction · wednesday 09:14

The node didn't start.

Not a graceful error, not a missing config key, not a "port already in use." It just panicked immediately on boot, before it had done anything useful, with this:

node startup
thread 'main' panicked at DatabaseError: MDBX_INVALID: the environment is not an MDBX file, or different page size

MDBX_INVALID. The storage engine refused to open the database. This database had been running fine. Same data directory, same permissions, same config as the last deploy. Nothing about the database had changed. We hadn't touched it. The only thing that changed was the binary sitting on top of it, and that binary came from a completely unremarkable version bump.

That's the shape of bug that makes you question everything. It's not telling you what's wrong. It's just saying no.

We Couldn't Reproduce It

Locally? Worked fine. Fresh cloud dev boxes? Worked fine. Built it ourselves anywhere we could think of? Fine. The one place it failed was the binary that came out of the actual build container, the one we use for real deployments.

"Works on my machine" usually means prod is right and your laptop is wrong. This was backwards. Every binary built outside the container opened the database. The container's binary couldn't.

Build Container
OS:Ubuntu 22.04 LTS
Rust:1.82.0
Target:x86_64-unknown-linux-gnu
Cargo.lock:identical
MDBX_INVALID
Dev / Runtime Box
OS:Ubuntu 22.04 LTS
Rust:1.82.0
Target:x86_64-unknown-linux-gnu
Cargo.lock:identical
binary opens DB

Everything visible looks the same. The binaries are not.

We spent a while not believing this. Same Cargo.lock, same target triple, same compiler version listed in the config. How is the binary different?

While we were at it, we checked whether the container's Ubuntu 22.04 base was somehow a factor. It wasn't -- but pulling up the EOL date was its own small surprise. 22.04 goes end-of-life in April 2027, closer than it feels when you're used to thinking of LTS releases as basically forever. Another thing to deal with eventually.

Chasing the Ghost

MDBX_INVALID on database open. The database file exists and was created by the same binary version, but the new node refuses to start. No grace, just a panic.

Ran mdbx_chk and walked through integrity checks. Database is structurally sound -- pages, trees, and checksums all pass. It's not a corruption issue.

Diffed every environment variable and config file between working and broken environments. Identical. Same image, same flags, same data directory, same everything.

Binaries built locally open the database fine. Binaries that come out of the actual build container don't. Same source, same Cargo.lock, same target. The difference has to be in how the container builds.

Compared the compiled artifacts directly using readelf and ldd. Different section layouts, different relocation patterns. The binaries are structurally different despite identical source.

The build container had mold pinned to a 1.x release from years ago. Dev environments were installing the latest 2.x from the package registry. mold shipped breaking changes between those major versions, including section alignment behavior. A silent drift that finally surfaced as an incompatible binary.

We ran mdbx_chk. We diffed every environment variable between working and broken deployments. We stared at config files side by side. We ran the binary under strace. At some point someone suggested maybe the database was subtly corrupt in a way the integrity check didn't catch, so we tried opening it with a known-good binary from a different machine. It opened fine.

The database was fine. The binary was wrong. Specifically our binary, the one built in the real container, was producing something that MDBX rejected on page geometry grounds.

That's when we started looking at the binary itself. Not the source. The actual ELF. readelf, ldd, symbol tables, section layouts. The binaries were structurally different in ways they had no business being.

The Linker

Both environments used mold, the fast Rust linker. Different versions.

The build container had mold pinned. Set it up once, installed a version, moved on. Dev and runtime environments were just installing latest off the package registry so they had a current release. The build container was years behind. Not a patch behind, not a minor version. Years. The world had shipped mold 2.x. The container was still on 1.x. Nobody noticed because nobody looked. It was just "how the container is set up."

Dependency drift over time

202120222023202420252026yoursmold 1.xmoldmold 2.xsilent incompatibility zoneyou are here

The 1.x to 2.x transition wasn't just performance improvements. There were changes to how mold handles certain relocation types and memory section alignment. For most software that doesn't matter. libmdbx is not most software -- it has hard assumptions about page geometry baked into a C library, and internal consistency checks that enforce them.

When the build container's old mold produced a binary with subtly different section alignment, those checks fired. The binary coming out of the real build container looked wrong to the storage engine. Binaries built locally with current mold looked fine.

MDBX_INVALID. An entire day of debugging a perfectly healthy database.

This Isn't Debt

After we found it, someone pulled up the Dockerfile's git log. The mold install line had one entry -- the commit that first set up the build environment, years back. Nobody had touched it since. There was no decision to stay on 1.x. It was just whatever version was current when someone first wired the container together, and then everyone moved on and didn't look again.

Nobody had tested this configuration in years. The mold maintainers hadn't. The Rust toolchain team hadn't. libmdbx certainly hadn't. They'd all moved on. Three years of releases, three years of the ecosystem building and testing against current behavior -- and we were outside all of it.

GitHub Has the Data

630 million repositories. 1.9 million of them with Dockerfiles. 15.5 billion documents indexed. 115 terabytes of raw code. GitHub knows what version of mold is in your build container, when that line was last touched, and what the current release is. They have that information for every repository on the platform.

Dependency Drift Analyzer2 warnings
Build toolchain drift detected
mold

build container pinned 1.11.0, current 2.x line is 2.35.1 (3+ years behind, section alignment semantics changed in 2.x)

libgcc

build container pinned 12.3, current is 14.2 (ABI changes may affect C FFI boundaries)

powered by LLM analysiswhat if this check existed?

Dependabot is enabled on 2.66 million repos. It opened 75 million pull requests in 2022 alone. It's genuinely impressive at what it does -- runtime dependencies, known CVEs, manifest files it understands. What it doesn't do is read apt-get install lines in a Dockerfile. As far as Dependabot is concerned, the build environment doesn't exist.

That's a choice, not a technical limitation. The Dockerfile is in the repository. The commit history is there. The release history of mold is public. The Cargo.lock is committed and tells you exactly what ecosystem you're in. A check that says "your build container pins mold 1.11.0, the current release is 2.35.1, and that gap includes breaking changes to linker semantics" is not a research problem. It's a query over data GitHub already has.

They're not building it.

The Fix

One line. Pin mold to a specific 2.x release in the build container. Rebuild. The node opened its database without complaint.

Wednesday ended late.