cosmo-regulus

A working open-source library that turns the rad-tolerance assumption from claim to falsifiable model.

Modern AI workloads run on commercial GPUs (H100, B200, Rubin-class). These are not radiation-hardened. The traditional aerospace answer is rad-hard silicon (RAD750, LEON5, rad-hard FPGAs), which runs 10–100× slower than commercial parts and lags the state of the art by ~10 years. Rad-hard silicon cannot run modern transformer inference at useful throughput. Full stop.

The alternative is software-defined tolerance: detect and correct radiation-induced bit flips at the model-serving layer. The hard question, how aggressive should that tolerance be?, has no published answer. cosmo-regulus produces one.

What It Computes

Two specific quantities, neither of which appears in the published literature:

1. The economic Pareto curve. For a given orbit or surface location, what combination of shielding mass × replica count × weight-scrubbing rate minimizes $ per million tokens at a target quality threshold?

2. The adaptive tolerance policy. Given a live (or simulated) particle-flux signal, what real-time policy of detection and recovery primitives keeps output quality above threshold X with throughput cost below Y%, across both quiet-sun and Solar Energetic Particle (SEP) event conditions?

Primitives exist piecemeal in the literature (ATTNChecker DAC 2025; ReaLM DAC 2025; FT-Transformer 2025; SAVE USENIX ATC 2025). The controller that turns environment data into a real-time tolerance policy does not.

Anchored on Measured Data

The fault model is grounded on the first time-resolved dose-rate measurement of the lunar surface:

Chang'E-4 LND: Zhang et al., Science Advances 2020. Reports 13.2 ± 1 µGy(Si)/hr dose rate on the lunar far-side surface, equivalent to ~116 mGy(Si)/yr unshielded, with quality factor ⟨Q⟩ ≈ 4.3.
LRO CRaTER: Mazur et al., Space Weather 2011. Reports ~130 mGy(Si)/yr behind <2 cm Al in lunar orbit, cross-checking Chang'E-4 within ~15%.

Every downstream λ_SEU (per-GPU bit-flip rate) number is traceable to those measurements rather than to CREME96 extrapolation.

First-Cut Result

The current pareto command sweeps 84 (shielding × replica × scrub) cells under peak-SEP worst-case design assumptions and identifies the Pareto-optimal frontier at quality ≥ 0.95:

Economic Pareto frontier, first-cut model

Shielding	Replicas	Scrub	Quality	Shielding $	$/M-tokens
100 cm regolith	1	168 h	0.9627	2,000	0.15
50 cm regolith	2	168 h	0.9755	1,000	0.30
25 cm regolith	3	168 h	0.9723	500	0.46

First-cut numbers. Several coefficients are planning placeholders pending vendor-specific HBM cross-section data and a refined LET-spectrum-integrated shielding model. Full assumptions ledger lives in the repository at docs/results.md.

Built On

Source	Role
ReaLM (Xie et al., DAC 2025)	LLM-inference fault-model methodology
SAVE (Zheng et al., USENIX ATC 2025)	Software-only fault tolerance on commodity GPU memory
RedNet (Wang et al., 2024)	Closest published space-environment → DNN-inference bridge
Google Suncatcher (Nov 2025)	Empirical TPU + AMD-host beam-test data; HBM cross-section sanity check
Chang'E-4 LND (Zhang et al., 2020)	Primary environment anchor; lunar polar surface dose
LRO CRaTER (Mazur et al., 2011)	Secondary anchor; lunar orbital dose cross-check

Cited, not forked. The license posture is Apache-2.0, chosen explicitly because AGPL contagion would be a poison pill for commercial adoption.

Repository

github.com/dubthree/cosmo-regulus, Apache-2.0.

End-to-end CLI; ~50 tests; reproducible Pareto curve in under a minute on a laptop. Limitations documented on page one of the README; the work is positioned as a falsifiable model, not a flight-grade certification.

Why It Sits Here

cosmo-regulus is the de-risking artifact for the load-bearing rad-tolerance assumption in Centradiant's lunar surface compute thesis, see /lunar. The cascade is useless if the GPUs do not survive the environment. This library makes the soft-error component of that survival claim quantitative, and now also carries a first-cut model of single-event latch-up, a distinct, potentially destructive failure that ECC, scrubbing, and replicas do not address and a regolith berm does not stop. The latch-up cross-sections are explicit placeholders, so resolving the real risk still requires part-level beam testing; the model exists to size its economic impact and isolate it as a tracked open item, not to certify it away.