cosmo-regulus
A working open-source library that turns the rad-tolerance assumption from claim to falsifiable model.
Modern AI workloads run on commercial GPUs (H100, B200, Rubin-class). These are not radiation-hardened. The traditional aerospace answer is rad-hard silicon (RAD750, LEON5, rad-hard FPGAs), which runs 10–100× slower than commercial parts and lags the state of the art by ~10 years. Rad-hard silicon cannot run modern transformer inference at useful throughput. Full stop.
The alternative is software-defined tolerance: detect and correct radiation-induced bit flips at the model-serving layer. The hard question — how aggressive should that tolerance be? — has no published answer. cosmo-regulus produces one.
What It Computes
Two specific quantities, neither of which appears in the published literature:
1. The economic Pareto curve. For a given orbit or surface location, what combination of shielding mass × replica count × weight-scrubbing rate minimizes $ per million tokens at a target quality threshold?
2. The adaptive tolerance policy. Given a live (or simulated) particle-flux signal, what real-time policy of detection and recovery primitives keeps output quality above threshold X with throughput cost below Y%, across both quiet-sun and Solar Energetic Particle (SEP) event conditions?
Primitives exist piecemeal in the literature (ATTNChecker DAC 2025; ReaLM DAC 2025; FT-Transformer 2025; SAVE USENIX ATC 2025). The controller that turns environment data into a real-time tolerance policy does not.
Anchored on Measured Data
The fault model is grounded on the first time-resolved dose-rate measurement of the lunar surface:
- Chang'E-4 LND — Zhang et al., Science Advances 2020. Reports 13.2 ± 1 µGy(Si)/hr dose rate on the lunar far-side surface, equivalent to ~116 mGy(Si)/yr unshielded, with quality factor ⟨Q⟩ ≈ 4.3.
- LRO CRaTER — Mazur et al., Space Weather 2011. Reports ~130 mGy(Si)/yr behind <2 cm Al in lunar orbit, cross-checking Chang'E-4 within ~15%.
Every downstream λ_SEU (per-GPU bit-flip rate) number is traceable to those measurements rather than to CREME96 extrapolation.
First-Cut Result
The current pareto command sweeps 84 (shielding × replica × scrub) cells under peak-SEP worst-case design assumptions and identifies the Pareto-optimal frontier at quality ≥ 0.95:
| Shielding | Replicas | Scrub | Quality | Shielding $ | $/M-tokens |
|---|---|---|---|---|---|
| 100 cm regolith | 1 | 168 h | 0.9627 | 2,000 | 0.15 |
| 50 cm regolith | 2 | 168 h | 0.9755 | 1,000 | 0.30 |
| 25 cm regolith | 3 | 168 h | 0.9723 | 500 | 0.46 |
First-cut numbers. Several coefficients are planning placeholders pending vendor-specific HBM cross-section data and a refined LET-spectrum-integrated shielding model. Full assumptions ledger lives in the repository at docs/results.md.
Built On
| Source | Role |
|---|---|
| ReaLM (Xie et al., DAC 2025) | LLM-inference fault-model methodology |
| SAVE (Zheng et al., USENIX ATC 2025) | Software-only fault tolerance on commodity GPU memory |
| RedNet (Wang et al., 2024) | Closest published space-environment → DNN-inference bridge |
| Google Suncatcher (Nov 2025) | Empirical TPU + AMD-host beam-test data; HBM cross-section sanity check |
| Chang'E-4 LND (Zhang et al., 2020) | Primary environment anchor; lunar polar surface dose |
| LRO CRaTER (Mazur et al., 2011) | Secondary anchor; lunar orbital dose cross-check |
Cited, not forked. The license posture is Apache-2.0 — chosen explicitly because AGPL contagion would be a poison pill for commercial adoption.
Repository
github.com/dubthree/cosmo-regulus — Apache-2.0.
End-to-end CLI; ~50 tests; reproducible Pareto curve in under a minute on a laptop. Limitations documented on page one of the README; the work is positioned as a falsifiable model, not a flight-grade certification.
Why It Sits Here
cosmo-regulus is the de-risking artifact for the load-bearing rad-tolerance assumption in Centradiant's lunar surface compute thesis — see /lunar. The orbital thermal cascade enables compute in orbit; the surface thermal cascade enables compute on the Moon. Both are useless if the GPUs do not survive the environment. This library makes the survival claim quantitative.