cosmo-regulus

A working open-source library that turns the rad-tolerance assumption from claim to falsifiable model.

Modern AI workloads run on commercial GPUs (H100, B200, Rubin-class). These are not radiation-hardened. The traditional aerospace answer is rad-hard silicon (RAD750, LEON5, rad-hard FPGAs), which runs 10–100× slower than commercial parts and lags the state of the art by ~10 years. Rad-hard silicon cannot run modern transformer inference at useful throughput. Full stop.

The alternative is software-defined tolerance: detect and correct radiation-induced bit flips at the model-serving layer. The hard question — how aggressive should that tolerance be? — has no published answer. cosmo-regulus produces one.


What It Computes

Two specific quantities, neither of which appears in the published literature:

1. The economic Pareto curve. For a given orbit or surface location, what combination of shielding mass × replica count × weight-scrubbing rate minimizes $ per million tokens at a target quality threshold?

2. The adaptive tolerance policy. Given a live (or simulated) particle-flux signal, what real-time policy of detection and recovery primitives keeps output quality above threshold X with throughput cost below Y%, across both quiet-sun and Solar Energetic Particle (SEP) event conditions?

Primitives exist piecemeal in the literature (ATTNChecker DAC 2025; ReaLM DAC 2025; FT-Transformer 2025; SAVE USENIX ATC 2025). The controller that turns environment data into a real-time tolerance policy does not.


Anchored on Measured Data

The fault model is grounded on the first time-resolved dose-rate measurement of the lunar surface:

  • Chang'E-4 LND — Zhang et al., Science Advances 2020. Reports 13.2 ± 1 µGy(Si)/hr dose rate on the lunar far-side surface, equivalent to ~116 mGy(Si)/yr unshielded, with quality factor ⟨Q⟩ ≈ 4.3.
  • LRO CRaTER — Mazur et al., Space Weather 2011. Reports ~130 mGy(Si)/yr behind <2 cm Al in lunar orbit, cross-checking Chang'E-4 within ~15%.

Every downstream λ_SEU (per-GPU bit-flip rate) number is traceable to those measurements rather than to CREME96 extrapolation.


First-Cut Result

The current pareto command sweeps 84 (shielding × replica × scrub) cells under peak-SEP worst-case design assumptions and identifies the Pareto-optimal frontier at quality ≥ 0.95:

Economic Pareto frontier — first-cut model
Shielding Replicas Scrub Quality Shielding $ $/M-tokens
100 cm regolith 1 168 h 0.9627 2,000 0.15
50 cm regolith 2 168 h 0.9755 1,000 0.30
25 cm regolith 3 168 h 0.9723 500 0.46

First-cut numbers. Several coefficients are planning placeholders pending vendor-specific HBM cross-section data and a refined LET-spectrum-integrated shielding model. Full assumptions ledger lives in the repository at docs/results.md.


Built On

Source Role
ReaLM (Xie et al., DAC 2025) LLM-inference fault-model methodology
SAVE (Zheng et al., USENIX ATC 2025) Software-only fault tolerance on commodity GPU memory
RedNet (Wang et al., 2024) Closest published space-environment → DNN-inference bridge
Google Suncatcher (Nov 2025) Empirical TPU + AMD-host beam-test data; HBM cross-section sanity check
Chang'E-4 LND (Zhang et al., 2020) Primary environment anchor; lunar polar surface dose
LRO CRaTER (Mazur et al., 2011) Secondary anchor; lunar orbital dose cross-check

Cited, not forked. The license posture is Apache-2.0 — chosen explicitly because AGPL contagion would be a poison pill for commercial adoption.


Repository

github.com/dubthree/cosmo-regulus — Apache-2.0.

End-to-end CLI; ~50 tests; reproducible Pareto curve in under a minute on a laptop. Limitations documented on page one of the README; the work is positioned as a falsifiable model, not a flight-grade certification.


Why It Sits Here

cosmo-regulus is the de-risking artifact for the load-bearing rad-tolerance assumption in Centradiant's lunar surface compute thesis — see /lunar. The orbital thermal cascade enables compute in orbit; the surface thermal cascade enables compute on the Moon. Both are useless if the GPUs do not survive the environment. This library makes the survival claim quantitative.