Validation: Detectability, Hindcast, SBC, and Posterior Predictive Checks
This notebook asks four questions about the Neptune inverse problem:
Detectability — before we validate any particular Neptune fit, is the Uranus anomaly itself large enough and structured enough to rule out noise? This is the precondition that motivates solving an inverse problem at all.
Hindcast validation — fit on 1781..cutoff-1, predict the cutoff..1846 holdout. This is the most important scientific check: does an early Neptune fit actually generalize out of sample?
Simulation-based calibration (SBC) — draw synthetic problems from a narrow prior, fit each one, and check whether posterior intervals behave sensibly. The checked-in run is intentionally lightweight and demonstrative, not statistically rigorous.
Posterior predictive checks — draw posterior samples, replicate the Uranus longitude series, and compare the implied residual structure against what was observed.
Runtime notes for a fresh run: - detectability: under a second - hindcast: about 30 seconds - SBC: a few minutes - posterior predictive (synthetic): under a minute - posterior predictive (JPL): optional, cache-guarded
Setup
The notebook works from either the repository root or the docs/ directory. We keep cache artifacts under data/cache/validation/ and rebuild the lightweight trial objects from cached parquet files when present.
Code
from IPython.display import displayimport loggingfrom pathlib import Pathimport matplotlib.pyplot as pltimport numpy as npimport pandas as pdfrom discoverneptune.bayesian import run_mcmcfrom discoverneptune.data import PLANET_IDS, fetch_planet_vectorsfrom discoverneptune.plotting import ( plot_hindcast_1846_error, plot_hindcast_rms, plot_posterior_predictive_bands, plot_posterior_predictive_summaries, plot_sbc_coverage,)from discoverneptune.simulation import MASS_NEPTUNE_APPROX, NeptuneCandidate, StateVectorfrom discoverneptune.solver import solvefrom discoverneptune.synthetic import build_synthetic_problemfrom discoverneptune.validation import ( HindcastTrial, SBCTrial, hindcast_trials_to_frame, run_hindcast_experiment, run_posterior_predictive_check, run_sbc_experiment, sbc_trials_to_frame, summarize_sbc_trials,)# plot_style imports added by transform scriptfrom discoverneptune.plot_style import ( apply_style, palette, category_cycle, dual_render, SEQUENTIAL_CMAP, truth_line, annotate_stat, fig_size,)apply_style()logging.basicConfig( level=logging.INFO,format="%(asctime)s%(levelname)-8s%(name)s%(message)s", datefmt="%H:%M:%S",)logger = logging.getLogger("validation_nb")cwd = Path.cwd()candidate_repo_roots = [cwd, cwd.parent]repo_root =next( (candidate for candidate in candidate_repo_roots if (candidate /"pyproject.toml").exists()),None,)if repo_root isNone:raiseFileNotFoundError("Could not find the repository root from the current working directory." )CACHE_DIR = repo_root /"data"/"cache"/"validation"CACHE_DIR.mkdir(parents=True, exist_ok=True)def _sv_from_df(df: pd.DataFrame) -> StateVector:return StateVector( x=float(df["x"].to_numpy()[0]), y=float(df["y"].to_numpy()[0]), z=float(df["z"].to_numpy()[0]), vx=float(df["vx"].to_numpy()[0]), vy=float(df["vy"].to_numpy()[0]), vz=float(df["vz"].to_numpy()[0]), )def _hindcast_trial_from_row(row: pd.Series) -> HindcastTrial:return HindcastTrial( mode=str(row["mode"]), cutoff_year=int(row["cutoff_year"]), n_fit=int(row["n_fit"]), n_holdout=int(row["n_holdout"]), in_sample_rms_arcsec=float(row["in_sample_rms_arcsec"]), holdout_rms_arcsec=float(row["holdout_rms_arcsec"]), holdout_max_abs_arcsec=float(row["holdout_max_abs_arcsec"]), longitude_1846_error_deg=float(row["longitude_1846_error_deg"]), best_params=np.array( [row["best_mass"], row["best_a"], row["best_e"], row["best_l_rad"]], dtype=float, ), best_sse=float(row["best_sse"]), converged=bool(row["converged"]), )def _sbc_trial_from_row(row: pd.Series) -> SBCTrial: params = ["mass", "a", "e", "l_rad"]return SBCTrial( noise_arcsec=float(row["noise_arcsec"]), truth_params=np.array([row[f"truth_{name}"] for name in params], dtype=float), mle_params=np.array([row[f"mle_{name}"] for name in params], dtype=float), converged=bool(row["converged"]), rank_by_param={name: int(row[f"rank_{name}"]) for name in params}, coverage_by_level={ name: {0.5: bool(row[f"cov_{name}_50"]),0.8: bool(row[f"cov_{name}_80"]),0.95: bool(row[f"cov_{name}_95"]), }for name in params }, interval_width_by_level={ name: {0.5: float(row[f"width_{name}_50"]),0.8: float(row[f"width_{name}_80"]),0.95: float(row[f"width_{name}_95"]), }for name in params }, max_rhat=float(row["max_rhat"]), min_ess=float(row["min_ess"]), )
Detectability: Is the Anomaly Real?
Before validating any particular Neptune fit, we should verify the Uranus anomaly is statistically distinguishable from noise. If it were not, our whole inverse problem would be fitting sky-position noise.
We simulate Uranus’s longitude under a no-Neptune 4-body model (Sun + Jupiter + Saturn + Uranus, all pinned at JPL truth at 1781-01-01), compare to the bundled historical observations, and ask two questions of the residual sequence:
Is the RMS larger than plausible measurement noise? We score \(\chi^2_\mathrm{red} = \sum r_i^2 / (\sigma^2 (N - k))\) with \(\sigma = 1\,\mathrm{arcsec}\) (a generous estimate of 19th-century transit precision) and \(k = 0\) free parameters. A value near 1 would mean the no-Neptune model is consistent with noise; values far above 1 rule it out.
Are the residuals structured, not random? Lag-1 autocorrelation is near 0 for white noise (std \(\approx 1/\sqrt{N} \approx 0.12\) for \(N=66\)). A value near \(+1\) means a smoothly drifting signal — the fingerprint of a missing perturbing body.
No-Neptune residuals over 66 observations (1781–1846):
RMS = 31.5 arcsec
Peak-to-peak = 131.4 arcsec
chi2_red (sigma=1 arcsec) = 994 (expected ~1 under noise-only)
Lag-1 autocorrelation = +0.900 (expected ~0 +/- 0.12 under white noise)
Hindcast Validation
We fit the simplified Neptune model on an early prefix of the synthetic observation window and then score its predictions on the held-out suffix. The key question is how quickly the holdout RMS and 1846-position error improve as more years accumulate.
SBC repeatedly simulates synthetic problems from a narrow prior near the true Neptune regime, reruns the usual solve + MCMC pipeline, and checks whether posterior intervals cover the truth at roughly the advertised rates. The notebook run below is lightweight: it uses two noise levels and a small replicate count so the surface is reviewable and cacheable, not statistically definitive.
Posterior predictive checks ask what the fitted model says Uranus residuals should look like under repeated draws from the posterior. In synthetic mode the residual band should stay narrow and centered near zero; in JPL mode the same machinery exposes the residual structure that the simplified 4-parameter model cannot capture.
This cell only runs if cached posterior samples are available. It uses the usual JPL Uranus/Jupiter/Saturn state-vector path, then feeds the cached Neptune posterior draws through the same posterior-predictive machinery as the synthetic case.
No cached JPL posterior samples; skipping JPL predictive check.
To generate them, run `uv run python -m discoverneptune --jpl --mcmc` and save the flat samples to data/cache/validation/jpl_posterior_samples.npy
Interpretation
Detectability. The no-Neptune residuals are grossly inconsistent with measurement noise — chi2_red runs into the hundreds under any plausible sigma, and lag-1 autocorrelation is near +1 (smoothly drifting signal rather than random scatter). Both statistics confirm that the 19th-century Uranus anomaly is real and structured; the inverse problem we attempt in the rest of this project is warranted, not a search for a signal inside noise.
Hindcast. The interesting quantity is the holdout error: if a fit learned a real Neptune-like signal, holdout RMS and the 1846-position error should improve as later cutoff years include more of the historical window.
SBC. With only a handful of replicates per noise level, empirical coverage is noisy. The point of this notebook is to show the calibration mechanism and produce a reusable result surface, not to claim definitive frequentist coverage.
Posterior predictive. Synthetic mode should look nearly self-consistent. JPL mode is where model misspecification becomes visible: residual structure that sits persistently outside the predictive envelope is evidence that the 4-parameter Neptune model cannot reproduce the richer ephemeris exactly.