Validation: Detectability, Hindcast, SBC, and Posterior Predictive Checks

This notebook asks four questions about the Neptune inverse problem:

  1. Detectability — before we validate any particular Neptune fit, is the Uranus anomaly itself large enough and structured enough to rule out noise? This is the precondition that motivates solving an inverse problem at all.
  2. Hindcast validation — fit on 1781..cutoff-1, predict the cutoff..1846 holdout. This is the most important scientific check: does an early Neptune fit actually generalize out of sample?
  3. Simulation-based calibration (SBC) — draw synthetic problems from a narrow prior, fit each one, and check whether posterior intervals behave sensibly. The checked-in run is intentionally lightweight and demonstrative, not statistically rigorous.
  4. Posterior predictive checks — draw posterior samples, replicate the Uranus longitude series, and compare the implied residual structure against what was observed.

Runtime notes for a fresh run: - detectability: under a second - hindcast: about 30 seconds - SBC: a few minutes - posterior predictive (synthetic): under a minute - posterior predictive (JPL): optional, cache-guarded

Setup

The notebook works from either the repository root or the docs/ directory. We keep cache artifacts under data/cache/validation/ and rebuild the lightweight trial objects from cached parquet files when present.

Code
from IPython.display import display
import logging
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from discoverneptune.bayesian import run_mcmc
from discoverneptune.data import PLANET_IDS, fetch_planet_vectors
from discoverneptune.plotting import (
    plot_hindcast_1846_error,
    plot_hindcast_rms,
    plot_posterior_predictive_bands,
    plot_posterior_predictive_summaries,
    plot_sbc_coverage,
)
from discoverneptune.simulation import MASS_NEPTUNE_APPROX, NeptuneCandidate, StateVector
from discoverneptune.solver import solve
from discoverneptune.synthetic import build_synthetic_problem
from discoverneptune.validation import (
    HindcastTrial,
    SBCTrial,
    hindcast_trials_to_frame,
    run_hindcast_experiment,
    run_posterior_predictive_check,
    run_sbc_experiment,
    sbc_trials_to_frame,
    summarize_sbc_trials,
)

# plot_style imports added by transform script
from discoverneptune.plot_style import (
    apply_style, palette, category_cycle, dual_render,
    SEQUENTIAL_CMAP, truth_line, annotate_stat, fig_size,
)
apply_style()


logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s  %(levelname)-8s  %(name)s  %(message)s",
    datefmt="%H:%M:%S",
)
logger = logging.getLogger("validation_nb")

cwd = Path.cwd()
candidate_repo_roots = [cwd, cwd.parent]
repo_root = next(
    (candidate for candidate in candidate_repo_roots if (candidate / "pyproject.toml").exists()),
    None,
)
if repo_root is None:
    raise FileNotFoundError(
        "Could not find the repository root from the current working directory."
    )

CACHE_DIR = repo_root / "data" / "cache" / "validation"
CACHE_DIR.mkdir(parents=True, exist_ok=True)

def _sv_from_df(df: pd.DataFrame) -> StateVector:
    return StateVector(
        x=float(df["x"].to_numpy()[0]),
        y=float(df["y"].to_numpy()[0]),
        z=float(df["z"].to_numpy()[0]),
        vx=float(df["vx"].to_numpy()[0]),
        vy=float(df["vy"].to_numpy()[0]),
        vz=float(df["vz"].to_numpy()[0]),
    )

def _hindcast_trial_from_row(row: pd.Series) -> HindcastTrial:
    return HindcastTrial(
        mode=str(row["mode"]),
        cutoff_year=int(row["cutoff_year"]),
        n_fit=int(row["n_fit"]),
        n_holdout=int(row["n_holdout"]),
        in_sample_rms_arcsec=float(row["in_sample_rms_arcsec"]),
        holdout_rms_arcsec=float(row["holdout_rms_arcsec"]),
        holdout_max_abs_arcsec=float(row["holdout_max_abs_arcsec"]),
        longitude_1846_error_deg=float(row["longitude_1846_error_deg"]),
        best_params=np.array(
            [row["best_mass"], row["best_a"], row["best_e"], row["best_l_rad"]],
            dtype=float,
        ),
        best_sse=float(row["best_sse"]),
        converged=bool(row["converged"]),
    )

def _sbc_trial_from_row(row: pd.Series) -> SBCTrial:
    params = ["mass", "a", "e", "l_rad"]
    return SBCTrial(
        noise_arcsec=float(row["noise_arcsec"]),
        truth_params=np.array([row[f"truth_{name}"] for name in params], dtype=float),
        mle_params=np.array([row[f"mle_{name}"] for name in params], dtype=float),
        converged=bool(row["converged"]),
        rank_by_param={name: int(row[f"rank_{name}"]) for name in params},
        coverage_by_level={
            name: {
                0.5: bool(row[f"cov_{name}_50"]),
                0.8: bool(row[f"cov_{name}_80"]),
                0.95: bool(row[f"cov_{name}_95"]),
            }
            for name in params
        },
        interval_width_by_level={
            name: {
                0.5: float(row[f"width_{name}_50"]),
                0.8: float(row[f"width_{name}_80"]),
                0.95: float(row[f"width_{name}_95"]),
            }
            for name in params
        },
        max_rhat=float(row["max_rhat"]),
        min_ess=float(row["min_ess"]),
    )

Detectability: Is the Anomaly Real?

Before validating any particular Neptune fit, we should verify the Uranus anomaly is statistically distinguishable from noise. If it were not, our whole inverse problem would be fitting sky-position noise.

We simulate Uranus’s longitude under a no-Neptune 4-body model (Sun + Jupiter + Saturn + Uranus, all pinned at JPL truth at 1781-01-01), compare to the bundled historical observations, and ask two questions of the residual sequence:

  1. Is the RMS larger than plausible measurement noise? We score \(\chi^2_\mathrm{red} = \sum r_i^2 / (\sigma^2 (N - k))\) with \(\sigma = 1\,\mathrm{arcsec}\) (a generous estimate of 19th-century transit precision) and \(k = 0\) free parameters. A value near 1 would mean the no-Neptune model is consistent with noise; values far above 1 rule it out.
  2. Are the residuals structured, not random? Lag-1 autocorrelation is near 0 for white noise (std \(\approx 1/\sqrt{N} \approx 0.12\) for \(N=66\)). A value near \(+1\) means a smoothly drifting signal — the fingerprint of a missing perturbing body.
Code
from discoverneptune.data import find_bundled_data_dir
from discoverneptune.simulation import simulate_uranus_longitudes_no_neptune

ARCSEC_PER_RAD = 3600.0 * 180.0 / np.pi

data_dir = find_bundled_data_dir()
obs = pd.read_csv(data_dir / 'uranus_observations_1781_1846.csv')
svs = pd.read_csv(data_dir / 'outer_planet_state_vectors_1781_01_01.csv')
state = {
    row.planet: StateVector(row.x, row.y, row.z, row.vx, row.vy, row.vz)
    for row in svs.itertuples(index=False)
}

jd_times = obs['datetime_jd'].to_numpy()
obs_lon = obs['longitude_rad'].to_numpy()
start_jd = float(jd_times[0])

sim_lon = simulate_uranus_longitudes_no_neptune(
    jd_times, start_jd, state['jupiter'], state['saturn'], state['uranus']
)

# Angle-wrapped residuals in arcsec
wrapped = ((obs_lon - sim_lon + np.pi) % (2 * np.pi)) - np.pi
residuals_arcsec = wrapped * ARCSEC_PER_RAD

sigma_assumed_arcsec = 1.0
n_obs = len(residuals_arcsec)
chi2_red = float(np.sum(residuals_arcsec**2)) / (sigma_assumed_arcsec**2 * n_obs)

r = residuals_arcsec - residuals_arcsec.mean()
lag1_autocorr = float(np.sum(r[:-1] * r[1:]) / np.sum(r**2))
white_noise_std = 1.0 / np.sqrt(n_obs)

years = 1781.0 + (jd_times - start_jd) / 365.25

def _build_no_neptune(theme="light"):
    apply_style(theme)
    pal = palette(theme)
    fig, ax = plt.subplots(figsize=fig_size("wide"))
    ax.plot(years, residuals_arcsec, 'o-', lw=0.8, ms=3.0, color=pal.before)
    truth_line(ax, 0, theme=theme)
    ax.set_xlabel('Year')
    ax.set_ylabel('Uranus residual (arcsec)')
    ax.set_title('No-Neptune residuals — 19th-century historical observations')
    return fig

display(dual_render(_build_no_neptune, alt="Uranus longitude residuals without Neptune perturbation"))

print(f'No-Neptune residuals over {n_obs} observations (1781–1846):')
print(f'  RMS                       = {np.sqrt(np.mean(residuals_arcsec**2)):.1f} arcsec')
print(f'  Peak-to-peak              = {np.ptp(residuals_arcsec):.1f} arcsec')
print(f'  chi2_red (sigma=1 arcsec) = {chi2_red:.0f}   (expected ~1 under noise-only)')
print(f'  Lag-1 autocorrelation     = {lag1_autocorr:+.3f}  (expected ~0 +/- {white_noise_std:.2f} under white noise)')
No-Neptune residuals over 66 observations (1781–1846):
  RMS                       = 31.5 arcsec
  Peak-to-peak              = 131.4 arcsec
  chi2_red (sigma=1 arcsec) = 994   (expected ~1 under noise-only)
  Lag-1 autocorrelation     = +0.900  (expected ~0 +/- 0.12 under white noise)

Hindcast Validation

We fit the simplified Neptune model on an early prefix of the synthetic observation window and then score its predictions on the held-out suffix. The key question is how quickly the holdout RMS and 1846-position error improve as more years accumulate.

Code
TRUE_NEPTUNE = NeptuneCandidate(
    mass_solar=MASS_NEPTUNE_APPROX,
    a=30.07,
    e=0.009,
    l_rad=np.radians(132.0),
)
HINDCAST_CACHE = CACHE_DIR / "hindcast_synthetic.parquet"

if HINDCAST_CACHE.exists():
    logger.info("Loading cached hindcast results from %s", HINDCAST_CACHE)
    hindcast_df = pd.read_parquet(HINDCAST_CACHE)
else:
    logger.info("Building synthetic benchmark problem...")
    problem = build_synthetic_problem(TRUE_NEPTUNE)
    logger.info("Running hindcast experiment...")
    hindcast_trials = run_hindcast_experiment(
        jd_times=problem["jd_times"],
        truth_longs=problem["truth_longs"],
        start_jd=problem["start_jd"],
        jupiter_sv=problem["jupiter_sv"],
        saturn_sv=problem["saturn_sv"],
        uranus_sv=problem["uranus_sv"],
        cutoff_years=[1800, 1810, 1820, 1830, 1840],
        mode="synthetic",
        solve_kwargs={"de_maxiter": 30, "de_popsize": 5, "seed": 42},
    )
    hindcast_df = hindcast_trials_to_frame(hindcast_trials)
    hindcast_df.to_parquet(HINDCAST_CACHE)

hindcast_trials = [_hindcast_trial_from_row(row) for _, row in hindcast_df.iterrows()]
problem = build_synthetic_problem(TRUE_NEPTUNE)
hindcast_df

display(dual_render(plot_hindcast_rms, hindcast_trials, alt="Hindcast RMS vs cutoff year"))
display(dual_render(plot_hindcast_1846_error, hindcast_trials, alt="1846 longitude error vs training cutoff year"))

Simulation-Based Calibration

SBC repeatedly simulates synthetic problems from a narrow prior near the true Neptune regime, reruns the usual solve + MCMC pipeline, and checks whether posterior intervals cover the truth at roughly the advertised rates. The notebook run below is lightweight: it uses two noise levels and a small replicate count so the surface is reviewable and cacheable, not statistically definitive.

Code
noise_levels = [0.0, 2.0]
sbc_frames = []
sbc_trials = []

for noise_arcsec in noise_levels:
    cache_path = CACHE_DIR / f"sbc_noise_{noise_arcsec:g}.parquet"
    if cache_path.exists():
        logger.info("Loading cached SBC results from %s", cache_path)
        noise_df = pd.read_parquet(cache_path)
    else:
        logger.info(
            "Running SBC experiment for noise=%s arcsec (4 replicates)...",
            noise_arcsec,
        )
        noise_trials = run_sbc_experiment(
            n_replicates=4,
            noise_arcsec=noise_arcsec,
            mcmc_steps=400,
            mcmc_walkers=12,
            de_maxiter=20,
            de_popsize=4,
            seed=42,
        )
        noise_df = sbc_trials_to_frame(noise_trials)
        noise_df.to_parquet(cache_path)
    sbc_frames.append(noise_df)
    sbc_trials.extend(_sbc_trial_from_row(row) for _, row in noise_df.iterrows())

sbc_df = pd.concat(sbc_frames, ignore_index=True)
sbc_summary = summarize_sbc_trials(sbc_trials)
print(sbc_summary.to_string(index=False))

display(dual_render(plot_sbc_coverage, sbc_summary, alt="SBC coverage diagnostic"))
 noise_arcsec param  n_trials  converged_rate  median_rank  cov_50  width_median_50  cov_80  width_median_80  cov_95  width_median_95
          0.0  mass         4            0.75         77.0    0.75         0.000009    0.75         0.000014    0.75         0.000017
          0.0     a         4            0.75         20.0    0.25         1.828814    0.50         2.864647    0.75         3.263406
          0.0     e         4            0.75          0.0    0.00         0.020420    0.00         0.032099    0.00         0.038397
          0.0 l_rad         4            0.75         19.5    0.00         0.044133    0.25         0.069407    0.25         0.081105
          2.0  mass         4            0.75          8.0    0.25         0.000013    0.25         0.000029    0.50         0.000042
          2.0     a         4            0.75         14.0    0.25         1.613499    0.25         3.374696    0.50         5.788712
          2.0     e         4            0.75         16.0    0.00         0.025829    0.25         0.046738    1.00         0.075322
          2.0 l_rad         4            0.75        210.0    0.25         0.046195    1.00         0.092593    1.00         0.144568

Posterior Predictive Checks

Posterior predictive checks ask what the fitted model says Uranus residuals should look like under repeated draws from the posterior. In synthetic mode the residual band should stay narrow and centered near zero; in JPL mode the same machinery exposes the residual structure that the simplified 4-parameter model cannot capture.

Code
SYNTHETIC_POSTERIOR_CACHE = CACHE_DIR / "synthetic_posterior.npy"

if SYNTHETIC_POSTERIOR_CACHE.exists():
    logger.info(
        "Loading cached synthetic posterior from %s", SYNTHETIC_POSTERIOR_CACHE
    )
    synthetic_samples = np.load(SYNTHETIC_POSTERIOR_CACHE)
else:
    logger.info("Running synthetic MLE + MCMC for posterior predictive check...")
    synthetic_mle = solve(
        jd_times=problem["jd_times"],
        truth_longs=problem["truth_longs"],
        start_jd=problem["start_jd"],
        jupiter_sv=problem["jupiter_sv"],
        saturn_sv=problem["saturn_sv"],
        uranus_sv=problem["uranus_sv"],
        de_maxiter=30,
        de_popsize=5,
        seed=42,
    )
    synthetic_mcmc = run_mcmc(
        mle_result=synthetic_mle,
        jd_times=problem["jd_times"],
        truth_longs=problem["truth_longs"],
        start_jd=problem["start_jd"],
        jupiter_sv=problem["jupiter_sv"],
        saturn_sv=problem["saturn_sv"],
        uranus_sv=problem["uranus_sv"],
        n_walkers=16,
        n_steps=800,
        seed=42,
        progress=False,
    )
    synthetic_samples = synthetic_mcmc.flat_samples
    np.save(SYNTHETIC_POSTERIOR_CACHE, synthetic_samples)

synthetic_ppc = run_posterior_predictive_check(
    flat_samples=synthetic_samples,
    jd_times=problem["jd_times"],
    truth_longs=problem["truth_longs"],
    start_jd=problem["start_jd"],
    jupiter_sv=problem["jupiter_sv"],
    saturn_sv=problem["saturn_sv"],
    uranus_sv=problem["uranus_sv"],
    mode="synthetic",
    max_draws=100,
)

display(dual_render(plot_posterior_predictive_bands, synthetic_ppc, alt="Posterior predictive bands — synthetic mode"))
display(dual_render(plot_posterior_predictive_summaries, synthetic_ppc, alt="Posterior predictive summaries — synthetic mode"))

JPL mode (optional, cache-guarded)

This cell only runs if cached posterior samples are available. It uses the usual JPL Uranus/Jupiter/Saturn state-vector path, then feeds the cached Neptune posterior draws through the same posterior-predictive machinery as the synthetic case.

Code
JPL_POSTERIOR_CACHE = CACHE_DIR / "jpl_posterior_samples.npy"

if JPL_POSTERIOR_CACHE.exists():
    logger.info("Loading cached JPL posterior from %s", JPL_POSTERIOR_CACHE)
    jpl_samples = np.load(JPL_POSTERIOR_CACHE)

    df_uranus = fetch_planet_vectors(PLANET_IDS["uranus"], use_cache=True)
    df_jupiter = fetch_planet_vectors(PLANET_IDS["jupiter"], use_cache=True)
    df_saturn = fetch_planet_vectors(PLANET_IDS["saturn"], use_cache=True)

    jpl_ppc = run_posterior_predictive_check(
        flat_samples=jpl_samples,
        jd_times=df_uranus["datetime_jd"].to_numpy(dtype=float),
        truth_longs=df_uranus["longitude_rad"].to_numpy(dtype=float),
        start_jd=float(df_uranus["datetime_jd"].to_numpy(dtype=float)[0]),
        jupiter_sv=_sv_from_df(df_jupiter),
        saturn_sv=_sv_from_df(df_saturn),
        uranus_sv=_sv_from_df(df_uranus),
        mode="jpl",
        max_draws=100,
    )

    display(dual_render(plot_posterior_predictive_bands, jpl_ppc, alt="Posterior predictive bands — JPL mode"))
    display(dual_render(plot_posterior_predictive_summaries, jpl_ppc, alt="Posterior predictive summaries — JPL mode"))
else:
    print("No cached JPL posterior samples; skipping JPL predictive check.")
    print(
        "To generate them, run `uv run python -m discoverneptune --jpl --mcmc` "
        "and save the flat samples to data/cache/validation/jpl_posterior_samples.npy"
    )
No cached JPL posterior samples; skipping JPL predictive check.
To generate them, run `uv run python -m discoverneptune --jpl --mcmc` and save the flat samples to data/cache/validation/jpl_posterior_samples.npy

Interpretation

Detectability. The no-Neptune residuals are grossly inconsistent with measurement noise — chi2_red runs into the hundreds under any plausible sigma, and lag-1 autocorrelation is near +1 (smoothly drifting signal rather than random scatter). Both statistics confirm that the 19th-century Uranus anomaly is real and structured; the inverse problem we attempt in the rest of this project is warranted, not a search for a signal inside noise.

Hindcast. The interesting quantity is the holdout error: if a fit learned a real Neptune-like signal, holdout RMS and the 1846-position error should improve as later cutoff years include more of the historical window.

SBC. With only a handful of replicates per noise level, empirical coverage is noisy. The point of this notebook is to show the calibration mechanism and produce a reusable result surface, not to claim definitive frequentist coverage.

Posterior predictive. Synthetic mode should look nearly self-consistent. JPL mode is where model misspecification becomes visible: residual structure that sits persistently outside the predictive envelope is evidence that the 4-parameter Neptune model cannot reproduce the richer ephemeris exactly.