Skip to contents

Goodness-of-fit calibration

The martingale-residual goodness-of-fit family (gof_univariate(), gof_global(), gof_auxiliary(), martingale_residuals()) reports p-values against an analytic null. This article documents an empirical study of when those p-values are exact and when they should be read as diagnostic. In short: the tests are well calibrated under the null and for bounded statistics, and become anti-conservative for unbounded count statistics in the presence of a real effect. This is a finite-sample property of the test, not a coding error — but it changes how the tests should be used.

Method. A type-I error-rate study under correct specification: simulate from a model, fit the same model, and collect GoF p-values. Under a calibrated test these are Uniform(0, 1), so the empirical rejection rate at level α should equal α. Settings below use a 20-actor one-mode network, the Gillespie simulator, 1{,}500 events and 150 replicates per cell unless stated, and a single control per case (the m = 2 design the GoF uses).

library(amorem)
actors <- paste0("a", 1:20)

# one replicate under correct specification: simulate with a known effect,
# fit the same linear model, return the gof_univariate p-value
gof_rep <- function(i, beta, stat = "reciprocity_count", half_life = NULL) {
  set.seed(70000L + i)
  ev <- simulate_relational_events(
    n_events = 1500, senders = actors, receivers = actors,
    endogenous_stats = stat, endogenous_effects = setNames(beta, stat),
    half_life = half_life, method = "gillespie")
  gof_univariate(ev, model = setNames("linear", stat), covariate = stat,
                 half_life = half_life, seed = 90000L + i)$p_value
}

type_I <- function(beta, ...) {
  p <- vapply(1:150, gof_rep, numeric(1), beta = beta, ...)
  c(rejection_05 = mean(p < 0.05), ks_uniformity = ks.test(p, "punif")$p.value)
}

Calibrated under the null

With no true effect the test is calibrated (indeed mildly conservative), and the p-values pass a uniformity check:

true β type-I @ .05 p-value uniformity (KS)
0.0 0.013 0.61
0.1 0.027

Anti-conservative for counts with a real effect

Once a real moderate effect is present, the rejection rate climbs above nominal — and it is not driven by numerical separation in the fit (at β = 0.2 essentially no strata are separated):

true β (reciprocity_count) type-I @ .05 mean separation
0.0 0.013 0.00
0.2 0.147 0.01
0.3 0.213 0.33
0.5 0.107 0.68

The empirical CDF of the p-values bows above the diagonal at small p — the signature of over-rejection:

GoF p-value calibration: empirical CDF vs uniform
GoF p-value calibration: empirical CDF vs uniform

The driver is the nature of the covariate, not its scale. At a matched moderate effect (β = 0.2, negligible separation):

covariate nature type-I @ .05
recency bounded, resets 0.053
reciprocity_exp_decay bounded by decay 0.093
reciprocity_count unbounded accumulating count 0.147
type_I(0.2, stat = "recency")
type_I(0.2, stat = "reciprocity_exp_decay", half_life = 5)
type_I(0.2, stat = "reciprocity_count")

A bounded statistic calibrates; an unbounded count does not. The cumulative-residual-process asymptotics require each event’s increment to be negligible relative to the whole, and an accumulating count becomes heavy-tailed late in the stream (a few late, high-count events dominate the process), which violates that condition.


What does not fix it

attempted remedy type-I @ β = 0.2 outcome
scale() the covariate 0.147 no change — the test normalises by the residual variance, so any linear rescaling cancels exactly
log1p() / sqrt() the covariate 0.88 / 0.88 worse — transforming the term misspecifies a linear-in-count truth, and the test correctly rejects
multiplier-bootstrap null 0.113 partial only
parametric bootstrap (re-simulate under β̂, refit) 0.158 no improvement (≈ analytic 0.158)

The parametric bootstrap is the decisive test: it is the most principled reference distribution available — it re-simulates under the fitted null and re-fits, fully accounting for the estimation effect. Because it over-rejects identically to the analytic null (R = 120: both 0.158, 95% CI 0.098–0.236; p-values fail uniformity, KS p = 0.008), the problem is not the null-distribution approximation. The observed statistic is genuinely larger than the fitted model predicts — a real finite-sample lack-of-fit of the case-1-control (m = 2) construction for count covariates with a true effect.


Guidance and outlook

  • Use under the null and for bounded statistics (recency, the decayed variants): the p-values are trustworthy.
  • For raw, unbounded count statistics with a real effect, read the GoF as a diagnostic — inspect the cumulative residual process and martingale_residuals() — rather than relying on the exact p-value, which is anti-conservative (type-I ≈ 0.15 at the settings above).
  • Open direction. A proper fix is a methods question rather than a one-line patch. The most promising candidate is a multi-control (risk-set / softmax) GoF in place of the single-control m = 2 degenerate binomial (the GoF currently fixes n_controls = 1), or a Khmaladze martingale transform that removes the estimation effect from the residual process. Quantifying the rejection rate as the number of controls grows is the natural next experiment.

See the Estimation guide for the GoF API and Validation experiments for the recovery and parity studies (E1–E7).