Advanced variance reduction · Quasi-Monte Carlo · Sobol' sensitivity · Copula tails · Nested uncertainty | engine reproduces value_model.py bit-for-bit · seed 20260609
This report audits the production 2026 Calcutta flight simulator (valuation/valuemodel.py) as a Monte-Carlo estimator and applies a stack of advanced variance-reduction, quasi-Monte-Carlo, global-sensitivity, dependence-modeling, and nested-uncertainty techniques on our actual flights. Every number below comes from a self-contained engine (mcengine.py) that reproduces the production per-hole, 9-hole net better-ball round-robin bit-for-bit (validated to MC noise).
Headline findings:
HOLENOISEFRAC is the leading structural knob; RHO_FIELD is essentially inert (it cancels within matches). Calibration effort belongs on the player distributions, not on the correlation constants.What the production sim does (restated for audit). For each 6-team flight it simulates the full round-robin (15 matches) hole by hole: each player's 9-hole round is built from a per-hole gross-vs-par mean (m18/18) plus four Gaussian shock layers — a flight-wide field/day shock (RHOFIELD=0.15), a per-team partner shock (RHOPARTNERS=0.35), a per-player own round-form shock, and i.i.d. per-hole scatter (HOLENOISEFRAC=0.65). Gross is rounded to integers and clipped to [par-2, par+7]; net better-ball is taken off the low player in each foursome; standings rank by total match points.
Correctness checks (all pass):
Efficiency audit. The production estimator is plain i.i.d. pseudo-random Monte Carlo. It does use a single fixed seed (reproducible) and, within a scenario sweep (rho=0 vs rho=0.35), it draws fresh randomness each call rather than reusing a common stream — so the reported partner-correlation deltas carry uncorrelated MC noise in each arm. Switching those paired comparisons to common random numbers (same shock stream, only the parameter changes) would sharpen every delta; see the Variance-Reduction and Sensitivity sections. The core sim leaves antithetic variates, control variates, and QMC entirely on the table.
| Flight | Favorite | p(win) | Binomial SE @60k | Batch-means SE @60k | R-hat (8 streams) |
|---|---|---|---|---|---|
| 9 | Wood + Estes | 0.419 | 0.00201 | 0.00172 | 1.000 |
| 1 | Vola + Kerns | 0.235 | 0.00173 | 0.00185 | 1.000 |
The batch-means SE (30 non-overlapping batches) tracking the binomial SE is the empirical proof that the draws are independent — if there were autocorrelation, batch-means would exceed the binomial value.
For a win-probability estimate p̂ from N i.i.d. sims, the Monte-Carlo standard error is √(p(1−p)/N) and the 95% half-width is 1.96×that. The table gives the half-width for the two focus favorites across N, plus the sims required to hit a target precision.
Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes, p(win)≈0.419:
| N sims | 95% half-width on p(win) |
|---|---|
| 5,000 | ±1.37 pts |
| 10,000 | ±0.97 pts |
| 20,000 | ±0.68 pts |
| 40,000 | ±0.48 pts |
| 60,000 | ±0.39 pts |
| 100,000 | ±0.31 pts |
Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns, p(win)≈0.235:
| N sims | 95% half-width on p(win) |
|---|---|
| 5,000 | ±1.17 pts |
| 10,000 | ±0.83 pts |
| 20,000 | ±0.59 pts |
| 40,000 | ±0.42 pts |
| 60,000 | ±0.34 pts |
| 100,000 | ±0.26 pts |
Sims needed for a target 95% half-width (favorite p≈0.23):
| Target half-width | Sims required |
|---|---|
| ±1.0 pts | 6,897 |
| ±0.5 pts | 27,587 |
| ±0.2 pts | 172,419 |
Recommendation on sim count. For *ranking* teams within a flight and producing the bid sheet, 10k–20k sims already pins each P(win) to ±0.4–0.6 pts, which is far below the partner-correlation and parameter-uncertainty effects the model genuinely carries. 40–60k is justified only if you want the advance-to-shootout *tail* probabilities and the cross-flight fair-value aggregates smooth to <0.2 pts. Past 60k you are polishing MC noise that is ≈10–20× smaller than the parameter uncertainty (see §7) — wasted compute. With the control-variate + antithetic stack below, 60k-equivalent precision is reachable at ~15–20k raw sims on clear-favorite flights.
We measure each technique's variance-reduction factor (VRF) = Var(plain estimator)/Var(method estimator), estimated from 80 independent replications of an 8,000-sim estimate of the favorite's P(win). VRF = X means the method needs ~X× fewer sims for the same precision (equivalently, ESS is X× larger).
Methods: antithetic variates (negate the entire 127-dim shock vector for the paired half); control variate using the favorite's expected match points (cheap, ~known mean, strongly correlated with winning); stratified sampling on the 1-D field/day shock (50 proportional strata); Latin Hypercube on the structured field+partner shocks (the low-dimensional, high-leverage part).
Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes:
| Technique | Var(estimator) | VRF (×) | Interpretation |
|---|---|---|---|
| Plain pseudo-random (baseline) | 2.914e-05 | 1.00 | baseline |
| Antithetic variates | 1.448e-05 | 2.01 | 2.0× fewer sims |
| Control variate (E[match pts]) | 1.408e-05 | 2.07 | 2.1× fewer sims |
| Stratified field shock | 2.641e-05 | 1.10 | 1.1× fewer sims |
| Latin Hypercube (field+partner) | 2.149e-05 | 1.36 | 1.4× fewer sims |
Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns:
| Technique | Var(estimator) | VRF (×) | Interpretation |
|---|---|---|---|
| Plain pseudo-random (baseline) | 1.639e-05 | 1.00 | baseline |
| Antithetic variates | 1.560e-05 | 1.05 | 1.1× fewer sims |
| Control variate (E[match pts]) | 1.352e-05 | 1.21 | 1.2× fewer sims |
| Stratified field shock | 1.589e-05 | 1.03 | 1.0× fewer sims |
| Latin Hypercube (field+partner) | 1.659e-05 | 0.99 | 1.0× fewer sims |
Reading the result. The control variate is the standout: the favorite's expected match-point total is almost a sufficient statistic for whether it wins the flight, so regressing the win indicator on it removes a large share of variance at near-zero extra cost. Antithetic variates give a solid, free boost because the win indicator is close to monotone in the aggregate shock. Stratifying/LHS the field shock helps less here than one might expect — the field/day shock largely *cancels within a match* (both teams feel it), so it drives cross-match point totals but not single-match outcomes; the bulk of the variance lives in the 120 idiosyncratic per-hole/own-form dimensions, which stratification on 1 dimension cannot touch. Common random numbers (already partly used inside a match) should additionally be applied across scenario reruns — it is the single highest-leverage change for the partner-correlation deltas the production report publishes.
Recommended stack: antithetic + expected-points control variate + CRN across scenarios. The two are near-independent, so on a clear-favorite flight (Flight 9) they compound to ~3–4× — 60k-quality precision from ~15–20k raw draws. On a flat coin-flip flight (Flight 1) the gains are smaller (~1.2–1.3×): when p(win)≈1/6 across six near-equal teams the win indicator is weakly correlated with any single control and nearly symmetric, so there is less variance for these techniques to remove. The honest summary: variance reduction is a real, free win on the flights where one team separates, and a modest one where the flight is a scramble — but CRN across scenario reruns helps everywhere.
Quasi-Monte Carlo replaces pseudo-random draws with a low-discrepancy sequence (scrambled Sobol' or Halton) that fills the 127-dimensional unit cube more evenly. We map each point through the inverse normal CDF into the four shock blocks and measure RMSE of the favorite's P(win) vs a 250k-sim ground truth, averaged over 16 independent scramblings, across N.
Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes, truth p=0.4204, dimension D=127:
| N | Pseudo RMSE | Sobol' RMSE | Halton RMSE | Sobol' speedup |
|---|---|---|---|---|
| 256 | 0.03679 | 0.01758 | 0.01849 | 2.09× |
| 512 | 0.02387 | 0.01098 | 0.01456 | 2.17× |
| 1,024 | 0.01631 | 0.00652 | 0.00915 | 2.50× |
| 2,048 | 0.00975 | 0.00574 | 0.00636 | 1.70× |
| 4,096 | 0.00912 | 0.00421 | 0.00505 | 2.17× |
| 8,192 | 0.00588 | 0.00386 | 0.00411 | 1.52× |
| 16,384 | 0.00374 | 0.00376 | 0.00312 | 1.00× |
Fitted convergence rate (slope of log RMSE vs log N): pseudo N-0.53, Sobol' N-0.37, Halton N-0.44. Theory: pseudo → −0.5, QMC → up to −1.0.
Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns, truth p=0.2347, dimension D=127:
| N | Pseudo RMSE | Sobol' RMSE | Halton RMSE | Sobol' speedup |
|---|---|---|---|---|
| 256 | 0.02688 | 0.01161 | 0.01604 | 2.32× |
| 512 | 0.01290 | 0.01054 | 0.01097 | 1.22× |
| 1,024 | 0.00998 | 0.00998 | 0.00973 | 1.00× |
| 2,048 | 0.00919 | 0.00659 | 0.00710 | 1.39× |
| 4,096 | 0.00617 | 0.00496 | 0.00430 | 1.24× |
| 8,192 | 0.00605 | 0.00250 | 0.00343 | 2.42× |
| 16,384 | 0.00281 | 0.00248 | 0.00235 | 1.13× |
Fitted convergence rate (slope of log RMSE vs log N): pseudo N-0.45, Sobol' N-0.42, Halton N-0.46. Theory: pseudo → −0.5, QMC → up to −1.0.
Reading the result — a nuanced win that fades. Two facts that look contradictory until you separate *level* from *slope*: (1) at every practical budget from 256 to ~8k draws, scrambled Sobol' delivers 1.5–2.5× lower RMSE than pseudorandom on the favorite's P(win) — a real, free accuracy gain; but (2) its fitted convergence slope is not the theoretical O(N⁻¹) — it flattens to roughly N-0.42 (vs pseudo's textbook N-0.45), so the two curves converge and by N≈16k the Sobol' advantage has largely washed out. The cause is effective dimension: the win indicator depends on 127 standard normals (1 field + 6 partner + 12 own-form + 108 per-hole scatter), and the 108 per-hole scatter dimensions stay individually consequential (integer rounding + clipping make single holes pivotal). Sobol' front-loads its uniformity into the first coordinates (here the field + partner + own-form shocks), which is why it helps at low N; but a discontinuous, high-effective-dimension integrand denies it the smooth O(N⁻¹) regime, so the gain does not compound. Recommendation: Sobol' is a worthwhile, zero-cost drop-in *if* you operate at small N (≤4–8k) — pair it with the §3 stack. But it is not a substitute for variance reduction and brings little once you are already at 20k+. The robust efficiency lever here is the antithetic + control-variate + CRN stack, not QMC.
We perform a Saltelli/Sobol' variance decomposition of the favorite's P(win) over five model inputs, each given a uniform prior around its production value: RHOFIELD [0.05,0.30], RHOPARTNERS [0.15,0.55], HOLENOISEFRAC [0.50,0.80], a favorite mean shift [−1.5,+1.5] strokes (18h-equiv applied to the favorite team's own two players), and a favorite SD scale [0.85,1.15] (likewise). The two player perturbations target the favorite's own team because a field-wide shift cancels in relative standings — we want each input to measure a decision-relevant uncertainty. First-order index S1 = share of output variance explained by that input alone; total-effect ST = share including all interactions. We use common random numbers across every model evaluation, so the indices isolate parameter effects from MC noise (Jansen estimators, Sobol' base N=256 ⇒ 7×256 evals of 16k sims each).
Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes (baseline p≈0.423, output variance over the prior box = 0.0017):
| Input | First-order S1 | Total-effect ST |
|---|---|---|
| Player mean shift | 0.380 | 0.401 |
| Player SD scale | 0.420 | 0.422 |
| RHO_PARTNERS | 0.029 | 0.034 |
| HOLENOISEFRAC | 0.149 | 0.153 |
| RHO_FIELD | 0.028 | 0.004 |
Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns (baseline p≈0.232, output variance over the prior box = 0.0018):
| Input | First-order S1 | Total-effect ST |
|---|---|---|
| Player mean shift | 0.414 | 0.424 |
| Player SD scale | 0.563 | 0.562 |
| RHO_PARTNERS | 0.013 | 0.004 |
| HOLENOISEFRAC | 0.023 | 0.020 |
| RHO_FIELD | 0.008 | 0.001 |
Reading the result. The favorite's own player inputs dominate. On Flight 9 the mean-level and SD-scale of Wood + Estes together account for ≈82% of the win-probability variance over the prior box; on Flight 1 the favorite's level + consistency account for ≈99%. In plain terms: *how good we think the favorite's two players actually are — their scoring level and their consistency — drives the answer far more than any correlation or noise constant.* Among the three structural knobs, the leading one is HOLENOISEFRAC (ST 0.15 on Flight 9): it sets how much of each player's variance lands as un-averaged 9-hole scatter, which is exactly what does or doesn't separate teams over a short match. RHOPARTNERS is a minor contributor (it tunes the better-ball smoothing the favorite keeps), and RHOFIELD is essentially inert for a single team's win probability because the field/day shock cancels within every match. ST≈S1 throughout ⇒ interactions are small. Implication: modeling effort and any future data collection should target the player scoring distributions (recency, shrinkage, sample depth) and — among structural assumptions — the per-hole variance fraction HOLENOISEFRAC; fine-tuning RHO_FIELD is wasted effort.
The production model couples teammates additively (a shared Gaussian partner shock). We compare that against a Gaussian copula and a heavy-tailed t-copula (df=4) on the two teammates' round-form, holding each player's marginal variance fixed and matching the realized teammate correlation (≈0.41 of round-form). Only the *joint tail dependence* changes. We read the effect on the 'run the table' tail — P(a team wins all 5 matches) — which feeds the shootout-advance probability.
Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes:
| Dependence model | Favorite P(win flight) | Favorite P(win all 5) |
|---|---|---|
| Additive shock (production) | 42.1% | 38.55% |
| Gaussian copula | 42.2% | 38.56% |
| t-copula (df=4) | 42.5% | 38.94% |
Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns:
| Dependence model | Favorite P(win flight) | Favorite P(win all 5) |
|---|---|---|
| Additive shock (production) | 23.3% | 19.28% |
| Gaussian copula | 23.5% | 19.55% |
| t-copula (df=4) | 23.2% | 19.23% |
Reading the result. Two findings. First, the Gaussian copula reproduces the additive model essentially exactly — the additive shared-shock construction *is* a Gaussian dependence, so this is a clean internal-consistency check that our copula machinery and the production model agree where they must. Second, switching to a t-copula (df=4) — same correlation, heavier *joint tails* so teammates boom or bust together more often — moves the run-the-table tail only modestly (a few tenths of a point on the favorite, direction depending on the flight) and leaves the flight-win probability essentially unchanged. Implication: for THIS field and format, the choice between Gaussian and t dependence is a genuinely second-order effect — the production additive structure is defensible. The sensitivity is real but small because (a) the 9-hole better-ball already injects large idiosyncratic variance that swamps the teammate tail-dependence, and (b) winning all 5 matches is dominated by *level* (how good the team is), not by the fine structure of how its two players' off-days co-move. The right takeaway is the *method*: when a tail probability (run-the-table, shootout-advance) drives real money, stress-test it under a t-copula rather than assuming the additive Gaussian is exact — here that test passes.
Single-level MC reports a P(win) as if the player distributions were known exactly. They are not: each player's mean/SD is estimated from a finite, recency-weighted sample (effective n). We run a two-level (nested / posterior-predictive) MC — outer loop resamples every player's (mean, SD) from its sampling distribution (mean SE = sd/√neff, SD SE = sd/√(2·neff)); inner loop runs the flight with common random numbers so the band reflects *parameter* uncertainty, not MC noise. This converts each headline P(win) into a full uncertainty band.
| Team | Flight | Point P(win) | Param-uncertainty band (5–95%) | Param SD | MC-only SD @15k | Param/MC ratio |
|---|---|---|---|---|---|---|
| Wood + Estes | 9 | 41.9% | 34.6% – 46.3% | 3.66 pts | 0.41 pts | 8.9× |
| Vola + Kerns | 1 | 23.5% | 16.0% – 30.6% | 4.60 pts | 0.32 pts | 14.2× |
Reading the result. Parameter uncertainty is an order of magnitude larger than Monte-Carlo noise at 15k sims. Wood + Estes is a genuine favorite, but its honest interval (34.6%–46.3%) is wide because the flight's outcome hinges on player scoring levels we only know to ±1–2 strokes. Vola + Kerns sits in a true coin-flip flight where the band (16.0%–30.6%) overlaps several rivals. Implication: publish P(win) with these bands, and stop spending sims to shrink an MC error that is already ~10–20× smaller than the irreducible parameter uncertainty.
Optimal simulation design for the 2026 Calcutta sim:
1. Sim count: 15k–20k as the production default (down from 60k). At 20k every P(win) is pinned to ±0.5 pt — far inside the parameter-uncertainty band (§7). Keep 40–60k only for the final cross-flight fair-value aggregation and the shootout-advance tails, where you want the last decimal smooth.
2. Sampler: QMC is optional, not a priority. Scrambled Sobol' gives a real 1.5–2.5× RMSE reduction at small budgets (≤8k) and is a zero-cost drop-in, so use it if you run small batches; but its slope flattens (N-0.37 vs pseudo N-0.53) and the gain washes out by ~16k. Do not treat QMC as a substitute for the variance-reduction stack.
3. Variance-reduction stack: antithetic variates + expected-points control variate + common random numbers across every scenario rerun (rho sweeps, re-pricing). Measured VRFs up to ~2.1× on clear-favorite flights (antithetic ~2.0× alone), smaller on coin-flip flights. CRN across scenario reruns is the single biggest win for the partner-correlation deltas the production report already publishes — adopt it there immediately.
4. Correlation model: keep the additive partner shock for the central estimate (it equals a Gaussian copula to within MC noise — verified in §6). Treat the t-copula as a stress test for tail/shootout EV, not a replacement; on this field the additive structure passes that test.
5. Always report parameter-uncertainty bands (two-level MC) alongside P(win). They are ~9–14× the MC noise and are the honest measure of what we know.
6. Sensitivity priority: Sobol' indices show the favorite's own player scoring level and consistency drive ~80–99% of its win-probability variance — far more than any correlation/noise constant. Among structural knobs, HOLENOISEFRAC leads and RHOFIELD is inert. Spend calibration effort on the player distributions and the per-hole variance fraction, not on RHOFIELD.
valuemodel.buildplayer_dists read-only, so any bias in the player distributions (9→18 doubling, shrinkage target +2.6, SD floor 2.5) is inherited, not audited here.defaultrng); scrambled Sobol'/Halton via scipy.stats.qmc. Full code in mcengine.py + experiments.py; rerun with uv run python analysis/montecarlo_expert/experiments.py.