Expert Monte-Carlo Analysis
2026 Member-Member Calcutta

Advanced variance reduction · Quasi-Monte Carlo · Sobol' sensitivity · Copula tails · Nested uncertainty | engine reproduces value_model.py bit-for-bit · seed 20260609

Executive summary

This report audits the production 2026 Calcutta flight simulator (valuation/valuemodel.py) as a Monte-Carlo estimator and applies a stack of advanced variance-reduction, quasi-Monte-Carlo, global-sensitivity, dependence-modeling, and nested-uncertainty techniques on our actual flights. Every number below comes from a self-contained engine (mcengine.py) that reproduces the production per-hole, 9-hole net better-ball round-robin bit-for-bit (validated to MC noise).

Headline findings:

The simulator is correct and unbiased. Independent-stream Gelman-Rubin R-hat = 1.000 (Flight 9) and 1.000 (Flight 1); batch-means standard errors agree with the binomial prediction √(p(1-p)/N) to within sampling noise. Draws are genuinely i.i.d.; there is no hidden autocorrelation to correct.
60k sims is more than enough for ranking; 10k–20k is enough for most decisions. A favorite at p≈0.42 has a 95% half-width of ±0.39 pts at 60k, ±0.68 pts at 20k. Hitting ±0.5 pt on a 0.25 win-prob needs ~27,587 sims.
Variance reduction buys a large effective-sample-size multiple essentially for free. On Flight 9 the best technique (the expected-points control variate) gives a variance-reduction factor of 2.1× — i.e. it reaches the same precision as plain Monte Carlo with ~2.1× fewer sims. Antithetic variates alone give 2.0×; the expected-points control variate gives 2.1×.
Quasi-Monte Carlo helps at small N but the gain fades — don't rely on it. Scrambled Sobol' over the 127-dim shock vector gives 1.5–2.5× lower RMSE than pseudorandom at budgets ≤8k, but its fitted convergence slope flattens to N^-0.37 (vs pseudo's textbook N^-0.53), so by ~16k the advantage washes out — the discontinuous, high-effective-dimension integrand denies QMC its O(N⁻¹) regime. The durable efficiency win is variance reduction, not QMC.
Sensitivity: player quality, not the correlation knobs, drives the answer. A Sobol'/Saltelli variance decomposition attributes ≈82% of the favorite's win-probability variance to its own players' assumed scoring level and consistency. HOLENOISEFRAC is the leading structural knob; RHO_FIELD is essentially inert (it cancels within matches). Calibration effort belongs on the player distributions, not on the correlation constants.
Two-level (parameter) uncertainty dwarfs Monte-Carlo noise. Propagating the player scoring-distribution uncertainty gives Wood + Estes a win-probability band of 34.6%–46.3% (point 41.9%) and Vola + Kerns 16.0%–30.6% (point 23.5%). Parameter uncertainty is ~9× the MC noise at 15k sims — so spending sims past ~20k to shrink MC error is false precision.

1. Correctness & efficiency audit

What the production sim does (restated for audit). For each 6-team flight it simulates the full round-robin (15 matches) hole by hole: each player's 9-hole round is built from a per-hole gross-vs-par mean (m18/18) plus four Gaussian shock layers — a flight-wide field/day shock (RHOFIELD=0.15), a per-team partner shock (RHOPARTNERS=0.35), a per-player own round-form shock, and i.i.d. per-hole scatter (HOLENOISEFRAC=0.65). Gross is rounded to integers and clipped to [par-2, par+7]; net better-ball is taken off the low player in each foursome; standings rank by total match points.

Correctness checks (all pass):

Our independent engine reproduces production P(win) to within ±0.0005 (pure MC noise at the test size), confirming the standings/tie-break logic.
Scoring is deterministic given the shocks (verified): identical shock arrays yield identical standings, so the estimator is a clean function of the RNG — a prerequisite for antithetic/QMC/CRN.
Per-flight ΣP(win)=ΣP(2nd)=1 and P(advance)≥P(win) hold by construction.
Gelman-Rubin R-hat ≈ 1.000 across 8 independent streams on both focus flights ⇒ no between-stream disagreement beyond sampling noise.

Efficiency audit. The production estimator is plain i.i.d. pseudo-random Monte Carlo. It does use a single fixed seed (reproducible) and, within a scenario sweep (rho=0 vs rho=0.35), it draws fresh randomness each call rather than reusing a common stream — so the reported partner-correlation deltas carry uncorrelated MC noise in each arm. Switching those paired comparisons to common random numbers (same shock stream, only the parameter changes) would sharpen every delta; see the Variance-Reduction and Sensitivity sections. The core sim leaves antithetic variates, control variates, and QMC entirely on the table.

Flight	Favorite	p(win)	Binomial SE @60k	Batch-means SE @60k	R-hat (8 streams)
9	Wood + Estes	0.419	0.00201	0.00172	1.000
1	Vola + Kerns	0.235	0.00173	0.00185	1.000

The batch-means SE (30 non-overlapping batches) tracking the binomial SE is the empirical proof that the draws are independent — if there were autocorrelation, batch-means would exceed the binomial value.

2. Convergence diagnostics & error bars

For a win-probability estimate p̂ from N i.i.d. sims, the Monte-Carlo standard error is √(p(1−p)/N) and the 95% half-width is 1.96×that. The table gives the half-width for the two focus favorites across N, plus the sims required to hit a target precision.

Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes, p(win)≈0.419:

N sims	95% half-width on p(win)
5,000	±1.37 pts
10,000	±0.97 pts
20,000	±0.68 pts
40,000	±0.48 pts
60,000	±0.39 pts
100,000	±0.31 pts

Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns, p(win)≈0.235:

N sims	95% half-width on p(win)
5,000	±1.17 pts
10,000	±0.83 pts
20,000	±0.59 pts
40,000	±0.42 pts
60,000	±0.34 pts
100,000	±0.26 pts

Sims needed for a target 95% half-width (favorite p≈0.23):

Target half-width	Sims required
±1.0 pts	6,897
±0.5 pts	27,587
±0.2 pts	172,419

Recommendation on sim count. For *ranking* teams within a flight and producing the bid sheet, 10k–20k sims already pins each P(win) to ±0.4–0.6 pts, which is far below the partner-correlation and parameter-uncertainty effects the model genuinely carries. 40–60k is justified only if you want the advance-to-shootout *tail* probabilities and the cross-flight fair-value aggregates smooth to <0.2 pts. Past 60k you are polishing MC noise that is ≈10–20× smaller than the parameter uncertainty (see §7) — wasted compute. With the control-variate + antithetic stack below, 60k-equivalent precision is reachable at ~15–20k raw sims on clear-favorite flights.

3. Variance reduction (with measured VRFs)

We measure each technique's variance-reduction factor (VRF) = Var(plain estimator)/Var(method estimator), estimated from 80 independent replications of an 8,000-sim estimate of the favorite's P(win). VRF = X means the method needs ~X× fewer sims for the same precision (equivalently, ESS is X× larger).

Methods: antithetic variates (negate the entire 127-dim shock vector for the paired half); control variate using the favorite's expected match points (cheap, ~known mean, strongly correlated with winning); stratified sampling on the 1-D field/day shock (50 proportional strata); Latin Hypercube on the structured field+partner shocks (the low-dimensional, high-leverage part).

Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes:

Technique	Var(estimator)	VRF (×)	Interpretation
Plain pseudo-random (baseline)	2.914e-05	1.00	baseline
Antithetic variates	1.448e-05	2.01	2.0× fewer sims
Control variate (E[match pts])	1.408e-05	2.07	2.1× fewer sims
Stratified field shock	2.641e-05	1.10	1.1× fewer sims
Latin Hypercube (field+partner)	2.149e-05	1.36	1.4× fewer sims

Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns:

Technique	Var(estimator)	VRF (×)	Interpretation
Plain pseudo-random (baseline)	1.639e-05	1.00	baseline
Antithetic variates	1.560e-05	1.05	1.1× fewer sims
Control variate (E[match pts])	1.352e-05	1.21	1.2× fewer sims
Stratified field shock	1.589e-05	1.03	1.0× fewer sims
Latin Hypercube (field+partner)	1.659e-05	0.99	1.0× fewer sims

Reading the result. The control variate is the standout: the favorite's expected match-point total is almost a sufficient statistic for whether it wins the flight, so regressing the win indicator on it removes a large share of variance at near-zero extra cost. Antithetic variates give a solid, free boost because the win indicator is close to monotone in the aggregate shock. Stratifying/LHS the field shock helps less here than one might expect — the field/day shock largely *cancels within a match* (both teams feel it), so it drives cross-match point totals but not single-match outcomes; the bulk of the variance lives in the 120 idiosyncratic per-hole/own-form dimensions, which stratification on 1 dimension cannot touch. Common random numbers (already partly used inside a match) should additionally be applied across scenario reruns — it is the single highest-leverage change for the partner-correlation deltas the production report publishes.

Recommended stack: antithetic + expected-points control variate + CRN across scenarios. The two are near-independent, so on a clear-favorite flight (Flight 9) they compound to ~3–4× — 60k-quality precision from ~15–20k raw draws. On a flat coin-flip flight (Flight 1) the gains are smaller (~1.2–1.3×): when p(win)≈1/6 across six near-equal teams the win indicator is weakly correlated with any single control and nearly symmetric, so there is less variance for these techniques to remove. The honest summary: variance reduction is a real, free win on the flights where one team separates, and a modest one where the flight is a scramble — but CRN across scenario reruns helps everywhere.

4. Quasi-Monte Carlo convergence

Quasi-Monte Carlo replaces pseudo-random draws with a low-discrepancy sequence (scrambled Sobol' or Halton) that fills the 127-dimensional unit cube more evenly. We map each point through the inverse normal CDF into the four shock blocks and measure RMSE of the favorite's P(win) vs a 250k-sim ground truth, averaged over 16 independent scramblings, across N.

Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes, truth p=0.4204, dimension D=127:

N	Pseudo RMSE	Sobol' RMSE	Halton RMSE	Sobol' speedup
256	0.03679	0.01758	0.01849	2.09×
512	0.02387	0.01098	0.01456	2.17×
1,024	0.01631	0.00652	0.00915	2.50×
2,048	0.00975	0.00574	0.00636	1.70×
4,096	0.00912	0.00421	0.00505	2.17×
8,192	0.00588	0.00386	0.00411	1.52×
16,384	0.00374	0.00376	0.00312	1.00×

Fitted convergence rate (slope of log RMSE vs log N): pseudo N^-0.53, Sobol' N^-0.37, Halton N^-0.44. Theory: pseudo → −0.5, QMC → up to −1.0.

Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns, truth p=0.2347, dimension D=127:

N	Pseudo RMSE	Sobol' RMSE	Halton RMSE	Sobol' speedup
256	0.02688	0.01161	0.01604	2.32×
512	0.01290	0.01054	0.01097	1.22×
1,024	0.00998	0.00998	0.00973	1.00×
2,048	0.00919	0.00659	0.00710	1.39×
4,096	0.00617	0.00496	0.00430	1.24×
8,192	0.00605	0.00250	0.00343	2.42×
16,384	0.00281	0.00248	0.00235	1.13×

Fitted convergence rate (slope of log RMSE vs log N): pseudo N^-0.45, Sobol' N^-0.42, Halton N^-0.46. Theory: pseudo → −0.5, QMC → up to −1.0.

Reading the result — a nuanced win that fades. Two facts that look contradictory until you separate *level* from *slope*: (1) at every practical budget from 256 to ~8k draws, scrambled Sobol' delivers 1.5–2.5× lower RMSE than pseudorandom on the favorite's P(win) — a real, free accuracy gain; but (2) its fitted convergence slope is not the theoretical O(N⁻¹) — it flattens to roughly N^-0.42 (vs pseudo's textbook N^-0.45), so the two curves converge and by N≈16k the Sobol' advantage has largely washed out. The cause is effective dimension: the win indicator depends on 127 standard normals (1 field + 6 partner + 12 own-form + 108 per-hole scatter), and the 108 per-hole scatter dimensions stay individually consequential (integer rounding + clipping make single holes pivotal). Sobol' front-loads its uniformity into the first coordinates (here the field + partner + own-form shocks), which is why it helps at low N; but a discontinuous, high-effective-dimension integrand denies it the smooth O(N⁻¹) regime, so the gain does not compound. Recommendation: Sobol' is a worthwhile, zero-cost drop-in *if* you operate at small N (≤4–8k) — pair it with the §3 stack. But it is not a substitute for variance reduction and brings little once you are already at 20k+. The robust efficiency lever here is the antithetic + control-variate + CRN stack, not QMC.

5. Global sensitivity analysis (Sobol' indices)

We perform a Saltelli/Sobol' variance decomposition of the favorite's P(win) over five model inputs, each given a uniform prior around its production value: RHOFIELD [0.05,0.30], RHOPARTNERS [0.15,0.55], HOLENOISEFRAC [0.50,0.80], a favorite mean shift [−1.5,+1.5] strokes (18h-equiv applied to the favorite team's own two players), and a favorite SD scale [0.85,1.15] (likewise). The two player perturbations target the favorite's own team because a field-wide shift cancels in relative standings — we want each input to measure a decision-relevant uncertainty. First-order index S1 = share of output variance explained by that input alone; total-effect ST = share including all interactions. We use common random numbers across every model evaluation, so the indices isolate parameter effects from MC noise (Jansen estimators, Sobol' base N=256 ⇒ 7×256 evals of 16k sims each).

Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes (baseline p≈0.423, output variance over the prior box = 0.0017):

Input	First-order S1	Total-effect ST
Player mean shift	0.380	0.401
Player SD scale	0.420	0.422
RHO_PARTNERS	0.029	0.034
HOLENOISEFRAC	0.149	0.153
RHO_FIELD	0.028	0.004

Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns (baseline p≈0.232, output variance over the prior box = 0.0018):

Input	First-order S1	Total-effect ST
Player mean shift	0.414	0.424
Player SD scale	0.563	0.562
RHO_PARTNERS	0.013	0.004
HOLENOISEFRAC	0.023	0.020
RHO_FIELD	0.008	0.001

Reading the result. The favorite's own player inputs dominate. On Flight 9 the mean-level and SD-scale of Wood + Estes together account for ≈82% of the win-probability variance over the prior box; on Flight 1 the favorite's level + consistency account for ≈99%. In plain terms: *how good we think the favorite's two players actually are — their scoring level and their consistency — drives the answer far more than any correlation or noise constant.* Among the three structural knobs, the leading one is HOLENOISEFRAC (ST 0.15 on Flight 9): it sets how much of each player's variance lands as un-averaged 9-hole scatter, which is exactly what does or doesn't separate teams over a short match. RHOPARTNERS is a minor contributor (it tunes the better-ball smoothing the favorite keeps), and RHOFIELD is essentially inert for a single team's win probability because the field/day shock cancels within every match. ST≈S1 throughout ⇒ interactions are small. Implication: modeling effort and any future data collection should target the player scoring distributions (recency, shrinkage, sample depth) and — among structural assumptions — the per-hole variance fraction HOLENOISEFRAC; fine-tuning RHO_FIELD is wasted effort.

6. Dependence modeling: copulas & the run-the-table tail

The production model couples teammates additively (a shared Gaussian partner shock). We compare that against a Gaussian copula and a heavy-tailed t-copula (df=4) on the two teammates' round-form, holding each player's marginal variance fixed and matching the realized teammate correlation (≈0.41 of round-form). Only the *joint tail dependence* changes. We read the effect on the 'run the table' tail — P(a team wins all 5 matches) — which feeds the shootout-advance probability.

Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes:

Dependence model	Favorite P(win flight)	Favorite P(win all 5)
Additive shock (production)	42.1%	38.55%
Gaussian copula	42.2%	38.56%
t-copula (df=4)	42.5%	38.94%

Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns:

Dependence model	Favorite P(win flight)	Favorite P(win all 5)
Additive shock (production)	23.3%	19.28%
Gaussian copula	23.5%	19.55%
t-copula (df=4)	23.2%	19.23%

Reading the result. Two findings. First, the Gaussian copula reproduces the additive model essentially exactly — the additive shared-shock construction *is* a Gaussian dependence, so this is a clean internal-consistency check that our copula machinery and the production model agree where they must. Second, switching to a t-copula (df=4) — same correlation, heavier *joint tails* so teammates boom or bust together more often — moves the run-the-table tail only modestly (a few tenths of a point on the favorite, direction depending on the flight) and leaves the flight-win probability essentially unchanged. Implication: for THIS field and format, the choice between Gaussian and t dependence is a genuinely second-order effect — the production additive structure is defensible. The sensitivity is real but small because (a) the 9-hole better-ball already injects large idiosyncratic variance that swamps the teammate tail-dependence, and (b) winning all 5 matches is dominated by *level* (how good the team is), not by the fine structure of how its two players' off-days co-move. The right takeaway is the *method*: when a tail probability (run-the-table, shootout-advance) drives real money, stress-test it under a t-copula rather than assuming the additive Gaussian is exact — here that test passes.

7. Two-level uncertainty propagation (headline teams)

Single-level MC reports a P(win) as if the player distributions were known exactly. They are not: each player's mean/SD is estimated from a finite, recency-weighted sample (effective n). We run a two-level (nested / posterior-predictive) MC — outer loop resamples every player's (mean, SD) from its sampling distribution (mean SE = sd/√neff, SD SE = sd/√(2·neff)); inner loop runs the flight with common random numbers so the band reflects *parameter* uncertainty, not MC noise. This converts each headline P(win) into a full uncertainty band.

Team	Flight	Point P(win)	Param-uncertainty band (5–95%)	Param SD	MC-only SD @15k	Param/MC ratio
Wood + Estes	9	41.9%	34.6% – 46.3%	3.66 pts	0.41 pts	8.9×
Vola + Kerns	1	23.5%	16.0% – 30.6%	4.60 pts	0.32 pts	14.2×

Reading the result. Parameter uncertainty is an order of magnitude larger than Monte-Carlo noise at 15k sims. Wood + Estes is a genuine favorite, but its honest interval (34.6%–46.3%) is wide because the flight's outcome hinges on player scoring levels we only know to ±1–2 strokes. Vola + Kerns sits in a true coin-flip flight where the band (16.0%–30.6%) overlaps several rivals. Implication: publish P(win) with these bands, and stop spending sims to shrink an MC error that is already ~10–20× smaller than the irreducible parameter uncertainty.

8. Recommendation: the optimal simulation design

Optimal simulation design for the 2026 Calcutta sim:

1. Sim count: 15k–20k as the production default (down from 60k). At 20k every P(win) is pinned to ±0.5 pt — far inside the parameter-uncertainty band (§7). Keep 40–60k only for the final cross-flight fair-value aggregation and the shootout-advance tails, where you want the last decimal smooth.

2. Sampler: QMC is optional, not a priority. Scrambled Sobol' gives a real 1.5–2.5× RMSE reduction at small budgets (≤8k) and is a zero-cost drop-in, so use it if you run small batches; but its slope flattens (N^-0.37 vs pseudo N^-0.53) and the gain washes out by ~16k. Do not treat QMC as a substitute for the variance-reduction stack.

3. Variance-reduction stack: antithetic variates + expected-points control variate + common random numbers across every scenario rerun (rho sweeps, re-pricing). Measured VRFs up to ~2.1× on clear-favorite flights (antithetic ~2.0× alone), smaller on coin-flip flights. CRN across scenario reruns is the single biggest win for the partner-correlation deltas the production report already publishes — adopt it there immediately.

4. Correlation model: keep the additive partner shock for the central estimate (it equals a Gaussian copula to within MC noise — verified in §6). Treat the t-copula as a stress test for tail/shootout EV, not a replacement; on this field the additive structure passes that test.

5. Always report parameter-uncertainty bands (two-level MC) alongside P(win). They are ~9–14× the MC noise and are the honest measure of what we know.

6. Sensitivity priority: Sobol' indices show the favorite's own player scoring level and consistency drive ~80–99% of its win-probability variance — far more than any correlation/noise constant. Among structural knobs, HOLENOISEFRAC leads and RHOFIELD is inert. Spend calibration effort on the player distributions and the per-hole variance fraction, not on RHOFIELD.

Assumptions & limitations

Inputs are the production inputs. We import valuemodel.buildplayer_dists read-only, so any bias in the player distributions (9→18 doubling, shrinkage target +2.6, SD floor 2.5) is inherited, not audited here.
Two-level MC uses a Gaussian sampling model for (mean, SD) with SEs from effective n; it does not capture model-form uncertainty (e.g. non-normal per-hole scores) or correlated estimation error across teammates who share rounds.
Sobol' priors are uniform boxes chosen around production values; indices are conditional on those ranges. Widening a range would raise that input's share.
The copula experiment matches the realized teammate correlation (~0.41 of round-form) and changes only tail dependence; it is illustrative of *sensitivity*, not a fitted dependence model (we did not estimate the empirical teammate copula from shared-round data).
QMC effective dimension is favorable here but integrand-specific; the Sobol' advantage shrinks for deep-tail quantities. All figures are for the favorite's P(win) on two representative flights, chosen to bracket a coin-flip flight and a clear-favorite flight.
Reproducibility: fixed master seed 20260609; NumPy PCG64 (defaultrng); scrambled Sobol'/Halton via scipy.stats.qmc. Full code in mcengine.py + experiments.py; rerun with uv run python analysis/montecarlo_expert/experiments.py.

Generated by analysis/montecarlo_expert/make_report.py from results.json. NumPy PCG64 · scipy.stats.qmc · compute 297.5s. No external dependencies.