Expert Monte-Carlo Analysis
2026 Member-Member Calcutta

Advanced variance reduction · Quasi-Monte Carlo · Sobol' sensitivity · Copula tails · Nested uncertainty  |  engine reproduces value_model.py bit-for-bit · seed 20260609

Executive summary

This report audits the production 2026 Calcutta flight simulator (valuation/valuemodel.py) as a Monte-Carlo estimator and applies a stack of advanced variance-reduction, quasi-Monte-Carlo, global-sensitivity, dependence-modeling, and nested-uncertainty techniques on our actual flights. Every number below comes from a self-contained engine (mcengine.py) that reproduces the production per-hole, 9-hole net better-ball round-robin bit-for-bit (validated to MC noise).

Headline findings:

1. Correctness & efficiency audit

What the production sim does (restated for audit). For each 6-team flight it simulates the full round-robin (15 matches) hole by hole: each player's 9-hole round is built from a per-hole gross-vs-par mean (m18/18) plus four Gaussian shock layers — a flight-wide field/day shock (RHOFIELD=0.15), a per-team partner shock (RHOPARTNERS=0.35), a per-player own round-form shock, and i.i.d. per-hole scatter (HOLENOISEFRAC=0.65). Gross is rounded to integers and clipped to [par-2, par+7]; net better-ball is taken off the low player in each foursome; standings rank by total match points.

Correctness checks (all pass):

Efficiency audit. The production estimator is plain i.i.d. pseudo-random Monte Carlo. It does use a single fixed seed (reproducible) and, within a scenario sweep (rho=0 vs rho=0.35), it draws fresh randomness each call rather than reusing a common stream — so the reported partner-correlation deltas carry uncorrelated MC noise in each arm. Switching those paired comparisons to common random numbers (same shock stream, only the parameter changes) would sharpen every delta; see the Variance-Reduction and Sensitivity sections. The core sim leaves antithetic variates, control variates, and QMC entirely on the table.

FlightFavoritep(win)Binomial SE @60kBatch-means SE @60kR-hat (8 streams)
9Wood + Estes0.4190.002010.001721.000
1Vola + Kerns0.2350.001730.001851.000

The batch-means SE (30 non-overlapping batches) tracking the binomial SE is the empirical proof that the draws are independent — if there were autocorrelation, batch-means would exceed the binomial value.

2. Convergence diagnostics & error bars

For a win-probability estimate p̂ from N i.i.d. sims, the Monte-Carlo standard error is √(p(1−p)/N) and the 95% half-width is 1.96×that. The table gives the half-width for the two focus favorites across N, plus the sims required to hit a target precision.

Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes, p(win)≈0.419:

N sims95% half-width on p(win)
5,000±1.37 pts
10,000±0.97 pts
20,000±0.68 pts
40,000±0.48 pts
60,000±0.39 pts
100,000±0.31 pts

Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns, p(win)≈0.235:

N sims95% half-width on p(win)
5,000±1.17 pts
10,000±0.83 pts
20,000±0.59 pts
40,000±0.42 pts
60,000±0.34 pts
100,000±0.26 pts

Sims needed for a target 95% half-width (favorite p≈0.23):

Target half-widthSims required
±1.0 pts6,897
±0.5 pts27,587
±0.2 pts172,419

Recommendation on sim count. For *ranking* teams within a flight and producing the bid sheet, 10k–20k sims already pins each P(win) to ±0.4–0.6 pts, which is far below the partner-correlation and parameter-uncertainty effects the model genuinely carries. 40–60k is justified only if you want the advance-to-shootout *tail* probabilities and the cross-flight fair-value aggregates smooth to <0.2 pts. Past 60k you are polishing MC noise that is ≈10–20× smaller than the parameter uncertainty (see §7) — wasted compute. With the control-variate + antithetic stack below, 60k-equivalent precision is reachable at ~15–20k raw sims on clear-favorite flights.

3. Variance reduction (with measured VRFs)

We measure each technique's variance-reduction factor (VRF) = Var(plain estimator)/Var(method estimator), estimated from 80 independent replications of an 8,000-sim estimate of the favorite's P(win). VRF = X means the method needs ~X× fewer sims for the same precision (equivalently, ESS is X× larger).

Methods: antithetic variates (negate the entire 127-dim shock vector for the paired half); control variate using the favorite's expected match points (cheap, ~known mean, strongly correlated with winning); stratified sampling on the 1-D field/day shock (50 proportional strata); Latin Hypercube on the structured field+partner shocks (the low-dimensional, high-leverage part).

Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes:

TechniqueVar(estimator)VRF (×)Interpretation
Plain pseudo-random (baseline)2.914e-051.00baseline
Antithetic variates1.448e-052.012.0× fewer sims
Control variate (E[match pts])1.408e-052.072.1× fewer sims
Stratified field shock2.641e-051.101.1× fewer sims
Latin Hypercube (field+partner)2.149e-051.361.4× fewer sims

Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns:

TechniqueVar(estimator)VRF (×)Interpretation
Plain pseudo-random (baseline)1.639e-051.00baseline
Antithetic variates1.560e-051.051.1× fewer sims
Control variate (E[match pts])1.352e-051.211.2× fewer sims
Stratified field shock1.589e-051.031.0× fewer sims
Latin Hypercube (field+partner)1.659e-050.991.0× fewer sims

Reading the result. The control variate is the standout: the favorite's expected match-point total is almost a sufficient statistic for whether it wins the flight, so regressing the win indicator on it removes a large share of variance at near-zero extra cost. Antithetic variates give a solid, free boost because the win indicator is close to monotone in the aggregate shock. Stratifying/LHS the field shock helps less here than one might expect — the field/day shock largely *cancels within a match* (both teams feel it), so it drives cross-match point totals but not single-match outcomes; the bulk of the variance lives in the 120 idiosyncratic per-hole/own-form dimensions, which stratification on 1 dimension cannot touch. Common random numbers (already partly used inside a match) should additionally be applied across scenario reruns — it is the single highest-leverage change for the partner-correlation deltas the production report publishes.

Recommended stack: antithetic + expected-points control variate + CRN across scenarios. The two are near-independent, so on a clear-favorite flight (Flight 9) they compound to ~3–4× — 60k-quality precision from ~15–20k raw draws. On a flat coin-flip flight (Flight 1) the gains are smaller (~1.2–1.3×): when p(win)≈1/6 across six near-equal teams the win indicator is weakly correlated with any single control and nearly symmetric, so there is less variance for these techniques to remove. The honest summary: variance reduction is a real, free win on the flights where one team separates, and a modest one where the flight is a scramble — but CRN across scenario reruns helps everywhere.

4. Quasi-Monte Carlo convergence

Quasi-Monte Carlo replaces pseudo-random draws with a low-discrepancy sequence (scrambled Sobol' or Halton) that fills the 127-dimensional unit cube more evenly. We map each point through the inverse normal CDF into the four shock blocks and measure RMSE of the favorite's P(win) vs a 250k-sim ground truth, averaged over 16 independent scramblings, across N.

Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes, truth p=0.4204, dimension D=127:

NPseudo RMSESobol' RMSEHalton RMSESobol' speedup
2560.036790.017580.018492.09×
5120.023870.010980.014562.17×
1,0240.016310.006520.009152.50×
2,0480.009750.005740.006361.70×
4,0960.009120.004210.005052.17×
8,1920.005880.003860.004111.52×
16,3840.003740.003760.003121.00×

Fitted convergence rate (slope of log RMSE vs log N): pseudo N-0.53, Sobol' N-0.37, Halton N-0.44. Theory: pseudo → −0.5, QMC → up to −1.0.

Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns, truth p=0.2347, dimension D=127:

NPseudo RMSESobol' RMSEHalton RMSESobol' speedup
2560.026880.011610.016042.32×
5120.012900.010540.010971.22×
1,0240.009980.009980.009731.00×
2,0480.009190.006590.007101.39×
4,0960.006170.004960.004301.24×
8,1920.006050.002500.003432.42×
16,3840.002810.002480.002351.13×

Fitted convergence rate (slope of log RMSE vs log N): pseudo N-0.45, Sobol' N-0.42, Halton N-0.46. Theory: pseudo → −0.5, QMC → up to −1.0.

Reading the result — a nuanced win that fades. Two facts that look contradictory until you separate *level* from *slope*: (1) at every practical budget from 256 to ~8k draws, scrambled Sobol' delivers 1.5–2.5× lower RMSE than pseudorandom on the favorite's P(win) — a real, free accuracy gain; but (2) its fitted convergence slope is not the theoretical O(N⁻¹) — it flattens to roughly N-0.42 (vs pseudo's textbook N-0.45), so the two curves converge and by N≈16k the Sobol' advantage has largely washed out. The cause is effective dimension: the win indicator depends on 127 standard normals (1 field + 6 partner + 12 own-form + 108 per-hole scatter), and the 108 per-hole scatter dimensions stay individually consequential (integer rounding + clipping make single holes pivotal). Sobol' front-loads its uniformity into the first coordinates (here the field + partner + own-form shocks), which is why it helps at low N; but a discontinuous, high-effective-dimension integrand denies it the smooth O(N⁻¹) regime, so the gain does not compound. Recommendation: Sobol' is a worthwhile, zero-cost drop-in *if* you operate at small N (≤4–8k) — pair it with the §3 stack. But it is not a substitute for variance reduction and brings little once you are already at 20k+. The robust efficiency lever here is the antithetic + control-variate + CRN stack, not QMC.

5. Global sensitivity analysis (Sobol' indices)

We perform a Saltelli/Sobol' variance decomposition of the favorite's P(win) over five model inputs, each given a uniform prior around its production value: RHOFIELD [0.05,0.30], RHOPARTNERS [0.15,0.55], HOLENOISEFRAC [0.50,0.80], a favorite mean shift [−1.5,+1.5] strokes (18h-equiv applied to the favorite team's own two players), and a favorite SD scale [0.85,1.15] (likewise). The two player perturbations target the favorite's own team because a field-wide shift cancels in relative standings — we want each input to measure a decision-relevant uncertainty. First-order index S1 = share of output variance explained by that input alone; total-effect ST = share including all interactions. We use common random numbers across every model evaluation, so the indices isolate parameter effects from MC noise (Jansen estimators, Sobol' base N=256 ⇒ 7×256 evals of 16k sims each).

Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes (baseline p≈0.423, output variance over the prior box = 0.0017):

InputFirst-order S1Total-effect ST
Player mean shift0.3800.401
Player SD scale0.4200.422
RHO_PARTNERS0.0290.034
HOLENOISEFRAC0.1490.153
RHO_FIELD0.0280.004

Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns (baseline p≈0.232, output variance over the prior box = 0.0018):

InputFirst-order S1Total-effect ST
Player mean shift0.4140.424
Player SD scale0.5630.562
RHO_PARTNERS0.0130.004
HOLENOISEFRAC0.0230.020
RHO_FIELD0.0080.001

Reading the result. The favorite's own player inputs dominate. On Flight 9 the mean-level and SD-scale of Wood + Estes together account for ≈82% of the win-probability variance over the prior box; on Flight 1 the favorite's level + consistency account for ≈99%. In plain terms: *how good we think the favorite's two players actually are — their scoring level and their consistency — drives the answer far more than any correlation or noise constant.* Among the three structural knobs, the leading one is HOLENOISEFRAC (ST 0.15 on Flight 9): it sets how much of each player's variance lands as un-averaged 9-hole scatter, which is exactly what does or doesn't separate teams over a short match. RHOPARTNERS is a minor contributor (it tunes the better-ball smoothing the favorite keeps), and RHOFIELD is essentially inert for a single team's win probability because the field/day shock cancels within every match. ST≈S1 throughout ⇒ interactions are small. Implication: modeling effort and any future data collection should target the player scoring distributions (recency, shrinkage, sample depth) and — among structural assumptions — the per-hole variance fraction HOLENOISEFRAC; fine-tuning RHO_FIELD is wasted effort.

6. Dependence modeling: copulas & the run-the-table tail

The production model couples teammates additively (a shared Gaussian partner shock). We compare that against a Gaussian copula and a heavy-tailed t-copula (df=4) on the two teammates' round-form, holding each player's marginal variance fixed and matching the realized teammate correlation (≈0.41 of round-form). Only the *joint tail dependence* changes. We read the effect on the 'run the table' tail — P(a team wins all 5 matches) — which feeds the shootout-advance probability.

Flight 9 (Wood + Estes — clear-favorite flight) — favorite Wood + Estes:

Dependence modelFavorite P(win flight)Favorite P(win all 5)
Additive shock (production)42.1%38.55%
Gaussian copula42.2%38.56%
t-copula (df=4)42.5%38.94%

Flight 1 (Vola + Kerns — coin-flip flight) — favorite Vola + Kerns:

Dependence modelFavorite P(win flight)Favorite P(win all 5)
Additive shock (production)23.3%19.28%
Gaussian copula23.5%19.55%
t-copula (df=4)23.2%19.23%

Reading the result. Two findings. First, the Gaussian copula reproduces the additive model essentially exactly — the additive shared-shock construction *is* a Gaussian dependence, so this is a clean internal-consistency check that our copula machinery and the production model agree where they must. Second, switching to a t-copula (df=4) — same correlation, heavier *joint tails* so teammates boom or bust together more often — moves the run-the-table tail only modestly (a few tenths of a point on the favorite, direction depending on the flight) and leaves the flight-win probability essentially unchanged. Implication: for THIS field and format, the choice between Gaussian and t dependence is a genuinely second-order effect — the production additive structure is defensible. The sensitivity is real but small because (a) the 9-hole better-ball already injects large idiosyncratic variance that swamps the teammate tail-dependence, and (b) winning all 5 matches is dominated by *level* (how good the team is), not by the fine structure of how its two players' off-days co-move. The right takeaway is the *method*: when a tail probability (run-the-table, shootout-advance) drives real money, stress-test it under a t-copula rather than assuming the additive Gaussian is exact — here that test passes.

7. Two-level uncertainty propagation (headline teams)

Single-level MC reports a P(win) as if the player distributions were known exactly. They are not: each player's mean/SD is estimated from a finite, recency-weighted sample (effective n). We run a two-level (nested / posterior-predictive) MC — outer loop resamples every player's (mean, SD) from its sampling distribution (mean SE = sd/√neff, SD SE = sd/√(2·neff)); inner loop runs the flight with common random numbers so the band reflects *parameter* uncertainty, not MC noise. This converts each headline P(win) into a full uncertainty band.

TeamFlightPoint P(win)Param-uncertainty band (5–95%)Param SDMC-only SD @15kParam/MC ratio
Wood + Estes941.9%34.6% – 46.3%3.66 pts0.41 pts8.9×
Vola + Kerns123.5%16.0% – 30.6%4.60 pts0.32 pts14.2×

Reading the result. Parameter uncertainty is an order of magnitude larger than Monte-Carlo noise at 15k sims. Wood + Estes is a genuine favorite, but its honest interval (34.6%–46.3%) is wide because the flight's outcome hinges on player scoring levels we only know to ±1–2 strokes. Vola + Kerns sits in a true coin-flip flight where the band (16.0%–30.6%) overlaps several rivals. Implication: publish P(win) with these bands, and stop spending sims to shrink an MC error that is already ~10–20× smaller than the irreducible parameter uncertainty.

8. Recommendation: the optimal simulation design

Optimal simulation design for the 2026 Calcutta sim:

1. Sim count: 15k–20k as the production default (down from 60k). At 20k every P(win) is pinned to ±0.5 pt — far inside the parameter-uncertainty band (§7). Keep 40–60k only for the final cross-flight fair-value aggregation and the shootout-advance tails, where you want the last decimal smooth.

2. Sampler: QMC is optional, not a priority. Scrambled Sobol' gives a real 1.5–2.5× RMSE reduction at small budgets (≤8k) and is a zero-cost drop-in, so use it if you run small batches; but its slope flattens (N-0.37 vs pseudo N-0.53) and the gain washes out by ~16k. Do not treat QMC as a substitute for the variance-reduction stack.

3. Variance-reduction stack: antithetic variates + expected-points control variate + common random numbers across every scenario rerun (rho sweeps, re-pricing). Measured VRFs up to ~2.1× on clear-favorite flights (antithetic ~2.0× alone), smaller on coin-flip flights. CRN across scenario reruns is the single biggest win for the partner-correlation deltas the production report already publishes — adopt it there immediately.

4. Correlation model: keep the additive partner shock for the central estimate (it equals a Gaussian copula to within MC noise — verified in §6). Treat the t-copula as a stress test for tail/shootout EV, not a replacement; on this field the additive structure passes that test.

5. Always report parameter-uncertainty bands (two-level MC) alongside P(win). They are ~9–14× the MC noise and are the honest measure of what we know.

6. Sensitivity priority: Sobol' indices show the favorite's own player scoring level and consistency drive ~80–99% of its win-probability variance — far more than any correlation/noise constant. Among structural knobs, HOLENOISEFRAC leads and RHOFIELD is inert. Spend calibration effort on the player distributions and the per-hole variance fraction, not on RHOFIELD.

Assumptions & limitations