Independent Statistical Audit

2026 Member-Member Calcutta — Expert Statistical Review

Independent statistical audit of the valuation model. All numbers are computed from the project's own data (data/processed/, data/raw/history/, valuation/). Scripts: prep.py, models.py, figures.py, build_report.py in this folder. Reproduce with uv run python analysis/stats_expert/{prep,models,figures,build_report}.py.

Executive summary

The single most important empirical fact in this dataset: net match-play results in this event are almost entirely noise, not skill. Three independent tests agree:

No year-over-year persistence. Teams that finished above their flight average in 2024 did not tend to repeat in 2025. The correlation is r = -0.12 (n = 33 repeat teams; bootstrap 95% CI [-0.41, +0.17], which straddles zero). For individual players it is r = -0.14 (n = 121). A team's finish one year tells you essentially nothing about the next.
Handicap does not predict the net outcome. Within a flight, the team's locked handicap explains 0.6% of the variance in match points (R² = 0.006, p = 0.39). That is by design - the net format neutralizes handicap - but it means handicap-driven win-probability spreads must be treated with great caution.
A leave-one-year-out handicap model loses to a coin flip. Predicting each flight's winner from handicap scores worse than a flat 1-in-6 guess on both log-loss (1.93 vs 1.79) and Brier (0.79 vs 0.76). There is no exploitable signal for who wins a flight.

What this means for the auction. The live model's win probabilities are already appropriately humble - within-flight they sit at 97% of maximum entropy (nearly uniform), and even the strongest 2026 team is only a ~43% favorite in its flight. That is the right posture. But it also means edge does not come from "we know who wins." It comes from two places the data does support:

A small, real track-record signal. A properly shrunk (empirical-Bayes) player-ability estimate has a between-player SD of about 1.0 points on a points scale whose noise SD is ~4.4 - tiny, but the best signal available, and nearly uncorrelated with handicap (r = 0.05). A 2026 field ability table built from it is in section 7. Use it as a tie-breaker, not a thesis.
The price side, done right. Historical prices are right-censored at the bid cap (14% of 2024 prices were pinned at \$1,000). OLS on censored prices attenuates slopes and understates dispersion; a Tobit model corrects both (σ inflates ~13% on 2024). The live model's hand-tuned PRICE_SPREAD = 1.85 is a crude patch for exactly this - a principled Tobit replaces the fudge.

Bottom line for a sharp partner: Don't pay up for "favorites" inside a flight - the format makes flights close to a lottery, and five years of results confirm it. Spend the analytical effort on (a) not overpaying relative to the censored-corrected price curve, and (b) leaning, gently, toward teams with a genuine multi-year track record of beating their flight. Everything else is variance.

1. Critical assessment of the current methodology

The live model (valuation/value_model.py) is a careful, well-documented hole-by-hole Monte-Carlo simulation of the confirmed format (round-robin 9-hole net better-ball, points accumulation). Its engineering is sound and its instinct - that 9-hole net match play is high-variance - is correct. Three critiques, in priority order:

(a) It is calibrated to handicap-derived scoring, but the outcome is nearly handicap-independent. Every player's scoring distribution flows from GHIN differentials and the locked handicap. Yet in five years of actual results, handicap carries no information about the net finish (section 3). The simulation's win-probability spread is therefore driven by an axis the historical record says is close to noise. To its credit, the model's output is already very flat (entropy ratio 0.97), so the practical damage is limited - but any temptation to trust a "43% favorite" as if it were real should be resisted. The honest prior is much closer to 1/6, widened only slightly by track record.

(b) The price model uses OLS on censored data, then re-inflates by hand. Prices are capped at the bid limit (\$1,000 in 2024, \$1,800 in 2025, \$2,000 in 2026). OLS treats a \$1,000 censored price as if the team were truly worth exactly \$1,000, which flattens the slope and shrinks the dispersion - precisely the symptom the code comments describe ("raw OLS is far too compressed"). The fix is a textbook Tobit (censored-normal MLE), fit in section 8. PRICE_SPREAD = 1.85 and PRICE_LEVEL_PER_TEAM = 700 are reasonable hand-corrections, but they are unidentified knobs; Tobit estimates the same correction from the data.

(c) The elaborate variance structure (partner/field/hole correlations) is set by assumption, not fit. RHO_PARTNERS, RHO_FIELD, HOLE_NOISE_FRAC are judgment calls. Given finding section 3 - that we cannot even detect skill in the results - we certainly cannot estimate these second-order correlations from data. This isn't wrong, but it should be labeled as a reasonable prior, not an inference. The model's existing sensitivity analysis is the right way to handle it.

Net assessment: the model is over-engineered relative to the information content of the data. That is not a knock on the craft - it is a statement that the data has very little to say about who wins, and a model that produces sharp-looking 43% favorites can give a false sense of precision. Its saving grace is that its outputs are nearly flat anyway.

2. Data and methods

Asset	What it is	Used for
`hist_team.csv`	198 team-flight RESULTS (2024-25), match points (0-50) + finishing position	repeatability, Plackett-Luce, calibration backtest
`hist_player.csv`	the same exploded to 396 player-seasons, joined to locked handicap (71% matched)	empirical-Bayes player abilities
`teams26.csv`	2026 field (120 teams), locked handicaps, GHIN ids	field ability table
`prices.csv`	2024-25 auction prices, flagged censored at the per-year cap	Tobit price model

Response variable. Within each 6-team flight the round-robin distributes exactly 150 match points, so a team's points are zero-sum within its flight. We center points within flight-year (pts_centered) so "points above the flight average" is the comparable, mean-removed response. Typical within-flight points SD ~ 4.37.

Methods applied (each justified against our data, not a textbook tour):

Variance components / repeatability - measure how much signal exists before fitting anything elaborate.
Hierarchical empirical-Bayes (partial pooling) for player abilities - the modern shrinkage estimator; tells us how far to trust a player's track record.
Plackett-Luce (the multi-competitor generalization of Bradley-Terry) on within-flight finishing orders - the natural paired-comparison frame for match play, fit to actual results, with an out-of-sample test.
Proper scoring rules (log-loss, Brier) with a leave-one-year-out backtest and a reliability chart - validate, honestly, whether any predictor beats a uniform baseline.
Tobit (censored-normal MLE) vs OLS for prices - the correct estimator under a bid cap.
Bootstrap and Wilson intervals - uncertainty on the repeatability correlation and the Monte-Carlo win probabilities.

We deliberately did not reach for PyMC/Stan: with this little signal, a full Bayesian hierarchy adds machinery without changing conclusions. Empirical-Bayes shrinkage gives the same partial-pooling answer in closed form.

3. Is there any signal? Variance components and repeatability

Each point is a team that played in both 2024 and 2025; axes are its points relative to its flight average. If finishing well were a durable property, the cloud would slope up. It doesn't.

Quantity	Value
Within-flight points SD	4.37
Team year-over-year repeatability r	-0.119 (n=33, p=0.51)
Player year-over-year repeatability r	-0.143 (n=121, p=0.12)
Bootstrap 95% CI on team r	[-0.41, +0.17] (includes 0)
Implied skill share of points variance	~ 0%

Under the standard signal+noise decomposition, the between-year correlation equals the skill share, var_skill / (var_skill + var_noise). With r ~ 0 (and not significantly different from it), the estimated skill variance is essentially zero: net match-play points behave like draws from a common distribution. A permutation test for repeatability returns p ~ 0.50 - exactly what pure noise gives.

This does not mean players have no golf skill. It means the net, better-ball, 9-hole, points-accumulation format successfully equalizes the field - the point of a member-member. The residual that decides finishes is dominated by which day you caught lightning.

4. Hierarchical empirical-Bayes player abilities (partial pooling)

We fit the one-way random-effects model y = mu + a_i + e_i, with player effects a_i ~ N(0, &τ;²) and residual e ~ N(0, &σ;²), then shrink each player's mean toward zero by the optimal factor n_i / (n_i + &σ;²/&τ;²).

Component	Estimate
Within-player residual SD (σ)	4.24
Between-player SD (τ)	1.03
Shrinkage constant k = &σ;²/&τ;²	17.1
Mean shrinkage weight	0.087

With most players having only 1-2 seasons, the shrinkage weight is ~0.09 - i.e., we pull raw player means ~91% of the way back to the field average. This is the statistically correct response to a tiny between-player variance: believe almost none of a one-year result.

The shrunk abilities collapse toward the origin: a player who "won" his flight by a mile in one year is credited with only a fraction of a point of durable ability. This is the honest version of a power ranking. The resulting 2026 field ability table is in section 7.

5. Plackett-Luce latent abilities from match results (paired-comparison frame)

Match play is paired comparison, so we fit a Plackett-Luce model (Bradley-Terry generalized to full finishing orders): within each flight, the finishing order is a sequential "pick the best remaining" draw governed by latent team strengths θ. Teams in multiple flight-years share a strength; an L2 ridge regularizes toward equal strengths.

Result	Value
Teams / races	163 / 33
In-sample log-lik (fit vs null)	-181.6 vs -217.1
Out-of-sample (2024-learned θ -> predict 2025 order)	-100.3
Out-of-sample with equal strengths	-98.7
Learned strengths beat equal out-of-sample?	No

The in-sample likelihood improves - but that is overfitting: 163 free strength parameters on 33 short races will always fit the noise. The decisive test is out-of-sample, and there the strengths learned from 2024 fail to beat assuming all teams equal when predicting 2025. This corroborates section 3 from a completely different modeling family: the latent team strength match play would estimate is, here, indistinguishable from noise. That is the correct, modern way to discover a paired-comparison model has nothing to grip.

6. Validation with proper scoring rules

We score three predictors of the flight winner against the 33 actual flight winners (2024-25), using multiclass log-loss and Brier. Ties for first split the winner mass.

Predictor	Log-loss (lower=better)	Brier (lower=better)
Uniform 1/6 (no-signal baseline)	1.792	0.758
Handicap softmax - best in-sample temperature	1.792 (β*=0.00)	0.758
Handicap softmax - honest leave-one-year-out	1.933	0.789

Two things stand out. First, the in-sample optimizer drives the handicap temperature to zero - the best the handicap model can do is ignore handicap and predict uniform. Second, the honest leave-one-year-out handicap model is worse than uniform on both rules. A predictor that loses to a coin flip out-of-sample carries no usable information about the winner.

The reliability view is blunt: bin every historical team by its within-flight handicap rank (1 = lowest handicap = nominal favorite) and plot the realized win rate. Every bin hovers around the 1/6 baseline. The nominal favorite wins about as often as the nominal longshot. There is no monotone "favorites win more" gradient to calibrate against.

(We cannot score the live 2026 Monte-Carlo model directly on history - it only produces 2026 teams - so we test the axis it relies on: handicap-implied favoritism. That axis is flat. The live model's own outputs are, appropriately, also nearly flat: entropy ratio 0.97, median flight-favorite p_win just 0.23.)

7. A defensible 2026 field "ability" table

Given sections 3-6, the most defensible team ranking is not the handicap-driven win probability - it's the shrunk track record. For each 2026 team we sum its two players' empirical-Bayes abilities (section 4). This is a small signal (team-level SD ~ 0.48 points, against a ~4.4 noise SD) and should be read as a gentle lean with wide uncertainty, not a prediction. Critically, it is nearly uncorrelated with the live model's p_win (r = 0.05) and with team handicap (r = 0.09) - genuinely orthogonal information.

Top of the field by historical track record (shrunk):

Flight	Team	Team hcp	EB ability	Live p_win
20	Ballard + Armstrong	47.4	+1.29	14%
13	Panessa + Ratliff	23.1	+1.25	17%
9	Wood + Estes	19.3	+1.22	43%
19	Vaniman + Leingang	35.5	+1.15	12%
19	Truitt + Fetter	36.0	+1.14	11%
10	Gelinas + Chafin	19.7	+0.97	16%
4	Green + Green	11.5	+0.91	19%
5	Wright + Copeland	12.2	+0.91	20%

Bottom of the field by track record (have historically under-performed their flight):

Flight	Team	Team hcp	EB ability	Live p_win
5	Pearson + Ferguson	12.6	-1.12	21%
10	Shirley + Aronson	20.8	-0.93	23%
12	Collins + Minter	22.5	-0.86	20%
15	Jaillet + Jaillet	26.0	-0.81	17%
3	Embleau + Loewenthal	9.1	-0.81	18%

Note the disagreements: Wood + Estes is both a track-record leader and the live model's strongest favorite - a rare case where two independent signals align (lean in). Conversely several teams the live model likes (high p_win) have negative track records - their favoritism rests on handicap, which section 6 shows is not predictive. Full table: field_ability_2026.csv.

8. The price model, done right: Tobit vs OLS

Auction prices are right-censored at the bid cap. The censoring is what makes OLS misbehave.

The red bar is the spike of 2024 teams pinned at the \$1,000 cap - 14% of the field. OLS treats each as a precise \$1,000 observation, biasing the slope toward zero and shrinking the fitted dispersion.

2024 (severe censoring, 15/108 capped), log-price ~ flight seed:

	OLS	Tobit
Seed slope	0.0448	0.0485
Residual σ	0.361	0.407
σ inflation (Tobit/OLS)	-	x1.13

Tobit recovers a steeper seed effect and 13% more dispersion - the strongest teams were worth more than the censored OLS fit implies, and the price spread is wider. This is exactly what PRICE_SPREAD = 1.85 approximates - but Tobit estimates it from the data instead of guessing.

2025 (mild censoring, 1/90), log-price ~ handicap + within-flight rank: here Tobit ~ OLS (σ ratio 0.99, slopes within 0.0002), because with almost no censoring there is nothing to correct. The honest read: use Tobit and let the data decide how much correction is needed - it self-deactivates when the cap doesn't bind (2025) and engages when it does (2024, and likely 2026 at the top).

Recommendation. Replace the OLS-plus-PRICE_SPREAD-plus-PRICE_LEVEL stack with a single Tobit fit on pooled 2024-25 data with a year fixed effect and each year's cap, then project to 2026 at the \$2,000 cap. It removes two unidentified knobs and is the standard estimator for capped prices.

9. Uncertainty quantification

Win probabilities. The Monte-Carlo p_win for even the strongest 2026 team (~0.426) is estimated tightly - Wilson 95% interval [0.422, 0.430] at N=60,000 sims. But that precision is about the simulator, not the world: it answers "what does the model output," not "how often will they actually win." Section 3 says the real distribution is much closer to 1/6 with wide irreducible spread.
Repeatability / is there skill. Bootstrap 95% CI on the year-over-year team correlation is [-0.41, +0.17] - consistent with zero and even mildly negative. We can rule out a large skill component, not a tiny one.
Player abilities. Posterior SDs in player_abilities_eb.csv are large relative to the abilities themselves for nearly every player (most are within ~2 SD of zero) - formal confirmation individual rankings are unreliable.

10. Assumptions and limitations

Tiny samples. Two years of results (33 flights with a repeat team; 121 repeat players). Absence of detectable signal is not proof of absence - but it caps how confidently anyone can claim to "know who wins."
Surname linkage. Historical teams join to current handicaps by surname (71% of player-seasons matched). Common surnames take a median handicap; a few mismatches are possible and would only add noise, not create false signal.
Points vs true match outcomes. We have flight standings (points, finish), not hole-by-hole cards. Enough for finishing-order and points models, but not to estimate the live model's hole-level correlation parameters from data - those remain assumptions.
Prices != value. Tobit models price formation (what the room paid). Right tool for the auction-price curve, but it inherits the room's inefficiencies.
Format stability. Assumes the 2026 net better-ball points format behaves like 2024-25. A material format or field change would require refreshing the repeatability estimates.

11. Recommendations

Treat within-flight win probabilities as close to uniform. Don't pay a premium for a "favorite" - five years of results say flights are nearly lotteries. The live model already leans this way; trust that humility over its sharpest numbers.
Use the shrunk track-record table (section 7) as the tie-breaker, not handicap. Small, but the only signal that survives an honest out-of-sample test, and orthogonal to handicap. Favor teams where track record and the model agree (e.g., Wood + Estes); discount model favorites with negative track records.
Replace the OLS+spread price stack with a pooled Tobit (year FE + per-year cap -> project to \$2,000). Same intent as PRICE_SPREAD, but estimated, identified, and self-deactivating when the cap doesn't bind.
Hunt for price edges, not outcome edges. Because outcomes are near-random, value comes from buying teams below the censored-corrected price curve, not from forecasting winners. Define bargains/traps against the Tobit price; down-weight win-prob differences.
Keep collecting results. The fastest way to settle whether any skill signal exists is more years; at 5+ the repeatability CI would tighten enough to decide it.

Generated from the project's own data. Methods: variance components, empirical-Bayes partial pooling, Plackett-Luce, proper scoring rules (log-loss / Brier), Tobit censored regression, bootstrap & Wilson intervals. Scripts in analysis/stats_expert/.