Independent Statistical Audit

2026 Member-Member Calcutta — Expert Statistical Review

Independent statistical audit of the valuation model. All numbers are computed from the project's own data (data/processed/, data/raw/history/, valuation/). Scripts: prep.py, models.py, figures.py, build_report.py in this folder. Reproduce with uv run python analysis/stats_expert/{prep,models,figures,build_report}.py.


Executive summary

The single most important empirical fact in this dataset: net match-play results in this event are almost entirely noise, not skill. Three independent tests agree:

  1. No year-over-year persistence. Teams that finished above their flight average in 2024 did not tend to repeat in 2025. The correlation is r = -0.12 (n = 33 repeat teams; bootstrap 95% CI [-0.41, +0.17], which straddles zero). For individual players it is r = -0.14 (n = 121). A team's finish one year tells you essentially nothing about the next.
  2. Handicap does not predict the net outcome. Within a flight, the team's locked handicap explains 0.6% of the variance in match points (R² = 0.006, p = 0.39). That is by design - the net format neutralizes handicap - but it means handicap-driven win-probability spreads must be treated with great caution.
  3. A leave-one-year-out handicap model loses to a coin flip. Predicting each flight's winner from handicap scores worse than a flat 1-in-6 guess on both log-loss (1.93 vs 1.79) and Brier (0.79 vs 0.76). There is no exploitable signal for who wins a flight.

What this means for the auction. The live model's win probabilities are already appropriately humble - within-flight they sit at 97% of maximum entropy (nearly uniform), and even the strongest 2026 team is only a ~43% favorite in its flight. That is the right posture. But it also means edge does not come from "we know who wins." It comes from two places the data does support:

Bottom line for a sharp partner: Don't pay up for "favorites" inside a flight - the format makes flights close to a lottery, and five years of results confirm it. Spend the analytical effort on (a) not overpaying relative to the censored-corrected price curve, and (b) leaning, gently, toward teams with a genuine multi-year track record of beating their flight. Everything else is variance.


1. Critical assessment of the current methodology

The live model (valuation/value_model.py) is a careful, well-documented hole-by-hole Monte-Carlo simulation of the confirmed format (round-robin 9-hole net better-ball, points accumulation). Its engineering is sound and its instinct - that 9-hole net match play is high-variance - is correct. Three critiques, in priority order:

(a) It is calibrated to handicap-derived scoring, but the outcome is nearly handicap-independent. Every player's scoring distribution flows from GHIN differentials and the locked handicap. Yet in five years of actual results, handicap carries no information about the net finish (section 3). The simulation's win-probability spread is therefore driven by an axis the historical record says is close to noise. To its credit, the model's output is already very flat (entropy ratio 0.97), so the practical damage is limited - but any temptation to trust a "43% favorite" as if it were real should be resisted. The honest prior is much closer to 1/6, widened only slightly by track record.

(b) The price model uses OLS on censored data, then re-inflates by hand. Prices are capped at the bid limit (\$1,000 in 2024, \$1,800 in 2025, \$2,000 in 2026). OLS treats a \$1,000 censored price as if the team were truly worth exactly \$1,000, which flattens the slope and shrinks the dispersion - precisely the symptom the code comments describe ("raw OLS is far too compressed"). The fix is a textbook Tobit (censored-normal MLE), fit in section 8. PRICE_SPREAD = 1.85 and PRICE_LEVEL_PER_TEAM = 700 are reasonable hand-corrections, but they are unidentified knobs; Tobit estimates the same correction from the data.

(c) The elaborate variance structure (partner/field/hole correlations) is set by assumption, not fit. RHO_PARTNERS, RHO_FIELD, HOLE_NOISE_FRAC are judgment calls. Given finding section 3 - that we cannot even detect skill in the results - we certainly cannot estimate these second-order correlations from data. This isn't wrong, but it should be labeled as a reasonable prior, not an inference. The model's existing sensitivity analysis is the right way to handle it.

Net assessment: the model is over-engineered relative to the information content of the data. That is not a knock on the craft - it is a statement that the data has very little to say about who wins, and a model that produces sharp-looking 43% favorites can give a false sense of precision. Its saving grace is that its outputs are nearly flat anyway.


2. Data and methods

Asset What it is Used for
hist_team.csv 198 team-flight RESULTS (2024-25), match points (0-50) + finishing position repeatability, Plackett-Luce, calibration backtest
hist_player.csv the same exploded to 396 player-seasons, joined to locked handicap (71% matched) empirical-Bayes player abilities
teams26.csv 2026 field (120 teams), locked handicaps, GHIN ids field ability table
prices.csv 2024-25 auction prices, flagged censored at the per-year cap Tobit price model

Response variable. Within each 6-team flight the round-robin distributes exactly 150 match points, so a team's points are zero-sum within its flight. We center points within flight-year (pts_centered) so "points above the flight average" is the comparable, mean-removed response. Typical within-flight points SD ~ 4.37.

Methods applied (each justified against our data, not a textbook tour):

We deliberately did not reach for PyMC/Stan: with this little signal, a full Bayesian hierarchy adds machinery without changing conclusions. Empirical-Bayes shrinkage gives the same partial-pooling answer in closed form.


3. Is there any signal? Variance components and repeatability

2024 points vs flight avg 2025 points vs flight avg r = -0.12 (n=33 repeat teams) — no persistence

Each point is a team that played in both 2024 and 2025; axes are its points relative to its flight average. If finishing well were a durable property, the cloud would slope up. It doesn't.

Quantity Value
Within-flight points SD 4.37
Team year-over-year repeatability r -0.119 (n=33, p=0.51)
Player year-over-year repeatability r -0.143 (n=121, p=0.12)
Bootstrap 95% CI on team r [-0.41, +0.17] (includes 0)
Implied skill share of points variance ~ 0%

Under the standard signal+noise decomposition, the between-year correlation equals the skill share, var_skill / (var_skill + var_noise). With r ~ 0 (and not significantly different from it), the estimated skill variance is essentially zero: net match-play points behave like draws from a common distribution. A permutation test for repeatability returns p ~ 0.50 - exactly what pure noise gives.

This does not mean players have no golf skill. It means the net, better-ball, 9-hole, points-accumulation format successfully equalizes the field - the point of a member-member. The residual that decides finishes is dominated by which day you caught lightning.


4. Hierarchical empirical-Bayes player abilities (partial pooling)

We fit the one-way random-effects model y = mu + a_i + e_i, with player effects a_i ~ N(0, &τ;²) and residual e ~ N(0, &σ;²), then shrink each player's mean toward zero by the optimal factor n_i / (n_i + &σ;²/&τ;²).

Component Estimate
Within-player residual SD (σ) 4.24
Between-player SD (τ) 1.03
Shrinkage constant k = &σ;²/&τ;² 17.1
Mean shrinkage weight 0.087

With most players having only 1-2 seasons, the shrinkage weight is ~0.09 - i.e., we pull raw player means ~91% of the way back to the field average. This is the statistically correct response to a tiny between-player variance: believe almost none of a one-year result.

raw mean centered points EB shrunk ability y=x (no shrink) abilities collapse to ~0: mean shrink weight 0.09

The shrunk abilities collapse toward the origin: a player who "won" his flight by a mile in one year is credited with only a fraction of a point of durable ability. This is the honest version of a power ranking. The resulting 2026 field ability table is in section 7.


5. Plackett-Luce latent abilities from match results (paired-comparison frame)

Match play is paired comparison, so we fit a Plackett-Luce model (Bradley-Terry generalized to full finishing orders): within each flight, the finishing order is a sequential "pick the best remaining" draw governed by latent team strengths θ. Teams in multiple flight-years share a strength; an L2 ridge regularizes toward equal strengths.

Result Value
Teams / races 163 / 33
In-sample log-lik (fit vs null) -181.6 vs -217.1
Out-of-sample (2024-learned θ -> predict 2025 order) -100.3
Out-of-sample with equal strengths -98.7
Learned strengths beat equal out-of-sample? No

The in-sample likelihood improves - but that is overfitting: 163 free strength parameters on 33 short races will always fit the noise. The decisive test is out-of-sample, and there the strengths learned from 2024 fail to beat assuming all teams equal when predicting 2025. This corroborates section 3 from a completely different modeling family: the latent team strength match play would estimate is, here, indistinguishable from noise. That is the correct, modern way to discover a paired-comparison model has nothing to grip.


6. Validation with proper scoring rules

We score three predictors of the flight winner against the 33 actual flight winners (2024-25), using multiclass log-loss and Brier. Ties for first split the winner mass.

Predictor Log-loss (lower=better) Brier (lower=better)
Uniform 1/6 (no-signal baseline) 1.792 0.758
Handicap softmax - best in-sample temperature 1.792 (β*=0.00) 0.758
Handicap softmax - honest leave-one-year-out 1.933 0.789

Two things stand out. First, the in-sample optimizer drives the handicap temperature to zero - the best the handicap model can do is ignore handicap and predict uniform. Second, the honest leave-one-year-out handicap model is worse than uniform on both rules. A predictor that loses to a coin flip out-of-sample carries no usable information about the winner.

within-flight handicap rank (1=lowest hcp) realised win rate no-signal baseline 1/6 36% 16% 19% 25% 0% 0%

The reliability view is blunt: bin every historical team by its within-flight handicap rank (1 = lowest handicap = nominal favorite) and plot the realized win rate. Every bin hovers around the 1/6 baseline. The nominal favorite wins about as often as the nominal longshot. There is no monotone "favorites win more" gradient to calibrate against.

(We cannot score the live 2026 Monte-Carlo model directly on history - it only produces 2026 teams - so we test the axis it relies on: handicap-implied favoritism. That axis is flat. The live model's own outputs are, appropriately, also nearly flat: entropy ratio 0.97, median flight-favorite p_win just 0.23.)


7. A defensible 2026 field "ability" table

Given sections 3-6, the most defensible team ranking is not the handicap-driven win probability - it's the shrunk track record. For each 2026 team we sum its two players' empirical-Bayes abilities (section 4). This is a small signal (team-level SD ~ 0.48 points, against a ~4.4 noise SD) and should be read as a gentle lean with wide uncertainty, not a prediction. Critically, it is nearly uncorrelated with the live model's p_win (r = 0.05) and with team handicap (r = 0.09) - genuinely orthogonal information.

Top of the field by historical track record (shrunk):

Flight Team Team hcp EB ability Live p_win
20 Ballard + Armstrong 47.4 +1.29 14%
13 Panessa + Ratliff 23.1 +1.25 17%
9 Wood + Estes 19.3 +1.22 43%
19 Vaniman + Leingang 35.5 +1.15 12%
19 Truitt + Fetter 36.0 +1.14 11%
10 Gelinas + Chafin 19.7 +0.97 16%
4 Green + Green 11.5 +0.91 19%
5 Wright + Copeland 12.2 +0.91 20%

Bottom of the field by track record (have historically under-performed their flight):

Flight Team Team hcp EB ability Live p_win
5 Pearson + Ferguson 12.6 -1.12 21%
10 Shirley + Aronson 20.8 -0.93 23%
12 Collins + Minter 22.5 -0.86 20%
15 Jaillet + Jaillet 26.0 -0.81 17%
3 Embleau + Loewenthal 9.1 -0.81 18%

Note the disagreements: Wood + Estes is both a track-record leader and the live model's strongest favorite - a rare case where two independent signals align (lean in). Conversely several teams the live model likes (high p_win) have negative track records - their favoritism rests on handicap, which section 6 shows is not predictive. Full table: field_ability_2026.csv.


8. The price model, done right: Tobit vs OLS

Auction prices are right-censored at the bid cap. The censoring is what makes OLS misbehave.

2024 price ($) teams $1,000 cap: 15 teams pinned (14%)

The red bar is the spike of 2024 teams pinned at the \$1,000 cap - 14% of the field. OLS treats each as a precise \$1,000 observation, biasing the slope toward zero and shrinking the fitted dispersion.

2024 (severe censoring, 15/108 capped), log-price ~ flight seed:

OLS Tobit
Seed slope 0.0448 0.0485
Residual σ 0.361 0.407
σ inflation (Tobit/OLS) - x1.13

Tobit recovers a steeper seed effect and 13% more dispersion - the strongest teams were worth more than the censored OLS fit implies, and the price spread is wider. This is exactly what PRICE_SPREAD = 1.85 approximates - but Tobit estimates it from the data instead of guessing.

2025 (mild censoring, 1/90), log-price ~ handicap + within-flight rank: here Tobit ~ OLS (σ ratio 0.99, slopes within 0.0002), because with almost no censoring there is nothing to correct. The honest read: use Tobit and let the data decide how much correction is needed - it self-deactivates when the cap doesn't bind (2025) and engages when it does (2024, and likely 2026 at the top).

Recommendation. Replace the OLS-plus-PRICE_SPREAD-plus-PRICE_LEVEL stack with a single Tobit fit on pooled 2024-25 data with a year fixed effect and each year's cap, then project to 2026 at the \$2,000 cap. It removes two unidentified knobs and is the standard estimator for capped prices.


9. Uncertainty quantification


10. Assumptions and limitations


11. Recommendations

  1. Treat within-flight win probabilities as close to uniform. Don't pay a premium for a "favorite" - five years of results say flights are nearly lotteries. The live model already leans this way; trust that humility over its sharpest numbers.
  2. Use the shrunk track-record table (section 7) as the tie-breaker, not handicap. Small, but the only signal that survives an honest out-of-sample test, and orthogonal to handicap. Favor teams where track record and the model agree (e.g., Wood + Estes); discount model favorites with negative track records.
  3. Replace the OLS+spread price stack with a pooled Tobit (year FE + per-year cap -> project to \$2,000). Same intent as PRICE_SPREAD, but estimated, identified, and self-deactivating when the cap doesn't bind.
  4. Hunt for price edges, not outcome edges. Because outcomes are near-random, value comes from buying teams below the censored-corrected price curve, not from forecasting winners. Define bargains/traps against the Tobit price; down-weight win-prob differences.
  5. Keep collecting results. The fastest way to settle whether any skill signal exists is more years; at 5+ the repeatability CI would tighten enough to decide it.