Independent statistical audit of the valuation model. All numbers are computed from the project's own data (data/processed/, data/raw/history/, valuation/). Scripts: prep.py, models.py, figures.py, build_report.py in this folder. Reproduce with uv run python analysis/stats_expert/{prep,models,figures,build_report}.py.
The single most important empirical fact in this dataset: net match-play results in this event are almost entirely noise, not skill. Three independent tests agree:
What this means for the auction. The live model's win probabilities are already appropriately humble - within-flight they sit at 97% of maximum entropy (nearly uniform), and even the strongest 2026 team is only a ~43% favorite in its flight. That is the right posture. But it also means edge does not come from "we know who wins." It comes from two places the data does support:
PRICE_SPREAD = 1.85 is a crude patch for exactly this - a principled Tobit replaces the fudge.Bottom line for a sharp partner: Don't pay up for "favorites" inside a flight - the format makes flights close to a lottery, and five years of results confirm it. Spend the analytical effort on (a) not overpaying relative to the censored-corrected price curve, and (b) leaning, gently, toward teams with a genuine multi-year track record of beating their flight. Everything else is variance.
The live model (valuation/value_model.py) is a careful, well-documented hole-by-hole Monte-Carlo simulation of the confirmed format (round-robin 9-hole net better-ball, points accumulation). Its engineering is sound and its instinct - that 9-hole net match play is high-variance - is correct. Three critiques, in priority order:
(a) It is calibrated to handicap-derived scoring, but the outcome is nearly handicap-independent. Every player's scoring distribution flows from GHIN differentials and the locked handicap. Yet in five years of actual results, handicap carries no information about the net finish (section 3). The simulation's win-probability spread is therefore driven by an axis the historical record says is close to noise. To its credit, the model's output is already very flat (entropy ratio 0.97), so the practical damage is limited - but any temptation to trust a "43% favorite" as if it were real should be resisted. The honest prior is much closer to 1/6, widened only slightly by track record.
(b) The price model uses OLS on censored data, then re-inflates by hand. Prices are capped at the bid limit (\$1,000 in 2024, \$1,800 in 2025, \$2,000 in 2026). OLS treats a \$1,000 censored price as if the team were truly worth exactly \$1,000, which flattens the slope and shrinks the dispersion - precisely the symptom the code comments describe ("raw OLS is far too compressed"). The fix is a textbook Tobit (censored-normal MLE), fit in section 8. PRICE_SPREAD = 1.85 and PRICE_LEVEL_PER_TEAM = 700 are reasonable hand-corrections, but they are unidentified knobs; Tobit estimates the same correction from the data.
(c) The elaborate variance structure (partner/field/hole correlations) is set by assumption, not fit. RHO_PARTNERS, RHO_FIELD, HOLE_NOISE_FRAC are judgment calls. Given finding section 3 - that we cannot even detect skill in the results - we certainly cannot estimate these second-order correlations from data. This isn't wrong, but it should be labeled as a reasonable prior, not an inference. The model's existing sensitivity analysis is the right way to handle it.
Net assessment: the model is over-engineered relative to the information content of the data. That is not a knock on the craft - it is a statement that the data has very little to say about who wins, and a model that produces sharp-looking 43% favorites can give a false sense of precision. Its saving grace is that its outputs are nearly flat anyway.
| Asset | What it is | Used for |
|---|---|---|
hist_team.csv |
198 team-flight RESULTS (2024-25), match points (0-50) + finishing position | repeatability, Plackett-Luce, calibration backtest |
hist_player.csv |
the same exploded to 396 player-seasons, joined to locked handicap (71% matched) | empirical-Bayes player abilities |
teams26.csv |
2026 field (120 teams), locked handicaps, GHIN ids | field ability table |
prices.csv |
2024-25 auction prices, flagged censored at the per-year cap | Tobit price model |
Response variable. Within each 6-team flight the round-robin distributes exactly 150 match points, so a team's points are zero-sum within its flight. We center points within flight-year (pts_centered) so "points above the flight average" is the comparable, mean-removed response. Typical within-flight points SD ~ 4.37.
Methods applied (each justified against our data, not a textbook tour):
We deliberately did not reach for PyMC/Stan: with this little signal, a full Bayesian hierarchy adds machinery without changing conclusions. Empirical-Bayes shrinkage gives the same partial-pooling answer in closed form.
Each point is a team that played in both 2024 and 2025; axes are its points relative to its flight average. If finishing well were a durable property, the cloud would slope up. It doesn't.
| Quantity | Value |
|---|---|
| Within-flight points SD | 4.37 |
| Team year-over-year repeatability r | -0.119 (n=33, p=0.51) |
| Player year-over-year repeatability r | -0.143 (n=121, p=0.12) |
| Bootstrap 95% CI on team r | [-0.41, +0.17] (includes 0) |
| Implied skill share of points variance | ~ 0% |
Under the standard signal+noise decomposition, the between-year correlation equals the skill share, var_skill / (var_skill + var_noise). With r ~ 0 (and not significantly different from it), the estimated skill variance is essentially zero: net match-play points behave like draws from a common distribution. A permutation test for repeatability returns p ~ 0.50 - exactly what pure noise gives.
This does not mean players have no golf skill. It means the net, better-ball, 9-hole, points-accumulation format successfully equalizes the field - the point of a member-member. The residual that decides finishes is dominated by which day you caught lightning.
We fit the one-way random-effects model y = mu + a_i + e_i, with player effects a_i ~ N(0, &τ;²) and residual e ~ N(0, &σ;²), then shrink each player's mean toward zero by the optimal factor n_i / (n_i + &σ;²/&τ;²).
| Component | Estimate |
|---|---|
| Within-player residual SD (σ) | 4.24 |
| Between-player SD (τ) | 1.03 |
| Shrinkage constant k = &σ;²/&τ;² | 17.1 |
| Mean shrinkage weight | 0.087 |
With most players having only 1-2 seasons, the shrinkage weight is ~0.09 - i.e., we pull raw player means ~91% of the way back to the field average. This is the statistically correct response to a tiny between-player variance: believe almost none of a one-year result.
The shrunk abilities collapse toward the origin: a player who "won" his flight by a mile in one year is credited with only a fraction of a point of durable ability. This is the honest version of a power ranking. The resulting 2026 field ability table is in section 7.
Match play is paired comparison, so we fit a Plackett-Luce model (Bradley-Terry generalized to full finishing orders): within each flight, the finishing order is a sequential "pick the best remaining" draw governed by latent team strengths θ. Teams in multiple flight-years share a strength; an L2 ridge regularizes toward equal strengths.
| Result | Value |
|---|---|
| Teams / races | 163 / 33 |
| In-sample log-lik (fit vs null) | -181.6 vs -217.1 |
| Out-of-sample (2024-learned θ -> predict 2025 order) | -100.3 |
| Out-of-sample with equal strengths | -98.7 |
| Learned strengths beat equal out-of-sample? | No |
The in-sample likelihood improves - but that is overfitting: 163 free strength parameters on 33 short races will always fit the noise. The decisive test is out-of-sample, and there the strengths learned from 2024 fail to beat assuming all teams equal when predicting 2025. This corroborates section 3 from a completely different modeling family: the latent team strength match play would estimate is, here, indistinguishable from noise. That is the correct, modern way to discover a paired-comparison model has nothing to grip.
We score three predictors of the flight winner against the 33 actual flight winners (2024-25), using multiclass log-loss and Brier. Ties for first split the winner mass.
| Predictor | Log-loss (lower=better) | Brier (lower=better) |
|---|---|---|
| Uniform 1/6 (no-signal baseline) | 1.792 | 0.758 |
| Handicap softmax - best in-sample temperature | 1.792 (β*=0.00) | 0.758 |
| Handicap softmax - honest leave-one-year-out | 1.933 | 0.789 |
Two things stand out. First, the in-sample optimizer drives the handicap temperature to zero - the best the handicap model can do is ignore handicap and predict uniform. Second, the honest leave-one-year-out handicap model is worse than uniform on both rules. A predictor that loses to a coin flip out-of-sample carries no usable information about the winner.
The reliability view is blunt: bin every historical team by its within-flight handicap rank (1 = lowest handicap = nominal favorite) and plot the realized win rate. Every bin hovers around the 1/6 baseline. The nominal favorite wins about as often as the nominal longshot. There is no monotone "favorites win more" gradient to calibrate against.
(We cannot score the live 2026 Monte-Carlo model directly on history - it only produces 2026 teams - so we test the axis it relies on: handicap-implied favoritism. That axis is flat. The live model's own outputs are, appropriately, also nearly flat: entropy ratio 0.97, median flight-favorite p_win just 0.23.)
Given sections 3-6, the most defensible team ranking is not the handicap-driven win probability - it's the shrunk track record. For each 2026 team we sum its two players' empirical-Bayes abilities (section 4). This is a small signal (team-level SD ~ 0.48 points, against a ~4.4 noise SD) and should be read as a gentle lean with wide uncertainty, not a prediction. Critically, it is nearly uncorrelated with the live model's p_win (r = 0.05) and with team handicap (r = 0.09) - genuinely orthogonal information.
Top of the field by historical track record (shrunk):
| Flight | Team | Team hcp | EB ability | Live p_win |
|---|---|---|---|---|
| 20 | Ballard + Armstrong | 47.4 | +1.29 | 14% |
| 13 | Panessa + Ratliff | 23.1 | +1.25 | 17% |
| 9 | Wood + Estes | 19.3 | +1.22 | 43% |
| 19 | Vaniman + Leingang | 35.5 | +1.15 | 12% |
| 19 | Truitt + Fetter | 36.0 | +1.14 | 11% |
| 10 | Gelinas + Chafin | 19.7 | +0.97 | 16% |
| 4 | Green + Green | 11.5 | +0.91 | 19% |
| 5 | Wright + Copeland | 12.2 | +0.91 | 20% |
Bottom of the field by track record (have historically under-performed their flight):
| Flight | Team | Team hcp | EB ability | Live p_win |
|---|---|---|---|---|
| 5 | Pearson + Ferguson | 12.6 | -1.12 | 21% |
| 10 | Shirley + Aronson | 20.8 | -0.93 | 23% |
| 12 | Collins + Minter | 22.5 | -0.86 | 20% |
| 15 | Jaillet + Jaillet | 26.0 | -0.81 | 17% |
| 3 | Embleau + Loewenthal | 9.1 | -0.81 | 18% |
Note the disagreements: Wood + Estes is both a track-record leader and the live model's strongest favorite - a rare case where two independent signals align (lean in). Conversely several teams the live model likes (high p_win) have negative track records - their favoritism rests on handicap, which section 6 shows is not predictive. Full table: field_ability_2026.csv.
Auction prices are right-censored at the bid cap. The censoring is what makes OLS misbehave.
The red bar is the spike of 2024 teams pinned at the \$1,000 cap - 14% of the field. OLS treats each as a precise \$1,000 observation, biasing the slope toward zero and shrinking the fitted dispersion.
2024 (severe censoring, 15/108 capped), log-price ~ flight seed:
| OLS | Tobit | |
|---|---|---|
| Seed slope | 0.0448 | 0.0485 |
| Residual σ | 0.361 | 0.407 |
| σ inflation (Tobit/OLS) | - | x1.13 |
Tobit recovers a steeper seed effect and 13% more dispersion - the strongest teams were worth more than the censored OLS fit implies, and the price spread is wider. This is exactly what PRICE_SPREAD = 1.85 approximates - but Tobit estimates it from the data instead of guessing.
2025 (mild censoring, 1/90), log-price ~ handicap + within-flight rank: here Tobit ~ OLS (σ ratio 0.99, slopes within 0.0002), because with almost no censoring there is nothing to correct. The honest read: use Tobit and let the data decide how much correction is needed - it self-deactivates when the cap doesn't bind (2025) and engages when it does (2024, and likely 2026 at the top).
Recommendation. Replace the OLS-plus-PRICE_SPREAD-plus-PRICE_LEVEL stack with a single Tobit fit on pooled 2024-25 data with a year fixed effect and each year's cap, then project to 2026 at the \$2,000 cap. It removes two unidentified knobs and is the standard estimator for capped prices.
player_abilities_eb.csv are large relative to the abilities themselves for nearly every player (most are within ~2 SD of zero) - formal confirmation individual rankings are unreliable.PRICE_SPREAD, but estimated, identified, and self-deactivating when the cap doesn't bind.