How Good was our Model?

There are a number of ways to evaluate the various bracketologies out there. One way is to simply look for the largest outliers between a final projection and the eventual bracket. Another is to look at the average difference between a team and its projected seed. Still another is to count up how many teams were correctly labeled as “in the field,” or correctly placed on a specific seed line.

Most scoring of bracketologies (and yes, people do score such things) is a combination of these things. It isn’t a perfect measure. It neglects every projection made prior to the final one, and it fails to evaluate the ease of access for fans—whether it’s difficult to navigate, what explanations are made available, etc. This isn’t to knock it. It’s a difficult thing to evaluate, and people understandably want scores and rankings of things. It’s competitive! It’s fun! In the end, though, most scoring looks for perfect bracket projections on Selection Sunday.

Our goal this year was not to produce perfect bracket projections. Down the line, yes, we’d love to reach that point, but for this year, our goal was to give you a model that was correct as often as it said it would be correct. Meaning, 100% of the things the model said were 100% likely to happen should have happened, 25% of the things the model said were 25% likely to happen should have happened, and so on. The term I’ve seen similar models use for this aspect is how “well-calibrated” they are.

We’ll be evaluating this for all iterations of our bracketology, but for today, we’re looking only at the final projection, entering the NCAAT Selection Show. Part of why we’re doing this is that it’s a manageable quantity to go over in an hour or two. Part of it is that it’s less overwhelming for us to explain in a blog post. And part of it is that because we’re already fairly confident in how our model simulates games, the calibration of the final projection will have a high correlation with the strength of previous projections.

Let’s begin, starting with the projections our model was most confident in:

100%, <1%, and 0%’s:

Excluding automatic NCAAT and NIT bids, where our model wasn’t really projecting anything, there were 8,152 events our model said were or weren’t going to happen with certainty or near-certainty. I include the <1%’s in here because given how our model rounds, those are equivalent to a probability between 99.5% and 99.9% on the positive end of the spectrum, which is spat out as “100%.”

These events ranged from “Vanderbilt won’t make the NIT” to “Duke won’t receive a 16-seed” to “North Carolina will make the NCAA Tournament” to “North Carolina Central will receive a 16-seed.”

Of these 8,152 events, our model was correct 8,152 times. At the top line, our error bars were at least large enough, and possibly too large.

95%-99%, 1%-5%:

There were 518 events that fell into this bucket, our most confident non-certain bucket. Of these, our model missed two: it had UNC-Greensboro as only 4% likely to receive a 1-seed in the NIT, and it had Nebraska as only 5% likely to receive a 4-seed in the NIT.

Even if we include the implicit probabilities (for example, conditionally, if a team is 29% likely to make the NCAAT and 69% likely to make the NIT, as Penn State was, we’ve implicitly said they’re 98% likely to make the NIT once they’re clearly not in the NCAAT) we only missed three of these, with Penn State not qualifying being the main outlier.

Still, using only the 518 explicit ones (to avoid going through and adding the implicit ones, and going back and editing the last section to include them), our model was correct nearly 100% of the time, a figure that wouldn’t change much were we including the implicit probabilities.

This is an indication our error bars may have been too large. Moving on.

85%-94%, 6%-15%:

387 events fell into this bucket, centered near a 90% probability. We were wrong on 31 of them, for an overall success rate of 92%. Again, a sign our error bars may have been too large, though we’d have to look at the shape of the distribution of these probabilities to get a better idea.

75%-84%, 16%-25%:

This bucket, centered around roughly an 80% probability, had 182 events. We were wrong on 45, for an overall success rate of 75%. This changes my impression—it’s possible our error bars used the right standard deviation, but should have been using something other than a normal distribution.

One more bucket to go.

26%-75%:

This bucket has the widest range, but it’s our smallest, holding only 77 events. This is mostly because the majority of the events in our overall model were seedings, but it’s also a product of how few things are really uncertain heading into the selection shows.

43% of these events happened, which is well within a normal error range.

In Sum

All in all, it appears our model was, on the whole, pretty well-calculated. It missed significantly on Penn State, Nebraska, and UNC-Greensboro, but it hit the remaining teams in a fairly normal distribution (another way we’ll be examining this going forward is to place each team’s eventual result on the bell curve of our model’s expectations for them, and to then see whether we hit the right shape on the aggregate, but that’s more than we could do quickly).

There were more significant misses, of course—in that second bucket, we had Gonzaga as only 9% likely to receive a 1-seed, a miss made more significant by public consensus that Gonzaga would, as they did, receive that 1-seed.

Still, if we include Gonzaga, three of our four most significant misses came from outlier situations.

Nebraska’s level of injuries was expected to knock it down, and it did. Our model didn’t have an “Is this team dealing with so many injuries they might not be able to field a team” variable, but perhaps it could going forward.

The disconnect between Penn State’s overall winning percentage (they went 14-18) and the strength of the rest of their résumé (they only played four Quadrant 4 teams, won three Quadrant 1 games, went 4-4 against Quadrant 2, and had NET/KPI/BPI SOR ratings of 50/67/71 to go with even stronger predictive ratings) similarly made them an outlier, and frankly, I’m of the opinion they were treated unfairly by the NIT Selection Committee, who, as the NCAAT Selection Committee did by excluding Texas, incentivized power conference teams to feast on cupcakes in nonconference play, with just a few fellow power conference foes to keep their non-conference strength of schedule out of NC State’s territory. But that’s beside the point—the goal is to predict the behavior of these committees, whether I agree with their stances or not. And, to be fair, non-conference scheduling is made tricky by the uncertainty of whether a team will end up good or not. NC State likely wasn’t trying to end up with the weakest non-conference schedule by the NCAA’s metric. Anyway, I’ve digressed—more to come on all this, probably in November.

Gonzaga is also an outlier. For better or worse, they seem to receive the benefit of the doubt from the committee thanks to their past success, and are forgiven for their conference schedule more than fellow non-power conference programs.

This fall, once our college football model is up and running and we’re back to preparing for college basketball, we’ll look back at these misses, other misses, and general trends, and we’ll start building the model anew. We’re hoping to step up the overall strength of our model, tightening those error bars, running 10,000 simulations at a time instead of 1,000, and adjusting for predictable outlier situations (if they exist to a degree that enables predictability). Frankly, we should have had more events in those higher-confidence buckets, including things like “Will Duke be a 1-seed” where our model said yes but with only 49% confidence. Beyond that, we’ll also be trying to make it a more user-friendly model, making as much information available to you, the fan, as possible.

We’ll still be updating it with our Final Four and Championship probabilities for each tournament, so if you’re looking for those (especially for the NIT, where they’re generally less available), we’ve got ‘em.

Thanks for your interest in the model. Enjoy the tournaments.

Joe Stunardi

Leave a Reply Cancel reply