How Our Bracketology Model Works

I’ve been holding off on writing this for this season, out of concern that a coronavirus-related sea change in the college basketball landscape would necessitate a major revamp of the model, and therefore a re-description of how the model works.

For now, though, we seem to be headed in a straightforward direction: Most conferences seem to be trying to get all their games in, at least for the time being, and the NCAA Tournament is evidently going to be 68 teams, though we still await a decision on the NIT. (Aside: Waiting to make a decision on the NIT is smart by the NCAA. Things, as we know, can change very quickly with the coronavirus. In two months’ time, a much wider portion of the population should be vaccinated, and at the very least we’ll certainly know more about infection rates and how many teams are available.) For the time being, it’s not all that abnormal of a season for our model to simulate, though there were fewer nonconference games and the upkeep of the model—tracking postponements, cancelations, and schedule changes—is much more significant than in years past.

So let’s go through how the model works, acknowledging up front that this model is imperfect, especially for the situation at hand, and we’d like to improve it and are working to improve it.

Purpose

Given that we’ve done away with the probabilities page for the time being, as we await a more concrete plan for each conference tournament and we continue to gauge how common postponements are (numerically, I mean—we know the non-numerical answer to that is “extremely common”), the model’s purpose is to give the median projected bracket every time we run it for both the NCAA Tournament and the NIT. The point being that a basketball fan can look at it and understand rather broadly where a certain team lies in the postseason picture.

Inputs

The model relies on eight and a half inputs. I’ll get to the half.

The first six are the ratings systems the NCAA puts on the selection committee’s team sheets: KenPom, BPI, Sagarin, NET, BPI SOR, KPI. Our model, as we’ll get to, bases its entire simulation of the selection process itself on these six systems (the model uses a weighted average based on past brackets).

The seventh is the schedule, which we take from ESPN every time we run the model (update: we’re in the process of changing this to getting the data directly from the conferences—it shouldn’t affect the model’s readouts right now, but wanted to be transparent), with a half-input of the schedule shown on KenPom if we are looking to reference a clear error on the ESPN schedule (we’ll get to this whole process in a bit).

The eighth is the winner of each game. That’s it. Those are the only inputs.

Ratings Proxies

Early in the season, NET and KPI aren’t available, and when they first become available, they and BPI SOR aren’t wholly indicative of their eventual end. There’s a lot of noise early in the season, so instead of drilling down on the current NET/KPI/SOR rankings and believing them to be too indicative of their final state, we use an elo-like system for each. For NET, the system is based off of KenPom, given the similarities between the two. For SOR, the system is based mostly off of BPI, just as BPI SOR is naturally, but also incorporates a bit of KenPom and Sagarin. For KPI, the system is based off an average between KenPom, Sagarin, and BPI.

Two notes on this:

First, this is an area ripe for improvement, and a big summer wishlist item is to build better approximations of these systems themselves so we don’t have to use such clunky proxies (which come with a wide range of uncertainty due to their clunkiness).

Second, this is something that will be changed around the end of January, when KPI and SOR and NET have had a chance to find their level. We’ll still be using an elo-based system to simulate where the rankings will go in conjunction with specific simulated results, but at that point we’ll have enough data in those systems to make them their own bases. NET’s elo will be based off of NET and KenPom. SOR’s will be based off of a BPI/SOR combo. KPI’s will be based off of KPI. (Update: This is done now.)

Simulating Individual Games

We use KenPom’s ratings to simulate each individual game. We don’t know his exact system, but we built ours off of his publications of game projections, so we believe it to be rather similar.

Scheduling

This is unique to this year’s model.

As I said earlier, every time we run the model, we take the latest schedule from ESPN (see update parenthetical above). We then go through, though, and even it out so each team has the same number of conference games. In other words, if we were to look at the SEC and see that both Auburn and Mississippi State have 17 conference games while the rest of the league has 18, we would add a presumed game between Auburn and Mississippi State. Sometimes these adjustments are more significant than adding just one game in a conference. This is another source of uncertainty, and it’s why we’re holding off on publishing any probabilities from our model, but it doesn’t sizably influence bracketology at this point in the season. (An explanation of why that is: Our bracketology is, this early in the year, more a reflection of how good a team is and what their general schedule environment is than it’s a reflection of how that specific schedule will play out. There isn’t a wide enough disparity within conferences to, at this point, allow our presumptions to shake our model far enough from the true median for it to be noteworthy.)

As the season approaches its end, we’ll comb through conferences and try to find their final answers on how they’ll weigh winning percentage in light of games played and use that in their standings. We’ll also be combing through for tournament plans, which right now—due to all the uncertainty—we have in the model as uniformly the same as last year, except on neutral courts (in conferences with a smaller number of teams, we just shrunk the tournament).

We leave all games marked “postponed” in the model and remove all those marked “canceled.” But, with conference play starting, we now remove all nonconference games marked “postponed” even though they ostensibly could be made back up (we just don’t see it happening).

Running the Model

When our model runs, it goes through every game ahead of it, simulating it based on current KenPom ratings (we don’t adjust it to project teams getting better or worse—instead, we just account for that in our broader uncertainty variable). It then, for each simulated array of results, simulates the selection process, assigning automatic bids for each tournament based on regular season and conference tournament results, then seeding each team based on its projected ratings (that weighted average we spoke of earlier). There’s a chance we get to updating the selection process by the end of the year to account for how the committee treats outliers (there’s some evidence a single outlying variable on a team sheet can pull a team down, or possibly up, even though it’s baked into those systems already), but we can’t guarantee that we’ll get there. More to come.

Once we have all those seedings, our model converts them into a median and mean for each team, while also tracking how often each team receives its league’s automatic bid. From there, we construct our brackets.

For the NCAA Tournament’s automatic bids, we take the most likely recipient in each conference. For the NIT’s, we take the median number of automatic bids awarded, and assign those to the most likely recipients overall, limiting ourselves to one in any given conference (update: since the NIT has been condensed to 16 teams for this year, and automatic bids have been eliminated, our model now puts conference tournament favorites into the NIT only if their conference tournament win probability is below 50%). For the NCAA Tournament, we then fill in at-large bids by going in order through team’s median seedings, using mean seeding as a tiebreaker. For the NIT’s at-large bids, we follow the same process, but we begin immediately after the median cut line in our model’s simulations—which is how we account for bid thieves (if the median last team into the NCAAT field is seeded 47th overall but, by the nature of the thing, we have 48 at-large teams in our projected NCAAT bracket, the 48th team will appear in both brackets, as we project it to land in the NIT but we need to fill out our NCAAT bracket too).

That’s it. If you have questions, send them our way and we’ll try to answer them directly or here.

The Barking Crow's resident numbers man. Was asked to do NIT Bracketology in 2018 and never looked back. Fields inquiries on Twitter: @joestunardi.
Posts created 3305

Leave a Reply

Your email address will not be published. Required fields are marked *

Related Posts

Begin typing your search term above and press enter to search. Press ESC to cancel.