We’re Going Beyond Excel

I’ve mentioned this in various places around here, at various times, but in an effort to impose some accountability on myself with regard to the predictive models, I’m trying to post updates roughly once a week on progress being made. Which brings me, this week, to a confession:

We’ve been using Excel this whole time.

I don’t know whether this is embarrassing, because I don’t know enough to know whether this is embarrassing. But I have the impression that Excel is a lot more inefficient than other things we could be using to set up and run the Monte Carlo simulations we run. Last year, running our 4,000 daily simulations of our college basketball probability model took Excel two hours, and that was after roughly fifteen minutes a day of prep work on our end. Again, I don’t know enough to know how much faster this should be—especially since we run these from a laptop (yes, I’d imagine a desktop would be faster, so that too is on our radar)—but it seems we could make some gains in efficiency that would then improve our precision. And in general, since we’re wanting to be at a level higher than the one we’re at in terms of overall quality, it seems like a good idea to explore other options than Excel.

Which brings us to Python.

I told a data scientist once that I was working on improving my own data science skills, then explained why, as well as what I’m currently doing. He suggested learning Python, and recommended using pandas to do so. I’ve heard anecdotes (yes, we’re at that level of my not knowing what I’m doing technologically) about Python being useful for web scraping, so as I attempt to build out an NHL model over the next week, I’m going to give it a shot, as an entry point. Installed pandas today. Found all the data tables we need on Hockey Reference. We’ll see how it goes.

Now, the NHL model.

I’m aiming to have a full explanation of how our finished model works next week, but this week, I’d like to explain why the NHL, and I’d like to outline what the approach is as this model gets crafted.

First, why the NHL:

We don’t make secret the fact that we try to emulate FiveThirtyEight in a lot of ways with these models. The concept—Monte Carlo simulations yielding published probabilities—is their thing, and while it exists elsewhere on the internet, it’s hard to find another site that does it with the same levels of professionalism and transparency. They’re worth emulating because they do a good job. They’re easy to try to emulate because they’re so transparent.

FiveThirtyEight, in the sports world, has models for the MLB, the NFL, the NBA, and various club soccer leagues. They do not model the NHL, golf, tennis, or NASCAR. We don’t think we can do as good a job as they do with the NBA. We don’t think we can do as good a job as ESPN’s FPI does with the NFL. We don’t think we can do as good a job as FanGraphs does with the MLB. But with FiveThirtyEight not participating in the NHL picture, and with other playoff probability models either not particularly transparent or hidden on backpages and obscure blogs (like ours, to be fair), and with the NHL a major sport even if not as major as the NFL, the NBA, and the MLB, there might be opportunity here, and we think it’s worthwhile to find out. Compared to modeling collegiate sports, modeling a professional league is easy. There isn’t a subjective committee to predict. There is only one set of tiebreaker rules, and they’re widely available, and they don’t change year-to-year (God bless anyone digging around for football tiebreakers in the Group of Five and the FCS). The talent disparity is narrow enough that the window of single-game probabilities doesn’t get as close to 100% as some do in college sports, giving much more cushion for error. These are our perceptions, and why we’re trying to break into this space.

Now, for how we’re trying to do that:

Like FiveThirtyEight, we’re going to start with an elo-like system as a basis for our model. Elo, for those not familiar, is a rating system used in chess in which players gain and lose points by winning and losing games, with the number of points gained by the winner equaling the number of points lost by the loser, making the system zero-sum in nature. If something expected happens (a highly-rated player beating a lowly-rated player), the number of points exchanged is small. If something unexpected happens (a lowly-rated player beating a highly-rated player), the number of points exchanged is large. We don’t know if this is a perfect place to start, but we know how to do it (we use some elo-like systems in our college basketball model) and if it’s good enough for FiveThirtyEight to use as a starting point, we trust that it’s a good enough starting point from us.

We want to adjust this, though, to account for more than just wins and losses, so we’ll be testing different methods of adjusting the elo we build by goals and shots on goal. For offseason adjustments, we plan on using Vegas Stanley Cup futures odds to pull teams towards their new quality level (rather than try to build our own system, or just putting in a standard regression to the mean). We also hope to explore the impact of streakiness, and whether there’s any way to use recent results to augment our elo in a way that strengthens its predictiveness, but that might be too deep a dive to manage this week.

More to come, with hopefully a full model published at this time next week.

Joe Stunardi

One thought on “We’re Going Beyond Excel”

Leave a Reply Cancel reply