This post originally appeared on Hockey Graphs.
Note: This was originally intended to be a tweet-thread which grew far too long and unmanageable, so you’re getting a poorly-written post instead. Apologies in advance.
Recently, David Johnson, owner of the awesome puckalytics.com has been on a bit of a warpath (pun intended) against the use of WAR/GAR. Most of David’s arguments can be found here and here, but there are some other comments in this thread.
I consider myself a bit of a WAR skeptic. I think Dawson’s work is great, but I think there are limitations/issues with it. A good summary of some of my concerns can be found in another ill-advised and long tweet thread.
With that being said though, I still think it’s extremely useful as a first pass to start discussion. WAR can be broken down into 5 useful components to see where a players impact derives from.
For example, it’s clear in the Scheifele/Perreault example that the metric likes Perreault’s defence a lot more than Scheifele. David’s argument tends to focus on “How can Scheifele be the same as Perreault when he has almost 2x the points”?
But the answer to that question is pretty evident – GAR views Scheifele’s offense as being much more valuable than Perreault’s. Where Perreault gets his advantage is in his own end, and a bit on the PP. Is this fair?
I’ve already stated that I think there are issues with how GAR measures defence, but it is worth noting that Perreault’s CA60 Rel TM is 3.9 better than Scheifele’s.
Should that produce such a wide difference in GAR? Probably not, but it’s also not completely unreasonable.
On the PP, the Jets shot rate is a bit higher with Perrault and their goal rate is much higher, so maybe that’s not so far off either. So it seems that on the Scheifele vs. Perreault issue it appears that there may be some statistical backing for the comparison, if you dig into the details. (As an aside, I’m team Scheifele > Perreault, but I’d read the numbers as suggesting Perreault has more value than some might think).
Moving on – next up is “Saad & Foligno don’t deserve to be ranked near the top of the league”. This is a bit of a curious argument to me. Columbus is currently 4th in the league, so I’d imagine at least some of their players should be good. Whether they are “long run Crosby-good” is irrelevant – GAR is measuring, to a degree, their contributions this year.
And this year, Columbus has been a very good team, presumably powered by a few very good player-seasons. Foligno and Saad are 1-2 in EV TOI, and both play significant minutes on a very good power play. It’s not at all unreasonable to think two of the top contributors to one of the best teams would be among the top 30 F in the NHL.
Third point: David lists off a series of players who he disagrees with the order on. This, I think, is a bit lazy. Yes there are going to be players whose rankings are clearly wrong, but that’s true of whatever metric you use. Again, the point of most of analytics is to challenge conventional thinking. It’s not designed to replace common sense, but rather to provide a means to challenge the assumptions that go into our traditional rankings.
As an example Curtis McElhinney is playing out of his mind this year and his save % is ahead of Cam Talbot. No one would take him over Talbot.
But no one is going to discount save percentage because of this small inconsistency – there’s nuance involved.
Next point: What stats David uses to evaluate players. Here he provides a list of 17 stats he’d use for player evaluation. First, all of these stats/ideas get built into WAR (with the exception of Sv% Rel).
Second, this is probably too many stats for one person to combine reasonably in a consistent manner.
The only difference between David’s method and Dawson’s is the aggregation. Dawson uses an algorithm, David does it manually. Personally, I’d lean towards using the algorithm. I’ve contradicted myself within tweet threads before and don’t trust my brain to handle all that info the same way each time.
Dawson has provided evidence that the algorithm he’s designed is better at identifying talented players than many stats we have today. That’s much more difficult to provide for a personal subjective evaluation for individual players. The algorithm may not be perfect, but it does appear that in a lot of cases it’s pretty good.
Lastly, David’s final critique today dealt with the “missing” inputs into WAR.
He claims that GAR doesn’t appropriately account for on-ice shooting percentage – but Dawson’s does! His expected goals model explicitly takes the shooter into account, and BPM includes goals and assists.
He claims that GAR doesn’t include on-ice save percentage because it doesn’t. That is because players have no significant impact on it. This article that he offers as proof is a brilliant exercise in binning to falsely show correlation. And this article shows that Sv% Rel varies with TOI, which is correlation rather than causation. (As an aside, I do believe that players can impact shot quality against to a degree, but what happens after that is full of randomness).
Finally, David claims that there’s an over-emphasis of shot metrics over goal metrics. I find this point most confusing.
Goals are a result, and there is some value in that result. But we know that factors in expected goals models explain some of those results. If you have 2 shots by equally talented shooters from the same spot and 1 goes in while the other doesn’t, why credit 1 more than the other.
GAR is meant to be descriptive, but not purely descriptive – there’s some value in crediting the process over the result. I view the Corsi-Goal scale as a continuum. Goals are one extreme and Corsi are another extreme, and Dawson’s xG sits between them. I think xG does a pretty good job at managing the balance between descriptive and predictive. There are factors it misses (pre-shot movement, screens, etc.) but it’s better than either extreme on it’s own.
I appreciate that David is trying to help push us forward and improve in the analytics community, but I don’t find the arguments that he’s made convincing. It is very important to know (even at a high level) the details of how a metric is calculated when making a critique, because it allows you to make useful suggestions for where the model may be going wrong.
Again, none of this is to say that GAR/WAR are the best metric we have available today or should be the only thing we use, but there’s clearly some strong arguments for using it to start a conversation, or as a sanity check on subjective evaluations. And the more productive discussions we have using models like GAR, the more clarity we’ll get on where it’s strengths and weaknesses as a metric lie, and the more avenues for other research we’ll open up.