Advanced Stats Primer: Or, what we know and how we know it

This is a post that I've been adding to over the course of the season. What you may not realize is that in October, I consciously made a decision to get to know about advanced stats in hockey and to see if I could learn anything from them. I wanted to see if they deepened my understanding of the game or distracted me.

I learned a lot over the season, not least of which is that this stuff is really hard for me. I am not a mathematical person by nature. Another thing I learned is that no one really explains this stuff very well to newbies, and that to some extent, that's by design. It's like there's a test you have to go through to be able to get your hands on this info, to prove you're worthy to have it and are smart and dedicated enough to use it well. And that test is that you go to the trouble of learning about it and finding it in all of the scattered corners of the interwebz.

I also learned that advanced stats are darned useful little things, as long as you don't get out beyond the evidence. These are stats that have changed the way we understand the game and they're going to change the way teams play the game, from coaching to strategies to drafting. That hasn't happened yet, but it's coming.

Anyway, since there's a bit of a lull right now in Lightning-land, I figured now would be as good a time as any to post some notes about advanced hockey stats, in the hopes that someone who, like me, is just trying to figure this stuff out, might get some use out of it.

1. What exactly are we talking about when we say "Advanced stats?"

Advanced stats include: Corsi in all its permutations (there are at least eight), Fenwick in all its permutations (there are at least seven), Zone Starts. Scoring Chances, PDO, and a couple of those generally uncommon goalie stats I talked about in March. There are probably others I've forgotten, but really, that's enough to start with, don't you think?

Not included: Goals, Assists, Points, Plus-Minus Shots For, Shots Against, Hits, Takeaways, Giveaways, Blocked Shots, Missed Shots, PIMs, Save Percentage, Goals Against Average.

2. What do Advanced Stats measure?

They measure shots for and shots against (Corsi and Fenwick), faceoff locations (Zone Starts), the number of times play enters a specific area of the ice (Scoring Chances), and randomness (PDO.) Goalie stats try to measure goaltender performance in various ways, generally relying on saves as a percentage of shots faced as their basis.

3. Why do hockey stats folks measure those things?

Because those are the things that are measurable right now. Stats people can take the raw data of the traditional stats, estimate the degree of reliability of the raw data, and analyze it to discover trends. They don't use Hits, Takeaways, or Giveaways because there is just an incredible amount of unreliability in the collection of that data, and garbage in >> garbage out.

And before you start yelling about how unreliable raw shot data is, consider that as bad as arena bias is in counting shots of all varieties, the bias is worse for the other stats. Stats people hate the existence of arena bias in shot counting, and they've been working to account for it in various ways. Until then, they're doing the best they can with what they have to work with.

4. How does each stat work?

The shot metrics are Corsi and Fenwick. They indicate something about puck possession and where play occurs. In general, because power play and penalty kill time varies so much from team to team and player to player, stats people like to use 5 on 5 stats instead of total stats.

Corsi: All shots for (on goal, missed, or blocked) while Player is on the ice minus all shots against while Player is on the ice. In other words, on-ice shot differential. [also called CorsiOn]

CorsiOff: The team's Corsi [all shots for-all shots against] while Player is not on the ice.

CorsiRel: Player's Corsi relative to his teammates. CorsiOn-CorsiOff. This controls for general team or line shot generation and helps you determine if a player's Corsi is actually high or low in relation to the rest of the team. In other words, it puts a guy's raw Corsi number in some context.

CorsiQoC: The sum of Corsi of all of Player's opponents, weighted for how much time Player is on the ice with them. This indicates who Player faces on a regular basis, and helps to determine if Player is facing shot-generating lines or "checking" lines.

CorsiQoT: The sum of the Corsi of all of Player's teammates, weighted for how much time Player is on the ice with them. This indicates who Player plays with on a regular basis, and helps to determine if Player is playing with shot-generating teammates.

CorsiRelQoC: The sum of the CorsiRel of all of Player's opponents, weighted for how much time Player is on the ice with them. This refines CorsiQoC the same way that CorsiRel refines raw Corsi (i.e., controls for team and line play of opponents.)

CorsiRelQoT: The sum of the CorsiRel of all of Player's teammates, weighted for how much time Player is on the ice with them. Same thing. Helps control for team/line play.

Fenwick: Measure of shots for and missed shots for as a percentage of all shots taken. Used more for teams than players. Often divided into game situation : Score-tied, 1-up, 1-down, 2-up, 2-down, 3-up, 3-down. This is because teams that are behind tend to shoot more than teams that are ahead, and the further behind they are the more pronounced this "score effect" is. It's so pronounced after falling 3 behind that everything goes out the window and there's no real point in separating it out any further.

Fenwick is also occasionally divided by period, as shooting tendencies change the further into a game one gets.

Fenwick measures tend to be more predictive of win-loss records for a season than Corsi measures, but Corsi is better for short-term analysis of puck possession, as it includes more events and so accounts for outliers (randomness) better. Just remember Fenwick = teams, long-term; Corsi = players and teams, shorter-term.

A Scoring Chance occurs when Player controls the puck in the slot area, where most goals are scored from. Teams have been tracking this for a while, but the data isn't available to the public. There is a Scoring Chance Project underway involving volunteers from about half the teams in the league who are taking the time to count scoring chances in every game played by their team (not the Lightning, as far as I know.) Once that data is collected, there'll be some initial analysis, and we'll see what we've got.

Zone Starts is a territorial measure, and indicates what situations Player's coach uses him in. It is the percentage of Player's non-neutral-zone shifts that begin in the offensive zone. [Offensive zone FOs/Offensive zone FOs + Defensive zone FOs.] It ignores on the fly changes and neutral zone faceoffs. 45-55% is about average, and it ranges from about 25% to about 80%. The higher the number, the more Player's coach uses them in offensive situations. Very low numbers indicate that Player's coach uses them in largely defensive situations. This is also some effort to use this at a team level to see how well teams do at gaining a territorial advantage

Zone Finishes is the percentage of Player's shifts that end in the offensive zone. Used exclusively in conjunction with Zone Starts. As a ratio or differential, it can indicate direction of play, and having a higher Zone Finish rate than Zone Start rate indicates that play tends to shift towards the offensive zone when Player is on the ice.

PDO is the team's shooting percentage plus the team's save percentage while Player is on the ice. Indicates degree of randomness. Regresses over a season towards 1, (or 100%, or 1000, depending on how it's written.) High PDOs mean "good luck" is affecting the outcome; low PDOs mean "bad luck" is affecting the outcome. Player PDO has a much wider range than team PDO. Very accurate shooters and their linemates can sustain high PDOs longer; checking lines can sustain low PDOs longer.

Some people have begun to look at an On/Off PDO to see if it yields any insights.

5. What can we conclude from advanced stats?

When used well, advanced stats can tell us something about underlying long-term trends, can help isolate individual performance from teammate performance, and can indicate something about how randomness is or is not affecting current performance.

They help us describe how much Player and his team are controlling the puck in ways that are simply not available to us through the eye test. They have also been used to suggest something about what sorts of on-ice shooting goes along with wins.

6. What can we not conclude from advanced stats?

No responsible stat-head is going to tell you that they can predict the future using advanced stats. Nor can they tell you the causes of wins using advanced stats. That's actually an uncomfortable grey area in the way stats people talk about their work right now. They know some correlations, but certainly not all of them. And the caveat that correlation does not equal causation often gets forgotten when these debates go on.

Right now, the impact of Hits, Takeaways, and Giveaways, is unknown. The data is considered too unreliable to allow for solid conclusions to be drawn from it.

In addition, advanced stats cannot give teams and players much help in making shift-by-shift decisions. You really shouldn't use these shot metrics exclusively to decide your line combinations, because what happened over the course of a season is no indication of what will happen in this next game or on this next shift. As the field stands right now, these are really long-term tools.

What's long-term? More than a month, in almost all cases. Closer to the full six-month season or even longer. In other words, using a LOT of data (as in, multiple seasons' worth of data) can help you predict the outcome of a full-season. It doesn't do much to predict the outcome of a single game, or even a month of games.

In general, I try to stay away from predictions and stick with evaluating past performance.

7. So how do we use advanced stats responsibly?

First of all, use them together. not alone. No single stat is sufficient by itself to draw conclusions from. If you want to evaluate Player's performance, you need to use enough different stats to understand the context within which he plays and how that will affect his numbers. You also need to know how other players are doing to know what's good, average, or bad.

Second, don't draw conclusions beyond what the data is designed to measure. Corsi does not equal performance. Neither does any other stat. In fact, puck possession doesn't equal performance.

Third, be aware of what you don't know. Don't dismiss something as being unimportant because it hasn't been adequately measured. Just say it's not known.

Fourth, be aware that randomness and luck play enormous roles in hockey; moreso than in baseball or (American) football. Account for that in your conclusions.

Finally, don't be a dick. Be nice when arguing for or against the use of a stat or series of stats. There are enough mean people in the world and sometimes they all seem to be involved in this debate.

8. Where to go from here:

The following websites are good places to get data and/or articles about hockey stats written by people way smarter than I am.

behind the net, but you have to start with how to use timeonice (part II, here)

driving play

arctic ice hockey

brodeur is a fraud

broad street hockey

There are many, many more places to go, and as you explore, you'll find websites that you like better than others.

Anyway, I hope this kind of helps introduce these stats to those of you who haven't really dealt much with them before. If you have any questions, I'll do my best to answer them. If you spot an error I've made, let me know so I can fix it.