As the debate rages among football analysts over who should be the NFL MVP this season, our rcon14 makes the case that Rodgers is the clear choice, even amid discussions of WAR and total value added. Check out that piece here as a companion to this examination of one argument recently made against Rodgers’ case.
Pro Football Focus is both famous and infamous for their player grades. Many football analysts use them, especially for players without obvious objective counting stats like offensive linemen, but we don’t always feel good about it. PFF grades carry the veneer of scientific rigor and come from a source that provides data to all NFL teams, but what NFL teams get is a far cry from what the public sees, even through most levels of the PFF paywall.
We’ve all had our moments with PFF grading, from Aaron Rodgers’ 5-touchdown game against the Chiefs in 2015 to several oddly low grades for Kenny Clark throughout season (though he’s shot up the charts lately), and so we view their player grades with at least some level of skepticism. Eric Eager may be doing great analytical work behind the scenes, but PFF is not, fundamentally, an analytics site as we make use of it in everyday life. It is a scouting site, which often provides useful data but is still subject to the limits and flaws of any scouting endeavor.
Today, PFF’s Steve Palazzolo wrote an article advocating for Tom Brady as the MVP over Aaron Rodgers. While I write for a Packer-focused website, I have no special affinity for Aaron Rodgers, and if there is a good statistical case for a different MVP, I’m willing to be persuaded by it. Tom Brady is likely the greatest quarterback in the history of the league, and Brady being in the MVP mix is hardly a novel take.
This article also prominently features PFF WAR, a statistic that PFF has been working on seemingly forever. Their methodology seems generally sound, and having a better understanding of positional value is football is certainly valuable. That said, I have some issues with their WAR construction. Before we get to that, however, let’s explain what WAR is.
What is WAR and why is it important?
WAR (Wins Above Replacement) is originally a baseball statistic and, more accurately, a framework for aggregating various baseball statistics into a single number that gives an approximate value to that player in terms of Wins. Importantly, WAR judges players against a hypothetical “replacement player,” usually defined as the value that would be provided by a freely available player at that position. There are three major versions of WAR (Baseball Prospectus’ WARP, Fangraphs’ fWAR, and Baseball Reference’s bWAR), though they all use roughly the same framework. (In the interest of full disclosure, I used to write for Baseball Prospectus covering the Brewers.)
The three sites in question all do great work measuring the various aspects of baseball, from obvious skills like hitting and pitching to runs generated and lost by baserunning to more esoteric things like pitch framing, and they all use their own inputs within their WAR calculations. This occasionally leads to differing opinions on player value between the sites. In the highest-profile example of this offseason, Baseball Reference had Cincinnati Reds (now Chicago Cubs) pitcher Wade Miley as a 5.6 win player, while Fangraphs saw him as only a 2.9 WAR player, and Baseball Prospectus had him as a barely over replacement 0.4 WARP.
So, how can three sites, all using a similar system and fairly rigorous statistical inputs, vary so much? There are two major reasons the systems disagree on a given player. First, in the case of Miley, it comes down to actual on-field performance versus projected on-field performance. It’s hard to say that Miley had anything other than a good year for the Reds based on his conventional baseball stats. He threw 163 innings with a 3.37 ERA, which provides quite a bit of value, and that is what Baseball Reference sees and gives credit for. Fangraphs and Prospectus, to varying degrees, do some work to control for certain factors like the defense playing behind Miley and luck on flyballs, in addition to more nuanced and more complicated adjustments. They focus more on the aspects of the game that pitchers directly control, like their strikeouts, walks, and home runs. They also adjust for myriad other factors, and so instead of seeing Miley’s good season, they instead make the case that in a vacuum, (A) he was not as valuable as his results would indicate, (B) going forward, you can expect him to be worse, and (C) that other players on his team may have contributed to run prevention more. Miley especially suffers for not striking out a ton of batters (under 7 per 9 innings last year) and allowing 17 home runs.
The other big differentiator between the WARs is defense. It is still difficult to measure individual contributions to defense in baseball. Each site uses different defensive models, and they all vary to some extent. I won’t get into the esoterica here, except to say that we are extremely good at measuring offensive baseball and merely good at measuring defensive baseball. Defense is a component of WAR, and large WAR differences between position players among the three sites is almost always due to how defense is measured. In a sense, a players WAR will always be a “less accurate” number than a metric that strictly measures offense, like wOBA, because of the inclusion of useful but more questionable defensive metrics.
Of the three, no site is “correct,” and all of their approaches have value. Helpfully, all have in-depth explainers on how their WAR is calculated, and baseball writers have wasted tons of ink in comparing the WARs, explaining the WARs, and improving the WARs. They provide an excellent glimpse into the value of an individual baseball player, but even with each site’s sophistication and rigor in creating their various WARs, any representative from any site will still tell you that WAR is just a starting point, and that there are nuances to each player that may create or deplete value and are not captured. Fortunately, because WAR and its inputs are available and transparent, it is easy to check on what makes up the WAR for each player and to pick out or analyze potential issues at a more granular level. For more on the various WARs, check out each site, or this helpful article on Wikipedia.
Back to Football
Baseball has been working on WAR, literally, for decades, and while it’s quite precise, it’s still not perfect and never will be. Baseball, compared to football, is much easier to track, as it’s composed of discrete events, with few interacting parts, that are relatively simple to quantify.
The WAR concept has always struck me as a poor fit for football for several reasons, the biggest of which is that the inputs are so much harder to quantify. For example, assessing the value that an offensive lineman created by properly blocking on a given running play is extremely difficult. You are already starting with an extremely low-value play, and with 22 players on the field, you will be dealing with extremely small, precise, numbers. Blocking, in terms of the value it provides, is also highly subjective. Let’s say that a tackle executes a perfect block on the backside of a run, but, because the linebacker he blocked stumbled or got off to a slow start, or because the running back was just extremely fast and got around the edge without issue, the block didn’t matter. What do you do with that?
It’s not a small question. Remember the Miley example above. In terms of actual contribution to the play, that lineman contributed nothing. He could have blown his block and the play would have been unchanged. In that sense, he deserves no credit. But that hardly seems fair. He executed a perfect block, and on most occasions the linebacker doesn’t stumble, or the running back is slightly impeded getting around the edge. Most of the time the block matters. So what do you do? Do you assign credit based on the outcome? Or based on what the lineman controls? Neither is right or wrong, but it’s important to know how that value is assigned so that we know what we’re talking about.
Football is also different than baseball in having extremely skewed positional value. In baseball, every player is, generally speaking, capable of providing about as much value as any other player, be they a pitcher, shortstop, catcher, or first baseman. The only real exception would be relief pitcher, and that is simply a consequence of not playing as much. WAR is a counting stat, and volume matters. Football is completely different. The value of the quarterback swamps everything else, and the rest of the team is essentially battling over win-scraps. That asymmetry has real consequences, as WAR lacks precision — especially in football — to be trusted with really small numbers. While it may make it relatively easier to assign value to quarterbacks, the rest of the team is just going to be noise.
Pro Football Focus does, at least, give a general overview of their WAR framework, which leads to a different, troubling issue:
Broadly, the PFF WAR model does these things, in order:
Determine how good a given player was during a period of time (generally a season) using PFF grades;
Map a player’s production to a “wins” value for his team using the relative importance of each facet of play;
Simulate a team’s expected performance with a player of interest and with an average player participating identically in his place. Take the difference in expected wins (e.g., Wins Above Average);
Determine the average player with a given participation profile’s wins above replacement player, assuming a team of replacement-level players is a 3-13 team;
Add the terms in the last two calculations to get that player’s WAR.
You can read more about PFF’s WAR calculation here, and they’ve obviously put a lot of time, thought and energy into it.
That said, this still seems highly problematic, and that is because their WAR leans so heavily on their own grades. In baseball, most of the inputs to WAR are objective measurements of on-field events. Even the famously wobbly defensive metrics of baseball are objective measurements of actual events. PFF grades are extremely subjective, as previously mentioned, so adding up and manipulating subjective grading into some kind of derivation of wins seems like piling potential error on potential error. PFF’s goals in their grading may be noble, as described here:
While there are several advanced metrics that have come to the forefront and improved our understanding of football in recent years, PFF grade remains the best at isolating individual performance, particularly at quarterback.
Stats like EPA (expected points added), ESPN’s Total QBR, or CPOE (completion percentage over expectation) all have their value, but they also rely heavily upon the quarterback’s supporting cast. From a receiver catching the ball to the defense dropping turnover-worthy plays, not every play receives the proper context, as the PFF grade seeks to isolate the quarterback from his supporting cast and the opposing defense.
But the adjustments used to isolate performance involve a ton of subjective judgement, and if they are wrong, can heavily skew things. For example, this season, PFF asserts that Aaron Rodgers has made turnover-worthy plays 2% of the time, while Brady has done so 1.9% of the time. They claim, based on film study, that Brady has had 14 turnover-worthy plays on the year on 743 dropbacks while Rodgers has 12 turnover-worthy plays on 587 dropbacks. They have a few clips in support, and please, read it yourself. The problem with this kind of “hypothetical turnover” analysis is that is goes against years of outcome-based data. Rodgers has led the league in interception percentage for four consecutive seasons, and for his career has an INT% of 1.3%. Over the last four seasons, Rodgers’ highest INT% is 1.0. Over that same time period, Brady’s lowest is 1.3%, and over his career, Brady’s INT% is 1.8%. Their fumbles are about the same in recent years, by the way.
If your subjective film study believes that Rodgers is secretly more turnover-prone than Brady, may I suggest there is something you are missing, or getting wrong in your film study, because the outcomes for Rodgers are quite consistent on this front. And interceptions have a HUGE impact on value. Adding interceptions to Rodgers while removing them from Brady isn’t some small adjustment to overall value, and based on Rodgers’ history with interceptions, not warranted.
Baseball has also taken great care in determining what exactly constitutes “replacement level.” Remember from above that PFF has set the win total of an all-replacement team at 3-13.
That is FAR too optimistic a win-total for a team that is going to be outclassed at every position against every opponent, and I wonder if setting a reasonable replacement value for the non-QB positions is really possible given how small the win-shares for those positions are.
Taking an imprecise grade, and applying an imprecise positional valuation, while converting to a WAR scale seems highly problematic. It becomes even more so when attempting to correct for interceptable passes, receiver drops, or anything of the like, especially when not also adjusting for other factors, like makeshift offensive lines, that are harder to quantify but still significant.
In any give baseball MVP discussion, sabermetricians will always bring up WAR, but when two players are separated by 1 WAR or less, you will rarely hear anyone argue that one of those players is a clear-cut MVP. We understand there is enough noise in the numbers to allow for some good-natured debate on the subject. Palazzolo brings up several good points regarding the volume difference between Brady and Rodgers, and some early Rodgers struggles, but the inclusion of WAR is an attempt to add scientific-sounding, objective heft to their case, when it’s really just a dressed up, and less accurate version of their grades.