By In Stuff

More on WAR

For a few years now, Bill James has had a problem with WAR. He has mostly stayed quiet on this because, well, he knows that he’s Bill James. He remembers how the people who held the power in baseball punched down hard on him as a young analyst. He has some power now, being a legend and one of Time Magazine’s most influential people and the Godfather of Moneyball and a three-time World Series winner with the Boston Red Sox. He does not want to punch down hard at the young analysts today. He absolutely wants to encourage people to advance baseball thought.

But, like I say, he has a real problem with WAR. And Thursday night, armed with strong feelings about the Jose Altuve-Aaron Judge MVP race, Bill let it rip.

Now, that article is perfectly accessible — the most underrated part of Bill is that he really is a wonderful writer — so there’s no need for me to explain it here. But I want to make a couple of points about it, points that have been bothering me for a long time, and so I will explain what I see as Bill’s biggest beef with WAR and then get to my own thing.

When Bill and I have discussed “Wins Above Replacement” the last few years, Bill has made clear that the problem he has with WAR is that it is not nearly as complex or elegant a statistic as he had assumed. He figured that because we have so much more data to work with today and the new analysts are so much more proficient at working with that data, that the new systems would be mind-blowing in their depth and breadth.

Il include this quote from my story Vanguard After the Revolution.

“My math skills are limited and my data-processing skills are essentially nonexistent. The younger guys are way, way beyond me in those areas. I’m fine with that, and I don’t struggle against it, and I hope that I don’t deny them credit for what they can do that I can’t.

“But because that is true, I ASSUMED that these were complex, nuanced, sophisticated systems. I never really looked; I just assumed that the details were out of my depth. But sometime in the last year I was doing some research that relied on these WAR systems, so I took a look at them, and … they’re not very impressive. They’re not well thought through; they haven’t made a convincing effort to address many of the inherent difficulties that the undertaking presents. They tend to get so far into the data, throw up their arms and make a wild guess. I don’t know if I’m going to get the time to do better of it, or if it will be left to others, but … we’re not at anything like an end point here. I assumed that these systems were a lot better than they actually are.”

There was some backlash when Bill said that five or so years ago, but even after the backlash Bill still wasn’t ready to go into detail. He now has and his big complaint — I hope I’m summing this up effectively — is that WAR does not connect directly to wins. The named “Wins Above Replacements” suggests that it is connected to wins but it is, fact, connect to RUNS. The wins part is an afterthought.

Many of us have known that WAR’s connection to actual wins is tenuous, but we never thought much about it. And now that Bill put words to that thought … it’s actually kind of jarring.

Look: Baseball Reference WAR and Fangraphs WAR go to great care figuring out how many runs a player is worth. They calculate (in different ways) what a positional player’s value is as a hitter, as a base runner, as a fielder. They make a positional adjustment because, as mentioned here a couple of days ago, some positions are more valuable than others. They make ballpark adjustments. They make a league-wide adjustment, based on the run-scoring atmosphere of the league (1968 being very different from 1999, for example). Pitchers have their value translated to runs; Fangraphs and Baseball Reference take very different routes to the same goal of separating a pitcher from his defense. Then, yes, they again adjust for ballparks and the run-scoring atmosphere of the season.

This all takes a great deal of calculation and thought and bold viewpoints. WAR is a wonderful formula in so many ways. And when the calculations are done, we are left with a number of runs a player/pitcher is worth, a number that can then be compared with the run value of a replacement player.

And after all this very intense math, how do they get from RAR (Runs Above Replacement) to WAR (Wins Above Replacement)?

They basically just divide the total by 10.

Yep, that’s pretty much it. Well, it is a bit more complicated than that — “You simply take that sum and divide it by the runs per win value of that season to find WAR,” Fangraphs explains — but really, yeah, you mainly just divide by 10.

Aaron Judge (Fangraphs): 82.9 Runs Above Replacement, 8.2 WAR.

Jose Altuve (Fangraphs): 75.4 Runs Above Replacement, 7.5 WAR.

Joey Votto (Baseball Reference): 77 Runs Above Replacement, 7.5 WAR

Giancarlo Stanton (Baseball Reference):  78 Runs Above Replacement, 7.6 WAR

I think this is what Bill meant when he said, “They tend to get so far into the data, throw up their arms and make a wild guess.”  Both WAR systems work so hard to determine how many RUNS a player is worth. And then, after that, the work is pretty well done. “If you had to pick one number over the history of baseball to convert runs into wins,” Baseball Reference writes, “it would be 10.”

What’s wrong with just dividing the runs by 10? Isn’t 10 runs about what a win is worth? Yes, I believe it is in a very general way. But this gets me to something that has frustrated me for years now but I’ve never had the words to explain my gripe. Let’s see if we can find the words here.

Let’s begin by using Baseball Reference to compare the Houston Astros and New York Yankees..

The Houston Astros players, added together, are worth 53.2 wins above replacement. The position players are worth 39.8 WAR; the pitchers are worth 13.4 WAR. The Astros won 101 games in 2017, so this suggests a team of replacement players would win 48 games — 101 minus 53. That’s reasonable.

The New York Yankees players, added together, are worth, hey, what do you know, 53.2 wins above replacement. Amazing! The Yankees’ split is different though: 29.5 WAR for position players, 23.7 WAR for pitchers, but it adds up to the exactly the same WAR as the Astros.

But the Yankees won only 91 games in 2017. So again, doing the math, 91 minus 53, huh, the Yankees replacement team only wins 38 games. This is not reasonable. Why are the Yankees replacement players so much worse than the Astros replacement players?*

*If you want to do something similar with Fangraphs, you can look at the Yankees and Diamondbacks. The Yankees won 91 games and were 43 wins above replacement, meaning a replacement team would win 48 games. Arizona won 93 games but were just 34 wins above replacement, meaning their replacement team would win 59 games.

The answer as Bill explains is that WAR does not have anything to do with actual wins. It is about runs. The Yankees’ expected record, their Pythagorean record, based on how many runs they scored and allowed, is 100-62. The Astros expected record, based on how many runs they scored and allowed, is 99-63. By runs, they were the same team. And so they have the same WAR.

But they were NOT the same team. Why don’t the Astros players have more WAR when they so clearly won more games?

This gets to the heart of my longstanding uneasiness with some of the advanced statistical thinking: I sometimes have wondered if maybe we’re so busy adjusting some stuff and dismissing other stuff as luck that we might be straying too far from what’s actually happening on the field. If we can adjust for the fact that Yankee Stadium was a great hitters park and Minute Maid was a great pitchers park, how can we not adjust for the fact that the Astros won 10 more games than the Yankees? How can we not find those 10 wins in our analysis?

A few years ago, Bill James came up with his Win Shares system, and a lot of people didn’t like it for various reasons. I don’t like ever quoting Wikipedia, but in this case I think they do a nice job expressing one of the bigger complaints about Win Shares:

“One criticism of this metric is that players who play for teams that win more games than expected, based on the Pythagorean expectation, receive more win shares than players whose team wins fewer games than expected. Since a team exceeding or falling short of its Pythagorean expectation is generally acknowledged as chance, some believe that credit should not be assigned purely based on team wins.”

There it is: Is a team winning or losing more games than expectation “chance?” I’ve always thought that’s mostly true, but I will just say: It’s a copout to just stop there. The object of baseball is to win games. Scoring runs, preventing runs, that’s all well and good. But the object is to win. Are we really ready to concede here, ready to just throw away X number of wins every year without a fight?

And even if we believe that the fight is over, even we believe that those extra wins are chance — how can we not include chance in our stats? Look, in the end EVERYTHING IN SPORTS AND LIFE has some chance involved. We would love to adjust chance out of our baseball stats, but at some point we are altering what really happened. Maybe the Yankees “should” have won 100 games. But they did not. And to give Aaron Judge 8.2/7.2 WAR on the assumption that they did isn’t good enough. We have to be better than this.

This is especially true in this specific situation because Judge was not the same player in high leverage situations as he was the rest of the time. Again, maybe that’s chance, but it’s reality. He hit .215/.380/.380 in late and close situations — the exactly situations where the Yankees underperformed in 2017. If you want to compare him to Altuve, it seems ridiculous not to point out that Altuve hit .441/.529/.661 in late and close situations. It seems ridiculous to not give Judge ANY of the culpability for the Yankees not winning as many games as the runs scored/allowed suggests they should have won. It seems ridiculous not to give Altuve any credit for the Astros outperforming their expectation.

Over the next couple of days, I’ll delve into Tom Tango’s fascinating reimagining of WAR — something calls “The Indis.” But for now, let’s just say that I am glad Bill made his thoughts clear on what he believes is wrong with WAR. If someone would like to make their case for why WAR should not be attached to wins, I’m happy to post that here … but I have to say that until such a compelling argument is made, I think Bill is right.







Print Friendly, PDF & Email

62 Responses to More on WAR

  1. Rob says:

    For me, I think each stat should have one of two goals: either it is explanatory or it is predictive. In the former, the stat should be used to explain the results that actually occurred while in the latter it is attempting to measure the underlying true skill shown by the player regardless of what the result was. The explanatory stats should include things like luck and be used for measuring how good a player was. I think a problem with WAR and the root of what I think you are describing is that it tries to remove luck but it is impossible to completely remove it plus why should you do that when you are measure what has already happened?

  2. LK says:

    I have to say, when I read that Bill James had posted his critique of WAR, I sort of expected something more profound.

    To me, this is a relatively simple situation in which people are asking two different questions.

    1. How many wins was Aaron Judge worth to the 2017 Yankees?
    2. If you were starting a team from scratch, made up of replacement players, and added 2017 Aaron Judge to it, how many wins would you expect him to add to your team?

    WAR, as currently constituted, attempts to answer question #2. Bill’s critique, as I see it, is that WAR isn’t answering question #1.

    If you want to know the answer to question #1, it’s incredibly important that Aaron Judge did not perform well in clutch situations for the 2017 Yankees. But for question #2, you can’t know for your hypothetical team which of Aaron Judge’s plate appearances will turn out to be in clutch situations. Implicit in this is an assumption, pointed out by Bill, that Aaron Judge’s lack of clutch in 2017 is a result of chance and not skill; this seems like a reasonable assumption to me in answering question #2, though I can appreciate why not everyone would automatically agree with it.

    For me, both of these questions are interesting and worth answering, though obviously question #2 is more abstract. It can simultaneously be true that:

    -Jose Altuve provided more value to the 2017 Astros than Aaron Judge did to the 2017 Yankees [Bill’s point]
    -If I were starting a team from scratch, I would want 2017 Aaron Judge more than 2017 Jose Altuve [Fangraphs WAR’s point]

    From my perspective, Bill has not identified a flaw in WAR as such, he has pointed out that it doesn’t answer a question it wasn’t designed for. Maybe he thinks question #2 is dumb and any effort spent on it has been wasted and should’ve been put toward question #1. Maybe he thinks the designers of WAR were attempting to answer question #1 and have completely botched it. Personally, I feel pretty confident that the Fangraphs and B-R crews set out to answer question #2 – and WAR does a pretty good job of that, in my opinion.

    • meh says:

      I think specifically in this context, Bill James isn’t suggesting whether question 1 or 2 is more important. But rather, if we’re going to give the MVP to a player for his performance in 2017, we should be using asking question 1 and not question 2.

    • Cooper Nielson says:

      I don’t disagree with you, LK, but I think part of the problem is that a lot of people *do* now use WAR to answer question #1 — even actual MVP voters. WAR is such a tremendous concept that many people think it means more than its creators ever intended it to mean.

    • Stephen says:

      Excellent points, LK. Like you, I read James’s critique and was kind of underwhelmed.

      I completely agree with you that there are two separate questions, and that WAR is better equipped to answer the question of “which year would be better in a vacuum” than it is to answer the question of “Who actually had the better season?” Like you, I think that’s deliberate. I always think of WAR as being designed to look at probability and prediction than about the details of a particular year’s performance. I don’t have a problem with that.

      The big thing, though, is that any statistic should be used as a jumping-off point, not as the be-all and end-all of “who was better.” WAR says, Altuve and Judge provided essentially the same value this year. We shouldn’t accept that this makes the two players actually equal in value, any more than we should accept that a pitcher with a 2.68 ERA was necessarily better than a pitcher with a 3.02 ERA. We start with stats, and then we look to see where those stats seem accurate and where those stats mislead.

      In this case, we need to look at the question of clutch hitting, we need to look at the question of whether the Yankees are being credited with “too many” wins…these are legit questions. To the extent that people don’t ask them, it is not the fault of the folks who came up with WAR (except possibly in overselling it); it’s the fault of people who don’t think critically about stats and just shout BUT (INSERT NAME) HAD A WAR TOTAL ONE POINT HIGHER THAN (INSERT NAME) SO OF COURSE HE WAS BETTER!

      And not so incidentally, it seems that in the voting, people did that, and pretty successfully too. So I’m not really sure why James seems so very angry about this.

      For the record, I would’ve voted for Altuve.

      • invitro says:

        “So I’m not really sure why James seems so very angry about this.” — I don’t think he seems all that angry in the article, but anyway… I think there’s a valid reason for being upset that Judge was as high as #2, when his “clutch” performance was -so- bad in 2017.

    • “. . . Implicit in this is an assumption, pointed out by Bill, that Aaron Judge’s lack of clutch in 2017 is a result of chance and not skill . . .”

      Actually James specifically did not say this. I think he tentatively suggested 70% chance and 30% skill, but mostly he’s saying we still don’t know, even though we thought we knew.

      See “Underestimating the Fog” from 2004 to see that James has had some issues with the direction of sabermetric assumptions in this regard for some time now.

      • LK says:

        I was trying to say that Bill pointed out that WAR assumes clutch (or lack thereof) is due to chance, not that Bill himself agrees with that assumption.

  3. invitro says:

    I used to talk a lot on this site about bb-ref’s “Clutch” statistic. You find it on the Advanced Batting page for a player, in the Win Probability table. It’s the difference between the context-dependent WPA and the context-independent WPA. It measures how well a batter did in clutch situations (roughly); it doesn’t include fielding since (I suppose) there’s no measure of clutch fielding yet (though there really should be).

    I think WAR + Clutch is a very good stat for measuring MVP candidates. I used this to argue here for Donaldson for MVP in 2013. He had 7.7 WAR in 2014, 3rd among AL MVP candidates, behind Trout’s 9.3 WAR and Cano’s 7.8 WAR. But Donaldson had a +1.3 Clutch, Trout had a -2.4 Clutch, and Cano had +0.9. So Donaldson had a 9.0 WAR+Clutch, Cano had 8.7, and Trout had only 6.9. I still think Donaldson was the 2013 AL MVP. (The actual winner, Miguel Cabrera, had 7.3-0.2=7.1 WAR+Clutch, well behind Donaldson & Cano).

    For 2017, Altuve had 8.3-0.6=7.7 WAR+Clutch, and Judge had 8.1-3.3=4.8 WAR+Clutch. (That -3.3 seems extremely low; I wonder how often a player does that.) So I probably wouldn’t even consider Judge for the 2017 AL MVP. (I would consider Kluber, and look at the other WAR leaders’ Clutch.)

    I think it’s very important to have both context-independent and context-dependent measures, and use them appropriately. Bill’s argument is basically that using the -independent measure is wrong when picking the MVP; who can argue with that?

    • Bikechess says:

      “I think it’s very important to have both context-independent and context-dependent measures, and use them appropriately. Bill’s argument is basically that using the -independent measure is wrong when picking the MVP; who can argue with that?”

      This is well stated!

    • Gary says:

      I am not familiar with bb-ref’s “Clutch” statistic or its derivation. But I find it curious that Altuve in 2017 had a negative “Clutch” statistic (“Altuve had 8.3-0.6=7.7 WAR+Clutch”) while at the same time, per Joe, “Altuve hit .441/.529/.661 in late and close situations”. They are not the same statistic but it is interesting how little they seem to overlap.

      • invitro says:

        I suppose there are a lot more high-leverage situations than “late and close”. I’d be very surprised if the Clutch number wasn’t a far better measure of clutch hitting than just late and close averages.

  4. JayJay says:

    Great comments by all –

    With regard to Judge, I felt all season long that he wasn’t as valuable as his numbers implied because of his inability/failure to produce in high-leverage situations, most of which I’d actually watched. I felt that he had an especially tough time against tough righties or good relievers in general (so I don’t think the failure is necessarily a result of chance).

    But I would never use the case of 2017 Aaron Judge as a general argument against non-win-dependent WAR. Judge and the 2017 Yankees were an extreme case of a team whose best hitter (and I think second-best hitter – Sanchez) was truly awful in the clutch and whose relievers blew a high number of games. I don’t think extreme cases should be used as an argument for wins and against runs, which remain the best predictor of future production.

    I never understood Bill James’ fixation with wins and I still don’t.

    That said, it would be nice of individual statistics took into account things like WPA. It appears that the awards voters already do…

  5. DjangoZ says:

    I enjoy angry Bill James, he’s fun to read. But his indignation on this one isn’t matched by his argument.

    It’s a valid debate to have, but it’s quite far from obvious that Bill is right. In fact, there are better arguments that taking a team’s wins into account when evaluating the value of a player in any team sport will lead to analysis that is inaccurate. Player analysis and team wins are separated on purpose. I think his argument is ultimately a step backward, not a correction.

    • J Hench says:

      Wins are important because the only valuable thing in baseball is wins. That’s what teams play for; that’s what makes for a successful/unsuccessful season. Wins in the season, which hopefully lead to more wins in the post-season.

      Runs are important because they have a relationship to wins that is steady, robust, and consistent. They are the currency that is exchanged for wins. But runs are not the ultimate goal. Wins are.

      I’m not sure what the argument is against measuring the value of a player’s performance on how much he helps his team win. Yes, you want to be able to compare his performance on an even playing field against other players, so you neutralize the variables that do not have to do with the player (which stadium he plays in, which league, etc.), but it doesn’t seem to me that team wins is an external variable that needs to be held constant in the same way.

      • Karyn says:

        My concern with this angle is that great players on terrible teams are downgraded. If they have a great season, and do everything we’d expect of a great player both in terms of WAR and however you measure clutch–but their team is so bad that they don’t accumulate many wins, I don’t see why they should be discounted in MVP balloting.

  6. Caveatbettor says:

    so a panel of statistical experts, without baseball biases, could create an experiment (i.e. series) between a win shares heavy team and a WAR team (Altuve would play on the former, and Judge the latter), and we would see how that goes.

    Probably at least 15 games in at least 5 different parks, with certain environmental and travel factors.

    To me, one question is going to be the manager and coaching/trainer staffs. Once that gets moved into the experiment, it is probably with more than a season of games. It is going to be difficult as well to explain interpersonal relationships on wins. For example, we know that growing up in a single-parent household vs a two-parent household is a factor for children, but there are many reasons why some families have one parent and a wide diversity in the quality of marriages in the 2-parent.

    The r-squared (simple explanatory coverage for all of this) will still be disappointing. Or so I predict, like a pre-James biased scout.

  7. EnzoHernandez11 says:

    Much has been made of the fact that baseball’s highest individual honor is Most Valuable Player and not Player of the Year. This has led self-styled traditionalists to insist, albeit inconsistently, that there should be an MVP bias in favor of players whose teams won–or at least competed for–pennants and championships.

    That was always kind of a silly argument. But it now seems like we may have a data-driven basis for actually making that distinction. Perhaps WAR can tell us who the best players was, but something like Win Shares can help us determine the Most Valuable Player in the context of the unique circumstances in a given year.

    In 2017, Aaron Judge didn’t come through in the clutch. That’s probably just a matter of luck, and he’ll probably reverse these figures going forward. But for this one year, his inability to perform in clutch situations–however fluky–made him less valuable than Altuve, even though he is probably just as good, if not better, of a player.

  8. Play-Doh the philosopher says:

    We “innumerate” old-school fans have argued for years that clutch hitting is a real thing, and were mocked for it by people like Joe and his most avid readers.

    Now the same idea is taken seriously because it comes from Bill James. Oy.

    The Yankees “should” have won 102 games so Aaron Judge is a choker? Really? If you knew the Yankees “should” have won 102 games, why didn’t you bet a few grand in the Yanks to win the AL East?

    PS Yeah, I’m a biased New Yorker, but Altuve is a fine choice who won by a landslide. So what’s Bill bleaching about? The MVP vote went exactly the way he thinks it should have, and it wasn’t close. So why write this columns at all?

    Well, because Bill James is still a petty egomaniac, but besides that.

    • EnzoHernandez11 says:

      James argument wasn’t about the MVP vote; it was about the relative usefulness of WAR. The vote between Altuve and Judge was not close, but their WAR scores are. Also, I didn’t read James’s essay as concluding that clutch hitting is a real thing; rather, he simply said we should have remained “agnostic” about it pending a more complete study. I remain skeptical about clutch hitting because we still lack a persuasive demonstration that it’s a persistent ability (rather than a concept that relies far too much on anecdotal evidence). Heck, we don’t even have a universally agreed-upon definition of those situations that clearly qualify as “clutch.” And is there such a thing as clutch pitching?

      In general, it seems far more likely that there are good players, mediocre players, and bad players, and that during “clutch” moments, you should prefer to have a good player at the plate or on the mound, regardless of any notion of clutch-ness. Barry Bonds was regarded as a choke artist for much of his career, until he fell an inning short of becoming the World Series MVP in 2002.

      I mean, I believe that people can choke under pressure in team sports. All of us lesser mortals have done it. I just tend to think that the enormous pressure of making one’s way up the pyramid from the minors to the majors tends to weed out players who crack under pressure. After that, almost everything else is just talent and preparation.

      Also, James’s willingness not to attack young analysts for 20 years certainly isn’t consistent with the actions of a “petty egomaniac.”

      • Marc Schneider says:

        I agree with you. People talk about close and late situations as being clutch or RISP. But there are a lot of other situations that might be considered clutch. If you hit a grand slam in the third inning to break a game open, that might be just as “clutch” as hitting one in the 8th. And there’s too much randomness in when hitters get hits. If a pitcher makes a great pitch, he will probably get the hitter out regardless of how clutch the hitter is. And then there’s the nature of the hit. In the NLDS, Anthony Rizzo in the 8th inning of game three came up with a runner on base and hit a pop up that fell in between three outfielders and drove in the winning run. On the other hand, Willie McCovey hit a rope in the bottom of the 9th in Game 7 1962 that was caught. Who is more “clutch?”

        • invitro says:

          “But there are a lot of other situations that might be considered clutch. If you hit a grand slam in the third inning to break a game open, that might be just as “clutch” as hitting one in the 8th.” — Absolutely. That’s why I favor the Clutch number (which I’ve been blahing about), which is based on WPA. Instead of saying all at-bats are either clutch or not (i.e., black or white), why not assign each at-bat a number on a continuous scale from 0 to 100, based on past data that says a hit in this particular situation makes the hitter’s team 24% more likely to win the game?

          • Marc Schneider says:

            I think that makes sense. One of the problems is that people assume that a clutch hit only occurs in the context of a close game. But any hit that increases the teams’ chance of winning substantially is a clutch hit in my view regardless of when it occurs or the final score.

    • invitro says:

      “We “innumerate” old-school fans have argued for years that clutch hitting is a real thing, and were mocked for it by people like Joe and his most avid readers.
      Now the same idea is taken seriously because it comes from Bill James.” – I’m an avid a reader of this site as anyone, and I’ve argued for using clutch performance when picking the MVP for at least four years. And Bill isn’t saying that clutch hitting is a real skill, only that it should be considered when voting for MVP.

      “If you knew the Yankees “should” have won 102 games, why didn’t you bet a few grand in the Yanks to win the AL East?” – Because we didn’t know that until the season was over.

      “The MVP vote went exactly the way he thinks it should have” — No, it didn’t. He thinks Judge shouldn’t have finished second, but instead probably around 8th.

      • Aaron Judge finished second the way Shan finished second at the 1973 Belmont Stakes- light years behind the winner.

        So if that’s what inspired James to criticize WAR, then “petty” is an apt word.

  9. E.H says:

    Gotta have a stat for clutch players, period.

    • invitro says:

      It’s there, the number called “Clutch” on bb-ref that I talked about above. George Brett has a very high career Clutch, FWIW.

  10. Richard says:

    What irks this guy with some scientific background (Master’s degree in Astronomy) is that all these statistical calculations completely ignore the mathematical tools developed in the field of statistics! Where are your error bars in your numbers? What’s the standard deviation in the population? Pretty much any scientist will tell you that in any complex calculation involving a large set of data, you really shouldn’t trust the last digit in the result.

    With regards to Bill James excellent essay where he discusses chance and luck, go back and re-watch Game 7 of the 2001 World Series (or re-read “The Last Night of the Yankee Dynasty” by Buster Olney.

    With the game tied at 1 with two out in the top of the 7th, and a runner on second, the Yankees Shane Spencer hits a fly ball to deep right center. But thanks to the weather (a truly random factor!), the wind coming in from that direction keeps the ball in the park. The game stays tied instead of giving the Yankees a 3-1 lead. And in the bottom of the ninth, the Diamondbacks’ decision to have an old-style dirt path from the plate to the mound meant that Mariano Rivera couldn’t field a bunt cleanly, and an error on a throw was a key play in the D-Backs’ rally. Again, a chance factor….

    And by the way, Bill James for the Hall of Fame…..

    • Mike C. says:

      Agreed on error bars and std dev. I just assume a 10% error when looking at numbers and create a mental error bar based on it. So Judge on BB-Ref has 8.1 WAR, with a error range of 7.3-8.9 WAR. Altuve has 8.3 WAR, with an error range of 7.5-9.1 WAR. This takes away the ridiculous notion of certainty on decimal places in WAR.

    • MikeN says:

      Read some global warming papers and you will really get angry!

  11. Mark Daniel says:

    Here’s one problem with WAR:
    In late September, Player A hits a solo HR in the 8th inning of a 10-0 game between 2 last place teams. The HR was hit in a pitcher’s park.
    Also in late Sep, Player B hits a 3-run HR in the 8th inning of a game in which his team down 5-3. His team is also tied for first place in the division with 7 games left to play. The HR was hit in a hitter’s park.

    According to WAR, Player A’s home run is worth more than Player B’s home run.

    • invitro says:

      That’s not a problem if you’re using WAR to predict future performance. And WAR happens to be designed to predict future performance, not measure clutchiness. 🙂

  12. Scott says:

    Fangraphs has a smart response.

    Also a million times yes to the commenter above who discusses the need for error bars/variance. 8.1(+/-.6) would be a better way to present these statistics.

  13. Mike C. says:

    Let’s try to find a happy compromise. When we get to runs above replacement (according to Fangraphs, 82.9 for Judge and 75.4 for Altuve), it seems everyone is relatively happy. James wants to relate those numbers to actual wins, WAR relates it to expected wins. If we divide those numbers by the number of runs per that players team’s wins, we get closer to what James is arguing.

    The Yankees scored 9.43 runs per win, which gives Judge 8.8 wins above replacement using this technique (82.9/9.43). The Astros scored 8.87 runs per win, giving Altuve 8.5 wins above replacement (75.4/8.87). That significantly closes the gap, and I think moves towards answering Bill’s critiques. Using BB-Ref’s numbers, Altuve takes a more sizable lead, 9.6 WAR to 8.7 WAR.

    • invitro says:

      Again, using WAR+Clutch is much easier, and much more accurate, as it doesn’t assume that Judge was equally responsible for the Yankees’ poor performance in close games as the other Yankees. (He was much more responsible for it than at least most of the other Yankees.)

      • Mike C. says:

        WAR and Clutch aren’t measured in the same units though. So combining them is bad math, right? If you found z-scores for each, and combined them, that would be acceptable I suppose.

        • invitro says:

          I believe both are measured in wins. I hope I’m not wrong! 🙂

          • Mike C. says:

            Nope. From BBRef:
            Clutch: (WPA) / (aLI – WPA/LI)
            WPA: Sum of the differences in win expectancies for each play the player is credited with – in wins
            aLI: Average leverage is 1.00 – not in wins
            WPA/LI: WPA in wins, LI not

    • Mike C. says:

      Although the more I think about Bill’s critique, the more I think it would give unnecessary boosts in value to hitters on teams with good pitching, and unfairly hurt hitters on teams with bad pitching. Not sure why pitching performance would become the responsibility of the hitters….

      • MikeN says:

        No it wouldn’t.

        The issue is that the Yankees underperformed their runs scored and allowed, while the Astros matched it. Teams with good pitchers that matched their Pythagorean would not get a double boost just because their team won 120 and another team won 60.

        • Mike C. says:

          A run scored on a team with good pitching is more valuable (leads to more wins) than a run scored on a team with bad pitching

  14. Thanks Joe.

    In 2002-2003ish when I was being introduced to fangraphs/analytics I was told that clutch is non existent and or meaningless and always had an issue with this on a basic logical level. Any time there is academic consensus on something that is strictly non factual it is A. Almost certainly wrong and B. Wrong in such a way that it invalidates much of the profession’s basis on that consensus.

    Of course I wouldn’t suggest RBIs are important or anything, but that clutch is something that is useful to have and potentially trackable from person to person given enough statistical analysis.

    • invitro says:

      You seem to be implying that clutch hitting is non-factual, and this most certainly is not the case.

      Many people have looked to see if clutch hitting is a skill, and they have all failed to find evidence for such a skill, except perhaps for a couple of players like David Ortiz and Mariano Rivera in the playoffs. This doesn’t mean that clutch hitting IS NOT a skill, but to me, it makes it highly unlikely that it is to any significant degree (with exceptions for players like Ortiz & Rivera).

    • MikeN says:

      Have they located clutch pitching?
      My impression was they had, but then when they saw hitters performing equally well against regular pitching and clutch pitching, they concluded there is no clutch hitting when the default should have to expect a decline.

      • invitro says:

        Pitchers do have a “Clutch” number on bb-ref (and Fangraphs, I think). But this includes team defense, so there is an improvement to be made, once someone figures out how to measure clutch defense…

  15. John Autin says:

    While the runs-vs.-wins complaint is valid in the specific case of Judge vs. Altuve, the rest of this complaint is massively overblown.

    [Yankees+Judge] is a near-perfect set of conditions to show the flaw of basing WAR solely on runs:
    — New York’s 9-win shortfall from Pythagorean Wins is one of the 40 largest of all time, placing in the 98th percentile for absolute distance from expected wins.
    — Judge’s “clutch” shortfall is also extreme. For instance, his high-leverage OPS was .188 less than his overall mark, which ranks near the 4th percentile in the last 5 years (of those with 100 hi-lev PAs in a season).

    Since Judge was the biggest Yankee offender in clutch spots, it’s easy to say that his WAR deserves to be docked AT LEAST a proportional share of their wins shortfall.

    But there are two giant problems in using such methods broadly to figure WAR:
    — Precisely how to weight the various clutch stats; and
    — Given the rarity of such large team+player shortfalls, rejiggering WAR to account for them is a mountain of work for a mere handful of meaningful improvements.

    Among the difficulties in weighing clutch stats is whether to gauge the player relative to his own overall mark, or relative to other players in clutch spots. For instance, Judge’s .861 hi-lev OPS was .188 less than his overall mark, but the raw number was still one of the better marks this year, just above Altuve’s.

    Beyond all this, there’s a slight whiff of “I was duped!” in Joe’s complaint. There’s no real excuse for that. WAR has always been openly based on runs, not wins, so anyone using WAR as an evaluative tool has a duty to know and adjust for the rare flaws it may cause. Baseball-Reference even has an explicit and longstanding caveat that differences of less than 1 WAR should not be considered meaningful.

    So, don’t blame the tools. And don’t throw the baby out with the bathwater. For the vast majority of comparative tools, WAR is much better than any competing metric.

  16. […] James: Judge and Altuve, and then Bill James: MVP Followup Joe Posnanski: More on WAR, following up on Judge and […]

  17. Paul Schroeder says:

    Why is 10 runs a win? In most games, a team does not need to score 10 runs to win. Maybe I am just a simpleton, but I have been reading about this stuff for years, and this is the first time I have heard that 10 runs = 1 win.

  18. MikeN says:

    Real reason why Joe agrees with this- James says Judge isn’t second, but third behind Eric Hosmer.

    How do the Royals of recent years stack up under this new regime?

  19. TWolf says:

    I wonder how Judge and Altuve would fare under Bill’s win shares system. Maybe this is his opening to resurrect win shares, the analysis of which ties in closely to actual team wins. I remember first reading about win shares when I purchased his second historical abstract almost two decades ago. Soon after he and Jim Henzler released a book which went deep into the win share analysis and calculation. I was fascinated by it even though there were limits to my understanding. In subsequent years he published further win share findings. I was surprised that the win shares analysis was never adopted by his sabremetric disciples and clearly has been replaced by WAR. On his website I asked him about this, but he never responded. Maybe Bill is just like many great individual thinkers who see their ideas dismissed by others over time. I remember reading that Bill grew pessimistic over future baseball analysis because he thought that many were capitalizing on his original work by just publishing numbers without any indication of what these numbers were supposed to prove.

  20. AJ says:

    Based upon James’ premise, no good performance should count if it comes in a loss. If player X hits 3 homers and the team loses, those 3 homers shouldn’t contribute to a win share. It seems to me that James wants to have his cake and eat it to. 1) Make sure WAR correlates precisely to wins. 2) But do so in a way that credits players with stats racked up in losses.

  21. Carl Marxz says:

    Joe, this seems very inconsistent on your part. As far as I can tell, you have always argued for removing context from individual statistics, and dismissing context driven stats when determining MVP. I see no difference between arguing that Judges WAR should be lower because the Yankees won x amount of games and arguing that Trout shouldn’t win the MVP because he was on a losing team and didn’t have enough RBI. WAR should not be affected by the performance of a players teammates, that’s antithetical to the point of the stat.

  22. DDMe says:

    If you do a low-ball error analysis on WAR, you get a ± of at least 0.4 on the face of it. Extrapolating out you get a ± that grows each year by similar factors. So evaluating a player who has a WAR of 50 really means that, if the player played 14 seasons and averaged 145 games a year, 1250 innings a year, that the true ± is a maximum of 62 and a minimum of 38. And that’s with me using a simple method that lowballs reality.

Leave a Reply

Your email address will not be published. Required fields are marked *