Meta-Dork: Ranking the Dorks from 1->N [Archive]

uh_no

04-08-2017, 12:32 AM

So a month ago, before the tournament started out, I started a project to attempt to grade the dork polls on how they did in the tournament. It’s taken me a week to get the write up done, but here is what I thought, did and wrote up. I hope you find it interesting.

The contenders:
Kenpom: The man himself. Twelfth son of the Pom. I caddied for him once. He stiffed me on the tip.
538: Trying to rebuild their legacy after a horrid election season….where they STILL beat all the rest of everyone
BPI: As much as I hate to give ESPN any recognition, especially when this rating only exists because they found that some people like stats and don’t want to pay attention to the BS they usually spout….we’ll see if it’s actually valuable
Sagarin: has three rating systems...we’ll see if they are any good!

So first, a few notes about what I could and couldn’t do:
I couldn’t have possibly extrapolated predictions from before the tournament. It would take me knowing exactly how each system calculated to try to evaluate this, and I simply don’t know that. That being the limitation, I took each systems prediction ON THE MORNING of the game. This means that if any games on previous days affected the ratings, those were taken into account. While it would be cool to see how they did solely based on their pre-tournament evaluations (since that's when we have to fill out brackets), it simply wasn't feasible.
The systems provide different predictions. Kenpom and BPI provide both score and %, 538 provides just %, and sagarin provides just point differential. I decided to use win percentage as my ultimate analysis tool, since we can ultimately do a tiny bit more with that. I converted sagarin’s point differential to % using a regression based on KP’s win and score for games during the regular season. A bit more on why I think that’s okay in a second.

So that a bit out of the way, which dork is the dungeon master?

Pick 'Em

Even though it’s largely and utterly useless, the first thing people want to know is “well who picked the most right.” I considered not even posting that, but as I would have expected, each system was within a game or two. Turns out they all pick the favorites, so they all pretty much get the matchups with favorites right, the upsets wrong, and the only thing that separates the system is who got one or two more close games right than the other. As I claimed in my preview a few weeks ago, that likely has no predictive value and is right in the realm of random chance. In case you REALLY want to know

48
BPI

48
Sagarin Combo

48
Sagarin Blue

47
KP

47
Sagarin ELO

46
538

So if you want to base your predictions next year on a system just because they got a play in game between mount st mary’s and new orleans right, more power to you. So congratulations to the 3 winners. You managed to flip heads more than the other guys.

Bucketizing Wins

So more interesting, each system provides win %...so how accurate ARE those win percentages? A good system should be right about a game it puts 75% on 3 out of every 4 times.

In order to look at this, I bucketized each system into sensible buckets of 10-15 games, and evaluated how close the predicted % came to the actual amount the teams won. The bucketing here may have slight impacts on results. I chose buckets solely based on where decent sized groupings or large breaks in % occurred...and only then looked at the amount they got right. It’s not perfect, but it certainly wasn’t biased.

KP:

% range
picks
% correct

51->54
9/16
56

57->63
6/10
60

64->67
5/12
42

69->79
10/12
83

77+ perfect

538:

51->55
6/12
50

56->63
4/12
33

66->70
10/13
77

74->84
9/13
69

82+ perfect

BPI:

50->57
8/13
61

58->65
6/12
50

67->75
6/10
60

76->86
8/12
66

86+ perfect

Sag combo:

50->56
11/16
69

57->67
10/17
59

69->75
6/12
50

73+ perfect

Sag elo:

50->55
8/11
73

56->61
9/15
60

64->70
5/11
45

73->85
11/16
69

86+ perfect

Sag blue:

50->54
10/16
63

55->65
6/13
46

67->73
6/11
55

78+ perfect

So what can we say here? It’s really hard to make any hard and fast conclusions, since SO much depends on the bucketing. It seems the sagarin picks might have looked better than they actually were due to getting lucky on their 50/50 games, but this was made up for by doing rather poorly in their upper ranges. At a glance, it would seem KP came the closest, with only one of his buckets especially out of whack. While this all is an interesting tidbit, I don’t put too much stock in the % above which the system picked perfectly (which I listed below each table), since that effectively just represents whatever win probability they assigned to some big upset. It’s a bit too much sway for some single data point.

But anyway, how can we do better than this?

Linear Regressions

Up until now, all we’ve considered is the win loss result. We have more information than that, though, namely, the point spread at the end of the game. Surely if a team wins by 1 point, the system that gave them 49% chance of victory did better than the one that gave them 90%!

To try to capture that information, I plotted final point spread vs predicted win % for each game and each system, and had excel pump me out a regression to see how well the outcome fit the predictions.

Note that I mirrored each game outcome across the (50,0) point to ensure the graph went through that point, along with some other reasons, namely allowing me to plot all the systems on the same graph without worrying about the fact that who chose which team as a “favorite” could unfairly bias the others. The graph looks like this:

7336

Interesting, but in the end there is a major flaw. As I had found out previously, the correlation between win % is not linear. This must obviously be the case, since one is bounded and the other is not. Therefore what our regression here would tell us is not who had the best prediction, but whose % predictions were closest to linear with point outcomes. That’s no good. It’s not apples to apples. So we have to convert our predicted win % into points so that we CAN run a linear regression (note, technically it’s still linear even though we’re transforming one of the variables….just a weird quirk of how analysts use the term “linear regression”). Easier than that, though, I’ll just use the same transformation (in reverse) to convert the final point spread into what the expected win % should have been.

So what formula to transform between the two to use here? There’s no universal table that says “a team that wins by 6 points should win 55% of the time.” I wanted to just use the same KP based formula I used before to calculate win probabilities for the sagarin predictions….but I had major concerns that using his data as a base for the transformation would unfairly bias the outcome. After thinking about it, though, it seems that EVERY system’s transformation between points and win probability SHOULD converge to about the same curve. If that’s not the case, it means that either their win % prediction, or their point prediction is sub-optimal...and obviously they want to be as accurate as possible in those two things!

So for better or for worse, I decided that the KP point/probability transformation equation I derived would be representative enough of any system’s transformation to use it.

Here is the new graph, comparing predicted win % to actual win % based on final points spread:

7337

So after I had that, I took a linear regression of each system, and looked at 2 things:

Accuracy

The closer the correlation is to a slope of 1, the better the prediction from that system. It more accurately evaluated how a team would perform relative to its opponent This is trying to divine the same informatoin as the bucket analysis before but in a far more robust way...since it does not depend on the buckets and is not susceptible to changes in the bucket boundaries. The observed slopes were

KP
.8

538
.74

BPI
.65

combo
.74

elo
.65

blue
.72

As we can see, KP blew everybody out of the water here. On average, KP did better than everyone else at predicting who were the better teams, and by how much. Ultimately, though, this outcome made me question my methodology….maybe it’s a coincidence that I used his data to come up with the transformation….maybe it’s not. Hard to know.

What we CAN say conclusively, however, is that KP blew all the sagarin systems to bits. Since I used the same transformation to convert sagarin point prediction -> probability in the first place, as I did on the final outcome, it isn’t dependent on that transformation at all. Seems ELO REALLY stinks. I’d also say that the spread between 538 and BPI is enough that both 538 and KP are likely better than BPI. I can’t imagine their ideal transformation being THAT different (if at all) that it could possibly account for a 15% difference.

So maybe 538 would be as good as KP here if I used a slightly different transformation? Maybe they wouldn’t be? I doubt it for two reasons...one i mentioned before about how the transformation should be close for all systems, and the other because KP still had the “best” slope, even on the un-transformed data.

Moving on

Precision

The precision of the data. The average accuracy of the system (evaluated in (1)) is useless if the deviation of any individual prediction is way off (and they just average out well in the end). To evaluate this, we look at the R^2 values of the individual regressions:

KP
.39

538
.40

BPI
.35

combo
.37

elo
.29

blue
.39

So we see here that the top 3 systems are really tightly clustered. I’d imagine it’s harder to do much better than that. BPI/combo are slightly off the mark, and ELO is a dumpster fire out back.

Given that 538 WAS the most precise, though slightly less accurate, it means we SHOULD be able to adjust their win % slightly (somewhere around 10% lower than they estimate) and then their predictions will be at least as good as KPs. This also counters the fact that I was forced to use the KP points/% transformation, which may have thrown 538 off slightly in point 1.

Some other fun things I noticed that likely contributed significantly to the overall results:

In some of the “bigger” upsets, KP had a lower win % for the team that lost in the end than some of the others. He gave nova only a 66% chance to win over wisconsin, which was lowest of the bunch, and far lower than 538’s 79%. He also came in lowest on the xavier FSU blowout upset. (66%) He was also lowest on Duke’s % over USC, at wait for it….66%, again far lower than the others, and better than 538’s 78%. He also gave baylor a 66% chance over USC, and KU a 66% chance over oregon…..so if you have just ONE take away from this, it’s that next year, if KP picks someone to win with a 66% chance? Pick the upset.

So after all those things, It is in my opinion that the meta-dork rankings are

king dork Kenpom
prince dork 538 (though if adjusted properly, could approach KP)
court members combo and blue
court jesters ELO and BPI

For full disclosure, I had no idea how this would come out, and I didn’t TRY to make KP win and ESPN carry the lanterne rouge, but I can’t say I’m unhappy it came out that way!

And now you all know how I won the bracket challenge and received eternal consciousness...The rest of you can give up now, you’ve been out-meta-dorked :)

(If anyone wants all the data, feel free to PM me)

Indoor66

04-08-2017, 07:18 AM

The ultimate in dorkness to evaluate dork polls. Can it possibly be done in brief prose paragraphs?

OldPhiKap

04-08-2017, 08:47 AM

I thought this was going to be a post ranking posters . . . .

(I self-identify a nerd and dweeb More than as a dork. Geek on various issues. When it comes to this type of analysis, I greatly appreciate the work but generally find myself scrolling to the conclusion and trusting the methodology. Great work, uh_no!)

superdave

04-08-2017, 09:28 AM

Why do you square the R's? I roll mine..... I kid, I kid.

Thanks for doing this. Very interesting read. I have to say the results are about what I would have expected as far as quality of systems. At some point, NCAA games are a crapshoot. I guess it's the volatility in outcomes of inconsistent teenagers competing.

fidel

04-08-2017, 11:26 AM

The ultimate in dorkness to evaluate dork polls. Can it possibly be done in brief prose paragraphs?

Iambic Pentameter.

Vote for discussion on next podcast. Really great analysis.