Wednesday, March 05, 2014

ELO DWP? II

A couple of quotes from ELO DWP?:-

In the days when the ratings were cut off at 2000, a player of a 2100 standard couldn't play a rated game against a player of 1800 standard. Once you extend that range, rated games become possible, but the 2100 player maintains their rating provided they make the necessary expected score.
RdC

Yes. Well, they do if the player of 1800 standard actually has a rating of 1800.


... it is obvious that they cannot all be under performing.
Phile

There are a lot of ways that sentence could be interpreted. Let me take you by the hand and lead you through the streets of Hampstead. I’ll show you proof that, in one sense at least, they can.







Elo to ECF.
x 8 + 650.
That’s the traditional conversion formula.  Or - 650/8 if you want to go the other way, I suppose.

As it happens my first Elo rating and my ECF at the time were pretty much in sync. 172 x 8 = 1376. Add 650 and you get 2026, which is 23 points (equivalent to a couple of ECF points) less than my actual rating back in 2011.

Things are somewhat different now.  I’ll spare you the maths. The discrepancy between my converted Elo and actual rating is now about 167 points.





Yes, I know. The conversion formula is different now. Click ‘Help’ on the ECF’s Grading Database and scroll to the bottom and you’ll see that as of January the calculation is now ECF x 7.5 + 700 = FIDE. Which brings my personal rating/grade discrepancy down to about 136 points.

Yes I know. One person’s numbers don’t mean anything. True enough, but, leaving the plenty that could be said at this point for another day, let me end by showing you a few more.

Number of games played at Hampstead since August 2013: 22
Number of the 22 against rated opponents: 16
Number of the 16 against opponents with ECF grades: 15



n=15
Average ECF to Elo discrepancy using the old formula: 105.27 points
Lowest difference: 13 points
Highest difference: 183 points
Number of opponents who are over-rated using the old formula: 0
Average Elo to ECF discrepancy using the new formula: 71.7 points
Lowest difference: -5
Highest difference: 169
Number of opponents who are over-rated using the new formula: 1

That’s why I have a hard time believing in Elo any more. It’s not just me, it under-rates everyone. Or everyone I play, at least.

Nearly. There was that one guy, over-rated to the tune 5 points. In case anybody is still tempted to blame all this on the kids, it’s only fair to mention that he was a junior.

15 comments:

Anonymous said...

Actually the traditional formula was BCF x 8 + 600. The problem isn't with the ELO system, it's with you trying futilely to fit a straight line to a curve. They are completely different systems - one a historical record, the other an exponentially decaying moving average scheme. The ELO system accurately predicts the results (in a statistical sense) and is a good measure of relative strength.

Anonymous said...

"it's not just me, it under-rates everyone"

The only point of any grading system is to show a player's strength relative to another. If it under-rates everyone, then logically it under-rates nobody at all?

So the ECF:ELO conversion is wrong; so what?

Adam B.

Dewi said...

Anonymous at 10.40 is correct. You are starting from an overly anglocentric position. The problem isnt with elo. Its merley that you cannot accurately convert your ecf grade to an elo grade.

Matt Fletcher said...

Jonathan - based on anecdote, I had thought that there was some systematic under-rating too, but my analysis (http://ecforum.org.uk/viewtopic.php?f=4&t=3994&sid=e717fc7240fd1b029d7200fdad99fa06&start=60)suggests otherwise - most people are within +/- 50 points of their conversion and roughly equally likely to be over- or under-rated. Of course there may be something I'm missing in my analysis? For example it may be under-rating improving players and over-rating static / deteriorating players? (that would be an interesting one to try to calculate)

Jonathan B said...

Thanks for your comments gentlemen.

You are starting from an overly anglocentric position

Actually I’m starting from a Mecentric position. I suspect


"it's not just me, it under-rates everyone” ... So the ECF:ELO conversion is wrong; so what?

Well it’s the next line: Or everyone I play, at least.


most people are within +/- 50 points of their conversion and roughly equally likely to be over- or under-rated.

I’m sure this is true, Matt. After all, the new conversion formula was generated by comparing what the relationship between ECFs and Elos actually was.

But ...

Where I play there is no way it’s equally likely that somebody would be over-rated or under-rated. Never has been since 2011. It’s *all* - or very nearly all - under-rated.

Which means the over-rateds must be elsewhere.

Which means that while the Elo system might be doing a decent job of showing relative strengths in my pool of opponents, it breaks down when comparing my pool of opponents against people who play in other tournaments/environments.

Unless I just happen to be exceptionally unlucky with who I’ve been paired with over the past couple of years. Which I accept is a theoretical possibility, but one that’s becoming less likely with each passing tournament.

Jonathan B said...

Oh, one more thing.
Actually the traditional formula was BCF x 8 + 600

Yes, thanks for that.

I did know that it was originally 600 and changed to 650 relatively recently. I decided against including that detail in the post simply because I thought that it just mudded the waters a little bit and didn’t add anything to the underlying central argument.

Jonathan B said...

And one more one more thing, regarding the ‘so what’ argument.

When titles are awarded on the basis of elo rating achieved (not relevant to me of course) and tournament entries are defined by elo rating (very much relevant to me), I think it does matter if there’s systematic under-valuing of everybody.

Even if ‘everybody’ means ‘everybody in one specific circumstance’.

Matt Fletcher said...

most people are within +/- 50 points of their conversion and roughly equally likely to be over- or under-rated.

I’m sure this is true, Matt. After all, the new conversion formula was generated by comparing what the relationship between ECFs and Elos actually was.


Well yes, but the "new" conversion is pretty similar ot the "old" one for most grades - eg 8 x 180 + 600 = 2040 whilst 7.5 x 180 + 700 = 2050 (actually an addition of 675 fits the data better to give 2025). Which to me suggests that (across everyone) it works now in much the same way as it did before - and I think you're trying to suggest something is more broken than a handful of ELO points?

Jonathan B said...

Which to me suggests that (across everyone) it works now in much the same way as it did before

I’m not sure if that’s an observation or saying, ‘it still works’, but anyway the issue here is the bracketed ‘across everyone’. I dare say that on the whole the system is more or less ok with most/many within 50 points of their ‘expected conversation’ (to invent a term) and with an equal likelihood of being + or -. But I’m not talking across everyone. I’m talking *where I play*.


and I think you're trying to suggest something is more broken than a handful of ELO points?

I don’t think that is what I’m saying. Using your conversion formula the average rating discrepancy falls to 46.7 points (with 2 over-rated). If I knock out the juniors that leaves n=12 and an average discrepancy of 41.25 points.

41.25 points under-rated means nothing when you’re talking about one person. But when you’re talking about a population it’s different, I think. When *everybody* (or everybody other than one person) is that much off on the under-rated side then something’s up, I think.

You could argue that my sub-sample of opponents is not representative of the general pool of opponents in the tournaments in which I play. That’s entirely possible (although I’m not convinced it’s likely). But when the group neatly divides, not into ‘over-rated’ and ‘under-rated’, but ‘more than 50 points under-rated’ and ‘less than 50 points under-rated’, I have a hard time believing that there’s not a bias in the system.

Unknown said...

Anonymous at 2:04:00 wrote:

"The ELO system accurately predicts the results (in a statistical sense) and is a good measure of relative strength."

Mr Hogg asked how come when the ELO ratings were put to the tests in the Kaggle competitions they were found to be inferior to many other submissions?

Jack Rudd said...

What are these Kaggle competitions? What other entries were submitted? What data was used to judge the methods' predictive power? And, perhaps most crucially, were any methods constructed to fit the given dataset?

Matt Fletcher said...

I tried to submit a comment earlier but it seems to have been swallowed - the point I was trying to make was that presumably one of these 3 things is true about ELO vs ECF:

1) it doesn't work for Jonathan - I think Jonathan has given evidence to support this, it's annoying for him but it doesn't affect anyone else. Perhaps he's just been unlucky?

2) it doesn't work for anyone, in which case it's potentially a problem for reasons given above - I think my analysis and that of others suggests that this isn't the case or at least that it's not too bad on an overall basis (though you may disagree)

3) it doesn't work for specified subgroups (juniors, grading bands, improvers, geographic areas, very active players), one or more of which Jonathan finds himself in. This could give rise to a potential issue where people in these categories play each other regularly (/exclusively) because they will always end up with the 'wrong' rating. I've done some analysis (as have others) on juniors and grading bands which suggests that it doesn't really work for juniors (and by implication for other fast-improving players) but that it does work OK in different grading bands. I haven't seen anything (though I haven't really looked hard) on other categories - is anyone aware of any analysis here? Also Jonathan, I'm interested to know what subgroup(s) you think isn't (aren't) working.

Also, I'm interested to know what the experience is of UK players playing rated chess abroad - do they tend to outperform their ELO or underperform?

Unknown said...

Jack Rudd wrote:
"What are these Kaggle competitions? What other entries were submitted? What data was used to judge the methods' predictive power? And, perhaps most crucially, were any methods constructed to fit the given dataset?"

Mr Hogg replied: Oh! Please google sumpin' like "kaggle chess rating" and take it from there.

IIRC one of the better if not the best entry has been submitted to FIDE by Geoff Sonas and Alec Stephenson, no?

Alec Stephenson's efforts are now available as an R pkg that I use for ratings other than chess.

The Kaggle people have many nice comps if you fancy yourself as a statistician or machine learning expert. The talent pool in their comps is 'awesome' as they say in USA, chasing some huge prizemoney!

If you are interested in this area I suggest you look at Back to Basics in Rating Theory at:
www.ratingtheory.com/index.htm
especially the Conclusions link but the whole site is worth a read. IMHO, essential reading for anyone debating ELO and ratings in general.

Cheers!

an ordinary chessplayer said...

Unknown wrote: If you are interested in this area I suggest you look at Back to Basics in Rating Theory at:
www.ratingtheory.com/index.htm

Yes, Jones was hot in Chess Life about 20 years ago. Cooler heads know that he hasn't fixed anything.

From the index page: "The implication is that every rating system is based on a probability distribution and that the accuracy of a system is to be judged by the suitability of this distribution."

Jones disparages this (according to Jones) idea of Elo's, implicitly on the index page and explicitly on the "Skeptical Conclusions" page. But, on the "A Revised Elo System" page, the ONLY modification he suggests is to the probability distribution.

Let me outline the junior problem as I understand it:

(1) Johnny learns the moves at his school club. Johnny plays many games there against other new players. He's a true beginner, as we all were at one point.

(2) Johnny plays his first rated tournament, a junior tournament naturally, and loses all four of his games against average 1200 opposition. Johnny's first Elo is the average of his opponent's ratings minus 400, thus 800. He's actually lucky to have an 800 rating, because he would have gone 0-4 against just about ANY average rating. (His only chance to score a point would have been against an unrated player like himself, which of course would not count for rating.)

(3) Johnny plays non-rated chess against his dad, learns Scholar's mate. Against his friends, learns the Fried Liver. Against his computer, learns not much at this stage. Johnny reads a book, learns the Sicilian. With his new chess knowledge, if Johnny3 were to play Johnny2 in a four-game match, Johnny3 would win 4-0. It's a little unfortunate that Johnny3 and Johnny2 have the same rating, because that rating (those ratings?) predicts a 2-2 result. Please note that there is NO "suitable" probability distribution that could correctly predict a 4-0 result here.

(4) Johnny plays his second rated tournament, an open tournament this time. Johnny scores 1-3 against average 1200 opposition, and bingo, K rating points just left the adult pool. Johnny's new rating is 850 and at this very moment he is STILL under-rated by 150 Elo points. The logician knows his 1st tournament can be thrown away as being for Johnny2, while his 2nd tournament performance rating reflects his "true" current rating. (Not for long, though.) Unfortunately, the statistician counts both tournaments.

This same rating problem exists, to one degree or another, for any player who ever learns anything practical about chess. For example, if two players meet in a club match, they BOTH improve (a little) by virtue of the game experience, but the rating system shows a net gain of zero.

@Matt Fletcher - Traveling players always underperform. All masters are wary of the local player who gets to sleep in the correct timezone, in his own bed, eat his usual breakfast, etc.

Jonathan B said...

Interesting comment OCP. I don’t doubt it’s true but what’s happened in my local area is that it doesn’t even matter now if you play juniors or not because everybody’s rating has been knackered. Only 4 of my 16 opponents mentioned in this post were juniors for example.

Also, Traveling “... players always under perform”. Perhaps although that makes it even more notable that my initial rating was gained with 9 of 17 games played in Spain.


Matt: I'm interested to know what subgroup(s) you think isn't (aren't) working.

Well, I’m just talking about the tournaments I play really. But I’d guess anybody who plays in an environment which attracts (a) a lot of unrated players and (b) a lot of young juniors is also likely to notice similar effects.