Do You Want to Know a Secret?
A riff on writing style and rating systems
|
| Cropped from source |
Mark Glickman is a statistician at Harvard University. With Jason Brown of Dalhousie University and Ryan Song also of Harvard—we’ll call them GBS—he has used musical stylometry to resolve questions about which Beatle wrote which parts of which songs. He is also a nonpareil designer of rating systems for chess and other games and sports.
Today we discuss wider issues and challenges arising from this kind of work.
In fact, we’ll pose a challenge right away. Let’s call it The GLL Challenge. Many posts on this blog have both our names. In most of them the writing is split quite evenly. Others like this are by just one of us. Can you find regularities in the style of the single-author ones and match them up to parts of the joint ones?
Most Beatles songs have single authors, but some were joint. Almost all the joint ones were between John Lennon and Paul McCartney, and in a number of those there are different accounts of who wrote what and how much. Here are examples of how GBS weighed in:
- Although the 1962 song, “Do You Want to Know a Secret?” was credited as “Lennon/McCartney” and even as “McCartney/Lennon” by a band who covered it in 1963, it has long been agreed as mostly by Lennon, as labeled on this authorship list. GBS confirm this.
- The two composers differed, however, in their accounts of “In My Life” and it has taken GBS to credit it all to Lennon with over 98% confidence.
- The song “And I Love Her” is mainly by McCartney, but GBS support Lennon’s claim to have written the 16-syllable bridge verse.
- Lennon said “The Word” was mainly his, but GBS found McCartney’s tracks all over it.
Tell Me Why Baby It’s You
To convey how it works, let’s go back to the GLL Challenge. I tend to use longer words and sentences, often chaining further thoughts within a sentence when I could have stopped it at the comma. The simplest approach is just to treat my sole posts as “bags of words” and average their length. Do the same for Dick’s, and then compare blocks of the joint posts. The wider the gap you find in our sole writings, the more confidently you can ascribe blocks of our joint posts that approach one of our word-length means or the other.
For greater sophistication, you might count cases of two consecutive multisyllabic words, especially when a simple word like “long” could have replaced the second one. Then you are bagging the pairs of words while discarding information about sentence structure and sequencing. An opposite approach would be to model the probability of a word of length following a whole sequence of words of lengths
. This retains sequencing information even if
is small because one sequence is chained to the previous one.
GBS counted pairs—that is, transitions from one note or chord to another—but did not analyze whole musical phrases. The foremost factor, highlighted in lots of popular coverage this past month, is that McCartney’s transitions jump around whereas Lennon’s stay closer to medieval chant. Although GBS covered songs from 1962–1966 only, the contrast survives in post-1970 songs such as Lennon’s “Imagine” and “Woman” versus McCartney’s “Live and Let Die” and the refrain of “Band on the Run.”
To my ears, the verses of the last creep like Lennon, whereas Lennon’s “Watching the Wheels” has swoops like McCartney. Back when they collaborated they may have taken leaves from each other, as I sometimes channel Dick. The NPR segment ended with a query by Scott Simon about collaborative imitation to Keith Devlin, who replied:
For sure. And that’s why it’s hard for the human ear to tell the thing apart. It’s also hard for them to realize who did it and this is why actually the only reliable answer is the mathematics because no matter how much people collaborate, they’re still the same people, and they have their preferences without realizing it. [Lennon’s and McCartney’s] things come together—that works—but they were still separate little bits. The mathematics isolates those little bits that are unique to the two people.
GBS isolated 149 bits that built a confident distinguisher of Lennon versus McCartney. This raises the specter of AI revealing more about us than we ourselves can plumb, let alone already know. It leads to the wider matter of models for personnel evaluation—rating the quality of performance—and keeping them explainable.
A Paradox of Projections
Glickman created the rating system Glicko and partnered in the design of URS, the Universal Rating System. Rather than present them in detail we will talk about the problems they intend to solve.
The purpose is to predict the how a player will do against an opponent
from the difference in their ratings
and
:
Here giving the probability for
to win, or more generally the percentage score expectation over a series of games. The function
should obey the following axioms:
The last says that the marginal value of extra skill tails off the more one is already superior to one’s opponent. Together these say is some kind of sigmoidal curve, like the red or green curve in this graphic from the “Elo Win Probability Calculator” page:
To use the calculator, pop in the difference as , choose the red curve (for US ratings) or green curve (for international ratings), and out pops the expectation
. What could be simpler? Such simplicity and elegance go together. But the paradox—a kind of “Murphy’s Law”—is:
Unless the players are equally rated, the projection is certainly wrong. It overestimates the chances of the stronger player. Moreover, every projection system that obeys the above axioms has the same defect.
Here’s why: We do not know each rating exactly. Hence their difference likewise comes with a
component. Thus our projection really needs to average
and
over a range of
values. However, because
is concave for
, all such averages will be below
.
We might think we can evade this issue by using the curves
This shifts the original curve left and right and averages them. Provided
is not too big,
is another sigmoid curve. Now define
by aggregating the functions
, say over
normally distributed around
. Have we solved the problem? No:
still needs to obey the axioms. It still has sigmoid shape concave above
. Thus
will still be too high for
and too low for
. The following "Law"—whom to name it for?—tries not to be hyperbolic:
All simple and elegant prediction models are overconfident.
Indeed, Glickman’s own explanation on page 11 of his survey paper, “A Comprehensive Guide to Chess Ratings,” is philosophically general:
At first, this consistent overestimation of the expected score formula may seem surprising [but] it is actually a statistical property of the expected score formula.
To paraphrase what he says next: In a world with total ignorance of playing skill, we would have to put for every game. Any curve
comes from a model purporting pinpoint knowledge of playing skill. Our real world is somewhere between such knowledge and ignorance. Hence we always get some interpolation of
and the flat line
. In chess this is really an issue: although both the red and green curve project a difference
to give almost 76% expectation to the stronger player, observed results are about 72% (see Figure 6 in the survey).
Newtonian Ratings and Grothendieck Nulls
The Glicko system solves this problem by giving every player a rating
and an uncertainty parameter
. Instead of creating
‘s and
(or etc.) it keeps
a separate parameter. This solves the problem by making the prediction
a function of
as well as
, with optional further dependence on how the
“glob” may skew as
grows into the tail of high outliers and on other dynamics of the population of rated players.
However, Newton’s laws behave as though bodies have pinpoint mass values at their centers of gravity, no matter how the mass may “glob” around it. Trying to capture an inverse-square law for chess ratings leads to a curious calculation. Put
for . Taking
gives
and allows gluing
. Simplifying
gives a fraction with denominator
and numerator
given by
Then taking cancels out the two bigger terms in the constant part, leaving the numerator as
David Mumford and John Tate, in their 2015 obituary for Alexander Grothendieck, motivated Grothendieck’s use of nilpotent elements via situations where one can consider to be truly negligible—that is, to put
.
Here we have an ostensibly better situation: In our original expression for , the coefficient
of
has to stay pretty small. The linear term for
has coefficient
and the
term has
. Thus if we could work in an algebra where
then the pinpoint value and all averages
for uncertainty would exactly agree. No separate parameter
would be needed.
Alas, insofar as the real world runs on real algebra rather than Grothendieck algebra, we have to keep the numerator and the denominator
. One can choose
to approximate the above green or red chess rating curves in various ways, and then compare the discrepancy for various combinations of
and
. The discrepancies for my “Newtonian”
tend about twice as great as for the standard curves. That is too bad. But I still wonder whether the above calculation of the prediction discrepancy
—and its curious
feature—has further uses.
Open Problems
What will AI be able to tell from our “track records” that we cannot?
Several theories of test-taking postulate a sigmoid relationship between a student’s ability and his/her likelihood
of getting a given exam question right. Changing the difficulty of the question shifts the curve left or right. For a multiple-choice question with
choices the floor might be
rather than
to allow for “guessing” but otherwise, similar axioms hold. Inverting the various
gives a grading rubric for the exam. Do outcomes tend to be bunched toward the middle more than predicted? Are exam “ratings” (that is, grades) robust enough—as chess ratings are—to tell?
Aggregating the curves for various questions on an exam involves computing weighted averages of logistic curves. Is there literature on mathematical properties of the space of such averaged curves? Is there a theory of handling discrepancy terms like my
above?
[some word tweaks and typo fixes]




Ken, do you still think God’s chess rating should be around 3600, given recent developments (several engines above 3400, Leela Chess Zero rated almost as high with a very different playing style, etc.)? Have you compared Leela moves to Stockfish moves in your cheating studies? Thanks.
Exactly what I think is that a slightly randomized version of AlphaZero can score 25% against any strategy, i.e., hold a draw every other game. Then all strategies would have Elo no more than 200 points higher from games against it. (Appropriate technical quip: “God” is a universal quantifier.) 🙂
Thanks! Do you think the rating model breaks down at such high strength, e.g. could there be a strategy S (say Stockfish 20 on futuristic hardware) that also holds A0 to 25%, but that in turn scores less than 50% against “God”?
Re the technical quip, my first thought was that (generalized) chess is PSPACE-complete, so “God” would be nested quantifiers of arbitrary depth. But of course, a single universal quantifier over the natural numbers (i.e. a halting oracle) could also solve chess.
More generally, I wonder if there are well-documented cases where Engine A beats Engine B, B beats Engine C, yet C beats A in matches, owing to strategic (stylistic) differences. I don’t know whether CCRL or anyone has measured such “triangles” in the rating system. One point, however, is that the randomness in my hypothetical version (call is RalphaZero?) offsets the effectiveness of particular strategies against others. One could in theory apply the randomness to whole strategies (i.e., mappings from positions to moves) rather than just the best yea-many moves in a given position—this would bring in how matrix games achieve a guarantee regardless of opposing strategy.
The universal quantifier I have in mind is over strategies—one exponential higher than the quantifiers over moves and game histories involved in the PSPACE arguments. What I mean is that there is no single “perfect” player—rather one defines a quantity such as an opposing rating ceiling by quantifying over all possible opposing strategies.
Additional comment: I don’t know if you’re following it or are interested, but LC0 (Leela Chess Zero) now seems to be approaching Alpha Zero’s strength, or has possibly even reached it (I’m not convinced of the latter yet). See recent forum posts at lczero.org for more info.
In early June my first indications didn’t point to more than 2500 for LC0, but by mid-July and the WCCC it was at least 2950 by my IPR measure. That’s also where my measure seems to start losing distinguishing power—I think my values up to 3100 are OK but past that it’s mush. Indeed, if they make an easier-to-install non-GPU Linux/UNIX version of LC0 just for analysis not training, then I will add it to my panel of reference engines for large-scale data-taking.
I meant to emphasize, too, one aspect of the near-perfect linear fits in my “When Data Serves Turkey” post. You could get the same fit just from analyzing the games in the Elo 1800–2200 range, say, where the engine depths I use are far superior enough to be clearly authoritative. The charts in that post demonstrate than extrapolating this out to 2800 is perfect. Thus one cannot simply say that my 3400 intercept is because of my system losing mojo when analyzing games at 2800+. (There are of course other issues shown in that post, and my ironic upshot is that although IMHO either the “3200” or “3400” answer could pass peer review, both are wrong. Indeed if one used the first-line matching figures alone then 100% would be the perfection point and that would be above 4000. I believe a “true” intrinsic measure needs to blend those two with a notion of “challenge created” based on the depth-of-thinking and playing-on-impulse ideas set out here and here, but I am right now having to resort to a very heavy-handed approach to taming the resulting model.)
Ken, I wonder if you’ve tried play.lczero.org in easy mode: that means you’re playing directly against the neural network policy evaluation with no search at all. In other words pure positional judgment, yet it easily beats masters if they’re not careful. Sort of the opposite of what we expect from an engine, fantastic positional sense but lousy tactics instead of the other way around. Someone a while back said its likely true strength is in the 2000 range. I’d be interested in an IPR measurement if you’re up for that sometime. It might be a good “impulse vs thinking” test. It beats me easily but I’ve never been anywhere near 2000.
I was able to get CPU Leela running with some effort on an i5 linux box some months back, but got only around 30 nps, way too slow to play really well. Can you get hold of a GPU machine for your chess stuff? Leela is important, I think. It is already influencing high level human play from what I hear.
The GLL challenge would be quite an easy task for a careful human reader. Having read several posts by both, I can say that the writing styles of Dick and Ken are vastly different. Here is another fun observation (in addition to observation on the sentence lengths made by Ken): All the “Oh well”s have been written by Dick, and all the “ostensibly”s have been written by Ken 🙂
I can claim every “ostensibly” but a few of the “Oh well”s have been mine as well, often to finish a thought expressed by phone.
Re weighted averages of logistic curves, there’s a large literature on Bayesian approaches to logistic regression that might be relevant. That includes both finite averaging over model choices, and simulation-based integration over the continuous posterior for a single model. I don’t off hand know of something that talks about the properties of the space of averaged models though.
And to continue the theme, Bayesian approaches to measurement error might be something to look at re the overconfidence problem.
I appreciate the simplifying you do, but surely you’ve oversimplified here. The fact that ratings are uncertain doesn’t mean they are the average of c+eps and c-eps. The distribution is obviously more complicated, and this just shows that it mustn’t be symmetric around its mean.
Fair enough. I did write in the last main section about “optional further dependence on how the (R_P,\epsilon_P) “glob” may skew as R_P grows into the tail of high outliers and on other dynamics of the population of rated players.” Added: The handling of epsilon resembles how concepts like border rank are defined.
More use of stylometry: https://slate.com/news-and-politics/2018/09/new-york-times-op-ed-anonymous-writer-trump.html