Skip to content

Pandemic Lag

July 30, 2021


In chess ratings and what other measures of cognitive development?

src

Henri Didon was a French priest and promoter of youth sports in the late 1800s. He coined the phrase Citius, Altius, Fortius, meaning faster-higher-stronger, which became the motto of the Olympic Games between their reinception in 1896 and its proclamation when the Games were held in Paris in 1924. For the 2020 Games being held now in 2021 they have added the word Communiter, meaning together, which is said to express solidarity during the pandemic.

Today we review how the official measure of being faster, higher, and stronger at chess has been impacted by the pandemic.

Didon spoke of his words as “the foundation and raison d’être of athletics” amid the progress of humanity. They have been borne out by the steady progression of athletic records over the Games’ 125-year history. Whether the Tokyo Games will continue that trend is open. Besides the year delay and the pandemic’s impact on qualifying competitions and athletic conditioning in general, there has emerged a question of mental effects amid the lack of spectators and straitened atmosphere. The one example I’ll quote is the claim by the Hungarian swimmer Kristof Milak that a pre-race mishap with his favorite swimming trunks cost him a record in an event he still won:

“They split 10 minutes before I entered the pool and in that moment I knew the world record was gone. I lost my focus and knew I couldn’t do it.”

At least the means of measuring athletic performances have not been disrupted. For psychometrics—a word meaning the science of measuring mental capacities and processes—the standardized tests most often used to measure aptitude have themselves been curtailed. This makes all the more open the question of how our youth have progressed during the pandemic in education on the whole. We will examine the special case of chess, where the official instrument has been almost entirely frozen for 15 months, but my own work carries both the ability and the responsibility to make up the difference.

Chess Ratings and Lag

The Elo rating system is simple but accurate enough for use by sporting federations besides chess. In chess, 1000 is a typical rating for a novice player, 1600 means a good club player, 2200 is the threshold for “master,” and 2800 is world championship standard. A player’s rating measures skill in a way that the difference to the opponent’s rating yields probabilities by which to predict the outcomes of games between them. Elo is the main prediction engine of FiveThirtyEight for basketball, baseball, and football (but not soccer).

Although the prediction formula uses only differences, so that an additive shift in all ratings would not affect the chances, I have shown that the ratings administered by the International Chess Federation (FIDE) have stayed stable in absolute regard to the objective quality of moves played as measured by my own predictive model, via my Intrinsic Performance Ratings (IPRs) geared to the FIDE rating scale. Having stable numbers is vital not only to my cheating tests but to the public understanding of the system on the whole. This goes for FIDE, for Internet gaming federations, and even for the use of Elo by Tinder.

Thus it is all the more sad for me to see things like this happen not only to FIDE’s Elo ratings but also those of the US Chess Federation (USCF), who adopted Arpad Elo’s formulas in the 1950s:



This is the FIDE Rating Progress Chart of Annie Wang, who just won the US Junior Women’s Championship played in-person at the Saint Louis Chess Club last week. Her FIDE rating has been stuck at 2384 ever since the April 2020 rating list. One glance at the chart suffices to project her rating into the neighborhood of 2500 by now. Her USCF rating is closer at 2457, but this is offset by a long-known inflation of USCF ratings relative to FIDE, measured about 75 points at that level in May 2020. Wang’s USCF rating has been similarly frozen. You can find the same for a plethora of young players down to aspiring kids of single-digit age blasting out of three-digit ratings, as Wang did. They have a flat line like the ones circled in blue, but located where she had a sharp rise (circled in green):


The Need to Adjust

The lag mattered immediately for me as I gave daily statistical reports to the tournament’s chief arbiter last week. Using Wang’s official rating would have underestimated her true strength and biased my reports in the direction of false positives. Instead, having developed a formula that I won’t claim is anything more than Fermiestimated, I calculated her effective FIDE rating as 2482, adding almost 100 points. I would have upped her USCF rating to 2543 by the same formula.

Wang was both the highest rated among the ten competitors and the oldest, with a long enough record of international play to have her FIDE K-factor reduced from 40 to 20. My formula adds more points for lower ratings, higher K-factor, and younger age—all reflecting the arc of many improving junior players. My average increase to the women’s ratings was 199.1 points, versus 57.4 to the ten players in the junior men’s/mixed championship, who had mostly higher ratings to begin with.

Also playing in St. Louis were ten in the US Senior Championship, including last year’s winner Joel Benjamin, whom I knew and played in the 1970s when we were kids. Their ratings have been likewise frozen. Rating points in chess are zero-sum, so the triple-digit gains I have credited to the young would in normal reality have been taken out of other players—most plausibly, us geezers. There are more of us than keen juniors, so the presumed individual losses would be less.

Did that prove out? My IPRs furnish a way to verify. They differ from other deployed quality metrics by organically involving the difficulty of the positions a player faces, in several ways besides the complexity and temptation factors I incorporated two years ago. Here are the results—but bear in mind that these three 10-player tournaments are small data: their two-sigma error bars on the average IPRs are about {\pm 80} Elo points.

  • US Jr. W: Avg. rating 2101, adjusted 2300, avg. IPR 2337 (+37).

  • US Jr. M: Avg. rating 2492, adjusted 2550, avg. IPR 2527 (-23).

  • US Sr. M: Avg. rating 2494 (no adjustment), avg. IPR 2459 (-35).

The truly significant result is that the women performed much closer to my adjustment than to their official ratings. The men were only slightly closer amid general insignificance, which applies also to the seniors. The juniors combined were highly close to my projections.

Right now I am gathering data from larger Open tournaments in this first month of widespread in-person play. There have been some hits and misses, and I have not yet evaluated all (un-)controllable factors. But gathering the original large data for my adjustment formula required coping with a major factor: the 100–200x higher evident cheating rate I’ve observed in online chess.

How To Be Not Very Wrong

I first perceived the phenomenon when monitoring the European Youth Online Rapid Chess Championship last September. I compiled full analysis on all 689 competitors in women’s and men’s/mixed sections ranging from Under-12 to Under-18. Besides four particular cases, my results said that probably at least four of another five were cheating, but without the confidence needed to flag any one. Removing the high outliers did not, however, equate either the IPRs or my sharper test of conformance to the bell curve to my projections. The Under-12 M and W and Under-14 M sections had IPRs averaging 83, 235, and 125 higher, respectively. The Under-14 W and U16 and U18 sections were close to my projections, so I did not suspect general modeling issues.

The online World Youth Rapid Championships in November-December, which added an under-10 division, brought the lag phenomenon out in force, on all continents. The correction I postulated even before that tournament finished was:

15 Elo {\times} (months since April 2020), higher for those under 13 (50% to 2x higher).

There are several reasons I have not tried to be more precise. There is uncertainty about how many high outliers to remove, about faster time controls, and about geographical drifts in ratings. The effect depends on how much a junior player is disposed to improve in the first place; I found it absent in the lower divisions of the UK’s junior leagues played online last winter. In an individual cheating case I take a more-particular fix on the appropriate rating. What the equation is for is to show the fairness of my baseline relative to the field on the whole. There are also non-cheating purposes, which should come to the fore as FIDE and other federations emerge from the pandemic, and which I discuss next.

I have been using essentially this formula ever since. From large scholastic tournaments across the globe this spring, I settled on fixing the adjustment for those with birth year 2008 or later as 25 Elo {\times} (months since April 2020). For players with official rating {R > 2000} I apply the rough multiplier {(3000 - R)/1000}, and for those with {K < 40} I (also) multiply by {\sqrt{K/40}}.

I won’t claim the ’15’ and ’25’ are right, compared to multipliers that are 1 or 2 higher or lower. But the results I have been getting all year say that my 15 and 25 are most often closer than factors of 10 or 20 or 30 would be. In almost all cases, like for the US Jr. W above, my pre-set rating calibration has come an order of magnitude closer to the IPR verification than the adjustments themselves. Taking a cue from the title of Jordan Ellenberg’s predecessor to his book I previewed last month, my main concern is to be not very wrong.

A Dilemma Moving Forward

Providing an accurate and stable rating system has long been recognized as a prime service of FIDE. A legal dimension has been added insofar as evaluating cheating allegations requires a prior assessment of the natural skill of the accused player. The pandemic has made me take over much of the latter responsibility, but the former presents a wider dilemma doubtless faced in some form by other impacted sporting federations and educational assessment agencies on the whole:

Is it a higher responsibility to provide the most accurate assessment of current ability obtainable now, or to maintain continuity of the official assessment mechanism?

I could go even wider to analogize this to the US Census debate over whether estimations, presuming demonstration of their greater accuracy, should be used in preference to the conducted count. The latter is enshrined in the US Constitution, while the principle that chess rating points should be won or lost only in actual combat is similarly hallowed. But I have certainly “demonstrated the obvious”: that the current official ratings of almost all the keenest young players are very wrong.

Mathematically, the rating system will re-establish equilibrium if the current discrepancy is left alone. The trouble is that the mathematical nature of the update and the relative paucity of chess games also guarantees that the process will be slow, measured in years. FiveThirtyEight has remarked in several recent article about the long update times in baseball as measured by Elo ratings. My cheating tests often cannot wait a day. I have to use my cross-check and validation features to detect and remove a huge amount of mathematically the same kind of bias believed to afflict other currently-deployed predictive models less transparently.

There is precedent for a large-scale adjustment of ratings by FIDE. Women’s chess used to be even more segregated from men than today. In 1986, Arpad Elo himself—as secretary of FIDE’s Qualifications Commission—reported that women’s ratings had drifted down by about “one half of a class interval.” FIDE added 100 points to the rating of every active female player except Susan Polgar, whose rating was already ‘well-mixed’ according to the report, since she had faced many more male players than the others.

Attempting to resolve that historical controversy by computing IPRs for Polgar and the other players in Elo’s study has never reached my front burner. But the point remains that my work is uniquely capable of informing the state of ratings in a radical manner. The pandemic has created both a need and an opportunity for a reset that could also solve other issues previously noted—while ensuring that ratings on all continents are on a common scale.

Open Problems

How pronounced is the lag of assessment in education and other competitive arenas, both physical and in mind-sports?

I had not noticed that Tyler Cowen had already used the term “psychometric test” in a post on the Marginal Revolution blog at the beginning of the pandemic, until he repeated it just today.

I have hinted at some other issues in chess but stopped short of addressing them here. One is whether online play—where play at 5-minute “Blitz” down to 1-minute “Bullet” time controls predominates even over “Rapid” beginning at 10 minutes—has a similar effect on development in the absence of any in-person “Classical” chess. Another is whether the observed increase in the ranks of players with 2700+ elite ratings is really Fortius or merely rating inflation. A third is whether the current conditions for in-person chess will last long enough to get a good fix on the ‘post-pandemic’ state of skill, and a fourth—coming back to what I quoted about the current Olympics—is whether they are truly “normal” enough even now.


Update 9/15/21: The generally higher relative accuracy has continued through the FIDE World Youth Rapid in July/August and the FIDE Online Olympiad in August through today. The game analyses continue to be taken using resources of the UB Center for Computational Research (CCR).



Update 8/2/23: The formula has evolved in just the following ways: inserting a 20x case for young teens, tapering off gains above Elo 2000 by treating the formula as a differential from there on, and a potential adjustment for female players mentioned
here. Here is my current code in Perl—note that most youngsters have a K-factor of 40, so for gains under 2000 by kids not born in 2008 it’s completely unchanged:


sub lagAdjust {
  my ($rating,$lagMonths,$birthYear,$kfactor,$gender,$noReturnIfOlder) = @_;
  my $add = 0;
  # my $genderFactor = (($gender eq "F" && $birthYear <= 2007) ? 0.5 :
    ($gender eq "F" && $birthYear <= 2009) ? 0.75 : 1.0) ;


  my $genderFactor = 1.0;
  if ($birthYear >= 2009) {
    $add = &round(25*$lagMonths*sqrt($kfactor/40.0)*$genderFactor);
  } elsif ($birthYear >= 2008) {
    $add = &round(20*$lagMonths*sqrt($kfactor/40.0)*$genderFactor);
  } elsif ($birthYear >= 2000) {
    $add = &round(15*$lagMonths*sqrt($kfactor/40.0)*$genderFactor);
  } elsif ($noReturnIfOlder) {
    return 0;
  }
  my $b = $rating + $add; # adjust $add less if into super-2000 region
  my $a = ($rating >= 2000.0 ? $rating : 2000.0);

  if ($b > 2000.0) { # integrate adjustment as differential from a to b
    $add = ($b - $a)*(3.0 - ($b+$a)/2000.0);
    if ($rating < 2000.0) {

      $add += 2000.0 - $rating;
    }
  }
  if ($rating >= 3000) { $add = 0; }
  return ($rating + $add);
}


[changed first figure to show the March 2020 pandemic start accurately; some minor word changes; added updates.]

5 Comments leave one →
  1. José de Jesús García Ruvalcaba permalink
    November 9, 2021 9:53 pm

    Hi.
    Sorry for posting this in a tangentially related entry.
    I thought you might not read replies to older entries.

    My questions are about the Intrinsic Performance Ratings (IPR).
    In an interview, Nona Gaprindashvili stated:
    [begin Quote]
    “When I started playing with men, there was a bit of a sceptical attitude towards women … men didn’t perceive (women) as strong enough,” she said.
    While everyone treated her politely, male opponents put in an extra effort to see her off, as losing to a woman was considered “humiliating”, she said.
    “Of course, they didn’t want to lose a game with a woman, especially when I was a skinny little girl,” she said, adding it took her about a year and a half of competing to be considered “a fully-fledged member of the circle”.
    [end Quote]
    I took this from https://news.trust.org/item/20201130170326-owkh6
    But I think I have read it somewhere else too.

    I think this can be stated in terms of IPR:
    The aggregated IPR of Gaprindashvili’s opponents against her was higher the their IPR among themselves. I guess this can be tested: the null-hypothesis would be that the both IPR are the same.

    My second question is about computing IPR from my over-the-board games, I have read your articles and understand most of them (I have a M.Sc. in math). I can analyze them in multi-PV mode with Stockfish, Komodo, Crafty, Lc0 or other engine, to a fixed depth. Most likely I can do a regression on the test set T from my games, to fit the parameters of your model as described in https://cse.buffalo.edu/~regan/papers/pdf/RMH11b.pdf
    (However, I have not written any significant amount of code in several years, there is a big chance of me making a mistake, or having a very inefficient solution).
    Also, I see in your blog entries that you now use newer versions of the engines, deeper searches, mention “depth of satisfying”, etcetera.
    ¿How can I compute the IPR from my games? (or the games of my friends). What I want to avoid is the following: letting my computer analyze for hours or days, then realizing I had a wrong setting… a big waste of time and electricity.

    Greetings.

    • November 9, 2021 10:30 pm

      In GM Gaprindashvili’s case, one obstacle would be that the rating system did not exist in the 1960s when she started playing in men’s events. By the time I played with her at Lone Pine 1977 (I can say “with” since I finished tied for 7th to her tie for 1st) she was long since “a regular.”

      Regarding IPR, here is a shortcut you can take and then the main principle that sets IPR apart from all the shortcuts. The shortcut is to compute the first-line match (aka T1 match) and some version of “average centipawn loss” that scales down according to the position value on the whole (see here for why), regress those against rating, and combine them into one index. The issue is that you will get skewed results if you happen to play either much more pacifistically or more risky-tactically than the average player. Put another way, the simple indices (like Chess.com’s CAPS) do not account for the difficulty of the positions you face. The principle is to standardize this by using the same “test set” for all players—regressing on the player’s own games only to get the player-profile parameters to use for the simulated test. Doing this requires full predictive analytics to pull off—and these two posts highlight the vicissitudes. But if you’ve played 30-50 tournament games in a single year and are a mainstream player, the shortcut will give a good enough indication.

      • José de Jesús García Ruvalcaba permalink
        November 10, 2021 4:03 pm

        Hi.

        I thought we do not need actual ratings to compute IPR. For Gaprindashvili, I was thinking about a tournament against men where all the games are available (like Reykjavik 1964). Compute IPR of the men against Gaprindashvili, and the IPR of the men amongst themselves. See if there is any significative difference. Probably more events in which she played against men can be aggregated, but finding complete records is difficult for older tournaments.

        As for myself, I have played 183 tournament games at standard time controls from 2004 to 2019. In this period, my most active year was 2008, with 24 games. Very far from 50 games, but perhaps close enough to 30. I would need to check, if in 12 consecutive months (not necessarily from January to December) I have played more games. When I was a teenager I easily played 50 games in a year, but that was a long time ago (and possibly teenagers can not be considered mainstream players, they can improve a lot in a few months).
        Unfortunately, I lost my old score sheets (I remember there were slightly over 400 games there); and I have not played the last two years due to the pandemic.
        I will try the shortcut you suggest for all my recorded games. I will also try to semi-manually do the full analysis for the 13 games I played in 2019.

        Greetings.

Trackbacks

  1. Best To Dean Mynatt | Gödel's Lost Letter and P=NP
  2. Should These Quantities Be Linear? | Gödel's Lost Letter and P=NP

Leave a Reply

Discover more from Gödel's Lost Letter and P=NP

Subscribe now to keep reading and get access to the full archive.

Continue reading