Stop Cheating—Again
More than just stopping carelessness
Sholto David is featured in this past Tuesday’s New York Times Science section for his work on cheating in papers on medical research.
The article is titled “Catching Too Many Errors by Sloppy Scientists” but goes beyond that. It follows a similar exposé in the Guardian the previous week. The Times author, Matt Richtel, has written a flurry of columns on recent medical and social science findings, while the Guardian’s Ian Sample is their science editor.
David, who has a PhD in molecular biology from Newcastle University, has been fascinated by errors in science for many years. What caught my attention was that this is a kind of cheating. We have discussed Ken’s long-term research on cheating in chess. I found that David’s work was like Ken’s—except that chess is changed to writing science papers. Since these papers are often on medical results, it is possible that now the cheating could have safety issues. Very interesting.
Twixt Carelessness and Cheating
The two newspaper articles seem to use kid gloves compared to the original work by David that they reference. The Guardian one begins in terms of “flaws in scientific papers” being “rectified.” It then recounts that Harvard’s Dana Farber Cancer Institute is “seeking to retract six papers and correct 31 more.” In our views, the eyes should stop at “retract six papers.”
The Times article has the same six-and-31 sentence after linking an earlier Times story under “flawed and manipulated data,” and later speaks of his finding “mistakes, or malfeasance.” The interview portion of the article gets to the crux:
More generally, do the mistakes you find seem innocent to you?
I think we can all understand that sometimes an image may have been copied and pasted by mistake. But there are more complicated examples where images are being rotated, transformed or stretched. Those kinds of examples are less savory. There are other examples […where…] the image has been assembled in a way that you couldn’t reconstruct a sensible experiment from. Sometimes there are cases where people are using Photoshop to more extensively edit images.
David’s original Dana Farber article has a lot of examples. He doesn’t mince words, saying at the outset, “Far worse skeletons than plagiarism lurk in the archives…” It’s snarky but on-point and, frankly, scary. Here is one example, showing large identical regions in scans from two different patients out of six in a study:
This reminds Ken of space photos with multiple images of the same galaxy owing to relativistic gravitational lensing—but here the duplication is much more precise. Elisabeth Bik, a Dutch microbiologist and scientific integrity consultant, has identified thousands of such cases. She co-won the 2021 John Maddox prize for her work.
Telling When Images Are The Same
David used a simple criterion: If two images from different papers are the same and there is no reference, or of two different tests in the same paper, then this is cheating.
The trouble of course is: how hard is it to tell if the images are essentially the same? There are a ton of papers and thus a lot of images. What David did was clever—he found a cheap way to tell quickly if images are the same on a mass scale. This is a kind of hashing result that I found cool.
For how he does this, see here. Roughly, he sent the image to a list of distances between objects in the images. These distances formed a type of hash that he used to encode the image. These hashes were much easier to keep track of then the whole image and so this made testing to see if two were the same. This was a cheap test and clearly made it easy for him to discover if a paper cheated.
See this for more examples of hashing of images.
The real question is, what could cause people to do such copy-and-pasting? When is it not inadvertence? Ken is no more a biologist than I, but he can contribute a temptation he faced that might lend some insight. Over to Ken now.
Oh, Scrubbitt!!—?
In the second half of 2019, I completed an update to my chess model that took four agonizing years. I then turned to measuring in greater detail the quality of my model’s predictions of chess moves, in particular what I had gained from the update. This post from late 2019 explored error models and prediction metrics. Per an update in January 2020, the best model for the true probability of a move in terms of my model’s probability
and a fittable error term
turned out to be
This makes projections of low-probability moves—such as blunders—have smaller absolute error but higher relative error than probable moves. Fitting against the Elo chess ratings of thousands of players would tell how sharp the projections are and also how sharpness tails off when estimating the world’s best players. For reasons described earlier here, I’ve felt compelled to build my model separately for every chess program (and major “engine” version) employed to test players. My update used the then-current versions 11 of the open-source Stockfish engine and 13.3 of the commercial Komodo engine, plus the earlier versions 7 of Stockfish and 10 of Komodo.
Here are the results I obtained for against rating, graphed using Andrew Que’s online regression applet:
Cue the ditty from the TV show Sesame Street:
“One of these things is not like the others. One of these things just doesn’t belong.”
The first three diagrams gave a highly consistent picture: my model is uniformly sharp except as ratings go above the 2600 “super-grandmaster” level. Stockfish 11, however, went off the rails—as if to say its rendition of my model became infinitely accurate by that level.
Mixing with another metric in a 90:10 split at least removed the infinities, though it still left the curve turning in the wrong direction:
The pandemic hit—driving me into the world of online chess with a 100–200x higher cheating rate—before I resolved this issue. Because the computer code for using all four engines was the same, and because this plot told me the issue with Stockfish 11 should be fixable, I filled in conservative coefficients along lines of the other programs. I had an all-important fail-safe: verifying in myriad randomized resampling trials that the test’s outputs over (presumably non-cheating) players conforms to the normal distribution. The new test using Stockfish 11 passed this check much the same as the tests developed with the other three chess programs.
Later in 2020 I adjusted a key hyperparameter to be even more conservative than what I’d used in the first diagram here, and the expected kind of curve emerged in numerically stable fashion at last:
Thenceforth I calibrated the test according to Hoyle. It all ultimately mattered little, because the new cheating test—using whichever engine—turns out to be no more or less sensitive than my regular one. I still report its results as “provisional.”
My point is, I temporarily behaved as though I had mentally copied and pasted in one of the other graphs. It was “OK” because I had other guardrails. I am still largely a one-man band. It would enlighten to learn why this sort of thing happens in large labs with teams of helpers.
Open Problems
What are the best image hashing methods? Should journals test all images in submitted papers to see if they have ever been used before?









Proof: Assume
and thus
has polynomial-size circuits with
gates. However in doubly exponential time we can enumerate over all say quasipolynomial-size oracle circuits
nice complexity-theoretic theorem:
.
Regarding the issue of detecting cheating with statistics let me mention several sources. The first is a 2012 blog post by Omer Reingold Rigged Lottery, Bible Codes, and Spinning Globes: What Would Kolmogorov Say? I commented there about the question weather statistics can be used (after the fact) to raise suspicions or even to prove these suspicions about the integrity of a statistical experiment, especially in connection with the familiar problem of biased data selection. I mentioned some methods we (McKay, Bar-Nathan, Bar-Hillel, and me) used in our study from the late 90s on work claiming hidden code on the Bible. I expressed the view that, in general, statistical tests after the experiment is conducted have rather small power in examining the integrity of experiments, and speculated weather some sort of “interactive protocols” (like interactive proofs in TCS) could be helpful.
The second source is a 2006 power point presentation entitled How to detect lies with statistics by Maya Bar-Hillel. This was a talk given by Maya at the conference honoring Prof. Ester Samuel- Cahn , Jerusalem, December 18-20, 2006, and it described a planned research project of Maya Bar-Hillel with Yossi Rinott, David Budescu, and myself. At the end we did not pursue it, mainly because each of us was involved in various other projects (but also because we were skeptical about some aspects of it.) Maya’s title is a variant of the title of Huff’s book “How to lie with statistics”. Maya ended her presentation with a famous quip by Fred Mosteller: “It is easy to lie with statistics, but easier to lie without them.” and wrote “Likewise, we should say: ‘It is possible to detect (some) lies with statistics, but easier to detect them with other means’ ”.
The third source Uri Simonsohn’s homepage. Simonsohn has several works on detecting scientific flaws with statistics.
There are also classic examples like Ronald Fisher’s claims against Gregor Mendel’s experiments the claims against Cyril Burt’s research on IQ. There is some controversy even about these cases remaining today.
Indeed, Simonsohn is a co-host of the blog Data Colada, which was at the forefront of discussions over Francesca Gino in particular. Dick and I have intended to blog about that, but it’s something where if one wants to get in, one must get in quite deep.