Image

‘Comically bad’ datasets used to train clinical models for stroke and diabetes


Scrolling through an online image dataset, Adrian Barnett, a statistician at the Queensland University of Technology in Australia, pointed out a few familiar faces. Sylvester Stallone as Rambo, and then again on the red carpet. “This is just ridiculous,” Barnett said. George Clooney, Angelina Jolie and Daniel Craig all appear more than once, often with the same image.

This particular dataset, collected in a folder titled “droopy” and hosted on an open-source repository called Kaggle, underpins a paper published in Scientific Reports – not as a find-the-celebrity game, but as a training set for a predictive clinical model for early detection of strokes.

The paper is the most recent example of a much wider problem that Barnett and his Ph.D. student Alexander Gibson have documented with Kaggle. By examining two other Kaggle datasets on stroke and diabetes, both of which included tabular patient data, Gibson and Barnett traced how the data move through the scientific literature and in some cases, into clinical use. Their work, described in a preprint posted to medRxiv in February, 2026, already has led to several retractions of the papers using these dubious datasets.


Source
https://retractionwatch.com/2026/05/18/kaggle-dataset-clinical-models-stroke-diabetes/

Another high-profile case has interested scientists. Read now

technology-center-pc_logo.png

Shatylova dacha str., 4, of. 702, Kharkiv, Ukraine, 61165
Company registration no. 31886700
ISNI: 0000 0005 1088 6447
Ringgold ID: 574078
ROR ID: https://ror.org/046wj6g23

+380 (57) 7508990
+380 (50) 3033801

Image
Image
Image
Image