ТЕСHNOLOGY СЕNTЕR - ‘Comically bad’ datasets used to train clinical models for stroke and diabetes

‘Comically bad’ datasets used to train clinical models for stroke and diabetes

: 26 May 2026

Scrolling through an online image dataset, Adrian Barnett, a statistician at the Queensland University of Technology in Australia, pointed out a few familiar faces. Sylvester Stallone as Rambo, and then again on the red carpet. “This is just ridiculous,” Barnett said. George Clooney, Angelina Jolie and Daniel Craig all appear more than once, often with the same image.

This particular dataset, collected in a folder titled “droopy” and hosted on an open-source repository called Kaggle, underpins a paper published in Scientific Reports – not as a find-the-celebrity game, but as a training set for a predictive clinical model for early detection of strokes.

The paper is the most recent example of a much wider problem that Barnett and his Ph.D. student Alexander Gibson have documented with Kaggle. By examining two other Kaggle datasets on stroke and diabetes, both of which included tabular patient data, Gibson and Barnett traced how the data move through the scientific literature and in some cases, into clinical use. Their work, described in a preprint posted to medRxiv in February, 2026, already has led to several retractions of the papers using these dubious datasets.

Source https://retractionwatch.com/2026/05/18/kaggle-dataset-clinical-models-stroke-diabetes/

Another high-profile case has interested scientists. Read now

+911

‘Comically bad’ datasets used to train clinical models for stroke and diabetes

+380 (57) 7508990
+380 (50) 3033801

ABOUT US

Activity

LEGAL TERMS

+911

‘Comically bad’ datasets used to train clinical models for stroke and diabetes

+380 (57) 7508990+380 (50) 3033801

ABOUT US

Activity

LEGAL TERMS

+380 (57) 7508990
+380 (50) 3033801