For Doctors in a Hurry
- Researchers investigated whether machine learning radiomics models in high-impact journals possess sufficient training data to support their algorithmic complexity.
- The study evaluated 28 assessable research articles that developed binary prediction models and reported external validation results.
- Only three studies (10.7%) met minimum stability criteria, with a median deficit of 195.5 training instances per model.
- The authors concluded that most models are trained on statistically insufficient datasets, which increases the risk of overfitting and instability.
- Clinicians should exercise caution as these unreliable predictions may misinform medical decision-making and contribute to the ongoing reproducibility crisis.
The Statistical Integrity of Radiomic Biomarkers in Clinical Practice
Radiomics has emerged as a potential tool for the non-invasive characterization of malignancies, offering clinicians quantitative insights into tumor grading and nodal metastasis in conditions such as endometrial cancer [1]. By extracting high-dimensional data from standard imaging like CT and MRI, these models aim to predict molecular biomarkers in glial tumors [2] and monitor early chemotherapy response in breast cancer [3]. While these applications suggest a path toward more personalized oncology, the transition from research to bedside is often hindered by significant methodological heterogeneity and a lack of uniform standards [4, 5]. The reliability of these predictive tools depends heavily on the statistical rigor of their development, yet many studies continue to report high performance metrics despite small cohort sizes [6]. A recent analysis now scrutinizes the foundational data requirements of these models to determine if they are truly ready for clinical integration.
Benchmarking Model Complexity Against Sample Size
The researchers conducted a systematic evaluation of training sample size adequacy within machine learning-based radiomics models that had undergone external validation. To ensure the analysis reflected the highest current standards in the field, the study focused on original research articles published between January 2023 and August 2025 in first quartile journals, which represent the top 25 percent of publications by impact factor. The investigation followed a prespecified and publicly archived protocol to maintain methodological transparency. To determine the final cohort of studies for analysis, the authors employed a randomized dynamic screening protocol (a method of selecting studies for review that minimizes selection bias) combined with an a priori power-calculated stopping rule. This stopping rule is a statistical threshold set before the study began to determine exactly when enough data had been gathered to ensure the review's own findings were statistically significant. The analysis specifically targeted studies that developed binary prediction models, such as those distinguishing between benign and malignant tissues, using machine learning algorithms other than standard logistic regression. To assess whether these models were trained on enough data, the researchers applied a sample size framework originally developed for logistic regression. This framework served as a conservative lower-bound benchmark, meaning it established the absolute minimum number of patients required for a simpler model, which likely underestimates the data needs of more complex machine learning algorithms. The minimum required sample sizes for each study were calculated using three critical parameters: reported training performance, outcome prevalence (the proportion of patients in the cohort with the condition of interest), and feature dimensionality. Feature dimensionality refers to the number of distinct imaging characteristics, such as tumor texture, shape, or density, that the algorithm processes to make a prediction. By comparing these calculated requirements against the actual number of patients used in the published studies, the researchers quantified the gap between current academic practice and the statistical necessity for stable, reliable clinical tools. This comparison is vital for clinicians, as models trained on insufficient data are prone to overfitting (a statistical error where an algorithm learns the random noise of a specific dataset rather than the underlying pathology), leading to unreliable predictions when applied to new patients.
Widespread Deficits in Training Data and Reporting
The systematic review initially identified 64 full-text records for evaluation, but a significant portion of these high-impact publications lacked the transparency necessary for statistical verification. Specifically, 16 of the 64 records (25%) were unassessable because they failed to report essential parameters, such as feature counts, which are required to estimate the minimum sample size needed for a stable model. This lack of reporting prevents clinicians and peer reviewers from determining whether a model is mathematically sound or merely a product of statistical noise. For the remaining 28 assessable studies that provided sufficient data for analysis, the researchers found that training sample sizes were consistently and significantly inadequate to support the complexity of the machine learning algorithms employed. The gap between the data used and the data required was substantial, with the assessable cohort showing a median deficit of 195.5 training instances. This means that the typical study lacked nearly 200 patients or imaging sets required to reach even a conservative threshold for model stability. Furthermore, most studies failed to meet basic heuristics used in clinical prediction modeling, such as the 10 events per predictor rule. This guideline suggests that for every variable or feature included in a model, at least ten outcome events, such as a confirmed diagnosis or a specific clinical event, are necessary to prevent the model from becoming overly tailored to a specific, small dataset. The analysis revealed a median events per predictor deficit of 5.8, indicating that many models are operating with less than half of the minimum recommended data, a deficit that directly contributes to the ongoing reproducibility crisis in radiomics research.
Clinical Implications of Model Instability
The systematic review reveals a stark disparity between the mathematical requirements for stable machine learning and current publication standards in high-impact medical journals. Even when applying charitable assumptions and a conservative lower-bound benchmark, the researchers found that only three studies (10.7%) met all criteria for stable prediction model development. This indicates that nearly 90% of externally validated radiomics models in top-tier journals are trained on datasets that are statistically insufficient to support their algorithmic complexity. For the practicing clinician, this means that the vast majority of published models, despite having undergone external validation, lack the foundational data volume necessary to ensure that their predictions are reproducible or reliable in a real-world clinical setting. This systemic data deficit renders models highly prone to overfitting, a phenomenon where a model learns the idiosyncratic variations of a specific training dataset rather than the true underlying biological signal. When a model is overfitted, it may demonstrate high accuracy on its initial data but fails to generalize to new patients, leading to instability in its predictive performance. The researchers suggest that this widespread instability potentially explains the ongoing reproducibility crisis in the field of radiomics, where models that appear robust in a research environment fail to deliver consistent results when applied to independent clinical cohorts. For physicians, these findings serve as a critical warning that a validated model may still generate unreliable predictions that could misinform diagnosis or treatment planning if the underlying training sample size was inadequate. Ensuring that a model is trained on a statistically sufficient population is not merely a technical requirement; it is a prerequisite for clinical safety and the delivery of accurate, evidence-based care.
References
1. Donato VD, Kontopantelis E, Cuccu I, et al. Magnetic resonance imaging-radiomics in endometrial cancer: a systematic review and meta-analysis.. International journal of gynecological cancer : official journal of the International Gynecological Cancer Society. 2023. doi:10.1136/ijgc-2023-004313
2. Danilov GV, Agrba SB, Strunina YV, et al. Radiomics and Machine Learning in Diagnostics of Glial Brain Tumors: a Systematic Review and Meta-Analysis.. Sovremennye tekhnologii v meditsine. 2025. doi:10.17691/stm2025.17.6.07
3. Putin R, Stana LG, Ilie AC, Tanase E, Cotoraci C. Quantitative Ultrasound Radiomics for Predicting and Monitoring Neoadjuvant Chemotherapy Response in Breast Cancer: A Systematic Review.. Diagnostics (Basel, Switzerland). 2026. doi:10.3390/diagnostics16030425
4. Tran K, Ginzburg D, Hong W, Attenberger U, Ko HS. Post-radiotherapy stage III/IV non-small cell lung cancer radiomics research: a systematic review and comparison of CLEAR and RQS frameworks.. European radiology. 2024. doi:10.1007/s00330-024-10736-1
5. Spadarella G, Ugga L, Calareso G, Villa R, D'Aniello S, Cuocolo R. The impact of radiomics for human papillomavirus status prediction in oropharyngeal cancer: systematic review and radiomics quality score assessment.. Neuroradiology. 2022. doi:10.1007/s00234-022-02959-0
6. Fiz F, Viganò L, Gennaro N, et al. Radiomics of Liver Metastases: A Systematic Review. Cancers. 2020. doi:10.3390/cancers12102881