Benign Calcifications Drive False-Positive AI Risk Scores in Mammography

A study of 130,031 mammograms reveals that amorphous calcifications often trigger high AI risk scores in non-cancerous cases.

For Doctors in a Hurry

Researchers investigated which specific mammographic features trigger high artificial intelligence risk scores in both cancerous and non-cancerous screening cases.
This retrospective study analyzed 130,031 screening mammograms from 42,371 women to compare features in 240 high-risk cases.
In non-cancer cases, calcifications were the primary feature, while spiculated masses appeared in 29% of screen-detected cancers.
The researchers concluded that mammograms with high risk scores show distinct radiological features depending on the actual screening outcome.
Identifying these features may help refine artificial intelligence thresholds to improve specificity and reduce unnecessary patient recalls.

The challenge of specificity in automated mammography screening

Screening mammography remains the cornerstone of early breast cancer detection, yet the high volume of examinations and the subtle nature of early malignancies place a significant cognitive burden on radiologists [1]. Artificial intelligence has emerged as a potential tool to enhance diagnostic accuracy and streamline workflows, with some models demonstrating sensitivity and specificity comparable to human readers in controlled settings [2]. However, the integration of these tools into population-based screening programs is often hindered by high false-positive rates and a lack of transparency regarding which imaging features trigger high-risk alerts [3, 4]. While artificial intelligence can effectively triage patients for supplemental imaging such as magnetic resonance imaging, its tendency to flag benign findings can lead to unnecessary recalls [5]. To address this clinical bottleneck, researchers recently investigated the specific mammographic features that drive high artificial intelligence risk scores in both malignant and non-cancerous cases, aiming to clarify why algorithms flag certain healthy breasts as highly suspicious.

Analyzing high-risk alerts in a large screening cohort

To understand the imaging characteristics that trigger high-risk alerts, the researchers conducted a retrospective study of 130,031 screening mammograms from 42,371 women who participated in BreastScreen Norway between 2008 and 2018. The study utilized two distinct artificial intelligence models, identified as Model A and Model B, which were designed to detect signs of malignancy on screening mammograms. The researchers focused their analysis on the highest tier of risk, performing an informed radiological review of all mammograms that fell within the highest 5% of artificial intelligence risk scores for both models. This threshold allowed the team to isolate the specific visual triggers that the algorithms prioritized as most suspicious for cancer.

The investigation utilized two distinct patient groups to compare how artificial intelligence interprets benign versus malignant features. Sample 1 consisted of 120 cases that received high artificial intelligence risk scores but had no breast cancer detected within 6 years of the initial screening, representing false-positive or high-risk benign findings. Sample 2 included 120 cases with high artificial intelligence risk scores and screen-detected cancers, serving as the malignant baseline. For each case, the researchers evaluated mammographic density using the Breast Imaging-Reporting and Data System (BI-RADS), a standardized classification that categorizes breast tissue from level a (entirely fatty) to level d (extremely dense).

Beyond tissue density, the study analyzed specific mammographic features including mass, spiculated mass (a suspicious lesion characterized by radiating lines of tissue), asymmetry, architectural distortion, calcification alone, and density with calcification. To correlate these findings with clinical judgment, radiologists assigned interpretation scores on a scale of 1 to 5, where 1 indicates a negative finding and 5 indicates a high suspicion of malignancy. By comparing these radiologists’ interpretation scores and specific imaging features across both samples, the study aimed to identify why certain non-cancerous features frequently trigger high-risk scores in automated detection systems.

Distinct imaging features in false-positive versus malignant cases

The researchers identified a clear divergence in the imaging features that triggered high risk scores in benign versus malignant cases. In Sample 1, which consisted of non-cancerous cases, calcifications alone were the most frequent feature marked by the artificial intelligence, occurring in 72% of cases for Model A and 68% for Model B. These calcifications in the non-cancer group predominantly showed amorphous morphology (small, hazy, or indistinct shapes lacking a clearly defined form) and a cluster distribution. In contrast, the primary driver of high scores in Sample 2, which contained screen-detected cancers, was a spiculated mass, which was the most frequent mammographic feature, occurring in 29% of cases.

Breast density also played a significant role in the generation of high artificial intelligence risk scores. The study found that mammographic density was higher in Sample 1 (non-cancer) compared to Sample 2 (cancer). Specifically, BI-RADS d density (extremely dense tissue that can obscure underlying lesions) was present in 11% of the non-cancer group versus 3% of the cancer group. This indicates that high tissue density is nearly four times more common in cases where artificial intelligence generates a false-positive alert. These findings suggest that both amorphous calcifications and high parenchymal density are significant contributors to artificial intelligence-driven false positives in screening programs, often leading to high risk scores in cases that human readers correctly interpret as benign.

Clinical implications for reducing unnecessary recalls

The discrepancy between automated risk assessment and clinical judgment highlights a significant opportunity for refining screening workflows. In Sample 1, which consisted of 120 cases with high artificial intelligence risk scores but no cancer detected within six years, 76% of the high-risk findings were interpreted as benign by radiologists, receiving an interpretation score of 1 (indicating a normal or benign finding). This finding suggests that human readers frequently identify the benign nature of features that trigger high-risk alerts in current algorithms. The study demonstrates that mammograms assigned high artificial intelligence risk scores exhibit distinct features depending on whether the outcome is cancer or non-cancer, with benign cases often characterized by amorphous calcifications and high tissue density that do not necessarily warrant clinical concern. For practicing clinicians, this means that a high artificial intelligence risk score driven primarily by calcifications in dense breasts should be interpreted with caution, as it carries a high probability of being a false positive.

The systematic characterization of these features may help refine artificial intelligence thresholds and improve specificity in the screening environment. By identifying the specific imaging patterns that lead to false-positive alerts, developers can adjust algorithmic weights to better distinguish hazy calcifications from malignant spiculated masses. Ultimately, understanding these features may reduce artificial intelligence false-positive findings and decrease the recall rate in screening programs. Lowering the recall rate is essential for minimizing patient anxiety, reducing unnecessary biopsies, and easing the clinical burden of redundant follow-up examinations. This knowledge provides a framework for integrating artificial intelligence as a more precise decision-support tool rather than a source of distracting alerts.

Study Info

High risk score of breast cancer by artificial intelligence (AI) on screening mammograms: a review of negative and cancer cases

Marit A. Martiniussen, Marie B. Bergan, Merete U. Kristiansen, Nataliia Moshina, et al.

Journal European Radiology

Published May 06, 2026

References

1. Abeelh EA, Abuabeileh Z. Screening Mammography and Artificial Intelligence: A Comprehensive Systematic Review.. Cureus. 2025. doi:10.7759/cureus.79353

2. Albinsaad L, Almubarak EA, Balkhair R, et al. Diagnostic Accuracy of Artificial Intelligence-Assisted Mammography Interpretation Vs. Radiologist Alone: A Systematic Review and Meta-Analysis. Current Topics in Nutraceutical Research. 2025. doi:10.37290/ctnr.v23i2.24

3. Jassim G, Otoom O, Nair B, Hashem J. Performance of artificial intelligence in breast cancer screening programmes: a systematic review.. BMJ open. 2025. doi:10.1136/bmjopen-2025-111360

4. Freeman K, Geppert J, Stinton C, et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ. 2021. doi:10.1136/bmj.n1872

5. Salim M, Liu Y, Sorkhei M, et al. AI-based selection of individuals for supplemental MRI in population-based breast cancer screening: the randomized ScreenTrustMRI trial.. Nature medicine. 2024. doi:10.1038/s41591-024-03093-5

or

The Clinical Lighthouse

Subscribe to read the full analysis

Full access to every article, clinical summary, and specialty feed.

Unlimited article access
Weekly curated digest
Specialty-specific alerts
CME-ready summaries

Sign Up for full access

Benign Calcifications Drive False-Positive AI Risk Scores in Mammography

Already a member?

Subscribe to read the full analysis