PHQ-9 Scores Conflate Symptom Frequency and Subjective Burden

Longitudinal data shows the depression screening tool captures a general severity signal but lacks precision in distinguishing distress.

For Doctors in a Hurry

Clinicians often question if the Patient Health Questionnaire-9 accurately measures symptom frequency or the subjective burden of depression.
The researchers analyzed 1,652 days of daily symptom reports and 236 weekly questionnaires from 23 patients with depression.
Both frequency and burden independently predicted scores, but neither showed unique effects when modeled together, indicating a shared severity signal.
The authors concluded that weekly scores primarily reflect a composite measure of symptom severity rather than distinct frequency or burden.
Physicians should recognize that the questionnaire captures a unified severity metric, though individual patient interpretation of these items may vary.

The Ambiguity of Standardized Depression Screening

The global prevalence of mental health disorders has intensified following the COVID-19 pandemic, with meta-analytic data confirming significant increases in stress, anxiety, and depression across diverse populations [1]. For the practicing clinician, identifying these conditions is complicated by the long-term impact of childhood maltreatment [2] and systemic social determinants such as racism, which shows a significant correlation with negative mental health outcomes (r = -0.23; 95% confidence interval, -0.24 to -0.21) [3]. Precise diagnosis is essential because psychiatric symptoms frequently co-occur with post-viral sequelae, which are long-term medical complications following a viral infection; notably, 80% of patients recovering from COVID-19 report at least one persistent symptom, including fatigue (58%) and headache (44%) [4, 5]. However, the clinical utility of standard screening instruments relies on the patient's ability to accurately translate internal states into fixed numerical scales. A recent study evaluates whether the most widely used depression screening tool effectively differentiates between the frequency of symptoms and the actual level of subjective distress experienced by the patient.

Longitudinal Tracking of Frequency and Distress

The Patient Health Questionnaire-9 (PHQ-9) serves as a cornerstone of primary care depression screening, yet its internal logic contains a potential psychometric conflict that may obscure clinical assessment. The researchers investigated concerns that the instrument's instructions, which ask patients how often they have been 'bothered by' specific symptoms, may fundamentally conflict with the frequency-based response options ranging from 'not at all' to 'nearly every day.' This phrasing creates a measurement ambiguity where it is unclear if a patient is reporting the objective recurrence of a symptom or the subjective distress it causes. The study hypothesized that patients may experience significant difficulty mapping their lived experiences onto these fixed response options, a process that potentially leads to the conflation of symptom frequency with symptom burden (the perceived emotional or functional weight of a symptom). To evaluate this discrepancy, the researchers employed an intensive longitudinal design (a method involving frequent, repeated measurements over time to capture real-time fluctuations in health states) to determine whether weekly PHQ-9 scores aligned more closely with frequency or burden. The study sample consisted of 23 depressed patients who provided high-resolution data through daily reporting. Over a total of 1652 days, these participants recorded both the daily presence of symptoms and the daily burden associated with them. Alongside these daily entries, the participants completed the weekly PHQ-9 a total of 236 times. This granular data collection allowed the authors to decompose within-person and between-person effects, providing a rigorous look at how the standard screening tool captures the nuances of the depressive experience.

Statistical Decomposition of Severity Signals

To analyze the complex relationship between how often a symptom occurs and how much it distresses a patient, the researchers utilized mixed-effects models (a statistical method used to analyze data that has multiple levels of variability, such as differences between different patients and fluctuations within a single patient over time). This analytical framework allowed the team to decompose within-person and between-person effects, effectively separating individual clinical changes from broader group trends. Within these models, the researchers estimated patient-specific frequency–burden weights to determine which factor more heavily influenced individual PH_Q-9 scores. This granular approach was necessary to identify whether the screening tool consistently prioritizes the number of symptomatic days or the subjective weight of those symptoms for each person. The analysis revealed that symptom frequency and symptom burden were highly correlated, suggesting that for most patients, an increase in the recurrence of depressive symptoms is inextricably linked to an increase in perceived distress. When these factors were modeled alone, both symptom frequency and symptom burden significantly predicted PHQ-9 scores. Specifically, within-person rises in frequency predicted higher PHQ-9 scores in separate models, and similarly, within-person rises in burden predicted higher PHQ-9 scores when evaluated independently. These findings indicate that the PHQ-9 is sensitive to changes in both the quantity and the quality of the patient's depressive experience when each is viewed in isolation. However, the clinical precision of the instrument was challenged when both variables were integrated into a single analysis. When frequency and burden were modeled jointly, neither factor showed unique effects on PHQ-9 scores. This lack of independent predictive power suggests that the instrument does not distinguish between the two constructs in a way that provides additional diagnostic utility. Instead, the researchers concluded that weekly PHQ-9 change mainly reflected a shared symptom-severity signal rather than providing distinct measures of frequency or burden. For the clinician, this means that while a rising PHQ-9 score reliably indicates worsening depression, it may not clarify whether the patient is suffering from more frequent episodes or a heightened emotional reaction to existing symptoms.

Clinical Implications of Patient Heterogeneity

The longitudinal analysis of 23 depressed patients over 236 weeks suggests that the clinical utility of the PHQ-9 may be influenced by how individual patients interpret the conflict between frequency-based response options and the instruction to report how much they are bothered by symptoms. The researchers observed that the data suggested person-level heterogeneity, which refers to the distinct differences between individual patients regarding whether their scores primarily reflected symptom frequency or subjective burden. For the practicing clinician, this indicates that a PHQ-9 score may not represent a uniform clinical signal across a patient panel. For some individuals, a high score may be driven by the persistence of symptoms throughout the week, while for others, the same score may reflect a high level of distress associated with relatively infrequent symptomatic episodes. This variation in how patients map their lived experience onto the scale complicates the interpretation of severity, although the researchers noted that this specific finding of heterogeneity was uncertain and requires further validation. Because the study utilized a small sample size of 23 participants providing 1652 days of data, the authors state that these findings are preliminary and should be interpreted with caution. While the results highlight a potential lack of precision in distinguishing between frequency and distress, the researchers emphasize that these insights should be integrated alongside the existing evidence base for the PHQ-9. The tool remains a validated instrument with a robust history of clinical use in identifying depressive symptoms and monitoring treatment progress. Until larger studies can more definitively characterize the impact of individual reporting styles on diagnostic outcomes, clinicians should continue to use the PHQ-9 as a general severity indicator while remaining mindful that the underlying drivers of a patient's score (whether frequency of occurrence or the weight of the burden) may vary from one individual to the next.

Study Info

An intensive longitudinal study on interpretation issues with the PHQ instructions

Lennart Seizer, Günter Schiepek

Journal Journal of Affective Disorders

Published April 01, 2026

References

1. Salari N, Hosseinian‐Far A, Jalali R, et al. Prevalence of stress, anxiety, depression among the general population during the COVID-19 pandemic: a systematic review and meta-analysis. Globalization and Health. 2020. doi:10.1186/s12992-020-00589-w

2. Norman R, Byambaa M, De R, Butchart A, Scott JG, Vos T. The Long-Term Health Consequences of Child Physical Abuse, Emotional Abuse, and Neglect: A Systematic Review and Meta-Analysis. PLoS Medicine. 2012. doi:10.1371/journal.pmed.1001349

3. Paradies Y, Ben J, Denson N, et al. Racism as a Determinant of Health: A Systematic Review and Meta-Analysis. PLoS ONE. 2015. doi:10.1371/journal.pone.0138511

4. Rogers J, Chesney E, Oliver D, et al. Psychiatric and neuropsychiatric presentations associated with severe coronavirus infections: a systematic review and meta-analysis with comparison to the COVID-19 pandemic. The Lancet Psychiatry. 2020. doi:10.1016/s2215-0366(20)30203-0

5. López‐León S, Wegman-Ostrosky T, Perelman C, et al. More than 50 long-term effects of COVID-19: a systematic review and meta-analysis. Scientific Reports. 2021. doi:10.1038/s41598-021-95565-8

or

The Clinical Lighthouse

Subscribe to read the full analysis

Full access to every article, clinical summary, and specialty feed.

Unlimited article access
Weekly curated digest
Specialty-specific alerts
CME-ready summaries

Sign Up for full access

PHQ-9 Scores Conflate Symptom Frequency and Subjective Burden

Already a member?

Subscribe to read the full analysis