Due to No Consensus.
The Breast Cancer Screening Recall Rates measure calculates the percentage of beneficiaries with mammography or digital breast tomosynthesis (DBT) screening studies that are followed by a diagnostic mammography, DBT, ultrasound, or magnetic resonance imaging (MRI) of the breast in an outpatient or office setting within 45 days.
-
-
1.5 Measure Type1.6 Composite MeasureNo1.7 Electronic Clinical Quality Measure (eCQM)1.8 Level Of Analysis1.9 Care Setting1.10 Measure Rationale
The Breast Cancer Screening Recall Rates measure calculates the percentage of beneficiaries with mammography or digital breast tomosynthesis (DBT) screening studies that are followed by a diagnostic mammography, DBT, ultrasound, or magnetic resonance imaging (MRI) of the breast in an outpatient or office setting on the same day or within 45 days.
1.11 Measure Webpage1.20 Testing Data Sources1.25 Data SourcesCBE #4220e is calculated using data from final claims that facilities submit for Medicare beneficiaries enrolled in fee-for-service (FFS) Medicare. The data are calculated only for facilities paid through the OPPS for mammography and DBT screening in the hospital outpatient setting. Data are pulled from the hospital outpatient and carrier files to identify eligible cases for inclusion in the initial patient population and numerator (e.g., a mammography follow-up study can occur in any location and be included in the measure’s numerator). Due to claims adjudication, there is a lag between when an imaging study is performed and when it is reported on the public reporting website.
-
1.14 Numerator
Medicare beneficiaries who had a diagnostic mammography study, DBT, ultrasound, or MRI of the breast following a screening mammography or DBT study on the same day or within 45 days of the screening study in any location.
1.14a Numerator DetailsCBE #4220e calculates the percentage of mammography and digital breast tomosynthesis (DBT) screening studies that are followed by a diagnostic mammography, DBT, ultrasound, or magnetic resonance imaging (MRI) of the breast in an outpatient or office setting within 45 days. The measure’s denominator contains any Medicare beneficiary who underwent a screening mammography or DBT study at a facility subject to OPPS regulation during the measurement period. From these beneficiaries, the numerator contains beneficiaries who had a diagnostic mammography study, DBT, ultrasound, or MRI of the breast following a screening mammography or DBT study within 45 days.
The Current Procedural Terminology (CPT) and Healthcare Common Procedure Coding System (HCPCS) codes used to identify beneficiaries with a diagnostic mammography study, DBT, ultrasound or MRI can be found in the submitted Excel file.
-
1.15 Denominator
Medicare beneficiaries who underwent a screening mammography or DBT study at a facility reimbursed through the Outpatient Prospective Payment System (OPPS).
1.15a Denominator DetailsThe CBE #4220e denominator contains any Medicare beneficiary who underwent a screening mammography or screening DBT’s performed at a facility subject to Outpatient Prospective Payment System (OPPS) regulation during the measurement period The CPT and HCPCS codes used to identify beneficiaries who underwent a screening mammography or screening DBT can be found in the submitted Excel file.
-
1.15b Denominator Exclusions
None.
1.15c Denominator Exclusions DetailsNone.
-
OLD 1.12 MAT output not attachedNot attached1.13 Attach Data Dictionary1.13a Data dictionary not attachedYes1.16 Type of Score1.17 Measure Score InterpretationBetter quality = Score within a defined interval1.18 Calculation of Measure Score
Please see attached measure score calculation diagram within the attachment under the 'Measure Score Calculation Diagram' question.
1.18a Attach measure score calculation diagram, if applicable1.19 Measure Stratification DetailsNot applicable—CBE #4220e is not stratified.
1.26 Minimum Sample SizeCBE #4220e uses a relative precision model to determine the minimum necessary number of cases. Similar approaches are used for three other Outpatient Imaging Efficiency measures. For this precision model, calculating minimum case count is determined by acceptable levels of precision, the level of confidence necessary for each measure, and the minimum case count required to meet precision and confidence. Precision depends on the facility’s observed performance rate. In general, stricter levels of precision are necessary for scores that are closer to the tail ends of the possible range of the measure score (i.e., 0.05 or 0.95), whereas scores towards the middle of the possible range (e.g., 0.50) do not require as strict a level of precision. The level of significance is 0.10. Thus, the minimum case counts (see Table 1 within the attachment under the 'Logic Model' question) ensure 90 percent confidence that the observed score reflects the true score of the minimum case counts that would be necessary to publicly report CBE #4220e. Facilities would need at least 31 cases to qualify for public reporting; this number can vary from 31 to 67, depending on a facility’s performance rate.
-
Most Recent Endorsement ActivityInitial Recognition and Management Fall 2023
-
StewardCenters for Medicare & Medicaid ServicesSteward Organization POC EmailSteward Organization URLSteward Organization Copyright
N/A
Measure Developer Secondary Point Of Contact Email
-
-
-
2.1 Attach Logic Model2.2 Evidence of Measure Importance
From the perspective of both clinical quality and efficiency, there are potentially negative consequences if the mammography and DBT recall rate is either too high or too low. A high cumulative dose of low-energy radiation can be a consequence of too many false-positive mammography and DBT follow-up studies. Radiation received from mammography or DBT may induce more cancers in younger people or those carrying deleterious gene mutations, such as BRCA-1 and BRCA-2 (Berrington de Gonzalez et al., 2009).
Societies and guidelines provide inconsistent suggestions on the appropriate recall rates to establish for breast cancer screening. The ACR recommends a target recall rate for mammography screening between 5 percent and 12 percent (American College of Radiology 2013); European research, via the International Agency for Research on Cancer, sets a target recall rate of 5 percent..
References:
Berrington de Gonzalez, A., Berg, C., Visvanathan, K., & Robson, M. (2009). Estimated risk of radiation-induced breast cancer from mammographic screening for young BRCA mutation carriers. JNCI: Journal of the National Cancer Institute, 101(3), 205–209. https://doi.org/10.1093/jnci/djn440
D’Orsi, C. J., Sickles, E. A., Mendelson, E. B., Morris EA, et al. (2013). ACR BI-RADS® atlas, breast imaging reporting and data system. Reston, VA: American College of Radiology.
-
2.3 Anticipated Impact
This measure will guide breast cancer screening decision making in hospital outpatient departments as there are potentially negative consequences if the mammography and DBT recall rate is either too high or too low.
The measure potentially will reduce radiation received from mammography or DBT that may induce more cancers in younger people or those carrying deleterious gene mutations, as well as decreasing unnecessary imaging and biopsies. Alternatively, underuse of follow up for screening mammography or DBT may result in missed cases of cancer.
CMS calculates performance for its Outpatient Imaging Efficiency measures using data from final claims that facilities submit for Medicare beneficiaries enrolled in FFS Medicare. The data are calculated only for facilities paid through the OPPS for mammography and DBT screening studies in the hospital outpatient setting. Data from the hospital outpatient and carrier files are used to determine beneficiary inclusion (e.g., a mammography follow-up study can occur in any location and be included in the measure’s numerator).
Results reported are for the public reporting period based on data collected from July 1, 2021, through June 30, 2022 (referred to as 2023 public reporting or PR 2023). In PR 2023, 3,652 facilities had at least 1 eligible case in the measure denominator. A total of 3,391 facilities met the minimum case count requirement, making them eligible for public reporting.
The analysis of the performance gap is presented in Table 2 and Table 3, within the attachment under the 'Logic Model' question. Table 2 presents the distribution of performance scores and denominator counts for facilities meeting MCC and for all facilities with at least one case in the denominator. Table 3 presents measure performance scores by patient biological sex, racial or ethnic identity, age group, and dual eligibility status, including chi-square values and probabilities used to assess whether differences in performance are statistically significant. For these analyses, only cases from facilities meeting minimum case count requirements for public reporting were used.
Table 2 shows the mean measure performance for facilities meeting MCC (8.5 percent, standard deviation [S.D.], 6.7 percent) falls within the targeted recall rate range of 5 percent to 12 percent; however, analysis of performance across deciles demonstrates variability across facilities during the measurement period, with more than 30 percent (33.4) of facilities having scores outside of the targeted recall rate range. Scores for all eligible facilities (i.e., those with at least one case in the denominator) add an additional 261 facilities and 7,475 patients; these facilities display a similar distribution with slightly higher mean performance (8.9 percent, S.D., 8.7 percent).
Performance by patient characteristics displayed in Table 3 show statistically significant differences in performance by biological sex, racial or ethnic identity, age band, and dual eligibility status. Care should be taken in interpretation of these results as some categories make up a small percentage of the total for each characteristic. For example, only 0.01 percent of patients in the measure sample are male (as would be expected, given the clinical scope of the measure), although the chi-square probability (<0.0001) indicates the difference in performance (24.3 percent for males, 9.2 percent for females) is significant. Racial identity also provides a similar chi-square probability (<0.0001), with white patients making up the majority of cases (86.4 percent of the total initial patient population, with a performance rate of 9.2 percent) followed by Black patients (7.6 percent of the initial patient population, with a performance rate of 8.5 percent). The next largest category is unknown race, comprising 2.1 percent of the initial patient population, with performance at 10.7 percent. While comprising a small percentage of the initial patient population, performance scores for patients of other race (9.8 percent), Asian or Pacific Islander (10.0 percent), and American Indian or Alaska Native (6.7 percent) show significant variation between race categories. Similarly, patients of Hispanic or Latino (9.4 percent) ethnicity also vary substantially from non-Hispanic or non-Latino populations.
Age band categories show consistent trends of lower scores as age increases, ranging from 17.6 percent for patients aged 18 to 34, to 8.2 percent for those age 85 or older. Younger patients make up a small percentage of the overall testing population, with the categories including those aged 18 to 54 comprising about 2.5 percent of the initial patient population. Those aged 55 to 64 are 4.9 percent of the initial patient population, with performance score of 9.3 percent. Patients over the age of 65 make up 92.6 percent of the initial patient population, with scores ranging from 9.3 percent (for ages 65 to 74) to 8.2 percent (for patients who are 85 or older).
Finally, performance by dual eligibility was examined, with 92.6 percent of the initial patient population having only Medicare FFS coverage, and the remaining 7.4 percent enrolled in both Medicare FFS and Medicaid (dually eligible). The difference in performance was slight—9.2 percent for Medicare only versus 9.3 percent for dual eligible—but significant at the 0.05 level (p=0.0161).
2.5 Health Care Quality LandscapeWhile other measures evaluate rates of breast cancer imaging, CBE #4220e monitors rate of recall following screening imaging. This measure provides valuable information for facilities, clinicians, administrators, policy makers, patients, researchers, and others to identify facilities recalling the appropriate number of patients for follow-up screening each year.
2.6 Meaningfulness to Target PopulationAmong the target population, additional imaging, and biopsies after a screening mammography or DBT can result in over-diagnosis among patients who do not have breast cancer, increasing their anxiety and distress. Alternatively, inappropriately low recall rates may lead to delayed diagnoses or undetected cases of breast cancer (Nelson et al., 2019). Inclusion of DBT when evaluating follow-up care may improve recall rates and positive prediction values compared to metrics that focus on mammography alone (Aujero et al., 2017; Bian et al., 2016; Chong et al., 2019; Conant et al., 2016; Pozzi et al., 2016; and Skaane 2017).
References:
Aujero, M., Gavenonis, S., Benjamin, R., Zhang, Z., & Holt, J. (2017). Clinical performance of synthesized two-dimensional mammography combined with tomosynthesis in a large screening population. Radiology, 283(1), 70–76. https://doi.org/10.1148/radiol.2017162674
Berrington de Gonzalez, A., Berg, C., Visvanathan, K., & Robson, M. (2009). Estimated risk of radiation-induced breast cancer from mammographic screening for young BRCA mutation carriers. JNCI: Journal of the National Cancer Institute, 101(3), 205–209. https://doi.org/10.1093/jnci/djn440
Bian, T., Lin, Q., Cui, C., Li, L., Qi, C., Fei, J., & Su, X. (2016). Digital breast tomosynthesis: A new diagnostic method for mass-like lesions in dense breasts. The Breast Journal, 22(5), 535–540. https://doi.org/10.1111/tbj.12622
Chong, A., Weinstein, S., McDonald, E., & Conant, E. (2019). Digital breast tomosynthesis: Concepts and clinical practice. Radiology, 292(1), 1–14. https://doi.org/10.1148/radiol.2019180760
Conant, E., Beaber, E., Sprague, B., Herschorn, S., Weaver, D., Onega, T., Tosteson, A., McCarthy, A., Poplack, S., Haas, J., Armstrong, K., Schnall, M., & Barlow, W. (2016). Breast cancer screening using tomosynthesis in combination with digital mammography compared to digital mammography alone: A cohort study within the PROSPR Consortium. Breast Cancer Research and Treatment, 156(1), 109–116. https://doi.org/10.1007/s10549-016-3695-1
Pozzi, A., Corte, A., Lakis, M., Jeong, H. (2016). Digital breast tomosynthesis in addition to conventional 2D-mammography reduces recall rates and is cost effective. Asian Pacific Journal of Cancer Prevention, 17(7), 3521–3526. Retrieved January 11, 2023, from https://pubmed.ncbi.nlm.nih. gov/27510003
Skaane, P. (2016). Breast cancer screening with Digital Breast Tomosynthesis. Digital Breast Tomosynthesis, 11–28. https://doi.org/10.1007/978-3-319-28631-0_2
-
-
-
3.1 Feasibility Assessment
CBE #4220e was assessed via qualitative survey of a multi-stakeholder group of 32 individuals. The measure developer previously seated a technical expert panel, which contained 12 individuals with extensive experience in clinical care (7 physicians), healthcare administration (3 payers, purchasers, or hospital administration staff), and patients (2 patients who act in an advocacy role). To supplement the information gathered from the technical expert panel, the measure developer also reached out to the American College of Radiology, who provided contact information for clinicians, healthcare administration staff, patients, and caregivers. In total, 25 physicians of various specialties, 5 healthcare administration or management staff, 1 patient, and 2 caregivers responded to the survey (which contained questions about measure face validity, feasibility, and usability). Results from this survey are presented throughout the full measure submission form.
For the question related to feasibility for CBE #4220e, the results indicate that 75 percent of the respondents agree that the measure does not place an undue burden on hospitals to collect the data. For the individuals that responded either Disagree or Strongly Disagree, one stated that burden would depend on who is reporting the measure and how it is reported (additional information on measure use appears in the Use section below); another person felt the measure would be difficult to track without specific Current Procedural Terminology (CPT) codes to identify diagnostic studies that count as follow-up care in the measure’s numerator (which has been resolved); a third respondent felt that burden would be high if exclusion of high-risk individuals was added to the technical specifications (which did not happen); finally, a fourth respondent felt that the measure would have significant burden, but did not explain why.
Results from the qualitative survey related to measure feasibility for CBE #4220e appear in Table 4, within the attachment under the ‘Measure Logic’ question.
3.3 Feasibility Informed Final MeasureNo changes were made to the final measure specifications in response to the feasibility assessment. There was high agreement that the measure does not place an undue burden on hospitals to collect the data.
-
3.4a Fees, Licensing, or Other Requirements
There are no fees, licensing, or other requirements to use any aspect of this measure as specified.
3.4 Proprietary InformationNot a proprietary measure and no proprietary components
-
-
-
4.1.3 Characteristics of Measured Entities
A total of 3,652 facilities were included in the testing population, with 3,315,335 imaging studies included in the measure’s denominator. Table 2 above shows the distribution of performance scores and denominator counts for all facilities as well as for the subset (3,391) of facilities meeting MCC requirements. These include all facilities for which relevant Medicare claims data were available; no sampling strategy was employed.
Distribution for location (i.e., urban versus rural), bed size, teaching status, and ownership status of facilities meeting MCC requirements are shown in Table 5, within the attachment under the 'Logic Model' question. The majority of facilities were urban (59.6 percent), non-teaching (83.5 percent), and non-profit (65.8 percent). Distribution by bed size shows a plurality of facilities to be small (0–50 count bed size, 32.5 percent) with substantive proportions at each subsequent bed size category.
4.1.1 Data Used for TestingAs described above, CMS calculates its Outpatient Imaging Efficiency measures using data from final claims that facilities submit for Medicare beneficiaries enrolled in FFS Medicare. The data are calculated only for facilities paid through the OPPS for mammography and DBT screening in the hospital outpatient setting. Data from the hospital outpatient and carrier files are used to determine beneficiary inclusion (e.g., a mammography follow-up study can occur in any location and be included in the measure’s numerator).
All reported testing results are for the public reporting period based on data collected from July 1, 2021, through June 30, 2022 (PR 2023). In PR 2023, 3,652 facilities had at least 1 eligible case in the measure denominator. A total of 3,391 facilities met the minimum case count requirement, making them eligible for public reporting.
4.1.4 Characteristics of Units of the Eligible PopulationTable 3, within the attachment under the 'Logic Model' question, displays the distribution of cases included in the denominator from facilities meeting MCC requirements by patient characteristic, including biological sex, racial or ethnic identification, age band, and dual eligibility status.
4.1.2 Differences in DataThe same data were used for all aspects of testing.
-
4.2.1 Level(s) of Reliability Testing Conducted4.2.2 Method(s) of Reliability Testing
Reliability was calculated in accordance with the methods described in The Reliability of Provider Profiling: A Tutorial (2009). This approach calculates the ability of the measure to distinguish between the performances of different facilities. Specifically, the testing calculated the signal-to-noise ratio for each facility meeting minimum case count, with higher scores indicating greater reliability. The reliability score is estimated using a beta-binomial model and is a function of the facility’s sample size and score on the measure, as well as the variance across facilities.
References:
Adams, J. (2009). The Reliability of Provider Profiling: A Tutorial. https://doi.org/10.7249/ tr653
4.2.3 Reliability Testing ResultsSee next section.
Table 2. Accountable Entity–Level Reliability Testing Results by Denominator-Target Population SizeAccountable Entity-Level Reliability Testing Results Overall Minimum Decile_1 Decile_2 Decile_3 Decile_4 Decile_5 Decile_6 Decile_7 Decile_8 Decile_9 Decile_10 Maximum Reliability 0.92 0.41 0.81 0.88 0.91. 0.94 0.95 0.97 0.98 0.98 0.99 1.00 1.00 Mean Performance Score N/A N/A 339 339 339 341 337 340 339 339 338 340 340 N of Entities N/A N/A 41,335 65,887 96,318 124,514 179,137 243,097 319,001 434,845 620,705 1,183,001 1,183,001 4.2.4 Interpretation of Reliability ResultsAs shown above, reliability scores for CBE #4220e ranged from 0.41 to 1.00, with a median reliability score of 0.95. This median score is indicative of very strong measure reliability and suggests that this measure is able to identify true differences in performance between individual facilities.
-
4.3.1 Level(s) of Validity Testing Conducted4.3.2 Type of accountable entity-level validity testing conducted4.3.3 Method(s) of Validity Testing
Feedback received from external stakeholders during a listening session about CBE #4220e indicate that a diverse group of stakeholders support its validity. Stakeholders were in agreement that screening mammography and DBT are appropriate imaging modalities that should be used to capture the initial patient population of the measure; some stakeholders recommended the measure consider the addition of other types of screening modalities, such as MRI and ultrasound, to the specifications. Stakeholders also reached consensus that guidance for the measure should include a target screening recall rate of 5 percent to 12 percent, in alignment with American College of Radiology guidelines, noting that the addition of DBT may shift that range downward (because DBT provides more precise imaging). Lastly, stakeholders suggested that facility characteristics could impact recall rates (e.g., underserved areas may have higher recall rates because patients in those areas have limited engagement with the healthcare system and tend to experience fragmented care). Differences in facility characteristics could be corrected using an administrative cancer detection rate.
In addition to the public listening session, the face validity of the measure was systematically assessed via qualitative survey of a multi-stakeholder group of 32 individuals, including one patient/patient advocate (the composition of which is described in the Feasibility section, above).
4.3.4 Validity Testing ResultsThe results shown above in Table 7, within the attachment under the 'Logic Model' question, indicate that 75 percent of the respondents support the measure’s intent—to assess recall rates to determine appropriate diagnostic imaging for breast cancer detection. Furthermore, 71 percent of the respondents strongly agreed or agreed that the measure addresses quality of care (Table 8, within the attachment under the 'Logic Model' question).
For individuals who responded Disagree or Strongly Disagree to the question summarized in Table 7, suggestions were made to update the measure name to reference recall in lieu of follow up (the change was made) and encouragement to use BIRADS assessment values instead of documentation from CPT codes (which is not feasible for measures calculated using administrative claims data).
For responses to the question summarized in Table 8, one person noted that “there are many additional factors to consider” (without sharing additional context); four individuals suggested removal of magnetic resonance as a follow-up imaging modality (the suggestion was not accepted); and three people encouraged removal of information about breast density (which was addressed).
4.3.5 Interpretation of Validity ResultsPlease see Table 7 and Table 8, within the attachment under the 'Logic Model' question.
-
4.4.1 Methods used to address risk factors4.4.1b If an outcome or resource use measure is not risk adjusted or stratified
CBE #4220e is a process measure for which the measure steward provides no risk adjustment or risk stratification. It was determined that risk adjustment and risk stratification were not appropriate for the measure based on the measure evidence base and the measure construct. Stakeholder feedback received during the listening session for the measure suggest facility characteristics could potentially impact measure scores because patients in underserved areas who have limited engagement with the healthcare system may have higher rates of recall. Additionally, the prevalence of cancer in underserved areas is typically higher because their population tends to have limited to no primary care prevention. As a process-of-care measure, the decision to image a patient should not be influenced by sociodemographic status factors; rather, adjustment would risk masking such important inequities in care delivery. Variation across populations is reflective of differences in the quality of care provided to the disparate population included in the measure’s denominator. The measure steward will continue to assess the need for risk adjustment throughout the measure’s lifespan.
Risk adjustment approachOffRisk adjustment approachOffConceptual model for risk adjustmentOffConceptual model for risk adjustmentOff
-
-
-
5.1 Contributions Towards Advancing Health Equity
As shown above in Table 3, some potential social risk factors were examined to identify performance gaps. These factors include biological sex, racial or ethnic identity, age band, and dual eligibility status. Statistically significant differences in performance have been identified which demonstrate an opportunity for improving health equity based on these risk factors.
-
-
-
6.1.2 Current or Planned Use(s)6.1.4 Program Details
-
6.2.1 Actions of Measured Entities to Improve Performance
Usability of the measure was assessed via qualitative survey of a multi-stakeholder group of 32 individuals, including one patient/patient advocate. The results indicate that 77.4 percent of the respondents agree the measure can be used by hospital outpatient departments to guide decision-making and improve healthcare quality and health outcomes. One responder suggested that the measure be used in conjunction with a breast cancer detection rate. Furthermore, 80.6 percent of the respondents agreed that the measurement of mammography and DBT follow-up rates for breast cancer detection is highly important because reporting the measure results can supply meaningful information to consumers and healthcare providers.
-
-
-
CBE# 4220 Staff Assessment
Importance
ImportanceStrengths:
- The developer cites evidence and guidelines from the American College of Radiology, which highlights the potentially negative consequences if the mammography and digital breast tomosynthesis (DBT) recall rate is either too high or too low. The developer posits that this measure “will guide breast cancer screening decision making in hospital outpatient departments.” In addition, this measure will potentially “reduce radiation received from mammography or DBT that may induce more cancers in younger people or those carrying deleterious gene mutations, as well as decreasing unnecessary imaging and biopsies. Alternatively, underuse of follow-up for screening mammography or DBT may result in missed cases of cancer.” The developer provides a logic model for this measure, but it does not capture the statements shared under “Anticipated Impact.” Across 3,391 facilities from July 1, 2021 – June 30, 2022, the mean measure performance for facilities meeting MCC (8.5%, standard deviation [S.D.], 6.7%) falls within the targeted recall rate range of 5% to 12%. However, variability exists across the distribution of facility scores, with more than 30% (33.4) of facilities having scores outside of the targeted recall rate range. The developer also reported statistically significant differences in performance by biological sex, racial or ethnic identity, age band, and dual eligibility status.
- The developer states that “while other measures evaluate rates of breast cancer imaging, this measure provides valuable information to facilities, consumers, researchers, clinicians, and policy-makers, with respect to recalling the appropriate number of patients for follow-up screening each year.
Limitations:
- The developer states that there are inconsistencies in the target recall rate for breast cancer screening. However, guidelines from the American College of Radiology recommend a target recall rate for mammography screening between 5% and 12%. In Europe, the International Agency for Research on Cancer sets a target recall rate at 5%. The developer does not describe the actions providers can take to ensure appropriate recall rates. The developer does not provide direct patient input on this measure. However, it does cite evidence of the negative consequences associated with additional screening, which include anxiety and stress to patients, as well as the risk of a delayed cancer diagnosis.
Rationale:
- The developer cites evidence and guidelines from the American College of Radiology, which highlight the potentially negative consequences if the mammography and digital breast tomosynthesis (DBT) recall rate is either too high or too low. The developer posits that this measure will "potentially reduce radiation received from mammography or DBT that may induce more cancers in younger people or those carrying deleterious gene mutations, as well as decreasing unnecessary imaging and biopsies.”
- However, the developer states that there are inconsistencies in the target recall rate for breast cancer screening. However, guidelines from the American College of Radiology recommend a target recall rate for mammography screening between 5% and 12%. The committee should consider if the range in recall rates is appropriate and perhaps have the developer further justify the range (benefits vs. harms) associated.
A gap exists across the distribution of facility scores, with more than 30% (33.4) of facilities having scores outside of the targeted recall rate range. The developer also reported statistically significant differences in performance by biological sex, racial or ethnic identity, age band, and dual eligibility status. - Lastly, the developer does not provide direct patient input on this measure. However, it does cite evidence of the negative consequences associated with additional screening, which include anxiety and stress to patients, as well as the risk of a delayed cancer diagnosis.
Feasibility Acceptance
Feasibility AcceptanceStrengths:
- The developer conducted a feasibility assessment by engaging a multistakeholder panel of experts and the American College of Radiology. It did not disclose any data availability or feasibility issues. Rather, members of the panel provided feedback on burden, indicating a 75% agreement that the measure does not place an undue burden on hospitals to collect the data. For the 25% who found the measure burdensome, the developer resolved most concerns.
- The developer notes that it did not make any changes to the measure as a result of the feedback. This measure uses claims data, and the Centers for Medicare & Medicaid Services calculates its Outpatient Imaging Efficiency measures using data from final claims that facilities submit for Medicare beneficiaries enrolled in fee-for-service Medicare. The developer states that there are no fees, licensing, or other requirements to use any aspect of this measure as specified.
Limitations:
None
Rationale:
- This measures uses the electronic data source of claims. The developer conducted a feasibility assessment by engaging a multistakeholder panel of experts and the American College of Radiology. It did not disclose any data availability issues. Rather, members of the panel provided feedback on burden, indicating a 75% agreement that the measure does not place an undue burden on hospitals to collect the data. For the 25% who found the measure burdensome, the developer resolved most concerns.
The developer notes that it did not make any changes to the measure as a result of the feedback.
Scientific Acceptability
Scientific Acceptability ReliabilityStrengths:
- The measure specifications are well defined and precise. The measure could be implemented consistently across organizations and allows for comparisons.
- The vast majority of facilities have a reliability well above the threshold of 0.6 with the first decile having mean reliability of 0.81 and an overall median of 0.95.
- A total of 3,391 facilities met the minimum case count requirement, making them eligible for public reporting, with millions of imaging studies between them included in the denominator.
- This measure uses a relative precision model to determine the minimum necessary number of cases. Facilities need at least 31 cases to qualify for public reporting; this number can vary from 31 to 67, depending on a facility’s performance rate.
- All reported testing results are for the public reporting period based on data collected from July 1, 2021, through June 30, 2022 (PR 2023).
Limitations:
- There appears to be at least one outlier (reliability of 0.41). It would be useful to know more about this outlier and see why the reliability was so much lower than the median (0.95) and even the mean reliability of the first decile (0.81).
Rationale:
- Measure score reliability testing (accountable entity level reliability) performed. The vast majority of facilities have a reliability which exceeds the accepted threshold of 0.6 with only one facility below the threshold (minimum of 0.41). Sample size for each year and accountable entity level analyzed is sufficient.
Scientific Acceptability ValidityStrengths:
- The developer conducted face validity testing of the measure score via qualitative survey of a multi-stakeholder group of 32 individuals. In total, 25 physicians of various specialties, 5 healthcare administration or management staff, 1 patient, and 2 caregivers responded to the survey (which contained questions about measure face validity, feasibility, and usability).
- The developer states that this group reached consensus that guidance for the measure should include a target screening recall rate of 5% to 12%, in alignment with American College of Radiology guidelines.
75% of the respondents support the measure’s intent and 71% strongly agreed or agreed that the measure addresses quality of care. - For individuals who did not agree, suggestions were made to update the measure name to reference recall in lieu of follow up (the change was made by the developer) and encouragement to use BIRADS assessment values instead of documentation from CPT codes (which is not feasible for measures calculated using administrative claims data).
- One person noted that “there are many additional factors to consider” (without sharing additional context); four individuals suggested removal of magnetic resonance as a follow-up imaging modality (the suggestion was not accepted by the developer); and three people encouraged removal of information about breast density (which was addressed by the developer).
- Lastly, the measure is not risk-adjusted, as it is a process measure.
Limitations:
None
Rationale:
- The developer conducted face validity testing of the measure score via qualitative survey of a multi-stakeholder group of 32 individuals. In total, 25 physicians of various specialties, 5 healthcare administration or management staff, 1 patient, and 2 caregivers responded to the survey (which contained questions about measure face validity, feasibility, and usability).
- The developer states that this group reached consensus that guidance for the measure should include a target screening recall rate of 5% to 12%, in alignment with American College of Radiology guidelines.
75% of the respondents support the measure’s intent and 71% strongly agreed or agreed that the measure addresses quality of care. - For individuals who did not agree, suggestions were made to update the measure name to reference recall in lieu of follow up (the change was made by the developer) and encouragement to use BIRADS assessment values instead of documentation from CPT codes (which is not feasible for measures calculated using administrative claims data).
- One person noted that “there are many additional factors to consider” (without sharing additional context); four individuals suggested removal of magnetic resonance as a follow-up imaging modality (the suggestion was not accepted by the developer); and three people encouraged removal of information about breast density (which was addressed by the developer).
- Lastly, the measure is not risk-adjusted, as it is a process measure.
Equity
EquityStrengths:
- Disparities are evaluated through analyzing differences in performance scores by sex, race/ethnicity, age group, and dual eligibility, and the developer found significant overall differences (chi-squares) for each factor
Limitations:
- Regarding the observed disparity by sex (i.e., men have a recall rate of 24.3% compared with 9.2% among women). Regardless of the sample size, one may expect a higher rate of positive results requiring follow-up since men probably would only get a mammogram if they are symptomatic and then its purpose would be to rule out breast cancer, and as such should result in a higher recall rate than the target range for this measure. This may be the same for women under age 40/50 (depending on the guideline followed)
- Significance tests are limited to chi-square tests, i.e., there are no t-tests to evaluate differences by race/ethnicity category, e.g., relative to a reference category or the target recall range; in addition, all groups but men and women under 45 have values within the target range, so the clinical importance of these findings is unclear; it might be most appropriate to conclude that no disparities were identified
Rationale:
- Developers use performance data to calculate chi-square statistics for the rate of recall by sex, race/ethnicity, age group, and dual eligibility, and find overall significance for each of these factors, concluding there is opportunity for improving health equity based on these factors.
- The method chosen does not report differences between specific groups, and because most rates are within the target range the clinical importance of the findings is unclear. In addition, the groups for which rates outside the range were found are groups that are not generally recommended for screening mammography (men, women under 45), so the target range may not be appropriate for them.
Use and Usability
Use and UsabilityStrengths:
- Developer indicates the measure is planned for use in public reporting and internal quality improvement (QI).
- After the measure was submitted to Battelle, the developer added more information based on its review of the staff assessment: The measure is intended for use in CMS's Hospital Outpatient Quality Reporting (HOQR) Program, which provides financial incentives for performance. The developer suggests facilities wanting help with improving on this measure can consult with their QIN-QIO or consult reports available on the CMS QualityNet site discussing how they can align rates of recall with the guidance from the American College of Radiology.
- Developer asked a multi-stakeholder group (n=32) to assess usability of the measure; 77.4% agreed that the measure could be used by entities to guide decision making and for QI, and 80.6% agreed that the information about follow-up rates for breast CA detection can provide consumers and providers with actionable information.
Limitations:
- No explicit articulation of the actions measured entities could take to improve performance, but in their comments the developer listed sources facilities can consult to develop a QI initiative.
Rationale:
- Developers plan for the measure to be used in CMS's Hospital Outpatient Quality Reporting (HOQR) Program, a pay-for-quality program. The majority of a multi-stakeholder group agreed the measure could be used by entities for QI and decision-making (77.4%) and that it would provide consumers and providers with actionable information (80.6%) and the developer recommends facilities work with their QIN-QIO or consult resources available through QualityNet to develop a QI program.
Summary
-
Summary
Importance
ImportanceThe developer does not address the significance of this population at risk of poor outcomes related to overuse of recall and/or underuse of recall post mammography or DBT.
Feasibility Acceptance
Feasibility AcceptanceThe focus group found that data collection was not overly burdensome on facilities.
Scientific Acceptability
Scientific Acceptability ReliabilityThis measure uses the same data process as Outpatient Imaging Efficiency measures using claims data. This has provided reliable results for this measure.
Scientific Acceptability ValidityThe validity of the data shows a wide range from a minimum of 0.41 to 1st decile of 0.81.
Equity
EquityThis measure is Equitable across all care and patient types.
Use and Usability
Use and UsabilityThis measure provides valid information and is not burdensome to facilities.
Summary
I would recommend this measure with minor modifications.
Breast Cancer Screening Recall Rates
Importance
ImportanceAgree with the input from the PQM folks and their rationale on this one.
Feasibility Acceptance
Feasibility AcceptanceThis should be easy enough to track and doesn't appear overly burdensome to input or submit data.
Scientific Acceptability
Scientific Acceptability ReliabilityNo additional comments.
Scientific Acceptability ValidityNo additional comments.
Equity
EquityNo additions, I agree with the PQM assessment.
Use and Usability
Use and UsabilityConcern around patients lost to follow-up, disparities in ability to notify of results/follow-up. What are the steps to change behavior, improve radiology reading, etc... if a facility falls outside the 5-12% range?
Summary
Ultimately, I think this measure probably Mets Acceptability Rating.
Overall I am in agreement
Importance
ImportanceThe authors provided clear rationale about how this could be helpful moving forward.
Feasibility Acceptance
Feasibility AcceptanceSince this is coming from claims data, collection should not be burdensome.
Scientific Acceptability
Scientific Acceptability ReliabilityWhile these was some variation among sites, with some that have very low scores in the 40s, the overall summary is supportive.
Scientific Acceptability ValidityI have some concerns about including men in the denominator, since they are typically screened only if they have significant risk factors or symptoms. Both situations would increase the likelihood of needing subsequent testing. Also note that some cancer facilities may have a higher subsequent testing rate also because their population is a higher risk group in general.
Equity
EquityThe authors did look at ethnic groups when reviewing the data.
Use and Usability
Use and UsabilityI considered this met, yet I am unclear what follow-up action medical facilities should take when their scores are higher than expected. The author did not include this in the background materials, but I still consider it met because the focus groups surveyed were in agreement.
Summary
I recommend removal of men from the denominator, but otherwise I am in agreement with the metric.
I look forward to a discussion of Use and Usability.
Importance
ImportanceAgree with staff assessment.
Feasibility Acceptance
Feasibility AcceptanceAgree with staff assessment.
Scientific Acceptability
Scientific Acceptability ReliabilityAgree with staff assessment.
Scientific Acceptability ValidityAgree with staff assessment.
Equity
EquityAgree with staff assessment.
Use and Usability
Use and UsabilityI think this is a great measure for internal quality measurement but fundamentally inappropriate for pay for performance programs.
This is an "indicator" measure that tracks the frequency that follow-up breast imaging studies are performed after a screening mamogram or DBT. Guidelines and large studies suggest the follow-up study range should be 5-12%, but adverse patient events would occur on both ends of that spectrum due to over or under-diagnosis. The main issue I see is that this measure does not track individual patient outcomes in relation to this measure. Due to patient population variation, or variation in the quality of imaging or radiologist interpretation, you could imagine lots of appropriate variation in line with the goal of optimal patient care. For example, a lower risk population with optimal image quality and interpretation, a follow-up imaging rate of 3-4% may be appropriate. Conversely, a near-future example with a clinic serving a high-risk population using new AI tools (shown to increase sensitivity) may have a follow-up range of 15%.
The use of a "range" in pay for performance also has potential to encourage unintended consequences which were not addressed by the developer. Follow up imaging is likely to be performed within the same radiology center. Thus, if a clinic that sees a low-risk population has a follow-up study rate of 5%, they may be incentivized to inappropriately bring patients back for diagnostic studies (that would increase reimbursement) as long as it stays "within range".
This is akin to "did the police officer meet their monthly quota for speeding tickets" rather than a measure of how much drivers are actually speeding.
Summary
See my comments on Use and Usability, and concerns for use of this measure in Pay for Performance.
Breast Cancer Screening Recall
Importance
ImportancePotential negative consequences mentioned but from a patient standpoint I would rather be called than not. If the recall rate is high, is there a problem with the equipment, technician competencies and radiologists reading the mammograms
Feasibility Acceptance
Feasibility AcceptanceNo additional burdon
Scientific Acceptability
Scientific Acceptability ReliabilityMeasure as outlined was well designed
Scientific Acceptability ValidityValidity testing completed and consensus reached. Unclear what the discussion really was to remove the MRI testing. MRI testing is recommended for high risk patients but what is concerning is the contrast media utiized: Gadolinium. This is a known toxic metal. 2018 FDA issued a drug safety warning relative to gadolinium retention issues. The FDA suggests first time patients should receive a medication guide
Equity
EquityPlanned for public reporting
Use and Usability
Use and UsabilityPublic reporting planned and to be part of HOQR Program
Summary
This is an important measure. It is very important and reassuring for patients even though it may increase their anxity. Risk was noted to be off. I do question this because patients at high risk for breast cancer are really never addressed. For example, those patients exposed in-utero to the drug Diethylstilbestrol (DES) are never acknowledged. Radiologist should know about these high risk patients. Recommend Mammogram intake forms includes two questions: Are you DES exposed or do you have a family history of DES exposure.
Measure seems to be important, look forward to further review
Importance
ImportanceThere is strong evidence presented about the benefits of screening for breast cancer and follow up evaluations when needed. It is further presented that there are negative consequences when mammography and digital breast tomosynthesis (DBT) recall rate is either too high or too low. The measure developer establishes that this measure will support the reduction in radiation received. Nonetheless, there is no clear established connection between this measure and reduction of radiation. It is unclear how this measure will support improved outcomes in the patients. The benchmark to compare to from ACR is the rate of recall, but there is no substantive research to support that this is still a current benchmark to compare the results to. Also, has this target rate been adjusted to different ethnical and racial parameters? In addition, it is unclear if this measure is unique and if there is meaningfulness perceived by the patient population.
Feasibility Acceptance
Feasibility AcceptanceThe burden of reporting this measure was directly addressed by the developer. There is no proprietary data reported as needed and there all information is generated from the EHR utilizing value sets, supporting the appropriate data collection.
Scientific Acceptability
Scientific Acceptability ReliabilityReliability of the measure was established with data provided. Does not seem to be of high burden to implementers, supported by data being readily available in EHRs with the use of value sets.
Agree with staff preliminary assessment comment related to limitations of the reliability results:
- "There appears to be at least one outlier (reliability of 0.41). It would be useful to know more about this outlier and see why the reliability was so much lower than the median (0.95) and even the mean reliability of the first decile (0.81)."
Scientific Acceptability ValidityIt is unclear how this measure will address the patient outcomes described of decreasing unnecessary radiation while improving further screening when needed. Validity study does not address how the addition of DBT "may shift the range downward" when referring to the ACR screening target recall rate.
Equity
EquityThe measure developer is able to establish a difference in the measure across different patient groups. It does not however, clearly establishes how the measure supports addressing these differences. Would like to know this further.
Use and Usability
Use and UsabilityThe measure, although important in nature, is hard to see as a way to improve patient outcomes. The developer does not establish any pathways to address any numbers outside of the range of 5-12%. Not only this, but it is also concerning what facilities could do in communities with high health inequities where resources are not available to correctly address results outside of the desired range.
Summary
Although I see the benefit of a measure that improves the appropriate follow up for breast cancer evaluation after mammograms is achieved, I am unable to see how this measure addresses the differences in population and their health equity markers as part of a pay for reporting program. I am also unable to see the need for knowing the recall rate when there is no process to address any outlier results. I look forward to further discussion of this measure.
important measure
Importance
ImportanceSetting a range to guide additional screening measures so that overuse of radiation and underdiagnosis don't occur is very important. I don't understand the rationale for the 5%-12% range other than it's consistency with measures already adopted. How do we know that this is the appropriate range going forward? Might current conditions warrant a shift in the range or having different ranges for special circumstances (eg. rural areas, demographic groups that have traditionally been over or under examined in follow up care)?
Also - while additional screening can elicit anxiety, there is also anxiety around uncertainty of diagnosis.
Feasibility Acceptance
Feasibility AcceptanceAgree with staff assessment.
Question - what about medicare beneficiaries enrolled in medicare advantage plans (which is about half of all medicare beneficiaries). Will this measure apply to them? Might there be differences in diagnostic follow up for those in MA plans that would be important to recognize?
Scientific Acceptability
Scientific Acceptability Reliabilityagree with staff assessments
Scientific Acceptability Validityagree with staff assessments
Appreciate that patients and caregivers were included
Equity
EquityDevelopers use performance data to calculate the rate of recall by sex, race/ethnicity, age group, and dual eligibility, and find overall significance for each of these factors, concluding there is opportunity for improving health equity based on these factors.
Use and Usability
Use and Usabilityagree with staff assessment
Some concern that facilities wanting help with improving on this measure can consult with their QIN-QIO will be sufficient for improving and sustaining the measure. There may be rural/urban, size, resource differences that could require other approaches.
Summary
My main question is about maintaining the 5%-12% range. Might there be reasons to reexamine this range at this time? How do we know if this range is still the best one for the purposes of this measure for all populations?
Overall, I agree with the…
Importance
ImportanceThe developer cites a target recall rate of 5% to 12%, depending on the facility. This rate appears arbitrary, albeit inconsistent, and does not explain the US healthcare dynamics (rural vs. urban) or technology. The developers do not describe the actionable quality improvement strategies to develop. Even though CBE #4220e is important, particularly as it focuses on monitoring "rates of recall following screening imaging instead of rates of breast cancer imaging," from a patient perspective, the developers did not describe how patients will be informed about this measure.
Feasibility Acceptance
Feasibility AcceptanceThe developers conducted a focus group of professionals and individuals with lived experiences which was commendable. However, the developers did not report the geographic characteristics of the individuals with the most agreement that the measures will not add undue burden on hospitals to collect data. Professionals in rural areas may report an undue burden with collecting this data. The developers conducted a focus group of professionals and patients with lived experiences which was commendable. However, the developers did not report the geographic characteristics of the individuals with the most agreement that the measures will not add undue burden on hospitals to collect data. The patient representation relative to healthcare staff/professionals could undermine the feasibility assessment result. It would be appropriate for the developers to consider a patient-only group to ensure accurate representation. Some questions that come to mind are if there were differences in the agreement scores between the patients and healthcare professionals.
Scientific Acceptability
Scientific Acceptability ReliabilityThe reliability rating is sound, used large data that is available to different facilities, and standardized.
Scientific Acceptability ValidityAlthough the scientific measures are sound, the developers did not describe the reliability of 0.41. Compared to the mean of 0.9 and the next lowest value of 0.81. It would be ideal to describe this steep difference in value.
Equity
EquityThe developers considered key variables in their measure. However, as noted by the developers, care must be taken in interpreting the significant difference between males and females and race/ethnicity. I would caution against framing the findings as though disparities by sex exist. Additionally, considering that there are male-female differences in breast cancer, leading lead to biased estimates when data is not disaggregated, it would be worthwhile for the developers to include an additional language acknowledging this.
Use and Usability
Use and UsabilityLike i mentioned earlier, patient representation on this measure is a limitation to be addressed. Based on the sample, the positive sentiments are reflective of providers.
Summary
Overall, I agree with the staff assessment and my fellow independent reviewers. The measure is important, but some nuances still need clarity. There's room for continuous improvement of the measure.
Breast Cancer Screening Recall
Importance
ImportanceThere is strong evidence presented about the benefits of breast cancer screening and follow-up.
The developer suggests the measure will support the reduction in radiation received, but was unable to identify a clear connection or understanding how they correlate. It is also unclear how tracking this measure would actually improve patient outcomes. Is the benchmark mentioned below take into account different ethnic/racial/rural parameters?
Limitations - The developer does not provide guidance to sites on the actions providers can take to ensure the appropriate recall rates. Also unclear depending on population, what sites should do if their recall rates fall outside of the 5% and 12% range if there is diversity in their population.
Feasibility Acceptance
Feasibility AcceptanceUsing claims data this should not be overly burdensome.
Scientific Acceptability
Scientific Acceptability ReliabilityWould like to see what was the information related to the outlier (reliability of 0.41) for review.
Agree with staff preliminary assessment comment related to limitations of the reliability results:
- "There appears to be at least one outlier (reliability of 0.41). It would be useful to know more about this outlier and see why the reliability was so much lower than the median (0.95) and even the mean reliability of the first decile (0.81)."
Scientific Acceptability ValidityI would like to hear discussion on whether men should be in the denominator as they are typically only screened if at risk or having symptoms. Is there any conditions for those areas with a population in high risk areas for cancer?
Equity
EquityAgree with staff assessment.
Use and Usability
Use and UsabilityThe developer did not address how an organization could take actions to improve performance if rates fell outside of the 5% and 12% parameters.
Summary
Although I see the benefit of a measure supporting follow-up evaluation from initial breast cancer screenings, I would like to hear discussion on different health equity aspects and population drivers (rural/city) and how these screenings could overall improve patient outcomes. I look forward to discussion to modify this measure to be beneficial to organizations to perform follow-up exams when needed.
Breast Cancer Screening Recall
Importance
ImportanceThe citations provided regarding the clinical guidelines do not appear appropriate, grading not provided related to recommendations on recall rates. Inconsistent recommendation on recall rates with wide range.
Feasibility Acceptance
Feasibility AcceptanceNo changes were made in response to the feasibility assessment but the survey indicated high agreement that the measure would not provide undue burden. Measure uses electronic claims data and no proprietary info needed.
Scientific Acceptability
Scientific Acceptability ReliabilityData suggest measure is reliable for all deciles and well above 0.6 threshold. Specifications appear precise.
Scientific Acceptability ValidityFace validity of 75% from panel. Additional validity testing not conducted.
Equity
EquityAgree with staff assessment. Interested to learn more about the developers perspective on clinical significance of differences.
Use and Usability
Use and UsabilityPlanned use for OQR program. Usability results indicated 77.4 percent of the respondents agree the measure can be used by hospital outpatient departments to guide decision-making and improve healthcare quality and health outcomes. As indicated by staff, unclear what interventions can be undertaken to establish appropriate recall rates.
Summary
Important topic but some concerns about strength of recommendations related to the recall rates and interventions that would result in clinically significant improvements.
Breast Cancer Recall Rate- Agree
Importance
ImportanceThere is not enough information to determine if the patient population finds the process measure meaningful. The developer mentioned hosting a listening session with a multi-stakeholder group of 32 individuals that included one patient. I particularly would have appreciated the feedback of those most impacted especially high-risk patients in underserved areas.
Feasibility Acceptance
Feasibility AcceptanceData would be readily available without burden.
Scientific Acceptability
Scientific Acceptability ReliabilityA reliability median score of 0.95 suggests the measure can identify true differences in performance between individual facilities.
Scientific Acceptability ValidityDeveloper facilitated qualitative and quantitive assessments of the measure's validity.
Equity
EquityThe developer draws attention to the process measure ensuring equity as shown in Table 3. It presents measure performance scores by patient biological sex, racial or ethnic identity, age group, and dual eligibility status.
Use and Usability
Use and UsabilityConcerned about facilities in underserved areas. The stakeholder's group composition does not share if their perspectives are based on rural or urban experiences, or those living in and providing care to resource-restricted populations.
Summary
Agree overall with the process measure.
Support this measure
Importance
ImportanceThis is an important measures as Chicago followed the action of NY in terms of identifying the lack of call backs -- and successful call backs. The variation among black communities was low but the mortality rate was higher than the white population. There were issues in the testing centers associated with lack of call backs and lack of call backs in a timely manner. While hospitals performed quite well, some of the outpatient centers and City operated centers performed poorly in terms of call backs.
If there are variations in mortality among populations, it is helpful for performance improvement to assess the call back outcomes which are really processes of care.
Equal Hope (formerly the Metro Chicago Breast Cancer Task Force) is the organization that has lead the initiative and actually closed the gap by conducting detailed analysis --- resulting in some centers closing and some provided with more support.
If there are disparities in communities and among populations, this is a very important measure.
Feasibility Acceptance
Feasibility AcceptanceIt is being done and most centers track their call back.
Scientific Acceptability
Scientific Acceptability ReliabilityThe systems are established to measure across centers with a high degree of reliability.
Scientific Acceptability ValidityIt has met the validity testing.
Equity
EquityAlmost all data is provided by gender, age, race, ethnicity, and payer type.
Use and Usability
Use and UsabilityIt is a proven approach to increase performance and in geographical areas it would be ideal to have public data. Also, as more stand alone organizations evolve to provide diagnostic services, it would be ideal to be able to track performance and identify any disparities.
Summary
From my patient perspective and knowing the improvement in breast cancer mortality results from the NY and Chicago experiences, I fully support this measure.
I see value in the measure…
Importance
ImportanceThere are two sides to the coin regarding the importance of this measure. One, unnecessary radiation exposure, which currently, there is no mechanism capturing cumulative radiation exposure for each individual. The other is potential for misdiagnosis an patient harm.
Why is Europe's rate 5% and is there something we can learn from them?
Feasibility Acceptance
Feasibility AcceptanceThere is no burden with collecting the data.
They did have a TEP and it had two patients. Patient representation should be higher. They did capture multistakeholder opinions; however, was there diversity within this group?
Scientific Acceptability
Scientific Acceptability ReliabilityLooked at different settings (urban and rural) and size of facilities.
Provided sound data and literature corresponded.
Scientific Acceptability ValidityValidity meet
Equity
EquityIt would be good to see the ethnic/race data based on rural and urban settings and size of facilities.
Use and Usability
Use and UsabilityI am not sure patients will use this data.
The end goal is to reduce the 5-12% number but what is the percentage or number?
Summary
I see value in the measure for patients. Reducing radiation exposure is important and not missing a potential cancer diagnosis.
I think that measuring…
Importance
Importancewhile it is important to identify facilities that over or under recall patients, the measure does not address how the screening recall rate be used to improve performance.
Feasibility Acceptance
Feasibility AcceptanceNo comment
Scientific Acceptability
Scientific Acceptability ReliabilityNo comment
Scientific Acceptability ValidityNo comment
Equity
EquityNo comment
Use and Usability
Use and UsabilityThe developers aim to use screening recall rate to improve quality of care. This measure can work when the reasons for rates outside acceptable zoon are due to a facility issues. If the reasons for unacceptable recall rates due to population access to health care, or the location of the facility (rural Vs cities) it will be very hard to improve results. How the developers aim to address these issues.
Summary
I think that measuring breast screening qualities is important. breast cancer screening recall rates could be a good tool to measure quality, but it can place unnecessary burden on facilities where poor rates are due to location, population etc.
Breast screening recall CBE ID 4220l
Importance
ImportanceThis is a new measure under PQM. However, it is a respecification measure for the “Mammography Follow Up Rates” under OQR. It is evaluated under the initial/new rather than the maintenance standard. The submission, however, relies significantly on the previous existence of the OQR measure.
A logic model is provided that describes the pathway to biopsy and surveillance. The clinical patient oriented goal would be accurate identification of mammographic abnormalities. As a initial/new submission, it would be important to have clarity if this is simply a consensus statement/guideline process measure or a patient oriented clinical measure.
The information provided assumes that there is a correct rate of recall in a population(5-12% based on professional guideline) regardless of the prevalence of breast abnormalities within the population. This promotes precision and homogeneity in high performing entities but does not, prima facia, support clinical accuracy or impact under or over utilization. In low risk populations, recall would be high. In high risk populations, one could conclude that recall is too low. This is problematic because maximizing diagnosis while minimizing injury in breast cancer is critically important.
The developer presents Performance gaps which are only required for maintenance submissions. The developer provides a table with the current distribution of performance across 3391 facilities. While the developer uses the appropriate level of recall of 5-12%, the gaps are reported in deciles with 4.9% and 13.0% as the closest deciles. This does not allow precise assessment of the gap from the 5-12% suggested. In addition, the high recall rate category (over 13%) has a performance level mean of 13.7% which demonstrates a measuarable but not clearly meaningful performance gap in over utilization based on consensus guidelines. Similarly, the “denominator count” for less than 4.9% seems to be 171/3,307,860 patients or 0.005%. The narrative provides facility counts which do not seem to appear in the data presented. Clarity in performance measures, and maintaining consistent categories and variable definitions would help in interpreting importance.
The developer also provides gap analysis by sex, race, age, and dual eligibility as a proxy for SES. They describe these data, however, as unreliable. In addition, while there are differences by race, all performance is within the recommended range(5-12%) so there would be considered no gap in performance by race because it is a binary yes/no measure of quality. The 5-12% is not a scaled measure of quality.
This was also seen with SES and those eligible for both Medicare and Medicaid. Younger patients (18-34, 35-44), as a broad category, have a higher rate of recall potentially creating risk based on the importance of the measure, but these patients do not have an indication for screening mammography and likely are receiving diagnostic mammography based on a complaint or an issue. Though patients 40-44 (no subgroup analysis) have the option for screening. Clearly delineating the difference between screening adn diagnostic mammography and follow up is important. Males, and women under 40 could be excluded because they are receiving diagnostic mammography based on a complaint in almost all cases.
The developer does not comment on the adequacy of existing measures.
Feasibility Acceptance
Feasibility AcceptanceThe developer evaluated perception of feasability with 32 people, with only one patient. The descripotion of how these 32 people were sampled is not evident though may be available in some supporting material. It is not clear if these 32 were representative of both clinicians, system adminitrators, payors, and patients. The conclusion was that it is feasible. Though 25% were undecided or disagreed that it was feasible. No adjustments to feasabilty were made and no costs or requirements were described. In another section one adjustment is described as changing follow up to recall.
Scientific Acceptability
Scientific Acceptability ReliabilityReliability method is not described fully described. This is currently proposed as a process measure at the facility level but much of the justification is a clinical measure at the patient level. This may require chart abstraction rather than the use of claims data.(Though claims data could be used with the ICD 10 of abnormal mammography findings.)
The presented data in Table 6 do not describe misclassification or the possibility of such. Misclassification would not simply be about performance between entities but the prevalence of mammography abnormal findings of the population presenting at each facility.
The beta-binomial model may be a good model but it is not clear how it is being used here.
Scientific Acceptability ValidityThe narrative description of the validity methods focus on the consensus of 32 individuals. It is not clear in the materials that these 32 individuals represent the broad stakeholders in breast cancer care. Wile a focus group can be used for face and likely construct validity, it is not clear that the current claims based measure is addressing the need for appropriate clinical follow up without overuse. The developer suggests that using the administrative cancer detection rate may have value. The developer does not follow up on this suggestion.
The developer argues that no risk adjustment is necessary. This may be true if the outcome is simply “recall”. However recall is being used as a proxy measurement for “appropriate”. In those cases, adjusting for risk in the population served by a facility would be appropriate. If there is a high risk of cancer in the population then the appropriate recall rate may be higher and even outside the suggested 5-12% range. Similarly, in a low risk population, not adjusting the pass rate for the measure would increase the exposure to radiation or follow up testing (biopsy) unnecessarily.
Equity
EquityThis is an optional measure. Feedback is provided for future improvement.
The developer suggests that equity was considered within the performance gap data. While statistical testing indicates a statistically significant difference in recall rates, there is no indication if this is clinically significant, appropriate based on risk, or equitable/inequitable. It is also testing recall based on a scaled measure of recall rather than the binary (yes/no=5-12%) which the measure is geared toward. As described in the performance gaps, one of the subpopulations most at risk quantitatively is males and 18-34 year old females. This quantitative inequity does not follow a clinical or equity logic model.
Use and Usability
Use and UsabilityUse and usability is minimally described.
There are no identified suggestions for improving performance.
This is a previously existing measure under a different name and program.
Summary
This is an important issue with significant advocacy around improving care and minimizing harm. The previous existing measure and measure proposed in this submission continues a proxy measure that uses a range without regard to clinical indication or population prevalence of breast cancer. While a measure should be implemented it is not clear that perpetuating the past measure will impact quality of care for women with abnormal findings on beast cancer screening.
could highlight not only real but also spurious anomalies
Importance
ImportanceThe information provided sufficiently supports the importance of a measure.
Feasibility Acceptance
Feasibility AcceptanceInformation concerning feasibility points to it being satisfactory.
Scientific Acceptability
Scientific Acceptability ReliabilityReliability looks good. Of course, understanding the outlier would be helpful, but wouldn't discount the overall reliability observed.
Scientific Acceptability ValidityFace validity results are acceptable.
Equity
EquityAppreciate the equity analysis; however, it would be a stretch to conclude that this measure helps address inequities in health care.
Use and Usability
Use and UsabilityComparing oneself to a target range can be helpful to identify an opportunity for improvement. However, upon closer inspection one might find that one's recall rate is entirely appropriate even though it happens to be outside of the target range. What are suggestions for the type of actions that can be used to improve
performance in that scenario? Will consulting reports available on the CMS QualityNet or their QIN-QIO help to "fix" a recall rate outside of the target range when the rate is appropriate given the patients served by a hospital during a measurement period?Summary
It's fine to have a target range and to investigate when one falls outside that range, but you wouldn't want to be unfairly judged anytime you were outside the expected range. This measure would be more useful if it allowed for explainable performance outside of the goal or had further criteria specified for characteristics of a hospital that indicate the goal is appropriate.
Breast Cancer Screening Recall Rates
Importance
ImportanceDeveloper should summarize evidence linking potentially negative consequences if recall rate is either too high or too low (i.e., overuse of imaging/biopsy or less accurate/timely cancer diagnosis), not just state it. This information may have been included in the supplementary materials but if so, it was not obvious.
Developer should solicit patient input about the meaningfulness of the facility recall rate for patients.
Feasibility Acceptance
Feasibility AcceptanceThe measure is calculated using Medicare FFS procedure code data (outpatient and carrier).
Scientific Acceptability
Scientific Acceptability ReliabilityTesting data were adequate, with good representation between facilities/patients. Conducted signal-to-noise analysis, with reliability at .92 overall and high (>.8) across all deciles.
Scientific Acceptability ValidityFace validity assessment was done systematically/with transparent methods/results. The quality question asked of the TEP was not as explicit as it could have been. Results were fair (71% agreed the measure reflects care quality, or 79% if you drop the undecided/don’t know from the calculation). The decision not to risk-adjust is seems reasonable, although I wondered about the much higher rates for younger women (although these made up relatively small percentage of women in the testing dataset).
Equity
EquityIt is good to see stratification by at least some variables, although I’m not sure how to interpret the results given the differences (for the most part) still fall within the guideline range. It would have been nice to see a rural/urban breakdown.
Use and Usability
Use and UsabilityI believe it is included in the adopted into the Hospital OQR Program (although that isn’t completely clear from the online information presented). A fairly large proportion of their TEP thinks it will be useful for decision making and improving quality. It is a new measure so developers were not required to respond to questions about feedback, improvement, or unintended consequences of the measure.
Summary
For the most part, this measure appears to meet most of the requirements for initial endorsement. However, the developer should provide more evidence to support the measure and also solicit patient input about the meaningfulness of the facility recall rate for patients.
N/A
Importance
Importancethe developer does not provide direct patient input on this measure. However, it does cite evidence of the negative consequences associated with additional screening, which include anxiety and stress to patients, as well as the risk of a delayed cancer diagnosis.
Feasibility Acceptance
Feasibility AcceptanceUtilizes claims data which is required data and routinely generated on patient encounters. 75% agreement that the measure will not plan an undue burden on hospital to collect the data.
Scientific Acceptability
Scientific Acceptability ReliabilityMeasure score reliability testing (accountable entity level reliability) performed. The vast majority of facilities have a reliability which exceeds the accepted threshold of 0.6 with only one facility below the threshold (minimum of 0.41). Sample size for each year and accountable entity level analyzed is sufficient
Scientific Acceptability Validitynot risk adjusted as it is a process measure
Equity
EquityDeveloper conducted sufficient assessment of equity for sex, race/ethnicity, age, and dual eligibility status.
Use and Usability
Use and UsabilityNo explicit articulation of the actions measured entities could take to improve performance, but in their comments the developer listed sources facilities can consult to develop a QI initiative.
Summary
N/A
no major concerns
Importance
ImportanceThis is 99% met but agree with staff assessment concern about limited patient engagement on importance. Although the developer doesn’t provide input from patients in this section, I’d be curious to hear if the developer got any information on the measures importance from the face validity testing and use survey they did.
Feasibility Acceptance
Feasibility AcceptanceFrom table 4 it looked like there were six people that responded either Disagree or Strongly Disagree. I assume the two others did not provide any feedback and so were not mentioned? Assuming this is the case and because 75% agreed, no major concerns so would consider this met.
Scientific Acceptability
Scientific Acceptability ReliabilityMet based on same rationales in staff assessment but one question: 1) Reliability- why is facility and denominator count n/a for the 0.41 reliability minimum? Perhaps this would answer questions about the outlier noted in staff assessment.
Scientific Acceptability Validityno concerns
Equity
EquityAlthough met for the purposes of this measure as specified, I have concerns that this Medicare FFS measure will not capture many of the patients in populations at risk for receiving inequitable recall rates/the care associated with this process. What proportion of patients aren't captured by this measure, for example? That would therefore be a limitation to this measures' ability to measure inequities. However, that is outside of the scope of this measure evaluation. Would encourage the developer to explore options for assessing other populations in the future.
Use and Usability
Use and UsabilityI think the developer makes points about how the facilities with too low or two high rates might interpret their results. However, this is most clearly written in the importance section. I think for this new measure the standards are met, though.
Summary
No major concerns but some areas where clarification or more information would be helpful.
Breast Cancer Screening Recall Rates
Importance
ImportanceThis seems like an important measure for all who get preventative and diagnostic mammograms and another test when a lump is discovered.
Feasibility Acceptance
Feasibility AcceptanceFrom the data presented it is feasible to do this and will provide important data for clinical staff to know.
Scientific Acceptability
Scientific Acceptability ReliabilityIt seems like the scientific reliability rating does not totally agree with the data submitted. As more data comes in, a standard of care can be written to discuss variations in care
Scientific Acceptability ValiditySeems like there needs to be more data to convince the scientific community the measurement is needed
Equity
EquityEquity seems to be covered well in the data presented.
Use and Usability
Use and UsabilityHopefully, the measurement will increase use and usability. This is important as people who have abnormal mammograms often wait long for further testing and then can have advanced disease.
Summary
This is an important standard for improving follow-up care for breast cancer screening. Getting this information out to physicians, radiologists, and other stakeholders is important for the benefit of the patient and care coordination.
Nice to see a balancing measure
Importance
ImportanceThe scores describe a considerable performance gap, but crucially a remedy for underperforming facilities is not apparent.
Feasibility Acceptance
Feasibility AcceptanceNo meaningful issues.
Scientific Acceptability
Scientific Acceptability ReliabilityOutlier noted does not concern me, median reliability excellent, and even decile 1 statistics at .81 are great.
Scientific Acceptability ValidityFace validity concerns addressed, although I don’t understand the rationale for removing MRI as a follow-up imaging modality and why that wasn’t accepted. I would be happy to hear more discussion on this.
Equity
EquityAssessment of equity was conducted, and measure would more likely than not address inequities in healthcare (by race and insurance status), although this could have been demonstrated with stronger statistical evidence.
Use and Usability
Use and UsabilityIt would be really helpful to understand what interventions would lead to an improvement in performance, and understand from research or a case study what implementing one of those interventions would look like.
Summary
This is a well-designed “balancing” measure focused on improving rates of diagnosis of breast cancer while reducing burdensome overuse. The performance gap established by the data to date is considerable.
Delicate Balance
Importance
ImportanceThe developer cites evidence and guidelines from the American College of Radiology, which highlight the potentially negative consequences if the mammography and digital breast tomosynthesis (DBT) recall rate is either too high or too low. The developer posits that this measure will "potentially reduce radiation received from mammography or DBT that may induce more cancers in younger people or those carrying deleterious gene mutations, as well as decreasing unnecessary imaging and biopsies.”
However, the developer states that there are inconsistencies in the target recall rate for breast cancer screening. However, guidelines from the American College of Radiology recommend a target recall rate for mammography screening between 5% and 12%. The committee should consider if the range in recall rates is appropriate and perhaps have the developer further justify the range (benefits vs. harms) associated.
Feasibility Acceptance
Feasibility AcceptanceThe measure is currently being used in the CMS Outpatient Imaging Efficiency program. The source data is entirely electronic and sent in on claims.
Scientific Acceptability
Scientific Acceptability ReliabilityThe measure specifications are well-defined and precise. The measure could be implemented consistently across organizations and allows for comparisons.
Scientific Acceptability ValidityThe developer conducted face validity testing of the measure score via a qualitative survey of a multi-stakeholder group of 32 individuals. In total, 25 physicians of various specialties, 5 healthcare administration or management staff, 1 patient, and 2 caregivers responded to the survey (which contained questions about measure face validity, feasibility, and usability).
The developer states that this group reached a consensus that guidance for the measure should include a target screening recall rate of 5% to 12%, in alignment with American College of Radiology guidelines.
75% of the respondents support the measure’s intent and 71% strongly agreed or agreed that the measure addresses quality of care.Equity
EquityDisparities are evaluated by analyzing differences in performance scores by sex, race/ethnicity, age group, and dual eligibility, and the developer found significant overall differences (chi-squares) for each factor
Use and Usability
Use and UsabilityDevelopers plan for the measure to be used in CMS's Hospital Outpatient Quality Reporting (HOQR) Program, a pay-for-quality program. The majority of a multi-stakeholder group agreed the measure could be used by entities for QI and decision-making (77.4%) and that it would provide consumers and providers with actionable information (80.6%) and the developer recommends facilities work with their QIN-QIO or consult resources available through QualityNet to develop a QI program.
Summary
I believe this is a good measure that has an opportunity to improve patient outcomes. Determining levels of appropriate follow-up exams while balancing that level with the potential downsides of over-follow-up is an important Quality Improvement objective for diagnostics. It goes without saying that patients do not want to have their possible breast tumor ignored and not further evaluated, but patients also do not respond positively when asked for follow-up that may not be necessary. Finding an appropriate balance, reimaging patients who need it, and limiting overexposure to radiation are excellent goals for which hospitals should strive. Extending this measure to the HOQR program with the target of 5%-12% performance is a good way for organizations to begin to find this balance and hone in on the required skills that are needed to optimize in this sensitive population.
Comments on 4220
Importance
ImportanceThe developer did not provide direct patient input on this measure. However, it does cite evidence of the negative consequences associated with additional screening, which include anxiety and stress to patients, as well as the risk of a delayed cancer diagnosis.
Inconsistencies in the target recall rate for breast cancer screening. However, guidelines from the American College of Radiology recommend a target recall rate for mammography screening between 5% and 12%. In Europe, the International Agency for Research on Cancer sets a target recall rate at 5%. The developer does not describe the actions providers can take to ensure appropriate recall rates.
Feasibility Acceptance
Feasibility AcceptanceMeasure calculated using claims data, so very feasible.
Scientific Acceptability
Scientific Acceptability ReliabilityNot all scores met the 0.6 threshold
Scientific Acceptability ValidityFace validity survey using a multi-disciplinary panel.
Equity
EquityDisparities are evaluated through analyzing differences in performance scores by sex, race/ethnicity, age group, and dual eligibility, and the developer found significant overall differences (chi-squares) for each factor
Use and Usability
Use and UsabilityNot clear on plans for measure getting used.
Not sure how HOPDs would know that they need to improve?? (is 10% a good rate?)
Summary
Comments on 4220
Important measure
Importance
ImportanceThis is an important measure to have as a balancing measure to tracking rates of screening alone. I do have concerns about the population being measured. USPSTF recommends screening in women aged 40-74. Medicare beneficiaries overlap only slightly with this population, meaning that a sizeable proportion of patients included in this measure are likely to be high-risk or being screened for diagnostic purposes. To me, this raises questions about the import of the results. Men, for example, and women younger than the recommended screening ages are likely to have been referred for suspicion of anomalies. Not surprisingly, these groups, though small, show a high rate of recall. It would also be helpful to know the follow-up of the recalls - how many patients continue on to treatment and how many turn out to be benign?
Feasibility Acceptance
Feasibility AcceptanceSimple measure does not present burden.
Scientific Acceptability
Scientific Acceptability ReliabilityAgree with staff assessment
Scientific Acceptability ValidityAgree wityh staff assessment
Equity
EquityThis measure presents an opportunity to track variability by patient group. This may be a reason behind some of the wide variability noted.
Use and Usability
Use and UsabilityNeed more discussion about how finding will be addressed. It is a concern that detection of variability or clusters due to georgraphic, cultural or socioeconomic factors could be discouraged or misinterpreted.
Summary
Overuse and underuse of breast cancer screening is an important subject to address, but this measure may be taking a bit of a broad brush to the problem.
-
N/A