E&M Guidebook Version 3.0

Comment Status

Closed

Comment Period

Tue, Jun 10 - Sun, Jun 29 2025

Description

Battelle annually reviews the PQM Guidebooks outlining its approach to Endorsement and Maintenance (E&M) activities. By reviewing and updating, we ensure our processes and procedures are continually refined and reflective of feedback received throughout the year.

The draft 2025 edition of the E&M Guidebook is open for comment from June 10 to June 29, 2025.

During the Spring 2025 Cycle, we implemented several changes:

Reduced fatigue by scheduling meetings with five or more measures over two days instead of one.
Adjusted the consensus threshold to 70% for groups with fewer than 20 voting members, while keeping a 75% threshold for groups with 20 or more members to ensure process integrity.
Implemented a reconsideration process for maintenance measures that do not reach the 75% consensus threshold but receive between 60% and 74% of votes to retain endorsement.
Required committee members who vote for "Do Not Endorse/Remove Endorsement" to provide reasons for their vote as part of a refined voting approach.

Additional enhancements will take effect during the Spring 2026 Cycle, including:

Requiring the Closing Care Gaps domain as part of measure submission.
Adding measure-specific guidance for pediatric clinical quality measures, cost measures, entity-level reliability and validity testing, and electronic clinical quality measures.
Establishing percentages for the number of entities needed to meet established reliability estimates for accountable entity-level testing.
Adding expectations for conducting correlation analyses for accountable entity-level validity testing.
Adding risk adjustment expectations for new vs. maintenance measures.
Adding a requirement for measure use that includes a description of the characteristics of an accountability application that this measure would support.

Cycle

Spring 2025

File

EM Guidebook Version 3

Comments

The Federation of American…

The Federation of American Hospitals (FAH) appreciates the opportunity to comment on the proposed updates to the Partnership for Quality Measurement (PQM) Endorsement & Maintenance (E&M) Guidebook. We commend Battelle’s effort to refine the endorsement framework, but believe several proposed changes, particularly those outlined in Appendix H, raise significant methodological and implementation concerns. Our comments are drawn from published research and consensus guidelines to strengthen our position.

H1. CBE Policy on Pediatric Clinical Quality Measures

The FAH does not support the proposed approach of applying measures developed for adult populations to pediatric populations through extrapolation. Children differ physiologically, developmentally, and epidemiologically from adults, and quality measures must reflect these differences to be valid and meaningful. Numerous studies, including work by Mangione-Smith et al. (2011) and the Pediatric Quality Measures Program (PQMP), highlight the need for pediatric-specific development and testing due to variations in disease presentation, care delivery, and outcomes in children (Mangione-Smith et al., 2011).

Moreover, we were alarmed to see that generative artificial intelligence (AI) was used in drafting this guidance without transparent expert vetting. While AI may support literature synthesis, qualified professionals must develop clinical quality standards, especially for vulnerable populations like children, using peer-reviewed evidence and stakeholder consensus (Krittanawong et al., 2020).

H2. CBE Requirements for Electronic Clinical Quality Measures (eCQMs)

Page 62 of the draft guidebook states, "A new eCQM version of an existing endorsed [non-eCQM] measure is automatically considered to be endorsed.”

We strongly oppose the assertion that an eCQM version of an existing measure should receive automatic endorsement. While the measure's importance will not change due to differences in data sources, the feasibility, scientific acceptability, and use/usability evaluations will differ due to the existing challenges with collecting data for electronic health record systems (EHRs)(Chen et al., 2021; Adler-Milstein et al., 2014). We ask that the PQM reconsider this statement.

We support the proposal to broaden further the number of sites in which an eCQM is tested; however, we do not support the current proposed process of focusing on increasing the number of sites, particularly only requiring data element reliability and validity testing in one vendor system for initial endorsement, and gap and measure score reliability and validity testing in at least two systems in a minimum of two systems for maintenance reviews. Last year, the FAH and our members alerted the Center for Medicare & Medicaid Services (CMS) to the challenges with data collection and submission of measures that leverage data from EHRs. Specifically, hospitals identified deficits in the data related to the timing of vital signs, patient body weight, and various laboratory tests for 3502e Hybrid Hospital‐Wide (All‐Condition, All‐Procedure) Risk‐Standardized Mortality Measure with Claims and Electronic Health Record Data. Specific examples were outlined in FAH’s comments here (https://assets.fah.org/uploads/2024/06/FAH-2025-IPPS-LTCH-letter-6-10-2…).

Initial testing of the EHR-derived data elements was only completed in 21 hospitals using one vendor system, which is like the current proposed approach for initial endorsement. Based on what our members encountered, it is clear that this limited testing was insufficient when the measure was implemented more broadly.

These limitations in data completeness apply to all measures that use EHR data. As a result, we believe that the validity of each data element should be tested across a wider set of vendor systems and facilities, including small and rural entities. Thorough assessments of each data element and the required calculations and logic must be vetted across more sites and vendor systems to understand whether an eCQM is ready for implementation and that the performance scores truly evaluate the quality of care provided.

We urge PQM to reconsider these proposed changes and solicit input from groups actively working to collect these data (e.g., hospitals, practices), EHR vendors, and other subject matter experts. Additional thought is needed to ensure that any eCQM that is endorsed can be collected with minimal effort and produce reliable and valid results.

H3. Instrument-Derived Measure Set Submission Framework

PQM must clearly state that only validated instruments are acceptable for use in clinical quality measures. The consensus from psychometrics literature is that instrument-derived measures require validation to ensure construct validity, reliability, and generalizability (DeVellis, 2016; AERA, APA, & NCME, 2014). Using unvalidated instruments risks producing spurious results and undermining trust in the measure. We ask that this requirement be clearly stated in the guidance.

H4. CBE Guidance on Entity-Level Reliability

We do not support recommendations for which we cannot determine why this decision was necessary, which experts were consulted, and how any discussions resulted in these specific methods. Transparency and broad stakeholder input are necessary components in the endorsement process, as well as underlying criteria and guidance. In addition, the method of testing that is determined to be appropriate should be at the discretion of the measure developer, and limitations should not be placed on what is accepted; instead, whether the testing and results meet the minimum criteria should be left to the endorsement committee to determine.

H5. CBE Guidance on Entity-Level Validity

Similar to our reliability concerns, we urge PQM to revise this guidance based on established methods and public expert consultation. The current guidance lacks transparency on how these standards were defined.

H6. CBE Guidance on Cost Measures

The FAH supports the requirement for measure developers to demonstrate a cost measure’s usefulness and accuracy within the context of quality and is appreciative of its inclusion.

Thank you for the opportunity to comment.

References:

Adler-Milstein, J., DesRoches, C. M., & Jha, A. K. (2014). Health information exchange among US hospitals. American Journal of Managed Care, 20(11), 917–922.

AERA, APA, & NCME. (2014). Standards for educational and psychological testing. American Educational Research Association.

Chen, J., Ryskina, K. L., & Jung, H. Y. (2021). Data Quality in Electronic Health Records: Implications for Performance Measurement. Health Affairs, 40(2), 242–248.

DeVellis, R. F. (2016). Scale development: Theory and applications (4th ed.). Sage Publications.

Krittanawong, C., Johnson, K. W., Rosenson, R. S., Wang, Z., Aydar, M., & Halperin, J. L. (2020). Deep learning for cardiovascular medicine: A practical primer. Journal of the American College of Cardiology, 75(10), 1310–1328.

Mangione-Smith, R., Schiff, J., Dougherty, D., et al. (2011). Identifying children’s health care quality measures for Medicaid and CHIP: An evidence-informed, publicly transparent expert process. Academic Pediatrics, 11(3 Suppl), S11–S21.

Name or Organization

Tilithia McBride

Comment on reliability requirements

Which aspect(s) of the guidebook are you commenting on?

Accountable-entity level testing guidance - reliability and validity

Please see attached for our comment.

UMKECC_Comment_6.25.2025.pdf

Name or Organization

University of Michigan Kidney Epidemiology and Cost Center

PQA Comments on E&M Guidebook 3.0

Which aspect(s) of the guidebook are you commenting on?

General E&M processes and policies

Committee structure and voting outcomes

Measure-specific guidance - pediatric, cost, or electronic clinical quality measures

Accountable-entity level testing guidance - reliability and validity

Measure use – accountability application characteristics

PQA appreciates the opportunity to provide public comments on the proposed updates to the E&M Guidebook.

Endorsed with Conditions Timelines

PQA appreciates the concrete examples provided in this section. PQA urges consideration of available steward resources when contemplating timelines for change and engaging in conversations with the steward to understand the amount of effort necessary to meet conditions. Additionally, some conditions may inherently require longer amounts of time to feasibly complete (e.g., due to availability of data).

Updates to the Endorsement Voting Process

PQA supports the requirement that committee members voting not to endorse a measure must provide a rationale for their vote.

PQA supports the reconsideration for maintenance measures that do not initially reach consensus on endorsement but have considerable support, as described in the draft Guidebook.

Reliability Criterion Met/Not Met Guidance in the Evaluation Rubric

PQA appreciates additional clarity on recommended thresholds for meeting the reliability criterion, with the understanding that these thresholds are advisory and measure reliability should be evaluated on a case-by-case basis.

PQA is curious if Battelle has evaluated how recent measure submissions perform against the new recommended thresholds. For example, has Battelle performed any analysis of how many recently endorsed measures meet the new reliability requirements?

Guidance on Entity-Level Reliability in Appendix H4

PQA notes that the guidance in H4 and on the criterion on p. 49 appear conflicting. PQA requests clarification on whether descriptive statistics are required at the 0.4 or 0.6 threshold. PQA also notes that a requirement for justification is mentioned on p. 49 but not within H4.

While PQA understands the desire to better understand the nature of entities that do not meet a reliability threshold of 0.6/0.4, PQA notes that additional analyses of these entities entail additional effort to a submission process that is already highly burdensome. PQA also requests additional clarity on the specific analyses required (i.e., what specific person- and entity-level descriptive statistics are required?). PQA notes that certain person-level data may be unavailable for some data sources.

PQA notes that the method for calculating reliability proposed by Nieser and Harris may have advantages over the current industry-standard Adams approach. However, PQA strongly recommends the Battelle scientific methods panel provide further detailed explanation and rationale for this suggested change before implementing it. PQA also requests clarity as to whether continued use of the Adams methodology is permissible – is the change to Nieser and Harris required or simply recommended?

While Nieser provides sample code for their methodology in R, it is unclear whether this code has been validated as part of the peer review process. If Battelle intends for this methodology to be adopted as standard, PQA suggests additional evaluation and validation of the code to provide developers and stewards with confidence.

Additionally, many measure developers and stewards use other analytic tools like SAS (the programming language in which the betabin macro, widely used to calculate the Adams’ signal-to-noise approach, is provided). Translating these methods into the SAS language will entail systemwide costs and risk of calculation errors. Battelle may consider developing and providing accompanying SAS code to reduce these concerns.

Guidance on Entity-Level Validity

PQA appreciates improved clarity in this area.

PQA recommends adjusting the correlation coefficient in the example of assessing correlation included in H5 to -0.25 in order to more clearly indicate a low correlation. Guidance on correlation coefficients in the literature (i.,e., Schober et al) often recommend interpreting +/-0.5 as moderate rather than low (although interpretation is always highly dependent on context).

In the example, -0.5 may indicate expected correlation between associated (but not overlapping) measures. Very high correlation between the measures in the example may be unexpected due to influence of other factors on falls (structural factors like staff ratios, for example) as well as the variability and characteristics of the patient safety protocol measure itself.

PQA believes that adjusting the correlation coefficient will improve the clarity of the example.

Providing Attributes of Accountability Programs

PQA notes that the question regarding attributes of accountability programs may be duplicated on p. 52 and p. 53. PQA also notes that for maintenance, some attributes discussed in the questions may be duplicative of information requested in other portions of the Use and Usability section and recommends evaluating for further streamlining.

Name or Organization

Pharmacy Quality Alliance

Guidebook Feedback

Please see the attached feedback from NHSN.

E and M Guidebook Feedback.pdf

Name or Organization

Paula Farrell

PQM Proposed Changes

Which aspect(s) of the guidebook are you commenting on?

General E&M processes and policies

Measure-specific guidance - pediatric, cost, or electronic clinical quality measures

Accountable-entity level testing guidance - reliability and validity

Measure use – accountability application characteristics

The American Medical Association (AMA) appreciates the opportunity to comment on the proposed changes to the endorsement and maintenance process, particularly Appendix H, Measure-Specific Guidance. The AMA is extremely concerned to see some of these revisions in the absence of a process that is transparent on process to obtain input, the individuals consulted, and how recommendations were reached. Our comments on Appendix H are attached.

draft comment PQM proposed EM change 062325 v2 (002).pdf

Name or Organization

American Medical Association

Comments on E&M Guidebook Version 3.0

Which aspect(s) of the guidebook are you commenting on?

General E&M processes and policies

Committee structure and voting outcomes

Measure-specific guidance - pediatric, cost, or electronic clinical quality measures

Accountable-entity level testing guidance - reliability and validity

Measure use – accountability application characteristics

PQM Measure Evaluation Rubric

Please see comments in the attached document, 2025-06-27-em-guidebook-v3.0-comments-acumen.pdf

2025-06-27-em-guidebook-v3.0-comments-acumen.pdf

Name or Organization

Acumen LLC

Review of and Commentary on the E&M Guidebook

Which aspect(s) of the guidebook are you commenting on?

General E&M processes and policies

Measure-specific guidance - pediatric, cost, or electronic clinical quality measures

Accountable-entity level testing guidance - reliability and validity

Measure use – accountability application characteristics

The American Society of Clinical Oncology (ASCO) welcomes the opportunity to provide feedback on the E&M Guidebook. Our detailed comments, offered in a spirit of partnership to help create rigorous and practical guidance for measure developers, are attached for your review.

ASCO Comments on E and M Guidebook.pdf

Name or Organization

American Society of Clinical Oncology (ASCO)

Concerns with Conditions for Endorsement

Which aspect(s) of the guidebook are you commenting on?

General E&M processes and policies

Measure-specific guidance - pediatric, cost, or electronic clinical quality measures

Thank you for providing the opportunity to comment on proposed changes to the endorsement and maintenance (E&M) guidebook. After reviewing the updates, the modifications have resulted in a few concerns.

Endorsed with Conditions

Unlike the previous version of the guidebook, the conditions placed on endorsement have been described in much greater detail. These conditions (i.e., residual uncertainties) would and should be considered fatal flaws, including but not limited to the following:

Scientific Acceptability
- Reliability: Developer/steward performed accountable entity-level reliability. Most entities have reliability estimates that exceed the accepted threshold of 0.6 but around 30% of facilities are below the threshold.
Feasibility
- Electronic Health Record (EHR) Integration : Concerns about clinical burden due to a survey for an instrument-derived measure being outside of the EHR.
Unintended Consequences
- Interested parties have voiced concern with respect to a specific unintended consequence due to the measure’s use.

Measures submitted to the CBE for endorsement are intended to be used in accountability programs which make judgments and decisions as a consequence of performance results, such as reward, recognition, punishment, payment, or selection. Many of the “conditions” and corresponding examples would not be appropriate for such high stakes uses nor would it be acceptable for the developer to resolve the conditions within 3 to 5 years.

Given the frequency of E&M review cycles (i.e., every six months), it would be more appropriate to have a measure sent back to the developer to address these flaws and have the measure reviewed again at an upcoming cycle.

CBE Requirements for Electronic Clinical Quality Measures

While practical considerations are important, these should not prevent an eCQM or any measure from being developed and tested in a rigorous manner. Measure development is a long and costly endeavor and there are no shortcuts, given the impact to patients, accountable entities, and measure implementers.

Toward that end, a new eCQM version of an existing endorsed [non-eCQM] measure SHOULD NOT automatically be considered endorsed. eCQMs are distinct from CQMs in that they have been developed for use in an EHR or other electronic system rather than other measures that rely on administrative data. Thus, apart from perhaps importance, every other domain in the evaluation rubric should be evaluated as if the eCQM were a new measure.

It is disappointing to see that the assessment of feasibility and testing is proposed to require just one EHR system. The results would be quite limited and far from generalizable across health systems.

Thank you.

Name or Organization

Sam Tierney

Comments for draft V3 E&M Manual - Material changes to measures

Which aspect(s) of the guidebook are you commenting on?

Other

Page 34 of the draft V3 of the E&M Manual refers “material changes” to endorsed measures and defines a material change as “any modification to the measure specifications that significantly affects the measure results.” Examples of material changes provided in the manual are:

Changes to the population being measured (e.g., changes in age inclusions, changes in diagnoses or other inclusion criteria, changes in excluded populations, change from one type of insured population to another population);
Changes to what is being measured (e.g., changes in target values such as blood pressure or lipid values);
Inclusion of new data source(s); or
Expansion of the level or changing unit of analysis or care setting(s) (e.g., adding clinician level to a measure currently endorsed at practice level). Note: Expansion of a measure’s level of analysis falls under the same CBE ID

The comments about this section are:

Some of the examples provided may not “significantly affect the measure results,” which is the definition of “material change.” For example, updating/changes in ICD codes (diagnoses) may not have a significant impact on quality measure results or rankings of providers by quality measure results. Should it be clarified that the changes would be considered “material changes” only if these changes significantly affect the measure results?
When an endorsed measure has a significant change, including a name change, would the measure ever be considered a new measure when reviewed for endorsement rather than reviewed for maintenance of endorsement? The current description seems to suggest all changes would consider the measure as undergoing maintenance of endorsement (not new measure endorsement review).

Name or Organization

RTI International

Page 42: [For initial…

Which aspect(s) of the guidebook are you commenting on?

Measure-specific guidance - pediatric, cost, or electronic clinical quality measures

Accountable-entity level testing guidance - reliability and validity

Page 42: [For initial endorsement] Identify and describe known and existing variations in health care and health outcomes related to the measure focus area. Include factors such as demographic characteristics and groups that have historically experienced barriers to accessing health care. Describe any challenges and how the measure contributes to efforts to improve health care delivery and outcomes for these identified groups.

[For maintenance] Provide a description of your methodology and approach to empirical testing of differences in performance scores across identified groups (e.g., demographic, geographic variables).

[For maintenance] Provide the results and an interpretation of those results, including explanations for any variations in performance scores across different groups. Discuss how these results relate to existing evidence, any limitations found in the results, and the potential impact of these variations on the identified groups.

[For maintenance] Describe or provide evidence indicating how accountable entities can utilize these results to close gaps in health care delivery and outcomes for the identified groups.

Response: We recognize and support PQM’s commitment to alleviating care gaps among patient subpopulations. While we appreciate that specific demographic characteristics for performing care gap analysis are not required, we note that some maintenance quality measures, such as claims-based measures, collect limited information about patient demographics. Analysis on claims-based measures will be limited to information collected on claim forms. Therefore, the measure developer may need to create an analytic file specific to measure testing that contains more data elements than needed for measure production.

As another example, chart-abstracted measures are also restricted to the data elements in the collection manual, and unlike claims-based measures, measure developers do not have data elements beyond those required for measure calculation. For example, the sex data element is optional for some chart-abstracted measures based on current reporting specifications.

We ask that Closing Care Gaps domain criteria account for available data capture in the context of each specific measure.

Page 51: For Accountable Entity-Level Reliability – Requirement that signal-to-noise > 0.6 for 70% or more of the accountable entities.

Response:

We appreciate and support PQM’s ongoing commitment to endorsing highly reliable quality measures. However, we note that the recent update requiring a signal-to-noise ratio greater than 0.6 for at least 70% of accountable entities appears to diverge from previous guidance issued by the CBE, which emphasized flexibility in interpreting reliability thresholds.

This revised standard effectively raises the expected median signal-to-noise ratio for a given sample to a level well above 0.6—likely closer to 0.75—which represents a notable increase from the current benchmark.

Given the sizable change in this requirement to meet the standard for entity-level reliability, we would welcome the opportunity to better understand the evidence base supporting this proposed update, and we respectfully request that PQM consider greater flexibility in this criterion.

Page 63: For initial endorsement, developers/stewards must conduct feasibility and person- or encounter-level testing at a minimum of three sites [distinct health care systems or organizations with separate ownership] within at least one EHR vendor. Beyond this minimum requirement, developers/stewards should test on the number of health systems they deem appropriate.

Response: We recognize and support PQM’s commitment to ensuring that quality measures are tested with representative and diverse patient populations. However, the requirement to engage three distinct healthcare systems or organizations for initial eCQM endorsement poses a substantial challenge for many measure developers.

In practice, developers often collaborate with a single medium- to large-sized healthcare system that offers considerable variation across geography, urbanicity, case mix, and safety-net status. Many of these systems include dozens or more hospitals, enabling broad and meaningful representation within a single organizational structure.

Requiring partnerships with three separate entities significantly increases resource demands without necessarily enhancing the scientific validity of the testing. Based on our experience, sufficient variability in performance and reliability can be demonstrated through a well-selected single health system. We respectfully recommend that PQM consider greater flexibility in this criterion, particularly for initial testing and feasibility assessment, to better align with the practical realities of eCQM development while maintaining methodological rigor.

Page 63: For maintenance evaluations, performance gap and accountable entity-level reliability and validity must include data from a minimum of five sites [distinct health care systems or organizations with separate ownership] within at least two EHR vendors. A minimum of five sites supports meaningful distributional analysis, enables the fitting of a quadratic curve, and reduces uncertainty with statistically valid insights—as demonstrated by the Rule of Five.

Response: We recognize and support PQM’s commitment to ensuring that quality measures are tested with representative and diverse patient populations. However, the requirement to engage at least five distinct healthcare systems or organizations for eCQM testing to support maintenance endorsement poses a substantial challenge for many measure developers.

Requiring partnerships with five separate entities significantly increases resource demands without necessarily enhancing the scientific validity of the testing. Based on our experience, sufficient variability in performance and reliability can be demonstrated through a well-selected single health system. We respectfully recommend that PQM consider greater flexibility in this criterion to better align with the practical realities of eCQM testing while maintaining methodological rigor.

Name or Organization

Mathematica Inc.

Researcher comments regarding 2025 E&M Guidebook

Which aspect(s) of the guidebook are you commenting on?

General E&M processes and policies

Committee structure and voting outcomes

Measure-specific guidance - pediatric, cost, or electronic clinical quality measures

Accountable-entity level testing guidance - reliability and validity

PQM Measure Evaluation Rubric

First, I am concerned that the CBE Scientific Methods Panel has been largely sidelined from its role as a key forum for discussing and making recommendations regarding E&M policy, specifically in the domain of Scientific Acceptability. The SMP is not mentioned anywhere on the PQM site as a currently active committee, the members of the committee are not publicly identified, there is no opportunity to nominate or recruit new SMP members, and the process of SMP discussion is opaque. This is not an optimal way to manage the scientific advisory process for a CBE. The SMP is mentioned briefly in footnote 14 and on pages 62 and 68 of the Guidebook, but only the footnote 14 reference is verifiable through accessible documents.

Second, the Appendix H1 policy on pediatric clinical quality measures raises several logical concerns. Foundationally, this policy appears to be based on the false premise that all existing quality measures are specified as either adult measures, starting at 18 years of age, or pediatric measures, ending at 18 years of age. PQM fails to distinguish between CQMs that appropriately include both adults and children and CQMs that are appropriately targeted to children. For example, measures of prenatal, delivery, and postpartum care may include all persons who are birthing a child, recognizing that the majority of these persons are adults, but a variable percentage may be children. In this case, the PQM “policy for extrapolating adult clinical quality measures to pediatric populations” seems to fit, in that “the target population, measure focus, context, mechanism complex, and material outcomes demonstrate sufficient similarity.” However, for most conditions, treatment practices (and even relevant outcomes) change at least somewhat across the age spectrum, but these changes are not necessarily aligned with the 18-year threshold. The Guidebook relies heavily on FDA guidance related to use of regulated drugs and devices, even suggesting “physiologically based pharmacokinetic modelling (PBPK) modeling (to) help validate extrapolation assumptions.” This suggestion seems absurd in the context of CQMs, which rarely (if ever) hinge on pharmacokinetics and precise medication dosing. Perhaps this was a product of “generative artificial intelligence” retrieving irrelevant information?

From the clinical perspective, each measure should be specified with the appropriate age window to define its denominator population – some measures are inherently designed for neonates, others for infants, others for young children, others for adolescents, others for young adults, others for older adults. Rather than evaluating whether “adult CQMs” (whatever those are?) can be extrapolated to children, the E&M process must consider whether the denominator population is appropriately defined for that measure, based on the best available evidence. My recommendation is to scrap this entire policy and start over, abandoning the artificial and illogical distinction between “adult CQMs” and “pediatric CQMs.” Instead, the policy question should be how developers should justify, and how the PQM should evaluate, claims related to the age spectrum for a measure, no matter what portion of the age spectrum a measure covers. Extrapolation from adolescents to young children, or from young adults to senior adults, is more problematic than extrapolating from 19-year-olds to 17-year-olds.

Third, the Appendix H2 policy raises several questions. These two sentences appear to contradict each other: “A new eCQM version of an existing endorsed [non-eCQM] measure is automatically considered to be endorsed. Given the distinct characteristics of an eCQM, the new measure must be separately evaluated and endorsed by Battelle.” The following sentence is completely unsupported by citations or empirical analyses: “For maintenance evaluations, performance gap and accountable entity-level reliability and validity must include data from a minimum of five sites within at least two EHR vendors. A minimum of five sites supports meaningful distributional analysis, enables the fitting of a quadratic curve, and reduces uncertainty with statistically valid insights—as demonstrated by the Rule of Five.” I have no idea what “rule of five” is being referred to here, but there is no evidence that five points “support meaningful distributional analysis,” let alone robustly estimating a beta binomial ICC. In our extensive experience, at least 10 sites are necessary for these purposes. Many statisticians recommend at least 30 entities to generate robust ICC estimates.

Fourth, I support the proposed use of the Nieser and Harris (2024) method, which our team has extensively tested and validated, over the Adams (2009) method, for the reasons stated in Table H4 and acknowledged by Adams et al. (Hwang J, Adams JL, Paddock SM. Defining and estimating the reliability of physician quality measures in hierarchical logistic regression models. Health Serv Outcomes Res Methodol. 2021;21(1):111-130. doi:10.1007/s10742-020-00226-4) and others (https://www.medrxiv.org/content/10.1101/2023.01.07.22283371v1 ). However, this statement is unsupported: “Note: For any accountable entities that have less than 0.6 reliability estimates, developers must report both entity- and person-level descriptive statistics for those entities.” When applied to hundreds or thousands of accountable entities, the Nieser and Harris estimator often generates ICC estimates a bit below 0.6 for a small proportion of facilities with reasonable sample sizes. The CBE’s standard practice has been to evaluate the distribution of ICCs and accept measures for which at least 70% exceed 0.6 (see page 49). It is unclear what purpose would be achieved by requiring “person-level descriptive statistics” for dozens of entities in the 0.5-0.6 range (for example).

Fifth, in Appendix H5, this example seems problematic: “Example: A developer assesses patient falls measure with another indicator of patient safety, use of patient safety protocols. The developer found a correlation coefficient of -0.5… Although in the expected direction, this correlation is lower than expected based on the theoretical constructs that both measures aim to improve patient safety.” On what basis is this correlation “lower than expected”? In fact, this correlation can be interpreted as suggesting that 25% of all the variation in fall rates across entities can be attributed to just one process factor – i.e., the use of patient safety protocols. Considering that many processes, or process failures, contribute to patient safety outcomes such as falls, a correlation coefficient of 0.5 seems implausibly high. Observed correlations, based on empirical experience across many measures, are more typically in the 0.2-0.4 range, suggesting that any single process factor only explains about 10-15% of observed variation in outcome rates.

Sixth, the risk-adjustment sections of the Guidebook need comprehensive reassessment and updating to reflect currently accepted approached to selecting features and estimating and validating risk-adjustment models. For example, there is no requirement for a set-aside test or validation data set for demonstrating “acceptable model performance,” and there are no standards for what level of performance is considered “acceptable.” The following language is problematic because it is unclear what justification (IF ANY) could possibly be acceptable: “If adjusting for differences in patient characteristics, a conceptual model is provided; AND the conceptual model includes patient characteristics that influence the measured outcome; AND are present at the start of care; AND the model does not include factors that are associated with differences or inequities in care unless justification is provided.” Specifically, the practice of including entity characteristics, such as provider volume, bed size, or geographic location, is essentially NEVER justifiable if the purpose of measurement is to compare performance across these entities. I encourage the use of directed acyclic graphs (DAGs) or similar methods drawn from causal inference to elucidate conceptual models for risk-adjustment.

Finally, the Appendix F argument for reducing the consensus threshold from 75% to 70% when the number of voters is less than 20 seems poorly justified for two reasons: (1) it should be a rather simple matter to ensure that each voting committee has at least 20 voting members, allowing a consistent approach to be applied across all committees; and (2) the actual threshold suggested in Table F1 appears to be 73.3%, which is only nominally less than 75%. In other words, the preferred solution to unreliable decision-making by small committees is to expand the committee to bring in additional voices, not to alter the threshold for consensus.

Thank you for the opportunity to offer these comments.

Name or Organization

Patrick S. Romano, MD MPH; University of California Davis Health

Yale/CORE’s Response to…

Which aspect(s) of the guidebook are you commenting on?

General E&M processes and policies

Measure-specific guidance - pediatric, cost, or electronic clinical quality measures

Accountable-entity level testing guidance - reliability and validity

Yale/CORE’s Response to Battelle Guidebook Public Comment

Thank you for the opportunity to provide comments on the updated E&M Guidebook (version 3.0). https://p4qm.org/sites/default/files/Del-3-6-Endorsement-and-Maintenanc…

We appreciate Battelle’s approach that includes cycles of refinement and improvements, for both processes as well as technical requirements for measure endorsement. We feel that Battelle has been very responsive to developer’s concerns, while at the same time improving the technical merits of endorsement. We also appreciate that Battelle extended the deadline for this public comment to July 6^th.

Below we provide comments on the E&M Guidebook, as well as the public comment process, below.

1. Webinar or Q&A before or during the public comment period.

We suggest that Battelle add a step in the public comment process that allows developers to ask clarifying questions prior to or during the public comment period. This would enhance the public comment process because developers would have the information they needed to provide meaningful feedback. An alternative to a webinar would be a Q&A session, which might reduce the burden on Battelle staff.

2. Provide 30 days for public comment.

We suggest that Battelle always allow for 30 days of public comment following posting of the updated E&M Guidebook. This allows developers, who may have been also managing measure-specific public comment and PA review during the current CBE cycle, to have sufficient time to review and consider the Guidebook changes.

3. eCQM Testing Requirements. On page 63 of the pdf, under eCQM requirements, the Guidebook states:

“For initial endorsement, developers/stewards must conduct feasibility and person- or encounter-level testing at a minimum of three sites within at least one EHR vendor.” The E&M Guidebook, in footnote 19 defines sites as “distinct health care systems or organizations with separate ownership.”

CORE’s Public Comment:

We appreciate and agree with the requirement for only one EHR vendor, as this reduces the burden on measure developers and testing partners without compromising testing integrity.

We are concerned, however, about the requirement of three “sites” required for reliability and particularly data element validity testing, defined by Battelle based on ownership. While we agree the overall goal is to capture diversity of testing sites, consolidation of health systems will make it difficult and costly for developers to obtain data from three (or five) separate health systems/organizations with separate ownership. Ownership does not define hospital processes or quality of care – our own empiric analyses shows that hospitals within the same system can have vastly different performance on eCQMs, reflecting their different underlying administrative and clinical processes.

We instead suggest that, rather than requiring a strict definition of “sites” that Battelle allow developers to justify empirically how the sites differ, and how these differences result in the necessary diversity in testing data, which could differ by measure type (e.g., different for a hospital-wide measure, vs. a narrower measure). Because most health care systems have grown by purchasing other hospital sites, Battelle could also consider defining the sites as two locations with different deployments of the same EHR vendor for reliability.

Requiring testing across three sites as currently defined has implications for measure development timelines, feasibility, as well as costs. The complexities around contracting with three separate health care systems or entities with different ownership will result in extended development timelines, without increasing or improving measure validity. For data element validity testing specifically, our eCQM experts note that their assessment is that it will be nearly impossible for developers to be able to perform validity testing across three “sites”, as they are currently defined. Due to the gold standard being chart review compared to EHR data pull(s), this is an extremely costly and time-consuming process for both measure developers and testing partners.

4. Closing Care Gaps: now required for all measures

Page 7 of the E&M Guidebook states that the new “Closing Care Gaps” domain will be required for measures submitted in the Spring 2026 cycle.

CORE’s Public Comment:

CORE supports Battelle’s efforts to close gaps in care across subpopulations. However, due to potential limitations on developer resources, we recommend that this section remain optional for future cycles.

We also have the following clarifications on this domain:

Is testing defined by, and limited to, populations identified in the conceptual model?
If there are no meaningful care gaps by subpopulation, does that mean the measure is un-endorsable?
What specific testing of measure scores does Battelle expect, in particular, for risk-adjusted measures without an existing stratification approach?

5. Reliability Guidance

Table H4 on page 66 of the E&M Guidebook, provides recommendations for the specific reliability testing method for different types of measures. CORE has the following questions/clarifications on this guidance.

For the binomial and non-binomial category categories:
- Can you please clarify if you are suggesting that developers using permutation sampling do the following (1) calculate the ICC for all measured entities together, and then (2) assume the ICC is the signal-to-noise reliability for an entity with a median volume (from step 1), and then (3) use the Spearman-Brown prophesy formula to calculate entity-level signal-to-noise reliability?
- Can you please provide a formula and an example for these calculations?
- Can you please provide a citation for this approach?
For the non-binomial category:
- The Nieser and Harris paper only refers to the use of the ICC. When using the SR instead of ICC, do you recommend developers use the same sampling approach and derive signal-to-noise the same way as is described by Nieser and Harris for the ICC? If the answer is yes, do you have a recommendation for how developers should choose the ICC vs. the SR?

Name or Organization

Yale/CORE

Comments on E&M Guidebook Version 3.0

Which aspect(s) of the guidebook are you commenting on?

General E&M processes and policies

Measure-specific guidance - pediatric, cost, or electronic clinical quality measures

Accountable-entity level testing guidance - reliability and validity

PQM Measure Evaluation Rubric

The Outpatient Measures team at Acumen, LLC (hereafter referred to as Acumen) thanks the CBE entity for the opportunity to provide feedback on the Endorsement and Maintenance (E&M) Guidebook Version 3.0, published by the Partnership for Quality Measures (PQM) in June 2025. Our comment (see included attachment) is structured into two sections: 1) a discussion of the improvements made to the E&M process covered in the latest Guidebook, and 2) a discussion of several remaining areas for further improvements we believe the CBE entity should consider addressing prior to finalization.

2025-07-03-em-guidebook-v3-OPMeasures-finalcomments.pdf

Name or Organization

Acumen, LLC - OP Measures Support Team

Breadcrumb

Comments

The Federation of American…

Comment on reliability requirements

PQA Comments on E&M Guidebook 3.0

Guidebook Feedback

PQM Proposed Changes

Comments on E&M Guidebook Version 3.0

Review of and Commentary on the E&M Guidebook

Concerns with Conditions for Endorsement

Comments for draft V3 E&M Manual - Material changes to measures

Page 42: [For initial…

Researcher comments regarding 2025 E&M Guidebook

Yale/CORE’s Response to…

Comments on E&M Guidebook Version 3.0