J. Weston Hughes, Jeffrey E Olgin, Robert Avram, Sean A Abreau, Taylor Sittler, Kaahan Radia, Henry Hsia, Tomos Walters, Byron Lee, Joseph E Gonzalez, Geoffrey H Tison

Tison Lab @ UCSF | RISE Lab @ UC Berkeley

Electrocardiograms (ECGs) are the most common cardiovascular test worldwide. Millions of clinicans rely everyday on automated preliminary ECG interpretation to asisst with a wide range of cardiac diseases from urgent heart attacks to abnormalities of cardiac rhythm, electrical conduction or structure.

In a study published in JAMA Cardiology, we developed a convolutional neural network (CNN) that could be trained using commonly available ECG data and diagnosis labels, and we implemented an explainability method that assists physicians to understand why the algorithm makes its diagnosis.

Our work shows that readily available ECG data can be used to train a CNN that outperforms a common commercial algorithm and is comparable to expert cardiologists for many diagnoses, with some exceptions. The LIME explainability technique also allows the CNN to highlight physiologically-relevant ECG segments that contribute to each CNN diagnosis.

Hundreds of millions of ECGs are obtained annually worldwide. Everyday millions of healthcare providers and patients rely on automated ECG interpretation algorithms.

Most existing commercial 12-lead ECG algorithms work by applying known disease-specific criteria to ECGs.

"Medical gallery of Blausen Medical"

Machine learning algorithms instead take a “data-driven” approach, to learn from large amounts of labeled data. This enables algorithms to be trained for ECG diagnoses where ECG rules do not currently exist, may achieve higher accuracy and can enable continual improvement as more data becomes available.

Machine learning algorithms could powerfully complement existing ECG automated analysis with a data-driven paradigm, but to date we have lacked comparisons against clinical standards-of-care.

To develop our CNN algorithm, we specifically designed it to accept and train on ECG data that is readily available in most institutions: namely XML-format ECG waveform-data and cardiologist-confirmed text diagnosis labels.

By using commonly available ECG data, we aspired to demonstrate both what can be achieved in most institutions and what could be eventually achieved by combining cross-institutional data.

We obtained ~1 million adult 12-lead ECGs from the University of California, San Francisco (UCSF) from 2003-2018.

As part of routine care, all ECGs had undergone preliminary analysis with a commercial ECG algorithm called MUSE (GE Healthcare) and also received a cardiologist clinical diagnosis—the final clinical ECG interpretation in most healthcare workflows.

In the first large-scale validation (in 91,440 ECGs from 32,576 patients) against the clinically-accepted standard of cardiologist-confirmed ECG diagnoses, the CNN had high area under the receiver operating characteristic curve (AUC) of ≥0.960 for 32/38 (84%) diagnostic classes.

Exceptions included ectopic atrial rhythm, nonspecific interventricular conduction delay, prolonged QT and posterior infarct (AUCs 0.937-0.959); ST elevation and lead misplacement had AUCs of 0.870 and 0.841, respectively.

Sensitivity in Table 1 is shown at a fixed threshold where specificity= 0.90, and vice versa.

We then performed a second validation of the algorithm against a Consensus Committee Diagnosis for which a committee of 3 electrophysiologists provided diagnoses by consensus.

In addition to providing a second validation, this enabled a relative comparison between the CNN, the clinical cardiologist and (unedited) MUSE diagnosis by F1 score.

In all 5 diagnostic categories, the CNN had higher frequency-weighted mean F1 scores than both cardiologist clinical and MUSE diagnoses. The CNN demonstrated AUCs of ≥0.910 for 32/38 (84%) individual diagnostic classes.

We examined the co-occurrence of diagnoses between cardiologist clinical and CNN-predicted diagnoses to understand the relationship between the various diagnostic classes predicted by the CNN.

► Overall, the CNN mirrored patterns of diagnosis co-occurence exhibited by cardiologists.

We applied an explainability technique to the CNN called Linear Interpretable Model-agnostic Explanations (LIME).

LIME highlights which ECG segments in which specific leads drive particular CNN diagnoses, as learned by the CNN in a data-driven manner from the training data.

ECG leads not shown were not highlighted by LIME

In many cases, LIME highlighted ECG segments that correspond with well understood physiologic-correlates of disease underlying each diagnosis. This shows that the CNN learned patterns in the data that "make sense."

LIME explainability for CNN ECG analysis could both improve clinician confidence in automated analysis while supporting an optimal human-machine interaction that provides greater context to incorporate algorithmic diagnoses into clinical decisions.

The enormous potential of data-driven machine learning in medicine necessarily lies, in part, in machine learning's ability to help drive discovery from large quantities of data, which CNN explainability techniques like LIME help accomplish by identifying patterns that are too subtle to be visually recognized or from datasets too large for manual review.

This becomes especially powerful when combined with large cross-institutional datasets and for diseases whose recognized ECG correlates are incomplete or where ECG criteria are unknown.

This presents a compelling motivation to start working towards overcoming the formidable administrative obstacles that currently limit large-scale, multi-institutional data sharing in medicine.

"In our present medical era where most medical data is digitized, the barriers to developing algorithms that can truly support a learning health system are no longer technical, but instead are largely administrative and regulatory.

Here we specifically prioritized developing a CNN that accepts and trains on digital ECG data “as is” without requiring additional annotation, which is often prohibitive to perform at large scale.

This unlocks the potential to use existing data from hundreds of millions of ECGs in institutions worldwide to continually improve CNN performance.

ECG analysis systems that consistently achieve expert-level performance could allow delegation of certain ECG diagnoses to algorithms, focusing human-expertise on difficult diagnoses.

As AI-augmented automated analysis improves, the common practice of having cardiologists confirm all ECGs regardless of diagnosis—which is subject to human fatigue—could be revisited.

The data-driven paradigm of machine learning makes it possible to predict new diagnoses beyond what physicians can currently perform with ECGs, as we have previously demonstrated.

This raises the possibility to expand the diagnostic utility of ECGs beyond their present scope, while also potentially learning new physiologic correlates of disease through AI explainability.

Read the full paper @ JAMA Cardiology

Email questions or inquiries here.

Tison Lab @ UCSF

Search form

You are here

Explainable AI for ECGs

J. Weston Hughes, Jeffrey E Olgin, Robert Avram, Sean A Abreau, Taylor Sittler, Kaahan Radia, Henry Hsia, Tomos Walters, Byron Lee, Joseph E Gonzalez, Geoffrey H Tison

Tison Lab @ UCSF | RISE Lab @ UC Berkeley

In a study published in JAMA Cardiology, we developed a convolutional neural network (CNN) that could be trained using commonly available ECG data and diagnosis labels, and we implemented an explainability method that assists physicians to understand why the algorithm makes its diagnosis.

Hundreds of millions of ECGs are obtained annually worldwide. Everyday millions of healthcare providers and patients rely on automated ECG interpretation algorithms.

Most existing commercial 12-lead ECG algorithms work by applying known disease-specific criteria to ECGs.

Machine learning algorithms could powerfully complement existing ECG automated analysis with a data-driven paradigm, but to date we have lacked comparisons against clinical standards-of-care.

To develop our CNN algorithm, we specifically designed it to accept and train on ECG data that is readily available in most institutions: namely XML-format ECG waveform-data and cardiologist-confirmed text diagnosis labels.

By using commonly available ECG data, we aspired to demonstrate both what can be achieved in most institutions and what could be eventually achieved by combining cross-institutional data.

We obtained ~1 million adult 12-lead ECGs from the University of California, San Francisco (UCSF) from 2003-2018.

As part of routine care, all ECGs had undergone preliminary analysis with a commercial ECG algorithm called MUSE (GE Healthcare) and also received a cardiologist clinical diagnosis—the final clinical ECG interpretation in most healthcare workflows.

In the first large-scale validation (in 91,440 ECGs from 32,576 patients) against the clinically-accepted standard of cardiologist-confirmed ECG diagnoses, the CNN had high area under the receiver operating characteristic curve (AUC) of ≥0.960 for 32/38 (84%) diagnostic classes.

Exceptions included ectopic atrial rhythm, nonspecific interventricular conduction delay, prolonged QT and posterior infarct (AUCs 0.937-0.959); ST elevation and lead misplacement had AUCs of 0.870 and 0.841, respectively.

Sensitivity in Table 1 is shown at a fixed threshold where specificity= 0.90, and vice versa.

We then performed a second validation of the algorithm against a Consensus Committee Diagnosis for which a committee of 3 electrophysiologists provided diagnoses by consensus.

In addition to providing a second validation, this enabled a relative comparison between the CNN, the clinical cardiologist and (unedited) MUSE diagnosis by F1 score.

In all 5 diagnostic categories, the CNN had higher frequency-weighted mean F1 scores than both cardiologist clinical and MUSE diagnoses. The CNN demonstrated AUCs of ≥0.910 for 32/38 (84%) individual diagnostic classes.

We examined the co-occurrence of diagnoses between cardiologist clinical and CNN-predicted diagnoses to understand the relationship between the various diagnostic classes predicted by the CNN.

► Overall, the CNN mirrored patterns of diagnosis co-occurence exhibited by cardiologists.

We applied an explainability technique to the CNN called Linear Interpretable Model-agnostic Explanations (LIME).

LIME highlights which ECG segments in which specific leads drive particular CNN diagnoses, as learned by the CNN in a data-driven manner from the training data.

In many cases, LIME highlighted ECG segments that correspond with well understood physiologic-correlates of disease underlying each diagnosis. This shows that the CNN learned patterns in the data that "make sense."

LIME explainability for CNN ECG analysis could both improve clinician confidence in automated analysis while supporting an optimal human-machine interaction that provides greater context to incorporate algorithmic diagnoses into clinical decisions.

This becomes especially powerful when combined with large cross-institutional datasets and for diseases whose recognized ECG correlates are incomplete or where ECG criteria are unknown.

This presents a compelling motivation to start working towards overcoming the formidable administrative obstacles that currently limit large-scale, multi-institutional data sharing in medicine.

"In our present medical era where most medical data is digitized, the barriers to developing algorithms that can truly support a learning health system are no longer technical, but instead are largely administrative and regulatory.

Here we specifically prioritized developing a CNN that accepts and trains on digital ECG data “as is” without requiring additional annotation, which is often prohibitive to perform at large scale.

This unlocks the potential to use existing data from hundreds of millions of ECGs in institutions worldwide to continually improve CNN performance.

ECG analysis systems that consistently achieve expert-level performance could allow delegation of certain ECG diagnoses to algorithms, focusing human-expertise on difficult diagnoses.

As AI-augmented automated analysis improves, the common practice of having cardiologists confirm all ECGs regardless of diagnosis—which is subject to human fatigue—could be revisited.

The data-driven paradigm of machine learning makes it possible to predict new diagnoses beyond what physicians can currently perform with ECGs, as we have previously demonstrated.

This raises the possibility to expand the diagnostic utility of ECGs beyond their present scope, while also potentially learning new physiologic correlates of disease through AI explainability.

Email questions or inquiries here.