Appl Clin Inform 2024; 15(05): 1074-1079
DOI: 10.1055/a-2407-1272
Special Topic on Teaching and Training Future Health Informaticians

Teaching Data Science through an Interactive, Hands-On Workshop with Clinically Relevant Case Studies

Authors

  • Alvin D. Jeffery

    1   Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, United States
  • Patricia Sengstack

    2   Department of Informatics, Vanderbilt University School of Nursing, Nashville, Tennessee, United States
 

Abstract

Background In this case report, we describe the development of an innovative workshop to bridge the gap in data science education for practicing clinicians (and particularly nurses). In the workshop, we emphasize the core concepts of machine learning and predictive modeling to increase understanding among clinicians.

Objectives Addressing the limited exposure of health care providers to leverage and critique data science methods, this interactive workshop aims to provide clinicians with foundational knowledge in data science, enabling them to contribute effectively to teams focused on improving care quality.

Methods The workshop focuses on meaningful topics for clinicians, such as model performance evaluation and introduces machine learning through hands-on exercises using free, interactive python notebooks. Clinical case studies on sepsis recognition and opioid overdose death provide relatable contexts for applying data science concepts.

Results Positive feedback from over 300 participants across various settings highlights the workshop's effectiveness in making complex topics accessible to clinicians.

Conclusion Our approach prioritizes engaging content delivery and practical application over extensive programming instruction, aligning with adult learning principles. This initiative underscores the importance of equipping clinicians with data science knowledge to navigate today's data-driven health care landscape, offering a template for integrating data science education into health care informatics programs or continuing professional development.


Background and Significance

Educating clinicians who have a desire to learn and enhance their work through data is an ongoing and evolving process. With technology advancing and data analytics skills becoming increasingly needed, our clinical workforce must possess foundational knowledge and skills to help them critically evaluate whether and how data science applications could inform care. With the current influx and popularity of artificial intelligence (AI), it is important to prepare a workforce that can influence how these tools are integrated into clinical practice. Training the next generation of clinicians to understand core concepts of data science requires training that addresses emerging methods to analyze large amounts of data to understand phenomena and make actionable predictions that can improve clinical outcomes.[1] Because the vast majority of our health care workforce has completed academic training prior to these recent advances, a significant barrier to equipping clinicians with the necessary knowledge lies in the need to teach data science concepts to clinicians who are actively practicing. Crafting meaningful and engaging learning content to an audience with no background in data science or machine learning (ML) methods requires a carefully executed strategy that aligns with principles of adult learning.[2] To address this need we have created and implemented a data science workshop using the iterative instructional design concepts contained in the ADDIE model (analyze, design, develop, implement, and evaluate) and have offered it to nurses (and other clinical disciplines) at conferences and in academic settings.[3] Leveraging the ADDIE model, we can use formal and informal evaluations to continuously improve the learning activity's content and format to provide an engaging and informative experience for learners.


Objective

Existing data science education and training for clinicians is limited, particularly for nurses. While curricula in nursing schools may include basic statistics or use of common spreadsheet tools such (e.g., Microsoft Excel), there are few academic programs that train health care providers in the concepts of ML, predictive modeling, AI, and other data science methods. We sought to develop an interactive experience where nurses who have an interest in informatics could become familiar with key data science concepts and be exposed to introductory scientific computing skills. The intent of this training is not to transform clinicians into data scientists but rather to give them a solid foundation in the field that allows them to ask the right questions as part of a team driving clinical transformation. In this case report, we describe the development of a data science workshop that has been developed and conducted with clinicians, most of whom are nurses, who possess no background in software programming or data science.


Methods

Content

Because the field of data science is quite broad, we wanted to ensure we focused on the topics that would be most meaningful to practicing nurses who are unlikely to routinely build ML models in the real world. To this end, we emphasized evaluation of model performance more than any other topic. Topics included how to calculate and how to evaluate sensitivity, positive predictive value, area under the receiver operating characteristic curves, and F1 scores. We also introduced the concepts of bias/variance trade-off and preventing overfitting.

After the in-depth discussion of evaluating model performance, we guided learners to train and tune multiple ML models on the data. For those interested in learning more, particularly about data preprocessing (the most time-consuming aspect of a data science project), we provided an additional notebook for self-study. [Table 1] contains an outline of the content covered in the workshop. While this order of content is reversed from an actual data science project, we found that starting with preprocessing was so tedious and unexciting that many learners lost motivation and became disengaged (see the Lessons Learned column in [Table 2]).

Table 1

Workshop content outline

Synchronous content

Asynchronous content

Introduction:

• Using interactive notebooks

• Basics of python programming

Preworkshop (required):

• Two brief videos regarding introductory concepts

Evaluating model performance:

• Review performance of “simple” sepsis prediction model

• Calculating sensitivity (recall), positive predictive value (precision), and F1 scores

Postworkshop (optional):

• Data cleaning:

 o Reading in data files

 o Categorical vs. continuous variables

 o Quality checks

 o Feature engineering

 o Descriptive statistics

 o Data visualization, including geographic mapping

• Modeling techniques:

 o Repeated content of synchronous content but applied to opioid case study

Improving model performance:

• How manually adjusting alert thresholds influences performance

• How manually adjusting parameters of “simple” sepsis prediction model influences performance

From manual learning to machine learning:

• Visual overview of how algorithms work conceptually

• Building models with python's scikit-learn package

• Bias versus variance and validation approaches

• Hyperparameter tuning

Table 2

Overview of venues, audiences, and lessons learned over time

Date

Event details

Audience

Lessons learned

June 2018

NKBDS Pre-Conference Workshop—Minneapolis, MN—1-, 6-hour day

∼50 nurse informaticists

Installing software on attendees' personal computers is time-consuming. Using web-based platforms requires robust internet connections with large audience sizes

June 2019

NKBDS Pre-Conference Workshop—Minneapolis, MN—1-, 6-hour day

∼50 nurse informaticists

Comprehending model development & evaluation concepts is cognitively challenging when introduced in the second half of the workshop

November 2020

AMIA Annual Symposium Workshop—Virtual—1-, 6-hour day

∼40 nurse informaticists

Having too many users on a single, shared compute instance results in excessive memory usage. Duplicating the data and notebooks into one's compute instance is more robust

February 2021

VUSN's Nursing Informatics MSN Workshop—Virtual—2-, 4-hour day

10 graduate nursing informatics students

Using a platform like DeepNote works better for virtual workshops so that facilitators can view/edit the learner's notebook easily. Asking participants to have 2 monitors available improves the experience because they can view the facilitator's screen and their own simultaneously. While the sepsis case study was beneficial, the intuition/motivation behind the need for machine learning (compared with a simple scoring system) was missing

June 2021

NKBDS Pre-Conference Workshop—Virtual—1-, 3-hour day

∼50 nurse informaticists

Focusing on model development & evaluation first (and having data preprocessing techniques as optional) improved learners' engagement. The 2nd workshop using DeepNote confirmed for us that a web-based platform allows user to more easily participate without being limited by technical challenges in downloading/installing software

December 2021

VUSN's PhD Cohort—Virtual, 1-, 2-hour session

5 PhD students in their 2nd year

Even though these students were in an advanced quantitative methods course, 2 hours is grossly insufficient to cover the content in this format

February 2022

VUSN's Nursing Informatics MSN Workshop—Nashville, TN—2-, 4-hour days

10 graduate nursing informatics students

Asking participants to manipulate the “simple” sepsis prediction model in small groups provided them with the intuition needed to understand how machine learning has significant advantages over simple systems

February 2023

VUSN's Nursing Informatics MSN Workshop—Nashville, TN—2-, 4-hour days

11 graduate nursing informatics students

An additional workshop day would be helpful to allow for more questions as content is covered

February 2024

VUSN's Nursing Informatics MSN Workshop—Nashville, TN—3-, 4-hour days

9 graduate nursing informatics students

3-, 4-hour days is a great amount of time for nurses new to machine learning, as it allows for covering concepts at a pace where learners can ask questions and have ample time for hands-on practice. Additionally, there is time to informally discuss topics related to clinical implementation and conceptual aspects of biased data

Abbreviations: MSN, Master of Science in Nursing; NKBDS, Nursing Knowledge: Big Data Science; VUSN, Vanderbilt University School of Nursing.



Format

All workshop events have used interactive python programming “notebooks” ([Fig. 1]) that facilitate learners' ability to read content, take notes, and execute software code simultaneously. Currently, we offer the workshop through a freely available interactive python notebook platform (specifically, DeepNote, even though several other platforms are also available) that allows remote interaction between individuals (similar to a Google Drive document) and does not involve downloading or installing any software. We have offered the synchronous workshop as an event both in-person and remotely, and we recommend learners have at least two monitors available for the remote version so they can watch the facilitator's demonstrations on one monitor and practice on the other monitor. To help learners explore the concepts of performance metrics interactively, we have also provided an Excel file containing a confusion matrix that can be manipulated with new values. As values are updated, the change in performance metrics can be seen instantaneously. Finally, we provide a “Key Takeaways” document (see Supplemental Materials for student version and answer key) to help learners focus on the most essential concepts and have a reference for the future. [The most recent DeepNote notebook from February 2024 containing all scripts, data, and supplementary files is available at http://tinyurl.com/dsw24 ].

Zoom
Fig. 1 Partial view of an interactive programming notebook.

Clinical Case Studies

We developed two case studies to provide relatable clinical contexts: sepsis recognition and opioid overdose death. For sepsis recognition, we created a completely synthetic dataset with known distributions and associations to ensure learners would make the discoveries we intended and that made clinical sense. For opioid overdose death, we created realistic but synthetic data using MITRE's Synthea program.[4]

In preparation for the sepsis case study, we encourage learners to watch 2, 15-minute videos prior to the workshop. The videos introduce the learners to a “sepsis prediction model” (based on an arbitrary point system) where they have an opportunity to hear some concepts (e.g., sensitivity, specificity) before the workshop. During the workshop, the concepts are reinforced as the learners work to build a better sepsis prediction model. First, learners work in small groups to manually modify the 10-variable (e.g., heart rate, white blood cell count, temperature, etc.) sepsis model by either assigning more/less points (i.e., weight) to each variable and/or changing the thresholds at which each variable would be assigned a point; we refer to this as Manual Learning. We use this exercise to illustrate that it would be impossible to manually evaluate every possible combination and that the benefit of ML is to leverage the computer for building a better model.

In the opioid overdose case study, a greater breadth of data preprocessing steps are covered with less depth on model development and validation (compared with the sepsis case study). We used the opioid overdose case study for the first several iterations of the workshop because we wanted to expose learners to all aspects of the prediction-focused data science process. However, we received feedback from workshop participants that when the entire process is condensed into this brief period of time, it was difficult to comprehend the model performance evaluation concepts described at the end. Therefore, we now provide the opioid overdose case study as a postworkshop learning activity for participants who would like to learn more concepts and/or apply the model development and evaluation learnings from the sepsis case study to a new clinical condition.


Evaluation

Feedback from each workshop has been used to iteratively improve the content and delivery. Due to the diversity of evaluation methods used in the multiple formats and audiences ([Table 2]), we are unable to provide longitudinal evaluation data. However, for the workshop offered to graduate nursing informatics students in 2022 through 2024 (30 students), we conducted a survey at the time of the workshop's completion to ask how they might use the information learned and how we can improve next time.



Results

The workshop has been offered to at least 250 attendees, including 4 national scientific meetings with nurses from multiple backgrounds (∼4–6 hours' duration), 4 Master's-level Nursing Informatics cohorts at Vanderbilt University School of Nursing (6–12 hours), and 1 PhD cohort at Vanderbilt University School of Nursing (2 hours).

In the earliest versions of the workshop (starting in 2018), fewer easily accessible notebook platforms were available, and we spent the first 1 to 2 hours of a workshop ensuring everyone could download and install software. The use of a free online platform (starting in 2020) that only requires creating an account without downloading software has significantly improved learners' experiences by reducing the amount of time to prepare the software for active participation.

Overall, the general feedback from all settings and audiences has been positive. Learners share that after participating in the workshop, they now possess a better understanding of the concepts and plan to use the newly gained knowledge in their work settings. Responses by nursing informatics students to the postworkshop evaluation included comments such as, “I find this information immensely valuable and intend to apply it upon returning to work tomorrow, particularly in the context of the ongoing [quality improvement] processes in which I am actively engaged” and “This data workshop helps me gain a lot of clarity about processing in our day-to-day health care information. Now I know how to think critically when assessing a predictive model, understand the pathway of data,” and “Blown away from all the content and can't wait to get back home to further explore.” Overall, the majority of students articulated that they now feel more confident in understanding how predictive models work and more comfortable in knowing the right questions to ask when evaluating effectiveness. There is widespread gratitude for the opportunity to learn these emerging topics in a way that makes sense to nurses and gives them the ability to ask the right questions when assessing the performance and potential benefits of a predictive model being used in their health care systems.

At the completion of the workshops conducted for the nursing informatics students from 2021 to 2024 (n = 40), the feedback again centered on the appreciation of making the concepts understandable and being able to use the knowledge to evaluate models in use or being considered for adoption. The sepsis case study was very well received, because learners are asked to “build your own algorithm” based on clinical knowledge, and then we illustrate how ML can simplify the task while also remaining interpretable. This exercise conducted during the workshop added increased engagement as it was meaningful to their daily practice while immediately demonstrating the benefit of ML. Additional quantitative descriptive information and accompanying free-text comments from respondents are provided in [Table 3].

Table 3

Summary of responses from structured survey provided to participants in the Nursing Informatics MSN program from 2022 to 2024

Question

Responses

Related comments from respondents

Did you feel that we prepared you with enough baseline knowledge for this workshop?

Yes, I felt I was prepared with enough baseline knowledge (n = 15)

I had some knowledge, but still felt a bit lost (n = 15)

Having a glossary of terms right at the beginning would be helpful - going from learning about sensitivity/PPV to referring to them under different terms was a little confusing.” (2022)

My brain comprehended Rapid Minder much easier with visuals. Maybe more break out groups to discuss and figure things out on our own would have been helpful rather than being told to run code.” (2022)

[P]roviding a cheat sheet containing key Python terms prior to the session could be helpful for participants to familiarize themselves with the language beforehand.” (2024)

Were you able to access and download the tools required for the workshop?

Yes (n = 29)

No (n = 1)

None provided

Was your computer able to adequately process the exercises?

Yes (n = 27)

No (n = 3)

Running Zoom and RapidMiner at the same time was tough on my computer and slowed down my audio/video settings; I had to watch along with Dr. Jeffrey instead of doing the exercises on my computer.” (2022)

How was the pace of the workshop?

About right (n = 28)

Too fast (n = 2)

I think it could be spaced out over 3 days instead of 2. Give the students some more time to digest.” (2023)


Discussion

Our interactive workshop to teach data science knowledge and skills meets a gap in current informatics education. Many informatics students will not become data scientists, but we believe all of them should have a grasp of key data science concepts and, at a minimum, be exposed to how data scientists use scientific computing to elicit information from data. A foundational knowledge of data science empowers nurses in informatics roles to assess clinical decision support systems and other predictive tools embedded within electronic health records with the ultimate outcome of improving patient care and outcomes. With many workshop attendees highlighting the immediate applicability of the material, we are confident that we have fostered a foundation for ongoing data science learning among all our participants. While some learners have been inspired to pursue more advanced data science training, the vast majority become equipped to critically evaluate data science activities and imagine new possibilities for how data science methods could help them solve health care challenges.

Many data science courses for general audiences begin with teaching computer science and programming basics, progress to preprocessing data, and conclude with model development and evaluation.[5] [6] Other than starting with some programming basics so that learners can use the notebooks, our approach mostly reverses this sequence. Our approach aligns with the adult learning principle that learners benefit from understanding the rationale for what they are trying to learn. By starting with model evaluation in the context of how this could influence one's clinical care activities, our learners stay more engaged and willing to explore some of the upstream activities that would precede model development. By making our materials freely available for others to use, we hope to empower continued training in the tools that will drive clinical transformation.

Our approach to teaching data science in this way also has its limitations. While we have provided extensive annotation and description, for the workshop to be fully reproducible, a facilitator with data science knowledge is necessary to assist with more complex questions or performing infrequent troubleshooting. While the material should be relevant and approachable for all health care professionals, we have only facilitated workshops for nurses. Finally, while the platforms we use are currently available for free, we have no guarantee they will continue to remain free; however, our content and approach can be easily adapted to other notebook platforms.


Conclusion

An interactive approach to introduce nurses to data science concepts and skills is needed within health informatics educational programs. We have iteratively developed a hands-on workshop that leverages two clinically relevant case studies to facilitate learners becoming more comfortable discussing data science methods and critiquing others' results. Applying data science concepts to clinical practice is key to enhancing care delivery and ultimately improving patient outcomes. By equipping clinicians and nurses with these skills, we aim to foster a health care environment that is data driven, leading to improved health outcomes and efficient care delivery.


Clinical Relevance Statement

Clinicians should understand fundamental data science concepts in today's health care environment if they hope to improve health and health care within systems or populations. Our interactive, relatively brief workshop leverages adult learning principles to provide direct care nurses with key data science information that fills a gap in current academic programs.


Multiple-Choice Questions

  1. Practicing clinicians who want to learn about data science are most likely to pursue the which learning format?

    • Pursuing a Doctor of Philosophy (PhD) degree

    • Attending a brief workshop with clinical case studies

    • Shadowing a data scientist for 1 month

    • Online learning module covering ML theory

    Correct Answer: The correct answer is option b. While all these options are opportunities to learn more about data science, (b) is the most likely to be of interest to practicing clinicians. (a) would require years of investment and would likely result in a career change. (c) is a significant time commitment that might not be feasible for most clinicians; additionally, it would require knowing a data scientist who is also a decent educator. (d) focuses on theory when many clinicians prefer to engage in clinically relevant learning activities.

  2. There are several datasets commonly used to teach data science methods. Which of the following datasets is most likely to keep clinicians engaged in learning?

    • Titanic—who will survive?

    • Credit Card Fraud—which transaction is fraudulent?

    • DIGITS—which number is present in the image?

    • Sepsis—which patient will develop sepsis?

    Correct Answer: The correct answer is option d. One principle of adult learning is that learners should feel the tasks in which they're participating are relevant to them. Even though the Titanic (a), Credit Card Fraud (b), and DIGITS (c) datasets can be helpful to learn and practice data science skills, many clinicians will not feel the activities are relevant to their work. Therefore, the Sepsis (d) dataset is more likely to facilitate learning for a clinical audience.



Conflict of Interest

None declared.

Acknowledgments

We received support for this work from the Agency for Healthcare Research & Quality (AHRQ) and the Patient-Centered Outcomes Research Institute (PCORI) under Award Number K12HS026395; and the Gordon and Betty Moore Foundation through Grant GBMF9048. The content is solely the responsibility of the authors and does not necessarily represent the official views of AHRQ, PCORI, the U.S. Government, or the Gordon and Betty Moore Foundation.

Protection of Human and Animal Subjects

We received a determination from the Vanderbilt University Medical Center's Institutional Review Board that our evaluation approach did not qualify as research (IRB number: 241036).



Address for correspondence

Alvin D. Jeffery, PhD, RN, CCRN, FNP-BC
Department of Biomedical Informatics, Vanderbilt University
Nashville, TN 37203
United States   

Publication History

Received: 25 March 2024

Accepted: 29 August 2024

Accepted Manuscript online:
30 August 2024

Article published online:
11 December 2024

© 2024. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany


Zoom
Fig. 1 Partial view of an interactive programming notebook.