CC BY 4.0 · Appl Clin Inform 2025; 16(03): 556-568
DOI: 10.1055/a-2544-3117
Research Article

Using Electronic Health Records to Classify Cancer Site and Metastasis

Kurt Kroenke
1   Department of Medicine, Indiana University School of Medicine, Indianapolis, Indiana, United States
2   Regenstrief Institute, Inc., Indianapolis, Indiana, United States
,
Kathryn J. Ruddy
3   Division of Medical Oncology, Mayo Clinic, Rochester, Minnesota, United States
,
Deirdre R. Pachman
4   Division of Community Internal Medicine, Geriatrics, and Palliative Care, Mayo Clinic, Rochester, Minnesota, United States
,
Veronica Grzegorczyk
5   Department of Physical Medicine and Rehabilitation, Mayo Clinic, Rochester, Minnesota, United States
,
Jeph Herrin
6   Department of Internal Medicine, Yale University School of Medicine, New Haven, Connecticut, United States
,
Parvez A. Rahman
7   Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota, United States
,
Kyle A. Tobin
7   Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota, United States
,
Joan M. Griffin
7   Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota, United States
8   Division of Health Care Delivery Research, Mayo Clinic, Rochester, Minnesota, United States
,
Linda L. Chlan
9   Division of Nursing Research, Department of Nursing, Mayo Clinic College of Medicine and Science, Mayo Clinic, Rochester, Minnesota, United States
,
Jessica D. Austin
10   Department of Epidemiology, Mayo Clinic College of Medicine and Science, Scottsdale, Arizona, United States
,
Jennifer L. Ridgeway
7   Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota, United States
8   Division of Health Care Delivery Research, Mayo Clinic, Rochester, Minnesota, United States
,
Sandra A. Mitchell
11   Outcomes Research Branch, Healthcare Delivery Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Rockville, Maryland, United States
,
Keith A. Marsolo
12   Department of Population Health Sciences, Duke University School of Medicine, Durham, North Carolina, United States
,
Andrea L. Cheville
5   Department of Physical Medicine and Rehabilitation, Mayo Clinic, Rochester, Minnesota, United States
7   Robert D. and Patricia E. Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, Minnesota, United States
› Institutsangaben

Funding E2C2 was funded as part of the Improving the Management of symPtoms during and following Cancer Treatment (IMPACT) Consortium, a National Cancer Institute Cancer Moonshot™ Research Initiative under the authorization of the 2016 United States 21st Century Cures Act. This research was supported by the National Cancer Institute of the NIH, UM1CA233033 (PI Cheville, Mayo Clinic, Rochester, MN). The findings expressed in this manuscript do not necessarily reflect the opinion of the NIH.
 

Abstract

Background

The Enhanced EHR-facilitated Cancer Symptom Control (E2C2) Trial is a pragmatic trial testing a collaborative care approach for managing common cancer symptoms. There were challenges in identifying cancer site and metastatic status.

Objectives

This study compares three different approaches to determine cancer site and six strategies for identifying the presence of metastasis using EHR and cancer registry data.

Methods

The E2C2 cohort included 50,559 patients seen in the medical oncology clinics of a large health system. SPPADE symptoms were assessed with 0 to 10 numeric rating scales (NRS). A multistep process was used to develop three approaches for representing cancer site: the single most prevalent International Statistical Classification of Diseases and Related Health Problems, 10th Revision (ICD-10) code, the two most prevalent codes, and any diagnostic code. Six approaches for identifying metastatic disease were compared: ICD-10 codes, natural language processing (NLP), cancer registry, medications typically prescribed for incurable disease, treatment plan, and evaluation for phase 1 trials.

Results

The approach counting the two most prevalent ICD-10 cancer site diagnoses per patient detected a median of 92% of the cases identified by counting all cancer site diagnoses, whereas the approach counting only the single most prevalent cancer site diagnosis identified a median of 65%. However, agreement among the three approaches was very good (kappa > 0.80) for most cancer sites. ICD and NLP methods could be applied to the entire cohort and had the highest agreement (kappa = 0.53) for identifying metastasis. Cancer registry data was available for less than half of the patients.

Conclusion

Identification of cancer site and metastatic disease using EHR data was feasible in this large and diverse cohort of patients with common cancer symptoms. The methods were pragmatic and may be acceptable for covariates, but likely require refinement for key dependent and independent variables.


#

Background and Significance

Symptoms and functional impairment are a prominent source of distress and diminished quality of life in persons with cancer.[1] [2] [3] [4] Additionally, cancer-related symptoms contribute to decreased work productivity along with increased health care utilization and costs.[5] [6] [7] Sleep disturbance, pain, physical function impairment, anxiety, depression, and energy deficit/fatigue (SPPADE) are a particularly common set of symptoms that have high rates of co-occurrence, undertreatment, and persistence in cancer survivors.[5] [8] [9] [10]

The Enhanced EHR-facilitated Cancer Symptom Control (E2C2) trial is a stepped wedge, cohort cluster randomized, pragmatic clinical trial designed to evaluate the effectiveness of routine symptom surveillance and guideline-informed symptom management targeting six SPPADE symptoms (clinicaltrials.gov identifier: NCT03892967).[11] A large, population-level cohort of patients was assembled from all medical oncology clinics of a large regional healthcare system. Baseline patient characteristics and outcomes were extracted from a common Epic electronic health record (EHR). Challenges of using the EHR for clinical research in cancer and other conditions have been described.[12] Extracting and operationalizing key clinical variables for E2C2 required the development and comparison of multistep algorithmic approaches.

Assembly of the E2C2 trial cohort revealed important challenges in using EHR data to classify two essential cancer characteristics: cancer site and metastatic status. First, individuals not uncommonly may have more than one site of cancer in which case methods used to study single sites of cancer may require modification when applied to patients with cancer at multiple sites. Second, a common assumption is that cancer registries are the most reliable and comprehensive source of data, particularly at the time of diagnosis. However, many patients with cancer seen at a tertiary center are not necessarily included in that center's registry and, importantly, registry data reflects the stage of cancer at initial diagnosis and is typically not updated to capture the longitudinal trajectory of the disease including progression to more advanced stages.[13] [14] [15] [16] Third, the increasing emphasis on using real-world data for pragmatic trials and other types of clinically embedded research requires informatics-based strategies for classifying important constructs such as cancer site and metastatic status.[17] [18]


#

Objectives

Data were derived from a cohort of more than 50,000 patients with cancer participating in the E2C2 pragmatic trial of symptom management. The objectives in this study are: (1) to compare three approaches for using EHR data to classify cancer site; (2) to compare six strategies for determining the likelihood of cancer metastasis. The ultimate aim is to balance data precision with pragmaticism, consider data's fitness for use, and offer means of optimizing EHR variable specification in large pragmatic trials as well as other types of embedded research including learning health care systems.[17] [18]


#

Methods

Cohort clusters from four adjacent health system regions and oncology care teams at a large academic medical center, all part of a large, multi-state health system, were randomly assigned to implement the E2C2 intervention at sequential 8-month intervals. Participants in this stepped wedge trial were identified using an EHR algorithm that required patients to have been assigned a diagnostic code included in the Epic foundation system's “All Cancers Grouper,” a clinical encounter with a medical oncology clinician, and an encounter visit type corresponding to an initial or follow-up medical oncology evaluation. Patients with cancer receiving care at the study sites were administered patient-reported outcome measures (PROM) to report SPPADE symptoms electronically, results of which were used to determine the level of intervention among clusters that had gone live with the intervention. The intervention included automated self-management support for patients reporting moderate (≥ 4 on 0-10 scale) levels of ≥1 SPPADE symptom and collaborative care model (CCM)-based management for patients reporting severe (≥7) symptoms. E2C2 was approved as a minimal risk study by the Mayo Clinic Institutional Review Board; the requirement for informed consent was waived. Additional details of the E2C2 trial protocol are reported elsewhere.[11] [19]


#

Determining Cancer Site

Three methods were used to classify cancer site into 14 specific categories using the combined dataset of patient diagnoses from medical oncology and other clinical encounters, hospital visits, and the EHR problem list. Excluded from all approaches were diagnoses in the categories “Other solid tumor with distant metastatic,” “Neoplasm of uncertain or unspecific behavior,” “Metastatic,” “Other solid tumor,” “Benign neoplasm,” “Unspecified,” and “Lymph node disease.” The algorithm for determining cancer site is detailed in [Fig. 1]. The All Cancers Grouper in Step 3 is a list of every diagnosis inside Mayo's instance of Epic that a provider might use. These are used in office visits, admissions, and problem lists to record when a patient has a particular condition and is a vital and routine part of care. This is built using the SNOMED CT hierarchy—specifically the “Malignant neoplastic disease” concept and all child concepts. Any diagnoses in Mayo's instance of Epic associated with this concept or one of its branches is included in this grouper.

Zoom Image
Fig. 1 Algorithm for using electronic health record (EHR) data to determine cancer site.
Zoom Image

Method A (most sensitive) counted all categories coded for a patient with no limit on the number of cancer sites. Method B (most specific) allowed only one primary diagnosis category per patient. For patients with two or more diagnosis categories, the one that made up >50% of diagnoses was assigned. If there was no majority diagnosis, the category assigned was “multiple.” If there were no specific valid diagnosis categories, the category assigned was “nonspecific.” Method C (intermediate sensitivity/specific) allowed up to two primary diagnosis categories (the two most common out of all that were coded). For those with three or more cancer sites, the two most commonly coded cancer site diagnoses were assigned.


#

Determining Metastatic Status

Six approaches to classifying cancer as metastatic are detailed in [Table 1]. Two methods of classification that could be applied to all patients were considered primary approaches. Four other methods were considered secondary approaches.

Table 1

Details of methods used to determine metastatic status in E2C2 trial cohort

Method

N [a]

Metastatic

Comments

ICD diagnoses

50,559

22,461

44.4%

• Includes a group of diagnoses which are indicative of metastatic illness.

• All patients in the cohort possess diagnosis data because a positive cancer diagnosis is required for inclusion in the cohort.

• Diagnoses indicating regional lymph node involvement were not included.

• Diagnoses were filtered to the time period of 1 year before inception of the study to the end of the study.

• Diagnoses on the problem list that were listed as “Deleted” were removed, as this indicates that they were entered in error.

• Patients with metastatic diagnoses had between 1 and 1,004 diagnoses indicating metastatic disease (mean = 28.43, median = 8).

Natural language processing (NLP)

50,559

22,336

44.2%

• All patients were run through a natural language processing (NLP) program.

• Details of the NLP program are provided in the footnote.[ b ]

Cancer registry

24,160

5,914

24.5%

• Patients typically have metastatic disease assessed during intake to the registry at the time of diagnosis, but the registry does not consistently include follow-up information about progression to metastatic disease after diagnosis.

• Therefore, the registry is reasonably accurate regarding metastatic disease at the time of diagnosis, but does not reliably identify patients who subsequently go on to develop metastatic disease after diagnosis.

Treatment plan

15,678

27,460

10,328

65.9%

10,328

37.6%

• 27,460 of the patients have a Beacon module treatment plan.

• Of the patients with treatment plans, 15,678 had a treatment goal listed.

• Of the patients who had a treatment goal listed, 10,328 had at least one treatment plan with the goal stated as “Palliative” or “Control.” Specification of either of these treatment goals was considered to indicate metastatic disease.

• However, having other treatment goals specified in the Beacon treatment plan is not an indicator that metastatic disease is not present.

Medications

48,647

5,300

10.9%

• 48,647 of the sample had medication prescription orders and medication administrations data (including outpatient and inpatient data).

• Of these, we have information on certain medications used to treat incurable types of cancers (brain, breast, gastrointestinal, genitourinary, hematologic, lung, leukemia, lymphoma, sarcoma).

• Other types of cancer do not have this data. Some of the commentary on medications indicated that the cancer was always considered incurable, or pointed to procedures that are performed for that specific type of cancer.

• 42,863 had at least one diagnosis of a valid type of cancer for this analysis.

• Of these, 5,300 were taking medication that indicated metastatic disease.

Clinical trial

1,050

1,050

100%

• Encounter descriptions with the phrase “Trial” were identified.

• Of these, 1,050 had phase I trial clinic encounter.

a N is the number of individuals for which data was available for each method.


b A natural language processing (NLP) algorithm was applied to EHR data for all E2C2 patients. The NLP algorithm included structured data elements but was additionally applied to unstructured data including all clinical notes, pathology reports, and inpatient documentation from allied health and other staff, among others. The algorithm's construction involved developing rule-based algorithms by integrating expert knowledge engineering and sublanguage analysis based on the Open Health Natural Language Processing (OHNLP) Toolkit implemented across the Mayo Clinic Enterprise. EHR clinical notes and pathology reports were used for metastasis disease information extraction. Specifically, 48 cancer types and corresponding staging knowledge from the American Joint Committee on Cancer (AJCC) Cancer Staging Manual were collected. NLP rules were developed to run on the OHNLP Toolkit. The algorithms on top of the OHNLP Toolkit checked the assertion status of each identified text mention that included certainty (i.e., positive, negative, and possible), temporality (historical or current), and experience (i.e., associated with the patient or someone else). Post-processing was done by exploring the context and structure of clinical texts grammatically and semantically to learn the specialized rules, i.e., “sublanguage.” Finally, the patient-level summary algorithm was developed to derive the presence or absence of patient-level metastatic disease. A review of a subset comprising 500 randomly selected patients from the E2C2 cohort, excluding those with hematological malignancies, revealed a 0.85 recall rate.


Primary Approaches

  1. ICD-10 codes included a group of diagnoses that are indicative of metastatic disease. Diagnoses indicating regional lymph node involvement were not included in this designation. Diagnoses were filtered to the time period of 1 year before enrollment to the end of the study (February 2023).

  2. A natural language processing (NLP) algorithm was applied to EHR data for all E2C2 patients, drawing upon prior research.[20] [21] [22] [23] The NLP algorithm included structured data elements but was additionally applied to unstructured data including all clinical notes, pathology reports, and inpatient documentation from allied health and other staff, among others. Details of the NLP algorithm are provided in the [Table 2] footnote.

Table 2

E2C2 study cohort

Characteristic

Total sample[a]

N = 50,559

Completed ≥1 PROM

N = 40,005

Did not complete a PROM

N = 10,554

SMD[b]

N

N

(%)

N

(%)

Age

0.12

 <40

3,183

2,570

(6.4)

613

(5.8)

 40–64

21,386

17,367

(43.4)

4,019

(38.1)

 ≥65

25,990

20,068

(50.2)

5,922

(56.1)

Sex

0.09

 Female

28,507

22,926

(57.3)

5,581

(52.9)

 Male

22,051

17,078

(42.7)

4,973

(47.1)

Ethnicity

0.06

 Not Hispanic or Latino

49,411

39,318

(98.3)

10,093

(95.6)

 Hispanic or Latino

644

456

(1.1)

188

(1.8)

 Unknown/Not reported

504

231

(0.6)

273

(2.6

Race

0.15

 White

46,952

37,724

(94.3)

9,228

(87.4)

 African-American

836

512

(1.3)

324

(3.1)

 American Indian/Alaska Native

262

195

(0.5)

67

(0.6)

 Asian/Pacific Islander

1,054

778

(1.9)

276

(2.6)

 Other/Unknown

1,455

796

(2.0)

659

(6.2)

Marital status

0.19

 Married/Partnered

34,850

28,386

(71.0)

6,464

(61.3)

 Divorced/Separated

4,553

3,477

(8.7)

1,076

(10.2)

 Widowed

4,746

3,541

(8.9)

1,205

(11.4)

 Single

6,126

4,519

(11.3)

1,607

(15.2)

 Unknown

284

82

(0.2)

202

(1.9)

Employment

0.19

 Employed

19,199

15,918

(39.8)

3,281

(31.1)

 Retired

24,530

19,188

(48.0)

5,342

(50.6)

 Not employed/student/military

4,411

3,216

(8.0)

1,195

(11.3)

 Disabled

2,054

1,556

(3.9)

498

(4.7)

 Unknown

365

127

(0.3)

238

(2.3)

Payor

0.11

 Government

30,773

24,031

(60.1)

6,742

(63.9)

 Non-government

19,281

15,751

(39.4)

3,530

(33.5)

 Unknown

505

223

(0.6)

282

(2.7)

Education

*0.24

 Less than high school

9,173

7,593

(19.0)

1,580

(15.0)

 High school

1,333

1,114

(2.8)

219

(2.1)

 Some college or Associate degree

10,452

9,182

(23.0)

1,270

(12.0)

 Bachelor's degree

7,502

6,707

(16.8)

795

(7.5)

 Master's or Doctoral degree

5,513

4,934

(12.3)

579

(5.5)

 Unknown

16,586

10,475

(26.2)

6,111

(57.9)

RUCA

0.02

 Urban (1–3)

27,931

22,338

(55.8)

5,593

(53.0)

 Micropolitan (4–6)

8,647

6,867

(17.2)

1,780

(16.9)

 Small town (7–9)

6,929

5,477

(13.7)

1,452

(13.8)

 Rural (10)

6,366

5,061

(12.7)

1,305

(12.4)

 Unknown

686

262

(0.7)

424

(4.0)

Metastatic status[c]

*0.28

 Not metastatic

22,333

16,704

(41.8)

5,629

(53.3)

 Metastatic

16,554

14,084

(35.2)

2,470

(23.4)

 Uncertain

11,672

9,217

(23.0)

2,455

(23.3)

Cancer type[d]

 Breast

10,611

8,760

(21.9)

1,851

(17.5)

0.11

 Endocrine

1,386

1,123

(2.8)

263

(2.5)

0.02

 Gastrointestinal

12,774

9,522

(23.8)

3,252

(30.8)

0.16

 Genitourinary

4,556

3,645

(9.1)

911

(8.6)

0.02

 Gynecologic

3,113

2,540

(6.4)

573

(5.4)

0.04

 Head and neck

2,144

1,711

(4.3)

433

(4.1)

0.01

 Hematologic

3,524

2,669

(6.7)

855

(8.1)

0.06

 Lung

4,835

3,881

(9.7)

954

(9.0)

0.02

 Melanoma

1,426

1,234

(3.1)

192

(1.8)

0.08

 Nervous system

2,003

1,701

(4.3)

302

(2.9

0.08

 Sarcoma

2,175

1,783

(4.5)

392

(3.7)

0.04

 Skin

860

600

(1.5)

260

(2.5)

0.07

 Multiple

849

648

(1.6)

201

(1.9)

0.02

 Uncertain

303

188

(0.5)

115

(1.1)

0.07

Portal use ≥1 in 365 days before encounter/enrollment

38,546

33,348

(83.4)

5,198

(49.3)

0.77[e]

Abbreviations: PROM, patient-reported outcome measures; RUCA, rural–urban commuting area; SDM, standardized mean difference.


a Missing data was <1% for all variables except RUCA (1.4%), race (2.9%), and education (32.8%).


b Standardized differences ≥0.20 and ≥0.50 suggest small and moderate imbalances, respectively.


c Metastatic = ICD-10 codes and natural language processing (NLP) methods agreed cancer was metastatic. Not metastatic = both methods agreed cancer was not metastatic. Indeterminate = the two methods disagreed.


d Method B (single most prevalent type of cancer) is used in this table.


e It denotes SMD ≥ 0.20.



#

Secondary Approaches

  1. Stage data from the Mayo Clinic Cancer Registry were evaluated for the 24,160 patients (47.8% of the cohort) included in this registry.

  2. Treatment plan was assessed using Beacon (Epic's oncology module), within which clinicians create treatment plans for cancer-directed therapies and order supportive care regimens based on standardized protocols. Within each cancer treatment plan, clinicians must specify a treatment goal. If at least one treatment plan had a goal stated as “Palliative” or “Control,” the cancer was classified as metastatic.

  3. Certain medications are used in managing specific cancers when that cancer is considered incurable. The linkage of these medications to specific cancers offers another potential method of identifying metastatic disease (see [Supplementary Table S1], available in the online version).

  4. Participation in a phase I clinical trial may indicate that a patient's cancer is metastatic/incurable. Encounter visits that contained the phrase “Trial” in the encounter description were captured to identify patients seen in the phase I clinical trials clinic.


#
#

Other Variables[8] [24]

Sociodemographic variables included age, sex, race, ethnicity, education, employment status, marital status, and payor. Rural–urban commuting area (RUCA) codes classify U.S. census tracts using measures of population density, urbanization, and daily commuting. RUCA codes range from 1 to 10 with 1–3, 4–6, and 7–10 representing urban, micropolitan, and rural locations, respectively. Clinical location was coded as tertiary (Mayo Clinic Rochester) or community (all other Mayo Clinic Health System sites). Portal use was defined as the patient having accessed the Epic health care portal at least once in the 12 months prior to first E2C2 contact.

Statistical Analysis

We summarized the results of all approaches to both cancer site and metastasis determination. The kappa statistic assessed agreement among methods to ascertain cancer site and metastatic disease. Agreement is considered fair for a kappa of 0.21 to 0.40, moderate for a kappa of 0.41 to 0.60, and substantial for a kappa of ≥0.61.[25] In interpreting kappa it is important to note that kappa is not simple agreement but rather the percentage of agreement beyond chance. Thus, a kappa of 0.50 is 50% agreement beyond chance and is therefore considered a moderate level of agreement.

Patient characteristics were described for the full cohort of eligible participants in this stepped wedge trial. Between-group differences were analyzed for symptom report completers versus non-completers (≥1 vs. 0 surveys completed). Imbalance in sociodemographic and clinical characteristics between groups was assessed by examining the standardized mean difference (SMD), which is calculated as the difference in means or proportions divided by standard error; imbalance was defined as an absolute value greater than 0.20[26] or 0.25.[27] SMD thresholds were used to assess variable imbalance because SMDs are robust to sample size (unlike statistical significance, which could reflect large absolute differences or just large sample size), and because we had no hypotheses regarding differences between groups.

All analyses were performed in SAS version 9.4 (SAS Institute) and Stata version 18.0 (StataCorp, College Station, Texas, United States). This trial was approved by the Mayo Clinic IRB.


#
#

Results

Study Sample

[Table 2] summarizes the characteristics of the E2C2 eligible cohort (n = 50,559). Most were 40 to 64 years old (42.3%) or ≥65 (51.4%), and 56.4% were women. The sample was predominantly white and non-Hispanic. Most patients had completed education beyond high school, were currently employed or retired, and were married or partnered. About a quarter resided in small town or rural areas (RUCA 7–10). There was a broad distribution of cancer sites with the most common being gastrointestinal (25.3%), breast (21.0%), lung (9.6%), genitourinary (9.0%), hematologic (7.0%), gynecologic (6.2%), sarcoma (4.3%), and head and neck (4.2%). The cancer was determined to be metastatic at some point during the year preceding cohort enrollment or during the 4-year trial interval in 16,544 (32.7%) patients.

Except for a few characteristics, the 40,005 (79.1%) patients who completed at least one symptom report were generally similar to the 10,544 (20.9%) patients who did not complete a report. There were small imbalances (SMD ≥ 0.20) between the two groups for just a few variables, with symptom report completers more likely to have metastatic cancer, higher education, and be receiving antineoplastic therapy. The only moderate imbalance was greater portal use among the symptom report completers (83.4% vs. 49.3%).


#

Cancer Site

[Table 3] summarizes results of the three operational approaches to categorizing cancer site. Among the total sample, two-thirds (65%; n = 33,004) had only 1 cancer diagnosis site identified, approximately one quarter (25%; n = 12,673) had 2 sites, and less than 5% (3%; n = 4,579) had ≥3 cancer sites identified in the EHR. The site of cancer was nonspecific (not a valid site) in only 303 patients (<0.6%). As shown in [Supplementary Table S2] (available in the online version), the approach allowing the two most prevalent ICD-10 cancer site diagnoses per patient (Method C) detected a median of 92% (range, 77% to 98% across cancer sites) of the cases identified by counting all cancer sites (Method A), whereas the approach allowing only the single most prevalent cancer site diagnosis (Method B) identified a median of 65% (range, 13 to 89%). However, the rank order of frequency distributions of specific cancer sites among the three approaches was relatively similar. Moreover, agreement among the three approaches was very good (kappa >0.80) for most cancer sites. Only endocrine and skin cancers had any inter-method kappa <0.7.

Table 3

Comparing three approaches for determining type of cancer from EHR

Cancer type

Number of cancer type diagnoses allowed per patient[a]

Agreement among methods

Kappa statistic[b]

Any number

1

1 to 2

A and B

A and C

B and C

A, B, and C

Method A

Method B

Method C

N

N

N

Breast

11,898

10,611

11,684

0.93

0.99

0.94

0.95

Endocrine

3,474

1,386

3,031

0.55

0.93

0.61

0.72

Gastrointestinal

15,248

12,774

14,735

0.88

0.98

0.90

0.92

Genitourinary

7,007

4,556

6,491

0.76

0.96

0.80

0.85

Gynecologic

3,919

3,113

3,744

0.88

0.98

0.90

0.92

Head and neck

2,954

2,144

2,710

0.83

0.95

0.88

0.89

Hematologic

5,848

3,524

5,348

0.73

0.95

0.78

0.83

Lung

6,669

4,835

6,209

0.82

0.96

0.86

0.88

Melanoma

2,331

1,426

2,010

0.75

0.92

0.82

0.84

Nervous system

3,669

2,003

3,152

0.69

0.92

0.77

0.80

Sarcoma

3,380

2,175

3,060

0.77

0.95

0.82

0.85

Skin, other

6,905

860

5,334

0.20

0.85

0.26

0.50

Multiple[c]

849

Nonspecific[d]

303

303

303

Total

73,605

50,559

67,811

a Method A counted all categories coded for a patient with no limit on the number of cancer types. Method B allowed only one primary diagnosis category per patient. For patients with two or more diagnosis categories, the one that made up >50% of diagnoses was assigned. If there was no majority diagnosis, the category assigned was “multiple.” If there were no specific valid diagnosis categories, the category assigned was “nonspecific.” Method C allowed up to two primary diagnosis categories (the two most common out of all that were coded). For those with three or more cancer types, the two most commonly coded cancer type diagnoses were assigned.


b Kappa thresholds of 0.2, 0.4, 0.6, and 0.8 indicate fair, moderate, substantial, and almost perfect agreement.


c When applying Method B, “multiple” was used if there was no majority diagnosis (i.e., one cancer type accounting for >50% of the cancer diagnoses in the EHR).


d Cancer was diagnosed but the cancer type was not specified in the electronic health record.



#

Metastatic Disease

[Table 4] summarizes the data available for each of the six methods used to classify the patient as having metastatic disease and the pairwise agreement among the methods. Because the phase 1 trial criterion applied to only a small number of patients, we examined in further detail all combinations of the other five methods for determining metastatic disease. [Supplementary Table S3] (available in the online version) shows that the number of patients who could be classified as metastatic by one, two, three, four, and all five methods was 10,857, 10,028, 6,824, 3,001, and 590, respectively.

Table 4

Comparison of methods to determine metastatic disease

Primary methods

Secondary methods

ICD-10

NLP

Cancer registry

Treatment plan

Medications

Phase I trial[a]

Metastatic

• Yes

22,461

22,336

5,914

10,328

5,300

1,050

• No

28,098

28,223

18,246

5,350

43,347

0

• Missing

0

0

26,399

34,881

1,912

49,509

Agreement, kappa[b]

• ICD-10 codes

• NLP

0.533

• Cancer registry

0.217

0.218

• Treatment plan

0.400

0.376

0.230

• Medications

0.100

0.103

0.090

0.195

Abbreviations: EHR, electronic health record; ICD-10, International Statistical Classification of Diseases and Related Health Problems, 10th Revision; NLP, natural language processing.


a Because the completeness and accuracy of EHR data about phase I trial participation was uncertain, agreement with other methods was not calculated.


b Kappa determined in cases for which metastatic status could be classified as “yes” or “no” (excluding “missing”).


ICD-10 codes and NLP data were available for all patients in the trial cohort. Using the ICD-10 approach, 22,461 (44.4%) patients were classified as having metastatic disease compared to 22,336 (44.2%) using the NLP approach. Regarding agreement, 16,554 (32.7%) patients were classified as metastatic by both approaches, 22,333 (44.2%) were classified as non-metastatic by both approaches, and 11,672 (23.1%) were classified as metastatic by one of the approaches but not the other. This resulted in moderate agreement (kappa = 0.53). Notably, this kappa value represents not merely simple agreement but instead 53% agreement over and above chance agreement. Thus, the final method for determining metastasis was agreement between ICD-10 and NLP.

Of the E2C2 cohort, approximately half (47.8%; n = 24,160) were in the Mayo Clinic Cancer Registry, and of these 5,914 (24.5%) were classified as having metastatic disease. A Beacon treatment plan with a specific treatment goal was documented for 15,678 (31.0%) patients in the cohort, of which 10,328 (65.9%) had “Palliative” or “Control” selected as the goal. Of the 48,647 patients with medication prescription data, 5,300 (10.9%) had a medication typically used for incurable cancer. Only 1,050 (2.1%) patients had EHR documentation of a phase I trial clinic visit. Following ICD–NLP concordance (kappa = 0.533), the second highest agreement was a Beacon treatment plan which had a kappa of 0.400 with ICD-10, 0.376 with NLP, and 0.230 with the cancer registry.


#
#

Discussion

In our cohort of 50,559 cancer patients participating in a stepped wedge trial of an EHR-facilitated symptom surveillance and management intervention, multiple approaches for determining cancer site and metastasis from EHR data were compared. In addition to the large sample size, strengths of our E2C2 cohort include a diversity of cancer sites, longitudinal measurement of frequently co-occurring symptoms, and the inclusion of participants drawn from both tertiary and community oncology clinics.

Operational rules for how to categorize cancer site are necessary for patients having ICD codes for more than one cancer site. Not surprisingly, the number of cases for specific cancer sites declined when only the one or two most prevalent cancer sites were counted, with the greatest decline being when only the one most prevalent site was counted. However, the relative rank ordering of cancer site frequencies was relatively similar among the three methods. Moreover, inter-method agreement was high (>0.70) for all sites except endocrine and skin cancers; the latter two are less often the most frequently coded cancer when patients have cancer at more than one site. The EHR approach used may depend upon whether one is studying a single versus multiple sites of cancer, the size of the patient sample, and the feasibility of manual chart review for confirmation. Additionally, investigators focusing on subgroups defined by cancer site may prefer to begin with the most sensitive approach: including all qualifying diagnoses, with recognition that adjudication will be required to eliminate false positives.

Determining metastatic status from EHR data is a complex and evolving methodology. The two most common approaches are ICD codes and NLP, used either individually or in combination.[13] [14] [15] [28] [29] [30] [31] Partly this is because, as we observed in our study, these two approaches can be applied in most patients. Also, the ICD–NLP kappa of 0.53 is considered reasonably good inter-method agreement since this is not absolute agreement but the amount of agreement beyond chance (in this case, 53% beyond chance).

We chose a conservative approach by only classifying status as metastatic where both ICD-10 and NLP agreed. Where one approach but not the other indicated metastasis, status was classified as “uncertain” which applied to 23% of our sample. Of interest, ICD-10 and NLP identified a similar number of patients as metastatic and each had a similar level of agreement with the other methods, as shown in [Table 4].

Previous studies evaluating methods for ascertaining metastatic status are detailed in [Supplementary Table S4] (available in the online version). Most have focused on a single cancer site, whereas three studies have focused on two to four cancer sites.[28] [29] [32] Four studies focused principally on detecting cancer recurrence,[28] [29] [32] [33] which can have substantial but not complete overlap with metastasis. All studies used at least two of four methods we studied (ICD, NLP, cancer registries, and/or metastatic-specific drugs), whereas several used claims data. Some studies used medical record review as a gold standard, but this could only be applied to smaller samples, or subsets of subsample of patients. Unlike previous studies, our cohort included a much wider range of cancer sites and compared six different methods.

Cancer registries would seem to be the most reliable and comprehensive source of data, particularly at the time of diagnosis. However, as we observed, many patients with cancer seen at a tertiary center are not necessarily included in that center's registry and, importantly, registry data reflects the stage of cancer at initial diagnosis and is typically not updated to capture the longitudinal trajectory of the disease including progression to more advanced stages.[13] [14] [15] [16] Although Beacon treatment plans can be useful for the subset of patients in which the clinician enters a treatment goal, this Epic module is not used in an important proportion of patients and, when used, a treatment goal may not be listed. Moreover, the absence of a “Palliative” or “Control” treatment goal does not mean that metastatic disease is absent. Finally, this approach is limited to health systems utilizing Epic, though modules with comparable information exist in other EHR systems. Searching prescription data for medications potentially indicative of metastatic disease or attempting to capture phase I trial participation have important limitations for identifying metastatic disease. Many patients with metastatic disease are neither taking these medications nor enrolled in a phase I trial. Other strategies might include claims data,[28] [29] information trackers collating data from multiple EHR sources,[34] and machine learning.[31]

Using real-world data for research is gaining traction due to large sample sizes, widespread use of EHRs, and the more inclusive nature of the cancer population being investigated.[35] However, problems include incomplete or inaccurate data, varying quality of clinical documentation, challenges of tracking patients longitudinally to assess cancer trajectory, and intricacies of operationalizing clinical variables.[36] The current literature does not outline EHR data best practices for pragmatic trialists, and limitedly describes the methods used in previous studies.[17] [18] [37] Potential solutions include the design of software for the EHR that enables rapid and standardized reporting of recurrence, common data elements needed for a range of study designs in oncology, use of electronic pathology reports to facilitate collection of recurrence by cancer registries, and mandating by insurers to require reporting of recurrence along with billing codes on medical claims.[16] All require coordination across stakeholders and are potentially resource-intensive.

E2C2 data are from different sources with high heterogeneity. Advanced statistical approaches (e.g., normalization, batch effect removal, imputation) may refine categorization, particularly for variables that are the primary focus of a study rather than covariates. Also, large language model advances are promising[21] [38] [39] and may be integrated into future specification of EHR-derived cancer characteristics, although some are limited to specific cancer sites or data sources. Challenges of using real-world EHR data have been articulated and call out for examining new strategies.[17] [36]

Deciding what is “good enough” covariate adjudication, i.e., fitness for use,[18] is a highly pragmatic consideration as research migrates along the translational continuum to population-level deployment. For example, the approaches described in this paper may be sufficient when describing characteristics of a sample or when adjusting for covariates. However, a pragmatic trial evaluating treatments to decrease progression to metastasis within a single type of cancer may warrant more rigorous methods including detailed chart review to ascertain the presence of metastasis as a primary outcome.

An important caveat is the predominantly white, non-Hispanic sample, which may limit generalizability of our results to racially and socioeconomically diverse populations. At the same time, our cohort's demographic characteristics are representative of the upper Midwest general population. A salient advantage is that we are able to generalize to rural patients (who face care access barriers) given that a quarter of the cohort resided in rural areas.


#

Conclusion

Strengths of this pragmatic trial include the large sample size, broad range of cancer sites and treatments, inclusion of a substantial portion of patients with metastatic disease and notable symptom burden, and the use of robust strategies for clinical characterization. Our findings can inform future use of EHR data for classifying cancer site and metastatic status in research, including pragmatic trials of novel healthcare delivery interventions as well as clinical epidemiological studies.


#

Clinical Relevance Statement

The combination of ICD diagnostic codes and natural language processing is a more sensitive approach to detecting the potential presence of metastasis than cancer register, treatment plans, or medication data.


#

Multiple-Choice Questions

  1. Using electronic health records (EHR), which two methods in combination allow classification of metastatic status for the greatest proportion of patients with cancer and with the highest agreement.

    • ICD diagnostic codes and cancer registry data

    • Cancer registry data and natural language processing

    • ICD diagnostic codes and natural language processing

    • Natural language processing and Beacon treatment plans

    Correct Answer: The correct answer is option c. Nearly all patients have ICD diagnostic codes in the EHR, and natural language processing is likewise applicable to all patients.

  2. For patients with more than one cancer site coded in the EHR, counting only the most frequently coded cancer would have the lowest sensitivity for patients with:

    • Lung cancer

    • Skin cancer, non-melanoma

    • Sarcoma

    • Melanoma

    Correct Answer: The correct answer is option b. Compared to counting all cancer diagnoses in the EHR, counting only the most frequently coded cancer would miss 80 to 90% of non-melanoma skin cancers. Another cancer site with a high non-detection rate would be endocrine cancer.


#
#

Conflict of Interest

None declared.

Protection of Human and Animal Subjects

The study was performed in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research Involving Human Subjects, and was reviewed by the Mayo Clinic Institutional Review Board.


Supplementary Material

  • References

  • 1 Van Lancker A, Velghe A, Van Hecke A. et al. Prevalence of symptoms in older cancer patients receiving palliative care: a systematic review and meta-analysis. J Pain Symptom Manage 2014; 47 (01) 90-104
  • 2 Cleeland CS. Symptom burden: multiple symptoms and their impact as patient-reported outcomes. J Natl Cancer Inst Monogr 2007; 2007 (37) 16-21
  • 3 Esther Kim JE, Dodd MJ, Aouizerat BE, Jahan T, Miaskowski C. A review of the prevalence and impact of multiple symptoms in oncology patients. J Pain Symptom Manage 2009; 37 (04) 715-736
  • 4 Kjaer TK, Johansen C, Ibfelt E. et al. Impact of symptom burden on health related quality of life of cancer survivors in a Danish cancer rehabilitation program: a longitudinal study. Acta Oncol 2011; 50 (02) 223-232
  • 5 Tan CJ, Yip SYC, Chan RJ, Chew L, Chan A. Investigating how cancer-related symptoms influence work outcomes among cancer survivors: a systematic review. J Cancer Surviv 2022; 16 (05) 1065-1078
  • 6 Hirpara D, Eskander A, Coburn N, Sutradhar R, Chan W, Hallet J. Association between patient-reported symptoms and health care resource utilization: a first step to develop patient-centred value measures in cancer care. Can J Surg 2022; 65: S99-S100
  • 7 Farrell MM, Jiang C, Moss G. et al. Associations between symptoms with healthcare utilization and death in advanced cancer patients. Support Care Cancer 2023; 31 (03) 183
  • 8 Kroenke K, Lam V, Ruddy KJ. et al. Prevalence, severity, and co-occurrence of SPPADE symptoms in 31,866 patients with cancer. J Pain Symptom Manage 2023; 65 (05) 367-377
  • 9 Barbera L, Seow H, Howell D. et al. Symptom burden and performance status in a population-based cohort of ambulatory cancer patients. Cancer 2010; 116 (24) 5767-5776
  • 10 Harrington CB, Hansen JA, Moskowitz M, Todd BL, Feuerstein M. It's not over when it's over: long-term symptoms in cancer survivors—a systematic review. Int J Psychiatry Med 2010; 40 (02) 163-181
  • 11 Finney Rutten LJ, Ruddy KJ, Chlan LL. et al. Pragmatic cluster randomized trial to evaluate effectiveness and implementation of enhanced EHR-facilitated cancer symptom control (E2C2). Trials 2020; 21 (01) 480
  • 12 Kim E, Rubinstein SM, Nead KT, Wojcieszynski AP, Gabriel PE, Warner JL. The Evolving Use of Electronic Health Records (EHR) for Research. Elsevier; 2019: 354-361
  • 13 Ling AY, Kurian AW, Caswell-Jin JL, Sledge Jr GW, Shah NH, Tamang SR. Using natural language processing to construct a metastatic breast cancer cohort from linked cancer registry and electronic medical records data. JAMIA Open 2019; 2 (04) 528-537
  • 14 Alba PR, Gao A, Lee KM. et al. Ascertainment of veterans with metastatic prostate cancer in electronic health records: demonstrating the case for natural language processing. JCO Clin Cancer Inform 2021; 5: 1005-1014
  • 15 Warner JL, Levy MA, Neuss MN, Warner JL, Levy MA, Neuss MN. ReCAP: feasibility and accuracy of extracting cancer stage information from narrative electronic health record data. J Oncol Pract 2016; 12 (02) 157-158 , e169–e7
  • 16 Warren JL, Yabroff KR. Challenges and opportunities in measuring cancer recurrence in the United States. J Natl Cancer Inst 2015; 107 (08) djv134
  • 17 Richesson RL, Marsolo KS, Douthit BJ. et al. Enhancing the use of EHR systems for pragmatic embedded research: lessons from the NIH Health Care Systems Research Collaboratory. J Am Med Inform Assoc 2021; 28 (12) 2626-2640
  • 18 Raman SR, O'Brien EC, Hammill BG. et al. Evaluating fitness-for-use of electronic health records in pragmatic clinical trials: reported practices and recommendations. J Am Med Inform Assoc 2022; 29 (05) 798-804
  • 19 Herrin J, Finney Rutten LJ, Ruddy KJ, Kroenke K, Cheville AL. Pragmatic cluster randomized trial to evaluate effectiveness and implementation of EHR-facilitated collaborative symptom control in cancer (E2C2): addendum. Trials 2023; 24 (01) 21
  • 20 Wang L, Fu S, Wen A. et al. Assessment of electronic health record for cancer research and patient care through a scoping review of cancer natural language processing. JCO Clin Cancer Inform 2022; 6: e2200006
  • 21 Liu K, Kulkarni O, Witteveen-Lane M, Chen B, Chesla D. MetBERT: A Generalizable and Pre-trained Deep Learning Model for the Prediction of Metastatic Cancer from Clinical Notes. American Medical Informatics Association; 2022: 331
  • 22 Soysal E, Warner JL, Denny JC, Xu H. Identifying metastases-related information from pathology reports of lung cancer patients. AMIA Jt Summits Transl Sci Proc 2017; 2017: 268-277
  • 23 Liu S, Wen A, Wang L. et al. An open natural language processing (NLP) framework for EHR-based clinical research: a case demonstration using the National COVID Cohort Collaborative (N3C). J Am Med Inform Assoc 2023; 30 (12) 2036-2040
  • 24 Kroenke K, Stump TE, Kean J, Talib TL, Haggstrom DA, Monahan PO. PROMIS 4-item measures and numeric rating scales efficiently assess SPADE symptoms compared with legacy measures. J Clin Epidemiol 2019; 115: 116-124
  • 25 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33 (01) 159-174
  • 26 Yang D, Dalton JE. A unified approach to measuring the effect size between two groups using SAS. SAS Global Forum; 2012. 335. 1-6
  • 27 Stuart EA, Lee BK, Leacy FP. Prognostic score-based balance measures can be a useful diagnostic for propensity score methods in comparative effectiveness research. J Clin Epidemiol 2013; 66 (8, Suppl): S84-S90 , 90.e1
  • 28 Hassett MJ, Uno H, Cronin AM, Carroll NM, Hornbrook MC, Ritzwoller D. Detecting lung and colorectal cancer recurrence using structured clinical/administrative data to enable outcomes research and population health management. Med Care 2017; 55 (12) e88-e98
  • 29 Carroll NM, Ritzwoller DP, Banegas MP. et al. Performance of cancer recurrence algorithms after coding scheme switch from international classification of diseases 9th Revision to International Classification of Diseases 10th Revision. JCO Clin Cancer Inform 2019; 3: 1-9
  • 30 He J, Mark L, Hilton C. et al. A comparison of structured data query methods versus natural language processing to identify metastatic melanoma cases from electronic health records. Int J Comput Med Healthcare 2019; 1 (01) 101-111
  • 31 Seneviratne MG, Banda JM, Brooks JD, Shah NH, Hernandez-Boussard TM. Identifying cases of metastatic prostate cancer using machine learning on electronic health records. AMIA Annu Symp Proc 2018; 2018: 1498-1504
  • 32 Hassett MJ, Ritzwoller DP, Taback N. et al. Validating billing/encounter codes as indicators of lung, colorectal, breast, and prostate cancer recurrence using 2 large contemporary cohorts. Med Care 2014; 52 (10) e65-e73
  • 33 Ritzwoller DP, Hassett MJ, Uno H. et al. Development, validation, and dissemination of a breast cancer recurrence detection and timing informatics algorithm. J Natl Cancer Inst 2018; 110 (03) 273-281
  • 34 Ping X-O, Tseng Y-J, Chung Y. et al. Information extraction for tracking liver cancer patients' statuses: from mixture of clinical narrative report types. Telemed J E Health 2013; 19 (09) 704-710
  • 35 Lerro CC, Bradley MC, Forshee RA, Rivera DR. The bar is high: evaluating fit-for-use oncology real-world data for regulatory decision making. JCO Clin Cancer Inform 2024; 8: e2300261
  • 36 Cooper JD, Shou K, Sunderland K, Pham K, Thornton JA, DeStefano CB. Real-world pitfalls of analyzing real-world data: a cautionary note and path forward. JCO Clin Cancer Inform 2023; 7 (07) e2300097
  • 37 NIH Pragmatic Trials Collaboratory: Rethinking Clinical Trials. Accessed June 29, 2024 at: https://rethinkingclinicaltrials.org/chapters/design/using-electronic-health-record-data-pragmatic-clinical-trials-top/using-electronic-health-record-data-in-pragmatic-clinical-trials-introduction/
  • 38 Li C, Zhang Y, Weng Y, Wang B, Li Z. Natural language processing applications for computer-aided diagnosis in oncology. Diagnostics (Basel) 2023; 13 (02) 286
  • 39 Sangariyavanich E, Ponthongmak W, Tansawet A. et al. Systematic review of natural language processing for recurrent cancer detection from electronic medical records. Inform Med Unlocked 2023; 41 (03) 101326
  • 40 Choi YC, Zhang D, Tyczynski JE. Comparison between health insurance claims and electronic health records (EHRs) for metastatic non-small-cell lung cancer (NSCLC) patient characteristics and treatment patterns: a retrospective cohort study. Drugs Real World Outcomes 2021; 8 (04) 577-587
  • 41 Stafkey-Mailey D, Landsman-Blumberg P, Farrelly E, Eaddy M. Comparison of methods to identify stage IIIB or IV metastatic lung cancer patients from electronic medical records. Value Health 2013; 16 (07) A582

Address for correspondence

Kurt Kroenke, MD
Regenstrief Institute
1101 W. 10th Street, Indianapolis, IN 46202
United States   

Publikationsverlauf

Eingereicht: 11. Januar 2025

Angenommen: 18. Februar 2025

Artikel online veröffentlicht:
18. Juni 2025

© 2025. The Author(s). This is an open access article published by Thieme under the terms of the Creative Commons Attribution License, permitting unrestricted use, distribution, and reproduction so long as the original work is properly cited. (https://creativecommons.org/licenses/by/4.0/)

Georg Thieme Verlag KG
Oswald-Hesse-Straße 50, 70469 Stuttgart, Germany

  • References

  • 1 Van Lancker A, Velghe A, Van Hecke A. et al. Prevalence of symptoms in older cancer patients receiving palliative care: a systematic review and meta-analysis. J Pain Symptom Manage 2014; 47 (01) 90-104
  • 2 Cleeland CS. Symptom burden: multiple symptoms and their impact as patient-reported outcomes. J Natl Cancer Inst Monogr 2007; 2007 (37) 16-21
  • 3 Esther Kim JE, Dodd MJ, Aouizerat BE, Jahan T, Miaskowski C. A review of the prevalence and impact of multiple symptoms in oncology patients. J Pain Symptom Manage 2009; 37 (04) 715-736
  • 4 Kjaer TK, Johansen C, Ibfelt E. et al. Impact of symptom burden on health related quality of life of cancer survivors in a Danish cancer rehabilitation program: a longitudinal study. Acta Oncol 2011; 50 (02) 223-232
  • 5 Tan CJ, Yip SYC, Chan RJ, Chew L, Chan A. Investigating how cancer-related symptoms influence work outcomes among cancer survivors: a systematic review. J Cancer Surviv 2022; 16 (05) 1065-1078
  • 6 Hirpara D, Eskander A, Coburn N, Sutradhar R, Chan W, Hallet J. Association between patient-reported symptoms and health care resource utilization: a first step to develop patient-centred value measures in cancer care. Can J Surg 2022; 65: S99-S100
  • 7 Farrell MM, Jiang C, Moss G. et al. Associations between symptoms with healthcare utilization and death in advanced cancer patients. Support Care Cancer 2023; 31 (03) 183
  • 8 Kroenke K, Lam V, Ruddy KJ. et al. Prevalence, severity, and co-occurrence of SPPADE symptoms in 31,866 patients with cancer. J Pain Symptom Manage 2023; 65 (05) 367-377
  • 9 Barbera L, Seow H, Howell D. et al. Symptom burden and performance status in a population-based cohort of ambulatory cancer patients. Cancer 2010; 116 (24) 5767-5776
  • 10 Harrington CB, Hansen JA, Moskowitz M, Todd BL, Feuerstein M. It's not over when it's over: long-term symptoms in cancer survivors—a systematic review. Int J Psychiatry Med 2010; 40 (02) 163-181
  • 11 Finney Rutten LJ, Ruddy KJ, Chlan LL. et al. Pragmatic cluster randomized trial to evaluate effectiveness and implementation of enhanced EHR-facilitated cancer symptom control (E2C2). Trials 2020; 21 (01) 480
  • 12 Kim E, Rubinstein SM, Nead KT, Wojcieszynski AP, Gabriel PE, Warner JL. The Evolving Use of Electronic Health Records (EHR) for Research. Elsevier; 2019: 354-361
  • 13 Ling AY, Kurian AW, Caswell-Jin JL, Sledge Jr GW, Shah NH, Tamang SR. Using natural language processing to construct a metastatic breast cancer cohort from linked cancer registry and electronic medical records data. JAMIA Open 2019; 2 (04) 528-537
  • 14 Alba PR, Gao A, Lee KM. et al. Ascertainment of veterans with metastatic prostate cancer in electronic health records: demonstrating the case for natural language processing. JCO Clin Cancer Inform 2021; 5: 1005-1014
  • 15 Warner JL, Levy MA, Neuss MN, Warner JL, Levy MA, Neuss MN. ReCAP: feasibility and accuracy of extracting cancer stage information from narrative electronic health record data. J Oncol Pract 2016; 12 (02) 157-158 , e169–e7
  • 16 Warren JL, Yabroff KR. Challenges and opportunities in measuring cancer recurrence in the United States. J Natl Cancer Inst 2015; 107 (08) djv134
  • 17 Richesson RL, Marsolo KS, Douthit BJ. et al. Enhancing the use of EHR systems for pragmatic embedded research: lessons from the NIH Health Care Systems Research Collaboratory. J Am Med Inform Assoc 2021; 28 (12) 2626-2640
  • 18 Raman SR, O'Brien EC, Hammill BG. et al. Evaluating fitness-for-use of electronic health records in pragmatic clinical trials: reported practices and recommendations. J Am Med Inform Assoc 2022; 29 (05) 798-804
  • 19 Herrin J, Finney Rutten LJ, Ruddy KJ, Kroenke K, Cheville AL. Pragmatic cluster randomized trial to evaluate effectiveness and implementation of EHR-facilitated collaborative symptom control in cancer (E2C2): addendum. Trials 2023; 24 (01) 21
  • 20 Wang L, Fu S, Wen A. et al. Assessment of electronic health record for cancer research and patient care through a scoping review of cancer natural language processing. JCO Clin Cancer Inform 2022; 6: e2200006
  • 21 Liu K, Kulkarni O, Witteveen-Lane M, Chen B, Chesla D. MetBERT: A Generalizable and Pre-trained Deep Learning Model for the Prediction of Metastatic Cancer from Clinical Notes. American Medical Informatics Association; 2022: 331
  • 22 Soysal E, Warner JL, Denny JC, Xu H. Identifying metastases-related information from pathology reports of lung cancer patients. AMIA Jt Summits Transl Sci Proc 2017; 2017: 268-277
  • 23 Liu S, Wen A, Wang L. et al. An open natural language processing (NLP) framework for EHR-based clinical research: a case demonstration using the National COVID Cohort Collaborative (N3C). J Am Med Inform Assoc 2023; 30 (12) 2036-2040
  • 24 Kroenke K, Stump TE, Kean J, Talib TL, Haggstrom DA, Monahan PO. PROMIS 4-item measures and numeric rating scales efficiently assess SPADE symptoms compared with legacy measures. J Clin Epidemiol 2019; 115: 116-124
  • 25 Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977; 33 (01) 159-174
  • 26 Yang D, Dalton JE. A unified approach to measuring the effect size between two groups using SAS. SAS Global Forum; 2012. 335. 1-6
  • 27 Stuart EA, Lee BK, Leacy FP. Prognostic score-based balance measures can be a useful diagnostic for propensity score methods in comparative effectiveness research. J Clin Epidemiol 2013; 66 (8, Suppl): S84-S90 , 90.e1
  • 28 Hassett MJ, Uno H, Cronin AM, Carroll NM, Hornbrook MC, Ritzwoller D. Detecting lung and colorectal cancer recurrence using structured clinical/administrative data to enable outcomes research and population health management. Med Care 2017; 55 (12) e88-e98
  • 29 Carroll NM, Ritzwoller DP, Banegas MP. et al. Performance of cancer recurrence algorithms after coding scheme switch from international classification of diseases 9th Revision to International Classification of Diseases 10th Revision. JCO Clin Cancer Inform 2019; 3: 1-9
  • 30 He J, Mark L, Hilton C. et al. A comparison of structured data query methods versus natural language processing to identify metastatic melanoma cases from electronic health records. Int J Comput Med Healthcare 2019; 1 (01) 101-111
  • 31 Seneviratne MG, Banda JM, Brooks JD, Shah NH, Hernandez-Boussard TM. Identifying cases of metastatic prostate cancer using machine learning on electronic health records. AMIA Annu Symp Proc 2018; 2018: 1498-1504
  • 32 Hassett MJ, Ritzwoller DP, Taback N. et al. Validating billing/encounter codes as indicators of lung, colorectal, breast, and prostate cancer recurrence using 2 large contemporary cohorts. Med Care 2014; 52 (10) e65-e73
  • 33 Ritzwoller DP, Hassett MJ, Uno H. et al. Development, validation, and dissemination of a breast cancer recurrence detection and timing informatics algorithm. J Natl Cancer Inst 2018; 110 (03) 273-281
  • 34 Ping X-O, Tseng Y-J, Chung Y. et al. Information extraction for tracking liver cancer patients' statuses: from mixture of clinical narrative report types. Telemed J E Health 2013; 19 (09) 704-710
  • 35 Lerro CC, Bradley MC, Forshee RA, Rivera DR. The bar is high: evaluating fit-for-use oncology real-world data for regulatory decision making. JCO Clin Cancer Inform 2024; 8: e2300261
  • 36 Cooper JD, Shou K, Sunderland K, Pham K, Thornton JA, DeStefano CB. Real-world pitfalls of analyzing real-world data: a cautionary note and path forward. JCO Clin Cancer Inform 2023; 7 (07) e2300097
  • 37 NIH Pragmatic Trials Collaboratory: Rethinking Clinical Trials. Accessed June 29, 2024 at: https://rethinkingclinicaltrials.org/chapters/design/using-electronic-health-record-data-pragmatic-clinical-trials-top/using-electronic-health-record-data-in-pragmatic-clinical-trials-introduction/
  • 38 Li C, Zhang Y, Weng Y, Wang B, Li Z. Natural language processing applications for computer-aided diagnosis in oncology. Diagnostics (Basel) 2023; 13 (02) 286
  • 39 Sangariyavanich E, Ponthongmak W, Tansawet A. et al. Systematic review of natural language processing for recurrent cancer detection from electronic medical records. Inform Med Unlocked 2023; 41 (03) 101326
  • 40 Choi YC, Zhang D, Tyczynski JE. Comparison between health insurance claims and electronic health records (EHRs) for metastatic non-small-cell lung cancer (NSCLC) patient characteristics and treatment patterns: a retrospective cohort study. Drugs Real World Outcomes 2021; 8 (04) 577-587
  • 41 Stafkey-Mailey D, Landsman-Blumberg P, Farrelly E, Eaddy M. Comparison of methods to identify stage IIIB or IV metastatic lung cancer patients from electronic medical records. Value Health 2013; 16 (07) A582

Zoom Image
Fig. 1 Algorithm for using electronic health record (EHR) data to determine cancer site.
Zoom Image