Appl Clin Inform 2017; 08(03): 731-741
DOI: 10.4338/ACI-2017-02-RA-0029
Research Article
Schattauer GmbH

Extracting autism spectrum disorder data from the electronic health record

Ruth A. Bush
1  Hahn School of Nursing and Health Science, Beyster Institute for Nursing Research, University of San Diego, San Diego, USA
2  Clinical Research Informatics, Rady Children’s Hospital-San Diego, San Diego, USA
Cynthia D. Connelly
1  Hahn School of Nursing and Health Science, Beyster Institute for Nursing Research, University of San Diego, San Diego, USA
Alexa Pérez
1  Hahn School of Nursing and Health Science, Beyster Institute for Nursing Research, University of San Diego, San Diego, USA
Halsey Barlow
1  Hahn School of Nursing and Health Science, Beyster Institute for Nursing Research, University of San Diego, San Diego, USA
George J. Chiang
3  Rady Children‘s Institute for Genomic Medicine, Rady Children‘s Hospital San Diego, San Diego, CA, USA
4  Department of Surgery, University of California-San Diego, San Diego, USA
› Author Affiliations
Funding This project was also supported in part by grant number K99/R00 HS022404 from the Agency for Healthcare Research and Quality. The content is solely the responsibility of the authors and does not necessarily represent the official views of the Agency for Healthcare Research and Quality.
Further Information

Correspondence to:

Ruth A. Bush PhD, MPH
Hahn School of Nursing and Health Science
Beyster Institute for Nursing Research
University of San Diego, San Diego, USA

Publication History

received: 07 February 2016

accepted: 07 May 2017

Publication Date:
20 December 2017 (online)



Background: Little is known about the health care utilization patterns of individuals with pediatric autism spectrum disorder (ASD).

Objectives: Electronic health record (EHR) data provide an opportunity to study medical utilization and track outcomes among children with ASD.

Methods: Using a pediatric, tertiary, academic hospital’s Epic EHR, search queries were built to identify individuals aged 2–18 with International Classification of Diseases, Ninth Revision (ICD-9) codes, 299.00, 299.10, and 299.80 in their records. Codes were entered in the EHR using four different workflows: (1) during an ambulatory visit, (2) abstracted by Health Information Management (HIM) for an encounter, (3) recorded on the patient problem list, or (4) added as a chief complaint during an Emergency Department visit. Once individuals were identified, demographics, scheduling, procedures, and prescribed medications were extracted for all patient-related encounters for the period October 2010 through September 2012.

Results: There were 100,000 encounters for more than 4,800 unique individuals. Individuals were most frequently identified with an HIM abstracted code (82.6%) and least likely to be identified by a chief complaint (45.8%). Categorical frequency for reported race (2 = 816.5, p < 0.001); payor type (2 = 354.1, p < 0.001); encounter type (2 = 1497.0, p < 0.001); and department (2 = 3722.8, p < 0.001) differed by search query. Challenges encountered included, locating available discrete data elements and missing data.

Conclusions: This study identifies challenges inherent in designing inclusive algorithms for identifying individuals with ASD and demonstrates the utility of employing multiple extractions to improve the completeness and quality of EHR data when conducting research.

Citation: Bush RA, Connelly CD, Pérez A, Barlow H, Chiang GJ. Extracting autism spectrum disorder data from the electronic health record. Appl Clin Inform 2017; 8: 731–741


1. Background and Significance

Autism spectrum disorder (ASD) is characterized by impairments in social interaction and communication along with restricted, repetitive, and stereotyped patterns of behaviour [[1]], affecting as many as 11.3 per 1,000 (one in 88) children [[2]]. The presentation of ASD can vary widely among affected individuals and within an individual over the lifespan [[3], [4]]. Among the conditions associated with ASD are intellectual disability, seizure disorders, hyperactivity, disorders of the gastrointestinal and immune system, and anxiety [[5]–[11]]. Medical treatments for children with ASD are primarily directed toward alleviating the co-morbid symptoms, rather than core symptoms [[11]]. The evidence for the utility of involving specialty treatment, particularly gastroenterology, dieticians/nutritionists, allergy/immunology, and prescribed medication is based on results from meta-analyses of heterogeneous methodology or pooled data. Limited research examines the comparative effectiveness in daily practice of the adopted techniques or of the effectiveness of adjunct medical or outpatient development treatments (e.g., neurology, speech therapy, dietary modification, etc.), which are being used daily.

Comparative effectiveness research (CER) is “designed to inform health care decisions by providing evidence on the effectiveness, benefits, and harms of different treatment options. The evidence is generated from research studies that compare drugs, medical devices, tests, surgeries, or ways to deliver health care” [[12]]. Since CER analyzes the health care delivery system as a whole and includes heterogeneous populations, therapy effectiveness is measured in a natural practice setting and results can be more easily generalized than the more controlled Randomized Controlled Trial (RCT) [[13], [14]] CER can identify interventions that are most effective under various circumstances and can take into account provider variability, institutional volume, and regional characteristics when providing information for patients, providers, and policy makers [[15]–[19]].


2. Objectives

Previously, given the reliance on administrative billing databases, data capture limitations have prevented the longitudinal tracking needed across health care delivery systems and across time to examine the impact of ASD treatment. The use of the Electronic Health Record (EHR) is an improvement; although EHR data elements vary from system to system; certified ambulatory systems contain the following data in a discrete format: age, gender, diagnosis, medical history, medications prescribed, lab and procedure orders and results, allergies, immunizations, and vital signs [[20]] Studies in adult populations, demonstrate success in extracting EHR data for research with populations of several thousand individuals with conditions such as diabetes and cardiomyopathy[[21], [22]] A recently conducted claims-based case identification of ASD compared against clinical review of medical charts demonstrates a positive predictive value of almost 90% [[23]]. It is important to determine if pediatric EHR data can provide the data needed to examine treatment effectiveness and to conduct CER with individuals with ASD.


3. Methods

The study was performed in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research involving Human Subjects, and was reviewed and approved by the University of California, San Diego Institutional Review Board. The study was conducted at an academic, pediatric hospital and its affiliated network, which draws from three counties in Southern California. The institution uses the Epic (Madison, WI) EHR system, which incorporates emergency department (ED), inpatient, outpatient (including satellite clinics), laboratory, and radiology input into an integrated system, which shares records within the organization. The EHR system has been fully operational at this location since 2010.

To develop and pilot test methods for using EHR data to capture and to measure medical treatment utilization patterns among patients with ASD, several data query techniques were employed to draw data from several potential sources including primary care, specialty, and urgent care/ED use. Search queries were built to create a patient list based on the presence of the related International Classification of Diseases, Ninth Revision (ICD-9) codes in their record in four different ways: (1) Clinician assigned during ambulatory visit (Encounter Dx), (2) abstracted by health information management (HIM) for an encounter after review, research, and verification of patient information and clinical data, (3) recorded on the patient problem list (Prob List), or (4) added as a chief complaint during an ED visit (Chief Complaint).

Queries were executed for all patients who were part of the EHR system, aged 2–18 with ICD-9 codes 299.00, 299.10 and 299.80 as part of a record. Children younger than 2 years of age were excluded since children are infrequently diagnosed with ASD before age 2. Initially, the query techniques, using Business Objects Crystal Reports and EHR system’s Clarity database, were applied to patients treated during the period 1 October through 31 December 2010 and validated before expanding to the period 1 October 2010 through 30 September 2012. Once a list of patients was created, data for all encounters related to the patient including demographics (age, race/ethnicity, gender and payor type); encounter date; type of care (e.g. outpatient, inpatient, medication refill, etc.); provider type (e.g., physician, nurse, occupational therapist, etc.); primary ICD9 code; secondary ICD-9 codes; ICD-9 procedure codes; and prescribed medications were extracted for the two year time period. A manual comparison of the data contained in the report of 10% of the individuals identified in the report were compared against the information in the electronic health record to ensure that 1) the query captured all of the patient encounters during the time period, 2) demographic information matched, and 3) the procedures and prescribed medications were complete. The comparison determined the data pull matched the EHR records.

Using SPSS® version 23 [[24]] descriptive frequencies were run for categorical variables and analytics such as mean, median, mode, standard deviation, skewness, and kurtosis of continuous variable to identify outliers, as well as to ascertain the type and impact of missing data. Analysis was conducted to identify the number of patients with short or minimal association with the healthcare system and to guide the approach to definitions of loss to follow-up and approaches for censored data. Analysis of variance and chi-square analyses, as appropriate for variable type, were used to examine group differences.


4. Results

The extraction identified nearly 100,000 encounters for more than 4,800 unique individuals. The demographic variable are presented in ►[Table 1].

Table 1

Encounter Demographic Information































Native American/Eskimo



Pac Islander/Hawaiian






















ASD patient encounters were most frequently identified; 82,450 encounters (82.6%) had an HIM abstracted code, of which 17,754 were identified solely by an HIM code. Encounters were least likely to be identified using a chief complaint applied during an ED visit; 45,741 or 45.8% captured using that methdology (►[Figure 1]). A total of 21,585 encounters (21.6%) were identified by all four methods and the majority were captured using at least two methods. Of note 32,201 encounters were identified through only one source. The sources of identification are enumerated in ►[Table 2].

Table 2

Source of Encounter Identification



Cumulative Percent

All four sources








HIM, Problem List




HIM, Problem List, Chief Complaint




Enc Dx, HIM, Problem List




Problem List




Chief Compliant




Enc Dx, HIM




HIM, Chief Complaint




Enc Dx, HIM, Chief Complaint




Problem List, Chief Complaint




Enc Dx, Problem List




Enc Dx




Enc Dx, Chief Complaint




HIM: Health Information Management, Enc Dx: Encounter Diagnosis
Zoom Image
Fig. 1 Encounter Identification Source (Oct 2010 to Sep 2012; n = 99,847)

The most frequent encounter types were office visits (34.5%), development services (which includes speech, occupational, and physical therapy) (27.9%), and clinicians recording emails or telephone calls with patient/parents (14.7%). The departments with the most frequent encounters were pediatrics (23.6%), speech therapy (13.3%), occupational therapy (10.0%) and neurology (7.7%), and the most common provider types were physicians (44.7%), speech therapists (14.7%) and occupational therapists (10.7%).

Based on the noted differences by source type, chi-squared analysis of those encounters captured using HIM assigned codes versus codes assigned by the other three methodologies (Prob List, Chief Complaint, Encounter Dx) was used to determine if there were differences in patients captured by query type. Developmental services (Dev Services), and hospital encounters (Hospital), were over-represented by using only HIM coding (X2 = 3722.8, p < 0.001); office visits and communication with patients/parents (Communication) were more likely to be identified through a query of non-HIM sources (X2 = 1497.0, p < 0.001) (►[Figure 2]).

Zoom Image
Fig. 2 Encounter Type by Source

There were also noted differences in race depending on query type (►[Figure 3]). Black patients and patients who refused to identify their race were underrepresented by using only HIM coding (X2 = 816.5, p < 0.001). Whether the payor was private insurance, government reimbursement, self-pay, and indigent also differed depending on the source of the ASD coding (X2 = 354.1, p < 0.001). There were no significant differences in gender by source of the coding.

Zoom Image
Fig. 3 Race by Source of Identification


5. Discussion

Four different data queries extracting data from the same integrated pediatric EHR system yielded substantially different results. The differences demonstrated the workflow for the diagnosis to enter a patient’s record varies notably. Although administrative or billing data provides the majority of information, the information gained from clinic documentation, such as records generated in the ED, are also important sources of patient identification, particularly for those individuals lacking private insurance and add to the diversity of patients captured. EHR-derived data may not be comprehensive enough for research unless multiple sources capturing several workflows are queried.

A significant strength of this project was the ability to employ different queries within a large heterogeneous healthcare delivery system that is the primary referral source for ASD in the geographic area and to have the statistical power to compare ASD capture within the EHR. This project demonstrated it is possible to identify patients with ASD and to capture needed data to identify and to quantify associated medical conditions. Such data is critical if medical intervention for ASD is to be studied and to have sufficient strength of evidence to evaluate either their potential benefit or adverse effects [[11]] Data such as these will add to the growing body of clinical guidelines on the Agency for Healthcare Research and Quality (AHRQ) National Guideline Clearinghouse available to advise clinicians and administrators about the organization, financing and delivery of services to children with ASD, [[25]] as well as to provide the desired patient-centered approaches for treatment that recognize family dynamics and other social factors, demand an outcomes-based analysis.

The process of assigning ICD codes is complicated. There are numerous potential sources of error affecting ICD code accuracy including the amount and quality of information at admission, communication among patients and providers, the clinicians knowledge and experience with the illness, and the clinician’s attention to detail [[26]]. Querying one source is not enough. For example, when using an algorithm designed to identify type 2 diabetes cases in the EHR, Pacheco et al. found just over half of patients were identified by searching the problem list and Kahn and Ranade found significantly different rates from safety source data from one hospital resulting from differences in workflow practice [[22], [27]]. A Canadian study found when analyzing administrative health data only 7% of obese children’s condition were correctly identified with this information source, which relied primarily on inpatient hospital data. The child’s weight was not noted during inpatient stay and outpatient visits were not included in the analysis, so the administrative data grossly underestimated the true population prevalence of obesity [[28]]. Similarly, it has been demonstrated among a small cohort of pediatric asthma patients there was a significant discrepancy between the presence in the EHR of clinical features compatible with a diagnosis of asthma in EHRs, but no ICD-9 reflecting the condition [[29]]. Under diagnosis of health conditions has tremendous implications for health planning.

This study demonstrated using a variety of data sources within the EHR may improve the accuracy and representativeness of the information capture. While the patient’s medical history is generally captured in a narrative format, tools such as “smart notes” and history templates capture the information as discrete data elements. The EHR incorporates a computerized physician order entry (CPOE) system allowing providers to manage and communicate orders and results, which are recorded electronically. Among the benefits of using these data are the current data validation programs in place. Additionally, EHR clinical users undergo substantial training in order to have access to the system and to enter data. There are numerous, programmed validation checks of the data, which provide uniformity to the data captured in addition to detailed data dictionaries and documentation of the definitions applied to the captured data. Multiple source extraction illustrated overlap of data, greater inclusiveness of data capture than from a sole source, and the ability to crosscheck when multiple sources are used. The findings support the capture of multiple workflows for greater patient and condition identification.

The overall utility of extracting such patient data for adult cohort studies are supported by the findings of Wells et al. who extracted echocardiographic data from the EHR and suggest it is possible to create EHR-based cohorts for use in the study of epidemiologic and genotype-phenotype associations in diverse populations [[21]]. The successful methodologic approach of building queries in this pediatric population had similar results to the work of Davis et al. who also used four algorithms based on ICD-9 codes and text keywords to identify adult individuals with Multiple Sclerosis [[30]] Their approach, however, began with a training database of a smaller set of known individuals as well as using medications as a query approach. Lawrence et al. also noted the value of patient identification using a combination query approach of individuals with one or more outpatient diagnosis codes of diabetes or a prescription for insulin [[31]].

The health care system studied is the sole pediatric referral health care center for two large Southern California counties and part of a third, as well as being the highest volume autism services provider in the region. While it is estimated the EHR system used captures 80% of pediatric patients in the area, there are patients who do not seek treatment from the integrated delivery system provider or only part of their treatment is captured in the EHR. In conducting cross-verification of the encounters captured against a sample of 150 known ASD patients to verify the completeness of data, it became clear that many of the patients obtain clinical care outside the integrated delivery system because of limitations of insurance, additional school-based programs, and other reasons. The limitation in ability to access and to combine data across multiple platforms has also been noted in other studies in which the EHR data may be less comprehensive than claimed if derived from only one of the clinical practices from which a patient seeks care, unless the practice is part of an integrated delivery system that is providing the patient with all of his or her ambulatory and inpatient care [[20], [31], [32]].

This project demonstrates it was possible to leverage routine data entry by pediatric care providers via the EHR within a diverse, regional referral pediatric healthcare to construct a large clinical data set without the burden of manual data collection in the clinical setting. Additionally, the data collected contained extensive health care data to analyze utilization patterns and characterization of current medical treatment practices within an ethnically and social-economically diverse population. This particular study population will allow for better understanding of potential subgroups in what is known to be a heterogeneous condition. The volume and validity of data suggest it is possible to use the EHR data to address relevant CER questions in a timely manner, thereby avoiding the expense, extended follow-up period, and potential reluctance of patients and their families to be randomized, which are associated with a RCT.


6. Conclusion

This study examined the availability and utility of detailed clinical and administrative data contained in the EHR system. Its methodology recognizes the interrelatedness of child health domains, and, most importantly, address the paucity of available data sources related to the medical treatment of ASD patients. Extracted data from the EHR system is potentially rich resource for conducting comparative effectiveness research and epidemiologic surveillance, including longitudinal analyses, of medical utilization among children with ASD, as well as potential changes in clinical practices patterns among ASD patients. It is important to employ a variety of data extraction methods to capture patients who enter the EHR through different clinical workflows.


Multiple Choice Questions

Using which of the following data extractions was most likely to identify an autism spectrum disorder patient?

  • Chief complaint during Emergency Department visit

  • Clinician assigned during ambulatory visit

  • Abstracted by health information management for an encounter

  • Recorded on the patient problem list

The correct answer is C. Autism Spectrum Disorder patients were most frequently identified with an HIM abstracted code from an encounter (82.6%) and least likely to be identified by a chief complaint during an ED visit (45.8%).

Clinical Relevance Statement

There are few available data sources related to the medical treatment of autism spectrum disorder patients. Electronic health record offer detailed clinical and administrative data with the potential for use in comparative effectiveness research. This study evaluates the extracted EHR data and demonstrates that a variety of extraction methods are needed to capture a robust profile of ASD clinical data.


Conflict of Interest

The authors have no conflicts of interest to disclose.


The authors appreciate the assistance of Cynthia Sepulveda and Anouk Bellengi for their report writing assistance.

Protection of Human Subjects

The study was performed in compliance with the World Medical Association Declaration of Helsinki on Ethical Principles for Medical Research involving Human Subjects, and was reviewed by University of California, San Diego Institutional Review Board.

Correspondence to:

Ruth A. Bush PhD, MPH
Hahn School of Nursing and Health Science
Beyster Institute for Nursing Research
University of San Diego, San Diego, USA

Zoom Image
Fig. 1 Encounter Identification Source (Oct 2010 to Sep 2012; n = 99,847)
Zoom Image
Fig. 2 Encounter Type by Source
Zoom Image
Fig. 3 Race by Source of Identification