Methods Inf Med 2021; 60(01/02): 049-061
DOI: 10.1055/s-0041-1731388
Original Article

Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning

Takuma Oda
1   Division of Biostatistics, Tohoku University Graduate School of Medicine, Sendai-city, Miyagi Prefecture, Japan
,
Shih-Wei Chiu
1   Division of Biostatistics, Tohoku University Graduate School of Medicine, Sendai-city, Miyagi Prefecture, Japan
,
Takuhiro Yamaguchi
1   Division of Biostatistics, Tohoku University Graduate School of Medicine, Sendai-city, Miyagi Prefecture, Japan
› Author Affiliations
Funding This study is based on research using information obtained from www.projectdatasphere.org, which is maintained by Project Data Sphere. Neither Project Data Sphere nor the owner(s) of any information from the website have contributed to, approved, or are in any way responsible for the contents of this study.

Abstract

Objective This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning.

Materials and Methods Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms—namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms—to identify the one that could generate the best prediction model.

Results The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format.

Conclusion By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.

Note

This study does not include individual human subject data.




Publication History

Received: 02 January 2021

Accepted: 22 May 2021

Article published online:
08 July 2021

© 2021. Thieme. All rights reserved.

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany