Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning

Takuma Oda; Shih-Wei Chiu; Takuhiro Yamaguchi

doi:10.1055/s-0041-1731388

RSS-Feed abonnieren

Bitte kopieren Sie die angezeigte URL und fügen sie dann in Ihren RSS-Reader ein.

https://www.thieme-connect.de/rss/thieme/de/10.1055-s-00035037.xml

PDF herunterladen

Methods Inf Med 2021; 60(01/02): 049-061
DOI: 10.1055/s-0041-1731388

Original Article

Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning

Authors

Takuma Oda

¹Division of Biostatistics, Tohoku University Graduate School of Medicine, Sendai-city, Miyagi Prefecture, Japan
Shih-Wei Chiu

¹Division of Biostatistics, Tohoku University Graduate School of Medicine, Sendai-city, Miyagi Prefecture, Japan
Takuhiro Yamaguchi

¹Division of Biostatistics, Tohoku University Graduate School of Medicine, Sendai-city, Miyagi Prefecture, Japan

Funding This study is based on research using information obtained from www.projectdatasphere.org, which is maintained by Project Data Sphere. Neither Project Data Sphere nor the owner(s) of any information from the website have contributed to, approved, or are in any way responsible for the contents of this study.

Weitere Informationen

Lizenzen und Reprints

Abstract

Objective This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning.

Materials and Methods Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms—namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms—to identify the one that could generate the best prediction model.

Results The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format.

Conclusion By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.

Keywords

data conversion - clinical trial - supervised machine learning - database

Note

This study does not include individual human subject data.

Publikationsverlauf

Eingereicht: 02. Januar 2021

Angenommen: 22. Mai 2021

Artikel online veröffentlicht:
08. Juli 2021

Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany

Reference
1 Japan Pharmaceutical Manufacturers Association (JPMA). CTDS (Clinical Trial Data Sharing). Accessed Feb 20, 2021 at: http://www.jpma.or.jp/medicine/shinyaku/tiken/allotment/pdf/ctds.pdf

Reference Link Ris
PubMed
2 Souza T, Kush R, Evans JP. Global clinical data interchange standards are here!. Drug Discov Today 2007; 12 (3-4): 174-181

Reference Link Ris
Crossref PubMed Suche in Google Scholar
3 The official homepage of CDISC (Clinical Data Interchange Standard Consortium). Accessed December 5, 2020 at: https://www.cdisc.org/

Reference Link Ris
PubMed
4 CDISC (Clinical Data Interchange Standards Consortium) Membership. Accessed December 5, 2020 at: https://www.cdisc.org/membership

Reference Link Ris
PubMed
5 Japan Pharmaceutical Manufacturers Association (JPMA). Clinical Trials Changed by CDISC. Accessed December 5, 2020 at: http://www.jpma.or.jp/information/evaluation/publishing_center/pdf/018.pdf

Reference Link Ris
PubMed
6 CDISC 2014 Business Case Highlights Significant Time and Cost Savings through Use of CDISC Standards in Medical Research Studies. Accessed December 5, 2020 at: https://www.cdisc.org/cdisc-2014-business-case-highlights-significant-time-and-cost-savings-through-use-cdisc-standards

Reference Link Ris
PubMed
7 Lamberti MJ, Kush R, Kubick W. et al. An examination of eClinical technology usage and CDISC standards adoption. Ther Innov Regul Sci 2015; 49 (06) 869-876

Reference Link Ris
Crossref PubMed Suche in Google Scholar
8 Tomioka S. SDTM mapping based on natural language process and machine learning. CDISC Interchange Japan; 2018 Accessed December 5, 2020 at: https://www.cdisc.org/system/files/all/event/restricted/2018_US/5C_MachineLearningApproachtoSDTMMapping_Tomioka.pdf

Reference Link Ris
PubMed Suche in Google Scholar
9 The official home page of Project Data Sphere®. Accessed December 5, 2020 at: https://data.projectdatasphere.org/projectdatasphere/html/home

Reference Link Ris
PubMed
10 Green AK, Reeder-Hayes KE, Corty RW. et al. The project data sphere initiative: accelerating cancer research by sharing data. Oncologist 2015; 20 (05) 464-e20

Reference Link Ris
Crossref PubMed Suche in Google Scholar
11 Stud Data Tabulation Model Implementation Guide (SDTMIG) version 3.2. Accessed April 3, 2021 at: https://www.cdisc.org/standards/foundational/sdtmig/sdtmig-v3-2

Reference Link Ris
PubMed
12 SDTM Model version 1.4. Accessed April 3, 2021 at: https://www.cdisc.org/standards/foundational/sdtm/sdtm-v1-4

Reference Link Ris
PubMed
13 Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. International Conference on Learning Representations; 2013 Accessed December 5, 2020 at: https://arxiv.org/pdf/1301.3781.pdf

Reference Link Ris
PubMed Suche in Google Scholar
14 Goldberg Y, Levy O. word2vec explained: deriving Mikolov et al's negative-sampling word-embedding method. Accessed December 5, 2020 at: https://arxiv.org/pdf/1402.3722.pdf

Reference Link Ris
PubMed
15 Ratcliff JW. Pattern matching: the gestalt approach. Dr Dobb's Journal 1988; issue 46. Accessed December 5, 2020 at: https://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970

Reference Link Ris
PubMed
16 difflib—Helpers for computing deltas. Accessed December 5, 2020 at: https://docs.python.org/3/library/difflib.html

Reference Link Ris
PubMed
17 Lau JH, Baldwin T. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. Association for Computational Linguistics; 2016: 78-86

Reference Link Ris
Suche in Google Scholar
18 Le Q, Mikolov T. Distributed representations of sentences and documents. Paper presented at: Proceedings of the 31st International Conference on Machine Learning. 2014 . Accessed December 5, 2020 at: https://cs.stanford.edu/~quocle/paragraph_vector.pdf

Reference Link Ris
PubMed Suche in Google Scholar
19 Doc2vec paragraph embeddings Introduction. Accessed April 3, 2021 at: https://radimrehurek.com/gensim/models/doc2vec.html

Reference Link Ris
PubMed
20 sklearn.metrics.pairwise_distances. Accessed April 3, 2021 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html

Reference Link Ris
PubMed
21 Wikipedia Corpus. Accessed December 5, 2020 at: https://www.english-corpora.org/wiki/

Reference Link Ris
PubMed
22 International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use Guidelines. Accessed December 5, 2020 at: https://www.ich.org/page/ich-guidelines

Reference Link Ris
PubMed
23 Japkowicz N. The Class Imbalance Problem: Significance and Strategies. International Conference on Artificial Intelligence (ICAI); 2000 Accessed December 5, 2020 at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.1693&rep=rep1&type=pdf

Reference Link Ris
PubMed Suche in Google Scholar
24 Dietterich T. Overfitting and undercomputing in machine learning. ACM Comput Surv 1995; 27 (03) 326-327

Reference Link Ris
Crossref PubMed Suche in Google Scholar

Ähnliche Zeitschriften

RSS-Feed abonnieren

Teilen / Bookmarken

Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning

Authors

Abstract

Keywords

Note

Publikationsverlauf

Reference