Subscribe to RSS
DOI: 10.1055/s-0041-1731388
Semi-automated Conversion of Clinical Trial Legacy Data into CDISC SDTM Standards Format Using Supervised Machine Learning
Funding This study is based on research using information obtained from www.projectdatasphere.org, which is maintained by Project Data Sphere. Neither Project Data Sphere nor the owner(s) of any information from the website have contributed to, approved, or are in any way responsible for the contents of this study.Abstract
Objective This study aimed to develop a semi-automated process to convert legacy data into clinical data interchange standards consortium (CDISC) study data tabulation model (SDTM) format by combining human verification and three methods: data normalization; feature extraction by distributed representation of dataset names, variable names, and variable labels; and supervised machine learning.
Materials and Methods Variable labels, dataset names, variable names, and values of legacy data were used as machine learning features. Because most of these data are string data, they had been converted to a distributed representation to make them usable as machine learning features. For this purpose, we utilized the following methods for distributed representation: Gestalt pattern matching, cosine similarity after vectorization by Doc2vec, and vectorization by Doc2vec. In this study, we examined five algorithms—namely decision tree, random forest, gradient boosting, neural network, and an ensemble that combines the four algorithms—to identify the one that could generate the best prediction model.
Results The accuracy rate was highest for the neural network, and the distribution of prediction probabilities also showed a split between the correct and incorrect distributions. By combining human verification and the three methods, we were able to semi-automatically convert legacy data into the CDISC SDTM format.
Conclusion By combining human verification and the three methods, we have successfully developed a semi-automated process to convert legacy data into the CDISC SDTM format; this process is more efficient than the conventional fully manual process.
Note
This study does not include individual human subject data.
Publication History
Received: 02 January 2021
Accepted: 22 May 2021
Article published online:
08 July 2021
© 2021. Thieme. All rights reserved.
Georg Thieme Verlag KG
Rüdigerstraße 14, 70469 Stuttgart, Germany
-
Reference
- 1 Japan Pharmaceutical Manufacturers Association (JPMA). CTDS (Clinical Trial Data Sharing). Accessed Feb 20, 2021 at: http://www.jpma.or.jp/medicine/shinyaku/tiken/allotment/pdf/ctds.pdf
- 2 Souza T, Kush R, Evans JP. Global clinical data interchange standards are here!. Drug Discov Today 2007; 12 (3-4): 174-181
- 3 The official homepage of CDISC (Clinical Data Interchange Standard Consortium). Accessed December 5, 2020 at: https://www.cdisc.org/
- 4 CDISC (Clinical Data Interchange Standards Consortium) Membership. Accessed December 5, 2020 at: https://www.cdisc.org/membership
- 5 Japan Pharmaceutical Manufacturers Association (JPMA). Clinical Trials Changed by CDISC. Accessed December 5, 2020 at: http://www.jpma.or.jp/information/evaluation/publishing_center/pdf/018.pdf
- 6 CDISC 2014 Business Case Highlights Significant Time and Cost Savings through Use of CDISC Standards in Medical Research Studies. Accessed December 5, 2020 at: https://www.cdisc.org/cdisc-2014-business-case-highlights-significant-time-and-cost-savings-through-use-cdisc-standards
- 7 Lamberti MJ, Kush R, Kubick W. et al. An examination of eClinical technology usage and CDISC standards adoption. Ther Innov Regul Sci 2015; 49 (06) 869-876
- 8 Tomioka S. SDTM mapping based on natural language process and machine learning. CDISC Interchange Japan; 2018 Accessed December 5, 2020 at: https://www.cdisc.org/system/files/all/event/restricted/2018_US/5C_MachineLearningApproachtoSDTMMapping_Tomioka.pdf
- 9 The official home page of Project Data Sphere®. Accessed December 5, 2020 at: https://data.projectdatasphere.org/projectdatasphere/html/home
- 10 Green AK, Reeder-Hayes KE, Corty RW. et al. The project data sphere initiative: accelerating cancer research by sharing data. Oncologist 2015; 20 (05) 464-e20
- 11 Stud Data Tabulation Model Implementation Guide (SDTMIG) version 3.2. Accessed April 3, 2021 at: https://www.cdisc.org/standards/foundational/sdtmig/sdtmig-v3-2
- 12 SDTM Model version 1.4. Accessed April 3, 2021 at: https://www.cdisc.org/standards/foundational/sdtm/sdtm-v1-4
- 13 Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. International Conference on Learning Representations; 2013 Accessed December 5, 2020 at: https://arxiv.org/pdf/1301.3781.pdf
- 14 Goldberg Y, Levy O. word2vec explained: deriving Mikolov et al's negative-sampling word-embedding method. Accessed December 5, 2020 at: https://arxiv.org/pdf/1402.3722.pdf
- 15 Ratcliff JW. Pattern matching: the gestalt approach. Dr Dobb's Journal 1988; issue 46. Accessed December 5, 2020 at: https://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970
- 16 difflib—Helpers for computing deltas. Accessed December 5, 2020 at: https://docs.python.org/3/library/difflib.html
- 17 Lau JH, Baldwin T. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. Association for Computational Linguistics; 2016: 78-86
- 18 Le Q, Mikolov T. Distributed representations of sentences and documents. Paper presented at: Proceedings of the 31st International Conference on Machine Learning. 2014 . Accessed December 5, 2020 at: https://cs.stanford.edu/~quocle/paragraph_vector.pdf
- 19 Doc2vec paragraph embeddings Introduction. Accessed April 3, 2021 at: https://radimrehurek.com/gensim/models/doc2vec.html
- 20 sklearn.metrics.pairwise_distances. Accessed April 3, 2021 at: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
- 21 Wikipedia Corpus. Accessed December 5, 2020 at: https://www.english-corpora.org/wiki/
- 22 International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use Guidelines. Accessed December 5, 2020 at: https://www.ich.org/page/ich-guidelines
- 23 Japkowicz N. The Class Imbalance Problem: Significance and Strategies. International Conference on Artificial Intelligence (ICAI); 2000 Accessed December 5, 2020 at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.1693&rep=rep1&type=pdf
- 24 Dietterich T. Overfitting and undercomputing in machine learning. ACM Comput Surv 1995; 27 (03) 326-327