Summary
Objectives: To develop an adaptive approach to mine frequent semantic tags (FSTs) from heterogeneous
clinical research texts.
Methods: We develop a “plug-n-play” framework that integrates replaceable un-supervised kernel
algorithms with formatting, functional, and utility wrappers for FST mining. Temporal
information identification and semantic equivalence detection were two example functional
wrappers. We first compared this approach’s recall and efficiency for mining FSTs
from ClinicalTrials.gov to that of a recently published tag-mining algorithm. Then
we assessed this approach’s adaptability to two other types of clinical research texts:
clinical data requests and clinical trial protocols, by comparing the prevalence trends
of FSTs across three texts.
Results: Our approach increased the average recall and speed by 12.8% and 47.02% respectively
upon the baseline when mining FSTs from ClinicalTrials.gov, and maintained an overlap
in relevant FSTs with the baseline ranging between 76.9% and 100% for varying FST
frequency thresholds. The FSTs saturated when the data size reached 200 documents.
Consistent trends in the prevalence of FST were observed across the three texts as
the data size or frequency threshold changed.
Conclusions: This paper contributes an adaptive tag-mining framework that is scalable and adaptable
without sacrificing its recall. This component-based architectural design can be potentially
generalizable to improve the adaptability of other clinical text mining methods.
Keywords
Medical informatics - text mining - clinical trials - semantic tags - component-based
architecture