Methods Inf Med 2017; 56(05): 370-376
DOI: 10.3414/ME17-01-0028
Paper
Schattauer GmbH

A Bag of Concepts Approach for Biomedical Document Classification Using Wikipedia Knowledge[*]

Spanish-English Cross-language Case Study
Marcos A. Mouriño-García
1   Department of Telematics Engineering, University of Vigo, Vigo, Spain
,
Roberto Pérez-Rodríguez
1   Department of Telematics Engineering, University of Vigo, Vigo, Spain
,
Luis E. Anido-Rifón
1   Department of Telematics Engineering, University of Vigo, Vigo, Spain
› Author Affiliations
Further Information

Publication History

received: 13 March 2017

accepted in revised form: 07 July 2017

Publication Date:
24 January 2018 (online)

Summary

Objectives: The ability to efficiently review the existing literature is essential for the rapid progress of research. This paper describes a classifier of text documents, represented as vectors in spaces of Wikipedia concepts, and analyses its suitability for classification of Spanish biomedical documents when only English documents are available for training. We propose the cross-language concept matching (CLCM) technique, which relies on Wikipedia interlanguage links to convert concept vectors from the Spanish to the English space.

Methods: The performance of the classifier is compared to several baselines: a classifier based on machine translation, a classifier that represents documents after performing Explicit Semantic Analysis (ESA), and a classifier that uses a domain-specific semantic an- notator (MetaMap). The corpus used for the experiments (Cross-Language UVigoMED) was purpose-built for this study, and it is composed of 12,832 English and 2,184 Spanish MEDLINE abstracts.

Results: The performance of our approach is superior to any other state-of-the art classifier in the benchmark, with performance increases up to: 124% over classical machine translation, 332% over MetaMap, and 60 times over the classifier based on ESA. The results have statistical significance, showing p-values < 0.0001.

Conclusion: Using knowledge mined from Wikipedia to represent documents as vectors in a space of Wikipedia concepts and translating vectors between language-specific concept spaces, a cross-language classifier can be built, and it performs better than several state-of-the-art classifiers.

* Supplementary material published on our website https://doi.org/10.3414/ME17-01-0028


 
  • References

  • 1 Gope HL, Das PK, Islam MJ, Seddiqui MH. Medical Document Classification from OHSUMED Dataset. IJCSN International Journal of Computer Science and Network 2014; 3 (04) 215-219.
  • 2 Mouriño García MA, Pérez Rodríguez R, Anido Rifón LE. Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag- of-concepts approach. PeerJ 2015; 3: e1279.
  • 3 Polavarapu N, Navathe SB, Ramnarayanan R, ul Haque A, Sahay S, Liu Y. Investigation into biomedical literature classification using support vector machines. Proceedings of Computational Systems Bioinformatics Conference.. IEEE; 2005: 366-374.
  • 4 Guindon GE, Lavis JN, Becerra-Posada F, Malek-Afzali H, Shi G, Yesudian CAK. et al. Bridging the gaps between research, policy and practice in low- and middle-income countries: a survey of health care providers. Canadian Medical Association Journal 2010; 182 (09) E362-E372.
  • 5 Hajmohammadi MS, Ibrahim R, Selamat A, Fujita H. Combination of active learning and self-training for cross-lingual sentiment classification with density analysis of unlabelled samples. Information Sciences 2015; 317: 67-77.
  • 6 Bel N, Koster CH, Villegas M. Cross-lingual text categorization. Koch T, Solvberg IT. Research and Advanced Technology for Digital Libraries.. Berlin, Heidelberg: Springer; 2003: 126-139.
  • 7 Jadhav BR, Mahajan M. Dual Sentiment Analysis Using Adaboost Algorithm Sentiment Analysis. International Journal of Engineering Science 2016; 6 (06) 7641-7645.
  • 8 Shi L, Mihalcea R, Tian M. Cross language text classification by model translation and semi- supervised learning. Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing.. Association for Computational Linguistics; Stroudsburg, PA, USA: 2010: 1057-1067.
  • 9 Wang P, Hu J, Zeng HJ, Chen Z. Using Wikipedia knowledge to improve text classification. Knowledge and Information Systems 2009; 19 (03) 265-281.
  • 10 Huang L, Milne D, Frank E, Witten IH. Learning a concept-based document similarity measure. Journal of the American Society for Information Science and Technology 2012; 63 (08) 1593-1608.
  • 11 Hutchins WJ, Somers HL. An introduction to machine translation (Vol. 362).. London: Academic Press; 1992
  • 12 Landauer TK, Dumais ST. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review 1997; 104 (02) 211.
  • 13 Bengio Y, Ducharme R, Vincent P, Jauvin C. A neural probabilistic language model. Journal of Machine Learning Research 2003; 3: 1137-1155.
  • 14 Blei DM, Ng AY, Jordan MI. Latent dirichlet allocation. Journal of Machine Learning Research. 2003; 3: 993-1022.
  • 15 Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proceedings of the AMIA Symposium.. Washington, USA: 2001: 17.
  • 16 Gabrilovich E, Markovitch S. Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research 2009; 34 (01) 443-498.
  • 17 Milne D, Witten IH. An open-source toolkit for mining Wikipedia. Artificial Intelligence 2013; 194: 222-239.
  • 18 Kim H, Howland P, Park H. Dimension reduction in text classification with support vector machines. Journal of Machine Learning Research 2005; 6 (01) 37-53.
  • 19 Hatakeyama Y, Miyano I, Kataoka H, Nakajima N, Watabe T, Yasuda N, Okuhara Y. Use of a Latent Topic Model for Characteristic Extraction from Health Checkup Questionnaire Data. Methods of Information in Medicine 2015; 54 (06) 515-521.
  • 20 Elberrichi Z, Taibi M, Belaggoun A. Multilingual medical documents classification based on mesh domain ontology. arXiv preprint arXiv:12064883. 2012
  • 21 Carrero F, Cortizo JC, Gomez JM. Testing concept indexing in crosslingual medical text classification. IEEE Third International Conference on Digital Information Management. 2008: 512-519.
  • 22 Mouriño García MA, Pérez Rodríguez R, Anido Rifón LE. CL-UVigoMED. (cited 2017 Mar 13). Available from: http://dx.doi.org/10.17632/7ph4hhh429/5.
  • 23 Tsoumakas G, Katakis I. Multi-label classification: An overview. International Journal of Data Warehousing and Mining 2007; 3 (03) 1-13.
  • 24 Cornolti M, Ferragina P, Ciaramys M. A framework for benchmarking entity-annotation systems. Proceedings of the 22nd International Conference on World Wide Web.. ACM; 2013: 249-260.
  • 25 Rigutini L, Maggini M, Liu B. An EM based training algorithm for cross-language text categorization. Proceedings of the 2005 IEEE/WIC/ACM International Conference on Web Intelligence.. IEEE Computer Society; 2005: 529-535.
  • 26 Zhao X, Zhang X, Hu X. Semantic smoothing for Bayesian text classification with small training data. Proceedings of the 2008 SIAM International Conference on Data Mining.. Society for Industrial and Applied Mathematics; 2008: 289-300.
  • 27 Sacchet MD, Prasad G, Foland-Ross LC, Thompson PM, Gotlib IH. Support vector machine classification of major depressive disorder using diffusion-weighted neuroimaging and graph theory. Frontiers in Psychiatry 2015; 6: 21.
  • 28 Gao L, Zhou S, Guan J. Effectively classifying short texts by structured sparse representation with dictionary filtering. Information Sciences 2015; 323: 130-142.
  • 29 Sahlgren M, Cöster R. Using bag-of-concepts to improve the performance of support vector machines in text categorization. Proceedings of the 20th International Conference on Computational Linguistics.. Association for Computational Linguistics; Stroudsburg, PA, USA: 2004: 487.
  • 30 Egozi O, Markovitch S, Gabrilovich E. Concept- based information retrieval using explicit semantic analysis. ACM Transactions on Information Systems 2011; 29 (02) 8.