Open Access
CC BY-NC-ND 4.0 · Yearb Med Inform 2019; 28(01): 208-217
DOI: 10.1055/s-0039-1677918
Section 10: Natural Language Processing
Survey
Georg Thieme Verlag KG Stuttgart

Recent Advances in Using Natural Language Processing to Address Public Health Research Questions Using Social Media and ConsumerGenerated Data

Authors

  • Mike Conway

    1   Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, United States
  • Mengke Hu

    1   Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, United States
  • Wendy W. Chapman

    1   Department of Biomedical Informatics, University of Utah, Salt Lake City, Utah, United States
Further Information

Correspondence to

Mike Conway

Publication History

Publication Date:
16 August 2019 (online)

 

Summary

Objective: We present a narrative review of recent work on the utilisation of Natural Language Processing (NLP) for the analysis of social media (including online health communities) specifically for public health applications.

Methods: We conducted a literature review of NLP research that utilised social media or online consumer-generated text for public health applications, focussing on the years 2016 to 2018. Papers were identified in several ways, including PubMed searches and the inspection of recent conference proceedings from the Association of Computational Linguistics (ACL), the Conference on Human Factors in Computing Systems (CHI), and the International AAAI (Association for the Advancement of Artificial Intelligence) Conference on Web and Social Media (ICWSM). Popular data sources included Twitter, Reddit, various online health communities, and Facebook.

Results: In the recent past, communicable diseases (e.g., influenza, dengue) have been the focus of much social media-based NLP health research. However, mental health and substance use and abuse (including the use of tobacco, alcohol, marijuana, and opioids) have been the subject of an increasing volume of research in the 2016 - 2018 period. Associated with this trend, the use of lexicon-based methods remains popular given the availability of psychologically validated lexical resources suitable for mental health and substance abuse research. Finally, we found that in the period under review “modern" machine learning methods (i.e. deep neural-network-based methods), while increasing in popularity, remain less widely used than “classical" machine learning methods.


1 Introduction

Social media is a valuable source of data for public health research. It is estimated that 75% of Internet users have read or watched online health information content, and 26% of Internet users have posted (or shared) their personal health information online [1]. This large-scale sharing of health information makes social media and Online Health Communities (OHC) a valuable and abundant source of data for addressing public health questions. Social media – including online consumer generated OHC data – provide a ready source of timely, abundant data that can serve as a valuable resource for several broad types of public health applications, including surveillance, health communication, sentiment analysis, and understanding the natural history of a disease, injury, or health behaviour. Research on utilising social media in conjunction with Natural Language Processing (NLP) for public health applications is a robust and growing area of study, with dedicated meetings[1] and a now well-established research community [2]. Regarding surveillance, the importance of mental health and substance abuse surveillance is increasingly recognised [3]. This growth is unsurprising given that it is estimated that mental health and substance abuse constitute approximately 10.4% of the global burden of disease and are the leading cause of years lived with disability, imposing direct and indirect costs on the world economy of around US$2.5 trillion [4]. The study of health communication is another area of research that uses social media in conjunction with NLP methods, particularly in the area of understanding and quantifying vaccine hesitancy and refusal. NLP can support public health researchers in identifying common health-related misconceptions, and in turn, devising more effective health communication methods [5]. Similarly, sentiment analysis with respect to products relevant to public health (e.g. marijuana-related products, e-cigarettes) and the health behaviours that they facilitate is a further area of research [6]. Finally, social media provide a valuable data source for studies focussed on understanding and analysing the natural history of a disease, illness or injury, especially in the context of new and re-emerging diseases and rapid changes in health behaviour [7].

The key changes we have observed since 2016 – apart from the growth in research related to mental health and substance abuse and the increasing interest in “modern” machine learning methods–include a move towards integrating social media analysis with the Electronic Health Record (EHR) [8], in part as a means of obtaining valuable diagnostic “ground truth”. A further shift of note is the increased interest in elucidating ethical issues in the application of NLP (and machine learning more generally) to social media for public health applications, particularly with respect to protecting the rights of those users suffering from potentially stigmatising conditions [9].

Challenges in developing high performance NLP methods for social media have been extensively enumerated, but in summary, major outstanding problems include the use of non-standard grammar, the use of rapidly changing and often non-standard slang terms , spelling variation in informal consumer-generated text, the rapidly changing nature of social media language, and finally the identification (and filtering) of jokes, memes, and advertising [2].

In this paper, we review literature from the period 2016-2018 regarding the application of NLP methods to social media data as a means of addressing public health research questions, focussing specifically on new application areas and the adoption of new methods. A distinctive feature of this review is an emphasis on the increasing volume of research focussed on ethics-related issues involved in using consumer-generated data for public health research.


2 Methods

Our paper selection process involved the following steps. First, we searched PubMed, the Association for Computational Linguistics Anthology, the Proceedings of the Conference on Human Factors in Computer Systems (CHI), and the Proceedings of the International AAAI (Association for the Advancement of Artificial Intelligence) Conference on Web and Social Media (ICWSM) using a variety of social media and NLP-related keywords. Second, we manually inspected Tables of Contents for the Journal of the American Medical Informatics Association, the Journal of Biomedical Informatics, and the Journal of Medical Internet Research. In this first pass, over 1,800 papers were identified. After reviewing abstracts, we reduced the number of papers reviewed to 130. In order to increase the tractability of the reviewing task, we further winnowed the papers to 71. This winnowing process was designed to capture a large swathe of both application areas and methods, and cannot be interpreted as a comment on the quality of research.

Only the papers that both demonstrated a clear public health focus and explicitly utilised NLP or text mining methods were retained. Papers that reported on the results of qualitative content analysis or professional standards for health communication using social media without reference to NLP were excluded. Papers that discussed ethical issues pertaining to the use of social media for public health applications and research were retained. References dated outside the period 2016-2018 have been included in order to provide important context. The use of these references does not imply that they form part of the document set defined by the inclusion criteria.

The papers reviewed utilise social media from several different sources, including Twitter, Reddit, Weibo, Facebook, and online discussion forums (see [Figure 1] and [Tables 1] & [2]).

Zoom
Fig. 1 Social media data sources. Note that this list is not exhaustive.
Table 1

Number of papers by topic and data source. Note that papers can occur in several categories

Data Source

Vac[a]

Comm[b]

Cancer[c]

SA[d]

Pharmaco[e]

STI[f]

MH[g]

Total

Reddit

-

1

-

3

-

1

13

18

Twitter

3

3

1

17

7

1

9

41

Instagram

-

-

-

-

-

-

1

1

Facebook

1

-

-

-

-

-

3

4

OHC[h]

1

-

2

2

1

-

6

12

Weibo

-

1

-

-

-

-

1

2

WhatsApp

-

-

-

1

-

-

-

1

Youtube

-

-

-

1

-

-

-

1

Yik-Yak

-

-

-

1

-

-

-

1

Tumblr

-

-

-

-

-

-

1

1

a Vaccination hesitancy and refusal;


b Health communication;


c Cancer;


d Substance Abuse;


e Pharmacovigilance;


f Sexually transmitted infections;


g Mental health;


h Online Health Communities


Table 2

Data Sources and Topics [Note that ethics-related papers are excluded from this table as they are frequently concerned with social media in general.]

Data Source

Vac[a]

Comm[b]

Cancer

SA[c]

Pharmaco[d]

STI[e]

MH[f]

Reddit

-

[10]

-

[11] [12] [13]

-

[14]

[15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27]

Twitter

[28] [29] [30]

[31] [32] [33]

[34]

[6], [12], [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49]

[50] [51] [52] [53] [54] [55] [56]

[57]

[18], [58] [59] [60] [61] [62] [63] [64] [65]

Instagram

-

-

-

-

-

-

[18]

Facebook

[66]

-

-

-

-

-

[8], [18], [67]

OHC[g]

[5]

-

[68], [69]

[12], [13]

[50]

-

[70] [71] [72] [73] [74] [75]

Weibo

-

[32]

-

-

-

-

[76]

Tumblr

-

-

-

-

-

-

[18]

a Vaccination hesitancy and refusal;


b Communicable diseases;


c Substance Abuse;


d Pharmacovigilance;


e Sexually transmitted infections;


f Mental health;


g Online Health Communities


The vast majority of the papers reviewed focussed on analysing English language text (68 papers), with two papers focussing on Chinese text [76], [77] and one paper focussing on Japanese text [31]. With respect to the geographical location of first authors, most of the articles emerged from North America (55), with Europe (7), and Asia (including Australasia and Turkey) (6) all represented.

The reviewed papers can be grouped into several health-related categories, including vaccine hesitancy and refusal, communicable diseases surveillance (including sexually transmitted infections, [STIs]), cancer, substance abuse, pharmacovigilance, and mental health (see [Table 2]). A wide range of methods were used, including “classical” machine learning (e.g., Random Forests, Support Vector Machines [SVM]), “modern” machine learning (e.g., Convolutional Neural Networks [CNN], Recurrent Neural Networks [RNN][2]), and lexicon-based approaches). Among the lexicon-based approaches, the Linguistic Inquiry and Word Count (LIWC) lexicon, a dictionary of words arranged into numerous psychological dimensions, is used extensively in many of the papers reviewed, especially in the areas of mental health and substance abuse [79].


3 Results

3.1 Vaccine Hesitancy and Refusal

Vaccine hesitancy – defined by the World Health Organisation as referring to a “delay in acceptance or refusal of vaccines despite availability of vaccination services”[3] – has been a growing subject of research during learning methods [5], [29], [30], and one used modem machine learning methods [30], with surveillance [28] [29] [30], health communication [5], [28] [29] [30], [66], and sentiment analysis [28] [29] [30], [66], all frequently studied topics. The LIWC lexicon has been used either to characterise public attitudes towards vaccination in general [66], or as a tool to explore the purported link between autism and the Measles, Mumps, and Rubella vaccine [28]. This last study aimed at investigating key differences between users who are longstanding vaccination advocates, long standing anti-vaccination advocates, or users who had recently adopted an anti-vaccination orientation. Vaccination the review period, with NLP methods applied to social media data in an attempt to develop insights into how best to understand and improve health communication as well as quantifying the degree of vaccine hesitancy in a community.

Of the five papers reviewed in this section (see [Table 3]), three utilised Twitter data [29], [30], one utilised Facebook data [66], and one further paper utilised data derived from an online health community, in this case moth- ering.com [5]. Supervised machine learning [30] and unsupervised machine learning [5], [28], [29] were both represented. Three of the papers reviewed used classical machine to protect against the Human Papillomavirus Virus (HPV) – a vaccine typically administered to adolescent boys and girls to prevent future sexual transmission of the disease – was also the subject of reviewed research, with high performance sentiment classifiers developed (AUC: 0.92) [30], and LDA (Latent Dirichlet Allocation) topic modeling used to identify a number of vaccine-hesitancy-related topics, including clinical evidence and vaccination harms [29].

Table 3

Summary of vaccine-related papers

Data Source

SML[a]

UML[b]

UML[b]

CML[c]

MML[d]

Surv[e]

HC[f]

Senti[g]

Lexicon[h]

Twitter

[30]

[28], [29]

[28], [29]

[29], [30]

[30]

[28] [29] [30]

[28] [29] [30]

[28] [29] [30]

[28]

Facebook

-

-

-

-

-

-

[66]

[66]

[66]

OHC[i]

-

[5]

[5]

[5]

-

-

[5]

-

-

a Supervised machine learning (e.g., Support Vector Machines, Random Forests);


b Unsupervised machine learning (e.g., Latent Dirichlet Allocation, K-means);


c Classical machine learning (e.g., Random Forests, Support Vector Machines);


d Modern machine learning (e.g., Convolutional Neural Networks);


e Surveillance;


f Health communication;


g Sentiment analysis;


h Lexicon-based methods;


i Online health communities


In a further example of novel research, Tangherlini et al., produced a statistical-mechanical network model representing relationships between “actants” (actors) that is used to automatically extract typical narratives and “story fragments” related to vaccination issues, evidencing a narrative framework related to a pronounced distrust of government and medical authority [5].


3.2 Communicable Diseases and Sexually Transmitted Infections

Systems designed to use social media data for pandemic public health surveillance have existed for almost 13 years [80], [81], and approaches that are variously referred to as infodemiology [82], digital disease detection [83], and digital epidemiology [84] are by now well established, particularly for dengue, influenza, and more recently, ebola. In addition, significant research efforts have centered on the study of STI, despite some methodological concerns regarding the willingness of users with STIs to disclose their status on social media.

In order to investigate the changing prevalence of a number of health related topics, Park et al., [10] observed that ebola discussions were characterised by concerns about risks and symptoms, while influenza was associated with terms like “CDC” and “H1N1”. Another study focussed on influenza misdiagnoses [33], achieving an F-score of 0.76. Regarding STIs, one study demonstrated statistically significant associations between Twitter data from 2012 and official Centers for Disease Control syphilis prevalence data from 2013 [57], with a related study discovering that the most frequent STIs discussed were intermediate (non-reportable) STIs like genital herpes and HPV, with more serious (reportable) diseases like syphilis and gonorrhoea discussed less frequently [14].

Of the six papers reviewed (see [Table 4]), four used Twitter data [31] [32] [33], [57], and two used Reddit data [10], [14], while Al-Garadi et al., provided a review that concentrated on Twitter and Weibo, the Chinese language microblog service [32]. Two of the papers reviewed described the use of supervised machine learning methods [31], [32], three papers used unsupervised machine learning methods [10], [14], [32], and one used a lexicon-based approach [57]. Machine learning methods were used to perform a variety of tasks, including surveillance [10], [14], [31] [32] [33], [57], health communication [32], and sentiment analysis [32]. Several studies concentrated on influenza surveillance using English [10], [33] and Japanese [31] Twitter data.

Table 4

Summary of communicable diseases and STI-related papers

Data Source

SML[a]

UML[b]

CML[c]

MML[d]

Surv[e]

HC[f]

Senti[g]

Lexicon[h]

Reddit

-

[10], [14]

[10], [14]

-

[10], [14]

-

-

-

Twitter

[31], [32]

[32]

[31-33]

-

[31-33, 57]

[32]

[32]

[57]

Weibo

[32]

[32]

[32]

-

[32]

[32]

[32]

-

a Supervised machine learning;


b Unsupervised machine learning;


c Classical machine learning;


d Modern machine learning;


e Surveillance;


f Health communication;


g Sentiment analysis;


h Lexicon-based methods



3.3 Cancer

Work on using NLP and text-mining methods to understand issues directly related to cancer (diagnosis, treatment, and management) are less well developed than some of the other areas considered in this review (e.g., mental health and substance abuse). Of the three cancer-related papers reviewed (see [Table 5]), one utilised Twitter data [34], and two utilised data derived from an online health community [68], [69]. All the papers discussed used both classical and modern machine learning methods, with modern machine learning methods performing better than classical machine learning methods, albeit by a narrow margin in the case of Zhang et al.’s work on identifying chemotherapy-related Twitter accounts by account type [34]. Zhang et al., observed that Twitter accounts belonging to individuals focussed on “personal chemotherapy experience and emotions”, whereas professional accounts typically provided a neutral presentation of chemotherapy side effects [34]. Two of the papers were centred on health communication, broadly conceived [68], [69], with one paper focusing on sentiment analysis [34]. Concentrating specifically on the patient experience of breast cancer, one study [68] aimed at characterizing how forum topics changed over time depending on the individual’s time since diagnosis and cancer state, and found that diagnosis is the most frequent class in the early stages of cancer treatment, with diagnosis (and treatment) related discussions declining over the course of a user’s cancer journey.

Table 5

Summary of cancer-related papers

Data Source

SML[a]

UML[b]

CML[c]

MML[d]

Surv[e]

HC[f]

Senti[g]

Lexicon[h]

Twitter

[34]

[34]

[34]

[34]

-

-

[34]

-

OHC[i]

[68, 69]

[68]

[68, 69]

[68, 69]

-

[68, 69]

-

-

a Supervised machine learning;


b Unsupervised machine learning;


c Classical machine learning;


d Modern machine learning;


e Surveillance;


f Health communication;


g Sentiment analysis;


h Lexicon;


i Online Health Communities



3.4 Substance Abuse

This section is concerned with reviewing work centred on the use of social media, in conjunction with NLP methods, to address substance abuse research questions, focussing on opioid abuse, tobacco, e-cigarette and marijuana use, and alcohol abuse. Interesting work on drug abuse – particularly new and emerging products – is increasingly evident in the literature. NLP methods are needed to deal with ambiguity and colloquial expressions used on social media (such as “bath salts”, “kitty cat”, or “miaow miaow” for mephedrone [44]).

Of the twenty-two papers discussed in this section, three are focussed on opioid abuse [35, 41, 42], eight on tobacco and marijuana use [6, 12, 13, 40, 43, 45, 46, 49], one on alcohol abuse [36], and one on the street drug, mephedrone [44]. Twitter is the most popular source of data (18 papers) [6, 11, 12, 35-49], with Reddit [11-13], and online health communities [12], [13], both represented. Supervised machine learning (8 papers - all utilising Twitter data) and unsupervised machine learning (11 papers) were both evident in the reviewed papers, with classical machine learning approaches more common than modern neural-network-based approaches (17 and 2 papers, respectively). Two of the papers reviewed utilized a rule- based approach. [Table 6] summarises the reviewed substance abuse-related papers.

Table 6

Summary of substance abuse-related papers

Data source

SML[a]

UML[b]

CML[c]

MML[d]

Surv[e]

HC[f]

Senti[g]

Lexicon[h]

Reddit

-

[11-13]

[11-13]

-

[12]

-

-

[13]

Twitter

[6, 36, 40, 45-49]

[6, 1 2, 35,37, 39, 41, 42, 43, 45]

[6, 12, 35, 36, 38-43, 45-49]

[6, 37]

[1 1, 12, 35, 36, 38, 39, 42, 44, 47-49]

[43]

[46-48]

[44]

OHC[i]

-

[12, 13]

[12, 13]

-

[12]

-

-

[13]

a Supervised machine learning;


b Unsupervised machine learning;


c Classical machine learning;


d Modern machine learning;


e Surveillance;


f Health communication;


g Sentiment analysis;


h Lexicon;


i Online Health Communities


3.4.1 Opioid Abuse

Opioid abuse is now recognised as one of the leading public health problems in the United States[4], and an important – albeit slightly less pressing – concern in many developed and developing countries. The crisis in the US is due to historical changes in drug prescription policies and practices that have encouraged both the licit and illicit use of highly addictive opioid-based painkillers[5] Every year in the United States, over 72,000 people die as a direct consequence of using opioids[6], making the need to understand emerging opioid-related behaviours and user trajectories especially pressing. One study concentrated on identifying public reactions to the opioid epidemic by identifying the most popular opioid-related topics tweeted by users [41]. Topics identified included discussions related to the possibility of promoting marijuana as a substitute for opioids, discussions related to the growing opioid market in North America, and discussions related to news reports advocating the use of buprenorphine – a narcotic used to treat opioid addiction – for adolescents experiencing opioid use disorders. Another study [35] aimed at detecting marketing and sale of opioids by illicit online sellers. The authors observed that the frequency of tweets directly related to illegal activity was relatively low when compared with other kinds of opioid mentions. A similar observation was made for tweets promoting the illegal online sale of fentanyl [42]. In this context, unsupervised approaches are of significant value for understanding changes in a rapidly developing online environment.


3.4.2 Tobacco, E-Cigarette, and Marijuana Use and Abuse

Tobacco use is declining in popularity in much of the developed world (the proportion of smokers in the US has declined by over half since 1964 and now stands at 16.8% among adults, and approximately half that among high school students [85]). However, despite this decrease in tobacco use, there has been a dramatic increase–now plateauing – in the use of e-cigarettes since their introduction to developed world markets in around 2007 [86]. This increase has occurred in the context of a lack of consensus regarding both the safety of the product [87] and its potential efficacy as a smoking cessation device [88]. In addition to these shifts in tobacco use, there have also been substantial changes in the regulation of marijuana products, particularly in the US context, and these changes have led – it has been suggested [89] – to an increase in marijuana use [90]. Given these public health concerns, using NLP to investigate tobacco, e-cigarette, and marijuana use, has become an active research area, especially to classify discussions [6, 12, 43, 45, 46] or to determine whether a particular user is above or below 21 years of age [40]. Reported findings included evidence that Twitter users frequently discussed ways in which e-cigarettes can be used in the workplace in a bid to circumvent smoking bans [43], and evidence that hookah was discussed more frequently at the weekend, indicating its use is associated with leisure activities, while reported tobacco use tends to be more consistent across the week [40]. In addition, authors observed that different social media services manifested distinctly different cultures regarding e-cigarette use, e.g., sensory experiences vs. psychological factors associated with quitting [13]. Rule- based approaches were used to identify where people reported using e-cigarettes, with 39% of posts referring to e-cigarette use in the classroom [49]. Other studies aimed at describing strategies for marketing Little Cigars & Cigarillos (LCC) and observed that 83% of identified LCC tweets referred to marijuana, and 29% of LCC tweets referenced memes [45].


3.4.3 Alcohol Abuse

Alcohol abuse was the seventh leading risk-factor worldwide for both death and disability in 2016. In the same year, among males aged 15-49, alcohol was a causal factor in 12% of deaths [91]. One of the reviewed studies [36] yielded the surprising result that– in the US at least – a positive correlation exists between excessive county-level alcohol consumption and higher education, suggesting that highly educated counties drink more, or at least tweet more about their drinking.



3.5 Pharmacovigilance

Pharmacovigilance – i.e. the post-market surveillance of drugs – was an early health-related focus for social media NLP [92], [93] and has remained an important subject of research, with applications including the identification of mentions of Adverse Drug Reactions (ADRs) [51], [55]. One recent study focussed on topics related to Thyroid Hormone Replacement Therapy (THRT), particularly on the identification of side effects [50]. It was discovered that male and female users of THRT had different experiences and concerns regarding side effects, with women primarily concerned about the effect of the drug on personal appearance and men more concerned about potential pain symptoms associated with the drug.

A recent significant development in pharmacovigilance research was the instigation of the SMM4 2017 shared task. The shared task consisted of three subtasks: automatic identification of ADRs, automatic classification of tweets that explicitly mentioned medication consumption, and normalization of ADR mentions. Important outputs of this effort included a publicly available corpus [51] and language models [55] for future research. In addition to this work on ADR identification and normalization, the identification of semantic relationships – chiefly causal relationships – between drug and symptom mentions had been a focus of research [52], [53]. A key challenge associated with this task is the difficulty involved in distinguishing between drug use as a response to a particular symptom (“I have a horrible headache and just took some ibuprofen”) and the existence of a symptom as a side effect of a drug (“Ever since I started taking Sertraline I’ve felt like crap”). Despite the difficulty of this task, Bollegala et al., achieved a moderately high F-score (0.74) using a skip-gram based method [52].

Six of the pharmacovigilance papers reviewed used Twitter as a data source [51], [56], while one used an online health community (see [Table 7]). Four of the papers used supervised methods [51]–[54] and five used unsupervised methods [50], [53]–[56] with five using classical machine learning methods [50]–[53], [56] and three using modern machine learning methods [51], [54], [55], with (unsurprisingly given the topic of pharmacovigilance) surveillance being the main application area.

Table 7

Summary of pharmacovigilance-related papers

Data Source

SML[a]

UML[b]

CML[c]

MML[d]

Surv[e]

HC[f]

Senti[g]

Lexicon[h]

Twitter

[51-54]

[53-56]

[51-53, 56]

[51, 54, 55]

[51-54, 56]

-

-

-

OHC[i]

-

[50]

[50]

-

-

-

-

-

a Supervised machine learning;


b Unsupervised machine learning;


c Classical machine learning;


d Modern machine learning;


e Surveillance;


f Health communication;


g Sentiment analysis;


h Lexicon-based methods;


i Online Health Communities



3.6 Mental Health

Mental health problems are estimated to account for 13% of the global burden of disease, as measured in Disability Adjusted Life Years [95]. Using social media as a resource to understand mental health is a research area that has experienced substantial growth in recent years [96], given the burden of disease associated with mental health problems and the fact that social media provides ready access to first person reports of behaviour, thoughts, and feelings. Reviewed studies covered a range of mental health topics, including predicting depression diagnosis [8], assessing suicide risk [16, 18, 24, 74-76, 98, 99], and developing a better understanding of users’ experiences of eating disorders [15], schizophrenia [59], [61], grief processes between gang-involved youth [58], relaxation [62], stress [63], pathological empathy [67], [72], and negative emotional effects associated with campus-based mass murders [64]. Related to this, a range of metrics have been used to characterize language use associated with specific mental health conditions, with lexical diversity, readability scores, sentence complexity, negation, uncertainty, and degree of repetition, all used during the review period [23, 26, 27, 60]. In novel work focussing on the relationship between clinical guidelines and actual treatments, Zhang et al. [71] created a catalogue of real-world treatments used – as opposed to merely discussed – by parents of children with autistic spectrum disorder, and then automatically identified their frequency of mention in two online autism forums.

With a view to improving how mental health forums are designed, one study applied textual cluster analysis to forums related to the conditions anxiety, depression, and post-traumatic stress disorder (PTSD) [19], showing that–consistent with current thinking regarding the relationship between PTSD and anxiety [97] – anxiety and PTSD forums shared more similarities to each other than to the depression forum. Related to this, another study found that different communities provided different degrees of emotional and informational support [20], with some communities (e.g., depression forums) focussed primarily on emotional support, and other communities (e.g. obsessive compulsive disorder forums) offering a greater proportion of informational support. Furthermore, the same study found that at the user level, the provision of social support was correlated with demonstrated linguistic accommodation, suggesting that those users who were able to “match” the linguistic culture of a particular community were likely to receive a greater volume of social support. Finally, a further study [100] involved the development of a classifier capable of identifying respectful uses of a mental-health related term (e.g. “I’m fuming. How dare a TV show portray folks suffering from mental health issues so unfairly”) and less-respectful usage.

Of the thirty-one mental health-related papers reviewed (see [Table 8]), thirteen involved the use of Reddit data [15-27], ten used Twitter data [18, 24, 58-65], one used Instagram [18], three used Facebook [8, 18, 67], six used OHC data [70-75], and one used data derived from Weibo [76], with twenty-two of the papers utilising supervised machine learning methods [8, 16, 18, 20-22, 24, 25, 58-62, 65, 67, 70-76], and twelve papers utilising unsupervised machine learning [8, 15, 18-22, 27, 59, 60, 70, 72]. The majority of the papers reported on the use of classical machine learning approaches [8, 15, 16, 18-20, 22, 24, 25, 27, 58-62, 65, 67, 71, 73-76], with a minority using modern machine learning methods [18, 21, 22, 67, 70, 72]. Four of the mental health papers reviewed utilised primarily lexicon-based methods [17, 23, 63, 64].

Table 8

Summary of mental health-related papers

Datasource

SML[a]

UML[b]

CML[c]

MML[d]

Surv[e]

HC[f]

Senti[g]

Lexicon[h]

Reddit

[16, 18, 20-22, 24, 25]

[15, 18-22, 27]

[15, 16, 18-20, 22, 24, 25, 27]

[21, 22]

-

-

[26]

[17], [23]

Twitter

[18, 58-62, 65]

[18, 59, 60]

[58-62, 65]

[18]

-

-

[63, 64, 24]

[63, 64]

Instagram

[18]

[18]

-

[18]

-

-

-

-

Facebook

[8, 18, 67]

[8, 18]

[8, 67]

[18, 67]

-

-

-

-

OHC[i]

[70-75]

[70, 72]

[71, 73-75]

[70, 72]

-

-

-

-

Weibo

[76]

-

[76]

-

-

-

-

-

a Supervised machine learning;


b Unsupervised machine learning;


c Classical machine learning;


d Modern machine learning;


e Surveillance;


f Health communication;


g Sentiment analysis;


h Lexicon-based methods;


i Online Health Communities



3.7 Ethical Issues

Two types of ethics-related papers are discussed in this section: those that are focussed on empirical ethics (i.e. the empirical investigation of ethical beliefs and practices) [101], [102], and those that are focussed on ethical guideline development (i.e. the generation of theoretical frameworks and practical guidelines for conducting health-related NLP research with social media) [9, 103, 104]. Reviewed studies highlighted the need for both transparency in the development of algorithms and an ethical framework to guide the appropriate use of social media for computational public health research.

Focussing specifically on research ethics from the perspective of social media users, one study [102] pointed to a generally favourable view of the use of computational methods for public health research among social media users, provided that data was highly aggregated, and the goal of the work was of significant public health value (e.g. opioid abuse surveillance was acceptable in a public health context, but not when used for employment screening). However, among some users, concerns remained regarding the robustness of both the data and the research methods, due to the fact that the data was not representative of the general population, and was subject to impression management (i.e. many users did not tweet about stigmatising health problems [105]). Related to this work, one paper – a systematic review of attitudes towards the ethics of computational social media research [106] – found a range of different views on appropriate research ethics, depending on the particular research topic discussed, suggesting that a “blanket” approach to research ethics is currently not appropriate, and instead ethical deliberations ought to take into account the particular context of the research under review [106].

As noted by Vayena et al., [104], the research regulation infrastructure in most jurisdictions was developed in the period prior to social media, and hence is not well-equipped to manage the review of computational social media research. This point is reinforced by a qualitative study conducted with Research Ethics Committee (Institutional Review Board) members in the United Kingdom. This study outlines the challenges faced by ethics committees in the application of existing research ethics regulation to computational work and emphasises the need to protect research participants (i.e. social media users), even in the context of research using publicly available data [101].

Finally, practical guidelines have recently been developed to guide NLP research using social media data [103], with eight principles outlined, including the stipulation that as most social media based NLP research can be defined as human subjects research [107], ethical approval or exemption ought to be gained from an Institutional Review Board or Research Ethics Committee; that data ought to be de-identified for use in publications and presentations; and that caution ought to be exercised in linking data.

In recent years there has been a move away from the commonly held view that in social media research “anything goes”, towards a more sophisticated perspective that acknowledges both the existence and importance of the ethical and regulatory issues involved in the application of NLP to social media for health research. Further, the provision of ethical guidelines developed specifically for NLP researchers – as described above, [103] – is a new and welcome development in the period since 2016.



4 Discussion and Conclusion

In this survey, we have presented recent advances in the application of NLP to social media to address public health research questions. We observed a substantial growth in the area of mental health and substance abuse research, and a continuing sustained interest in the use of social media for studying communicable diseases (particularly in the area of vaccine hesitancy). The widespread use of lexical resources developed in the psychology research communities – specifically, LIWC – is also notable, as is the relatively low frequency of “modern” (as opposed to “classical”) machine learning approaches.

While predicting future trends is not a straightforward task, we tentatively suggest four directions in which current work is evolving. First, linking data – with appropriate consent – from the EHR and social media, both in the context of public health research and clinical care. Examples of this type of work in the research context already exist (e.g. [8]), and will likely be a focus of considerable research effort over the next few years.

Second, further utilisation of social media in public health surveillance. Currently, while advances have been made in research using NLP and social media, substantial barriers still exist to implementing social media health surveillance in the context of public health practice. These barriers include costs (public health agencies are frequently underfunded), limited expertise in NLP, and difficulties in integrating social media analysis with existing surveillance methods and pipelines. However, even given these challenges, considerable strides have been made, particularly in the area of pharmacovigilance (e.g. the Food & Drug Administration Center for Drug Evaluation and Research).

Third, much social media research relies on the identification of appropriate keywords to construct a data sample suitable for the research question at hand. This keyword selection process has typically relied on intuition. However, recently there has been a move towards a more data-driven means of iteratively identifying and evaluating keywords (and their associated synonyms), with word embeddings and other empirical synonym discovery methods (e.g. [108]). This shift towards a more principled method of selecting keywords for data sampling is to be welcomed.

Fourth, while we believe that Twitter will remain a valuable (and popular) data source for NLP research, we suspect that Reddit will become increasingly popular as a research resource, partly due to its “research-friendly” terms and conditions and its increasing user base. Related to this, the dynamism of the social media ecosystem should not be underestimated, with new services (e.g. TikTok) attracting users – especially new adolescent users – away from existing services. Given this rapidly changing social media environment, there is little reason to believe that currently popular social media platforms will maintain their current level of popularity.



Acknowledgements

This work was partially supported by the National Institute on Drug Abuse of the United States National Institutes of Health under award number R21DA043775. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

1 For example, the Social Media Mining for Health Applications (SMM4H) Workshop or the Computational Linguistics and Clinical Psychology (CLPsych) Workshop


2 Note that the terms “classical” and “modern” machine learning are, from a historical perspective, misnomers, given the roots of neural network theory in the mid-twentieth century [78].


3 https://www.who.int/immunization/programmes_systems/vaccine_hesitancy/en/


4 https://www.cdc.gov/drugoverdose/epidemic/index.html


5 https://www.drugabuse.gov/drugs-abuse/opioids/opioid-overdose-crisis


6 https://www.cdc.gov/drugoverdose/data/statedeaths.html



Correspondence to

Mike Conway


Zoom
Fig. 1 Social media data sources. Note that this list is not exhaustive.