Summary
Objectives: Detecting hints to public health threats as early as possible is crucial to prevent
harm from the population. However, many disease surveillance strategies rely upon
data whose collection requires explicit reporting (data transmitted from hospitals,
laboratories or physicians). Collecting reports takes time so that the reaction time
grows. Moreover, context information on individual cases is often lost in the collection
process. This paper describes a system that tries to address these limitations by
processing social media for identifying information on public health threats. The
primary objective is to study the usefulness of the approach for supporting the monitoring
of a population's health status.
Methods: The developed system works in three main steps: Data from Twitter, blogs, and forums
as well as from TV and radio channels are continuously collected and filtered by means
of keyword lists. Sentences of relevant texts are classified relevant or irrelevant
using a binary classifier based on support vector machines. By means of statistical
methods known from biosurveillance, the relevant sentences are further analyzed and
signals are generated automatically when unexpected behavior is detected. From the
generated signals a subset is selected for presentation to a user by matching with
user queries or profiles. In a set of evaluation experiments, public health experts
assessed the generated signals with respect to correctness and relevancy. In particular,
it was assessed how many relevant and irrelevant signals are generated during a specific
time period.
Results: The experiments show that the system provides information on health events identified
in social media. Signals are mainly generated from Twitter messages posted by news
agencies. Personal tweets, i.e. tweets from persons observing some symptoms, only
play a minor role for signal generation given a limited volume of relevant messages.
Relevant signals referring to real world outbreaks were generated by the system and
monitored by epidemiologists for example during the European football championship.
But, the number of relevant signals among generated signals is still very small: The
different experiments yielded a proportion between 5 and 20% of signals regarded as
“relevant” by the users. Vaccination or education campaigns communicated via Twitter
as well as use of medical terms in other contexts than for outbreak reporting led
to the generation of irrelevant signals.
Conclusions: The aggregation of information into signals results in a reduction of monitoring
effort compared to other existing systems. Against expectations, only few messages
are of personal nature, reporting on personal symptoms. Instead, media reports are
distributed over social media channels. Despite the high percentage of irrele vant
signals generated by the system, the users reported that the effort in monitoring
aggregated information in form of signals is less demanding than monitoring huge social-media
data streams manually. It remains for the future to develop strategies for reducing
false alarms.
Keywords
Textmining - Web science - public health - population surveillance - epidemic intelligence
- Medicine 2.0