Keywords
machine learning - classification - psychiatry - patient injury
Introduction
Adverse events are common in health care. Large international reviews have estimated
that around 10% of hospital patients experience an adverse event, and half of the
adverse events are preventable.[1] Patient harm has been estimated to be the 14th leading cause of disease burden globally
and up to 15% of total hospital expenditure in OECD (Organization for Economic Co-operation
and Development) countries results from adverse events.[2] Better understanding about the quality of adverse events in different health care
settings is needed to improve the quality and safety of care.[3]
In Finland, all health care providers are obliged to have patient insurance. Patients
can claim compensation for injuries incurred in connection with treatment by filing
a notice of injury. The notice needs to be done within 3 years from the date the patient
got to know of the injury. All notices are handled by Patient Insurance Centre (PIC)
based on the legislation. The PIC obtains all necessary clarifications, including
patient documents, from the relevant health care providers. Experienced medical experts
evaluate the cases. Also, juridical experts are consulted when necessary. PIC administers
an extensive patient injury data which has been widely used for medical research.
Research has concentrated on surgical specialties like orthopaedics,[4]
[5] otorhinolaryngology,[6] and dental care.[7]
[8] An article[9] recently raised attention to psychiatric patient injuries which had not been investigated
earlier.
Psychiatric treatment does not always go as planned but compared with many other specialties
claims for patient injury appear to be less common in psychiatry.[10]
[11]
[12] Common claims for patient injury in psychiatry include the misdiagnosis and delay
of diagnosis, unprevented suicide, involuntary treatment deemed wrongful, and medication
deemed harmful.[9] However, there is so far little data on the likelihood of certain types of injuries
in psychiatric care and no international comparisons despite existing large coverage
statistics in many countries. An accurate classification of individual cases according
to the type of injury helps better understand the types of injuries and their distributions
in psychiatric care. This type of classification could further help to establish a
monitoring system detecting trends in patient injuries with a goal of improving patient
safety and preventing adverse outcomes in psychiatric treatment. To the best of our
knowledge, this is the first study applying machine learning methods to the data associated
with patient compensation claims.
Currently, the statistical data from patient injury claims and compensation decisions
made in Finland include information such as the nature of the disease treated, medical
specialty, and event descriptions in a free-text form, and there is no specific coding
system referring to the type of the injury or contents of the treatment. Classifying
such data requires laborious and time-consuming manual work. Machine learning algorithms
can classify past and future data efficiently. The application of machine learning
in psychiatry has already been studied for the prediction of treatment,[13]
[14] prognosis,[15] and diagnosis.[16] This study aimed to develop and test an accurate machine learning algorithm, which
could not only help in a classification process but also potentially improve treatment
outcomes in the future.
The current study involves two problems applying psychiatric data: the classification
of data associated with compensation claim evaluations for patient injuries into six
predefined categories and the binary classification of compensation claim decisions
into two classes (accepted or declined claim). The original data contained 328 compensation
claims and their medical evaluations written by specialists in the specialty of psychiatry.
The data used for machine learning originated from specialists' evaluations, including
argumentation to support the decisions. In addition, other information was available
for specialists such as an applicant's age, sex, and claim decision (accepted or declined,
i.e., positive or negative).
Methods
The data for the study were collected from the claims register of PIC which approved
(7.5.2020) the use of the data. The evaluations were made for all psychiatric patient
injury claim decisions between 2012 and 2016 and the corresponding specialists' evaluations
were the basis of the original data. The first preprocessing task was the slight cleaning
of the data where some cases were removed because of insufficient amount of psychiatric
or other medical phrases. Cases with 3 to 15 phrases were included in the final data.
Some phrases could be split up in parts such as “falling serious concussion” to “falling”
and “serious concussion” (all texts were originally written in Finnish, but their
phrases mentioned are translated here). Some phrases were quite similar, for example,
“clinical research and treatment procedure” and “clinical research or treatment procedure.”
Three investigators (authors J.N. and O.K., and J.V, see Acknowledgments) considered
all the complicated phrases and categorized them into six classes. The categorization
according to phrases into classes was based on 50 first cases that two investigators
(J.N. and J.V.) classified independently. The inter-rater reliability with these 50
cases was 100%. The hypothesis for six classes was based on one investigator's (O.K.)
clinical experience with an earlier sample of approximately 80 cases with compensation
claims. As indicative phrases, information on the applicant's illness and treatment
descriptions and injury details was used. After this first preprocessing task, 308
cases remained in the dataset of the patient compensation claim evaluations.
As the second preprocessing task, all psychiatric or neurological terms or phrases
were extracted from the evaluation documents. The phrases chosen from documents contained,
for example, diagnoses, symptoms, or otherwise meaningful issues such as “inappropriate
medical treatment,” “appropriate care during hospitalization,” “anxiety,” and “medication
discontinuation” categorized into phrase groups {“nursing”, “hospital care,” “depression”}
or {“drugs and medication (not psychosis)”}. Phrases were divided into different groups.
Phrases closely related to each other were later combined. This way all phrases were
grouped.
Altogether 35 phrase groups were manually categorized from 1,591 phrases. These groups
are shown in [Table 1]. As an example, the phrase group of “hospital care” is described in [Table 2], where some words, for example “hospitalization,” were written more than once because
of the declension of Finnish nouns: Finnish term “osastohoito” (ward care literally)
and its genitive “osastohoidon” were both translated into “hospitalization.” Also,
the synonyms “osastohoito” and “sairaalahoito” (hospital care literally) were translated
to be “hospitalization.”
Table 1
Numbers of phrases in phrase groups (translated from Finnish language) as classification
attributes
Phrase group
|
Category
|
Number of phrases
|
Phrase group
|
Category
|
Number of phrases
|
1
|
Patient's demeanor or state
|
200
|
19
|
Tests and treatment together
|
31
|
2
|
Psychosis, delusions
|
12
|
20
|
Diagnostics
|
43
|
3
|
ADHD and other neurological diseases
|
52
|
21
|
Medicines and medication (not psychosis)
|
64
|
4
|
The patient's behavior
|
39
|
22
|
Other psychiatric diagnoses and symptoms
|
216
|
5
|
Interaction in a treatment setting
|
22
|
23
|
Depression
|
54
|
6
|
Brain tumors and other organic neurological diseases and symptoms
|
23
|
24
|
Death, decease
|
26
|
7
|
Intoxicants
|
22
|
25
|
Anxiety, anxiousness
|
18
|
8
|
Bipolar disorder
|
19
|
26
|
Treatment
|
12
|
9
|
Other organic diseases, symptoms
|
17
|
27
|
Involuntary
|
156
|
10
|
Electroconvulsive therapy
|
154
|
28
|
Patient harm
|
53
|
11
|
Neuroleptics and neuroleptic treatment
|
20
|
29
|
Procedure
|
23
|
12
|
Suicide
|
38
|
30
|
Adverse effects
|
14
|
13
|
Hospitalization
|
40
|
31
|
Accident
|
25
|
14
|
Suicidality
|
44
|
32
|
Medicine in general
|
20
|
15
|
Therapy
|
16
|
33
|
Other, unclassified
|
26
|
16
|
Imaging
|
21
|
34
|
Compensation, damages
|
30
|
17
|
Monitoring
|
9
|
35
|
Otherwise related to patient treatment
|
11
|
18
|
Tests and examination
|
21
|
|
|
|
Abbreviation: ADHD, attention deficit hyperactivity disorder.
Table 2
Phrase group “hospital care” containing 40 phrases
Phrase
|
Phrase
|
Phrase
|
A short hospital observation period
|
Acting like this would not have completely avoided hospitalization, but the duration
would have been shorter
|
Acting like this would not have prevented hospitalization
|
After being discharged from the hospital
|
After hospitalization
|
Appropriate care during hospitalization
|
Appropriate medical treatment
|
Being left untreated at the psychiatric ward
|
Dispatchment to the hospital
|
(1) During hospitalization
|
(2) During hospitalization
|
Felt unsafe at the hospital ward
|
(1) Hospitalization
|
(2) Hospitalization
|
(3) Hospitalization
|
Hospitalization at the psych. ward
|
Hospitalization at the psychiatric ward
|
Hospitalization period
|
Impatient stay and entitled to compensation
|
In hospital care
|
In respite care
|
In the acute psychiatric ward
|
In the hospital
|
In the rehabilitation ward
|
Inpatient stay to maintain general condition
|
More inpatient stays
|
On-call hospital care
|
Psychiatric hospitalization
|
Psychiatric hospitalization for depression
|
Psychiatric hospitalization was justified
|
Psychiatric inpatient stay
|
Referral to psychiatric hospitalization was justified
|
Several impatient stays
|
(1) Treatment at a psychiatric ward
|
(2) Treatment at a psychiatric ward
|
Was hospitalized
|
Was immediately taken to crisis therapy period
|
Was not admitted to the hospital
|
Was not given appropriate treatment for shortness of breath during hospitalization
|
When alone in a hospital room
|
|
|
Finally, normalization by first subtracting the minimum of each attribute from the
values of the current attribute and, second, by dividing their differences of each
attribute with the difference of the maximum and minimum of this attribute was performed
attribute by attribute scaling the values of each attribute to the interval [0, 1].
This was important particularly for classifications applying the k-nearest neighbor searching method. An attribute is the same as phrase group here.
An attribute value of a document equals the sum of the number of phrases of the current
phrase group present in a document.
Since supervised machine learning methods were applied, in the beginning all cases
were manually divided into six different classes. The classes were formed according
to the types or contents of medical or otherwise relevant phrases found in the psychiatric
evaluation documents. Six categories or classes are characterized in [Table 3].
Table 3
Distribution of the classes
Class
|
Description
|
Number of cases
|
1
|
Psychosis, involuntary treatment; care or medication deemed unwarranted or harmful
in the complaint
|
84
|
2
|
A complaint about a suicide attempt or completed suicide; care is deemed to be insufficient
or faulty
|
38
|
3
|
A complaint about diagnostic error or a prolonged diagnostic process
|
40
|
4
|
Harm due to medication or another form of biological treatment, or incorrect medication
(not related to psychosis)
|
87
|
5
|
Harm due to some other aspect of treatment, e.g., therapy, problems in communication
|
32
|
6
|
Incidents during hospitalization, e.g., falling down, errors in administering medication
|
27
|
For binary classification, data cases were distributed into two classes: accepted
(1 or positive) or declined (0 or negative) decisions of compensation claims. There
were 36 positive and 272 negative cases.
Since the number of cases was 308, small in the sense of machine learning, and the
least class consisted of 27 cases only, K-fold cross-validation with K-value 5 and leave-one-out (LOO) were applied to divide data cases into training and
test sets for constructing models. For classification, several methods were used,
i.e., k-nearest neighbor searching method with different distance or similarity functions
and k-values, linear and quadratic discriminant analysis, Naïve Bayes,[17]
[18]
[19] and random forests.[20] Random forests were run with the numbers of trees from 10 to 100. Numbers of trees
above 100 did not improve results. In the following, the results produced by 10, 30,
and 100 trees are given. For k-nearest neighbor searching (k-NN), k-nearest neighbors with numbers k from 3 to 25 were computed using only LOO.
We chose the above machine learning methods since they are appropriate to small datasets
as here with 308 cases only, but as many as six classes. More complicated classification
algorithms, e.g., neural networks, could require more data to be able to build good
models. The chosen methods follow different principles: random forests, nearest neighbor
searching with various distance measures, Naïve Bayes based on probabilities, and
discriminant analysis. We did not include decision trees, since typically random forests
being an ensemble method based on the use of sets of several decision trees are better.
Results
The classification accuracies given by the listed methods are presented in [Table 4], where each k-NN result is shown with a k-value that gave the best result for the current k-NN method. The best results were given by random forests with 100 decision trees.
Thus, their results are only given in the form of confusion matrix in the following.
The confusion matrix of the results of this modelling is presented in [Table 5]. Next, SMOTE algorithm[21] was applied to balance classes by generating artificial cases for other classes
than Class 4 comprising the greatest number of 87 cases. SMOTE generates artificial
cases by first searching for the nearest neighbors of great enough numbers for original
cases in other classes than the majority class. For example, the minority class of
27 cases was extended with 60 artificial cases. SMOTE generates an artificial case
randomly on the line between an original case and one of its nearest neighbors. Thereafter,
all classes consisted of 87 cases. This improved classification accuracy of random
forests with 100 trees (LOO) up to 88%. This modeling increased the true positive
rates of Class 2 to 93%, Class 3 to 92%, Class 5 to 91%, and Class 6 to 89%, but decreased
those of Class 1 to 85% and Class 4 to 76%. Comparing with [Table 5], the improved results concerned the classes that were originally small, but the
slightly worsened results hit the two largest classes.
Table 4
Classification accuracies in decreasing order given by the classifiers built with
leave-one-out (LOO) and K-fold cross-validation with K equal to 5
Method
|
Classification accuracy %
|
Method
|
Classification accuracy %
|
Random forests, LOO, 100 trees
|
77
|
Random forests, K = 5, 100 trees
|
76
|
Random forests, LOO, 30 trees
|
74
|
Random forests, K = 5, 30 trees
|
74
|
Random forests, LOO, 10 trees
|
73
|
Random forests, K = 5, 10 trees
|
72
|
Linear discriminant analysis, LOO
|
71
|
Spearman k-NN, k = 9, LOO
|
71
|
Cosine k-NN, k = 7, LOO
|
71
|
Correlation k-NN, k = 7, LOO
|
69
|
Linear discriminant analysis, K = 5
|
69
|
Jaccard k-NN, k = 7, LOO
|
69
|
Chi-squared distance k-NN, k = 7, LOO
|
66
|
Mahalanobis k-NN, k = 25, LOO
|
66
|
Hamming k-NN, k = 7, LOO
|
65
|
Manhattan (block city) k-NN, k = 25, LOO
|
63
|
Euclidean k-NN, k = 5, LOO
|
63
|
Minkowski distance k-NN, dimension 3, k = 5, LOO
|
63
|
Minkowski distance k-NN, dimension 35, k = 5, LOO
|
62
|
Quadratic discriminant analysis, LOO
|
56
|
Naïve Bayes, K = 5
|
51
|
Quadratic discriminant analysis, K = 5
|
50
|
Naïve Bayes, LOO
|
49
|
Chebyshev k-NN, k = 3, LOO
|
46
|
Table 5
Results of random forest with 100 trees for the original data when the numbers of
correctly classified cases are on the diagonal (in bold)
Predicted class
|
|
|
Class
|
1
|
2
|
3
|
4
|
5
|
6
|
True %
|
False %
|
1
|
75
|
1
|
5
|
2
|
1
|
0
|
89
|
11
|
2
|
3
|
31
|
1
|
2
|
0
|
1
|
82
|
18
|
3
|
8
|
0
|
21
|
6
|
3
|
2
|
53
|
47
|
4
|
4
|
3
|
4
|
67
|
7
|
2
|
77
|
23
|
5
|
0
|
1
|
1
|
9
|
20
|
1
|
63
|
37
|
6
|
7
|
1
|
1
|
3
|
5
|
10
|
37
|
63
|
True%
|
77
|
84
|
64
|
75
|
56
|
63
|
|
|
False %
|
23
|
16
|
36
|
25
|
44
|
37
|
|
|
Finally, the binary classification of either accepted or declined compensation claims
was run. The class distribution was very imbalanced as the great majority (272 of
308) of the cases had been declined (Class 0). When random forests run with 100 trees
(LOO) gave the best result in [Table 4], we also used random forests for the classification of the decisions of compensation
claims. These class-specific results are presented in [Table 6] for this binary classification. Random forests lost almost all cases of the minority
class, but those of the majority classes were classified almost fully correctly. By
modelling with nearest neighbor searching, rather similar results were obtained. Obviously,
the very imbalanced class distribution inflicted so that the minority class could
not be separated from the majority class. Thus, SMOTE algorithm was also run for this
classification by increasing the size of Class 1 up to 272 cases. After having balanced
the minority Class 1, its cases were separated much better from those of Class 0.
For Classes 0 and 1, 88 and 89% were classified correctly in the extended dataset.
Nonetheless, the share of the correctly classified cases of the originally majority
class was less than before balancing, which is rather common for binary classification
where two classes are “opposing” each other.
Table 6
Results of random forest with 100 trees for the binary classification of the original
data
|
Predicted class
|
|
|
Correct class
|
Class
|
0
|
1
|
True %
|
False %
|
|
0
|
270
|
2
|
99
|
1
|
|
1
|
34
|
2
|
6
|
94
|
|
True %
|
89
|
50
|
|
|
|
False %
|
11
|
50
|
|
|
The machine learning classification method showed accurate results in comparison with
the clinical judgement. The original data source was a set of psychiatrists' evaluations
of the compensation claims for patient injuries in association with psychiatric diseases
and disorders. All in all, 35 phrase groups were formed from 1,591 phrases by combining
almost fully or at least somewhat conceptually or semantically similar phrases. This
was necessary to create suitable attributes (phrase groups) for machine learning,
because many phrases existed only once or a few times in the dataset which would not
have made a reasonable basis for computation. Besides, there existed also phrase pairs
that were completely or virtually identical. We designed six different classes of
patient types or characterizations.
Random forests produced the highest classification accuracy of 77% based on the LOO
technique for dividing the data into training sets of size n − 1 cases and test sets of single cases. Furthermore, we modified SMOTE algorithm,
not using multiples of minority class or other than the majority class as in the basic
SMOTE but balancing these classes up to the size of the majority class. This increased
the classification accuracy approximately 10%. Ultimately, the binary classification
of the declined and accepted claims of the same data was performed. Since 272 were
in class “declined” or 0, the binary class distribution was very biased, and the classification
of random forests almost lost the cases of Class 1. Running first the modified SMOTE
algorithm, however, could level out the two classes generating classification accuracy
to 89%.
Finally, in association with random forests we computed receiver operating characteristic
curves and area under the curve (AUC) values presented in [Fig. 1] for the classification of six classes before applying SMOTE algorithm and in [Fig. 2] after its use. The AUC values are from 0.899 to 0.962 before SMOTE and higher after
it. These were also computed for the binary classification reaching the AUC values
of 0.685 for both classes before the use of SMOTE and 0.992 after it. All these results
were computed with the random forests of 100 trees and following the LOO principle.
Fig. 1 ROC curves and AUC values for the classification of six classes. AUC, area under
the curve; ROC, receiver operating characteristic.
Fig. 2 After generating artificial cases for balancing the class distribution, ROC curves
and AUC values for the classification of six classes. AUC, area under the curve; ROC,
receiver operating characteristic.
Discussion
Obviously, thus far, other than statistical computational methods have hardly ever
been applied to psychiatric data according to our information searching with the following
examples. Health care claims were studied by applying knowledge discovery for massive
data to find fraudulent health care providers by using text mining, social network
analysis, and particularly temporal analysis.[22] However, the main results for which computational results were presented concerned
only straightforward statistical results such as log-likelihood scores. The types
of data were clinical data without describing specialties, patient behavior data,
pharmaceutical research data, and health insurance data. Medical malpractice claims
of an extensive dataset were studied statistically, with logistic regression, to predict
whether a claim is closed with no compensation.[23] In addition, conditionally on the cases of accepted compensations their covariates
were studied statistically. Their eight specialties (not psychiatry) were named for
only 27% of all 3,179 claims. Claims, liabilities, injuries, and compensation payments
of medical malpractice were described with numbers of cases and associated with drugs,
different diseases, and different types of hospitals,[24] but no statistical or other computational results were shown. Psychiatry was not
mentioned. Workers' compensation claims and payments were studied and described with
descriptive statistics containing numbers of cases and their means without any psychiatric
cases.[25] Compensation data research of population-based injury data was made where the term
data analytics was mentioned.[26] Nevertheless, it consisted merely of two estimations for probabilities of work-related
injury claims calculated for the period of approximately 7 years. Compensation claims
of psychiatric injury and severity of physical injuries associated with motor vehicle
accidents were statistically considered where 19.5% of all 522 cases included a claim
for psychiatric injury.[27] This small dataset of 105 patients was analyzed with multivariate logistic regression
computing their odds ratios for five different categories, e.g., injury severity score
and hospital stay days. Compensation claims are only infrequently studied in the field
of psychiatry. Subject to computation means, statistical methods only are applied.
The results of the current study are in line with earlier reports where the rate of
compensation claims related to malpractice in psychiatric treatment have been rare
compared with other medicine specialties. In an American study, the annual rate for
compensation claims for psychiatrists was only 2.6%, whereas in neurosurgery the corresponding
rate was almost 20%.[11] In Spain, the annual rate among psychiatrists in Catalonia was found to be 0.9%.[12]
Despite the relatively low claim rates, the treatment flaws might be more common even
in psychiatric treatment. For example, both in a Swedish and an American study, adverse
events were found in approximately fifth of treatments.[28]
[29]
Strengths and Limitations
Strengths and Limitations
The comprehensive national data with a coverage from the very beginning of electronic
database in the Finnish Patient Insurance Center can be regarded as study strengths.
The clinician-based classification that was used as a comparison had a 100% agreement
rate between researchers, so it can be considered a good validation tool for the data
algorithm. Since the database used in the study was completely encrypted and it was
not possible to use the entire database for, e.g., text mining, we searched the database
for as comprehensive a selection of treatment focus and content-related phrases as
possible. The researcher who selected the phrases was trained to use the database
and an experienced psychiatrist was acting as a backup in this process. However, it
is possible that with the help of text mining we could have obtained a wider sample
of phrases, which might have resulted in even better functioning with the algorithm.
However, we believe that the most important text contents were included by extracting
the phrases.
Obviously, our current study is among the first using machine learning for psychiatric
data.
Adverse events in health care are a global concern. Although patient safety improvement
efforts have increased in the past 20 years, new ways to enhance the safety of care
are needed. Learning from patient injuries requires understanding about injury types
and causes. Traditionally, this needs to be done manually case by case and arising
trends in the patient injury data may not be recognized. The use of machine learning
in the classification of data can solve these problems and sustain an up-to-date classification
of injuries and be applied in prospective risk analyses for developing processes in
health care systems.
Natural language processing was not used, because this was our first classification
study for the current data. In the future, it is, naturally, reasonable to be applied
at least for the preprocessing of phrases. Nevertheless, the final consideration,
e.g., how to make phrase groups, requires deep psychiatric expertise that is hardly
possible to automatize. In the future, it is important to collect more corresponding
data, since this would possibly produce better classification results. It could also
be possible to attempt to extend this type of classification study to other medical
specialties.
Conclusion
It can be concluded that the classification into six classes as such is reasonable
and possibly useful. Further, particularly using the modified SMOTE algorithm the
classification task of six present classes was successful. The binary classification
task of the compensation claim decision data was more complex because of its skewed
class distribution. Nevertheless, this approach could also be a reasonable approach,
but only after having used the modified SMOTE algorithm as described to balance two
classes of the current data.
The machine learning classification appears to be a promising method for detecting
different types of patient claims and injuries. This kind of modelling could be used
in larger long-term data for monitoring and predicting temporal trends and developing
indicators of quality for different dimensions in clinical treatment.