Keywords focal liver lesions - artificial intelligence - systematic review - METHODS & TECHNIQUES
- ultrasound - deep learning
Introduction
Focal liver lesions (FLLs) are a common finding in abdominal ultrasound examinations
with
a prevalence of approximately 15% [1 ]. As incidental findings, focal liver lesions are frequently benign (e.g., cysts,
hemangiomas). Sonography is particularly well suited for the classification of focal
liver
lesions (FLLs) due to its wide availability, lack of invasiveness, and low cost [2 ]. B-mode, in combination with Doppler sonography, is sufficient for a definitive
diagnosis in lesions like cysts, hemangiomas, and focal fatty changes in the non-cirrhotic
liver. In unclear cases, contrast-enhanced ultrasound (CEUS) has a high diagnostic
value
(diagnostic accuracy: 90%, sensitivity: 92–95% specificity: 83–90%) to correctly classify
tumor dignity[2 ]
[3 ]
[4 ]. Nevertheless, in some cases, the assessment of malignancy or specific tumor entity
is
not possible. In unclear cases biopsy is carried out to provide a final diagnosis
based on
histology. Although imaged-guided biopsy is a low-risk tool, as an invasive procedure
it might
be accompanied by pain, bleeding, infection, or injury to other organs [5 ].
Artificial intelligence (AI) generally describes computational methods that emulate
human
intelligence, at least in partial areas, such as decision-making. Machine learning
is a
subfield of AI in which a program is designed to learn from experience using training
data. On
a research level, this process has already been evaluated in a variety of medical
fields
(e.g., detection of polyps during colonoscopy) [6 ]
[7 ]. Support vector machines (SVM) and artificial neural networks (ANN) are machine
learning methods that can be applied to evaluate image data. In detail, an SVM is
a
mathematical method for dividing a set of objects into classes by maximizing margins
between
groups. ANNs use a structure that is similar to biological neural networks to classify
data.
Deep learning (DL) represents a subfield of ANN-based machine learning with complex
neural
network architectures with multiple layers of artificial neurons. Large amounts of
data are
used to train DL-algorithms and in the case of image data, feature extraction is often
implicitly done by the DL-network. DL is regarded as the current state-of-the-art
approach for
AI-based image analysis.
AI could potentially improve the assessment of FLL dignity and entity by sonography.
In
such a scenario, the investigator could benefit from a more objective AI assessment
and the
comparability of results might be better. AI algorithms with good discriminatory ability
regarding entities that are difficult to distinguish for humans would be particularly
helpful.
In recent years, several papers have been published on this topic [6 ]. In this systematic review, we summarize the available data on the assessment of
dignity and entity of FLLs by B-mode and CEUS using AI methods. The diagnostic value
of those
methods is discussed from a clinician’s perspective. A special focus was placed on
diagnostic
accuracies and whether these can be improved by adding clinical parameters. In addition,
comparative data between AI-based approaches and physicians was collected and assessed.
Although individual reviews on this topic already exist, our study represents the
first
systematic and most comprehensive review [6 ]
[7 ].
Methods
Search strategy
For this systematic review, articles on the characterization of FLLs by sonography
were
selected in the Scopus, Web of Science, PubMed, and IEEE databases. The literature
search
was conducted on 12/31/2021 according to the a priori defined search criteria. The
following
search terms were used: “artificial intelligence”, “machine learning”, “neural network”,
“deep learning”, “computer-assisted”, “computer-aided”, “ultrasound”, “sonograph”,
“ultrasonography”, “liver”, “hepat”, “lesion”, “tumor”, “carcinoma”, “mass”, “focal.”
The
detailed and complete search terms can be found in the supplemental data. The inclusion
and
exclusion criteria were determined a priori. Only articles from the years 2000 to
2021 were
considered, as older articles mostly used algorithms that are outdated from today's
perspective. Only articles that addressed liver tumor classification and/or diagnosis
of a
specific liver tumor entity either by B-mode and/or CEUS using artificial intelligence
in
humans were considered. Articles that did not report the diagnostic accuracy of AI-based
classification of images were excluded. In addition, only original English language
full-text articles or congress contributions with sufficient information were
included.
Data extraction
Two authors (MV and DJ) independently performed the data extraction and quality
assessment. Any disagreements were discussed and clarified in consensus with a third
author.
The extracted data included authors, title, year of publication, study design (mono-
or
multicentric), number of cases and mode of ultrasound (B-mode and/or CEUS). Diagnostic
accuracy, sensitivity, specificity, and AUC (area under the curve) for lesion dignity
and/or
specific tumor entities were recorded. Regarding AI, information on the extracted
image
features and algorithm that was used was collected. If multiple AI algorithms were
used in
one study, only the one with the best performance with respect to overall diagnostic
accuracy was considered.
Quality Assessment
The quality of the studies was evaluated using the Quality Assessment of Diagnostic
Accuracy Studies (QUADAS-2) tool [8 ]. Four scenarios were assessed by QUADAS-2: 1) Differentiation of benign and
malignant liver lesions (tumor dignity) using AI on B-mode ultrasound; 2) Diagnosis
of
specific tumor entities using AI on B-mode ultrasound 3) Differentiation of benign
and
malignant liver lesions (tumor dignity) using AI on CEUS 4) Diagnosis of specific
tumor
entities using AI on CEUS. The focus of the QUADAS-2 tool is the detection of a possible
bias and the applicability of studies. For the assessment of a potential bias, the
articles
were examined using 12 signaling questions regarding patient selection, index test,
reference standard, and flow and timing [8 ]. According to the recommendations of QUADAS-2, these signaling questions were
adapted with regard to our research question. For concerns of applicability, the following
criteria were added: 1) Are at least the diagnoses of cyst (B-mode only), FNH, hemangioma,
HCC, and metastasis included? (Patient selection) 2) Has the AI been verified using
an
independent data set? (Index test) 3) Were diagnoses based on pathology or CT/MRI
or
clinical follow-up for more than 6 months? (Reference standard). The full adjustments
and
detailed results can be found in the supplemental data (supplemental Tab.
1–4 ).
Results
Literature search
A total of 660 articles were found during the literature search in PubMed, Web of
Science, Scopus, and IEEE, with 152 duplicates. During the screening of abstracts,
184
articles were excluded because they were not original works ([Fig. 1 ]) (e.g., reviews). An additional 260 studies were not included because our research
question was not addressed. During the full-text analysis of 64 studies, 7 further
articles
were removed because of the imaging modality used (computed tomography, endosonography,
or
shear wave elastography). One study investigated the detection of tumors only, one
article
investigated splenic lesions in dogs, and one study did not report the accuracy of
AI-based
image classification alone (only in combination with clinical data). Additionally,
two
duplicate studies were removed. Finally, 52 articles remained for the final analysis.
Of
these, 32 studies investigated FLLs using B-mode ultrasound (10x dignity, 25x diagnosis)
and
21 studies using CE ultrasound (8x dignity, 13x diagnosis).
Fig. 1 Flowchart of the identification and selection process of studies. IEEE=Institute of
Electrical and Electronics Engineers, US=ultrasound, CT=computed tomography,
EUS=endoscopic ultrasound, SWE= shear wave elastography, CEUS: contrast-enhanced
US.
General approach of the identified studies using artificial intelligence
All studies followed a similar pattern ([Fig. 2 ]). The first step comprised image optimization followed by manual or automated
segmentation. Subsequently, while some studies used raw image data, others extracted
specific image features to be analyzed by the AI algorithm. In the case of CEUS, some
studies extracted time intensity curves (TIC). Examples of extracted B-mode data are
contour
properties and gray level features. Most often, a whole array of different features
was
extracted automatically by specific algorithms. Afterwards, feature selection was
performed
to reduce the number of collected features (in some studies several thousands) to
a level
the AI algorithm could work with efficiently. Few studies (B-mode only) considered
additional clinical data for the classification process. Finally, the actual classification
algorithm was applied, whereby ANNs and SVMs were used most often. AI was usually
trained
with the majority of images (about 80%), followed by validation and testing with the
remaining images from a database or cohort of patients. Many studies used x-fold
cross-validation, a method in which the data are split into different training and
validation sets repeatedly. External testing cohorts were rarely used.
Fig. 2 General schematic of studies investigating AI-based classification of FLLs.
Artificial intelligence for the differentiation of benign and malignant liver lesions
on B-mode ultrasound
We found ten studies using AI classification of B-mode ultrasound images to
differentiate benign from malignant FLLs [9 ]
[10 ]
[11 ]
[12 ]
[13 ]
[14 ]
[15 ]
[16 ]
[17 ]
[18 ]. Studies were published between 2003 and 2021 (90% from 2015 or later). Two
studies were multicentric [15 ]
[17 ]. Case numbers ranged from 101 to 23,756. Most studies (9 out of 10) included
patients with hemangiomas and HCCs. FNHs were only considered in three studies. Where
indicated, multiple different ultrasound machines were employed. Four studies extracted
image features. All ten studies used an ANN to classify data (mainly convolutional
neural
networks (CNN), n=8). The diagnostic accuracy for the assessment of tumor dignity
ranged
from 68.5% to 94.8% ([Fig. 3 ]). Yang et al. exclusively conducted external testing with an independent patient
cohort, the results of which did not substantially differ from internal testing [15 ]. Data are summarized in [Table 1 ].
Fig. 3 Overview of diagnostic accuracies. Each dot represents the reported diagnostic
accuracy of a single study. For the study by Sritunyarat et al. (B-mode – entity),
only values of external testing were available. Therefore, these are not shown
here.
Table 1 Summary of B-mode studies on lesion dignity. Studies are sorted alphabetically.
Only the best diagnostic accuracy within one study without the consideration of clinical
parameters is shown. Diagnostic accuracies are only comparable to a limited extent
due
to different testing measures and selection of diagnoses. 1 : When the number
of patients was not available, the number of images was used. 2 : Value was
estimated from a graph. 3 : Values for external testing. 4 :
Retrospectively calculated diagnostic accuracy from sensitivity, specificity, and
prevalence or positive/negative predictive values. ABS=abscess, AML=angiomyolipoma,
ANN=artificial neural network, BEN=benign lesions, CCC=cholangiocarcinoma, CINO=cirrhotic
nodule, FFD=focal fat deposition, FFS=focal fatty sparing, FNH= focal nodular
hyperplasia, HCC=hepatocellular carcinoma, HEM=hemangioma, MAL=malignant lesions,
MET=metastasis, N/A = not available, OBL=other benign lesions, OML=other malignant
lesions.
Author
Cases
Diagnoses
Feature extraction
AI
Accuracy
(in %)
Sensitivity
(in %)
Specificity
(in %)
AUC
Acharya et al. 2018
101
ABS, CYST, HCC, HEM, MET
yes
ANN
93.0
90.8
97.4
N/A
Hassan et al. 2021
3521
BEN, MAL
yes
ANN
~922
N/A
N/A
N/A
Ryu et al. 2021
3873
CYST, HCC, HEM, MET
no
ANN
90.4
95.0
86.0
0.970
Sato et al. 2021
1080
ABS, AML, CCC, CYST, FFD, FFL, FNH, HCC, HEM, MET, OBL
no
ANN
68.5
67.3
69.8
0.721
Tiyarattanachai et al. 2019
683
CYST, HCC, HEM, FFD, FFS
no
ANN
81
76
85
0.890
Xi et al. 2021
596
ABS, ADEN, CINO, CYST, FFD, FNH, HCC, HEM, OBL, OML
no
ANN
84
N/A
N/A
0.830
Yang et al. 2020
206251
ABS, AML, CCC, ECH, FFS, FNH, HCC, HEM, MET, OBL, OML
yes
ANN
76.74
(75.1)3/4
80.5
(77.4)3
60.1
(67.4)3
0.779
(0.805)3
Yamakawa et al. 2019
324
CYST, HCC, HEM, MET
no
ANN
94.8
93.8
95.2
N/A
Yamakawa et al. 2021
23756
CYST, HCC, HEM, MET
no
ANN
94.3
82.9
96.7
N/A
Yoshida et al. 2003
44
HCC, HEM, MET
yes
ANN
N/A
N/A
N/A
0.92
Artificial intelligence for the differentiation of specific tumor entities in B-mode
ultrasound
The database search revealed 25 studies using AI on B-mode images to diagnose specific
tumor entities [19 ]
[20 ]
[21 ]
[22 ]
[23 ]
[24 ]
[25 ]
[26 ]
[27 ]
[28 ]
[29 ]
[11 ]
[30 ]
[31 ]
[32 ]
[13 ]
[33 ]
[34 ]
[35 ]
[36 ]
[37 ]
[16 ]
[17 ]
[38 ]
[39 ]. Studies were published between 2003 and 2021, including six with a multicentric
design [17 ]
[20 ]
[21 ]
[29 ]
[31 ]
[32 ]. Case numbers ranged from 51 to 3,873. Tumor entities differed substantially
between the studies. Regarding ultrasound devices, a wide range from a single machine
to
multiple devices from different manufacturers was found. In 16 studies texture features
were extracted, and in 9 AI obtained raw images. The most common algorithm used was
an ANN
(n=17, mainly CNN (n=8)). Furthermore, SVMs (n=7) and logistic regression (n=1) were
applied to classify data. Diagnostic accuracies ranged from 69.0% to 98.6% ([Fig. 3 ]). Two studies used an internal and external testing cohort: Tiyarattanachai et al.
observed hardly any differences between the two [32 ], whereas Ren et al. saw a noticeable decrease in diagnostic accuracy (internal
testing: 79%, external testing: 69%) [29 ]. Data are summarized in [Table 2 ].
Table 2 Summary of B-mode studies on the assessment of tumor entity. Studies are sorted
alphabetically. Only the best diagnostic accuracy within one study without the
consideration of clinical parameters is shown. Diagnostic accuracies are only comparable
to a limited extent due to different testing measures and selection of diagnoses.
1 : When the number of patients was not available, the number of images was
used. 2 : Values for external testing. 3 : Retrospectively
calculated diagnostic accuracy from sensitivity, specificity, and prevalence or
positive/negative predictive values. ABS=abscess, AML=angiomyolipoma, ANN=artificial
neural network, BEN=benign lesions, CCC=cholangiocarcinoma, CINO=cirrhotic nodule,
FFD=focal fat deposition, FFS=focal fatty sparing, FNH= focal nodular hyperplasia,
HCC=hepatocellular carcinoma, HEM=hemangioma, ICC=intrahepatic cholangiocarcinoma,
MAL=malignant lesions, MET=metastasis, N/A = not available, OBL=other benign lesions,
OML=other malignant lesions, SVM: support vector machine.
Author
Cases
Diagnoses
Feature extraction
AI
Accuracy
(in %)
Sensitivity
(in %)
Specificity
(in %)
AUC
Balasubramanian et al. 2017
1601
N/A
yes
ANN
84.63
N/A
N/A
N/A
Hassan et al. 20154
1101
CYST,HCC,
HEM
yes
SVM
96.5
97.6
92.5
N/A
Hassan et al. 20174
110
CYST, HCC,
HEM
no
ANN
97.2
98
95.7
N/A
Hwang et al. 2015
115
CYST, HEM,
MAL
yes
ANN
98.13
N/A
N/A
N/A
Lee et al. 2011
102
CYST, HEM,
MAL
yes
SVM
83.3
66.7
83.3
0.77
Mao et al. 2021
114
HCC, ICC,
MET
yes
Other
84.3
76.8
88.0
0.816
Mitrea et al. 2019
300
HCC, HEM
yes
ANN
85.4
78.0
82.9
0.805
Mittal et al. 20114
176
CYST, HCC,
HEM, MET
yes
ANN
86.4
N/A
N/A
N/A
Peng et al. 2022
589
INF, MAL
yes
SVM
79.1
86.3
45.2
0.745
Qiu et al. 20114
2561
HCC, HEM
yes
SVM
96.93
N/A
N/A
N/A
Ren et al. 2021
188
CCC, HCC
yes
SVM
79.0
(69.2)2
90.0
(66.7)2
75.0
(70.0)2
0.843
(0.730)2
Ryu et al. 2021
3873
CYST, HCC,
HEM, MET
no
ANN
82.2
86.7
89.7
0.947
Schmauch et al. 20194
544
CYST, FNH,
HCC, HEM,
MET
no
ANN
N/A
N/A
N/A
(0.891)2
Sritunyarat et al. 2020
157
CYST, HCC,
HEM, FFD,
FFS
no
ANN
(95.0)2
(87.0)2
(97.0)2
N/A
Tiyarattanachai et al. 2019
683
CYST, HCC,
HEM, FFD,
FFS
no
ANN
69
N/A
N/A
N/A
Tiyarattanachai et al. 2021
3872
CYST, HCC,
HEM, FFD,
FFS
no
ANN
95.4
(95.3)2
83.9
(84.9)2
97.1
(97.1)2
N/A
Virmani et al. 20134
1081
CYST, HCC,
HEM, MET
yes
ANN
87.7
N/A
N/A
N/A
Virmani et al. 2013
51
HCC, MET
yes
SVM
91.6
N/A
N/A
N/A
Virmani et al. 20134
1081
CYST, HCC,
HEM, MET
yes
SVM
87.2
N/A
N/A
N/A
Virmani et al. 2014
1081
CYST, HCC,
HEM, MET
yes
ANN
95.0
N/A
N/A
N/A
Xu et al. 2020
79
ABS, HCC
yes
ANN
83.8
N/A
N/A
N/A
Yamakawa et al. 2019
324
CYST, HCC,
HEM, MET
no
ANN
88.0
80.4
96.0
N/A
Yamakawa et al. 2021
23756
CYST, HCC,
HEM, MET
no
ANN
91.1
N/A
N/A
N/A
Zhang et al. 20104
2801
CYST, HCC,
HEM
yes
ANN
98.63
N/A
N/A
N/A
Zhou et al. 2021
172
HCC, OML
no
ANN
78.43
57.1
91.3
0.74
Artificial intelligence for the differentiation between benign and malignant liver
lesions on CEUS
Eight studies using AI classification of CEUS data to differentiate benign from
malignant FLLs published between 2014 and 2021 were found [40 ]
[41 ]
[42 ]
[43 ]
[44 ]
[45 ]
[46 ]
[47 ]. Only one study was multicentric [45 ]. Most had a small sample size with a range from 26 to 363 cases and all but one
performed their examinations with a single ultrasound device. The remaining study
used two
machines from the same manufacturer [43 ]. Feature extraction was applied in all but one study and TIC data was used
exclusively in two. Half of the studies employed an SVM and three an ANN to classify
lesions. The reported overall diagnostic accuracy ranged from 81.1% to 91.6% ([Fig. 3 ]). Data are summarized in [Table 3 ].
Table 3 Summary of CEUS studies on tumor dignity. Studies are sorted alphabetically. Only
the best diagnostic accuracy within one study without the consideration of clinical
parameters is shown (values for sensitivity, specificity, and AUC are reported for
the
method with the best diagnostic accuracy). Diagnostic accuracies are only comparable
to
a limited extent due to different testing measures and selection of diagnoses.
1 : Including clips which could not be analyzed by AI. ABS= abscess,
ANN=artificial neural network, BEN=benign lesions, FFS=focal fatty sparing, FNH= focal
nodular hyperplasia, HCC=hepatocellular carcinoma, HEM=hemangioma, MAL=malignant
lesions, MET=metastasis, N/A = not available, SVM=support vector machine, TIC=time
intensity curve.
Author
Cases
Diagnoses
Feature extraction
AI
Accuracy
in %
Sensitivity
in %
Specificity
in %
AUC
Guo et al. 2017
93
BEN, MAL
yes
Other
90.4
93.6
89.3
0.95
Guo et al. 2018
83
CCC, FNH, HCC, HEM, MET,
yes
Other
90.4
93.6
86.9
0.97
Hu et al. 2021
363
BEN, MAL
no
ANN
91.0
92.7
85.1
0.93
Kondo et al. 2017
94
FNH, HCC, HEM, MET
Yes
(TIC)
SVM
91.6
94.0
90.3
N/A
Qian et al. 2017
93
BEN, MAL
yes
SVM
89.4
89.7
89.8
0.96
Ta et al. 2018
105
BEN, MAL
Yes
(+TIC)
ANN&SVM
81.1 (73.3)1
90.0 (83.3)1
71.1 (62.7)1
0.88
Wu et al. 2014
26
ABS, FFS, HCC, HEM, MET
Yes
(TIC)
ANN
86.4
83.3
87.5
N/A
Zhang et al. 2021
153
BEN, MAL
yes
SVM
88.2
86.9
89.4
0.9
Artificial intelligence for the differentiation of specific tumor entities on
CEUS
Thirteen studies evaluated AI-based classification of different FLL entities with
CEUS
data [48 ]
[49 ]
[50 ]
[51 ]
[52 ]
[53 ]
[54 ]
[55 ]
[56 ]
[57 ]
[58 ]
[59 ]
[60 ]. The number of cases ranged from 37 to 527, the majority of studies (9/13)
included more than 100. Only three studies included FNHs, hemangiomas, HCCs, and
metastases in their analysis. Six studies used a single, three used multiple ultrasound
devices with the remaining four not disclosing this information. CEUS features were
extracted in all studies, two of them analyzed TICs only. ANNs were most commonly
used for
FLL classification (58%), followed by SVMs (30%). Diagnostic accuracy ranged from
64.0% to
98.3% ([Fig. 3 ]). Data are summarized in [Table 4 ].
Table 4 Summary of CEUS studies on the assessment of tumor entity. Studies are sorted
alphabetically. Only the best diagnostic accuracy within one study without the
consideration of clinical parameters is shown (values for sensitivity, specificity,
and
AUC are reported for the method with the best diagnostic accuracy). Diagnostic
accuracies are only comparable to a limited extent due to different testing measures
and
selection of diagnoses. 1 : Values for external testing; ADEN= adenoma,
ANN=artificial neural network, BEN=benign lesions, FFC=focal fatty change, FNH= focal
nodular hyperplasia, HCC=hepatocellular carcinoma, HEM=hemangioma, MAL=malignant
lesions, MET=metastasis, N/A = not available, SVM: support vector machine, TIC= time
intensity curve.
Author
Cases
Diagnoses
Feature extraction
AI
Accuracy in %
Sensitivity in %
Specificity in %
AUC
Căleanu et al. 2014
37
FNH, HCC, HEM, MET
yes
SVM
64.0
N/A
N/A
N/A
Căleanu et al. 2021
91
FNH, HCC, HEM, MET
yes
(TIC)
ANN
95.7
N/A
N/A
N/A
De Senneville et al. 2020
47
ADEN, FNH
yes
Other
95.9
93.4
97.6
0.97
Hu et al. 2019
527
N/A
yes
ANN
85–88
89–94
67–70
0.89
Huang et al. 2020
342
FNH, HCC
yes
SVM
94.4
94.8
93.6
N/A
Li et al. 2021
226
FNH, HCC
yes
SVM
N/A
76.6
80.5
0.86
Liang et al. 2016
353
FNH, HCC, HEM
yes
Other
84.8
N/A
N/A
N/A
Shiraishi et al. 2008
103
HCC, HEM, MET
yes
ANN
88.3
N/A
N/A
N/A
Sîrbu et al. 2020
95
FNH, HCC, HEM, MET
N/A
ANN
95.7
N/A
N/A
N/A
Streba et al. 2012
112
FFC, HCC, HEM, MET
Yes
(TIC)
ANN
87.1
93.2
89.7
N/A
Sugimoto et al. 2009
137
HCC, HEM, MET
yes
ANN
94.2
N/A
N/A
N/A
Sugimoto et al. 2010
137
HCC, HEM, MET
yes
ANN
88.3
N/A
N/A
N/A
Zhou et al. 2021
186
FNH, HCC
yes
SVM
98.3
(96.7)1
98.1
(98.7)1
98.6
(94.7)1
N/A
Impact of the inclusion of clinical data on the diagnostic accuracy of artificial
intelligence
Four studies investigated whether the additional consideration of clinical parameters
to B-mode images is able to increase the diagnostic accuracy of AI-based classification
[12 ]
[15 ]
[29 ]
[39 ]. There was no data on this for CEUS. In all studies diagnostic accuracy could be
improved. The effect was particularly pronounced in the study by Sato et al., in which
the
diagnostic accuracy increased from 68.5% to 96.3% [12 ]. Yang et al. were able to show that knowledge of the presence of hepatitis or
tumor disease significantly improves the differentiation between benign and malignant
lesions [15 ]. Zhou et al. were able to show that the consideration of CA-19–9 (OR 24.85)
enhances the differentiation between HCC and other malignant processes of the liver
almost
as well as the AI algorithm itself (OR 29.52) [39 ]. Data are summarized in [Table 5 ].
Table 5 Summary of B-mode studies adding clinical data to AI analysis. 1: retrospectively
calculated diagnostic accuracy from sensitivity, specificity, and prevalence or
positive/negative predictive values.
Study
Mode
ACC
without clinical data
ACC
with
clinical data
Clinical and sonographic parameters
(odds ratio)
Sato et al. 2022
B-mode / Dignity
68.5%
96.3%
Clinical parameters
Age, gender, AST, ALT, platelet count, albumin
Yang et al. 2020
B-mode / Dignity
76.7%1
87.0%1
OR for malignant lesions:
Hypoechoic halo
(18.389 [9.921–34.084])
History of extrahepatic tumor (16.17 [9.311–28.065])
History of hepatitis
(11.736 [7.857–17.529])
Age > 65y
(3.323 [2.096–5.269])
Male gender
(2.303 [1.629–3.256])
Intratumoral vascularity
(1.911 [1.344–2.717])
Ren et al. 2021
B-mode / entity
78.95%
86.8%
Clinical parameters
Age, gender, history of hepatitis, AFP, ALT, AST, TB, CB, UCB, size of
lesion
Zhou et al. 2021
B-mode / entity
(HCC vs. other malignancies)
57.1%
78.6%
OR for non-HCC malignancies
CA19–9
(24.85 [6.10–101.25])
Female gender
(3.72 [1.17–11.9])
Diagnostic performance of artificial intelligence in comparison to ultrasound
professionals
A total of seven B-mode and CEUS studies compared the diagnostic accuracy of AI
algorithms to radiologists interpreting the same cases [51, 42, 53, 57, 45, 14, 15].
Additional clinical information was available to radiologists in some of the studies.
AI
matched the diagnostic performance of experts in five studies and significantly
outperformed beginners in two studies and experts in one study. Hu et al. reported
that
the diagnostic accuracy of less experienced examiners improved when combined with
AI,
while the diagnostic accuracy of experts worsened [51 ]. Data are summarized in [Table 6 ].
Table 6 Summary of studies comparing the performance of AI with physician-based
decisions. 1 : TIC analysis. 2 : Additional clinical information.
N/A = not available.
Author
Mode
ACC (expert)
p
ACC (beginner)
p
ACC
(AI)
Conclusion
Hu et al. 2019
CEUS
N/A
N/A
N/A
N/A
85–88
AI was a setback for experts
Hu et al. 20212
CEUS
87.5
0.256
83.0
0.021
91.0
AI matched experts
Li et al. 2021
CEUS
0.84 (AUC)
N/A
N/A
N/A
0.86 (AUC)
AI matched experts
Streba et al. 20121
CEUS
N/A
0.225
N/A
N/A
87.1
AI matched experts
Ta et al. 2018
CEUS
81.4
N/A
72.0
N/A
81.1
AI matched experts, better than beginners
Xi et al. 20212
B-mode
80.0 (1x)
73.0 (1x)
0.18
N/A
N/A
84.0
AI matched experts
Yang et al. 20202
B-mode
69.5
<0.01
64.7
<0.01
84.7
AI better than experts
Quality assessment using QUADAS-2
All studies were reviewed for potential bias and applicability concerns using QUADAS-2.
In general, most studies did not provide all the information needed to assess the
risk of
bias. For example, the domain “patient selection” remained unclear for most studies,
as it
was not evident from the articles whether patients were recruited consecutively or
not.
Using all available information, the risk of bias was considered to be low ([Fig. 4 ]a). In contrast, applicability was a concern for most studies ([Fig. 4 ]b). In the domain "patient selection", it was noticeable that the majority of studies
did not include FNHs in their analysis. Furthermore, only a few studies validated
the
diagnostic accuracy of their AI algorithm with an independent data set, which decreases
the
applicability of the index test. There were similar results concerning bias and
applicability for the subgroups B-mode, CEUS, tumor dignity and tumor entity. More
detailed
information and the assessments of individual studies are included in the supplemental
data
(Supplemental Fig. 1 and Supplemental Tab.
1–4 ).
Fig. 4 QUADAS-2 overview. a) Risk of bias for all studies (light
gray: low risk, dark gray: high risk, white: unclear risk). b)
Applicability concerns for all studies (light gray: low level of concerns, dark gray:
high level of concerns, white: unclear level of concerns).
Discussion
Sonography can be used to reliably determine the dignity and entity of many focal
liver
lesions. However, even with the use of CEUS, not every lesion can be classified correctly.
Since AI-based applications have found their way into many scientific fields, there
is
reasonable hope, that AI could also help to improve ultrasound-based diagnosis of
FLLs and
potentially avoid the need for additional imaging and invasive procedures. The aim
of this
systematic review was to analyze studies in which the dignity or entity of FLLs was
assessed
by AI, using B-mode or CEUS data. For this purpose, 52 articles found using a structured
literature search approach were analyzed systematically in order to answer the following
questions:
How powerful is artificial intelligence for the classification of liver tumors?
Diagnostic accuracy describes the fraction of cases which are assigned the correct
diagnosis based on the test procedure. Typically, diagnostic accuracy of more than
80% is
considered good and more than 90% excellent [61 ].
Half of the B-mode studies assessing FLL dignity reported excellent diagnostic accuracy,
and a further 20% of the studies showed good performance (range: 68.5% to 94.8%).
The impact
of lesion size on diagnostic accuracy, sensitivity, specificity, and AUC was investigated
in
one study with no significant differences between sizes 1.1–2.0 cm, 2.1–5.0 cm, and
>5.0
cm [15 ]. Yamakawa et al. reported higher accuracies for cysts (99.0%) and hemangiomas
(91.0%) in comparison to HCCs (67.5%) and liver metastases (62.8%) on B-mode [17 ]. A different study did not observe differences between these entities [31 ]. Studies that analyzed CEUS data to classify FLL-dignity all showed good (50%) or
excellent (50%) diagnostic performance.
When assessing specific tumor entities based on the B-mode image, accuracies ranged
from
69.0% to 98.6%. 40% of studies reported good and a further 40% reported excellent
diagnostic
accuracy. In CEUS studies regarding the differentiation of FLL entities, all but one
(92%)
showed at least good performance with six reporting excellent accuracy.
In order to measure diagnostic accuracy as exactly as possible, AI-based classification
algorithms should ideally be evaluated by means of external validation. This requires
the
use of an independent test set of patients, which the AI has not been trained on (even
partially). Only five B-mode studies and one CEUS-based study performed external validation.
Two of these studies compared the diagnostic accuracies with their internal set (the
AI had
also been trained on) and found no significant differences [15 ]
[32 ]. The other two studies found a deterioration of diagnostic accuracy when using an
external set [29 ]
[60 ]. The remaining two B-mode studies did not test on the internal set, and, therefore,
a comparison was not possible [30 ]
[31 ]. These differing results for the external validation cohort might be due to a
considerable variation in case numbers (3872 and 20,625 [32, 15] vs. 188 and 186 cases
[29 ]
[60 ]). Alternatively, the conflicting results could originate from unknown random or
systematic differences between the internal and the validation data set.
Can the potency of artificial intelligence be improved by adding clinical
parameters?
Clinical data indicate pre-test probability and should, therefore, always be considered
by physicians when making a diagnosis. Somewhat surprisingly, only four B-mode studies
considered this approach for their AI algorithms. All of them were able to show that
the
diagnostic accuracy of AI-based FLL classification can be improved by adding clinical
parameters. Among other things, gender, age, and a positive history of hepatitis or
cancer
had a significant impact on diagnostic accuracy. In the multivariate analysis, some
parameters (e.g., CA19–9) were almost as relevant for the correct classification as
the
interpretation of image data itself [39 ]. Sato et al. achieved the highest diagnostic accuracy among the aforementioned
studies with the combination of B-mode image data and clinical parameters [12 ].
Artificial intelligence vs. human intelligence – which is better?
A total of seven studies (2x B-mode, 5x CEUS) compared physicians’ diagnostic
performance with that of their AI algorithms. According to the results, AI performed
as well
as experienced radiologists in five studies and better in one study. However, in the
latter
study the human diagnostic accuracy was low (ACC 69.5%) [15 ]. Another study reported that the availability of AI-based classification improved
the diagnostic accuracy of less experienced examiners but was a setback for experts
[51 ]. These results are remarkable, even more so knowing that physicians had an advantage
by having insight into the clinical parameters in three of the studies.
Ta et al. observed that radiologists were able to successfully analyze (not classify)
CEUS data more often (inexperienced: 95.2%, experienced 97.1%) than their AI algorithms
(90.5%) [45 ]. An inability to analyze cases was the result of poor image quality, contrast agent
enhancement, or small size of the FLL (<1 cm). When taking these unclassifiable lesions
into account, the diagnostic accuracy of the AI-based approach dropped from 81.1%
to 73.3%
(for radiologists: inexperienced: 68.6%, experienced 79.0%). Whether the accuracies
reported
in other studies were calculated with this consideration in mind is doubtful.
It can be concluded that AI classification of FLLs is able to achieve diagnostic
accuracies comparable to experienced human observers under rather artificial study
conditions. There is not enough data to make reasonable conclusions about the differences
in
diagnostic performance between AI and humans in a real-world setting.
What are the concerns and limitations for the use of artificial intelligence to
classify FLLs?
General concerns about the use of AI in the medical field include the protection of
patients’ individual rights and personal information, especially if data are not being
analyzed on site. Another critical aspect of the methods discussed in this article
is their
black-box nature. There is often no easy way to interpret or explain the produced
results.
Providers of AI-based classification systems will need to ensure that their technological
approach is as transparent and reliable as possible. A recent research topic called
explainable AI is trying to resolve this issue [62 ]. Liability concerns will probably be the biggest obstacle keeping AI from
implementation in clinical practice.
A limitation of the studies included in this systematic review is that image acquisition
was often performed on only one type of ultrasound machine, raising doubts about a
possible
transfer to general clinical usage. Furthermore, most studies included a limited spectrum
of
different FLL entities in their analysis, which reduces the applicability for clinical
practice. For example, FNHs were included only in a minority of B-mode studies (11%),
even
though they are one of the most common FLLs. While this might seem understandable,
since the
diagnosis of FNH is not based on B-mode ultrasound, but rather is a domain of CEUS,
it
certainly leads to a selection bias and raises doubts about the significance of the
reported
diagnostic accuracies.
A major limitation of almost all studies we reviewed is the lack of a sufficiently
large
database. The number of images an AI method is trained on directly affects its diagnostic
performance. Most studies, therefore, used augmentation techniques, such as mirroring
or
rotation of images, which cannot fully compensate for a lack of real data. In addition,
these small data sets lead to limitations concerning the testing process. As mentioned
above, an independent patient cohort was not used for testing in the majority of studies.
Testing can be performed by splitting all images into a training/validation and test
data
set. This can lead to images from one patient ending up in both data sets, therefore
resulting in an overestimation of testing accuracy. The issue can be addressed by
splitting
patients (and not images) into groups. A CEUS-based study, which compared the two
approaches, observed a drop in diagnostic accuracy from 95.7% to 56% [49 ]. Although this pronounced deterioration of accuracy can certainly not be
generalized, it must be assumed that some of the reported results are overrated. This
is
especially true for small studies with a homogenous set of data or patients.
Another key issue is that all studies needed intervention by healthcare professionals
not only to perform ultrasound scans, but also to process the collected data further
(e.g.,
demarcation of the regions of interest (ROI)). This puts a significant part of diagnostic
ultrasound, i.e., differentiating the FLL from the liver parenchyma, back into human
hands.
Some studies have tried to solve this problem by developing algorithms that are able
to
identify the ROI. Liang et al. trained an AI algorithm to track FLLs and their corresponding
ROIs in CEUS clips automatically [54 ]. Nonetheless, they needed a physician with CEUS experience to identify the ROI at
the start of the clip. With this approach, they were able to achieve diagnostic accuracies
similar to studies with manual ROI placement (84.8% for the differentiation of FLL
entities
and 92.7% to distinguish benign from malignant FLLs). Also, there are studies that
are
solely focused on the detection of FLLs and not their classification (and were therefore
disregarded for the purpose of this review). They have shown promising results, indicating
that solutions addressing this issue seem possible [63 ]. For the implementation of AI techniques in the clinical routine, a combination
of
both techniques (detection and classification) would be ideal, as this would eliminate
possible bias introduced by the examiner.
In summary, although the diagnostic capabilities of AI for the diagnosis of FLLs are
almost all reported to be good or excellent, among other concerns, the lack of independent
test sets and the exclusion of common FLL entities in almost all studies severely
limit the
real-world applicability of these data. Therefore, the pathway towards the implementation
of
AI in clinical ultrasound of the liver has many hurdles to overcome. User-friendly
AI-based
tools, which are built into ultrasound devices for specific questions such as “is
this a
malignant liver lesion?” could be a starting point. Ideally, real-world data from
the
application of these tools would be used to further improve AI performance in a continuous
learning approach. Data protection concerns will limit this kind of feedback loop
to
clinical trials. Therefore, large multicenter cohorts will be necessary to improve
AI-based
ultrasound techniques before a significant impact on clinical practice seems feasible.
In
the long term, AI-based approaches will need to integrate data from multiple sources
such as
ultrasound, radiology, histopathology, laboratory tests, and clinical information
to make a
diagnosis [64 ]. As for now and the near future, the only viable field of use for AI in clinical
ultrasound seems to be to support (especially inexperienced) physicians in their decision
making.
A limitation of our review is the heterogeneity of the studies. Heterogeneity was
observed in all study parts, starting with the selection of patients or image databases.
Differences continued with respect to the pre-processing of images, extracted image
features, and types of AI that were used (e.g., CNN or SVM). Finally, as outlined
above,
testing of the diagnostic performance varied significantly. These differences severely
limit
the comparability of studies included in this systematic review.
Conclusion and Outlook
Data on the AI-based classification of ultrasound imaging of FLLs are promising. The
diagnostic performance of AI-based classification should be improved by adding clinical
data.
AI could serve as a supportive system for ultrasound examinations of the liver, especially
for
inexperienced examiners. The main weaknesses of the available studies are the limited
spectrum
of FLL entities and the lack of external validation. Moreover, in addition to technical
hurdles, regulatory hurdles must be overcome for a successful transfer of the technology
to
clinical practice. Large, cross-center ultrasound image databases could help to improve
the
diagnostic capabilities of AI-based classification systems.