Key words gadoxetate disodium - Gd-EOB-DTPA - motion artifact - respiratory - interrater agreement
- interrater reliability
Introduction
Gadoxetate disodium (Gd-EOB-DTPA, Primovist, Eovist, Bayer Healthcare) is a liver
specific contrast agent, demonstrating an uptake by hepatocytes and subsequent biliary
excretion of approximately 50 % in patients with normal liver and kidney function
[1 ]. Based on the specific pharmacodynamic and pharmacokinetic properties, the use of
this contrast agent results in improved detection and characterization of focal liver
lesions not only in the non-cirrhotic but also in the cirrhotic liver. In this context,
proper arterial phase imaging is crucial, especially for lesion characterization.
Recently, an association has been described between the intravenous injection of gadoxetate
disodium and motion artifacts in the arterial phase of the contrast dynamic, which
has been termed acute transient severe motion (TSM) [2 ]
[3 ]. This phenomenon is typically self-limiting, lasts for about 10 to 20 seconds and
may be accompanied by a subjective feeling of transient dyspnea [4 ]. TSM-induced artifacts may have destructive effects on arterial phase MRI image
quality, resulting in non-diagnostic images in the worst case. The exact pathophysiology
of this unaccounted for phenomenon is still unknown, and several patient-related as
well as MR-specific risk factors are being discussed [5 ]. More importantly, the reported incidence of TSM throughout the literature is not
consistent, covering a wide range from 2.4 % up to 18 % [3 ]
[6 ].
One possible explanation for this discrepancy might be the difficulty of differentiating
motion artifacts from other sources of image degradation, such as truncation artifacts
[7 ]. To the best of our knowledge, there is no prior study in the literature specifically
addressing the matter of interrater agreement and reliability in this context. In
order to reliably evaluate respiratory motion artifacts and TSM in larger studies
comprising multiple institutions with multiple readers, artifact scoring must be consistent
and robust.
To address this problem, the working group for abdominal imaging within the German
Roentgen Society (AG Gastrointestinal- und Abdominaldiagnostik, Deutsche Röntgengesellschaft)
initiated a multicenter study in which MRI examinations of more than 2000 patients
are being evaluated. As a prerequisite, the purpose of this study is to assess interrater
agreement and reliability among expert abdominal radiologists with respect to the
grading of arterial phase respiratory motion artifacts in gadoxetate disodium-enhanced
MRI by means of a 5-point score. The null hypothesis was that there is no significant
difference between multiple readers regarding the scoring of severe arterial phase
respiratory motion artifacts.
Materials and Methods
This multicenter study was approved by the local institutional review board (IRB)
of each participating center, with a waiver of informed patient consent granted for
the prospective analysis of retrospective data. Our pilot study was conducted in order
to test the robustness of a scoring system intended to be used in a large European
multicenter study, assessing the incidence and underlying risk factors of TSM on gadoxetate
disodium-enhanced MRI.
Selection and preparation of datasets
Two radiologists of the coordinating study center selected 40 gadoxetate disodium-enhanced
liver MRI datasets from 40 different patients (25 male, 15 female; mean age: 59.4 ± 15.9
years). The datasets were chosen to include examinations without as well as with respiratory
motion artifacts of varying severity. A single axial image in the arterial phase,
encompassing the upper abdomen at the level of the suprarenal aorta was generated
from each dataset. Images were merged in random order into a single file for further
reading. In addition, an exemplary set of images (not including the study datasets)
demonstrating motion artifacts of varying degrees was presented to the readers. All
images were acquired on a 1.5 T scanner using the bolus detection technique and standard
dosing of 0.025 mmol/kg gadoxetate disodium injected at a flow rate of 1.5 ml/sec,
followed by a saline flush of 25 ml.
Characterization of readers
11 radiologists from 11 different institutions from Germany and Switzerland participated
in this trial. All readers were board-certified radiologists with substantial experience
in abdominal MR imaging, in order to ensure homogeneity of our study findings. Readers
were at least 5 years post board certification and had a minimum of 8 years of experience
regarding the interpretation of abdominal MRI. Notably, every reader had knowledge
of the appearance of respiratory motion artifacts and differentiation from other sources
of image degradation, such as truncation. To preserve anonymity, the order of the
readers’ appearance in the figures and tables is neither consistent throughout the
manuscript nor is it consistent with the authorship order.
Image evaluation
All readers independently assessed the prepared image datasets with regard to respiratory
motion-related artifacts using a 5-point scale. If other artifacts were observed (e. g.
truncation, pulsation), readers were asked to ignore these. Score 1: no motion-related
artifact; Score 2: minimal motion-related artifact with no effect on diagnostic quality;
Score 3: moderate motion-related artifact with some, but not severe effect on diagnostic
quality; Score 4: severe motion-related artifact, but images are still interpretable;
Score 5: extensive motion-related artifact resulting in non-diagnostic image quality
([Fig. 1 ]). This scoring system has been used in previous studies [2 ]
[8 ], but to our knowledge has not yet been validated in a large multicenter and multireader
setting. All radiologists were blinded to the ratings of the other radiologists. In
addition, four readers performed a second assessment of all datasets in order to evaluate
intrarater agreement. The interval between both reading sessions was longer than two
months in order to avoid any recall bias.
Fig. 1 Demonstration of motion-related artifacts in the arterial phase on gadoxetate disodium-enhanced.
MRI Motion-related artifacts were evaluated by means of a 5-point scale. Score 1:
no motion-related artifact A ; Score 2: minimal motion-related artifact with no effect on diagnostic quality B ; Score 3: moderate motion-related artifact with some, but not severe effect on diagnostic
quality C ; Score 4: severe motion-related artifact, but images are still interpretable D ; Score 5: extensive motion-related artifact resulting in non-diagnostic image quality
E .
Abb. 1 Atemabhängige Artefakte in der arteriellen Phase der MRT mit Gd-EOB-DTPA. Graduierung
atemabhängiger Artefakte anhand einer 5-Punkte Skala. 1: keine atemabhängigen Artefakte
A ; 2: minimale atemabhängige Artefakte, keine Beeinträchtigung der diagnostischen Bildqualität
B ; 3: mäßige atemabhängige Artefakte, keine starke Beeinträchtigung der diagnostischen
Bildqualität C ; 4: deutliche atemabhängige Artefakte, Bilder noch beurteilbar D ; 5: schwere atemabhängige Artefakte, nicht-diagnostische Bildqualität E .
Statistical analysis
Statistical analysis was performed using SPSS software (version 22; SPSS; Chicago,
Illinois). Interrater agreement was defined as the extent to which different readers
assigned the same precise motion score on MRI datasets. The general trend in ratings
was addressed by means of interrater reliability, assessing the extent to which readers
could consistently distinguish between different motion scores [9 ]. For validation of interrater agreement and interrater reliability, the intraclass
correlation coefficient (ICC) was calculated according to McGraw and Wong [10 ], applying a two-way mixed model. In addition, the Kendall coefficient of concordance
(W) for further evaluation of the interrater agreement was calculated. The intrarater
agreement was calculated similarly. The ICC and Kendall W were interpreted as follows:
a value less than 0.20 indicated poor agreement; a value of 0.21 – 0.40 fair agreement;
a value of 0.41 – 0.60 moderate agreement; a value of 0.61 – 0.80 substantial agreement;
and a value of 0.81 – 1.00 almost perfect agreement [11 ]. For all measurements, p < 0.05 indicated a significant difference.
Results
Scoring of motion artifacts
All readers assigned motion scores ranging from 1 to 5. The median motion score assigned
by the readers averaged over all 40 datasets was either 2 or 3. Only in one case (2.5 %)
with extensive motion artifacts and non-diagnostic image quality (score 5), all readers
assigned the same motion score. In 6 cases (15 %), 10 out of 11 readers assigned the
same motion score. Clinically irrelevant motion artifacts, defined as a mean motion
score ≤ 3 on arterial phase images, were observed in 28 patients. Among these cases,
motion artifacts were rated with a score ≤ 3 by all readers in 79 % of cases (n = 22
out of 28 cases). Severe or extensive motion artifacts, defined as a mean motion score
≥ 4 in the arterial phase, were observed in 12 patients. In these specific cases,
a motion score ≥ 4 was assigned by all readers in 75 % of cases (n = 9 out of 12 cases)
([Table 1 ], [Fig. 2 ]).
Table 1
Rating results of motion artifacts on gadoxetate disodium-enhanced arterial phase
MRI, as assessed individually by 11 radiologists (R01 – 11) on a 5-point scale. In
addition, the median motion score of each reader is provided, as well as the median
motion score for each dataset including the percentage agreement for that specific
score (% Ag). Data is sorted according to the median.
Tab. 1 Graduierung von Atemartefakten durch 11 Radiologen (R01 – 11) anhand einer 5-Punkte
Skala. Zusätzlich Angabe des Median-Scores und der Übereinstimmung in Hinblick auf
diesen (%Ag). Die Daten sind entsprechend des Median-Scores sortiert.
dataset
R01
R02
R03
R04
R05
R06
R07
R08
R09
R10
R11
median
% Ag
40
5
5
5
5
5
5
5
5
5
5
5
5
100
14
5
5
4
4
5
5
5
5
5
5
5
5
81.8
21
4
5
5
4
4
4
5
5
5
5
5
5
63.6
11
4
5
4
4
4
5
4
5
5
5
5
5
54.5
38
5
5
4
4
4
4
5
4
5
5
5
5
54.5
15
4
4
4
4
4
4
4
4
4
5
4
4
90.9
26
4
4
4
4
4
4
4
4
4
4
5
4
90.9
10
3
4
4
4
4
4
4
4
4
5
5
4
72.7
23
4
5
4
4
4
5
4
4
5
4
4
4
72.7
1
3
4
4
3
2
4
4
4
3
4
4
4
63.6
13
4
4
4
3
3
4
5
4
4
5
4
4
63.6
5
4
4
3
4
4
4
4
3
3
5
3
4
54.5
4
3
4
3
3
3
3
3
3
3
4
3
3
81.8
18
3
3
2
3
2
2
3
3
3
3
3
3
72.7
12
3
3
2
3
2
2
3
3
3
3
4
3
63.6
25
3
4
3
3
3
4
5
3
3
3
4
3
63.6
9
3
2
3
3
2
4
3
2
3
3
4
3
54.5
32
2
3
2
2
2
3
3
2
3
3
4
3
45.5
34
2
3
3
3
2
2
3
2
3
2
4
3
45.5
27
2
3
2
2
2
2
2
2
2
2
2
2
90.9
31
2
2
2
2
2
2
2
1
2
2
2
2
90.9
17
2
3
2
2
2
3
2
2
2
2
2
2
81.8
35
2
2
1
2
2
2
2
2
2
2
3
2
81.8
39
2
3
1
2
2
2
2
2
2
2
2
2
81.8
16
2
2
1
2
2
2
2
2
2
1
1
2
72.2
6
2
2
1
1
2
2
2
2
1
1
2
2
63.6
7
2
3
2
2
2
3
4
2
3
2
2
2
63.6
29
2
2
2
1
2
2
2
3
2
3
3
2
63.6
33
2
2
2
3
2
1
2
2
3
2
3
2
63.6
8
2
2
2
3
2
3
3
1
2
2
3
2
54.5
22
1
2
1
1
2
1
2
1
2
2
2
2
54.5
36
1
2
1
1
1
2
1
2
2
2
2
2
54.5
37
2
3
2
1
1
3
2
1
1
2
3
2
36.4
2
1
2
1
1
1
1
1
1
1
1
1
1
90.9
3
1
1
1
1
1
1
1
2
1
1
1
1
90.9
30
1
1
1
2
1
1
1
2
1
1
1
1
81.8
24
1
1
1
1
1
1
2
2
1
1
2
1
72.7
20
1
2
1
1
1
1
1
1
2
2
2
1
63.6
19
1
2
1
2
1
1
1
2
2
1
2
1
54.5
28
1
2
1
2
1
1
2
1
2
2
1
1
54.5
median
2
3
2
3
2
3
3
2
3
2
3
Fig. 2 Scoring of motion artifact as assessed by 11 radiologists on a 5-point scale. Presented
are mean values and the range of motion scores separately for all 40 datasets. The
horizontal line indicates the cut-off (≥ 4) that makes an artifact severe.
Abb. 2 Graduierung atemabhängiger Artefakte durch 11 Radiologen anhand einer 5-Punkte Skala.
Darstellung der Mittelwerte und Spannweite der Ergebnisse separat für 40 Datensätze.
Die horizontale Linie stellt den cut-off Wert dar (≥ 4), ab wann ein Artefakt als
schwerwiegend gewertet wird.
Interrater agreement and reliability
The interrater agreement, defined as the extent to which different readers assigned
the same precise motion score and as assessed by means of the ICC, was 0.983 (95 %
confidence intervals 0.973 – 0.990; p < 0.0001). The Kendall W for assessment of interrater
agreement was 0.865 (p < 0.0001). Both values indicated almost perfect interrater
agreement regarding the rating of the motion artifact on arterial phase gadoxetate
disodium-enhanced MRI. The interrater reliability, assessing the extent to which readers
could consistently distinguish between different motion scores, was very high as well
with an ICC of 0.985 (95 % confidence intervals 0.978 – 0.991; p < 0.0001). Image
examples are presented in [Fig. 3 ].
Fig. 3 Image examples of different degrees and scoring of respiratory motion artifacts.
A : Case of extensive motion-related artifact (dataset #40) that was scored by all readers
with “5”, indicating perfect agreement and no variability. B : Case (dataset #37) in which readers assigned scores between “1” and “3”, demonstrating
higher variability and less agreement.
Abb. 3 Bildbeispiele unterschiedlicher Artefaktausprägungen und entsprechender Bewertungen.
A : Schwere atemabhängige Artefakte. Der Datensatz (#40) wurde von allen Radiologen
mit „5“ bewertet, entsprechend einer perfekten Übereinstimmung und fehlenden Variabilität.
B : Datensatz #37, der von den einzelnen Radiologen unterschiedlich mit „1“ bis „3“
Punkten bewertet wurde. Entsprechend ist die Variabilität höher und die Übereinstimmung
der Bewertungen niedriger.
Intrarater agreement
The intrarater agreement among all four radiologists was almost perfect, with a mean
ICC of 0.935 (range: 0.886 – 0.980) and a mean 95 % confidence interval of 0.873 – 0.966
(range: 0.781 – 0.989; p < 0.0001 for all readers). Similarly, Kendall W for assessment
of intrarater agreement was very good with a mean of 0.935 (range: 0.912 – 0.975;
p≤ 0.001 for all readers).
Discussion
In this multicenter study, we observed high interrater agreement and reliability for
the assessment of TSM on arterial phase gadoxetate disodium-enhanced MRI. Results
were substantiated by an almost perfect intrarater agreement, which has, to the best
of our knowledge, not been specifically evaluated in the context of arterial phase
motion artifacts. Due to the possible detrimental effects of respiratory motion on
dynamic liver MRI, robust characterization and scoring in large multicenter studies
is essential for the evaluation of this unaccounted for phenomenon. It needs to be
emphasized that we assessed interrater agreement and reliability separately, two terms
that are often incorrectly used interchangeably throughout the literature. While agreement
is defined as the degree to which ratings given by different judges (here: assigned
motion artifact scores by different readers) are identical, reliability refers to
the consistency of ratings and the extent of variability [9 ]. Our findings could thus contribute to better interpretation and understanding of
motion artifact scoring in multireader and multicenter studies.
The scoring system for the assessment of motion artifacts used in our study has been
described in previous smaller studies with two to five readers only, with a high interrater
agreement and reliability. Davenport et al. reported good agreement for the scoring
of motion in the arterial phase between two readers with an ICC of 0.90 [12 ]. Kim et al. presented comparable results in a two reader setting with an ICC ranging
from 0.87 to 0.97 for different phases of the contrast dynamic [13 ]. In the initial study conducted by Davenport and colleagues, excellent reliability
among 5 readers for the scoring of motion was reported with an ICC between 0.85 and
0.95 for different contrast phases. Results regarding interrater or intrarater agreement
were not presented [2 ]. Pietryga et al. in contrast calculated interrater agreement, and not reliability,
among five readers. The ICCs for motion scores ranged from moderate for the pre-contrast
phases (ICC = 0.53) to excellent for the second arterial phase (ICC = 0.90) [8 ]. The results of these previous studies are in line with those of our present study.
However, in most of these earlier studies readers were from the same institution evaluating
their own datasets, which constitutes a potential bias.
Looking at the motion scores in our study in detail, all readers assigned the same
score only in one case. Specifically, this was a case with extensive motion artifacts
and non-diagnostic image quality. Taking this into account, one could hypothesize
that a non-clustered score (e. g. 1, 2, 3 instead of 1 – 5) could be solid enough
to evaluate motion artifacts on gadoxetate disodium-enhanced MRI. On the other hand,
we were able to demonstrate that the applied scoring system is solid and practical,
and that high interrater agreement and reliability can be achieved in a multicenter
setting if a standardized scoring system is used.
Nonetheless, our study also has limitations. First, only one phase of the contrast
dynamic, namely the late arterial, was evaluated. We chose to focus on this specific
phase as it is the most important phase when it comes to evaluating severe transient
motion on gadoxetate disodium-enhanced MRI. On the other hand, rating of a single
phase can also be considered as a strength of this study, as the reader does not have
any other phases or images for comparison, which could facilitate image evaluation.
Secondly, readers were asked to score motion artifacts only. Other artifacts, which
may also cause image degradation, were not scored specifically. Motion artifacts need
to be differentiated especially from truncation or ringing artifacts (also known as
Gibbs’s artifacts), which originate from vessels and decay with distance from the
source. Motion artifacts, however, are located randomly throughout the image extending
into the noise outside the body [7 ]
[14 ]. One possible explanation for the discrepancy regarding the reported incidence of
TSM within the literature might be the difficulty in differentiating between these
types of artifacts. The results of our present study, however, show that motion artifacts
may be differentiated and graded reliably if experienced radiologists perform the
assessment.
In conclusion, we confirm the null hypothesis that there is no significant difference
between multiple readers from different institutions regarding the assessment of severe
respiratory motion artifacts. The consistency of rating, as demonstrated by our study
results, may have implications for future studies, especially those in which subjective
assessment of image quality and artifacts is part of the evaluation process. The results
of our data will enhance the scientific value of an envisaged large European multicenter
study, aiming at assessing the incidence and underlying risk factors for transient
severe motion artifact on gadoxetate disodium-enhanced MRI.