Methods Inf Med 2012; 51(06): 489-494
DOI: 10.3414/ME12-01-0005
Original Articles
Schattauer GmbH

Measuring Inter-observer Agreement in Contour Delineation of Medical Imaging in a Dummy Run Using Fleiss’ Kappa[*]

G. Rücker
1   Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Freiburg, Germany
,
T. Schimek-Jasch
2   Department of Radiology, University Medical Center Freiburg, Freiburg, Germany
,
U. Nestle
2   Department of Radiology, University Medical Center Freiburg, Freiburg, Germany
› Author Affiliations
Further Information

Publication History

received:11 January 2012

accepted:03 July 2012

Publication Date:
20 January 2018 (online)

Summary

Background: In medical imaging used for planning of radiation therapy, observers delineate contours of a treatment volume in a series of images of uniform slice thickness.

Objective: To summarize agreement in contouring between an arbitrary number of observers by a single number, we generalized the kappa index proposed by Zijdenbos et al. (1994).

Methods: Observers characterized voxels by allocating them to one of two categories, inside or outside the contoured region. Fleiss’ kappa was used to measure association between n indistinguishable observers. Given the number Vi of voxels contoured by exactly i observers (i = 1, …, n), the resulting overall kappa is representable as a ratio of weighted sums of the Vi .

Results: Overall kappa was applied to analyze inter-center variations in a multicenter trial on radiotherapy planning in patients with locally advanced lung cancer. A contouring dummy run was performed within the quality assurance program. Contouring was done twice, once before and once after a training program. Observer agreement was enhanced from 0.59 (with a 95% confidence interval (CI) of 0.51 – 0.67) to 0.69 (95% CI 0.59 – 0.78)

Conclusion: By contrast to average pairwise indices, overall kappa measures observer agreement for more than two observers using the full information about overlapping volumes, while not distinguishing between observers. It is particularly adequate for measuring observer agreement when identification of observers is not possible or desirable and when there is no gold standard.

* Supplementary material published on our website www.methods-online.com


 
  • References

  • 1 Michalski JM, Lawton C, Naqa IE, Ritter M, O’Meara E, Seider MJ. et al. Development of RTOG consensus guidelines for the definition of the clinical target volume for postoperative conformal radiation therapy for prostate cancer. International Journal of Radiation Oncology, Biology, Physics 2010; 76 (02) 361-368.
  • 2 Lim K, Small J, WilliamPortelance L, Creutzberg C, Jürgenliemk-Schulz IM, Mundt A. et al. Consensus Guidelines for delineation of clinical target volume for intensity-modulated pelvic radiotherapy for the definitive treatment of cervix cancer. International Journal of Radiation Oncology, Biology, Physics 2011; 79 (02) 348-355.
  • 3 Zijdenbos AP, Dawant BM, Margolin RA, Palmer AC. Morphometric analysis of white matter lesions in MR images: Method and Validation. IEEE Transactions of Medical Imaging 1994; 13 (04) 716-724.
  • 4 Gibon D, Viard R, Rodrigues C. Contour comparison metrics. Loos Les Lille, France: AQUILAB SAS;2009. Available from. http://www.aquilab.com
  • 5 Nestle U, Fleckenstein J, Kremp S, Schaefer-Schuler A, Hellwig D, Groeschel A. et al. PET-Plan NSCLC: first impressions from the pilot study. European Journal of Nuclear Medicine and Molecular Imaging 2007; 34: 141
  • 6 Schimek-Jasch T, Gibon D, Astner ST, Badakhshi H, Bultel YP, Hehr T. et al. Target volume delineation in locally advanced NSCLC: Multicenter contouring dummy run. Strahlentherapie und Onkologie 2010; 186: 144
  • 7 Schimek-Jasch T, Rücker G, Vach W, König J, Jacob V, Götz T. et al. Contour comparison metrics for two or more observers - Optimizing the evaluation of agreement in a contouring dummy run by identifying and resolving statistical pitfalls in the framework of the a multicenter study. Strahlentherapie und Onkologie 2011; 187: 132
  • 8 Cohen J. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 1960; 20 (01) 37-46.
  • 9 Cohen J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin 1968; 70: 213-220.
  • 10 Gefeller O, Brenner H. How to correct for chance agreement in the estimation of sensitivity and specificity of diagnostic tests. Methods Inf Med 1994; 33: 180-186.
  • 11 Neveu D, Aubas P, Seguret F, Kramar A, Dujols P. Measuring agreement for ordered ratings in 3´3 tables. Methods Inf Med 2006; 45 (05) 541-547.
  • 12 Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin 1971; 76 (05) 378-382.
  • 13 Fleiss JL, Cohen J, Everitt BS. Large sample standard errors of kappa and weighted kappa. Psychological Bulletin 1969; 72: 323-327.
  • 14 Fleiss JL, Nee JCM, Landis JR. Large sample variance of kappa in the case of different sets of raters. Psychological Bulletin 1979; 86 (05) 974-977.
  • 15 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33 (01) 159-174.
  • 16 Wikipedia; 2011. Accessed October 18, 2011. Available from. http://en. wikipedia.org/wiki/Fleiss_kappa
  • 17 Krummenauer F. Methoden zur Evaluation bildgebender Verfahren von begrenzter Reproduzierbarkeit. Aachen; Shaker: 2005
  • 18 Albert PS, Dodd LE. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 2004; 60 (02) 427-435.
  • 19 Pepe MS, Janes H. Insights into latent class analysis of diagnostic test performance. Biostatistics 2007; 8 (02) 474-484.
  • 20 Warfield SK, Zou KH, Wells WM. Simultaneous truth and performance level estimation (STAPLE): An algorithm for the validation of image segmentation. IEEE Transactions on Medical Imaging 2004; 23 (07) 903-921.
  • 21 Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via EM algorithm. Journal of the Royal Statistical Society Series B- Methodological 1977; 39 (01) 1-38.