Inequalities between multi-rater kappas. (English) Zbl 1284.62338

Summary: The paper presents inequalities between four descriptive statistics that have been used to measure the nominal agreement between two or more raters. Each of the four statistics is a function of the pairwise information. Light’s kappa and Hubert’s kappa are multi-rater versions of Cohen’s kappa. Fleiss’ kappa is a multi-rater extension of Scott’s pi, whereas Randolph’s kappa generalizes Bennett et al. \(S\) to multiple raters. While a consistent ordering between the numerical values of these agreement measures has frequently been observed in practice, there is thus far no theoretical proof of a general ordering inequality among these measures. It is proved that Fleiss’ kappa is a lower bound of Hubert’s kappa and Randolph’s kappa, and that Randolph’s kappa is an upper bound of Hubert’s kappa and Light’s kappa if all pairwise agreement tables are weakly marginal symmetric or if all raters assign a certain minimum proportion of the objects to a specified category.


62H17 Contingency tables
62H20 Measures of association (correlation, canonical correlation, etc.)
62P25 Applications of statistics to social sciences
Full Text: DOI


[1] Artstein R, Poesio M (2005) Kappa3 = Alpha (or Beta). NLE Technical Note 05-1, University of Essex
[2] Banerjee M, Capozzoli M, McSweeney L, Sinha D (1999) Beyond kappa: a review of interrater agreement measures. Can J Stat 27: 3–23 · Zbl 0929.62117
[3] Bennett EM, Alpert R, Goldstein AC (1954) Communications through limited response questioning. Public Opin Q 18: 303–308
[4] Berry KJ, Mielke PW (1988) A generalization of Cohen’s kappa agreement measure to interval measurement and multiple raters. Educ Psychol Meas 48: 921–933
[5] Brennan RL, Prediger DJ (1981) Coefficient kappa: some uses, misuses, and alternatives. Edu Psychol Meas 41: 687–699
[6] Cohen J (1960) A coefficient of agreement for nominal scales. Edu Psychol Meas 20: 37–46
[7] Cohen J (1968) Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol Bull 70: 213–220
[8] Conger AJ (1980) Integration and generalization of kappas for multiple raters. Psychol Bull 88: 322–328
[9] Craig RT (1981) Generalization of Scott’s index of intercoder agreement. Public Opin Q 45: 260–264
[10] Davies M, Fleiss JL (1982) Measuring agreement for multinomial data. Biometrics 38: 1047–1051 · Zbl 0501.62045
[11] De Mast J (2007) Agreement and kappa-type indices. Am Stat 61: 148–153 · Zbl 05680729
[12] Di Eugenio B, Glass M (2004) The kappa statistic: a second look. Comput Linguist 30: 95–101 · Zbl 1234.68406
[13] Dou W, Ren Y, Wu Q, Ruan S, Chen Y, Bloyet D, Constans J-M (2007) Fuzzy kappa for the agreement measure of fuzzy classifications. Neurocomputing 70: 726–734
[14] Fleiss JL (1971) Measuring nominal scale agreement among many raters. Psychol Bull 76: 378–382
[15] Gwet KL (2008) Variance estimation of nominal-scale inter-rater reliability with random selection of raters. Psychometrika 73: 407–430 · Zbl 1301.62120
[16] Heuvelmans APJM, Sanders PF (1993) Beoordelaarsovereenstemming. In: Eggen TJHM, Sanders PF (eds) Psychometrie in de Praktijk. Cito Instituut voor Toestontwikkeling, Arnhem, pp 443–470
[17] Hsu LM, Field R (2003) Interrater agreement measures: comments on kappa n, Cohen’s kappa, Scott’s {\(\pi\)} and Aickin’s {\(\alpha\)}. Underst Stat 2: 205–219
[18] Hubert L (1977) Kappa revisited. Psychol Bull 84: 289–297
[19] Janes CL (1979) An extension of the random error coefficient of agreement to N {\(\times\)} N tables. Br J Psychiatry 134: 617–619
[20] Janson H, Olsson U (2001) A measure of agreement for interval or nominal multivariate observations. Educ Psychol Meas 61: 277–289
[21] Janson S, Vegelius J (1979) On generalizations of the G index and the Phi coefficient to nominal scales. Multivar Behav Res 14: 255–269
[22] Kraemer HC (1979) Ramifications of a population model for {\(\kappa\)} as a coefficient of reliability. Psychometrika 44: 461–472 · Zbl 0425.62088
[23] Kraemer HC (1980) Extensions of the kappa coefficient. Biometrics 36: 207–216 · Zbl 0463.62103
[24] Kraemer HC, Periyakoil VS, Noda A (2002) Tutorial in biostatistics: kappa coefficients in medical research. Stat Med 21: 2109–2129
[25] Krippendorff K (1987) Association, agreement, and equity. Qual Quant 21: 109–123
[26] Landis JR, Koch GG (1977) The measurement of observer agreement for categorical data. Biometrics 33: 159–174 · Zbl 0351.62039
[27] Light RJ (1971) Measures of response agreement for qualitative data: some generalizations and alternatives. Psychol Bull 76: 365–377
[28] Mitrinović DS (1964) Elementary inequalities. P. Noordhoff, Groningen
[29] O’Malley FP, Mohsin SK, Badve S, Bose S, Collins LC, Ennis M, Kleer CG, Pinder SE, Schnitt SJ (2006) Interobserver reproducibility in the diagnosis of flat epithelial atypia of the breast. Mod Pathol 19: 172–179
[30] Popping R (1983) Overeenstemmingsmaten voor nominale data. PhD thesis, Rijksuniversiteit Groningen, Groningen
[31] Randolph JJ (2005) Free-marginal multirater kappa (multirater {\(\kappa\)} free): an alternative to Fleiss’ fixed-Marginal multirater kappa. Paper presented at the Joensuu Learning and Instruction Symposium, Joensuu, Finland
[32] Schouten HJA (1980) Measuring agreement among many observers. Biom J 22: 497–504 · Zbl 0452.62087
[33] Schouten HJA (1982) Measuring pairwise agreement among many observers. Biom J 24: 431–435 · Zbl 0491.62093
[34] Schouten HJA (1986) Nominal scale agreement among observers. Psychometrika 51: 453–466
[35] Scott WA (1955) Reliability of content analysis: the case of nominal scale coding. Public Opin Q 19: 321–325
[36] Vanbelle S, Albert A (2009) A note on the linearly weighted kappa coefficient for ordinal scales. Stat Methodol 6: 157–163 · Zbl 1220.62172
[37] Warrens MJ (2008a) On similarity coefficients for 2 {\(\times\)} 2 tables and correction for chance. Psychometrika 73: 487–502 · Zbl 1301.62125
[38] Warrens MJ (2008b) Bounds of resemblance measures for binary (presence/absence) variables. J Classif 25: 195–208 · Zbl 1276.62044
[39] Warrens MJ (2008c) On association coefficients for 2 {\(\times\)} 2 tables and properties that do not depend on the marginal distributions. Psychometrika 73: 777–789 · Zbl 1284.62762
[40] Warrens MJ (2008d) On the equivalence of Cohen’s kappa and the Hubert-Arabie adjusted Rand index. J Classif 25: 177–183 · Zbl 1276.62043
[41] Warrens MJ (2008e) On the indeterminacy of resemblance measures for (presence/absence) data. J Classif 25: 125–136 · Zbl 1260.62052
[42] Warrens MJ (2010a) Inequalities between kappa and kappa-like statistics for k {\(\times\)} k tables. Psychometrika 75: 176–185 · Zbl 1272.62138
[43] Warrens MJ (2010b) A formal proof of a paradox associated with Cohen’s kappa. J Classif (in press) · Zbl 1337.62143
[44] Warrens MJ (2010c) Cohen’s kappa can always be increased and decreased by combining categories. Stat Methodol 7: 673–677 · Zbl 1232.62161
[45] Warrens MJ (2010d) A Kraemer-type rescaling that transforms the odds ratio into the weighted kappa coefficient. Psychometrika 75: 328–330 · Zbl 1234.62088
[46] Zwick R (1988) Another look at interrater agreement. Psychol Bull 103: 374–378
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. It attempts to reflect the references listed in the original paper as accurately as possible without claiming the completeness or perfect precision of the matching.