Cohen's kappa: Difference between revisions
CSV import Tags: mobile edit mobile web edit |
CSV import |
||
| Line 39: | Line 39: | ||
{{Statistics-stub}} | {{Statistics-stub}} | ||
<gallery> | |||
File:Kappa_vs_accuracy.png | |||
File:Kappa_example.png | |||
</gallery> | |||
Latest revision as of 02:11, 17 February 2025
Cohen's kappa coefficient (κ) is a statistic that measures inter-rater reliability for qualitative (categorical) items. It is generally thought to be a more robust measure than simple percent agreement calculation, as κ takes into account the agreement occurring by chance. Cohen's kappa measures the agreement between two raters who each classify N items into C mutually exclusive categories. The formula for κ is:
\[ \kappa = \frac{p_o - p_e}{1 - p_e} \]
where \(p_o\) is the relative observed agreement among raters (proportion of items where both raters agree), and \(p_e\) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters other than what would be expected by chance (as given by \(p_e\)), κ = 0.
Background[edit]
The kappa statistic was introduced by Jacob Cohen in 1960 as a measure of agreement for nominal scales. It is used in various fields such as healthcare, where it helps to assess the reliability of diagnostic tests, and in machine learning, where it is used to measure the performance of classification algorithms.
Calculation[edit]
To calculate Cohen's kappa, the number of categories into which assignments can be made must be fixed, and the assignments of each item into these categories by the two raters must be known. The formula involves the calculation of several probabilities:
- \(p_o\), the observed agreement, is calculated by summing the proportions of items that both raters agree on.
- \(p_e\), the expected agreement by chance, is calculated by considering the agreement that would occur if both raters assign categories randomly, based on the marginal totals of the categories.
Interpretation[edit]
The value of κ can be interpreted as follows:
- A κ of 1 indicates perfect agreement.
- A κ less than 1 but greater than 0 indicates partial agreement.
- A κ of 0 indicates no agreement better than chance.
- A κ less than 0 indicates less agreement than expected by chance.
Landis and Koch (1977) provided a commonly used interpretation for the kappa statistic, suggesting that values ≤ 0 as indicating no agreement, 0.01–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1.00 as almost perfect agreement.
Limitations[edit]
While Cohen's kappa is a widely used and informative statistic for measuring inter-rater reliability, it has its limitations. It assumes that the two raters have equal status and that the categories are mutually exclusive. Furthermore, kappa may be affected by several factors including the number of categories, the distribution of observations across these categories, and the prevalence of the condition being rated.
Applications[edit]
Cohen's kappa is used in a variety of settings to assess the reliability of categorical assignments. In healthcare, it is used to evaluate the consistency of diagnostic tests between different raters. In psychology, it helps in assessing the reliability of categorical diagnoses. In content analysis and machine learning, kappa provides a measure of the agreement between human annotators or between an algorithm and a human annotator.
See Also[edit]

This article is a statistics-related stub. You can help WikiMD by expanding it!