Inter-Rater Reliability Calculator

Enter your confusion matrix data into the Inter-Rater Reliability Calculator to compute Cohen's Kappa — the gold-standard measure of agreement between two raters or observers. Select the number of categories (2–5), fill in the observed frequency cells of your contingency table, and get back the kappa coefficient, percent agreement, and a strength-of-agreement interpretation. Supports both nominal and ordered category data.

Number of Categories *

Select how many rating categories your classification system uses.

Rater A: Cat 1 — Rater B: Cat 1 *

Both raters assigned Category 1.

Rater A: Cat 1 — Rater B: Cat 2 *

Rater A assigned Cat 1, Rater B assigned Cat 2.

Rater A: Cat 2 — Rater B: Cat 1 *

Rater A assigned Cat 2, Rater B assigned Cat 1.

Rater A: Cat 2 — Rater B: Cat 2 *

Both raters assigned Category 2.

Rater A: Cat 1 — Rater B: Cat 3

Only used when 3+ categories selected.

Rater A: Cat 2 — Rater B: Cat 3

Rater A: Cat 3 — Rater B: Cat 1

Rater A: Cat 3 — Rater B: Cat 2

Rater A: Cat 3 — Rater B: Cat 3

Rater A: Cat 1 — Rater B: Cat 4

Rater A: Cat 2 — Rater B: Cat 4

Rater A: Cat 3 — Rater B: Cat 4

Rater A: Cat 4 — Rater B: Cat 1

Rater A: Cat 4 — Rater B: Cat 2

Rater A: Cat 4 — Rater B: Cat 3

Rater A: Cat 4 — Rater B: Cat 4

Rater A: Cat 1 — Rater B: Cat 5

Rater A: Cat 2 — Rater B: Cat 5

Rater A: Cat 3 — Rater B: Cat 5

Rater A: Cat 4 — Rater B: Cat 5

Rater A: Cat 5 — Rater B: Cat 1

Rater A: Cat 5 — Rater B: Cat 2

Rater A: Cat 5 — Rater B: Cat 3

Rater A: Cat 5 — Rater B: Cat 4

Rater A: Cat 5 — Rater B: Cat 5

Results

Cohen's Kappa (κ)

Percent Agreement

Expected Agreement (by Chance)

Total Observations

Strength of Agreement

Results Table

More Education & Academic Tools

Frequently Asked Questions

What is Cohen's Kappa and why is it used instead of percent agreement?

Cohen's Kappa (κ) measures the agreement between two raters while accounting for the agreement that would occur purely by chance. Simple percent agreement can be misleadingly high when one category is very common, so Kappa corrects for this by comparing observed agreement to expected chance agreement. A Kappa of 0 means agreement is no better than chance; a Kappa of 1 means perfect agreement.

How many categories should I select for my data?

Select the number of categories that matches your rating scale. For example, if your raters classify items as simply 'Yes' or 'No', choose 2 categories. If they rate severity as 'None', 'Mild', 'Moderate', or 'Severe', choose 4 categories. The number of categories determines the size of the confusion matrix you'll fill in.

How do I fill in the confusion matrix cells?

Each cell represents how many items both raters assigned to a specific pair of categories. For example, the cell 'Rater A: Cat 1 — Rater B: Cat 1' is the count of items where both raters chose Category 1 (diagonal cells = agreements). Off-diagonal cells represent disagreements. The sum of all cells equals your total number of rated items.

What is a good value for Cohen's Kappa?

A common interpretation scale is: κ < 0 = Less than chance agreement; 0.01–0.20 = Slight; 0.21–0.40 = Fair; 0.41–0.60 = Moderate; 0.61–0.80 = Substantial; 0.81–1.00 = Almost perfect. Most research fields consider κ ≥ 0.61 to be acceptable, though requirements vary by discipline and stakes.

Can Cohen's Kappa be negative?

Yes. A negative Kappa means the two raters agreed less than would be expected by chance alone, which typically indicates a systematic disagreement or a data entry error. Values below 0 are rarely seen in practice and usually prompt researchers to review their rating procedures.

What is the difference between Cohen's Kappa and Krippendorff's Alpha?

Cohen's Kappa is designed specifically for two raters and nominal or ordinal data. Krippendorff's Alpha is more general — it handles any number of raters, missing data, and various levels of measurement (nominal, ordinal, interval, ratio). For straightforward two-rater nominal agreement, Kappa is the standard choice.

Does this calculator support weighted Kappa for ordinal data?

This calculator computes standard (unweighted) Cohen's Kappa, which treats all disagreements equally regardless of how far apart the categories are. Weighted Kappa, which penalizes larger disagreements more heavily using linear or quadratic weights, is appropriate when your categories are ordered (e.g., severity scales). For ordinal data, consider a dedicated weighted kappa tool.

What is the minimum sample size needed for a reliable Kappa estimate?

There is no universal minimum, but most methodologists recommend at least 30–50 observations per cell in the confusion matrix, or a total N of at least 100–200 subjects. Very small samples produce unstable, wide-confidence-interval Kappa estimates. Larger samples give more precise and trustworthy results.

Results

Observed vs. Chance Agreement

Results Table

More Education & Academic Tools

Frequently Asked Questions