Objectives: To evaluate the interobserver reliability of the Kellgren–Lawrence (KL) classification among orthopedic surgeons and to compare their assessments with artificial intelligence (AI) systems. Methods: One hundred anteroposterior weight-bearing knee radiographs from patients aged 65 years and older were retrospectively analyzed. Four orthopedic surgeons and two AI systems independently graded all radiographs according to the KL classification and were blinded to clinical information and to each other’s evaluations. Interobserver agreement was assessed using quadratically weighted Cohen’s kappa and intraclass correlation coefficients (ICC). Results: Interobserver agreement among orthopedic surgeons demonstrated good reliability (mean weighted?=0.780; ICC=0.784). Agreement between the orthopedic consensus and ChatGPT was moderate (?=0.481), whereas Gemini demonstrated moderate-to-good agreement (?=0.561). Agreement between the two AI systems was also moderate (?=0.484). Conclusion: The KL classification demonstrated good reliability among orthopedic surgeons. AI systems demonstrated moderate agreement with orthopedic experts and may serve as supportive screening tools rather than as diagnostic replacements. Keywords: Artificial
Corresponding Author: