Introduction
With recent developments in artificial intelligence (AI) using deep learning, AI-related studies in the field of digital pathology are expanding. One of the factors that must be considered in AI-based pathology image analysis studies is that high reproducibility and accuracy of the pathology diagnosis are the basic premises of AI learning. If there is no consensus as to what is the “gold standard” for diagnosis, and different pathologists make different diagnoses for the same pathology image, the basis for AI learning will be compromised, and the reliability of results will be low [
1,
2]. Therefore, a high level of diagnostic agreement is a prerequisite for AI-powered analytics.
Factors that affect diagnostic agreement among pathologists include clarity of the definition of diagnostic terms, cognitive biases, method of evaluation of pathological findings (quantitative or qualitative), institutional variability, cross-training, and the level of experience or training of the pathologists [
1]. Several studies have investigated or improved agreement in the diagnosis of renal disease [
3–
10]. In particular, the diagnosis of lupus nephritis, which has a major impact on clinical prognosis and treatment decisions, is poor among pathologists [
4,
6,
7,
9].
Two previous studies on diagnostic agreement have been conducted by The Renal Pathology Study Group of the Korean Society of Pathologists (RPS-KSP) [
11,
12]. These studies standardized the terminologies of renal pathology and improved diagnostic agreement through training with a virtual slide atlas. In the Nephrotic Syndrome Study Network (NEPTUNE) study, intra- and interobserver variability were significantly reduced after two rounds of web-based cross-training [
5]. Web-based cross-training provides pathologists with the opportunity to meet across time and space, which can dramatically improve diagnostic agreement.
In this study, we identified interobserver variability among pathologists for the accurate diagnosis and classification of lupus nephritis and attempted to improve agreement through educational training. Thus, we aimed to improve the quality and accuracy of the diagnosis of lupus nephritis and provide a gold standard for future AI-powered studies.
Results
Fourty-three RPS-KSP members responded more than once. There were 31 respondents to Survey 1, 28 respondents to Survey 2, and 19 respondents to Survey 3. Of these, 16, 14, and 12, respectively, had more than 10 years of experience reporting renal biopsies for the differential diagnosis of internal medicine conditions.
The number of renal biopsies reported per year varied among the respondents. Seven respondents reported more than 300 biopsies per year, one reported 200 to 300, 11 reported 100 to 200, 12 reported 51 to 100, and seven reported 50 or fewer. Among the highly experienced pathologists, six reported more than 300 renal biopsies per year, six reported 100 to 200 per year, four reported 51 to 100 per year, and four reported 50 or fewer per year.
The κ-values for each question are presented in
Supplementary Fig. 1 (available online) and
Supplementary Table 1 (available online). The κ-values by item are presented in
Fig. 2 and
Supplementary Table 2 (available online). The mean ± standard deviation (SD) of the κ-values, respectively, for the surveys of 25 questions were as follows: Survey 1, 0.417 ± 0.011; Survey 2, 0.412 ± 0.010; and Survey 3, 0.472 ± 0.013. The overall κ-value of Survey 3 was significantly higher than that of Surveys 1 and 2 (p < 0.001 and p = 0.001, respectively). The κ-values for highly experienced pathologists, who had practiced in renal pathology for more than 10 years, were generally higher than those of all the pathologists (
Fig. 3). The mean ± SD of the κ-values of the surveys for highly experienced pathologists were as follows: Survey 1, 0.475 ± 0.019; Survey 2, 0.427 ± 0.011; and Survey 3, 0.474 ± 0.015. The kappa value of Survey 3 for the highly experienced pathologists was significantly higher than that of Survey 2 (p = 0.009). Question 8 showed poor agreement with a kappa of 0.2 or less for all three concordance assessments (
Supplementary Fig. 1, available online), but substantial agreement of 0.6 or more for Gwet’s AC1 (
Supplementary Fig. 2, available online).
The mean ± SD of the κ-values, respectively, for the surveys of 14 items of lupus nephritis were as follows: Survey 1, 0.251 ± 0.033; Survey 2, 0.276 ± 0.042; and Survey 3, 0.309 ± 0.015. The κ-values for highly experienced pathologists were higher than those of all the pathologists. Overall item agreement was not statistically different for all the pathologists after the educational sessions. However, the agreement for Survey 3 increased for highly experienced pathologists after the educational sessions compared to the previous survey. There was “fair” agreement between endocapillary hypercellularity and neutrophils and/or karyorrhexis, with values of less than 0.4 in both Fleiss’ κ and Gwet’s AC1 analyses among all the pathologists and the highly experienced pathologists (
Table 1,
Figs. 2–
4). Mesangial hypercellularity showed poor agreement with both Fleiss’ κ and Gwet’s AC1 value of 0.2 or less in all three surveys. The agreement between identification of mesangial hypercellularity and endocapillary hypercellularity had a greater increase after the two educational sessions than before, but the agreement between the identification of neutrophils and/or karyorrhexis had a greater decrease. For highly experienced pathologists only the ability to identify mesangial hypercellularity increased after two educational sessions, while identification of endocapillary hypercellularity and neutrophils and/or karyorrhexis decreased from pre-educational session levels (
Fig. 3;
Supplementary Fig. 3, available online).
In Survey 3 it was found that the ability to identify segmental sclerosis and adhesion between the tuft and capsule had lower κ-values than Survey 1, with κ-values for Survey 3 of 0.231 and 0.345, respectively, and κ-values for Survey 1 of 0.289 and 0.361, respectively. However, Gwet’s AC1 values for the two items, segmental sclerosis and adhesion between the tuft and capsule, were higher in Survey 3 (0.722 and 0.722, respectively) than in Survey 1 (0.685 and 0.688, respectively). Items such as normal, global sclerosis, spike or intramembranous hole formation, fibrous crescent, and double contour showed highly unbalanced marginal distributions and thus the κ-values were of no value. For Gwet’s AC1 value, for these items, the results were in almost perfect agreement, with scores of at least 0.8 (
Fig. 4;
Supplementary Fig. 4, available online).
The possibility is raised that the sincerity of those who dropped out of one of the three surveys may differ from those who completed all three surveys. In order to make a rigorous comparison of pre- and post-educational agreement, it is necessary to analyze agreement only among those who completed all three surveys (
Supplementary Table 3, available online). When only those who completed all three surveys were analyzed, the agreement for each item was slightly higher than the agreement for participants who responded to one or more of the surveys, and the trend remained similar to before (
Fig. 5). The increase in agreement from pre- to post-education varied by item. Of the three items with the lowest agreement, two (mesangial hypercellularity and endocapillary hypercellularity) increased in agreement after education (Gwet’s AC1; 0.184 and 0.329, to 0.194 and 0.334, respectively) and one (neutrophils and/or karyorrhexis) decreased in agreement (0.574 to 0.357). The kappa and Gwet’s AC1 value of Survey 2 for the highly experienced pathologists was significantly less than that of Survey 1 (p = 0.015 and 0.004, respectively).
The agreement between experienced and inexperienced pathologists was compared across all-three-survey respondents. The Gwet’s AC1 value of experienced varied from item to item when compared to inexperienced (
Supplementary Fig. 5, available online). The definition of the experienced was narrowed than before, to more than 10 years of renal pathology practice and diagnosing at least 100 renal biopsies per year. The difference in agreement between experienced and inexperienced varied by item. Before the education, the experienced (n = 6) had six items with higher AC1 values than the inexperienced (n = 8; mesangial hypercellularity, endocapillary hypercellularity, fibrous crescent, wire loop lesion and/or hyaline thrombi, and double contour), but after the education, it decreased to four (endocapillary hypercellularity, spike or intramembranous hole formation, fibrocellular and fibrous crescent) (
Fig. 6). There was no significant difference between the two groups in terms of overall agreement.
Discussion
There are few studies on concordance between pathologists in the diagnosis of lupus nephritis, and the concordance is low [
6]. To the best of our knowledge, this is the first study to assess concordance in the identification of the pathological lesions of lupus nephritis, in Korea. Since the 2018 International Society of Nephrology/Renal Pathology Society (ISN/RPS) revision of the classification of lupus nephritis, some histopathological descriptors that comprise the activity and chronicity indices have been modified or redefined [
14]. With the ISN/RPS revision, the definitions of mesangial hypercellularity, crescent, adhesion, and fibrinoid necrosis were revised, and endocapillary proliferation was renamed endocapillary hypercellularity. This is the first study to evaluate concordance using the new histopathological descriptors from the 2018 ISN/RPS revision, along with other histopathological features used for the diagnosis of lupus nephritis.
Dasari et al. [
6] systematically reviewed the inter-pathologist agreement on lupus nephritis and concluded that the concordance was “poor” to “moderate.” In their review, leukocyte infiltration, a similar term to neutrophils in the modified activity/chronicity index, exhibited “poor” agreement, which is in line with our results (κ-value for neutrophils and/or karyorrhexis, <0.4). However, the agreement for “endocapillary hypercellularity” was lower than previous studies, which showed “moderate” agreement (intraclass correlation coefficient [ICC] or κ-value, >0.4) [
6,
7,
9,
15], despite two educational sessions. This is likely to be due to the inclusion of mesangial hypercellularity as an option, unlike in previous studies, or unclear definitions. Most studies used a crude assessment by scoring the percentage of involvement of the total glomeruli in the slide, according to a cutoff [
7,
9,
15], whereas this study used a more rigorous evaluation of endocapillary hypercellularity per glomerulus. Although mesangial hypercellularity and endocapillary hypercellularity often coexist, the 2018 revision does not provide criteria for distinguishing between them. The Oxford Working Group reported that the concordance of segmental endocapillary hypercellularity was “fair” [
3]. They also reported that mesangial cellularity was difficult to score in segments with endocapillary hypercellularity; therefore, they scored them as “indeterminate” for mesangial cellularity in the presence of global endocapillary hypercellularity. Cellular and fibrous crescents showed an increase in agreement from “poor” to “moderate” previously (cellular ICC, 0.5 and 0.55 ± 0.07; fibrous ICC, 0.25 ± 0.09 and 0.58) to “good” to “almost perfect” agreement in this study (cellular κ, >0.6; fibrous Gwet’s AC1, >0.9) [
15,
16]. This can be hypothesized as being attributed to the lowering of the cutoff for extracapillary proliferation to 10% from 25% [
14], which reduced uncertainty by encouraging identification of lesions that were previously borderline to be determined to be crescentic, thereby improving agreement. Second, a more detailed definition of a fibrocellular/fibrous crescent [
14], which was not previously available, may have assisted in improving the concordance. Despite fibrinoid necrosis being defined in detail for the first time in the revision [
14], it is possible that the similar degree of agreement (“poor” agreement on κ-value, 0.32 to 0.47; “substantial” agreement on Gwet’s AC1, 0.61 to 0.76) seen with fibrinoid necrosis/karyorrhexis (ICC: 0.26, 0.48, and 0.45 ± 0.09) [
6] is because it is now determined separately, as opposed to being combined with karyorrhexis. In both the NEPTUNE and the Nephrotic Syndrome Study Network Digital Pathology Scoring System studies, the agreement was higher than before, after grouping individual descriptors [
5,
10].
Mesangial hypercellularity is not a component of the activity index and chronicity index but is a key feature that can be diagnostic of class II lupus nephritis when present with appropriate immunofluorescence or electron microscopic findings and has not been addressed in previous lupus nephritis concordance studies [
17]. The definition of mesangial hypercellularity in the ISN/RPS revision was taken from the definition of immunoglobulin A (IgA) nephropathy in the Oxford classification, and the cutoff was increased from three cells to four cells, which was emphasized in the educational sessions of this study. Despite a more detailed definition and a minimal increase in concordance after two educational sessions, mesangial hypercellularity had the lowest agreement among the items; however, this has been frequently observed in other studies [
18–
20]. According to concordance studies on IgA nephropathy, there was “moderate” to “poor” agreement in determining the presence of mesangial hypercellularity in more than half of the biopsied glomeruli, suggesting that the agreement for the presence of mesangial hypercellularity in a single glomerulus is expected to be even lower. Furthermore, it is not yet known whether a clear-cut distinction between mesangial hypercellularity and endocapillary hypercellularity can be made in class III and IV lesions [
14]. It is also unclear whether the cutoff of four cells for mesangial hypercellularity is for mesangial cells alone or if it also includes inflammatory cells [
14]. More specific definitions will be required in the future (
Supplementary Table 4, available online).
After analysis, it was found that some items had low κ-values despite the high agreement observed, and this was due to the ‘prevalence paradox’ of Fleiss’ κ [
13,
21,
22]. To compensate for the uneven distribution of responses, as the ‘prevalence paradox’ of Fleiss’ κ can cause the agreement value to be too low compared to the observed agreement, we performed Gwet’s AC1 analysis. Given that the limitations of kappa, which have been pointed out in previous studies, are also evident in some items of this study, Gwet’s AC1 is a more appropriate measure of agreement than Fleiss’ κ, especially when the agreement is high [
23–
25].
It is noteworthy that even with the narrower definition of experienced pathologists, less than half of the items have higher agreement than inexperienced, with insignificant difference, and the difference is even smaller after education. This is different from previous studies that have shown high concordance with experts [
5,
6], and suggests that, at least in Korean nephropathologists, the level of experience does not necessarily correlate with higher concordance in lupus nephritis glomeruli. However, this study also found that agreement increased in some items with educational sessions. This emphasizes the importance of regular training of pathologists, at least in some items.
This study is more detailed and systematic, uses digital images to assess the agreement between the components of the activity index and chronicity index of lupus nephritis for each glomerulus, and is the first concordance study to use the definitions of the 2018 ISN/RPS revision. It is also more objective and general than an agreement assessment based on a small number of pathologists, as it includes a relatively large number of pathologists and a high response rate. This study included four images, H&E, PAS, trichrome, and PAMS, to represent the diagnostic setting. Educational sessions were successful in improving agreement and the benefits were immediately applicable in the clinic as the majority of the pathologists worked at multiple institutions.
This study has some limitations. It included only glomeruli and did not evaluate the degree of agreement for tubulointerstitial and vascular lesions. Glomerular selection bias was unavoidable. Few glomeruli reported global sclerosis or spikes; therefore, the reliability of degree of agreement of these two items is questionable. A post hoc review of the glomerular images revealed that there were no typical images in which spikes or global sclerosis were easily identifiable. Therefore, additional images should be included in future assessments. The education was a one-way lecture, which seems to be less effective than the interactive open-round meeting. Especially for experienced pathologists, an interactive open-round meeting would be more effective, where the attendants could comment on each other and discuss problematic points in depth and may lead to a better agreement. Finally, the study was limited to Korean patients and pathologists.
The treatment of lupus nephritis is based on histopathological classification and the activity/chronicity index, and appropriate treatment affects patient prognosis. In addition to effectively training a machine-learning model, the training data must be highly reliable, which is difficult to achieve when the histopathological diagnostic agreement is low between pathologists. This study showed improvement in agreement after two educational sessions. This is immediately applicable in clinical practice and is the basis for the development of accurate AI models.