Predictive models for posttransplant diabetes mellitus in kidney transplant recipients using machine learning and deep learning approach: a nationwide cohort study from South Korea
Article information
Abstract
Background
Posttransplant diabetes mellitus (PTDM) complicates kidney transplant recipients (KTRs) in morbidity and mortality. This study aimed to predict PTDM risk in KTRs using machine learning and deep learning models.
Methods
Data were obtained from the Korea Organ Transplantation Registry, a nationwide cohort study of KTRs. Four machine learning algorithms, including eXtreme Gradient Boosting (XGBoost), CatBoost, light gradient boosting machine and logistic regression, and deep learning were implemented on 41 pretransplant and 31 posttransplant variables to predict PTDM. Model performance was assessed using the area under the curve (AUC) of the receiver operating characteristic curve, accuracy, precision, recall, and F1 score.
Results
Among 3,213 KTRs, 497 patients (15.5%) developed PTDM within 1 year. The PTDM group had higher age, body mass index (BMI), triglyceride level, and prevalence of hypertension and cardiovascular disease, and lower total cholesterol level at baseline than the No-PTDM group. The XGBoost model showed the highest AUC (0.738) and F1 score (0.42), and modest accuracy (0.86), while the CatBoost model exhibited the highest accuracy (0.87) and precision (0.79). Feature importance in XGBoost was highest for recipient age, followed by baseline BMI, triglyceride level at posttransplant 6 months, baseline glycated hemoglobin and high-density lipoprotein cholesterol level, white blood cell (WBC) count and serum uric acid level at 6 months, baseline WBC count, and tacrolimus trough level at discharge.
Conclusion
The XGBoost model demonstrated the best performance for predicting PTDM within 1 year, offering an accurate tool for early identification and personalized care of high-risk KTRs for PTDM.
Introduction
Posttransplant diabetes mellitus (PTDM) is a common metabolic complication after kidney transplantation (KT), with an incidence ranging from 7% to 40% [1,2]. PTDM affects outcomes of KT recipients (KTRs), in terms of graft failure, cardiovascular disease, and mortality [3]. In addition, PTDM is associated with poor quality of life and increased healthcare costs [3]. Risk factors for PTDM are known as age, obesity, acute rejection, virus infection including cytomegalovirus and hepatitis B and C, and immunosuppressants [4–7]. Identifying KTRs with high risk for developing PTDM is crucial, since early diagnosis and management will help to reduce PTDM-related morbidity and mortality. Therefore, a risk scoring system will be helpful to allow robust pre- and posttransplant assessment of PTDM. For a generalizable scoring system, a larger cohort using a multicenter database is needed. In addition, a relatively recent database will be more useful as it will reflect the current immunosuppressive strategies.
Machine learning is used to create predictive models from data. Owing to high-performance computing, data availability, and algorithmic innovations, it effectively analyzes a large dataset [8]. Deep learning is an advanced subset of machine learning that uses artificial neural networks and big data [9]. Machine learning has the potential to detect possible interactions and new relationships between variables from a large dataset, which may provide more accurate prognostic models in medicine. Therefore, the aim of this study was to build a predictive model for PTDM using machine learning and deep learning algorithms from a nationwide multicenter cohort of KTRs.
Methods
Study population and data collection
Data were obtained from the Korea Organ Transplantation Registry (KOTRY), a prospective multicenter nationwide cohort study of KTRs in Korea. Forty-one transplantation centers participated in the KOTRY. Following the establishment of KOTRY in 2014, data for this study were requested in 2020 and subsequent multifaceted analyses have been conducted since. Consequently, the dataset included all 6,455 KTRs aged 18 years or more who underwent KT between May 2014 and August 2020 were included in this study. The KOTRY provided patient demographics and clinical and laboratory data at the time of transplantation. Follow-up data 6 months and 1 year after KT were collected including laboratory data, antirejection treatment, and complications of KT including PTDM. Data on medications taken posttransplantation including vitamin D analogs, tacrolimus, cyclosporine, mycophenolate acid, mammalian target of rapamycin inhibitor, and corticosteroids were also collected. Estimated glomerular filtration rate was calculated using the CKD-EPI (Chronic Kidney Disease Epidemiology Collaboration) equation [10]. Body mass index (BMI) was calculated as the patient’s weight in kilograms divided by height in meters squared (kg/m2). The original dataset consisted of a total of 236 variables related to the recipient and 125 variables related to the donor.
All patients provided written informed consent before KOTRY enrollment. The study was performed in line with the principles of the Declaration of Helsinki and approved by the Institutional Review Board of The Catholic University of Korea, Incheon St. Mary’s Hospital (OC19ONDI0034).
Study flow chart and endpoint
Based on the criteria of American Diabetes Association/World Health Organization [11], patients were diagnosed as diabetes mellitus (DM) when they are subject to at least one of the following features: 1) fasting plasma glucose level ≥126 mg/dL, 2) random plasma glucose level ≥200 mg/dL, 3) 2-hour glucose after a 75-g oral glucose tolerance test of more than 200 mg/dL, and 4) hemoglobin A1c (HbA1c) ≥6.5%. Since the KOTRY data included patients whose HbA1c was ≥6.5% or random plasma glucose was ≥200 mg/dL but not diagnosed as DM, DM diagnosis errors had to be revised first. Missing values and outliers of HbA1c and fasting plasma glucose at baseline were substituted with the median value. Missing values and outliers of follow-up serum plasma glucose 6 months and 1 year after KT were filled in using carry forward method.
KTRs who had been diagnosed as pretransplant DM (n = 2,006) and those who had been additionally determined as pretransplant DM following the criteria of the American Diabetes Association/World Health Organization [11] were excluded (n = 771), leaving 3,678 patients in the cohort. After excluding subjects without post-KT data at 6 months and 1 year (n = 465), a total of 3,213 patients were included in the final analysis (Fig. 1).

Flow chart of the study.
DM, diabetes mellitus; HbA1c, hemoglobin A1c; KOTRY, Korea Organ Transplantation Registry; PTDM, posttransplant diabetes mellitus.
The endpoint of this study was newly diagnosed PTDM within 1 year after KT. The data on the development of PTDM was recorded 6 months and 1 year after KT in the database. Additionally, PTDM was diagnosed from fasting plasma glucose level [12]. Four-hundred ninety-seven KTRs (15.5%) developed PTDM within 1 year.
Missing data and outlier preprocessing
For every feature, outliers were replaced with blank for the correct analysis of value distribution and missing percentages. Missing value percentages (proportion of number of patients missing value for the feature to the number of total patients) were calculated for every feature. For binary features, the proportion of the number of patients with the value 0, referred to as zero ratio from now on, was calculated. Missing values in all the selected features were substituted with representative values. Mean was used for continuous features; mode was used for binary and categorical features; interpolation of forward method was used for follow-up features. Derivative binary features were merged as categorical features.
Feature selection
Among the 236 variables, features to be used for training were selected according to a few criteria; baseline characteristics, laboratory values, medications, and complications relevant to allograft function and DM. Features with missing percentage higher than 50%, binary features with zero ratio higher than 90% or lower than 10% were eliminated.
As exceptions, features of patients’ human leukocyte antigen alleles were eliminated, for the reason of excessive category numbers. Features for delayed graft function and anti-thymoglobulin dosage were included despite the missing percentage standard considering their medical importance in contribution to DM.
The 41 baseline variables and 31 follow-up variables included in the final analysis are shown in Table 1.
Training-machine learning
Fig. 2 shows the schema of machine learning. KTRs were classified into two classes based on the occurrence of PTDM within 1 year. Data set was split into training, validation, and test set as 7.5 to 1.5 to 1.5 ratio. Baseline and follow-up features were spread out and input into eXtreme Gradient Boosting (XGBoost), CatBoost, light gradient boosting machine (GBM) and logistic regression model. The model was optimized by finding the optimum hyperparameters through 300 trials of using Optuna library. StratifiedKFold cross validation was implemented to increase the validity, splitting the data set into five validation sets and assessing the model on each. Performances of algorithm were calculated by area under the curve (AUC) of the receiver operating characteristic (ROC) curve with 95% confidence intervals (CIs). The importance of each feature was determined using SHAP (SHapley Additive exPlanations) method.

Schema of machine learning.
(A) Baseline and follow-up variables including rejection data were included in the analysis. (B) Data preparation includes data cleaning, data transformation, feature engineering, data normalization and standardization, data splitting, and data augmentation. (C) Baseline and follow-up features were spread out and input into eXtreme Gradient Boosting (XGBoost), CatBoost, light gradient boosting machine (GBM), and logistic regression model. The model was optimized by hyperparameter optimization and StratifiedKFold. Five-fold cross validation was performed by splitting the data set into five validation sets and assessing the model on each.
Confusion matrix was generated from the prediction results, which had four categories for correct and incorrect prediction. The categories for correct predictions were true positive (TP) and true negative (TN), whereas those for incorrect predictions were false positive (FP) and false negative (FN). The four categories were used to calculate metrics for assessing model performance: accuracy, precision, recall (sensitivity), and F1 score.
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision = TP / (TP + FP)
Recall (sensitivity) = TP / (TP + FN)
F1 core = (precision × recall) / (precision + recall)
Training-deep learning
Fig. 3 shows the schema of deep learning. KTRs were classified into two classes based on the occurrence of PTDM within 1 year. Data set was split into training, validation, and test set as 7.5 to 1.5 to 1.5 ratio. On the baseline and rejection features, multi-layer perceptron (consisting of linear transformation and nonlinear activation function such as rectified linear unit) was done to generate the first output. On the follow-up features, long short-term memory (used for dealing with dynamic data and sequential pattern learning) was applied to generate the second output. The outputs were passed through a linear layer to derive the final output. Weighted CrossEntropyLoss was used to calculate loss to ease the class imbalance problem, and weighted decay was added to optimizer Adam. Performances of algorithm were calculated by AUC of the ROC with 95% CI, accuracy, recall, precision, and F1 score.

Schema of deep learning.
On the baseline features and rejection features, a multilayer perceptron (MLP) was done to generate the first output. On the follow-up features, long short-term memory (LSTM) was applied to generate the second output. The three outputs were passed through a linear layer to derive the final output.
Statistical analysis
Statistical analysis was performed using SAS version 9.4 (SAS Institute). Continuous variables were presented as mean and standard deviation for data with normal distribution and presented as median and interquartile ranges for data with nonparametric distribution. After distribution of data between PTDM group and No-PTDM group was determined, they were compared using independent t test or Wilcoxon rank sum test. Categorical data was presented as percentages and comparison between the two groups was performed using chi-square test and Fisher exact test. The p-values of <0.05 were considered significant.
Results
Baseline characteristics
Among 3,213 KTRs, 497 patients (15.5%) developed PTDM during 1 year after KT. Baseline characteristics of PTDM group versus No-PTDM group are shown in Table 2. The PTDM group was older and had a higher BMI and triglyceride level and a lower total cholesterol level at baseline than the No-PTDM group. The prevalence of hypertension and cardiovascular disease and the proportion of hypertension or others as primary renal disease were higher in the PTDM group compared to the No-PTDM group.
Model performance of machine learning and deep learning
To analyze the statistical performance of the four machine learning models and deep learning model for PTDM prediction, we assessed AUC of ROC, accuracy, precision, recall (sensitivity), and F1 score (Table 3, Fig. 4). The XGBoost model showed the highest AUC (0.738; 95% CI, 0.677–0.798), which was followed by CatBoost (AUC = 0.727; 95% CI, 0.667–0.785), logistic regression (AUC = 0.719; 95% CI, 0.651–0.784), light GBM (AUC = 0.716; 95% CI, 0.654–0.777), and deep learning (AUC = 0.699; 95% CI, 0.639–0.759).

ROC of machine learning and deep learning models.
(A) ROC of four machine learning models. (B) ROC of deep learning model.
AUC, area under the curve; GBM, gradient boosting machine; ROC, receiver operating characteristic; XGBoost, eXtreme Gradient Boosting;
While the model run on CatBoost exhibited the highest accuracy (0.87) and precision (0.79), the XGBoost model showed the highest AUC (0.738) and F1 score (0.42), and modest accuracy (0.86). Therefore, the XGBoost model showed the best performance among five models.
The incorporation of transplant-related risk factors was trialed, with hepatitis B and C virus infection and cytomegalovirus immunity additionally evaluated as machine learning variables. This approach led to an overall decline in performance, including reductions in AUC across all models, resulting in the exclusion of these variables from the final model. The performance metrics from this assessment are detailed in Supplementary Table 1 (available online).
Feature importance of the XGBoost model
Fig. 5 shows the importance of each feature of the XGBoost model determined by SHAP method. Feature importance was highest for recipient age, which was followed by baseline BMI, triglyceride level at post-KT 6 months, HbA1c and high-density lipoprotein cholesterol (HDL-C) level at baseline, white blood cell count and serum uric acid level at 6 months, white blood cell count at baseline, tacrolimus trough level at discharge, and total cholesterol level at post-KT 6 months.

Feature importance of XGBoost model.
(A) The average impact of each feature on the output of XGBoost model. The vertical axis shows the features included in the model. The horizontal axis depicts the mean SHAP (SHapley Additive exPlanations) value, which represents the average impact of each feature. (B) The impact of each feature on the output of XGBoost model. The vertical axis shows the features included in the model. The horizontal axis depicts the SHAP value, which represents the impact of each feature. The feature value is shown in blue to red colors; blue as low impact and red as high impact.
BMI, body mass index; eGFR, estimated glomerular filtration rate; HbA1c, hemoglobin A1c; HDL-C, high-density lipoprotein cholesterol; hsCRP, high sensitivity C-reactive protein; KT, kidney transplantation; MPA, mycophenolate acid; PD, peritoneal dialysis; SBP, systolic blood pressure; WBC, white blood cell; Tac, tacrolimus; XGBoost, eXtreme Gradient Boosting.
Discussion
We developed a machine learning-based prediction model for PTDM after 1 year of KT using a nationwide multicenter cohort of KTRs. The model incorporated a total of 72 variables, both at pre- and posttransplantation. These variables are used in clinical practice and can be extracted from electric medical records. Among four machine learning models and a deep learning model, XGBoost showed the best performance with an AUC of 0.738.
In this study, the incidence of PTDM within 1-year posttransplantation was 15.5%. The incidence of PTDM in KTRs is reported to range between 7% and 40% [1,2]. The variable incidence is related to variable diagnosis criteria and different timepoints of diagnosis after KT [13]. In contrast to the diagnosis criteria of type 2 DM utilizing the HbA1c, it is recommended not to use the thresholds of HbA1c for diagnosing PTDM [12]. It is because the diagnostic threshold for HbA1c (≥6.5%) in type 2 DM is not related to the risk of diabetic retinopathy in KTRs [14], and because the HbA1c level is affected by high red blood cell turnover, anemia, and inhibition of red cell proliferation in the bone marrow due to immunosuppressants in the early posttransplant period [15]. Therefore, it is recommended to use the fasting plasma glucose level or the oral glucose tolerance test to diagnose PTDM [12]. Since fasting plasma glucose level is collected after 6 months and annually after KT in KOTRY, PTDM was diagnosed according to the fasting glucose criteria. The incidence of PTDM shows a biphasic pattern, with a peak in the first few months after transplantation and a second surge over the next 2 to 3 years [16]. This study focused on the development of PTDM within 1 year of transplant because a majority of KTRs were not yet in the follow-up period of posttransplantation 2 or 3 years. An analysis of PTDM occurrence beyond 1 year posttransplantation was attempted but did not yield a significant conclusion. The insufficient number of additional PTDM cases identified during the 2- to 4-year follow-up period, coupled with the increased number of missing values, diminished the predictive power with the current sample size. Hence, these cases were not included in the analysis, with consideration for future studies contingent upon obtaining a larger sample size.
Various risk factors are known to contribute to the development of PTDM; age, obesity, acute rejection, virus infection including cytomegalovirus and hepatitis B and C, and immunosuppressants [4–7]. However, there are a few studies focusing on making a prediction model for PTDM. Chakkera et al. [17] developed a risk prediction score including seven pretransplant variables and using multivariable regression models among 316 KTRs; the AUCs were 0.70 to 0.72. Rodrigo et al. [18] used the San Antonio diabetes prediction model and Framingham Offspring Study–Diabetes Mellitus algorithm to predict PTDM among 191 KTRs, which were originally developed in nontransplant population; each exhibited an AUC of 0.807 and 0.756, respectively. These two studies included a relatively small number of subjects and did not consider factors related with immunosuppressive drugs. Recently, Cheng et al. [19] reported a risk prediction model among 495 KTRs using six variables in logistic regression. Variables included in their model were age, BMI, tacrolimus level, transient hyperglycemia, delayed graft function, and acute rejection, and the AUC was 0.916. The difference of this study is that we used machine learning algorithms and included a larger number of subjects and variables. The advantage of machine learning over traditional statistical methods is that it can analyze a large dataset and interpret complex, nonlinear relationships and interactions among many variables [8]. Therefore, machine learning may provide more accurate prediction models by capturing hidden interactions between features. In this study, 41 pretransplant variables and 31 posttransplant variables were selected among 236 variables. A PTDM-prediction model was made by utilizing four machine learning algorithms and deep learning, a more advanced machine learning technique. Among five models, XGBoost demonstrated the highest performance with an AUC of 0.738 and accuracy of 0.86.
Although the highest recall/sensitivity (0.64) was achieved by the deep learning model, it exhibited significantly lower precision (0.27) compared to the other models. PTDM is not an acute emergency requiring immediate intervention but rather a chronic condition requiring long-term management of elevated blood glucose. Accurate diagnosis and gradual treatment are therefore essential for PTDM. Therefore, given the nature of PTDM, it is more important to ensure that diagnostic precision is maintained at an acceptable level rather than focusing solely on high sensitivity. The XGBoost model demonstrated the highest AUC, F1 score and high accuracy while also maintaining robust sensitivity and precision, making it the optimal choice.
In our XGBoost model, 10 features with highest importance were followed as; recipient age, baseline BMI, triglyceride level at post-KT 6 months, HbA1c and HDL-C level at baseline, white blood cell count and serum uric acid level at 6 months, white blood cell count at baseline, tacrolimus trough level at discharge, and total cholesterol level at post-KT 6 months. These features are consistent with previous reports; reflecting age, obesity, metabolic syndrome, and inflammation [6]. Among variables related to immunosuppression, tacrolimus level at discharge, rejection episode within 6 months, mycophenolate acid dosage at 6 months, and tacrolimus dosage at discharge were in the 30 top variables with importance. Calcineurin inhibitors and glucocorticoids are well known predisposing factors for PTDM, whereas mycophenolate acid has not been reported to increase the risk of PTDM [6]. Since cumulative dosage of corticosteroids was not collected in KOTRY database, the effect of corticosteroid could have been reduced. It is unclear why mycophenolate acid dosage at post-KT 6 months was in the model. It may be because variables with higher cardinality have the likelihood to have higher feature importance.
There are limitations to this study. First, PTDM was diagnosed only according to the fasting plasma glucose level and not by oral glucose tolerance test as recommended [12], which might have led to a lower incidence of PTDM. Furthermore, HbA1c, which could serve as a valuable criterion for diagnosing PTDM along with fasting plasma glucose level, especially for patients with unknown oral glucose tolerance, was not included in the analysis due to its follow-up data being unavailable in the KOTRY dataset. The inclusion of this variable could potentially have resulted in a higher AUC score. Second, factors that might have affected hyperglycemia, such as the cumulative dosage of corticosteroids, time-averaged tacrolimus levels or viral infection, were not included in the analysis because of lack of data. We attempted to include hepatitis B and C virus infection and cytomegalovirus immunity in the models, but it did not improve the models’ performance. Third, external validation in different countries, race, and ethnicity is needed to make the prediction model generalizable. However, our study has advantages in that it used a nationwide multicenter database and employed machine learning and deep learning in making the prediction model. Variables before and after transplantation were included in the model, which makes our model useful in clinical practice. Moreover, immunosuppressant used in current era were included in the analysis, which makes our model more practical than previous models in literature.
In conclusion, we implemented machine learning to predict the development of PTDM after 1 year in KTRs. Our model could aid clinicians in early diagnosis, prevention, and counseling of PTDM. Risk factors for PTDM should be evaluated and individualized risk assessment should be done during pre-transplantation work-up. This can lead to early diagnosis and prompt management of PTDM, thereby improving clinical outcomes of KTRs.
Supplementary Materials
Supplementary data are available at Kidney Research and Clinical Practice online (https://doi.org/10.23876/j.krcp.24.113).
Notes
Conflicts of interest
All authors have no conflicts of interest to declare.
Funding
This research was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (grant number: HI23C047600). Additional funding was provided by a grant from the Patient-Centered Clinical Research Coordinating Center (PACEN), also supported by the Ministry of Health and Welfare, Republic of Korea (grant numbers: HI19C0481 and HC20C0054). Furthermore, this research was supported by the National Institute of Health (NIH) research project (2014-ER6301-00, 2014-ER6301-01, 2014-ER6301-02, 2017-ER6301-00, 2017-ER6301-01, 2017-ER6301-02, 2020-ER7201-00, 2020-ER7201-01, 2020-ER7201-02, 2023-ER0805-00, and 2023-ER0805-01). This research was supported by a grant of Patient-Centered Clinical Research Coordinating Center (PACEN) funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI19C0481, HC20C0054).
Acknowledgments
The authors thank the members of the KOTRY Study Group (Appendix) for their contribution.
Data sharing statement
The data presented in this study are available from the corresponding author upon reasonable request.
Authors’ contributions
Conceptualization, Project administration: HEY, Sejoong Kim
Data curation, Formal analysis: SC, Sangwoong Kim
Funding acquisition: Sangwoong Kim, HEY, Sejoong Kim
Investigation: JCJ, HEY, Sejoong Kim
Methodology: Sangwoong Kim, HEY, Sejoong Kim
Resources: YHL, HM, JHL, JY, MSK
Software: Sangwoong Kim
Writing–original draft: SC, MRP, Sangwoong Kim, HEY
Writing–review & editing: All authors
All authors read and approved the final manuscript.