A novel approach to the relation of multi-pollutant effect and kidney dysfunction: data analysis from the Korean National Environmental Health Survey Cycle 3 (2015–2017)

Article information

Korean J Nephrol. 2025;.j.krcp.24.173
Publication date (electronic) : 2025 February 19
doi : https://doi.org/10.23876/j.krcp.24.173
1Graduate School of Public Health, Seoul National University, Seoul, Republic of Korea
2College of Artificial Intelligence, Ewha Womans University, Seoul, Republic of Korea
3Department of Internal Medicine, Keimyung University School of Medicine, Daegu, Republic of Korea
4Department of Internal Medicine, Hallym University Sacred Heart Hospital, Hallym University College of Medicine, Anyang, Republic of Korea
5Department of Internal Medicine, Dongguk University Ilsan Hospital, Dongguk University College of Medicine, Goyang, Republic of Korea
6Research Center for Chronic Disease and Environmental Medicine, Dongguk University College of Medicine, Gyeongju, Korea
7Department of Internal Medicine, Seoul National University Hospital, Seoul, Republic of Korea
8Department of Internal Medicine, SMG-SNU Boramae Medical Center, Seoul, Republic of Korea
9Department of Internal Medicine, Seoul National University College of Medicine, Seoul, Republic of Korea
10Department of Internal Medicine, Ulsan University Hospital, University of Ulsan College of Medicine, Ulsan, Republic of Korea
11Basic-Clinical Convergence Research Institute, University of Ulsan, Ulsan, Republic of Korea
Correspondence: Kyung Don Yoo Division of Nephrology, Department of Internal Medicine, Ulsan University Hospital, University of Ulsan College of Medicine, 25 Daehakbyeongwon-ro, Dong-gu, Ulsan 44030, Republic of Korea. E-mail: ykd9062@uuh.ulsan.kr, ykd9062@gmail.com
Kyungho Choi Department of Environmental Health Sciences, Graduate School of Public Health, Seoul National University, 1 Gwanak-ro, Gwanak-gu, Seoul 08826, Republic of Korea. E-mail: kyungho@snu.ac.kr
*Inae Lee and Junhyug Noh contributed equally to this study as co-first authors.†Kyungho Choi and Kyung Don Yoo contributed equally to this study as co-corresponding authors.
Received 2024 July 1; Revised 2024 December 6; Accepted 2024 December 23.

Abstract

Background

Traditional statistical models for estimating the impact of multiple environmental chemicals on kidney outcomes have limitations. This study aimed to evaluate the risk prediction of kidney disease in the general population using innovative methodologies.

Methods

Serum persistent organic pollutant (POP), urinary chemical, serum creatinine, and urinary albumin levels were measured in a subpopulation of adults (n = 1,266) drawn from the Korean National Environmental Health Survey Cycle 3 (n = 3,787). Various machine learning (ML) models, including bagging, ridge, lasso, and random forest, were used to predict chronic kidney disease (CKD) risk, and their results were compared with those of conventional logistic regression methods. Furthermore, the weighted quantile sum (WQS) approach, which assigns weights to mixture components, was employed to evaluate multi-pollutant effects. Presplit was attempted to incorporate existing domain knowledge.

Results

A total of 42 variables, including baseline characteristics and laboratory findings, were analyzed during the ML modeling process. The decision tree algorithm generally outperformed logistic regression in risk prediction. Based on the decision tree models, lipid-corrected polychlorinated biphenyl 153 (PCB153) emerged as the strongest predictor of CKD. PCB153 remained a significant predictor of CKD in middle-aged adults (<50 years; p = 0.01) following age stratification. Particularly among middle-aged adults with hemoglobin levels >13.25 g/dL, CKD risk was predicted to be 71.4% in the high serum PCB153 group.

Conclusion

Current observations showed that utilizing both WQS regression and ML-based predictions offers valuable insights. In the models, POPs, particularly PCB153, were identified as important risk factors for CKD in Korean adults.

Introduction

Increasing evidence suggests that environmental exposure to harmful chemicals, such as air pollutants and environmental chemicals, contributes to a decline in kidney function and incident chronic kidney disease (CKD) in humans [1,2]. In the general population, exposure to heavy metal chemicals, such as lead (Pb) and cadmium (Cd), has been associated with decreased glomerular filtration [3]. In an environment-wide association study, several chemicals, including Cd, Pb, volatile compounds, perfluorooctanoic acid, and polycyclic aromatic hydrocarbons (PAHs), were identified as potential risk factors for CKD [4]. Although cross-sectional studies on chemical exposure and renal function have accumulated [2], experimental studies to elucidate the underlying mechanisms are limited. Environmental chemicals, though not extensively studied for their direct impacts on renal function, are known to induce oxidative stress, which has been well-documented to disrupt podocyte structure and contribute to albuminuria, podocyte loss, and tubular injury [2,5,6]. This oxidative stress accelerates the progression of kidney dysfunction, particularly through the development of tubulointerstitial fibrosis [7]. Most epidemiologic studies investigating the associations of chemicals with CKD have generally been focused on a limited number of chemical risk factors; however, everyday life involves close and continuous contact with a diverse range of pollutants with different physicochemical and toxicological properties. Failure to include important chemical risk factors in the model could result in false conclusions, potentially because of commonalities of exposure sources or interactions among chemicals exposed together [8]. For example, in the presence of a confounding co-pollutant, effect estimates may be skewed, and studying linked exposures in different models may result in false-positive results [9,10]. Consequently, several studies have employed statistical approaches that employ multiple chemicals in the models and reported significant associations of environmental chemicals with CKD in different populations [4,11]. In recent studies on chemical exposure and outcomes, several statistical models, including weighted quantile sum (WQS) regression, quantile g-computation model, and Bayesian kernel machine regression, were used to consider multi-pollutant exposure [1214]. Several association models with multi-pollutant mixtures have proven to address some of the shortcomings of the existing models using a single or limited number of chemicals [13]. These models have their strengths and limitations; therefore, they have been used simultaneously and compared in recent studies. Complex mixtures of environmental chemicals include components that may interact, possibly resulting in effects that cannot be reliably predicted based on the influence of individual chemicals [15]. However, the existing multi-pollutant models may have limitations in predicting such interactions. Machine learning (ML) is a type of artificial intelligence that employs algorithms to analyze data; however, the application of ML in the context of environmental exposure is scarce [16]. ML approach has increasingly been utilized in risk assessment and epidemiology, particularly in studies involving multi-pollutant exposure [17]. Toxic substances usually interact in complex ways, resulting in multi-pollutant effects. These interactions can be effectively studied using models that combine various types of environmental chemical data. Our novel approach aims to address this complexity and provide a better understanding of the relationships between multi-pollutants and kidney dysfunction [18,19].

In this study, we aimed to determine whether the models, including decision tree and WQS regression, can identify complex patterns in the Korean National Environmental Health Survey (KoNEHS) Cycle 3 (2015–2017) data and make more accurate predictions regarding the relationship between environmental exposures and kidney dysfunction. Since multi-pollutant models have their strengths and weaknesses, the results of these approaches were subsequently interpreted together.

Methods

Study population and data source

A subset of the adult participants from the KoNEHS Cycle 3 (2015–2017) was included in this study. The KoNEHS is a population-based cross-sectional survey conducted in a 3-year cycle that is nationally representative of the Korean population [20]. In the KoNEHS, a two-stage proportionately stratified sampling strategy was used to recruit individuals (n = 3,787) across the nation (Fig. 1); the demographic characteristics of the survey participants are described elsewhere [20]. The adult participants were stratified by sex and age, and 1,295 participants were randomly selected based on this stratification by the National Institute of Environmental Research [20,21]. Furthermore, the subset of the population with data on kidney function markers and persistent organic pollutant (POP) levels was included in our study (n = 1,266). The study population was further stratified according to kidney function following the KDIGO (Kidney Disease: Improving Global Outcomes) definition [22] into the G1A1 (estimated glomerular filtration rate [eGFR] ≥90 mL/min/1.73 m2 and albumin-to-creatinine ratio [ACR] <3 mg/mmol), G1A2 (eGFR ≥90 mL/min/1.73 m2 and ACR of 3–30 mg/mmol), G3 (eGFR of 30–59 mL/min/1.73 m2), and A2 (ACR of 3–30 mg/mmol) groups. Blood and urine samples were collected, and chemicals and several clinical indicators were measured.

Figure 1.

Study flow chart for the inclusion of participants from the KoNEHS.

BMI, body mass index; CKD, chronic kidney disease; DEHP, di(2-ethylhexyl) phthalate; DM, diabetes mellitus; eGFR, estimated glomerular filtration rate; HTN, hypertension; KoNEHS, Korean National Environmental Health Survey; MECPP, mono(2-ethyl-5-carboxypentyl) phthalate; MEHHP, mono(2-ethyl-5-hydroxyhexyl) phthalate; MEOHP, mono(2-ethyl-5-oxohexyl) phthalate; POPs, persistent organic pollutants; UACR, urine albumin-to-creatinine ratio; ∑DEHP metabolites, molar sum of DEHP metabolites.

aThe study population was stratified by kidney function according to the KDIGO (Kidney Disease: Improving Global Outcomes) definition [22] into the G1A1 (eGFR ≥90 mL/min/1.73 m2 and UACR <3 mg/mmol), G1A2 (eGFR ≥90 mL/min/1.73 m2 and UACR of 3–30 mg/mmol), G3 (eGFR of 30–59 mL/min/1.73 m2), and A2 (UACR of 3–30 mg/mmol) groups. bThe concentrations of POPs were adjusted for lipid content, calculated as POPs concentration (pg/mL) divided by the total lipid concentration (mg/dL) and multiplied by 100, yielding values in ng/g. Consequently, the total cholesterol level was excluded from the input variables, as it is highly correlated with this adjusted POPs measure in model setting 2.

The Institutional Review Board (IRB) of Seoul National University exempted IRB approval of this subset (IRB No. E1911/002–008).

Chemicals

Among the measured chemicals, only those with >70% detection frequency (n = 29, including the sum of di(2-ethylhexyl) phthalate [DEHP] metabolites [∑DEHPm]; n = 31, including three individual DEHP metabolites) were evaluated in this study. These chemicals included phthalates (n = 8; three DEHP metabolites—mono(2-ethyl-5-hydroxyhexyl) phthalate [MEHHP], mono(2-ethyl-5-oxohexyl) phthalate, and mono(2-ethyl-5-carboxypentyl) phthalate; mono-n-butyl phthalate; monobenzyl phthalate [MBzP]; monocarboxyoctyl phthalate; monocarboxynonyl phthalate; and mono(3-carboxypropyl) phthalate); parabens (n = 3; methylparaben, ethylparaben, and propylparaben); bisphenol A (BPA) (n = 1); PAHs (n = 4; 1-hydroxypyrene, 2-naphthol, 2-hydroxyfluorene, and 1-hydroxyphenanthrene); volatile organic compounds (n = 2; trans, trans-muconic acid and benzylmercapturic acid); 3-phenoxybenzoic acid (n = 1); metals (n = 4; Pb and mercury [Hg] in blood, and Hg and Cd in urine); and POPs (n = 8; hexachlorobenzene, p,p’-dichlorodiphenyl trichloroethane, p,p’-dichlorodiphenyl dichloroethylene, and polychlorinated biphenyl [PCB] congeners [PCB52, PCB118, PCB138, PCB153, and PCB180]). Our analysis utilized the following two settings: three individual DEHP metabolites and the molar sum of DEHP (Fig. 1). Quality control for the analysis was performed following the protocols provided by the National Institute of Environmental Research of Korea [21,23].

Statistical analysis

First, serum and urinary chemical levels were standardized. Specifically, blood lipid concentration was used for evaluating serum POPs. For urinary chemicals, covariate-adjusted standardization [24] was used because of the potential issues related to the use of urinary creatinine (Ucr) level for adjustment when the health outcome is eGFR [2527]. Predicted Ucr (Ucr^) levels were calculated accounting for relevant covariates—age, sex, body mass index (BMI), and eGFR. The urinary chemical concentration was standardized as the ratio between the predicted (Ucr^) and measured Ucr using the following equation:

Covariate-adjusted standardized concentration = urinary chemical concentration × RUcr,

where RUcr denotes Ucr^/Ucr.

Covariate-adjusted standardized and lipid-corrected chemicals were natural log-transformed (ln-transformed). To identify chemical risk factors for CKD, both WQS regression and ML techniques were used. The ML models include single and ensemble models. Single models include classification and regression trees (CART) and logistic regression, whereas ensemble models include bagging and random forest [2830]. The performances of these methods were compared using the area under the curve (AUC) metric on a held-out test set. Regardless of the method, we split our data into a training (70%) and test (30%) set for the experiments using the ML approach. Owing to the limited quantity of data, we performed a five-fold cross-validation to prevent the overfitting of our model. After cross-validation, the model was reevaluated using the test set. The following two model settings were constructed for the association analyses: model 1 with multiple clinical and chemical variables (n = 41) and model 2 with a presplit approach employing domain knowledge (Fig. 1).

Models for machine learning approach

The performance of each ML method on the prediction of CKD risk was compared based on the test results obtained by calculating the AUC using various parameters. Here, the selected model parameters included the validation techniques and ratios, test set size, training and test performance of the dataset, and validation of the dataset.

Logistic regression and regularization

Logistic regression is a widely utilized ML algorithm for classification tasks, operating as a specialized form of a generalized linear model (GLM). Compared to linear regression, which assumes the dependent variable follows a normal distribution, logistic regression models the probability that the dependent variable belongs to a particular category (usually binary) based on a Bernoulli distribution. This algorithm employs a logit function to map a linear combination of independent variables to a probability between 0 and 1. The logistic regression equation is given by, where indicates the probability of the dependent variable y being in category 1, given the independent variables [31]. The training of a logistic regression model focuses on minimizing a cost function, costy^,­y=ylogy^1ylog1y^, where y^ is equivalent to π (X). To mitigate overfitting, regularization techniques, such as the least absolute shrinkage and selection operator (lasso) and ridge regularization, were incorporated. These techniques adjust the cost function by adding a penalty term related to the magnitude of the coefficients (β). Lasso regularization uses the L1 norm, ‖β‖1, whereas ridge regularization uses the L2 norm, β22. Furthermore, the influence of the regularization term is controlled by a hyperparameter (λ), allowing for the balance between the fit to the training data and the magnitude of the coefficients (β).

Decision tree

Decision trees are popular and intuitive classification algorithms that are easy to implement and interpret compared with many other ML methods. Within the realm of decision trees, the CART [30] is particularly notable. CART constructs a binary tree, progressively branching out its nodes to enhance purity within the dataset. The Gini index serves as the primary metric for evaluating impurities at each node, guiding the tree’s development. The tree expansion is governed by specific stopping criteria encapsulated within hyperparameters, such as the complexity parameter and the maximum depth of the tree.

Bagging and random forest

We applied two ensemble techniques as follows: bootstrap aggregation (bagging) [29] and random forest [28]. Bagging trains multiple models on bootstrapped subsets of the data, combining their outputs through averaging (regression) or majority voting (classification). In contrast, random forest builds on bagging by randomly selecting features for each tree, enhancing model diversity and robustness. We used CART as a base learner for both bagging and random forest to improve the accuracy of the output predictions [30]. Finally, the complexity of these ensemble models was adjusted by varying the number of trees.

Weighted quantile sum regression

WQS regression is a statistical method predominantly employed in the analysis of high-dimensional datasets, such as those arising from environmental exposure studies. The primary objective of this technique is variable selection, which allows the identification of relevant predictor variables in the context of GLMs with a specified link function. WQS has several advantages, including robustness to multicollinearity and higher accuracy, sensitivity, and specificity, over shrinkage methods or penalized regression techniques. Despite its benefits, WQS has some limitations that should be considered during its implementation. One potential drawback is the loss of information owing to quantile settings, which can lead to a reduced representation of the data. Additionally, because weights are assigned to each variable, determining the level of importance of each weight is necessary.

In the WQS regression model, each predictor variable is assigned a weight (ωi) with the range of 0–1, such that the sum of all weights (ωi) equals 1. The values of the predictor variables are divided into quartiles (qi), with each quartile representing a specific range, including the 1st, 2nd, 3rd, or 4th quartile (qi = 0, 1, 2, or 3). When implementing WQS regression in high-dimensional datasets, researchers should carefully evaluate the advantages and limitations of the method as well as the appropriate weighting scheme and quantile selection for their specific context. Covariates, including age, sex, BMI, medical history of diabetes mellitus (DM) or hypertension, average monthly household income in the past year, smoking status, and alcohol consumption status, were included.

Presplit decision tree

The decision tree algorithm determines the split rules in a data-driven, greedy manner, aiming to maximize node purity without considering the implications for lower-level nodes, which can result in suboptimal decisions. To mitigate this issue, we proposed a presplit decision tree approach incorporating domain knowledge. The risks associated with environmental substances vary depending on factors such as sex and age. Therefore, we enforced the use of these variables at the root node of the decision tree to improve its performance. To minimize the risk of making suboptimal decisions, a presplit decision tree model was developed by incorporating existing knowledge into the CART model. We selected sex and age as the factors affecting the associations between chemical exposure and CKD risks. Therefore, we first split the data based on one of the following two criteria: 1) whether age was ≥50 years or 2) whether the sex was male. Subsequently, we applied the CART algorithm [30] to each subset of data created by these presplits to build a decision tree model.

Implementation details

The methods were implemented using R version 3.4 (R Foundation for Statistical Computing), utilizing the glmnet, rpart, randomForestSRC, and ipred packages to develop the ML models.

Results

Study population

When stratified by kidney function, the mean ages of the participants were 42.9 ± 13.4, 61.0 ± 12.7, and 55.9 ± 15.4 years in the G1A1 (n = 953), G2A1 (n = 215), and CKD (G3 or A2 [n = 98]) groups, respectively (Table 1). The CKD group exhibited a significantly higher prevalence of hypertension and DM and a lower household income than the non-CKD group. Male participants constituted 52.0% (n = 51) of the CKD group, and 38.8% of patients with CKD had DM (Table 1). BMI was significantly different among the groups, with values of 24.1 ± 3.6, 24.4 ± 3.0, and 25.5 ± 3.3 kg/m2 for the G1A1, G2A1, and CKD groups, respectively (p < 0.001). No significant differences in hemoglobin and total cholesterol levels were found between the groups.

Baseline characteristics of the enrolled participants from the KoNEHS

Comparison of machine learning algorithms for predicting chronic kidney disease risk

The performance of various ML models on the prediction of CKD risk is shown in Table 2 (setting 1: using three individual DEHP metabolites) and Table 3 (setting 2: using the molar sum of DEHP metabolites). Across both settings, the logistic regression model, which is a GLM, performed worst with AUCs of 0.6214 and 0.6240 for settings 1 and 2, respectively. In contrast, the random forest model consistently outperformed other methods, achieving AUCs of 0.6954 and 0.7204 in settings 1 and 2, respectively. Notably, random forest under setting 2, which used the molar sum of DEHP metabolites, provided the highest predictive performance with an AUC of 0.7204.

Performance of classification models for chronic kidney disease in the KoNEHS participants using model setting 1

Performance of classification models for chronic kidney disease in the KoNEHS participants using model setting 2

The superior performance of ensemble methods, including random forest and bagging, can be attributed to several factors as follows. random forest, as an ensemble model, constructs multiple decision trees and averages their predictions, which helps to reduce overfitting, a common issue in decision trees. Furthermore, it introduces randomness not only in the data used to build each tree (bootstrapping) but also in the features considered for splitting nodes, which improves robustness and generalization compared to simpler models like logistic regression. Logistic regression is a GLM that assumes a fixed relationship between the features and outcome, which may not capture complex interactions between the environmental pollutants and CKD risk. Conversely, random forest is nonparametric and can model intricate, nonlinear interactions between predictors, giving it a significant advantage in datasets with diverse features similar to ours.

Another important observation is the performance of our presplit decision tree when participants were stratified by sex in setting 2. This model yielded an AUC of 0.7054, the second-best performance overall, surpassing the traditional decision tree model, which splits based solely on a data-driven approach. The presplit decision tree method demonstrates competitive performance with ensemble models including bagging and random forest, despite being a single model.

Performance of decision tree modeling for predicting chronic kidney disease risk

Given the benefit of interpretability, we examined the constructed decision tree, which achieved an AUC of 0.654 (Fig. 2). Fig. 2 presents these findings, with each leaf node indicating CKD risk. In the model, the lipid-corrected PCB153 value was the most relevant risk factor in the CKD group. The estimated CKD rate reached as high as 71.4% among the population with lipid-corrected PCB153 (ln-transformed) ≥4.0 and hemoglobin level ≥13.25 g/dL. In contrast, among adults with lipid-corrected PCB153 (ln-transformed) <4.0, CKD risk was predicted at 62.5% in the population with covariate-adjusted standardized MEHHP levels (ln-transformed) ≥4.9 (Fig. 2). The subsequent node in the decision tree was MEHHP, a DEHP metabolite. Following MEHHP, urinary Cd appeared as the next node in the decision tree. Here, each substance underwent an appropriate standardized adjustment, which varied based on the measuring medium and substance group.

Figure 2.

Decision tree modeling for chronic kidney disease in the Korean National Environmental Health Survey participants.

Urinary chemical levels were covariate-adjusted standardized, serum polychlorinated biphenyl 153 (PCB153) was lipid-corrected, and both were then ln-transformed. The percentiles of these chemicals are indicated in parentheses.

AUC, area under the curve; CKD, chronic kidney disease; MEHHP, mono(2-ethyl-5-hydroxyhexyl) phthalate.

Results of the weighted quantile sum regression analysis

The WQS regression analyses showed that among all participants (n = 1,266), chemical mixture was associated with CKD (β = –16.2; 95% confidence interval [CI], –17.8 to –14.6; p < 0.001), with PCB180 and PCB153 contributing high weights (Figs. 3, 4). Urinary Cd was one of the variables with the highest weight among the metals included in the model (Fig. 3). After age stratification, the WQS regression analyses revealed that serum PCB153 were a significant determinant of reduced kidney function in the younger age group (<50 years) (Fig. 5). In both age groups, WQS showed significant negative associations of chemical mixture with eGFR (younger age group: β = –3.43 [95% CI, –6.1 to –0.70], p = 0.01 and older age group: β = –4.03 [95% CI, –7.9 to –0.20], p = 0.04) (Fig. 6).

Figure 3.

Weights of all measured pollutants in the Korean National Environmental Health Survey participants.

1-OHP, 1-hydroxypyrene; 1OHPhe, 1-Hydroxyphenanthrene; 2-NAP, 2-naphthol; 2OH-Flu, 2-hydroxyfluorene; 3-PBA, 3-phenoxybenzoic acid; BMA, N-acetyl-S-(benzyl)-L-cysteine; BPA, bisphenol A; Cd, cadmium; EtP, ethyl paraben; HCB, hexachlorobenzene; Hg, mercury; MBzP, monobenzyl phthalate; MCNP, monocarboxynonyl phthalate; MCOP, monocarboxyoctyl phthalate; MCPP, mono(3-carboxypropyl) phthalate; MeP, methyl paraben; MnBP, mono-n-butyl phthalate; PCB118, polychlorinated biphenyl 118; PCB138, polychlorinated biphenyl 138; PCB153, polychlorinated biphenyl 153; PCB180, polychlorinated biphenyl 180; PCB52, polychlorinated biphenyl 52; Pb, lead; PrP, propyl paraben; p,p′-DDE, p,p′-dichlorodiphenyldichloroethylene; p,p′-DDT, p,p′-dichlorodiphenyldichloroethane; t,t-MA, trans, trans-muconic acid; ∑DEHPm, molar sum of DEHP metabolites.

Figure 4.

Unadjusted association analysis between WQS and eGFR.

β-coefficient = –16.2 (95% confidence interval, –17.8 to –14.6), p < 0.001.

WQS, weighted quantile sum; eGFR, estimated glomerular filtration rate.

Figure 5.

Weights of all measured pollutants in each age group.

(A) Age <50 years. (B) Age ≥50 years.

1-OHP, 1-hydroxypyrene; 1OHPhe, 1-hydroxyphenanthrene; 2-NAP, 2-naphthol; 2OH-Flu, 2-hydroxyfluorene; 3-PBA, 3-phenoxybenzoic acid; BMA, N-acetyl-S-(benzyl)-L-cysteine; BPA, bisphenol A; Cd, cadmium; EtP, ethylparaben; HCB, hexachlorobenzene; Hg, mercury; MBzP, monobenzyl phthalate; MCNP, monocarboxynonyl phthalate; MCOP, monocarboxyoctyl phthalate; MCPP, mono(3-carboxypropyl) phthalate; MeP, methylparaben; MnBP, mono-n-butyl phthalate; PCB118, polychlorinated biphenyl 118; PCB138, polychlorinated biphenyl 138; PCB153, polychlorinated biphenyl 153; PCB180, polychlorinated biphenyl 180; PCB52, polychlorinated biphenyl 52; Pb, lead; PrP, propylparaben; p,p′-DDE, p,p′-dichlorodiphenyldichloroethylene; p,p′-DDT, p,p′-dichlorodiphenyltrichloroethane; t,t-MA, trans, trans-muconic acid; ∑DEHPm, molar sum of DEHP metabolites.

Figure 6.

Adjusted association analysis between WQS and eGFR stratified by age group.

(A) Age <50 years. β-coefficient = –3.43 (95% confidence interval [CI], –6.1 to –0.7), p = 0.01. (B) Age ≥50 years. β-coefficient = –4.03 (95% CI, –7.9 to –0.2), p = 0.04. Adjustment variables for age, sex, body mass index, history of diabetes mellitus, hypertension, smoking status, alcohol consumption status, and household incomes.

WQS, weighted quantile sum; eGFR, estimated glomerular filtration rate.

Results of presplit decision tree for predicting chronic kidney disease risk

Here, we introduce a method that employs a presplit decision tree based on age or sex. As shown in Fig. 7, when age was set as the primary variable in the presplit decision tree model, the lipid-corrected PCB153 value emerged as the single most significant risk factor for CKD in the younger age group. The predicted CKD rate reached 71.4% if the lipid-corrected PCB153 (ln-transformed) was ≥4.0 and the hemoglobin level was >13.25 g/dL. The next node was identified as urinary MBzP, a phthalate metabolite. For participants aged ≥50 years with DM, the higher MBzP group (21.2%) had a reduced risk of CKD compared to the lower MBzP group (64.2%). The same variable (i.e., urine MBzP) appeared in three nodes within the decision tree.

Figure 7.

Presplit decision tree model for CKD in the Korean National Environmental Health Survey participants according to age group.

Urinary chemical levels were covariate-adjusted standardized, serum polychlorinated biphenyl 153 (PCB153) was lipid-corrected, and both were subsequently ln-transformed. The percentiles of these chemicals are indicated in brackets.

AUC, area under the curve; CKD, chronic kidney disease; DM, diabetes mellitus; MBzP, monobenzyl phthalate.

Discussion

While studies have suggested many chemicals as risk factors for CKD, epidemiological studies have rarely utilized association models incorporating multiple chemicals. Most existing epidemiological studies on chemical exposure and kidney disease are limited to a single or few chemicals in the models [9,10]. Although nonchemical factors that may influence the association are generally adjusted in the model, their interactions with chemical parameters cannot be easily addressed using the existing approaches. Here, we demonstrated that the decision tree model, when combined with existing domain knowledge (i.e., presplit decision tree model), can be applied to the general population to identify chemical risk factors for CKD. Additionally, the WQS regression model provides a robust method for identifying key chemical risk factors by assigning weights to each chemical based on its contribution to the outcome. This enables the incorporation of multiple chemicals in the analysis and helps overcome the limitations of traditional single-chemical models, which may fail to account for the combined effects of multiple exposures.

The results of multiple models, including the decision tree and WQS regression analyses, support the relationship of PCB153 with CKD risk in the general population (Figs. 2, 3). Specifically, PCB153 was identified as an important factor for predicting CKD risk in both the decision tree and WQS regression analyses. Information on PCB153 is limited in both epidemiological and experimental studies. Recent studies have also reported negative correlations of PCB153 and PCB180 with eGFR in the KoNEHS Cycle 3 (2015–2017), analyzing the same population as that of our study [4]. While Lee et al. [4] focused solely on the effects of POPs on kidney function, our study expands upon this by analyzing a broader range of environmental pollutants, allowing the assessment of the combined effects of multi-pollutants on kidney dysfunction. By examining a wider range of pollutants, our study provides a more nuanced understanding of the environmental factors contributing to kidney health and identifies new associations not apparent in previous studies limited to POPs. Apoptosis induced by PCB153 exposure in kidney tubular cells has been reported [32]. In our study, when age and sex were used to presplit the population in the decision tree analysis, PCB153 was identified as the most important chemical risk factor for CKD. Among nonchemical parameters, DM and hemoglobin levels emerged as important variables in the decision tree analysis.

Our observation shows that age, sex, DM, and hemoglobin level are important variables that determine CKD risk in the decision tree analysis. Urinary Cd was also identified as a significant chemical substance in the decision tree and WQS regression analyses (Figs. 2, 3). In particular, when the WQS was set as the main variable in the decision tree model, Cd exposure was found to be a significant factor. The association between Cd exposure and kidney toxicity has been well-established in population-based and experimental studies over a long period [33]. The underlying mechanism involves the generation of reactive oxygen species, leading to mitochondrial damage [33]. Compared with studies on other substances, a relatively substantial body of research exists on the relationship between Cd exposure and kidney disease, suggesting that further studies are needed in this domain. Identifying consistently significant nonpersistent urinary chemicals in decision tree analysis is challenging, making it difficult to establish a clear discussion in our analysis. Urinary BPA was also identified as a possible chemical substance in the decision tree and WQS regression analyses (Figs. 2, 3). It showed a negative correlation with eGFR in the lasso regression model, appeared in the decision tree, and had relatively high importance in the WQS analysis. Although urinary MBzP appeared to be an important variable in the decision tree analysis, it exhibited a protective effect (Fig. 7). In participants aged ≥50 years with DM, the higher MBzP group had a reduced risk of CKD (64.2%). A lower risk of CKD was observed in the higher urine MBzP group, and the same variable (i.e., urine MBzP) appeared in three nodes within the decision tree, making the discussion more complicated. This could be a limitation when using a classification approach to interpret the results. Therefore, additional research is required to better understand the relationship between these chemicals and kidney function. The association between BPA exposure and CKD has been inconsistent [34]. A recent systematic review found that while the association with kidney function indicators may be influenced by the urine dilution adjustment method, a reanalysis of the National Health and Nutrition Examination Survey data, along with other studies, supported the association between BPA and kidney disease [34]. Therefore, in studies examining the associations between BPA and kidney function indicators, employing research methods that utilize urine dilution adjustment techniques with a relatively low relationship to outcomes, such as 24-hour urine volume, or outcomes with limited association with dilution adjustment methods, such as cystatin-C–based eGFR, may be necessary [35].

Several studies have reported significant health effects even when exposed to mixtures of substances below standard levels. Woodruff et al. [36] emphasized the importance of considering both existing or persistent exposure to environmental chemicals and inherent biological or disease susceptibility, which independently contribute to the risk of overt diseases. The demand for evaluating exposure to harmful substances has increased, resulting in studies using advanced statistical techniques, such as supervised principal component analysis, lasso, partial least-squares regression, and Bayesian model averaging.

In conclusion, to identify chemical risk factors of CKD in the general population, we employed several frequently used ML approaches. Our study shows that the decision tree model, following the adoption of existing knowledge (e.g., influence of demographic factors), can be used to identify chemical predictors of CKD risk. Based on the presplit decision tree model and WQS analysis, serum PCB153 was determined as the most important chemical risk factor for CKD, particularly among the population aged <50 years and with hemoglobin levels >13.25 g/dL. Moreover, further validation in other populations is warranted to confirm the utility of this ML approach in the association studies on CKD risk.

Notes

Conflicts of interest

All authors have no conflicts of interest to declare.

Funding

This survey was supported by grants from the National Institute of Environmental Research funded by the Ministry of Environment (MOE) of Korea (NIER-2019-01-02-082) and the National Research Foundation (NRF) of Korea (NRF-2022R1C1C2006982). Junhyug Noh was partly supported by Ewha Womans University research grant of 2023 and the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. RS-2022-00155966).

Data sharing statement

The data presented in this study are available from the corresponding author upon reasonable request.

Authors’ contributions

Conceptualization, Data curation, Methodology, Visualization: IL, JN, KC, KDY

Formal analysis: IL, JN

Investigation: All authors

Supervision: KC, KDY

Writing–original draft: IL, JN, KC, KDY

Writing–review & editing: All authors

All authors read and approved the final manuscript.

References

1. Chang PY, Li YL, Chuang TW, et al. Exposure to ambient air pollutants with kidney function decline in chronic kidney disease patients. Environ Res 2022;215:114289. 10.1016/j.envres.2022.114289. 36116493.
2. Kataria A, Trasande L, Trachtman H. The effects of environmental chemicals on renal function. Nat Rev Nephrol 2015;11:610–625. 10.1038/nrneph.2015.94. 26100504.
3. Kim Y, Lee BK. Associations of blood lead, cadmium, and mercury with estimated glomerular filtration rate in the Korean general population: analysis of 2008-2010 Korean National Health and Nutrition Examination Survey data. Environ Res 2012;118:124–129. 10.1016/j.envres.2012.06.003. 22749111.
4. Lee J, Lee I, Park JY, et al. Exposure to several polychlorinated biphenyls (PCBs) is associated with chronic kidney disease among general adults: Korean National Environmental Health Survey (KoNEHS) 2015-2017. Chemosphere 2022;303:134998. 10.1016/j.chemosphere.2022.134998. 35597461.
5. Piwkowska A, Rogacka D, Jankowski M, Kocbuch K, Angielski S. Hydrogen peroxide induces dimerization of protein kinase G type Iα subunits and increases albumin permeability in cultured rat podocytes. J Cell Physiol 2012;227:1004–1016. 10.1002/jcp.22810. 21520075.
6. Barisoni L, Schnaper HW, Kopp JB. A proposed taxonomy for the podocytopathies: a reassessment of the primary nephrotic diseases. Clin J Am Soc Nephrol 2007;2:529–542. 10.2215/CJN.04121206. 17699461.
7. Nath KA. Tubulointerstitial changes as a major determinant in the progression of renal damage. Am J Kidney Dis 1992;20:1–17. 10.1016/s0272-6386(12)80312-x. 1621674.
8. Kim S, Kim S, Won S, Choi K. Considering common sources of exposure in association studies: urinary benzophenone-3 and DEHP metabolites are associated with altered thyroid hormone balance in the NHANES 2007-2008. Environ Int 2017;107:25–32. 10.1016/j.envint.2017.06.013. 28651165.
9. Lazarevic N, Barnett AG, Sly PD, Knibbs LD. Statistical methodology in studies of prenatal exposure to mixtures of endocrine-disrupting chemicals: a review of existing approaches and new alternatives. Environ Health Perspect 2019;127:26001. 10.1289/ehp2207. 30720337.
10. Feron VJ, Groten JP. Toxicological evaluation of chemical mixtures. Food Chem Toxicol 2002;40:825–839. 10.1016/s0278-6915(02)00021-2. 11983277.
11. Lee J, Oh S, Kang H, et al. Environment-wide association study of CKD. Clin J Am Soc Nephrol 2020;15:766–775. 10.2215/cjn.06780619.
12. Di D, Zhang R, Zhou H, et al. Joint effects of phenol, chlorophenol pesticide, phthalate, and polycyclic aromatic hydrocarbon on bone mineral density: comparison of four statistical models. Environ Sci Pollut Res Int 2023;30:80001–80013. 10.1007/s11356-023-28065-z. 37289393.
13. Zhang Y, Dong T, Hu W, et al. Association between exposure to a mixture of phenols, pesticides, and phthalates and obesity: comparison of three statistical models. Environ Int 2019;123:325–336. 10.1016/j.envint.2018.11.076. 30557812.
14. Weng X, Tan Y, Fei Q, et al. Association between mixed exposure of phthalates and cognitive function among the U.S. elderly from NHANES 2011-2014: three statistical models. Sci Total Environ 2022;828:154362. 10.1016/j.scitotenv.2022.154362. 35259385.
15. De Falco M, Laforgia V. Combined effects of different Endocrine-Disrupting Chemicals (EDCs) on prostate gland. Int J Environ Res Public Health 2021;18:9772. 10.3390/ijerph18189772. 34574693.
16. European Commission, Directorate-General for Health and Consumers. Toxicity and assessment of chemical mixtures European Commission; 2012.
17. Miller TH, Gallidabino MD, MacRae JI, et al. Machine learning for environmental toxicology: a call for integration and innovation. Environ Sci Technol 2018;52:12953–12955. 10.1021/acs.est.8b05382. 30338686.
18. Coull BA, Bobb JF, Wellenius GA, et al. Part 1: statistical learning methods for the effects of multiple air pollution constituents. Res Rep Health Eff Inst 2015;183 Pt 1-2:5–50. 26333238.
19. Park ES, Symanski E, Han D, Spiegelman C. Part 2: development of enhanced statistical methods for assessing health effects associated with an unknown number of major sources of multiple air pollutants. Res Rep Health Eff Inst 2015;183 Pt 1-2:51–113. 26333239.
20. Jung SK, Choi W, Kim SY, et al. Profile of environmental chemicals in the Korean population: results of the Korean National Environmental Health Survey (KoNEHS) cycle 3, 2015-2017. Int J Environ Res Public Health 2022;19:626. 10.3390/ijerph19020626. 35055445.
21. National Institute of Environmental Research (NIER). Manual for analysis of environmental pollutants in biological samples (organic chemicals). NIER; 2018.
22. Levin A, Ahmed SB, Carrero JJ, et al. Executive summary of the KDIGO 2024 Clinical Practice Guideline for the Evaluation and Management of Chronic Kidney Disease: known knowns and known unknowns. Kidney Int 2024;105:684–701. 10.1016/j.kint.2023.10.016. 38519239.
23. Jeon HL, Hong S, Choi K, Lee C, Yoo J. First nationwide exposure profile of major persistent organic pollutants among Korean adults and their determinants: Korean National Environmental Health Survey Cycle 3 (2015-2017). Int J Hyg Environ Health 2021;236:113779. 10.1016/j.ijheh.2021.113779. 34119853.
24. O’Brien KM, Upson K, Cook NR, Weinberg CR. Environmental chemicals in urine and blood: improving methods for creatinine and lipid adjustment. Environ Health Perspect 2016;124:220–227. 10.1289/ehp.1509693. 26219104.
25. Kang H, Lee J, Lee JP, Choi K. Urinary metabolites of organophosphate esters (OPEs) are associated with chronic kidney disease in the general US population, NHANES 2013-2014. Environ Int 2019;131:105034. 10.1016/j.envint.2019.105034. 31374441.
26. Bulka CM, Mabila SL, Lash JP, Turyk ME, Argos M. Arsenic and obesity: a comparison of urine dilution adjustment methods. Environ Health Perspect 2017;125:087020. 10.1289/ehp1202. 28858828.
27. Lee I, Park JY, Kim S, et al. Association of exposure to phthalates and environmental phenolics with markers of kidney function: Korean National Environmental Health Survey (KoNEHS) 2015-2017. Environ Int 2020;143:105877. 10.1016/j.envint.2020.105877. 32645486.
28. Breiman L. Random forests. Mach Learn 2001;45:5–32. 10.1023/A:1010933404324.
29. Breiman L. Bagging predictors. Mach Learn 1996;24:123–140. 10.1007/bf00058655.
30. Breiman L, Friedman J, Olshen RA, Stone CJ. Classification and regression trees CRC Press; 1984.
31. Dobson AJ. An introduction to generalized linear models Chapman & Hall; 1990.
32. Ghosh S, De S, Chen Y, Sutton DC, Ayorinde FO, Dutta SK. Polychlorinated biphenyls (PCB-153) and (PCB-77) absorption in human liver (HepG2) and kidney (HK2) cells in vitro: PCB levels and cell death. Environ Int 2010;36:893–900. 10.1016/j.envint.2010.06.010. 20723988.
33. Gobe G, Crane D. Mitochondria, reactive oxygen species and cadmium toxicity in the kidney. Toxicol Lett 2010;198:49–55. 10.1016/j.toxlet.2010.04.013. 20417263.
34. Moreno-Gómez-Toledano R, Arenas MI, Vélez-Vélez E, et al. Bisphenol A exposure and kidney diseases: systematic review, meta-analysis, and NHANES 03-16 study. Biomolecules 2021;11:1046. 10.3390/biom11071046. 34356670.
35. Weaver VM, Kotchmar DJ, Fadrowski JJ, Silbergeld EK. Challenges for environmental epidemiology research: are biomarker concentrations altered by kidney function or urine concentration adjustment? J Expo Sci Environ Epidemiol 2016;26:1–8. 10.1038/jes.2015.8. 25736163.
36. Woodruff TJ, Zeise L, Axelrad DA, et al. Meeting report: moving upstream-evaluating adverse upstream end points for improved risk assessment and decision-making. Environ Health Perspect 2008;116:1568–1575. 10.1289/ehp.11516. 19057713.

Article information Continued

Figure 1.

Study flow chart for the inclusion of participants from the KoNEHS.

BMI, body mass index; CKD, chronic kidney disease; DEHP, di(2-ethylhexyl) phthalate; DM, diabetes mellitus; eGFR, estimated glomerular filtration rate; HTN, hypertension; KoNEHS, Korean National Environmental Health Survey; MECPP, mono(2-ethyl-5-carboxypentyl) phthalate; MEHHP, mono(2-ethyl-5-hydroxyhexyl) phthalate; MEOHP, mono(2-ethyl-5-oxohexyl) phthalate; POPs, persistent organic pollutants; UACR, urine albumin-to-creatinine ratio; ∑DEHP metabolites, molar sum of DEHP metabolites.

aThe study population was stratified by kidney function according to the KDIGO (Kidney Disease: Improving Global Outcomes) definition [22] into the G1A1 (eGFR ≥90 mL/min/1.73 m2 and UACR <3 mg/mmol), G1A2 (eGFR ≥90 mL/min/1.73 m2 and UACR of 3–30 mg/mmol), G3 (eGFR of 30–59 mL/min/1.73 m2), and A2 (UACR of 3–30 mg/mmol) groups. bThe concentrations of POPs were adjusted for lipid content, calculated as POPs concentration (pg/mL) divided by the total lipid concentration (mg/dL) and multiplied by 100, yielding values in ng/g. Consequently, the total cholesterol level was excluded from the input variables, as it is highly correlated with this adjusted POPs measure in model setting 2.

Figure 2.

Decision tree modeling for chronic kidney disease in the Korean National Environmental Health Survey participants.

Urinary chemical levels were covariate-adjusted standardized, serum polychlorinated biphenyl 153 (PCB153) was lipid-corrected, and both were then ln-transformed. The percentiles of these chemicals are indicated in parentheses.

AUC, area under the curve; CKD, chronic kidney disease; MEHHP, mono(2-ethyl-5-hydroxyhexyl) phthalate.

Figure 3.

Weights of all measured pollutants in the Korean National Environmental Health Survey participants.

1-OHP, 1-hydroxypyrene; 1OHPhe, 1-Hydroxyphenanthrene; 2-NAP, 2-naphthol; 2OH-Flu, 2-hydroxyfluorene; 3-PBA, 3-phenoxybenzoic acid; BMA, N-acetyl-S-(benzyl)-L-cysteine; BPA, bisphenol A; Cd, cadmium; EtP, ethyl paraben; HCB, hexachlorobenzene; Hg, mercury; MBzP, monobenzyl phthalate; MCNP, monocarboxynonyl phthalate; MCOP, monocarboxyoctyl phthalate; MCPP, mono(3-carboxypropyl) phthalate; MeP, methyl paraben; MnBP, mono-n-butyl phthalate; PCB118, polychlorinated biphenyl 118; PCB138, polychlorinated biphenyl 138; PCB153, polychlorinated biphenyl 153; PCB180, polychlorinated biphenyl 180; PCB52, polychlorinated biphenyl 52; Pb, lead; PrP, propyl paraben; p,p′-DDE, p,p′-dichlorodiphenyldichloroethylene; p,p′-DDT, p,p′-dichlorodiphenyldichloroethane; t,t-MA, trans, trans-muconic acid; ∑DEHPm, molar sum of DEHP metabolites.

Figure 4.

Unadjusted association analysis between WQS and eGFR.

β-coefficient = –16.2 (95% confidence interval, –17.8 to –14.6), p < 0.001.

WQS, weighted quantile sum; eGFR, estimated glomerular filtration rate.

Figure 5.

Weights of all measured pollutants in each age group.

(A) Age <50 years. (B) Age ≥50 years.

1-OHP, 1-hydroxypyrene; 1OHPhe, 1-hydroxyphenanthrene; 2-NAP, 2-naphthol; 2OH-Flu, 2-hydroxyfluorene; 3-PBA, 3-phenoxybenzoic acid; BMA, N-acetyl-S-(benzyl)-L-cysteine; BPA, bisphenol A; Cd, cadmium; EtP, ethylparaben; HCB, hexachlorobenzene; Hg, mercury; MBzP, monobenzyl phthalate; MCNP, monocarboxynonyl phthalate; MCOP, monocarboxyoctyl phthalate; MCPP, mono(3-carboxypropyl) phthalate; MeP, methylparaben; MnBP, mono-n-butyl phthalate; PCB118, polychlorinated biphenyl 118; PCB138, polychlorinated biphenyl 138; PCB153, polychlorinated biphenyl 153; PCB180, polychlorinated biphenyl 180; PCB52, polychlorinated biphenyl 52; Pb, lead; PrP, propylparaben; p,p′-DDE, p,p′-dichlorodiphenyldichloroethylene; p,p′-DDT, p,p′-dichlorodiphenyltrichloroethane; t,t-MA, trans, trans-muconic acid; ∑DEHPm, molar sum of DEHP metabolites.

Figure 6.

Adjusted association analysis between WQS and eGFR stratified by age group.

(A) Age <50 years. β-coefficient = –3.43 (95% confidence interval [CI], –6.1 to –0.7), p = 0.01. (B) Age ≥50 years. β-coefficient = –4.03 (95% CI, –7.9 to –0.2), p = 0.04. Adjustment variables for age, sex, body mass index, history of diabetes mellitus, hypertension, smoking status, alcohol consumption status, and household incomes.

WQS, weighted quantile sum; eGFR, estimated glomerular filtration rate.

Figure 7.

Presplit decision tree model for CKD in the Korean National Environmental Health Survey participants according to age group.

Urinary chemical levels were covariate-adjusted standardized, serum polychlorinated biphenyl 153 (PCB153) was lipid-corrected, and both were subsequently ln-transformed. The percentiles of these chemicals are indicated in brackets.

AUC, area under the curve; CKD, chronic kidney disease; DM, diabetes mellitus; MBzP, monobenzyl phthalate.

Table 1.

Baseline characteristics of the enrolled participants from the KoNEHS

Characteristic G1A1 group G2A1 group CKDa group (above G3 or A2) p-value
No. of patients 953 215 98
Age (yr) 42.9 ± 13.4 61.0 ± 12.7 55.9 ± 15.4 <0.001
Male sex 458 (48.1) 120 (55.8) 51 (52.0) 0.11
Body mass index (kg/m2) 24.1 ± 3.6 24.4 ± 3.0 25.5 ± 3.3 <0.001
Proportion of obeseb participants 321 (33.9) 85 (42.5) 50 (53.8) <0.001
Hypertension 83 (8.7) 72 (33.5) 34 (34.7) <0.001
Diabetes mellitus 52 (5.5) 38 (17.7) 38 (38.8) <0.001
Smoking status 0.002
 Never-smoker 587 (61.6) 124 (57.7) 61 (62.2)
 Past smoker 158 (16.6) 52 (24.2) 27 (27.6)
 Current smoker 208 (21.8) 39 (18.1) 10 (10.2)
Alcohol consumption status
 Current alcohol consumer 823 (86.4) 157 (73.0) 86 (87.8) <0.001
Household income (Korean won/mo) <0.001
 Under 1,000,000 70 (7.3) 56 (26.0) 23 (23.5)
 ~2,000,000 139 (14.6) 45 (20.9) 21 (21.4)
 ~3,000,000 214 (22.5) 34 (15.8) 23 (23.5)
 ~5,000,000 308 (32.3) 58 (27.0) 19 (19.4)
 ~7,000,000 142 (14.9) 13 (6.0) 9 (9.2)
 Over 7,000,000 74 (7.8) 8 (3.7) 3 (3.1)
 Unknown 6 (0.6) 1 (0.5) 0 (0)
Hemoglobin (g/dL) 13.7 ± 1.4 13.7 ± 1.4 13.6 ± 1.4 0.65
Total cholesterol (mg/dL) 186.2 ± 35.0 181.6 ± 36.1 182.8 ± 36.8 0.17
Serum creatinine (mg/dL) 0.72 ± 0.15 0.91 ± 0.16 0.95 ± 0.53 <0.001
eGFR (mL/min/1.73 m2) 108.9 ± 11.8 80.9 ± 7.9 85.6 ± 26.6 <0.001
Urinary ACR (mg/g) 5.5 ± 5.2 6.9 ± 6.42 125.6 ± 223.0 <0.001

Data are expressed as mean ± standard deviation or numbers (%).

ACR, albumin-to-creatinine ratio; CKD, chronic kidney disease; eGFR, estimated glomerular filtration rate; KoNEHS, Korean National Environmental Health Survey.

a

CKD composite is defined as reduced eGFR group (G3, G4, G5) or urinary ACR group (A2, A3).

b

≥25 kg/m2.

Table 2.

Performance of classification models for chronic kidney disease in the KoNEHS participants using model setting 1

Model Hyperparameter AUC 95% CI
Random forest ntree = 200 0.6954 0.5990–0.7918
Decision tree cp = –1, maxdepth = 10 0.6752 0.5761–0.7743
Ridge lambda = 0.02 0.6657 0.5600–0.7714
Bagging nbagg = 120 0.6469 0.5394–0.7545
Lasso lambda = 3e-04 0.6292 0.5287–0.7297
Logistic regression None 0.6214 0.5218–0.7210

Names of all hyperparameters correspond directly to those used in their respective R packages. We allocated 30% of the dataset as the test set (n = 379) and 25% of the remaining dataset as the validation set. We eliminated the observations with missing values (n = 7).

AUC, area under the curve; CI, confidence interval; KoNEHS, Korean National Environmental Health Survey.

Table 3.

Performance of classification models for chronic kidney disease in the KoNEHS participants using model setting 2

Model Hyperparameter AUC 95% CI
Random forest ntree = 100 0.7204 0.6151–0.8258
Presplit decision tree (male sex) cp.true = –1, md.true = 2, cp.false = –1, md.false = 8 0.7054 0.6214–0.7894
Decision tree cp = –1, maxdepth = 6 0.6750 0.5633–0.7867
Presplit decision tree (age ≥50 yr) cp.true = –1, md.true = 6, cp.false = –1, md.false = 8 0.6683 0.5874–0.7491
Ridge lambda = 0.01 0.6567 0.5523–0.7612
WQS adjust = 1 / bootstrap = 100 0.6389 0.5445–0.7333
Bagging nbagg = 100 0.6323 0.5231–0.7416
Lasso lambda = 3e-04 0.6290 0.5277–0.7302
Logistic regression None 0.6240 0.5231–0.7250

Names of all hyperparameters correspond directly to those used in their respective R packages. The hyperparameter set labeled as “true” corresponds to the tree whose condition (in parentheses) is true. We allocated 30% of the dataset as the test set (n = 379) and 25% of the remaining dataset as the validation set. We eliminated the observations with missing values (n = 7).

AUC, area under the curve; CI, confidence interval; KoNEHS, Korean National Environmental Health Survey; WQS, weighted quantile sum.