A novel approach to the relation of multi-pollutant effect and kidney dysfunction: data analysis from the Korean National Environmental Health Survey Cycle 3 (2015–2017)
Article information
Abstract
Background
Traditional statistical models for estimating the impact of multiple environmental chemicals on kidney outcomes have limitations. This study aimed to evaluate the risk prediction of kidney disease in the general population using innovative methodologies.
Methods
Serum persistent organic pollutant (POP), urinary chemical, serum creatinine, and urinary albumin levels were measured in a subpopulation of adults (n = 1,266) drawn from the Korean National Environmental Health Survey Cycle 3 (n = 3,787). Various machine learning (ML) models, including bagging, ridge, lasso, and random forest, were used to predict chronic kidney disease (CKD) risk, and their results were compared with those of conventional logistic regression methods. Furthermore, the weighted quantile sum (WQS) approach, which assigns weights to mixture components, was employed to evaluate multi-pollutant effects. Presplit was attempted to incorporate existing domain knowledge.
Results
A total of 42 variables, including baseline characteristics and laboratory findings, were analyzed during the ML modeling process. The decision tree algorithm generally outperformed logistic regression in risk prediction. Based on the decision tree models, lipid-corrected polychlorinated biphenyl 153 (PCB153) emerged as the strongest predictor of CKD. PCB153 remained a significant predictor of CKD in middle-aged adults (<50 years; p = 0.01) following age stratification. Particularly among middle-aged adults with hemoglobin levels >13.25 g/dL, CKD risk was predicted to be 71.4% in the high serum PCB153 group.
Conclusion
Current observations showed that utilizing both WQS regression and ML-based predictions offers valuable insights. In the models, POPs, particularly PCB153, were identified as important risk factors for CKD in Korean adults.
Introduction
Increasing evidence suggests that environmental exposure to harmful chemicals, such as air pollutants and environmental chemicals, contributes to a decline in kidney function and incident chronic kidney disease (CKD) in humans [1,2]. In the general population, exposure to heavy metal chemicals, such as lead (Pb) and cadmium (Cd), has been associated with decreased glomerular filtration [3]. In an environment-wide association study, several chemicals, including Cd, Pb, volatile compounds, perfluorooctanoic acid, and polycyclic aromatic hydrocarbons (PAHs), were identified as potential risk factors for CKD [4]. Although cross-sectional studies on chemical exposure and renal function have accumulated [2], experimental studies to elucidate the underlying mechanisms are limited. Environmental chemicals, though not extensively studied for their direct impacts on renal function, are known to induce oxidative stress, which has been well-documented to disrupt podocyte structure and contribute to albuminuria, podocyte loss, and tubular injury [2,5,6]. This oxidative stress accelerates the progression of kidney dysfunction, particularly through the development of tubulointerstitial fibrosis [7]. Most epidemiologic studies investigating the associations of chemicals with CKD have generally been focused on a limited number of chemical risk factors; however, everyday life involves close and continuous contact with a diverse range of pollutants with different physicochemical and toxicological properties. Failure to include important chemical risk factors in the model could result in false conclusions, potentially because of commonalities of exposure sources or interactions among chemicals exposed together [8]. For example, in the presence of a confounding co-pollutant, effect estimates may be skewed, and studying linked exposures in different models may result in false-positive results [9,10]. Consequently, several studies have employed statistical approaches that employ multiple chemicals in the models and reported significant associations of environmental chemicals with CKD in different populations [4,11]. In recent studies on chemical exposure and outcomes, several statistical models, including weighted quantile sum (WQS) regression, quantile g-computation model, and Bayesian kernel machine regression, were used to consider multi-pollutant exposure [12–14]. Several association models with multi-pollutant mixtures have proven to address some of the shortcomings of the existing models using a single or limited number of chemicals [13]. These models have their strengths and limitations; therefore, they have been used simultaneously and compared in recent studies. Complex mixtures of environmental chemicals include components that may interact, possibly resulting in effects that cannot be reliably predicted based on the influence of individual chemicals [15]. However, the existing multi-pollutant models may have limitations in predicting such interactions. Machine learning (ML) is a type of artificial intelligence that employs algorithms to analyze data; however, the application of ML in the context of environmental exposure is scarce [16]. ML approach has increasingly been utilized in risk assessment and epidemiology, particularly in studies involving multi-pollutant exposure [17]. Toxic substances usually interact in complex ways, resulting in multi-pollutant effects. These interactions can be effectively studied using models that combine various types of environmental chemical data. Our novel approach aims to address this complexity and provide a better understanding of the relationships between multi-pollutants and kidney dysfunction [18,19].
In this study, we aimed to determine whether the models, including decision tree and WQS regression, can identify complex patterns in the Korean National Environmental Health Survey (KoNEHS) Cycle 3 (2015–2017) data and make more accurate predictions regarding the relationship between environmental exposures and kidney dysfunction. Since multi-pollutant models have their strengths and weaknesses, the results of these approaches were subsequently interpreted together.
Methods
Study population and data source
A subset of the adult participants from the KoNEHS Cycle 3 (2015–2017) was included in this study. The KoNEHS is a population-based cross-sectional survey conducted in a 3-year cycle that is nationally representative of the Korean population [20]. In the KoNEHS, a two-stage proportionately stratified sampling strategy was used to recruit individuals (n = 3,787) across the nation (Fig. 1); the demographic characteristics of the survey participants are described elsewhere [20]. The adult participants were stratified by sex and age, and 1,295 participants were randomly selected based on this stratification by the National Institute of Environmental Research [20,21]. Furthermore, the subset of the population with data on kidney function markers and persistent organic pollutant (POP) levels was included in our study (n = 1,266). The study population was further stratified according to kidney function following the KDIGO (Kidney Disease: Improving Global Outcomes) definition [22] into the G1A1 (estimated glomerular filtration rate [eGFR] ≥90 mL/min/1.73 m2 and albumin-to-creatinine ratio [ACR] <3 mg/mmol), G1A2 (eGFR ≥90 mL/min/1.73 m2 and ACR of 3–30 mg/mmol), G3 (eGFR of 30–59 mL/min/1.73 m2), and A2 (ACR of 3–30 mg/mmol) groups. Blood and urine samples were collected, and chemicals and several clinical indicators were measured.

Study flow chart for the inclusion of participants from the KoNEHS.
BMI, body mass index; CKD, chronic kidney disease; DEHP, di(2-ethylhexyl) phthalate; DM, diabetes mellitus; eGFR, estimated glomerular filtration rate; HTN, hypertension; KoNEHS, Korean National Environmental Health Survey; MECPP, mono(2-ethyl-5-carboxypentyl) phthalate; MEHHP, mono(2-ethyl-5-hydroxyhexyl) phthalate; MEOHP, mono(2-ethyl-5-oxohexyl) phthalate; POPs, persistent organic pollutants; UACR, urine albumin-to-creatinine ratio; ∑DEHP metabolites, molar sum of DEHP metabolites.
aThe study population was stratified by kidney function according to the KDIGO (Kidney Disease: Improving Global Outcomes) definition [22] into the G1A1 (eGFR ≥90 mL/min/1.73 m2 and UACR <3 mg/mmol), G1A2 (eGFR ≥90 mL/min/1.73 m2 and UACR of 3–30 mg/mmol), G3 (eGFR of 30–59 mL/min/1.73 m2), and A2 (UACR of 3–30 mg/mmol) groups. bThe concentrations of POPs were adjusted for lipid content, calculated as POPs concentration (pg/mL) divided by the total lipid concentration (mg/dL) and multiplied by 100, yielding values in ng/g. Consequently, the total cholesterol level was excluded from the input variables, as it is highly correlated with this adjusted POPs measure in model setting 2.
The Institutional Review Board (IRB) of Seoul National University exempted IRB approval of this subset (IRB No. E1911/002–008).
Chemicals
Among the measured chemicals, only those with >70% detection frequency (n = 29, including the sum of di(2-ethylhexyl) phthalate [DEHP] metabolites [∑DEHPm]; n = 31, including three individual DEHP metabolites) were evaluated in this study. These chemicals included phthalates (n = 8; three DEHP metabolites—mono(2-ethyl-5-hydroxyhexyl) phthalate [MEHHP], mono(2-ethyl-5-oxohexyl) phthalate, and mono(2-ethyl-5-carboxypentyl) phthalate; mono-n-butyl phthalate; monobenzyl phthalate [MBzP]; monocarboxyoctyl phthalate; monocarboxynonyl phthalate; and mono(3-carboxypropyl) phthalate); parabens (n = 3; methylparaben, ethylparaben, and propylparaben); bisphenol A (BPA) (n = 1); PAHs (n = 4; 1-hydroxypyrene, 2-naphthol, 2-hydroxyfluorene, and 1-hydroxyphenanthrene); volatile organic compounds (n = 2; trans, trans-muconic acid and benzylmercapturic acid); 3-phenoxybenzoic acid (n = 1); metals (n = 4; Pb and mercury [Hg] in blood, and Hg and Cd in urine); and POPs (n = 8; hexachlorobenzene, p,p’-dichlorodiphenyl trichloroethane, p,p’-dichlorodiphenyl dichloroethylene, and polychlorinated biphenyl [PCB] congeners [PCB52, PCB118, PCB138, PCB153, and PCB180]). Our analysis utilized the following two settings: three individual DEHP metabolites and the molar sum of DEHP (Fig. 1). Quality control for the analysis was performed following the protocols provided by the National Institute of Environmental Research of Korea [21,23].
Statistical analysis
First, serum and urinary chemical levels were standardized. Specifically, blood lipid concentration was used for evaluating serum POPs. For urinary chemicals, covariate-adjusted standardization [24] was used because of the potential issues related to the use of urinary creatinine (Ucr) level for adjustment when the health outcome is eGFR [25–27]. Predicted Ucr (
Covariate-adjusted standardized concentration = urinary chemical concentration × RUcr,
where RUcr denotes
Covariate-adjusted standardized and lipid-corrected chemicals were natural log-transformed (ln-transformed). To identify chemical risk factors for CKD, both WQS regression and ML techniques were used. The ML models include single and ensemble models. Single models include classification and regression trees (CART) and logistic regression, whereas ensemble models include bagging and random forest [28–30]. The performances of these methods were compared using the area under the curve (AUC) metric on a held-out test set. Regardless of the method, we split our data into a training (70%) and test (30%) set for the experiments using the ML approach. Owing to the limited quantity of data, we performed a five-fold cross-validation to prevent the overfitting of our model. After cross-validation, the model was reevaluated using the test set. The following two model settings were constructed for the association analyses: model 1 with multiple clinical and chemical variables (n = 41) and model 2 with a presplit approach employing domain knowledge (Fig. 1).
Models for machine learning approach
The performance of each ML method on the prediction of CKD risk was compared based on the test results obtained by calculating the AUC using various parameters. Here, the selected model parameters included the validation techniques and ratios, test set size, training and test performance of the dataset, and validation of the dataset.
Logistic regression and regularization
Logistic regression is a widely utilized ML algorithm for classification tasks, operating as a specialized form of a generalized linear model (GLM). Compared to linear regression, which assumes the dependent variable follows a normal distribution, logistic regression models the probability that the dependent variable belongs to a particular category (usually binary) based on a Bernoulli distribution. This algorithm employs a logit function to map a linear combination of independent variables to a probability between 0 and 1. The logistic regression equation is given by, where indicates the probability of the dependent variable y being in category 1, given the independent variables [31]. The training of a logistic regression model focuses on minimizing a cost function,
Decision tree
Decision trees are popular and intuitive classification algorithms that are easy to implement and interpret compared with many other ML methods. Within the realm of decision trees, the CART [30] is particularly notable. CART constructs a binary tree, progressively branching out its nodes to enhance purity within the dataset. The Gini index serves as the primary metric for evaluating impurities at each node, guiding the tree’s development. The tree expansion is governed by specific stopping criteria encapsulated within hyperparameters, such as the complexity parameter and the maximum depth of the tree.
Bagging and random forest
We applied two ensemble techniques as follows: bootstrap aggregation (bagging) [29] and random forest [28]. Bagging trains multiple models on bootstrapped subsets of the data, combining their outputs through averaging (regression) or majority voting (classification). In contrast, random forest builds on bagging by randomly selecting features for each tree, enhancing model diversity and robustness. We used CART as a base learner for both bagging and random forest to improve the accuracy of the output predictions [30]. Finally, the complexity of these ensemble models was adjusted by varying the number of trees.
Weighted quantile sum regression
WQS regression is a statistical method predominantly employed in the analysis of high-dimensional datasets, such as those arising from environmental exposure studies. The primary objective of this technique is variable selection, which allows the identification of relevant predictor variables in the context of GLMs with a specified link function. WQS has several advantages, including robustness to multicollinearity and higher accuracy, sensitivity, and specificity, over shrinkage methods or penalized regression techniques. Despite its benefits, WQS has some limitations that should be considered during its implementation. One potential drawback is the loss of information owing to quantile settings, which can lead to a reduced representation of the data. Additionally, because weights are assigned to each variable, determining the level of importance of each weight is necessary.
In the WQS regression model, each predictor variable is assigned a weight (ωi) with the range of 0–1, such that the sum of all weights (
Presplit decision tree
The decision tree algorithm determines the split rules in a data-driven, greedy manner, aiming to maximize node purity without considering the implications for lower-level nodes, which can result in suboptimal decisions. To mitigate this issue, we proposed a presplit decision tree approach incorporating domain knowledge. The risks associated with environmental substances vary depending on factors such as sex and age. Therefore, we enforced the use of these variables at the root node of the decision tree to improve its performance. To minimize the risk of making suboptimal decisions, a presplit decision tree model was developed by incorporating existing knowledge into the CART model. We selected sex and age as the factors affecting the associations between chemical exposure and CKD risks. Therefore, we first split the data based on one of the following two criteria: 1) whether age was ≥50 years or 2) whether the sex was male. Subsequently, we applied the CART algorithm [30] to each subset of data created by these presplits to build a decision tree model.
Implementation details
The methods were implemented using R version 3.4 (R Foundation for Statistical Computing), utilizing the glmnet, rpart, randomForestSRC, and ipred packages to develop the ML models.
Results
Study population
When stratified by kidney function, the mean ages of the participants were 42.9 ± 13.4, 61.0 ± 12.7, and 55.9 ± 15.4 years in the G1A1 (n = 953), G2A1 (n = 215), and CKD (G3 or A2 [n = 98]) groups, respectively (Table 1). The CKD group exhibited a significantly higher prevalence of hypertension and DM and a lower household income than the non-CKD group. Male participants constituted 52.0% (n = 51) of the CKD group, and 38.8% of patients with CKD had DM (Table 1). BMI was significantly different among the groups, with values of 24.1 ± 3.6, 24.4 ± 3.0, and 25.5 ± 3.3 kg/m2 for the G1A1, G2A1, and CKD groups, respectively (p < 0.001). No significant differences in hemoglobin and total cholesterol levels were found between the groups.
Comparison of machine learning algorithms for predicting chronic kidney disease risk
The performance of various ML models on the prediction of CKD risk is shown in Table 2 (setting 1: using three individual DEHP metabolites) and Table 3 (setting 2: using the molar sum of DEHP metabolites). Across both settings, the logistic regression model, which is a GLM, performed worst with AUCs of 0.6214 and 0.6240 for settings 1 and 2, respectively. In contrast, the random forest model consistently outperformed other methods, achieving AUCs of 0.6954 and 0.7204 in settings 1 and 2, respectively. Notably, random forest under setting 2, which used the molar sum of DEHP metabolites, provided the highest predictive performance with an AUC of 0.7204.

Performance of classification models for chronic kidney disease in the KoNEHS participants using model setting 1

Performance of classification models for chronic kidney disease in the KoNEHS participants using model setting 2
The superior performance of ensemble methods, including random forest and bagging, can be attributed to several factors as follows. random forest, as an ensemble model, constructs multiple decision trees and averages their predictions, which helps to reduce overfitting, a common issue in decision trees. Furthermore, it introduces randomness not only in the data used to build each tree (bootstrapping) but also in the features considered for splitting nodes, which improves robustness and generalization compared to simpler models like logistic regression. Logistic regression is a GLM that assumes a fixed relationship between the features and outcome, which may not capture complex interactions between the environmental pollutants and CKD risk. Conversely, random forest is nonparametric and can model intricate, nonlinear interactions between predictors, giving it a significant advantage in datasets with diverse features similar to ours.
Another important observation is the performance of our presplit decision tree when participants were stratified by sex in setting 2. This model yielded an AUC of 0.7054, the second-best performance overall, surpassing the traditional decision tree model, which splits based solely on a data-driven approach. The presplit decision tree method demonstrates competitive performance with ensemble models including bagging and random forest, despite being a single model.
Performance of decision tree modeling for predicting chronic kidney disease risk
Given the benefit of interpretability, we examined the constructed decision tree, which achieved an AUC of 0.654 (Fig. 2). Fig. 2 presents these findings, with each leaf node indicating CKD risk. In the model, the lipid-corrected PCB153 value was the most relevant risk factor in the CKD group. The estimated CKD rate reached as high as 71.4% among the population with lipid-corrected PCB153 (ln-transformed) ≥4.0 and hemoglobin level ≥13.25 g/dL. In contrast, among adults with lipid-corrected PCB153 (ln-transformed) <4.0, CKD risk was predicted at 62.5% in the population with covariate-adjusted standardized MEHHP levels (ln-transformed) ≥4.9 (Fig. 2). The subsequent node in the decision tree was MEHHP, a DEHP metabolite. Following MEHHP, urinary Cd appeared as the next node in the decision tree. Here, each substance underwent an appropriate standardized adjustment, which varied based on the measuring medium and substance group.

Decision tree modeling for chronic kidney disease in the Korean National Environmental Health Survey participants.
Urinary chemical levels were covariate-adjusted standardized, serum polychlorinated biphenyl 153 (PCB153) was lipid-corrected, and both were then ln-transformed. The percentiles of these chemicals are indicated in parentheses.
AUC, area under the curve; CKD, chronic kidney disease; MEHHP, mono(2-ethyl-5-hydroxyhexyl) phthalate.
Results of the weighted quantile sum regression analysis
The WQS regression analyses showed that among all participants (n = 1,266), chemical mixture was associated with CKD (β = –16.2; 95% confidence interval [CI], –17.8 to –14.6; p < 0.001), with PCB180 and PCB153 contributing high weights (Figs. 3, 4). Urinary Cd was one of the variables with the highest weight among the metals included in the model (Fig. 3). After age stratification, the WQS regression analyses revealed that serum PCB153 were a significant determinant of reduced kidney function in the younger age group (<50 years) (Fig. 5). In both age groups, WQS showed significant negative associations of chemical mixture with eGFR (younger age group: β = –3.43 [95% CI, –6.1 to –0.70], p = 0.01 and older age group: β = –4.03 [95% CI, –7.9 to –0.20], p = 0.04) (Fig. 6).

Weights of all measured pollutants in the Korean National Environmental Health Survey participants.
1-OHP, 1-hydroxypyrene; 1OHPhe, 1-Hydroxyphenanthrene; 2-NAP, 2-naphthol; 2OH-Flu, 2-hydroxyfluorene; 3-PBA, 3-phenoxybenzoic acid; BMA, N-acetyl-S-(benzyl)-L-cysteine; BPA, bisphenol A; Cd, cadmium; EtP, ethyl paraben; HCB, hexachlorobenzene; Hg, mercury; MBzP, monobenzyl phthalate; MCNP, monocarboxynonyl phthalate; MCOP, monocarboxyoctyl phthalate; MCPP, mono(3-carboxypropyl) phthalate; MeP, methyl paraben; MnBP, mono-n-butyl phthalate; PCB118, polychlorinated biphenyl 118; PCB138, polychlorinated biphenyl 138; PCB153, polychlorinated biphenyl 153; PCB180, polychlorinated biphenyl 180; PCB52, polychlorinated biphenyl 52; Pb, lead; PrP, propyl paraben; p,p′-DDE, p,p′-dichlorodiphenyldichloroethylene; p,p′-DDT, p,p′-dichlorodiphenyldichloroethane; t,t-MA, trans, trans-muconic acid; ∑DEHPm, molar sum of DEHP metabolites.

Unadjusted association analysis between WQS and eGFR.
β-coefficient = –16.2 (95% confidence interval, –17.8 to –14.6), p < 0.001.
WQS, weighted quantile sum; eGFR, estimated glomerular filtration rate.

Weights of all measured pollutants in each age group.
(A) Age <50 years. (B) Age ≥50 years.
1-OHP, 1-hydroxypyrene; 1OHPhe, 1-hydroxyphenanthrene; 2-NAP, 2-naphthol; 2OH-Flu, 2-hydroxyfluorene; 3-PBA, 3-phenoxybenzoic acid; BMA, N-acetyl-S-(benzyl)-L-cysteine; BPA, bisphenol A; Cd, cadmium; EtP, ethylparaben; HCB, hexachlorobenzene; Hg, mercury; MBzP, monobenzyl phthalate; MCNP, monocarboxynonyl phthalate; MCOP, monocarboxyoctyl phthalate; MCPP, mono(3-carboxypropyl) phthalate; MeP, methylparaben; MnBP, mono-n-butyl phthalate; PCB118, polychlorinated biphenyl 118; PCB138, polychlorinated biphenyl 138; PCB153, polychlorinated biphenyl 153; PCB180, polychlorinated biphenyl 180; PCB52, polychlorinated biphenyl 52; Pb, lead; PrP, propylparaben; p,p′-DDE, p,p′-dichlorodiphenyldichloroethylene; p,p′-DDT, p,p′-dichlorodiphenyltrichloroethane; t,t-MA, trans, trans-muconic acid; ∑DEHPm, molar sum of DEHP metabolites.

Adjusted association analysis between WQS and eGFR stratified by age group.
(A) Age <50 years. β-coefficient = –3.43 (95% confidence interval [CI], –6.1 to –0.7), p = 0.01. (B) Age ≥50 years. β-coefficient = –4.03 (95% CI, –7.9 to –0.2), p = 0.04. Adjustment variables for age, sex, body mass index, history of diabetes mellitus, hypertension, smoking status, alcohol consumption status, and household incomes.
WQS, weighted quantile sum; eGFR, estimated glomerular filtration rate.
Results of presplit decision tree for predicting chronic kidney disease risk
Here, we introduce a method that employs a presplit decision tree based on age or sex. As shown in Fig. 7, when age was set as the primary variable in the presplit decision tree model, the lipid-corrected PCB153 value emerged as the single most significant risk factor for CKD in the younger age group. The predicted CKD rate reached 71.4% if the lipid-corrected PCB153 (ln-transformed) was ≥4.0 and the hemoglobin level was >13.25 g/dL. The next node was identified as urinary MBzP, a phthalate metabolite. For participants aged ≥50 years with DM, the higher MBzP group (21.2%) had a reduced risk of CKD compared to the lower MBzP group (64.2%). The same variable (i.e., urine MBzP) appeared in three nodes within the decision tree.

Presplit decision tree model for CKD in the Korean National Environmental Health Survey participants according to age group.
Urinary chemical levels were covariate-adjusted standardized, serum polychlorinated biphenyl 153 (PCB153) was lipid-corrected, and both were subsequently ln-transformed. The percentiles of these chemicals are indicated in brackets.
AUC, area under the curve; CKD, chronic kidney disease; DM, diabetes mellitus; MBzP, monobenzyl phthalate.
Discussion
While studies have suggested many chemicals as risk factors for CKD, epidemiological studies have rarely utilized association models incorporating multiple chemicals. Most existing epidemiological studies on chemical exposure and kidney disease are limited to a single or few chemicals in the models [9,10]. Although nonchemical factors that may influence the association are generally adjusted in the model, their interactions with chemical parameters cannot be easily addressed using the existing approaches. Here, we demonstrated that the decision tree model, when combined with existing domain knowledge (i.e., presplit decision tree model), can be applied to the general population to identify chemical risk factors for CKD. Additionally, the WQS regression model provides a robust method for identifying key chemical risk factors by assigning weights to each chemical based on its contribution to the outcome. This enables the incorporation of multiple chemicals in the analysis and helps overcome the limitations of traditional single-chemical models, which may fail to account for the combined effects of multiple exposures.
The results of multiple models, including the decision tree and WQS regression analyses, support the relationship of PCB153 with CKD risk in the general population (Figs. 2, 3). Specifically, PCB153 was identified as an important factor for predicting CKD risk in both the decision tree and WQS regression analyses. Information on PCB153 is limited in both epidemiological and experimental studies. Recent studies have also reported negative correlations of PCB153 and PCB180 with eGFR in the KoNEHS Cycle 3 (2015–2017), analyzing the same population as that of our study [4]. While Lee et al. [4] focused solely on the effects of POPs on kidney function, our study expands upon this by analyzing a broader range of environmental pollutants, allowing the assessment of the combined effects of multi-pollutants on kidney dysfunction. By examining a wider range of pollutants, our study provides a more nuanced understanding of the environmental factors contributing to kidney health and identifies new associations not apparent in previous studies limited to POPs. Apoptosis induced by PCB153 exposure in kidney tubular cells has been reported [32]. In our study, when age and sex were used to presplit the population in the decision tree analysis, PCB153 was identified as the most important chemical risk factor for CKD. Among nonchemical parameters, DM and hemoglobin levels emerged as important variables in the decision tree analysis.
Our observation shows that age, sex, DM, and hemoglobin level are important variables that determine CKD risk in the decision tree analysis. Urinary Cd was also identified as a significant chemical substance in the decision tree and WQS regression analyses (Figs. 2, 3). In particular, when the WQS was set as the main variable in the decision tree model, Cd exposure was found to be a significant factor. The association between Cd exposure and kidney toxicity has been well-established in population-based and experimental studies over a long period [33]. The underlying mechanism involves the generation of reactive oxygen species, leading to mitochondrial damage [33]. Compared with studies on other substances, a relatively substantial body of research exists on the relationship between Cd exposure and kidney disease, suggesting that further studies are needed in this domain. Identifying consistently significant nonpersistent urinary chemicals in decision tree analysis is challenging, making it difficult to establish a clear discussion in our analysis. Urinary BPA was also identified as a possible chemical substance in the decision tree and WQS regression analyses (Figs. 2, 3). It showed a negative correlation with eGFR in the lasso regression model, appeared in the decision tree, and had relatively high importance in the WQS analysis. Although urinary MBzP appeared to be an important variable in the decision tree analysis, it exhibited a protective effect (Fig. 7). In participants aged ≥50 years with DM, the higher MBzP group had a reduced risk of CKD (64.2%). A lower risk of CKD was observed in the higher urine MBzP group, and the same variable (i.e., urine MBzP) appeared in three nodes within the decision tree, making the discussion more complicated. This could be a limitation when using a classification approach to interpret the results. Therefore, additional research is required to better understand the relationship between these chemicals and kidney function. The association between BPA exposure and CKD has been inconsistent [34]. A recent systematic review found that while the association with kidney function indicators may be influenced by the urine dilution adjustment method, a reanalysis of the National Health and Nutrition Examination Survey data, along with other studies, supported the association between BPA and kidney disease [34]. Therefore, in studies examining the associations between BPA and kidney function indicators, employing research methods that utilize urine dilution adjustment techniques with a relatively low relationship to outcomes, such as 24-hour urine volume, or outcomes with limited association with dilution adjustment methods, such as cystatin-C–based eGFR, may be necessary [35].
Several studies have reported significant health effects even when exposed to mixtures of substances below standard levels. Woodruff et al. [36] emphasized the importance of considering both existing or persistent exposure to environmental chemicals and inherent biological or disease susceptibility, which independently contribute to the risk of overt diseases. The demand for evaluating exposure to harmful substances has increased, resulting in studies using advanced statistical techniques, such as supervised principal component analysis, lasso, partial least-squares regression, and Bayesian model averaging.
In conclusion, to identify chemical risk factors of CKD in the general population, we employed several frequently used ML approaches. Our study shows that the decision tree model, following the adoption of existing knowledge (e.g., influence of demographic factors), can be used to identify chemical predictors of CKD risk. Based on the presplit decision tree model and WQS analysis, serum PCB153 was determined as the most important chemical risk factor for CKD, particularly among the population aged <50 years and with hemoglobin levels >13.25 g/dL. Moreover, further validation in other populations is warranted to confirm the utility of this ML approach in the association studies on CKD risk.
Notes
Conflicts of interest
All authors have no conflicts of interest to declare.
Funding
This survey was supported by grants from the National Institute of Environmental Research funded by the Ministry of Environment (MOE) of Korea (NIER-2019-01-02-082) and the National Research Foundation (NRF) of Korea (NRF-2022R1C1C2006982). Junhyug Noh was partly supported by Ewha Womans University research grant of 2023 and the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. RS-2022-00155966).
Data sharing statement
The data presented in this study are available from the corresponding author upon reasonable request.
Authors’ contributions
Conceptualization, Data curation, Methodology, Visualization: IL, JN, KC, KDY
Formal analysis: IL, JN
Investigation: All authors
Supervision: KC, KDY
Writing–original draft: IL, JN, KC, KDY
Writing–review & editing: All authors
All authors read and approved the final manuscript.