Kidney Res Clin Pract > Epub ahead of print
Xu and Kang: Statistical consideration in nephrology research

Abstract

Nephrology research plays an important role in advancing our understanding of kidney disease and improving patient outcomes. However, the complexity of nephrology data and the application of advanced statistical methods present significant challenges. This review highlights key statistical considerations in nephrology research, focusing on common errors such as violations of statistical assumptions, multicollinearity, missing data, overfitting, and the integration of machine learning tools. It emphasizes the importance of applying appropriate statistical approaches to ensure the reliability of study findings. Additionally, the review underscores the need for transparency and reproducibility in nephrology research, particularly the importance of open access to data, code, and study protocols. By utilizing tools like R, RStudio, Git, and GitHub, researchers can integrate their code, results, and data into a transparent workflow, enhancing the reproducibility of their research. This review also presents a practical checklist for promoting reproducible research practices, which can help improve the quality, transparency, and reliability of nephrology studies. This review aims to contribute to the advancement of nephrology research and, ultimately, to support the long-term goal of improving patient care and outcomes.

Introduction

Nephrology research is increasingly important, as chronic kidney disease (CKD) affects over 35 million adults in the United States, according to the Kidney Disease Surveillance System from the U.S. Centers for Disease Control and Prevention [1]. This represents a significant public health issue, with Medicare spending on CKD surpassing $130 billion annually [2]. Given the scale and widespread impact of this challenge, utilizing accurate statistical methods is critical. Advancements in nephrology research and the broader medical field are increasingly driven by sophisticated statistical methods, enabling researchers to extract meaningful insights from complex and intricate data [3]. From exploratory studies to clinical trials, the correct application of statistical techniques plays a crucial role in ensuring reliable and reproducible results. Despite this critical importance, common statistical errors, such as misusing assumptions, applying inappropriate statistical approaches, and neglecting adequate sample size considerations, continue to undermine the validity of many medical studies across all phases [4]. As innovative approaches such as machine learning, deep learning, and artificial intelligence become widely used in nephrology research, the need for a thorough understanding of statistical principles, their strengths and limitations, and appropriate applications becomes more urgent [5]. These techniques hold great promise for personalized medicine and predictive modeling in nephrology, but they require careful application to avoid overfitting, misinterpretation, or biases [6].
This growing reliance on advanced statistical methods, however, introduces additional challenges to nephrology data. Data from nephrology studies are diverse and multimodal. Many nephrology studies involve complex, time-dependent data, such as time-to-event data, censored observations, and heterogeneity of patient populations [7,8]. Furthermore, the increasing availability of large-scale datasets, such as electronic health records (EHRs), biobank data, and multi-omics studies, offers unprecedented opportunities to explore kidney disease from molecular, genetic, and population perspectives [3,9]. Due to the complexity of the data, these factors require specialized statistical approaches that can account for the distinct characteristics of nephrology research, help avoid misleading conclusions, and ensure the robustness and generalizability of study findings.
The need for statistical precision and expert guidance in nephrology research is increasingly critical. This review provides practical research guidance, outlining key statistical considerations to help researchers navigate the complexities of nephrology data. Its long-term goal is to equip researchers with the tools and knowledge necessary to conduct high-quality studies that deepen our understanding of nephrology research and improve patient care and outcomes. By emphasizing the importance of methodologies, the review also offers practical advice on overcoming statistical challenges in study design and data analysis, helping researchers to contribute meaningfully to the broader nephrology field.

Commonly ignored statistical assumptions

Statistical errors commonly occur in scientific literature, with approximately 50% of published articles containing at least one error [10]. In nephrology research, there are several common statistical issues that require careful attention to ensure the validity and reliability of the results. These issues can range from improper handling of missing data, which is particularly prevalent in clinical studies with patients who may drop out or have incomplete records, to incorrect assumptions about data distribution, such as assuming normality in urinary biomarkers, such as albumin and other markers of kidney injury, when they may follow non-normal distributions [11,12]. A summary of commonly overlooked statistical assumptions is provided in Table 1.

Independence of observations

Most statistical methods, such as t tests, analysis of variance, and regression, assume that observations are independent or unrelated [13]. However, in real-world cases, some data should be treated as more similar to each other than as independent observations. This assumption is frequently violated in practice due to clustered, repeated measures, or temporally dependent data structures in nephrology research. For instance, in clustered data, observations within the same group, such as patients within a hospital, often exhibit within-cluster correlation due to shared characteristics. Similarly, longitudinal and time-series data often display temporal dependence, where measurements taken closer in time are more strongly correlated. Ignoring these dependencies can lead to inflated type I error rates, and incorrect standard errors [13]. To address these issues, researchers should use statistical methods that account for such dependencies, including mixed-effects models [14], generalized estimating equations (GEE) [15], or time-series analysis [16].

Homoscedasticity

Linear regression is widely used in nephrology research [15]. An important assumption of linear regression is homoscedasticity, which refers to the constant variance of residuals across different levels of an independent variable. When the homoscedasticity assumption is violated, it can result in higher type I error rates or reduced statistical power. This violation may negatively impact the accuracy of conclusions, and failing to identify and address heteroscedasticity can have significant consequences for theory, research, and practice. In general, detecting violations of homoscedasticity and addressing its biasing effects can enhance the validity of inferences [17,18].

Normality of residuals

Normality assumptions are required for many statistical procedures such as correlation, regression, t tests, and parametric tests, implying that the observations from which samples are drawn are normally distributed [19,20]. However, a review found that among 30 researchers examining their own studies, the assumptions of these statistical techniques were rarely checked for violations in the data [21]. Although some scientific literature suggests that linear models are generally robust to violations of the normality assumption for hypothesis testing and parameter estimation if outliers are properly handled, there are still certain scenarios that require careful attention to avoid significant bias [2225]. A study conducted by Knief and Forstmeier [26] analyzed 100 simulated scenarios to evaluate the impact of violations of the normality assumption and found that violations are generally not problematic when sample sizes are moderate (n = 100) to large (n = 1,000) or when there are no extreme outliers, as Gaussian models remain robust and produce reliable p-values. However, issues arise with very small sample sizes (n = 10) or when extreme outliers are present in either the independent or dependent variables. Sample size plays a critical role in the reliability of statistical results because, for small sample sizes, the normality assumption helps ensure unbiased estimation of standard errors, which in turn affects the accuracy of confidence intervals and p-values [27]. In nephrology studies, many researchers address this issue by normalizing urinary biomarker concentrations to urinary creatinine (uCr) [28]. In research focused on potential biomarkers for detecting acute kidney injury, Blaikley et al. [29] normalized a tubular marker neutral endopeptidase to uCr to account for variations in urinary flow rates. Normalization is also commonly used in studies related to CKD and the progression of kidney disease. For instance, in a study conducted by Kamijo et al. [30] the ratio of human liver-type fatty acid-binding protein (L-FABP) transcript intensity to glyceraldehyde-3-phosphate dehydrogenase intensity was used to normalize the amount of human L-FABP transcripts.

Absence of multicollinearity

Multicollinearity occurs when independent variables in a regression model are highly correlated. For example, in a nephrology study investigating the impact of patient characteristics on glomerular filtration rate (GFR), it is important to carefully evaluate the correlations among covariates. A study conducted by Kamal [31] found a strong correlation between blood urea nitrogen (BUN) and serum creatinine levels in patients with renal disorders. Both BUN and serum creatinine are indicators of kidney function, and their high correlation can result in multicollinearity when included together as explanatory variables in a single regression model. When multicollinearity exists, it becomes challenging to determine the individual impact of serum creatinine and BUN on GFR because their effects overlap significantly, and the variables have more interrelationships [32]. Diagnostic tools of multicollinearity include the variance inflation factor (VIF), condition index and condition number, and variance decomposition proportion, while VIF is one of the most widely used tools to detect multicollinearity [32,33]. The VIF measures the strength of linear dependencies and quantifies how much the variance of each regression coefficient is inflated due to collinearity, compared to a scenario where the independent variables are not linearly related. Researchers are highly encouraged to use VIF to identify multicollinearity and to consider removing or combining variables when appropriate and needed.
Violating statistical assumptions in nephrology research compromises the validity and reproducibility of findings. When one or more assumptions are violated in specific scenarios, significant effects may arise. In such cases, approaches like variable transformations or nonparametric methods are commonly used to address these issues. Researchers should assess whether their data meets these assumptions and consider alternative methods when violations occur. Nonparametric tests, such as the Wilcoxon rank-sum test or the Kruskal-Wallis test, offer robust alternatives that do not require normality or homogeneity of variances [34]. Apply data transformations, such as logarithmic or square-root transformations, which can help normalize data and stabilize variance. Researchers should always perform diagnostic tests, such as computing VIF, to ensure that assumptions hold before proceeding with their analyses and model adjustments.

Other analysis issues

Missing data

In addition to violations of statistical assumptions, there are other common analysis challenges in nephrology research. High-quality data is crucial for reliable research, but the issue of missing data is prevalent in nearly all studies and can significantly impact the conclusions drawn from the data [35]. Proper handling of missing data ensures that missing values are addressed in a way that does not bias the results. This includes using suitable single or multiple imputation approaches to estimate missing values. Montez-Rath et al. [36] proposed methods for addressing missing data in clinical studies of kidney diseases, categorizing the missing data into three scenarios: 1) missing completely at random (MCAR), where missing data has no relationship to either observed or unobserved factors, and patients included in the analysis are no different from those excluded; 2) missing at random (MAR), where missing data is related only to observed factors but not to unobserved ones; and 3) not missing at random (NMAR), where missing data depends on both observed and unobserved factors. The authors evaluated and concluded the most appropriate approaches for each type of missing data. For MCAR, complete case analysis (CCA), which uses only the subset of data with complete information, is valid when the MCAR assumption holds. However, since MCAR is rare in practice, relying solely on CCA often leads to biased and inefficient estimates. For MAR, both maximum likelihood-based and multiple imputation (MI)-based methods are valid, with MI having the advantage of being supported by accessible and increasingly user-friendly software such as SAS (SAS Institute Inc.), Stata (StataCorp LLC), and R (R Foundation for Statistical Computing). For NMAR, more advanced methods are necessary, as both maximum likelihood and MI assume MAR. In this case, directly modeling the missing data mechanism under NMAR conditions is required. Improper handling of missing data can distort the analysis and lead to inaccurate conclusions. Thus, researchers should document and validate all preprocessing steps to ensure the robustness of their findings.

Sample size

When designing a study, accurate sample size calculation is important to ensure for production of reliable and significant findings in nephrology research. A small sample size can lead to underpowered studies, especially in rare kidney diseases or specific patient groups, which may cause false negative results or miss important effects. Many treatments have been tested in small randomized trials that were not large enough to detect realistic treatment effects [37]. Using a small sample size to create a prediction model can lead to inaccurate estimates and overfitting. This results in unreliable predictions and poor model performance when tested on new individuals from the same population, making the model less applicable to other situations [38]. One approach to mitigate overfitting in predictive models is to use optimism-corrected summary measures, such as the R2 for linear models or the area under the receiver operating characteristic curve for logistic regression models [39,40].
A simulation study demonstrating the impact of a small sample size is presented. Assuming a sample size of n = 100, the underlying linear model is specified as y = β0 + β1x1 + β2x2 + β3x3 + ε, where ε follows a standard normal distribution. A summary of this simulation is provided in Table 2. The R code used to develop this simulation is available at the following GitHub repository: https://github.com/hakmook/KRCP_review.git.
We then introduced additional noise to x3 by doubling the first 10 observations, which resulted in a discrepancy between the estimated β3 with and without the extra noise. We examined how this discrepancy varied as a function of sample size by employing 300 Monte Carlo simulations. As illustrated in Fig. 1, it is evident that a larger sample size results in significantly smaller bias (or discrepancy between estimates with and without the extra noise). Conversely, smaller sample sizes are more sensitive to unexpected noise. Given that it is impossible to ensure that collected data are free from unexpected noise or errors, paying extra attention to sample size is crucial when designing a study.

Goodness of fit

Goodness-of-fit tests serve as the essential statistical tools used to evaluate how well a model represents the observed data. The Bayesian information criterion, the Akaike information criterion, and some other indicators derived from them are widely used for model selection [41]. By comparing goodness of fit statistics using these measurements, analysts can select the models that better represent the underlying data. Cheng et al. [42] evaluated the performance of the sequential models for the progression of diabetic kidney disease to kidney failure with different combinations of predictive variables in the derivation and validation cohorts, emphasizing the importance of goodness of fit in determining the model’s performance in accurately predicting kidney disease progression.

Machine learning: applications, improvements, and key considerations

The integration of machine learning (ML) into nephrology research has revolutionized data analysis in recent years and made it more and more popular, enabling deep insights from structured and unstructured datasets. These approaches have been applied to predict disease progression, identify biomarkers, and enhance decision-making in nephrology clinical care. However, while these tools are powerful, researchers must remain mindful of when to use them and how to enhance their effectiveness by refining the models or integrating them with other methods. Researchers in beginners often lack the experience needed to run a data mining project effectively, which can lead to incorrect practices, common mistakes, or overly optimistic results [43].
Data in nephrology is high-dimensional due to its large volume and variety, collected from multiple sources such as patient registries, EHRs, and clinical trials [44]. Account for it, feature selection is recommended to help remove irrelevant and redundant data when using machine learning [45]. A study conducted by Ebiaredoh-Mienye et al. [46] proved that the machine learning models trained with the reduced feature set performed better than those trained with the complete feature set by comparing logistic regression, decision tree, XGBoost, random forest, support vector machine, and conventional AdaBoost models to compare the effective detection of CKD. Multi-omics data including genomics (DNA), transcriptomics (RNA), proteomics (proteins), and metabolomics (metabolites) are also widely used in nephrology [47]. Integrating multi-omics data with machine learning for kidney disease research enables novel disease classification, reclassifying patients into molecularly defined subgroups and uncovering the underlying molecular mechanisms and biological pathways of various diseases [47,48]. Feature selection on multi-omics data often involves merging different data types without considering their sources, which can lead to the loss of information on specific omics datasets. Some approaches have been developed for the consideration of the group structure such as lasso based-feature selection methods sparse group lasso [49], integrative lasso with penalty factors [50], random forest extension block forest [51], and partial least squares (PLS) based adaptive sparse multi-block PLS (asmbPLS) [52] and asmbPLS discriminant analysis [53].
One other challenge for applying machine learning to nephrology research data is the limitation of sample size. Compared to other specialties in the medical field, nephrology has reported relatively few clinical trials with a large sample size [54,55]. When applying machine learning models to small sample size datasets, it may exaggerate the accuracy of ML due to overfitting or random effects [37]. Thus, it is essential to ensure that the sample size is sufficient relative to the complexity of the model and the number of features in the data. In addition, the robust evaluation of ML classification is also important when training and testing sample sizes are small, as it facilitates meaningful comparisons across studies and methods [5658].

Reproducible research

Reproducibility of research is essential for scientific research accuracy, ensuring that findings are reliable, verifiable, and transparent. Integrating computations, code, and data into documents ensures readers to verify, adapt, and reproduce results. However, in biomedical research, where complex datasets and advanced statistical models are frequently used, research often lacks transparency and reproducibility. A study by Iqbal et al. [59] analyzed 441 biomedical publications and found that only one provided access to study protocols, none provided raw data, and only four were reproducible studies. Similarly, in nephrology research, a study by Fladie et al. [60] evaluated nephrology literature for reproducible practices by analyzing 300 randomly sampled publications and found that 123 of them lacked the empirical data necessary for reproducibility.
Statistical software like R and Rstudio are commonly used among biostatisticians, and they are powerful tools that provide user-friendly packages to incorporate reproducible research into statistical workflows [61]. They allow researchers to combine code, results, and descriptive comments into a single document, making it easier to generate comprehensive reports that can be shared and reproduced by others. Through RStudio, researchers can create dynamic, literate programming documents that integrate statistical analysis with explanatory text and visualizations. These reports can be generated in various formats, such as HTML, PDF, and Word, making it easier to share the findings. The inclusion of the code and data used in the analysis ensures transparency and enables other researchers to understand the analytical process, replicate the results, and adapt the analysis if necessary. Beyond RStudio, several other tools and platforms also support reproducibility. Platforms such as Git and GitHub play important roles in improving the reproducibility of research [61]. By using its version control systems, researchers can effectively track changes to their code over time, enhancing collaboration and transparency. As such, maintaining a traceable research process is essential for enabling rigorous scrutiny and validation, which is crucial for ensuring the long-term reliability of findings. A link to the GitHub repository, which includes all necessary files such as programming code and descriptive instructions, is recommended for publication.
To maximize the reproducibility of research, this review proposed a checklist that researchers can follow (Fig. 2). This checklist divided the procedures of conducting research into three divisions, before data analysis, during data analysis, and after data analysis. Before data analysis, researchers should outline their study’s goals, methodology, and design in a clear proposal, and create standardized protocols for data collection and analysis. A comprehensive data management plan should define procedures for data storage, backup, and sharing while identifying necessary software, tools, and libraries. Version control systems, such as Git and GitHub, should be used to track code changes. During data analysis, researchers must document standardized data cleaning steps, thoroughly annotate the code, and keep track of changes with meaningful commit messages. Using reproducible scripts (e.g., Rmarkdown) is essential to ensure transparency. After analysis, researchers should verify that the results can be replicated with the documented code and data. Data and analysis scripts should be shared for replication, and the final version of code should be well-documented. In publications, all supplementary materials should be included or linked. By adhering to these practices, researchers can promote transparency, improve reproducibility, reduce the potential for errors, and create a foundation for future studies to build upon.

Conclusions and discussion

Nephrology research involves complex data, which requires advanced statistical methods to make reliable conclusions. This review highlights several common challenges in the field, including violations of statistical assumptions, such as non-independence of observations, homoscedasticity, and normality of residuals, which can lead to misleading results. In practice, these assumptions are frequently overlooked, but using the appropriate methods, like mixed-effects models, GEE, and nonparametric alternatives, can help address these issues and improve the reliability of findings [14,15,34]. Another concern in nephrology research is multicollinearity, where predictor variables are highly correlated, making it difficult to determine their independent effects. Tools like VIF are useful for detecting multicollinearity and refining models [32,33]. Additionally, missing data is a prevalent issue in clinical studies, and improper handling of missing values can introduce significant biases. Researchers should consider using methods such as MI or maximum likelihood estimation to address missing data and avoid distorting the analysis [36].
Overfitting is another common problem, particularly in small sample size studies, where models may appear to perform well on training data but fail to generalize to new datasets. Ensuring sufficient sample sizes and using cross-validation can help reduce this risk [62]. Reporting optimism-corrected summary measures such as the R2 can also help reduce the risk. Using ML methods in nephrology research holds great opportunities for predictive modeling and biomarker identification. However, ML models require careful handling, especially when dealing with high-dimensional data or small sample sizes. Feature selection and proper validation approaches can prevent overfitting and promote research generalizable [4953].
Reproducibility is another concern of quality research. Many studies in nephrology lack transparency, which undermines their reliability [60]. Researchers should consider open access to data, code, and study protocols, enabling others to verify and replicate their work. Tools such as R, Rstudio, Rmarkdown, Git, and GitHub support reproducible research by integrating code, results, and data into a transparent workflow. The checklist proposed by this review helps researchers improve the accuracy and reliability of studies by providing a structured framework for researchers to follow, ensuring that all aspects of their research process from data collection and analysis to documentation and sharing are conducted transparently and reproducible.
In conclusion, nephrology researchers must think deeply about addressing common statistical challenges to ensure the robustness of their findings. Promoting reproducibility will strengthen the reliability of nephrology research and contribute to the long-term goal of enhancing patient care and outcomes.

Notes

Conflicts of interest

All authors have no conflicts of interest to declare.

Data sharing statement

The data presented in this study are available from the corresponding author upon reasonable request.

Authors’ contributions

Conceptualization, Data curation, Supervision: HK

Investigation, Methodology: KX, HK

Resources: KX

Writing–original draft: KX

Writing–review & editing: HK

All authors read and approved the final manuscript.

Figure 1.

A plot illustrating the trend of bias (i.e., discrepancy of estimated β3 between with and without extra noise) as a function of sample size.

j-krcp-25-046f1.jpg
Figure 2.

The checklist guides researchers to maximize the reproducibility of research.

j-krcp-25-046f2.jpg
Table 1.
Commonly ignored statistical assumptions
Assumption Brief description
Independence of observations This is the assumption that each observation is independent of the others, meaning the value of one observation does not influence or depend on the value of another. This assumption is violated if the same subject is measured multiple times (e.g., blood pressure taken every hour for the same patient), as repeated measurements from the same individual are not independent.
Homoscedasticity This is the assumption that the variance of a variable remains constant across all levels of the independent variables. In other words, the spread of the dependent variable around the regression line is consistent for every value of the predictors. This assumption is often violated in medical studies; for example, younger patients might show more consistent responses to a drug or treatment than older patients due to better overall health. As a result, one group may have a higher variance than another.
Normality of residuals This assumption implies that the residuals follow a normal distribution. It is particularly important for inference in small samples, as many statistical tests rely on the normality of errors to determine the significance of the results.
Absence of multicollinearity The absence of multicollinearity assumption in regression is that there is no perfect or exact relationship between the explanatory variables. Multicollinearity occurs when two or more predictors in a regression model are highly correlated with each other.
Table 2.
Parameters and true values used in the simulation study
Parameter True value Note
(β0, β1, β2, β3) (1, 2, 1, 1)
x1 Normal (0,1) Random number from standard normal distribution
x2 0 or 1 P (x1 = 0) =P (x1 = 1) = 0.5
x3 Normal (2, 1) Random number from normal distribution of mean = 2 and variance = 1

References

1. Kidney Disease Surveillance System. Awareness of chronic kidney disease remains low among US Adults [Internet]. Kidney Disease Surveillance System, c2024 [cited 2025 Jan 21]. Available from: https://nccd.cdc.gov/ckd/AreYouAware.aspx?emailDate=June_2024
2. Johansen KL, Chertow GM, Foley RN, et al. US renal data system 2020 annual data report: epidemiology of kidney disease in the United States. Am J Kidney Dis 2021;77(4 Suppl 1):A7–A8.
crossref pmid pmc
3. Saez-Rodriguez J, Rinschen MM, Floege J, Kramann R. Big science and big data in nephrology. Kidney Int 2019;95:1326–1337.
crossref pmid
4. Borg DN, Lohse KR, Sainani KL. Ten common statistical errors from all phases of research, and their fixes. PM R 2020;12:610–614.
crossref pmid pdf
5. Fayos De Arizon L, Viera ER, et al. Artificial intelligence: a new field of knowledge for nephrologists? Clin Kidney J 2023;16:2314–2326.
crossref pmid pmc pdf
6. Lemley KV. Machine learning comes to nephrology. J Am Soc Nephrol 2019;30:1780–1781.
crossref pmid pmc
7. McCullough K, Sharma P, Ali T, et al. Measuring the population burden of chronic kidney disease: a systematic literature review of the estimated prevalence of impaired kidney function. Nephrol Dial Transplant 2012;27:1812–1821.
crossref pmid
8. Streja E, Goldstein L, Soohoo M, Obi Y, Kalantar-Zadeh K, Rhee CM. Modeling longitudinal data and its impact on survival in observational nephrology studies: tools and considerations. Nephrol Dial Transplant 2017;32(suppl_2):ii77–ii83.
crossref pmid pmc
9. Zeng XX, Liu J, Ma L, Fu P. Big data research in chronic kidney disease. Chin Med J (Engl) 2018;131:2647–2650.
crossref pmid pmc
10. Curran-Everett D, Benos DJ; American Physiological Society. Guidelines for reporting statistics in journals published by the American Physiological Society. Am J Physiol Endocrinol Metab 2004;287:E189–E191.
crossref pmid
11. Waikar SS, Sabbisetti VS, Bonventre JV. Normalization of urinary biomarkers to creatinine during changes in glomerular filtration rate. Kidney Int 2010;78:486–494.
crossref pmid pmc
12. Blazek K, van Zwieten A, Saglimbene V, Teixeira-Pinto A. A practical guide to multiple imputation of missing data in nephrology. Kidney Int 2021;99:68–74.
crossref pmid
13. Kim N, Fischer AH, Dyring-Andersen B, Rosner B, Okoye GA. Research techniques made simple: choosing appropriate statistical methods for clinical research. J Invest Dermatol 2017;137:e173–e178.
crossref pmid
14. Vonesh E, Tighiouart H, Ying J, et al. Mixed-effects models for slope-based endpoints in clinical trials of chronic kidney disease. Stat Med 2019;38:4218–4239.
crossref pmid pdf
15. Boucquemont J, Heinze G, Jager KJ, Oberbauer R, Leffondre K. Regression methods for investigating risk factors of chronic kidney disease outcomes: the state of the art. BMC Nephrol 2014;15:45.
crossref pmid pmc pdf
16. Gupta AK, Udrea A. Beyond linear methods of data analysis: time series analysis and its applications in renal research. Nephron Physiol 2013;124:14–27.
crossref pmid pdf
17. Rosopa PJ, Schaffer MM, Schroeder AN. Managing heteroscedasticity in general linear models. Psychol Methods 2013;18:335–351.
crossref pmid
18. Yang K, Tu J, Chen T. Homoscedasticity: an overlooked critical assumption for linear regression. Gen Psychiatr 2019;32:e100148.
crossref pmid pmc
19. Altman DG, Bland JM. Statistics notes: the normal distribution. BMJ 1995;310:298.
crossref pmid pmc
20. Curran-Everett D. Explorations in statistics: the assumption of normality. Adv Physiol Educ 2017;41:449–453.
crossref pmid
21. Hoekstra R, Kiers HA, Johnson A. Are assumptions of well-known statistical techniques checked, and why (not)? Front Psychol 2012;3:137.
crossref pmid pmc
22. Ali MM, Sharma SC. Robustness to nonnormality of regression F-tests. J Econom 1996;71:175–205.
crossref
23. Box GEP, Watson GS. Robustness to non-normality of regression tests. Biometrika 1962;49:93–106.
crossref
24. Lumley T, Diehr P, Emerson S, Chen L. The importance of the normality assumption in large public health data sets. Annu Rev Public Health 2002;23:151–169.
crossref pmid
25. Schielzeth H, Dingemanse NJ, Nakagawa S, et al. Robustness of linear mixed-effects models to violations of distributional assumptions. Methods Ecol Evol 2020;11:1141–1152.
crossref pdf
26. Knief U, Forstmeier W. Violating the normality assumption may be the lesser of two evils. Behav Res Methods 2021;53:2576–2590.
crossref pmid pmc pdf
27. Schmidt AF, Finan C. Linear regression and the normality assumption. J Clin Epidemiol 2018;98:146–151.
crossref pmid
28. Tang KW, Toh QC, Teo BW. Normalisation of urinary biomarkers to creatinine for clinical practice and research: when and why. Singapore Med J 2015;56:7–10.
crossref pmid pmc
29. Blaikley J, Sutton P, Walter M, et al. Tubular proteinuria and enzymuria following open heart surgery. Intensive Care Med 2003;29:1364–1367.
crossref pmid pdf
30. Kamijo A, Sugaya T, Hikawa A, et al. Urinary excretion of fatty acid-binding protein reflects stress overload on the proximal tubules. Am J Pathol 2004;165:1243–1255.
crossref pmid pmc
31. Kamal A. Estimation of blood urea (BUN) and serum creatinine level in patients of renal disorder. Indian J Fundam Appl Life Sci 2014;4:199–202.
32. Yoo W, Mayberry R, Bae S, Singh K, Peter He Q, Lillard JW. A study of effects of multicollinearity in the multivariable analysis. Int J Appl Sci Technol 2014;4:9–19.
pmid pmc
33. Kim JH. Multicollinearity and misleading statistical results. Korean J Anesthesiol 2019;72:558–569.
crossref pmid pmc pdf
34. Nahm FS. Nonparametric statistical tests for the continuous data: the basic concept and the practical use. Korean J Anesthesiol 2016;69:8–14.
crossref pmid pmc
35. Graham JW. Missing data analysis: making it work in the real world. Annu Rev Psychol 2009;60:549–576.
crossref pmid
36. Montez-Rath ME, Winkelmayer WC, Desai M. Addressing missing data in clinical studies of kidney diseases. Clin J Am Soc Nephrol 2014;9:1328–1335.
crossref pmid pmc
37. Baigent C, Herrington WG, Coresh J, et al. Challenges in conducting clinical trials in nephrology: conclusions from a Kidney Disease-Improving Global Outcomes (KDIGO) controversies conference. Kidney Int 2017;92:297–305.
crossref pmid pmc
38. Riley RD, Collins GS. Stability of clinical prediction models developed using statistical or machine learning methods. Biom J 2023;65:e2200302.
crossref pmid pmc
39. Harrell FE. Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Springer; 2001.
40. Kang H, Kim EE, Shokouhi S, Tokita K, Shin HW. Texture analysis of F-18 Fluciclovine PET/CT to predict biochemically recurrent prostate cancer: initial results. Tomography 2020;6:301–307.
crossref pmid pmc
41. Kuha J. AIC and BIC: Comparisons of assumptions and performance. Sociol Method Res 2004;33:188–229.
42. Cheng Y, Shang J, Liu D, Xiao J, Zhao Z. Development and validation of a predictive model for the progression of diabetic kidney disease to kidney failure. Ren Fail 2020;42:550–559.
crossref pmid pmc
43. Chicco D. Ten quick tips for machine learning in computational biology. BioData Min 2017;10:35.
crossref pmid pmc pdf
44. Kaur N, Bhattacharya S, Butte AJ. Big data in nephrology. Nat Rev Nephrol 2021;17:676–687.
crossref pmid pdf
45. Cai J, Luo J, Wang S, Yang S. Feature selection in machine learning: a new perspective. Neurocomputing 2018;300:70–79.
crossref
46. Ebiaredoh-Mienye SA, Swart TG, Esenogho E, Mienye ID. A machine learning method with filter-based feature selection for improved prediction of chronic kidney disease. Bioengineering (Basel) 2022;9:350.
crossref pmid pmc
47. Rhee EP. How omics data can be used in nephrology. Am J Kidney Dis 2018;72:129–135.
crossref pmid pmc
48. Eddy S, Mariani LH, Kretzler M. Integrated multi-omics approaches to improve classification of chronic kidney disease. Nat Rev Nephrol 2020;16:657–668.
crossref pmid pdf
49. Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. J Comput Graph Stat 2013;22:231–245.
crossref
50. Boulesteix AL, De Bin R, Jiang X, Fuchs M. IPF-LASSO: Integrative L1-penalized regression with penalty factors for prediction based on multi-omics data. Comput Math Methods Med 2017;2017:7691937.
crossref pmid pmc
51. Hornung R, Wright MN. Block Forests: random forests for blocks of clinical and omics covariate data. BMC Bioinformatics 2019;20:358.
crossref pmid pmc pdf
52. Zhang R, Datta S. asmbPLS: biomarker identification and patient survival prediction with multi-omics data. Front Genet 2024;15:1444054.
crossref pmid pmc
53. Zhang R, Datta S. Adaptive sparse multi-block PLS discriminant analysis: an integrative method for identifying key biomarkers from multi-omics data. Genes (Basel) 2023;14:961.
crossref pmid pmc
54. Strippoli GF, Craig JC, Schena FP. The number, quality, and coverage of randomized controlled trials in nephrology. J Am Soc Nephrol 2004;15:411–419.
crossref pmid
55. Inrig JK, Califf RM, Tasneem A, et al. The landscape of clinical trials in nephrology: a systematic review of Clinicaltrials.gov. Am J Kidney Dis 2014;63:771–780.
crossref pmid pmc
56. Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS One 2019;14:e0224365.
crossref pmid pmc
57. Raudys SJ, Jain AK. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Trans Pattern Anal Mach Intell 1991;13:252–264.
crossref
58. Kanal L, Chandrasekaran B. On dimensionality and sample size in statistical pattern classification. Pattern Recognit 1971;3:225–234.
crossref
59. Iqbal SA, Wallach JD, Khoury MJ, Schully SD, Ioannidis JP. Reproducible research practices and transparency across the biomedical literature. PLoS Biol 2016;14:e1002333.
crossref pmid pmc
60. Fladie IA, Adewumi TM, Vo NH, Tritz DJ, Vassar MB. An evaluation of nephrology literature for transparency and reproducibility indicators: cross-sectional review. Kidney Int Rep 2019;5:173–181.
crossref pmid pmc
61. Horton NJ, Kleinman K. Using R and RStudio for data management, statistical analysis, and graphics. 2nd ed. CRC Press; 2015.
62. Montesinos López OA, Montesinos López A, Crossa J. Multivariate statistical machine learning methods for genomic prediction. Springer; 2022.
TOOLS
METRICS Graph View
  • 0 Crossref
  •  0 Scopus
  • 266 View
  • 6 Download
ORCID iDs

Ke Xu
https://orcid.org/0000-0002-6153-5055

Hakmook Kang
https://orcid.org/0000-0001-6876-4021

Related articles


ABOUT
BROWSE ARTICLES
EDITORIAL POLICY
FOR CONTRIBUTORS
Editorial Office
#301, (Miseung Bldg.) 23, Apgujenog-ro 30-gil, Gangnam-gu, Seoul 06022, Korea
Tel: +82-2-3486-8736    Fax: +82-2-3486-8737    E-mail: registry@ksn.or.kr                

Copyright © 2025 by The Korean Society of Nephrology.

Developed in M2PI

Close layer