Statistical consideration in nephrology research
Article information
Abstract
Nephrology research plays an important role in advancing our understanding of kidney disease and improving patient outcomes. However, the complexity of nephrology data and the application of advanced statistical methods present significant challenges. This review highlights key statistical considerations in nephrology research, focusing on common errors such as violations of statistical assumptions, multicollinearity, missing data, overfitting, and the integration of machine learning tools. It emphasizes the importance of applying appropriate statistical approaches to ensure the reliability of study findings. Additionally, the review underscores the need for transparency and reproducibility in nephrology research, particularly the importance of open access to data, code, and study protocols. By utilizing tools like R, RStudio, Git, and GitHub, researchers can integrate their code, results, and data into a transparent workflow, enhancing the reproducibility of their research. This review also presents a practical checklist for promoting reproducible research practices, which can help improve the quality, transparency, and reliability of nephrology studies. This review aims to contribute to the advancement of nephrology research and, ultimately, to support the long-term goal of improving patient care and outcomes.
Introduction
Nephrology research is increasingly important, as chronic kidney disease (CKD) affects over 35 million adults in the United States, according to the Kidney Disease Surveillance System from the U.S. Centers for Disease Control and Prevention [1]. This represents a significant public health issue, with Medicare spending on CKD surpassing $130 billion annually [2]. Given the scale and widespread impact of this challenge, utilizing accurate statistical methods is critical. Advancements in nephrology research and the broader medical field are increasingly driven by sophisticated statistical methods, enabling researchers to extract meaningful insights from complex and intricate data [3]. From exploratory studies to clinical trials, the correct application of statistical techniques plays a crucial role in ensuring reliable and reproducible results. Despite this critical importance, common statistical errors, such as misusing assumptions, applying inappropriate statistical approaches, and neglecting adequate sample size considerations, continue to undermine the validity of many medical studies across all phases [4]. As innovative approaches such as machine learning, deep learning, and artificial intelligence become widely used in nephrology research, the need for a thorough understanding of statistical principles, their strengths and limitations, and appropriate applications becomes more urgent [5]. These techniques hold great promise for personalized medicine and predictive modeling in nephrology, but they require careful application to avoid overfitting, misinterpretation, or biases [6].
This growing reliance on advanced statistical methods, however, introduces additional challenges to nephrology data. Data from nephrology studies are diverse and multimodal. Many nephrology studies involve complex, time-dependent data, such as time-to-event data, censored observations, and heterogeneity of patient populations [7,8]. Furthermore, the increasing availability of large-scale datasets, such as electronic health records (EHRs), biobank data, and multi-omics studies, offers unprecedented opportunities to explore kidney disease from molecular, genetic, and population perspectives [3,9]. Due to the complexity of the data, these factors require specialized statistical approaches that can account for the distinct characteristics of nephrology research, help avoid misleading conclusions, and ensure the robustness and generalizability of study findings.
The need for statistical precision and expert guidance in nephrology research is increasingly critical. This review provides practical research guidance, outlining key statistical considerations to help researchers navigate the complexities of nephrology data. Its long-term goal is to equip researchers with the tools and knowledge necessary to conduct high-quality studies that deepen our understanding of nephrology research and improve patient care and outcomes. By emphasizing the importance of methodologies, the review also offers practical advice on overcoming statistical challenges in study design and data analysis, helping researchers to contribute meaningfully to the broader nephrology field.
Commonly ignored statistical assumptions
Statistical errors commonly occur in scientific literature, with approximately 50% of published articles containing at least one error [10]. In nephrology research, there are several common statistical issues that require careful attention to ensure the validity and reliability of the results. These issues can range from improper handling of missing data, which is particularly prevalent in clinical studies with patients who may drop out or have incomplete records, to incorrect assumptions about data distribution, such as assuming normality in urinary biomarkers, such as albumin and other markers of kidney injury, when they may follow non-normal distributions [11,12]. A summary of commonly overlooked statistical assumptions is provided in Table 1.
Independence of observations
Most statistical methods, such as t tests, analysis of variance, and regression, assume that observations are independent or unrelated [13]. However, in real-world cases, some data should be treated as more similar to each other than as independent observations. This assumption is frequently violated in practice due to clustered, repeated measures, or temporally dependent data structures in nephrology research. For instance, in clustered data, observations within the same group, such as patients within a hospital, often exhibit within-cluster correlation due to shared characteristics. Similarly, longitudinal and time-series data often display temporal dependence, where measurements taken closer in time are more strongly correlated. Ignoring these dependencies can lead to inflated type I error rates, and incorrect standard errors [13]. To address these issues, researchers should use statistical methods that account for such dependencies, including mixed-effects models [14], generalized estimating equations (GEE) [15], or time-series analysis [16].
Homoscedasticity
Linear regression is widely used in nephrology research [15]. An important assumption of linear regression is homoscedasticity, which refers to the constant variance of residuals across different levels of an independent variable. When the homoscedasticity assumption is violated, it can result in higher type I error rates or reduced statistical power. This violation may negatively impact the accuracy of conclusions, and failing to identify and address heteroscedasticity can have significant consequences for theory, research, and practice. In general, detecting violations of homoscedasticity and addressing its biasing effects can enhance the validity of inferences [17,18].
Normality of residuals
Normality assumptions are required for many statistical procedures such as correlation, regression, t tests, and parametric tests, implying that the observations from which samples are drawn are normally distributed [19,20]. However, a review found that among 30 researchers examining their own studies, the assumptions of these statistical techniques were rarely checked for violations in the data [21]. Although some scientific literature suggests that linear models are generally robust to violations of the normality assumption for hypothesis testing and parameter estimation if outliers are properly handled, there are still certain scenarios that require careful attention to avoid significant bias [22–25]. A study conducted by Knief and Forstmeier [26] analyzed 100 simulated scenarios to evaluate the impact of violations of the normality assumption and found that violations are generally not problematic when sample sizes are moderate (n = 100) to large (n = 1,000) or when there are no extreme outliers, as Gaussian models remain robust and produce reliable p-values. However, issues arise with very small sample sizes (n = 10) or when extreme outliers are present in either the independent or dependent variables. Sample size plays a critical role in the reliability of statistical results because, for small sample sizes, the normality assumption helps ensure unbiased estimation of standard errors, which in turn affects the accuracy of confidence intervals and p-values [27]. In nephrology studies, many researchers address this issue by normalizing urinary biomarker concentrations to urinary creatinine (uCr) [28]. In research focused on potential biomarkers for detecting acute kidney injury, Blaikley et al. [29] normalized a tubular marker neutral endopeptidase to uCr to account for variations in urinary flow rates. Normalization is also commonly used in studies related to CKD and the progression of kidney disease. For instance, in a study conducted by Kamijo et al. [30] the ratio of human liver-type fatty acid-binding protein (L-FABP) transcript intensity to glyceraldehyde-3-phosphate dehydrogenase intensity was used to normalize the amount of human L-FABP transcripts.
Absence of multicollinearity
Multicollinearity occurs when independent variables in a regression model are highly correlated. For example, in a nephrology study investigating the impact of patient characteristics on glomerular filtration rate (GFR), it is important to carefully evaluate the correlations among covariates. A study conducted by Kamal [31] found a strong correlation between blood urea nitrogen (BUN) and serum creatinine levels in patients with renal disorders. Both BUN and serum creatinine are indicators of kidney function, and their high correlation can result in multicollinearity when included together as explanatory variables in a single regression model. When multicollinearity exists, it becomes challenging to determine the individual impact of serum creatinine and BUN on GFR because their effects overlap significantly, and the variables have more interrelationships [32]. Diagnostic tools of multicollinearity include the variance inflation factor (VIF), condition index and condition number, and variance decomposition proportion, while VIF is one of the most widely used tools to detect multicollinearity [32,33]. The VIF measures the strength of linear dependencies and quantifies how much the variance of each regression coefficient is inflated due to collinearity, compared to a scenario where the independent variables are not linearly related. Researchers are highly encouraged to use VIF to identify multicollinearity and to consider removing or combining variables when appropriate and needed.
Violating statistical assumptions in nephrology research compromises the validity and reproducibility of findings. When one or more assumptions are violated in specific scenarios, significant effects may arise. In such cases, approaches like variable transformations or nonparametric methods are commonly used to address these issues. Researchers should assess whether their data meets these assumptions and consider alternative methods when violations occur. Nonparametric tests, such as the Wilcoxon rank-sum test or the Kruskal-Wallis test, offer robust alternatives that do not require normality or homogeneity of variances [34]. Apply data transformations, such as logarithmic or square-root transformations, which can help normalize data and stabilize variance. Researchers should always perform diagnostic tests, such as computing VIF, to ensure that assumptions hold before proceeding with their analyses and model adjustments.
Other analysis issues
Missing data
In addition to violations of statistical assumptions, there are other common analysis challenges in nephrology research. High-quality data is crucial for reliable research, but the issue of missing data is prevalent in nearly all studies and can significantly impact the conclusions drawn from the data [35]. Proper handling of missing data ensures that missing values are addressed in a way that does not bias the results. This includes using suitable single or multiple imputation approaches to estimate missing values. Montez-Rath et al. [36] proposed methods for addressing missing data in clinical studies of kidney diseases, categorizing the missing data into three scenarios: 1) missing completely at random (MCAR), where missing data has no relationship to either observed or unobserved factors, and patients included in the analysis are no different from those excluded; 2) missing at random (MAR), where missing data is related only to observed factors but not to unobserved ones; and 3) not missing at random (NMAR), where missing data depends on both observed and unobserved factors. The authors evaluated and concluded the most appropriate approaches for each type of missing data. For MCAR, complete case analysis (CCA), which uses only the subset of data with complete information, is valid when the MCAR assumption holds. However, since MCAR is rare in practice, relying solely on CCA often leads to biased and inefficient estimates. For MAR, both maximum likelihood-based and multiple imputation (MI)-based methods are valid, with MI having the advantage of being supported by accessible and increasingly user-friendly software such as SAS (SAS Institute Inc.), Stata (StataCorp LLC), and R (R Foundation for Statistical Computing). For NMAR, more advanced methods are necessary, as both maximum likelihood and MI assume MAR. In this case, directly modeling the missing data mechanism under NMAR conditions is required. Improper handling of missing data can distort the analysis and lead to inaccurate conclusions. Thus, researchers should document and validate all preprocessing steps to ensure the robustness of their findings.
Sample size
When designing a study, accurate sample size calculation is important to ensure for production of reliable and significant findings in nephrology research. A small sample size can lead to underpowered studies, especially in rare kidney diseases or specific patient groups, which may cause false negative results or miss important effects. Many treatments have been tested in small randomized trials that were not large enough to detect realistic treatment effects [37]. Using a small sample size to create a prediction model can lead to inaccurate estimates and overfitting. This results in unreliable predictions and poor model performance when tested on new individuals from the same population, making the model less applicable to other situations [38]. One approach to mitigate overfitting in predictive models is to use optimism-corrected summary measures, such as the R2 for linear models or the area under the receiver operating characteristic curve for logistic regression models [39,40].
A simulation study demonstrating the impact of a small sample size is presented. Assuming a sample size of n = 100, the underlying linear model is specified as y = β0 + β1x1 + β2x2 + β3x3 + ε, where ε follows a standard normal distribution. A summary of this simulation is provided in Table 2. The R code used to develop this simulation is available at the following GitHub repository: https://github.com/hakmook/KRCP_review.git.
We then introduced additional noise to x3 by doubling the first 10 observations, which resulted in a discrepancy between the estimated β3 with and without the extra noise. We examined how this discrepancy varied as a function of sample size by employing 300 Monte Carlo simulations. As illustrated in Fig. 1, it is evident that a larger sample size results in significantly smaller bias (or discrepancy between estimates with and without the extra noise). Conversely, smaller sample sizes are more sensitive to unexpected noise. Given that it is impossible to ensure that collected data are free from unexpected noise or errors, paying extra attention to sample size is crucial when designing a study.
Goodness of fit
Goodness-of-fit tests serve as the essential statistical tools used to evaluate how well a model represents the observed data. The Bayesian information criterion, the Akaike information criterion, and some other indicators derived from them are widely used for model selection [41]. By comparing goodness of fit statistics using these measurements, analysts can select the models that better represent the underlying data. Cheng et al. [42] evaluated the performance of the sequential models for the progression of diabetic kidney disease to kidney failure with different combinations of predictive variables in the derivation and validation cohorts, emphasizing the importance of goodness of fit in determining the model’s performance in accurately predicting kidney disease progression.
Machine learning: applications, improvements, and key considerations
The integration of machine learning (ML) into nephrology research has revolutionized data analysis in recent years and made it more and more popular, enabling deep insights from structured and unstructured datasets. These approaches have been applied to predict disease progression, identify biomarkers, and enhance decision-making in nephrology clinical care. However, while these tools are powerful, researchers must remain mindful of when to use them and how to enhance their effectiveness by refining the models or integrating them with other methods. Researchers in beginners often lack the experience needed to run a data mining project effectively, which can lead to incorrect practices, common mistakes, or overly optimistic results [43].
Data in nephrology is high-dimensional due to its large volume and variety, collected from multiple sources such as patient registries, EHRs, and clinical trials [44]. Account for it, feature selection is recommended to help remove irrelevant and redundant data when using machine learning [45]. A study conducted by Ebiaredoh-Mienye et al. [46] proved that the machine learning models trained with the reduced feature set performed better than those trained with the complete feature set by comparing logistic regression, decision tree, XGBoost, random forest, support vector machine, and conventional AdaBoost models to compare the effective detection of CKD. Multi-omics data including genomics (DNA), transcriptomics (RNA), proteomics (proteins), and metabolomics (metabolites) are also widely used in nephrology [47]. Integrating multi-omics data with machine learning for kidney disease research enables novel disease classification, reclassifying patients into molecularly defined subgroups and uncovering the underlying molecular mechanisms and biological pathways of various diseases [47,48]. Feature selection on multi-omics data often involves merging different data types without considering their sources, which can lead to the loss of information on specific omics datasets. Some approaches have been developed for the consideration of the group structure such as lasso based-feature selection methods sparse group lasso [49], integrative lasso with penalty factors [50], random forest extension block forest [51], and partial least squares (PLS) based adaptive sparse multi-block PLS (asmbPLS) [52] and asmbPLS discriminant analysis [53].
One other challenge for applying machine learning to nephrology research data is the limitation of sample size. Compared to other specialties in the medical field, nephrology has reported relatively few clinical trials with a large sample size [54,55]. When applying machine learning models to small sample size datasets, it may exaggerate the accuracy of ML due to overfitting or random effects [37]. Thus, it is essential to ensure that the sample size is sufficient relative to the complexity of the model and the number of features in the data. In addition, the robust evaluation of ML classification is also important when training and testing sample sizes are small, as it facilitates meaningful comparisons across studies and methods [56–58].
Reproducible research
Reproducibility of research is essential for scientific research accuracy, ensuring that findings are reliable, verifiable, and transparent. Integrating computations, code, and data into documents ensures readers to verify, adapt, and reproduce results. However, in biomedical research, where complex datasets and advanced statistical models are frequently used, research often lacks transparency and reproducibility. A study by Iqbal et al. [59] analyzed 441 biomedical publications and found that only one provided access to study protocols, none provided raw data, and only four were reproducible studies. Similarly, in nephrology research, a study by Fladie et al. [60] evaluated nephrology literature for reproducible practices by analyzing 300 randomly sampled publications and found that 123 of them lacked the empirical data necessary for reproducibility.
Statistical software like R and Rstudio are commonly used among biostatisticians, and they are powerful tools that provide user-friendly packages to incorporate reproducible research into statistical workflows [61]. They allow researchers to combine code, results, and descriptive comments into a single document, making it easier to generate comprehensive reports that can be shared and reproduced by others. Through RStudio, researchers can create dynamic, literate programming documents that integrate statistical analysis with explanatory text and visualizations. These reports can be generated in various formats, such as HTML, PDF, and Word, making it easier to share the findings. The inclusion of the code and data used in the analysis ensures transparency and enables other researchers to understand the analytical process, replicate the results, and adapt the analysis if necessary. Beyond RStudio, several other tools and platforms also support reproducibility. Platforms such as Git and GitHub play important roles in improving the reproducibility of research [61]. By using its version control systems, researchers can effectively track changes to their code over time, enhancing collaboration and transparency. As such, maintaining a traceable research process is essential for enabling rigorous scrutiny and validation, which is crucial for ensuring the long-term reliability of findings. A link to the GitHub repository, which includes all necessary files such as programming code and descriptive instructions, is recommended for publication.
To maximize the reproducibility of research, this review proposed a checklist that researchers can follow (Fig. 2). This checklist divided the procedures of conducting research into three divisions, before data analysis, during data analysis, and after data analysis. Before data analysis, researchers should outline their study’s goals, methodology, and design in a clear proposal, and create standardized protocols for data collection and analysis. A comprehensive data management plan should define procedures for data storage, backup, and sharing while identifying necessary software, tools, and libraries. Version control systems, such as Git and GitHub, should be used to track code changes. During data analysis, researchers must document standardized data cleaning steps, thoroughly annotate the code, and keep track of changes with meaningful commit messages. Using reproducible scripts (e.g., Rmarkdown) is essential to ensure transparency. After analysis, researchers should verify that the results can be replicated with the documented code and data. Data and analysis scripts should be shared for replication, and the final version of code should be well-documented. In publications, all supplementary materials should be included or linked. By adhering to these practices, researchers can promote transparency, improve reproducibility, reduce the potential for errors, and create a foundation for future studies to build upon.
Conclusions and discussion
Nephrology research involves complex data, which requires advanced statistical methods to make reliable conclusions. This review highlights several common challenges in the field, including violations of statistical assumptions, such as non-independence of observations, homoscedasticity, and normality of residuals, which can lead to misleading results. In practice, these assumptions are frequently overlooked, but using the appropriate methods, like mixed-effects models, GEE, and nonparametric alternatives, can help address these issues and improve the reliability of findings [14,15,34]. Another concern in nephrology research is multicollinearity, where predictor variables are highly correlated, making it difficult to determine their independent effects. Tools like VIF are useful for detecting multicollinearity and refining models [32,33]. Additionally, missing data is a prevalent issue in clinical studies, and improper handling of missing values can introduce significant biases. Researchers should consider using methods such as MI or maximum likelihood estimation to address missing data and avoid distorting the analysis [36].
Overfitting is another common problem, particularly in small sample size studies, where models may appear to perform well on training data but fail to generalize to new datasets. Ensuring sufficient sample sizes and using cross-validation can help reduce this risk [62]. Reporting optimism-corrected summary measures such as the R2 can also help reduce the risk. Using ML methods in nephrology research holds great opportunities for predictive modeling and biomarker identification. However, ML models require careful handling, especially when dealing with high-dimensional data or small sample sizes. Feature selection and proper validation approaches can prevent overfitting and promote research generalizable [49–53].
Reproducibility is another concern of quality research. Many studies in nephrology lack transparency, which undermines their reliability [60]. Researchers should consider open access to data, code, and study protocols, enabling others to verify and replicate their work. Tools such as R, Rstudio, Rmarkdown, Git, and GitHub support reproducible research by integrating code, results, and data into a transparent workflow. The checklist proposed by this review helps researchers improve the accuracy and reliability of studies by providing a structured framework for researchers to follow, ensuring that all aspects of their research process from data collection and analysis to documentation and sharing are conducted transparently and reproducible.
In conclusion, nephrology researchers must think deeply about addressing common statistical challenges to ensure the robustness of their findings. Promoting reproducibility will strengthen the reliability of nephrology research and contribute to the long-term goal of enhancing patient care and outcomes.
Notes
Conflicts of interest
All authors have no conflicts of interest to declare.
Data sharing statement
The data presented in this study are available from the corresponding author upon reasonable request.
Authors’ contributions
Conceptualization, Data curation, Supervision: HK
Investigation, Methodology: KX, HK
Resources: KX
Writing–original draft: KX
Writing–review & editing: HK
All authors read and approved the final manuscript.