Mortality or mortality rate has been collected since 1750’s [
1] and are utilized across various research fields such as epidemiology, biostatistics, and biomedical and biopharmaceutical research. Excess mortality, denoting mortality above the normal rate, typically demands deeper investigation as it can signal fatal diseases or infections. In biomedical research, unraveling the causes of excess mortality often sparks new lines of inquiry, such as research into developing vaccines for coronavirus disease 2019. Consequently, instances of increased mortality or excess mortality consistently attract significant attention in the biomedical research realm.
In the era of big data, encompassing electronic health records (EHRs), multi-modal magnetic resonance imaging data, and multiomics data, comprehending data structures and integrating diverse information sources to address critical scientific questions is paramount. In large-scale datasets, encountering missing data is inevitable due to various factors, such as randomness or systematic issues [
2]. Particularly, EHR-type data tend to exhibit missing observations [
3,
4].
One strategy to tackle the missing data challenge involves imputing missing values using non-missing information within the dataset, provided the non-missing information predicts the missing values. For example, missing body weight values could be imputed using height, sex, and age data via a linear regression model [
5]. Herein, understanding variable associations within a dataset is pivotal for successful imputation.
Nevertheless, when missing observations arise from systematic missingness, such as the absence of death information in the Health Insurance Review and Assessment Service database, imputation becomes more limited and complex. Integrating a dataset containing systematic missingness with another containing the missing information can help obtain the necessary data without constructing an imputation model. In cases where obtaining another dataset with the missing information isn’t feasible, imputation methods such as single or multiple imputation may need to be considered [
5].
Mortality stands as a cornerstone parameter in biomedical research [
6,
7]. Sometimes, addressing critical scientific inquiries becomes infeasible without knowledge of mortality rates. In such instances, operational definitions of mortality, as outlined in [
8,
9], can be established. A recent article by Lee et al. [
10], titled “
Validation of operational definitions of mortality in a nationwide hemodialysis population using the Health Insurance Review and Assessment Service databases of Korea,” delves into the validation of several operational definitions of mortality. These definitions encompass intervals of 30, 60, 90, 120, 150, and 180 days of no health insurance claims. The study reveals that an operational definition requiring 150 days free from health insurance claims yielded the most accurate results.
From a statistical standpoint, the uniformity of a single definition across all age and sex groups is intriguing. Nonetheless, leaning towards more conservative or lenient mortality estimates based on a single operational definition within a particular age group could introduce bias to study outcomes. Employing statistical models to account for potential confounding factors like age or sex might be essential. If one definition consistently overestimates mortality within a specific age and sex combination, proposing an alternative definition for that combination could prove beneficial. Similarly, suggesting multiple operational definitions for various strata, such as specific age and sex combinations, could help mitigate bias. However, the trade-off between bias and variance needs careful consideration, as too many definitions may lead to overfitting and reduced generalizability.
The assessment of estimation can be approached through external and internal validations. While external validation using external datasets is the gold standard, internal validation can be achieved through k-fold cross-validation combined with bootstrapping when external data are unavailable.
In this context, exploring the distribution of deviations from true values for each age stratum via bootstrapping could be illuminating. Repeatedly resampling data within each age group and calculating deviations from true values generates a bootstrap distribution of deviations, essentially reflecting the variation of mortality within each age group. By proposing multiple operational definitions and generating corresponding bootstrap distributions of deviations, it becomes possible to identify approaches that exhibit the smallest variance, thus being more generalizable to similar datasets. Beyond proposing definitions, assessing the variance associated with each definition contributes to finding an optimal balance between bias (measured by mean deviation) and variance (bootstrap variance).