Internal and external validation of predictive models: A simulation study of bias and precision in small samples

https://doi.org/10.1016/S0895-4356(03)00047-7Get rights and content

Abstract

We performed a simulation study to investigate the accuracy of bootstrap estimates of optimism (internal validation) and the precision of performance estimates in independent validation samples (external validation). We combined two data sets containing children presenting with fever without source (n = 376 + 179 = 555; 120 bacterial infections). Random samples were drawn from this combined data set for the development (n = 376) and validation (n = 179) of logistic regression models. The models included statistically significant predictors for infection selected from a set of 57 candidate predictors. Model development, including the selection of predictors, and validation were repeated in a bootstrapping procedure. The resulting expected optimism estimate in the receiver operating characteristic (ROC) area was compared with the observed optimism according to independent validation samples. The average apparent ROC area was 0.74, which was expected (based on bootstrapping) to decrease by 0.07 to 0.67, whereas the observed decrease in the validation samples was 0.09 to 0.65. Omitting the selection of predictors from the bootstrap procedure led to a severe underestimation of the optimism (decrease 0.006). The standard error of the observed ROC area in the independent validation samples was large (0.05). We recommend bootstrapping for internal validation because it gives reasonably valid estimates of the expected optimism in predictive performance provided that any selection of predictors is taken into account. For external validation, substantial sample sizes should be used for sufficient power to detect clinically important changes in performance as compared with the internally validated estimate.

Introduction

Optimism is a well-known problem of predictive models: Their performance in new patients is often worse than expected based on performance estimated from the development data set (“apparent performance”) [1], [2], [3]. The extent of optimism of pre-specified models can be estimated for similar patient populations using internal validation techniques such as bootstrapping [4], [5]. Predictive models are, however, usually not pre-specified but are constructed in an iterative way. Model specification may include decisions on coding of variables (e.g., [re-]grouping categorized or continuous variables) and decisions on the inclusion of main effect, nonlinear, and interaction terms in the final model. However, when model specification, such as stepwise selection of predictor variables, can be formulated in a systematic way, it may be replayed entirely in every bootstrap sample [6]. Such a procedure should provide an honest estimate of the optimism of the final model [3], [7]. Because empirical evidence for this claim is limited, our first aim was to study the accuracy of the bootstrap estimate of optimism of a prediction model that is developed using variable selection techniques.

Of more interest than internal validity is external validity, or generalizability [8]. External validity is typically studied in independent validation samples with patients from a different but “plausibly related” population [9]. In a previous study, a diagnostic model developed to estimate the presence of a serious bacterial infection in children with fever without apparent source showed a surprisingly poor external validity in another sample of 179 children [10]. Although a sample size of 179 subjects is not uncommon in diagnostic (validation) research, it raises the question of how large a validation set needs to be. Our second aim was to study the precision of performance estimates in relatively small validation samples and to explore the consequences for the power of validation studies when comparing model performance between development and validation sets.

Section snippets

Patients

We combined two previously described data sets of children presenting with fever without apparent source: a development set from Rotterdam, The Netherlands, diagnosed between 1988 and 1992 (n = 376), and a validation set from Rotterdam and The Hague, diagnosed between 1997 and 1998 (n = 179) [10]. Of these 555 children, 120 (22%) had a serious bacterial infection, which was defined as the presence of bacterial meningitis, sepsis or bacteriemia, pneumonia, urinary tract infection, bacterial

Expected optimism in the full data set

In the full data set of 555 children, four statistically significant predictors were selected after univariable and multivariable stepwise analyses: duration of fever at presentation (days), presence of chest-wall retractions, poor peripheral circulation, and presence of crepitations. The apparent ROC area was 0.727, the R2 was 15.7%, and the calibration slope was unity (Table 1).

According to 1000 bootstrap samples, the expected optimism was 0.056 for the ROC area (0.761−0.706) and 9.8%

Discussion

We found that internally validated estimates of model performance could accurately be obtained with bootstrapping when a stepwise selection strategy was followed in the construction of the predictive model provided that this strategy was systematically replayed in every bootstrap sample. The expected optimism was close to that observed in independent random validation samples for a number of performance measures, including the ROC area. However, the variability (SEs) of these performance

Acknowledgements

This study was inspired by the comments of an anonymous reviewer regarding the sampling variability of external validation studies. We gratefully acknowledge the contributions of all medical students and clinicians involved in data collection, especially Dr. G. Derksen-Lubsen (Juliana Children's Hospital, The Hague, The Netherlands). This work was supported by a fellowship from the Royal Netherlands Academy of Arts and Sciences (EWS).

References (25)

  • A.C Justice et al.

    Assessing the generalizability of prognostic information

    Ann Intern Med

    (1999)
  • Bleeker SE, Moll HA, Steyerberg EW, et al. External validation is necessary in prediction research: a clinical example....
  • Cited by (0)

    View full text