Original Article
Development and validation of a prediction model with missing predictor data: a practical approach

https://doi.org/10.1016/j.jclinepi.2009.03.017Get rights and content

Abstract

Objective

To illustrate the sequence of steps needed to develop and validate a clinical prediction model, when missing predictor values have been multiply imputed.

Study Design and Setting

We used data from consecutive primary care patients suspected of deep venous thrombosis (DVT) to develop and validate a diagnostic model for the presence of DVT. Missing values were imputed 10 times with the MICE conditional imputation method. After the selection of predictors and transformations for continuous predictors according to three different methods, we estimated regression coefficients and performance measures.

Results

The three methods to select predictors and transformations of continuous predictors showed similar results. Rubin's rules could easily be applied to estimate regression coefficients and performance measures, once predictors and transformations were selected.

Conclusion

We provide a practical approach for model development and validation with multiply imputed data.

Introduction

Interest in multivariable prediction models for diagnostic and prognostic research has grown over the past decade. Prediction models enable physicians explicitly to convert combinations of multiple predictor values to an estimated absolute risk of disease presence (in case of diagnosis) or the occurrence of a disease-related event (in case of prognosis). Prediction models are developed with data of patients from a development set, often using multivariable regression analysis. The models are accordingly validated in new, similar patients (validation set) [1], [2].

Missing observations are almost universally encountered in clinical data sets, no matter how strictly studies have been designed or how hard investigators try to prevent them. The easiest way to deal with missing values is to exclude all patients with a missing value on any of the considered variables. Such a complete case analysis may sacrifice useful information and may cause biased results [3], [4]. Imputation based on observed patient characteristics (conditional imputation) has been advocated to deal with the missing values [3]. To take the uncertainty of the imputed values into account, missing values should be imputed multiple (m) times, for which several iterative algorithms are available. The resulting m completed data sets are each analyzed separately by standard methods and the m results are combined into one final point estimate and variance, with the standard error equal to the square root of the variance [3], [5].

Combining the m results is straightforward, when a single analysis is considered. The m point estimates are averaged and the m variances can be combined taking the variability between the m data sets into account with a components-of-variance argument (Rubin's rules) [3].

The development of a prediction model follows a sequence of steps [6], including selection of predictors, selection of transformations for continuous predictors [7], [8], and estimation of the regression coefficients. Hence, model development with multiply imputed data is not straightforward and seldom illustrated. Here, we demonstrate the development and validation of a prediction model obtained with logistic regression in the presence of multiply imputed data. Continuous predictors are modeled with transformations if necessary. In the second model, the continuous predictors are dichotomized. Further, three different methods to select predictors and transformations are applied. We encountered also another practical problem, typical with real life data. Two continuous predictors were recorded partly as dichotomous and partly as continuous. We impute the continuous values by using the observed value for the dichotomous variable and the distribution of the continuous variable where available. Two empirical data sets on the diagnosis of deep venous thrombosis (DVT) are used with minor to major percentages of missing predictor values, one data set to develop the model [9] and one to validate the model [10].

Section snippets

Empirical data

We used the data of 2,086 consecutive primary care patients suspected of DVT. The data originated from a large cross-sectional diagnostic study that was performed between January 1, 2002 and January 1, 2006 among over 100 primary care physicians in The Netherlands. For specific details and main results of the diagnostic study, we refer to the literature [9], [10]. In brief, suspicion of DVT was based on swelling, redness, or pain of the lower extremities. Information was systematically

Model development in general

When developing a prediction model, various issues and choices need to be addressed. We discuss briefly three common steps in the development of prediction models. First, the number of candidate predictors is commonly too large to include them all in the prediction model. The data to hand can be used to select predictors, for instance with a backward elimination procedure.

Second, the shape of the relation of continuous predictors with the outcome variable can be studied with nonlinear functions

Model development

We examine first only patients with complete cases and then the completed data using multiple imputation. We illustrate the model development methods in the presence of multiply imputed data as described in Section 3.2 with the data sets on the diagnosis of DVT. We consider a model that could contain continuous predictors that are modeled with the MFP algorithm and a second model with only dichotomous predictors. Continuous predictors (calf circumference, d-dimer, age, and duration of symptoms)

Discussion

Missing data are commonplace in clinical studies. The main message we wish to bring out here is that good statistical methods are available to enable credible, practical analyses of such data sets. It is often unclear from reports whether a prediction model was developed or validated in the presence of missing data. Authors usually ignore cases with missing observations and perform complete case analyses. More recently, awareness has been growing of the usefulness of multiple imputation

References (41)

  • W. Sauerbrei et al.

    Building multivariable prognostic and diagnostic models: transformation of the predictors by using fractional polynomials

    J R Stat Soc A

    (1999)
  • R. Oudega et al.

    Ruling out deep venous thrombosis in primary care. A simple diagnostic algorithm including D-dimer testing

    Thromb Haemost

    (2005)
  • D. Toll et al.

    A new diagnostic rule for deep vein thrombosis: safety and efficiency in clinically relevant subgroups

    Fam Pract

    (2008)
  • J.L. Schafer

    Multiple imputation: a primer

    Stat Methods Med Res

    (1999)
  • S. van Buuren et al.

    Multiple imputation of missing blood pressure covariates in survival analysis

    Stat Med

    (1999)
  • P. Royston et al.

    Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling (with discussion)

    Appl Stat

    (1994)
  • F.E. Harrell

    Regression modeling strategies. With applications to linear models, logistic regression, and survival analysis

    (2001)
  • D.J. Spiegelhalter

    Probabilistic prediction in patient management and clinical trials

    Stat Med

    (1986)
  • J.B. Copas

    Regression, prediction and shrinkage

    J R Stat Soc B

    (1983)
  • H.C. van Houwelingen et al.

    Construction, validation and updating of a prognostic model for kidney graft survival

    Stat Med

    (1995)
  • Cited by (207)

    • Development and validation of a prediction model for self-reported mobility decline in community-dwelling older adults

      2022, Journal of Clinical Epidemiology
      Citation Excerpt :

      Nonlinearity between continuous predictors and the outcome was examined using fractional polynomials before multivariable modeling. Missing data were imputed using multiple imputation by chained equation technique [18,19] (Appendix 3). Fifty imputed datasets were created.

    View all citing articles on Scopus

    This work is supported by the Netherlands Organization for Scientific Research Grant ZON-MW 917.46.360 (Y. Vergouwe and K.G.M. Moons); UK Medical Research Council (P. Royston); and Cancer Research UK (D.G. Altman).

    View full text