Article Text

Original article
Prediction of Chlamydia trachomatis infection to facilitate selective screening on population and individual level: a cross-sectional study of a population-based screening programme
  1. David van Klaveren1,
  2. Hannelore M Götz1,2,
  3. Eline LM Op de Coul3,
  4. Ewout W Steyerberg1,
  5. Yvonne Vergouwe1
  1. 1Department of Public Health, Erasmus MC—University Medical Center Rotterdam, Rotterdam, The Netherlands
  2. 2Department of Infectious Disease Control, Municipal Public Health Service Rotterdam-Rijnmond, Rotterdam, The Netherlands
  3. 3Epidemiology and Surveillance Unit, Centre for Infectious Disease Control, National Institute for Public Health and the Environment, Bilthoven, The Netherlands
  1. Correspondence to David van Klaveren, Department of Public Health, Erasmus MC, P.O. Box 2040, Rotterdam 3000 CA, The Netherlands; d.vanklaveren.1{at}


Objectives To develop prediction models for Chlamydia trachomatis (Ct) infection with different levels of detail in information, that is, from readily available data in registries and from additional questionnaires.

Methods All inhabitants of Rotterdam and Amsterdam aged 16–29 were invited yearly from 2008 until 2011 for home-based testing. Their registry data included gender, age, ethnicity and neighbourhood-level socioeconomic status (SES). Participants were asked to fill in a questionnaire on education, sexually transmitted infection history, symptoms, partner information and sexual behaviour. We developed prediction models for Ct infection using first-time participant data—including registry variables only and with additional questionnaire variables—by multilevel logistic regression analysis to account for clustering within neighbourhoods. We assessed the discriminative ability by the area under the receiver operating characteristic curve (AUC).

Results Four per cent (3540/80 385) of the participants was infected. The strongest registry predictors for Ct infection were young age (especially for women) and Surinamese, Antillean or sub-Saharan African ethnicity. Neighbourhood-level SES was of minor importance. Strong questionnaire predictors were low to intermediate education level, ethnicity of the partner (non-Dutch) and having sex with casual partners. When using a prediction model including questionnaire risk factors (AUC 0.74, 95% CI 0.736 to 0.752) for selective screening, 48% of the participating population needed to be screened to find 80% (95% CI 78.4% to 81.0%) of Ct infections. The model with registry risk factors only (AUC 0.67, 95% CI 0.656 to 0.675) required 60% to be screened to find 78% (95% CI 76.6% to 79.4%) of Ct infections.

Conclusions A registry-based prediction model can facilitate selective Ct screening at population level, with further refinement at the individual level by including questionnaire risk factors.


Statistics from


Chlamydia trachomatis (Ct) infection is the most common bacterial sexually transmitted infection (STI) in Europe and other Western countries, especially in young people.1 Repeated infections occur due to no or limited development of immunity, untreated sexual partners or new sexual exposure after treatment. This mostly asymptomatic infection is a public health threat, in some cases leading to serious adverse events, such as pelvic inflammatory disease, tubal pathology and infertility,2 and premature labour.3 To detect asymptomatic infections for the prevention of potential adverse events, screening is the intervention of choice although good evidence to support the cost-effectiveness of screening is still lacking.4 In the USA, opportunistic screening is advised for women under the age of 25, likewise in the UK where the National Chlamydia Screening Programme advises screening for men and women under 25, and screening trials have been performed or are ongoing in other countries.5–7 Selective systematic screening, that is, screening of subjects identified to be at high risk, may be favourable for cost-effectiveness and results in fewer individuals undergoing an unnecessary test.7 ,8 To prevent unacceptable high proportions of missed infections, it is crucial that the prediction model performs well.

In 2005, a prediction rule was developed as a tool for selective screening of Ct.9 In 2008, this model was used in a large population-based screening programme to select participants at high risk in a less urban region.7 ,10–12 In urban areas, they chose to invite all sexually active individuals in the target group (men and women aged 16–29 years) to participate without further selection because of the high Ct prevalence that was previously found in highly urban areas (4.2%).9

In the current study, we validate the Ct prediction model with the data from the screening programme as a benchmark for predictive performance in urban areas.12 ,13 Furthermore, we aim to develop improved Ct prediction models with different levels of detail in information, that is, with readily available registries only and with additional detailed questionnaires.


Study population

The data of this study were collected in the Chlamydia Screening Implementation (CSI) programme.12 In summary, all inhabitants aged 16–29 of Rotterdam, Amsterdam, and selected municipalities of South Limburg were invited yearly from 2008 until 2011 for home-based testing (men: urine sample; women: vaginal swab or urine sample). In the first screening round, 261 025 individuals were invited of whom 41 638 effectively participated. We selected all first-time participants from Rotterdam and Amsterdam, resulting in 80 385 unique participants of whom 3440 were infected with Ct. All participants were asked to fill in a questionnaire on education, STI history, symptoms, partner information and sexual behaviour (variable definitions in online supplementary table S1). Information on gender, age, country of birth of both the participants and their parents and residential postcode was gathered from communal registries. Ethnicity was based on the country of birth of the participant and the participant's parents, consistent with ethnicity definitions used in Dutch STI clinics. In case of a regular partner, we defined the variable ‘ethnicity mixing’ by all four combinations of Dutch and non-Dutch ethnicity of the participant and the regular partner. Neighbourhood-level socioeconomic status (SES) scores were based on 4-digit postcode as provided by the Netherlands Institute for Social Research (available at We used the SES score of 2010.

Multiple imputation of missing values

We used an advanced multiple imputation strategy (method of chained equations) to account for missing values.14 General questionnaire information (education, partner information and sexual behaviour) was available in approximately 57% of the participants and more specific questionnaire information on STI history and symptoms in approximately 20% (table 1). We compared results without and with imputation of missing values. We used R package mice for multiple imputation.15

Table 1

Univariable associations between Chlamydia trachomatis infection and risk factors for the Chlamydia Screening Implementation

Development of improved chlamydia prediction models

We used logistic regression analysis for the development of three prediction models: registry risk factors gender, age and ethnicity only (model 1); model 1 with neighbourhood-level SES (model 2); and model 1 with additional questionnaire variables (model 3; see online supplementary table S1). Regular partnership status, ethnicity mixing with regular partner, condom use with either regular or casual partners were newly studied variables in comparison with the pilot study.9 In the pilot study, lifetime sexual partners were included, whereas in the current study only partners of the last 6 months were assessed. An interaction of gender with age was included in the analysis based on the a priori hypothesis that the age effect on Ct prevalence may be different for males and females. We modelled possible non-linearity of the age effect with restricted cubic splines.16 We evaluated the contribution of each predictive factor by its multivariable OR together with its likelihood ratio χ2 test statistic minus twice the degrees of freedom (number of regression coefficients used to model a predictive factor), which balances the goodness of fit of a model with its complexity and gives a fair assessment of a factor's predictiveness.16 We applied a backward selection approach to delete variables without predictive contribution, that is, when the χ2 test statistic minus twice the degrees of freedom is negative. Since there may be differences in Ct prevalence across neighbourhoods that cannot be fully explained by individual and neighbourhood-level predictors, we added an extra neighbourhood level to the logistic regression models.17 For quantification of a model's unexplained heterogeneity in Ct prevalence across neighbourhoods, we used the median OR (MOR), that is, the OR of the neighbourhood at highest risk versus the neighbourhood at lowest risk, when randomly picking out two neighbourhoods.18 We assessed the discriminative ability of each model by the area under the curve (AUC) of the receiver operating characteristic (ROC) curve. Although we developed prediction models with a high number of events per variable, we used a bootstrap procedure to correct the AUC for a too optimistic presentation of model performance in new settings.19 For easy calculation of an individual's risk score and the corresponding probability of having a Ct infection, we present the prediction models in score charts.16 ,20 For multilevel regression analysis and construction of prediction models, we used R packages lme4 and rms, respectively.15

Potential of new chlamydia prediction models

The practical potential of the prediction models with and without questionnaire information can be assessed by their usefulness for selective screening strategies. A selective screening strategy is defined by a desirable risk score threshold: only the individuals with a risk score equal to or above the threshold will be screened for Ct infection. To estimate the impact of using the prediction models for selective screening, we present for all possible risk score thresholds: the proportion of the population that would be screened (screening eligibility); the proportion of the Ct-positive population that would be screened (sensitivity); the proportion of the Ct-negative population that would not be screened (specificity); and the proportion of the screened population that would be Ct positive (positive predicted value).


Chlamydia prevalence

Overall prevalence in the Rotterdam and Amsterdam population was 4.4% (95% CI 4.26% to 4.55%) compared with a prevalence of 4.2% in highly urban regions of the pilot study.

Development of improved chlamydia prediction models

Strong registry-based predictors for Ct infection were young age, especially for women, and either Surinamese, Antillean or sub-Saharan African ethnicity (table 2). The non-linear interaction between age and gender indicated that the risk for men and women decreased similarly after the age of 24 (see online supplementary figure S1), was stable below the age of 24 for men but further increased for women below the age of 24. Neighbourhood-level SES was of minor importance. From the individual questionnaires, low to intermediate education level, ethnicity mixing with the regular partner (non-Dutch) and having sex with casual partners showed strong associations with Ct infection. Note that the additional risk for a non-Dutch participant with a non-Dutch regular partner was lower than for a Dutch participant with a non-Dutch regular partner since the non-Dutch ethnicity of the participant was already an important risk factor. The MOR of the random neighbourhood effect decreased from 1.27 to 1.22 when neighbourhood-level SES was added and to 1.15 when questionnaire information was added to the model (table 2). The AUC at internal validation was 0.67 (95% CI 0.656 to 0.675) based on registry risk factors only (model 1), stayed at 0.67 (95% CI 0.657 to 0.677) when neighbourhood-level SES was added (model 2) and increased substantially to 0.74 (95% CI 0.736 to 0.752) when questionnaire risk factors were added (model 3). Model 3 also performed substantially better than the Ct prediction model developed in the pilot study in 2005 (see online supplementary figure S2). Results were fairly similar when only complete cases were analysed (see online supplementary table S2).

Table 2

Multivariable associations between Chlamydia trachomatis and risk factors for three levels of information based on registry data, registry + neighbourhood-level socioeconomic status (SES) data; registry+individual questionnaire data (n=80 385 participants)

We presented the newly developed prediction models 1 and 3 by score charts (table 3). To calculate an individual's probability of having a Ct infection, first determine her or his risk factor values (eg, female, 17 years of age; Surinamese ethnicity), second, look up risk points for each risk factor (6 and 4 points in table 3, respectively), and, third, add the points up—including an optional constant (7 points for the registry information-based model)—to obtain a risk sum score (17 points). The probability of having a Ct infection is subsequently read from the two right-hand columns of table 3 (14%; see online supplementary table S3). The score charts were visualised by nomograms in online supplementary figure S3.

Table 3

Rounded score charts based on registry data and registry plus questionnaire data

Potential of new chlamydia prediction models

The estimated impact of using the prediction models for selective screening is reported in table 4 for all possible risk sum score thresholds. With a sum score threshold of 10 (predicted Ct probability of 2.8%), the registry-based prediction model leads to 87% (95% CI 85.4% to 87.6%) sensitivity and 73% of the population eligible for screening and (positive predicted value 5.2%; specificity 28%), while the prediction model including questionnaire information reaches 80% (95% CI 78.4% to 81.0%) sensitivity with 48% of the population eligible for screening (positive predicted value 7.3%; specificity 53%). The difference in predictive performance of the two prediction models was visualised with decision curves (see online supplementary figure S4), with plots of screening eligibility by sensitivity (see online supplementary figure S5), and with ROC curves (see online supplementary figure S6).

Table 4

Implications of applying the prediction models


Main findings

We developed easily applicable chlamydia prediction models with data from a large chlamydia screening project. The prediction model based on readily available registry data may serve as a simple tool for selective screening at the population level. With detailed questionnaire information, the predictive performance increased substantially and was better than the performance of a previously proposed Ct prediction model. With less than half of the participating population needed to be screened to find 80% of the infections, the detailed prediction model allows for better screening decisions at the individual level.

Risk factors identified in relation to other studies

Most of the predictors for Ct infection found previously were also associated with Ct infection in our study population: young age; Surinamese or Antillean ethnicity; low or intermediate education; urogenital symptoms, especially for men; multiple sexual partners; new partner in previous 2 months; and no condom use at last sexual contact. With respect to these risk factors, we found additionally that the age effect was stronger for women and that sub-Saharan Africans have a similar risk as Surinamese and Antillean individuals. The number of sexual partners in the last 6 months was included in our model instead of the previously modelled lifetime partners since it is easier to assess in practice, is less dependent of age and less prone to recall bias. We could not include address density in our model as our study population was almost entirely very highly urban (AAD 1). Application of our prediction model in less urban areas probably requires recalibration, which could be based on the lower risk for lower address density regions in the previously reported prediction model.9 Furthermore, we were able to add some additional predictors: sexual contact with casual partners in the last 6 months, non-Dutch ethnicity of the regular partner especially for Dutch participants and history of self-reported STI. The latter two are in line with previous findings.21–23 Finally, neighbourhood-level SES was of minor importance especially when questionnaire predictors were added. Apparently, the registry and questionnaire data at the individual level were more predictive than the SES at neighbourhood level.

Strengths and limitations

Main strengths of this study are the large study population of 80 385 screening participants (3540 positive) living in two large urban areas; the availability of high-quality outcome data and objective registry data (gender, age, ethnicity derived from country of birth of participants or their parents, neighbourhood-level SES); extensive questionnaire information on education, STI history, symptoms, partner information and sexual behaviour; and advanced analysis allowing for targeting of screening with only registry data or the combination of registry and questionnaire data.

The prediction models were developed in a specific population sample, that is, first-time responders to a Ct test invitation of all individuals aged 16–29 living in the Dutch cities of Rotterdam and Amsterdam. However, we anticipate the models to be useful for selecting high-risk individuals in other populations as well since most of the predictors are universal. Still, recalibration of the prediction models may be necessary to match the overall prevalence of a particular population, for example, individuals in other cities or countries, or individuals who are specifically seeking care. Similarly, recalibration to match the Ct risk for repeatedly participating individuals may be required, although we included self-reported STI history as a risk factor in the questionnaire-based model. Furthermore, the absence of large communities of Surinamese and Antillean may attenuate the discriminative ability of the models in other populations. Local data would be required to extend our models with risk levels of specific ethnic groups.

The self-assessed questionnaire variables may be considered a limitation of this study. However, history of STI and STI complaints are commonly used risk factors for Ct infection.24 Recent partner change and number of partners in the last 6 months as well as condom use at last sexual contact may be expected to be remembered well.

A high number of missing values in questionnaires, possibly more often for subjects with low education, is a severe limitation of this study. However, with the substantial amount of available data we were able to develop prediction models based on multiple imputation of the missing values. Fairly similar but less reliable results were noted in a complete case analysis. Furthermore, the prediction model's coefficients hardly changed when we forced an extra one-third of the imputations to an intermediate or low education.

Ethnicity mixing with the partner was only available in case of regular partners. Although the effect of ethnicity mixing could only be quantified for regular partners, we suspect this effect applies to casual partners as well. This should, however, be validated in future studies.

Ethnicity of the participant and of the regular partner were confirmed to be important predictors for Ct. A cross-sectional study among adolescents in the USA also found that race–ethnicity (either of tested individual or of partner) affected algorithm performance.25 Presently triaging systems in Dutch Public Health STI clinics include assessing ethnicity, and this is well accepted. However, ethnicity may be hard to obtain in future practice when both the participants and their parents are already born in the Netherlands (third generation) or when ethnicity is considered to be sensitive information. Although Ct prediction models without ethnicity show lower predictive performance—the AUC decreases from 0.665 to 0.610 when using registry data only and from 0.744 to 0.721 when using questionnaire data—they could still be useful, especially when using detailed questionnaire information.

Application of prediction models

We presented the Ct prediction models by risk score charts, which can either be implemented in paper forms or in internet-based apps. The score charts may be useful to select high-risk individuals as part of systematic screening programmes in urban areas, similar to the selection of individuals at high risk in less urban regions.10 Furthermore, the score charts may be used in guiding STI clinicians and GPs whom to offer opportunistic Ct testing—more selectively than using age group alone. Although the prediction models are developed from general population data, we anticipate the risk factors to hold for those at higher risk. Especially the score chart based on questionnaire data would allow for better identifying individuals at high risk for Ct infection and can be used for (internet) triaging systems. Scoring questionnaires may encourage test uptake by increasing risk awareness in those who may be reluctant to be tested.26

One may argue that selective screening is less effective or even unethical since it will miss Ct infections that would have been detected in a screen-all strategy. Although prediction models are imperfect, using them for selective screening may still be very helpful since the harm of missing infections needs to be balanced with the burden—including costs—of unnecessary diagnostic testing. We illustrated the helpfulness of our prediction model by choosing a particular risk threshold that implies a huge benefit (and cost-savings) of screening only half instead of the full population against the burden of missing 20% of the Ct infections. The choice of the appropriate risk threshold for a selective screening strategy—balancing the benefits and the harms—is up to decision-makers.

Implications for future research

We recommend further validation studies of our chlamydia prediction models—with recalibration and updating of predictors or predictor effects when necessary—in different countries and in different selective screening settings, both for first-time participants and for repeatedly tested participants. Moreover, we encourage studies that focus on the impact on clinical practice, ideally by trials incorporating chlamydia prediction models in selective screening settings.27 Other types of studies can also be used to analyse the impact of targeted screening versus untargeted screening, as was done for HIV screening in an emergency department based on a validated HIV risk score.28 Finally, the hypothesis that scoring questionnaires encourages test uptake deserves further analysis.


A registry-based prediction model can facilitate selective Ct screening at population level, with further refinement at the individual level by including questionnaire risk factors.

Key messages

  • We developed easily applicable Chlamydia trachomatis prediction models with data from a large chlamydia screening project in urban areas (>80 000 participants).

  • The newly developed Ct prediction models performed substantially better than a previously proposed Ct prediction tool.

  • A registry-based prediction model can facilitate selective Ct screening at population level, with further refinement at the individual level by including questionnaire risk factors.


The authors express their gratitude to all of the principal investigators (IVF van den Broek, JEAM van Bergen, JSA Fennema, HM Götz, CJPA Hoebe, ELM Op de Coul) of the Chlamydia Screening Implementation for providing the data.


View Abstract

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.


  • DVK and HMG are equal contributors

  • Handling editor Jackie A Cassell

  • Contributors DvK and HMG contributed equally to this research paper. DvK, HMG, EWS and YV designed the study. EOdC participated in the collection of data and organisation of the databases from which this manuscript was developed. DvK, HMG and YV analysed the data and wrote the first draft of the paper. All authors contributed to writing the paper and approved the final version.

  • Funding DvK and YV were supported by the Netherlands Organisation for Scientific Research (grant 917.11.383.).

  • Competing interests None declared.

  • Ethics approval The CSI programme was approved by a Medical Ethics Committee of the VUmc in Amsterdam (METc number: 2007/239).

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.