Article Text

Download PDFPDF

Capture-recapture methods and respondent-driven sampling: their potential and limitations
  1. Yakir Berchenko,
  2. Simon D W Frost
  1. Department of Veterinary Medicine, University of Cambridge, Cambridge, UK
  1. Correspondence to Dr Simon D W Frost, Department of Veterinary Medicine, University of Cambridge, Madingley Road, Cambridge CB3 0ES, UK; sdf22{at}cam.ac.uk

Statistics from Altmetric.com

Measuring the size of populations most at risk of HIV, such as men who have sex with men (MSM), female sex workers (FSWs), and injection drug users presents significant challenges, as these populations are often hidden or hard to reach.1 2 Among the most popular methods to estimate the size of populations are those based on capture-recapture, where multiple samples (or data sources) are combined, and the population size estimated based on the concept that individuals are more likely to be sampled multiple times from small populations than large populations. This principle, originating as far back as the 17th century,3 was embraced later by wildlife researchers4 and epidemiologists5 and is advocated in the WHO/UNAIDS ‘Guideline on Estimating the Size of Populations Most at Risk to HIV’.6

Respondent-driven sampling (RDS)7 8 is increasingly being used to estimate HIV prevalence within hidden populations, by utilising social networks of the target population to facilitate sampling.9 Although RDS does not directly estimate population size, by weighting individuals according to their number of acquaintances, sampling bias can be controlled for to produce estimates of quantities such as disease prevalence that are similar to simple random (uniform) sampling (SRS), albeit with larger variance.7 8 10

In this issue, Paz-Bailey et al11 describe the combined use of RDS and capture-recapture in order to estimate the number of FSWs and MSM in El Salvador. Relatively few studies have combined RDS with capture-recapture (cf,12 13), and the study by Paz-Bailey et al is unusual in the use of an active ‘capture’ stage, where key chains were distributed to the members of the target population, followed by RDS to ‘recapture’ the key chains. The underlying model they used to estimate population size considers two samples: if N denotes the (unknown) size of the population to be estimated, and after sampling the population twice, there are a11 individuals who appeared in both samples, a10 individuals who appeared only in the first sample and a01 individuals who appeared only in the second, an estimate of the total population can be obtained as follows.

N^naive=(a1,0+a1,1)(a0,1+a1,1)a1,1(1)

For a fixed sample size in both the capture and recapture phases, the above (rounded down to the nearest whole number) is the maximum likelihood estimate of N.14 This estimate is naive in that the model makes the following assumptions. First, there is assumed to be no change to the population between samples; this may not be too unrealistic in the study of Paz-Bailey et al, as the capture and recapture stages were performed close together in time. Second, the samples are assumed to independent; this would not be the case, for example, if participation in the capture stage affected participation in the recapture stage. The use of different recruitment approaches may help to alleviate this problem. Lastly, each individual is assumed to have the same chance of being sampled; however, recruitment through RDS is assumed to be biased towards individuals with more acquaintances, and this may well be true of the capture stage in Paz-Bailey et al as well.

The WHO/UNAIDS guidelines6 strongly advocate against breaking the assumptions of this simple model. However, for quantities such as degree that can adopt a wide range of values, more biologically realistic models have been described in the statistical ecology literature, which consider the probability of sampling.15 In RDS, sampling is assumed to be proportional to degree, and so sampling bias can be controlled for by weighting the number of recaptured individuals by the network sizes of the marked and unmarked individuals.11 13 We can rewrite the naive estimator to allow unequal sampling probabilities, as follows.

N^adj=(kIk1,0+kIk1,1)(kIk1,1πk+kIk0,1πk)kIk1,1πk(2)

Iki,j is the indicator whether individual k was sampled and πk is the sampling probability. Paz Bailey et al noted a large difference between N^naive and N^adj for the population of MSM, but a small difference between N^naive and N^adj for the population of FSWs and hypothesised that this discrepancy is due to differences in the ‘capturing’ stage in the two populations: MSM are sampled according to their degrees, while FSWs uniformly at random.

To investigate this further, we simulated a capture-recapture design, where the probability of sampling a random individual, k, is correlated between the capture and recapture stages, due to the sampling probability being proportional to his/her degree, dk. We calculated the naive and adjusted estimators for a population with a power-law degree distribution, over a range of values of λ, which is a measure of the heterogeneity in degree between individuals. For each λ, we sampled the population twice; in the capture phase, individuals were sampled in relation to their degree; and in the recapture phase, individuals were sampled by an RDS-like process, and N^naive and N^adj calculated (figure 1). For large values of λ, corresponding to homogenous degrees, N^naive and N^adj were similar and close to the true population size. However, as λ decreases, the naive estimate decreases and worsens, while N^adj is relatively constant across the range of λ. If individuals have equal capture probabilities (but recapture is by RDS), then N^naive and N^adj are similar, irrespective of λ (figure 1, inset). Hence, similar naive and adjusted estimates can arise when sampling during the capture stage is uniform, as suggested by Paz Bailey et al for FSWs, and when the degree distribution is relatively uniform.

Figure 1

Mean and 95% intervals of naive and respondent-driven sampling (RDS)-adjusted estimates of population size across a range of networks with different levels of degree heterogeneity given by the parameter λ, with high values corresponding to networks with relatively uniform degree, assuming sampling either proportional to degree (main figure) or uniformly (inset) during the capture stage. We assumed a total population size N=1000 and equal numbers of individuals in the capture and recapture stages, S1=S2=100. Random networks were generated according to a power-law distribution, Embedded Image, with a minimum degree of 3. The recapture phase was simulated by an RDS-like process, in which a single ‘seed’ individual was sampled, and one of its network neighbours sampled; this is followed by random sampling of the combined neighbours of these two nodes and so on, until the desired sample size was obtained. To avoid recruitment rates being strongly affected by degree, we limited the number of recruits per individual to 3. For each value of λ, 20 networks were constructed and on each one, the process was run 50 times (ie, 1000 repetitions per λ).

Use of RDS within a capture-capture framework is not without its limitations. RDS itself makes a number of strong assumptions16 and may present challenges to implement,17 18 and the claim that RDS can generate representative samples has been criticised.19 20 If the probability of participation is not related to the reported degree, then the adjusted estimate may show increased bias. The relationship between degree and recruitment probability may also be different for the capture and recapture phases, for example if receiving a key chain makes someone more (or less) likely to participate in the follow-up study. Like other quantities derived from samples collected by RDS,21 the variance of the estimates of population size is more variable than SRS (a phenomenon overlooked by Paz-Bailey et al who used a naive estimator of variance). However, by distributing the key chains widely in the capture stage, the tendency of individuals with key chains to recruit others with key chains, one potential cause of high variability in RDS estimates, may be low. These limitations aside, there are also several improvements to the design, including gathering information on degree during the capture stage and performing an additional stage of recapture following RDS, both of which, combined with further statistical developments, may help to provide better estimates of the size of populations at risk of HIV and other sexually transmitted infections.

References

View Abstract

Footnotes

  • Linked article: 045633.

  • Funding This research was supported by the National Institute of Nursing Research (grant NR10961), the National Institute on Drug Abuse (grant DA24998) and by a Royal Society Wolfson Research Merit Award to SDWF.

  • Competing interests None.

  • Provenance and peer review Commissioned; not externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles