Article Text

## Statistics from Altmetric.com

Measuring the size of populations most at risk of HIV, such as men who have sex with men (MSM), female sex workers (FSWs), and injection drug users presents significant challenges, as these populations are often hidden or hard to reach.1 2 Among the most popular methods to estimate the size of populations are those based on capture-recapture, where multiple samples (or data sources) are combined, and the population size estimated based on the concept that individuals are more likely to be sampled multiple times from small populations than large populations. This principle, originating as far back as the 17th century,3 was embraced later by wildlife researchers4 and epidemiologists5 and is advocated in the WHO/UNAIDS ‘Guideline on Estimating the Size of Populations Most at Risk to HIV’.6

Respondent-driven sampling (RDS)7 8 is increasingly being used to estimate HIV prevalence *within* hidden populations, by utilising social networks of the target population to facilitate sampling.9 Although RDS does not directly estimate population size, by weighting individuals according to their number of acquaintances, sampling bias can be controlled for to produce estimates of quantities such as disease prevalence that are similar to simple random (uniform) sampling (SRS), albeit with larger variance.7 8 10

In this issue, Paz-Bailey *et al*11 describe the combined use of RDS and capture-recapture in order to estimate the number of FSWs and MSM in El Salvador. Relatively few studies have combined RDS with capture-recapture (cf,12 13), and the study by Paz-Bailey *et al* is unusual in the use of an active ‘capture’ stage, where key chains were distributed to the members of the target population, followed by RDS to ‘recapture’ the key chains. The underlying model they used to estimate population size considers two samples: if *N* denotes the (unknown) size of the population to be estimated, and after sampling the population twice, there are *a*_{11} individuals who appeared in both samples, *a*_{10} individuals who appeared only in the first sample and *a*_{01} individuals who appeared only in the second, an estimate of the total population can be obtained as follows.

For a fixed sample size in both the capture and recapture phases, the above (rounded down to the nearest whole number) is the maximum likelihood estimate of *N*.14 This estimate is naive in that the model makes the following assumptions. First, there is assumed to be no change to the population between samples; this may not be too unrealistic in the study of Paz-Bailey *et al*, as the capture and recapture stages were performed close together in time. Second, the samples are assumed to independent; this would not be the case, for example, if participation in the capture stage affected participation in the recapture stage. The use of different recruitment approaches may help to alleviate this problem. Lastly, each individual is assumed to have the same chance of being sampled; however, recruitment through RDS is assumed to be biased towards individuals with more acquaintances, and this may well be true of the capture stage in Paz-Bailey *et al* as well.

The WHO/UNAIDS guidelines6 strongly advocate against breaking the assumptions of this simple model. However, for quantities such as degree that can adopt a wide range of values, more biologically realistic models have been described in the statistical ecology literature, which consider the probability of sampling.15 In RDS, sampling is assumed to be proportional to degree, and so sampling bias can be controlled for by weighting the number of recaptured individuals by the network sizes of the marked and unmarked individuals.11 13 We can rewrite the naive estimator to allow unequal sampling probabilities, as follows.

*k* was sampled and π_{k} is the sampling probability. Paz Bailey *et al* noted a large difference between

To investigate this further, we simulated a capture-recapture design, where the probability of sampling a random individual, *k*, is correlated between the capture and recapture stages, due to the sampling probability being proportional to his/her degree, *d*_{k}. We calculated the naive and adjusted estimators for a population with a power-law degree distribution, over a range of values of *λ*, which is a measure of the heterogeneity in degree between individuals. For each *λ*, we sampled the population twice; in the capture phase, individuals were sampled in relation to their degree; and in the recapture phase, individuals were sampled by an RDS-like process, and *λ*, corresponding to homogenous degrees, *λ* decreases, the naive estimate decreases and worsens, while *λ*. If individuals have equal capture probabilities (but recapture is by RDS), then *λ* (figure 1, inset). Hence, similar naive and adjusted estimates can arise when sampling during the capture stage is uniform, as suggested by Paz Bailey *et al* for FSWs, and when the degree distribution is relatively uniform.

Use of RDS within a capture-capture framework is not without its limitations. RDS itself makes a number of strong assumptions16 and may present challenges to implement,17 18 and the claim that RDS can generate representative samples has been criticised.19 20 If the probability of participation is not related to the reported degree, then the adjusted estimate may show increased bias. The relationship between degree and recruitment probability may also be different for the capture and recapture phases, for example if receiving a key chain makes someone more (or less) likely to participate in the follow-up study. Like other quantities derived from samples collected by RDS,21 the variance of the estimates of population size is more variable than SRS (a phenomenon overlooked by Paz-Bailey *et al* who used a naive estimator of variance). However, by distributing the key chains widely in the capture stage, the tendency of individuals with key chains to recruit others with key chains, one potential cause of high variability in RDS estimates, may be low. These limitations aside, there are also several improvements to the design, including gathering information on degree during the capture stage and performing an additional stage of recapture following RDS, both of which, combined with further statistical developments, may help to provide better estimates of the size of populations at risk of HIV and other sexually transmitted infections.

## References

## Footnotes

Linked article: 045633.

Funding This research was supported by the National Institute of Nursing Research (grant NR10961), the National Institute on Drug Abuse (grant DA24998) and by a Royal Society Wolfson Research Merit Award to SDWF.

Competing interests None.

Provenance and peer review Commissioned; not externally peer reviewed.

## Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.