Evaluating the use of the reproduction number as an epidemiological tool, using spatio-temporal trends of the Covid-19 outbreak in England

The time-varying reproduction number (Rt: the average number secondary infections caused by each infected person) may be used to assess changes in transmission potential during an epidemic. Since new infections usually are not observed directly, it can only be estimated from delayed and potentially biased data. We estimated Rt using a model that mapped unobserved infections to observed test-positive cases, hospital admissions, and deaths with confirmed Covid-19, in seven regions of England over March through August 2020. We explored the sensitivity of Rt estimates of Covid-19 in England to different data sources, and investigated the potential of using differences in the estimates to track epidemic dynamics in population sub-groups. Our estimates of transmission potential varied for each data source. The divergence between estimates from each source was not consistent within or across regions over time, although estimates based on hospital admissions and deaths were more spatio-temporally synchronous than compared to estimates from all test-positives. We compared differences in Rt with the demographic and social context of transmission, and found the differences between Rt may be linked to biased representations of sub-populations in each data source: from uneven testing rates, or increasing severity of disease with age, seen via outbreaks in care home populations and changing age distributions of cases. We highlight that policy makers should consider the source populations of Rt estimates. Further work should clarify the best way to combine and interpret Rt estimates from different data sources based on the desired use.


Background
Since its emergence in 2019, the novel coronavirus SARS-CoV-2 has caused over six million cases of disease  worldwide within six months [1]). Its rapid initial spread and high death rate prompted global policy interventions to prevent continued transmission, with widespread temporary bans on social interaction outside the household [2]. Introducing and adjusting such policy measures depends on a judgement in balancing continued transmission potential with the multidimensional consequences of interventions. It is therefore critical to inform the implementation of policy measures with a clear and timely understanding of ongoing epidemic dynamics [3,4].
In principle, transmission could be tracked by directly recording all new infections. In practice, real-time monitoring of the Covid-19 epidemic relies on surveillance of indicators that are subject to different levels of bias and delay. In England, widely available surveillance data across the population includes: 1) the number of positive tests, biased by changing test availability and practice, and delayed by the time from infection to symptom onset (if testing is symptom-based), from symptom onset to a decision to be tested and from test to test result; 2) the number of new hospital admissions, biased by differential severity that triggers care seeking and hospitalisation, and additionally delayed by the time to develop severe diseases; and 3) the number of new deaths due to Covid-19, biased by differential risk of death and the exact definition of a Covid-19 death, and further delayed by the time to death.
Each of these indicators provides a different view on the epidemic and therefore contains potentially useful information. However, any interpretation of their behaviour needs to reflect these biases and lags and is best done in combination with the other indicators. One approach that allows this in a principled manner is to use the different data sets to separately track the time-varying reproduction number, Rt, the average number of secondary infections generated by each new infected person [5]. Because Rt quantifies changes in infection levels, it is independent of the level of overall ascertainment as long as this does not change over time or is explicitly accounted for [6]. At the same time, the underlying observations in each data source may result from different lags from infection to observation. However, if these delays are correctly specified then transmission behaviour over time can be consistently compared via estimates of Rt.
Different methods exist to estimate the time-varying reproduction number, and in the UK a number of mathematical and statistical methods have been used to produce estimates used to inform policy [7,8].
Empirical estimates of Rt can be achieved by estimating time-varying patterns in transmission events from mapping to a directly observed time-series indicator of infection such as reported symptomatic cases. This can be based on the the probabilistic assignment of transmission pairs [9], the exponential growth rate [10], or the renewal equation [11,12]. Alternatively, Rt can be estimated via mechanistic models which explicitly compartmentalise the disease transmission cycle into stages from susceptible through exposed, infectious and recovered [13,14]. This can include accounting for varying population structures and context-specific biases in observation processes, before fitting to a source of observed cases. Across all methods, key parameters include the time after an infection to the onset of symptoms in the infecting and infected, and the source of data used as a reference point for earlier transmission events [15,16].
In this study, we used a modelling framework based on the renewal equation, adjusting for delays in observation to estimate regional and national reproduction numbers of SARS-Cov-2 across England. The same method was repeated for each of three sources of data that are available in real time. After assessing differences in Rt by data source, we explored why this variation may exist. We compared the divergence between Rt estimates with spatio-temporal variation in case detection, and the proportion at risk of severe disease, represented by the age distribution of test positive cases and hospital admissions and the proportion of deaths in care homes.

Data management
Three sources of data provided the basis for our Rt estimates. Time-series case data were available by specimen date of test. This was a de-duplicated dataset of Covid-19 positive tests notified from all National Health Service (NHS) settings (Pillar One of the UK Government's testing strategy, [17]) and by commercial partners in community settings outside of healthcare (Pillar Two). Hospital admissions were also available by date of admission if a patient had tested positive prior to admission, or by the day preceding diagnosis if they were tested after admission. Death data were available by date of death and included only those which occurred within 28 days of a positive Covid-19 test in any setting. All data were publicly available and taken from the UK government source [18,19], and were aggregated to the seven English regions used by the NHS.
To provide context for Rt estimates, we sourced weekly data on regional and national test positivity (percentage positive tests of all tests conducted) from Public Health England [20], available as weekly average percentages from 10th May. From the same source, we also identified the age distributions of cases admitted to hospital and all test-positive cases. Hospital admissions by age were available as age bands with rates per 100,000, so we used regional population data from 2019 [21] to approximate the raw count. We separately sourced daily data on the number of deaths in care homes by region from March, available from 12th April [22]. Care homes are defined as supported living facilities (residential homes, nursing homes, rehabilitation units and assisted living units). Data were available by date of notification, which included an average 2-3 day lag after the date of death. We also drew on a database which tracked Covid-19 UK policy updates by date and area [23].

Rt estimation
We estimated Rt using EpiNow2 version 1.2.0, an open-source package in R [24,25]. This package implements a Bayesian latent variable approach using the probabilistic programming language Stan [26], which works as follows. The initial number of infections were estimated as a free parameter with a prior based on the initial number of cases, hospital admissions or deaths, respectively. For each subsequent time step, previous imputed infections were summed, weighted by an uncertain generation time probability mass function, and combined with an estimate of Rt to give the prevalence at time t [11]. These infection trajectories were mapped to reported case counts by convolving over an uncertain incubation period and report delay distribution [24], and a negative binomial observation model combined with a multiplicative day of the week effect (with an independent effect for each day of the week).Temporal variation was controlled using an approximate Gaussian process [27] with a squared exponential kernel. The length scale and magnitude of the kernel were estimated during model fitting. Each region was fitted independently using Markov-chain Monte Carlo (MCMC). Eight chains were used with a warmup of 1,000 samples and 2,000 samples post warmup. Convergence was assessed using the R hat diagnostic [24,26].
We used a gamma distributed generation time with mean 3.6 days (standard deviation (SD) 0.7), and SD of 3.1 days (SD 0.8), sourced from [28]. Instead of the incubation period used in the original study (which was based on fewer data points), we refitted using a log-normal incubation period with a mean of 5.2 days (SD 1.1) and SD of 1.52 days (SD 1.1) [29]. This incubation period was also used to convolve from unobserved infections to unobserved symptom onsets (or a corresponding viral load in asymptomatic cases) in the model. We estimated both the delay from onset to positive test (either in the community or in hospital) and the delay from onset to death as log-normal distributions using a subsampled Bayesian bootstrapping approach (with 100 subsamples each using 250 samples) from given data on these delays. Our delay from date of onset to date of positive test (either in the community . CC-BY 4.0 International license It is made available under a perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2020. . or in hospital) was taken from a publicly available linelist of international cases [30]. We removed countries with outlying delays (Mexico and the Philippines). The resulting delay data had a mean of 4.4 days and standard deviation (SD) 5.6. For the delay from onset to death we used data taken from a large observational UK study [31]. We re-extracted the delay from confidential raw data, with a mean delay of 14.3 days (SD 9.5). There was insufficient data available on the various reporting delays to estimate spatially-or temporally-varying delays, so they were considered to be static over the course of the epidemic. See [12,24] for further details on the approach.

Comparison of Rt estimates
We compared Rt estimates by data source, plotting each by region over time. To avoid the first epidemic wave obscuring visual differences, all plots were limited to the earliest date that any Rt estimate for England crossed below 1 after the peak. We also identified the time at which each Rt estimate fell below 1, the local minima and maxima of median Rt estimates, and the number of times in the time-series that each Rt estimate crossed its own median, and compared these across regions and against the total count of the raw data.
We investigated correlations between Rt and the demographic and social context of transmission. We used linear regression to assess whether the level of raw data count influenced oscillations in Rt. We assessed the influence of local outbreaks using test positivity. We used a 5% threshold for positivity as the level at which testing is either insufficient to keep pace with widespread community transmission [32], or where outbreaks have already been detected and tests targeted to those more likely to be positive. We plotted this against raw data and Rt, and also used linear regression to test the association. We interpreted results in light of known outbreaks and policy changes. We plotted and qualitatively assessed variation in Rt against the age distribution of cases over time, and similarly explored patterns in Rt against the qualitative proportion of cases to all deaths. The latter was not assessed quantitatively due to differences in reference dates [22]. With the exception of sampling the delay from onset to death (held confidentially), end to end code to reproduce this analysis is available [33].

Results
Across England, the Covid-19 epidemic peaked at 4,798 reported test-positive cases (on the 22nd April), 3,099 admissions (1st April), and 975 deaths (8th April) per day (figure 1A)). Following the peak, a declining trend continued for daily counts of admissions and deaths, while daily case counts from all reported test-positive cases increased from July and had more than tripled by August (from 571 on 30th June to 1,929 on 1st September). Regions followed similar patterns over time to national trends. However, in the North East and Yorkshire, Midlands, and North West, incidence of test-positive cases did not decline to near the count of admissions as in other regions, and also saw a small temporary increase during the overall rise in case counts in early August.
[ Figure 1] Following the initial epidemic peak in mid-March, the date at which Rt crossed below 1 varied by both data source and geography ( figure 1B, figure 2). The first region to cross into a declining epidemic was London, on the 26th March according to an Rt estimated from deaths (where the lower 90% CrI crossed below 1 on the 24th and the upper CrI on the 28th March). However, as much as spatial variation, the data sources used to estimate Rt influenced the earliest date of epidemic decline. Rt estimated from hospital admissions gave the earliest estimate of a declining epidemic, while using all test-positive cases to estimate Rt took the longest time to reach a declining epidemic, in all but one region (East of England). This difference by data source varied by up to 21  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2020. . April), but the median Rt from test-positive cases crossed 1 on only the 22nd April (90%CrIs 1st April, 25th April).
[ Figure 2] When not undergoing a clear state change, Rt estimates from all data sources appeared to oscillate, with oscillations damped when Rt estimates were transitioning to what appeared to be new levels. In England and all NHS regions, test-positive cases showed evidence of larger damped oscillations from July when a state change occurred to Rt > 1. In England, Rt from test-positive cases increased from 0.99 (90%CrI 0.94-1.04) on 30th June to 1.37 (90%CrI 1.31-1.1.44) on 27th August. Meanwhile, the timing and duration of oscillations did not align between Rt estimates (figure 1B). In some regions, the difference between Rt estimates held consistently over time, such as between Rt from admissions and deaths in the South East. In other regions such as the Midlands this was not the case, with the divergence between the Rt from test-positive cases, admissions, and deaths each varying over time. Rt estimates from test-positive cases were the most likely to diverge from estimates derived from other data sources across all regions. Across all regions, Rt estimates from deaths had larger damped oscillations compared to estimates from test-positive cases or hospital admissions. However, oscillations in Rt did not appear to be linked to the level of raw data counts in each source (SI figure 2).
More rapid oscillations in Rt from test-positive cases appeared to be linked to targeted testing of case clusters, seen in high test positivity (table SI2). Both the North East and Yorkshire and the Midlands saw more frequent oscillations in Rt from test-positive cases than other regions. The Rt from cases crossed its own median 10 times over the time-series in both regions, while in all other NHS regions this averaged 6 times, and oscillations in Rt from cases also had a shorter duration in the North East and Yorkshire and the Midlands compared to other regions (table SI1). Across all regions, 84% of weeks with over 5% positivity (N=19) were in the North East and Yorkshire and the Midlands (figure 2A). In these regions, positivity peaked on the week of 9th May at 14% and 12% respectively, and overall averaged 6% (95%CI 4.4-7.6%) and 5.9% (95%CI 4.6-7.2%, weeks of 10th May to 22nd August) respectively. High test positivity is likely to have resulted from targeted testing among known local outbreaks in these regions. In the Midlands, these included local restrictions and increased testing across Leicester and in a Luton factory (restrictions between 4th and 25th July [34]). In Yorkshire case clusters were detected with local restrictions in Bradford, Calderdale, and Kirklees (with restrictions from 5 August [35]).
In England, a divergence between Rt from cases versus Rts from deaths and admissions coincided with a decline in the age distribution among all test-positive cases in England to a younger population ( figure   SI1A). From mid-April to June, national estimates of Rt from test-positive cases remained around the same level as those from admissions or deaths, while after this, cases diverged to a higher steady state ( figure 1A). On the 23rd May, the median Rt from cases matched that of deaths at 0.83 (both with 90% CrIs 0.78-0.89), but this was followed by a 78 day period before the two estimates were again comparable, on 8th August. Over this period the median Rt from cases was on average 14% higher (95%CI 12-15%). Meanwhile, the share of test-positive cases under age 50 increased from under onequarter of cases in the week of 28 March (24%, N=16,185), to accounting for nearly three-quarters of cases by 22nd August (77%, N=6,733). While the percentage of test-positive cases aged 20-49 increased consistently from April to August, the 0-19 age group experienced a rapid increase over mid-May through July, increasing by a mean 1% each week over May 9th through August 1st (from 4% of 18,774 cases to 14.8% of 5,017 cases). is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2020. . https://doi.org/10.1101/2020. 10.18.20214585 doi: medRxiv preprint Similarly, Rt from admissions in England oscillated over June through July, potentially linked to the age distribution of hospital admissions. From 0.92 (90%CrI 0.87-0.98) on the 11th June, Rt from admissions fell to 0.8 (90%CI 0.75-0.85)) on the 27th June. In contrast, this transition was not observed in the Rt estimate based on test-positive cases (figure 1A). Older age groups dominated Covid-19 hospital admissions, where 0-44 years never accounted for more than 12.8% of hospital-based cases (a maximum in the week of 22nd August, N=690; figure SI2B). While the proportion of hospital admissions aged 75+ remained steady over May through mid-June, this proportion appeared to oscillate over July through August (standard deviation of weekly percentage at 6.1 over June-August, compared to 5.4 in months March-May). These variations were not seen in the proportion aged 70+ in the test-positive case data, which saw a continuous decline from 30% at the start of June to 7% by August.
Rt estimated from either admissions or deaths experienced near-synchronous local peaks across regions over April and May. We compared this Rt from deaths with its source data and a separate regional dataset of deaths in care homes. In the South East and South West, the Rt from deaths rose over April, with a peak of Rt in early May. In the South West, the median Rt estimate from deaths increased by 0.04 from 22nd April to 7th May (from 0.8 (90%CrI 0.72-0.88) to 0.84 (90%CrI 0.76-0.95)); and by 0.06 from the 17th April to 4th May in the South East (from 0.82 (90%CrI 0.77-0.9) to 0.88 (90%CrI 0.72-0.88)). In both these regions, this early May peak in Rt from deaths coincided with similarly rising Rts from hospital admissions, while the reverse trend was seen in Rt from cases. In all regions, care home deaths peaked over the 22nd-29th April (by date of notification; figure SI3). This was later than regional peaks in the raw count of all deaths in any setting (which peaked between the 8th-16th April, by date of death), even accounting for a 2-3 day reporting lag. This meant that the proportion of deaths from care homes varied over time, where in the South East and South West, deaths in care homes appeared to account for nearly all deaths for at least the period mid-May to July.

Discussion
We estimated the time-varying reproduction number for Covid-19 over March through August across England and English NHS regions, using test-positive cases, hospital admissions, and deaths with confirmed Covid-19. Our estimates of transmission potential varied for each of these sources of infections, and the divergence between estimates from each data source was not consistent within or across regions over time, although estimates based on hospital admissions and deaths were more spatio-temporally synchronous than compared to estimates from cases. We compared differences in Rt estimates to the extent and context of transmission, and found that the difference between Rt from cases, admissions, and deaths may be linked to uneven rates of testing, the changing age distribution of cases, and outbreaks in care home populations.
Rt varied by data source, and the extent of variation itself differed by region and over time. Following the initial epidemic peak in mid-March, the date at which Rt crossed below 1 varied by both data source and geography, following which Rt estimates from all data sources varied when not undergoing a clear state change. The differences in these oscillations by data source may indicate different underlying causes. This implies that each data source was influenced differently by changes in sub-populations over time.
Increasingly rapid oscillations in Rt from test-positive cases were associated with higher test-positivity rates. Increasing test-positivity rates could be an indication of inconsistent community testing, with the observation of an initial rise in transmission amplified by expanded testing and local interventions where a cluster of new, mild cases has been identified [17]. This targeted testing may drive regionally localised instability in case detection and resulting Rt estimates but may not reflect changes in underlying . CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2020. . transmission. This is a limitation of monitoring epidemic dynamics using test-positive surveillance data in areas where testing rates vary across the population and over time. This also suggests that Rt from admissions may be more reliable than that from all test-positive cases for indicating the relative intensity of an epidemic over time [36].
We hypothesised that variations in Rt estimates based on data reflecting more severe outcomes (hospital admissions and deaths) were related to changes in the age distribution of cases over time, because age is associated with severity [37]. We found that from June onwards, Rt from all test-positive cases appeared to increasingly diverge away from Rt from admissions and deaths, transitioning into a separate, higher, steady state. This was followed by the observed age distribution of all test-positive cases becoming increasingly younger, while the age distribution of admissions remained approximately level. Because of the severity gradient, this suggested the Rts from all test-positive cases and admissions were more biased by the relative proportion of younger cases and older cases respectively than the Rt from admissions or deaths.
Similarly, all regions saw a near-synchronous local peak in Rt from hospital admissions over spring, which was not observed in Rt from test-positive cases. This may have reflected the known widespread regional outbreaks in care homes, with an older population who are more likely to experience fatal outcomes, and less likely to appear for community testing or in hospital admissions, than the general population [38][39][40] Our analysis was limited where data or modelling assumptions did not reflect underlying differences in transmission. Rt estimates can become increasingly uncertain and unstable with lower case counts. Further, estimated unobserved infections were mapped to reported cases or deaths using two uncertain delay distributions: the time from infection to test in the community or hospital, and a longer delay from infection to death. A mis-specified delay distribution would have created bias in the temporal distribution of all resulting Rt estimates, with estimated dates of infection and Rt, incorrectly shifted too much or little in time compared to the true infection curve. This may have been observed in the Rt variation from admissions and deaths, which often showed comparable levels and patterns in oscillations over time but were out of phase with each other. This is likely, where due to a lack of England specific data we used delays from a worldwide linelist to specify the delay to any positive test as the same as that to admission, while the delay to death was estimated separately from UK data.
We may also have mis-specified delays where delay distributions may themselves be spatially or temporally dependent. This was not accounted for but could have increased the accuracy of Rt estimation [41]. However, there is currently very little UK data on the time from case onset to confirmation of Covid-19 in any of a positive test, hospital admission, or death, from any point in the epidemic. We mitigated this by using a subsampled bootstrap, which incorporates our lack of certainty in the data.
The data sources themselves may also have been inaccurate or biased, which would change the representation of the population we have assumed here. For example, we excluded data from other nations of the UK (Wales, Scotland and Northern Ireland) in our analysis, as these differed in both availability over time and in data collection and reporting practices [18,42]. English regional data may also contain bias where new parts of the population might be under focus for testing efforts, or the population characteristics of hospital admissions from Covid-19 may have changed over time with changes in clinical criteria or hospital capacity for admission. This would mean that an Rt estimate from these data sources would represent different source populations over time, limiting our ability to reliably compare against Rt estimates from other data sources. Where possible we highlighted this by comparing Rt estimates to known biases and changes in case detection and reporting.
. CC-BY 4.0 International license It is made available under a perpetuity.
is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint The copyright holder for this this version posted October 20, 2020. . Our approach is unable to make strong or causal conclusions about varying transmission, and assumptions about sampling and the representation of sub-populations remain implicit. Alternatively, varying epidemics in sub-populations could have been addressed with mechanistic models that explicitly consider transmission in different settings and are fitted to multiple data sources. However, these require additional assumptions, detailed data to parameterise, and may be time-consuming to develop. In the absence of data, the number of assumptions required for these models can introduce inherent structural biases. Our approach contains few structural assumptions and therefore may be more robust when data are sparse, or information is required in real-time.
This work highlights the sensitivity of Rt estimates to assumptions about the underlying population of each data source. There is no clearly superior choice of data for estimating Rt, meaning choice should be guided by minimising relevant biases in the context in which the estimates will be used. Each data source is generated by separate sampling strategies, such as by community case detection or patient severity. Despite the robust reconstruction of the underlying transmission process from the reporting processes, there is additional information added by considering Rt estimates from different data sources in tandem. Although this can be difficult to interpret without specific knowledge of population structure and dynamics, this would be useful to clarify and target policy depending on transmission intensity by sub-population where access to high quality disaggregated data may not be not available in real time.
In contrast, if policy were to be based on either a single or an averaged Rt estimate, it would be unclear what any recommendation should be and for whom.
Future work could explore systematic differences in the influence of data source on Rt estimate by extending the comparison of Rt by data source to other countries or infectious diseases. Additionally, work should also clarify the potential for comparing Rt estimates in real-time tracking of outbreaks, and explore the inconsistencies in case detection over time and space, where a cluster of cases leads to a highly localised expansion of community testing, creating an uneven spatial bias in transmission estimates. These findings may be used to improve Rt estimation and identify findings of use for epidemic control. Based on the work presented here we now provide Rt estimates, updated each day, for test positive cases, admissions and deaths in each NHS region and in England. Our estimates are visualised on our website, are available for download, and are produced using publicly accessible code [43,44].
Tracking differences by data source can improve understanding of variation in testing bias in data collection, highlight outbreaks in new sub-populations and indicate differential rates of transmission among vulnerable populations, and clarify the strengths and limitations of each data source. Our approach can quickly identify such patterns in developing epidemics that might require further investigation and early policy intervention. Our method is simple to deploy and scale over time and space using existing open-source tools, and all code and estimates used in this work are available to be used or re-purposed by others.

Funding
The following funding sources are acknowledged as providing funding for the named authors. Wellcome Trust (210758/Z/18/Z: JDM, JH, KS, NIB, SA, SFunk, SRM). This research was partly funded by the Bill & Melinda Gates Foundation (INV-003174: MJ). This project has received funding from the European Union's Horizon 2020 research and innovation programme -project EpiPose (101003688: MJ). Figure 1. Epidemic dynamics across (A) England and (B-H)  is the author/funder, who has granted medRxiv a license to display the preprint in (which was not certified by peer review) preprint