Data to Track the Prevalence and Impact of CoVID-19

Data to Track the Prevalence and Impact of CoVID-19

Guest blog by Tony Kwasnica (Penn State) talking about CoVID-19 analysis with some plots by me.

Tracking the progress of CoVID-19 and our fight against it has become an all-consuming exercise for many people as they wait on bated breath for daily press briefings from local or national public health officials about the status of the disease. To date, much of the discussion has focused on two pieces of data: the number of positive (confirmed) cases of CoVID-19 and the number of fatalities, with other statistics such as hospitalizations, ventilator usage, etc. also being occasionally discussed. One important data point that has begun to garner attention is the rate of positivity (e.g. the fraction of all tests that are positive) of the testing conducted (see www.theatlantic.com/technology/archive/2020/04/us-coronavirus-outbreak-out-control-test-positivity-rate/610132/). This number has even been suggested as an important number to determine whether particular locations are ready to lessen mitigation measures (www.nytimes.com/interactive/2020/04/17/us/coronavirus-testing-states.html). Unfortunately, good visualizations of this data and how states compare over time is hard to find. Chris Parker (chrisparker.io) and I have recently used the data provided by the CoVID Tracking Project (www.covidtracking.com) to provide state by state analysis of two important numbers. Below in small form or directly at https://plotly.com/~christopherdparker/5/#/ you can find graphs where we plot the five-day running total of the rate of positivity for each state as well as the five-day average number of tests conducted per day per 100,000 people. Below I discuss some of the advantages and weaknesses of this data.

While numbers such as the absolute number of CoVID cases and the number of fatalities are important data points, they have some limitations. Changes in the number of new positive cases in a day are often cited as signs of progress against the disease and our efforts to “flatten the curve”, but they can be misleading since, even at the state level, the number of tests results being returned can vary substantially. For example, Pennsylvania announced on April 20, 2020 that there were only 948 new cases, which was the first time the number had dipped below 1,000 since April 1. This number, however, is almost definitely lower in part due to the fact that only 4,098 tests results were returned on April 20. In comparison, there were 6,592 tests (1,628 positives) returned the day previous. While this lower count may be good news, one guaranteed way to increase or decrease the number of positive cases is to test more or less. If, as has been suggested is necessary to remove restrictions, we were to suddenly triple the amount of testing the number of new positive cases would jump, but that does not mean the disease would be any more or less prevalent.1 In the extreme, if we did no testing then we’d have no new cases. Similarly, fatalities have become a common number used to measure the extent and cost of the disease with the popular IHME projections relying of fatalities as the primary forecast variable (www.healthdata.org). This data too can suffer from any number of data quality issues that can make its use somewhat misleading. First, particularly when a region is in the midst of a large outbreak, many fatalities may go uncounted either because they were unrecognized (e.g. they did not test positive) or, due to limitations in hospital resources, happened somewhere other than a hospital. Recent reports of excess deaths suggests that uncounted fatalities may be quite high (www.nytimes.com/2020/04/21/world/coronavirus-live-world-cases-global.html). Second, even if accurately measured, fatalities likely lag infection by 14-21 day so it might provide a clear picture of where we were in terms of infection rather than where we currently are.

Now that most states are conducting a fairly high number of tests (e.g. anything more than a hundred or so per day) one under reported data point is the rate of positivity.2

What it is? The rate of positivity is the number of positive test results in a time period (5 days in our figures) divided by the total number of tests returned in that time period.

What it is not? The rate of positivity is NOT the percentage of the local population that is infected. Rather, it is the percentage of the people that were given a test that were found to have CoVID. The actual percentage of the population currently infected (the prevalence) is hopefully much, much lower since we are testing people that we presume, due to other symptoms, are likely to have CoVID. Testing could be used to estimate the prevalence of the disease but that would require a large random sample of the population. These efforts are likely underway, and it is at least one reason public health officials argue more testing resources are needed. For now, most tests are confined to those individuals for whom we have some concern that they might be infected.

What it is? While the rate of positivity is not the prevalence of the disease, it is correlated with prevalence meaning when prevalence goes up (down), all else being equal, the rate of positivity will go up (down). Of course, the devil is in that little phrase “all else being equal.” In theory, if one knew a few additional pieces of data, the prevalence could be directly calculated from the positivity rate. The required data relates to the quality of the test (specificity and sensitivity), the proportion of people with CoVID and other afflictions that meet the current diagnostic criteria to obtain a test, and the prevalence of other illnesses that might be mistaken for CoVID. Especially given the novel nature of this disease many of these numbers are still speculative, but we are garnering improved data all the time about some of these variables. However, to be useful, it is not necessary to know all these factors. Instead it is sufficient if they are stable or at least changes in them can be tracked. Given that most testing is driven by CDC guidelines for testing (www.cdc.gov/coronavirus/2019-ncov/hcp/clinical-criteria.html), it is reasonable to assume that many of these factors are stable across states and within the time frames we are examining (since March 15). Unless you have reason to believe that one state (say NY) is handing out tests like candy or the flu is running rampant in Pennsylvania but not in New Jersey, then the “all else being equal” assumption is not particularly troubling here in my opinion. Further data evaluation with respect to known variations could help explain differences in positivity rates and thus prevalence.3

What it is not? This data does not provide policy prescriptions. It simply suggests where we are at in terms of prevalence of the disease. Any change in policy relies on deeper knowledge of disease dynamics that is not captured here. In the long run, data like this may be used to track changes in prevalence over time, which might then be used to estimate the infectivity rate of the disease, but that is beyond the scope of this simple project.

What it is? The data can be used to compare states across time and location. For example, the significantly higher positivity rate in New Jersey versus Pennsylvania suggests that the disease is more widespread in New Jersey. The increase in positivity in Pennsylvania may indicate that Pennsylvania has not yet reached its peak in terms of disease prevalence. Given that a positivity rate of less than 10% has been cited as a goal number for relaxing restrictions, this data can be useful to see where particular states are in this dimension. The danger of simply using new cases for such decisions is that it might result in excessively cautious or risky decisions; a state that really wants to ‘reopen’ might decide to test less and thereby demonstrate lower case counts or a state that invests heavily in testing might be scared away from relaxing some measures because of increased case counts.

Examining the data a few interesting/surprising results can be found.

  • Many states have consistent low (less than 10%) positivity rates. An incomplete list is Alaska, Hawaii, Montana, North Dakota, Oklahoma, Oregon, Tennessee, Utah, West Virginia, and Wyoming. These might be states where mitigation measures can be lessened in the near future?
  • Some states show trends consistent with other data suggesting a decline in prevalence. The best (but somewhat different) examples would be Louisiana (a brief peak but substantial decrease since then), New York (still high positivity rates but they have begun falling), Florida (a more gradual increase and decrease that does not seem to match up with the dire predictions for the state given policy choices and demographics), and California (fairly low and stable positivity numbers).
  • Some states show worrying upward trends in positivity. Example states might include Connecticut, Iowa, Massachusetts, and Pennsylvania. States with current upward trends appear close to locations with the most severe outbreaks (e.g. NY) possibly indicating the dynamic nature of the disease.
  • Data issues make some of these numbers hard to utilize. It is clear that some states’ reporting of test results is uneven and frequently did not start until well into April. Despite using a five-day window to visualize the data there are spikes of positivity usually related to long periods with few tests reported. This problem appears to be decreasing as states improve their data provision but large changes in the positivity rate should be compared to the changes in the number of tests being reported.

Accounting for uncertainty. As with any statistic it is important to remember that the number may be wrong. The data is simply the estimated positivity rate given the observed test results. A conservative margin of error can be easily calculated just as in survey data.4 The margin of error is impacted by both the estimated positivity rate and the number of tests. As an example, a state with a positivity rate of 20% will have a margin of error of plus or minus 2.5% if they only provide 1,000 tests whereas the margin of error falls to plus or minus 1.1% with 5,000. While we may add margin of error figures in the future, given the nature of the data a margin of error in the 1-2% range is pretty reasonable for this meaning it would not be wise to make a lot of small shifts in the positivity rate but shifts bigger than a few percent are likely significant.5

That said, it does seem important to report figures on the volume of testing alongside the positivity rate. First, like positivity, these are numbers rarely reported and presented publicly but are still the subject of much debate (are we doing enough testing?). Second, there is interesting variation between states and over time. Third, while I am unsure exactly how to account for it, it does seem like the volume of testing can be an important variable to help track prevalence. In this case, testing is not exogenous and random. Instead, there is testing demand based upon the number of patients satisfying a diagnostic criterion. One explanation for reduced testing may be decreased prevalence resulting in less sick people needing to be tested. I’d be happy to hear from others on how one might more seriously consider this number. One informal inference I believe makes sense is that if a state has both a rising positivity rate and a rising number of tests then that is sort of a `double whammy’ of bad news in terms of the disease. On the other hand, if testing and positivity are going down, it might be a sign that the disease is decreasing to the point that testing capacity is not being utilized.

As in any decision-making problem, I encourage you to consider all available data sources. This data should be just one piece of the puzzle of our fight to understand and then defeat this disease.

Please feel free to contact me at kwasnica@psu.edu or Chris Parker at chris.parker@american.edu.


Footnotes

[1] Proportional changes in the number of tests conducted will not necessarily result in a proportional increase in the number of positive cases. While the direction of the change is guaranteed, the size of the change depends upon the way in which testing is increased and other factors such as the fraction of the population with certain symptoms associated with CoVID.

[2] I am making no claims here about the “right” number of tests. When I say “high” I am simply referring to the fact that with a large enough sample the observed proportion of positive tests will likely be fairly stable and close to the actual rate of positivity.

[3] A large panel data study of how various factors such a temperature, population density, timing of NPI, etc. impact positivity might yield interesting insights. Further study of similar data at the county or zip code level might be even more enlightening, but low levels of testing in some areas might create complications.

[4] The margin of error is typically the 95% confidence interval.

[5] In truth, these measures of uncertainty rely on statistical methods assuming the sampled population is small relative to the whole (e.g., we survey 1,000 likely voters out of millions). Here it might be possible to say more since, hopefully, nearly all people with the symptoms are being tested, but those technical details merit further discussion and is beyond the scope of this note.