[email protected]

COVID-19 and Wastewater

Tracking new cases via sewage water

October 02, 2021

COVID-19 and Wastewater

Predicting Daily COVID-19 Cases in British Columbia with Regression and Time Series Models

Key Highlights

Here’s a quick summary of the models I explored and what the results showed:

Curiosity

During the pandemic, I got curious about how waste water data could help predict the number of new COVID-19 cases. So I decided to build a model to forecast daily cases in British Columbia (B.C.), using data from April 1st, 2020 to August 27th, 2021.

The challenge? To Predict the number of new cases that will appear based on data from 1, 3, 5, and 7 days ago. Along the way, I explored case counts, recovery data, wastewater surveillance, and even rainfall

Here’s a breakdown of what I did

Collecting the Data

I worked with two main datasets:

  1. COVID case data for B.C.
    This included daily new cases, hospitalizations, ICU numbers, active cases, and wastewater readings from five sewage plants.

  2. Rainfall data
    Pulled from weatherstats.ca, specifically for Vancouver.

The wastewater data was reported weekly, so I forward-filled it to create a daily dataset. I also created a new variable I called Adj_ww (Adjusted Wastewater), which took into account recoveries and deaths — the idea was to make it reflect the number of truly active cases better.

Feature Engineering

Some interesting things popped out while exploring correlations:

Building the Models

Model 1: Regression

I started with an exponential regression model with the form of

New cases=eβ0+β1x1+β2x2+...\text{New cases} = e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...}

In this model, I found that the number of new cases and recovered cases in the previous day were both statistically significant, while the adjusted wastewater data (adjusted for new deaths, recoveries, and known active cases) did not improve the model. Additionally, while the model did an adequate job for predicting new cases using data up to 1 day prior, the model performed rather poorly for multi-day predictions.

Next, I switched to a simple linear regression with the form of

New casest=  β0+β1New casest1+β2Weighted MA7t1+β3Adjusted wastewatert1\begin{array}{l} \text{New cases}_t =\; \beta_0 + \beta_1 \, \text{New cases}_{t-1} \\ + \beta_2 \, \text{Weighted MA7}_{t-1} \\ + \beta_3 \, \text{Adjusted wastewater}_{t-1} \end{array}

Which incorporates the:

The model performed decently and all the variables were statistically significant. I liked this model because it was easy to understand — great for interpreting what’s actually influencing case numbers.

Next, I tried a transfer function time series model, which is a kind of ARIMA model that can include external predictors (like active cases). The transfer function was in the form of

yt=β(b)V(b)(xt)+Θ(b)ϕ(b)(zt)y_t = \frac{\beta(b)}{\mathcal{V}(b)}(x_t) + \frac{\Theta(b)}{\phi(b)}(z_t)

Where

While this model could be unstable in a non-stationary time series data I noticed no evidence of spurious regression, allowing me to move forward with this model which brings me to:

New casest=β0+β1(active casest1)+β2(new casest1)+β3(new casest2)+β4(new casest3)+β5(new casest4)+β6(new casest5)\begin{array}{l} \text{New cases}_t = \beta_0 + \beta_1 (\text{active cases}_{t-1}) + \beta_2 (\text{new cases}_{t-1}) \\ + \beta_3 (\text{new cases}_{t-2}) + \beta_4 (\text{new cases}_{t-3}) \\ + \beta_5 (\text{new cases}_{t-4}) + \beta_6 (\text{new cases}_{t-5}) \end{array}

Although this time series model is more complicated and excludes noisy features like wastewater and recoveries, it performed slightly better in terms of accuracy (lower RMSE). That said, it was also more fragile when dealing with imperfect data.

Here’s the comparison based on my results:

ModelRMSE
Regression86.24
Time Series84.99

The difference was small. I ended up liking both for different reasons:

Does Wastewater Data Actually Help?

I was really curious about this! And the answer is: yes.

Backtesting (to Avoid Data Leakage)

For each prediction window (1, 3, 5, 7 days), I made sure I wasn’t cheating by accidentally training on future data. I created different datasets where the features only included data available before the prediction date.

It was a bit manual, but I confirmed it was working by checking some specific index ranges — and everything looked good.

Does Rain Dilute Wastewater RNA?

I looked at whether rainfall affects wastewater viral load — and it does seem to.

The theory is that rainwater entering the sewer system dilutes the virus concentration. So if it rains a lot, RNA levels might appear lower even if the infection rate hasn’t changed.

I didn’t end up correcting for it in my final models, but it’s definitely something I’d like to explore further.

Thanks for reading! If you’ve got ideas for how to improve these models or just want to geek out about wastewater epidemiology (it’s a thing now), feel free to reach out.