Realtime $R_t$ image

In The Metric We Need to Manage COVID-19, Kevin Systrom, who co-founded Instagram, wrote about $R_t$, a measure of the effective reproductive rate for COVID-19. You might have heard about R0 - the base reproductive rate, which is measure how many secondary infections are produced by a single primary infection. If R0 is greater that 1, the disease spreads, while if it is less than 1, it will eventually peter out.

$R_t$ is an improvement over R0, since it changes in response to health interventions, which makes it a metric we can actually track to see how well we are doing. The techniques for calculating it are based on Bayesian probability, whic is best explained by referring to the original blog post, or his notebook, or his follow up posts on the topic.

This post extends on his notebook, which caululated $R_t$ for the states in the US, by calculating $R_t$ for Canada. What follows is mostly his original noteboo, with his explanations, except where I have made some changes to swap in Canadian data, or to make editorial comments.

Technical Setup

import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
from matplotlib.dates import date2num, num2date
from matplotlib import dates as mdates
from matplotlib import ticker
from matplotlib.colors import ListedColormap
from matplotlib.patches import Patch

from scipy import stats as sps
from scipy.interpolate import interp1d

from IPython.display import clear_output
k = np.array([20, 40, 55, 90])

# We create an array for every possible value of Rt
R_T_MAX = 12
r_t_range = np.linspace(0, R_T_MAX, R_T_MAX*100+1)
GAMMA = 1/7

FILTERED_REGIONS_CANADA = ['Grand Princess',
                           'Diamond Princess', 
                           'Recovered', 
                           'Northwest Territories',
                           'Prince Edward Island',
                           'Yukon']
MIN_CASES = 10
MIN_CASES_BY_PROVINCE = {'New Brunswick': 4, 'Prince Edward Island': 2}

%config InlineBackend.figure_format = 'retina'

Get Canadian Data

To start, we will load the data for Canadian provinces from the Johns Hopkins CSSE repository, and transform it to match nthe data format in Kevin's notebook.

CSSE_URL = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/'
CONFIRMED_URL = f'{CSSE_URL}/time_series_covid19_confirmed_global.csv'
confirmed = pd.read_csv(CONFIRMED_URL)
canada_wide = confirmed[confirmed["Country/Region"] == 'Canada']

From wide to long

The data is originally in wide format, which means that each new day is a new column, and so the data will stretch as wide as there are days. Converting from wide to long data format will frequently be a data processing step that you are required to do as a data scientist. Here we are showing the first 10 columns of the dataset.

You will also notice that it contains Grand Princess and Diamond Princess, which are two cruise ships that were early sources of COVID19 infections in Canada. We will remove those later, along with Recovered, which we don't need since we are only looking at provinces.

canada_wide.iloc[:, :10].style.hide_index()

The following code performs a melt on the data, which converts from wide to long format.

dates = canada_wide.columns[4:]
provinces = canada_wide.melt(id_vars=['Province/State'], 
                                  value_vars=dates, 
                                  var_name='date', 
                                  value_name='positive').rename(columns={'Province/State': 'province'})
provinces.date = pd.to_datetime(provinces.date)
provinces = provinces.groupby(['province', 'date']).sum().squeeze()

We are left with the data for the Canadian provinces, with the province and date being the index.

provinces

province  date      
Alberta   2020-01-22     0
          2020-01-23     0
          2020-01-24     0
          2020-01-25     0
          2020-01-26     0
                        ..
Yukon     2020-04-19     9
          2020-04-20    11
          2020-04-21    11
          2020-04-22    11
          2020-04-23    11
Name: positive, Length: 1395, dtype: int64

Apply Gaussian smoothing

This is a step that Kevin does in his notebook, which produces a smoother, more Gaussian data series. Here are his original notes

Taking a look at the state, we need to start the analysis when there are a consistent number of cases each day. Find the last zero new case day and start on the day after that.

Also, case reporting is very erratic based on testing backlogs, etc. To get the best view of the 'true' data we can, I've applied a gaussian filter to the time series. This is obviously an arbitrary choice, but you'd imagine the real world process is not nearly as stochastic as the actual reporting.

And here is this process applied to Ontario's data.

province_name = 'Ontario'

def prepare_cases(cases, min_cases=MIN_CASES):
    new_cases = cases.diff()

    smoothed = new_cases.rolling(9,
        win_type='gaussian',
        min_periods=1,
        center=True).mean(std=3).round()
    
    idx_start = np.searchsorted(smoothed, min_cases)
    
    smoothed = smoothed.iloc[idx_start:]
    original = new_cases.loc[smoothed.index]
    
    return original, smoothed

cases = provinces.xs(province_name).rename(f"{province_name} cases")

original, smoothed = prepare_cases(cases)

original.plot(title=f"{province_name} New Cases per Day",
               c='k',
               linestyle=':',
               alpha=.5,
               label='Actual',
               legend=True,
             figsize=(500/72, 400/72))

ax = smoothed.plot(label='Smoothed',
                   legend=True)

ax.get_figure().set_facecolor('w')

Running the Algorithm

Choosing the Gaussian $\sigma$ for $P(R_t|R_{t-1})$

Note: you can safely skip this section if you trust that we chose the right value of $\sigma$ for the process below. Otherwise, read on.

The original approach simply selects yesterday's posterior as today's prior. While intuitive, doing so doesn't allow for our belief that the value of $R_t$ has likely changed from yesterday. To allow for that change, we apply Gaussian noise to the prior distribution with some standard deviation $\sigma$. The higher $\sigma$ the more noise and the more we will expect the value of $R_t$ to drift each day. Interestingly, applying noise on noise iteratively means that there will be a natural decay of distant posteriors. This approach has a similar effect of windowing, but is more robust and doesn't arbitrarily forget posteriors after a certain time like my previous approach. Specifically, windowing computed a fixed $R_t$ at each time $t$ that explained the surrounding $w$ days of cases, while the new approach computes a series of $R_t$ values that explains all the cases, assuming that $R_t$ fluctuates by about $\sigma$ each day.

However, there's still an arbitrary choice: what should $\sigma$ be? Adam Lerer pointed out that we can use the process of maximum likelihood to inform our choice. Here's how it works:

Maximum likelihood says that we'd like to choose a $\sigma$ that maximizes the likelihood of seeing our data $k$: $P(k|\sigma)$. Since $\sigma$ is a fixed value, let's leave it out of the notation, so we're trying to maximize $P(k)$ over all choices of $\sigma$.

Since $P(k)=P(k_0,k_1,\ldots,k_t)=P(k_0)P(k_1)\ldots P(k_t)$ we need to define $P(k_t)$. It turns out this is the denominator of Bayes rule:

$$P(R_t|k_t) = \frac{P(k_t|R_t)P(R_t)}{P(k_t)}$$

To calculate it, we notice that the numerator is actually just the joint distribution of $k$ and $R$:

$$ P(k_t,R_t) = P(k_t|R_t)P(R_t) $$

We can marginalize the distribution over $R_t$ to get $P(k_t)$:

$$ P(k_t) = \sum_{R_{t}}{P(k_t|R_t)P(R_t)} $$

So, if we sum the distribution of the numerator over all values of $R_t$, we get $P(k_t)$. And since we're calculating that anyway as we're calculating the posterior, we'll just keep track of it separately.

Since we're looking for the value of $\sigma$ that maximizes $P(k)$ overall, we actually want to maximize:

$$\prod_{t,i}{p(k_{ti})}$$

where $t$ are all times and $i$ is each state.

Since we're multiplying lots of tiny probabilities together, it can be easier (and less error-prone) to take the $\log$ of the values and add them together. Remember that $\log{ab}=\log{a}+\log{b}$. And since logarithms are monotonically increasing, maximizing the sum of the $\log$ of the probabilities is the same as maximizing the product of the non-logarithmic probabilities for any choice of $\sigma$.

Function for Calculating the Posteriors

To calculate the posteriors we follow these steps:

Calculate $\lambda$ - the expected arrival rate for every day's poisson process
Calculate each day's likelihood distribution over all possible values of $R_t$
Calculate the process matrix based on the value of $\sigma$ we discussed above
Calculate our initial prior because our first day does not have a previous day from which to take the posterior
- Based on info from the cdc we will choose a Gamma with mean 7.
Loop from day 1 to the end, doing the following:
- Calculate the prior by applying the Gaussian to yesterday's prior.
- Apply Bayes' rule by multiplying this prior and the likelihood we calculated in step 2.
- Divide by the probability of the data (also Bayes' rule)

def get_posteriors(sr, sigma=0.15):

    # (1) Calculate Lambda
    lam = sr[:-1].values * np.exp(GAMMA * (r_t_range[:, None] - 1))

    
    # (2) Calculate each day's likelihood
    likelihoods = pd.DataFrame(
        data = sps.poisson.pmf(sr[1:].values, lam),
        index = r_t_range,
        columns = sr.index[1:])
    
    # (3) Create the Gaussian Matrix
    process_matrix = sps.norm(loc=r_t_range,
                              scale=sigma
                             ).pdf(r_t_range[:, None]) 

    # (3a) Normalize all rows to sum to 1
    process_matrix /= process_matrix.sum(axis=0)
    
    # (4) Calculate the initial prior
    prior0 = sps.gamma(a=4).pdf(r_t_range)
    prior0 /= prior0.sum()

    # Create a DataFrame that will hold our posteriors for each day
    # Insert our prior as the first posterior.
    posteriors = pd.DataFrame(
        index=r_t_range,
        columns=sr.index,
        data={sr.index[0]: prior0}
    )
    
    # We said we'd keep track of the sum of the log of the probability
    # of the data for maximum likelihood calculation.
    log_likelihood = 0.0

    # (5) Iteratively apply Bayes' rule
    for previous_day, current_day in zip(sr.index[:-1], sr.index[1:]):

        #(5a) Calculate the new prior
        current_prior = process_matrix @ posteriors[previous_day]
        
        #(5b) Calculate the numerator of Bayes' Rule: P(k|R_t)P(R_t)
        numerator = likelihoods[current_day] * current_prior
        
        #(5c) Calcluate the denominator of Bayes' Rule P(k)
        denominator = np.sum(numerator)
        
        # Execute full Bayes' Rule
        posteriors[current_day] = numerator/denominator
        
        # Add to the running sum of log likelihoods
        log_likelihood += np.log(denominator)
    
    return posteriors, log_likelihood

# Note that we're fixing sigma to a value just for the example
posteriors, log_likelihood = get_posteriors(smoothed, sigma=.25)

The Result

Below you can see every day (row) of the posterior distribution plotted simultaneously. The posteriors start without much confidence (wide) and become progressively more confident (narrower) about the true value of $R_t$

ax = posteriors.plot(title=f'{province_name} - Daily Posterior for $R_t$',
           legend=False, 
           lw=1,
           c='k',
           alpha=.3,
           xlim=(0.4,6))

ax.set_xlabel('$R_t$');

Plotting in the Time Domain with Credible Intervals

Since our results include uncertainty, we'd like to be able to view the most likely value of $R_t$ along with its highest-density interval.

def highest_density_interval(pmf, p=.9):
    # If we pass a DataFrame, just call this recursively on the columns
    if(isinstance(pmf, pd.DataFrame)):
        return pd.DataFrame([highest_density_interval(pmf[col], p=p) for col in pmf],
                            index=pmf.columns)
    
    cumsum = np.cumsum(pmf.values)
    best = None
    for i, value in enumerate(cumsum):
        for j, high_value in enumerate(cumsum[i+1:]):
            if (high_value-value > p) and (not best or j<best[1]-best[0]):
                best = (i, i+j+1)
                break
            
    low = pmf.index[best[0]]
    high = pmf.index[best[1]]
    return pd.Series([low, high], index=[f'Low_{p*100:.0f}', f'High_{p*100:.0f}'])

# Note that this takes a while to execute - it's not the most efficient algorithm
hdis = highest_density_interval(posteriors, p=.9)

most_likely = posteriors.idxmax().rename('ML')

# Look into why you shift -1
result = pd.concat([most_likely, hdis], axis=1)

result.tail()

Plotting $R_t$ for Ontario

def plot_rt(result, ax, state_name):
    
    ax.set_title(f"{state_name}")
    
    # Colors
    ABOVE = [1,0,0]
    MIDDLE = [1,1,1]
    BELOW = [0,0,0]
    cmap = ListedColormap(np.r_[
        np.linspace(BELOW,MIDDLE,25),
        np.linspace(MIDDLE,ABOVE,25)
    ])
    color_mapped = lambda y: np.clip(y, .5, 1.5)-.5
    
    index = result['ML'].index.get_level_values('date')
    values = result['ML'].values
    
    # Plot dots and line
    ax.plot(index, values, c='k', zorder=1, alpha=.25)
    ax.scatter(index,
               values,
               s=40,
               lw=.5,
               c=cmap(color_mapped(values)),
               edgecolors='k', zorder=2)
    
    # Aesthetically, extrapolate credible interval by 1 day either side
    lowfn = interp1d(date2num(index),
                     result['Low_90'].values,
                     bounds_error=False,
                     fill_value='extrapolate')
    
    highfn = interp1d(date2num(index),
                      result['High_90'].values,
                      bounds_error=False,
                      fill_value='extrapolate')
    
    extended = pd.date_range(start=pd.Timestamp('2020-03-01'),
                             end=index[-1]+pd.Timedelta(days=1))
    
    ax.fill_between(extended,
                    lowfn(date2num(extended)),
                    highfn(date2num(extended)),
                    color='k',
                    alpha=.1,
                    lw=0,
                    zorder=3)

    ax.axhline(1.0, c='k', lw=1, label='$R_t=1.0$', alpha=.25);
    
    # Formatting
    ax.xaxis.set_major_locator(mdates.MonthLocator())
    ax.xaxis.set_major_formatter(mdates.DateFormatter('%b'))
    ax.xaxis.set_minor_locator(mdates.DayLocator())
    
    ax.yaxis.set_major_locator(ticker.MultipleLocator(1))
    ax.yaxis.set_major_formatter(ticker.StrMethodFormatter("{x:.1f}"))
    ax.yaxis.tick_right()
    ax.spines['left'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.margins(0)
    ax.grid(which='major', axis='y', c='k', alpha=.1, zorder=-2)
    ax.margins(0)
    ax.set_ylim(0.0, 5.0)
    ax.set_xlim(pd.Timestamp('2020-03-01'), result.index.get_level_values('date')[-1]+pd.Timedelta(days=1))
    fig.set_facecolor('w')

    
fig, ax = plt.subplots(figsize=(600/72,400/72))

plot_rt(result, ax, province_name)
ax.set_title(f'Real-time $R_t$ for {province_name}')
ax.xaxis.set_major_locator(mdates.WeekdayLocator())
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))

Choosing the optimal $\sigma$

In the previous section we described choosing an optimal $\sigma$, but we just assumed a value. But now that we can evaluate each state with any sigma, we have the tools for choosing the optimal $\sigma$.

Above we said we'd choose the value of $\sigma$ that maximizes the likelihood of the data $P(k)$. Since we don't want to overfit on any one state, we choose the sigma that maximizes $P(k)$ over every state. To do this, we add up all the log likelihoods per state for each value of sigma then choose the maximum.

Note: this takes a while!

sigmas = np.linspace(1/20, 1, 20)

targets = ~provinces.index.get_level_values('province').isin(FILTERED_REGIONS_CANADA)
provinces_to_process = provinces.loc[targets]

results = {}

for province_name, cases in provinces_to_process.groupby(level='province'):
    
    print(province_name)
    new, smoothed = prepare_cases(cases, MIN_CASES_BY_PROVINCE.get(province_name, MIN_CASES))
    
    result = {}
    
    # Holds all posteriors with every given value of sigma
    result['posteriors'] = []
    
    # Holds the log likelihood across all k for each value of sigma
    result['log_likelihoods'] = []
    
    for sigma in sigmas:
        posteriors, log_likelihood = get_posteriors(smoothed, sigma=sigma)
        result['posteriors'].append(posteriors)
        result['log_likelihoods'].append(log_likelihood)
    
    # Store all results keyed off of state name
    results[province_name] = result
    clear_output(wait=True)

print('Done.')

Done.

Sum for each value of sigma across provinces

Now that we have all the log likelihoods, we can sum for each value of sigma across provinces, graph it, then choose the maximum.

# Each index of this array holds the total of the log likelihoods for
# the corresponding index of the sigmas array.
total_log_likelihoods = np.zeros_like(sigmas)

# Loop through each state's results and add the log likelihoods to the running total.
for province_name, result in results.items():
    total_log_likelihoods += result['log_likelihoods']

# Select the index with the largest log likelihood total
max_likelihood_index = total_log_likelihoods.argmax()

# Select the value that has the highest log likelihood
sigma = sigmas[max_likelihood_index]

# Plot it
fig, ax = plt.subplots()
ax.set_title(f"Maximum Likelihood value for $\sigma$ = {sigma:.2f}");
ax.plot(sigmas, total_log_likelihoods)
ax.axvline(sigma, color='k', linestyle=":")

<matplotlib.lines.Line2D at 0x281370a8288>

Compile Final Results

Given that we've selected the optimal $\sigma$, let's grab the precalculated posterior corresponding to that value of $\sigma$ for each state. Let's also calculate the 90% and 50% highest density intervals (this takes a little while) and also the most likely value.

final_results = None

for province_name, result in results.items():
    print(province_name)
    posteriors = result['posteriors'][max_likelihood_index]
    hdis_90 = highest_density_interval(posteriors, p=.9)
    hdis_50 = highest_density_interval(posteriors, p=.5)
    most_likely = posteriors.idxmax().rename('ML')
    result = pd.concat([most_likely, hdis_90, hdis_50], axis=1)
    if final_results is None:
        final_results = result
    else:
        final_results = pd.concat([final_results, result])
    clear_output(wait=True)

print('Done.')

Done.

Plot All Canadian Provinces

ncols = 3
nrows = int(np.ceil(len(results) / ncols))

fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(15, nrows*3))

for i, (province_name, result) in enumerate(final_results.groupby('province')):
    plot_rt(result, axes.flat[i], province_name)

fig.tight_layout()
fig.set_facecolor('w')

 plt.savefig('books_read.png')

<Figure size 432x288 with 0 Axes>

Standings

FULL_COLOR = [.7,.7,.7]
NONE_COLOR = [179/255,35/255,14/255]
PARTIAL_COLOR = [.5,.5,.5]
ERROR_BAR_COLOR = [.3,.3,.3]

filtered = final_results.index.get_level_values(0).isin(FILTERED_REGIONS_CANADA)
mr = final_results.loc[~filtered].groupby(level=0)[['ML', 'High_90', 'Low_90']].last()

def plot_standings(mr, figsize=None, title='Most Recent $R_t$ by Province'):
    if not figsize:
        figsize = ((15.9/50)*len(mr)+.1,2.5)
        
    fig, ax = plt.subplots(figsize=figsize)

    ax.set_title(title)
    err = mr[['Low_90', 'High_90']].sub(mr['ML'], axis=0).abs()
    bars = ax.bar(mr.index,
                  mr['ML'],
                  width=.825,
                  color=FULL_COLOR,
                  ecolor=ERROR_BAR_COLOR,
                  capsize=2,
                  error_kw={'alpha':.5, 'lw':1},
                  yerr=err.values.T)

    labels = mr.index.to_series().replace({'District of Columbia':'DC'})
    ax.set_xticklabels(labels, rotation=90, fontsize=11)
    ax.margins(0)
    ax.set_ylim(0,2.)
    ax.axhline(1.0, linestyle=':', color='k', lw=1)

    fig.set_facecolor('w')
    return fig, ax

mr.sort_values('ML', inplace=True)
plot_standings(mr, figsize=(8,4));

mr.sort_values('High_90', inplace=True)
plot_standings(mr, figsize=(8,4));

Province/State	Country/Region	Lat	Long	1/26/20	1/27/20
Alberta	Canada	53.933300	-116.576500	0	0
British Columbia	Canada	49.282700	-123.120700	0	0
Grand Princess	Canada	37.648900	-122.665500	0	0
Manitoba	Canada	53.760900	-98.813900	0	0
New Brunswick	Canada	46.565300	-66.461900	0	0
Newfoundland and Labrador	Canada	53.135500	-57.660400	0	0
Nova Scotia	Canada	44.682000	-63.744300	0	0
Ontario	Canada	51.253800	-85.323200	1	1
Prince Edward Island	Canada	46.510700	-63.416800	0	0
Quebec	Canada	52.939900	-73.549100	0	0
Saskatchewan	Canada	52.939900	-106.450900	0	0
Diamond Princess	Canada	0.000000	0.000000	0	0
Recovered	Canada	0.000000	0.000000	0	0
Northwest Territories	Canada	64.825500	-124.845700	0	0
Yukon	Canada	64.282300	-135.000000	0	0

	ML	Low_90	High_90
date
2020-04-19	0.99	0.65	1.33
2020-04-20	0.95	0.61	1.30
2020-04-21	0.68	0.34	1.04
2020-04-22	0.89	0.54	1.25
2020-04-23	0.99	0.65	1.36