Stacking Covid predictions under CRPS
Someone emailed me
We are currently working on forecasting the trajectory of Covid case numbers and ICU bed capacities… We would like to combine different models into an ensemble… using CRPS. . Do you know of an existing implementation or have an thoughts on how to do model stacking with just the predictive samples? The implementation you presented in your excellent paper relied on Stan, but I am not sure we are going to fit all our models using Stan.
Yes, I think it is very reasonable to aggregate forecasts not only to improve the final prediction but also to understand the reliability of individual models.
The CRPS for forecast CDFs with a finite first moment can be written as
\[crps(F,y)=E_X|X−y|− E_{X,X′}|X−X'|.\]Given K prediction models and T observations $y_1, \dots, y_T$ (number of observed icu needs) at time $1, \dots, T$, we could generate one day ahead prediction for each model, and draw $K$ predictive samples $x_{1kt}, \dots, x_{Skt}$ for the $k-th$ model and $t$-th day. The prediction is using all information before $t$ to eliminate over-fitting.
In particular, we can estimate the crps of the k-th model on $t$-th day by
\[\hat crps_{kt}= \hat crps(x_{1kt}, \dots, x_{skt},y_t)= \frac{1}{S} \sum_s |x_{skt}-y_t| - \frac{1}{2S^2} \sum_{s, j=1}^S |x_{skt}- x_{jkt}|.\]We may also average over t to summarize the predictive performance over time.
Now the goal is to aggregate these K models. It is easy to show that, when prediction is a mixture with weights $w_1, \dots, w_s$, the CRPS can be expressed as
\[\hat crps_{agg, t} (w_1, \dots, w_K) = \frac{1}{S} \sum_{k=1}^K w_k \sum_s |x_{skt}-y_t| - \frac{1}{2S^2} (\sum_{k=1}^K w_k^2 \sum_{s, j=1}^S |x_{skt}- x_{jkt}|+ \sum_{k, k'=1, k\neq k' }^K w_k w_{k'} \sum_{s, j=1}^S |x_{skt}- x_{jk't}| )\]| This expression is quadric on w. We only need to compute $\sum_s | x_{skt}-y_t | $, $\sum_{s, j=1}^S | x_{skt}- x_{jkt} | $, and |
| $\sum_{s, j=1}^S | x_{skt}- x_{jk’t} | $ for all k and k’ once and store them for all possible weight values. Now in the data we have multiple counties at the same time, we simply average over these countries. |
The final step is to optimize over $w_k$ with respect to a simplex constraint (which can be done in stan for example.)
A degree of freedom is how we weight the prediction over time. In the early days we have very few data so all predictions are noisy and we might want to downweight the early phase in the final model evaluation. Mathematically we can pre-fixed introduce a time-varying weight $\lambda_1, \dots, \lambda_T$, say $\lambda_t = 1.5-(1-t/T)^2$, it penalizes early estimates (smaller t). Finally we solve a quadric optimization:
\[\max \sum_{t=1}^T \lambda_t \hat crps_{agg, t} (w), \quad s.t. ~{0\leq w_1, \dots, w_T \leq 1, \sum_{k}w_k=1}\]