Bayesian prediction and the likelihood principle

Posted by Yuling Yao on May 19, 2024. Tag: theory

Bertrand and I have recently finished our review paper on Bayesian prediction. One of my favorite part there is the distinction between inferential and predictive Bayes.

A Bayesian procedure is such that we treat observed data as a realization of random variables, and we calculate the conditional distribution of unknown quantities given the observed data. In this sense, there is no difference between posterior inference $p(\theta \vert y)$ and posterior prediction $p(y_{n+1} \vert y)$ as both parameters $\theta$ and next unseen data point $y_{n+1}$ can be viewed as random variables.

Bayesian prediction makes statements on future/unseen data: $p(y_{n+1} \vert y_{1:n})$, however the innocent index notation $i$ has already masked one additional task in prediction: to define the replication. In frequentist calculation, the repeated sampling $y_i \vert \theta$ needs to be specified in any inference, while Bayesian \emph{inference} does not repeated sampling $y_i \vert \theta$: we can define a Bayesian inference when there is one single data point. Bayesian prediction, including its formulation, validation, and evaluation of Bayesian prediction, needs to specify the sampling procedure. As a bold statement, Bayesian prediction is beyond the likelihood principle. To explain, consider four examples:

Example 1: Binomial vs Negative binomial likelihood. The likelihood principle states that when there are two data generating experiments both involving $\theta$ and outcomes $y_1$ and $y_2$, if the outcome-experiment pair has the same likelihood $p_1(y_1 \vert \theta) \propto p_2(y_2 \vert \theta)$ (viewed as a function of $ \theta$), then these two experiments and two observations will provide the same inference information about $\theta$. As a classic example, when we observe $y=9$ positives from $n=10$ trials, it does not matter if the data is generated from $y ~ \sim \mathrm{Binomial}(\theta, n=10)$ or $n-y ~ \sim \mathrm{NegativeBinomial}(\theta, y=9)$ as they provide the same posterior inference of $\theta \vert y,n$. But these two sampling procedures define two distinct replications: either we can fix $n$ and predict a new $y$ or fix $y$ and predict a new $n$, and will impact the evaluation of the prediction. The marginal likelihood of two procedures are different and cannot be compared directly (as they involve different sampling space); the posterior predictive check will result in different tail probabilities, for example, given $\theta=0.7$, the tail probably of $n-y = 1$ in the Negative Binomial predictive distribution is .15, while the tail probably of $y = 9$ in the Binomial predictive distribution is .03.

Example 2: sequential prediction. Consider a sequence of observed data $y_{1}, \dots, y_n$ distributed from a process $y_1\sim \mbox{normal}(0, \sigma), ~~ y_{i+1} \sim \mbox{normal}( x_i, \sigma),~ i\geq 2$, where $x_i = \frac{1}{i} \sum_{j=1}^{i} y_j$ is the running mean and $\sigma$ is the only unknown parameter to infer. When making predictions for a new dataset given a parameter value $\theta$, there are two ways to define the replication process $(y_{1}^{\mathrm{rep}},y_{2}^{\mathrm{rep}}, \ldots, y_{n}^{\mathrm{rep}})$, (a) treat ${x_1, x_n}$ as fixed design matrix, for each $i$ sample $y_{i+1}^{\mathrm{rep}}$ from $\mbox{normal}(\bar x_i, \sigma)$ as in a usual regression model; (b) treat ${x_1, x_n}$ as functions of outcome $y$, for each $i$ compute $x_i = \frac{1}{i} {j=1}{i} y_j^{\mathrm{rep}}$ and then sample $y_{i+1}^{\mathrm{rep}}$ from $\mbox{normal}(\bar x_i, \sigma)$. These these two sampling processes entails the same likelihood given existing date, and thereby will not impact Bayesian inference of $\sigma \vert y$, but this sampling difference is relevant to Bayesian model check.

Example 3: two-by-two contingency table. It is a classic task to test independence on a two-by-two table where we observe counts in four cells spanned by a binary treatment and a binary outcome indicator. The frequentist test needs to specify which margin is fixed, for example the Fisher’s exact test is only exact when both margins are fixed. The Bayesian inference takes a shortcut as the posterior inference will not change by conditioning on a contrasts that has been satisfied by observed data. That is $p(\theta \vert Y= y^{obs} ) =p(\theta \vert T(Y)= T(y^{obs}))$, where $T(\cdot)$ is any function of the data to be constrained. Clearly the resulting sampling distribution of future replication will generally be different, $p(Y^{\mathrm{rep}} \vert y^{\mathrm {obs}}, T(Y^{\mathrm{rep}})= T(y^{\mathrm{obs}}) ) \neq p(Y^{\mathrm{rep}} \vert y^{\mathrm {obs}} )$. Once again, the sampling aspect is ignorable in Bayesian inference but relevant in Bayesian prediction and model check.

Example 4: data-dependent stopping time. Bayesian inference has a key advantage in handling data-dependent stopping: as long as the parameters of the data are a priori independent of the parameters of the data-dependent stopping rule, and if all data used to make stopping decisions are recorded for data analysis, the stopping rule is ignorable and we can carry a standard Bayesian inference. Suppose we collect data $y_{1}, \dots, y_n$ from a generating process $y_i \sim \mathrm{normal}(\mu, 1), p(\mu)\sim \mathrm{normal} (0,1)$. Whether the observed data is collected from IID sampling or from a sequential data collection $n= \min {n: \frac{1}{n}\sum_{i=1}^n y_i >2}$ is not relevant to Bayesian inference of $\theta \vert y$, but is matter to forming future replications $y^{\mathrm{rep}}$ and the examination of such replications, which is even more needed for data-dependent stopping for its sensitive on model misspecification (rosenbaum and rubin 1984).

In all these examples, and in applied Bayesian modeling at large, the definition of replication is never automated, but is often guided by the realistic prediction task. Even in a simple linear regression with exchangeable observations $D={y_{1}, \ldots, ,y_{n}}$, the sampling procedure requires to specify if the sample size is fixed or random and if the design matrix is fixed. And even after these aspect are decided, the definition of replication or prediction can still differ between (a) predicting the next unseen data $y_{n+1} \vert D$, and (b) replicating the whole dataset and predict $(y^{\mathrm{rep}}_1, \ldots, y^{\mathrm{rep}}_n) \vert D$ , and such sampling assumption does have different implications in practice. To name a few:

The model check, model evaluation and model averaging typically depends on how the replication is defined;
Even in Bayesian computation, a realm that is traditionally only “inferential”, can deepen on the sampling distribution: In BUGS and JAX, you need to define it in the program; In ABC and simulation-based inference, an appropriate choice of sampling distribution can in theory make your computation more efficient.