1 The problem

The hierarchical model has been the working horse for Bayesian modeling, and its sampling algorithm has been comprehensively studied in computation literature. However there are still a few questions remaining unclear:

  1. How to choose an efficient parametrization? Heuristically, the efficiency of the the non-centered parameterization depends on the “strength of the data” (Betancourt and Girolami 2015; Gorinova, Moore, and Hoffman 2019). Yet data can be differently informative for different parameters.

  2. Most comparisons between parameterization are focused on computation efficiency. When there is metastability, centered and non centered parameterization are implictly two models and have different predictive power. Are the metrics on computation efficiency and predictive power consistent?

  3. The concepts of metastability and multimodality are often used interchangeably in practice. In hierarchical models, the bimodality is common (but not always). What is relation between the posterior bimodality and the funnel, both of which seem to harm the computation efficiency? Do they occur simultaneously, and do we have to worry about the bimodality in sampling?

  4. And does the last question provide new insights on prior choice, if we are not concerned with bimodality?

  5. Combining the centered and non-centered parameterization is likely to render better result. Just like mixture model versus model ensemble, such combination can take place within each HMP jump, which effectively leads to the “automatic reparameterization”, or can be conducted once after the entire inference. How useful is stacking in the existence of a large number of divergence due to poor parameterization?

2 When does the non-centered parametrization fail

2.1 The funnel in the likelihood

It is known that the non-centered parameterization is not always better. We first show that there is a room for a funnel shape posterior in the non-centered parameterization.

Throughout the article, we are focused on a balanced one-way model. The data are generated from a normal likelihood:

\[y_{i,j} \sim N(\theta_i, \sigma^2), \quad \theta_i \sim N(\mu, \tau^2),\quad 1\leq i\leq N, 1\leq j \leq J \]

We use \(1\leq i\leq N\) to denote group index, and \(1\leq j \leq J\) to denote individuals in each group.

In this subsection I am generating \(N=8\) schools with \(J=10\) observations in each. I intentionally increase the between-group variance \(\sum_{i=1}^{N}(\bar y_{i,}- \bar{y}_{..})^2\) by adding some student-t noise to the true group mean \(\theta_{i}\).

We also fix the prior to be invGamma(0,1, 0.1) for both \(\tau^2\) and \(\sigma^2\).

How are the data informative of \(\sigma\) and \(\tau\)? It can crudely summarized by MLE, as shown by the red spike in the following graph. There is no prior-data conflict as an inv-gamma has a thick tail so as not to exclude the possibility of any large \(\sigma^2\) and \(\tau^2\).

Now we sample from both the centered and the non-centered parameterization (\(\theta_i= \xi_i \tau + \mu\)).

## Warning: There were 11 divergent transitions after warmup. Increasing adapt_delta above 0.8 may help. See
## http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup
## Warning: There were 143 transitions after warmup that exceeded the maximum treedepth. Increase max_treedepth above 10. See
## http://mc-stan.org/misc/warnings.html#maximum-treedepth-exceeded
## Warning: Examine the pairs() plot to diagnose sampling problems

The dimension of the parameter space is only \(8+3=11\) but there were 11 divergent transitions and 143 transitions that exceeded the maximum treedepth in the non-centered parameterization. We can check the posterior draws and the divergence using bayesplot:

The centered-parameterization works fine. No divergence, no warnings, and R hat=1.

And by contrast, the non-centered ones reveal pathological geometry:

What is the problem with the non-centered parameterization? We first notice the divergence happens mostly, if not exclusively, when the between group variation is large. There is a funnel shape posterior in the \(\mu - \xi_i\) pair plot.

In the non-centered parameterization, apriori there is a harmless additive dependency between \(\xi_i\) and \(\mu\): \(\mu = \theta - \tau \xi_i\). It is only linear dependant.

However, \(y_{ij} \sim N(\theta_i,\sigma)\) implies \(\theta_i \sim N( \bar y_{i.},\sigma/\sqrt{J} )\) that is
\[\xi_i \vert \tau, \mu, y \sim N(\frac{1}{\tau}(\bar y_{i.} - \mu) , \sigma \tau^{-1} J^{-1/2} )\] Further replace \(\tau\) by a plug-in estimate: \(\tau \approx \sqrt{N^{-1} \sum_i^{N} (\theta_i-\mu)^2}\) and \(\theta_i \approx \bar{y_{i.}}= 1/J \sum_{l=1}^J y_{i,l}\), we can derive the conditional variance in the likelihood as \[V_l \overset{\Delta}{=} \mathrm{Var}(\xi_i | \mu, \sigma) = \frac{J^{-1} \sigma^2}{N^{-1} \sum_i (\bar{y_{i.}}-\mu)^2} \]

This is perfectly a funnel! When \(\mu\) is far away from the sample mean of \(\theta_i\), the conditional variance of \(\xi_i\) is very small.

The full posterior is the product of the likelihood term and the prior \(\xi_i \sim \mathrm{N}(0,1)\). The conditional posterior in the non-centered parameterization is then

\[\xi_i \vert \tau, \mu, y \sim \mathrm{N}( (V_l +1)^{-1} \bar y_{i.} , (V_l^{-1}+ 1)^{-1} ) \]

The likelihood funnel will be more significant when \(V_l\) is small, such that the likelihood term weights more in the the posterior. The plug-in estimate is \(V_l\) is

\[\frac{J^{-1} \hat \sigma^2}{N^{-1} \sum_i (\bar y_{i.}- \bar y{..} )^2} \approx F^{-1}\]

where F is the \(F\)-statistics (between group variance divided by within group variance).

Therefore such likelihood funnel in non-centered parameterization should be amplified when the likelihood variance is small, i.e., F-statistics in the data is large. It might be a more rigorous description than “the data are strong”. We confirm this intuition with the following simulations.

2.2 Non-centered parameterization with varying F-statistics

We vary both true \(\tau\) and \(\sigma\) from \(0.01\) to \(20\) in the true data generating mechanism, with additionally student-t noise added to true \(\theta\) in order to achieve a high F-statistics (from nearly \(0\) to \(10^3\)). We generate the true date from \(N=8\) schools and \(J=10\) in each school and keep a inv_gamma(0.1, 0.1) prior on both \(\tau^2\) and \(\sigma^2\) (again we pick inv_gamma for easier analysis in the multimodal part).

We see the trade off between centered and non-centered parameterization from the graph above: when the between-group variation is large or the within group variation is small, the non-centered parameterization is less efficient, measured either by the pairwise correlation, or the relative effective sample size (n_eff/n, and the contrast will be more striking if I use n_eff/time).

It is worthwhile to mention that these two parameterizations essentially give almost an identical prediction. The ratio of elpd_loo is nearly constant 1 in various settings, regardless of the alarming warning of divergence. That is because the model is relatively simple and therefore it is the computation efficiency matters more. We will increase the dimension in the next two sections and compare the predictive power there.

Unlike in the ideal Neal’s funnel where the conditional variance will vary from nearly zero to infinity, the non-centered posterior conditional variance of \(\xi_i\): \((V_l^{-1}+ 1)^{-1}<1\) is always constrained from above (prior variance=1), just as in the centered case \(\mathrm{Var}(\theta_i\vert \tau, y, \sigma)= (\sigma^{-2}J + \tau^{-2} )^{-1} < \sigma^{2}/J\). That said, the funnel shape can still be challenging if the number of groups J is high (admittedly I do not know any theoretical result on the mixing time as a function as the group size in this model).

3 The Metastablility and the Multimodality may evolve in opposite directions

The byproduct of the previous simulation is to make a distinction between the metastability versus the multimodality.

It is understandable why these two concepts are often used interchangeably: in part one main type of metastability comes from multimodality– when the posterior distribution has two modes and lacks enough density connected in between.

However, there is also fundamental difference between these two concepts. To name a few:

  1. When the two modes are effectively connected through enough mass, the bimodality does not lead to metastability, or indeed not to anything regarding how quick the mixing time will be (though is true that a bimodal distribution is always non-log-concave, making theory results on rapid mixing do not apply).

  2. The metastability can also be a result of an entropic barrier. e.g., Neal’s funnel has no mode at all and is monotonic along the direction of \(\tau\).

  3. Intuitively, the metastability describes the slow mixing that may come from a large local curvature, while the mode only requires the gradient being zero. Both the gradient and the curvature are not invariant under reparameterization, but also subject to different evolution.

This normal-invgamma model is an example where the metastability and the multimodality are evolved in the “opposite” directions.

According to Liu and Hodges (2003), we can analytically determine the posterior bimodality in the centered \((N+3)\)-dimensional parameter space \((\theta_1,\dots ,\theta_N, \mu, \tau^2, \tau^2)\). The criteria depend on both the data \((y_{i,j})\) and the prior( \(\tau\sim\) inv-gamma\((a, b), \sigma \sim\) inv-gamma\((\alpha, \beta)\)), as summarized by the following discriminant function:

We compute the bimodality for each generated dataset in the previous simulation. In that graph we use red and blue colors to label the bimodal and unimodal situation. Notably, when the between-group variation increases, the posterior in the centered parameterization eventually becomes bimodal – even though the the sampling under the same parameterization becomes actually more efficient! That is the increasing red solid line in terms of effective sample size, or the decreasing red solid line in terms of pairwise correlation.

How is that possible?

To illustrate this counter-intuitive fact, we consider the unnormalized profile density, which is to replace \((\theta_1,\dots, \theta_J, \mu)\) by their modes in \(p(\theta_1,\dots, \theta_J, \mu \vert y, \sigma^2 ,\tau^2)\). Viewing \(\tau\) and \(\sigma\) as given constant, \(\theta_1,\dots, \theta_J, \mu \vert y \sigma^2 ,\tau^2\) forms a normal-normal model, hence the conditional mode is always unique.

The log profile density, as a function of \(\sigma^2 ,\tau^2\), is:

We use the following function to make a grid search and find local modes (local minimum of the negative log posterior profile density,excluding all boundaries). In the setting in section 2.1, we firstly found one mode at \((\sigma= 0.91, \tau= 28)\), which is in turn the MAP estimate.

## local mode found: sigma= 0.91   tau= 28.3

But does it only have one mode? It does not match the result from Liu and Hodges (2003)!

After a careful check I caught the second mode by enlarging the searching area. The second mode is far away from the typical set though.

## First local mode found: sigma= 0.91   tau= 28.5
## Second local mode found: sigma= 31.79   tau= 0.1

We put the posterior sample (from the non-centered parameterization in stan), the MLE, and these two modes altogether.

Yes, there does exist a second mode – but it has a extremely low density. The ratio of the heights of these two local modes is:

## [1] 1.819449e+103

and it can be confirmed by the comparison between the attraction region around these two modes:

The second mode is not even close to the typical set and can be completely ignored if the purpose is to compute any integral under the posterior distribution, as most Bayesian computation tasks require. Indeed the sampling distribution \((\theta_s)\sim P_s\) converges in distribution to a target density \(p\) only requires \(\sum_{s=1}^S f(\theta_s) \to E_p{f(x)}\) when \(S\to\infty\). We may even arbitrarily modify \(P_s\) up to the zero-measure set with respect to \(p\) – so in principle we can create an arbitrarily number of modes.

The analysis above only applies to inv-gamma due to conjugacy. But the bimodality may generally exist in hierarchical models. For example, when the prior on \(\tau\) is flat, \(\tau=0\) is always a mode.

Granted, the mode itself is not statistically interesting to begin with. And computationally if the only goal is to find the modes, we can simply run an optimization. It is challenging to find this second mode through sampling. Even when I initialize the chain right at the mode, which is never known in practice, the sampler still jumps to the typical set surrounding the global mode– after all the density is too low so it might be a numerical issue.

## Inference for Stan model: random_effect.
## 4 chains, each with iter=1000; warmup=500; thin=1; 
## post-warmup draws per chain=500, total post-warmup draws=2000.
## 
##        mean se_mean    sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
## sigma  0.98    0.00  0.08  0.83  0.92  0.98  1.04  1.16  3371    1
## tau   38.29    0.33 12.39 22.35 29.93 35.70 43.60 68.38  1438    1
## 
## Samples were drawn using NUTS(diag_e) at Fri Aug 16 12:02:51 2019.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at 
## convergence, Rhat=1).

From the modelling perspective, the second mode refers to the “complete pooling”, while \(\tau\) is slightly larger than \(0\) because of the zero-avoiding behavior of the prior inv-gamma\((0.1,0.1)\). We can construct the true zero model by \[y_{i,j}\sim \mathrm{N}(\mu, \sigma), \sigma\sim \mathrm{invgamma}(\alpha, \beta).\] The posterior is analytic: \[\sigma^2\sim \mathrm{invGamma}(\alpha+NJ/2, \beta+NJ/2 \mathrm{Var}(y) ), \mu \vert y, \sigma \sim \mathrm{N} (\bar y, \sigma/\sqrt{NJ}).\] Of course I will still use stan to fit the model since I do not trust my math.

## data{
##   real alpha;
##   real beta;
##   int N;
##   int J;
##   matrix[N,J] y;
##   
## }
## parameters{
##   real mu;
##   real<lower=0> sigma;
## }
## model{
##   for(i in 1:N)
##     for(j in 1:J)
##        y[i,j]~normal(mu, sigma);
##   target += inv_gamma_lpdf(sigma^2 | alpha, beta)+ log(sigma); // jacobian
## }
## 
## generated quantities{
##   matrix[N,J] log_lik;
##   for(i in 1:N)
##     for(j in 1:J)
##        log_lik[i,j]=normal_lpdf(y[i,j]|mu, sigma);
## }

We evaluate its predictive performance by LOO.

## elpd_diff        se 
##     277.3      10.4
## Method: stacking
## ------
##        weight
## model1 0.000 
## model2 1.000

Either by model comparison or by model averaging, the complete pooling model in this setting is not useful. It effectively explains all the variation by \(\sigma\) and hence ignores the hierarchical structure.

Right, we have just spent too much time emphasizing on a noninteresting concept (the mode), and even worse it is a bad mode!

I close this section by two questions.

  1. The modality analysis is almost useless for Bayesian computation – the mode is irreverent to mixing time. Is there any better concept we can use to quantify the mixing time in hierarchical models, and therefore the relative advantage of reparameterization? Ideally it should be a replication of Liu and Hodges (2003), with the focus shiftted towards metastability.

  2. In line with my earlier blog post, the existence of multiple modes– no matter how bad each of them is– can still improve/expand the model. Does it really help if we stack the complete pooling model (since it is never excluded by data)?

I will answer question (B) in the following section.

4 Using stacking to average possibly non-convergent chains

Both the centered and the non-centered parameterization can be problematic. A user can always run both parameterization and detect which one is better afterwards, or he can implement Gorinova, Moore, and Hoffman (2019)’s automatic reparameterization. But these procedures are not automated and require extra labor cost.

Alternatively we have a simple heuristic: in cases where \(\tau\) is estimated to be near zero, the resulting inferences will be sensitive to prior and parameterization, and creates the funnel shape in centered ones. Can we utilize the divide and conquer strategy, separately treat small \(\tau\) in the centered parameterization (additionally the Markov chain never fully explore small \(\tau\) anyway), and remedy such partial ignorance by averaging over a complete pooling model?

I am echoing Andrew’s old post on zero-avoing prior, which was known (to Dan) to bias the posterior distribution significantly. However, the non-centered parameterization should not be taken for granted as the ground truth, and therefore the actual effect of such “bias” based on the benchmark of ncp is questionable.

I will focus on two metrics 1) the computation efficiency of MCMC, with which has been slightly dealt in the last section. 2) the posterior predictive distribution.

To the second end, when there are only \(N=8\) schools, the data is so simple such that essentially any model will fit (the hierarchical model is essentially overparameterized, if all \(\theta_i\)s are allowed to disobey the Bayes rule, all hyper-parameters can be estimated arbitrarily), and the dimension is also low such that any sampler can explore the typical set sooner or later.

Therefore we see the centered parameterization and non-centered parameterization yields almost the identical elpd in section 2.2.

Now we increase the dimension N (number of groups) to \(N=300\) so as to further amplify the potential funnel. We keep J (observations in each group) as 20. Throughout the simulation, we vary the following settings:

  1. We change the within-group standard deviation \(\sigma\) from \(0.01\) to \(100\).

  2. We add the between group variance by adding a student-t noise. Namely \(\theta_{n}\sim \mathrm{N}(0, 0.01)+ B t(1)\), and the t-noise scale B varies from 0 to 50.

  3. We fix the prior on \(\sigma\) to be Inv-Gamma\((0.1,0.1)\), and \(\tau \sim\) Inv-Gamma\((0.1, \beta)\). We vary the scale parameters \(\beta\) from 0.1 to 10. A larger \(\beta\) avoids zero more strongly. Inv-gamma is not well-calibrated (Gelman 2006). The reason for keeping this unexciting inv-gamma prior is (i) its connection to the earlier theoretical analysis on bimodality (ii) the inv-gamma renders zero-avoiding effect on \(\tau\), which can be combined with stacking.

(Annoyingly Rstan has some problems loading shared objects after repeatedly fitting the same model a large number of simulations (>100), and prints error Error in dyn.load(libLFile). It seems to be an old issue and did not get fixed. The recommendation to use dyn.unload() is vulnerable.)

4.1 cp+0 \(>\) ncp

For each generated observation \(y_{i,j}\), we sample three posterior distributions

  1. The complete pooling (1 chain, 1000 iters). It enforces \(\tau=0\).

  2. The centered-parameterized hierarchical model (cp, 8 chains, 1000 iters).

  3. the non-centered parameterized hierarchical model (ncp, 8 chains, 1000 iters).

  4. the stacked distribution: we use stacking to average 8 chains from centered parameterization, plus 1 chain from the complete pooling model.We then compute the leave-one-out log predictive density from all four inferences. The sample size 300*20=6000 is large enough that the loo quantity approaches its expected value.

In general when the variation increases, no matter from within group or between group group variance, the data generating mechanism contains more uncertainty, and therefore elpd decreases. When the between group variation is very large (50 times student-t (1) is really large!), the centered parameterization reveals inferior predictive power compared with the non-centered one, as the prior funnel \((\theta-\tau)\) will be amplified when \(\tau\) is to be estimated large. The complete pooling model is always worse, and should never be used individually. However, when we average 8 possibly non-mixed chains of centered parameterization (orange line) plus an additional chain from complete pooling (grey line), the stacked (red line) predictive performance is almost identical to the non-centered parameterization (green line). They almost always overlap except when the between group variance is very large (50 student-t\((1)\)) and within group variance is very small (N\((0,0.01)\) ).

Those situations (\(\sigma=0.01, \tau=50 t(1)\)) are rare if the observations are unit-scaled.

The pattern is robust under different prior. A stronger zero avoiding prior, such as inv-gamma (0.1, 5) does not change the result a lot.

Now let’s zoom into the head-to-head comparison between the non-centered parameterization and the stacking (weighted average of centered 8 chains of parameterization and 1 chain of complete pooling) . The graph below is the same as above, except we only display these two inferences results– visualizing elpd is difficult when it varies in the order of magnitude, it cannot be compared in the log scale as elpd is already a log quantity.

We see more comparison details from this graph: except for extreme cases, the stacked result almost always has a larger log predictive density than ncp!

It is another question to ask: how much of the gain in stacking’s predictive ability is from the complete pooling model (for it approaches true zero that is not previously seen due to the funnel), and how much is from the internal-heterogeneity of the 8 possibly non-mixed cp chains?

The graph below answers the question: even though the complete pooling distribution always has a much smaller elpd than cp hierchrical models (second row), it is still sometimes assigned a non-zero stacking weight (last row). The final stacked distribution is always better than the uniformly mixed cp (first row).

The last row records the stacking weight for the complete pooling, when it is zero, the stacked distribution is purely an average of 8 chains from cp. Those chains may have plenty of divergences and a large \(R\) hat as well. A weighted mixture of them compensates the short-sightedness of individual chains and makes identical or even better prediction than the ncp.

4.2 The tradeoff between n_eff and elpd

That is not the end of the story. We always have a trade-off between computation efficiency and predictive power. Even a super slow but ergodic sampler can eventually explore the whole target space given enough time. On the other hand, a good prediction does not require an ergodic sampler, such as in quasi-Monte Carlo methods.

In section 2, we compared the computation efficiency between centered and non centered parameterization, and concluded that ncp suffers from a likelihood funnel when the F-statistics (between group variance divided by within group variance) are large. Does it contradict the result in section 4.1, where we see ncp outperforms cp when between group variance is large?

In fact these two metrics (elpd vs n_eff) are effectively independent, just as metastability vs multimodality.

The following graph compares the sampling time, and the effective sample size from the previous experiments. The first row is the system time (in seconds) for (i) sampling 1 chain \(\times\) 500 iters in the complete pooling model, (ii) sampling 8 chains \(\times\) 500 iters in cp, (ii) sampling 8 chains \(\times\) 500 iters in ncp and (iv) the additional optimization in stacking. The number of iterations does not include the first half warmups. The number of groups are fixed as 300 and sample size 6000. I have to acknowledge this might not be the fairest comparison, as I run the simulation on a cluster and it is possible that each node has different computation resource. The general patter is that: cp and ncp is comparable in terms of sampling time, both of which are much larger than the additional optimization time in stacking. We only show the result with prior \(\tau \sim\) invGamma\((0.1,0.1)\) yet other priors and sample sizes have similiar results.

The lower two rows display the arithmetic and the harmonic mean of the relative effective size (n_eff/n) among all sampling parameters in cp, ncp, and complete pooling. The non-centered parameterization in general suffers from a lower relative effective sample size, especially when the between-group variance is large– albeit it is also the scenario when ncp has a larger winning margin than cp in terms of prediction. This is an example of the tension between outcome-prediction and parameter-estimation.

In conclusion, there are two applications for stacking in hierarchical models:

  1. (To improve prediction) When we have noticed a large number of divergences from cp, but we do not want to rewrite the code into non-centered parameterization for it takes extra time and labor, and the extra gain of reparameterization is not always positive, we can utilize stacking to achieve almost the same or even better predictive power by reweighting the divergent cp chains and complete pooling model.

  2. (To remedy low n_eff) When we run both the cp chains and ncp chains, and if we notice the ncp ones have a low effective sample size such that any parameter estimation becomes unreliable, while cp has a high effective sample size but also has potential drawbacks on non-mixing, we can use stacking to average cp chains and therefore increase the effective sample size without hurting the predictive power.

Acknowledge

Work in progress. [ sourcecode ]

Joint work with Aki Vehtari and Andrew Gelman and others.

Reference

Betancourt, Michael, and Mark Girolami. 2015. “Hamiltonian Monte Carlo for Hierarchical Models.” Current Trends in Bayesian Methodology with Applications 79. CRC Press Boca Raton, FL: 30.

Gelman, Andrew. 2006. “Prior Distributions for Variance Parameters in Hierarchical Models (Comment on Article by Browne and Draper).” Bayesian Analysis 1 (3). International Society for Bayesian Analysis: 515–34.

Gorinova, Maria I, Dave Moore, and Matthew D Hoffman. 2019. “Automatic Reparameterisation of Probabilistic Programs.” arXiv Preprint arXiv:1906.03028.

Liu, Jiannong, and James S Hodges. 2003. “Posterior Bimodality in the Balanced One-Way Random-Effects Model.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (1). Wiley Online Library: 247–55.