Three Example of Invariance under Reweighting

Do we necessarily sacrifice from the extrapolotion?

The more I think about weighting methods, the more confused I am. Here is the starting point of this post: Suddenly there are several methods that are invariant under re-weighting.

By convention throughout the article I will use $\theta$ to denote a possibly high dimensional parameter to sample from, $p(\theta)$ the target distribution $q(\theta)$ the proposal.

To begin with, using importance sampling to approximate intergtation does not have to sacrifice efficiency. Occationally importance sampling can be more efficient than direct sampling. For example if the sampling variable $\theta$ is binary, the traget distribution $p(\theta) =Ber(0.1)$, and the function under integral is $f(\theta)= 0.9 \times 1(\theta=0)+ 0.1 \times 1(\theta=1)$. Clearly $E_{p} f=0.9\times 0.1 \times 2=0.18$.

Now sampling from the unifrom proposal $q(\theta) =Ber(0.5)$ yields the IP estimates

\[I_{IP}=\frac{0.18}{S}( \sum_{s=1}^S 1(\theta_s=0)+ 1(\theta_s=1) ),\]

which is a constant and has zero variance and I will obation exact answer with one sample point.

In particuar it is well-known that without self-normalization, the optimal proposal is $q(\theta) \propto p(\theta) \vert f(\theta) \vert$. In other words, we gain some efficiency by (importance) sampling from $q(\theta)$, rather than directly sampling from $p(\theta)$.

Now let’s view it in a modeling framework: we collect data $(x,y)$ from the observation $q(x,y)$, but at some point we need to make prediction over the distribution $p(x,y)$. The prediction means the conditional model $\hat p(y\mid x)$. In the most general situation $p(y \mid x )$ and $q(y \mid x )$ ) can even be different.
The prediction means the conditional model $\hat p(y\mid x)$. When $q$ and $p$ are not identical, the prediction is essentially relying on extrapolation.

Here is our first question:

To what extent do we sacrifice from the extrapolation?

Sadly it seems unlikely that could be a tight bound for the above question. The motivation comes from the reasoning above: weighting methods may even achieve super-efficiency, in which situation extrapolation might benefit the prediction.

Furthermore, the effect of extrapolation depends on the gap between $p$ and $q$, and it should also depend on the learning method I use.

What quantity describes the sensitivity of the extrapolation effect on the distribution gap between $p$ and $q$?

There are so many theoretical results in active learning and transfer learning. But in this post I just want to focus on one small subpoint: there are some methods that are invariant under reweighting.

A really top example

If we know the data generating mechanism is $y=\beta X+ \alpha$ without noise. Then clearly any two data points will give me identical inference on $\beta$ and $\alpha$.

Now for a regression $y=\beta X+ \alpha+ N(0, \sigma)$, and we know the population covariates come from $p(x)$ while we could only collect data from $q(x)$, then we will run a weighted linear regression $\arg \min_{\beta, \alpha, \sigma} w_i(y_i-\beta X_i- \alpha)^2 \sigma^{-2}+ 2\log (\sigma)$ where $w_i= p(x_i)/q(x_i)$ is the importance weights. The weighted OLS, under the true model, have the same expectation as before.

A another toy example

Now let’s consider SVM. SVM itself models the weights on data points so extra data weight is masked. Imagine a linearly separable problem (i.e., there exists a linear boundary that perfectly separates binary outcomes $y=1$ and $y=0$), a training point is either on the support (therefore SVM weight =1 ) or not, therefore (weight =0). In that case, I would not change my inference result even if I know my input X is non-representative of the prediction problem.

A separate SVM amounts to say “under the true model”. However, in general, a true model does not grant finite sample invariance. Weighted SVM gives a different answer if the problem is not separable.

Now, this will be concerning: If me SVM kernel is so flexible that it essentially can separate any training data, then it will always be weighing-invariant, no matter how strong regularization I put.

Deep neural networks, seem to be this case.

A less toy example: Logistic Regression, Case-contral.

Dating back to Prentice and Pyke 1979, the invariance of Logistics Regression under case-control studies and cohort studies have been thoroughly studied.

In a cohort study, researchers collect data randomly from p(x, y). For rare disease (only a small fraction of the population has disease p(y=1)«p(y=0)), such sampling may result in too few case samples.

In a case-control study, researchers collect data from cases (y=1) and controls (y=0) separately until a fixed number of both cases and controls are achieved. In the sampling language, the only random variable is $x$.

Prentice and Pyke 1979 established the theory that the odds ratio from the cohort study can be estimated from the case control study, even in finite sample. This is due to the fact that a logistic regression essentially models the odds ratio: $\frac{p(y=1\mid x) / p(y=0\mid x) }{p(y=1\mid x_0) / p(y=0\mid x_0)}$

In a cohort study, modeling $p(y\mid x)$ as softmax $(\alpha + \beta x)$ is equvalent to model the oddes ratio by $\beta (x-x_0)$.

In a case control study, the same odds ratio implies $p(x\mid y=k) \propto c_k{\exp(\gamma(x) + \beta_k x)}$ where $\gamma(x)= \log( p(x\mid z=0)/ p(x=0\mid z=0))$.

Logistic regression can be viewed as semiparametric that leaves $\gamma(x)$ unmodeled. The MLE of $\beta$, turns out to be the same under case-control and cohort study, which implies we can treat a case-control as if it is cohort.

Let me sketch the proof. To ease the notation, I will consider the general situation with $y=0,\dots, K$ and $p(y=k\mid x)=\frac{\exp(\alpha_k + \beta_k x)}{\sum_k \exp(\alpha_k + \beta_k x)}$.

In a cohort study, the log likelihood is

\[\sum_{k=1}^K \sum_{i=1}^{n_k} \log p(x_{ik}|y=k) =\sum_{k=1}^K \sum_{i=1}^{n_k} \log (c_k \exp(\gamma(x) + \beta_k x) )\]

Rewright the marginal of $x$: $q(x)= \exp\gamma(x) \sum c_k n_k n^{-1} \exp(\beta_k x)$ Then the log likihood above is

\[L=\sum_{k=1}^K \sum_{i=1}^{n_k} \log\frac{\exp(\delta_k + \beta_k x)} {\exp(\delta_j + \beta_j x)} + \log q(x_i) .\]

Consider the profile likeihood, where $q(x)$ is replaced by its emperical distribution $\hat q(x)=1/n\sum_{i=1}^n1(x=x_i)$, which is nonparametric MLE, then $\frac{\partial}{\beta} L$ has the same expression as in cohort likihood $L=\sum \log p(y\mid x)$.

By the way, I think most ML literature compares logistic regression with LDA and claims logistic regression does not model $p(x)$. It does.

A more Bayesian version of this analysis is recently developed by Byrne and Dawid 2018, which says the posterior marginal law of the odds ratio, or any parameter that only depends on it, is invariant under retrospective and prospective sampling.

wait, why can’t I reweight a case-control study?

I mean it is fine to stick to the clear distinction among retrospective and prospective, but after all, a retrospective study is sampling from

\[q(x, y)= q(y)p(x\mid y).\]

Now we want to optimize the emperical risk, such that it is expected to have optimal predictive performance under the population which I will do prediction:

\[E_{p(x,y)} l(x,y)=E_{q(x,y)} l(x,y)\frac{p(x,y)}{q(x,y)}\]

So if a black-box machine learning researcher wants to do retrospective epidemiology study, he would use the weighted loss function

\[-L=\sum_i w_i l(y_i, \log f(x_i))=\sum_{k=1}^K w_k \sum_{i=1}^{n_k} \log\frac{\exp(\alpha_k + \beta_k x)} {\exp(\alpha_j + \beta_j x)}\]

where $f(x_i)_j=\frac{\exp(\alpha_j+\beta_j X)}{\sum_j \exp(\alpha_j+\beta_j X)}$, $l(j,p)=-\log(p_j)$ is the cross entroy loss, and $w_i = p(y_i)/q(y_i)$ is importance ratios that only depend on $y_i$. If $q$ is balanced/uniform, $w(y_i=k) \propto p(y_i)$, i.e., the population rate.

It is a totally different expression as in the retrospective likelihood. And we will get a different answer by solving $\partial/\partial \beta_k=0$

To be honest, at the risk of exposing my ignorance, I was totally shocked by this difference. So I did some simple R simulation:

library(arm)
x=rnorm(10000,0,1)
p_sim=invlogit(x-3)
y=rbinom(n=length(x),size=1, p=p_sim)
n1=sum(y==1)
n0=sum(y==0)
reto_1=sample(which(y==1),size=500 )
reto_0=sample(which(y==0),size=500 )
ip_weights_data=c(n0, n1)[retro_data$y+1]/sum(c(n0, n1))
retro_data=data.frame(y=y[c(reto_1, reto_0)], x=x[c(reto_1, reto_0)])
L1=glm(y~x,data=retro_data, family=binomial(link = "logit"), weights =  ip_weights_data)
L2=glm(y~x,data=retro_data, family=binomial(link = "logit"))

So the unweighted and weighted logistic regression are different. An weighted case-control sutdy is equvalent to rescale the loss function.

My understanding is that, under the true data generating mechanism (odds ratio model), the unweighted logistic regression is correct in the sense that it gives the same estimation of $\beta$ that is essentially the MLE under retrospective sampling. A retrospective sampling scheme does not see the population $p$.

The weighted model is also valid for giving the asymptotically optimal prediction. Indeed I run the above simulation, the weighted one is indeed outperforming the unweighted one measured by MSE.

Under model misspecification, the weigthed one is then double roubust. But it may comes with a cost of slower convergence rate.