OLS under covariate shift
Introduction
In statistical learning theory the goal of a prediction model is to learn regularities of a data generating process from a sample and then obtain the best possible generalization error, which can be thought of the error that will occur “in the wild” when the trained model is asked to make predictions on data it has never seen. Being more formal, define a domain \(\D={\X,P(X)}\) as a feature space \(\bX\in\X\) with a marginal probability over the feature space \(P(\bX)\) and a task as a response space \(\by \in \Y\) with a conditional probability distribution \(P(\by|\bX)\). We are either trying to learn \(P(\by|\bX)\) (for classification) or \(E(\by|\bX)\) (for regression) with the training data. We call the function we learn from the sample data the prediction estimator \(\hat{f}\).
Since we can imagine various estimators/algorithms that could fit the data, \(f_1,\dots,f_M\), we want to know the expected generalization error of each when they are given training data from a source domain \((\by_S,\bX_S)\in\X_S\times \Y_S\) and asked to make predictions on a target domain \((\by_T,\bX_T) \in \X_T \times \Y_T\) for some loss function \(L\):
\[\begin{align} \Err &= E[L(\by_T,\hat{f_S}(\bX_T))] \tag{1}\label{eq:err1} \end{align}\]There are several things to note about equation \eqref{eq:err1}. First, we are conditioning \(\hat{f_S}(x)=\hat{f}(x|\by_S,\bX_S)\) on the source features and responses since these data determine the functional mapping properties of \(\hat{f}\). Second, our expectation is over everything that is random: \(\bX_S,\by_S,\bX_T,\by_S\).[1] Third, we are allowing for the possibility that \(S\neq T\), which differs from the standard “single domain” assumption that \(S=T\) in the classical setups. Generally, transfer learning (TL) is the field of machine learning that is interested in \(S\neq T\). Pan and Yang’s (2010) Survey on Transfer Learning provides a very useful overview. Specifically transductive transfer learning or covariate shift is the scenario where \(P_S(\bX)\neq P_T(\bX)\) but \(E_S(y|X=x)=E_T(y|X=x)\) or \(P_S(y|X=x)=P_T(y|X=x)\), meaning the fundamental classification/regression relationship between \(y \sim x\) is the same.
In the next section, we will see what happens to the usual OLS prediction error bounds when domain/source equivalence is relaxed.
Linear regression
Consider the simplest model one could conceive of: a linear regression model with one covariate and a homoskedastic Gaussian error term.
\[\begin{align*} &\text{Source domain} \\ y_{iS} &= \beta x_{iS} + e_i, \hspace{3mm} i=1,\dots,N_S \\ &\text{Target domain} \\ y_{kT} &= \beta x_{kT} + e_k, \hspace{3mm} k=1,\dots,N_T \\ &e_i \sim N(0,\sigma^2_e), \hspace{3mm} x_{iS} \sim N(0,\sigma^2_S), \hspace{3mm} x_{iT} \sim N(\mu_T,\sigma^2_T) \\ \end{align*}\]The only twist is that the distribution of \(\bx_S\) and \(\bx_T\) differ. Note that \(\bx_S\) is mean zero, where the target has mean general zero: this is done to simplify the math later, and it reasonable since we can always de-mean any training data that is given.[2][3] We could conceptualize this data as having some observations from the third quadrant (the source), but wanting to make predictions about data in the first quadrant (the target).
The inference case?
From a statistical decision theory perspective, it should be noted that if \(N_S=N_T\) and \(\sigma^2_S=\sigma^2_T\) then the statistical efficiency of the least squares \(\hat\beta\) is identical between the source and target domains. In other words, even if the covariates are shifted, the variance of the estimator of the linear relationship (\(\beta\)) is identical. This is intuitive and is easy to show.
\[\begin{align*} \hat{\beta} &= \frac{\langle \bx_S,\by_S \rangle }{\|\bx_S\|_2^2} = \frac{\langle \bx_S,\beta\bx_S + \be_S \rangle }{\|\bx_S\|_2^2} = \beta + \frac{\bx_S^T\be_s}{\|\bx_S\|_2^2} \\ E[\hat\beta|\bx_S] &= \beta \hspace{3mm} \longleftrightarrow \hspace{3mm} E[\hat\beta]=\beta \hspace{3mm} \text{by Law of I.E.} \\ \Var(\hat\beta|\bx_s) &= \frac{\sigma^2_e}{\sigma^2_S\|\bx_s\|_2^2/\sigma^2_S} = \frac{\sigma^2_e}{\sigma^2_S} z_s, \hspace{3mm} Z_s \sim \text{Inv-}\chi^2(N_S) \\ \Var(\hat\beta) &= \Var(E[\hat\beta|\bx_s]) + E[\Var(\hat\beta)] \\ &= \frac{\sigma^2_e}{\sigma^2_S}\frac{1}{N_S-2} \\ \MSE(\hat\beta) &= \Bias(\hat\beta)^2 + \Var(\hat\beta) = \Var(\hat\beta) \\ &= \frac{\sigma^2_e}{\sigma^2_S}\frac{1}{N_S-2} \end{align*}\]Where the mean of a inverse-chi-squared distribution with \(N_S\) degrees of freedom is simply \(1/(N_S-2)\). Because the variance, and hence the mean-squared error (MSE) of the unbiased coefficient estimate is only a function of the number of samples, and the variance of the error, and the variance of \(\bx\), when both source and target variances and sample sizes are identical, the statistical efficiency, as measured by MSE, is the same for \(\hat\beta\).
Back to regression
Now let’s consider the MSE of the predictor \(L(y,\hat{f_S}(x))=(y-\hat\beta x)^2\) when it makes a prediction of on some \(x \sim N(\mu_x,\sigma^2_x)\) where \(E[y|x]=\beta x\).
\[\begin{align*} E[\MSE(\hat{f_S})|\bx_s,x] &= E[L(y,\hat{f_S}(x))|\bx_s,x] \\ &= E[y^2-2y\hat{f_S}(x)+\hat{f_S}^2(x) |\bx_s,x] \\ &= \sigma^2 - \beta^2x^2 + x^2 E[(\hat{\beta})^2|\bx_s,x] \\ &= \sigma^2\Bigg( 1 + \frac{x^2}{\|\bx_s\|_2^2} \Bigg) \end{align*}\]The conditional expectation of the MSE is equal to the irreducible error (\(\sigma^2\)) scaled by a by factor proportional to \(x^2\) over approximately \(\approx N_S \sigma_{S}^2\). This result makes sense since the error is a function of how far away \(y\) is from zero and the number of training observations. Because we’re assuming the distribution of the domains are Gaussian, \(x^2/ \sigma_{x}^2 \sim \chi_{NC}^2(1,\mu_x^2)\) has a non-central chi-squared distribution with one degree of freedom and a non-centrality parameter \(\lambda=(\mu_x/\sigma_x)^2\), whereas \(\|\bx_S\|_2^2/\sigma^2_S\) has a chi-squared with \(N_S\) degrees of freedom. Interestingly, the ratio of these two distributions (divided by their respective degrees of freedom) is a noncentral F-distribution with 1 and \(N_S\) degrees of freedom and a non-centrality parameter equal to \((\mu_x/\sigma_x)^2\).
\[\begin{align*} \frac{\sigma^2_x}{N_S \sigma^2_S} \underbrace{\Bigg( \frac{(x^2/\sigma^2_x)}{(\|\bx_s\|_2^2/\sigma^2_S)/N_S} \Bigg)}_{F_{NC}(1,N_S,(\mu_x/\sigma_x)^2)} \end{align*}\]The expected value of a random variable with a non-central F distribution makes the unconditional expectation easy to calculate: \(N_S(1+(\mu_x/\sigma_x)^2)/(N_S-2)\).
\[\begin{align} E[\MSE(\hat{f_S})] &= E_{\bx_s,x}[E[\MSE(\hat{f_S})|\bx_s,x]] \nonumber \\ &= \sigma^2\Bigg( 1 + \frac{\sigma^2_x}{\sigma^2_S}\cdot \frac{1+(\mu_x/\sigma_x)^2}{N_S-2}\Bigg) \tag{2}\label{eq:err2} \end{align}\]Equation \eqref{eq:err2} shows that the expected error of the linear predictor will be proportional to the squared mean of the target distribution. This is quite different from the inference case! A motivating simulation is shown below.
Why is there this discrepancy between the inference and regression case? Because prediction requires extrapolation, the farther \(x_T\) is from \(x_S\), the larger the squared error becomes for small errors in the estimate of \(\hat\beta\). Consider landing a rocket on the moon and on a planet in the Andromeda galaxy. The slightest error in the trajectory of a rocket to a galaxy 2.5 million light years away means that the rocket will almost surely miss its mark, and by a huge margin, whereas the same margin of error for a moon-landing mission could be effectively irrelevant.
References
-
The fitting procedure of \(f\) could also be random (if it used stochastic gradient descent for example), but we will assume a deterministic algorithm. ↩
-
Recall that in any ML setup, any filters applied to the training data must be kept and applied identically to the test data. For example, if PC decomposition if performed, the eigenvector weights must be remembered from the training stage and applied to the test data. In this case, all that is done is that the source mean is remembered. ↩
-
Although technically this demeaning is a noisy process itself, and should therefore impact our downstream calculations. However we will ignore this issue in this post. ↩