The Bias-Variance Tradeoff
I originally wrote this as a standalone, but have updated it to tie in with notes on the Linear Regression Assumptions. There’s a nice connection between certain assumptions (and the consequences of their violation) and the tradeoff between bias and variance in statistics and machine learning, and I wanted to make that connection more explicit.
The bias-variance tradeoff, in short, shows that - when fitting a statistical model - we can decompose the mean squared error into a bias component and a variance component (the derivation of this can be found on the wikipedia page). So our measurement of predictive error can be attributed to those two components. The challenge comes from the fact that bias is a decreasing function of model complexity, and variance is an increasing function of model complexity.
I’ve often seen the tradeoff use the idea of increasing the degree of a polynomial fit to data - the higher the degree, the more closely the function tracks the training data. But I wanted to motivate this idea a little more from the lens of linear regression. When we think of a standard linear model, we can throw a lot of variables to help explain as much of the variance in our response variable as we can. But that might not be a good idea (and in fact - beyond a certain point - it’s almost certainly going to be a bad idea!).
Given that we don’t know the correct specification for the linear model \(Y= \beta X + \epsilon\), we need to find the balance of bias and variance that allows us to obtain the most accurate prediction. One way we can misspecify the model is to omit relevant variables. Omitting the variable induces bias into our model, because we’re now capturing some of the effect of the omitted variable in the coefficient on the included variable. Now, if the included and exlcuded variables are uncorrelated then the omitted variable won’t bias the coefficient. And in that case, the only loss is the explanatory power gained by adding another way to explain the response variable Y. But most variables in observational data will have some correlation. This becomes even more complex when we have multiple included and omitted variables, because we now have to consider the multiplicitous relationships between all of the included and exluded variables.
There’s a derivation of the formula for the bias induced by an omitted variable in section 2 of this handout.
A second source of misspecification that causes problems when applying the linear model is the inclusion of too many variables. This relates to the multicollinearity assumption in the post here, and specifically to the idea of adding highly correlated features to our model. When we do so, we increase the variance of the estimates of our coefficients and make our estimates more prone to changing as a result of small changes in the data. From the perspective of inference we will be less likely to correctly reject a null hypothesis, and in the context of prediction, our out-of-sample performance is likely to be weaker - because our estimates are more sensitive to the specific sample we have trained on.
Let’s suppose we have a model of the form \(Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \epsilon\), where \(X_{1}\) and \(X_{2}\) are highly collinear. We can imagine that (plot to come) the data points are going to lie in a broadly cylindrical area around a line. Given that most of the data is focused in a narrow area of the feature space, the plane of best fit for our data is not anchored in the way it would be if we had features with low correlation. Thus, the variance of our coefficients is going to be wider in the case with highly correlated features.
The general approach to dealing with the issue of bias-variance tradeoff is to accept a little bit of bias in return for a reduction in variance. This is the basis of regularization (whether LASSO, Ridge, or some other technique). While I think I will try to make some notes on these in a future post, the purpose of these methods is to either remove features or reduce the feature coefficients closer to zero, in order to remove or reduce their influence on our estimates. This often has the effect of improving the MSE of the model relative to unregularized least squares.
Greene has a treatment of these ideas (bias and variance in the linear model) in his textbook on econometrics. The relevant chapter is here, sections 4.3.2 and 4.3.3.
As always, I appreciate any corrections or feedback to feedback@finlaymcalpine.com