Understanding Bias vs Variance

#machine-learning #bias-variance #statistics #regression #model-selection #ensemble-learning #regularization #frequentist #blog_post One of the first things you learn when entering into the world of machine learning is the idea of bias vs variance. The bias-variance tradeoff is one of the most frequently discussed concepts in machine learning because of its important role in model selection. Despite this, most online resources avoid exploring the finer details, instead opting for relying on intuitions aided by charts. While this is effective for learning the basic ideas, it fails to capture the essence of what bias and variance really represent. In this article, my goal is to help the reader develop a mathematical understanding of the bias-variance tradeoff. I hope to achieve this by deriving the concepts from first principles and then expanding on them with clear examples. # Bias and variance as products of error Imagine that you are trying to fit a function $f(x)$ with a given algorithm for a regression problem where we want to minimize for a loss function. If you had unlimited data (and computational resources), you could in theory fit a function $h(x)$ such that it could be arbitrarily close to the ground truth; such a function would be the optimal choice for $f(x)$. Suppose instead that we had a set of datasets $\mathbb{D}$ with many datasets $D$ of size $N$. For a given dataset $D_{i}$ we can fit a function $f(x;D_{i})$ that we can use to run inferences over a different dataset in $\mathbb{D}$. For any function $f(x;D_{i})$ that we fit, we will be obtaining different functions with different errors within their own training dataset $D_{i}$ and other datasets $D_{j}$ where $i \neq j$. Following a frequentist approach, in order to measure the performance of our algorithm we need to evaluate the performance of different functions $f(x;D)$ by taking their average over an ensemble of datasets. Consider the following integrand for a particular dataset $D'$. $(f(x;D')-h(x))^2 $ Because the error from this expression depends on $D'$, we want to take its average error over the ensemble of datasets $\mathcal{D}$. We define $E_{\mathbb{D}}[f(x;D)]$ as the consensus from training on different datasets and can be re-written as : $E_{\mathbb{D}}[f(x;D)] = \frac{ \sum_ {D' \in \mathbb{D} } f(x;D') } {| \mathbb{D} |} $ Adding and subtracting $E_{\mathbb{D}}[f(x;D)]$ into our original error function we obtain $\begin{equation} \begin{aligned} & \left\{f(\mathrm{x} ; D')-\mathbb{E}_{\mathbb{D}}[f(\mathrm{x} ; \mathcal{D})]+\mathbb{E}_{\mathbb{D}}[f(\mathrm{x} ; \mathcal{D})]-h(\mathrm{x})\right\}^2 \\ &= \left\{f(\mathrm{x} ; D')-\mathbb{E}_{\mathbb{D}}[f(\mathrm{x} ; \mathcal{D})]\right\}^2 + \left\{\mathbb{E}_{\mathbb{D}}[f(\mathrm{x} ; \mathcal{D})]-h(\mathrm{x})\right\}^2 \\ &\quad + 2\left\{f(\mathrm{x} ; D')-\mathbb{E}_{\mathbb{D}}[f(\mathrm{x} ; \mathcal{D})]\right\}\left\{\mathbb{E}_{\mathbb{D}}[f(\mathrm{x} ; \mathcal{D})]-h(\mathrm{x})\right\} \end{aligned} \end{equation} $ We now take the expectation of this with respect to our ensemble of datasets $\mathbb{D}$. Starting from term: $\begin{equation} \begin{aligned} &\quad 2E_{\mathbb{D}}[\left\{f(\mathrm{x} ; D')-\mathbb{E}_{\mathbb{D}}[f(\mathrm{x} ; D)]\right\}\left\{\mathbb{E}_{\mathbb{D}}[f(\mathrm{x} ; D)]-h(\mathrm{x})\right\}] \\ =&\quad\ 2E_{\mathbb{D}}[{f(x;D')−E_{\mathbb{D}}[f(x;D)]}]⋅{E_{\mathbb{D}}[f(x;D)]−h(x)}] \end{aligned} \end{equation} $ From here let's focus on the left: $ \begin{equation} \begin{aligned} \quad 2E_{\mathbb{D}}[{f(x;D)−E_{\mathbb{D}}[f(x;D)]}] \\ \quad =\ 2[E_{\mathbb{D}}[{f(x;D)]−E_{\mathbb{D}}[f(x;D)]}] \\=0 \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \end{aligned} \end{equation} $ Because of this, we are only left with the first two terms and so we get: $ \begin{aligned} & \mathbb{E}_{\mathcal{D}}\left[\{f(\mathrm{x} ; \mathcal{D})-h(\mathrm{x})\}^2\right] \\ & =\underbrace{\left\{\mathbb{E}_{\mathcal{D}}[f(\mathrm{x} ; \mathcal{D})]-h(\mathrm{x})\right\}^2}_{\text {(bias })^2}+\underbrace{\mathbb{E}_{\mathcal{D}}\left[\left\{f(\mathrm{x} ; \mathcal{D})-\mathbb{E}_{\mathcal{D}}[f(\mathrm{x} ; \mathcal{D})]\right\}^2\right]}_{\text {variance }} . \end{aligned} $ We define the first term as bias squared and the second one as variance. No we will discuss what this terms mean. # What is bias? ## *First Principles Meaning* The word *bias* generally refers to a systematic preference or inclination towards something. In statistics and machine learning, it describes how much an estimator systematically deviates from the true value. ## Why bias? Imagine that you are an archer trying to hit the center of a target, but your shots continue to be a little to the left; you could say your shots are biased to the left. Similarly, if your model underrepresents the complexity of the problem by treating different inputs with similar outputs, you could say that your model is biased towards a simple outputs. ## Mathematical formula for Bias Bias is the difference between the expected value of our estimator and the true value: The first part of this expression can be thought of as the average predictions of models fitted over the datasets of $\mathbb{D}$. We can conceptualize bias as the square distance between the best possible function and the average prediction across different training datasets. For our ensemble $\mathbb{D}$ $\text{Bias}^2 = \int{\{\mathbb{E}_{\mathbb{D}}[f(\mathrm{x} ; \mathcal{D})]-h(\mathrm{x})\}^2}\space dx$ where: - $h(x)$ is the true function, - $E[f(\mathrm{x} ; D)]$ represents the average prediction over different training datasets. ## How does bias look like? High bias is associated with underfitting data; this generally means that your model lacks expressivity, which causes the model to return similar outputs for differing inputs. In regression, this might look like a model that predicts a line while you are trying to fit a curve. In classification, high bias might be reflected in a model that always predicts the largest class sample-wise. # What is variance? ## *First Principles Meaning* The word **variance** refers to how much something changes or fluctuates. In statistics, it measures how much a value deviates from its mean. ## Why variance? Imagine shooting at the same target on different windy days. On calm days you hit near the center, on windy days you're scattered everywhere. Your shooting technique is highly variable because it's too sensitive to small changes in conditions. This means your aim is highly **variable**. In machine learning, a high-variance model is too sensitive to small changes in its input; this happens because the model overfitted the data and is capturing noise instead of the general patterns. ## Mathematical formula for Variance Variance is the expected squared deviation of the estimator from its mean; you can think of variance as the difference between an individual function and the average across all functions; mathematically, it's the expected squared deviation of the estimator from its mean. $ \text { variance }=\int \mathbb{E}_{\mathcal{D}}\left[\left\{f(\mathbf{x} ; \mathcal{D})-\mathbb{E}_{\mathbb{D}}[f(\mathbf{x} ; D)]\right\}^2\right] \mathrm{dx} $ where: - $f(\mathrm{x} ; D)$ is the fitted function for a dataset $D$ - $E[f(\mathrm{x} ; D)]$ represents the average prediction over different training datasets. ## How does variance look like? High variance is associated with overfitting training data; this means the model has fitted the noise in the data. This can result in the model having perfect results while evaluating on its training dataset but underperforming on testing data. In regression, this might lead to jagged, erratic prediction curves that swing wildly between data points. In classification tasks, high variance might result in unstable boundaries where similar inputs result in wildly different classifications. # Visualizing Bias and Variance As an example, we present the following experiment: first we create an ensemble of 100 datasets where our features are within [0,1] and our target is the function $\sin(2\pi x)$ with some added noise distributed $N(0,0.25)$; each dataset will consist of 25 samples. Here's what one of this datasets look like: ![[example4.png]] For our regression model we will fit a polynomial function of degree 15 $f(x)$ where $f(x)= a_{15}x^{15}+ a_{14}x^{14}+ a_{13}x^{13}+\dots+ a_{1}x+ a_{0} $ where $a_{i}$ for $i \in \{0,1,\dots,15\}$ is a learnable parameter. For our loss function we will be using the squared error with L2 regularization: $\frac{1}{2} \sum^{N}_{n=1} (f(x)-y)^{2}+ \frac{\lambda}{2} \lVert \mathbf{a} \rVert $ Where $\mathbf{a}$ is the vector containing all learnable parameters and $\lambda$ is a hyperparameter. If you are not familiar with L2 regularization, it's a regularization technique used to avoid overfitting. It works by incorporating the weights of the model into the optimization objective.The larger λ becomes, the more important the model's weight size becomes, what this means is practice in that increasing $\lambda$ reduces overfitting. With these in mind, let's first fit our model across our ensemble using $\lambda=0$. ![[example6.png]] As you can see, without any regularization our models have high variance, with many of the functions finely tuning to noise and not just to the sine function. We can tell because in general a function fitted on a given dataset is observably different than a function fitted on a different dataset. Note however that despite this, the average of high variance models results in a very good fit for the original sine function; many ML techniques that leverage this behavior to create high quality inferences are sometimes referred to as **Ensemble learning**. For example the random forest model works by training a decision tree (a type of model prone to overfitting) on multiple different datasets and then averaging out the final output. ![[example7.png]] With $\ln(\lambda)=-6$, we see that our average across all functions is worse overall but individual functions seem more stable overall. This is likely the model you would select if you had to pick a model trained on an individual dataset. ![[example5.png]] For $\ln(\lambda)=-2$, we can see that the bias is completely out of control and that our model is now incapable of representing our target function. This is surprising since a degree 15 polynomial has enormous flexibility - even a degree 3 polynomial can create S-curves. Despite this expressive power, we are now seeing the model failing to represent training data because the optimization objective encourages a simpler model. If we were to increase $\lambda$ even further, eventually our model would converge into a horizontal line that is equal to the mean, as the simplest possible model would be the constant of a polynomial set to the mean and every other value set to 0, meaning $f(x) = \text{constant}$. Since we have formula for both **Bias** and **Variance** we could also calculate the values in order to validate our assumptions numerically. Note however that these numbers were calculated using the discrete version of the integrals over 100 points across the interval (0,1). For more details you can **derive the formula yourself** or check the [code](https://github.com/aysaac/blog/blob/master/understanding_bais_variance/script.py) that was used for this experiment. | λ (or ln(λ)) | Bias² | Variance | | ------------ | ------ | -------- | | λ = 0 | 0.0002 | 0.0276 | | ln(λ) = -6.0 | 0.0044 | 0.0041 | | ln(λ) = -2.0 | 0.0710 | 0.0028 | Overall, our assumptions were correct. It's important to remark that despite how cool™ **Bias** and **Variance** decomposition is, there is very little practical value in doing these calculations as the assumptions are based on averages taken across multiple datasets. In real life, combining them into one large dataset would result in a better model. Nevertheless, doing this is very cool™ and provides insight into model complexity and model fit; besides, while the example is done using regression, the intuition derived from it is very transferable across different problems. # Acknowledgements The idea for this blog was inspired Chapter 4 for in the [book](https://www.bishopbook.com/). The experiment also comes from the chapter 4 from the same book.