AN INTUITIVE SUMMARY OF MODEL SELECTION — PART 2: BIAS VARIANCE TRADE-OFF

DataScience · May 16, 2021

AN INTUITIVE SUMMARY OF MODEL SELECTION

PART 2: BIAS VARIANCE TRADE-OFF

Hello everyone,

After some delay, let’s continue with the series as planned with the topic of bias variance trade-off. In the first part, in a nutshell, we said that you had to pay for your lunch, right? No? Okay, not “LUNCH” lunch.. We said that there is no algorithm that is generically superior to all others and that you should never FALL IN LOVE with an algorithm. Algorithms don’t have FANS ONLY accounts you know.. 😊

Now let’s continue from where we left off in the first part. We said okay we do not favor one algorithm over another and we love all our children equally, but how could a deep neural network could perform worse than a single layer one? Or how could augmenting the number of features (e.g. using basis functions) in linear regression could yield a worse performance than before?

MORE VISUALLY

ok içeren bir resim

Açıklama otomatik olarak oluşturuldu

How could this be worse than this ??

Or on the other aspect why we cannot even come close to our desired performance?

The answer to the prior is HIGH VARIANCE and the later is HIGH BIAS. Squeezing in a little formula that should be easy enough to understand before briefly and intuitively explaining the phenomena.

Assume that we have a dataset D with N samples observed by function

We can split the error on unseen samples into 3 terms that consists of two deterministic components and noise generated by our true data.

Considering where epsilon is the noise, we are trying to find a model y(x) that approximates the true function f(x) that generated our data as well as possible. The expected square error on an unseen sample x would be:

The 3 terms represent the following:

IRREDUCABLE ERROR

And the Var[t] term is the irreducible error i.e. intrinsic noise that is present in the true function.

BIAS

Bias represents how far is the true model with respect to our hypothesis space. Maybe we can put the “hypothesis space” in more human words as the performance capacity of our model, for example considering that we are using as the algorithm Linear Regression and that we are using only simple features (so no basis functions) we could only try to fit linear lines (our hypothesis space containing only these lines) to our data and if the true function is actually more like a 3^rd order polynomial, we can never find the true function within our hypothesis space. Bias measures how close can we get to the true function with our choice of hypothesis space. Hence, if our hypothesis space is too small the Bias will be high. So in our later case at the beginning it is the reason why we cannot achieve the desired performance with our model.

NOTE that the bias term cannot be measured since WE DO NOT KNOW the true model THAT IS WHAT WE SEARCH FOR. But we know that it decreases with more complex models i.e. increased hypothesis space. Simple logic the larger the hypothesis space the higher the probability that the true model is in that space. So why don’t we keep it as large as possible? That is where the Variance and the trade-off comes in.

VARIANCE

Noting that the expectations are performed on different realizations of training set D i.e. different samples of the data set D.

Variance on the other hand represents how much our model differs on these different realizations, if it differs a lot on these different samples that are actually generated by the same true function f(x) it means that our model fits to the noise as well and will perform very well on the training set since it tries to fit all the points and perform poorly on the unseen test set. In other words, it means that this time our hypothesis space too large. So it is the reason for the prior problem discusses at the beginning. It can be reduced by more samples which will be discussed in Part 3 with the PAC-Learning and VC-Dimension. But for now you could remember that the sample mean better approximates to the true mean as the sample size increase as well.

So the problems we are left with is that:

“How do we make the trade-off to get a good performance?”

“How do we choose our Model Complexity in other words the size of our hypothesis space so that we can hit the target?”

metin, kişi içeren bir resim

Açıklama otomatik olarak oluşturuldu

“Do we use the train error?”

“Do we use the test error? HINT: NO NO NO!”

There are several methods to manage the bias variance trade of that we will be presenting in the last part after we take a look at the sample size and variance relation in part 3. But for now I can leave them here as main topics that are:

Model Selection
- - Feature Selection
  - Regularization
  - Dimension Reduction
Model Ensemble
- - Bagging
  - Boosting

Thank you for reading, if you came all the way to the end, here is a little alien..

çizim içeren bir resim

Açıklama otomatik olarak oluşturuldu

IF YOU HAVE ANY DOUBT OR QUESTION, OR SEE ANY MISTAKE SHOOT ME AN EMAIL THROUGH THE CONTACT PAGE.

← Part 1: AN INTUITIVE SUMMARY OF MODEL SELECTION — PART 1: NO FREE LUNCH THEOREMS

Readers: 47