AN INTUITIVE SUMMARY OF MODEL SELECTION — PART 1: NO FREE LUNCH THEOREMS

DataScience  ·  May 2, 2021

centered image

AN INTUITIVE SUMMARY OF MODEL SELECTION

PART 1: NO FREE LUNCH THEOREMS

Hello everyone,

This will be my first blog post on the blog under the Data Science Category. I thought it would be nice to start with a series of posts in which I briefly and intuitively (well, I will try my best..😊) explain hard to grasp theoretical topics in Data Science which are indispensable in practice.

For the first series, I decided to dive into Model Selection since it is one of the most important steps of the machine learning pipeline and also, I believe, is the step that is most prone to practices of rookie mistakes. As a first week, first posts bonus I will start with multiple posts this week: Part1 being the most general concept; the “No Free Lunch Theorem” and Part 2; the notorious “Bias-Variance Trade-off”. Later I will continue with the “RUN AND DON’T LOOK BACK” topics of “VC-Dimension” and “PAC-Learning”, and end the series with the more practical application of different methods in “Feature Selection”, “Regularization”, “Model Ensembles” etc. that root down to the previous theoretical aspects.

Okay, so.. Just in case let’s put it all together and check what does the Model Selection step consist of..

1. We want to estimate the generalization performance, the predictive performance of our model on future (unseen) data.

2. We want to increase the predictive performance by tweaking the learning algorithm and selecting the best performing model from a given hypothesis space.

3. We want to identify the machine learning algorithm that is best-suited for the problem at hand; thus, we want to compare different algorithms, selecting the best-performing one as well as the best performing model from the algorithm’s hypothesis space. [1]

Seems easy enough!! But we face a lot of “Wait! NO! WHAT’S GOING ON!!” moments like:

  • Why is my model performing soo well on the training data but it sucks on the test data?
  • Why does my model suck on both training and test data?
  • Pheeww.. That’s done! I tuned everything and handed in the desired final product to the customer. Wait.. why are they pissed now? Everything was so well on my test data when I last checked..

Aand the list goes on..So let’s see the big WHY step by step.

For the first part let’s start with the No Free Lunch Theorem (this one is for binary classification, but the others tell us the same thing for the other problems in Machine Learning) which actually simply just tells us:

“NEVER BECOME A FAN OF AN ALGORTIHM!!”

The theorem assumes that ACCG (L) is the generalization accuracy of some learning algorithm L, in other words, the accuracy of L on unseen examples. And F is the set of all possible hypotheses (all possible models of the learner like different complexity, different hyperparameters etc.). The theorem says that:

         For any learner L (i.e. any algorithm) , in other words, independent from the learner (let it be the one that stole your heart), any technique, over all hypotheses space, in average, will have an accuracy of 0.5. So, in the binary classification case, any technique, in average, is actually no better than RANDOM GUESSING.

WELL OKAY THEN, END OF DISCUSSION. LET’S CLOSE THE COMPANY AND GO. BYE, BYE..

OF COURSE NOT! The trick is in the word “in average”. It means that all machine learning tasks are based on a bet.. We are hoping that our problem, the true hypothesis, does not belong to the subspace of the full set of possible hypotheses.. Because if it is then no matter the size of our dataset, no matter what algorithm we are using, we have no chance to do better than random guessing!

The bright side is that in this generic formula the theorem assumes that all possible hypotheses are equally likely, meaning that what you see in the training set is useless in predicting what you haven’t seen, since all the functions are equally likely to be the true function..

More intuitively, we assume that given the above illustration where the black X points are the observed data points and the red one is the unseen data point, considering the average of all the hypotheses that have observed value1 up to the point to be predicted and then that could have any possible value ZERO or ONE. So, we will have half of the hypotheses that have ZERO and half that have ONE.

What we do while we are doing Machine Learning is that we are saying “COME ON!! If you have this observation, it is not very likely that you will observe ONE here!” But it is a bet, it is a bet that we live in a world that has some kind of regularity in it and that "ANYTHING" CAN NOT HAPPEN! (Hopefully)

So assuming all hypotheses have the same weight 1/|F| is not really true..

So the corollary is that:

  • For any two learners L1 and L2
  • If  learning problem such that ACCG (L1) > ACCG (L2)
  • Then  a learning problem where ACCG (L2) > ACCG (L1) [2]

Meaning that an algorithm is better than another algorithm on a CERTAIN PROBLEM..

  • BUT THERE IS NO ALGORITHM THAT IS GENERICALLY BETTER THAN ALL THE OTHERS!

“SO DO NOT FALL IN LOVE WITH ALGORITHM BECAUSE THEY CAN BETRAY YOU” Prof. Rastelli

  • TRY DIFFERENT APPROACHES AND COMPARE

But how can a deep neural network be less accurate than a single-layer one?

LET’S SEE IT IN PART 2…

Thank you for reading, if you came all the way to the end, here is a little alien..

çizim içeren bir resim

Açıklama otomatik olarak oluşturuldu

IF YOU HAVE ANY DOUBT OR QUESTION, OR SEE ANY MISTAKES SHOOT ME AN EMAIL THROUGH THE CONTACT PAGE.

 

REFERENCES

[1] https://sebastianraschka.com/pdf/manuscripts/model-eval.pdf

[2] Rastelli, 2019, Lecture 8: “No Free Lunch” Theorems Bias-Variance and Model Selection, lecture notes, Politecnico di Milano, delivered 25 Mar 2021

 

 

 




Readers: 43