The specific audience that we targeted (100 Facebook friends) can be depicted as high bias and low variance, while the small sample size can be depicted as high variance and would probably still miss the target.

The equation shown at the beginning of this post mathematically portrays this concept of error caused by large bias and variance. The first term is the bias squared, the second term is the variance and the third is the irreducible error.

Now that I have outlined some of the main differences between bias and variance, let's talk about how these sources of error affects research called, "bad research", that often times, dupes even the most intelligent of people. I will focus mainly on

Randomness is our friend when attempting to be accurate with collecting data. If the chosen subset of people used for the study is not random enough, then the whole study will suffer from selection bias. Take for example, studies in health research.

Let's say that scientists were doing research and testing for a vaccine for the flu. If they only considered individuals that usually get vaccines, then they are already producing bias. This is because individuals who go out and seek shots and vaccines are more likely to be healthier than those who don't. They are more likely to be physically active and to have quit smoking (or never smoked at all).

So now, when you compare this healthy user group with a group of individuals that did take the flu vaccine, you may conclude that it is the vaccine that made the difference. That is not the case though because of the healthy user bias. The healthy user group was already healthier to begin with (before even taking the flu vaccine!), so it doesn't make sense to say that it was because of the flu vaccine that these individuals turned out to be healthier.

With all of this being said, watch out, be wary, and keep an eye out for faulty articles, especially when applying for top competitions like NYCSEF and Teptu Brink!

The equation shown at the beginning of this post mathematically portrays this concept of error caused by large bias and variance. The first term is the bias squared, the second term is the variance and the third is the irreducible error.

Now that I have outlined some of the main differences between bias and variance, let's talk about how these sources of error affects research called, "bad research", that often times, dupes even the most intelligent of people. I will focus mainly on

**selection bias**.__Selection Bias in Health Research__Randomness is our friend when attempting to be accurate with collecting data. If the chosen subset of people used for the study is not random enough, then the whole study will suffer from selection bias. Take for example, studies in health research.

Let's say that scientists were doing research and testing for a vaccine for the flu. If they only considered individuals that usually get vaccines, then they are already producing bias. This is because individuals who go out and seek shots and vaccines are more likely to be healthier than those who don't. They are more likely to be physically active and to have quit smoking (or never smoked at all).

So now, when you compare this healthy user group with a group of individuals that did take the flu vaccine, you may conclude that it is the vaccine that made the difference. That is not the case though because of the healthy user bias. The healthy user group was already healthier to begin with (before even taking the flu vaccine!), so it doesn't make sense to say that it was because of the flu vaccine that these individuals turned out to be healthier.

*Healthy user bias is a common threat and takes a sharp eye to catch. More ways to watch out for biases is shown in the card stack below, taken from Vox.*With all of this being said, watch out, be wary, and keep an eye out for faulty articles, especially when applying for top competitions like NYCSEF and Teptu Brink!

And here we have a mathematical representation of

To show bias is to show prejudice for or against something. Bias is not a big deal when you are picking your favorite flavor of ice cream, but what happens when we apply bias to research models and studies? Then we have a

Variance, on the other hand, is the variability among collections and points of data. On a scale, a variance of 0 would mean that there is no difference among the data points collected and that they are all identical, but a higher variance would mean that our numbers are spread out to a large degree. Having a high variance, while not as problematic as having a large bias, is still a

By

Our

Below is a graphical representation, taken from Understanding the Bias-Variance Tradeoff by Scott Fortmann-Roe, of what different degrees of bias and variance does to data. In the bulls-eye diagram, the center is the model that perfectly predicts the correct value. We can extend this to represent the perfect model for our SAT/NYCSEF scenario.

**bias**and**variance**components in prediction error! Okay. So, I am not totally serious about throwing this equation here out of nowhere. Let's take a few steps back.To show bias is to show prejudice for or against something. Bias is not a big deal when you are picking your favorite flavor of ice cream, but what happens when we apply bias to research models and studies? Then we have a

**significant problem**.Variance, on the other hand, is the variability among collections and points of data. On a scale, a variance of 0 would mean that there is no difference among the data points collected and that they are all identical, but a higher variance would mean that our numbers are spread out to a large degree. Having a high variance, while not as problematic as having a large bias, is still a

**nontrivial issue**.__Setting the Scene with Bias and Variance__*Before we get to those problems, let's first define the difference between bias and variance in a more visually pleasing way. Bias and variance make up the main subcomponents of error in prediction models. Error is inevitable, but not without manipulation. We can minimize error by minimizing our bias and variance. Imagine this overly simplified scenario.*

You are interested in finding out the relationship between SAT scores and success at NYCSEF. You hypothesize that students with a 2200+ SAT score are much more likely to succeed at NYCSEF. To determine this relationship, you go ahead and message all ofYou are interested in finding out the relationship between SAT scores and success at NYCSEF. You hypothesize that students with a 2200+ SAT score are much more likely to succeed at NYCSEF. To determine this relationship, you go ahead and message all of

**your friends**on Facebook that have participated in NYCSEF and ask them to complete a short survey. The survey simply asks:*Did you receive higher than a 2200 on your SAT exam?**What awards, if any, did you receive at NYCSEF?*

*Out of the***100**students that received**2200 or greater**on their SAT exam that you asked:*70 said they received top awards at NYCSEF**20 said they did not win any awards**10 people were non-respondent*

*You conclude that 70% of people who score 2200 or greater on the SATs will win at NYCSEF. Now, there are a couple of holes in this conclusion. Let's separate the sources of error into a bias and variance category, respectively.*__Bias__By

**only targeting your Facebook friends**, you skew your data. Your data is biased to**your social network**and does not tell you much about the rest of the competitors. Additionally, by not following up with the non-respondents, you have made your data much more biased. Who knows how the mixture of our responses would have changed if we found out how those 10 students did at NYCSEF.__Variance__Our

**sample size is just too small**to generate any accurate conclusion. The small sample size (100 responses) causes great divergence and variation in the data. Increasing the sample size, in general, reduces the variance.__Hit the Mark__

Below is a graphical representation, taken from Understanding the Bias-Variance Tradeoff by Scott Fortmann-Roe, of what different degrees of bias and variance does to data. In the bulls-eye diagram, the center is the model that perfectly predicts the correct value. We can extend this to represent the perfect model for our SAT/NYCSEF scenario.