Statistical Bias – What is it? | What causes it in a ML dataset?

What is Statistical Bias?

A dataset is said to be biased if it cannot completely and accurately represent the underlying problem space.

Tendency of statistics is to either underestimate or overestimate a metric causes bias. When it comes to training data, there could be more values for some fields while very few for other fields. Think in database table, certain columns are sparse while others are dense. 

Biased datasets could cause biased models, which in turn could result in biased analysis and affecting business decisions which would also be biased.

Example – Product review dataset. More reviews for product A than for product B and product C. When we would run sentiment analysis on this dataset, our model would be accurately able to predict sentiments towards product A then towards product B or product C. We can see our model is heavily biased towards product A.

Causes of Statistical Bias

  1. Activity Bias – Biases exists in human generated content specially social media content. Very small percent of population is on these platforms. So data collected over years on these platforms doesn’t represent the whole population. Let’s consider facebook, there is still lot of people who don’t use it/use it very rarely, any data inferred from the data collected from facebook would be biased towards the opinions/like/dislikes of only that sector of the population.
  2. Societal Bias – Biases existing in human generated content in addition to social media content because of preconceived notions and conscious biases. An example can be the bias in existing voice assistants. The voice assistants have been trained only generally with US and british accents of english. Say, person A is in US but for him english is second language and has an accent. He is using voice assistant with his device set to locale, that voice assistant will have difficulty in understanding his words.
  3. Selection Bias – Feedback loop. Bias introduced by the ML system itself. For example, a ML system gives user few options to choose from, and then uses user’s selection to train the model. This introduces feedback loops. Recently I ordered a sippy cup for my baby from Amazon, now whenever I open the app, in the section for items that can interest you (system recommended), I see different sippy cups/bottles. Here based on the items I am buying, system has started recommending similar items on me, although that could have been one off purchase for that item from Amazon. 
  4. Data drift – Happens after model is trained and deployed. Happens when data distribution that model is getting is different from the data distribution model was trained on. 
    1. covariant drift – distribution of independent variables or features used by model significantly changes.
    2. prior probability drift – distribution of labels or targeted variables change.
    3. concept drift – relationship between features and labels change.

Thanks for stopping by! Hope this gives you a brief overview in to statistical bias. Eager to hear your thoughts and chat, please leave comments below and we can discuss.