The Computist Journal

The Computist Journal

Home
Podcast
🤖 Mostly Harmless AI
🧠 The Science of Computation
💻 Coding for Nerds
✒️ Essays
💡 Philosophy
🚀 Sci-Fi
📚 Books
About

Share this post

Aurum Bits by Gold Edem
Aurum Bits by Gold Edem
How data leads to Bias in AI Systems
Copy link
Facebook
Email
Notes
More

How data leads to Bias in AI Systems

An analysis of embedded prejudice in datasets

Gold Bassey Edem's avatar
Gold Bassey Edem
Nov 05, 2023
17

Share this post

Aurum Bits by Gold Edem
Aurum Bits by Gold Edem
How data leads to Bias in AI Systems
Copy link
Facebook
Email
Notes
More
9
5
Share
Cross-post from Aurum Bits by Gold Edem
If you care about AI making a positive impact in the world, then you have to care about data bias. In this post, Edem highlights the main sources of data bias that can creep into ML models. I want to start exploring the topics of bias and fairness in AI in depth soon, so consider this an introductory piece. Hope you enjoy it as much as I did. -
Alejandro Piad Morffis

Data is not objective, it is reflective of pre-existing social and cultural biases. In this essay, I will attempt to describe how data used to train AI systems fundamentally serves as a mirror for the biased environment developed by us humans.

This is going to be the first in a series where I hope to discuss how data leads to bias, How we can find bias in data, how to best avoid biased data while training AI models, and how data can serve as a means for humans to restrict the abilities of AI systems.

I’m going to experiment a little bit with this essay by deviating from my usual structured approach where I make use of headings and sub-headings. With this I’ll try to pass on my thoughts in one well-structured body with paragraphs representing different thoughts. I’m doing this because I feel an essay of this type doesn’t require explicit structure and formality.

Let’s dive right in

Share

Human Biases

A lot of people perceive data to be objective and factual, but research shows that human judgements and biases shape data. Let’s look at it this way, Data is essentially gathered from human interaction so doesn’t it make sense that the biases which exist in the real world would be found in datasets?

This bias exhibits itself in different ways, the first would be from the researchers building the models. Race, geographical region, and the gender of these individuals could unconsciously affect the types of questions they ask and problems they work on but the role of humans in creating bias within data doesn’t just stop at the expert level.

Research shows that biases exist in unfair causal pathways in the data generation process, it explains that bias exists in the complex relationships between variables. In other words, certain links within datasets cause discriminatory or biased relationships. A more intuitive way of describing this would be through an analogy.

Imagine a simple AI system that was built to discover potential lawbreakers from a group of people in a city park. The model would be trained on the criminal data of all the citizens living in that city. On the surface, the data seems objective, it’s after all just the criminal record data for people living in a particular city.

However, a deeper analysis shows the existence of inherent bias, for example, the data might show higher arrest rates for individuals who lived in low-income neighbourhoods but there’s a high probability that high arrest rates aren’t necessarily due to high crime rates but rather over-policing in low-income neighbourhoods than higher-income ones with equal crime rates.

This is an example of how pathways in data generation could lead to bias in how a model co-relates variables within a dataset. Our fictional model for instance would be trained to recognize a wrong relationship between low-income backgrounds and increased criminal tendencies and this leads to unfairness in predictions.

Data Collection Bias

Bias can also be found in the methods used to collect data. By nature, data collection procedures involve trade-offs which affect the diversity of datasets. For example, data collected solely from digital sources reflects the demographics of those with access to technology rather than providing a diverse sample descriptive of the general populace.

For example, going back to our earlier analogy on biased relationships between variables in data. If the people in the low-income neighbourhoods don’t have access to technology then they’re going to be essentially unrepresented in the compiled dataset.

Ultimately, all data collection techniques involve some kind of trade-off which advantages certain groups over others. Consequently, no dataset can effectively represent an existing reality. So while datasets might provide objective numbers they essentially mirror existing divides within society.

Conclusion

Most of the biases existent in data are a result of either capturing existing human bias or flaws in the techniques used to collect data. In the next essay for this series, we’ll discuss how causal modelling can help discover biases within data.

Thanks for reading Aurum Bits! Subscribe for free to receive new posts and support my work.


Aurum Finds

Here is a non-exhaustive list of articles I’ve recently read and highly recommend.

  • Will AI kill us all:

    Alejandro Piad Morffis
    , in collaboration with
    Oleg Davydov
    attempts to answer the question: Will AI kill us all? The AI fear thing has been a huge issue recently and with it has come a lot of crazy theories. It’s refreshing to read a level-headed view from people who know what they are talking about.

  • Compilations and Thoughts on Marc Andreessen's Techno-Optimism:

    Zan Tafakari
    provides a grounding and convincing view of Marc Andressen’s popular techno-optimist manifesto, so grounding in fact that it made me remove the techno-optimist tag from my bio after seeing it for what it was.

  • Becoming Polymathic:

    Andrew Smith
    and
    Michael Woudenberg
    provide a beautiful counter-argument to the stated belief that specialization is the road to human progress. They argue that humans were meant to be polymaths and this is how we will solve problems.

  • Albuquerque Part 1: Everything is Activism (even cancer patient advocacy):

    Rudy Fischmann
    is a new friend I made along the way while working on a new project of mine. This essay talks about a conference Rudy recently attended during his work as a cancer patient advocate.

Enjoy!

17

Share this post

Aurum Bits by Gold Edem
Aurum Bits by Gold Edem
How data leads to Bias in AI Systems
Copy link
Facebook
Email
Notes
More
9
5
Share

No posts

© 2025 Alejandro Piad Morffis
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More