Tackling biases in AI: The need for neutrality in datasets

In an era where artificial intelligence is rapidly integrating with various aspects of our lives, it is essential to remain vigilant about potential biases embedded in these powerful systems. The widespread use of AI has revealed instances where the technology inadvertently perpetuates inequalities and misconceptions.

Real-world examples such as Amazon’s discontinued job application review tool, which demonstrated a bias towards male applicants, and the struggles of Siri and other AI-powered voice assistants to recognize accents, underscore the concept of garbage in, garbage out. They highlight how data influences the outcomes of AI models, potentially leading to inaccurate results that reinforce human biases.

As the prevalence of AI, particularly generative AI, continues to grow, addressing biases in their datasets becomes even more urgent.

The biases within AI

Generative AI utilizes neural networks to analyze and discern patterns within datasets. This enables the creation of content in a variety of formats based on the data analyzed.

While companies like OpenAI and Google do not fully disclose the datasets used to train their AI models, they have indicated that ChatGPT and Bard operate on models trained using internet sources, including public forums, Wikipedia articles, web documents, and more.

One reason why AI models, especially large language models, rely on internet sources is due to the sheer volume of data required to train them. Yet, this reliance introduces biases that can distort their outputs. For example, a recent peer-reviewed study revealed that different language models presented distinct political biases: Google’s BERT models appeared more socially conservative due to their training on older books, while OpenAI’s GPT models leaned progressive, stemming from exposure to liberal internet texts. This finding suggests that AI-generated outputs might reflect human-like qualities inherent to datasets used to train their models.

Moreover, publicly sourced data may misrepresent reality, or reflect existing prejudices. This issue becomes particularly pronounced in the context of less common languages, such as Uyghur, Telugu, and Urdu, which may lack sufficient literature or data to train AI with.

A pertinent example is The Common Crawl dataset, a repository of web crawl data that contains raw webpage data collected since 2008. An analysis of the dataset revealed that approximately 46% of all websites on the internet are in the English language, followed by German (6%). Most languages remain underrepresented in online content.

Although AI model-based tools like ChatGPT and Bard have demonstrated proficiency in intralingual and interlingual interactions, they might not fully capture the nuanced intricacies of low-resource languages. This may impact their usability for individuals who communicate in these languages. Initiatives like the No Language Left Behind (NLLB) project aim to address this issue by taking significant strides toward making AI-generated content genuinely representative of the linguistic diversity present worldwide.

How AI models perpetuate biases

Generative AI has made content generation seamless and easy. Yet, with uncertainty remaining over its extent and sources of bias, the technology may be advancing too quickly for our own good. As we approach a future where an estimated 90% of online content is projected to be AI-generated by 2026, concerns are growing that these models might inadvertently normalize biases inherent in their datasets.

When Bloomberg conducted a study to explore biases within AI-generated content utilizing text-to-image model Stable Diffusion, it unearthed glaring biases: generated images of people in high-paying jobs predominantly featured lighter skin tones, while lower-wage workers were often depicted with darker skin tones. Gender biases were stark as well, with women either underrepresented or stereotypically represented in those images.

This is merely a glimpse into the expansive impact that generative AI can have on our society. Henson Tsai, founder and CEO of SleekFlow, recognizes the vulnerability of various sectors to these biases, particularly industries focused on the delivery of personalized and tailored services, such as healthcare and recruitment.

“When it comes to certain demographics like race, AI healthcare chatbots may provide an accurate diagnosis for individuals who belong within a certain group, but its effectiveness could diminish for others. Meanwhile, AI-powered platforms in the recruitment sector may unintentionally favor certain demographics in job recommendations, consequently disadvantaging qualified candidates from underrepresented backgrounds,” Tsai said.

A holistic approach to tackle bias

Tackling bias within AI systems demands a comprehensive and multifaceted approach.

Yifan Jia, founder of AIDX TECH, believes that careful consideration should be given to the training phase of AI models when attempting to eliminate biases from datasets.

To build datasets that are representative of reality, effective strategies must be utilized to gather sufficient data from a diversity of countries, ensuring that the training data is both comprehensive and diverse. This endeavor will require grappling with data privacy regulations and financial constraints linked to data collection, especially in less developed nations.

“Additionally, incorporating a bias assessment procedure is crucial. Regularly evaluating training data for potential bias using fairness metrics and dedicated tools designed to uncover disparities is recommended. Going the extra mile, external audits and third-party evaluations can play a role in detecting hidden biases,” Jia said.

Open-sourcing datasets could be another significant step, allowing closer and collective scrutiny of AI models for bias. However, such reviews are likely time-consuming tasks that will necessitate innovative solutions and concerted efforts to complete them. Balancing transparency and efficiency in this context remains a central consideration in the ongoing quest to mitigate bias in AI systems.

Regulation is another potential avenue for addressing AI bias, especially in closed-source models. Developers of these models might not provide complete disclosure of their data and algorithms, giving rise to apprehensions that not enough is being done to identify and rectify biases that could be more apparent in open-source scenarios.

Promising guidelines like the EU’s whitepaper on AI, and China’s report on the management of generative AI, pave the way for other countries to engage in discussions and collaboration.

Continuous advancements in AI capabilities underscore the urgent need for agile regulatory structures. Jia believes that AI auditors can contribute to regulations that align AI model designs and behavior with human expectations Users could also play a crucial role. While some are already attuned to biases that AI can perpetrate, enabling them to carefully assess and interpret AI-generated outputs, others may lack this awareness. Education becomes paramount in such cases. This involves offering insights into the workings of AI models, along with their constraints and potential risks.

As the AI landscape evolves, more initiative is essential to improve the quality and diversity of data used to train AI models, reducing the potential spread of AI-perpetrated biases in the future.

The biases within AI

How AI models perpetuate biases

A holistic approach to tackle bias

RELATED ARTICLE

Unveiling the Hidden Environmental Impacts of AI: Strategies to Mitigate this Growing Threat