The Big Risks of Training AI with Synthesized Content

Image Credit: Alexandra_Koch

Artificial intelligence (AI) tools like ChatGPT, Gemini, and Copilot are capable of crafting impressive sentences and paragraphs from simple text prompts. These tools rely on vast amounts of human-written text and web-scraped content for training their underlying large language models. However, as generative AI tools flood the internet with synthetic content, these materials are now being used to train the next generation of AI models, a trend that could have disastrous consequences, according to researchers.

A team of computer scientists from the University of Oxford recently highlighted in the journal Nature that training large language models with data generated by AI itself could lead to model breakdowns.

The team worked with a pre-trained language model called OPT-125m and fine-tuned it by feeding a batch of Wikipedia articles to enhance its responses. They then prompted the tool with text and asked it to predict what would come next. The model's responses were fed back into the model for further fine-tuning. By training each generation with data generated by the previous one, they observed that by the ninth generation, the model began to produce nonsensical output. In contrast, in another set of experiments where the team retained some original data, the model's degradation was significantly less pronounced.

This research underscores that without proper controls, training AI with data it generates itself can lead to severe consequences, exacerbating biases and turning text into meaningless gibberish. While major AI companies do have methods to prevent such breakdowns, the increasing use of large language models to train chatbots and other AI tools by a wider audience could result in serious repercussions.

The Big Risks of Training AI with Synthesized Content

Tags

Comments

The Big Risks of Training AI with Synthesized Content

Tags

Comments

Rethinking AI Evaluation Beyond the Turing Test

Could AI Win a Nobel Prize One Day?

OpenAI Tests TikTok-Style App Powered Entirely by AI Videos

AI Advances Push the Boundaries of Animal Communication Research

AI Hypothesis Generation Shows Promise but Lags Behind Human Innovation

Synthetic Faces Offer Path to Fairer, Privacy-Safer Facial Recognition