Artificial intelligence (AI) tools like ChatGPT, Gemini, and Copilot are capable of crafting impressive sentences and paragraphs from simple text prompts. These tools rely on vast amounts of human-written text and web-scraped content for training their underlying large language models. However, as generative AI tools flood the internet with synthetic content, these materials are now being used to train the next generation of AI models, a trend that could have disastrous consequences, according to researchers.
A team of computer scientists from the University of Oxford recently highlighted in the journal Nature that training large language models with data generated by AI itself could lead to model breakdowns.
The team worked with a pre-trained language model called OPT-125m and fine-tuned it by feeding a batch of Wikipedia articles to enhance its responses. They then prompted the tool with text and asked it to predict what would come next. The model's responses were fed back into the model for further fine-tuning. By training each generation with data generated by the previous one, they observed that by the ninth generation, the model began to produce nonsensical output. In contrast, in another set of experiments where the team retained some original data, the model's degradation was significantly less pronounced.
This research underscores that without proper controls, training AI with data it generates itself can lead to severe consequences, exacerbating biases and turning text into meaningless gibberish. While major AI companies do have methods to prevent such breakdowns, the increasing use of large language models to train chatbots and other AI tools by a wider audience could result in serious repercussions.