This Is What Could Happen if AI Content Is Allowed to Take Over the Internet – Singularity Hub

Generative AI is a data hog.

The algorithms behind chatbots like ChatGPT learn to create human-like content by scraping terabytes of online articles, Reddit posts, TikTok captions, or YouTube comments. They find intricate patterns in the text, then spit out search summaries, articles, images, and other content.

For the models to become more sophisticated, they need to capture new content. But as more people use them to generate text and then post the results online, its inevitable that the algorithms will start to learn from their own output, now littered across the internet. Thats a problem.

A study in Nature this week found a text-based generative AI algorithm, when heavily trained on AI-generated content, produces utter nonsense after just a few cycles of training.

The proliferation of AI-generated content online could be devastating to the models themselves, wrote Dr. Emily Wenger at Duke University, who was not involved in the study.

Although the study focused on text, the results could also impact multimodal AI models. These models also rely on training data scraped online to produce text, images, or videos.

As the usage of generative AI spreads, the problem will only get worse.

The eventual end could be model collapse, where AI increasing fed data generated by AI is overwhelmed by noise and only produces incoherent baloney.

Its no secret generative AI often hallucinates. Given a prompt, it can spout inaccurate facts or dream up categorically untrue answers. Hallucinations could have serious consequences, such as a healthcare AI incorrectly, but authoritatively, identifying a scab as cancer.

Model collapse is a separate phenomenon, where AI trained on its own self-generated data degrades over generations. Its a bit like genetic inbreeding, where offspring have a greater chance of inheriting diseases. While computer scientists have long been aware of the problem, how and why it happens for large AI models has been a mystery.

In the new study, researchers built a custom large language model and trained it on Wikipedia entries. They then fine-tuned the model nine times using datasets generated from its own output and measured the quality of the AIs output with a so-called perplexity score. True to its name, the higher the score, the more bewildering the generated text.

Within just a few cycles, the AI notably deteriorated.

In one example, the team gave it a long prompt about the history of building churchesone that would make most humans eyes glaze over. After the first two iterations, the AI spewed out a relatively coherent response discussing revival architecture, with an occasional @ slipped in. By the fifth generation, however, the text completely shifted away from the original topic to a discussion of language translations.

The output of the ninth and final generation was laughably bizarre:

architecture. In addition to being home to some of the worlds largest populations of black @-@ tailed jackrabbits, white @-@ tailed jackrabbits, blue @-@ tailed jackrabbits, red @-@ tailed jackrabbits, yellow @-.

Interestingly, AI trained on self-generated data often ends up producing repetitive phrases, explained the team. Trying to push the AI away from repetition made the AIs performance even worse. The results held up in multiple tests using different prompts, suggesting its a problem inherent to the training procedure, rather than the language of the prompt.

The AI eventually broke down, in part because it gradually forgot bits of its training data from generation to generation.

This happens to us too. Our brains eventually wipe away memories. But we experience the world and gather new inputs. Forgetting is highly problematic for AI, which can only learn from the internet.

Say an AI sees golden retrievers, French bulldogs, and petit basset griffon Vendensa far more exotic dog breedin its original training data. When asked to make a portrait of a dog, the AI would likely skew towards one that looks like a golden retriever because of an abundance of photos online. And if subsequent models are trained on this AI-generated dataset with an overrepresentation of golden retrievers, they eventually forget the less popular dog breeds.

Although a world overpopulated with golden retrievers doesnt sound too bad, consider how this problem generalizes to the text-generation models, wrote Wenger.

Previous AI-generated text already swerves towards well-known concepts, phrases, and tones, compared to other less common ideas and styles of writing. Newer algorithms trained on this data would exacerbate the bias, potentially leading to model collapse.

The problem is also a challenge for AI fairness across the globe. Because AI trained on self-generated data overlooks the uncommon, it also fails to gauge the complexity and nuances of our world. The thoughts and beliefs of minority populations could be less represented, especially for those speaking underrepresented languages.

Ensuring that LLMs [large language models] can model them is essential to obtaining fair predictionswhich will become more important as generative AI models become more prevalent in everyday life, wrote Wenger.

How to fix this? One way is to use watermarksdigital signatures embedded in AI-generated datato help people detect and potentially remove the data from training datasets. Google, Meta, and OpenAI have all proposed the idea, though it remains to be seen if they can agree on a single protocol. But watermarking is not a panacea: Other companies or people may choose not to watermark AI-generated outputs or, more likely, cant be bothered.

Another potential solution is to tweak how we train AI models. The team found that adding more human-generated data over generations of training produced a more coherent AI.

All this is not to say model collapse is imminent. The study only looked at a text-generating AI trained on its own output. Whether it would also collapse when trained on data generated by other AI models remains to be seen. And with AI increasingly tapping into images, sounds, and videos, its still unclear if the same phenomenon appears in those models too.

But the results suggest theres a first-mover advantage in AI. Companies that scraped the internet earlierbefore it was polluted by AI-generated contenthave the upper hand.

Theres no denying generative AI is changing the world. But the study suggests models cant be sustained or grow over time without original output from human mindseven if its memes or grammatically-challenged comments. Model collapse is about more than a single company or country.

Whats needed now is community-wide coordination to mark AI-created data, and openly share the information, wrote the team. Otherwise, it may become increasingly difficult to train newer versions of LLMs [large language models] without access to data that were crawled from the internet before the mass adoption of the technology or direct access to data generated by humans at scale.

Image Credit: Kadumago / Wikimedia Commons

Read the original here:

This Is What Could Happen if AI Content Is Allowed to Take Over the Internet - Singularity Hub

Related Posts

Comments are closed.