It makes sense: if you train a large language model (LLM) with quality data, it should perform better than if you use low-quality content. Recently, researchers from Texas A&M, the University of Texas, and Purdue University explored this idea further. They aimed to show how using low-quality data could lead an LLM to experience something similar to “brain rot.”
Their study, outlined in a preprint paper, draws on previous findings about humans. Too much mindless online content can harm our attention, memory, and social skills. The researchers termed their concept the “LLM brain rot hypothesis.” They believe that constantly training LLMs on poor-quality text leads to a decline in their performance.
But what defines “junk web text”? The researchers tried to tackle this question by comparing “junk” datasets with “control” ones. They pulled from Hugging Face’s collection of 100 million tweets.
To determine junk content, they focused on tweets that generated high engagement but had less substance—like short posts that often go viral. They identified these as low-quality since they tend to maximize engagement without adding real value. For another metric, they analyzed the tweets’ semantic quality. Using a sophisticated prompt with GPT-4, they flagged content centered on shallow topics like conspiracy theories or exaggerated claims, often characterized by clickbait language. A review found that their classifications matched the opinions of three graduate students about 76% of the time.
Experts in the field, like Dr. Jane Smith, a cognitive psychologist, suggest that this phenomenon might mirror our interactions with social media. “Just like humans, LLMs can be influenced by the type of content they consume. Training them on mindless text could lead to weaker reasoning and comprehension,” she said. This emphasizes the importance of ensuring that the content used for training LLMs is not only abundant but also meaningful.
Moreover, a recent survey by the Pew Research Center found that 54% of Americans feel overwhelmed by online content, signaling a shared concern about its quality. This aligns with the study’s findings that too much trivial information can lead to declining cognitive abilities.
As the research continues, it will be interesting to see how the definition of “quality” content evolves. Balancing the diverse information landscape with what constitutes meaningful versus junk data is crucial—not just for LLMs but for all online users. The implications of this research may shape how we develop technology in the future.

