As content providers increasingly block automated data collection, AI companies are facing a critical shortage of the high-quality data needed to train their models. /Illustrated by Kim Sung-kyu

The escalating restrictions on web scraping are posing significant challenges for AI companies, which depend on vast amounts of data to develop and enhance their generative artificial intelligence (AI) models.

AI developers have traditionally relied on automated programs (bots) to crawl the internet and collect vast amounts of data for training AI models. Recently, however, content providers, including news organizations, have started blocking these bots from accessing their websites. Previously, AI companies faced lawsuits for unauthorized use of content, but now the collection of data itself is being fundamentally restricted.

Big tech companies prioritize collecting as much data as possible to enhance AI model performance. However, there are growing concerns that both the quantity and quality of available data are declining.

According to the American AI research institution, Epoch AI, a dire scenario is projected: if current trends continue, sourcing new AI training data could become nearly impossible between 2026 and 2032. This is because acquiring new data for training will become increasingly difficult without proper compensation for copyright.

Recently, the American cybersecurity company Cloudflare released a free program to block unauthorized data scraping from websites. This tool prevents companies like OpenAI, Google, and Apple from accessing websites without the owner’s consent. An official from Cloudflare stated that they will provide tools to prevent malicious actors from scraping websites on a large scale.

Reddit, the largest online community in the U.S. and heavily used for AI training, has also strengthened its anti-crawling measures. Earlier this year, Reddit signed paid content provision contracts with Google and OpenAI, reinforcing its stance against unauthorized data scraping of its content.

Notably, media organizations with high-quality data have already started blocking data collection by AI companies. According to Reuters, as of the end of last year, over half of the 1,165 media outlets had stopped OpenAI, Google, and the nonprofit data collection organization Common Crawl from accessing their sites.

The MIT-operated Data Provenance Initiative (DPI) reported that 5% of the 14,000 websites used for AI data collection had blocked crawler access last year. Among high-quality content sources, such as media outlets, this figure rises to 25%. DPI stated that measures to prohibit data collection across online websites are rapidly increasing.

With the increasing restrictions on web scraping, AI model developers are struggling to secure the necessary data. While data continues to be generated online, it falls short of meeting the demand for AI training.

OpenAI’s GPT-3, released in 2020, was trained on approximately 300 billion tokens (the smallest units of text that AI learns). Three years later, GPT-4 was trained on 12 trillion tokens, 40 times more.

Meta’s generative AI Llama3, released this year, was trained on over 15 trillion tokens. According to the Epoch AI, GPT-5 is expected to be trained on around 60 trillion tokens, but the currently available high-quality data might be insufficient by 10 to 20 trillion.