Data is essential for AI training, and AI companies often gather valuable information through web scraping. But this can often lead to unauthorized use, and site owners aren’t paid for the pilfered data.
Enterprises are increasingly demanding compensation, even as some prominent AI leaders say their data isn’t even all that valuable, and that they’re happy to walk away rather than pay up.
Cloudflare hopes to have enterprises covered either way. The company this week announced a new AI Audit tool that gives site owners the ability to control — and just as importantly, get paid for — data scraped by AI bots without the need to enter into content licensing agreements.
“We think that sites of any size should be fairly compensated for the use of their content,” Sam Rhea, who works on the emerging technology and incubation team at Cloudflare, wrote in a blog post.
To license, to walk away, or something in between?
The AI training and copyright question has been top of mind for many site owners and publishers as genAI has rapidly evolved over the last two years. Some disputes have led to high-profile lawsuits that are working their way through the court system, far more slowly than crawlers are gobbling up data.
Facing the reality of the groundbreaking technology and its staying power, some site owners and publishers have struck content licensing deals with prominent AI platforms. These give them attribution rights and sometimes include terms about what content can be accessed and how often. For instance, OpenAI has made agreements with top companies including Shutterstock, Reddit, Vox Media, Time, Universal Music Group, the Financial Times and others.
However, Rhea points out, “not everyone has the time or contacts to negotiate deals with AI companies.”
Some tech heavy hitters are pushing back against such contracts, introducing a whole new twist. Meta CEO Mark Zuckerberg, for one, boldly asserted this week that publishers tend to “overestimate the value of their specific content.” If “push comes to shove,” he said, his company won’t use content when a publisher demands payment. “It’s not like that’s going to change the outcome of this stuff that much,” he told The Verge.
Cloudflare’s approach to this whole conundrum is a new feature that will allow publishers to set a price for use of content on their site (or sections of it), essentially creating a whole new marketplace of sorts.
“Cloudflare doesn’t want publishers to lock down their work altogether; they’re trying to be the middleman in facilitating a more controlled and potentially profitable exchange between content creators and AI companies,” says Dev Nag, CEO and founder at query execution platform QueryPal. “This approach represents a significant shift in how web content might be valued and accessed in the age of AI.”
Bringing clarity to the ‘murkiness’ in AI scraping
AI crawlers or bots don’t often drive traffic to sites, Rhea pointed out. Data scraper bots request content on a page, capture responses, then store that data.
“Your material is then put into a kind of blender, mixed up with other content, and used to answer questions from users without attribution or the need for users to visit your site,” Rhea wrote.
Meanwhile, AI search crawler bots scan content that they can then cite in AI search engine responses without proper attribution. The downside is that users will often stay on that engine rather than visit sites directly, because answers are created in real time in front of them.
This creates a “murkiness” where the value exchange is unclear, Rhea noted. “We believe this poses a risk to an open internet. Both sides lack the tools to create a healthy, transparent exchange of permissions and value.”
Cloudflare’s new AI Audit first provides detailed analytics on the AI services that crawl a customer’s site and the specific content they’re accessing. Activity is broken down by provider and type of bot. Site owners then have a one-click option to block any crawlers. They can choose to keep that on permanently, or use filters to grant access to certain providers. They can then eventually gain income from those exchanges.
Meanwhile, for customers who have struck deals with AI companies, Cloudflare provides a “single click” report to audit activity allowed in those contracts.
“AI will fundamentally change online content, and we all need to decide together what that future will look like,” Cloudflare’s co-founder and CEO Matthew Prince said in a statement. “Content creators and website owners of all sizes deserve to own and regain control of their content. If not, the quality of online information will either deteriorate or be available only to those who pay for it.”
Cloudflare as both protector and broker
Cloudflare is essentially positioning itself as “both protector and broker,” Nag noted, which could create a “new ecosystem” where creators have more control over how their work is used by AI while still giving AI companies access to training data.
“This marketplace approach suggests that the era of the open web, which flourished from around 1993 to 2020, may have been a transitional phase in the internet’s evolution,” he said. “We’re likely moving towards a future where access to online content is increasingly segmented and monetized, especially for AI consumption.”
Ultimately, this could “fundamentally alter” how information is distributed and accessed online, potentially creating distinct tiers of content availability based on commercial agreements and AI training needs, Nag said.
This can provide new revenue streams for content creators, he said, but cautioned: “It also raises concerns about equity of access to information and the potential stratification of the digital landscape.”