OpenAI searches for an answer to its copyright problems

August 30, 2024 3,014

The huge leaps in OpenAI’s GPT model probably came from sucking down the entire written web. That includes entire archives of major publishers such as Axel Springer, Condé Nast, and The Associated Press — without their permission. But for some reason, OpenAI has announced deals with many of these conglomerates anyway.

At first glance, this doesn’t entirely make sense. Why would OpenAI pay for something it already had? And why would publishers, some of whom are lawsuit-style angry about their work being stolen, agree?

I suspect if we squint at these deals long enough, we can see one possible shape of the future of the web forming. Google has been referring less and less traffic outside itself — which threatens the existence of the entire rest of the web. That’s a power vacuum in search that OpenAI may be trying to fill.

The deals

Let’s start with what we know. The deals give OpenAI access to publications in order to, for instance, “enrich users’ experience with ChatGPT by adding recent and authoritative content on a wide variety of topics,” according to the press release announcing the Axel Springer deal. The “recent content” part is clutch. Scraping the web means there’s a date beyond which ChatGPT can’t retrieve information. The closer OpenAI is to real-time access, the closer its products are to real-time results.

On the one hand, this is peanuts, just embarrassingly small amounts of money

The terms around the deals have remained murky, I assume because everyone has been thoroughly NDA’d. Certainly I am in the dark about the specifics of the deal with Vox Media, the parent company of this publication. In the case of the publishers, keeping details private gives them a stronger hand when they pivot to, let’s say, Google and AI startup Anthropic — in the same way that not disclosing your previous salary lets you ask for more money from a new would-be employer.

OpenAI has been offering as little as $1 million to $5 million a year to publishers, according to The Information. There’s been some reporting on the deals with publishers such as Axel Springer, the Financial Times, NewsCorp, Condé Nast, and the AP. My back-of-the-envelope math based on publicly reported figures suggests that the ceiling on these deals is $10 million per publication per year.

On the one hand, this is peanuts, just embarrassingly small amounts of money. (The company’s former top researcher Ilya Sutskever made $1.9 million in 2016 alone.) On the other hand, OpenAI has already scraped all these publications’ data anyway. Unless and until it is prohibited by courts from doing so, it can just keep doing that. So what, exactly, is it paying for?

Maybe it’s API access, to make scraping easier and more current. As it stands, ChatGPT can’t answer up-to-the-moment queries; API access might change that.

But these payments can be thought of, also, as a way of ensuring publishers don’t sue OpenAI for the stuff it’s already scraped. One major publication has already filed suit, and the fallout could be much more expensive for OpenAI. The legal wrangling will take years.

The New York Times is prepared to litigate

If OpenAI ingested the entirety of the text-based internet, that means a couple things. First, that there’s no way to generate that volume of data again anytime soon, so that may limit any further leaps in usefulness from ChatGPT. (OpenAI notably has not yet released GPT-5.) Second, that a lot of people are pissed.

Many of those people have filed lawsuits, and the most important was filed by The New York Times. The Times’ lawsuit alleges that when OpenAI ingested its work to train its LLMs, it engaged in copyright infringement. Moreover, the product OpenAI created by doing this now competes with the Times and is meant to “steal audiences away from it.”

The Times’ lawsuit says that it tried to negotiate with OpenAI to permit the use of its work, but those negotiations failed. I’m going to take a wild guess based on the math I did above and say it’s because OpenAI offered insultingly low sums of money to the Times. Its excuse? Fair use — a provision that allows the unlicensed use of copyrighted material under certain circumstances.

Should the newspaper win its case, OpenAI is going to have to pay an absolute minimum of $7.5 billion in statutory damages alone

If the Times wins its lawsuit, it may be entitled to statutory damages, which start at $750 per work. (I know those figures because — as you may have guessed from my use of “statutory” — they are dictated by law. The paper is also asking for compensatory damages, restitution, and attorneys’ fees.) The Times says that OpenAI ingested 10 million total works — so that’s an absolute minimum of $7.5 billion in statutory damages alone. No wonder the Times wasn’t going to cut a deal in the single-digit millions.

So when OpenAI makes its deals with publishers, they are, functionally, settlements that guarantee the publishers won’t sue OpenAI as the Times is doing. They are also structured so that OpenAI can maintain its previous use of the publishers’ work is fair use — because OpenAI is going to have to argue that in multiple court cases, most notably the one with the Times.

“I do have every reason to believe that they would like to preserve their rights to use this under fair use,” says Danielle Coffey, the CEO of the News Media Alliance. “They wouldn’t be arguing that in a court if they didn’t.”

It seems like OpenAI is hoping to clean up its reputation a little. If you’re introducing a new product you want people to pay for, it simply can’t come with a ton of baggage and uncertainty. And OpenAI does have baggage: to make its fair use defense, it must admit to taking The New York Times’ copyrighted material without permission — which implicitly suggests it’s taken a lot of other copyrighted material without permission, too. Its argument is just that it is legally entitled to do that.

There’s also a question of accuracy. At this point, we all know generative AI makes stuff up. The publisher deals don’t just provide legitimacy — they may also help feed generative AI information that is less likely to result in embarrassing errors.

Google

There’s more at play than just lawsuit prevention and reputation management. Remember how the deals also give OpenAI up-to-date information? OpenAI recently announced SearchGPT, its very own search engine. AI-native web searching is still nascent, but being able to filter out AI-generated SEO glurge in favor of real sources of reliable information would be a leg up.

Google Search has seriously degraded over the last several years, and the AI chatbot Google has slapped on top of its results hasn’t exactly helped matters. It sometimes gives inaccurate answers while burying links with real information farther down the page. If you want to build a product to upend web search as we know it, now’s the time.

The OpenAI deals give publishers a little more leverage and may eventually force Google to the negotiating table

Google has also managed to piss off publishers — not just by ingesting all their data for its large language models, but also by repurposing itself. Once upon a time, Google Search was a major source of traffic for publishers and a way of directing people to primary sources. But then, Google introduced “snippets,” which meant that people didn’t have to click through to a link in order to find out, for instance, how much to dilute coconut cream to make it a coconut milk equivalent. Because people didn’t go to the original source, publishers didn’t get as many impressions on their ads. Various other changes to Search over the years have meant that Google has referred less traffic to publishers, especially smaller ones.

Now, Google’s AI chatbot sidelines publishers further. But the OpenAI deals give publishers a little more leverage and may eventually force Google to the negotiating table.

Google is not generally in the habit of making paid deals for search; until recently, the arrangement was that publishers got traffic referrals. But for its chatbot, Google did make a deal: with Reddit. For $60 million a year, Google has access to Reddit, cutting off every search engine that didn’t make a similar deal. This is significantly more money than OpenAI is paying publishers, and has cracked open a door that it seems publishers intend to walk through.

Taking over the search market is the kind of thing that could justify all that investment

Google has been getting less useful to the average person for years now. Generative AI threatens to make that worse, by creating sites full of junk text that serve ads. Google doesn’t treat all the sites it crawls the same, of course. But if someone can come up with an alternative that promises higher quality information, the search engine that lost its way may be in real trouble. After all, that’s how Google itself unseated the search engines that came before it, such as AltaVista.

OpenAI burns money, and may lose $5 billion this year. It’s currently in talks for yet another round, valuing the company at over $100 billion. To justify anything close to this valuation, it needs a path to profitability. Taking over the search market is the kind of thing that could justify all that investment.

OpenAI’s SearchGPT isn’t a serious threat yet. It’s still a “prototype,” which means that if it makes an error on the order of telling people to put glue on their pizza, that’s easier to explain away. Unlike Google, a utility for almost every person online, SearchGPT has a limited number of users — so a lot fewer people will see any early mistakes.

The deals with publishers also provide SearchGPT with another reputational cushion. Its competitor Perplexity is under fire for scraping sites that have explicitly banned it. SearchGPT, by contrast, is a collaboration with the publishers who inked deals.

What happens when the courts actually rule?

It’s not totally clear what the pivot to “answer engines” means for publishers’ bottom lines. Maybe some people will continue to click through to see original sources, especially if it isn’t possible to remove hallucinations from large language models. Another possible model comes from Perplexity, which belatedly introduced a revenue-sharing program.

The revenue sharing program makes it a little easier for Perplexity to claim its scraping is fair use (sound familiar?). Perplexity’s situation is a little different than ChatGPT’s; it has created a “Pages” product that has an unfortunate tendency to plagiarize copyrighted material. Forbes and Condé Nast have already sent Perplexity legal nastygrams.

So here’s the big question: what happens when the courts actually rule? Part of the reason these publisher deals exist at all is to reduce the threat of legal action. But their very existence may cut against the argument that scraping copyrighted material for AI is fair use.

Copywrong

A ruling in favor of The New York Times can potentially help both Google and OpenAI, as well as Microsoft, which is backing OpenAI. Maybe this was what Eric Schmidt, former Google CEO, meant when he said entrepreneurs should do whatever they want with copyrighted work and “hire a whole bunch of lawyers to go clean the mess up.”

Courts are unpredictable when it comes to copyright law because it kind of works like porn — judges know a violation when they see it. Plus, if there is indeed a trial between The New York Times and OpenAI, there will almost certainly be an appeal on the verdict, no matter who wins.

Court cases take time, and appeals take more time. It will be years before the courts sort all this out. And that’s plenty of time for a player like OpenAI to develop a dominant business.

She specifically cites Google as being so big that it can force publishers into its terms

Let’s say OpenAI eventually loses. That means all creators of large language models have to pay out. That can get very expensive, very fast — meaning that only the biggest players will be able to compete. It ensconces every established player and potentially destroys a number of open-source LLMs. That makes Google, Microsoft, Amazon, and Meta even more important in the ecosystem than they already dominate — as well as OpenAI and Anthropic, both of which have deals with some of the major players.

There’s also some precedent in how big tech companies navigate the rulings against them, says the News Media Alliance’s Coffey. She specifically cites Google as being so big that it can force publishers into its terms; as if to underscore her point, a few weeks after our interview, Google was legally declared a monopoly in an antitrust case.

Here’s an example of Google’s outsize power: In 2019, the EU gave digital publishers the right to demand payment when Google used snippets of their work. This law, first implemented in France, resulted in Google telling publishers it would use only headlines from their work rather than pay. “And so they sent a bunch of letters to French publications, saying waive your copyright protection if you want to be found,” Coffey said. “They’re almost above the law in that sense” because Google Search is so dominant.

Google is currently using its search dominance to squeeze publishers in a similar way. Blocking its AI from summarizing people’s work means that Google simply won’t list them at all, because it uses the same tool to scrape for web search and AI training.

“That would be a real anticompetitive tragedy at the beginning of the ecosystem.”

So if the Times wins, it seems possible that Google and other major AI players could still demand deals that don’t benefit publishers much — while also destroying competing LLMs. “I’m incredibly worried about the possibility that we are setting up an ecosystem where the only people who are going to be able to afford training data are the biggest companies,” says Nicholas Garcia, policy counsel at Public Knowledge.

In fact, the existence of the suit may be enough to discourage some players from using publicly accessible data to train their models. People might perceive that they can’t train on publicly available data — narrowing competitive dynamics even farther than the bottlenecks that already exist with the supply of compute and experts. “That would be a real anticompetitive tragedy at the beginning of the ecosystem,” Garcia says.

OpenAI isn’t the only defendant in the Times case; the other one is its partner, Microsoft. And if OpenAI does have to pay out a settlement that is, at minimum, hundreds of millions of dollars, that might open it up to an acquisition from Microsoft — which then has all the licensing deals that OpenAI already negotiated, in a world where the licensing deals are required by copyright law. Pretty big competitive advantage. Granted, right now, Microsoft is pretending it doesn’t really know OpenAI because of the government’s newfound interest in antitrust, but that could change by the time the copyright cases have rolled through the system.

And OpenAI may lose because of the licensing deals it negotiated. Those deals created a market for the publishers’ data, and under copyright law, if you’re disrupting such a market, well, that’s not fair use. This particular line of argument most recently came up in a Supreme Court case about an Andy Warhol painting that was found to unfairly compete with the original photograph used to create the painting.

The legal questions aren’t the only ones, of course. There’s something even more basic I’ve been wondering about: do people want answer engines, and if so, are they financially sustainable? Search isn’t just about finding answers — Google is a way of finding a specific website without having to memorize or bookmark the URL. Plus, AI is expensive. OpenAI might fail because it simply can’t turn a profit. As for Google, it could be broken up by regulators because of that monopoly finding.

In that case, maybe the publishers are the smart ones after all: getting the money while the money’s still good.

Source link