Can AI run out of fuel or kill the web?

Artificial intelligence feeds on human knowledge to give it back to us in an effortless and directly actionable way, for our greatest pleasure. But what would happen if this knowledge were to dry up?

Artificial intelligence and human intelligence have one thing in common: they need knowledge to work and train. The more it learns, the more it has to work with, and the more it works, the more it progresses within its limits.

In this sense, it’s not an infinite science. If we take the example of generative AI for the general public, which is the big topic of the moment , it devours everything it finds on the web and then applies a probabilistic model to respond to our requests. Basically, depending on what it has understood from the content it has swallowed, it will return what is most likely to correspond to your request.

AI needs knowledge in quantity and quality

“Most likely” is very important. If you present it with erroneous or biased content, it can give you an answer that’s totally wrong in substance, but which seems the most likely possible.

If I know very little or have learned it from fake sources, I may be convinced of what I’m saying, but that doesn’t make it any more true or relevant.

Everything comes from the body of knowledge used to train it and, in this case, from the content available on the web.

This is excellent news, as the web is an infinite source of constantly updated, improved and renewed content, and is aidedin this by a groundswell that dates back more than 20 years: Web 2.0 and User Generated Content (UGC).

While this may seem like a prehistoric era to some, there was a time when only the media and corporations occupied the web (badly), apart from a few forums and personal sites. It wasn’t until the mid-2000s that blogs, wikis and then social networks gave everyone the right to speak out on the web, for better or for worse.

This may seem trivial, but it’s better to say it: without the technologies that have made it so easy for everyone to publish online, and that media and companies have appropriated, and without the wave of UGC, AIs, however powerful they may be, would today be like students going to learn in a library with almost empty shelves.

Never mind the history, after all, since we have an infinite amount of material with which to educate AIs.

Can this virtuous model be challenged?

AIs running out of data?

Before we talk about data quality, let’s talk about quantity.

AI doesn’t swallow data, it gluttonously devours it at a speed we can’t imagine. Good news for its training?

Yes, as long as it has enough to eat, but not when its reserves run dry.

According to a study by Cornell University, AIs could run out of accessible human data between 2026 and 2032 (Will AI run out of public data in the coming years?).

Why? Not only because it consumes data faster than it is produced, but also, as we shall see, because more and more sources of knowledge and information will want to make it inaccessible to AIs.

Why take knowledge away from AI?

I deliberately use the term knowledge rather than content. I’ve always found that the latter degrades the value of the former, but it’s a good representation of the current situation, and even sheds some light on today’s society.

For those who publish knowledge, valuable information, their “content” has a value in itself, is the result of learning, experience, an intellectual process that they decide to disseminate free of charge or for a fee (paid access or advertising).

For others, it’s a material to be exploited to create value by charging someone who doesn’t have the knowledge or the time to mobilize it.

Content is content on an accessible platform, regardless of its value; knowledge is content that enriches the reader.

Because content without value does exist: all those sites that do everything they can to grab your attention without teaching you anything, with a catchy title to expose you to a ton of advertising.

Well, those who think in terms of value will tend to keep their publications out of the hands of AI.

AI vs. media

At the top of the list are the media , which charge for their output in one of two ways: advertising or subscription, or both, depending on the value they attribute to a given publication.

Let’s move on from subscription-based content, which is safe for the moment, to talk about advertising-financed content.

Whether AI is used as a substitute for a search engine to find an answer to a simple question, or to generate content, the result is the same: end-users won’t visit the sites in question, and the result will be a drop in audience and advertising revenues. Worse still, in the long term, a decline in brand awareness.

But the same reasoning applies to all companies who publish studies, a marketing tool which, while it enriches the knowledge and thinking of their audience, is above all intended to demonstrate their know-how, make themselves known and establish their reputation.

If, instead of taking the trouble to research and read a study, you ask an AI to write a note on the subject, you get the expected end result without bothering about its origin, you’re killing the marketing of these companies.

And then there are the “web volunteers”, expert bloggers and the like, who earn no income from their work but do it for the beauty of the gesture, or almost. Their audience is their reward, and their reputation an asset that they can monetize indirectly on the job market.

Same punishment as for the others: their work will be taken advantage of without the slightest recognition.

The media have bargaining power, and some will be able to force AI publishers to pay to use their content for a while, even if the latter don’t see it that way at all, and will certainly end up forcing their way in, even if governments try to legislate (Journalism & AI: War is declared!).

As for the “volunteers”, they may simply stop publishing for lack of an audience, or they may add a tag to their publications forbidding indexing, in the hope that it will be respected.

AI vs. fake news

Another danger facing AI is the proliferation of fake news. As the famous adage “shit i shit out” goes,if AI is trained on fake information, or even if such information simply pollutes a reliable corpus, it will necessarily lose relevance.

The risks of a shortage of available knowledge

The stagnation or, worse still, rarefaction of available knowledge, or even worse, its non-updating, would have a consequence as inevitable as it would be dramatic: the loss of relevance of AI. The same goes for a corpus of poor quality or polluted by fake news.

Let’s not forget that generative AI operates on a probabilistic model, and the more information it has to cross-check its work, the more relevant it will be. Conversely, without information, it runs the risk of making quick, irrelevant shortcuts.

And it doesn’t take much to tip over into the absurd, as David Fayon recently discovered (Quand l’AI générative déraille ou les risques de “ soleil vert des données ”). A generative AI designated him as the author of a book he hadn’t written and which, incidentally, didn’t exist.

Seeing this kind of thing happen today on an unimportant subject gives us an idea of what might happen tomorrow if, as David says, AIs were ever to operate in a vacuum.

Transposition to enterprise AI

Enterprise AI is more my subject than consumer AI, so I can’t help but wonder if there are transposable lessons to be learned from all this.

By enterprise AI I mean business-oriented AI, using data from business applications and internal content.

A priori, as this is a closed ecosystem, the company is safe from fake news, but not from data quality problems, which are a major issue.

However, it is not immune to data quality problems, for two completely contradictory reasons.

The first is that it is not unusual, and not only in large organizations, for different versions of the same document to exist in different parts of the intranet. The proliferation of personal drives doesn’t help either.

And if we imagine that in CRM or ERP-type business applications all data is clean and up to date, we’re making a big mistake.

The second is exactly the opposite in companies that have implemented strict information governance. In this case, production and validation times can be long, so the information will take time to become available , and in the case of an update, the AI won’t have the latest version (AI moves fast, content moves slow).

The last point to bear in mind is the small amount of data available. Of course, this depends on the size of the company and the use case, but the volume of internal data that can be used to train AI can be very small.

Bottom line

The decline in the quantity and quality of publicly available data is a real potential danger for consumer AIs.

Enterprise AIs, on the other hand, face the same problems, but for different reasons linked to their size and the information governance they have been able to implement.

Image : Shutterstock.

Bertrand DUPERRIN
Bertrand DUPERRINhttps://www.duperrin.com/english
Head of People and Business Delivery @Emakina / Former consulting director / Crossroads of people, business and technology / Speaker / Compulsive traveler
Head of People and Business Delivery @Emakina / Former consulting director / Crossroads of people, business and technology / Speaker / Compulsive traveler
1,756FansLike
11,559FollowersFollow
28SubscribersSubscribe

Recent posts