When A.I. Chatbots Hallucinate

Global NewsXMay 1, 2023Last Updated: May 1, 2023

0 0 4 minutes read

When did The New York Times first report on “artificial intelligence”?

According to ChatGPT, it was July 10, 1956, in an article titled “Machines Will Be Capable of Learning, Solving Problems, Scientists Predict” about a seminal conference at Dartmouth College. The chatbot added:

The 1956 conference was real. The article was not. ChatGPT simply made it up. ChatGPT doesn’t just get things wrong at times, it can fabricate information. Names and dates. Medical explanations. The plots of books. Internet addresses. Even historical events that never happened.

When ChatGPT was recently asked how James Joyce and Vladimir Lenin first met — there is no evidence they ever did — this is how it responded:

Fabrications like these are common. Figuring out why chatbots make things up and how to solve the problem has become one of the most pressing issues facing researchers as the tech industry races toward the development of new A.I. systems.

Chatbots like ChatGPT are used by hundreds of millions of people for an increasingly wide array of tasks, including email services, online tutors and search engines. And they could change the way people interact with information. But there is no way of ensuring that these systems produce information that is accurate.

The technology, called generative A.I., relies on a complex algorithm that analyzes the way humans put words together on the internet. It does not decide what is true and what is not. That uncertainty has raised concerns about the reliability of this new kind of artificial intelligence and calls into question how useful it can be until the issue is solved or controlled.

The tech industry often refers to the inaccuracies as “hallucinations.” But to some researchers, “hallucinations” is too much of a euphemism. Even researchers within tech companies worry that people will rely too heavily on these systems for medical and legal advice and other information they use to make daily decisions.

“If you don’t know an answer to a question already, I would not give the question to one of these systems,” said Subbarao Kambhampati, a professor and researcher of artificial intelligence at Arizona State University.

ChatGPT wasn’t alone in erring on the first reference to A.I. in The Times. Google’s Bard and Microsoft’s Bing chatbots both repeatedly provided inaccurate answers to the same question. Though false, the answers seemed plausible as they blurred and conflated people, events and ideas.

Microsoft’s Bing cited its findings to a realistic-looking web address on The Times’s website:

According to The Times’s archives, all the chatbots were wrong. They cited articles that did not exist. And while coverage of early research on thinking machines dated to the 1930s, it wasn’t until 1963 that The Times first published an article with the phrase “artificial intelligence.”

“We released Bard as an experiment and want to be as transparent as possible about well documented limitations,” Jennifer Rodstrom, a spokeswoman for Google, said. “These are top of mind for us as we continue to fine tune Bard.”

Like Google, Microsoft and OpenAI say they are working to reduce hallucinations.

The new AI. systems are “built to be persuasive, not truthful,” an internal Microsoft document said. “This means that outputs can look very realistic but include statements that aren’t true.”

The chatbots are driven by a technology called a large language model, or L.L.M., which learns its skills by analyzing massive amounts of digital text culled from the internet.

By pinpointing patterns in that data, an L.L.M. learns to do one thing in particular: guess the next word in a sequence of words. It acts like a powerful version of an autocomplete tool. Given the sequence “The New York Times is a ____,” it might guess “newspaper.”

Because the internet is filled with untruthful information, the technology learns to repeat the same untruths. And sometimes the chatbots make things up. They produce new text, combining billions of patterns in unexpected ways. This means even if they learned solely from text that is accurate, they may still generate something that is not.

Because these systems learn from more data than humans could ever analyze, even A.I. experts cannot understand why they generate a particular sequence of text at a given moment. And if you ask the same question twice, they can generate different text.

That compounds the challenges of fact-checking and improving the results.

Bard said in one chat:

Then Bard said in another chat:

Companies like OpenAI, Google and Microsoft have developed ways to improve the accuracy. OpenAI, for instance, tries to refine the technology with feedback from human testers.

As people test ChatGPT, they rate the chatbot’s responses, separating useful and truthful answers from those that are not. Then, using a technique called reinforcement learning, the system spends weeks analyzing the ratings to better understand what it is fact versus fiction.

A newer version of ChatGPT called ChatGPT Plus, which is available for a $20 monthly subscription, consistently avoided answering the question about the first mention of artificial intelligence in The Times. This could be the result of reinforcement learning or other changes to the system applied by OpenAI.

Microsoft built its Bing chatbot on top of OpenAI’s underlying technology, called GPT-4, and has layered on other ways to improve accuracy. The company uses GPT-4 to compare the chatbot’s responses with the underlying data and rate how the model is performing. In other words, Microsoft uses the A.I. to make the A.I. better.

The company also tries to improve the chatbot’s responses with help from its traditional internet search engine. When you type a query into the Bing chatbot, Microsoft runs an internet search on the same subject and then folds the results into the query before sending it on to the bot. By editing the query, said Sarah Bird, a leader in Microsoft’s responsible A.I. efforts, the company can push the system to produce better results.

Google uses similar methods to improve the accuracy of its Bard chatbot. It uses human feedback to hone the system’s behavior, and it “grounds” the system using information from the company’s search engine, said Eli Collins, a vice president of research at Google.

Microsoft does not check the bot’s responses for accuracy in real time, Ms. Bird said, though it is researching how to do that. It checks the accuracy of a small portion of results after the fact and then uses that analysis.

But becoming more accurate may also have a downside, according to a recent research paper from OpenAI. If chatbots become more reliable, users may become too trusting.

“Counterintuitively, hallucinations can become more dangerous as models become more truthful, as users build trust in the model when it provides truthful information in areas where they have some familiarity,” the paper said.

Steve Lohr and Nico Grant contributed reporting. Jack Begg and Susan C. Beachy contributed research.

Source: New York Times