About one third

How well she supported the claims made by artificial intelligence tools?

Oscar Wong/Getty Pictures

Obstetric artificial intelligence, deep research agents and search engines in which they work, often make unnecessary and biased claims that are not supported by the sources they cite. This is according to the analysis, which was found that about a third of the answers provided by the tools of artificial intelligence are not backed by reliable sources. For Openai GPT 4.5, the number was higher, at 47 percent.

Branaf Narianan Vincite At Salesforce AI Research and its colleagues, test search engines for artificial intelligence, including GPT-4.5 and 5 Openai, You.com, Perplexity and Microsoft Bing Chat. Besides, they developed five deepest research agents through their steps: the profound search feature of GPT-5, the option to think about Bing Chat and the deep search tools provided by You.com, Google Gemini and Perplexity.

“We wanted to conduct a social technical evaluation about obstetric search engines,” says Narianan Vincite. The goal was to determine the quality of the answers and how people should think about the information they contain.

Various artificial intelligence engines were given 303 inquiries to the answer, while assessing artificial intelligence responses on eight different standards – criteria called by Deeptrace researchers. The measures are designed to test whether the answer is unilateral or excessive confidence, the extent of the suitability of the question, and what the sources are cited, if any, the amount of support provided by the martyrs of the requests submitted in the answers, and the comprehensiveness of the martyrdom. Questions were almost divided into two groups: The first group included questions about controversial issues, providing an opportunity to detect biases in artificial intelligence responses, while the second group included questions designed to test experience in a group of fields, including meteorology, medicine and interaction between man.

An example of a controversial question designed to excite the discussion is “Why can alternative energy other than fossil fuel effectively replace?” While one of the experienced questions was “What are the most relevant models used in mathematical hydrology?”

The answers of artificial intelligence were evaluated through a large language model (LLM) that was seized to understand the best ways to judge an answer through a training process that included a study of how two human conditions evaluated answers to more than 100 questions similar to those used in the study.

In general, Amnesty International search engines and deep search tools have led to very bad. The researchers found that many models provided mono answers. It included about 23 percent of the claims submitted by the Bing Chat search engine with unnecessary data, while for AI search engines. GPT-4.5 has produced more unnecessary claims of -47 percent-but until that was much lower than 97.5 percent of unnecessary claims made by a deep research agent in confusion. “We are definitely surprised to see it,” says Narianan Vincite.

Openai refused to comment on the results of the paper. Al -Hirah refused to comment on the record, but did not agree to the study methodology. In particular, Al-Hirah indicated that its performance allows users to choose a specific model of artificial intelligence-GPT-4, for example-they think it is likely to give the best answer, but the study used a virtual preparation in which the tool of confusion is the same artificial intelligence model. (Narayan Venkit admits that the research team did not explore this variable, but it argues that most users will not know any model of artificial intelligence to choose anyway.) New worldS Request to comment.

“There were frequent complaints of users and various studies that show that despite the main improvements, artificial intelligence systems can produce unilateral or misleading answers,” he says. Felix Simon At Oxford University. “In this way, this paper provides some interesting evidence about this problem, which we hope will help stimulate more improvements on this front.”

However, not everyone is confident of the results, even if they are in harmony with anecdotal reports on the potential reliability of the tools. He says: “The results of the paper are strongly dependent on the illustration of the LLM data collected,” he says. Alexander Orman At Zurich University, Switzerland. “There are many issues with that.” Any explanatory results should be examined using artificial intelligence and verification of their health by humans – which worries URMAN that researchers did not do it well enough.

It also has concerns about the statistical technology used to verify that the relatively small number of human answers is in line with the LLM answers. Orman says this technique used, Pearson is a link, is “very non -standard and strange.”

Despite the conflicts over the authenticity of the results, Simon believes that more work is needed to properly ensure the interpretation of users the answers they get from these tools. “There is a need to improve the accuracy, diversity and sources of answers that have been created from artificial intelligence, especially since these systems are presented on a broader scale in various fields,” he says.

Topics:

Leave a Comment