
The ARC-AGI-2 standard is designed to be a difficult test for artificial intelligence models
Just_super/Getty Images
The most sophisticated artificial intelligence models today badly recorded new standards designed to measure their progress towards artificial general intelligence (AGI)-and the power of brute power will not be sufficient to improve, as residents now take into account the cost of managing the model.
There are many competing definitions of AGI, but they are generally taken to refer to artificial intelligence that can do any cognitive task that humans can do. To measure this, the Arc Prize Foundation previously launched a test of the thinking capabilities called ARC-AGI-1. Last December, Openai announced that its O3 model was largely recorded in the test, which led some to ask whether the company was about to achieve AGI.
But now a new test, ARC-AGI-2, lift the tape. It is difficult enough that no current AI system in the market can achieve more than 100 degrees in the test, while each question has been solved by at least two skin in two attempts.
in Blog post Announcement of ARC-AGI-2, ARC President Greg Camradat said that the new standard was required to test different skills from the previous repetition. “To overcome it, you should show a high level of adaptation and high efficiency,” he wrote.
The ARC-AGI-2 index differs from other measurement tests of artificial intelligence in that it focuses on the capabilities of artificial intelligence models on completing simplified tasks-such as repeating changes in a new image based on previous examples of symbolic interpretation-instead of their ability to match doctoral performance in the world. Current models are good in “deep learning”, which was measured by ARC-AGI-1, but they are not good in the simplest tasks, which require more challenging thinking and interaction, in ARC-AGI-2. Openai O3-LOW model, for example, 75.7 percent records on ARC-AGI-1, but only 4 percent on Arc-Agi-2.
The standard also adds a new dimension to measure the capabilities of artificial intelligence, by looking at its efficiency in solving problems, as it was measured at the cost required to complete the task. For example, although ARC paid its human test $ 17 per task, it estimates that O3-Low cost Openai $ 200 as fees for the same work.
“I think the new repetition of ARC-Aagi, which is now focusing on the balance of performance with efficiency, is a big step towards a more realistic evaluation of artificial intelligence models,” he says. Joseph Imperial At Bath University, UK. “This is a sign that we move from unilateral assessment tests with a focus only on performance, but also taking into account a lower math strength.”
Any model is able to pass the ARC-AGI-2 that will not only need to be very efficient, but also smaller and lightweight, says Imperial-with the efficiency of the model a major component in the new standard. This can help address fears that artificial intelligence models have become more intense in energy – Sometimes to waste-to achieve increasing results.
However, not everyone is convinced that the new measure is useful. He says: “The complete framework for this because it tests intelligence is not the right frame.” Catherine dust shake At Staffordshire University, UK. Instead, she says that these criteria only evaluate the ability of artificial intelligence to complete one task or a set of tasks well, which are then settled to mean general abilities through a series of tasks.
“These criteria should not be seen well on these criteria as a major moment for AGI:” you see that the media picks up that these models pass through the tests of intelligence at the human level, where they do not do that; What they do is in fact just a response to a certain demand. “
What is happening exactly if the Arc-Agi-2 is passed exactly is another question-do we need another standard? “If they want to develop ARC-AGI-3, I think they will add another axis in the graph that indicates [the] “The minimum of humans – whether or not it is an expert – will take to solve tasks, in addition to performance and efficiency,” says Imperial.
Topics: