
Scientific laboratories can be dangerous places
People Pictures/Shutterstock
Researchers have warned that the use of artificial intelligence models in scientific laboratories threatens to enable dangerous experiments that could cause fires or explosions. Such models offer a convincing illusion of understanding, but are prone to ignoring basic and vital safety precautions. In tests of 19 advanced AI models, every one of them made potentially fatal errors.
Serious accidents in university laboratories are rare but certainly not unheard of. In 1997, Chem Karen Wetterhahn She was killed by dimethylmercury that seeped through her protective gloves; In 2016, there was an explosion It cost one researcher her arm; In 2014, he was a scientist Partially blind.
Now, AI models are in service in a variety of industries and fields, including research laboratories where they can be used to design experiments and procedures. AI models designed for specialized tasks have been used successfully in a number of scientific fields, such as biology, meteorology, and mathematics. But large, general-purpose models tend to make things up and answer questions even when they don’t have access to the data needed to form the correct answer. This can be a nuisance when searching for holiday destinations or recipes, but it can be fatal when designing a chemistry experiment.
To investigate risks, Xiangliang Zhang She and her colleagues at the University of Notre Dame in Indiana created a test called LabSafety Bench, which can measure whether an AI model identifies potential risks and harmful consequences. Includes 765 multiple-choice questions and 404 illustrated laboratory scenarios that may involve safety issues.
In multiple choice tests, some AI models, such as Vicuna, scored almost as low as can be seen in random guesses, while GPT-4o reached 86.55% accuracy and DeepSeek-R1 reached 84.49% accuracy. When tested with images, some models, like the InstructBlip-7B, recorded less than 30 percent accuracy. The team tested 19 state-of-the-art large language models (LLMs) and vision language models on the LabSafety Bench, and found that none scored more than 70 percent accuracy overall.
Zhang is optimistic about the future of AI in science, even in so-called autonomous laboratories where robots work on their own, but he says the models are not yet ready to design experiments. “Now? In the lab? I don’t think so. They’ve often been trained for general-purpose tasks: rewriting an email, polishing some paper, or summarizing a research paper. They do a pretty good job at that kind of task.” [But] They don’t have the domain knowledge about these matters [laboratory] Risks.
“We welcome research that helps make AI in science safe and reliable, especially in high-risk laboratory environments,” an OpenAI spokesperson says, noting that researchers have not tested its pioneering model. “GPT-5.2 is our most capable scientific model to date, with much more robust reasoning, planning, and error detection than the model discussed in this paper to better support researchers. It is designed to accelerate scientific work while humans and existing safety systems remain responsible for safety-critical decisions.”
Google, DeepSeek, Meta, Mistral, and Anthropic did not respond to a request for comment.
Alan Tucker AI models can be invaluable when used to help humans design new experiments, but there are risks and humans need to be kept in the loop, says the doctor at Brunel University in London. “The behavior of these people [LLMs] “It’s certainly not well understood in any typical scientific sense,” he says. “I think the new class of MA that mimics language — and not much else — is clearly being used in inappropriate environments because people trust it too much. There is already evidence that humans are starting to sit back and let AI do the hard work but without proper scrutiny.”
Craig Merlik At the University of California, Los Angeles, he says he ran a simple test in recent years, asking AI models what to do if you spilled sulfuric acid on yourself. The correct answer is rinsing with water, but Merlik says he has found that the AI always warns against that, and incorrectly adopts irrelevant advice about not adding water to acid in experiments due to heat buildup. However, he says, in recent months models have begun to give the correct answer.
Merlik says instilling good safety practices on campuses is vital, because there is a constant influx of new students with little experience. But he is less pessimistic about the place of artificial intelligence in experimental design than other researchers.
“Are they worse than humans? It’s one thing to criticize all these big linguistic models, but they haven’t tested them against a representative group of humans,” Merlik says. “There are humans who are very careful, and there are humans who are not. It is possible that large language models are better than a certain percentage of novice graduates, or even experienced researchers. Another factor is that large language models get better every month, so the numbers in this paper will likely become completely invalid in another six months.”
Topics: