Artificial intelligence can one day replace teachers, but its reliability is still backward

Data sets used to train artificial intelligence algorithms may be unlimited to the elderly. Credit: Pixabay/CC0 General Domain

Artificial intelligence has become an integral part of many daily lives. LLMS models such as ChatGPT, Gemini or Copilot Writing letters and terms of terms for them, providing tips for trips on vacation or answering questions about each topic that can be perceived.

The use of artificial intelligence was also a routine in universities in many areas. To what extent can large language models support students in natural sciences as teachers who are not subject to supervision? A research team at Julius-MAXIMINIAANS-HOIVERSITUTUTMUTMUTUT (JMU) (JMU) has investigated this question. Team results Published on Arxiv Preprint Service.

An evaluation tool freely

The research collection of the physical chemistry department, which has so far conducted research in the spectral analysis of nanoparticles, has been developed, a tool that tests the thermal dynamic understanding of modern LLMS – in particular, whether their skills go beyond just realistic knowledge. The tool, called UTQA (the thermal dynamic answer at the university stage) can be reached freely, and aims to support teachers and researchers in evaluating LLMS in a fair way and privacy of the topic-making progress measured.

“Our desire to be artificial intelligence is one day able to support us as a partner who is not subject to supervision of teaching-for example, in the form of specialized Chatbots that responds individually to the needs of each student in preparing and following the lectures,” says the Tobias Hertel project.

“With UTQA, we explain where the current language models are already convincing and where they fail systematically – this is exactly what the lecturers need so that they can plan to use them in teaching responsibly.”

Teaching

The Hertel LLMS team is used in a thermal dynamic lecture with more than 150 students to examine weekly knowledge since the 2023 winter semester. Models such as Chatgpt-3.5 and Chatgpt-4 have shown their strengths, but also clear weaknesses.

This led to the desire to obtain a special standard: “UTQA thus includes 50 difficult tasks to choose from the basic thermal dynamic lecture-Hurtel explained, a third with graphs and graphics, as is the typical of educational exercises.”

The goal was not only to test real knowledge and definitions, but also to test the ability of language models to link different borders in a targeted manner and understand the sequence of complex operations.

Results: solid, but not (yet) reliable enough

According to HERTEL, the best -performing models test in 2025 paint clear images: with UTQA, no 95 % success rate is required by the research set for unimaginable aid as Amnesty International. Even the leading GPT-O3 model in many criteria achieved only 82 % total accuracy.

The scientist says: “Two points of weakness were noticeable: First, models have constantly faced difficulties with the so -called irreversible operations, as the speed of the state’s change affects the result. Second, there was a clear deficit in the tasks that require the interpretation of the image.”

Historical review shows that this is not surprising. About 100 years ago, French physicist Pierre Duhaim has already described the phenomenon of reflection as one of the most difficult phenomena in thermal dynamics. The fact that LLMS has problems with the interpretation of graphs is not surprising either, because the visualization and processing of visual content is one of the distinct cognitive strengths of humans.

Not good enough to use unprepared yet

“In practice, this means that LLMS can be already beneficial in teaching with or without supervision – but it is not enough to use it as a teacher not subject to supervision,” says Hertel. “At the same time, we have seen tremendous progress in the past two years. So we are confident that the development provided does not stop suddenly – the experience required for teaching aides in our specialization can be achieved soon.”

Hertel is especially pleased that two student teachers have participated greatly in the research project, which contributes to their specialized educational views. Luca-Sophie Bien has created a preliminary German version of many tasks; Anna Geißler translated and expanded the group for international use.

Why is thermal dynamic?

According to Centile, thermal dynamics is ideal for testing the ability of models and their ability to think.

“It is essential to our understanding of nature, and it has basic laws, but in application it requires accurate discrimination between state and practical variables, heat or work, inverse or irreversible operations. This is precisely where thinking ability is separated from mere preservation.”

As a next step, the team is now planning to expand the tool to include real gases, mixtures, phase plans and standard courses. The goal is to cover more concepts that are essential in teaching.

“The best models can deal with multimedia connection, that is, a mixture of text and images, as well as irreversible systems, whenever we approach reliable and reliable Amnesty International lessons,” says Hertel.

More information:
Anna Geißler et al, from ecclesiastical to complex: standard LLM capabilities in university thermal dynamics, Arxiv (2025). Doi: 10.48550/Arxiv.2508.21452

Magazine information:
Arxiv


It was presented by Julius Maximilian University in Fortzburg


quote: Artificial intelligence can one day replace teachers, but its reliability is still underdeveloped (2025, September 6).

This document is subject to copyright. Regardless of any fair dealing for the purpose of study or private research, no part may be reproduced without written permission. The content is provided for information purposes only.

Leave a Comment