Why still data extract from PDFS is a nightmare for data experts

However, these promotional claims do not always match the performance of the real world, according to recent tests. “I usually admire Mistral models, but the new model of OCR, which they issued last week was a really bad performance,” Willis pointed out.

“Send a colleague This pdf He was asked if I could help him analyze the table it contains, as Willis says: It is an old document with a table that contains some complex design elements. New [Mistral] Really specific OCR model Poor performanceRepeat the names of cities and luxury many numbers. “

The AI ​​App App Applent application also also indicated X defect with the wrong OCR capacity to understand the handwriting, writing“Unfortunately, the Mistral-Oroc still has the usual VLM curse: with difficult manuscripts, it completely cheerfully.”

According to Willis, Google currently leads the field in artificial intelligence models that can read the documents: “For me, for me, the clear leader is Google Gemini 2.0 Flash Pro experimental. I dealt with PDF that Mistral has not been unprecedented with a few errors, and I have turned on multiple PDFS through which with it with Success, including handwriting content. “

Gemini’s performance greatly stems from its ability to process extensive documents (in a type of short -term memory called “context window”), which Willis notes specifically as a main feature: “The size of the context window also helps, where I can download large documents and work through it in the parts.” This possibility, along with a more powerful handling of handwritten content, it seems that the Google gives a practical advantage over competitors in the tasks of handling documents in the real world at the present time.

Disadvantages of LLM optical optical recognition

Despite their promise, LLMS introduces many new problems to document treatment. Among them, they can insert installation or hallucinations (reasonable but incorrect information), or follow the instructions in the text by mistake (thinking that it is part of the user’s mentor), or just misinterpret the data.

Leave a Comment