Presentation on an mLLM-Based OCR System at the Digital Digesta Workshop

2/8/2026

At the recently held “Digital Digesta Workshop”, Naoya Iwata from the Center for Digital Humanities and Social Sciences at Nagoya University (and the National Institute of Informatics) delivered a presentation titled “Toward Digital Digesta: An mLLM-Based OCR System for Legal Code Digitization,” sharing the latest technological advancements from the Humanitext project.

Overcoming Challenges in Digitizing Classical Texts

Western Classical texts, such as the Digesta (Pandects), are notorious for their complex layouts—featuring intricate footnotes, marginal line numbers, and multi-column formatting. Traditional OCR and HTR engines like Tesseract and Transkribus, while widely adopted in the humanities, often struggle with layout removal and structural formatting, presenting a significant bottleneck for scholars attempting to create clean, machine-readable corpora.

The mLLM-Driven OCR Revolution

To address these limitations, Iwata and the research team turned to the emerging capabilities of multimodal Large Language Models (mLLMs), which can jointly process images and text. The presentation introduced the new Humanitext OCR system, built on the robust architecture of Google’s Gemini 2.5 Flash.

The breakthrough feature of this system is its ability to follow natural language instructions. By simply prompting the model with commands like “remove footnotes and extract body text only” or “output in JSON format,” the system bypasses the need for complex, rigid layout-parsing algorithms.

In evaluation tests using a Latin corpus (Manilius, Astronomica), the system demonstrated exceptional performance. The first-pass word accuracy exceeded 99%, and following an auto-correction step using the same model, the Word Error Rate (WER) plummeted to an astonishing ~0.07%. The leakage of footnotes into the main text and symbol recognition errors were substantially reduced, proving the system’s readiness for demanding scholarly applications.

A Highly Accurate, No-Code Tool for Researchers

Designed with accessibility in mind, Humanitext OCR offers a user-friendly, no-code web interface. The service is currently planned to be free to use for up to 20 pages per day. Furthermore, the operational cost remains extremely low—estimated at merely ~$0.005 per page—demonstrating that mLLM-based processing is not only highly accurate but also economically viable for large-scale digitization.

Future Outlook

The bulk OCR processing of the Digesta is currently underway. Looking ahead, the Humanitext team plans to implement TEI/XML output with structural markup (such as divisions, page breaks, and line breaks), alongside Named Entity Recognition (NER) tagging for persons, places, and legal terminology. The Humanitext project continues to move forward in its mission to establish a comprehensive, machine-readable edition of vital historical legal texts.