News

Over 32,000 medieval manuscripts transcribed in four months using AI

Medievalists can now access automated transcriptions of 32,763 digitised medieval manuscripts, produced in just four months as part of a project called CoMMA—a large-scale corpus designed to make manuscript texts searchable and analysable at a scale that would be impossible to tackle by hand.

The work was carried out by researchers in computational humanities at Inria (Institut national de recherche en sciences et technologies du numérique), working with partners in France and Switzerland. Inria is France’s national institute for research in digital science and technology, supporting research in areas such as computer science and applied mathematics through centres and teams across the country.

Why medieval manuscripts have been so difficult to automate

France, Paris. Bibliothèque Sainte-Geneviève, Ms. 10 fol. 1v

Transcribing medieval handwriting is hard enough at the level of a single codex. At scale, the problem becomes even more complex because medieval writing does not conform to modern expectations about spelling, punctuation, or consistent letterforms. As the researchers note, in the Middle Ages most European vernaculars were still evolving, spelling was not standardised, new letterforms emerged, and manuscript pages were filled with abbreviations, symbols, and annotations.

“When it came to transcribing medieval manuscripts, individual specialists went about it in their own way,” explains Thibault Clérice, a computational humanities researcher within the ALMANACH project team at Inria. “But automating manuscript transcription requires machine learning, and for this you need standards.”

CATMuS: building a standardised training corpus

To tackle this, a first initiative—CATMuS—was launched in 2022 at the initiative of Ariane Pinche, a CNRS research fellow in medieval studies and digital humanities. The team led by Pinche and Clérice aimed to create a large, uniform dataset that could serve as reliable training material.

The researchers first collected 300 medieval manuscripts that were already transcribed or partially transcribed—about 200,000 lines in total—and brought them into alignment using well-established standards intended to respect medieval spelling and abbreviation practices.

“The documents in question ranged from the 8th to the 16th centuries and were written in a dozen or so different languages – mostly in Old French and Latin, but also in Spanish languages, Italian, Venetian, Dutch, and so on,” Clérice explains.

With this standardised corpus in place, the project could train a handwriting text recognition model. The approach relies on transcription tools developed at EPHE–Université PSL, including eScriptorium and Kraken—a workflow designed for managing transcription campaigns and using Kraken as a segmentation/transcription engine.

Inria’s description emphasises two practical priorities: efficiency and restraint. The model is presented as energy-efficient, and—importantly—more focused on recognising what is on the page than on attempting broad language “understanding,” which can lead to confident but wrong extrapolations when dealing with medieval spelling, abbreviations, or unusual hands.

CoMMA: applying the model at massive scale

Bibliothèque Sainte-Geneviève, Ms. 10 fol. 1v in CoMMA

CATMuS produced the foundation, but the team wanted to use the trained model for real-world, high-volume transcription. “After spending more than two years collecting and transcribing manuscripts and then training the model, all we wanted to do was to put it to use ourselves!” recalls Clérice.

That next phase became CoMMA (Corpus of Multilingual Medieval Archives), launched in 2024 with an expanded team. Clérice continued to manage modelling and computation, joined by Benoît Sagot (head of the Almanach team), while Hassen Aguili contributed to the interface side.

A key enabler was access to large catalogues of digitised manuscripts. The team turned to Biblissima+, which aggregates links to digitised manuscript holdings and metadata from multiple institutions, including the Bibliothèque nationale de France. With that infrastructure in place, they could scale up from hundreds to a total of 32,763 manuscripts, mostly in Old French and Latin, which we transcribed in four months.

Inria explains that the system combines two components: one algorithm identifies and separates different elements of the manuscript page (main text, notes, illustrations, and so on), while the other—developed during CATMuS—handles transcription of the text itself.

Checking accuracy and understanding the limits

The team manually checked three consecutive lines in 670 manuscripts, and found an error rate of 9.7%. Some errors arise when manuscripts were older than the material used for training; others come from the difficulty of recognising some hands, especially when scribes use cursive scripts.

A paper outlining the process and its limitations is on the way and the team hasn’t ruled out further reducing the model’s error rate. “As long as it’s worthwhile”, says Thibault Clérice. “Doubling the processing time just to reduce the error rate by 1% wouldn’t really be worth it.”

For Clérice, the broader point is also methodological: this kind of work depends on medievalists and technologists working together. In the project’s wording: “Digital expertise alone would not have allowed us to understand as well the manuscripts we were processing and the processes that needed to be applied to them.”

What can medievalists do with CoMMA?

CoMMA is framed as more than a technical demonstration. Because the corpus is intended to be searchable and consistently transcribed (while still respecting original spellings and abbreviations), it offers a platform for large-scale study of medieval writing habits, layout practices, abbreviation systems, and linguistic variation.

For example the researchers point to a huge growth to the number of pseudo-words – groups of characters – that have been recorded. Previously, the largest corpus of manuscripts transcribed in Old French contained 11 million pseudo-words, while CoMMA has reached 516 million. For Latin, it has gone 226 million words to 2.7 billion.

Elena Pierazzo, professor of digital humanities at the University of Tours, is also excited about the possibilities on offer. “This corpus will change how we process textual data,” she says, “having such a vast quantity of data that respects original spelling and abbreviations opens up all sorts of avenues for studying writing habits. CoMMA will help us to understand the evolution of languages, including dialects, through the use of statistical data. This corpus also shines a spotlight on texts previously overlooked by researchers which are now easy to access by searching either by period or by theme.”

A cross-disciplinary tool—and plans to expand

The CoMMA website – https://comma.inria.fr/homepage

From a digital perspective, CoMMA also shifts what is possible for AI research on pre-modern sources. The corpus can now be used to train AI customised for the analysis of ancient texts, something which was previously impossible owing to insufficient data. And as Elena Pierazzo is keen to emphasise: “CoMMA will reshape the borders between disciplines within the humanities. Specialists in the history of art, medicine or philosophy whose paths would previously never have crossed will now be able to work together using this cross-disciplinary tool, which covers practically all of the knowledge there is available on the Middle Ages in Old French and Latin.”

The team is already looking beyond Old French and Latin. Plans are in place to open the corpus up to other languages by obtaining new texts from Biblissima+. “There is no reason why Spanish or Italian languages, and the researchers studying them, can’t take advantage of transcriptions from our model”, concludes Thibault Clérice. For medievalists, the next wave of discoveries may come not from finding new manuscripts, but from finally being able to search and compare them at scale.

Click here to access CoMMA