Project map
Click one of the boxes
The cards move because the work does too: between archives, models, images, evaluation, and strange documents that do not behave.
Noisy Documents
OCR, HTR, damaged text, digitization errors, and semantic damage.
Jump there → 02Semantic Extraction
Entities, events, relations, linking, and structured access to collections.
Jump there → 03Historical AI
Time-aware models, historical language, multilingual collections, unstable meanings.
Jump there → 04Cultural Heritage
Epigraphy, contested memory, endangered heritage, structured vocabularies.
Jump there → 05Multimodal Archives
Text, image, layout, typography, photographs, and visual evidence.
Jump there → 06Applied Document Intelligence
Fraud, misinformation, emergency events, and production-scale pipelines.
Jump there →Subject wavesover time
A quick visual map of the curated papers below. The lines show when the main research subjects become more visible across years; it is a readable map, not a full bibliography.
Noisy Documents
I keep returning to documents that arrive already damaged: OCR errors, HTR errors, broken segmentation, missing accents, strange line breaks, degraded scans, and transcripts that look clean until a model has to reason over them.
What this contains
- Measuring how OCR/HTR noise changes downstream semantic tasks, not only character accuracy.
- Designing post-correction and robustness experiments for historical and degraded text.
- Testing whether models remain reliable when the input is damaged, fragmented, or historically unstable.
Selected high-impact / most relevant papers
- Investigating OCR-Sensitive Neurons to Improve Entity Recognition in Historical DocumentsICADL 2024
- Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised DocumentsIJDL 2022
- Alleviating Digitization Errors in Named Entity Recognition for Historical DocumentsCoNLL 2020
- A Comparison of Sequential and Combined Approaches for Named Entity Recognition in a Corpus of Handwritten Medieval ChartersICFHR 2020
Semantic Extraction
I work on extracting the small structures that make collections searchable and arguable: people, places, organizations, events, relations, roles, mentions, links, and evidence. I care less about clean labels and more about whether the extracted structure survives noisy documents and real use.
What this contains
- Named entity recognition, entity linking, relation extraction, and event extraction.
- Entity-centered ways of making large document collections searchable, comparable, and arguable.
- Evaluation setups where semantic structure matters more than a clean-looking label.
Selected high-impact / most relevant papers
- A Multilingual Dataset for Named Entity Recognition, Entity Linking and Stance Detection in Historical NewspapersSIGIR 2021
- MELHISSA: A Multilingual Entity Linking Architecture for Historical Press ArticlesIJDL 2021
- Exploring Entities in Event Detection as Question AnsweringECIR 2022
- Event Detection with Entity MarkersECIR 2021
- Event Role Extraction using Domain-Relevant Word RepresentationsEMNLP 2014
Historical AI
Historical language is not stable. Names move. Borders move. Political words change meaning. A system that treats all periods as one flat present will make confident mistakes. I work on temporal modelling, long-horizon representations, and evaluation setups that make these failures visible.
What this contains
- Temporal modelling for documents where names, roles, borders, and meanings do not stay fixed.
- Experiments with temporal knowledge injection, temporal fusion, and long-horizon representations.
- Critiques of fluent AI answers when evidence is weak, noisy, or historically misplaced.
Selected high-impact / most relevant papers
- Apertus: Democratizing Open and Compliant LLMs for Global Language EnvironmentsACL 2026
- A Study of Temporal Fusion Strategies for Named Entity Recognition in Historical TextsTPDL 2025
- Injecting Temporal-Aware Knowledge in Historical Named Entity RecognitionECIR 2023
- Yes but.. Can ChatGPT Identify Entities in Historical Documents?JCDL 2023
Cultural Heritage
I work with cultural heritage as data, evidence, memory, and conflict. This includes digital epigraphy, Armenian and Ukrainian inscriptions, structured vocabularies, cultural weaponization, contested narratives, and computational methods that must remain accountable to domain experts.
What this contains
- Structured vocabularies and computational methods for underrepresented cultural heritage material.
- Digital epigraphy, inscription corpora, and standards-oriented encoding.
- Computational analysis of cultural heritage narratives, memory, and contested heritage.
Selected high-impact / most relevant papers
- A Dataset of Cultural Heritage Manipulation on English Wikipedia in the Russo–Ukrainian ContextSIGIR 2026
- Building a SKOS Vocabulary for Armenian Epigraphy: A Corpus-Driven and NLP-Assisted ApproachLaTeCH-CLfL 2026
- Armenian AutoEpiDoc: Automated Extraction and Encoding of Armenian Inscriptions into EpiDoc TEI/XMLLaTeCH-CLfL 2026
- Developing a Standardised Vocabulary for Ukrainian Epigraphy and Expanding Digital Epigraphic ResourcesTPDL 2024
Multimodal Archives
Documents are not only text. Layout, typography, images, tables, margins, page structure, photographs, and visual noise often carry the evidence. I am interested in models that connect these signals without pretending that the page is just a bag of words.
What this contains
- Document images, page structure, layout, photographs, and visual evidence.
- Connections between image processing, document analysis, and semantic modelling.
- Work that treats the page as a visual object, not only as text after OCR.
Selected high-impact / most relevant papers
- Integrated Interdisciplinary Workflows for Research on Historical NewspapersJASIST 2021
- Automatic Page Classification in a Large Collection of Manuscripts Based on the International Image Interoperability FrameworkICDAR 2019
- Targeting a Practical Approach for Robot Vision with Ensembles of Visual FeaturesICCSW 2012
Applied Document Intelligence
The same problems reappear outside archives: forged receipts, insurance claims, fake news, epidemic monitoring, emergency events, administrative records, and production workflows where models must be robust enough to be useful and transparent enough to be questioned.
What this contains
- Document fraud detection, forged receipts, insurance claims, and applied document reasoning.
- Multilingual epidemic monitoring and emergency event detection.
- Fake news, misinformation, and workflows where models have to survive real constraints.
Selected high-impact / most relevant papers
- Detecting Forged Receipts with Domain-Specific Ontology-Based Entities & RelationsICDAR 2023
- Receipt Dataset for Document Forgery DetectionICDAR 2023
- Analyzing the Impact of Tokenization on Multilingual Epidemic Surveillance in Low-Resource LanguagesICDAR 2023
- Multilingual Epidemiological Text Classification: A Comparative StudyCOLING 2020
- Token-Level Multilingual Epidemic Dataset for Event ExtractionTPDL 2021