← back home

Things I keep returning to

tools datasets demos infrastructures unfinished obsessions

Click one of the boxes

The cards move because the work does too: between archives, models, images, evaluation, and strange documents that do not behave.

Subject wavesover time

A quick visual map of the curated papers below. The lines show when the main research subjects become more visible across years; it is a readable map, not a full bibliography.

Subject waves over time A line-based visual map of selected papers grouped by main research themes over publication years. 2012201420192020202120222023202420252026 Noisy Documents · 2020Noisy Documents · 2022Noisy Documents · 2024Semantic Extraction · 2014Semantic Extraction · 2021Semantic Extraction · 2022Historical AI · 2023Historical AI · 2025Historical AI · 2026Cultural Heritage · 2024Cultural Heritage · 2026Multimodal Archives · 2012Multimodal Archives · 2019Multimodal Archives · 2021Applied Document Intelligence · 2020Applied Document Intelligence · 2021Applied Document Intelligence · 2023
Noisy DocumentsSemantic ExtractionHistorical AICultural HeritageMultimodal ArchivesApplied Document Intelligence

Noisy Documents

I keep returning to documents that arrive already damaged: OCR errors, HTR errors, broken segmentation, missing accents, strange line breaks, degraded scans, and transcripts that look clean until a model has to reason over them.

OCRHTRpost-correctionevaluationrobustness

What this contains

  • Measuring how OCR/HTR noise changes downstream semantic tasks, not only character accuracy.
  • Designing post-correction and robustness experiments for historical and degraded text.
  • Testing whether models remain reliable when the input is damaged, fragmented, or historically unstable.

Selected high-impact / most relevant papers

Semantic Extraction

I work on extracting the small structures that make collections searchable and arguable: people, places, organizations, events, relations, roles, mentions, links, and evidence. I care less about clean labels and more about whether the extracted structure survives noisy documents and real use.

NERentity linkingrelationseventsevidence

What this contains

  • Named entity recognition, entity linking, relation extraction, and event extraction.
  • Entity-centered ways of making large document collections searchable, comparable, and arguable.
  • Evaluation setups where semantic structure matters more than a clean-looking label.

Selected high-impact / most relevant papers

Historical AI

Historical language is not stable. Names move. Borders move. Political words change meaning. A system that treats all periods as one flat present will make confident mistakes. I work on temporal modelling, long-horizon representations, and evaluation setups that make these failures visible.

temporalitydiachronic languagehistorical collectionslong horizon

What this contains

  • Temporal modelling for documents where names, roles, borders, and meanings do not stay fixed.
  • Experiments with temporal knowledge injection, temporal fusion, and long-horizon representations.
  • Critiques of fluent AI answers when evidence is weak, noisy, or historically misplaced.

Selected high-impact / most relevant papers

Cultural Heritage

I work with cultural heritage as data, evidence, memory, and conflict. This includes digital epigraphy, Armenian and Ukrainian inscriptions, structured vocabularies, cultural weaponization, contested narratives, and computational methods that must remain accountable to domain experts.

epigraphyheritageSKOSEpiDoc/TEImemory

What this contains

  • Structured vocabularies and computational methods for underrepresented cultural heritage material.
  • Digital epigraphy, inscription corpora, and standards-oriented encoding.
  • Computational analysis of cultural heritage narratives, memory, and contested heritage.

Selected high-impact / most relevant papers

Multimodal Archives

Documents are not only text. Layout, typography, images, tables, margins, page structure, photographs, and visual noise often carry the evidence. I am interested in models that connect these signals without pretending that the page is just a bag of words.

layoutvision-languagedocument imagesphotographsvisual evidence

What this contains

  • Document images, page structure, layout, photographs, and visual evidence.
  • Connections between image processing, document analysis, and semantic modelling.
  • Work that treats the page as a visual object, not only as text after OCR.

Selected high-impact / most relevant papers

Applied Document Intelligence

The same problems reappear outside archives: forged receipts, insurance claims, fake news, epidemic monitoring, emergency events, administrative records, and production workflows where models must be robust enough to be useful and transparent enough to be questioned.

fraud detectionmisinformationevent monitoringpipelinesdeployment

What this contains

  • Document fraud detection, forged receipts, insurance claims, and applied document reasoning.
  • Multilingual epidemic monitoring and emergency event detection.
  • Fake news, misinformation, and workflows where models have to survive real constraints.

Selected high-impact / most relevant papers

Posts / opinions → Photo projects →