Indexing Libraries – Indexing Heritage

Supervisors: Florian Kleber
Status: open

Before digital indexing any documentation in Vienna was stored on handwritten papers (little use of typewriters before 1930s). The University Library (Former Central Library of Austria-Hungary) has

– documents over 250 years of science history

These are currently inaccessible due to size and lack of indexing, and thus ~ 1.14 million books are invisible to the public.

The thesis has 3 major tasks:

Index matching based on existing OCR text available from Transkribus (Cleanup, structure OCR text)
“Older” Index matching. Older index have mapping written on top which requires separation from underlying text + OCR
Keyword Cards. March Cards using the previous information

The thesis is in cooperation with Thomas Kohlwein, Department of German Studies.

The research consists of

Literature Review – getting to know the methods
Implementation & Evaluation
- Evaluate state-of-the-art methods on the provided datasets
- Develop and apply yourprocessing pipeline for text layer segmentation, OCR and Matching
- Comparison and thorough evaluation (e.g., improvement of CER/WER)
Written Thesis and final presentation
Summarize your work in a publication (optionally).

Helpful experience

Python
Good understanding of deep learning
Machine Learning frameworks (preferably PyTorch)
Interest in deep learning, document analysis, historical documents and/or handwritten text recognition