Indexing Libraries – Indexing Heritage

Supervisors: Florian Kleber
Status: open

Before digital indexing any documentation in Vienna was stored on handwritten papers (little use of typewriters before 1930s). The  University Library (Former Central Library of Austria-Hungary) has

– documents over 250 years of science history

These are currently inaccessible due to size and lack of indexing, and thus ~ 1.14 million books are invisible to the public.

The thesis has 3 major tasks:

  • Index matching based on existing OCR text available from Transkribus (Cleanup, structure OCR text)
  • “Older” Index matching. Older index have mapping written on top which requires separation from underlying text + OCR
  • Keyword Cards. March Cards using the previous information

The thesis is in cooperation with Thomas Kohlwein, Department of German Studies.

The research consists of

  • Literature Review – getting to know the methods
  • Implementation & Evaluation
    • Evaluate state-of-the-art methods on the provided datasets
    • Develop and apply yourprocessing pipeline for text layer segmentation, OCR and Matching
    • Comparison and thorough evaluation (e.g., improvement of CER/WER)
  • Written Thesis and final presentation
  • Summarize your work in a publication (optionally).

Helpful experience

  • Python
  • Good understanding of deep learning
  • Machine Learning frameworks (preferably PyTorch)
  • Interest in deep learning, document analysis, historical documents and/or handwritten text recognition