Supervisors: Florian Kleber
Status: open
Before digital indexing any documentation in Vienna was stored on handwritten papers (little use of typewriters before 1930s). The University Library (Former Central Library of Austria-Hungary) has
– documents over 250 years of science history
These are currently inaccessible due to size and lack of indexing, and thus ~ 1.14 million books are invisible to the public.
The thesis has 3 major tasks:
- Index matching based on existing OCR text available from Transkribus (Cleanup, structure OCR text)
- “Older” Index matching. Older index have mapping written on top which requires separation from underlying text + OCR
- Keyword Cards. March Cards using the previous information
The thesis is in cooperation with Thomas Kohlwein, Department of German Studies.
The research consists of
- Literature Review – getting to know the methods
- Implementation & Evaluation
- Evaluate state-of-the-art methods on the provided datasets
- Develop and apply yourprocessing pipeline for text layer segmentation, OCR and Matching
- Comparison and thorough evaluation (e.g., improvement of CER/WER)
- Written Thesis and final presentation
- Summarize your work in a publication (optionally).
Helpful experience
- Python
- Good understanding of deep learning
- Machine Learning frameworks (preferably PyTorch)
- Interest in deep learning, document analysis, historical documents and/or handwritten text recognition