Exhibition catalogues comprise information containing accompanying texts, name of the paintings, owner of the paintings, etc. An example image shown on this page illustrates a page of an catalogue comprising consecutive numbers, name of the paintings and owners, and the hall number.
The main goal of the master thesis will be to adopt the available layout analysis and OCR module to automatically extract the catalogue information. The extracted information will be stored in text files/database. The work will be done in cooperation with the University of Vienna in the context of the FWF project “Exhibitions of Modern European Painting 1905-1915“.
Document layout analysis deals with the layout structure of document images, thus segmenting a page into homogeneous image regions. Within the project READ a framework for layout analysis is currently developed. The layout analysis allows to detect text regions (text lines, text block). Additionally, an OCR module is availabe, which uses the Tesseract ocr engine. Both can/should be used for the recognition of the catalogues.
An adapted layout analysis methodology will be implemented to analyze exhibition catalogues. Additionally the OCR module will be used to recognize the text of the catalogues.
On success a funding by the FWF is possible (project Exhibitions of Modern European Painting 1905-1915).
- C++ knowledge
- Machine Learning/Computer Vision knowledge
- Ideally VU Document Analysis