Layout Analysis and OCR of art exhibition catalogues

Status: open
Supervisor: Markus Diem, Florian Kleber, Christina Bartosch (University of Vienna)

Exhibition catalogues comprise information containing accompanying texts, name of the paintings, owner of the paintings, etc. An example image shown on this page illustrates a page of an catalogue comprising consecutive numbers, name of the paintings and owners, and the hall number.


The main goal of the master thesis will be to adopt the available layout analysis and OCR module to automatically extract the catalogue information. The extracted information will be stored in text files/database. The work will be done in cooperation with the University of Vienna in the context of the FWF project “Exhibitions of Modern European Painting 1905-1915“.

Document layout analysis deals with the layout structure of document images, thus segmenting a page into homogeneous image regions. Within the project READ a framework for layout analysis is currently developed. The layout analysis allows to detect text regions (text lines, text block). Additionally, an OCR module is availabe, which uses the Tesseract ocr engine. Both can/should be used for the recognition of the catalogues.

Objectives

An adapted layout analysis methodology will be implemented to analyze exhibition catalogues. Additionally the OCR module will be used to recognize the text of the catalogues.

Financing

On success a funding by the FWF is possible (project Exhibitions of Modern European Painting 1905-1915).

Requirements