DIR – Document Information Retrieval

Project Details

Funding

Faunhofer IPK

Duration

2009/03/01 – 2013/12/31

Contact

Robert Sablatnig

Persons

Florian Kleber
Markus Diem
Angelika Garz
Stefan Fiel

The goal of this project is to analyze snippets of “manually” torn documents. A reconstruction of fragmented writing materials allows to retrieve and to analyze the lost content. This is done on objects of cultural and historic value, or for crime investigation as example.
Reassembling algorithms use either the shape of the fragments, the content of the fragments or a combination of shape and content as a feature. By taking only the border regions into account the main information (printed on the snippet) which can be used for reconstruction is lost. Within this project document analysis techniques are applied to calculate following features:

  • Skew estimation of a snippet
  • Paper type (checked, lined, blank) and the frequency of the ruling
  • Text color, paper color
  • Writing type (handwritten or printed text)
  • Line segmentation (e.g. for form analysis)
  • Layout analysis

A fragmentation of documents can be performed to make information (writings, drawings) inscribed on writing materials (paper, parchment, papyrus) unreadable one. Although parts of the information on single fragments still exist, the entire text and therefore the context of the document is destroyed. Reasons for an intended tearing of writing materials are either criminal intentions (business crime, tax fraud investigation, secret service documents or e.g. the protection of sensitive data/personal information (bank details, credit card numbers). Unintended fragmenting of documents concern either ancient manuscripts that are fragmented due to environmental effects (influence of mold, water) or due to catastrophes like the collapse of the historical archive of the City of Cologne (a total of 18 shelve kilometers of books has been destroyed).

This project will be performed in cooperation with:

who are dealing with the reconstruction of destroyed “Stasi Documents” (approximately 600 million pieces). See:

Publications

F. Kleber, M.Diem and R. Sablatnig, “Document Reconstruction by Layout Analysis of Snippets”, In Proceedings of IS&T SPIE Conference on Computer Image Analysis in the Study of Art, San Jose, USA, 2010.

M. Diem, F. Kleber and R. Sablatnig, “Document Analysis Applied to Fragments: Feature Set for the Reconstruction of Torn Documents”, In Proceedings of the 9th International Workshop on Document Analysis Systems (DAS), Boston, USA, pp. 393-400, 2010.

F. Kleber, M. Diem and R. Sablatnig, “Reconstruction of Torn Manuscripts/Notes: Determination of Snippet Features”, In Proceedings of EVA Berlin – Electronic Media and Visual Arts, Berlin, Germany, 2009.

Detailed Information