Enabling Deep Document Image Analysis with Generative Models

Abstract: Historical documents are a priceless treasure trove of cultural and historical knowledge, but their preservation and analysis pose significant challenges due to the unique characteristics of handwritten scripts, the variability, and the inevitable degradation by the centuries of usage. With the rise of the Deep Learning era, enormous amounts of annotated data are required to train large models that can efficiently perform tasks on unseen data. Nowadays, digital libraries provide high-quality digitized images for analysis and processing of historical documents. However, collecting and annotating the provided data is an expensive task and requires a lot of expertise from historians and the humanities. Hence, generating synthetic data to enhance the performance of deep learning frameworks is a common approach in Computer Vision and, precisely for this thesis, in Document Image Analysis and Recognition (DIAR).This thesis focuses on leveraging generative models to facilitate DIAR tasks, focusing on historical and handwritten documents, by generating realistic synthetic images that resemble a real distribution and enhance the training of downstream DIAR tasks. The contributions of the thesis include a systematic literature review of existing historical document image datasets, which identified limitations and promising resources for future research, as well as an evaluation comparison of synthetic data generated by two existing methods on a historical image task. Furthermore, the thesis introduces a new method for generating styled handwriting text images based on Denoising Diffusion Probabilistic Models (DDPM), which is an unexplored method in DIAR.The method manages to capture stylistic and content characteristics of a standard multi-writer handwriting dataset and achieved state-of-the-art performance in enhancing writer identification and handwriting text recognition compared to Generative Adversarial Network (GAN)-based methods. This method is further extended to operate in a few-shot scheme for the writer style condition and manages to generate synthetic images of unseen writer styles as a final contribution.The results demonstrate the potential of the generative method for enabling deep document image analysis and pave the way for further research in the field. As a future direction, this work will aim to progress from generating word images to generating sentence and full document images by conditioning on the content, style, and layout of historical documents. Furthermore, the future work will aim in leveraging important features from pre-training with synthetic and real data in order to generalize to historical documents that are a scarce source and adjusting the text encoding parts to different languages and scripts.Finally, the final step of the future work aims to generate a massive synthetic historical document image database for reading systems to fill the existing benchmark gap.

  This dissertation MIGHT be available in PDF-format. Check this page to see if it is available for download.