The HBA dataset contains 2435 and 2001 printed and manuscript pages, respectively. These pages have been collected from 11 books (6 manuscripts and 5 printed books).
The documents of the HBA dataset are gray-scale or color images which were digitized at 300 or 400 dpi and saved in the TIFF format which provides a high resolution of digitized images.
The following table details the HBA dataset characteristics. The links associated to the different “Book Id.” correspond to the URL links pointing to the selected historical books in the French digital library Gallica (only low resolution images are publicly available online).
|Book Id.||Publishing date||Number of pages||Book type||Image type|
The characteristics of the HBA dataset are primarily: strong heterogeneity, with differences in layout, typography, illustration style, historic fonts, complex layouts (e.g. dense printing, irregular spacing, varying text column widths, marginal notes), ink shining through and historical spelling variants, … In addition to this specificity, the issues affecting document image layout analysis, such as the degradation properties (e.g. yellow pages, ink stains, back-to-front interference) and scanning defects (e.g. defects of curvature and light) are adequately covered. It is worth noting that the historical document images in the dataset were selected so as to be as realistic as possible, in order to reflect the challenges of this competition to determine if the participating methods are sufficiently robust to the particularities of historical document images.