Description

The HBA dataset contains 2435 and 2001 manuscript and printed pages, respectively. These pages have been collected from 11 books (5 manuscripts and 6 printed books).

The documents of the HBA dataset are gray-scale or color images which were digitized at 300 or 400 dpi and saved in the TIFF format which provides a high resolution of digitized images.

The following table details the HBA dataset characteristics. The links associated to the different “Book Id.” correspond to the URL links pointing to the selected historical books in the French digital library Gallica (only low resolution images are publicly available online).

Book Id. Publishing date Number of pages Book type Image type
Book 1 1743-1774 730 Manuscript Color
Book 2 1342 486 Manuscript Color
Book 3 1285 350 Manuscript Color
Book 4 1201-1300 813 Manuscript Gray-scale
Book 5 1758 56 Manuscript Color
Book 6 1596 322 Printed Color
Book 7 1711 64 Printed Gray-scale
Book 8 1478-1480 403 Printed Gray-scale
Book 9 1481 341 Printed Color
Book 10 1782-1822 440 Printed Color
Book 11 1839 431 Printed Color

The characteristics of the HBA dataset are primarily: strong heterogeneity, with differences in layout, typography, illustration style, historic fonts, complex layouts (e.g. dense printing, irregular spacing, varying text column widths, marginal notes), ink shining through and historical spelling variants, … In addition to this specificity, the issues affecting document image layout analysis, such as the degradation properties (e.g. yellow pages, ink stains, back-to-front interference) and scanning defects (e.g. defects of curvature and light) are adequately covered. It is worth noting that the historical document images in the dataset were selected so as to be as realistic as possible, in order to reflect the challenges of this competition to determine if the participating methods are sufficiently robust to the particularities of historical document images.