Description

The HBA 1.0 dataset contains 2435 and 2001 manuscript and printed pages, respectively. These pages have been collected from 11 books (5 manuscripts and 6 printed books).

The documents of the HBA 1.0 dataset are gray-scale or color images which were digitized at 300 or 400 dpi and saved in the TIFF format which provides a high resolution of digitized images.

The following table details the HBA 1.0 dataset characteristics. The links associated to the different “Book Id.” correspond to the URL links pointing to the selected historical books in the French digital library Gallica (only low resolution images are publicly available online).

Book Id.	Publishing date	Number of pages	Book type	Image type
Book 01	1743-1774	730	Manuscript	Color
Book 02	1342	486	Manuscript	Color
Book 03	1285	350	Manuscript	Color
Book 04	1201-1300	813	Manuscript	Gray-scale
Book 05	1758	56	Manuscript	Color
Book 06	1596	322	Printed	Color
Book 07	1711	64	Printed	Gray-scale
Book 08	1478-1480	403	Printed	Gray-scale
Book 09	1481	341	Printed	Color
Book 10	1782-1822	440	Printed	Color
Book 11	1839	431	Printed	Color

The characteristics of the HBA 1.0 dataset are primarily: strong heterogeneity, with differences in layout, typography, illustration style, historic fonts, complex layouts (e.g. dense printing, irregular spacing, varying text column widths, marginal notes), ink shining through and historical spelling variants, … In addition to this specificity, the issues affecting document image layout analysis, such as the degradation properties (e.g. yellow pages, ink stains, back-to-front interference) and scanning defects (e.g. defects of curvature and light) are adequately covered. It is worth noting that the historical document images in the HBA 1.0 dataset were selected so as to be as realistic as possible, in order to reflect the challenges of this competition to determine if the participating methods are sufficiently robust to the particularities of historical document images.