HBA dataset division

The HBA 1.0 dataset is divided into two sub-datasets, the sample dataset and the evaluation dataset. The sample dataset will contain 2 books while the evaluation dataset will be composed of the 9 remaining books. Each dataset will be composed of a set of training images and a set of test images. The training dataset will contain a reduced number of book pages, along with their ground truth in the TXT/PNG format. The training images are representative of different contents and layouts of the book pages. On the other side, the test dataset will be composed of images representing the remainder book pages.

All pages of the 2 selected books to compose the sample dataset, along with their ground truths will be provided. Few pages will be selected from each book of the 2 books of the sample dataset, to constitute the training dataset. These selected pages should contain all content classes of the analyzed books. Then, only the ground truths of few pages of the remaining books (i.e. 9 books) will be provided to constitute the training dataset of the evaluation dataset.

It is worth pointing out that the content classes in the HBA 1.0 dataset have very different headcounts. Indeed, the textual content is predominant in monographs, compared to the graphical content. Moreover, among the textual content a great majority represent the body text while other character fonts are more marginal.

This imbalanced headcounts between classes varies from one book to another book in the HBA 1.0 dataset. Nevertheless, our goal consists in evaluating a method which will automatically annotates an important number of book pages, based on a limited number of manually annotated pages of the same book.