Sample dataset

The sample dataset will contain 2 books. It will be composed of a set of training images and a set of test images. The training dataset will contain a reduced number of book pages, along with their ground truth in the TXT format. The training images are representative of different contents and layouts of the book pages. On the other side, the test dataset will be composed of images representing the remainder book pages.

All pages of the 2 selected books to compose the sample dataset, along with their ground truths will be provided. Few pages will be selected from each book of the 2 books of the sample dataset, to constitute the training dataset. These selected pages should contain all content classes of the analyzed books.

The training dataset of each book of the sample dataset is structured as follows.

Book Id. Number of training images
Book 4 30
Book 6 42

The participants are free to use the sample dataset for training, testing or any other purpose related to the HBA competition.

The two selected books to form the sample dataset are: Book 4 as a manuscript book and Book 6 as a printed one. The training datasets of Book 4 and Book 6 are composed of 30 and 42 files, respectively.

Click on the two links below to download the files you need for using the sample dataset. Each link points to a compressed file containing data of a book of the sample dataset.
After decompressing the file, 4 folders will be created, namely, “images”, “train”, “test” and “gt”.

  • “images”: It contains all TIFF images of a book.
  • “train”: It is composed of a number of TXT files to form the training dataset. The training dataset is representative of different contents and layouts of the analyzed book pages. Each line of a TXT file is composed of the following three values: the coordinates of the selected foreground pixel and its corresponding label class representing the content type in the analyzed book. The label value varies between 1 and 6. If the label value is equal to 1, the content class represents a graphical content else it corresponds to a textual content.
  • “test”: It contains the remainder of book TXT files by reference to the training dataset in order to form the test dataset. Each line of a TXT file is composed of only the coordinates of the selected foreground pixel. The participants should fill out these files with the predicted class label for each foreground pixel.
  • “gt”: It is composed of the ground truth files of the test dataset in the TXT format. A ground truth file has a similar structure to training TXT file.

Download links

The two selected books composing the sample dataset are available for download from the two following links:

– Book 4
– Book 6