Sample dataset

The sample dataset will contain 2 books. It will be composed of a set of training images and a set of test images. The training dataset will contain a reduced number of book pages, along with their ground truth in the TXT or PNG format. The training images are representative of different contents and layouts of the book pages. On the other side, the test dataset will be composed of images representing the remainder book pages.

All pages of the 2 selected books to compose the sample dataset, along with their ground truths will be provided. Few pages will be selected from each book of the 2 books of the sample dataset, to constitute the training dataset. These selected pages should contain all content classes of the analyzed books.

The training dataset of each book of the sample dataset is structured as follows.

Book Id. Number of training images
Book 04 30
Book 06 42

The two selected books to form the sample dataset are: Book 04 as a manuscript book and Book 06 as a printed one. The training datasets of Book 04 and Book 06 are composed of 30 and 42 files, respectively.

Two versions of the simple dataset are available as:

1- TXT

2- PNG

The two versions differ in the format of the training, test and ground truth files (TXT and PNG).

The participants are free to choose between the TXT or PNG version and also to use the sample dataset for training, testing or any other purpose related to the HBA competition.

Click on the links below to download the files you need for using the sample dataset. Each link points to a compressed file containing data of a book of the sample dataset.

After decompressing the file, 4 folders will be created, namely, “images”, “train”, “test” and “gt”.

1- “images”: It contains all TIFF images of a book.

2- “train”: It is composed of a number of TXT/PNG files to form the training dataset. The training dataset is representative of different contents and layouts of the analyzed book pages. Each line of a TXT file is composed of the following three values: the coordinates of the selected foreground pixel and its corresponding label class representing the content type in the analyzed book. The label value varies between 1 and 6. If the label value is equal to 1, the content class represents a graphical content else it corresponds to a textual content. In the case of PNG version, pixel-labeled images have been provided. At maximum 6 BGR values have been used to encode the 6 different content classes in the pixel-labeled images.

3- “test”: It contains the remainder of book TXT/PNG files by reference to the training dataset in order to form the test dataset. Each line of a TXT file is composed of only the coordinates of the selected foreground pixel. The participants should fill out these files with the predicted class label for each foreground pixel. In the case of PNG version, pixel-lalebed images with the selected foreground pixels colored in white have been provided. The participants should provide in the case of PNG version as an output a pixel-labeled image with respect of the BGR values defined in the training files.

4- “gt”: It is composed of the ground truth files of the test dataset in the TXT/PNG format. A ground truth file has a similar structure to a training file.

Download links

The two versions of the two selected books composing the sample dataset are available for download from the following links:

TXT PNG
Book 04 Book 04
Book 06 Book 06