Database for Amharic OCR

a
b

Sample orginal images from Amharic Database:  a. Character level images.  b. Text-line images.

Authors: Birhanu Hailu Belay, Tewodros Habtegebrial, Marcus Liwicki, Gebyehu Belay & Didier Stricker

Current version: 1.0

Year of publication: 2019

An Amharic database available on this page, is organized in to two subsets of Amharic datasets:

  • Amharic character Image dataset:

    This dataset contains 80,000 Amharic character images with the corresponding Ground-Truth (GT) in the first version and 79994 Amharic character images in the second version (about 2006 distorted Amharic character images are removed from the first version). Images and their corresponding labels are stored in numpy format. Therefore, you have to write a program to read numpy file. Moreover, the images are Grey-level with a size of 32 by 32 pixels. In this dataset, we only considered 231 basic Amharic characters.

    1. First version:
          train_character_data Not available!
          test_character_data Not available!
    2. Second version:
          train_character_data
          test_character_data
  • Amharic Text-line Image dataset:

    This dataset contains 337,332 Amharic text-line images which are written with Visual Geez and Power Geez fonts using 280 unique Amharic characters for the purpose of training and testing Amharic text recognizers. All images are Grey-level and normalized to 48 by 128 pixels. From the total text-line images 40,929 are printed text-line images written with power Geez font, 197,484 and 98,919 images are synthetically generated text-lines with Power Geez and Visual Geez fonts respectively.

    Text-line images and their GT texts in training set are in the order of printed, synthetically generated images with Power Geez and Visual Geez font respectively as a single numpy file while images in a test set are organized as a separate numpy file with their corresponding GT texts each. The test set contains 2907 printed, 9245 and 6479 synthetically generated text-line images with Power Geez and Visual Geez fonts.

      train_data.tar.gz
      test_data.tar.gz

    Citation


    If you need any additional information, please don't hesitate to contact us at the following address

      Birhanu Hailu:  birhanu.hailub@gmail.com