OCR Recognition Languages

  • ABBYY products support more than 190 OCR languages.
  • There are different types of languages
    • Natural languages, like English, Russian or German
    • Artificial languages: Esperanto, Interlingua, Ido, Occidental
    • Programming Languages: Basic, C/C++, COBOL, Fortran, Java, Pascal, Simple chemical formulas
  • Languages contain special language units/data types, e.g.:
    • addresses
    • date and time
    • human names, etc.
    • For some natural languages: City, village, settlement (English, United Kingdom); Currency in words (English, United States), etc.
  • Hieroglyphic Chinese (PRC and Taiwan), Japanese, Korean and Korean/Hangul
  • Thai
  • Hebrew
  • Arabic

The languages mentioned above are “so-called” predefined languages. In additional to them, it is possible to define your own language and use it for recognition.
The screen-shot below shows the “Language Editor” implementation of FineReader 10, witch is shipped with FineReader Engine 10.

Structure of a Recognition Language

Every recognition language has the following properties:

  • Name
  • Set of allowed characters:
    • alphabet
    • list of prefixes
    • list of suffixes
    • alphabet for subscripts
    • alphabet for superscript and
    • list of ignored characters.
    • Dictionaries are optional, so a language can have one, but recognition will also “work” without one.

Language Auto-Detection

  • ABBYY technologies are able to detect language of a document automatically.
  • The product chooses the best matching language from a group pre-defined group of languages.
  • This group can be set/edited by the user/developer.


Note. It is recommended to use not more than 5 languages in a group. The more languages are selected, the more CPU time is needed, but also by increasing the number of hypothesis goes up. This will most likely decrease the OCR quality.


Back to: Feature Overview - Inside OCR