Difference between revisions of "Tibetan OCR"

From Digital Tibetan
Jump to: navigation, search
(Created page with "==Tesseract 4 alpha (by Google, with neural networks)== '''Note:''' At the time of the writing (2017-04), tesseract 4 is still in early development, and mostly supports linux...")
 
(dharmabook.ru added.)
Line 17: Line 17:
 
* [https://github.com/tesseract-ocr/tessdata/blob/master/dzo.traineddata Dzongkha]
 
* [https://github.com/tesseract-ocr/tessdata/blob/master/dzo.traineddata Dzongkha]
  
==Results==
+
====Results====
 
Recognition works quite well for printed Tibetan texts, however, the recognition rate for wood-block pechas is still poor.
 
Recognition works quite well for printed Tibetan texts, however, the recognition rate for wood-block pechas is still poor.
 +
 +
==OCR at dharmabook.ru==
 +
dharmabook.ru offers a free OCR service, uploaded texts are converted within a few days:
 +
* http://www.dharmabook.ru/ocr/

Revision as of 10:58, 29 April 2017

Tesseract 4 alpha (by Google, with neural networks)

Note: At the time of the writing (2017-04), tesseract 4 is still in early development, and mostly supports linux. Technical computer skills are required for usage.

Tesseract 4.0 alpha supports OCR (optical character recognition) for Tibetan.

The new version adds a new OCR engine based on LSTM neural networks. It initially works (well) on x86/Linux. Model data for 101 languages (including Tibetan and Dzongkha) is available in the tessdata repository.

Installation

Refer to the Tesseract repository for detailed installation instructions.

In addition to tesseract, you will need trained language sets.

Language training sets

Results

Recognition works quite well for printed Tibetan texts, however, the recognition rate for wood-block pechas is still poor.

OCR at dharmabook.ru

dharmabook.ru offers a free OCR service, uploaded texts are converted within a few days: