Difference between revisions of "Tibetan OCR"

From Digital Tibetan
Jump to: navigation, search
(Namsel Ocr)
Line 39: Line 39:
* https://escholarship.org/uc/item/6d5781k5#page-4
* https://escholarship.org/uc/item/6d5781k5#page-4
* https://github.com/zmr/namsel
* https://github.com/zmr/namsel
* https://github.com/thubtenrigzin/namsel-ocr (Forked from zmr/namsel)
* https://hub.docker.com/r/thubtenrigzin/docker-namsel-ocr/ (Docker image)

Revision as of 19:21, 27 October 2018

Namsel Ocr

Available for Windows, Mac and Linux and now on Docker

Work great on Docker: Docker Namsel Ocr For the main version: Namsel Ocr

Sources are available on the Github page https://github.com/thubtenrigzin/namsel-ocr for the main project and https://github.com/thubtenrigzin/docker-namsel-ocr

OCR at dharmabook.ru

dharmabook.ru offers a free OCR service, uploaded texts are converted within a few days:

This service seems to work well with wood-block pechas.

See Buddhist Library Project for more information about the Project

Tesseract 4 alpha (by Google, with neural networks)

Note: At the time of writing (2017-04), tesseract 4 is still in early development, and mostly supports linux. Technical computer skills are required for usage.

Tesseract 4.0 alpha supports OCR (optical character recognition) for Tibetan.

The new version adds a new OCR engine based on LSTM neural networks. It initially works (well) on x86/Linux. Model data for 101 languages (including Tibetan and Dzongkha) is available in the tessdata repository.


Refer to the Tesseract repository for detailed installation instructions.

In addition to tesseract, you will need trained language sets.

Language training sets


Recognition works quite well for printed Tibetan texts, however, the recognition rate for wood-block pechas is still poor.


A good overview of different endeavors in Tibetan OCR is given at the Namsel project: