Blog
4 days ago
Training Tesseract OCR on Kurdish Historical Documents
This article documents the process of digitizing Kurdish historical publications and training Tesseract OCR to recognize the language. The team sourced rare archives from the Zheen Center, processed fragile scans into clean line-by-line images, and created a ground-truth dataset of over 1,200 files. Using Ubuntu and tesstrain, they set up a training environment, corrected image skew, applied cropping, and built transcription pairs to teach the model Kurdish text recognition. The results showcase how open-source OCR tools can help preserve cultural heritage through machine learning.
Source: HackerNoon →