Blog
Aug 18, 2025
Can AI Save Centuries of Kurdish History?
This study tackles the challenge of digitizing fragile historical Kurdish publications, which current OCR systems fail to process due to damaged pages, non-standard fonts, and lack of datasets. Using Google’s open-source Tesseract 5.0, researchers built a custom dataset of over 1,200 annotated lines from pre-1950 Kurdish documents provided by the Zheen Center. The adapted Arabic model achieved promising accuracy (84% character recognition), and a user-friendly web app was developed for text extraction. The project highlights the need for larger public datasets and technical innovation to preserve low-resource languages like Kurdish.
Source: HackerNoon →