
Improved Access to Online Content at National Library Ivan Vazov Plovdiv
Discover how the National Library Ivan Vazov Plovdiv is enhancing access to online content through a digitalization project focused on optical character recognition (OCR). Learn about overcoming challenges with Cyrillic texts and achieving success in improving OCR accuracy. Explore the methods and achievements in processing OCR and PDF files, including training software and developing tools with partners from the Institute of Information and Communication Technology. Join the journey towards preserving and making historical Bulgarian texts more accessible for learners and scholars.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Useful in Times of Crisis: Improved Access to Online Content National Library Ivan Vazov Plovdiv
Introduction to OCR-related activities National Library Ivan Vazov Plovdiv, embarked on a digitalization project whose ultimate purpose is to provide both learners and scholars with several types of content, including periodicals and books published from the Bulgarian National Revival (1840s) until the 1940 s. The digitalization project involves optical character recognition (OCR) and requires proper handling of Cyrillic texts. The problem: By 1945 the Bulgarian language had undergone three major Orthographic reforms. A myriad of letter symbols such as , , , , etc., were gradually removed from the modern written language, eventually reducing the number of letters in the alphabet to the current 30. These wide variations of the officially accepted language become a serious hindrance to the success rate of OCR. The task to minimize and correct errors in the machine-readable texts produced by OCR software becomes essential in order to improve access to the text. The solution: Joint activities with partners from the Institute of Information and Communication Technology at the Bulgarian Academy of Sciences (IICT-BAS) within the CLaDA-BG project to develop relevant tools and methodologies.
Introduction to OCR-related activities Achievements 1. Successful adoption of a new workflow model: Long-term storage SCANNING Original source High quality master file Web-ready PDF file with recognized text 2. Completion of a one-time migration service for all existing collections in the Digital Library in order to replace the existing images with corresponding machine-readable PDFs. (04.2020) 3. Participation in the project CLaDA-BG: preparation and testing of dictionary of old Bulgarian word forms, to be used for the purpose of assisting OCR.
Specifics of OCR and PDF file processing Methods for improving OCR success rate OCR with ABBYY FineReader of texts before the Orthographic Reform of 1945. Two methods may be applied to improve the OCR success rate: 1) raining the software to recognize the old letter symbols no longer in use. 2) Adding a dictionary of the word forms in use before the Orthographic Reform of 1945. OCR The purpose of OCR is to turn the letter symbols in their image form into machine-readable text. The ultimate goal is to make the text more usable, searchable, to support copy, edit, etc.
Specifics of OCR and PDF file processing Methods for improving OCR success rate 1) Training the software to recognize the old letter symbols The purpose of the training of the software is to establish a match between the graphical images and their corresponding character encodings.
Specifics of OCR and PDF file processing Methods for improving OCR success rate 2) Adding a dictionary of the word forms in use before the Orthographic Reform of 1945. FineReader relies on dictionaries to leverage recognition quality by escalating words hypotheses found in a dictionary. A dictionary, named CLADABG-MODEL, was developed by our partners from the Institute of Information and Communication Technology at the Bulgarian Academy of Sciences (IICT-BAS) within the CLaDA-BG project. The development of the CLADABG-MODEL dictionary with word forms in use before the Orthographic Reform of 1945 aims to take advantage of the benefit that dictionary use would have on text recognition. FineReader has a built-in dictionary only for the modern Bulgarian language. To test the benefit s extent, 20 identical text-filled pages from issue no. 1/1881 of the magazine "Science were selected, color scanned at a resolution of 300 ppi, 24-bit, uncompressed TIFF format.
CLaDA CLaDA- -BG PROJECT BG PROJECT Test OCR with dictionary CLADABG-MODEL Complying with FineReader s requirements, the words in the dictionary are presented in a list form, in a simplified .TXT format and with Unicode encoding. Minimal training was done mainly to aid the recognition of the traditionally problematic letter , which is always recognized as if no training had been done beforehand. Recognition hindrance letters with dominant vertical lines (such as , , , , ) their horizontal lines close and the letters merge, thus confusing the OCR:
CLaDA CLaDA- -BG PROJECT BG PROJECT Test OCR with dictionary CLADABG-MODEL Method The percentage of incorrectly recognized words in relation to the total number of words was considered as the main indicator. The incorrectly recognized words were counted manually. The total number of words is 5485, and their average number per page is 274.25. In the case of combined use of two dictionaries, the built-in Bulgarian and CLADABG-MODEL, the recognition is performed with two languages: (1) Bulgarian with a standard, present-day set of characters, with the FineReader built-in Bulgarian dictionary, and (2) Bulgarian before 1945 featuring a character set with added old letter symbols, such as , , , etc., and with the CLADABG-MODEL dictionary. In parallel, a count of unrecognized line-break split words was performed, as well as a check to see if the recognition was improved when using grayscale images.
CLaDA CLaDA- -BG PROJECT BG PROJECT Test OCR with dictionary CLADABG-MODEL Results 1) PERCENTAGE OF INCORRECTLY RECOGNIZED WORDS PERCENTAGE OF INCORRECTLY RECOGNIZED WORDS (BUILT-IN DICTIONARY) 4,90 % PERCENTAGE OF INCORRECTLY RECOGNIZED WORDS (CLADABG-MODEL) 4,40 % PERCENTAGE OF INCORRECTLY RECOGNIZED WORDS (COMBINED) 4,50 % 2) AVERAGE NUMBER OF UNRECOGNIZED LINE-BREAK SPLIT WORDS PER PAGE AVERAGE NUMBER OF UNRECOGNIZED SPLIT WORDS PER PAGE (BUILT-IN DICTIONARY) 6,90 AVERAGE NUMBER OF UNRECOGNIZED SPLIT WORDS PER PAGE (CLADABG-MODEL) 5,55 AVERAGE NUMBER OF UNRECOGNIZED SPLIT WORDS PER PAGE (COMBINED) 6,05
Conclusion Conclusion OCR with the dictionary CLADABG-MODEL is improved. The improvement however, is not substantial with an average of 0,5% less incorrectly recognized words. Using CLADABG-MODEL also produces an average of 1,35 less unrecognized line-break split words per page. The OCR of pages in greyscale leads to a negligible improvement of recognition, which can not justify scanning in such a mode and unnecessary file conversion. The work on the dictionary will continue. There are ideas on how to improve and streamline the process for higher recognition success.
THANK YOU ! Dimitar DimitarMinev Minev Ivan Ivan Kratchanov Kratchanov Head of Digitization Centre, National Library Ivan Vazov , Plovdiv, Bulgaria digitization@libplovdiv.com Director, National Library Ivan Vazov , Plovdiv, Bulgaria dimin@libplovdiv.com