Towards the Creation of a Robust Search Index for Digitalized Documentsby László Kovács, Máté Pataki, Tamás Füzessy and Zoltán Tóth The simultaneous support of electronic and paper-based document handling is a natural demand of current filing and document management systems. To support the better management of search and retrieval functions and to reduce the high costs of digitizing, the Department of Distributed Systems of SZTAKI analysed the different kinds of error that emerged during the digitization process of Hungarian documents, and examined how these errors affect the searchability of the digitized items. For this reason, a testbed was set up that was suitable for the automatic analysis of digitized texts in a large corpus, and the conclusions and statistics obtained from the analysis were employed in the development of new content management products. The primary beneficiaries of these are civil service and higher-education bodies. Today the realization of the ’almost paperless office’ can be achieved via post-digitization, or more precisely via scanning and OCR, as a huge number of documents still need to be digitized. For various reasons, errors may occur during the digitization process; in seeking to achieve the highest quality for full text search capabilities, accuracy is thus an important issue. Therefore, the application of a search engine with high fault tolerance would make texts more suitable for search and retrieval purposes and would enhance their usability in practice while considerably reducing the costs of digitizing – primarily because post-processing human intervention to make corrections would be unnecessary. The primary goal of the project was to build a metric for the errors introduced during the OCR process, particularly for those resulting in the loss or alteration of characters or accents, and to build a robust search index for digital repositories containing automatically digitized, error-prone documents. ![]() Figure 1: Architecture of the testbed. Testbed for the Evaluation of Digitalization Error Types Actual Findings Another example is the letter ‘m’, which was often recognized as 'rn' ('r' followed by 'n'). Hence, when searching for words containing the letter ‘m’, one could also search for the same word having the 'm' replaced with 'rn'. In reverse, this method is employed by spammers to obfuscate dictionary-based spam filters. Further, our results confirm the hypothesis that errors related to accented characters like é, á, ő, and ö occur quite often. For example, the character 'o' has three accented variants in the Hungarian language (ö, ő, ó); together with the capital equivalents, this makes eight different but barely distinguishable characters for the OCR software. Even during post-processing, it is hard to tell which variant is the correct one, as there are many meaningful word-pairs that differ only in a single accent (eg kor, kór, kör). Complete statistics were gathered for the most common accented character identification errors. The fault-tolerant search algorithm that was developed based on these findings has been integrated into the new versions of the Contentum content management product, and may also be used for further collaboration in European projects related to data repositories. In addition, and along with the list of the most common character substitutions, the analysis and the algorithm may provide a good basis in the future for building a robust search index for digital repositories comprising digitized documents. Links: Please contact: |









