Publication: Analyzing the Content of Business Documents Recognized with a Large Number of Errors Using Modified Levenshtein Distance
Дата
2022
Авторы
Slavin, O.
Farsobina, V.
Myshev, A.
Journal Title
Journal ISSN
Volume Title
Издатель
Аннотация
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.The chapter discusses methods for analyzing test material of business documents, lists tasks that use word comparison. The features of the analysis of recognized texts are indicated. A mechanism for identifying recognized words based on textual feature points is described. The advantages and disadvantages of Levenshtein position are listed. Other distances between string objects are described: Jaro-Winkler similarity, multiset metric, MFKC metric. The Levenshtein standard distance is compared to other distances between two string objects. A modification of the Levenshtein position is proposed focused on the features of the recognized characters. Experimental results are presented that demonstrate the effect of using the proposed distance in comparison with the normalized Levenshtein distances. The experiments investigate the extraction of data from the document and the classification of documents. We also compared the time spent on calculating the modified Levenshtein metric and the multiset metric. The proposed method can be applied in a modern CAD system in the recognition component to analyze the information of recognized text documents. Also, the method can be in the system of the analysis of the recognized text using the methods of computational linguistics.
Описание
Ключевые слова
Цитирование
Slavin, O. Analyzing the Content of Business Documents Recognized with a Large Number of Errors Using Modified Levenshtein Distance / Slavin, O., Farsobina, V., Myshev, A. // Studies in Systems, Decision and Control. - 2022. - 417. - P. 267-279. - 10.1007/978-3-030-95116-0_22