CN113420766B - Low-resource language OCR method fusing language information - Google Patents

Low-resource language OCR method fusing language information Download PDF

Info

Publication number
CN113420766B
CN113420766B CN202110756557.8A CN202110756557A CN113420766B CN 113420766 B CN113420766 B CN 113420766B CN 202110756557 A CN202110756557 A CN 202110756557A CN 113420766 B CN113420766 B CN 113420766B
Authority
CN
China
Prior art keywords
language
ocr
low
resource language
resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110756557.8A
Other languages
Chinese (zh)
Other versions
CN113420766A (en
Inventor
冯冲
滕嘉皓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202110756557.8A priority Critical patent/CN113420766B/en
Publication of CN113420766A publication Critical patent/CN113420766A/en
Application granted granted Critical
Publication of CN113420766B publication Critical patent/CN113420766B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a low-resource language OCR method fusing language information, and belongs to the technical field of OCR. The method comprises the following steps: acquiring an open source text generation picture of a low resource language, and enhancing OCR training data of the low resource language based on image and character characteristics; and selecting a high resource language with high language low resource language similarity based on the similarity between languages, applying a mixed fine tuning migration strategy to migrate the OCR model of the high resource language to the OCR model of the low resource language, and scoring the recognition result as a judgment basis based on the recognition of the OCR model to judge whether the recognition result contains errors. Performing word list detection on the sentences with low scores, positioning and identifying wrong words, adopting multi-strategy fusion, and generating a possible correction scheme by taking a word list and an editing distance as basis; and finally, scoring each correction scheme of the OCR recognition sequence, and selecting the optimal correction scheme. The method improves the OCR recognition accuracy rate caused by the scarcity of data resources due to low resource language factors.

Description

Low-resource language OCR method fusing language information
Technical Field
The invention relates to a low-resource language OCR method fusing language information, in particular to a training method based on a mixed fine-tuning strategy and a text correction method fusing language information.
Background
The optical character recognition OCR technology simulates the intelligence of human vision, and character information in an image is recognized by processing and analyzing the image, belonging to the combination of two research fields of computer vision and natural language processing. The technology establishes a bridge between the image and the text information carriers, can quickly extract the text information in the image, and replaces a manual re-entry mode.
With the increasing research results in the OCR field, the technical level is becoming more mature, but at the same time, the imbalance between the research quantity and the technical capability among different languages is obviously found to be gradually increased. The reason for this is that the OCR technology with excellent performance is realized by a deep learning method, and needs support of large-scale training data. Because the scarcity of low-resource language data resources cannot meet the requirements of a deep learning method, the resource languages with higher shortage of OCR (optical character recognition) capability level are more and more prominent.
In the aspect of researching OCR technology of low resource languages, overcoming the limitation of small scale of data resources and the processing mode of fusing language information is worthy of being explored. The deep learning method can better fit data features, has stronger feature expression capability compared with the traditional method, and is a mainstream method of the current OCR technology. However, deep learning is a big data-based research method, and if the scale of data resources in a training set is small, a network model cannot be accurately fitted to the characteristics of the data set, and a good effect cannot be obtained in a test set. Currently, a common way to address the shortage of data resources is data enhancement and migratory learning. The mainstream data enhancement method is mainly used for improving the data scale and the data diversity from the computer vision perspective, but cannot realize more effective enhancement modes from the language self characteristic; the migration learning method lacks the cross-language knowledge sharing based on language similarity.
Furthermore, the recognition accuracy of the current calculation method depends greatly on the input image, and the information at the language level cannot be fully utilized. Therefore, the OCR result error caused by poor picture quality, irregular characters, high character similarity and the like cannot be optimized well. The final recognition performance of the OCR can be effectively improved by utilizing a natural language processing method to realize text correction in the OCR post-processing stage.
In summary, the research on the low-resource language OCR processing method fusing language information still is one of the problems to be solved urgently for the OCR field. However, no system or related technology disclosure with better recognition performance for low resource languages is seen at present.
Disclosure of Invention
The invention aims to solve the problem that the recognition capability of a low resource language and a high resource language has a large difference due to the limitation of the lack of trained data resources of the conventional low resource language OCR, and provides a low resource language OCR method fusing language information, which firstly enhances the OCR training data set of the low resource language and then migrates the OCR model of the high resource language to the OCR model of the low resource language through a mixed fine tuning migration strategy based on migration learning; and then constructing a word list of the low resource language based on the low resource language OCR model, finding errors in the OCR recognition result, using the word list as a basis for generating correction options, finally carrying out OCR recognition and text correction based on a mixed fine-tuning strategy on the pictures in the test set, and improving the accuracy of the low resource language OCR recognition result by adopting a language information fusion mode.
In order to achieve the purpose, the invention adopts the following technical scheme:
the low-resource language OCR method fusing the language information comprises the following steps:
step 1: acquiring open source text data of low resource languages, generating pictures, enhancing an OCR (optical character recognition) training data set of the low resource languages based on image characteristics and character characteristics, and improving the robustness of a model;
step 2: based on migration learning and similarity between languages, comprehensively analyzing the similarity of the selected high resource language and the low resource language from the linguistic perspective, selecting the high resource language with high similarity to the low resource language, and migrating the OCR model of the high resource language to the OCR task of the low resource language through a mixed fine tuning migration strategy;
and step 3: training an OCR model of the low resource language and constructing a word list of the low resource language for finding errors in an OCR recognition result and generating a correction option basis, wherein the method specifically comprises the following steps: judging whether the recognition result contains errors or not by taking the score of the language model of the low resource language on the OCR recognition result as a judgment basis; meanwhile, word list detection is carried out on the sentences with low scores, wrong words are positioned, and the wrong words are used as a basis for correcting options;
and 4, step 4: aiming at the basis of the correction options, namely identifying the category of the error word, adopting multi-strategy fusion, and generating possible correction options corresponding to the error word according to common error types in an OCR (optical character recognition) result on the basis of an editing distance and a word list;
and 5: and (4) scoring each correction scheme of the OCR recognition sequence by using the language model, and selecting the optimal correction scheme from the possible correction options in the step 4 according to the scoring result of the language model.
Advantageous effects
Compared with the prior art, the low-resource language OCR method fusing the language information has the following beneficial effects:
1. the method designs data enhancement by combining image characteristics and text characteristics, and the enhancement can be combined with OCR characteristics more closely, so that the problem of data resource shortage of the small languages is solved in a targeted manner;
2. the method designs a learning cost quantification method of low-resource language samples and a mixed fine-tuning-based transfer learning strategy, realizes the effective transfer of OCR knowledge of high-resource languages to low-resource languages by combining the methods, enables the model to better fit the characteristics of a data set, and improves the performance of the OCR model under the condition of scarce data;
3. the method adopts a mode of fusing language information to correct the text in an OCR post-processing link, and simply realizes the discovery and the positioning of wrong words in a recognition result in a mode of matching a language model and a word list;
4. when the method is used for generating the correction options, a multi-strategy mixing mode is adopted for different common error types in OCR, the correction options with finer granularity are generated, and the coverage range is more comprehensive; the screening process combines the language model to improve the correction accuracy.
Drawings
FIG. 1 is a flow chart of a hybrid fine tuning migration strategy of a low-resource language OCR method fusing language information according to the present invention;
FIG. 2 is a flowchart of a text correction method with multi-strategy fusion in a language information-fused low-resource language OCR method according to the present invention.
Detailed Description
The present invention further provides a low resource language OCR method fusing language information with reference to the attached drawings.
Example 1
The invention relates to a low-resource language OCR method fusing language information, which comprises the following steps: acquiring an open source text generation picture of a low resource language and enhancing OCR training data of the low resource language based on image and character characteristics; and selecting a high resource language with high language low resource language similarity based on the similarity between languages, applying a mixed fine tuning migration strategy to migrate the OCR model of the high resource language to the OCR model of the low resource language, and scoring the recognition result as a judgment basis based on the recognition of the OCR model to judge whether the recognition result contains errors. Performing word list detection on the sentences with low scores, positioning and identifying wrong words, adopting multi-strategy fusion, and generating a possible correction scheme by taking a word list and an editing distance as basis; and finally, scoring each correction scheme of the OCR recognition sequence, and selecting the optimal correction scheme. When in specific implementation, the method comprises the following steps:
step 1: the method comprises the steps of obtaining open source text data of low resource languages, generating pictures, enhancing an OCR training data set of the low resource languages based on image characteristics and character characteristics, and improving model robustness.
Specifically, firstly, an open source text of a low-resource language is obtained by using a crawler technology, and a picture is generated according to text information. The generated picture is then data enhanced for text characteristics and image characteristics.
The data enhancement mode comprises the following modes:
(1) translation transformation: aiming at a data enhancement means of the self characteristics of the graph, namely moving an input image in the horizontal direction and the vertical direction, the research adopts a mode of randomly translating different pixel values in the two directions for conversion, but the number of characters in the translated image needs to be ensured to be unchanged. Since the partial image is empty after the translation, the empty part is expanded by the background in the original image during the translation.
(2) Background noise: the data enhancement means aiming at the self characteristics of the image, namely, when the image is generated according to the text, a blank background is not adopted, but a noisy background is adopted, and a document picture with poor paper quality is simulated, so that the model is not sensitive to the text background,
(3) text distortion and blurring: data enhancement means aiming at character characteristics and image characteristics, and distortion transformation or fuzzy processing on texts in the images. The distortion transformation is a phenomenon that characters in the image are deformed due to factors such as angle inclination and the like possibly existing when a character image is obtained; the blurring is to simulate the state of a low-quality picture or a picture with a smaller size in the amplification process, simulate the situation possibly encountered in the actual recognition process by a data enhancement method, and improve the generalization capability of the model.
(4) Font transformation: the data enhancement means aiming at the characters self characteristics, namely, the image generation is carried out by adopting various fonts when the image is generated according to the text, so that the writing mode of the characters has stronger diversity, the feature fitting capability of the model to each character is enhanced, and the recognition capability of the model to each character is enhanced.
(5) Linguistic-based character-level enhancement: aiming at the special characters in the low-resource language, data enhancement is carried out by improving the occurrence frequency, and the word formation rule of the language can be combined during enhancement.
Step 2: based on the similarity between languages, selecting a high resource language with high similarity with a low resource language, and migrating the OCR model of the high resource language to the low resource language OCR task through a mixed fine tuning migration strategy.
Specifically, first, a high resource language with the highest similarity to the low resource language to be recognized is selected. And then, selecting an OCR model advanced in the industry to pre-train on the OCR data set of the high resource language to obtain a source model. And finally, migrating the high-resource language OCR model to the low-resource language by adopting a mixed fine-tuning migration strategy.
The mixed fine-tuning migration strategy comprises the following steps:
as shown in fig. 1, first, special characters specific to a low resource language are counted by comparing alphabets of the high resource language and the low resource language. Further calculating the Learning Cost (Learning Cost) of the low-resource language training sample to measure the Learning difficulty of the sample, wherein the quantification method is as follows:
Figure BDA0003147827650000051
wherein L is C Represents the learning cost of the training example, L represents the total length of the text sequence in a certain picture, Avr L Average length of words in training set for low resource languages, C S Representing the total number of special text characters in a single picture, which are different from the high resource language, and alpha is the difficulty weight of the special text characters. In this embodiment, the difficulty weight α is 5.
Then, part of high resource language data in the pre-training is randomly selected and mixed with the OCR training data of the low resource language in proportion, and the mixing proportion x (100-x) can be satisfied by randomly oversampling the low resource language data, where x is 30 in this embodiment.
Thereafter, all samples in the low-resource language training dataset are sorted in ascending order according to their learning cost. In the training process, low-resource language data are sequentially selected in a batch unit according to the sequence from low learning cost to high learning cost, the low-resource language data and the selected high-resource language image are mixed according to the proportion of x% to (100-x)% and a mixed fine-tuning strategy is adopted to train the model.
And step 3: and judging whether the recognition result contains errors or not by taking the score of the language model of the low resource language on the OCR recognition result as a judgment basis. Meanwhile, vocabulary detection is carried out on the sentences with low scores, and the words with wrong recognition are located.
Specifically, firstly, open source text of low resource language is obtained by using crawler technology, and the language model LowRe is trained by using the data and a statistical language model tool N-gram And counting the average conditional probability value P of the obtained language model for the low-resource language text. Meanwhile, a word list WordDict of the language is obtained by utilizing a crawler technology.
And then, identifying the low-resource language OCR test set by using the low-resource language OCR model obtained in the step 2 to obtain a corresponding text of the image. By language model LowRe N-gram Conditional summary of text recognized by OCR modelValue of the rate P OCR Performing calculation if P is less than or equal to P OCR If P is greater than P, the recognition result is considered to be correct OCR Then it is assumed that the recognition result may have an error and needs to be corrected.
And then, performing word segmentation on the text sequence possibly with errors, judging word by word in a word list matching mode, if the word list contains the word, continuing to check the next word, and if the word list does not contain the word, considering the word as an error word.
And 4, step 4: and aiming at different lists of wrong words, a multi-strategy fusion method is adopted, and a possible correction scheme is generated by taking a word list and an editing distance as bases.
OCR recognition errors can be classified into two types, word spelling errors and word blocking errors, depending on their common manifestations. The expression form of misspelling of the word is that a few characters in the word have four phenomena of replacement, insertion, repetition and deletion; word sticking errors are manifested in the form of incorrect recognition of a space between two words, and the appearance of a concatenation of multiple words.
Firstly, whether the error word can be divided into a plurality of words in a word list WordDict or not is judged, and if the word can be divided, all the division schemes are searched.
Specifically, firstly, the data structure of the dictionary tree is used for storing the word list WordDict of the low resource language, and the data structure can effectively prune the search space, thereby achieving the purpose of accelerating the search efficiency. Then, as shown in fig. 2, based on worldDict with a dictionary tree structure, a dynamic programming method is used to determine whether all the error words in the sequence can be segmented into a plurality of words in the word list worldDict one by one. If the word can be segmented, further adopting a recursive backtracking method to search all segmentation schemes, and adding all feasible segmentation schemes into a possible correction scheme set W for each error word k-set And if the word cannot be segmented, the word is not considered to belong to the adhesion error type.
Then, a correction scheme is generated by searching for words in the vocabulary that are within a threshold a of levensian distance from the wrong word, regardless of whether the word is separable.
Specifically, firstly, the data structure of the BK tree is used for storing the low-resource language vocabulary WordDict, and the data structure can effectively prune the search space and achieve the purpose of accelerating the search efficiency. Then, as shown in fig. 2, words in the vocabulary in which all the error words in the sequence are within the threshold value α are searched one by one based on worldict of the BK tree structure. For each erroneous word, all words within a threshold range are added to the set of possible correction schemes W k-set . In this embodiment, the threshold α is 2.
And 5: and scoring each correction scheme of the OCR recognition sequence by using the language model, and selecting the optimal correction scheme according to the scoring result of the language model.
As shown in FIG. 2, the possible correction schemes W for each erroneous word in the sequence in step 4 are applied k-set Combining in a full permutation mode, and utilizing a low resource language model LowRe N-gram Calculating conditional probability value P of new sequence corresponding to each correction scheme k-set And the maximum value is recorded as Max _ P k-set . Further comparison Max _ P k-set And P OCR The numerical relationship between:
(1) if Max _ P k-set Greater than P OCR If so, the modification scheme is indicated to improve the score of the statement and should be adopted;
(2) if Max _ P k-set Is less than or equal to P OCR If so, the modification scheme fails to improve the score of the sentence, and all correction schemes are not ideal enough and are not adopted.
Example verification, experiments were conducted on a philippine OCR task.
(1) The experimental operation steps are as follows: selecting the Philippines as low-resource languages, comprehensively analyzing from the linguistic perspective, and selecting the English as the corresponding high-resource languages. Because no publicly available Philippine language OCR data set exists in the current academic world and the industry, the method automatically constructs and enhances the Philippine language OCR data according to the method in the step 1, and a PPOCRLael labeling tool is used for completing the construction of the Philippine language OCR data set. The Philippine OCR data set is divided into a training set, a development set and a test set, and comprises 20800 images, 2600 images and 2600 images respectively, and 26000 images in total.
According to the fine-tuning method in the step 2, firstly, an English PaddleOCR model pre-trained in a large-scale English OCR data set is used as a source model, and a character recognition network in the English PaddleOCR model is fine-tuned to obtain a PaddleOCR model capable of recognizing Philippines. Then, by comparing the alphabets of English and Philippines, the special character in Philippines can be found to be
Figure BDA0003147827650000082
And combinations of a/a, E/E, I/I, O/O, U/U characters with a caret (a '), an accent symbol (a'), and an accent symbol (^). According to the analysis, the learning cost of each image in the Philippine OCR training set is calculated according to the formula in the step 2 of the invention, and the images are sorted in ascending order according to the learning cost. And then introducing 6 ten thousand English OCR images, randomly oversampling the Philippine language training set to mix the Philippine language and the English OCR images according to the proportion of 30:70, inputting the oversampled mixed data into the model in a mode of ascending the proportion and the learning cost, and training the character recognition network.
According to the step 3 of the invention, firstly, 5 thousands of filigree language corpora in the network are collected by adopting a crawler technology, 1000 of the corpora are randomly extracted to form a test set, and the rest text data form a training set. Selecting a kenLM statistical language model tool for training to obtain a Philippines ternary model LowRe N-gram Remember LowRe N-gram The average score over the test set text is P. Meanwhile, a word list WordDict of the Philippines is obtained by utilizing a crawler technology. And then, identifying the Philippine language OCR test set by using the Philippine language PaddleOCR model obtained in the step 2 to obtain a corresponding text of the image. By the Philippine trigram LowRe N-gram Conditional probability value P of text obtained by OCR model recognition OCR Performing calculation if P is less than or equal to P OCR If P is greater than P, the recognition result is considered to be correct OCR Then it is assumed that the recognition result may have an error and needs to be corrected. Thereafter, the word is divided for the text sequence with possible errors, and the words are matched through a word listAnd judging word by word, if the word list contains the word, continuing to check the next word, and if the word list does not contain the word, considering the word as an error word.
According to the steps 4 and 5 of the invention, aiming at the correction of the wrong Philippine words in the recognition result, the generated correction schemes are respectively corrected by using a Philippine ternary model LowRe N-gram Calculating the conditional probability value and taking the maximum value Max _ P k-set A corresponding correction scheme. By judging Max _ P k-set Conditional probability value P of text recognized by OCR model OCR The size relationship of (a) determines whether to use the correction scheme generated by the algorithm.
(2) And (3) experimental evaluation indexes: the Accuracy (Accuracy) is used as an evaluation index, and the calculation method is as follows:
Figure BDA0003147827650000081
(3) baseline System Experimental setup
Tesseract OCR: an open source OCR engine, supporting OCR in over 60 languages including the Philippines, provides a method for self-training character sets, supports extensions to recognized characters.
Fast RCNN + CRNN + CTC: a text detection model, namely, a fast RCNN and a character recognition model, namely, a CRNN + CTC, which are advanced in the current industry are fused in a serial mode, wherein the fast RCNN is composed of a convolutional layer, an RPN layer, a RoI Pooling layer and a Classification layer, and a text candidate area box is determined based on the idea of target detection.
SegLink + CRNN + CTC, which adopts a serial mode to fuse the advanced text detection model SegLink (linking segments) and the character recognition model CRNN + CTC in the current industry, wherein SegLink divides a word into a plurality of small segments which are easier to detect by using the texture characteristics of the text, connects adjacent segments by using links with directions and merges the segments into a complete text line, and the model recognizes inclined texts, thereby optimizing the performance of the model.
PaddleOCR: the open-source OCR model of Baidu corporation adopts a serial mode to fuse an advanced text detection model DBNet and a character recognition model CRNN + CTC in the industry, designs various lightweight strategies and effectively reduces model parameters.
(4) The main experimental results are shown in table 1.
TABLE 1 Experimental results of different OCR recognition methods
Experimental methods Accuracy (Accuracy)
Tesseract OCR 0.9133
Faster RCNN+CRNN+CTC 0.9311
SegLink+CRNN+CTC 0.9244
PaddleOCR 0.9233
PaddleOCR-Fine-tune CRNN 0.9478
PaddleOCR-Mixed Fine-tune 0.9489
PaddleOCR-Mixed Fine-tune+Data Selection 0.9533
PaddleOCR-Mixed Fine-tune + Data Selection + spell correction 0.9589
PaddleOCR-Mixed Fine-tune + Data Selection + spelling error correction + blocking error correction 0.9611
As shown in table 1, overall, the multiple strategies designed by the method can significantly improve the OCR performance of the low-resource language, philippines. First, the PaddleOCR and Tesseract OCR, fast RCNN + CRNN + CTC, and SegLink + CRNN + CTC methods all belong to the OCR systems or models advanced in the industry, and can reach a level of more than 0.91 in the recognition accuracy of philippines, but still have a space for further optimization. The PaddleOCR model and the other three baseline methods achieve similar accuracy, but the model parameters are obviously reduced, and the model is selected to verify that the performance of the method is reasonable.
Secondly, the result shows that the method is feasible by adopting a transfer learning mode to realize the knowledge sharing of the cross-language character recognition capability according to the similarity between the low-resource language and the high-resource language from the linguistic perspective. The traditional migration learning method is utilized to migrate the English character recognition model to the Philippine language, the recognition accuracy of the Philippine language OCR test set can be improved from 0.9233 to 0.9478, the mixed migration fine tuning strategy can be applied to better fit the characteristics of the Philippine language OCR data set, the performance of the Philippine language OCR model is further improved, and the accuracy of the test set reaches 0.9533.
Thirdly, the method performs text correction on the recognition result of the OCR through a post-processing link, fuses language information in various modes such as editing distance, vocabulary matching and language models, and can perform error positioning and error correction on the OCR result of the Philippines. By applying the multi-strategy fusion correction method, two types of spelling errors and adhesion errors in the recognition result can be respectively modified, the final recognition accuracy is further improved, and 0.9611 can be achieved.
The above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the embodiments of the present invention, and those skilled in the art can easily make various changes and modifications according to the main concept and spirit of the present invention, so the protection scope of the present invention should be subject to the protection scope of the claims.

Claims (5)

1. A low-resource language OCR method fusing language information is characterized by comprising the following steps:
step 1: acquiring open source text data of low resource languages, generating pictures, enhancing an OCR (optical character recognition) training data set of the low resource languages based on image characteristics and character characteristics, and improving the robustness of a model;
step 2: based on migration learning and similarity between languages, comprehensively analyzing the similarity of the selected high resource language and the low resource language from the linguistic perspective, selecting the high resource language with high similarity to the low resource language, and migrating the OCR model of the high resource language to the OCR task of the low resource language through a mixed fine tuning migration strategy;
and step 3: training an OCR model of the low resource language and constructing a word list of the low resource language for finding errors in an OCR recognition result and generating a correction option basis, wherein the method specifically comprises the following steps: judging whether the recognition result contains errors or not by taking the score of the language model of the low resource language on the OCR recognition result as a judgment basis; meanwhile, word list detection is carried out on the sentences with low scores, wrong words are positioned and identified, and the wrong words are used as a basis for correcting options, wherein the step 3 comprises the following steps:
firstly, utilizing crawler technology to obtain open source text of low resource language, using said data to train its language model LowRe by using statistical language model tool N-gram Counting the average conditional probability value P of the obtained language model to the low-resource language text; meanwhile, a word list WordDict of the language is obtained by utilizing a crawler technology;
then, the low-resource language OCR test set is identified by using the low-resource language OCR model obtained in the step 2, and a corresponding text of the image is obtained; by languageModel LowRe N-gram Conditional probability value P of text obtained by OCR model recognition OCR Performing calculation if P is less than or equal to P OCR If P is greater than P, the recognition result is considered to be correct OCR If so, the identification result is considered to have errors possibly and needs to be corrected;
then, performing word segmentation on the text sequence possibly with errors, judging word by word in a word list matching mode, if the word list contains the word, continuing to check the next word, and if the word list does not contain the word, considering the word as an error word;
and 4, step 4: aiming at the correction option basis, namely the category of the recognized error word, multi-strategy fusion is adopted, and based on the editing distance and the word list, the possible correction option corresponding to the error word is generated according to the common error type in the OCR result, wherein the step 4 comprises the following steps:
firstly, judging whether a wrong word can be divided into a plurality of words in a word list WordDict or not, and searching out all division schemes if the word can be divided;
then, whether the words can be separated or not, a correction scheme is generated by searching words in a word list with the Levenstein distance to the error word within a threshold value alpha;
and 5: and (4) scoring each correction scheme of the OCR recognition sequence by using the language model, and selecting the optimal correction scheme from the possible correction options in the step 4 according to the scoring result of the language model.
2. The OCR method of low resource language fused with language information according to claim 1, wherein step 1 is to use crawler technology to obtain open source text of low resource language and generate pictures according to text information; and combining a character level enhancement mode based on linguistics, aiming at special characters in a low-resource language, performing data enhancement by improving the occurrence frequency, and generating a picture by combining word formation rules of the language during enhancement.
3. A language information-fused low-resource language OCR method as recited in claim 2, wherein the step 2 includes the steps of:
step 2.1, selecting a high resource language with highest similarity to the low resource language to be identified;
2.2, selecting an OCR model advanced in the industry to pre-train on an OCR data set of a high resource language to obtain a source model;
and 2.3, migrating the high-resource language OCR model to the low-resource language by adopting a mixed fine-tuning migration strategy.
4. A language information-fused low-resource language OCR method as claimed in claim 3, wherein said hybrid fine-tuning migration strategy in step 2.3 includes the following steps:
step 3.1, counting the special characters specific to the low resource language by comparing alphabets of the high resource language and the low resource language; further calculating the Learning Cost (Learning Cost) of the low-resource language training sample to measure the Learning difficulty of the sample, wherein the quantification method is as follows:
Figure FDA0003754900830000021
wherein L is C Represents the learning cost of the training example, L represents the total length of the text sequence in a certain picture, Avr L Average length of words in training set for low resource languages, C S The total number of special text characters different from the high resource language in a single picture is represented, and alpha is the difficulty weight of the special text characters;
3.2, randomly selecting part of high resource language data in pre-training, mixing the high resource language data with OCR training data of low resource language in proportion, and satisfying the mixing proportion x (100-x) by randomly oversampling the low resource language data;
3.3, arranging all samples in the low-resource language training data set in an ascending order according to the learning cost; in the training process, low-resource language data are sequentially selected in a batch unit according to the sequence of the learning cost from low to high, the low-resource language data and the selected high-resource language image are mixed according to the proportion of x% to (100-x)%, and a mixed fine-tuning strategy is adopted to train the model.
5. The low resource language OCR method of fusing language information according to claim 4, characterized in that step 5 includes the following steps:
combining the possible correction schemes for each error word in the sequence in step 4 according to a full permutation mode, and utilizing a low resource language model LowRe N-gram Calculating conditional probability value P of new sequence corresponding to each correction scheme k-set And the maximum value is recorded as Max _ P k-set (ii) a Further comparison Max _ P k-set And P OCR The numerical relationship between:
(1) if Max _ P k-set Greater than P OCR If so, the modification scheme is indicated to improve the score of the statement and should be adopted;
(2) if Max _ P k-set Is less than or equal to P OCR If so, the modification scheme fails to improve the score of the sentence, and all correction schemes are not ideal enough and are not adopted.
CN202110756557.8A 2021-07-05 2021-07-05 Low-resource language OCR method fusing language information Active CN113420766B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110756557.8A CN113420766B (en) 2021-07-05 2021-07-05 Low-resource language OCR method fusing language information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110756557.8A CN113420766B (en) 2021-07-05 2021-07-05 Low-resource language OCR method fusing language information

Publications (2)

Publication Number Publication Date
CN113420766A CN113420766A (en) 2021-09-21
CN113420766B true CN113420766B (en) 2022-09-16

Family

ID=77720151

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110756557.8A Active CN113420766B (en) 2021-07-05 2021-07-05 Low-resource language OCR method fusing language information

Country Status (1)

Country Link
CN (1) CN113420766B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116740688B (en) * 2023-08-11 2023-11-07 武汉市中西医结合医院(武汉市第一医院) Medicine identification method and system
CN116977436B (en) * 2023-09-21 2023-12-05 小语智能信息科技(云南)有限公司 Burmese text image recognition method and device based on Burmese character cluster characteristics

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6205261B1 (en) * 1998-02-05 2001-03-20 At&T Corp. Confusion set based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique
CN109753968A (en) * 2019-01-11 2019-05-14 北京字节跳动网络技术有限公司 Generation method, device, equipment and the medium of character recognition model
CN109766879A (en) * 2019-01-11 2019-05-17 北京字节跳动网络技术有限公司 Generation, character detection method, device, equipment and the medium of character machining model
CN112149678A (en) * 2020-09-17 2020-12-29 支付宝实验室(新加坡)有限公司 Character recognition method and device for special language and recognition model training method and device
CN112232337A (en) * 2020-10-09 2021-01-15 支付宝实验室(新加坡)有限公司 Matching method of special language characters and information verification method and device
CN112613502A (en) * 2020-12-28 2021-04-06 深圳壹账通智能科技有限公司 Character recognition method and device, storage medium and computer equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6205261B1 (en) * 1998-02-05 2001-03-20 At&T Corp. Confusion set based method and system for correcting misrecognized words appearing in documents generated by an optical character recognition technique
CN109753968A (en) * 2019-01-11 2019-05-14 北京字节跳动网络技术有限公司 Generation method, device, equipment and the medium of character recognition model
CN109766879A (en) * 2019-01-11 2019-05-17 北京字节跳动网络技术有限公司 Generation, character detection method, device, equipment and the medium of character machining model
CN112149678A (en) * 2020-09-17 2020-12-29 支付宝实验室(新加坡)有限公司 Character recognition method and device for special language and recognition model training method and device
CN112232337A (en) * 2020-10-09 2021-01-15 支付宝实验室(新加坡)有限公司 Matching method of special language characters and information verification method and device
CN112613502A (en) * 2020-12-28 2021-04-06 深圳壹账通智能科技有限公司 Character recognition method and device, storage medium and computer equipment

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Low-resource Languages:A Review of PastWork and Future Challenges;Alexandre Magueresse等;《arXiv》;20200612;第1-14页 *
基于Android平台的多语种文字识别翻译APP;唐瑞寒等;《厦门理工学院学报》;20171030(第05期);第67-72页 *
稀缺资源语言神经网络机器翻译研究综述;李洪政等;《自动化学报》;20210630;第47卷(第6期);第1217-1231页 *

Also Published As

Publication number Publication date
CN113420766A (en) 2021-09-21

Similar Documents

Publication Publication Date Title
WO2021135444A1 (en) Text error correction method and apparatus based on artificial intelligence, computer device and storage medium
CN109800414B (en) Method and system for recommending language correction
CN111062376A (en) Text recognition method based on optical character recognition and error correction tight coupling processing
CN105279149A (en) Chinese text automatic correction method
CN113420766B (en) Low-resource language OCR method fusing language information
CN111062397A (en) Intelligent bill processing system
WO2009035863A2 (en) Mining bilingual dictionaries from monolingual web pages
CN110853625B (en) Speech recognition model word segmentation training method and system, mobile terminal and storage medium
CN116244445B (en) Aviation text data labeling method and labeling system thereof
CN109684928B (en) Chinese document identification method based on internet retrieval
CN111259151A (en) Method and device for recognizing mixed text sensitive word variants
CN111737968A (en) Method and terminal for automatically correcting and scoring composition
CN110276069A (en) A kind of Chinese braille mistake automatic testing method, system and storage medium
CN111241824B (en) Method for identifying Chinese metaphor information
CN110457715B (en) Method for processing out-of-set words of Hanyue neural machine translation fused into classification dictionary
CN112989806A (en) Intelligent text error correction model training method
CN114817570A (en) News field multi-scene text error correction method based on knowledge graph
CN109255117A (en) Chinese word cutting method and device
CN110751234A (en) OCR recognition error correction method, device and equipment
CN113221542A (en) Chinese text automatic proofreading method based on multi-granularity fusion and Bert screening
Pal et al. OCR error correction of an inflectional indian language using morphological parsing
CN111460147A (en) Title short text classification method based on semantic enhancement
CN113657122A (en) Mongolian Chinese machine translation method of pseudo-parallel corpus fused with transfer learning
CN112765977A (en) Word segmentation method and device based on cross-language data enhancement
Aliwy et al. Corpus-based technique for improving Arabic OCR system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant