CN113269192B

CN113269192B - OCR post-processing method based on word matching and grammar matching

Info

Publication number: CN113269192B
Application number: CN202110567957.4A
Authority: CN
Inventors: 薛翔天; 孔祥龙
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2024-04-30
Anticipated expiration: 2041-05-24
Also published as: CN113269192A

Abstract

The invention discloses an OCR post-processing method based on word matching and grammar matching technology, which comprises the steps of obtaining first K result sets corresponding to each word through single word recognition, for each text, taking the recognition result of the maximum probability corresponding to each word as a preliminary sequence to segment words, carrying out word matching operation based on a corpus on words after word segmentation, and selecting word combination update words with the maximum probability in a pre-recognition module; and carrying out grammar matching operation on the word after word segmentation, respectively carrying out grammar analysis matching on K results of word recognition, and updating the word by taking the most possible result. And the two fused steps are used as output results of post-processing. The invention fully digs the text syntax information, respectively processes single words and multiple words, shows good adaptability, and has more obvious advantages and higher application value compared with the traditional word matching-based technology, especially on texts with lower quality.

Description

OCR post-processing method based on word matching and grammar matching

Technical Field

The invention relates to an OCR post-processing method based on word matching and grammar matching, belonging to the technical field of OCR processing.

Background

OCR (Optical Character Recognition ) is the reading of text printed or written on paper using optical and computer technology and conversion to a format that can be accepted by a computer and understood by the person. It is a relatively broad problem that varies in the requirements and standards of different specific scenarios, as well as in the fault tolerance. The flow of OCR is generally divided into the following steps: text detection, text recognition and post-processing. Post-processing is an important component of OCR because errors that are very common to environmental noise, or word-in-shape, are very much like word recognition, and we often desire error correction from corpus and context information by context. Classical solution algorithms are of two types: 1) An improved BK-tree based on a priori dictionary; 2) Language model based error correction mechanism.

The current post-processing is mainly aimed at multi-word words, and few processing methods are used for single words. Aiming at the problem, the invention integrates the word matching technology aiming at the multi-word and the grammar matching technology aiming at the single word, so that the effect of OCR post-processing is better.

Disclosure of Invention

The invention discloses an OCR (Optical Character Recognition ) post-processing method based on word matching and grammar matching technology, which acquires the first K result sets of corresponding recognition of each word through single word recognition. For each text segment, taking the recognition result of the maximum probability corresponding to each word as a preliminary sequence to segment words, carrying out word matching operation based on a corpus on the segmented words, and selecting the word combination with the maximum probability in the pre-recognition module to update the words; and carrying out grammar matching operation on the word after word segmentation, respectively carrying out grammar analysis matching on K results of word recognition, and updating the word by taking the most possible result. And the two fused steps are used as output results of post-processing.

In order to achieve the above object, the technical scheme of the invention is as follows, an OCR post-processing method based on word matching and grammar matching, comprising the following steps:

Step 1), a prepositive OCR single word recognition module is used for positioning the text information in a scene and recognizing the single word through the prepositive OCR module, and the most likely top K recognition results and the corresponding probabilities thereof are stored;

Step 2) word segmentation, namely taking the recognition result of the maximum probability of each character as an initial result, and using a main stream word segmentation tool to segment the text sequence;

Step 3) based on the forward maximum word matching of the Chinese dictionary, for the multi-word words after word segmentation, utilizing the compared identification words and possible similar candidate word groups thereof to find out the most logical words according to the front and rear identification words, and correcting the initial result;

Step 4) multi-material lexical segmentation, namely according to the result of the step 2), substituting K recognition results of the single word into texts to respectively carry out grammar segmentation and save;

and 5) screening K different grammar segmentation results according to the prior knowledge of the grammar for the single word after word segmentation based on grammar matching of a Chinese grammar library, and then selecting the result with the highest probability value in the step 1) for correction.

Step 6) after processing the single word and the multi-word respectively, outputting the recognition result.

In the preferred scheme of the method, in the step 1), the input text is set as x= (X ₁,x₂,...,x_n), and the output result of the single word recognizer is set as y= (Y ₁,y₂,...,y_n). Wherein y _i is the result of recognition of the single word xi of the input text X, yi= { (y _{i_1},p_{i_1}),(y_{i_2},p_{i_2}),...,(y_{i_K},p_{i_K}) } contains the first K result sets with the largest probability values in the recognition classification result, and each tuple contains the recognition result and the corresponding probability.

In the preferred scheme of the method, a word segmentation tool in the step 2) adopts jieba word segmentation, a jieba word segmentation algorithm uses a prefix dictionary-based efficient word graph scanning to generate a Directed Acyclic Graph (DAG) formed by all possible word generation conditions of Chinese characters in sentences, a dynamic programming search maximum probability path is adopted to find out the maximum segmentation combination based on word frequency, and a HMM model based on the word formation capability of Chinese characters is adopted for unregistered words and a Viterbi algorithm is used. The Viterbi algorithm "backtracks" through a back pointer to determine if a certain hidden state is a member of the most likely sequence of hidden states. Noise in the sequence is effectively isolated.

In the preferred scheme of the method, the specific flow based on the forward maximum word matching of the Chinese dictionary in the step 3) is as follows:

(a) Assuming that n words exist in the multi-word, each word has K recognition results, K ⁿ multi-word combinations are obtained after random combination, and the corresponding word probabilities in the word combinations are multiplied and normalized to obtain the word combination recognition probability P1= { P' ₁,p'₂,...,p'_K^n }. And then putting the multi-word combinations into a corpus for matching, counting the occurrence times of each word combination in the corpus, and recording as { c ₁,c₂,...,c_K^n }, thereby obtaining the occurrence probability of the multi-word and carrying out normalization to obtain the result as P2= { c ₁/sum{c_i},c₂/sum{c_i},...,c_K*n/sum{c_i }.

(B) The combination recognition probability of the word is combined with the occurrence probability of the multi-word, and a weight factor alpha is added, and the final combination recognition probability is P=alpha P1+ (1-alpha) P2. The phrase corresponding to the highest probability value is combined into the final result. The step comprehensively considers the two discrimination results, thereby effectively reducing errors.

In the preferred scheme of the method, the specific flow of grammar matching based on the Chinese grammar library in the step 5) is as follows:

The method adopts a corpus of modern Chinese grammar information dictionary of Beijing university, takes the word to be identified as the center, and identifies the character to be detected by carrying out grammar matching check with the context. Grammar matching mainly utilizes grammar semantic knowledge provided by word libraries in a corpus. Taking noun word stock as an example. Items examined herein at grammar matching include the following: several, individual, metric, container, collective, category, shaping, indefinite, move-time, front, rear, front generation, front-to-back. And deleting the single word recognition result which does not accord with the grammar matching rule, and then, updating the single word by taking the recognition result with the highest probability.

The invention fully digs the text syntax information, respectively processes single words and multiple words, shows good adaptability, and has more obvious advantages and higher application value compared with the traditional word matching-based technology, especially on texts with lower quality. Compared with the prior art, the invention has the following advantages:

(1) And combining the classification probability of single word recognition with the post-processing matching probability, and improving the information association degree and the utilization rate before and after the structure. The traditional post-processing method does not use the probability value of the recognition classification module, so that the information utilization rate is greatly reduced, and the accuracy of the post-processing method is completely depended. According to the invention, the probability value of the single word recognition module is fused with the probability value of the post-processing module, so that the false detection rate of the post-processing module is effectively reduced, and the adaptability of the whole structure is improved.

(2) The accuracy and the integrity of the post-processing scheme are improved. The traditional method carries out word matching analysis on the word after word segmentation, and ignores the processing and grammar analysis of single words. The invention respectively carries out semantic and grammar optimization processing on the multi-word words and the single words based on word matching and grammar matching technology, so that each word in the text is covered by the post-processing module, and the accuracy and performance of post-processing are improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

Fig. 2 is a lexical cut schematic diagram.

The specific embodiment is as follows:

in order to enhance the understanding of the present invention, the present embodiment will be described in detail with reference to the accompanying drawings.

Examples: the OCR post-processing method based on word matching and grammar matching obtains the first K result sets corresponding to each word through single word recognition. For each text segment, taking the recognition result of the maximum probability corresponding to each word as a preliminary sequence to segment words, carrying out word matching operation based on a corpus on the segmented words, and selecting the word combination with the maximum probability in the pre-recognition module to update the words; performing grammar matching operation on the word after word segmentation, and respectively feeding K results of word recognition into the system structure:

Fig. 1 shows the architecture of the OCR post-processing method based on word matching and grammar matching, and a detailed description of two main parts is given below.

1. Forward maximum word matching based on Chinese dictionary:

(B) The combination recognition probability of the word is combined with the occurrence probability of the multi-word, and a weight factor alpha is added, and the final combination recognition probability is P=alpha P1+ (1-alpha) P2. The phrase corresponding to the highest probability value is combined into the final result.

2. Grammar matching based on Chinese grammar library:

2. The specific process comprises the following steps:

referring to fig. 1, an NLP library combination usage technique based on overlap computation includes the following:

Step 1) a prepositive OCR single word recognition module is used for positioning the text information in the scene and recognizing the single word through the prepositive OCR module, and the most likely top K recognition results and the corresponding probabilities thereof are stored. Let the input text be x= (X ₁,x₂,...,x_n), the output result of the word recognizer be y= (Y ₁,y₂,...,y_n). Wherein y _i is the result of recognition of the single word xi of the input text X, yi= { (y _{i_1},p_{i_1}),(y_{i_2},p_{i_2}),...,(y_{i_K},p_{i_K}) } contains the first K result sets with the largest probability values in the recognition classification result, and each tuple contains the recognition result and the corresponding probability.

Step 2) word segmentation, namely taking the recognition result of the maximum probability of each word as an initial result, and using a main stream word segmentation tool to segment the text sequence.

And 3) based on the forward maximum word matching of the Chinese dictionary, for the multi-word words after word segmentation, utilizing the compared recognized words and possible similar candidate word groups thereof to find out the most logical words according to the front and rear recognized words, and correcting the initial result.

(A) Assuming that n words exist in the multi-word, each word has K recognition results, K ⁿ multi-word combinations are obtained after random combination, and the corresponding word probabilities in the word combinations are multiplied and normalized to obtain the word combination recognition probability P1= { P' ₁,p'₂,...,p'_K^n }. Then putting the multi-word combinations into a corpus for matching, counting the occurrence times of each word combination in the corpus, and recording as { c ₁,c₂,...,c_K^n }, thereby obtaining the occurrence probability of the multi-word and carrying out normalization to obtain a result of P2= { c ₁/sum{c_i},c₂/sum{c_i},...,c_K*n/sum{c_i };

And 4) multi-material lexical segmentation, namely respectively carrying out grammar segmentation on K recognition results of the single word and substituting the K recognition results into the text according to the result of the step 2), and storing the K recognition results. According to the word segmentation result, the multi-word which needs to be subjected to lexical segmentation is the updated result of the step 3), and the single-word takes the K recognition results obtained in the step 1) and carries out the grammar segmentation respectively;

And 5) screening K different grammar segmentation results according to the prior knowledge of the grammar for the single word after word segmentation based on grammar matching of a Chinese grammar library, and then selecting the result with the highest probability value in the step 1) for correction. And identifying the character to be detected by carrying out grammar matching check with the context by taking the word to be identified as the center. Grammar matching mainly, mainly grammar matching mainly, mainly provided by word libraries grammar semantic knowledge. And deleting the single word recognition result which does not accord with the grammar matching rule, and then, updating the single word by taking the recognition result with the highest probability.

3. Specific application examples:

for convenience of description, it is assumed that there are the following simplified application examples: the text to be detected is selected as follows

X＝(x₁,x₂,x₃,x₄)

According to the aforementioned calculation steps, the following steps are carried out:

First, the pre-OCR word recognition module stores the most likely top k=3 recognition results and their corresponding probabilities. The corresponding result is y= (Y ₁,y₂,y₃). Wherein y ₁ = { ("me", 0.7), ("Russian", 0.2), ("zedoary", 0.08) }, y ₂ = { ("ai", 0.5), ("ai", 0.3), ("ir", 0.1) }, y ₃ = { ("mid", 0.9), ("string", 0.02), ("-", 0.01) }, y ₄ = { ("country", 0.3), ("threshold", 0.2), ("basket", 0.2) }.

Secondly, word segmentation is carried out, the recognition result with the highest probability in the first step is taken as an initial result, and a main stream word segmentation tool is used for text word segmentation, wherein the word segmentation result is I/E/Chinese.

Third, based on the maximum word match in the forward direction of the Chinese dictionary,

(A) Only one two words are included in the word segmentation result, each word has 3 recognition results, 9 multi-word combinations ('Chinese', 'middle threshold', 'middle frame', 'string country', 'string threshold', 'string frame', 'furo' and 'furframe') are obtained after random combination, and the corresponding word probabilities in the word combinations are multiplied and normalized to obtain word combination recognition probability P1= (0.35,0.23,0.23,0.08,0.05,0.05,0.003,0.003,0.003). And then putting the multi-word combinations into a corpus for matching, counting the occurrence times of each word combination in the corpus as (9,5,0,3,0,0,0,0,0), thereby obtaining the occurrence probability of the multi-word and carrying out normalization to obtain a result of P2= (0.53,0.29,0,0.18,0,0,0,0,0).

(B) The combination recognition probability and the occurrence probability of the multi-word are combined, a weight factor alpha=0.5 is added, and the final combination recognition probability is p=0.5×p1+0.5×p2. P= (0.44,0.26,0.12,0.13,0.025,0.025,0.0015,0.0015,0.0015), the phrase "chinese" corresponding to the highest probability value is used as the final result.

Fourth, multi-word segmentation is carried out, as shown in fig. 2, according to the word segmentation result, the multi-word needed to be subjected to word segmentation is the updated result in the step 3), the single-word takes the 3 recognition results obtained in the step 1), and the grammar segmentation is carried out respectively.

Fifthly, recognizing characters to be detected by carrying out grammar matching check with the context by centering on the words to be recognized based on grammar matching of the Chinese grammar library. Grammar matching mainly utilizes grammar semantic knowledge provided by word libraries in a corpus. And deleting the single word recognition result which does not accord with the grammar matching rule, and then, updating the single word by taking the recognition result with the highest probability. Here we eliminate the "fres" of x ₁ and the "moxa" of x ₂, and we consider that its syntax does not conform to the match check. The recognition result with the highest probability is selected to update the word. x ₁ identifies "me" and x ₂ updates to "love".

And sixthly, outputting the post-processed result and updating the result from the original 'I' Chinese in Chinese to 'I love Chinese'.

It should be noted that the above-mentioned embodiments are not intended to limit the scope of the present invention, and equivalent changes or substitutions made on the basis of the above-mentioned technical solutions fall within the scope of the present invention as defined in the claims.

Claims

1. An OCR post-processing method based on word matching and grammar matching, the method comprising the steps of:

Step 5) screening K different grammar segmentation results according to the prior knowledge of the grammar for single word after word segmentation based on grammar matching of a Chinese grammar library, and then selecting the result with the highest probability value in step 1) for correction;

2. The post-OCR processing method based on word matching and grammar matching according to claim 1, wherein in the step 1), the method of the pre-OCR single word recognition module is:

Let x= (X ₁,x₂,...,x_n) in the input text and y= (Y ₁,y₂,...,y_n) in the output result of the single word recognizer, where Y _i is the result of recognition of the single word X _i in the input text X, and Y _i＝{(y_{i_1},p_{i_1}),(y_{i_2},p_{i_2}),...,(y_{i_K},p_{i_K}) includes the first K result sets with the largest probability values in the recognition classification result, and each tuple includes the recognition result and the corresponding probability.

3. The OCR post-processing method based on word matching and grammar matching according to claim 1, wherein the word segmentation tool in step 2) adopts jieba word segmentation, the jieba word segmentation algorithm uses a prefix dictionary-based efficient word graph scanning to generate a Directed Acyclic Graph (DAG) composed of all possible word generation conditions of Chinese characters in sentences, and then adopts dynamic programming to find out a maximum segmentation combination based on word frequency, and for non-logged words, an HMM model based on word formation capability of Chinese characters is adopted, and a Viterbi algorithm is used.

4. The OCR post-processing method based on word matching and grammar matching according to claim 1, wherein in the step 3), the specific flow of the forward maximum word matching based on the chinese dictionary is:

(a) Assuming that n words are included in the multi-word, each word has K recognition results, obtaining K ⁿ multi-word combinations after random combination, multiplying corresponding word probabilities in the word combinations and normalizing to obtain word combination recognition probability P1= { P' ₁,p'₂,...,p'_K^n }, then putting the multi-word combinations into a corpus for matching, counting the occurrence times of each word in the corpus, marking the occurrence times as { c ₁,c₂,...,c_K^n }, thus obtaining the occurrence probability of the multi-word and normalizing to obtain the result of P2= { c ₁/sum{c_i},c₂/sum{c_i},...,c_K*n/sum{c_i };

(b) And combining the word combination recognition probability with the occurrence probability of the multi-word, adding a weight factor alpha, wherein the final word combination probability is P=alpha P1+ (1-alpha) P2, and the word combination corresponding to the highest probability value is combined into a final result.

5. The OCR post-processing method based on word matching and grammar matching according to claim 1, wherein in the step 4), the method of multi-language lexical segmentation is as follows:

according to the word segmentation result, the multi-word which needs to be subjected to lexical segmentation is the updated result of the step 3), and the single-word takes the K recognition results obtained in the step 1) to respectively carry out the lexical segmentation.

6. The post-OCR processing method based on word matching and grammar matching according to claim 1, wherein in the step 5), the specific flow of grammar matching based on the chinese grammar library is:

The word to be identified is taken as a center, the word to be detected is identified through grammar matching check with the context, grammar matching mainly utilizes grammar semantic knowledge provided by each word library in the corpus, and single word identification results which do not accord with grammar matching rules are deleted and then the identification results with the highest probability are taken for updating single words.