CN113011406A

CN113011406A - Single-template working flow optimization method

Info

Publication number: CN113011406A
Application number: CN202110312418.6A
Authority: CN
Inventors: 玄洪升; 李明明; 潘心冰; 郭保荣; 冷静
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-22

Abstract

The invention relates to the field of deep learning and image character recognition, and particularly provides a single-template workflow optimization method, which comprises the following steps: s1, constructing an N-Gram knowledge base in the field of template pictures; s2, preprocessing the OCR error correction text into a sequence to be corrected; s3, identifying suspected wrong words; s4, constructing a correct candidate word list based on the error rule reasoning model; s5, comprehensive weight generation is carried out by using a candidate character screening algorithm, the weights are ranked, and the character with the highest weight is the correct character of the error character in the current text to be corrected. Compared with the prior art, the method improves the usability of the image in the single-template workflow after OCR conversion, and is applied to the framing reference field, the framing identification area and the evaluation application stage, so that the precision of the single-template workflow character identification model is improved, and the extraction effect of the structured information is ensured.

Description

Single-template working flow optimization method

Technical Field

The invention relates to the field of deep learning and image character recognition, and particularly provides a single-template workflow optimization method.

Background

The single template workflow can independently construct a character recognition template, recognize characters in a template picture, provide a high-precision character recognition model and ensure the extraction precision of structural information, wherein the usability of image OCR text recognition is a key index of the single template workflow in a framing reference field, a framing recognition area and an evaluation application stage, as shown in FIG. 4.

The text information extraction of the picture is mainly realized based on an Optical Character Recognition (OCR) technology, and a better optimized OCR technology can correctly recognize most text contents, but a Recognition error of a part of text is still existed.

The problem of text error correction is accompanied with the development of computer technology, and at present, a great number of scientific researchers are studying related fields at home and abroad. Text error correction is mainly divided into two parts, the first step is error detection, and the second step is error correction. At present, the mainstream method is to cut words by a Chinese word segmentation device in the Chinese of the crust at the error detection part, and because the sentences contain wrongly written characters, the word cutting result is often in the situation of wrong cutting, so that errors are detected from two aspects of character granularity and word granularity, and suspected error results of the two granularities are integrated to form a suspected error position candidate set.

The error correction part is to traverse all suspected error positions, replace words in the error positions by using similar dictionaries, then calculate sentence confusion degree through a language model, compare and sort results of all candidate sets to obtain the optimal corrected words.

In this regard, how to improve the usability of the text after the image is subjected to OCR becomes one of the bottlenecks of the single template workflow.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a single-template workflow optimization method with strong practicability.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a single-template working flow optimization method comprises the following steps:

s1, constructing an N-Gram knowledge base in the field of template pictures;

s2, preprocessing the OCR error correction text into a sequence to be corrected;

s3, identifying suspected wrong words;

s4, constructing a correct candidate word list based on the error rule reasoning model;

s5, comprehensive weight generation is carried out by using a candidate character screening algorithm, the weights are ranked, and the character with the highest weight is the correct character of the error character in the current text to be corrected.

Further, in step S1, BiGram segmentation is performed on the collected template picture field text set, word collocation in the field text set is analyzed and counted, a template picture field word collocation probability matrix is generated, and a template picture field corpus N-Gram knowledge base is constructed as a text error judgment mechanism background knowledge base for performing error position judgment on the text with error correction.

Preferably, in step S2, text preprocessing is performed on the text to be corrected, including removing numbers, symbols and english letters in the sentence, and BiGram segmentation is performed on the sentence by using an N-Gram mechanism, so as to obtain a text sequence with error correction.

Further, in step S3, the text to be corrected is subjected to text preprocessing to obtain a BiGram segmented word sequence of the text to be corrected, and the BiGram segmented word sequence is compared with the N-Gram knowledge base of the template picture field text set to refer to the field word collocation probability matrix, and if the probability is lower than the threshold value, it is determined that a text error exists in the BiGram group.

Further, in step S4, after determining the error position of the word in the text to be corrected, the word is subjected to OCR conversion error rule inference using an error rule inference model.

Further, in step S4, the chinese character structure attribute knowledge map is mainly composed of the chinese character itself, the corresponding table-shape code of the chinese character, the stroke of the chinese character, and the error rule occurring between the chinese character and the chinese character during the OCR conversion, and the map gives a unique number to each chinese character, and the chinese character number is used as the main node of all the relations, thereby forming a complete chinese character structure attribute knowledge map.

Further, in step S4, according to the prediction result that the word has errors in OCR conversion, a correct word candidate interval of the erroneous word is generated for the subsequent candidate word screening algorithm to extract, and according to the inference result of the erroneous word in the OCR conversion error rule inference model, a candidate word list is generated.

Further, in step S5, the filtering for the candidate word list is performed in two parts,

and traversing and checking a word collocation combination of each candidate word in the candidate word list, which corresponds to context BiGram segmentation in the text to be corrected where the suspected wrong word is located, according to a BiGram field word collocation template picture knowledge base, and directly eliminating words which do not form a reasonable sentence with the context of the context position where the suspected wrong word is located in the text to be corrected if the words in the candidate words and the context of the context position where the suspected wrong word is located do not form the reasonable sentence, so that the words are not kept in the candidate word list for weight calculation of the screening algorithm.

Further, in step S5, the context position of each candidate word and the suspected erroneous word is calculated by using the TF-IDF algorithm and the cosine similarity for the screening algorithm of the candidate word list, which represents the reasonable degree of the sentence formed by the candidate word and the context of the erroneous position in the text to be corrected, and the degrees of similarity between the candidate word in the candidate word list and the erroneous word in the text to be corrected, which are compared by the BiGram template picture knowledge base, are compared, so as to select the most reasonable candidate word.

Compared with the prior art, the single-template working flow optimization method has the following outstanding beneficial effects:

the single-template workflow optimization method improves the usability of the text of the image after OCR conversion in the single-template workflow, and is applied to the framing reference field, the framing identification area and the evaluation application stage, so that the precision of the character identification model of the single-template workflow is improved, and the extraction effect of the structured information is ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a diagram of a text error correction framework in a single-template workflow optimization method;

FIG. 2 is a flow chart of an error determination mechanism in a single-template workflow optimization method;

FIG. 3 is a diagram of a fault rule reasoning model in a single-template workflow optimization method;

fig. 4 is a flow chart of a single template in the prior art.

Detailed Description

The present invention will be described in further detail with reference to specific embodiments in order to better understand the technical solutions of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A preferred embodiment is given below:

as shown in fig. 1, the method for optimizing a single-mode working flow in this embodiment includes the following steps:

s1, constructing an N-Gram knowledge base in the field of template pictures:

firstly, BiGram segmentation is carried out on a collected template picture field text set, word collocation in the field text set is analyzed and counted, a template picture field word collocation probability matrix is generated, and a template picture field text set N-Gram knowledge base is constructed and serves as a text error judgment mechanism background knowledge base for carrying out error position judgment on an error-corrected text.

The BiGram (two-interval) is a refining standard widely used in the N-Gram technology, and compared with Tri Gram or 5-Gram and the like, the BiGram has finer granularity on the aspect of text word collocation statistical analysis, can obtain a more accurate field word collocation probability matrix, and can provide higher precision on the aspect of text error position judgment.

S2, preprocessing the OCR error correction text into a sequence to be corrected:

and performing text preprocessing on the text to be corrected, including removing numbers, symbols, English letters and the like in the sentence, and performing Bigram segmentation on the sentence by using an N-Gram mechanism to obtain a text sequence with the error correction.

S3, identifying the suspected wrong word:

and preprocessing the text to be corrected to obtain a BiGram segmentation word sequence of the text to be corrected, comparing the BiGram segmentation word sequence with a template picture field text set N-Gram knowledge base to refer to a field word collocation probability matrix, and judging that a text error exists in a BiGram group if the probability is lower than a threshold value.

As shown in fig. 3, W1, W2, W3, and W4 respectively represent four words in the N-Gram sequence of the text to be corrected, after BiGram segmentation, three BiGram sequences of W1W2, W2W3, and W3W4 are respectively provided, and by performing word collocation probability analysis on each BiGram group, when the word collocation probability of W2W3 is lower than a threshold, it is determined that a text error exists in the W2W3 combination. And then carrying out word collocation probability analysis on the Bigram combinations W1W2 and W3W4 of W2 and W3 respectively, wherein the combination with higher probability is regarded as correct, and the combination with lower probability is regarded as error, so that the specific position of the error word in the W2W3 combination is confirmed.

S4, constructing a correct candidate word list based on the error rule reasoning model:

as shown in fig. 4, after determining the error position of a word in the text to be corrected, an error rule inference model is used to perform OCR conversion error rule inference on the word.

The Chinese character structure attribute knowledge map is mainly composed of Chinese characters, corresponding form codes of the Chinese characters, Chinese character strokes and error rules which occur between the Chinese characters during OCR conversion, unique numbers are assigned to the Chinese characters in the map, and the Chinese character numbers are used as main nodes of all relations to form the complete Chinese character structure attribute knowledge map.

And generating a correct character candidate interval of the wrong character according to a wrong prediction result of the character in OCR conversion for extraction by a subsequent candidate word screening algorithm. And generating a candidate word list according to the reasoning result of the error word in the OCR conversion error rule reasoning model.

S5, comprehensive weight generation is carried out by using a candidate character screening algorithm, the weights are ranked, and the character with the highest weight is the correct character of the wrong character in the current text to be corrected:

the screening for the candidate word list is performed in two parts.

Firstly, traversing and checking a word collocation combination of each candidate word in a candidate word list, which corresponds to context BiGram segmentation in a text to be corrected where a suspected error word is located, according to a BiGram domain word collocation template picture knowledge base, and directly excluding words which do not form a reasonable sentence with the context before and after the context position where the suspected error word is located in the text to be corrected, if the words in the candidate words and the context before and after the context position where the suspected error word is located are not included in a candidate word list for weight calculation of a screening algorithm.

And then, calculating the context position of each candidate character and the suspected error character by using a TF-IDF algorithm and cosine similarity aiming at a screening algorithm of the candidate character list, representing the reasonable degree of the sentence formed by the candidate character and the error position context in the text to be corrected, comparing the similarity degree of the candidate character in the candidate character list and the error character in the text to be corrected after comparison of a Bigram template picture knowledge base, and selecting the most reasonable candidate character.

The above embodiments are only specific ones of the present invention, and the scope of the present invention includes but is not limited to the above embodiments, and any suitable changes or substitutions that are consistent with the claims of a single-mode-board working-flow optimization method of the present invention and are made by those of ordinary skill in the art shall fall within the scope of the present invention.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A single-template working flow optimization method is characterized by comprising the following steps:

s1, constructing an N-Gram knowledge base in the field of template pictures;

s3, identifying suspected wrong words;

2. The method of claim 1, wherein in step S1, BiGram segmentation is performed on the collected template picture domain text set, word collocation in the domain text set is analyzed and counted, a template picture domain word collocation probability matrix is generated, and a template picture domain text set N-Gram knowledge base is constructed as a text error judgment mechanism background knowledge base for performing error position judgment on the error-corrected text.

3. The method according to claim 1, wherein in step S2, the text to be corrected is pre-processed, including removing numbers, symbols and english letters in sentences, and the sentences are BiGram segmented by using an N-Gram mechanism to obtain text sequences with error correction.

4. The method according to claim 1, wherein in step S3, the text to be corrected is pre-processed to obtain a BiGram segmented word sequence of the text to be corrected, and the BiGram segmented word sequence is compared with a domain word collocation probability matrix of the template picture domain text set N-Gram knowledge base, and if the probability is lower than a threshold value, it is determined that a text error exists in the BiGram group.

5. The single-template workflow optimization method of claim 1, wherein in step S4, after determining the error position of a word in the text to be corrected, an OCR transformation error rule inference model is used to perform OCR transformation error rule inference on the word.

6. The single-template workflow optimization method of claim 5, wherein in step S4, the Chinese character structure attribute knowledge graph is mainly composed of the Chinese characters themselves, corresponding pictographic codes of the Chinese characters, strokes of the Chinese characters and error rules between the Chinese characters during OCR conversion, wherein the graph is assigned with unique numbers for each Chinese character, and the Chinese character numbers are used as the main nodes of all relations to form a complete Chinese character structure attribute knowledge graph.

7. The single-template workflow optimization method of claim 6, wherein in step S4, a correct word candidate interval of the wrong word is generated according to a predicted result of the word with errors in OCR transformation, and is used for extraction by a subsequent candidate word screening algorithm, and a candidate word list is generated according to a reasoning result of the wrong word in an OCR transformation error rule reasoning model.

8. The single-template workflow optimization method according to claim 1, wherein in step S5, the screening for the candidate word list is divided into two parts,

9. The single template workflow optimization method of claim 8, wherein in step S5, the screening algorithm for the candidate word list calculates the context location of each candidate word and the suspected error word by using the TF-IDF algorithm and the cosine similarity, which represents the reasonable degree of the sentence formed by the candidate word and the context of the error location in the text to be corrected, compares the similarity degree of the candidate word and the error word in the candidate word list after the BiGram template picture knowledge base comparison in the text to be corrected, and picks out the most reasonable candidate word.