CN117333893A

CN117333893A - OCR-based custom template image recognition method, system and storage medium

Info

Publication number: CN117333893A
Application number: CN202311231362.7A
Authority: CN
Inventors: 孙觉予; 宋卫平; 李欢欢; 徐小云; 杨帆; 阮正平; 佘文魁; 邓大建; 王红蕾; 叶鑫平; 李军
Original assignee: Sichuan Zhongdian Aostar Information Technologies Co ltd
Current assignee: Sichuan Zhongdian Aostar Information Technologies Co ltd
Priority date: 2023-09-22
Filing date: 2023-09-22
Publication date: 2024-01-02

Abstract

The invention discloses a self-defined template image recognition method, a self-defined template image recognition system and a storage medium based on OCR, wherein OCR recognition is carried out on an image, and finally, the conversion of a recognition text into structured data is realized, and then, error correction is carried out on structured data information based on a natural language processing technology; and performing word segmentation processing on the recognized words by a natural language technology, detecting suspected error results formed by errors from two aspects of word granularity and word granularity, performing error correction, traversing all error information positions, replacing words at the error positions by using a shape near dictionary, and performing calculation again by a language model to obtain optimal corrected words for replacement. The invention realizes the determination of the correctness and the completeness of the structured data by recognizing the word segmentation of the structured data and corrects the condition of recognition errors, thereby realizing the verification of the structured data and having better practicability.

Description

OCR-based custom template image recognition method, system and storage medium

Technical Field

The invention belongs to the technical field of image recognition, and particularly relates to a custom template image recognition method, system and storage medium based on OCR.

Background

OCR (Optical Character Recognition ) refers to the process of an electronic device (e.g., a scanner or digital camera) checking characters printed on paper, determining their shape by detecting dark and light patterns, and then translating the shape into computer text using a character recognition method. For the print character, the characters in the paper document are converted into black-white lattice image file by optical mode, and the characters in the image are converted into text format by identification software for further editing and processing by word processing software.

The output of structured information is a technique of reprocessing key information obtained by recognition using a template to obtain a document conforming to a certain format. The information obtained by OCR recognition has natural advantages for reprocessing the information, text information and position information of all identifiable character strings can be obtained in the recognition of documents, and the information can be obtained in a large scale by carrying out batch scanning on a large number of paper files or electronic images, and the structured information which is fixed in a large number of formats and convenient to access can be output through secondary processing of the information.

The OCR recognition method of the existing text comprises the following steps:

(1) the paper document is converted into an electronic image by photographing, scanning, or the like.

(2) Electronic image preprocessing: the main purpose of image preprocessing is to eliminate irrelevant information in an image, recover useful real information, enhance the detectability of relevant information and simplify data to the greatest extent, thereby improving the reliability of feature extraction, image segmentation, matching and recognition. For example, four methods, namely, a component method, a maximum value method, an average value method and a weighted average method, which are commonly used in graying.

(3) Electronic image character recognition output: the character recognition output of electronic image recognition mainly uses OCR technology to recognize the processed image to obtain key character information, and common OCR platforms include a Google open source OCR engine-Tesseact, a hundred-degree open platform-PaddleOCR and a CNN character recognition framework based on deep learning.

The prior patent 202011588284.2 discloses a text and picture recognition system based on OCR technology, the system realizes that a system type or free format type picture is recognized through a custom template to return a structural recognition result, the system calls a designated OCR recognition model and version from a built-in model library to carry out recognition analysis, after the analysis result is obtained, pixel coordinate conversion between an actual input bill and a template bill is carried out according to anchor point content under the condition that the anchor point exists, the content of a region to be recognized is extracted, format conversion is carried out according to the data type at a corresponding position in the template bill, and finally the structural recognition content is obtained and displayed on a bill template. However, when the template file cannot be completely matched with the identified image or the image file is fuzzy, and a good identification effect cannot be achieved under the condition of preprocessing, structural information dislocation or structural information false identification can be caused.

The prior patent 202110756421.7 discloses a generation method of a custom recognition template, a recognition method of a certificate and a device, wherein a target sample image of a target object is obtained; responding to the selected operation, respectively obtaining a first area of a field to be identified in the target sample image and a second area of a reference field in the target sample image; establishing a relative position relation between the first area and the second area in the target sample image; based on the relative positional relationship, and the reference field, an identification template of the target object is generated. Therefore, the custom recognition template can be flexibly generated, and the user can customize the fields to be recognized, so that the limitation that the universal template cannot recognize all types of images is eliminated, and the method has stronger flexibility.

However, at present, structured information is output to a custom recognition template, and research shows that under the condition of low image quality, the situation of content dislocation or text information extraction errors exists. For images with poor quality, after preprocessing such as binarization, smooth denoising, inclination angle correction and the like, the characters can not be correctly recognized by OCR recognition, and a certain false recognition rate is caused for the structuring of recognition data.

Disclosure of Invention

The invention aims to provide a self-defined template image recognition method, a self-defined template image recognition system and a storage medium based on OCR, and aims to solve the problems, and error positioning and correction are carried out on recognized structured data based on a natural language processing technology.

The invention is realized mainly by the following technical scheme:

the method for identifying the custom template image based on OCR comprises the following steps:

step S1: acquiring an image and performing image preprocessing;

step S2: OCR recognition is carried out on the image, and all texts and position information are obtained;

step S3: positioning and matching the identified text and the position information based on the positioning text and the position information in the configuration file;

step S4: after all the information is positioned and matched, obtaining the scaling ratio between the template and the image to be recognized according to the distance between the positioning information read in the template file and the distance between the positioning information acquired by OCR recognition, and obtaining the positioning position of the required recognition information according to the scaling ratio and the center point of the image;

step S5: matching with OCR recognition information of a corresponding positioning position, and converting the recognition text into structured data;

step S6: performing error correction on the structured data information based on natural language processing technology; and performing word segmentation processing on the recognized words by a natural language technology, detecting suspected error results formed by errors from two aspects of word granularity and word granularity, performing error correction, traversing all error information positions, replacing words at the error positions by using a shape near dictionary, and performing calculation again by a language model to obtain optimal corrected words for replacement.

In order to better implement the present invention, further, the step S1 includes the steps of:

step S101: the image is preprocessed, and then OCR-recognized for the first time,

step S102: traversing and sorting through all text angles,

step S103: selecting the median and the text angles before and after the median, summing and averaging to obtain the rotation angle of the image,

step S104: and then carrying out angle correction on the original picture according to the rotation angle so as to facilitate the positioning of the subsequent characters.

In order to better implement the present invention, in step S101, the image is subjected to enhancement processing, and the image with lower contrast is enhanced by a histogram equalization method or the overexposed image is enhanced by a gamma conversion method.

In order to better realize the invention, further, in the step S3, the first positioning information matching is performed, the positioning text information in the configuration file and the text information of the identified picture are traversed, and if the similarity of the two texts is greater than a set threshold, the matching is considered to be successful; if the similarity between the plurality of positioning text information and the same identification text is larger than a set threshold value, discarding the text information of the identified picture; and finally, saving the text information of the identified pictures with the similarity greater than the threshold value and which are not discarded for subsequent use.

In order to better implement the present invention, in step S6, if the recognized text is wrong, word segmentation is performed on the recognized text by natural language technology after the recognized text is recognized; if additional information is recognized due to image interference factors, word segmentation processing is performed on the recognized words through natural language technology, and when effective replacement cannot be performed in a shape-near word dictionary, error parts are omitted, so that error correction is performed.

The invention is realized mainly by the following technical scheme:

the OCR-based custom template image recognition system is performed by adopting the recognition method and comprises an image preprocessing module, an image OCR recognition extraction module, a positioning matching module, a positioning position extraction module, a structuring module and a correction module; the image preprocessing module is used for preprocessing an image, the image OCR recognition extraction module is used for extracting the preprocessed image and obtaining recognition text and position information, the positioning matching module is used for positioning and matching the recognition text and the position information based on the positioning text and the position information, the positioning position extraction module is used for processing the positioning information based on the template file and the recognition positioning information to obtain the positioning position of the text of the recognition image, the structuring module is used for converting the recognition text into structured data according to the positioning position, and the correction module is used for correcting the structured data based on a natural language processing method.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the identification method described above.

The beneficial effects of the invention are as follows:

the invention carries out error positioning and correction on the identified structured data based on natural language processing technology, realizes the determination of the correctness and the completeness of the structured data through word segmentation identification on the structured data, corrects the situation of the identification error and reports the situation to the human body for judgment, thereby realizing the verification on the structured data. According to the invention, the automatic checking and correcting functions of the structured data identified by the digital image are realized, and the labor cost of the structured data identified by the digital image is reduced.

Drawings

FIG. 1 is an overall flow chart of the OCR-based custom template image recognition method of the present invention;

FIG. 2 is a flow chart of image preprocessing;

FIG. 3 is a flow chart of matching positioning text;

FIG. 4 is a flow chart for structured output of identification text information;

FIG. 5 is a flow chart of error correction of structured data information.

Detailed Description

Example 1:

the method for identifying the custom template image based on OCR comprises the following steps as shown in figure 1:

step S1: acquiring an image and performing image preprocessing;

step S6: and carrying out error correction on the structured data information based on natural language processing technology.

Preferably, this is easy to occur in OCR (optical character recognition) for recognized characters with errors, such as (sorghum-sorghum), especially in the case of poor image quality. After the characters are identified, word segmentation is carried out on the identified characters through a natural language technology, if misplaced characters exist in the characters, the segmentation result often has error conditions, error correction is carried out on suspected error results formed by detecting errors from two aspects of word granularity and word granularity, all error information positions are traversed, words at the error positions are replaced by using a shape near dictionary, then calculation is carried out again through a language model, and the optimal correction words are obtained for replacement.

For some information which is not to be identified because of image interference factors, the words identified by the interference factors are nonsensical in most cases, and can be identified by the method, and the error correction is carried out by considering that the error parts are omitted when effective replacement cannot be carried out in the shape-near word dictionary.

Preferably, in standard OCR text recognition, recognition of an image returns recognized text and corresponding coordinate information, but not structured data. As shown in fig. 1, the positioning text and the position information are positioning parts in the configuration file to be written, and the positioning parts include the character string information of the positioning text to be matched and the position information of the positioning text corresponding to the character string information. In the whole positioning process, unknown text information and position information thereof are fixed in a positioning template according to different required robustness thresholds. And calculating the image size ratio of the identification text and the template file according to the distances of the different positioning text position information. The text to be identified is matched with the identification text at the positioning position after the required information is calculated, and all the information is integrated and structured to output information.

Preferably, as shown in fig. 2, in the process of obtaining an image file, a certain preprocessing needs to be performed on the image to achieve the purpose of manuscript recognition rate, and in the process of obtaining the image, a certain strengthening processing needs to be performed on the image, and because the image is generally recorded as an electronic image in a scanning and photographing mode, the obtained electronic image may have the influence of unclear, shadow, distortion, shielding, reflection, overexposure, background interference (such as seal covering) and the like. Thus, during the preliminary image processing, different image enhancement modes such as histogram equalization are adopted for enhancing the image with lower contrast, gamma conversion is adopted for enhancing the overexposed image, and then the first OCR recognition is carried out on the enhanced image, so that the best result is obtained or a best image enhancement mode is determined for different application scenes so as to reduce the running time of the system and improve the performance.

After the first OCR recognition we have obtained the information result after the first OCR recognition, the angle of the picture needs to be corrected to correctly position the required positioning information due to the need to format the output information. In this step, we need to sort the angles of all the text information, select the three text information angles (the median and the front and rear two bits thereof) at the middle, average them to obtain the rotation angle required by the image, and then correct the image to facilitate the positioning of the following text.

Preferably, as shown in fig. 3, after the image correction, the preprocessing work for the picture is finished, and then the corrected picture is subjected to the second OCR recognition to obtain all text information in the current picture and the position information thereof.

And then carrying out first positioning information matching, namely traversing the positioning information text and text information in the identified picture, if the similarity of the two texts is greater than a set threshold value, judging that the identification is successful, and if the similarity of a plurality of texts and the agreeable text is greater than the set threshold value, discarding the texts. And finally, saving the text which is more than the threshold value in similarity and is not discarded for subsequent use.

Preferably, as shown in fig. 4, after all the information is located, a scaling ratio between the template and the image to be identified can be obtained according to a distance between the locating information and a distance between the locating information of the located picture, a locating position of the text to be identified and corresponding to the text to be identified can be further obtained according to the scaling ratio and the center point of the image, and the identified text can be converted into structured data according to the locating position.

Preferably, as shown in fig. 5, after the structured data conversion is completed, the identified data is traversed and natural language processing is performed. Through word segmentation processing, error correction is carried out on suspected error results formed by detecting errors in the two aspects of word granularity and word granularity, all error information positions are traversed, words in the error positions are replaced by using a near dictionary, and then calculation is carried out again through a language model, so that the optimal correction words are obtained for replacement. When effective replacement cannot be performed in the shape near word dictionary, the error correction is performed by considering that the error part is removed.

For the structured data extracted from the identified digital image, the identified characters are often misidentified or the non-existing characters are misidentified under the condition that the image quality is poor or other interference factors exist, and the situation often causes interference to the structured data, so that the structured data error correction technology is provided. The invention identifies the word of the structured data to determine the correctness and the integrity of the structured data, corrects the condition of the identification error and reports the condition to the human to judge, thereby realizing the verification of the structured data. According to the invention, the automatic checking and correcting functions of the structured data identified by the digital image are realized, and the labor cost of the structured data identified by the digital image is reduced.

The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent variation, etc. of the above embodiment according to the technical matter of the present invention fall within the scope of the present invention.

Claims

1. The method for identifying the custom template image based on OCR is characterized by comprising the following steps:

step S1: acquiring an image and performing image preprocessing;

2. The OCR-based custom template image recognition method according to claim 1, wherein the step S1 comprises the steps of:

step S102: traversing and sorting through all text angles,

3. The OCR-based custom template image recognition method according to claim 2, wherein in step S101, the image is strengthened by a histogram equalization method, or the overexposed image is strengthened by a gamma conversion method.

4. The OCR-based custom template image recognition method according to claim 1, wherein in the step S3, the first positioning information matching is performed, the positioning text information in the configuration file and the text information of the recognized picture are traversed, and if the similarity of the two texts is greater than a set threshold, the matching is considered to be successful; if the similarity between the plurality of positioning text information and the same identification text is larger than a set threshold value, discarding the text information of the identified picture; and finally, saving the text information of the identified pictures with the similarity greater than the threshold value and which are not discarded for subsequent use.

5. The method for recognizing an OCR-based custom template image according to any one of claims 1 to 4, wherein in the step S6, if the recognized text is wrong, word segmentation is performed on the recognized text by a natural language technique after the recognized text is recognized; if additional information is recognized due to image interference factors, word segmentation processing is performed on the recognized words through natural language technology, and when effective replacement cannot be performed in a shape-near word dictionary, error parts are omitted, so that error correction is performed.

6. An OCR-based custom template image recognition system which is carried out by adopting the recognition method as claimed in any one of claims 1 to 5 and is characterized by comprising an image preprocessing module, an image OCR recognition extraction module, a positioning matching module, a positioning position extraction module, a structuring module and a correction module; the image preprocessing module is used for preprocessing an image, the image OCR recognition extraction module is used for extracting the preprocessed image and obtaining recognition text and position information, the positioning matching module is used for positioning and matching the recognition text and the position information based on the positioning text and the position information, the positioning position extraction module is used for processing the positioning information based on the template file and the recognition positioning information to obtain the positioning position of the text of the recognition image, the structuring module is used for converting the recognition text into structured data according to the positioning position, and the correction module is used for correcting the structured data based on a natural language processing method.

7. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the identification method as claimed in any one of claims 1-5.