CN113469029A

CN113469029A - Text recognition method and device for financial pdf scanned piece

Info

Publication number: CN113469029A
Application number: CN202110735367.8A
Authority: CN
Inventors: 金鑫; 李鹏辉
Original assignee: Shanghai Alphainsight Technology Co ltd
Current assignee: Shanghai Alphainsight Technology Co ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-10-01

Abstract

The invention discloses a text recognition method of a financial pdf scanned piece, which comprises the steps of creating an image generation template; inserting template information into the image generation template, generating a training image by using the image generation template, and training a text recognition model by using the generated training image as training data; the pdf scan is identified using a text recognition model. The invention also discloses a text recognition device of the financial pdf scanned piece, which comprises a template creating module, a training image generating module, a text recognition model training module, a text recognition service module and a verification module. The text recognition method and device of the financial pdf scanned piece do not need a large amount of manual labeling, can realize automatic recognition of the pdf scanned piece under the complex conditions of fuzzy fonts, direction inclination, watermarking and the like, have high recognition efficiency, and improve the recognition accuracy of the pdf scanned piece.

Description

Text recognition method and device for financial pdf scanned piece

Technical Field

The invention belongs to the technical field of text recognition, and particularly relates to a text recognition method and device for a financial pdf scanned piece.

Background

In recent years, deep learning techniques are widely applied in a plurality of fields such as graphic images, natural language processing, automatic driving and the like, and the expression effect is obviously superior to that of the traditional method.

In text information processing, there are a large number of images of different styles. The prior art still has many problems for the extraction of image information. If a large amount of labeled corpora are needed, massive character arrangement and combination are needed, different fonts and sizes are needed, the background color and the layout type of the image are also various, and complicated conditions such as font blurring, direction inclination, watermarks and the like exist. Only depending on the labeling, the method depends on a large amount of manpower, is easy to make mistakes, and has low cost performance.

Disclosure of Invention

1. Problems to be solved

Aiming at the problems that image information is difficult to extract, workload is large and errors are easy to occur when a pdf scanned piece is subjected to text recognition in the prior art, the invention provides the text recognition method and the text recognition device for the financial pdf scanned piece, the problems of time consumption and low efficiency of manual labeling are effectively solved by utilizing a template creation technology, and the recognition effect is further improved by utilizing the latest achievement of deep learning.

2. Technical scheme

In order to solve the above problems, the present invention adopts the following technical solutions.

A text recognition method for a financial pdf scan comprises the following steps:

step 1, creating an image generation template;

step 2, inserting template information into the image generation template, and generating a training image by using the image generation template;

step 3, training a text recognition model by using the generated training image as training data;

and 4, identifying the pdf scanned piece by using a text identification model.

The preferable technical scheme is as follows:

the method for recognizing the text of the pdf scanned piece of the financial system as described above, wherein the template information includes a layout, a font, a background and a watermark pattern.

The method for text recognition of a pdf scanned piece of financial data as described above, wherein the template information is from a scanned piece, a non-scanned piece or randomly generated.

The method for text recognition of pdf scan of financial class as described above, wherein said text recognition model comprises:

the text gradient detection model is used for detecting the text gradient of the whole image page and correcting the gradient text;

the character detection model is used for detecting the coordinates of a text box where each line of characters is located after the inclined text is aligned;

the character recognition model is used for recognizing each character in the text box where the text box coordinate is located according to the text box coordinate;

and the text structuring model is used for converting the lines of characters identified by the character identification model into structured data.

The method for recognizing the text of the pdf scanned piece in the above manner, after step 4, further includes: and checking the recognition result of the text recognition model.

The method for text recognition of the pdf scan of the financial class comprises the following steps:

step 31, comparing errors of the text gradient detected by the text gradient detection model and an actual angle according to a detection result of the text gradient detection model, and judging that the detection is wrong if the error exceeds 5 degrees;

step 32, comparing the IOU error of the text region detected by the character detection model with the IOU error of the actual region according to the detection result of the character detection model, and judging that the detection is wrong if the error exceeds 20%;

step 33, comparing whether the text content identified by the text identification model is the same as the actual content or not according to the identification result of the text identification model, and if not, judging that the identification is wrong;

and step 34, comparing whether the structured data generated by the text structured model is the same as the actual structured data or not according to the processing result of the text structured model, and if not, judging that the structured processing is wrong.

As another aspect of the present application, there is provided an apparatus for implementing the method for text recognition of a pdf scan according to any one of the above paragraphs, comprising a template creating module, a training image generating module, a text recognition model training module, a text recognition service module, and a verification module;

the template creating module is used for creating an image generating template;

the training image generation module is used for generating a training image according to the image generation template;

the model training module is used for training a text recognition model by using the generated training image as training data;

the text recognition service module is used for recognizing the pdf scanning piece according to the trained text recognition model;

and the checking module is used for checking the recognition result of the text recognition service module.

The preferable technical scheme is as follows:

in the above-described apparatus, the template information of the image generation template is derived from a scanned part, a non-scanned part or randomly generated.

3. Advantageous effects

Compared with the prior art, the invention has the beneficial effects that:

(1) when the image generation template is created, template information is inserted into the image generation template, wherein the template information can be from a pdf scanning piece, or from a non-scanning piece or generated randomly, so that the comprehensiveness of the identification model data establishment before training is ensured, and the text identification accuracy of the subsequent text identification model training is guaranteed;

(2) when the text recognition model performs text recognition, the text recognition model can firstly detect the text in a training image to avoid text inclination, then determine the coordinates of a text box in the text box through the character detection model, and then recognize each character in the text box where the coordinates of the text box are located based on the character recognition model; finally, converting the multiple lines of characters identified by the character identification model into structured data through a text structured model, and completing text identification of the pdf scanned piece; the text recognition method of the financial pdf scanned piece does not need a large amount of manual labeling, can realize automatic recognition of the pdf scanned piece under the complex conditions of fuzzy fonts, oblique directions, watermarks and the like, has high recognition efficiency, and improves the recognition accuracy of the pdf scanned piece;

(3) when the text recognition model is used for training a training image generated by using the image generation template, a checking mechanism can be established in the recognition process of the text recognition model, and a checking standard is established aiming at the conditions of large text gradient, large IOU error and inconsistent recognition characters, so that the model can realize automatic recognition of a pdf scanning piece under the complex conditions of direction inclination, font blurring, watermarking and the like, a new solution is provided for text recognition of the pdf scanning piece, and the text recognition method is wide in application performance and has a good use prospect.

Drawings

FIG. 1 is a flow chart of a method for text recognition of a pdf scan according to the present invention;

FIG. 2 is a block diagram of a text recognition model according to the present invention;

FIG. 3 is a flow chart of text recognition verification of the present invention;

FIG. 4 is a block diagram of a text recognition apparatus for a pdf scan according to the present invention;

in the figure: 1. a template creation module; 2. a training image generation module; 3. a text recognition model training module; 4. a text recognition service module; 5. and (5) a checking module.

Detailed Description

The invention is further described with reference to specific embodiments and the accompanying drawings.

Example 1

As shown in fig. 1-2, a method for text recognition of a pdf scan of finance type includes:

step 1, creating an image generation template. In this embodiment, specifically, the purpose of creating the image generation template is to perform on the text image in each text image sample in the template, so as to obtain the extended text image sample corresponding to each text image sample.

And 2, inserting template information into the image generation template, and generating a training image by using the image generation template.

The template information comprises a layout, a font, a background and a watermark pattern. The template information is from a scanning piece, a non-scanning piece or randomly generated. In step 2, a training image is generated by using the template and randomly generated characters or randomly sampled characters from other data sources, and the image is subjected to operations such as blurring, rotation and noise addition, so that data of a real scene can be fitted, and sample expansion is completed, thereby ensuring comprehensiveness of the identification model data establishment before training and providing guarantee for the text identification accuracy of subsequent text identification model training.

And 3, training the text recognition model by using the generated training image as training data.

In this embodiment, specifically, the training model specifically uses a model trained by general data (such as text images and labels downloaded through a network), for example, an OCR pre-training model. For the OCR pre-training model, more than 100 pieces of training time are needed for training data, and the training is needed to be carried out by adopting the GPU for one week until the accuracy of the model on the test set reaches more than 95%.

For the OCR pre-training model, in this embodiment, the OCR pre-training model includes:

And 4, identifying the pdf scanned piece by using a text identification model.

When the text recognition model performs text recognition, the text recognition model can firstly detect the text in a training image to avoid text inclination, then determine the coordinates of a text box in the text box through the character detection model, and then recognize each character in the text box where the coordinates of the text box are located based on the character recognition model; finally, converting the multiple lines of characters identified by the character identification model into structured data through a text structured model, and completing text identification of the pdf scanned piece; the text recognition method of the financial pdf scanned piece does not need a large amount of manual labeling, can realize automatic recognition of the pdf scanned piece under the complex conditions of fuzzy fonts, oblique directions, watermarks and the like, has high recognition efficiency, and improves the recognition accuracy of the pdf scanned piece.

Example 2

Substantially the same as in example 1. In order to further ensure the accuracy and efficiency of text recognition on the pdf scan piece, in this embodiment, when training the text recognition model, after step 4, the method further includes: and checking the recognition result of the text recognition model. As shown in fig. 3, the text recognition check includes:

When the text recognition model of the embodiment is used for training a training image generated by using an image generation template, a verification mechanism can be established in the recognition process of the text recognition model, and a verification standard is established for the situations of large text gradient, large IOU error and inconsistent recognition characters, so that the model can realize automatic recognition of a pdf scanning piece under the complex situations of direction inclination, font blurring, watermarking and the like, a new solution is provided for text recognition of the pdf scanning piece, and the text recognition model has wide applicability and good use prospect.

Example 3

A text recognition apparatus for a financial pdf scan, which is used to implement the text recognition method for a financial pdf scan described in embodiment 1, as shown in fig. 4, includes a template creation module 1, a training image generation module 2, a text recognition model training module 3, a text recognition service module 4, and a verification module 5;

the template creating module 1 is used for creating an image generating template, wherein the template information of the image generating template is from a scanning piece, a non-scanning piece or random generation;

the training image generation module 2 is used for generating a training image according to an image generation template;

the text recognition service module 4 is used for recognizing the pdf scanning piece according to the trained text recognition model;

and the checking module 5 is used for checking the recognition result of the text recognition service module 4.

In the text recognition apparatus provided in this embodiment, only the division of the functional modules is exemplified, and in practical applications, the functions may be allocated to different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules, so as to complete all or part of the functions described above. In addition, the text recognition apparatus of this embodiment and the text recognition method embodiment in the above embodiments belong to the same concept, and specific implementation processes and beneficial effects thereof are described in detail in the text recognition method embodiment, and are not described herein again.

The examples described herein are merely illustrative of the preferred embodiments of the present invention and do not limit the spirit and scope of the present invention, and various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the design concept of the present invention shall fall within the protection scope of the present invention.

Claims

1. A text recognition method for a financial pdf scanned piece is characterized in that: the method comprises the following steps:

step 1, creating an image generation template;

and 4, identifying the pdf scanned piece by using a text identification model.

2. The method of claim 1, wherein the method comprises: the template information comprises a layout, a font, a background and a watermark pattern.

3. The method of claim 1, wherein the method comprises: the template information is from a scanning piece, a non-scanning piece or randomly generated.

4. The method of claim 1, wherein the method comprises: the text recognition model comprises:

5. The method of claim 4, wherein the method comprises: after the step 4, the method further comprises the following steps: and checking the recognition result of the text recognition model.

6. The method of claim 5, wherein the method comprises: the text recognition check comprises:

7. An apparatus for implementing the method of text recognition of a pdf scan according to any one of claims 1 to 6, comprising a template creation module, a training image generation module, a text recognition model training module, a text recognition service module, and a verification module;

the template creating module is used for creating an image generating template;

8. The apparatus of claim 7, wherein the template information of the image generation template is from a scanned piece, a non-scanned piece, or a random generation.