CN113469029A - Text recognition method and device for financial pdf scanned piece - Google Patents

Text recognition method and device for financial pdf scanned piece Download PDF

Info

Publication number
CN113469029A
CN113469029A CN202110735367.8A CN202110735367A CN113469029A CN 113469029 A CN113469029 A CN 113469029A CN 202110735367 A CN202110735367 A CN 202110735367A CN 113469029 A CN113469029 A CN 113469029A
Authority
CN
China
Prior art keywords
text
model
training
recognition
text recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110735367.8A
Other languages
Chinese (zh)
Inventor
金鑫
李鹏辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Alphainsight Technology Co ltd
Original Assignee
Shanghai Alphainsight Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Alphainsight Technology Co ltd filed Critical Shanghai Alphainsight Technology Co ltd
Priority to CN202110735367.8A priority Critical patent/CN113469029A/en
Publication of CN113469029A publication Critical patent/CN113469029A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Character Input (AREA)
  • Character Discrimination (AREA)

Abstract

The invention discloses a text recognition method of a financial pdf scanned piece, which comprises the steps of creating an image generation template; inserting template information into the image generation template, generating a training image by using the image generation template, and training a text recognition model by using the generated training image as training data; the pdf scan is identified using a text recognition model. The invention also discloses a text recognition device of the financial pdf scanned piece, which comprises a template creating module, a training image generating module, a text recognition model training module, a text recognition service module and a verification module. The text recognition method and device of the financial pdf scanned piece do not need a large amount of manual labeling, can realize automatic recognition of the pdf scanned piece under the complex conditions of fuzzy fonts, direction inclination, watermarking and the like, have high recognition efficiency, and improve the recognition accuracy of the pdf scanned piece.

Description

Text recognition method and device for financial pdf scanned piece
Technical Field
The invention belongs to the technical field of text recognition, and particularly relates to a text recognition method and device for a financial pdf scanned piece.
Background
In recent years, deep learning techniques are widely applied in a plurality of fields such as graphic images, natural language processing, automatic driving and the like, and the expression effect is obviously superior to that of the traditional method.
In text information processing, there are a large number of images of different styles. The prior art still has many problems for the extraction of image information. If a large amount of labeled corpora are needed, massive character arrangement and combination are needed, different fonts and sizes are needed, the background color and the layout type of the image are also various, and complicated conditions such as font blurring, direction inclination, watermarks and the like exist. Only depending on the labeling, the method depends on a large amount of manpower, is easy to make mistakes, and has low cost performance.
Disclosure of Invention
1. Problems to be solved
Aiming at the problems that image information is difficult to extract, workload is large and errors are easy to occur when a pdf scanned piece is subjected to text recognition in the prior art, the invention provides the text recognition method and the text recognition device for the financial pdf scanned piece, the problems of time consumption and low efficiency of manual labeling are effectively solved by utilizing a template creation technology, and the recognition effect is further improved by utilizing the latest achievement of deep learning.
2. Technical scheme
In order to solve the above problems, the present invention adopts the following technical solutions.
A text recognition method for a financial pdf scan comprises the following steps:
step 1, creating an image generation template;
step 2, inserting template information into the image generation template, and generating a training image by using the image generation template;
step 3, training a text recognition model by using the generated training image as training data;
and 4, identifying the pdf scanned piece by using a text identification model.
The preferable technical scheme is as follows:
the method for recognizing the text of the pdf scanned piece of the financial system as described above, wherein the template information includes a layout, a font, a background and a watermark pattern.
The method for text recognition of a pdf scanned piece of financial data as described above, wherein the template information is from a scanned piece, a non-scanned piece or randomly generated.
The method for text recognition of pdf scan of financial class as described above, wherein said text recognition model comprises:
the text gradient detection model is used for detecting the text gradient of the whole image page and correcting the gradient text;
the character detection model is used for detecting the coordinates of a text box where each line of characters is located after the inclined text is aligned;
the character recognition model is used for recognizing each character in the text box where the text box coordinate is located according to the text box coordinate;
and the text structuring model is used for converting the lines of characters identified by the character identification model into structured data.
The method for recognizing the text of the pdf scanned piece in the above manner, after step 4, further includes: and checking the recognition result of the text recognition model.
The method for text recognition of the pdf scan of the financial class comprises the following steps:
step 31, comparing errors of the text gradient detected by the text gradient detection model and an actual angle according to a detection result of the text gradient detection model, and judging that the detection is wrong if the error exceeds 5 degrees;
step 32, comparing the IOU error of the text region detected by the character detection model with the IOU error of the actual region according to the detection result of the character detection model, and judging that the detection is wrong if the error exceeds 20%;
step 33, comparing whether the text content identified by the text identification model is the same as the actual content or not according to the identification result of the text identification model, and if not, judging that the identification is wrong;
and step 34, comparing whether the structured data generated by the text structured model is the same as the actual structured data or not according to the processing result of the text structured model, and if not, judging that the structured processing is wrong.
As another aspect of the present application, there is provided an apparatus for implementing the method for text recognition of a pdf scan according to any one of the above paragraphs, comprising a template creating module, a training image generating module, a text recognition model training module, a text recognition service module, and a verification module;
the template creating module is used for creating an image generating template;
the training image generation module is used for generating a training image according to the image generation template;
the model training module is used for training a text recognition model by using the generated training image as training data;
the text recognition service module is used for recognizing the pdf scanning piece according to the trained text recognition model;
and the checking module is used for checking the recognition result of the text recognition service module.
The preferable technical scheme is as follows:
in the above-described apparatus, the template information of the image generation template is derived from a scanned part, a non-scanned part or randomly generated.
3. Advantageous effects
Compared with the prior art, the invention has the beneficial effects that:
(1) when the image generation template is created, template information is inserted into the image generation template, wherein the template information can be from a pdf scanning piece, or from a non-scanning piece or generated randomly, so that the comprehensiveness of the identification model data establishment before training is ensured, and the text identification accuracy of the subsequent text identification model training is guaranteed;
(2) when the text recognition model performs text recognition, the text recognition model can firstly detect the text in a training image to avoid text inclination, then determine the coordinates of a text box in the text box through the character detection model, and then recognize each character in the text box where the coordinates of the text box are located based on the character recognition model; finally, converting the multiple lines of characters identified by the character identification model into structured data through a text structured model, and completing text identification of the pdf scanned piece; the text recognition method of the financial pdf scanned piece does not need a large amount of manual labeling, can realize automatic recognition of the pdf scanned piece under the complex conditions of fuzzy fonts, oblique directions, watermarks and the like, has high recognition efficiency, and improves the recognition accuracy of the pdf scanned piece;
(3) when the text recognition model is used for training a training image generated by using the image generation template, a checking mechanism can be established in the recognition process of the text recognition model, and a checking standard is established aiming at the conditions of large text gradient, large IOU error and inconsistent recognition characters, so that the model can realize automatic recognition of a pdf scanning piece under the complex conditions of direction inclination, font blurring, watermarking and the like, a new solution is provided for text recognition of the pdf scanning piece, and the text recognition method is wide in application performance and has a good use prospect.
Drawings
FIG. 1 is a flow chart of a method for text recognition of a pdf scan according to the present invention;
FIG. 2 is a block diagram of a text recognition model according to the present invention;
FIG. 3 is a flow chart of text recognition verification of the present invention;
FIG. 4 is a block diagram of a text recognition apparatus for a pdf scan according to the present invention;
in the figure: 1. a template creation module; 2. a training image generation module; 3. a text recognition model training module; 4. a text recognition service module; 5. and (5) a checking module.
Detailed Description
The invention is further described with reference to specific embodiments and the accompanying drawings.
Example 1
As shown in fig. 1-2, a method for text recognition of a pdf scan of finance type includes:
step 1, creating an image generation template. In this embodiment, specifically, the purpose of creating the image generation template is to perform on the text image in each text image sample in the template, so as to obtain the extended text image sample corresponding to each text image sample.
And 2, inserting template information into the image generation template, and generating a training image by using the image generation template.
The template information comprises a layout, a font, a background and a watermark pattern. The template information is from a scanning piece, a non-scanning piece or randomly generated. In step 2, a training image is generated by using the template and randomly generated characters or randomly sampled characters from other data sources, and the image is subjected to operations such as blurring, rotation and noise addition, so that data of a real scene can be fitted, and sample expansion is completed, thereby ensuring comprehensiveness of the identification model data establishment before training and providing guarantee for the text identification accuracy of subsequent text identification model training.
And 3, training the text recognition model by using the generated training image as training data.
In this embodiment, specifically, the training model specifically uses a model trained by general data (such as text images and labels downloaded through a network), for example, an OCR pre-training model. For the OCR pre-training model, more than 100 pieces of training time are needed for training data, and the training is needed to be carried out by adopting the GPU for one week until the accuracy of the model on the test set reaches more than 95%.
For the OCR pre-training model, in this embodiment, the OCR pre-training model includes:
the text gradient detection model is used for detecting the text gradient of the whole image page and correcting the gradient text;
the character detection model is used for detecting the coordinates of a text box where each line of characters is located after the inclined text is aligned;
the character recognition model is used for recognizing each character in the text box where the text box coordinate is located according to the text box coordinate;
and the text structuring model is used for converting the lines of characters identified by the character identification model into structured data.
And 4, identifying the pdf scanned piece by using a text identification model.
When the text recognition model performs text recognition, the text recognition model can firstly detect the text in a training image to avoid text inclination, then determine the coordinates of a text box in the text box through the character detection model, and then recognize each character in the text box where the coordinates of the text box are located based on the character recognition model; finally, converting the multiple lines of characters identified by the character identification model into structured data through a text structured model, and completing text identification of the pdf scanned piece; the text recognition method of the financial pdf scanned piece does not need a large amount of manual labeling, can realize automatic recognition of the pdf scanned piece under the complex conditions of fuzzy fonts, oblique directions, watermarks and the like, has high recognition efficiency, and improves the recognition accuracy of the pdf scanned piece.
Example 2
Substantially the same as in example 1. In order to further ensure the accuracy and efficiency of text recognition on the pdf scan piece, in this embodiment, when training the text recognition model, after step 4, the method further includes: and checking the recognition result of the text recognition model. As shown in fig. 3, the text recognition check includes:
step 31, comparing errors of the text gradient detected by the text gradient detection model and an actual angle according to a detection result of the text gradient detection model, and judging that the detection is wrong if the error exceeds 5 degrees;
step 32, comparing the IOU error of the text region detected by the character detection model with the IOU error of the actual region according to the detection result of the character detection model, and judging that the detection is wrong if the error exceeds 20%;
step 33, comparing whether the text content identified by the text identification model is the same as the actual content or not according to the identification result of the text identification model, and if not, judging that the identification is wrong;
and step 34, comparing whether the structured data generated by the text structured model is the same as the actual structured data or not according to the processing result of the text structured model, and if not, judging that the structured processing is wrong.
When the text recognition model of the embodiment is used for training a training image generated by using an image generation template, a verification mechanism can be established in the recognition process of the text recognition model, and a verification standard is established for the situations of large text gradient, large IOU error and inconsistent recognition characters, so that the model can realize automatic recognition of a pdf scanning piece under the complex situations of direction inclination, font blurring, watermarking and the like, a new solution is provided for text recognition of the pdf scanning piece, and the text recognition model has wide applicability and good use prospect.
Example 3
A text recognition apparatus for a financial pdf scan, which is used to implement the text recognition method for a financial pdf scan described in embodiment 1, as shown in fig. 4, includes a template creation module 1, a training image generation module 2, a text recognition model training module 3, a text recognition service module 4, and a verification module 5;
the template creating module 1 is used for creating an image generating template, wherein the template information of the image generating template is from a scanning piece, a non-scanning piece or random generation;
the training image generation module 2 is used for generating a training image according to an image generation template;
the model training module is used for training a text recognition model by using the generated training image as training data;
the text recognition service module 4 is used for recognizing the pdf scanning piece according to the trained text recognition model;
and the checking module 5 is used for checking the recognition result of the text recognition service module 4.
In the text recognition apparatus provided in this embodiment, only the division of the functional modules is exemplified, and in practical applications, the functions may be allocated to different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules, so as to complete all or part of the functions described above. In addition, the text recognition apparatus of this embodiment and the text recognition method embodiment in the above embodiments belong to the same concept, and specific implementation processes and beneficial effects thereof are described in detail in the text recognition method embodiment, and are not described herein again.
The examples described herein are merely illustrative of the preferred embodiments of the present invention and do not limit the spirit and scope of the present invention, and various modifications and improvements made to the technical solutions of the present invention by those skilled in the art without departing from the design concept of the present invention shall fall within the protection scope of the present invention.

Claims (8)

1. A text recognition method for a financial pdf scanned piece is characterized in that: the method comprises the following steps:
step 1, creating an image generation template;
step 2, inserting template information into the image generation template, and generating a training image by using the image generation template;
step 3, training a text recognition model by using the generated training image as training data;
and 4, identifying the pdf scanned piece by using a text identification model.
2. The method of claim 1, wherein the method comprises: the template information comprises a layout, a font, a background and a watermark pattern.
3. The method of claim 1, wherein the method comprises: the template information is from a scanning piece, a non-scanning piece or randomly generated.
4. The method of claim 1, wherein the method comprises: the text recognition model comprises:
the text gradient detection model is used for detecting the text gradient of the whole image page and correcting the gradient text;
the character detection model is used for detecting the coordinates of a text box where each line of characters is located after the inclined text is aligned;
the character recognition model is used for recognizing each character in the text box where the text box coordinate is located according to the text box coordinate;
and the text structuring model is used for converting the lines of characters identified by the character identification model into structured data.
5. The method of claim 4, wherein the method comprises: after the step 4, the method further comprises the following steps: and checking the recognition result of the text recognition model.
6. The method of claim 5, wherein the method comprises: the text recognition check comprises:
step 31, comparing errors of the text gradient detected by the text gradient detection model and an actual angle according to a detection result of the text gradient detection model, and judging that the detection is wrong if the error exceeds 5 degrees;
step 32, comparing the IOU error of the text region detected by the character detection model with the IOU error of the actual region according to the detection result of the character detection model, and judging that the detection is wrong if the error exceeds 20%;
step 33, comparing whether the text content identified by the text identification model is the same as the actual content or not according to the identification result of the text identification model, and if not, judging that the identification is wrong;
and step 34, comparing whether the structured data generated by the text structured model is the same as the actual structured data or not according to the processing result of the text structured model, and if not, judging that the structured processing is wrong.
7. An apparatus for implementing the method of text recognition of a pdf scan according to any one of claims 1 to 6, comprising a template creation module, a training image generation module, a text recognition model training module, a text recognition service module, and a verification module;
the template creating module is used for creating an image generating template;
the training image generation module is used for generating a training image according to the image generation template;
the model training module is used for training a text recognition model by using the generated training image as training data;
the text recognition service module is used for recognizing the pdf scanning piece according to the trained text recognition model;
and the checking module is used for checking the recognition result of the text recognition service module.
8. The apparatus of claim 7, wherein the template information of the image generation template is from a scanned piece, a non-scanned piece, or a random generation.
CN202110735367.8A 2021-06-30 2021-06-30 Text recognition method and device for financial pdf scanned piece Pending CN113469029A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110735367.8A CN113469029A (en) 2021-06-30 2021-06-30 Text recognition method and device for financial pdf scanned piece

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110735367.8A CN113469029A (en) 2021-06-30 2021-06-30 Text recognition method and device for financial pdf scanned piece

Publications (1)

Publication Number Publication Date
CN113469029A true CN113469029A (en) 2021-10-01

Family

ID=77874387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110735367.8A Pending CN113469029A (en) 2021-06-30 2021-06-30 Text recognition method and device for financial pdf scanned piece

Country Status (1)

Country Link
CN (1) CN113469029A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919014A (en) * 2019-01-28 2019-06-21 平安科技(深圳)有限公司 OCR recognition methods and its electronic equipment
CN111353497A (en) * 2018-12-21 2020-06-30 顺丰科技有限公司 Identification method and device for identity card information
CN111639566A (en) * 2020-05-19 2020-09-08 浙江大华技术股份有限公司 Method and device for extracting form information
CN111798360A (en) * 2020-06-30 2020-10-20 百度在线网络技术(北京)有限公司 Watermark detection method, watermark detection device, electronic equipment and storage medium
CN112651340A (en) * 2020-12-28 2021-04-13 上海商米科技集团股份有限公司 Character recognition method, system, terminal device and storage medium for shopping receipt
CN112733639A (en) * 2020-12-28 2021-04-30 贝壳技术有限公司 Text information structured extraction method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111353497A (en) * 2018-12-21 2020-06-30 顺丰科技有限公司 Identification method and device for identity card information
CN109919014A (en) * 2019-01-28 2019-06-21 平安科技(深圳)有限公司 OCR recognition methods and its electronic equipment
CN111639566A (en) * 2020-05-19 2020-09-08 浙江大华技术股份有限公司 Method and device for extracting form information
CN111798360A (en) * 2020-06-30 2020-10-20 百度在线网络技术(北京)有限公司 Watermark detection method, watermark detection device, electronic equipment and storage medium
CN112651340A (en) * 2020-12-28 2021-04-13 上海商米科技集团股份有限公司 Character recognition method, system, terminal device and storage medium for shopping receipt
CN112733639A (en) * 2020-12-28 2021-04-30 贝壳技术有限公司 Text information structured extraction method and device

Similar Documents

Publication Publication Date Title
US10817741B2 (en) Word segmentation system, method and device
JP2713622B2 (en) Tabular document reader
Chiang et al. Recognition of multi-oriented, multi-sized, and curved text
CN112949455B (en) Value-added tax invoice recognition system and method
JP2004272798A (en) Image reading device
CN116597466A (en) Engineering drawing text detection and recognition method and system based on improved YOLOv5s
CN113469029A (en) Text recognition method and device for financial pdf scanned piece
Yamazaki et al. Embedding a mathematical OCR module into OCRopus
CN112861861B (en) Method and device for recognizing nixie tube text and electronic equipment
US20230036812A1 (en) Text Line Detection
CN113657162A (en) Bill OCR recognition method based on deep learning
Sagar et al. OCR for printed Kannada text to machine editable format using database approach
CN112633283A (en) Method and system for identifying and translating English mail address
CN114627457A (en) Ticket information identification method and device
CN111476090A (en) Watermark identification method and device
CN116994282B (en) Reinforcing steel bar quantity identification and collection method for bridge design drawing
Rahman et al. Notice of violation of IEEE publication principles: Modified syntactic method to recognize Bengali handwritten characters
Kadam et al. A Hybrid Approach to Detect and Recognize Text In Images
US10878271B2 (en) Systems and methods for separating ligature characters in digitized document images
CN115690806B (en) Unstructured document format recognition method based on image data processing
CN113743400B (en) Electronic document intelligent examination method and system based on deep learning
KR102628553B1 (en) Equipment data recognition apparatus and method
CN118015649A (en) File conversion method and system based on target detection semantic segmentation and OCR
KR100573392B1 (en) Method and System for digitalizing a large volume of documents based on character recognition with adaptive training module to real data
Agamamidi et al. Extraction of textual information from images using mobile devices

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20211001

RJ01 Rejection of invention patent application after publication