CN115828874A

CN115828874A - Industry table digital processing method based on image recognition technology

Info

Publication number: CN115828874A
Application number: CN202211571003.1A
Authority: CN
Inventors: 李炯梅; 李婵一; 薛龙江
Original assignee: Golden Tax Bridge Big Data Technology Co ltd
Current assignee: Golden Tax Bridge Big Data Technology Co ltd
Priority date: 2019-08-05
Filing date: 2019-08-05
Publication date: 2023-03-21
Also published as: CN110413979A

Abstract

The invention discloses an industry table digital processing method based on an image recognition technology, which distinguishes common picture or pdf type files by loading a class detection model and a character recognition model trained by deep learning and industry calculation model data of various industry related tables, and can automatically distinguish which report type each page of table belongs to according to the content of each page for the pdf type files; and then detecting character areas in the table, predicting the form of the table, and automatically generating a new table according to the type designated by a user or the type presumed by a built-in model of the system for the file without the table according to the rule of predicting the arrangement of the text. Carrying out character recognition on the detected table content, extracting keywords and numerical values, and forming a digital output result; therefore, reliable data support is provided for subsequent application, and the method can be applied to scenes with extremely complex report forms in the finance and tax industry, and can greatly improve the working efficiency.

Description

Industry table digital processing method based on image recognition technology

The application is a divisional application named as 'an industry table digital processing method based on an image recognition technology', the application date of the original application is 08-month-05-2019, and the application number is 201910715902.6.

Technical Field

The invention relates to the technical field of computer information processing, in particular to a digital processing method of a form.

Background

With the continuous acceleration of the informatization process of various industries, a large amount of multi-source, heterogeneous, multi-dimensional and massive business data are generated, a lot of historical data are paper and contain various forms, or no form exists in the file per se, but in the subsequent processing, an industry analyst needs to process the data according to the rule of form distribution, recognize the paper files, further process the paper files according to the special properties of different types of files of various industries, extract keywords and corresponding numerical values, and finally digitize the paper files and perform subsequent analysis processing. Therefore, the digital processing of the table has very large practical significance and application space.

Related technologies include an aristoloc general OCR (optical character recognition) recognition interface, a hundred-degree OCR recognition interface, a channel table recognition interface, and the like. However, the recognition interfaces have very limited processing functions for tables, some files with tables cannot be recognized, some files support special item recognition for table types, but only very simple two-dimensional row and column tables can be processed, the files cannot be recognized correctly as soon as the cells are merged, interface support is not provided for professional industry tables, most files cannot return recognition results correctly, and therefore the files cannot participate in the next process of performing digital processing analysis by corresponding data to key values.

Chinese patent CN105589841B discloses a method for identifying a PDF document table, which comprises the steps of firstly obtaining a character set in a page, merging the character set into lines, and establishing a line set; then extracting horizontal lines and vertical lines in the page path to establish a line set; then detecting suspected table titles in the row set and suspected table lines in the line set; if the suspected table title and the suspected table line exist at the same time, identifying the table by adopting a region growing method based on table title and line set; if only the suspected table lines exist, firstly detecting the full line table and then detecting the three line table by using the line set and the row set; if only the suspected table title exists, identifying the table by using a region growing method based on the table title and the row set; if the page is not only definitely similar to the table line but also definitely similar to the table title, judging that the page has no table; and detecting the attached elements of the table header and the table note and outputting the page table identification result.

Chinese patent application CN109522816A provides a table identification method and device, and computer storage medium. The method comprises the following steps: detecting a table structure of a first table in an image to be processed to obtain table structure information, and identifying table contents of the first table to obtain text information corresponding to the table contents; drawing a second table according to the table structure information; and filling the text information into a second table.

Although the above two documents can specially process the tables, the pdf document also needs to extract the character set in the page of the pdf document, the pdf document with a pure map cannot be processed, and the table processing is only applicable to a common table, and cannot implement induction processing on complex business tables such as complex and various table types of finance and tax types, and the identified result cannot be further processed correspondingly on the keywords and values according to industry information, so that it is difficult to output data completely meeting the requirements for the complex table type document, and the subsequent digitization application cannot be performed. Meanwhile, the result without table is directly output for the file without table, and the more complicated situation that if the table is not available only in the form, the data is arranged according to the rule of the table of a certain type of industry is not considered.

Disclosure of Invention

The invention aims to provide a digital processing method of an industrial form, which can process different types of forms, form a digital processing result and provide a basis for subsequent work.

In order to achieve the purpose, the invention provides the following scheme:

an industry table digital processing method based on an image recognition technology is characterized by comprising the following steps:

acquiring an OCR character detection model, a character recognition model, an industry report data model and an industry standard data model;

inputting a file to be processed;

judging the type of the file to be processed;

if the file to be processed is a PDF file, splitting the PDF file into a plurality of pictures according to page number, identifying the form type in each picture based on an industry report data model and an industry standard data model, detecting a character area by adopting an OCR character detection model, and predicting the form of the form;

if the file to be processed is a picture file, identifying the form type in the picture file and predicting the form of the form;

judging whether the file to be processed contains a table or not according to the prediction result of the form, if so, loading form model data, and recognizing the text by adopting a character recognition model; if the table is not included, recognizing the text in the picture by adopting a character recognition model, and predicting an arrangement rule according to the text;

automatically generating a new form according to the type designated by the user or the built-in type of the system, filling the identified text into the new form, and performing form repair and correction to obtain a corrected form;

and performing character recognition on the content of the corrected table by adopting a character recognition model, extracting keywords and values, generating an excel table, and outputting a digital result.

Optionally, before determining the type of the file to be processed, the industry table digital processing method of the image recognition technology further includes: and sequentially carrying out watermark removal, rotation correction and noise point removal on the file to be processed.

Optionally, the OCR character detection model is a CTPN, the PIXEL _ LINK model and the character recognition model are CRNN and DENSENET models; the industry report data model is a mathematical calculation model which is self-compiled and accords with an industry calculation method based on four rules of operation.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention can not only process common pictures or files of full-type pdf types, but also can perform induction processing on complex and various table types, can completely restore the structure of the table for common tables, can predict the arrangement rule for files without tables according to texts, can automatically generate new tables according to the types specified by users or the built-in types of the system, and then performs further processing corresponding to keywords and numerical values to form a digital output result, thereby providing reliable data support for subsequent application.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of the industry form digital processing method based on image recognition technology;

FIG. 2 is an original image after splitting an input file according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a processed document according to an embodiment of the present invention;

FIG. 4 is a graph of the results produced in the example of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

An industry table digital processing method based on an image recognition technology is used for carrying out digital processing on table data in a paper document, forming an electronic version excel table containing all table association styles and providing data support for subsequent work, and the flow of the method is shown in figure 1 and mainly comprises the following steps.

A. And initializing the system, and loading an OCR character detection model, a character recognition model, an industry report data model and an industry standard data model. The OCR character detection model is a CTPN and PIXEL _ LINK model, the character recognition model adopts a CRNN and DENSENET model, and the industry report data model is a mathematical calculation model which is self-compiled and accords with an industry calculation method based on four arithmetic operations.

B. Inputting a file and preprocessing the file. The preprocessing mode comprises watermarking removal, rotation correction and noise point removal.

C. And D, judging the file type, namely whether the input file is a PDF file or a common picture file, if the input file is the PDF file, performing the step D, and if the input file is the picture file, firstly predicting the form type, and then performing the step E.

D. And splitting the PDF file, predicting the position of the character, and cutting a small part of picture to identify and judge the type of the form.

In this embodiment, a pdf file is input, where the pdf file includes 3 pictures, and the pdf is split into separate pictures according to the number of pages in step D, and then each picture is processed separately.

E. Judging whether the form is contained, if so, loading form model data and performing text recognition; and if the table is not included, identifying the picture text, and predicting the arrangement rule according to the text.

In this embodiment, 2 of the pictures themselves have no table, but the user can specify that he needs to identify them according to the table, so the table is added to the picture according to the file type. For example, the original image is shown in fig. 2, and the preview image after the automatic line addition is shown in fig. 3.

F. And D, automatically generating a new form according to the type designated by the user or the built-in type of the system, filling the text in the step E into the form, and repairing and correcting the form.

G. And F, performing character recognition on the contents of the table in the step F, extracting keywords and values, generating an excel table, and outputting a digital result.

In this step, a new picture with a table is identified, keywords and verticals are extracted, and an identification result in an Excel format is generated, as shown in fig. 4.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. An industry table digital processing method based on an image recognition technology is characterized by comprising the following steps:

inputting a file to be processed;

judging the type of the file to be processed;

if the file to be processed is a PDF file, automatically splitting the PDF file into a plurality of pictures according to page number, identifying the form type in each picture based on a self-research industry report data model and a self-research industry standard data model, detecting a character area by adopting an OCR character detection model, and predicting the form of the form;

automatically judging whether the file to be processed contains a form or not according to the prediction result of the form, if so, loading form model data, and recognizing the text by adopting a character recognition model; if the table is not contained, recognizing the text in the picture file to be processed by adopting a character recognition model, and predicting an arrangement rule according to the text;

2. The image recognition technology industry form digital processing method according to claim 1, wherein before determining the type of the file to be processed, the image recognition technology industry form digital processing method further comprises:

and sequentially carrying out watermark removal, rotation correction and noise point removal on the file to be processed.

3. The industry table digital processing method of image recognition technology as claimed in claim 1, wherein said OCR character detection model is CTPN, PIXEL _ LINK model;

the character recognition model is a CRNN model and a DENSENET model;

the industry report data model is a mathematical calculation model which is self-compiled and accords with an industry calculation method based on four rules of operation.