CN115828874A - Industry table digital processing method based on image recognition technology - Google Patents

Industry table digital processing method based on image recognition technology Download PDF

Info

Publication number
CN115828874A
CN115828874A CN202211571003.1A CN202211571003A CN115828874A CN 115828874 A CN115828874 A CN 115828874A CN 202211571003 A CN202211571003 A CN 202211571003A CN 115828874 A CN115828874 A CN 115828874A
Authority
CN
China
Prior art keywords
model
industry
file
type
processed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211571003.1A
Other languages
Chinese (zh)
Inventor
李炯梅
李婵一
薛龙江
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Golden Tax Bridge Big Data Technology Co ltd
Original Assignee
Golden Tax Bridge Big Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Golden Tax Bridge Big Data Technology Co ltd filed Critical Golden Tax Bridge Big Data Technology Co ltd
Priority to CN202211571003.1A priority Critical patent/CN115828874A/en
Publication of CN115828874A publication Critical patent/CN115828874A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

The invention discloses an industry table digital processing method based on an image recognition technology, which distinguishes common picture or pdf type files by loading a class detection model and a character recognition model trained by deep learning and industry calculation model data of various industry related tables, and can automatically distinguish which report type each page of table belongs to according to the content of each page for the pdf type files; and then detecting character areas in the table, predicting the form of the table, and automatically generating a new table according to the type designated by a user or the type presumed by a built-in model of the system for the file without the table according to the rule of predicting the arrangement of the text. Carrying out character recognition on the detected table content, extracting keywords and numerical values, and forming a digital output result; therefore, reliable data support is provided for subsequent application, and the method can be applied to scenes with extremely complex report forms in the finance and tax industry, and can greatly improve the working efficiency.

Description

Industry table digital processing method based on image recognition technology
The application is a divisional application named as 'an industry table digital processing method based on an image recognition technology', the application date of the original application is 08-month-05-2019, and the application number is 201910715902.6.
Technical Field
The invention relates to the technical field of computer information processing, in particular to a digital processing method of a form.
Background
With the continuous acceleration of the informatization process of various industries, a large amount of multi-source, heterogeneous, multi-dimensional and massive business data are generated, a lot of historical data are paper and contain various forms, or no form exists in the file per se, but in the subsequent processing, an industry analyst needs to process the data according to the rule of form distribution, recognize the paper files, further process the paper files according to the special properties of different types of files of various industries, extract keywords and corresponding numerical values, and finally digitize the paper files and perform subsequent analysis processing. Therefore, the digital processing of the table has very large practical significance and application space.
Related technologies include an aristoloc general OCR (optical character recognition) recognition interface, a hundred-degree OCR recognition interface, a channel table recognition interface, and the like. However, the recognition interfaces have very limited processing functions for tables, some files with tables cannot be recognized, some files support special item recognition for table types, but only very simple two-dimensional row and column tables can be processed, the files cannot be recognized correctly as soon as the cells are merged, interface support is not provided for professional industry tables, most files cannot return recognition results correctly, and therefore the files cannot participate in the next process of performing digital processing analysis by corresponding data to key values.
Chinese patent CN105589841B discloses a method for identifying a PDF document table, which comprises the steps of firstly obtaining a character set in a page, merging the character set into lines, and establishing a line set; then extracting horizontal lines and vertical lines in the page path to establish a line set; then detecting suspected table titles in the row set and suspected table lines in the line set; if the suspected table title and the suspected table line exist at the same time, identifying the table by adopting a region growing method based on table title and line set; if only the suspected table lines exist, firstly detecting the full line table and then detecting the three line table by using the line set and the row set; if only the suspected table title exists, identifying the table by using a region growing method based on the table title and the row set; if the page is not only definitely similar to the table line but also definitely similar to the table title, judging that the page has no table; and detecting the attached elements of the table header and the table note and outputting the page table identification result.
Chinese patent application CN109522816A provides a table identification method and device, and computer storage medium. The method comprises the following steps: detecting a table structure of a first table in an image to be processed to obtain table structure information, and identifying table contents of the first table to obtain text information corresponding to the table contents; drawing a second table according to the table structure information; and filling the text information into a second table.
Although the above two documents can specially process the tables, the pdf document also needs to extract the character set in the page of the pdf document, the pdf document with a pure map cannot be processed, and the table processing is only applicable to a common table, and cannot implement induction processing on complex business tables such as complex and various table types of finance and tax types, and the identified result cannot be further processed correspondingly on the keywords and values according to industry information, so that it is difficult to output data completely meeting the requirements for the complex table type document, and the subsequent digitization application cannot be performed. Meanwhile, the result without table is directly output for the file without table, and the more complicated situation that if the table is not available only in the form, the data is arranged according to the rule of the table of a certain type of industry is not considered.
Disclosure of Invention
The invention aims to provide a digital processing method of an industrial form, which can process different types of forms, form a digital processing result and provide a basis for subsequent work.
In order to achieve the purpose, the invention provides the following scheme:
an industry table digital processing method based on an image recognition technology is characterized by comprising the following steps:
acquiring an OCR character detection model, a character recognition model, an industry report data model and an industry standard data model;
inputting a file to be processed;
judging the type of the file to be processed;
if the file to be processed is a PDF file, splitting the PDF file into a plurality of pictures according to page number, identifying the form type in each picture based on an industry report data model and an industry standard data model, detecting a character area by adopting an OCR character detection model, and predicting the form of the form;
if the file to be processed is a picture file, identifying the form type in the picture file and predicting the form of the form;
judging whether the file to be processed contains a table or not according to the prediction result of the form, if so, loading form model data, and recognizing the text by adopting a character recognition model; if the table is not included, recognizing the text in the picture by adopting a character recognition model, and predicting an arrangement rule according to the text;
automatically generating a new form according to the type designated by the user or the built-in type of the system, filling the identified text into the new form, and performing form repair and correction to obtain a corrected form;
and performing character recognition on the content of the corrected table by adopting a character recognition model, extracting keywords and values, generating an excel table, and outputting a digital result.
Optionally, before determining the type of the file to be processed, the industry table digital processing method of the image recognition technology further includes: and sequentially carrying out watermark removal, rotation correction and noise point removal on the file to be processed.
Optionally, the OCR character detection model is a CTPN, the PIXEL _ LINK model and the character recognition model are CRNN and DENSENET models; the industry report data model is a mathematical calculation model which is self-compiled and accords with an industry calculation method based on four rules of operation.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention can not only process common pictures or files of full-type pdf types, but also can perform induction processing on complex and various table types, can completely restore the structure of the table for common tables, can predict the arrangement rule for files without tables according to texts, can automatically generate new tables according to the types specified by users or the built-in types of the system, and then performs further processing corresponding to keywords and numerical values to form a digital output result, thereby providing reliable data support for subsequent application.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of the industry form digital processing method based on image recognition technology;
FIG. 2 is an original image after splitting an input file according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a processed document according to an embodiment of the present invention;
FIG. 4 is a graph of the results produced in the example of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
An industry table digital processing method based on an image recognition technology is used for carrying out digital processing on table data in a paper document, forming an electronic version excel table containing all table association styles and providing data support for subsequent work, and the flow of the method is shown in figure 1 and mainly comprises the following steps.
A. And initializing the system, and loading an OCR character detection model, a character recognition model, an industry report data model and an industry standard data model. The OCR character detection model is a CTPN and PIXEL _ LINK model, the character recognition model adopts a CRNN and DENSENET model, and the industry report data model is a mathematical calculation model which is self-compiled and accords with an industry calculation method based on four arithmetic operations.
B. Inputting a file and preprocessing the file. The preprocessing mode comprises watermarking removal, rotation correction and noise point removal.
C. And D, judging the file type, namely whether the input file is a PDF file or a common picture file, if the input file is the PDF file, performing the step D, and if the input file is the picture file, firstly predicting the form type, and then performing the step E.
D. And splitting the PDF file, predicting the position of the character, and cutting a small part of picture to identify and judge the type of the form.
In this embodiment, a pdf file is input, where the pdf file includes 3 pictures, and the pdf is split into separate pictures according to the number of pages in step D, and then each picture is processed separately.
E. Judging whether the form is contained, if so, loading form model data and performing text recognition; and if the table is not included, identifying the picture text, and predicting the arrangement rule according to the text.
In this embodiment, 2 of the pictures themselves have no table, but the user can specify that he needs to identify them according to the table, so the table is added to the picture according to the file type. For example, the original image is shown in fig. 2, and the preview image after the automatic line addition is shown in fig. 3.
F. And D, automatically generating a new form according to the type designated by the user or the built-in type of the system, filling the text in the step E into the form, and repairing and correcting the form.
G. And F, performing character recognition on the contents of the table in the step F, extracting keywords and values, generating an excel table, and outputting a digital result.
In this step, a new picture with a table is identified, keywords and verticals are extracted, and an identification result in an Excel format is generated, as shown in fig. 4.
The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims (3)

1. An industry table digital processing method based on an image recognition technology is characterized by comprising the following steps:
acquiring an OCR character detection model, a character recognition model, an industry report data model and an industry standard data model;
inputting a file to be processed;
judging the type of the file to be processed;
if the file to be processed is a PDF file, automatically splitting the PDF file into a plurality of pictures according to page number, identifying the form type in each picture based on a self-research industry report data model and a self-research industry standard data model, detecting a character area by adopting an OCR character detection model, and predicting the form of the form;
if the file to be processed is a picture file, identifying the form type in the picture file and predicting the form of the form;
automatically judging whether the file to be processed contains a form or not according to the prediction result of the form, if so, loading form model data, and recognizing the text by adopting a character recognition model; if the table is not contained, recognizing the text in the picture file to be processed by adopting a character recognition model, and predicting an arrangement rule according to the text;
automatically generating a new form according to the type designated by the user or the built-in type of the system, filling the identified text into the new form, and performing form repair and correction to obtain a corrected form;
and performing character recognition on the content of the corrected table by adopting a character recognition model, extracting keywords and values, generating an excel table, and outputting a digital result.
2. The image recognition technology industry form digital processing method according to claim 1, wherein before determining the type of the file to be processed, the image recognition technology industry form digital processing method further comprises:
and sequentially carrying out watermark removal, rotation correction and noise point removal on the file to be processed.
3. The industry table digital processing method of image recognition technology as claimed in claim 1, wherein said OCR character detection model is CTPN, PIXEL _ LINK model;
the character recognition model is a CRNN model and a DENSENET model;
the industry report data model is a mathematical calculation model which is self-compiled and accords with an industry calculation method based on four rules of operation.
CN202211571003.1A 2019-08-05 2019-08-05 Industry table digital processing method based on image recognition technology Pending CN115828874A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211571003.1A CN115828874A (en) 2019-08-05 2019-08-05 Industry table digital processing method based on image recognition technology

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910715902.6A CN110413979A (en) 2019-08-05 2019-08-05 Industry table digitalized processing method based on image recognition technology
CN202211571003.1A CN115828874A (en) 2019-08-05 2019-08-05 Industry table digital processing method based on image recognition technology

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201910715902.6A Division CN110413979A (en) 2019-08-05 2019-08-05 Industry table digitalized processing method based on image recognition technology

Publications (1)

Publication Number Publication Date
CN115828874A true CN115828874A (en) 2023-03-21

Family

ID=68365805

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202211571003.1A Pending CN115828874A (en) 2019-08-05 2019-08-05 Industry table digital processing method based on image recognition technology
CN201910715902.6A Pending CN110413979A (en) 2019-08-05 2019-08-05 Industry table digitalized processing method based on image recognition technology

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201910715902.6A Pending CN110413979A (en) 2019-08-05 2019-08-05 Industry table digitalized processing method based on image recognition technology

Country Status (1)

Country Link
CN (2) CN115828874A (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111507230A (en) * 2020-04-11 2020-08-07 创景未来(北京)科技有限公司 Method and system for identifying and extracting document and table data
CN112528599B (en) * 2020-12-15 2024-05-10 信号旗智能科技(上海)有限公司 XML-based multi-page document processing method, device, computer equipment and medium
CN112905733A (en) * 2021-02-02 2021-06-04 嘉应学院 Book storage method, system and device based on OCR recognition technology
CN116935396B (en) * 2023-06-16 2024-02-23 北京化工大学 OCR college entrance guide intelligent acquisition method based on CRNN algorithm

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101976232B (en) * 2010-09-19 2012-06-20 深圳市万兴软件有限公司 Method for identifying data form in document and device thereof
CN105589841B (en) * 2016-01-15 2018-03-30 同方知网(北京)技术有限公司 A kind of method of PDF document Table recognition
CN108416279B (en) * 2018-02-26 2022-04-19 北京阿博茨科技有限公司 Table analysis method and device in document image
CN109271613B (en) * 2018-09-25 2022-12-06 四川译讯信息科技有限公司 PDF file analysis method
CN109670477B (en) * 2018-12-28 2021-02-26 上海大智慧财汇数据科技有限公司 PDF table-oriented automatic identification system and method
CN109840519B (en) * 2019-01-25 2023-05-05 青岛盈智科技有限公司 Self-adaptive intelligent bill identification and input device and application method thereof
CN109993112B (en) * 2019-03-29 2021-04-09 杭州睿琪软件有限公司 Method and device for identifying table in picture

Also Published As

Publication number Publication date
CN110413979A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN110516208B (en) System and method for extracting PDF document form
CN110363102B (en) Object identification processing method and device for PDF (Portable document Format) file
CN115828874A (en) Industry table digital processing method based on image recognition technology
Al-Zaidy et al. Automatic extraction of data from bar charts
KR20190123790A (en) Extract data from electronic documents
RU2679209C2 (en) Processing of electronic documents for invoices recognition
US20160055376A1 (en) Method and system for identification and extraction of data from structured documents
US8467614B2 (en) Method for processing optical character recognition (OCR) data, wherein the output comprises visually impaired character images
KR20160132842A (en) Detecting and extracting image document components to create flow document
JP4785655B2 (en) Document processing apparatus and document processing method
CN111291572A (en) Character typesetting method and device and computer readable storage medium
CN112434690A (en) Method, system and storage medium for automatically capturing and understanding elements of dynamically analyzing text image characteristic phenomena
US9519404B2 (en) Image segmentation for data verification
CN1525378A (en) Bill definition data generating method and bill processing apparatus
EP2884425B1 (en) Method and system of extracting structured data from a document
Ramel et al. AGORA: the interactive document image analysis tool of the BVH project
RU2597163C2 (en) Comparing documents using reliable source
JP2004178010A (en) Document processor, its method, and program
CN112464907A (en) Document processing system and method
CN115830620B (en) Archive text data processing method and system based on OCR
CN114155547B (en) Chart identification method, device, equipment and storage medium
JPH06214983A (en) Method and device for converting document picture to logical structuring document
CN115713775A (en) Method, system and computer equipment for extracting form from document
CN109739981B (en) PDF file type judgment method and character extraction method
Gupta et al. Table detection and metadata extraction in document images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination