CN115713775B - Method, system and computer equipment for extracting form from document - Google Patents

Method, system and computer equipment for extracting form from document Download PDF

Info

Publication number
CN115713775B
CN115713775B CN202310010871.0A CN202310010871A CN115713775B CN 115713775 B CN115713775 B CN 115713775B CN 202310010871 A CN202310010871 A CN 202310010871A CN 115713775 B CN115713775 B CN 115713775B
Authority
CN
China
Prior art keywords
image
line
detection model
document
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310010871.0A
Other languages
Chinese (zh)
Other versions
CN115713775A (en
Inventor
高翔
李瀚清
杨慧宇
朱耀邦
曾丹梦
李巍豪
赵业辉
岳小龙
纪达麒
陈运文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Datagrand Information Technology Shanghai Co ltd
Original Assignee
Datagrand Information Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Datagrand Information Technology Shanghai Co ltd filed Critical Datagrand Information Technology Shanghai Co ltd
Priority to CN202310010871.0A priority Critical patent/CN115713775B/en
Publication of CN115713775A publication Critical patent/CN115713775A/en
Application granted granted Critical
Publication of CN115713775B publication Critical patent/CN115713775B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Character Input (AREA)

Abstract

The invention relates to a method for extracting a form from a document, which comprises an off-line processing link and an on-line processing link, wherein the off-line processing link is used for detecting an image form area, an image form type and an image form line by marking data and training a machine learning model and outputting a corresponding detection model to the on-line processing link for application; the online processing link is used for extracting the table in the document in real time, wherein the table comprises an electronic table and an image table, the electronic table can be analyzed according to the format protocol code of the appointed document, the image table needs to be extracted after being connected in series by using a model trained by the offline processing link, and the electronic table file after structure reduction and content filling is obtained; a system and computer device for extracting the form are also included. The method, the system and the equipment of the invention provide the form extraction mode in the common document type uniformly to extract all form contents in the document in one step, and have great significance for the actual office scene.

Description

Method, system and computer equipment for extracting form from document
Technical Field
The present invention relates to the field of intelligent text processing, and in particular, to a method, system, and computer device for extracting a form from a document.
Background
The form is an important information bearing and displaying mode, has the characteristics of clear structure, large information quantity and the like, and is widely used in various files such as daily office work, data files and the like, and common information tables such as personnel information tables, product attribute tables, financial reports and the like.
In practice, the form data is rarely presented alone, but rather is presented in the form of a document material mixed with other document elements such as paragraphs, titles, pictures, etc. Common document formats include Word, PDF, picture, etc. For the form in Word, it can be read directly if it is a spreadsheet protocol, but it is also possible to insert the form in image form in Word. However, for PDF, picture, etc., the tables are stored in the form of images, which is difficult to process. Because of the problems of complex image form input environment, paper flatness, printing definition and the like, the image form input system has the problems of distortion, perspective, definition and the like, and the subsequent use of the image form is very difficult. How to automatically extract all types of table structures and contents from different types of documents is very demanding and has very high challenges in actual production work. Since the electronic forms in the document refer to forms in which the form structure and content can be directly edited, for example, form objects in a word can be directly edited using office or wps software. The image table refers to a table stored in an image form, and cannot be edited, for example, a table in a pdf document, a table in a picture file, and the like, and the table can be inserted in a word in a picture mode, so that automatic extraction of a table structure and contents is difficult, and actual working efficiency is affected.
Because the actually used form style is complex, the form style can be generally divided into a full-line form, a few-line form and a wireless form according to the complete condition of lines. The full-line table is the most common table, lines surround all around each cell, and the table structure is clear. While few-line tables generally have only a general structure of horizontal or vertical line divisions, cell divisions need to be properly understood in conjunction with text alignment information. The wireless table is a table without any lines, and the intelligent table structure can be correctly understood through text alignment information.
In addition to the difficulties caused by the above table types, the quality of the acquired image table is uneven, such as shadows, perspectives, line distortion, line color being excessively dull, etc., due to the problems of printing quality, photographing equipment, paper flatness, etc. during the acquisition of the image table. The invention provides a method and a device for extracting a table from a document, which aim to support the analysis of various types of image tables and store the image tables in the form of independent electronic table files such as xlsx, csv and the like.
Disclosure of Invention
The present invention aims to overcome the above-mentioned shortcomings in the prior art, and provides a method, a system and a computer device for extracting a form from a document. The method, the system and the computer equipment can analyze and extract all the forms in the document according to the structures and the contents of all the forms in the document, automatically extract all the forms from the document, automatically find all the forms in the document and the contents of the forms, and export the forms into electronic forms such as xlsx, csv and the like, thereby facilitating the subsequent manual processing or the automatic processing of the system.
In order to achieve the above object, the present invention provides the following technical solutions:
the method of the invention is mainly divided into an off-line system and an on-line system. The off-line system mainly comprises: 1, training an image table region detection model; 2, training an image form type division model; and 3, training an image table ruled line detection model. The online system mainly comprises 1, judging the document type; 2, judging the form type; 3, detecting and identifying the text content in the image table; 2, detecting an image table area; 3, dividing the image table types; 4, detecting lines of an image table; 5, constructing image cells; and 6, exporting the table to the electronic table file.
The off-line system trains a relevant machine learning model by labeling a certain amount of relevant data, is used for detecting image form areas, image form types and image form lines, and outputs the relevant model for the on-line system to use.
Image table region detection model training. The detection of the image table area is based on the target detection technology, the original image and the position information are used as training data by marking the position information of the table area in the image, a target detection model capable of detecting the table area is trained by using a machine learning method, and common target detection algorithms comprise classical algorithms such as YOLO, fast RCNN and the like. The table area is represented by an upper left corner xy coordinate and a lower right corner xy coordinate.
Image table type detection model training. Image table type detection is based on an image classification technology, and the table types are divided into a wired table, a few-line table and a wireless table by carrying out data annotation on a table area image. The original image and the table type information are used as training data, a machine learning method is used for training a model capable of detecting the table type in the image, and common image classification algorithms comprise SVM, resNet and other network algorithms.
And training an image table line detection model. The image table line detection model is used to detect all lines in the table, and the common line detection model is based on image instance segmentation techniques, such as UNet, spatial CNN, and the like. The table grid lines are represented by an ordered pixel point set, the arrangement sequence of the horizontal lines is from left to right, the arrangement sequence of the vertical lines is from top to bottom, and each pixel point is represented by xy coordinates. The form types are all-line, few-line and wireless, so that different line detection models are designed according to the form types and are trained separately to improve the accuracy. The table lines are divided into physical lines and virtual lines according to whether they can be observed or not. According to this standard, all table grid lines in the full line table are physical lines, table grid lines in the wireless table are virtual lines, and table grid lines in the few line table are combinations of physical lines and virtual lines. Therefore, the form line detection model needs to train two models of physical line detection and virtual line detection.
In an online system, the following operations are performed:
document preprocessing. Forms in documents are divided into two major categories, electronic forms and image forms. Electronic forms are typically found in Word, while image forms may be found in any type of document, and therefore require different preprocessing depending on the document type, and then different types of form extraction depending on the preprocessing result. For example, the electronic form object in Word is directly extracted after being taken out, and the image object is exported as an image file for image form extraction. And the PDF file needs to convert each page of content into an image file for image table extraction processing.
And (5) extracting a spreadsheet. The spreadsheet mainly uses the relevant file protocol parsing library to extract the table objects in the document. word can read all the table objects using official provided sdk, directly take out the table structure and content.
And (5) extracting an image table. And detecting and identifying the text content in the image table. The complete table contents should include the table structure and cell text, so that all text information in the image needs to be first identified and the position of each word marked. If some text coordinates are within a certain cell area which is identified later, the text is the text content of the cell. Recognizing text in an image uses sophisticated optical character recognition (Optical Character Recognition, OCR) techniques, but requires the OCR system to output text content and region coordinates. The text region coordinates are represented by the upper left-hand xy and lower right-hand xy coordinates of the text.
And (5) detecting a table area. And the table area detection uses an area detection model trained by an offline system to judge the table area. The original image is input, and the areas of all tables in the image are output, wherein each area comprises the xy coordinates of the upper left corner and the xy coordinates of the lower right corner.
The table type is divided. And sending the image in the table area to a table type dividing module, dividing the table area in types according to the offline trained type model to obtain a wired table, a wireless table and a half-line table, and outputting the types.
And (5) detecting the lines of the table. And sending the images and types in the form area to a form line detection module, and outputting related form lines by using form line detection models of different types trained offline according to the form types. The wired table uses a physical table line detection model, the wireless table uses a virtual table line detection model, and the half-line table uses both physical table lines and virtual table line models.
And (5) constructing a cell. And aiming at the table line detection result in each table area, the four-corner xy coordinates of each cell can be obtained according to a series of intersection points of the transverse and vertical lines. And arranging all the obtained cells according to four-corner coordinates and in the sequence from left to right and from top to bottom, so as to obtain the structure of the whole table. After the table structure is obtained, filling the cell characters according to the character content and the coordinates recognized by the OCR system and each constructed cell coordinate, if some character coordinate ranges are in the cells, the cell characters are obtained, and outputting the constructed table according to the row-column two-dimensional matrix.
And exporting the form to a spreadsheet file. After all the detected tables in the document are subjected to structure restoration and content filling, each table can be sequentially saved as an electronic table file by using a table storage program code, and the format supports csv or xlxs.
The invention provides a method, a system and a computer device for extracting all tables and related structures from a document by designing a whole set of system including offline and online based on the technologies of document analysis, image deep learning and the like, and has great application value in actual business. The invention has the following technical characteristics:
1. the method and the system separate image elements and electronic form elements from a Word document by using a document analysis technology, convert pdf files into a plurality of image files by pages, and train an image form region detection model, an image form type division model and an image form line detection model by using an image deep learning algorithm through data annotation.
2. The method and the system of the invention provide the concepts of physical table lines and virtual table lines, and complete the line detection work of various image type tables through the combination of the two table lines.
3. The method and the system provided by the invention provide a complete online system flow, use a trained offline model to detect and analyze the image form, use OCR technology to identify text contents and areas, fill and construct the form cells, and finally output all forms extracted from the document by an online processing unit and store the forms in an electronic form file, such as csv or xlsx, so that the method and the system are convenient for manual or software system use.
4. The method and the system of the invention do not need to manually sort and distinguish the table types in the document, uniformly provide a table extraction mode in common document types to extract all table contents in the document in one step, and have great significance for actual office scenes.
Drawings
FIG. 1 is a schematic overall flow chart of a method of extracting forms from documents according to the present invention.
FIG. 2 is a schematic diagram of an offline model training process for image table region detection in a method for extracting tables from documents according to the present invention.
FIG. 3 is a flow chart of training a form type detection model in a method of extracting a form from a document according to the present invention.
FIG. 4 is a schematic diagram of an offline model training process for detecting image form lines in a method for extracting forms from documents according to the present invention.
FIG. 5 is a schematic diagram of an image document preprocessing flow in a method for extracting forms from documents according to the present invention.
FIG. 6 is a schematic diagram of an online flow of image text content detection in a method for extracting forms from documents according to the present invention.
FIG. 7 is a schematic diagram of an online flow of image form region detection in a method of extracting forms from documents according to the present invention.
FIG. 8 is a schematic view of an on-line flow chart of image form type partitioning in a method of extracting forms from documents according to the present invention.
FIG. 9 is a schematic diagram of an online flow of image form line detection in a method of extracting forms from documents according to the present invention.
FIG. 10 is a schematic flow chart of image cell construction on line in a method for extracting a form from a document according to the present invention.
FIG. 11 is a schematic diagram of an online process for exporting forms to a spreadsheet file in a method of extracting forms from a document according to the present invention.
FIG. 12 is a schematic diagram of the composition of a system for extracting forms from documents in accordance with the present invention.
Detailed Description
The following describes in further detail a method, system and computer device for extracting forms from documents in conjunction with the accompanying drawings and detailed examples, in order to provide a clearer understanding of the structural composition and manner of operation, but not to limit the scope of the invention.
The invention firstly provides a method for extracting a form from a document, which comprises an off-line processing link and an on-line processing link, wherein:
the off-line processing link is to label a certain amount of image form labeling data, the range of forms, the types of forms and all lines in the forms, wherein a base model is to label ten thousand form images, scene optimization is to label about thousand, train a machine learning model to detect image form areas, image form types and image form lines, and output a corresponding detection model to the on-line processing link for application.
The online processing link is used for extracting all types of forms in the document in real time, including electronic forms and image forms, the electronic forms can be analyzed according to format protocol codes of the appointed document, the image forms need to be extracted after being connected in series by using a model trained by the offline processing link, and the electronic form files after structure reduction and content filling are obtained.
As shown in fig. 1, the offline processing step includes the following steps:
s11, training an image table area detection model, detecting an image table area based on a target detection technology, marking position information of the image table area in an image, taking an original image and the position information as training data, and training out a target detection model capable of detecting the table area, wherein the table area is represented by an xy coordinate of an upper left corner and an xy coordinate of a lower right corner;
s12, training an image form type detection model, carrying out data annotation on a form region image based on an image classification technology, dividing form types into a wired form, a few-line form and a wireless form, taking original image and form type information as training data, and training out a model capable of detecting the form types in the image;
s13, training an image table line detection model, representing table grid lines by using an ordered pixel point set based on an image example segmentation technology, wherein the horizontal line arrangement sequence is from left to right, the vertical line arrangement sequence is from top to bottom, each pixel point is represented by xy coordinates, different separate training line detection models are designed according to the table type, the table lines are divided into physical lines and virtual lines according to whether the table lines can be observed as standards, and the table line detection model needs to be used for training the physical line detection model and the virtual line detection model.
The online processing link comprises the following steps:
s14, preprocessing the document, judging the type of the document and judging the type of the table, dividing the table in the document into two major types, namely an electronic table and an image table, wherein the electronic table usually appears in Word, the image table possibly appears in any type of document, preprocessing the document differently according to the type of the document, extracting the table of different types according to the preprocessing result, executing S15 if the electronic table is the electronic table, and executing S16 if the electronic table is the image table;
s15, extracting a spreadsheet, wherein the spreadsheet mainly uses a relevant file protocol analysis library to read all table objects, uses an OpenXML format protocol to analyze the spreadsheet in the docx format of a Microsoft Word document and the WPS format of a Jinshan WPS document, extracts the table objects in the document, and uses an officially provided sdk (software development kit ) if the Word document is used, directly takes out the table structure and the content;
s16, extracting an image table, and firstly detecting and identifying text contents in the image table; judging form areas by using a trained image form area detection model, and outputting all form areas in the image, wherein each area comprises an upper left corner xy coordinate and a lower right corner xy coordinate; performing type division on the table area by using the trained image table type detection model to obtain a wired table, a wireless table and a half-line table, and outputting the types; sending the images and types in the table area into an image table line detection model, and outputting related table grid lines, wherein a physical table line detection model is used for a wired table, a virtual table line detection model is used for a wireless table, and a physical table grid line and a virtual table line model are simultaneously used for a half-line table; finally, aiming at a table line detection result in each table area, acquiring xy coordinates of four corners of each cell according to a plurality of intersection points of the transverse and vertical lines, and arranging all the acquired cells according to the four corner coordinates in a sequence from left to right and from top to bottom to acquire a structure of the whole table;
s17, leading out the forms to the electronic form file, carrying out structure restoration and content filling on all the detected forms in the file, and sequentially storing each form into the electronic form file by using a form storage program code.
As shown in fig. 2, in S11, the target detection technique includes algorithms including YOLO and Faster RCNN, and a machine learning method is used to train a target detection model capable of detecting a table area, and the flow includes:
inputting an original image dataset;
marking data in the form area;
generating model training data by the labeling data;
training a table target detection model;
and outputting the table area detection model.
As shown in fig. 3, in S12, the image classification technique includes a network algorithm including SVM and res net, and uses a machine learning method to train a model capable of detecting a table type in an image, and the flow includes:
collecting the table area images;
labeling form type data;
generating model training data by the labeling data;
training a form type detection model;
and outputting the form type detection model.
As shown in fig. 4, in S13, the image segmentation technique includes algorithms including UNet and Spatial CNN, and the table line detection model needs to train two models, namely physical line detection and virtual line detection, and the flow includes:
a set of table region images;
the type of the interpretation form is a wired form, a half-line form or a wireless form;
after the wired form is marked with the physical form line, training a physical form line detection model to obtain the physical form line detection model;
after the wireless form is marked with the virtual form line, training a virtual form line detection model to obtain the virtual form line detection model;
the physical table grid lines in the half-line table are used as physical table line labels, combined into a physical table line detection model for training and outputting, the virtual table grid lines in the half-line table are used as virtual table grid lines for standard, and combined into a virtual table line detection model for training and outputting;
the physical form line detection model after training is output as a form line detection model, and the virtual form line detection model after training is also output as a form line detection model.
In S14, when different preprocessing is performed according to the document type, the electronic form object in the Word file is directly extracted after being taken out, and the image object is exported as an image file for image form extraction, and the PDF file needs to convert each page of content into an image file for image form extraction. As shown in fig. 5, the pretreatment process includes:
inputting a document, judging whether the document is a Word file, if so, extracting document elements, if the document is a spreadsheet, directly extracting an element set of the spreadsheet, and if the document is an image element set, exporting the document to be a graphic file combination to obtain a picture file set; if not, judging whether the file is a PDF file;
if the file is a PDF file, the PDF file is disassembled and converted into a plurality of pages of images to form a picture file set;
if the image file is not a PDF file, whether the image file is an image file is judged, if the image file is also a picture file set, and if the image file is not an image file, the discarding cannot be processed.
In S16, characters in the recognition image are output with the OCR system using an optical character recognition technique, and the character content and the region coordinates of the characters are represented by the upper left-hand xy coordinates and the lower right-hand xy coordinates of the characters. As shown in fig. 6, the image text recognition process includes:
inputting an original image;
calling an OCR system to perform character recognition;
outputting all the characters and the coordinates thereof, namely the OCR result.
In S16, the table area detection uses the trained area detection model to determine the table area, and as shown in fig. 7, the table area detection flow includes:
inputting an original image;
calling a table region detection model to predict;
and outputting the coordinates of all the table areas.
In S16, as shown in fig. 8, the image in the table area is sent to a table type division model, and the table area is type-divided according to the offline trained type model, so as to obtain a wired table, a wireless table or a half-line table for output.
In S16, as shown in fig. 9, the table line online detection process includes:
inputting a form area image and a form type;
judging whether the table is a wired table, if so, adopting a physical line detection model to predict, otherwise, judging whether the table is a half-line table;
if the physical line is a half line form, predicting the physical line by adopting a physical line detection model, and predicting the virtual line by adopting a virtual line detection model;
if the wireless form is the wireless form, all the virtual line detection models are adopted for prediction;
outputting all the line sets in the table by the physical line detection model and the virtual line detection model, namely, the table line sets;
as shown in fig. 10, the cell construction process includes:
inputting an OCR result and a table line set;
calculating all intersection points of the table lines;
obtaining four-corner coordinates of the cell according to all the intersection points;
arranging all the cells in the sequence from top to bottom and from left to right;
filling cell characters according to the character coordinates and the cell coordinates;
and outputting a row-column two-dimensional matrix table structure.
In S16, as shown in fig. 11, the export table to spreadsheet file flow includes:
inputting the contents of the table structure, the holding format and the path;
invoking a table storage program;
and outputting all the electronic form files to the appointed position, and finishing the extraction of the document form.
The invention also relates to a system for extracting the form from the document, which comprises an offline processing unit and an online processing unit, wherein the offline processing unit comprises an image form region detection model, an image form type division model and an image form line detection model which are trained through data annotation; the online processing unit comprises a document preprocessing module, a spreadsheet extraction module, a graphic form extraction module and a form export module, and the composition of the online processing unit is shown in fig. 12.
In the online processing unit, the document preprocessing module judges and identifies the input document, divides the form in the document into an electronic form and an image form, and then extracts the electronic form element set and the image file set respectively according to different forms.
The electronic form extraction module extracts electronic forms in the documents by using a file protocol analysis library to obtain an electronic form element set, wherein the element set comprises a form structure and form contents;
the graphic form extraction module is provided with a text content detection and identification sub-module, a form area detection sub-module, a form type dividing sub-module, a form line detection sub-module and a unit lattice construction sub-module, text information in an image file is identified through the text content detection and identification sub-module, the position of each word is marked, the form area detection sub-module calls a trained image form area detection model in an offline processing unit to output area coordinates of all forms in an image, the form type dividing sub-module calls a trained image type dividing detection model in the offline processing unit to obtain a wired form, a wireless form and a half-line form and output the types, the form line detection sub-module calls a trained image form line detection model in the offline processing unit to obtain a combined set of all lines in the form, the unit lattice construction sub-module arranges all obtained unit lattices according to the four-corner coordinates from left to right to top to bottom to obtain a structure of the whole form, and then carries out unit lattice text filling on the identified text content and coordinates and each constructed unit lattice coordinates to obtain a two-dimensional form matrix to output the two-dimensional form;
and the table export module carries out structure restoration and content filling on all detected tables in the document, and sequentially stores each table into a spreadsheet file by using a table storage program, wherein the stored spreadsheet file format is csv format or xlxs format.
A computer device having a system for extracting forms from documents, the system being a computer program or storage medium that performs the above-described automatic extraction form from documents method to extract form files from input documents, formed as csv-format or xlxs-format spreadsheet files. The computer equipment is provided with a table analysis online module serving as an online processing unit and a table analysis offline module serving as an offline processing unit, wherein the table analysis offline module realizes table region detection model training, table type classification model training and table line detection model training. The form analysis on-line module realizes form region detection prediction, form type classification prediction and form line detection prediction, realizes document image conversion of an image form, and electronic form analysis extraction and result data storage. The method is characterized in that a mature OpenCV algorithm, an image classification algorithm, a target detection algorithm and a graph segmentation algorithm are adopted in a computer, the mature OpenCV algorithm, the image classification algorithm, the target detection algorithm and the graph segmentation algorithm are applied to a deep learning framework and high-performance parallel computing to serve as a computing framework, a CPU and a GPU are realized in resource scheduling management to complete computing, and data storage comprises a local disk and a network storage space.
The above is, of course, only a limited implementation of the method, system and computer device for extracting forms from documents, but other possible implementations and system components are included. In summary, the scope of the invention also includes other variations and alternatives that will be apparent to those skilled in the art.

Claims (16)

1. A method of extracting a form from a document, the method comprising an offline processing element and an online processing element, wherein:
the off-line processing link trains a machine learning model by marking table image data so as to detect an image table area, an image table type and image table lines, and outputs a detection model to the on-line processing link for application;
the online processing link is used for extracting all types of forms in the document in real time, including electronic forms and image forms, wherein the electronic forms can be analyzed according to format protocol codes of the appointed document, the image forms need to be extracted after being connected in series by using a model trained by the offline processing link, and electronic form files after structure reduction and content filling are obtained; the online processing link comprises the following steps:
s14, preprocessing a document, judging the type of the document and judging the type of the table, dividing the table in the document into two major types, namely an electronic table and an image table, wherein files appearing in the electronic table comprise Word and WPS, the image table can appear in any type of document, different preprocessing is carried out according to the type of the document, then different types of table extraction is carried out according to the preprocessing result, if the electronic table is used for executing S15, and if the electronic table is used for executing S16;
s15, extracting a spreadsheet, wherein the spreadsheet uses a corresponding file protocol analysis library to extract a table object in a document, and uses an OpenXML format protocol to analyze the spreadsheet in the docx format of the Microsoft Word document and the WPS format of the gold mountain WPS document to directly take out the table structure and the content;
s16, extracting an image table, and firstly detecting and identifying text contents in the image table; judging form areas by using a trained image form area detection model, and outputting all form areas in the image, wherein each area comprises an upper left corner xy coordinate and a lower right corner xy coordinate; performing type division on the table area by using the trained image table type detection model to obtain a wired table, a wireless table and a half-line table, and outputting the types; sending the images and types in the table area into an image table line detection model, and outputting related table grid lines, wherein a physical table line detection model is used for a wired table, a virtual table line detection model is used for a wireless table, and a physical table grid line and a virtual table line model are simultaneously used for a half-line table; finally, aiming at a table line detection result in each table area, acquiring xy coordinates of four corners of each cell according to a plurality of intersection points of the transverse and vertical lines, and arranging all the acquired cells according to the four corner coordinates in a sequence from left to right and from top to bottom to acquire a structure of the whole table;
s17, leading out the forms to the electronic form file, carrying out structure restoration and content filling on all the detected forms in the file, and sequentially storing each form into the electronic form file by using a form storage program code.
2. The method of claim 1, wherein the offline processing step comprises the steps of:
s11, training an image table area detection model, detecting an image table area based on a target detection technology, marking position information of the image table area in an image, taking an original image and the position information as training data, and training out a target detection model capable of detecting the table area, wherein the table area is represented by an xy coordinate of an upper left corner and an xy coordinate of a lower right corner;
s12, training an image form type detection model, carrying out data annotation on a form region image based on an image classification technology, dividing form types into a wired form, a few-line form and a wireless form, taking original image and form type information as training data, and training out a model capable of detecting the form types in the image;
s13, training an image table line detection model, representing table grid lines by using an ordered pixel point set based on an image example segmentation technology, wherein the horizontal line arrangement sequence is from left to right, the vertical line arrangement sequence is from top to bottom, each pixel point is represented by xy coordinates, different separate training line detection models are designed according to the table type, the table lines are divided into physical lines and virtual lines according to whether the table lines can be observed as standards, and the table line detection model needs to be used for training the physical line detection model and the virtual line detection model.
3. The method as claimed in claim 2, wherein in S11, the object detection technique includes algorithm including YOLO and fast RCNN, and the method of machine learning is used to train the object detection model capable of detecting the table area, and the process includes:
inputting an original image dataset;
marking data in the form area;
generating model training data by the labeling data;
training a table target detection model;
and outputting the table area detection model.
4. The method according to claim 2, wherein in S12, the image classification technique includes a network algorithm including SVM and ResNet, and the machine learning method is used to train a model capable of detecting a table type in an image, and the process includes:
collecting the table area images;
labeling form type data;
generating model training data by the labeling data;
training a form type detection model;
and outputting the form type detection model.
5. The method of claim 2, wherein in S13, the image segmentation technique includes algorithms including UNet and Spatial CNN, and the form line detection model needs to train two models, namely physical line detection and virtual line detection, and the flow includes:
a set of table region images;
the type of the interpretation form is a wired form, a half-line form or a wireless form;
after the wired form is marked with the physical form line, training a physical form line detection model to obtain the physical form line detection model;
after the wireless form is marked with the virtual form line, training a virtual form line detection model to obtain the virtual form line detection model;
the physical table grid lines in the half-line table are used as physical table line labels, combined into a physical table line detection model for training and outputting, the virtual table grid lines in the half-line table are used as virtual table grid lines for standard, and combined into a virtual table line detection model for training and outputting;
the physical form line detection model after training is output as a form line detection model, and the virtual form line detection model after training is also output as a form line detection model.
6. The method according to claim 1, wherein in S14, when preprocessing is performed differently according to the document type, the electronic form object in the Word file is directly extracted after being taken out, and the image object is exported as an image file for image form extraction, and the PDF file needs to convert each page of content into an image file for image form extraction.
7. The method for extracting forms from documents as claimed in claim 6, wherein the preprocessing process comprises:
inputting a document, judging whether the document is a Word file, if so, extracting document elements, if the document is a spreadsheet, directly extracting an element set of the spreadsheet, and if the document is an image element set, exporting the document to be a graphic file combination to obtain a picture file set; if not, judging whether the file is a PDF file;
if the file is a PDF file, the PDF file is disassembled and converted into a plurality of pages of images to form a picture file set;
if the image file is not a PDF file, judging whether the image file is an image file, if so, taking the image file as a picture file set, and if not, discarding the image file set and not processing the image file set.
8. The method according to claim 1, wherein in S16, the character in the image is recognized by using an optical character recognition technique, the character content and the region coordinates are output by using the OCR system, the region coordinates of the character are represented by the xy coordinates of the upper left corner and the xy coordinates of the lower right corner of the character, and the image character recognition process includes:
inputting an original image;
calling an OCR system to perform character recognition;
outputting all the characters and the coordinates thereof, namely the OCR result.
9. The method according to claim 1, wherein in S16, the table region detection uses a trained region detection model to determine the table region, and the table region detection process includes:
inputting an original image;
calling a table region detection model to predict;
and outputting the coordinates of all the table areas.
10. A method for extracting a form from a document according to claim 9, wherein in S16, the image in the form area is fed into a form type division model, the form area is type-divided according to an offline trained type model, and a wired form, a wireless form or a half-line form is obtained and output.
11. The method according to claim 10, wherein in S16, the form line online detection process includes:
inputting a form area image and a form type;
judging whether the table is a wired table, if so, adopting a physical line detection model to predict, otherwise, judging whether the table is a half-line table;
if the physical line is a half line form, predicting the physical line by adopting a physical line detection model, and predicting the virtual line by adopting a virtual line detection model;
if the wireless form is the wireless form, all the virtual line detection models are adopted for prediction;
outputting all the line sets in the table by the physical line detection model and the virtual line detection model, namely the table line set.
12. The method according to claim 11, wherein in S16, the unit cell construction process includes:
inputting an OCR result and a table line set;
calculating all intersection points of the table lines;
obtaining four-corner coordinates of the cell according to all the intersection points;
arranging all the cells in the sequence from top to bottom and from left to right;
filling cell characters according to the character coordinates and the cell coordinates;
and outputting a row-column two-dimensional matrix table structure.
13. The method of claim 12, wherein in S16, deriving the form to spreadsheet file flow comprises:
inputting the contents of the table structure, the holding format and the path;
invoking a table storage program;
and outputting all the electronic form files to the appointed position, and finishing the extraction of the document form.
14. The system for extracting the form from the document is characterized by comprising an offline processing unit and an online processing unit, wherein the offline processing unit comprises an image form region detection model trained through data annotation, an image form type division model and an image form line detection model; the online processing unit comprises a document preprocessing module, an electronic form extraction module, a graphic form extraction module and a form export module;
the document preprocessing module judges and recognizes an input document, divides forms in the document into two types of electronic forms and image forms, and then respectively extracts according to the two different types of forms to respectively obtain an electronic form element set and a picture file set;
the electronic form extraction module extracts electronic forms in the documents by using a file protocol analysis library to obtain an electronic form element set, wherein the element set comprises a form structure and form contents;
the graphic form extraction module is provided with a text content detection and identification sub-module, a form area detection sub-module, a form type dividing sub-module, a form line detection sub-module and a unit lattice construction sub-module, text information in an image file is identified through the text content detection and identification sub-module, the position of each word is marked, the form area detection sub-module calls a trained image form area detection model in an offline processing unit to output area coordinates of all forms in an image, the form type dividing sub-module calls a trained image type dividing detection model in the offline processing unit to obtain a wired form, a wireless form and a half-line form and output the types, the form line detection sub-module calls a trained image form line detection model in the offline processing unit to obtain a combined set of all lines in the form, the unit lattice construction sub-module arranges all obtained unit lattices according to the four-corner coordinates from left to right to top to bottom to obtain a structure of the whole form, and then carries out unit lattice text filling on the identified text content and coordinates and each constructed unit lattice coordinates to obtain a two-dimensional form matrix to output the two-dimensional form;
and the table export module restores the structure and fills the content of all the detected tables in the document, and sequentially stores each table as an electronic table file by using a table storage program.
15. The system for extracting forms from documents as claimed in claim 14, wherein the saved spreadsheet file format is csv format or xlxs format.
16. A computer device, characterized in that a system for extracting forms from documents is provided in the computer device, which system is a computer program or a storage medium, which performs the method of any of claims 1-13 for extracting forms from an input document.
CN202310010871.0A 2023-01-05 2023-01-05 Method, system and computer equipment for extracting form from document Active CN115713775B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310010871.0A CN115713775B (en) 2023-01-05 2023-01-05 Method, system and computer equipment for extracting form from document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310010871.0A CN115713775B (en) 2023-01-05 2023-01-05 Method, system and computer equipment for extracting form from document

Publications (2)

Publication Number Publication Date
CN115713775A CN115713775A (en) 2023-02-24
CN115713775B true CN115713775B (en) 2023-04-25

Family

ID=85236169

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310010871.0A Active CN115713775B (en) 2023-01-05 2023-01-05 Method, system and computer equipment for extracting form from document

Country Status (1)

Country Link
CN (1) CN115713775B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052193B (en) * 2023-04-03 2023-06-30 杭州实在智能科技有限公司 RPA interface dynamic form picking and matching method and system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020143831A1 (en) * 2001-03-28 2002-10-03 Bennett Paul W. System and method for calculation using spreadsheet lines and vertical calculations in a single document
GB2574608B (en) * 2018-06-11 2020-12-30 Innoplexus Ag System and method for extracting tabular data from electronic document
CN110390269B (en) * 2019-06-26 2023-08-01 平安科技(深圳)有限公司 PDF document table extraction method, device, equipment and computer readable storage medium
CN113221743B (en) * 2021-05-12 2024-01-12 北京百度网讯科技有限公司 Table analysis method, apparatus, electronic device and storage medium
CN113688688A (en) * 2021-07-28 2021-11-23 达观数据(苏州)有限公司 Completion method of table lines in picture and identification method of table in picture
CN114565927A (en) * 2022-03-03 2022-05-31 上海恒生聚源数据服务有限公司 Table identification method and device, electronic equipment and storage medium
CN114782970B (en) * 2022-06-22 2022-09-16 广州市新文溯科技有限公司 Table extraction method, system and readable medium

Also Published As

Publication number Publication date
CN115713775A (en) 2023-02-24

Similar Documents

Publication Publication Date Title
CN110363102B (en) Object identification processing method and device for PDF (Portable document Format) file
US8718364B2 (en) Apparatus and method for digitizing documents with extracted region data
CN111027297A (en) Method for processing key form information of image type PDF financial data
CN105260751B (en) A kind of character recognition method and its system
CN112883926B (en) Identification method and device for form medical images
US11341319B2 (en) Visual data mapping
CN115713775B (en) Method, system and computer equipment for extracting form from document
CN113780229A (en) Text recognition method and device
CN111652232A (en) Bill identification method and device, electronic equipment and computer readable storage medium
CN112418812A (en) Distributed full-link automatic intelligent clearance system, method and storage medium
CN116052193B (en) RPA interface dynamic form picking and matching method and system
CN109726369A (en) A kind of intelligent template questions record Implementation Technology based on normative document
CN115828874A (en) Industry table digital processing method based on image recognition technology
KR20210116371A (en) Image processing method, device, electronic equipment, computer readable storage medium and computer program
Yuan et al. An opencv-based framework for table information extraction
CN111241329A (en) Image retrieval-based ancient character interpretation method and device
CN116052195A (en) Document parsing method, device, terminal equipment and computer readable storage medium
CN111241955B (en) Bill information extraction method and system
CN114332866A (en) Document curve separation and coordinate information extraction method based on image processing
JP7470264B1 (en) LAYOUT ANALYSIS SYSTEM, LAYOUT ANALYSIS METHOD, AND PROGRAM
CN114202761B (en) Information batch extraction method based on picture information clustering
CN116303237A (en) Image data structure capable of backtracking errors and labeling method
WO2024047764A1 (en) Layout analysis system, layout analysis method, and program
CN114663414B (en) Rock and ore recognition and extraction system and method based on UNET convolutional neural network
CN114596577A (en) Image processing method, image processing device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant