CN116311299A

CN116311299A - Method, device and system for identifying structured data of table

Info

Publication number: CN116311299A
Application number: CN202310080299.5A
Authority: CN
Inventors: 张一�; 艾韬; 毛景羡; 陈灿伟; 马鹏开
Original assignee: Hunan Jiuli Supply Chain Co ltd
Current assignee: Hunan Jiuli Supply Chain Co ltd
Priority date: 2023-02-08
Filing date: 2023-02-08
Publication date: 2023-06-23

Abstract

The application discloses a data identification method, device and system of a form. The method comprises the following steps: inputting a document to be processed into a preset neural network model for identification, and determining the form position and form type in the document to be processed; determining a corresponding cell identification method according to the determined form type, and identifying the form position by utilizing the corresponding cell identification method to obtain a target cell; performing optical character recognition on all the target cells to obtain target characters; and associating the target character with the corresponding target cell to serve as a recognition result. Other accuracy of the data may be improved based on the present application.

Description

Method, device and system for identifying structured data of table

Technical Field

The present disclosure relates to the field of form data identification technologies, and in particular, to a method, an apparatus, and a system for identifying structured data of a form.

Background

In many business scenarios, it is necessary to identify and extract data in a table. As in international trade, import and export goods are typically delivered to professional third party companies for warehousing and transfer of goods and for secondary transfer. In this process, the third party company needs to acquire various necessary information such as shippers, recipients, names of goods, etc. from the form of the bill, the shipping bill, the order, etc.

At present, the identification of table data is mainly carried out by a method of line detection to identify each cell in a table, and then the identification of the content in the cell is carried out. But this method has lower recognition accuracy in the case of low definition of the lines in the table, non-closure of the lines, and the like.

For this reason, it is highly desirable to provide a new method that can improve the accuracy of recognition of form data.

Disclosure of Invention

In order to solve the defects in the prior art, the main purpose of the application is to provide a method, a device and a system for identifying data of a form so as to improve the accuracy of data identification of the form.

In order to achieve the above purpose, the technical scheme of the application is as follows:

a method of data identification of a form, the method comprising:

inputting a document to be processed into a preset neural network model for identification, and determining the form position and form type in the document to be processed;

determining a corresponding cell identification method according to the determined form type, and identifying the form position by utilizing the corresponding cell identification method to obtain a target cell;

performing optical character recognition on all the target cells to obtain target characters;

and associating the target character with the corresponding target cell to serve as a recognition result.

Preferably, the document to be processed is obtained by:

acquiring a document to be processed;

when the bill to be processed is in an image format, binarizing the bill to be processed;

and carrying out alignment processing on the binarized image to obtain the document to be processed.

Preferably, the aligning the binarized image includes:

performing contour analysis on the binarized image to determine vertex coordinates;

calculating a homography matrix according to the vertex coordinates;

and performing perspective transformation on the binarized image according to the homography matrix to obtain the document to be processed.

Preferably, the neural network model is based on a Cascade R-CNN model, and is fused with a region candidate network.

Preferably, the inputting the document to be processed into a preset neural network model for recognition, and determining the form position and the form type in the document to be processed includes:

inputting the document to be processed into the area candidate network and the backbone network of the Cascade R-CNN model;

extracting candidate frames of the document to be processed by using the area candidate network;

inputting the candidate box into a candidate region of the Cascade R-CNN model;

and obtaining the table position and the table type of the document to be processed by using a candidate region of the Cascade R-CNN model, a pooling layer connected with a backbone network and a full-connection layer connected with the pooling layer.

Preferably, when the table type is a line type, the identifying the target cell at the table position by using the corresponding cell identification method includes:

performing line detection by using Hough transformation to confirm the transverse and vertical lines in the table position;

and confirming the position of each target cell through the intersection point of the transverse and vertical lines.

Preferably, when the table type is a wireless bar type, the identifying the target cell at the table position by using the corresponding cell identification method includes:

performing expansion and corrosion treatment on the characters in the table positions by using a morphological image processing method;

and determining each target cell according to the distance between the processed characters.

Preferably, before the target cell is obtained by identifying the corresponding cell identification method at the table position, the method further comprises cutting the document to be processed according to the table position;

the step of identifying the corresponding cell identification method at the table position to obtain a target cell comprises the step of identifying the document part after cutting by the corresponding cell identification method to obtain the target cell.

Preferably, the method further comprises:

determining the recognition accuracy according to the recognition result;

and when the identification accuracy is lower than a preset value, updating and training the neural network model, and carrying out next data identification by using the neural network model after updating and training.

Preferably, the method further comprises:

taking the document to be processed as a sample set, updating and storing the sample set in an image database;

the training of the neural network model comprises the following steps:

and performing incremental learning training and parameter adjustment on the neural network model by using the updated image database.

Another aspect discloses a data recognition apparatus of a form, the apparatus comprising:

the form position and type recognition unit is used for inputting the document to be processed into a preset neural network model for recognition and determining the form position and the form type in the document to be processed;

the target cell identification unit is used for determining a corresponding cell identification method according to the determined form type and identifying the position of the form by utilizing the corresponding cell identification method to obtain a target cell;

the target character recognition unit is used for carrying out optical character recognition on all the target cells to obtain target characters;

and the recognition result unit is used for associating the target character with the corresponding target cell as a recognition result.

Yet another aspect provides a computer system comprising:

one or more processors; and

a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the method of any of the above.

The beneficial effects of the application are that:

according to the method and the device, the position and the type of the form are obtained by identifying the document to be processed by using the neural network model, and compared with the prior art, the accuracy of position and type identification is improved. Further, different types of tables are processed differently, so that the accuracy of cell identification is further improved.

Furthermore, the Cascade R-CNN model fused with the regional candidate network RPN is used in the method, and compared with the original Cascade R-CNN model, the identification result is more accurate and rapid.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the method of the present application;

FIG. 2 is a diagram of a neural network model of the present application;

FIGS. 3A-3D are schematic diagrams of form identification of the present application;

FIG. 4 is a schematic diagram of a cell identification method of the present application;

FIG. 5 is a block diagram of the apparatus of the present application;

FIG. 6 is a schematic diagram of a computer system architecture according to the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the description of the present application, it should be understood that the terms "X-axis," "Y-axis," "Z-axis," "vertical," "parallel," "up," "down," "front," "back," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," etc. indicate or are based on the orientation or positional relationship shown in the drawings, merely for convenience of description and to simplify the description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be configured and operated in a particular orientation, and thus should not be construed as limiting the present application. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, unless otherwise indicated, the meaning of "a plurality" is two or more.

In the description of the present application, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the terms in this application will be understood by those of ordinary skill in the art in a specific context.

The method and the device aim at identifying the form position and the form type of the document to be processed through the neural network model, and identifying the cells according to the form type by adopting different methods so as to improve the accuracy of identifying the form data.

Example 1

As shown in fig. 1, there is provided a data identification method of a table, the method including:

and inputting the document to be processed into a preset neural network model for recognition, and determining the form position and form type in the document to be processed.

The table position generally refers to the top left vertex coordinates and length and width of the table on the image, and the table type can be divided according to requirements, such as a wired table and a wireless table. The incomplete corresponding lines are forced to be classified into a certain category by the neural network model according to the incomplete degree, and detailed description will be made later on in specific processing.

Post-processing of structured data is carried out according to the identified form position and form type, and the post-processing is specific:

and performing optical character recognition on all the target cells to obtain target characters, wherein the OCR (optical character recognition) can be used for recognizing the target characters of all the cells by using various mainstream optical character recognition engines.

And associating the target character with the corresponding target cell to serve as a recognition result. The identification result and the cell position information are output together as the final result of the system.

The neural network model has different accuracies for different pictures, so that the accuracy of the recognition result needs to be detected to determine whether to further update and improve according to the result, and the method further comprises:

determining the recognition accuracy according to the recognition result of the data output;

And updating and training the neural network model, namely updating and storing the document to be processed as a sample set in an image database, and performing incremental learning training and parameter adjustment on the neural network model by using the updated image database.

Example two

In an actual business scenario, the form has various file forms, such as PDF, form format, image format, and the like. The image format file may be photographed by a camera, a mobile phone, or the like, and the angle may have a certain deviation, so as to be shown in fig. 1, the application further includes a step of preprocessing the file to obtain a document to be processed in a standard format:

acquiring a bill to be processed, namely a new bill;

Preferably, the aligning the binarized image includes:

calculating a homography matrix according to the vertex coordinates;

The document to be processed after alignment processing is obtained through binarization processing and homography perspective transformation, so that the accuracy of subsequent recognition is further improved.

Example III

The neural network model in this application may have a variety of options. In a third embodiment, shown in FIG. 2, the model selection is based on the Cascade R-CNN model and incorporates region candidate networks (RNPs).

The Cascade R-CNN model comprises a backbone network, a candidate region Bn, an ROI pooling layer, an FC full connection layer and the like.

The documents to be processed are input into the backbone network and the area candidate network (RNP).

Backbone network: various types of mainstream structures that can be based on convolutional networks, including, but not limited to: alexNet, VGG, resNet, etc.

RPN, i.e. area candidate network: consistent with the RPN function in the Faster R-CNN, the method is used for completing the operation of extracting the candidate frame in the process of extracting the characteristics.

Candidate region Bn: the structure uses three layers of structures, each layer of structure outputs category and candidate area, and the candidate area information of each layer is used as the input of the next layer. In addition, according to the structure of the Cascade R-CNN, each layer outputs a category information except the candidate region position, and only the category result of the last layer is used.

ROI pooling layer: conventional model structure in deep learning. The function of the method is to extract the characteristic diagrams with the same size by mapping the RoI with different sizes on the convolution characteristic diagrams.

FC, fully connected layer: conventional model structure in deep learning. Each node of the full connection layer is connected with all nodes of the upper layer and is used for integrating the features extracted by the front edge.

Example IV

To improve the accuracy of the identification, different cell identification methods are provided for different table types. The table types are divided into a wired bar type and a wireless bar type in the present application. It should be noted that the wired and wireless strips in the present application are the result of classification and identification of the neural network model. In the identification process, the form is forced to be classified into a certain category according to the incompleteness degree of the line, and the type of the wireless line is not any line.

As shown in fig. 3A and 3B, the positions (3B shaded portions) of the form of the lined type based on which the document to be processed and the model are output, respectively;

as shown in fig. 3C, 3D, the form locations (3D shaded portions) of the wireless bar type based on which the document to be processed and the model are output, respectively.

After the model outputs the table position and table type, the cell identification is performed as shown in fig. 4:

when the table type is a line type, the identifying the target cell at the table position by using the corresponding cell identification method includes:

performing line detection by using Hough transformation to confirm the horizontal line and the vertical line in the table position;

and confirming the position of each target cell through the intersection point of the transverse and vertical lines. The respective target cell content is then sequentially output to the next stage of Optical Character Recognition (OCR).

When the table type is a wireless bar type, the identifying the target cell at the table position by using the corresponding cell identification method includes:

expanding and corroding the characters in the table positions by using a morphological image processing (Morphological Image Processing) method; and determining each target cell according to the distance between the processed characters. Generally, text information with a spacing smaller than a certain preset value is classified into one cell, and each cell position is determined according to the information, and each content is output to Optical Character Recognition (OCR) at the next stage.

Preferably, before the target cell is obtained by identifying the corresponding cell identification method at the table position, the method further comprises cutting the document to be processed according to the table position by using a filter;

Example five

As shown in fig. 5, the fifth embodiment of the present application further discloses a data identifying apparatus for a table corresponding to the methods of embodiments 1 to 4, where the apparatus includes:

a table position and type identifying unit 11, configured to input a document to be processed into a preset neural network model for identification, and determine a table position and a table type in the document to be processed;

A target cell identification unit 12, configured to determine a corresponding cell identification method according to the determined table type, and identify a target cell at the table position by using the corresponding cell identification method;

a target character recognition unit 13, configured to perform optical character recognition on all the target cells to obtain target characters; the target character recognition unit can recognize target characters of all cells by using various mainstream optical character recognition engines.

And a recognition result unit 14, configured to associate the target character with the corresponding target cell, as a recognition result.

The neural network model has different accuracies for different pictures, so that the accuracy of the recognition result needs to be detected to determine whether to further update and improve according to the result, and the device further comprises:

the accuracy rate determining unit is used for determining the identification accuracy rate according to the identification result;

and the model updating unit is used for updating and training the neural network model and carrying out next data identification by using the neural network model after updating and training when the identification accuracy is lower than a preset value.

In an actual business scenario, the form has various file forms, such as PDF, form format, image format, and the like. The device also comprises a preprocessing unit for preprocessing the file to obtain a document to be processed in a standard format, and the preprocessing unit is used for:

acquiring a document to be processed;

The pretreatment unit is specifically used for:

calculating a homography matrix according to the vertex coordinates;

To improve the accuracy of the identification, the target cell identification unit provides different cell identification methods for different table types. The table types are divided into a wired bar type and a wireless bar type in the present application. It should be noted that the wired and wireless strips in the present application are the result of classification and identification of the neural network model. In the identification process, the form is forced to be classified into a certain category according to the incompleteness degree of the line, and the type of the wireless line is not any line.

After the model outputs the form location and form type, the form at the form location is cropped by the filter. Then, the identification of the cells is performed:

the target cell identification unit is specifically used for carrying out line detection by utilizing Hough transformation to confirm the transverse and vertical lines in the table position when the table type is the line type; and confirming the position of each target cell through the intersection point of the transverse and vertical lines. And then sequentially outputting the contents of each target cell to a target character recognition unit of the next stage.

The target cell identification unit is specifically used for performing expansion and corrosion treatment on characters in the form position by using a morphological image processing (Morphological Image Processing) method when the form type is a wireless strip type; and determining each target cell according to the distance between the processed characters. Generally, text information with a spacing smaller than a certain preset value is classified into a single cell, the position of each cell is judged according to the information, and each content is output to a target character recognition unit in the next stage.

Preferably, the device further comprises a filter, and the filter is used for cutting the document to be processed according to the table position before the target cell is obtained by identifying the table position by using the corresponding cell identification method.

Example six

In response to the above method and system, a sixth embodiment of the present application further provides a computer system, as shown in fig. 6, which may include one or more processors and a memory, where one or more storage applications or data may be stored. Wherein the memory may be transient or persistent. The application program stored in the memory may include one or more modules (not shown), each of which may include a series of computer-executable instructions in the data recognition device. Still further, the processor may be configured to communicate with the memory and execute a series of computer-executable instructions in the memory on the data-identifying processing device. The processing device for data identification may also include one or more power supplies, one or more wired or wireless network interfaces, one or more input/output interfaces, one or more keyboards.

In particular, in this embodiment, the data-identifying processing device includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer-executable instructions in the data-identifying processing device, and the execution of the one or more programs by the one or more processors comprises computer-executable instructions for:

Further, the computer system configured to execute the one or more programs by the one or more processors includes computer-executable instructions for performing the methods described in embodiments one through four.

The foregoing description of the preferred embodiments of the present application is not intended to limit the invention to the particular embodiments of the present application, but to limit the scope of the invention to the particular embodiments of the present application.

Claims

1. A method of data identification of a form, the method comprising:

2. The data recognition method according to claim 1, wherein the document to be processed is obtained by:

acquiring a document to be processed;

calculating a homography matrix according to the vertex coordinates;

3. The data recognition method of claim 1, wherein the neural network model is based on a cascades R-CNN model and incorporates region candidate networks.

4. The data recognition method of claim 3, wherein the inputting the document to be processed into a predetermined neural network model for recognition, determining the form location and form type in the document to be processed comprises:

inputting the candidate box into a candidate region of the Cascade R-CNN model;

and obtaining the table position and the table type of the document to be processed by utilizing the candidate region of the Cascade R-CNN model, a pooling layer connected with the backbone network and a full-connection layer connected with the pooling layer.

5. The data recognition method according to claim 1, wherein when the table type is a line type, the recognizing the target cell at the table position by using the corresponding cell recognition method includes:

6. The data identification method as claimed in claim 1, wherein when the table type is a wireless bar type, the identifying the target cell at the table position using the corresponding cell identification method includes:

7. The data identification method of claim 1, wherein the method further comprises:

determining the recognition accuracy according to the recognition result;

8. The data identification method of claim 7, wherein the method further comprises:

the training of the neural network model comprises the following steps:

9. A data recognition apparatus of a form, the apparatus comprising:

10. A computer system, comprising:

one or more processors; and

a memory associated with the one or more processors, the memory for storing program instructions that, when read for execution by the one or more processors, perform the method of any of claims 1-8.