CN117746437B

CN117746437B - Document data extraction system and method thereof

Info

Publication number: CN117746437B
Application number: CN202410187687.8A
Authority: CN
Inventors: 李哲洙; 李洪金
Original assignee: Shenyang Zhehang Information Technology Co ltd
Current assignee: Shenyang Zhehang Information Technology Co ltd
Priority date: 2024-02-20
Filing date: 2024-02-20
Publication date: 2024-04-30
Anticipated expiration: 2044-02-20
Also published as: CN117746437A

Abstract

The application relates to the technical field of data processing, and provides a document data extraction system and a method thereof, wherein the document data extraction system comprises an image acquisition unit for acquiring a document image to be extracted, a corrosion expansion unit for acquiring corner coordinates of a document cell in the document image, a document data extraction unit for determining document data corresponding to the document image and a data management console; the data management console is respectively connected with the image acquisition unit, the text detection unit, the corrosion expansion unit and the document data extraction unit and is used for controlling and managing all the units. The embodiment of the application overcomes the defects that the document data extraction method in the traditional scheme is poor in compatibility, cannot be suitable for documents with changeable forms and is poor in extraction effect, realizes the extraction of various types of document data, improves the extraction accuracy and the extraction efficiency, is easy to realize and deploy, and has extremely strong practicability and better compatibility.

Description

Document data extraction system and method thereof

Technical Field

The application relates to the technical field of data processing, in particular to a document data extraction system and a document data extraction method.

Background

Document processing is a common issue in daily life of people, however, most documents exist in non-editable forms, such as picture forms, scanned file forms and the like, so that the document is difficult to extract and the electronic difficulty of document information is high.

At present, the extraction of the document, the input of information, the proofreading and the like in the non-editable document are carried out manually, which inevitably consumes a great deal of time and effort, and also has a great operational risk. Based on the above, an automatic document data extraction mode is generated, but the current document data extraction method cannot be suitable for documents with various structures and various forms, has poor compatibility, cannot extract various types of documents, and has a very good effect.

Disclosure of Invention

The application provides a document data extraction system and a document data extraction method, which are used for solving the defects that in the prior art, the document data extraction method is poor in compatibility, cannot adapt to documents with changeable forms and is poor in extraction effect, and the limitation of document types is jumped out, so that the extraction of various types of document data is realized, and the extraction effect is ensured.

In a first aspect, the application provides a document data extraction system, which comprises an image acquisition unit, a text detection unit, a corrosion expansion unit, a document data extraction unit and a data management console; the data management console is respectively connected with the image acquisition unit, the text detection unit, the corrosion expansion unit and the document data extraction unit and is used for controlling and managing all the units;

The image acquisition unit is used for: acquiring a document image to be extracted;

the text detection unit is used for: text detection is carried out on the document image, so that text areas in the document image and corner coordinates of the text areas are obtained;

The corrosion expansion unit is used for generating a mask image of the document image based on the corner coordinates of each text area, and carrying out corrosion expansion on the mask image to obtain the corner coordinates of the document cells in the document image;

The document data extraction unit is used for: and determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

In a second aspect, the present application provides a document data extraction method, including:

Acquiring a document image to be extracted;

text detection is carried out on the document image, so that text areas in the document image and corner coordinates of the text areas are obtained;

Generating a mask image of the document image based on the corner coordinates of each text region, and corroding and expanding the mask image to obtain the corner coordinates of the document cells in the document image;

And determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

In a third aspect, the present application also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the document data extraction method according to any one of the above second aspects when executing the program.

In a fourth aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a document data extraction method as in any of the above second aspects.

In a fifth aspect, the present application also provides a computer product having stored thereon a computer program which, when executed by a processor, implements a document data extraction method as in any of the above second aspects.

According to the document data extraction system and the document data extraction method, the mask graph is generated through the corner coordinates of the text areas in the document image, the mask graph is corroded and expanded to obtain the corner coordinates of each document cell, the limitation of document types on document data extraction is broken, the document cell extraction of various documents is realized, the corner coordinates of each text area are combined on the basis, text contents of each text area are backfilled to obtain document data corresponding to the document image, the defects that the document data extraction method in the traditional scheme is poor in compatibility and cannot be suitable for documents with changeable forms and poor in extraction effect are overcome, the extraction of various types of document data is realized, the extraction accuracy and the extraction efficiency are improved, and the document data extraction system is easy to realize and deploy and has extremely high practicability and good compatibility.

Drawings

In order to more clearly illustrate the application or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the application, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a document data extraction system provided by the present application;

FIG. 2 is a flow chart of a document data extraction method provided by the application;

FIG. 3 is an exemplary diagram of a document image provided by the present application;

FIG. 4 is an exemplary diagram of document data corresponding to a document image provided by the present application;

FIG. 5 is a general flow chart of a document data extraction method provided by the present application;

Fig. 6 is a schematic structural diagram of an electronic device provided by the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In recent years, the neural network and the deep learning technology have been greatly broken through, and are widely applied in a plurality of fields, and the landing of the neural network is theoretically guaranteed. Extracting documents from non-editable documents remains a complex problem, mainly due to: the structure of the document is various and the form is varied, so that the document data extraction method needs to be suitable for the varied document to ensure the extraction effect, however, the current extraction method is difficult to realize, so that the extraction method has poor adaptability and poor effect.

In short, in the practical use process, the frame wires of the document are various, and the frame wires belong to important characteristics of the document, so that the various frame wires (such as full frame wires, half frame wires, no frame wires and the like) can lead to various structures and various forms of the document; the documents with various structures and various forms have higher requirements on the compatibility of the current extraction method, and are difficult to realize at present.

In this regard, the application provides a document data extraction system, aim at extracting the angular point coordinates of each document cell through the angular point coordinates of the text area in the document image, break the restriction of document type to document data extraction, implement the document cell extraction of all kinds of documents, backfill on the basis of text content, obtain the document data that the document image corresponds to, implement the document data extraction of all kinds, have promoted extraction accuracy and extraction efficiency, possess very strong practicability and better compatibility, figure 1 is the structural schematic diagram of the document data extraction system that the application provides, as shown in figure 1, the document data extraction system includes the image acquisition unit, the text detection unit, corrode the expansion unit, the document data extraction unit and data management console; the data management console is respectively connected with the image acquisition unit, the text detection unit, the corrosion expansion unit and the document data extraction unit and is used for controlling and managing all the units.

Alternatively, the image acquisition unit acquires a document image to be extracted, and transmits the document image to be extracted to the data management console.

Alternatively, the text detection unit may acquire the document image to be extracted in the data management console, so that the text detection unit may perform text detection on the document image to obtain text regions in the document image and corner coordinates of each text region, and transmit the corner coordinates of each text region to the data management console.

Optionally, the corrosion expansion unit may acquire the corner coordinates of each text region in the data management console, so the corrosion expansion unit may generate a mask image of the document image according to the corner coordinates of each text region, perform corrosion expansion on the mask image to obtain the corner coordinates of the document cells in the document image, and transmit the corner coordinates of the document cells in the document image to the data management console.

Alternatively, the document data extracting unit may acquire the corner coordinates of each text region, the corner coordinates of each document cell, and the text content of each text region in the data management console, and thus the document data extracting unit may determine the document data corresponding to the document image based on the corner coordinates of each text region, the corner coordinates of each document cell, and the text content of each text region.

According to the document data extraction system provided by the application, the mask image is generated through the corner coordinates of the text areas in the document image, the mask image is corroded and expanded to obtain the corner coordinates of each document cell, the limit of document types on document data extraction is broken, the document cell extraction of various documents is realized, the corner coordinates of each text area are combined on the basis, the text content of each text area is backfilled to obtain the document data corresponding to the document image, the defects that the document data extraction method in the traditional scheme is poor in compatibility and cannot be suitable for documents with changeable forms and the extraction effect is poor are overcome, the extraction of various types of document data is realized, the extraction accuracy and the extraction efficiency are improved, and the document data extraction system is easy to realize and deploy and has extremely high practicability and good compatibility.

Further, the application provides a document data extraction method, which aims to extract the corner coordinates of each document cell through the corner coordinates of a text region in a document image, breaks through the limitation of document types on document data extraction, realizes the document cell extraction of various documents, backfills based on text content on the basis to obtain document data corresponding to the document image, realizes the document data extraction of various types, improves the extraction accuracy and the extraction efficiency, has extremely strong practicability and better compatibility, and fig. 2 is a schematic flow chart of the document data extraction method provided by the application, and an execution main body of the method is a document data extraction system as shown in fig. 2, and the method comprises the following steps:

Step 110, obtaining a document image to be extracted;

And 120, performing text detection on the document image to obtain text areas in the document image and corner coordinates of each text area.

Specifically, before extracting document data, a document image to be extracted needs to be acquired first, wherein an extraction object, namely, a document to be extracted is contained in the document image. In order to ensure pertinence and accuracy of document data extraction, no errors, messy codes and the like occur, in the embodiment of the application, each document image only contains a single document and does not contain text information outside the document.

The document image to be extracted may be an image in a document form, that is, a document image, an image obtained by splitting/separating from a scanned file, or a picture, which is not particularly limited in the embodiment of the present application.

It can be understood that in the actual document data extraction process, after the user uploads the document containing the extraction object, the document data extraction system receives the document, analyzes the document information of the document, determines the document type, and if the document is an image, determines the document image accordingly so as to be subjected to the subsequent document data extraction process; otherwise, if the uploaded image is not a picture but a document, the document can be parsed to obtain a document image, that is, a scanned file (such as PDF (Portable Document Format) file) can be parsed to be cut into a single page image, and an image containing the extracted object is selected from the single page image to obtain the document image.

Since the document often contains more than a single document and the document distribution is not necessarily concentrated, the document image to be extracted may be a plurality of sheets, and in the case that the document image is a plurality of sheets, the document image needs to be extracted one by one to extract the document in each document image, thereby completing the document data extraction of the whole document.

The document image may be acquired by a user through an image acquisition device, may be downloaded by the user from the internet, or may be received through network communication, or may be acquired by the user from a history file, for example, may search/find a financial report from a report file of an enterprise in the past year, and separate or acquire a document image from the financial report.

After the document image is obtained, text detection can be performed on the document image to locate text areas in the document image, and corner coordinates of each text area are determined, namely, the area where the text in the document image is located can be determined through the text detection, the text areas are segmented, and accordingly corner positions of the text areas are determined, namely, the positions of four corner points of each text area in the document image are determined, so that the coordinates of the four corner points of the text area are obtained.

The text detection model may be used to perform text detection on the document image to obtain text regions in the document image, and determine corner coordinates of each text region, that is, the document image may be input into the text detection model, so that the text detection model performs text detection on the input document image and correspondingly outputs each text region therein, then determines the corner coordinates of each text region, where the text detection model may be a DB (Differentiable Binarization) text detection algorithm, which may directly locate the region where the text is located, and partition the text region, and then may determine coordinates of four corners of the text region, that is, the corner coordinates of each text region through OpenCV.

Step 130, generating a mask image of the document image based on the corner coordinates of each text region, and corroding and expanding the mask image to obtain the corner coordinates of the document cells in the document image;

and 140, determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

Specifically, after the corner coordinates of each text region are obtained in step 120, the position of the document cell can be determined accordingly, so as to obtain the corner coordinates of each document cell, and on the basis, the corner coordinates of each text region and the text content of each text region are combined to determine the extracted document, namely, the document corresponding to the document to be extracted in the document image.

It can be understood that after the corner coordinates of each text region are obtained, the corner coordinates of each document cell in the document image can be determined according to the corner coordinates, that is, a MASK image corresponding to the document image can be generated by taking the corner coordinates of each text region in the document image as a reference, and then the document separation line can be extracted by using a corrosion and expansion method on the basis, so as to determine the document structure, thereby obtaining the corner coordinates of each document cell in the document image.

Specifically, considering the position of the document frame line segmentation, namely the characteristic that the intersection point of the frame lines in the document does not have text, the embodiment of the application can utilize the angular point position of each text area to determine the position without text content, and connect the position without text content into the segmentation frame lines to determine the intersection point of each segmentation frame line, thus obtaining the angular point coordinates of each document cell.

Specifically, a mask map of the document image is generated according to the corner coordinates of each text region, that is, the corner positions of each text region in the document image can be used to generate a mask map of the document size, so that the frame line information of the document in the document image can be determined according to the mask map, and the corner coordinates of each document cell in the document can be extracted.

That is, after obtaining the mask image of the document image, in the embodiment of the application, the mask image may be processed by adopting a corrosion expansion method to obtain the corner coordinates of each document cell, specifically, the mask image may be subjected to corrosion expansion to determine the frame line of the document in the mask image, thereby obtaining the frame line information of the document in the document image, and then the intersection point of the frame line of the document and the coordinates at the intersection point may be determined according to the frame line information, thereby obtaining the corner coordinates of each document cell.

It is noted that in the embodiment of the application, a method based on image morphology is adopted in the analysis of the document structure, a mask image is generated through the angular point coordinates of each text region, and then the mask image is corroded and expanded, so that the angular point coordinates of each document cell are obtained, the process is related to the position of the text region in the document image, and is irrelevant to the document type, no matter how changeable the frame line of the document belongs to, the process can analyze the document structure, so that frame line information is obtained, the angular point coordinates of each document cell are determined, the structural analysis of various documents and the extraction of the document cells of various documents are realized, the dependence on the document contour is eliminated, the limitation of the document type on the extraction of document data is broken, the process has extremely strong adaptability and compatibility, and the accuracy and the feasibility of the document data extraction are improved.

Further, after the corner coordinates of each document cell are extracted, the corner coordinates of each text region and the text content can be extracted to obtain a document in a document image, namely, the text content of each text region can be determined firstly, then the matching relationship between each text region and each document cell region is determined according to the corner coordinates of each text region and the corner coordinates of each document cell, so that the text content can be filled into the corresponding document cell according to the matching relationship, and finally the document to be extracted is generated.

Specifically, the text content of each text region may be obtained by first identifying, that is, text identification may be performed on each text region, so as to obtain text content corresponding to each text region.

The text recognition model may be an existing trained model, or an initial model may be pre-built, and a sample and a label are applied to train the trained text recognition model on the basis of the initial model, wherein the initial model may be built on the basis of a convolutional neural network (Convolutional Recurrent Neural Network, CRNN).

Then, according to the text content of each text region obtained by recognition, the corner coordinates of each text region and the corner coordinates of each document cell, determining the corresponding document data of the document image, namely considering that the document cells in the document are always corresponding to the text content, even if part of the document cells do not have text content, the corresponding regions are not corresponding to the text content, and only the text content of the region is empty.

In the embodiment of the application, a top-down mode is adopted when text content of a document is backfilled, and the corresponding relation between the corner coordinates of each document cell and the corner coordinates of each text area is determined through matching, so that the text is backfilled according to the corresponding relation, further document data corresponding to a document image is obtained, the accuracy of text content placement is ensured, the efficiency of text backfilling is improved, the efficiency of the text backfilling is more obvious on the document with larger scale, and in addition, the mode occupies less calculation resources, has strong operability and has stronger practicability.

According to the document data extraction method provided by the application, the mask graph is generated through the corner coordinates of the text areas in the document image, the mask graph is corroded and expanded to obtain the corner coordinates of each document cell, the limit of document types on document data extraction is broken, the document cell extraction of various documents is realized, the corner coordinates of each text area are combined on the basis, the text content of each text area is backfilled to obtain the document data corresponding to the document image, the defects that the document data extraction method in the traditional scheme is poor in compatibility and cannot be suitable for documents with changeable forms and poor in extraction effect are overcome, the extraction of various types of document data is realized, the extraction accuracy and the extraction efficiency are improved, and the document data extraction method is easy to realize and deploy and has extremely strong practicability and better compatibility.

Based on the above embodiment, step 130 includes:

generating a single-channel image based on the corner coordinates of the document frame in the document image;

Determining a mask map of the document image based on the corner coordinates of each text region and the single-channel image;

determining a target size based on the image size of the mask map;

and corroding and expanding the mask pattern through a convolution kernel of the target size to obtain the corner coordinates of the document cells in the document image.

Specifically, in step 130, a mask map of the document image is generated according to the corner coordinates of each text region, and the mask map is corroded and expanded to obtain the corner coordinates of the document cells in the document image, which is essentially an image morphology method, and the position information of each document cell is found through the position information (corner coordinates) of each text region, which specifically includes:

First, a mask map corresponding to the document image may be generated according to the position information of each text region, where the region corresponding to each text region in the mask map is black, that is, the pixel value is 0, and the region corresponding to each non-text region is white, that is, the pixel value is 255.

Specifically, a single-channel image is generated according to the angular point coordinates of a document frame in a document image, namely, an image with the document size and the pixel value of 255 is generated by referring to the angular point coordinates of the document frame in the document image, and a pure white single-channel image with the same size as the document is obtained;

And processing the pure white single-channel image to zero the pixel value of the corresponding region in the angular point coordinates of each text region, namely determining the corresponding region of each text region in the single-channel image corresponding to the document image through the angular point coordinates of each text region, and zero the pixel value of the partial region to fill the partial region into black, thereby obtaining the mask image.

Then, the mask map can be corroded and expanded to obtain frame line information of the document in the document image, and then the intersection point of the document frame line and the coordinates of the intersection point can be determined according to the frame line information, so that the corner point coordinates of each document cell are obtained.

The mask diagram can be corroded through convolution check after the mask diagram is obtained, so that the frame line of the document is determined in the mask diagram, and frame line information of the document in the document image is obtained.

Specifically, the size of the convolution kernel used for performing the etching operation can be determined first, and in order to ensure the accuracy of the obtained document frame line, in the embodiment of the present application, the etching operation needs to be performed by adopting the convolution kernel matched with the size of the document, and the size of the mask pattern is consistent with the size of the document, so that the size of the convolution kernel can be correspondingly determined by the size of the mask pattern, that is, the size of the mask pattern can be used as the target size, the etching operation can be performed by checking the mask pattern through the convolution of the target size, specifically, the mask pattern is etched by respectively using the long convolution kernel with the length of the mask pattern and the wide convolution kernel with the length of the mask pattern, so as to obtain the transverse frame line and the vertical frame line of the document; then, the angular point positions (angular point coordinates) of each document cell, that is, the coordinates of four angular points of each document cell are determined on the basis of this using OpenCV.

For example, when the mask pattern has a length W and a width H, the etching operation may be performed using a convolution check mask pattern having a length W, a horizontal frame line of the document may be obtained in the mask pattern, and the etching operation may be performed using a convolution check mask pattern having a length H, a vertical frame line of the document may be obtained, and by overlapping the horizontal frame line and the vertical frame line, the entire frame line of the document, that is, the frame line information of the document may be obtained.

According to the embodiment of the application, the angular point coordinates of each document cell are extracted based on the angular point coordinates of each text region based on the image morphology method, so that the document cell extraction of various documents is realized, the limit of document types on the document cell extraction is broken, the dependence on document contours is broken, the determination of the document cell position is only related to the position of the text region and is irrelevant to the document types, no matter how changeable the frame line of the document belongs to, the document structure can be analyzed, the frame line information is extracted, the angular point coordinates of each document cell are determined, and the method has extremely strong adaptability and compatibility, and improves the accuracy and feasibility of document data extraction.

Based on the above embodiment, step 140 includes:

Matching each text region with each document cell based on the corner coordinates of each text region and the corner coordinates of each document cell to obtain the corresponding relation between each text region and each document cell;

And based on the corresponding relation and the text content of each text region, text content is placed into each document cell, and document data corresponding to the document image is obtained.

Specifically, in step 140, the process of determining the document data corresponding to the document image according to the corner coordinates of each document cell, the corner coordinates of each text region, and the text content includes:

It can be understood that after the document structure analysis is performed by the image morphology method to obtain the corner coordinates of each document cell, the corner coordinates of each text region and the text content can be extracted to obtain the document in the document image.

Specifically, the text content of each text region can be determined, namely, text recognition can be performed on each text region, so that text content corresponding to each text region is obtained, the text recognition process can be realized through a text recognition model in a deep learning model, and then the matching relationship between each text region and each document cell region is determined according to the corner coordinates of each text region and the corner coordinates of each document cell, so that the text content can be filled into the corresponding document cell according to the matching relationship, and finally the document to be extracted is generated.

That is, the matching relationship between each text region and each document cell region may be determined according to the corner coordinates of each text region and the corner coordinates of each document cell, that is, each text region and each document cell may be matched with reference to the corner coordinates of each text region and the corner coordinates of each document cell, so as to determine the document cell corresponding to each text region, thereby obtaining the corresponding relationship between the text region and each document cell, that is, the corresponding relationship between each text region and each document cell may be obtained through region matching based on the corner coordinates.

Here, it can be clear which document cell each text region corresponds to through the correspondence, in other words, it can be determined where document cells of text contents of each text region need to be filled according to the correspondence, so that a subsequent text backfilling process can be performed steadily and orderly.

Then, according to the corresponding relation and the text content of each text area, determining the document data corresponding to the document image, namely, carrying out text content placement according to the corresponding relation, thereby obtaining the document data corresponding to the document image.

Based on the above embodiment, the correspondence between any text region and document cell is determined based on the following steps:

determining the area of an overlapping area between the text area and each document cell based on the corner coordinates of the text area and the corner coordinates of each document cell;

And screening an overlapping region with the largest area from the overlapping regions as a target overlapping region, and taking a document cell corresponding to the target overlapping region as a document cell corresponding to the text region to obtain the corresponding relation between the text region and the document cell.

Specifically, the process of determining the correspondence between any text region and a document cell includes:

in the embodiment of the application, an intersection method is adopted to match based on the principle of maximum intersection so as to obtain the corresponding relation between each text region and each document cell.

Based on this, in the embodiment of the present application, the corner coordinates of the text region and the corner coordinates of each text region may be used to perform the calculation of the overlapping region, so as to determine whether there is an overlapping region between the text region and each document cell region, if there is an overlapping region, the area of the overlapping region is calculated, and if there is no overlapping region, the area of the overlapping region may be recorded as 0, so as to obtain the area of the overlapping region between the text region and each document cell.

And then, according to the principle of maximum intersection, selecting the overlapping area with the largest area from the overlapping areas as a target overlapping area, taking the document cell corresponding to the target overlapping area as the document cell corresponding to the text area, and in short, selecting the document cell with the largest intersection as the document cell corresponding to the text area, namely the document cell to be filled in by the text content of the text area. After the corresponding relation between each text region and each document cell is obtained, text backfilling can be carried out from top to bottom according to the corresponding relation so as to fill each text content into the corresponding document cell, thereby obtaining document data corresponding to the document image.

According to the embodiment of the application, according to the maximum intersection principle, a top-down mode is adopted when text content of a document is backfilled, and document data corresponding to a document image is obtained by correspondingly carrying out text backfilling by determining document cells corresponding to each text region, so that not only is the accuracy of text content placement ensured, but also the efficiency of text backfilling is greatly improved, and the efficiency of the document is more obvious on the larger-scale document; in addition, the text backfilling mode is fast in recognition, small in occupied resources, high in operability and high in practicability.

Based on the above embodiment, step 110 includes:

Acquiring an initial document image;

performing seal detection on the initial document image to obtain a seal area and seal color in the initial document image;

and carrying out channel filtration on a seal area in the initial document image based on seal color to obtain the document image.

Specifically, in step 110, the process of obtaining the document image to be extracted specifically includes:

Considering that most of documents, especially documents containing reports, are usually covered with enterprise stamps to ensure the validity of information, i.e. most of documents are covered with bright red or blue stamps, the presence of the stamps not only affects the analysis of the document structure, but also affects the text recognition, so that the accuracy of text content is reduced.

In view of this, in the embodiment of the present application, when acquiring a document image to be extracted, an initial document image needs to be acquired first, where the initial document image may be understood as an unprocessed document image, that is, a document image that is directly uploaded by a user, and that is acquired by a document data extraction system, or an image that includes a document and is directly resolved and separated from a scanned file uploaded by the user.

Then, the seal erasing can be carried out on the initial document image so as to fade/remove the seal in the initial document image, so that the document image after seal fading/erasing is obtained, namely, seal detection can be carried out on the initial document image so as to detect the area where the seal is located and the color of the seal, and thus detection results, namely, the seal area and the seal color are obtained.

Specifically, the stamp in the initial document image may be detected by using the target detection model to detect the position and color of the stamp, so as to obtain the stamp color (red or blue) and the stamp position (upper left corner coordinates and lower right corner coordinates of the stamp) output by the target detection model, and determine the stamp area according to the stamp color (red or blue) and the stamp position (lower left corner coordinates and lower right corner coordinates of the stamp), where the target detection model may be a general high-performance detection model, for example, an open source model yolov.

And then, carrying out channel filtration on the seal area to erase/fade the seal, thereby obtaining a document image after seal erasure, namely, a document image to be extracted, namely, filtering the color of the seal by adopting a channel filtration method, thereby achieving the purpose of seal erasure/fade.

Specifically, the color filtering may be performed on the stamp area, so that if the stamp color is red, the G, B channels are filtered, and only the R channels are reserved; correspondingly, under the condition that the seal color is blue, the R, G channels are filtered, only the B channels (the image can be represented by R, G, B three channels) are reserved, so that a filtered single-channel area image corresponding to the seal area is obtained, then the single-channel area image can be restored to be three channels (R, G, B), the original seal area is replaced, so that a replaced document image, namely the document image to be extracted, is obtained, and the seal area in the original document image is backfilled by the area image which is filtered by the channels and restored to be three channels, so that the document image to be extracted is obtained.

The document image after the seal erasing treatment can be basically erased or obviously desalted, so that the influence on the subsequent document structure analysis and text recognition can be avoided to the greatest extent.

In the embodiment of the application, consideration is given to the floor level in the actual use process, and the seal possibly carried in the initial document image is erased/desalted to obtain the document image after seal desalination/erasure, so that the document data extraction is carried out on the document image, the conditions that the extraction of document cells is wrong and the text content is recognized by mistake caused by the seal are avoided, and the accuracy of the analysis of the document structure and the accuracy of the subsequent text recognition in the document data extraction process are ensured.

Based on the above embodiment, based on the seal color, channel filtering is performed on the seal area in the initial document image to obtain the document image, including:

Based on seal color, carrying out channel filtration on seal areas in the initial document image to obtain a target document image;

carrying out document detection on the target document image to obtain the original corner coordinates of the document area in the target document image;

Determining a target angular point coordinate based on the original angular point coordinate, and performing perspective transformation on a document area in the target document image based on the target angular point coordinate;

and (3) carrying out direction recognition on the document area obtained by perspective transformation, and carrying out angle correction on the document area obtained by perspective transformation based on the recognition result to obtain a document image.

In the actual operation process, the problems of distortion, messy picture background and the like of the acquired document image caused by improper shooting operation, faults of scanning equipment and the like are considered, namely the problems of unclear, skewed and miscut of the acquired document image caused by abnormal conditions in the processes of scanning, shooting and the like are unavoidable, and the problems can increase the difficulty of extracting the subsequent document data, reduce the accuracy of document structure analysis and text recognition and cause poor document data extraction effect.

Based on the above, in the embodiment of the application, after the seal is erased/desalted, the document image after the seal is erased/desalted is also required to be processed so as to correct distortion and remove the disordered background, thereby obtaining the final document image to be extracted.

Specifically, the seal area in the initial document image can be filtered according to the seal color to filter the seal area color, so as to achieve the purpose of seal erasing/desalting, and the erased/desalted document image is obtained.

Then, document detection can be performed on the target document image to locate the document region therein, the corner positions of the document region, namely the coordinates of four corners of the document region, are determined, and the coordinates of the four corners detected at this time are regarded as original corner coordinates of the document region.

Specifically, since the document is generally composed of a header, a footer, a text, an image, a document and the like, and the document image separated from the document is often carried with the document composition factor of the part, in the embodiment of the application, in order to remove the background interference, the document detection can be performed on the target document image obtained in the previous step to determine the document area therefrom and extract the coordinates of four corner points of the document area, specifically, the document detection can be performed on the target document image by adopting a corner point detection model, so as to return the coordinates of four corner points (four corner points starting from the corner point at the upper left and clockwise) of the document in the image, and the coordinates are taken as the original corner point coordinates.

The back box and the neg of the corner detection model use yolov-tini original structures, but the detection head is adjusted, the regression of the detection head is not center coordinates and length and width values, but coordinates of four corners of a document, and the reason is that the subsequent distortion correction can be performed sequentially, and when the document is detected through the corner detection model, the coordinates of the four corners of a document area are required to be obtained, but not the center coordinates, so that the sequential execution of the distortion correction can be ensured. And the Loss function of the corner detection model after adjustment adopts Wing Loss and BCE Loss (Binary Cross Entropy Loss) binary cross entropy Loss.

Then, according to the extracted original angular point coordinates, document distortion correction can be performed to correct the deformation of the region, so as to obtain a corrected document image, specifically, the original angular point coordinates are firstly used for determining the corresponding target angular point coordinates, namely, the original angular point coordinates are referred to, the maximum value of each direction is taken to determine the minimum rectangle capable of surrounding the document region, the coordinate of the minimum rectangle (the upper left angular point coordinates are (0, 0)) is taken as the target angular point coordinates, then, according to the target angular point coordinates, the document region can be subjected to deformation correction, so that the corrected document region can be obtained, specifically, the document region after perspective transformation can be obtained and extracted by using perspective transformation, the document frame line of the extracted document region is horizontal and vertical, the distortion is corrected by the perspective transformation, and the interference caused by the deformation is eliminated.

In the embodiment of the application, the corner detection is performed based on the improved target detection model to obtain the original corner coordinates, and the target corner coordinates are determined according to the original corner coordinates, so that perspective transformation can be performed on the target corner coordinates, a corrected image is obtained, the interference of image deformation to the subsequent document data extraction process is eliminated, and the accuracy and efficiency of document data extraction are improved.

Further, considering the problems of shooting errors, document sizes and the like, the document in the acquired document image is not in an upward direction of 0 degrees, but has a certain rotation angle (for example, an oversized document is placed by rotating 90 degrees), so that the efficiency of subsequent text recognition is further improved, and the recognition accuracy is improved.

Specifically, after obtaining the document area after perspective transformation, that is, the document area corrected and extracted through perspective transformation, direction recognition can be performed on the document area to identify the rotation angle of the text in the document area, so as to obtain a recognition result.

The angle detection model is a classification model, such as mobilenetv model, and the loss function used in model training is a cross entropy loss function based on label smoothing.

After the rotation angle (any one of 0, 90, 180 and 270 degrees) of the model output is obtained, the document area obtained by perspective transformation can be subjected to angle correction according to the rotation angle, so that a document image is obtained, and the document area can be anticlockwise rotated by the same angle as the rotation angle according to the rotation angle at the moment because the rotation angle obtained by model detection is the clockwise rotation angle, so that angle correction is realized, the corrected document area is obtained, and the area is used as the document image to be finally extracted.

Based on the above embodiment, determining the document data corresponding to the document image further includes:

Carrying out the same screen display on the document image and the document;

And under the condition that a checking operation of any document cell in the displayed document is received, jumping from the document cell to the document cell corresponding to the document cell in the document image based on the corner coordinates of each document cell.

Specifically, in order to ensure the consistency of the extracted document and the document in the original document image after the document data extraction is completed, manual verification is often required, so that the extraction and input of the document in the whole non-editable document are completed, and the electronization of the document is realized. However, the processing of the document in the non-editable document at present only involves the extraction of the document data, but the post-processing is omitted, and the subsequent manual verification often needs to consume a great deal of time and effort when the huge document data is faced, so that the efficiency of inputting the document information is reduced, and the process is slowed down. In short, the lack of a suitable collation mechanism upon manual collation after extraction completion results in a significant time consumption for verification of the collation time.

Based on the above, the embodiment of the application can display the document image and the document on the same screen after extracting the document data corresponding to the document image, so that a proofreading person can see the displayed document image and the extracted document on the same display screen, thereby facilitating the proofreading person to check and check in comparison to confirm the accuracy of the information in the extracted document.

Further, after the document image and the document are displayed, if the proofreading personnel need to confirm whether the text content of any document cell in the extracted document is wrong, at this time, a proofreading operation can be input, the proofreading operation can be any one of clicking operation, sliding operation, checking operation and the like, the document data extraction system can receive the proofreading operation and can skip according to the proofreading operation, and the angular point coordinates of each document cell are acquired through the document data extraction, so that at this time, the document cell skip can be performed by taking the angular point coordinates as a reference, namely, the document cell input proofreading operation can skip to the corresponding document cell in the document image from the proofreading personnel according to the angular point coordinates of each document cell, so that the proofreading personnel can compare and check the content in the document cell before and after the skip to confirm whether the text content obtained by the identification is wrong, and the proofreading operation of the document cell is completed.

In the embodiment of the application, after the document in the document image is obtained through the extraction of the document data, a display mode which is convenient for a proofreading person to check is provided, the document image and the extracted document can be displayed in the same display screen, and under the condition of receiving the proofreading operation, the document image and the extracted document jump to the corresponding document cell in the document image from the current proofreading position, so that the proofreading person can be helped to quickly correspond to each document cell in the document, the document content which is identified in each document cell can be checked conveniently, whether errors exist or not can be confirmed, further corresponding adjustment can be realized, the quick input of the document is realized, and the accuracy and the effectiveness of the input document information are ensured.

The following will exemplify the above procedure by taking the document data extraction system processing flow as an example:

Fig. 3 is an exemplary diagram of document image provided by the present application, fig. 4 is an exemplary diagram of document data corresponding to the document image provided by the present application, as shown in fig. 3 and fig. 4, through document image preprocessing (seal erasing, distortion correction and angle correction), text detection of the document image, structural analysis based on an image morphology mode, and text backfill based on a maximum intersection principle, document data corresponding to the document image can be obtained, and conversion from the document image to document structured data is realized, but considering usability when the document image is used on the ground, there is a serious problem that a manual review process must exist for the identified document in the actual application process, if the process just refers to conversion of the document in the document image as Excel, the review pressure of a subsequent calibrator is still very high, especially when the document with huge data volume is faced, for example, the calibrator needs to compare each document cell with a very complicated document, such as an enterprise annual report, an enterprise report, etc., which is very error-prone.

Based on this, in order to realize real cost reduction and efficiency enhancement, the effective post-processing mechanism is put forward, in short, only until the recognition is convenient and the checking of checking personnel after the recognition is finished, the artificial workload is reduced greatly in the real sense, for example, assuming that the manual extraction of the document in the document image takes 30 minutes, if the document is automatically extracted and then checked manually, the document can be reduced to 10 minutes, but if the checking personnel check conveniently through an effective checking mechanism, the whole flow can be directly reduced to 1 minute, so that the real cost reduction and efficiency enhancement can be realized, and the whole flow can exert the maximum energy efficiency and has obvious advantages.

That is, after the aforementioned document data is extracted, the document data corresponding to the document image can be obtained, that is, json information of the recognition result of each document image can be obtained, where the json information includes the corner coordinates of each document cell and the text content corresponding to the corner coordinates. The document data extraction system can then integrate the document image and the corresponding document at the rear end thereof, and give the integrated information to the front end, which can display the received information, such as window display, left side display of the document image, right side display of the extracted document (the document image shown in fig. 3 can be displayed on the left side of the display screen, right side display of the document shown in fig. 4, and all the information identified from the document image can be obtained by sliding the progress bar), so that the collators check the collations. After the front end clicks any document cell in the document, the verifier can jump to the corresponding document cell in the document image, so as to confirm whether the identified text content is wrong or not through the comparison of information before and after jumping, thereby being capable of carrying out confirmation or correction.

Among other things, it is noted that text content in each document cell in a displayed document may be edited, validated, and exported. In addition, text contents in each document cell in the displayed document can be distinguished and displayed, namely, a text region with lower confidence in text recognition can be highlighted when being displayed, so that a proofreading person can know that the text is in doubt at a glance and needs to check with emphasis, the proofreading time can be further saved, and the proofreading efficiency and the validity are ensured.

In the embodiment of the application, the Pipeline is comprehensive, and a plurality of aspects from document input to actual use are considered, so that the accuracy, the effectiveness and the compatibility of document data extraction can be effectively improved, the subsequent review work is simplified, the review is convenient for a check person, the cost reduction and the synergy on the actual application level are realized, the application problem of a document data extraction scheme is solved, the application problem can be directly landed, the implementation and the deployment are easy, the labor cost of enterprises can be reduced, the operation efficiency of the enterprises is improved, and the operation risk is reduced.

FIG. 5 is a general flow chart of a document data extraction method provided by the application, as shown in FIG. 5, the method comprising:

step 400, obtaining an initial document image;

Step 411, seal detection is carried out on the initial document image, and a seal area and seal color in the initial document image are obtained;

Step 412, performing channel filtration on the stamp area in the initial document image based on the stamp color to obtain a target document image;

step 413, performing document detection on the target document image to obtain original corner coordinates of a document area in the target document image;

Step 414, determining the target angular point coordinates based on the original angular point coordinates, and performing perspective transformation on the document region in the target document image based on the target angular point coordinates;

Step 415, direction recognition is carried out on the document area obtained by perspective transformation, and angle correction is carried out on the document area obtained by perspective transformation based on the recognition result, so as to obtain a document image;

Step 420, performing text detection on the document image to obtain text regions in the document image and corner coordinates of each text region;

step 431, generating a single-channel image based on the corner coordinates of the document frame in the document image;

Step 432, determining a mask map of the document image based on the corner coordinates of each text region and the single-channel image;

step 433, determining a target size based on the image size of the mask map; corroding and expanding the mask pattern through a convolution kernel of the target size to obtain corner coordinates of the document cells in the document image;

Step 441, matching each text region with each document cell based on the corner coordinates of each text region and the corner coordinates of each document cell, so as to obtain a corresponding relationship between each text region and each document cell;

step 442, text content is placed into each document cell based on the corresponding relationship and the text content of each text region, so as to obtain document data corresponding to the document image.

Wherein, the corresponding relation between any text area and the document cell is determined based on the following steps: determining the area of an overlapping area between the text area and each document cell based on the corner coordinates of the text area and the corner coordinates of each document cell; and screening an overlapping region with the largest area from the overlapping regions as a target overlapping region, and taking a document cell corresponding to the target overlapping region as a document cell corresponding to the text region to obtain the corresponding relation between the text region and the document cell.

Determining the document data corresponding to the document image, and then further comprising:

Carrying out the same screen display on the document image and the document; and under the condition that a checking operation of any document cell in the displayed document is received, jumping from the document cell to the document cell corresponding to the document cell in the document image based on the corner coordinates of each document cell.

According to the embodiment of the application, the mask graph is generated through the corner coordinates of the text areas in the document image, the mask graph is corroded and expanded to obtain the corner coordinates of each document cell, the limit of document type on document data extraction is broken, the document cell extraction of various documents is realized, the corner coordinates of each text area are combined on the basis, the text content of each text area is backfilled to obtain the document data corresponding to the document image, the defects that the document data extraction method in the traditional scheme is poor in compatibility and cannot be suitable for documents with changeable forms and the extraction effect is poor are overcome, the extraction of various types of document data is realized, the extraction accuracy and the extraction efficiency are improved, and the method is easy to realize and deploy, has extremely strong practicability and good compatibility.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, memory 630 communicate with each other via communication bus 640. Processor 610 may invoke logic instructions in memory 630 to perform a document data extraction method comprising: acquiring a document image to be extracted; text detection is carried out on the document image, so that text areas in the document image and corner coordinates of the text areas are obtained; generating a mask image of the document image based on the corner coordinates of each text region, and corroding and expanding the mask image to obtain the corner coordinates of the document cells in the document image; and determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method of the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present application also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing a document data extraction method provided by the above methods, the method comprising: acquiring a document image to be extracted; text detection is carried out on the document image, so that text areas in the document image and corner coordinates of the text areas are obtained; generating a mask image of the document image based on the corner coordinates of each text region, and corroding and expanding the mask image to obtain the corner coordinates of the document cells in the document image; and determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

In yet another aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the document data extraction method provided by the above methods, the method comprising: acquiring a document image to be extracted; text detection is carried out on the document image, so that text areas in the document image and corner coordinates of the text areas are obtained; generating a mask image of the document image based on the corner coordinates of each text region, and corroding and expanding the mask image to obtain the corner coordinates of the document cells in the document image; and determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

The system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A document data extraction system, characterized in that the document data extraction system comprises an image acquisition unit, a text detection unit, a corrosion expansion unit, a document data extraction unit and a data management console; the data management console is respectively connected with the image acquisition unit, the text detection unit, the corrosion expansion unit and the document data extraction unit and is used for controlling and managing all units;

The text detection unit is used for: performing text detection on the document image to obtain text areas in the document image and corner coordinates of each text area;

the corrosion expansion unit is used for generating a mask image of the document image based on the corner coordinates of each text region, and carrying out corrosion expansion on the mask image to obtain the corner coordinates of the document cells in the document image;

The document data extraction unit is used for: determining document data corresponding to the document image based on the corner coordinates of each text region, the corner coordinates of each document cell and the text content of each text region;

wherein the determining the document data corresponding to the document image based on the corner coordinates of the text areas, the corner coordinates of the document cells, and the text content of the text areas includes:

Matching each text region with each document cell based on the corner coordinates of each text region and the corner coordinates of each document cell to obtain a corresponding relation between each text region and each document cell;

based on the corresponding relation and the text content of each text region, placing the text content of each document cell to obtain document data corresponding to the document image;

the correspondence between any text region and document cell is determined based on the steps of:

Determining the area of an overlapping area between any text area and each document cell based on the corner coordinates of the any text area and the corner coordinates of each document cell;

and selecting an overlapping region with the largest area from the overlapping regions as a target overlapping region, and taking a document cell corresponding to the target overlapping region as a document cell corresponding to any text region to obtain a corresponding relation between any text region and the document cell.

2. A document data extraction method, characterized by comprising:

Acquiring a document image to be extracted;

Performing text detection on the document image to obtain text areas in the document image and corner coordinates of each text area;

generating a mask map of the document image based on the corner coordinates of each text region, and corroding and expanding the mask map to obtain the corner coordinates of the document cells in the document image;

determining document data corresponding to the document image based on the corner coordinates of each text region, the corner coordinates of each document cell and the text content of each text region;

3. The method for extracting document data according to claim 2, wherein generating a mask map of the document image based on the corner coordinates of the text regions, and performing corrosion expansion on the mask map to obtain the corner coordinates of the document cells in the document image, comprises:

generating a single-channel image based on corner coordinates of a document frame in the document image;

Determining a target size based on an image size of the mask map;

And corroding and expanding the mask map through the convolution kernel of the target size to obtain corner coordinates of the document cells in the document image.

4. A document data extraction method according to claim 2 or 3, wherein the acquiring a document image to be extracted includes:

Acquiring an initial document image;

and carrying out channel filtration on the seal area in the initial document image based on the seal color to obtain the document image.

5. The method for extracting document data according to claim 4, wherein said performing channel filtering on a stamp area in said initial document image based on said stamp color to obtain said document image comprises:

Based on the seal color, carrying out channel filtration on a seal area in the initial document image to obtain a target document image;

carrying out document detection on the target document image to obtain original corner coordinates of a document area in the target document image;

and carrying out direction recognition on the document area obtained by perspective transformation, and carrying out angle correction on the document area obtained by perspective transformation based on a recognition result to obtain the document image.

6. A document data extraction method according to claim 2 or 3, wherein said determining the document data to which the document image corresponds further comprises, after that:

carrying out same screen display on the document image and the document data;

And under the condition that a checking operation of any document cell in the displayed document data is received, jumping from any document cell in the document data to a document cell corresponding to the any document cell in the document image based on the corner coordinates of each document cell.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the document data extraction method of any one of claims 2 to 6 when the program is executed by the processor.

8. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the document data extraction method according to any one of claims 2 to 6.