CN117746437A

CN117746437A - Document data extraction system and method thereof

Info

Publication number: CN117746437A
Application number: CN202410187687.8A
Authority: CN
Inventors: 李哲洙; 李洪金
Original assignee: Shenyang Zhehang Information Technology Co ltd
Current assignee: Shenyang Zhehang Information Technology Co ltd
Priority date: 2024-02-20
Filing date: 2024-02-20
Publication date: 2024-03-22
Anticipated expiration: 2044-02-20
Also published as: CN117746437B

Abstract

The document data extraction system comprises an image acquisition unit for acquiring a document image to be extracted, a corrosion expansion unit for acquiring corner coordinates of a document cell in the document image, a document data extraction unit for determining document data corresponding to the document image and a data management console; the data management console is respectively connected with the image acquisition unit, the text detection unit, the corrosion expansion unit and the document data extraction unit and is used for controlling and managing all the units. The method and the device overcome the defects that in a traditional scheme, a document data extraction method is poor in compatibility, cannot adapt to documents with changeable forms and is poor in extraction effect, realize extraction of various types of document data, improve extraction accuracy and extraction efficiency, and are easy to realize and deploy, and have extremely strong practicability and good compatibility.

Description

Document data extraction system and method thereof

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a document data extraction system and a method thereof.

Background

Document processing is a common issue in daily life of people, however, most documents exist in non-editable forms, such as picture forms, scanned file forms and the like, so that the document is difficult to extract and the electronic difficulty of document information is high.

At present, the extraction of the document, the input of information, the proofreading and the like in the non-editable document are carried out manually, which inevitably consumes a great deal of time and effort, and also has a great operational risk. Based on the above, an automatic document data extraction method is generated, but the current document data extraction method cannot be suitable for documents with various structures and various forms, has poor compatibility, cannot extract various types of documents, and has often worry about the effect.

Disclosure of Invention

The application provides a document data extraction system and a document data extraction method, which are used for solving the defects that in the prior art, the document data extraction method is poor in compatibility, cannot adapt to documents with changeable forms and is poor in extraction effect, and the limitation of document types is jumped out, so that the extraction of various types of document data is realized, and the extraction effect is ensured.

In a first aspect, the present application provides a document data extraction system, including an image acquisition unit, a text detection unit, a corrosion expansion unit, a document data extraction unit, and a data management console; the data management console is respectively connected with the image acquisition unit, the text detection unit, the corrosion expansion unit and the document data extraction unit and is used for controlling and managing all the units;

the image acquisition unit is used for: acquiring a document image to be extracted;

the text detection unit is used for: text detection is carried out on the document image, so that text areas in the document image and corner coordinates of the text areas are obtained;

the corrosion expansion unit is used for generating a mask image of the document image based on the corner coordinates of each text area, and carrying out corrosion expansion on the mask image to obtain the corner coordinates of the document cells in the document image;

the document data extraction unit is used for: and determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

In a second aspect, the present application provides a document data extraction method, including:

acquiring a document image to be extracted;

text detection is carried out on the document image, so that text areas in the document image and corner coordinates of the text areas are obtained;

Generating a mask image of the document image based on the corner coordinates of each text region, and corroding and expanding the mask image to obtain the corner coordinates of the document cells in the document image;

and determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

In a third aspect, the present application further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the document data extraction method according to any one of the above second aspects when executing the program.

In a fourth aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a document data extraction method as in any of the above second aspects.

In a fifth aspect, the present application also provides a computer product having stored thereon a computer program which, when executed by a processor, implements a document data extraction method as in any of the above second aspects.

According to the document data extraction system and the document data extraction method, the mask graph is generated through the corner coordinates of the text areas in the document image, the mask graph is corroded and expanded to obtain the corner coordinates of each document cell, the limit of document types on document data extraction is broken, document cell extraction of various documents is achieved, text contents of each text area are backfilled by combining the corner coordinates of each text area on the basis, document data corresponding to the document image is obtained, the defects that the document data extraction method in the traditional scheme is poor in compatibility and cannot be suitable for documents with changeable forms and poor in extraction effect are overcome, various types of document data are extracted, extraction accuracy and extraction efficiency are improved, and the document data extraction system is easy to implement and deploy and has extremely high practicability and good compatibility.

Drawings

For a clearer description of the present application or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are some embodiments of the present application, and that other drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a document data extraction system provided herein;

FIG. 2 is a flow chart of a document data extraction method provided by the present application;

FIG. 3 is an exemplary diagram of a document image provided herein;

FIG. 4 is an exemplary diagram of document data corresponding to a document image provided herein;

FIG. 5 is a general flow chart of a document data extraction method provided herein;

fig. 6 is a schematic structural diagram of an electronic device provided in the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the present application will be clearly and completely described below with reference to the drawings in the present application, and it is apparent that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In recent years, the neural network and the deep learning technology have been greatly broken through, and are widely applied in a plurality of fields, and the landing of the neural network is theoretically guaranteed. However, extracting a document from a non-editable document remains a complex problem, mainly due to: the structure of the document is various and the form is varied, so that the document data extraction method needs to be suitable for the varied document to ensure the extraction effect, however, the current extraction method is difficult to realize, so that the extraction method has poor adaptability and poor effect.

In short, in the practical use process, the frame wires of the document are various, and the frame wires belong to important characteristics of the document, so that the various frame wires (such as full frame wires, half frame wires, no frame wires and the like) can lead to various structures and various forms of the document; the documents with various structures and various forms have higher requirements on the compatibility of the current extraction method, and are difficult to realize at present.

In this regard, the present application provides a document data extraction system, which aims to obtain the corner coordinates of each document cell by extracting the corner coordinates of a text region in a document image, breaks through the limitation of document types on document data extraction, realizes document cell extraction of various documents, backfills based on text content on the basis to obtain document data corresponding to the document image, realizes document data extraction of various types, improves extraction accuracy and extraction efficiency, and has extremely strong practicability and better compatibility, and fig. 1 is a schematic structural diagram of the document data extraction system provided by the present application, and as shown in fig. 1, the document data extraction system includes an image acquisition unit, a text detection unit, a corrosion expansion unit, a document data extraction unit and a data management console; the data management console is respectively connected with the image acquisition unit, the text detection unit, the corrosion expansion unit and the document data extraction unit and is used for controlling and managing all the units.

Alternatively, the image acquisition unit acquires a document image to be extracted, and transmits the document image to be extracted to the data management console.

Alternatively, the text detection unit may acquire the document image to be extracted in the data management console, so that the text detection unit may perform text detection on the document image to obtain text regions in the document image and corner coordinates of each text region, and transmit the corner coordinates of each text region to the data management console.

Optionally, the corrosion expansion unit may acquire the corner coordinates of each text region in the data management console, so the corrosion expansion unit may generate a mask image of the document image according to the corner coordinates of each text region, perform corrosion expansion on the mask image to obtain the corner coordinates of the document cells in the document image, and transmit the corner coordinates of the document cells in the document image to the data management console.

Alternatively, the document data extracting unit may acquire the corner coordinates of each text region, the corner coordinates of each document cell, and the text content of each text region in the data management console, and thus the document data extracting unit may determine the document data corresponding to the document image based on the corner coordinates of each text region, the corner coordinates of each document cell, and the text content of each text region.

According to the document data extraction system, the mask image is generated through the corner coordinates of the text areas in the document image, the mask image is corroded and expanded to obtain the corner coordinates of each document cell, the limit of document types on document data extraction is broken, document cell extraction of various documents is achieved, the corner coordinates of each text area are combined on the basis, text contents of each text area are backfilled to obtain document data corresponding to the document image, the defects that the document data extraction method in the traditional scheme is poor in compatibility and cannot be suitable for documents with changeable forms and poor in extraction effect are overcome, various types of document data extraction is achieved, extraction accuracy and extraction efficiency are improved, and the document data extraction system is easy to implement and deploy and has extremely high practicability and good compatibility.

Further, the present application provides a document data extraction method, which aims to extract corner coordinates of text regions in a document image to obtain corner coordinates of each document cell, break the limit of document types on document data extraction, implement document cell extraction of various documents, backfill based on text content on the basis to obtain document data corresponding to the document image, implement document data extraction of various types, improve extraction accuracy and extraction efficiency, and have extremely strong practicability and better compatibility, and fig. 2 is a schematic flow diagram of the document data extraction method provided by the present application, and as shown in fig. 2, an execution main body of the method is a document data extraction system, and the method includes:

Step 110, obtaining a document image to be extracted;

and 120, performing text detection on the document image to obtain text areas in the document image and corner coordinates of each text area.

Specifically, before extracting document data, a document image to be extracted needs to be acquired first, wherein an extraction object, namely, a document to be extracted is contained in the document image. In order to ensure pertinence and accuracy of document data extraction, no errors, messy codes and the like occur, in the embodiment of the application, each document image only contains a single document and does not contain text information outside the document.

The document image to be extracted may be an image in a document form, that is, a document image, an image obtained by splitting/separating from a scanned file, or a picture, which is not particularly limited in the embodiment of the present application.

It can be understood that in the actual document data extraction process, after the user uploads the document containing the extraction object, the document data extraction system receives the document, analyzes the document information of the document, determines the document type, and if the document is an image, determines the document image accordingly so as to be subjected to the subsequent document data extraction process; otherwise, if the uploaded image is not a picture but a document, the document can be parsed to obtain a document image, that is, a scanned file (for example, PDF (Portable Document Format) file) can be parsed to be cut into a single page image, and an image containing the extraction object is selected from the single page image to obtain the document image.

Since the document often contains more than a single document and the document distribution is not necessarily concentrated, the document image to be extracted may be a plurality of sheets, and in the case that the document image is a plurality of sheets, the document image needs to be extracted one by one to extract the document in each document image, thereby completing the document data extraction of the whole document.

The document image may be acquired by a user through an image acquisition device, may be downloaded by the user from the internet, or may be received through network communication, or may be acquired by the user from a history file, for example, may search/find a financial report from a report file of an enterprise in the past year, and separate or acquire a document image from the financial report.

After the document image is obtained, text detection can be performed on the document image to locate text areas in the document image, and corner coordinates of each text area are determined, namely, the area where the text in the document image is located can be determined through the text detection, the text areas are segmented, and accordingly corner positions of the text areas are determined, namely, the positions of four corner points of each text area in the document image are determined, so that the coordinates of the four corner points of the text area are obtained.

The text detection model may be used to perform text detection on the document image to obtain text regions in the document image, and determine corner coordinates of each text region, that is, the document image may be input into the text detection model, so that the text detection model performs text detection on the input document image and correspondingly outputs each text region therein, then determines the corner coordinates of each text region, where the text detection model may be a DB (Differentiable Binarization) text detection algorithm, which may directly locate the region where the text is located, and divide the text region, and then may determine coordinates of four corners of the text region, that is, the corner coordinates of each text region through OpenCV.

Step 130, generating a mask image of the document image based on the corner coordinates of each text region, and corroding and expanding the mask image to obtain the corner coordinates of the document cells in the document image;

and 140, determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

Specifically, after the corner coordinates of each text region are obtained in step 120, the position of the document cell can be determined accordingly, so as to obtain the corner coordinates of each document cell, and on the basis, the corner coordinates of each text region and the text content of each text region are combined to determine the extracted document, namely, the document corresponding to the document to be extracted in the document image.

It can be understood that after the corner coordinates of each text region are obtained, the corner coordinates of each document cell in the document image can be determined according to the corner coordinates, that is, a MASK image corresponding to the document image can be generated by taking the corner coordinates of each text region in the document image as a reference, and then the document separation line can be extracted by using a corrosion and expansion method on the basis, so as to determine the document structure, thereby obtaining the corner coordinates of each document cell in the document image.

Specifically, considering the position of the document frame line segmentation, that is, the characteristic that the intersection point of the frame lines in the document has no text, in the embodiment of the application, the position of the text content not included can be determined by using the angular point position of each text region, and the position of the text content not included is connected into the segmentation frame lines at the position of the text content not included so as to determine the intersection point of each segmentation frame line, so that the angular point coordinates of each document cell can be obtained.

Specifically, a mask map of the document image is generated according to the corner coordinates of each text region, that is, the corner positions of each text region in the document image can be used to generate a mask map of the document size, so that the frame line information of the document in the document image can be determined according to the mask map, and the corner coordinates of each document cell in the document can be extracted.

That is, after obtaining the mask image of the document image, in the embodiment of the present application, the mask image may be processed by using a corrosion expansion method to obtain the coordinates of the corner points of each document cell, specifically, the mask image may be corroded and expanded to determine the frame line of the document in the mask image, so as to obtain the frame line information of the document in the document image, and then the intersection point of the frame line of the document and the coordinates at the intersection point may be determined according to the frame line information, so as to obtain the coordinates of the corner points of each document cell.

It is noted that in the embodiment of the present application, a method based on image morphology is adopted in the analysis of the document structure, a mask map is generated by using the corner coordinates of each text region, and then the mask map is corroded and expanded, so as to obtain the corner coordinates of each document cell, the process is related to the position of the text region in the document image, and is irrelevant to the document type, no matter how changeable the frame line of the document belongs to, the process can analyze the document structure, so as to extract the frame line information, further determine the corner coordinates of each document cell, thereby realizing the structural analysis of various documents, and the extraction of document cells of various documents, breaking away from the dependence on the document contour, breaking the limitation of the document type on the extraction of document data, having extremely strong adaptability and compatibility, and improving the accuracy and feasibility of the extraction of the document data.

Further, after the corner coordinates of each document cell are extracted, the corner coordinates of each text region and the text content can be extracted to obtain a document in a document image, namely, the text content of each text region can be determined firstly, then the matching relationship between each text region and each document cell region is determined according to the corner coordinates of each text region and the corner coordinates of each document cell, so that the text content can be filled into the corresponding document cell according to the matching relationship, and finally the document to be extracted is generated.

Specifically, the text content of each text region may be obtained by first identifying, that is, text identification may be performed on each text region, so as to obtain text content corresponding to each text region.

The text recognition model can be an existing trained model or an initial model which is built in advance and is trained by applying samples and labels, wherein the initial model can be built on the basis of a convolutional neural network (Convolutional Recurrent Neural Network, CRNN).

Then, according to the text content of each text region obtained by recognition, the corner coordinates of each text region and the corner coordinates of each document cell, determining the corresponding document data of the document image, namely considering that the document cells in the document are always corresponding to the text content, even if part of the document cells do not have text content, the corresponding regions are not corresponding to the text content, and the text content of the region is empty.

In the embodiment of the application, a top-down mode is adopted when text content of a document is backfilled, and the corresponding relation between the corner coordinates of each document cell and the corner coordinates of each text area is determined through matching, so that text backfilling is carried out according to the corresponding relation, document data corresponding to a document image are obtained, the accuracy of text content placement is guaranteed, the efficiency of text backfilling is improved, the efficiency of the text backfilling is obvious on the document with a larger scale, in addition, the occupied computing resource is small, the operability is strong, and the practicability is strong.

According to the document data extraction method, the mask graph is generated through the corner coordinates of the text areas in the document image, the mask graph is corroded and expanded to obtain the corner coordinates of each document cell, the limit of document types on document data extraction is broken, document cell extraction of various documents is achieved, the corner coordinates of each text area are combined on the basis, text contents of each text area are backfilled to obtain document data corresponding to the document image, the defects that the document data extraction method in the traditional scheme is poor in compatibility and cannot be suitable for documents with changeable forms and poor in extraction effect are overcome, various types of document data are extracted, extraction accuracy and extraction efficiency are improved, and the document data extraction method is easy to implement and deploy and has extremely high practicability and good compatibility.

Based on the above embodiment, step 130 includes:

generating a single-channel image based on the corner coordinates of the document frame in the document image;

determining a mask map of the document image based on the corner coordinates of each text region and the single-channel image;

determining a target size based on the image size of the mask map;

and corroding and expanding the mask pattern through a convolution kernel of the target size to obtain the corner coordinates of the document cells in the document image.

Specifically, in step 130, a mask map of the document image is generated according to the corner coordinates of each text region, and the mask map is corroded and expanded to obtain the corner coordinates of the document cells in the document image, which is essentially an image morphology method, and the position information of each document cell is found through the position information (corner coordinates) of each text region, which specifically includes:

first, a mask map corresponding to the document image may be generated according to the position information of each text region, where the region corresponding to each text region in the mask map is black, that is, the pixel value is 0, and the region corresponding to each non-text region is white, that is, the pixel value is 255.

Specifically, a single-channel image is generated according to the angular point coordinates of a document frame in a document image, namely, an image with the document size and the pixel value of 255 is generated by referring to the angular point coordinates of the document frame in the document image, and a pure white single-channel image with the same size as the document is obtained;

And processing the pure white single-channel image to zero the pixel value of the corresponding region in the angular point coordinates of each text region, namely determining the corresponding region of each text region in the single-channel image corresponding to the document image through the angular point coordinates of each text region, and zero the pixel value of the partial region to fill the partial region into black, thereby obtaining the mask image.

Then, the mask map can be corroded and expanded to obtain frame line information of the document in the document image, and then the intersection point of the document frame line and the coordinates of the intersection point can be determined according to the frame line information, so that the corner point coordinates of each document cell are obtained.

The mask diagram can be corroded through convolution check after the mask diagram is obtained, so that the frame line of the document is determined in the mask diagram, and frame line information of the document in the document image is obtained.

Specifically, the size of the convolution kernel used for performing the etching operation can be determined first, and in order to ensure the accuracy of the obtained document frame line, in the embodiment of the present application, the etching operation needs to be performed by adopting the convolution kernel matched with the size of the document, and the size of the mask map is consistent with the size of the document, so that the size of the convolution kernel can be correspondingly determined through the size of the mask map, that is, the size of the mask map can be used as the target size, the etching operation can be performed by checking the mask map through the convolution of the target size, specifically, the mask map is etched by respectively using the long convolution kernel with the length of the mask map and the wide convolution kernel with the length of the mask map, so as to obtain the transverse frame line and the vertical frame line of the document; then, the angular point positions (angular point coordinates) of each document cell, that is, the coordinates of four angular points of each document cell are determined on the basis of this using OpenCV.

For example, when the mask pattern has a length W and a width H, the etching operation may be performed using a convolution check mask pattern having a length W, a horizontal frame line of the document may be obtained in the mask pattern, and the etching operation may be performed using a convolution check mask pattern having a length H, a vertical frame line of the document may be obtained, and by overlapping the horizontal frame line and the vertical frame line, the entire frame line of the document, that is, the frame line information of the document may be obtained.

According to the method based on image morphology, the corner coordinates of each document cell are extracted on the basis of the corner coordinates of each text region, the document cell extraction of various documents is achieved, the limit of document types on document cell extraction is broken, dependence on document contours is eliminated, the determination of the document cell positions is only related to the positions of the text regions and is irrelevant to the document types, no matter how changeable the frame lines of the documents are, the frame line information can be obtained through analysis of the document structures, the corner coordinates of each document cell are determined, and the method has extremely strong adaptability and compatibility, and improves the accuracy and feasibility of document data extraction.

Based on the above embodiment, step 140 includes:

Matching each text region with each document cell based on the corner coordinates of each text region and the corner coordinates of each document cell to obtain the corresponding relation between each text region and each document cell;

and based on the corresponding relation and the text content of each text region, text content is placed into each document cell, and document data corresponding to the document image is obtained.

Specifically, in step 140, the process of determining the document data corresponding to the document image according to the corner coordinates of each document cell, the corner coordinates of each text region, and the text content includes:

it can be understood that after the document structure analysis is performed by the image morphology method to obtain the corner coordinates of each document cell, the corner coordinates of each text region and the text content can be extracted to obtain the document in the document image.

Specifically, the text content of each text region can be determined, namely, text recognition can be performed on each text region, so that text content corresponding to each text region is obtained, the text recognition process can be realized through a text recognition model in a deep learning model, and then the matching relationship between each text region and each document cell region is determined according to the corner coordinates of each text region and the corner coordinates of each document cell, so that the text content can be filled into the corresponding document cell according to the matching relationship, and finally the document to be extracted is generated.

That is, the matching relationship between each text region and each document cell region may be determined according to the corner coordinates of each text region and the corner coordinates of each document cell, that is, each text region and each document cell may be matched with reference to the corner coordinates of each text region and the corner coordinates of each document cell, so as to determine the document cell corresponding to each text region, thereby obtaining the corresponding relationship between the text region and each document cell, that is, the corresponding relationship between each text region and each document cell may be obtained through region matching based on the corner coordinates.

Here, it can be clear which document cell each text region corresponds to through the correspondence, in other words, it can be determined where document cells of text contents of each text region need to be filled according to the correspondence, so that a subsequent text backfilling process can be performed steadily and orderly.

Then, according to the corresponding relation and the text content of each text area, determining the document data corresponding to the document image, namely, carrying out text content placement according to the corresponding relation, thereby obtaining the document data corresponding to the document image.

Based on the above embodiment, the correspondence between any text region and document cell is determined based on the following steps:

determining the area of an overlapping area between the text area and each document cell based on the corner coordinates of the text area and the corner coordinates of each document cell;

and screening an overlapping region with the largest area from the overlapping regions as a target overlapping region, and taking a document cell corresponding to the target overlapping region as a document cell corresponding to the text region to obtain the corresponding relation between the text region and the document cell.

Specifically, the process of determining the correspondence between any text region and a document cell includes:

in the embodiment of the application, an intersection method is adopted to match based on the principle of maximum intersection so as to obtain the corresponding relation between each text region and each document cell.

Based on this, in the embodiment of the present application, the corner coordinates of the text region and the corner coordinates of each text region may be used to perform the calculation of the overlapping region, so as to determine whether there is an overlapping region between the text region and each document cell region, if there is an overlapping region, the area of the overlapping region is calculated, and if there is no overlapping region, the area of the overlapping region may be marked as 0, so as to obtain the area of the overlapping region between the text region and each document cell.

And then, according to the principle of maximum intersection, selecting the overlapping area with the largest area from the overlapping areas as a target overlapping area, taking the document cell corresponding to the target overlapping area as the document cell corresponding to the text area, and in short, selecting the document cell with the largest intersection as the document cell corresponding to the text area, namely the document cell to be filled in by the text content of the text area. After the corresponding relation between each text region and each document cell is obtained, text backfilling can be carried out from top to bottom according to the corresponding relation so as to fill each text content into the corresponding document cell, thereby obtaining document data corresponding to the document image.

According to the method, the device and the system, according to the principle of maximum intersection, a top-down mode is adopted when text content of a document is backfilled, document cells corresponding to all text areas are determined, text backfilling is correspondingly carried out to obtain document data corresponding to a document image, accuracy of text content placement is guaranteed, efficiency of text backfilling is greatly improved, and efficiency of the document is more obvious on documents with larger scales; in addition, the text backfilling mode is fast in recognition, small in occupied resources, high in operability and high in practicability.

Based on the above embodiment, step 110 includes:

acquiring an initial document image;

performing seal detection on the initial document image to obtain a seal area and seal color in the initial document image;

and carrying out channel filtration on a seal area in the initial document image based on seal color to obtain the document image.

Specifically, in step 110, the process of obtaining the document image to be extracted specifically includes:

considering that in most documents, particularly documents containing reports, in order to ensure the validity of information, the stamps of enterprises are usually covered, namely, most documents are covered with bright red or blue stamps, and the presence of the stamps not only affects the analysis of the document structure, but also affects the text recognition, so that the accuracy of text content is reduced.

In view of this, in the embodiment of the present application, when acquiring a document image to be extracted, an initial document image needs to be acquired first, where the initial document image may be understood as an unprocessed document image, that is, a document image that is directly uploaded by a user, and that is acquired by a document data extraction system, or an image that includes a document and is directly resolved and separated from a scan file uploaded by the user.

Then, the seal erasing can be carried out on the initial document image so as to fade/remove the seal in the initial document image, so that the document image after seal fading/erasing is obtained, namely, seal detection can be carried out on the initial document image so as to detect the area where the seal is located and the color of the seal, and thus detection results, namely, the seal area and the seal color are obtained.

Specifically, the stamp in the initial document image may be detected by using the target detection model, so as to obtain the stamp color (red or blue) and the stamp position (upper left corner coordinate and lower right corner coordinate) output by the target detection model, and determine the stamp area according to the stamp color (red or blue) and the stamp position, where the target detection model may be a general high-performance detection model, for example, an open source model yolov8.

And then, carrying out channel filtration on the seal area to erase/fade the seal, thereby obtaining a document image after seal erasure, namely, a document image to be extracted, namely, filtering the color of the seal by adopting a channel filtration method, thereby achieving the purpose of seal erasure/fade.

Specifically, the seal area may be color filtered, so that in the case that the seal color is red, the G, B channel is filtered, and only the R channel is reserved; correspondingly, under the condition that the seal color is blue, the R, G channel is filtered, only the B channel (the image can be represented by R, G, B three channels) is reserved, so that a filtered single-channel area image corresponding to the seal area is obtained, then the single-channel area image can be restored to be three channels (R, G, B), the original seal area is replaced, so that a replaced document image, namely the document image to be extracted, is obtained, and the seal area in the original document image is backfilled by the area image which is filtered by the channels and restored to be three channels, so that the document image to be extracted is obtained.

The document image after the seal erasing treatment can be basically erased or obviously desalted, so that the influence on the subsequent document structure analysis and text recognition can be avoided to the greatest extent.

In the embodiment of the application, consideration is given to the seal possibly carried in the initial document image from the floor level in the actual use process, so that the document image after seal desalination/erasing is obtained, the document data extraction is carried out on the document image according to the document image, the situation that the extraction of document cells is wrong and the text content is recognized in error due to the seal is avoided, and the accuracy of document structure analysis and the accuracy of subsequent text recognition in the document data extraction process are ensured.

Based on the above embodiment, based on the seal color, channel filtering is performed on the seal area in the initial document image to obtain the document image, including:

based on seal color, carrying out channel filtration on seal areas in the initial document image to obtain a target document image;

carrying out document detection on the target document image to obtain the original corner coordinates of the document area in the target document image;

determining a target angular point coordinate based on the original angular point coordinate, and performing perspective transformation on a document area in the target document image based on the target angular point coordinate;

and (3) carrying out direction recognition on the document area obtained by perspective transformation, and carrying out angle correction on the document area obtained by perspective transformation based on the recognition result to obtain a document image.

In the actual operation process, the problems of distortion, messy picture background and the like of the acquired document image caused by improper shooting operation, faults of scanning equipment and the like are considered, namely the problems of unclear, skewed and miscut of the acquired document image caused by abnormal conditions in the processes of scanning, shooting and the like are unavoidable, and the problems can increase the difficulty of extracting the subsequent document data, reduce the accuracy of document structure analysis and text recognition and cause poor document data extraction effect.

Based on the above, in the embodiment of the application, after the seal is erased/desalted, the document image after the seal is erased/desalted is further required to be processed so as to correct distortion and remove the disordered background, thereby obtaining the final document image to be extracted.

Specifically, the seal area in the initial document image can be filtered according to the seal color to filter the seal area color, so as to achieve the purpose of seal erasing/desalting, and the erased/desalted document image is obtained.

Then, document detection can be performed on the target document image to locate the document region therein, the corner positions of the document region, namely the coordinates of four corners of the document region, are determined, and the coordinates of the four corners detected at this time are regarded as original corner coordinates of the document region.

Specifically, since the document is generally composed of a header, a footer, a text, an image, a document and the like, and the document image separated from the document is often also carrying the document composition factor of the part, in the embodiment of the present application, in order to remove the background interference, the document detection may be performed on the target document image obtained in the previous step to determine the document area therefrom, and extract the coordinates of four corner points of the document area, specifically, the document detection may be performed on the target document image by adopting a corner point detection model, so as to return the coordinates of four corner points (four corner points starting from the corner point at the upper left corner and clockwise) of the document in the image, and use the coordinates as the original corner point coordinates.

The back bone and the neg of the corner detection model use the original structure of yolov8-tiny, but the detection head is adjusted, the regression of the detection head is not the center coordinate and the length and width value, but the coordinates of four corners of the document, because the subsequent distortion correction order can be performed, when the document is detected through the corner detection model, the coordinates of the four corners of the document area need to be obtained, but the coordinates of the center coordinate cannot be obtained, so that the order execution of the distortion correction can be ensured. And the Loss function of the corner detection model after adjustment adopts Wing Loss and BCE Loss (Binary Cross Entropy Loss) binary cross entropy Loss.

Then, according to the extracted original angular point coordinates, document distortion correction can be performed to correct the deformation of the region, so as to obtain a corrected document image, specifically, the original angular point coordinates are firstly used for determining the corresponding target angular point coordinates, namely, the original angular point coordinates are referred to, the maximum value of each direction is taken to determine the minimum rectangle capable of surrounding the document region, the coordinate of the minimum rectangle (the upper left angular point coordinates are (0, 0)) is taken as the target angular point coordinates, then, according to the target angular point coordinates, the document region can be subjected to deformation correction, so that the corrected document region can be obtained, specifically, the document region after perspective transformation can be obtained and extracted by using perspective transformation, the document frame line of the extracted document region is horizontal and vertical, the distortion is corrected by the perspective transformation, and the interference caused by the deformation is eliminated.

In the embodiment of the application, the corner detection is performed based on the improved target detection model to obtain the original corner coordinates, and the target corner coordinates are determined accordingly, so that perspective transformation can be performed on the target corner coordinates, corrected images are obtained, interference of image deformation to a subsequent document data extraction process is eliminated, and accuracy and efficiency of document data extraction are improved.

Further, considering the problems of shooting errors, document sizes and the like, the document in the acquired document image is not in an upward direction of 0 degrees, but a certain rotation angle exists (for example, an oversized document is placed by rotating 90 degrees), so that the efficiency of subsequent text recognition is further improved, and the recognition accuracy is improved.

Specifically, after obtaining the document area after perspective transformation, that is, the document area corrected and extracted through perspective transformation, direction recognition can be performed on the document area to identify the rotation angle of the text in the document area, so as to obtain a recognition result.

The angle detection model is a classification model, such as a mobiletv 3 model, and the loss function used in model training is a cross entropy loss function based on label smoothing.

After the rotation angle (any one of 0, 90, 180 and 270 degrees) of the model output is obtained, the document area obtained by perspective transformation can be subjected to angle correction according to the rotation angle, so that a document image is obtained, and the document area can be anticlockwise rotated by the same angle as the rotation angle according to the rotation angle at the moment because the rotation angle obtained by model detection is the clockwise rotation angle, so that angle correction is realized, the corrected document area is obtained, and the area is used as the document image to be finally extracted.

Based on the above embodiment, determining the document data corresponding to the document image further includes:

carrying out the same screen display on the document image and the document;

and under the condition that a checking operation of any document cell in the displayed document is received, jumping from the document cell to the document cell corresponding to the document cell in the document image based on the corner coordinates of each document cell.

Specifically, in order to ensure the consistency of the extracted document and the document in the original document image after the document data extraction is completed, manual verification is often required, so that the extraction and input of the document in the whole non-editable document are completed, and the electronization of the document is realized. However, the processing of the document in the non-editable document at present only involves the extraction of the document data, but the post-processing is omitted, and the subsequent manual verification often needs to consume a great deal of time and effort when the huge document data is faced, so that the efficiency of inputting the document information is reduced, and the process is slowed down. In short, the lack of a suitable collation mechanism upon manual collation after extraction completion results in a significant time consumption for verification of the collation time.

Based on the above, in the embodiment of the application, after the document data corresponding to the document image is extracted, the document image and the document are displayed on the same screen, so that a proofreading person can see the displayed document image and the extracted document on the same display screen, and the proofreading person can check and correct the information in the extracted document conveniently.

Further, after the document image and the document are displayed, if the proofreading personnel need to confirm whether the text content of any document cell in the extracted document is wrong, at this time, a proofreading operation can be input, the proofreading operation can be any one of clicking operation, sliding operation, checking operation and the like, the document data extraction system can receive the proofreading operation and can skip according to the proofreading operation, and the angular point coordinates of each document cell are acquired through the document data extraction, so that at this time, the document cell skip can be performed by taking the angular point coordinates as a reference, namely, the document cell input proofreading operation can skip to the corresponding document cell in the document image from the proofreading personnel according to the angular point coordinates of each document cell, so that the proofreading personnel can compare and check the content in the document cell before and after the skip to confirm whether the text content obtained by the identification is wrong, and the proofreading operation of the document cell is completed.

According to the embodiment of the application, after the document in the document image is obtained through extraction of the document data, a display mode which is convenient for a proofreading person to check and check is provided, the document image and the extracted document can be displayed in the same display screen, and under the condition that a proofreading operation is received, the document image and the extracted document jump to the corresponding document cell in the document image from the current proofreading position, so that the proofreading person can be helped to quickly correspond to each document cell in the document, the document content which is identified in each document cell is conveniently checked, whether errors exist or not is confirmed, further corresponding adjustment is achieved, quick input of the document is realized, and the accuracy and the effectiveness of the input document information are ensured.

The following will exemplify the above procedure by taking the document data extraction system processing flow as an example:

fig. 3 is an exemplary diagram of document image provided by the present application, fig. 4 is an exemplary diagram of document data corresponding to the document image provided by the present application, as shown in fig. 3 and fig. 4, through document image preprocessing (seal erasing, distortion correction and angle correction), text detection of the document image, structural analysis based on an image morphology mode, and text backfill based on a maximum intersection principle, document data corresponding to the document image can be obtained, and conversion from the document image to document structured data is realized, but considering usability in landing use, there is a serious problem that a manual review process must exist for the identified document in the actual application process, if the process is only converting the document in the document image into Excel, the review pressure of subsequent proofreading personnel is still very large, especially when facing documents with huge data volume, for example, enterprise annual report, enterprise report and the like, the proofreading personnel need to compare each document cell one by one, which is very difficult and is very prone to financial error.

Based on this, in order to realize real cost reduction and efficiency enhancement, the effective post-processing mechanism is put forward, in short, only until the recognition is convenient and the checking of checking personnel after the recognition is finished, the artificial workload is reduced greatly in the real sense, for example, assuming that the manual extraction of the document in the document image takes 30 minutes, if the document is automatically extracted and then checked manually, the document can be reduced to 10 minutes, but if the checking personnel check conveniently through an effective checking mechanism, the whole flow can be directly reduced to 1 minute, so that the real cost reduction and efficiency enhancement can be realized, and the whole flow can exert the maximum energy efficiency and has obvious advantages.

That is, after the aforementioned document data is extracted, the document data corresponding to the document image can be obtained, that is, json information of the recognition result of each document image can be obtained, where the json information includes the corner coordinates of each document cell and the text content corresponding to the corner coordinates. The document data extraction system can then integrate the document image and the corresponding document at the rear end thereof, and give the integrated information to the front end, which can display the received information, such as window display, left side display of the document image, right side display of the extracted document (the document image shown in fig. 3 can be displayed on the left side of the display screen, right side display of the document shown in fig. 4, and all the information identified from the document image can be obtained by sliding the progress bar), so that the collators check the collations. After the front end clicks any document cell in the document, the verifier can jump to the corresponding document cell in the document image, so as to confirm whether the identified text content is wrong or not through the comparison of information before and after jumping, thereby being capable of carrying out confirmation or correction.

Among other things, it is noted that text content in each document cell in a displayed document may be edited, validated, and exported. In addition, text contents in each document cell in the displayed document can be distinguished and displayed, namely, a text region with lower confidence in text recognition can be highlighted when being displayed, so that a proofreading person can know that the text is in doubt at a glance and needs to check with emphasis, the proofreading time can be further saved, and the proofreading efficiency and the validity are ensured.

In the embodiment of the application, the Pipeline is comprehensive, the aspects from document input to actual use are considered, the accuracy, the effectiveness and the compatibility of document data extraction can be effectively improved, the follow-up review work is simplified, the review is convenient for a check staff, the cost reduction and the synergy on the aspect of the actual application are realized, the difficulty in the application of a document data extraction scheme is solved, the application can be directly used on the ground, the implementation and the deployment are easy, the labor cost of enterprises can be reduced, the operation efficiency of the enterprises is improved, and the operation risk is reduced.

FIG. 5 is a general flow chart of a document data extraction method provided herein, as shown in FIG. 5, the method comprising:

Step 400, obtaining an initial document image;

step 411, seal detection is carried out on the initial document image, and a seal area and seal color in the initial document image are obtained;

step 412, performing channel filtration on the stamp area in the initial document image based on the stamp color to obtain a target document image;

step 413, performing document detection on the target document image to obtain original corner coordinates of a document area in the target document image;

step 414, determining the target angular point coordinates based on the original angular point coordinates, and performing perspective transformation on the document region in the target document image based on the target angular point coordinates;

step 415, direction recognition is carried out on the document area obtained by perspective transformation, and angle correction is carried out on the document area obtained by perspective transformation based on the recognition result, so as to obtain a document image;

step 420, performing text detection on the document image to obtain text regions in the document image and corner coordinates of each text region;

step 431, generating a single-channel image based on the corner coordinates of the document frame in the document image;

step 432, determining a mask map of the document image based on the corner coordinates of each text region and the single-channel image;

Step 433, determining a target size based on the image size of the mask map; corroding and expanding the mask pattern through a convolution kernel of the target size to obtain corner coordinates of the document cells in the document image;

step 441, matching each text region with each document cell based on the corner coordinates of each text region and the corner coordinates of each document cell, so as to obtain a corresponding relationship between each text region and each document cell;

step 442, text content is placed into each document cell based on the corresponding relationship and the text content of each text region, so as to obtain document data corresponding to the document image.

Wherein, the corresponding relation between any text area and the document cell is determined based on the following steps: determining the area of an overlapping area between the text area and each document cell based on the corner coordinates of the text area and the corner coordinates of each document cell; and screening an overlapping region with the largest area from the overlapping regions as a target overlapping region, and taking a document cell corresponding to the target overlapping region as a document cell corresponding to the text region to obtain the corresponding relation between the text region and the document cell.

Determining the document data corresponding to the document image, and then further comprising:

carrying out the same screen display on the document image and the document; and under the condition that a checking operation of any document cell in the displayed document is received, jumping from the document cell to the document cell corresponding to the document cell in the document image based on the corner coordinates of each document cell.

According to the method and the device, the mask graph is generated through the corner coordinates of the text areas in the document image, the mask graph is corroded and expanded to obtain the corner coordinates of each document cell, the limit of document type on document data extraction is broken, document cell extraction of various documents is achieved, text contents of each text area are backfilled by combining the corner coordinates of each text area on the basis, document data corresponding to the document image are obtained, the defects that in a traditional scheme, document data extraction methods are poor in compatibility and cannot be suitable for documents with changeable forms, extraction effects are poor are overcome, various types of document data extraction is achieved, extraction accuracy and extraction efficiency are improved, and the method and the device are easy to implement and deploy, have extremely high practicability and good compatibility.

Fig. 6 illustrates a physical schematic diagram of an electronic device, as shown in fig. 6, which may include: processor 610, communication interface (Communications Interface) 620, memory 630, and communication bus 640, wherein processor 610, communication interface 620, and memory 630 communicate with each other via communication bus 640. Processor 610 may invoke logic instructions in memory 630 to perform a document data extraction method comprising: acquiring a document image to be extracted; text detection is carried out on the document image, so that text areas in the document image and corner coordinates of the text areas are obtained; generating a mask image of the document image based on the corner coordinates of each text region, and corroding and expanding the mask image to obtain the corner coordinates of the document cells in the document image; and determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

Further, the logic instructions in the memory 630 may be implemented in the form of software functional units and stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present application also provides a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions which, when executed by a computer, are capable of performing a document data extraction method provided by the above methods, the method comprising: acquiring a document image to be extracted; text detection is carried out on the document image, so that text areas in the document image and corner coordinates of the text areas are obtained; generating a mask image of the document image based on the corner coordinates of each text region, and corroding and expanding the mask image to obtain the corner coordinates of the document cells in the document image; and determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

In yet another aspect, the present application also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a document data extraction method provided by the above methods, the method comprising: acquiring a document image to be extracted; text detection is carried out on the document image, so that text areas in the document image and corner coordinates of the text areas are obtained; generating a mask image of the document image based on the corner coordinates of each text region, and corroding and expanding the mask image to obtain the corner coordinates of the document cells in the document image; and determining document data corresponding to the document image based on the corner coordinates of each text area, the corner coordinates of each document cell and the text content of each text area.

The system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on such understanding, the foregoing technical solutions may be embodied essentially or in part in the form of a software product, which may be stored in a computer-readable storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the various embodiments or methods of some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the corresponding technical solutions.

Claims

1. A document data extraction system, characterized in that the document data extraction system comprises an image acquisition unit, a text detection unit, a corrosion expansion unit, a document data extraction unit and a data management console; the data management console is respectively connected with the image acquisition unit, the text detection unit, the corrosion expansion unit and the document data extraction unit and is used for controlling and managing all units;

the text detection unit is used for: performing text detection on the document image to obtain text areas in the document image and corner coordinates of each text area;

The corrosion expansion unit is used for generating a mask image of the document image based on the corner coordinates of each text region, and carrying out corrosion expansion on the mask image to obtain the corner coordinates of the document cells in the document image;

the document data extraction unit is used for: and determining the document data corresponding to the document image based on the corner coordinates of each text region, the corner coordinates of each document cell and the text content of each text region.

2. A document data extraction method, characterized by comprising:

acquiring a document image to be extracted;

performing text detection on the document image to obtain text areas in the document image and corner coordinates of each text area;

generating a mask map of the document image based on the corner coordinates of each text region, and corroding and expanding the mask map to obtain the corner coordinates of the document cells in the document image;

and determining the document data corresponding to the document image based on the corner coordinates of each text region, the corner coordinates of each document cell and the text content of each text region.

3. The method for extracting document data according to claim 2, wherein generating a mask map of the document image based on the corner coordinates of the text regions, and performing corrosion expansion on the mask map to obtain the corner coordinates of the document cells in the document image, comprises:

Generating a single-channel image based on corner coordinates of a document frame in the document image;

determining a target size based on an image size of the mask map;

and corroding and expanding the mask map through the convolution kernel of the target size to obtain corner coordinates of the document cells in the document image.

4. The document data extraction method according to claim 2, wherein the determining the document data corresponding to the document image based on the corner coordinates of the text regions, the corner coordinates of the document cells, and the text contents of the text regions includes:

matching each text region with each document cell based on the corner coordinates of each text region and the corner coordinates of each document cell to obtain a corresponding relation between each text region and each document cell;

and based on the corresponding relation and the text content of each text region, placing the text content of each document cell to obtain the document data corresponding to the document image.

5. The document data extraction method according to claim 4, wherein the correspondence between any text region and a document cell is determined based on the steps of:

determining the area of an overlapping area between any text area and each document cell based on the corner coordinates of the any text area and the corner coordinates of each document cell;

and selecting an overlapping region with the largest area from the overlapping regions as a target overlapping region, and taking a document cell corresponding to the target overlapping region as a document cell corresponding to any text region to obtain a corresponding relation between any text region and the document cell.

6. The document data extraction method according to any one of claims 2 to 5, wherein the acquiring a document image to be extracted includes:

acquiring an initial document image;

and carrying out channel filtration on the seal area in the initial document image based on the seal color to obtain the document image.

7. The method for extracting document data according to claim 6, wherein said performing channel filtering on a stamp area in said initial document image based on said stamp color to obtain said document image comprises:

based on the seal color, carrying out channel filtration on a seal area in the initial document image to obtain a target document image;

carrying out document detection on the target document image to obtain original corner coordinates of a document area in the target document image;

and carrying out direction recognition on the document area obtained by perspective transformation, and carrying out angle correction on the document area obtained by perspective transformation based on a recognition result to obtain the document image.

8. The document data extraction method according to any one of claims 2 to 5, wherein the determining the document data to which the document image corresponds further includes, thereafter:

carrying out same screen display on the document image and the document data;

and under the condition that a checking operation of any document cell in the displayed document data is received, jumping from any document cell in the document data to a document cell corresponding to the any document cell in the document image based on the corner coordinates of each document cell.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the document data extraction method of any one of claims 2 to 8 when the program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the document data extraction method according to any of claims 2 to 8.