CN111753727B

CN111753727B - Method, apparatus, device and readable storage medium for extracting structured information

Info

Publication number: CN111753727B
Application number: CN202010588634.9A
Authority: CN
Inventors: 冯博豪; 庞敏辉; 谢国斌; 韩光耀
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2023-06-23
Anticipated expiration: 2040-06-24
Also published as: CN111753727A

Abstract

The embodiment of the application discloses a method, a device, electronic equipment and a computer readable storage medium for extracting structured information, and relates to the technical fields of deep learning, image processing, natural language processing and cloud computing. One embodiment of the method comprises the following steps: acquiring an image to be processed, and identifying and obtaining a wireless table area in the image to be processed; performing semantic segmentation operation on the wireless table area by using a deep model which can extract multi-scale features and is used for segmentation to obtain each text block, so as to obtain each segmented text block; and extracting according to each text block to obtain target structural information. The embodiment provides an automatic structured information extraction scheme of a detail bill and a bill, and particularly aims at a wireless table area, by using a deep model which can extract multi-scale features for dividing and obtaining each text block, the text block dividing effect is better, and the accuracy of the extracted structured information is improved.

Description

Method, apparatus, device and readable storage medium for extracting structured information

Technical Field

Embodiments of the present application relate to the field of data processing, and in particular, to the field of image data processing and natural language processing.

Background

In the reimbursement scene, the consumption bill is frequently subjected to, information in the bill is required to be recorded, and along with the rapid increase of social activities, the number of the bill is also rapidly increased, and how to rapidly and accurately record the bill and bill information into an electronic system is the key point of the study of the technicians in the field.

The manual entry of various data in bills and detailed notes is performed conventionally by a person.

Disclosure of Invention

Embodiments of the present application provide a method, apparatus, electronic device, computer-readable storage medium, and computer program product for extracting structured information.

In a first aspect, an embodiment of the present application proposes a method for extracting structured information, including: acquiring an image to be processed, and identifying and obtaining a wireless form area in the image to be processed; executing semantic segmentation operation on the wireless table area by using a deep model to obtain segmented text blocks; the deep model can extract multi-scale features for segmentation to obtain each text block; and extracting according to each text block to obtain target structural information.

In a second aspect, an embodiment of the present application proposes an apparatus for extracting structured information, including: the wireless form area identifying unit is configured to acquire an image to be processed and identify and obtain a wireless form area in the image to be processed; the semantic segmentation operation execution unit is configured to execute semantic segmentation operation on the wireless table area by using a deep model to obtain segmented text blocks; the deep model can extract multi-scale features for segmentation to obtain each text block; and the target structured information extraction unit is configured to extract target structured information according to each text block.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to implement a method for extracting structured information as described in any one of the implementations of the first aspect when executed.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for enabling a computer to implement a method for extracting structured information as described in any one of the implementations of the first aspect when executed.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, is capable of implementing a method for extracting structured information as described in any of the implementations of the first aspect.

The embodiment of the application provides a method, a device, electronic equipment, a computer readable storage medium and a computer program product for extracting structured information, wherein firstly, an image to be processed is obtained, and a wireless table area in the image to be processed is identified; then, semantic segmentation operation is carried out on the wireless table area by using a deep model which can extract multi-scale features and is used for segmentation to obtain each text block, so that each segmented text block is obtained; and finally, extracting according to each text block to obtain the target structural information. According to the technical scheme, the automatic structured information extraction scheme for the bill and the bill is provided, and particularly, aiming at a wireless form area, the deep model capable of extracting multi-scale features is used for dividing and obtaining each text block, so that the text block dividing effect is better, and the accuracy of the extracted structured information is improved.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is an exemplary system architecture in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method for extracting structured information according to the present application;

FIG. 3 is a flow chart of another embodiment of a method for extracting structured information according to the present application;

FIG. 4 is a schematic diagram of the functional module architecture of an application scenario of a method for extracting structured information according to the present application;

FIG. 5 is an exemplary billing image;

FIG. 6 is a schematic representation of the bill image of FIG. 5 after processing by a corrosion expansion algorithm;

FIG. 7 is a schematic diagram of each text block in the billing image shown in FIG. 5 after performing a frame detection operation;

FIG. 8 is a schematic view of the image of FIG. 6 after performing a rank alignment operation.

Detailed Description

The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of the methods, apparatus, electronic devices, and computer-readable storage media for extracting structured information of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include an image acquisition device 101, a network 102, and a server 103. The network 102 is a medium used to provide a communication link between the image capturing apparatus 101 and the server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 103 via the network 102 using the image acquisition device 101 to receive or send messages or the like. Various applications for implementing information communication between the image capturing device 101 and the server 103, such as a bill/bill uploading application, a structured information extraction application, an instant messaging application, and the like, may be installed on the image capturing device 101 and the server 103.

The image acquisition device 101 and the server 103 may be hardware or software. When the image capturing device 101 is hardware, it may be various electronic devices having a display screen and a camera, including but not limited to a smart phone, a tablet computer, a computer, and various independent camera devices, etc.; when the image capturing apparatus 101 is software, it may be installed in the above-listed electronic apparatus, which may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein. When the server 103 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or may be implemented as a single server; when the server 103 is software, it may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, which is not particularly limited herein.

The server 103 may provide various services through various built-in applications, for example, a bill/bill uploading application that may provide a structured information extraction service, and the server 103 may achieve the following effects when running the bill/bill uploading application: firstly, acquiring an image to be processed from an image acquisition device 101 through a network 102, and then identifying and obtaining a wireless table area in the image to be processed by a server 103; then, the server 103 performs semantic segmentation operation on the wireless table area by using a deep model which can extract multi-scale features for segmentation to obtain each text block, so as to obtain each segmented text block; finally, the server 103 extracts the target structural information according to each text block. That is, the server 103 extracts the structural information included in the input image to be processed through the above-described processing steps, and outputs the extracted target structural information as a result.

It is to be noted that the image to be processed may be stored in advance in the server 103 in various ways, in addition to being acquired from the image pickup apparatus 101 through the network 102. Thus, when the server 103 detects that such data has been stored locally (e.g., a task of extracting structured information of a pending image that remains prior to beginning processing), it may choose to obtain such data directly from locally, in which case the exemplary system architecture 100 may also exclude the image acquisition device 101 and the network 102.

Since the extraction of the structured information from the image to be processed requires more operation resources and stronger operation capability, the method for extracting the structured information provided in the subsequent embodiments of the present application is generally performed by the server 103 having stronger operation capability and more operation resources, and accordingly, the device for extracting the structured information is also generally disposed in the server 103. However, it should be noted that, when the image capturing apparatus 101 also has the required computing capability and computing resources, the image capturing apparatus 101 may also complete the above operations performed by the server 103 through the bill/bill uploading application installed thereon, and further output the same result as the server 103. Especially in case of simultaneous presence of a plurality of image acquisition devices having different computing capabilities. For example, when the bill/bill uploading application determines that the image capturing device has a relatively strong computing capability and relatively more computing resources remain, the current image capturing device can perform the above computation, so as to properly reduce the computation pressure of the server 103. Accordingly, the means for extracting structured information may also be provided in the image acquisition device 101. In this case, the exemplary system architecture 100 may also not include the server 103 and the network 102.

It should be understood that the number of image acquisition devices, networks and servers in fig. 1 is merely illustrative. There may be any number of image acquisition devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, there is shown an implementation flow 200 of one embodiment of a method for extracting structured information according to the present application, comprising the steps of:

step 201: acquiring an image to be processed, and identifying and obtaining a wireless table area in the image to be processed;

this step aims at acquiring an image to be processed by an execution subject (e.g., the server 103 shown in fig. 1) that executes a method for extracting structured information, and identifying a wireless form area in the image to be processed.

The to-be-processed image comprises, but is not limited to, bill images, policy images and the like containing various form contents, and the form forms comprise wired forms and wireless forms, wherein the wired forms can obviously display the structural relationship between left and right and up and down according to a rectangle which is formed by mutually crossing horizontal straight lines and vertical straight lines and can be filled with contents, and the wireless forms need to be subjected to important processing when structural information is extracted from the wired forms due to the defect of 'lines'.

Accordingly, to identify wireless form areas in the image to be processed, the wireless form areas may be distinguished from wired form areas according to whether a cross relationship of "lines" exists. Of course, it may be identified from other places where the difference between the wired form area and the wireless form area may be characterized, for example, the border crossing line of the outermost layer of the wireless form area is usually beyond the extension, and the wired form area is not extended after the border crossing line crosses, so as to form a closed form, that is, whether the wireless form area exists is judged by whether the border crossing line continues to extend after crossing.

Taking the example of a bill image to be processed having only a wired form area and a wireless form area, one implementation including, but not limited to, identifying a wireless form area from the image to be processed can be seen in the following steps:

processing the bill image to be processed by using a corrosion expansion algorithm to obtain a horizontal straight line and a vertical straight line;

determining an area where a horizontal direction straight line and a vertical direction straight line intersect in a bill image to be processed as a wired table area;

and determining an area, in which the horizontal straight line does not intersect with the vertical straight line, in the bill image to be processed as a wireless table area.

The present embodiment uses corrosion and expansion in the opencv library (a cross-platform computer vision library, including a plurality of general computer vision algorithms) to process the bill to be processed, so that the principle of corrosion and expansion is used to highlight the horizontal straight line and the vertical straight line. After binarization conversion is carried out on the bill image to be processed, horizontal straight lines and vertical straight lines can be found out through a corrosion expansion algorithm based on a non-black-white image. Since the bill image to be processed contains a wired form area and a wireless form area in addition, the wired form area and the wireless form area can be determined in a dichotomy by whether a horizontal straight line and a vertical straight line intersect.

It should be noted that the image to be processed may be obtained directly from a local storage device by the execution subject described above, or may be obtained from a non-local storage device (for example, the image capturing device 101 shown in fig. 1). The local storage device may be a data storage module, such as a server hard disk, disposed in the execution body, where the image to be processed may be quickly read locally; the non-local storage device may also be any other electronic device configured to store data, such as some user terminals, in which case the executing entity may obtain the image to be processed by receiving a request for extracting structured information containing the image to be processed from the electronic device.

Furthermore, the image to be processed shot by the image acquisition device is not free from deflection, and in order to avoid the influence of the problems on the structural information extracted later as much as possible, the tilting correction of the image to be processed can be performed by utilizing a Gliding Vertex algorithm and an RSDet algorithm after the image to be processed is acquired and before the wireless table area in the image to be processed is identified.

The Gliding Vertex algorithm and the RSDet algorithm are two algorithms used in the aspect of remote sensing target detection, and are essentially quadrilateral detectors which are designed for edge detection of a ground complex target by remote sensing of a remote satellite and have the capability of detecting the outline of the complex target. In order to better correct the inclined and inclined image to be processed to the correct direction, the application introduces a Gliding Vertex algorithm and an RSDet algorithm used by the method from the field of remote sensing target detection, so as to realize the aim of accurately determining the quadrangle of the edge of the image to be processed by means of the capability of detecting the quadrangle of the edge of the complex target, thereby realizing the integral and accurate correction effect.

Step 202: performing semantic segmentation operation on the wireless table area by using the deep model to obtain segmented text blocks;

On the basis of step 201, this step aims at completing the semantic segmentation operation on the wireless table area by the execution body by using the deep model, so as to obtain segmented text blocks. The object of the semantic segmentation operation is a partial image corresponding to the wireless table area, the result of the semantic segmentation is an image block corresponding to each text block, namely, the segmentation is carried out by taking the text blocks as units, and the segmented image block is actually the image block where each text block is located.

Common models for achieving semantic segmentation are FCN (Fully Convolutional Networks, full convolution network) and SegNet (Sementic Segmentation Segnet, semantic segmentation network). The FCN is used for realizing pixel-level classification by replacing a final full-connection layer with a convolution layer on the basis of realizing image classification by using the full-connection layer by CNN (Convolutional Neural Networks) and the convolution neural network), and is used as a basis for realizing semantic segmentation; based on the FCN, the SegNet uses deconvolution and pooling operations to enable the extracted characteristics for classification to be more accurate, so that better semantic segmentation effect than that of the FCN is achieved.

However, the existing FCN and SegNet are generally used for performing semantic segmentation on images containing different image features, for example, separating human body and background in a picture, while the images aimed at in the application are bill images and bill images containing a large amount of structured information and text information, that is, the images to be processed contain a large amount of features with different dimensions, and the features of the text type need more detailed feature extraction and recognition modes due to smaller differences, so that FCN and SegNet have poor effects. Based on the method, the characteristics of the to-be-processed image containing the table and a large number of words in the table are combined, and the deep model with better multi-scale characteristics, classification and recognition effects is selected and used, so that the advantages of the deep model are fully utilized to meet the actual requirement of extracting the structural information from the bill image and the bill image containing a large number of table and word structural information.

Compared with other models, the deep model obtains the characteristic of higher sampling rate by removing downsampling and maximum pooling of the last layers and using an upsampling filter, so as to solve the problem of reduced spatial resolution caused by sampling by using continuous pooling operation in the traditional classification CNN and FCN; the feature layer is resampled, so that multi-scale image text information is obtained, and meanwhile, a plurality of parallel ACNNs (Annularly Convolutional Neural Networks, annular convolution neural networks) are used for multi-scale sampling, so that the problem of scale detection of a traditional classification model is solved. Specifically, the deep model mainly focuses on ASPP (Atrous Spatial Pyramid Pooling, void space pyramid pool) structures according to the development of V1, V2 and V3 versions, the initial ASPP structure uses multiple scales to perform void convolution, and the initial ASPP structure is connected after 1*1 convolution, so that multi-scale feature extraction is realized, and finally global and local features are obtained. In version V3, the modified ASPP structure consists of one 1*1 convolution and 3 3*3 hole convolutions, with 256 and BN layers per convolution kernel (Batch Normalization ), containing global average pooling of images and features. In short, unlike the traditional semantic segmentation model mainly aiming at image features, the deep model which adopts an up-sampling filter, removes the last layers of downsampling and maximally pooling, uses an ASPP structure to extract and fuse to obtain multi-scale features is more suitable for semantic segmentation of the image to be processed which is aimed at by the application and contains a large amount of tables and text structural information, and the characteristics of the deep model can be fully utilized to improve the segmentation accuracy of text blocks aiming at no-table areas.

Step 203: and extracting according to each text block to obtain target structural information.

Based on step 202, this step aims at extracting the target structural information by performing subsequent processing on each precisely divided text block by the execution body. The target structured information is obtained on the basis of each text block of the image block in practice, and often, various processes are required, wherein the most important is text content identification and structured information extraction in the field of natural language processing, so that the content of each identified text block is sorted by the correct structured information, and effective target structured information is obtained.

The method for extracting the structured information provides an automatic structured information extraction scheme of detail bills and bills, and particularly aims at a wireless table area, by using a deep model capable of extracting multi-scale features for dividing and obtaining each text block, the text block dividing effect is better, and the accuracy of the extracted structured information is improved.

Based on the above embodiments, the present application further provides a flowchart 300 of another method for extracting structured information, including the following steps:

Step 301: acquiring an image to be processed, and identifying and obtaining a wireless table area in the image to be processed;

step 301 corresponds to step 201 shown in fig. 2, and the same content is referred to the corresponding portion of the previous embodiment, and will not be described herein.

Step 302: performing feature extraction operation on the wireless form area by using the coding module to obtain a first feature;

step 303: carrying out pooling treatment on the first features by using a space pyramid pooling module of cavity convolution to obtain multi-scale features;

step 304: performing up-sampling operation on the multi-scale features by using a decoding module, and taking each obtained segmented image as a text block;

aiming at a deep model formed by a coding module, a spatial pyramid pooling module of hole convolution and a decoding module, a specific implementation mode for dividing each text block is provided in the steps 302 to 304, wherein the coding module can adopt a feature extraction model based on CNN, and the spatial pyramid pooling module of hole convolution can also specifically comprise convolution kernels of various different specifications, so that as many multi-scale features as possible are realized.

Step 305: sequentially performing frame detection operation, row-column alignment operation and character recognition operation on each character block to obtain the content of each character block;

Step 306: and sorting according to the content of each text block to obtain the target structural information.

Step 305 and step 306 provide a scheme for sequentially performing frame detection operation, line alignment operation and text recognition operation on each text block to complete extraction of structural information and specific text content, and finally sorting according to the content of each text block to obtain target structural information.

For ease of understanding, a specific implementation is also presented herein, including but not limited to, for the bezel detection operation in step 305, including the steps of:

obtaining the edge coordinates of each text block by using a connected domain and a canny edge detection algorithm;

and determining the frame of the corresponding text block according to the edge coordinates.

The connected domain and the canny edge detection algorithm are also a general computer vision algorithm in an opencv library and are used for realizing edge detection.

On the basis of possessing all the beneficial effects of the above embodiment, the embodiment provides a specific and high-realizable implementation manner of implementing semantic segmentation on the wireless table area through the deep model through steps 302 to 304, and provides a specific extraction of completing the target structure information based on each text block through

steps

305 and 306. And no causal and dependency exists between the two parts of improvement, and a separate improved embodiment can be formed independently with the previous embodiment, and the embodiment only exists as a preferred embodiment of the two parts of preferred implementation scheme.

Furthermore, the bill images and the bill images of the same type have common points in the extraction process of the structured information, so that the extraction efficiency can be improved by combining the processes of extracting the effective structured information of the images to be processed of different types in an attempted manner, and generating a structured information extraction template corresponding to the type of the images to be processed through arrangement.

An implementation, including but not limited to, may be:

acquiring an image to be processed, and judging whether a structured information extraction template corresponding to the type of the image to be processed is pre-stored or not;

if a structured information extraction template corresponding to the type of the image to be processed is pre-stored, calling the corresponding structured information extraction template to execute structured information extraction operation on the image to be processed;

if the structured information extraction template corresponding to the type of the image to be processed is not pre-stored, a new structured information extraction template corresponding to the type of the image to be processed is formed according to the obtaining process of the targeted structured information.

Furthermore, when determining whether the image to be processed is matched with any pre-stored structured information extraction template, the image to be processed can be confirmed based on a text classification model constructed based on the BERT model and an image classification model constructed based on the acceptance model. The BERT model (Bidirectional Encoder Representations from Transformers, bi-directional encoder characterizations from the transformer) is chosen to construct a word classification model because it replaces a small number of words with Mask or another random word with a reduced probability when training the bi-directional language model, resulting in increased memory of the context, increased loss of one predicted next sentence, and the existence of two differences, resulting in better semantic recognition and classification capabilities for the BERT model, compared to the traditional semantic-based word classification model.

Similarly, the acceptance model achieves the best possible performance of the image classification network within limited computing resources through continual improvements to the acceptance structure. The acceptance model does not adopt the common hardware upgrade, larger data set and other ways to improve the performance, in general, the most direct method for improving the network performance is to increase the depth and width of the network (the depth of the network is only the number of layers of the network, and the width refers to the number of channels in each layer), but this method brings two disadvantages: 1) Fitting is easy to occur, when the depth and the width are increased continuously, the parameters to be learned are also increased continuously, and fitting is easy to occur for huge parameters; 2) Increasing the size of the network uniformly results in increased computational effort. Thus, the acceptance model is sparse for biological based neural connections and if the probability distribution of the dataset can be described by a large and very sparse DNN (Deep Neural Networks, deep neural network), the origin of the optimal network topology can be built layer by analyzing the relevant statistical properties of the activation values of the preceding layer and clustering the output highly correlated neurons, giving an improved way of introducing sparse properties and converting the fully connected layer into sparse connections by itself. Therefore, the sparse characteristic of the filter level can be maintained, the high calculation performance of the matrix can be fully dense, and the additional problem caused by a conventional performance increasing mode is avoided.

In order to deepen understanding, the application also provides an electronic auxiliary system for the detail bill, which corresponds to the method for extracting the structural information and is used for assisting business personnel in electronic input of the detail bill, in combination with a specific application scene.

The electronic auxiliary system for the bill in detail consists of 8 parts, namely: the system comprises an image preprocessing module, a partitioning module, a wireless form processing module, a text detection and identification module, a bill template matching module, a manual interaction interface, an information base and a storage module, wherein the structural schematic diagram is shown in fig. 4. The above modules are mutually communicated so as to realize data intercommunication, and the specific implementation of each functional module will be respectively described below with reference to examples:

an image preprocessing module:

due to the problem of shooting angle, the obtained image may have a certain inclination, or a plurality of pictures may be pasted together. In this case, image segmentation correction is required. The image segmentation correction steps are as follows:

1) Completing the outer frame detection of the bill by utilizing the target detection, and cutting the bill according to the coordinates of the outer frame;

2) And (5) performing inclination correction by using four-corner coordinates of the detection frame.

The target detection method is characterized in that the target detection method adopts the Gliding Vertex and RSDet algorithms, and the two algorithms have remarkable effects in the remote sensing target detection process and can accurately detect the inclined target. The present embodiment uses the annotated training data to train the algorithm network so that they can detect the inclined bill.

Partition module:

the module mainly utilizes form detection and straight line detection to conduct partition processing on the detail bill. The divided areas are a horizontal key-value area (i.e., a part of the wired table area) and a wireless table area. The method mainly utilizes a corrosion expansion algorithm contained in an opencv library, and can obtain a line segment in the horizontal direction and a line segment in the vertical direction through the algorithm. An exemplary raw billing image is shown in fig. 5, and a billing image processed by the erosion dilation algorithm is shown in fig. 6. It can be seen that the coordinates of a straight line can be easily obtained by using the binarized map shown in fig. 6, and then the original image on the left side can be partitioned into a horizontal key-value area and a wireless table area by using the straight line.

For the key-value region, a conventional text detection and recognition module is directly called to recognize the corresponding content, and a corresponding key-value pair is obtained; and for the wireless table area, the wireless table processing module to be called also completes the extraction of the table_key and the table_value.

A wireless form processing module:

the module is mainly used for completing the processing of the wireless form and comprises the following three steps:

1) And extracting the text blocks above the bill details by using the semantic segmentation model. The model of semantic segmentation utilized here is deep v3+ (a modified version of V3), the network of which is a "encode-decode" structure. The encoding module uses DCNN to perform feature extraction, and then connects with a spatial pyramid pooling module (i.e., ASPP structure) of the hole convolution for extracting and fusing the multi-scale features of the image. The decoding module obtains a segmentation result by up-sampling. As the deep V < 3+ > model introduces multi-scale information, compared with other image segmentation models, the deep V < 3+ > model further fuses the bottom layer features with the high layer features, and the accuracy of boundary segmentation is greatly improved.

2) Obtaining the frame of text block using the result of semantic segmentation

And the coordinates of the edges of each text block can be obtained by using the connected domain of the opencv library and a canny edge detection algorithm. Then, four-point coordinates (x_min, x_max, y_min and y_max) of the character detection frame are obtained by taking maximum and minimum values by utilizing the edge coordinates, and the result is shown in fig. 7;

3) Line alignment using the coordinates of text blocks

After obtaining the corresponding text block, the text recognition module can be called to recognize the content of the text block. However, if the content of the wireless table needs to be extracted in a structuring way, matching between the table_key and the table_value is also needed, that is, row and column alignment of the wireless table needs to be performed. To implement row-column alignment, the header of the wireless table, which is the table_key, needs to be acquired first. And calling a character recognition module to recognize the detected character blocks and then matching the character blocks with the table heads in the information base. The header in the information base can be manually added through the interactive interface. And determining the header and obtaining the coordinates of the header. And (5) extending downwards by using the coordinates of the header to obtain corresponding table_value. All texts with the center coordinates of the text detection frame between the left and right horizontal coordinates (x_min, x_max) of the header detection frame belong to the same column. In the same manner as column alignment, row alignment can be accomplished, as shown schematically in FIG. 8.

The character detection and recognition module:

the module is mainly used for detecting and identifying the text content of the bill. In the text detection and recognition module, the embodiment applies FOTS (Fast Oriented Text Spotting) algorithm, which is a rapid end-to-end integrated detection and recognition framework, and the FOTS has a faster speed compared with other two-stage methods. The overall structure of the FOTS consists of four parts, namely a convolution sharing branch, a text detection branch, a Roirote operation branch and a text recognition branch. The backbone of the convolution sharing Network is ResNet-50 (Residual Network), and the function of the convolution sharing is to connect the low-level feature map and the high-level semantic feature map. The main function of the roi operation is to transform an angularly inclined text block into a horizontal text block after affine transformation. Compared with other character detection and recognition algorithms, the algorithm has the characteristics of small model, high speed, high precision and support for multiple angles.

Human-computer interaction interface:

the module is mainly used for carrying out template modification and template configuration. In the embodiment, the template for the structured extraction of the detail bill is automatically generated by utilizing an image preprocessing module, a partitioning module, a wireless form processing module, a text detection and identification module and other modules. But there may be errors due to the automatically generated templates. In addition, when the automatic extraction is performed, all key-value pairs are extracted, wherein the key-value pairs are not concerned by partial business personnel. The business personnel can modify the template through the man-machine interaction interface. The identified key-value pairs are modified while key-value pairs of interest may be selected.

A detail bill template matching module:

this module is primarily to sort the identified bill details. If the bill of detail for identification is the bill template existing in the system, the system automatically calls the existing template to identify and structurally extract the bill of detail. If the bill of detail for identification is not an existing bill template in the system, the system will perform information structured extraction to form a new template.

The template classification module comprises a text classification model and an image classification model. The text classification model applies a BERT model. The model is a pre-trained model, and can calculate the similarity between the text content of a new bill and the text content of an existing template of the system. The image classification application of the acceptance-v 4 model has remarkable effect on image classification, and can accurately classify detailed bills. The bill in detail electronic system can integrate the text classification and the image classification results to match and classify the bill in detail with the system template.

Information base:

the information base stores a large number of detailed bill templates for selection and calling. In addition, the information base stores the list head of the detail bill, can be provided for the wireless form processing module to call, and also comprises an interface for manual inquiry and modification to support the manual maintenance of the content of the information base.

And a result storage module:

this section is mainly to save the bill processed by the bill in detail electronic system. These bills can be subsequently marked to become training data of the system for improving the accuracy of structuring and recognition of the system.

As an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for extracting structural information, where an embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be specifically applied to various electronic devices.

The apparatus for extracting structured information of the present embodiment may include: the device comprises a wireless table area identification unit, a semantic segmentation operation execution unit and a target structural information extraction unit. The wireless form area identifying unit is configured to acquire an image to be processed and identify and obtain a wireless form area in the image to be processed; the semantic segmentation operation execution unit is configured to execute semantic segmentation operation on the wireless table area by using the deep model to obtain segmented text blocks; the deep model can extract multi-scale features for segmentation to obtain each text block; and the target structured information extraction unit is configured to extract target structured information according to each text block.

In this embodiment, in the apparatus for extracting structured information: the specific processing of the wireless table area identifying unit, the semantic segmentation operation executing unit, and the target structured information extracting unit and the technical effects thereof may refer to the relevant descriptions of steps 201 to 203 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some optional implementations of the present embodiment, the semantic segmentation operation execution unit may be further configured to: performing feature extraction operation on the wireless form area by using the coding module to obtain a first feature; carrying out pooling treatment on the first features by using a space pyramid pooling module of cavity convolution to obtain multi-scale features; performing up-sampling operation on the multi-scale features by using a decoding module, and taking each obtained segmented image as a text block; the deep model comprises an encoding module, a spatial pyramid pooling module and a decoding module.

In some optional implementations of this embodiment, the means for extracting structured information may further include: and the tilt correction unit is configured to perform tilt correction on the image to be processed by utilizing a Gliding Vertex algorithm and an RSDet algorithm before the wireless table area in the image to be processed is identified.

In some optional implementations of the present embodiment, when the image to be processed is specifically a bill image to be processed, the wireless form area identifying unit may be further configured to: processing the bill image to be processed by using a corrosion expansion algorithm to obtain a horizontal straight line and a vertical straight line; determining an area where a horizontal direction straight line and a vertical direction straight line intersect in a bill image to be processed as a wired table area; and determining an area, in which the horizontal straight line does not intersect with the vertical straight line, in the bill image to be processed as a wireless table area.

In some optional implementations of the present embodiment, the target structured information extraction unit may include: the text block processing subunit is configured to sequentially perform frame detection operation, row-column alignment operation and text recognition operation on each text block to obtain the content of each text block; and the target structured information acquisition subunit is configured to obtain target structured information according to the content arrangement of each text block.

In some optional implementations of this embodiment, the text block processing subunit includes a border detection module configured to perform a border detection operation on each text block, the border detection module being further configured to: obtaining the edge coordinates of each text block by using a connected domain and a canny edge detection algorithm; and determining the frame of the corresponding text block according to the edge coordinates.

In some optional implementations of this embodiment, the means for extracting structured information may further include: the existing template direct use unit is configured to call the corresponding structured information extraction template to execute the structured information extraction operation on the image to be processed when the structured information extraction template corresponding to the type of the image to be processed is prestored; and a new template forming unit configured to form a new structured information extraction template corresponding to the type of the image to be processed according to a process of obtaining the targeted structured information when the structured information extraction template corresponding to the type of the image to be processed is not pre-stored.

In some optional implementations of this embodiment, the means for extracting structured information may further include: an existing template matching unit configured to determine whether an image to be processed matches any pre-stored structured information extraction template using a text classification model constructed based on the BERT model and an image classification model constructed based on the acceptance model.

The embodiment exists as an embodiment of a device corresponding to the embodiment of the method, and the device for extracting structured information provided by the embodiment provides an automatic structured information extraction scheme of a detail bill and a bill through the technical scheme, and particularly aims at a wireless table area, by using a deep model which can extract multi-scale features for dividing and obtaining each text block, the text block dividing effect is better, and the accuracy of the extracted structured information is improved.

According to embodiments of the present application, there is also provided an electronic device, a computer-readable storage medium, and a computer program product.

Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

The electronic device includes: one or more processors, memory, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system).

The memory is a non-transitory computer readable storage medium provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the methods provided herein for extracting structured information. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the methods provided herein for extracting structured information.

The memory is used as a non-transitory computer readable storage medium for storing a non-transitory software program, a non-transitory computer executable program, and modules, such as program instructions/modules (e.g., a wireless table region identification unit, a semantic segmentation operation execution unit, and a target structured information extraction unit) corresponding to the method for extracting structured information in the embodiments of the present application. The processor executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory, i.e., implements the methods for extracting structured information in the method embodiments described above.

The memory may include a memory program area and a memory data area, wherein the memory program area may store an operating system, at least one application program required for a function; the storage data area may store various types of data created by the electronic device when executing the method for extracting structured information, and the like. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, which may be connected via a network to an electronic device adapted to perform the method for extracting structured information. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device adapted to perform the method for extracting structured information may further comprise: input means and output means. The processor, memory, input devices, and output devices may be connected by a bus or other means.

The input device may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic device adapted to perform the method for extracting structured information, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. input devices. The output means may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

According to the technical scheme, the automatic structured information extraction scheme for the bill and the bill is provided, and particularly, aiming at a wireless form area, the deep model capable of extracting multi-scale features is used for dividing and obtaining each text block, so that the text block dividing effect is better, and the accuracy of the extracted structured information is improved.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions disclosed in the present application can be achieved, and are not limited herein.

The above embodiments do not limit the scope of the application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application are intended to be included within the scope of the present application.

Claims

1. A method for extracting structured information, comprising:

acquiring an image to be processed, and identifying and obtaining a wireless form area in the image to be processed;

performing feature extraction operation on the wireless table area by using an encoding module of the deep model to obtain a first feature;

carrying out pooling treatment on the first features by using a space pyramid pooling module of cavity convolution of the deep model to obtain multi-scale features;

Performing up-sampling operation on the multi-scale features by using a decoding module of the deep model, and taking each obtained segmented image as a text block to obtain segmented text blocks;

sequentially performing frame detection operation, row-column alignment operation and character recognition operation on each character block to obtain the content of each character block;

according to the content of each text block, the target structural information of the table is obtained by arrangement;

when the image to be processed is specifically a bill image to be processed, identifying and obtaining a wireless table area in the image to be processed includes:

determining a region where the horizontal direction straight line and the vertical direction straight line intersect in the bill image to be processed as a wired table region;

and determining an area in the bill image to be processed, in which the horizontal direction straight line does not intersect with the vertical direction straight line, as a wireless table area.

2. The method of claim 1, wherein prior to identifying the wireless table region in the image to be processed, further comprising:

And performing tilt correction on the image to be processed by using a Gliding Vertex algorithm and an RSDet algorithm.

3. The method of claim 1, wherein performing a border detection operation on the text block comprises:

obtaining the edge coordinates of each character block by using a connected domain and a canny edge detection algorithm;

4. A method according to any one of claims 1 to 3, further comprising

When a structured information extraction template corresponding to the type of the image to be processed is pre-stored, calling the corresponding structured information extraction template to execute structured information extraction operation on the image to be processed;

and when the structured information extraction template corresponding to the type of the image to be processed is not pre-stored, forming a new structured information extraction template corresponding to the type of the image to be processed according to the obtaining process of the target structured information.

5. The method of claim 4, further comprising:

and determining whether the image to be processed is matched with any pre-stored structured information extraction template or not by using a text classification model constructed based on the BERT model and an image classification model constructed based on the acceptance model.

6. An apparatus for extracting structured information, comprising:

the wireless form area identifying unit is configured to acquire an image to be processed and identify and obtain a wireless form area in the image to be processed;

the semantic segmentation operation execution unit is configured to execute feature extraction operation on the wireless table area by utilizing an encoding module of the deep model to obtain a first feature; carrying out pooling treatment on the first features by using a space pyramid pooling module of cavity convolution of the deep model to obtain multi-scale features; performing up-sampling operation on the multi-scale features by using a decoding module of the deep model, and taking each obtained segmented image as a text block to obtain segmented text blocks;

the target structured information extraction unit is configured to sequentially perform frame detection operation, row-column alignment operation and character recognition operation on each character block to obtain the content of each character block; according to the content of each text block, the target structural information of the table is obtained by arrangement;

wherein when the image to be processed is specifically the bill image to be processed, the wireless form area identifying unit is further configured to:

7. The apparatus of claim 6, further comprising:

and the tilt correction unit is configured to perform tilt correction on the image to be processed by utilizing a Gliding Vertex algorithm and an RSDet algorithm before the wireless table area in the image to be processed is identified.

8. The apparatus of claim 6, wherein the target structured information extraction unit comprises a border detection subunit configured to perform a border detection operation on each of the text blocks, the border detection subunit further configured to:

9. The device according to any one of claims 6 to 8, further comprising

An existing template direct use unit configured to call a corresponding structured information extraction template to perform a structured information extraction operation on the image to be processed when a structured information extraction template corresponding to the type of the image to be processed is prestored;

and a new template forming unit configured to form a new structured information extraction template corresponding to the type of the image to be processed according to the obtaining process of the target structured information when the structured information extraction template corresponding to the type of the image to be processed is not pre-stored.

10. The apparatus of claim 9, further comprising:

an existing template matching unit configured to determine whether the image to be processed matches any pre-stored structured information extraction template using a text classification model constructed based on the BERT model and an image classification model constructed based on the acceptance model.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for extracting structured information of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method for extracting structured information of any one of claims 1-5.