CN111753727A

CN111753727A - Method, device, equipment and readable storage medium for extracting structured information

Info

Publication number: CN111753727A
Application number: CN202010588634.9A
Authority: CN
Inventors: 冯博豪; 庞敏辉; 谢国斌; 韩光耀
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2020-10-09
Anticipated expiration: 2040-06-24
Also published as: CN111753727B

Abstract

The embodiment of the application discloses a method and a device for extracting structured information, electronic equipment and a computer readable storage medium, and relates to the technical fields of deep learning, image processing, natural language processing and cloud computing. One embodiment of the method comprises: acquiring an image to be processed, and identifying and obtaining a wireless table area in the image to be processed; performing semantic segmentation operation on the wireless table area by using a Deeplab model which can extract multi-scale features and is used for obtaining each character block by segmentation to obtain each segmented character block; and extracting target structured information according to each character block. The implementation mode provides an automatic structured information extraction scheme of detailed bills and bills, and particularly for a wireless table area, the word block segmentation effect is better through a deep model which can extract multi-scale features and is used for segmenting each word block, and the accuracy of extracted structured information is improved.

Description

Method, device, equipment and readable storage medium for extracting structured information

Technical Field

The embodiment of the application relates to the field of data processing, in particular to the field of image data processing and natural language processing.

Background

In a reimbursement scene, a consumer detail bill is often encountered, information in the detail bill needs to be input, the number of the detail bill is rapidly increased along with rapid increase of social activities, and how to rapidly and accurately input the bill and bill information into an electronic system is a key point of research of technicians in the field.

Conventionally, each item of data in bills and detailed bills is manually input.

Disclosure of Invention

The embodiment of the application provides a method and a device for extracting structured information, electronic equipment and a computer-readable storage medium.

In a first aspect, an embodiment of the present application provides a method for extracting structured information, including: acquiring an image to be processed, and identifying and obtaining a wireless table area in the image to be processed; performing semantic segmentation operation on the wireless table area by using a Deeplab model to obtain each segmented character block; the deep model can extract multi-scale features for segmentation to obtain each character block; and extracting target structured information according to each character block.

In a second aspect, an embodiment of the present application provides an apparatus for extracting structured information, including: the wireless table area identification unit is configured to acquire an image to be processed and identify a wireless table area in the image to be processed; a semantic division operation execution unit configured to execute a semantic division operation on the wireless table area by using a Deeplab model to obtain each divided character block; the deep model can extract multi-scale features for segmentation to obtain each character block; and the target structured information extraction unit is configured to extract target structured information according to each character block.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for extracting structured information as described in any one of the implementations of the first aspect when executed.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for enabling a computer to implement a method for extracting structured information as described in any implementation manner of the first aspect when executed.

According to the method, the device, the electronic equipment and the computer-readable storage medium for extracting the structured information, firstly, an image to be processed is obtained, and a wireless table area in the image to be processed is obtained through identification; then, semantic segmentation operation is carried out on the wireless table area by utilizing a Deeplab model which can extract multi-scale features and is used for obtaining each character block through segmentation, and each segmented character block is obtained; and finally, extracting the target structured information according to each character block. The embodiment of the application provides an automatic structured information extraction scheme of detailed bills and bills, and particularly, for a wireless table area, a deep model with multi-scale features for segmenting each character block is extracted by using the method, so that the character block segmentation effect is better, and the accuracy of the extracted structured information is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture to which the present application may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for extracting structured information, according to the present application;

FIG. 3 is a flow diagram of another embodiment of a method for extracting structured information according to the present application;

FIG. 4 is a functional block diagram of an application scenario of a method for extracting structured information according to the present application;

FIG. 5 is an exemplary billing image;

FIG. 6 is a schematic representation of the billing image of FIG. 5 after being processed by the erosion dilation algorithm;

FIG. 7 is a schematic diagram of the text blocks in the bill image shown in FIG. 5 after performing border detection operations;

FIG. 8 is a diagram of the image shown in FIG. 6 after performing a row-column alignment operation.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the method, apparatus, electronic device, and computer-readable storage medium for extracting structured information of the present application may be applied.

As shown in fig. 1, system architecture 100 may include an image capture device 101, a network 102, and a server 103. Network 102 serves as a medium to provide a communication link between image capture device 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use image capture device 101 to interact with server 103 over network 102 to receive or send messages or the like. Various applications for realizing information communication between the image acquisition device 101 and the server 103, such as a bill/bill uploading application, a structured information extraction application, an instant messaging application, and the like, may be installed on the image acquisition device 101 and the server 103.

The image capturing apparatus 101 and the server 103 may be hardware or software. When the image capturing device 101 is a hardware device, it may be various electronic devices with a display screen and a camera, including but not limited to a smart phone, a tablet computer, a computer, various independent camera devices, and the like; when the image capturing device 101 is software, it may be installed in the electronic devices listed above, and it may be implemented as multiple software or software modules, or may be implemented as a single software or software module, and is not limited in this respect. When the server 103 is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server; when the server 103 is software, it may be implemented as a plurality of software or software modules, or may be implemented as a single software or software module, and is not limited in this respect.

The server 103 may provide various services through various built-in applications, taking a bill/bill upload application which may provide a structured information extraction service as an example, the server 103 may implement the following effects when running the bill/bill upload application: firstly, acquiring an image to be processed from an image acquisition device 101 through a network 102, and then identifying and obtaining a wireless table area in the image to be processed by a server 103; then, the server 103 performs semantic segmentation operation on the wireless table area by using a deep model for extracting multi-scale features for obtaining each character block through segmentation, so as to obtain each segmented character block; finally, the server 103 extracts the target structured information according to each text block. That is, the server 103 extracts the structural information included in the input image to be processed through the above-described processing steps, and outputs the extracted target structural information as a result.

It should be noted that the image to be processed may be stored locally in the server 103 in advance in various ways, in addition to being acquired from the image capturing apparatus 101 through the network 102. Thus, when the server 103 detects that such data is already stored locally (e.g., a pending image structured information extraction task remaining before starting processing), it may choose to obtain such data directly from locally, in which case the exemplary system architecture 100 may also not include the image capture device 101 and the network 102.

Since the extraction of the structured information from the image to be processed needs to occupy more computation resources and stronger computation capability, the method for extracting the structured information provided in the following embodiments of the present application is generally executed by the server 103 with stronger computation capability and more computation resources, and accordingly, the apparatus for extracting the structured information is generally disposed in the server 103. However, it should be noted that when the image capturing device 101 also has the computing capability and computing resource meeting the requirements, the image capturing device 101 may also complete the above-mentioned operations performed by the server 103 through the bill/bill uploading application installed thereon, and then output the same result as the server 103. Especially in the case where there are a plurality of image capturing apparatuses having different arithmetic capabilities at the same time. For example, when the bill/bill uploading application determines that the image capturing device has a strong computing capability and a large amount of computing resources are left, the current image capturing device can execute the above operations, thereby appropriately reducing the computing pressure of the server 103. Accordingly, the means for extracting the structured information may also be provided in the image acquisition apparatus 101. In such a case, exemplary system architecture 100 may also not include server 103 and network 102.

It should be understood that the number of image capturing devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of image capture devices, networks, and servers, as desired for implementation.

With continuing reference to FIG. 2, an implementation flow 200 of one embodiment of a method for extracting structured information according to the present application is shown, comprising the steps of:

step 201: acquiring an image to be processed, and identifying to obtain a wireless table area in the image to be processed;

this step is intended to acquire an image to be processed by an execution subject (for example, the server 103 shown in fig. 1) executing the method for extracting structured information, and identify a wireless table area in the acquired image to be processed.

The images to be processed include but are not limited to bill images, policy images and the like containing various types of form contents, the form forms include wired forms and wireless forms, the wired forms can obviously display the structural relationship between the left and right sides and the upper and lower sides according to rectangles which are formed by the mutual intersection of horizontal straight lines and vertical straight lines and can be used for filling contents, and the wireless forms need to be processed in a key mode when structural information is extracted from the wireless forms due to the loss of the 'lines'.

Accordingly, to identify the wireless table area in the image to be processed, the wireless table area can be distinguished from the wired table area according to whether the cross relationship of the line exists. Of course, it can also be identified from other places that can represent the difference between the wired table area and the wireless table area, for example, the outermost frame cross line of the wireless table area usually extends beyond the extension, but the wired table area does not extend after the frame cross line crosses, so as to form a closed table, i.e. whether the wireless table area exists is judged by whether the frame cross line continues to extend after crossing.

Taking the to-be-processed bill image with only the wired table area and the wireless table area as an example, one implementation manner including, but not limited to, identifying the wireless table area from the to-be-processed image may be seen in the following steps:

processing the bill image to be processed by using a corrosion expansion algorithm to obtain a horizontal direction straight line and a vertical direction straight line;

determining a region in which a horizontal direction straight line and a vertical direction straight line in a bill image to be processed are crossed as a wired table region;

and determining an area in which a horizontal direction straight line does not intersect with a vertical direction straight line in the bill image to be processed as a wireless table area.

In the embodiment, bills to be processed are processed by using a corrosion expansion algorithm in an opencv library (a cross-platform computer vision library which comprises various general computer vision algorithms), so that the horizontal straight line and the vertical straight line in the bills are highlighted by using the corrosion and expansion principle. After the binary conversion is carried out on the bill image to be processed, a horizontal direction straight line and a vertical direction straight line in the bill image can be found through a corrosion expansion algorithm based on a non-black or white image. Since the bill image to be processed usually includes the wireless form area in addition to the wired form area, the wired form area and the wireless form area can be determined by bisection method by determining whether the horizontal direction straight line and the vertical direction straight line intersect.

It should be noted that the image to be processed may be acquired by the execution subject directly from a local storage device, or may be acquired from a non-local storage device (for example, the image capturing device 101 shown in fig. 1). The local storage device may be a data storage module arranged in the execution main body, such as a server hard disk, in which case, the image to be processed can be read locally and quickly; the non-local storage device may also be any other electronic device configured to store data, for example, some user terminals, and in this case, the execution subject may obtain the image to be processed by receiving a structured information extraction request containing the image to be processed sent by the electronic device.

Furthermore, the to-be-processed image shot by the image acquisition device is not free from a deflection problem, and in order to avoid the influence of the problem on the subsequently extracted structured information as much as possible, after the to-be-processed image is obtained and before the wireless table area in the to-be-processed image is identified, the to-be-processed image can be subjected to tilt correction by utilizing a Gliding Vertex algorithm and an RSDet algorithm.

The Gliding Vertex algorithm and the RSDet algorithm are two algorithms used for detecting the remote sensing target, are both quadrilateral detectors in nature, are designed for edge detection of a ground complex target by remote satellite remote sensing, and have the capability of detecting the complex target contour. In order to better correct the inclined and tilted images to be processed to the correct direction, the application introduces a Gliding Vertex algorithm and an RSDet algorithm used by the method from the field of remote sensing target detection, so as to realize the purpose of accurately determining the quadrangle of the edges of the images to be processed by means of the detection capability of the method on the quadrangle of the edges of the complex target, thereby realizing the integral and accurate correction effect.

Step 202: performing semantic segmentation operation on the wireless table area by using a Deeplab model to obtain each segmented character block;

in step 201, the execution subject performs semantic division on the wireless table area by using a deep model, so as to obtain each divided text block. The semantic division operation is performed on a partial image corresponding to the wireless table area, and the result of the semantic division is an image block corresponding to each character block, that is, the division is performed on a character block basis, and the divided image block is actually the image block where each character block is located.

Common models for implementing semantic Segmentation are FCN (full volumetric Networks) and SegNet (semantic Segmentation network). The FCN realizes pixel-level classification by replacing the last full connection layer with a convolution layer on the basis that CNN (Convolutional Neural Networks) realizes image classification by using the full connection layer, and the classification is taken as the basis for realizing semantic segmentation; SegNet makes the characteristic that is used for classifying that extracts more accurate through increasing to use deconvolution and pooling operation on FCN's basis to realize the better semantic segmentation effect than FCN.

However, the existing FCN and SegNet are generally used for performing semantic segmentation on images including different image features, for example, segmenting human bodies and backgrounds in one picture, but the images targeted by the present application are bill images and bill images including a large amount of structured information and text information, that is, the images to be processed include a large amount of features of different scales, and the features of text class are small in difference, so that a more refined feature extraction and identification manner is required, and the FCN and SegNet are not good in effect. Based on the above, the method and the device have the advantages that the characteristics of the to-be-processed image containing the table and a large number of characters in the table are combined, the deep model which can be extracted and fused to the multi-scale features and has better classification and recognition effects is selected, and the actual requirement of extracting the structured information from the bill image and the bill image containing a large number of tables and character type structured information is met by fully utilizing the advantages of the deep model.

Compared with other models, the Deeplab model gets the characteristics of higher sampling rate by removing the down-sampling and the maximum pooling of the last layers and using an up-sampling filter, so as to solve the problem of reduced spatial resolution caused by using continuous pooling operation sampling in the traditional classification CNN and FCN; the feature layer is resampled to obtain multi-scale image text information, and multiple parallel ACNN (annular Convolutional Neural Networks) are used for multi-scale sampling to solve the problem of scale detection of the traditional classification model. Specifically, according to the development of V1, V2 and V3 versions of the Deeplab model, the improvement is mainly focused on an ASPP (advanced space Pyramid) structure, the initial ASPP structure is subjected to cavity convolution by using multiple scales, and after 1 × 1 convolution, the initial ASPP structure is connected, so that the extraction of the multiple-scale features is realized, and the global and local features are finally obtained. In version V3, the modified ASPP structure includes a convolution of 1 × 1 and a convolution of 3 × 3 holes, with 256 and all BN layers per convolution kernel (Batch Normalization), including global average pooling of images and features. Briefly, different from a traditional semantic segmentation model mainly aiming at image features, the method adopts an up-sampling filter, removes the last few layers of down-sampling and the maximum pooling, and uses an ASPP structure to extract and fuse to obtain a deep model of multi-scale features, is more suitable for semantic segmentation of an image to be processed containing a large amount of table and character structured information, and can fully utilize the characteristics of the deep model to improve the segmentation accuracy of character blocks aiming at table-free areas.

Step 203: and extracting target structured information according to each character block.

On the basis of step 202, this step aims to extract target structured information by performing subsequent processing on each precisely segmented text block by the execution subject. The target structured information is obtained on the basis of each character block of the image block actually, and various kinds of processing are often needed, wherein the most important is character content identification and structured information extraction in the field of natural language processing, so that the content of each identified character block is sorted by correct structured information to obtain effective target structured information.

The method for extracting the structured information provided by the embodiment of the application provides an automatic structured information extraction scheme for detailed bills and bills, and particularly for a wireless table area, the method enables a word block segmentation effect to be better and improves the accuracy of the extracted structured information by using a deep model capable of extracting multi-scale features for segmenting each word block.

On the basis of the above embodiment, the present application also provides a flow 300 of another method for extracting structured information through fig. 3, including the following steps:

step 301: acquiring an image to be processed, and identifying to obtain a wireless table area in the image to be processed;

step 301 is the same as step 201 shown in fig. 2, and please refer to the corresponding parts in the previous embodiment for the same contents, which will not be described herein again.

Step 302: performing feature extraction operation on the wireless table area by using a coding module to obtain a first feature;

step 303: performing pooling processing on the first feature by using a spatial pyramid pooling module of the cavity convolution to obtain a multi-scale feature;

step 304: performing up-sampling operation on the multi-scale features by using a decoding module, and taking each obtained segmented image as a character block;

for the deplab model formed by the coding module, the spatial pyramid pooling module with cavity convolution and the decoding module, the above steps 302 to 304 provide a specific implementation manner for segmenting each text block, wherein the coding module may adopt a CNN-based feature extraction model, and the spatial pyramid pooling module with cavity convolution may also specifically include convolution kernels of various different specifications, thereby implementing multi-scale features as much as possible.

Step 305: sequentially performing frame detection operation, row and column alignment operation and character recognition operation on each character block to obtain the content of each character block;

step 306: and arranging according to the content of each text block to obtain target structured information.

Step 305 and step 306 provide a scheme for completing the extraction of the structured information and the specific text content by sequentially performing a frame detection operation, a row and column alignment operation and a text recognition operation for each text block, and finally obtaining the target structured information according to the content arrangement of each text block.

For ease of understanding, a specific implementation, including but not limited to, is also presented herein for the border detection operation in step 305, including the following steps:

obtaining the edge coordinates of each character block by using a connected domain and a canny edge detection algorithm;

and determining the frame of the corresponding character block according to the edge coordinates.

The connected domain and canny edge detection algorithm is also a general computer vision algorithm in an opencv library and is used for realizing edge detection.

On the basis of having all the advantages of the previous embodiment, the present embodiment provides a specific implementation manner with high realizability for implementing semantic segmentation on the wireless table area through the deepab model through steps 302 to 304, and provides a specific extraction for completing the target structure information based on each text block through

steps

305 and 306. And the two-part improvement has no causal and dependency relationship, so that a separate improved embodiment can be formed independently from the previous embodiment, and the present embodiment exists only as a preferred embodiment of the preferred implementation scheme with two parts existing at the same time.

Furthermore, the bill images and the bill images of the same type have a common point in the extraction process of the structured information, so that the effective structured information of the images to be processed of different types can be extracted in an trial manner by combining the processes, and a structured information extraction template corresponding to the types of the images can be generated through arrangement, so that the extraction efficiency can be improved in a manner of directly calling the template subsequently.

One implementation, including but not limited to, may be:

acquiring an image to be processed, and judging whether a structural information extraction template corresponding to the type of the image to be processed is prestored;

if the structured information extraction template corresponding to the type of the image to be processed is prestored, calling the corresponding structured information extraction template to execute the structured information extraction operation on the image to be processed;

and if the structured information extraction template corresponding to the type of the image to be processed is not prestored, forming a new structured information extraction template corresponding to the type of the image to be processed according to the obtaining process of the target structured information.

Furthermore, when determining whether the image to be processed matches any one of the pre-stored structural information extraction templates, the method can also be confirmed by a character classification model constructed based on a BERT model and an image classification model constructed based on an inclusion model. The reason that a BERT model (Bidirectional Encoder Representations from a transformer) is selected to construct a character classification model is that compared with a traditional model for carrying out character classification based on semantics, the BERT model replaces a small number of words with Mask or another random word with reduced probability when training a Bidirectional language model, so that the model increases the memory of context, increases the loss of predicting the next sentence, and has two differences, so that the BERT model has better semantic recognition and classification capabilities.

Similarly, the inclusion model achieves the performance of the image classification network as much as possible within limited computing resources by continuously improving the inclusion structure. The inclusion model does not adopt the common hardware upgrading and larger data set to improve the performance, generally speaking, the most direct method for improving the network performance is to increase the depth and width of the network (the depth of the network is only the number of layers of the network, and the width refers to the number of channels of each layer), but this method brings two disadvantages: 1) overfitting is easy to occur, when the depth and the width are continuously increased, parameters needing to be learned are also continuously increased, and large parameters are easy to overfit; 2) increasing the size of the network uniformly results in an increase in the amount of computation. Therefore, the inclusion model is sparse based on the biological nervous system connections, and if the probability distribution of the data set can be described by large and very sparse DNN (Deep Neural Networks), the reason for the optimal network topology can be constructed layer by analyzing the relevant statistical properties of the activation values of the front layer and clustering the neurons with highly relevant outputs, giving its own improved way of introducing the sparse properties and converting the fully connected layer into sparse connections. Therefore, the sparse characteristic of the filter level can be kept, the high calculation performance of the matrix can be sufficiently concentrated, and the additional problem caused by a conventional performance increasing mode is avoided.

In order to deepen understanding, the application also provides a detailed bill single electronic auxiliary system corresponding to the method for extracting the structured information, which is used for assisting business personnel to carry out electronic entry of a detailed bill.

The electronic auxiliary system for the refined bill consists of 8 parts which are respectively as follows: the system comprises an image preprocessing module, a partition module, a wireless form processing module, a character detection and identification module, a detail bill template matching module, a manual interaction interface, an information base and a storage module, and a structural schematic diagram please refer to fig. 4. The above modules are interconnected to realize data intercommunication, and the specific implementation of each functional module will be described below with reference to examples:

an image preprocessing module:

due to the problem of the shooting angle, the obtained image may have a certain inclination, or a plurality of pictures may be pasted together. In this case, image segmentation correction is required. The image segmentation and correction steps are as follows:

1) completing frame detection of the detail bill by using target detection, and cutting the detail bill according to frame coordinates;

2) and (5) utilizing the coordinates of the four corners of the detection frame to perform inclination correction.

The Gliding Vertex algorithm and the RSDet algorithm are applied to target detection, the two algorithms have obvious effect in the process of detecting the remote sensing target, and the inclined target can be accurately detected. The present embodiment utilizes labeled training data to specialize this algorithm network, enabling them to detect skewed statements.

A partitioning module:

the module is mainly used for partitioning the detailed bill by using table detection and line detection. The divided areas are a bar-value area (i.e., a part of the wired table area) and a wireless table area. The method mainly utilizes a corrosion expansion algorithm contained in an opencv library, and can obtain a horizontal straight line and a vertical line segment through the algorithm. An exemplary original billing image is shown in fig. 5 and a billing image processed by the erosion dilation algorithm is shown in fig. 6. It can be seen that the coordinates of the straight line can be easily acquired by using the binarized map shown in fig. 6, and then the original map on the left side can be divided into the horizontal key-value area and the wireless table area by using the straight line.

For the key-value area, directly calling a conventional character detection and identification module to identify corresponding contents to obtain a corresponding key-value pair; for the wireless form area, the wireless form processing module to be called is required to complete the extraction of the table _ key and the table _ value.

A wireless form processing module:

the module is mainly used for finishing the processing of the wireless form and comprises the following three steps:

1) and extracting the text blocks on the bill particulars by using a semantic segmentation model. The semantic segmentation model used here is deep V3+ (a modified version of V3), whose network is an "encode-decode" structure. The encoding module uses DCNN to extract features, and then is connected with a spatial pyramid pooling module (namely an ASPP structure) of the cavity convolution for extracting and fusing multi-scale features of the image. The decoding module obtains a segmentation result by using the upsampling. Because the deep V3+ model introduces multi-scale information, compared with other image segmentation models, the deep image segmentation model further fuses bottom-layer features and high-layer features, and the accuracy of boundary segmentation is greatly improved.

2) Obtaining borders of text blocks by using semantic segmentation results

The coordinates of the edge of each character block can be obtained by using the connected domain of the opencv library and the canny edge detection algorithm. Then, by using the edge coordinates, the four-point coordinates (x _ min, x _ max, y _ min, y _ max) of the text detection box are obtained by taking the maximum value and the minimum value, and the result is shown in fig. 7;

3) row and column alignment using coordinates of text blocks

After the corresponding text block is obtained, a text recognition module can be called to recognize the content of the text block. However, if structured extraction of the content of the wireless table is required, matching between table _ key and table _ value is also required, that is, row-column alignment of the wireless table is required. To realize the row-column alignment, a header of the wireless table, which is the table _ key, needs to be obtained first. And calling a character recognition module to recognize the character block detected above and then match the character block with a header in an information base. The table header in the information base can be added manually through an interactive interface. And determining the header and obtaining the coordinates of the header. By extending the coordinates of this header downward, the corresponding respective table _ value is acquired. All texts with the center coordinates of the text detection boxes between the left and right horizontal coordinates (x _ min, x _ max) of the header detection box belong to the same column. Row alignment can be accomplished in the same manner as column alignment, as shown schematically in fig. 8.

The character detection and identification module:

the module is mainly used for detecting and identifying the text content of the detailed bill. In the Text detection and identification module, the embodiment applies FOTS (fast organized Text streaming) algorithm, which is a fast end-to-end integrated detection and identification framework, and FOTS has a faster speed compared with other two-stage methods. The overall structure of FOTS is composed of a convolution sharing branch, a text detection branch, a RoIRote operation branch and a text recognition branch. The backbone of the convolution sharing Network is ResNet-50(Residual Network), and the convolution sharing has the function of connecting the low-level feature map and the high-level semantic feature map. The RoIRote operation mainly functions to convert a text block with an angle inclination into a horizontal text block after affine transformation. Compared with other character detection and identification algorithms, the method has the characteristics of small model, high speed, high precision and support of multiple angles.

A human-computer interaction interface:

the module is mainly used for template modification and template configuration. In the embodiment, the template for structured extraction of the detailed bill is automatically generated by utilizing modules such as an image preprocessing module, a partition module, a wireless form processing module, a character detection and identification module and the like. But there may be errors due to the automatically generated template. In addition, when automatic extraction is performed, all key-values are extracted, including key-value pairs which are not concerned by part of service personnel. The template can be modified by service personnel through a man-machine interaction interface. The identified key-value pairs are modified, while the key-value pair of interest may be selected.

A detail bill template matching module:

the module is mainly used for classifying the identified detailed bills. If the detail bill used for identification is an existing bill template in the system, the system can automatically call the existing template to identify and structurally extract the detail bill. If the detail bill for identification is not the existing bill template in the system, the system will perform information structured extraction to form a new template.

The template classification module comprises a text classification model and an image classification model. The text classification model applies a BERT model. The model is a pre-trained model, and the similarity between the text content of a new detail bill and the text content of an existing template of the system can be calculated. The acceptance-v 4 model applied to image classification has a remarkable effect on image classification and can accurately classify detailed bills. The detailed bill electronization system integrates the text classification and image classification results to match and classify the detailed bill and the system template.

An information base:

the information base stores a large number of detailed bill templates for selection and calling. In addition, the information base is stored with the header of the detailed bill, can be provided for the wireless form processing module to be called, and also comprises an interface for manual inquiry and modification, and supports the manual maintenance of the content of the information base.

A result storage module:

this section is primarily to keep bills processed through the itemized electronic system. The bills can become training data of the system after being labeled subsequently, and the training data is used for improving the accuracy of the structuring and the identification of the system.

As an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for extracting structured information, which corresponds to the method embodiment shown in fig. 2, and which can be applied in various electronic devices.

The apparatus for extracting structured information of the present embodiment may include: the device comprises a wireless table area identification unit, a semantic segmentation operation execution unit and a target structured information extraction unit. The wireless table area identification unit is configured to acquire an image to be processed and identify a wireless table area in the image to be processed; the semantic segmentation operation execution unit is configured to execute semantic segmentation operation on the wireless table area by using a Deeplab model to obtain each segmented character block; the deep model can extract multi-scale features for segmenting to obtain each character block; and the target structured information extraction unit is configured to extract target structured information according to each character block.

In the present embodiment, in the apparatus for extracting structured information: the detailed processing and the technical effects of the wireless table area identification unit, the semantic segmentation operation execution unit, and the target structured information extraction unit can refer to the related descriptions of step 201 and step 203 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of this embodiment, the semantic division operation performing unit may be further configured to: performing feature extraction operation on the wireless table area by using a coding module to obtain a first feature; performing pooling processing on the first feature by using a spatial pyramid pooling module of the cavity convolution to obtain a multi-scale feature; performing up-sampling operation on the multi-scale features by using a decoding module, and taking each obtained segmented image as a character block; the Deeplab model comprises an encoding module, a spatial pyramid pooling module and a decoding module.

In some optional implementations of this embodiment, the means for extracting the structured information may further include: and the inclination correction unit is configured to perform inclination correction on the image to be processed by utilizing a GlidinVertex algorithm and an RSDet algorithm before identifying the wireless table area in the image to be processed.

In some optional implementations of this embodiment, when the image to be processed is specifically a bill image to be processed, the wireless form area identifying unit may be further configured to: processing the bill image to be processed by using a corrosion expansion algorithm to obtain a horizontal direction straight line and a vertical direction straight line; determining a region in which a horizontal direction straight line and a vertical direction straight line in a bill image to be processed are crossed as a wired table region; and determining an area in which a horizontal direction straight line does not intersect with a vertical direction straight line in the bill image to be processed as a wireless table area.

In some optional implementations of this embodiment, the target structured information extracting unit may include: the character block processing subunit is configured to execute frame detection operation, row and column alignment operation and character recognition operation on each character block in sequence to obtain the content of each character block; and the target structured information acquisition subunit is configured to obtain the target structured information according to the content arrangement of each text block.

In some optional implementations of this embodiment, the text block processing subunit includes a frame detection module configured to perform a frame detection operation on each text block, the frame detection module being further configured to: obtaining the edge coordinates of each character block by using a connected domain and a canny edge detection algorithm; and determining the frame of the corresponding character block according to the edge coordinates.

In some optional implementations of this embodiment, the means for extracting the structured information may further include: the existing template direct using unit is configured to call a corresponding structured information extraction template to execute the structured information extraction operation on the image to be processed when the structured information extraction template corresponding to the type of the image to be processed is prestored; and a new template forming unit configured to form a new structured information extraction template corresponding to the type of the image to be processed according to the obtaining process of the targeted structured information when the structured information extraction template corresponding to the type of the image to be processed is not pre-stored.

In some optional implementations of this embodiment, the means for extracting the structured information may further include: and the existing template matching unit is configured to determine whether the image to be processed is matched with any one of the pre-stored structural information extraction templates by using a character classification model constructed based on the BERT model and an image classification model constructed based on the inclusion model.

The present embodiment exists as an apparatus embodiment corresponding to the method embodiment, and the apparatus for extracting structured information provided in the present embodiment provides an automatic structured information extraction scheme for detailed bills and bills through the above technical solution, and particularly for a wireless table area, by using a deepab model capable of extracting multi-scale features for obtaining each text block by segmentation, a text block segmentation effect is better, and the accuracy of the extracted structured information is improved.

According to an embodiment of the present application, an electronic device and a computer-readable storage medium are also provided.

Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

The electronic device includes: one or more processors, memory, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system).

The memory is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the method for extracting structured information provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method for extracting structured information provided herein.

The memory, as a non-transitory computer-readable storage medium, may be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the method for extracting structured information in embodiments of the present application (e.g., a wireless table region identification unit, a semantic division operation execution unit, a target structured information extraction unit). The processor executes various functional applications of the server and data processing by executing the non-transitory software programs, instructions, and modules stored in the memory, that is, implements the method for extracting structured information in the above method embodiments.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store various types of data and the like created by the electronic device in executing the method for extracting the structured information. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes a memory remotely located from the processor, and these remote memories may be connected over a network to an electronic device adapted to perform the method for extracting structured information. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device adapted to perform the method for extracting structured information may further comprise: an input device and an output device. The processor, memory, input device, and output device may be connected by a bus or other means.

The input means may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic device suitable for performing the method for extracting structured information, such as input means like a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer, one or more mouse buttons, a track ball, a joystick or the like. The output devices may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The embodiment of the application provides an automatic structured information extraction scheme of detailed bills and bills, and particularly, for a wireless table area, a deep model with multi-scale features for segmenting each character block is extracted by using the method, so that the character block segmentation effect is better, and the accuracy of the extracted structured information is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for extracting structured information, comprising:

acquiring an image to be processed, and identifying and obtaining a wireless table area in the image to be processed;

performing semantic segmentation operation on the wireless table area by using a Deeplab model to obtain each segmented character block; the deep model can extract multi-scale features for segmentation to obtain each character block;

and extracting target structured information according to each character block.

2. The method of claim 1, wherein performing semantic segmentation on the wireless table region using a deep model to obtain segmented text blocks comprises:

performing feature extraction operation on the wireless table area by using a coding module to obtain a first feature;

performing pooling processing on the first feature by using a spatial pyramid pooling module of cavity convolution to obtain a multi-scale feature;

performing an upsampling operation on the multi-scale features by using a decoding module, and taking each obtained segmented image as a character block;

wherein the Deeplab model comprises the encoding module, the spatial pyramid pooling module, and the decoding module.

3. The method of claim 1, wherein before identifying the wireless table region in the image to be processed, further comprising:

and performing inclination correction on the image to be processed by utilizing a Gliding Vertex algorithm and an RSDet algorithm.

4. The method of claim 1, wherein identifying a wireless form area in the image to be processed when the image to be processed is specifically a bill image to be processed comprises:

determining an area where the horizontal direction straight line and the vertical direction straight line in the bill image to be processed intersect as a wired table area;

and determining the area, in the bill image to be processed, of which the horizontal direction straight line does not intersect with the vertical direction straight line as a wireless table area.

5. The method of claim 1, wherein extracting target structured information from each of the text blocks comprises:

sequentially performing frame detection operation, row and column alignment operation and character recognition operation on each character block to obtain the content of each character block;

and obtaining the target structured information according to the content arrangement of each character block.

6. The method of claim 5, wherein performing a bounding box detection operation on the text block comprises:

and determining the frame of the corresponding text block according to the edge coordinates.

7. The method of any of claims 1 to 6, further comprising

When a structural information extraction template corresponding to the type of the image to be processed is prestored, calling the corresponding structural information extraction template to execute structural information extraction operation on the image to be processed;

and when the structural information extraction template corresponding to the type of the image to be processed is not prestored, forming a new structural information extraction template corresponding to the type of the image to be processed according to the obtaining process of the target structural information.

8. The method of claim 7, further comprising:

and determining whether the image to be processed is matched with any one of the pre-stored structural information extraction templates or not by utilizing a character classification model constructed based on the BERT model and an image classification model constructed based on the inclusion model.

9. An apparatus for extracting structured information, comprising:

the wireless table area identification unit is configured to acquire an image to be processed and identify a wireless table area in the image to be processed;

a semantic division operation execution unit configured to execute a semantic division operation on the wireless table area by using a Deeplab model to obtain each divided character block; the deep model can extract multi-scale features for segmentation to obtain each character block;

and the target structured information extraction unit is configured to extract target structured information according to each character block.

10. The apparatus of claim 9, wherein the semantic split operation execution unit is further configured to:

11. The apparatus of claim 9, further comprising:

and the inclination correction unit is configured to perform inclination correction on the image to be processed by utilizing a Gliding Vertex algorithm and an RSDet algorithm before identifying the wireless table area in the image to be processed.

12. The apparatus of claim 9, wherein, when the image to be processed is specifically a bill image to be processed, the wireless table area identification unit is further configured to:

13. The apparatus of claim 9, wherein the target structured information extraction unit comprises:

the character block processing subunit is configured to sequentially execute frame detection operation, row and column alignment operation and character recognition operation on each character block to obtain the content of each character block;

and the target structured information acquisition subunit is configured to obtain the target structured information according to the content arrangement of each text block.

14. The apparatus of claim 13, wherein the text block processing subunit includes a bounding box detection module configured to perform a bounding box detection operation on each of the text blocks, the bounding box detection module further configured to:

15. The apparatus of any of claims 9 to 14, further comprising

The existing template direct using unit is configured to call a corresponding structured information extraction template to perform structured information extraction operation on the image to be processed when the structured information extraction template corresponding to the type of the image to be processed is prestored;

a new template forming unit configured to form a new structured information extraction template corresponding to the type of the image to be processed according to the obtaining process of the targeted structured information when the structured information extraction template corresponding to the type of the image to be processed is not pre-stored.

16. The apparatus of claim 15, further comprising:

and the existing template matching unit is configured to determine whether the image to be processed is matched with any one of the pre-stored structural information extraction templates by using a character classification model constructed based on the BERT model and an image classification model constructed based on the inclusion model.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method for extracting structured information of any one of claims 1-8.

18. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method for extracting structured information of any one of claims 1-8.