CN114241497A

CN114241497A - Table sequence identification method and system based on context attention mechanism

Info

Publication number: CN114241497A
Application number: CN202111322144.5A
Authority: CN
Inventors: 万洪林; 仲宗峰; 孙景生; 张理继
Original assignee: Shandong Normal University
Current assignee: Shandong Normal University
Priority date: 2021-11-09
Filing date: 2021-11-09
Publication date: 2022-03-25
Anticipated expiration: 2041-11-09
Also published as: CN114241497B

Abstract

The invention discloses a table sequence identification method and system based on a context attention mechanism, wherein the method comprises the following steps: acquiring a form image to be identified; processing the table image to be recognized by adopting the trained table sequence recognition network model to obtain a recognized table structure and the content of each cell; the table sequence recognition network model is realized by adopting an encoder and a decoder which are connected with each other, wherein the encoder is used for extracting features and generating a feature sequence; the decoder is used to realize the identification of the sequence. And converting the table structure into a structured label for identification in a sequence-to-sequence identification mode, and finally realizing table structure identification and cell content aggregation.

Description

Table sequence identification method and system based on context attention mechanism

Technical Field

The invention relates to the technical field of table sequence identification, in particular to a table sequence identification method and system based on a context attention mechanism.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

A form plays an extremely important role in every aspect of society as a means of structuring information, which is most commonly used in daily life. Particularly in the structured representation of data information such as various data statements, bank flow sheet tables, and various laboratory sheet tables in medical examination reports. The tables can be classified into a wired table, a wireless table, and a combination of a wired table and a wireless table. The structured representation of the data information by the table is represented by visual comparison of data by a row-column structure, and the tabulation of the data information is helpful for understanding, comparing and extracting key data. In the face of a large amount of table data, digitalization of table pictures and extraction of key information become a problem to be solved urgently.

In recent years, as deep learning progresses, OCR (Optical Character Recognition) technology has matured. Especially, the text detection and recognition technology is applied to a large number of scenes in daily life, and the practicability is gradually improved.

With the development of OCR technology, the demand for table OCR is also increasing. For example, identification of large numbers of scanned forms of import and export customs clearance documents, identification of bank running and insurance notes, and identification of test document forms for medical examinations, etc. are performed every year. The table OCR mainly includes recognition of a table structure and detection recognition of cell contents, and there are two main methods:

(1) in the traditional method, table lines are extracted by methods of morphological transformation, texture extraction, edge detection and the like of an image, and a table row and column structure is generated by the table lines. Such methods are difficult to adapt to diversified table styles and can only be directed to fixed table forms.

(2) The deep learning method based on segmentation divides and classifies various table lines and then aggregates the table line structure, the generalization of the method is poor, and corresponding segmentation data sets need to be labeled for training aiming at different types of tables.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a table sequence identification method and a table sequence identification system based on a context attention mechanism; the form is described in a structured language, such as html, and the form identification is converted to identify the structured tags. And converting the table structure into a structured label for identification in a sequence-to-sequence identification mode, and finally realizing table structure identification and cell content aggregation.

In a first aspect, the invention provides a table sequence identification method based on a context attention mechanism;

the table sequence identification method based on the context attention mechanism comprises the following steps:

acquiring a form image to be identified;

processing the table image to be recognized by adopting the trained table sequence recognition network model to obtain a recognized table structure and the content of each cell;

the table sequence recognition network model is realized by adopting an encoder and a decoder which are connected with each other, wherein the encoder is used for extracting features and generating a feature sequence; the decoder is used to realize the identification of the sequence.

In a second aspect, the present invention provides a table sequence identification system based on a contextual attention mechanism;

a table sequence identification system based on a contextual attention mechanism, comprising:

an acquisition module configured to: acquiring a form image to be identified;

an identification module configured to: processing the table image to be recognized by adopting the trained table sequence recognition network model to obtain a recognized table structure and the content of each cell;

In a third aspect, the present invention further provides an electronic device, including:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of the first aspect.

In a fourth aspect, the present invention also provides a storage medium storing non-transitory computer readable instructions, wherein the non-transitory computer readable instructions, when executed by a computer, perform the instructions of the method of the first aspect.

Compared with the prior art, the invention has the beneficial effects that:

the invention realizes the purpose of identifying the table structure from the sequence of the table image to the sequence by utilizing a sequence identification network based on a context attention mechanism, improves the efficiency and the accuracy of identifying the table structure in the table OCR and has stronger generalization.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flowchart of a method according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a sample and label of a data set according to a first embodiment of the present invention;

FIG. 3 is a diagram illustrating an overall network structure according to a first embodiment of the present invention;

FIG. 4 is a diagram illustrating a bottleneck residual module according to a first embodiment of the present invention;

FIG. 5 shows a CotNet50vd network structure according to a first embodiment of the present invention;

FIG. 6 is a Cot module according to a first embodiment of the invention;

FIG. 7 is an Attention-head according to a first embodiment of the present invention;

FIG. 8 illustrates a GRU in accordance with a first embodiment of the present invention;

FIGS. 9(a) -9 (f) are screenshots of the loss function and the acc change of the first embodiment of the present invention;

fig. 10 is a visualization of the first embodiment of the present invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

All data are obtained according to the embodiment and are legally applied on the data on the basis of compliance with laws and regulations and user consent.

Example one

The embodiment provides a table sequence identification method based on a context attention mechanism;

as shown in fig. 1, the table sequence identification method based on the context attention mechanism includes:

s101: acquiring a form image to be identified;

s102: processing the table image to be recognized by adopting the trained table sequence recognition network model to obtain a recognized table structure and the content of each cell;

The model network integrally adopts a coding-decoding (encoder-decoder) structure, and a coding layer is mainly a CNN layer and is used for extracting features and generating a feature sequence; the decoding layer is a recurrent neural network based on an attention mechanism to realize the identification of the sequence from the sequence, and the whole network structure is shown in fig. 3.

Further, the encoder is realized by adopting CotNet50vd, CotNet50vd is improved on the basis of a ResNet50 network, and a context-based attention structure Cot module is used for replacing a3 × 3 convolutional layer of an original ResNet50 network.

Further, the encoder includes a convolutional layer Conv _1, a maximum pooling layer maxpool, a convolutional layer Conv _21, a convolutional layer Conv _22, a convolutional layer Conv _23, a convolutional layer Conv _31, a convolutional layer Conv _32, a convolutional layer Conv _33, a convolutional layer Conv _34, a convolutional layer Conv _41, a convolutional layer Conv _42, a convolutional layer Conv _43, a convolutional layer Conv _44, a convolutional layer Conv _45, a convolutional layer Conv _46, a convolutional layer Conv _51, a convolutional layer Conv _52, a convolutional layer Conv _53, an average pooling layer avgpool, and a softmax activation function layer, which are connected in this order.

Further, the structure layers included in the interior of the above-described buildup layers Conv _21, buildup layer Conv _22, buildup layer Conv _23, buildup layer Conv _31, buildup layer Conv _32, buildup layer Conv _33, buildup layer Conv _34, buildup layer Conv _41, buildup layer Conv _42, buildup layer Conv _43, buildup layer Conv _44, buildup layer Conv _45, buildup layer Conv _46, buildup layer Conv _51, buildup layer Conv _52, and buildup layer Conv _53 are the same.

Further, the convolutional layer Conv _21 includes a1 × 1 convolutional layer a1, a context-based attention structure Cot module, a1 × 1 convolutional layer a2, and an adder J1, which are connected in sequence, wherein an input terminal of the 1 × 1 convolutional layer a1 is further connected to an input terminal of the adder J1 through a serially connected averaging pooling layer and the 1 × 1 convolutional layer a 3; the input of 1 x1 convolutional layer a1 is the input of convolutional layer Conv _21, and the output of adder J1 is the output of convolutional layer Conv _ 21.

Illustratively, the coding layer is a CNN backbone network, and is configured to extract features to obtain a feature map of the input image, and finally obtain a feature sequence of the input image. The backbone network adopts CotNet50vd, and is improved on the basis of a ResNet50 network. Replacing 3 x 3 convolution in the original Res-Net network with a similar transformer attention structure Cot-Block based on context, and adding a2 x2 avg-pool before downsampling 1 x1 convolution of each bottleneck residual module to finally obtain a CotNet50vd network and a bottleneck residual module, as shown in fig. 4.

The network body comprises a convolution layer, a pooling layer and an activation layer. The input image is firstly convolved by 7 × 7 × 64, and then passes through a maximum pooling layer of 3 × 3; then passing through 16 building blocks of 3+4+6+3, wherein each block is 3 layers, and 16 × 3 layers in total are 48 layers; and finally an average pooling layer and a full-link layer. One of the convolution layer and the full link layer of the whole backbone network has 1+48+ 1-50 layers, and the structure of the backbone network is shown in fig. 5.

Furthermore, the attention structure Cot module based on the context relationship comprises three parallel branches, wherein a convolutional layer b1 is arranged on the first branch, a connector concat, a convolutional layer b2, a convolutional layer b3, a multiplier and an adder J2 which are sequentially connected are arranged on the second branch, and a convolutional layer b3 is arranged on the third branch;

the input ends of the convolutional layer b1, the connector concat and the convolutional layer b3 are all connected with the input end of the attention structure Cot module based on the context;

the output end of the convolution layer b1 is connected with the input end of the connector concat;

the output end of the convolutional layer b1 is connected with the input end of an adder J2;

the output end of the convolutional layer b3 is connected with the input end of the multiplier f 1;

the output of the adder is the output of the context-based attention structure Cot module.

Further, the attention structure Cot module based on the context performs context information coding on the input value through a convolution layer b1 of 3 × 3, so as to obtain a static context expression about the input value; then, splicing the static context expression of the input value with the input value, and learning a dynamic multi-head attention matrix through two continuous 1-by-1 convolution layers; multiplying the obtained dynamic multi-head attention matrix with the input value to obtain a dynamic context expression related to the input; and fusing the static context expression and the dynamic context expression to obtain a final output value for output.

Illustratively, the Cot module is an attention module similar to a Transformer structure, and can make full use of contextual information to guide learning of a dynamic matrix and enhance the ability of visual expression. Firstly, carrying out context information coding on input keys through 3-by-3 convolution to obtain static context expression about input; further splicing the coded keys with the input query and learning a dynamic multi-head attention matrix through two continuous 1-1 convolutions; the resulting attention matrix is multiplied by the input values to obtain a dynamic context expression for the input. The fusion result of the static context expression and the dynamic context expression is used as the output of the module, and the Cot module is shown in fig. 6.

First for the input features, three variables are first defined (K, Q, V).

K1 is considered to be a static model of the local context information, where K × K is 3 × 3, and K is represented by the local context information (denoted as K1).

Then, K1 and Q are concat, and then the result y of concat is subjected to two consecutive 1 × 1 convolution operations, and the result is recorded as a 2:

y＝concat(K1,Q)；

a2＝[K1,Q]W_(1*1)W_(1*1)；

unlike the conventional Self-orientation, the a2 matrix here is derived from the interaction of the Q information and the local context information K1, rather than just modeling the relationship between Q and K. I.e. guided by local context modeling, the mechanism of self-attention is enhanced.

Then, multiplying a2 and V, we get dynamic context modeling:

finally, the result is the addition of the local static context modeled K1 and the global dynamic context modeled K2.

Y＝K1+K2

Further, the decoder is realized by combining Attention mechanism model Attention and GRU; the characteristic sequence extracted by the encoder is used as the input of the decoder; attention mechanism model Attention updates the output weights according to the output of each layer of GRU and the current input sequence; the GRU is used to enable recognition of the characteristic sequence into the html sequence.

The decoder comprises a plurality of GRU units which are connected in sequence;

the input end of the tth attention mechanism model is respectively connected with the output end of the encoder and the output end of the t-1 GRU unit; t is a positive integer, t is greater than 1;

the output end of the tth attention mechanism model is connected with the input end of the tth multiplier;

the input end of the t multiplier is also connected with the input end of the encoder;

the output end of the tth multiplier is connected with the input end of the tth GRU unit;

the output end of the t GRU unit is used for outputting the html tag.

Illustratively, the decoding part adopts the structure of Attention + GRU, and constitutes an Attention-Head of sequence identification. And the characteristic sequence generated after the characteristic extraction of the coding layer is used as the input of the decoding part. The role of the Attention section is to update the output weights according to the output and current input sequence of each layer GRU. GRU is an improved variant of RNN (recurrent neural network) that functions to enable recognition of a signature sequence to an html sequence. The Attention-Head is shown in FIG. 7.

Output weight alpha_tThe calculation function of (a) is:

α_t＝Attention(h_t-1,x_t)

indicates the current output weight and the output h of the last GRU_t-1And the current input sequence x_tIn relation, the specific relationship calculation function is as follows:

Attention(h_t-1,x_t)＝softmax(linear(tanh(linear([h_t-1,x_t]))))

the input sequence of the GRU weights the current input sequence:

g_t＝x_t×α_t

h_tupdating through a GRU loop process:

h_t＝GRU(g_t,h_t-1)

and finally, calculating probability density through softmax, outputting table structure sequence prediction, and outputting unit grid coordinate prediction through sigmoid function after concat of all GRUs.

GRU is an improved variant of RNN, which solves the problem of gradient disappearance of RNN, and is more computationally intensive than LSTM, the structure of GRU is shown in figure 8. Wherein g is_tIs the input of the current time, h_t-1Is the hidden state at the previous moment, h_tIs the hidden state at the current time.

When calculating the current hidden state, firstly calculating a candidate state

The candidate state is related to the value of the reset gate, and the calculation function of the reset gate is as follows:

r_t＝σ(W_r[h_t-1,g_t])

the sigmoid limit is used to take values between [0,1 ].

After the candidate state is obtained, the updated gate value is used to control how much information of the previous hidden state can be transmitted to the current hidden state, and the similar weight coefficient. The update gate calculation function is as follows:

z_t＝σ(W_z[h_t-1,g_t])

the final output current hidden state is as follows:

further, the trained table sequence identifies a network model, and the training step includes:

constructing a training set and a test set; the training set and the test set are both table images of known labels;

inputting the training set into a table sequence recognition network model for training, and stopping training when the loss function of the network model does not decrease any more;

and inputting the test set into the form sequence recognition network model for testing, stopping testing when the evaluation index of the network model meets the test requirement, and recording that the current network model is the trained form sequence recognition network model.

Further, constructing a training set and a test set, and selecting a Chinese medical literature data set; the training set and the testing set respectively comprise a plurality of document images, each image comprises patient information of five rows and four columns, and a table main body containing test results.

Illustratively, a CMDD data set (Chinese Medical Documents data set) is selected for the needs of the application scenario. The CMDD is a medical laboratory reported image dataset containing 238 document images, each image including a header listing patient information in five rows and four columns and a detailed information table body reporting the test results.

Further, the known tag is a structured language of html, and the structured language of html comprises cell contents, table coordinate information, a table structure and complete information of a table;

and the table coordinate information comprises a coordinate value of a point at the upper left corner of the cell frame and a coordinate value of a point at the lower right corner of the cell frame.

Illustratively, the table body part of each document image is cut out, and a total of 238 medical laboratory sheet images are obtained as a training data set, wherein 70% is used as a training set and 30% is used as a verification set. The tag of the data set is a structured language of html, and mainly comprises cell contents, coordinate information, a table structure and complete information of a table. The coordinates of the cells are [ x1, y1, x2 and y2], wherein [ x1 and y1] are the coordinates of the upper left corner of the cell frame, and [ x2 and y2] are the coordinates of the lower right corner of the cell frame. The data set samples and labels are shown in fig. 2.

Further, the loss function is equal to a weighted sum of the structural loss function and the coordinate loss function.

The loss function is divided into a structural loss and a coordinate loss, and the overall loss is equal to the sum of the structural loss weighting and the coordinate loss weighting.

The structure loss adopts cross entropy loss, and the calculation function is as follows:

the coordinate loss adopts the mean square loss, and the calculation function is as follows:

final loss function:

total_loss＝structure_loss×structure_weight+loc_loss×loc_weight

furthermore, the test evaluation index is to use a tree structure to represent the table structure, and to use the distance between the trees to perform similarity measurement between two trees.

Model training: inputting the processed CMDD data set pictures and label files into a network for training, and setting 500 epochs in total, wherein the learning rate lr is 0.001, and the batch _ size is 8. The computer environment is configured as an Nvidia driver: 450.57, CUDA: 11.0, CUDNN:8.0, identify networks with paddlepaddle2.1.2 build table sequences. And (4) performing iterative training on the input training set, checking the acc (accuracy rate) of the model on the verification set, and storing the model with the highest acc on the verification set. In the training process, the learning rate, the loss function and the acc can be observed in real time by using visualdl, wherein the structural loss and the coordinate loss rapidly decrease and tend to be stable, the train _ acc and the eval _ acc gradually increase, the best _ acc can reach 0.95 at most, and the best model at the moment is stored for testing. The loss function and the screen shot of the acc change are shown in fig. 9(a) to 9(f), where fig. 9(a) shows the training loss, fig. 9(b) shows the structure loss, fig. 9(c) shows the coordinate loss, fig. 9(d) shows the training acc, fig. 9(e) shows the verification acc, and fig. 9(f) shows the highest acc in the verification set

And (3) testing a model: and converting the training model into a reasoning model, storing the reasoning model, and carrying out evaluation index test and visual test. The evaluation indexes adopted are TEDS:

the TEDS evaluation is to represent the table structure by a tree structure, and to measure the similarity between two trees by using the distance between the trees,

where EditDist represents the tree edit distance and | T | represents the number of nodes of T. The table structure TEDS on the validation set was 0.997.

And inputting the test table picture into a test program, and outputting a json file to obtain the html structural representation of the table picture. And visualizing the html structure in the json file to complete the visualization test, wherein the visualization result is shown in fig. 10.

Example two

The embodiment provides a table sequence identification system based on a context attention mechanism;

an acquisition module configured to: acquiring a form image to be identified;

It should be noted here that the above-mentioned acquiring module and the identifying module correspond to steps S101 to S102 in the first embodiment, and the above-mentioned modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The table sequence identification method based on the context attention mechanism is characterized by comprising the following steps:

acquiring a form image to be identified;

2. The method as claimed in claim 1, wherein the encoder is implemented by CotNet50vd, CotNet50vd is improved based on the ResNet50 network, and the attention structure Cot module based on the context replaces 3 × 3 convolutional layer of the original ResNet50 network.

3. The method as claimed in claim 2, wherein the attention structure table sequence recognition based on context comprises three parallel branches, wherein a convolutional layer b1 is disposed on a first branch, a connector concat, a convolutional layer b2, a convolutional layer b3, a multiplier and an adder J2 are disposed on a second branch, and a convolutional layer b3 is disposed on a third branch;

4. The method according to claim 2, wherein the attention structure Cot module based on the context performs context information encoding on the input value through a convolution layer b1 of 3 × 3 to obtain a static context expression about the input value; then, splicing the static context expression of the input value with the input value, and learning a dynamic multi-head attention matrix through two continuous 1-by-1 convolution layers; multiplying the obtained dynamic multi-head attention matrix with the input value to obtain a dynamic context expression related to the input; and fusing the static context expression and the dynamic context expression to obtain a final output value for output.

5. The method of claim 1, wherein the decoder is implemented by a combination of Attention model Attention and GRU; the characteristic sequence extracted by the encoder is used as the input of the decoder; attention mechanism model Attention updates the output weights according to the output of each layer of GRU and the current input sequence; the GRU is used to enable recognition of the characteristic sequence into the html sequence.

6. The method of claim 1, wherein the trained table sequence identifies a network model, and the training step comprises:

7. The method of claim 6, wherein the penalty function is equal to a weighted sum of a structure penalty function and a coordinate penalty function;

the test evaluation index is to use the table structure as tree structure to represent, and to use the distance of the tree to measure the similarity between two trees.

8. The table sequence identification system based on the context attention mechanism is characterized by comprising the following steps:

an acquisition module configured to: acquiring a form image to be identified;

9. An electronic device, comprising:

a memory for non-transitory storage of computer readable instructions; and

a processor for executing the computer readable instructions,

wherein the computer readable instructions, when executed by the processor, perform the method of any of claims 1-7.

10. A storage medium storing non-transitory computer-readable instructions, wherein the non-transitory computer-readable instructions, when executed by a computer, perform the instructions of the method of any one of claims 1-7.