CN117807968A

CN117807968A - Form merging method, form merging prediction model training method and device

Info

Publication number: CN117807968A
Application number: CN202311863551.6A
Authority: CN
Inventors: 于业达; 彭敬伟; 刘奕晨; 杨威; 李杨
Original assignee: Shanghai Hengsheng Juyuan Data Service Co ltd
Current assignee: Shanghai Hengsheng Juyuan Data Service Co ltd
Priority date: 2023-12-29
Filing date: 2023-12-29
Publication date: 2024-04-02

Abstract

The application provides a form merging method, a form merging prediction model training method and a form merging prediction model training device, wherein the form merging method comprises the following steps: acquiring an image to be processed, wherein the image to be processed comprises a first table segment and a second table segment which are distributed across pages; inputting the image to be processed into a form merging prediction model for prediction processing to obtain a physical merging prediction result and a semantic merging prediction sequence, wherein each item in the semantic merging prediction sequence corresponds to one cell of the first form segment and the second form segment respectively, and the values of each item in the semantic merging prediction sequence are used for indicating whether the corresponding cell on the first form segment and the corresponding cell on the second form segment can be merged semantically or not; and carrying out combination processing according to the physical combination prediction result and the semantic combination prediction sequence. And the physical structure and the context semantic information of the cell level are combined to judge whether the tables can be combined, so that the accuracy of the result of the table combination is obviously improved.

Description

Form merging method, form merging prediction model training method and device

Technical Field

The application relates to the technical field of computers, in particular to a form merging method, a form merging prediction model training method and a form merging prediction model training device.

Background

In a scenario involving document recognition parsing, such as finance, it is often necessary to perform a cross-page form merging process. In particular, the same table may be disclosed on different pages due to the limitations of the page scope of the document. When the document is recognized and parsed, the contents of such a table need to be merged.

In the prior art, some methods for merging cross-page tables are provided, and the methods mainly judge whether the tables can be physically merged or not through analysis of physical structures of the tables.

However, prior art methods may result in inaccurate results for cross-page table merging due to analysis based solely on the physical structure of the table.

Disclosure of Invention

The invention aims to provide a form merging method, a form merging prediction model training method and a form merging prediction model training device aiming at the defects in the prior art so as to solve the problem of inaccurate result of cross-page form merging in the prior art.

In order to achieve the above purpose, the technical solution adopted in the embodiment of the present application is as follows:

in a first aspect, an embodiment of the present application provides a method for merging tables, including:

acquiring an image to be processed, wherein the image to be processed comprises a first table segment and a second table segment which are distributed across pages;

Inputting the image to be processed into a form merging prediction model obtained by pre-training to perform prediction processing to obtain a physical merging prediction result and a semantic merging prediction sequence, wherein the physical merging prediction result is used for indicating whether the first form segment and the second form segment can be merged in a physical structure, each item in the semantic merging prediction sequence corresponds to one cell of the first form segment and the second form segment respectively, and each item in the semantic merging prediction sequence is used for indicating whether the corresponding cell of the first form segment and the corresponding cell of the second form segment can be merged semantically;

and combining the first table segment and the second table segment according to the physical combination prediction result and the semantic combination prediction sequence.

As one possible implementation, the table merging prediction model includes: a visual feature processing network, a semantic feature processing network and a linear processing layer;

inputting the image to be processed into a form merging prediction model obtained by pre-training for prediction processing to obtain a physical merging prediction result and a semantic merging prediction sequence, wherein the method comprises the following steps of:

Inputting the image to be processed into the visual feature processing network, performing visual feature coding by the visual feature processing network to obtain visual features, and performing position coding on the visual features to obtain position coded features;

inputting the position-coded features into the semantic feature processing network, and carrying out semantic feature coding by the semantic feature processing network to obtain semantic-coded features;

inputting the characteristics after semantic coding into the linear processing layer for linear transformation to obtain the physical combination prediction result and the semantic combination prediction sequence.

As a possible implementation manner, the visual feature processing network comprises a plurality of coding layers connected in series in sequence;

inputting the image to be processed into the visual feature processing network, and performing visual feature coding by the visual feature processing network to obtain visual features, wherein the method comprises the following steps:

inputting the image to be processed into a first coding layer for coding processing, inputting the processed characteristics into a next coding layer for coding processing, and sequentially executing until the last coding layer finishes coding processing;

respectively carrying out maximum value pooling treatment on the treated characteristics of each coding layer except the last coding layer to obtain pooled characteristics of each coding layer;

And splicing the pooled features of each coding layer and the features of the last coding layer after coding treatment to obtain the visual features.

As a possible implementation manner, the performing position coding on the visual feature to obtain a position coded feature includes:

and respectively carrying out horizontal direction position coding and vertical direction position coding on the visual features based on a two-dimensional sine and cosine coding algorithm to obtain the coded features.

As a possible implementation manner, the merging processing of the first table segment and the second table segment according to the physical merging prediction result and the semantic merging prediction result includes:

and if the physical merging prediction result indicates that the first table segment and the second table segment can be merged in a physical structure, traversing each item in the semantic merging prediction sequence, and aiming at the traversed current item, if the value of the current item is a preset value, carrying out content merging on the corresponding cell of the current item in the first table segment and the corresponding cell in the second table segment.

As a possible implementation manner, before the obtaining the image to be processed, the method further includes:

According to page segmentation marks in the electronic document, a first initial table segment and a second initial table segment are intercepted from adjacent pages of the electronic document, so that an initial image is obtained;

and cutting the initial image and aligning the edges of the table to obtain the image to be processed.

In a second aspect, an embodiment of the present application provides a method for training a table merging prediction model, including:

constructing a training data set based on an original electronic document containing a cross-page table;

training the initial merging model based on the training data set to obtain a form merging prediction model, wherein the prediction result of the form merging prediction model comprises a physical merging prediction result and a semantic merging prediction sequence, the physical merging prediction result is used for indicating whether a first form segment and a second form segment in an input image to be processed can be merged in a physical structure, each item in the semantic merging prediction sequence corresponds to one cell of the first form segment and the second form segment respectively, and each item in the semantic merging prediction sequence is used for indicating whether the corresponding cell on the first form segment and the corresponding cell on the second form segment can be merged semantically.

As one possible implementation manner, the constructing a training data set based on the original electronic document containing the cross-page table includes:

intercepting, cutting and aligning the edges of the cross-page table in the original electronic document to obtain a test data set and a forward training data set;

splitting and splicing the tables in the original electronic document to obtain a negative training data set.

As a possible implementation manner, the training the initial merging model based on the training data set to obtain a form merging prediction model includes:

inputting the sample data in the training data set into the initial merging model to obtain a processing result of the initial merging model, wherein the initial merging model comprises the following steps: a visual feature processing network, a semantic feature processing network and a linear processing layer;

and carrying out loss calculation on the processing result of the initial merging model based on a target loss function to obtain the loss of the initial merging model, wherein the target loss function at least comprises: an error transfer loss function for calculating a deviation of a processing result of the visual feature processing network from a processing result of the semantic feature processing network;

And carrying out iterative correction on the initial merging model according to the loss of the initial merging model to obtain the form merging prediction model.

As a possible implementation manner, the objective loss function further includes: a physical merge-loss function and a semantic merge-loss function;

the physical merging loss function is used for calculating the loss of the physical merging prediction result, and the semantic merging loss function is used for calculating the loss of the semantic merging prediction result.

In a third aspect, an embodiment of the present application provides a table merging device, including:

the device comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring an image to be processed, and the image to be processed comprises a first table segment and a second table segment which are distributed across pages;

the processing module is used for inputting the image to be processed into a form merging prediction model obtained through pre-training to be subjected to prediction processing, and obtaining a physical merging prediction result and a semantic merging prediction sequence, wherein the physical merging prediction result is used for indicating whether the first form segment and the second form segment can be merged in a physical structure, each item in the semantic merging prediction sequence corresponds to one cell of the first form segment and the second form segment respectively, and each item in the semantic merging prediction sequence is used for indicating whether the corresponding cell of the first form segment and the corresponding cell of the second form segment can be merged in a semantic manner;

And the merging module is used for merging the first table segment and the second table segment according to the physical merging prediction result and the semantic merging prediction sequence.

the processing module is specifically configured to:

As a possible implementation manner, the processing module is specifically configured to:

As a possible implementation manner, the merging module is specifically configured to:

As a possible implementation manner, the processing module is further configured to:

In a fourth aspect, an embodiment of the present application provides a table merging prediction model training apparatus, including:

the building module is used for building a training data set based on the original electronic document containing the cross-page table;

the training module is used for training the initial merging model based on the training data set to obtain a form merging prediction model, wherein the prediction result of the form merging prediction model comprises a physical merging prediction result and a semantic merging prediction sequence, the physical merging prediction result is used for indicating whether a first form segment and a second form segment in an input image to be processed can be merged in a physical structure, each item in the semantic merging prediction sequence corresponds to one cell of the first form segment and the second form segment respectively, and each item in the semantic merging prediction sequence is used for indicating whether the corresponding cell on the first form segment and the corresponding cell on the second form segment can be merged in a semantic manner.

As a possible implementation manner, the construction module is specifically configured to:

As a possible implementation manner, the training module is specifically configured to:

In a fifth aspect, embodiments of the present application provide an electronic device, including: a processor and a memory storing machine readable instructions executable by the processor to perform steps of a form merge method as described in the first aspect or a form merge prediction model training method as described in the second aspect when the electronic device is running.

In a sixth aspect, the present application provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor performs the steps of the table merging method according to the first aspect or the steps of the table merging prediction model training method according to the second aspect.

According to the table merging method, the table merging prediction model training method and the table merging prediction model training device, an image to be processed is input into a table merging prediction model which is obtained through pre-training, the table merging prediction model can predict and obtain a physical merging prediction result and a semantic merging prediction sequence, the physical merging prediction result indicates whether a first table segment and a second table segment can be merged in a physical structure, and the semantic merging prediction sequence indicates whether each corresponding cell of the first table segment and the second table segment can be merged in a semantic manner, namely whether the corresponding cell can be merged from a context semantic angle through the semantic merging prediction sequence. Because the physical structure and the context semantic information of the cell level are combined to judge whether the tables can be combined, the accuracy of the result of the table combination can be obviously improved. In addition, each item in the semantic merging prediction sequence corresponds to one cell of the first table segment and one cell of the second table segment respectively, so that the semantic merging prediction sequence is a variable-length sequence and is not required to be limited under a fixed length, and the semantic merging prediction sequence can be flexibly applied to table merging scenes with different columns.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is an exemplary diagram of one scenario for cross-page table merging;

fig. 2 is a flow chart of a table merging method according to an embodiment of the present application;

FIG. 3 is a table merge example;

FIG. 4 is another table merge example;

FIG. 5 is a schematic diagram of an architecture of a table merge prediction model;

FIG. 6 is another flow chart of a table merging method according to an embodiment of the present disclosure;

FIG. 7 is a diagram of an example architecture of a visual characteristics processing network;

FIG. 8 is a flow chart of a table merging method according to an embodiment of the present disclosure;

FIG. 9 is a flowchart of a method for training a table merge prediction model according to an embodiment of the present disclosure;

FIG. 10 is another flow chart of a method for training a table merge prediction model according to an embodiment of the present disclosure;

FIG. 11 is a schematic diagram of training an initial merge model using an error transfer loss function;

fig. 12 is a block diagram of a table merging device according to an embodiment of the present application;

FIG. 13 is a block diagram of a training device for a form merge prediction model according to an embodiment of the present application;

fig. 14 is a schematic structural diagram of an electronic device 140 according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the accompanying drawings in the present application are only for the purpose of illustration and description, and are not intended to limit the protection scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this application, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to the flow diagrams and one or more operations may be removed from the flow diagrams as directed by those skilled in the art.

In addition, the described embodiments are only some, but not all, of the embodiments of the present application. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that the term "comprising" will be used in the embodiments of the present application to indicate the presence of the features stated hereinafter, but not to exclude the addition of other features.

The existing cross-page table merging method mainly judges whether the tables can be merged or not through analysis of the physical structure of the tables. Several examples of prior art are listed below.

In a first example, a simulation generation mode is adopted, table1 and table2 of two different pages are selected, the table1 and the table2 are respectively divided into an upper part and a lower part according to a middle row, and then cell coordinates contained in the four parts are combined in pairs to form: the input format of [ SEP ] +table1_cell1+table1_cell2+ ] +table1_cellm+ [ SEP ] +table2_cell1+table2_cell2+ ] +table2_celln+ [ SEP ] is trained by using a deep bidirectional converter, and when the confidence of an output result is greater than 0.5, the table1 and the table2 can be combined, otherwise, the combination is not needed. This approach only considers whether the tables are physically combinable and does not determine whether they are combinable from a logical view of the cell up and down. In addition, the bi-directional depth transformer in this manner has such an error accumulation, which easily leads to prediction bias.

In a second example, two-stage type judgment is adopted to merge the cross-page tables, after header footers are removed, table areas of upper and lower pages are obtained, a table head detection model is utilized to detect whether the upper and lower page tables have table heads, meanwhile, whether the table bodies can be merged is judged according to the existence rule of the upper and lower page table heads, similarity calculation is carried out on cell groups at the joint positions of the table bodies under the condition that the table bodies can be merged, texts of the cell groups at the joint positions are merged into a new text, the similarity of the semantic vector and semantic vectors of other cell groups in the same column is calculated, if the similarity is larger than the similarity calculated in the same mode before merging, merging is carried out, and final form merging and splicing are completed by combining other prior knowledge. The merging result in this way depends on a pre-trained header recognition model, which can lead to merging errors if the header is detected incorrectly. In addition, this approach does not take into account context semantic information, and it is difficult to cover many table scenarios.

In summary, the prior art focuses on whether the tables can be merged in the physical structure, and does not consider the context semantic information of the context cells, so that the result of the table merging by using the scheme in the prior art has the problem of lower accuracy.

Based on the above-mentioned problems, the embodiments of the present application provide a form merging method and a form merging prediction model training method, which utilize a training data set to train a form merging prediction model, where the form merging prediction model not only predicts whether a form can be merged from a physical structure perspective, but also predicts whether a cell can be merged from a context semantic perspective at a cell level, and since the physical structure and the context semantic information at the cell level are combined to determine whether the form can be merged, accuracy of a form merging result can be significantly improved.

The embodiment of the application can be applied to a scene of cross-page table merging. FIG. 1 is a diagram showing an example of a scenario in which cross-page tables are merged, as shown in FIG. 1, in an electronic document, the same table is displayed in adjacent pages due to the limitation of page space, i.e., cross-page display occurs. For the purpose of content analysis and the like for the electronic document, the forms displayed across pages need to be combined to obtain a complete form. For example, in fig. 1, the content of the last cell in the first row of the table of the lower page is substantially integral with the content of the last cell in the last row of the table of the upper page, and should be merged into one cell. By applying the method of the embodiment of the application, the combination of the cross-page tables can be accurately realized.

Before introducing the technical solution of the embodiment of the present application, it should be first described that the form merging method and the form merging prediction model training method provided in the embodiment of the present application should obtain user consent through explicit forms before personal information and important data collected and generated in the process are executed. Meanwhile, personal information and important data collected and generated in the execution process of the form merging method and the form merging prediction model training method provided by the embodiment of the application need to follow legal, legal and necessary principles, disclose collection and use rules, and clearly show the purpose, mode and range of collecting and using the information. The relevant data in the above examples do not contain personal information that is not related to the relevant services provided in the above examples.

The embodiment of the application provides a form merging method and a form merging prediction model training method, wherein the form merging prediction model training method is used for describing a training process of a form merging prediction model used in the form merging method. The table merging method will be explained first.

Fig. 2 is a flowchart of a table merging method according to an embodiment of the present application, where an execution body of the method may be any electronic device with computing processing capability. As shown in fig. 2, the method may include:

S201, acquiring an image to be processed, wherein the image to be processed comprises a first table segment and a second table segment which are distributed across pages.

When content analysis is needed for a certain electronic document, the adjacent pages of the cross-page form exist in the electronic document based on specific marks such as page separators, form lines and the like, and all or part of form segments in the adjacent pages are combined into one image, so that the image to be processed is obtained.

It should be appreciated that for the same electronic document, there may be one or more groups of adjacent pages across the page table, and accordingly, one or more images to be processed may be obtained. When the number of images to be processed is multiple, table merging can be performed on each image to be processed by using the method steps in the embodiment of the application.

In addition, the image to be processed includes a first table segment and a second table segment, wherein the first table segment is distributed across pages, the first table segment can be, for example, part or all of table segments in an upper page in an adjacent page, and the second table segment can be, for example, part or all of table segments in a lower page in the adjacent page.

S202, inputting the image to be processed into a form merging prediction model obtained through pre-training to conduct prediction processing, and obtaining a physical merging prediction result and a semantic merging prediction sequence. The physical merging prediction result is used for indicating whether the first table segment and the second table segment can be merged in physical structure, each item in the semantic merging prediction sequence corresponds to one cell of the first table segment and the second table segment respectively, and each item in the semantic merging prediction sequence is used for indicating whether the corresponding cell of the first table segment and the second table segment can be merged semantically.

Alternatively, the physical merge prediction result may be a 1-bit value. Illustratively, the tag tab_merge is used to represent the physical merge prediction result, if the value of tab_merge is 1, it indicates that the first table segment and the second table segment can be physically merged, and if the value of tab_merge is 0, it indicates that the first table segment and the second table segment cannot be physically merged.

Optionally, the semantic merging prediction sequence may be a numerical sequence, where the length of the numerical sequence, that is, the number of entries of the numerical sequence is the same as the minimum granularity column number of the first table segment and the minimum granularity column number of the second table segment, and each entry corresponds to one cell in the first table segment and one cell in the second table segment respectively. It should be understood that each cell corresponding in the first table segment refers to a cell in the last row of the first table segment, and each cell corresponding in the second table segment refers to a cell in the first row of the second table segment. Accordingly, the value on each entry in the sequence of values represents whether the cell on the corresponding first table segment can be semantically combined with the cell on the corresponding second table segment. Illustratively, the label col_merge is used to represent a semantically combined predicted sequence that is the same length as the number of columns of the first table segment and the second table segment. For example, assuming that the minimum granularity column number of the first table segment and the second table segment is 8, the length of col_merge is 8, when the value on a certain item in col_merge is 1, it indicates that the cell on the first table segment corresponding to the item and the cell on the corresponding second table segment can be semantically combined, and when the value on a certain item in col_merge is 0, it indicates that the cell on the first table segment corresponding to the item and the cell on the corresponding second table segment cannot be semantically combined.

Table 1 below is an explanation of the numerical meanings of tab_merge and col_merge. Table 2 below shows the rule of merging tab_merge with col_merge.

TABLE 1

TABLE 2

The following is illustrated by two examples.

Fig. 3 is a table merging example, as shown in fig. 3, where the minimum granularity column number of the first table segment and the second table segment is 8, and based on the above-mentioned table merging prediction model, it is predicted that the first table segment and the second table segment are physically combinable, and that all 8 cells can be semantically merged, so that the value of tab_merge is 1, and the value of col_merge is 11111111. It should be appreciated that a value of 1 on each item 11111111 indicates that the corresponding cell semantics of the item may be merged, respectively. For example, a first value of 1 in 11111111 indicates that the first cell of the last row of the first table segment may be semantically merged with the first cell of the first row of the second table segment.

Fig. 4 is another table merging example, as shown in fig. 4, where the minimum granularity column number of the first table segment and the second table segment is 4, and based on the above-mentioned table merging prediction model, it is predicted that the first table segment and the second table segment are physically and structurally combinable, and none of the 4 cells are semantically combinable, so that the value of tab_merge is 1, and the value of col_merge is 0000. It should be appreciated that a value of 0 on each item of 0000 indicates that the corresponding cell semantics of the item are not combinable, respectively. For example, a first value of 0 in 0000 indicates that the last row first cell of the first table segment is semantically non-combinable with the first row first cell of the second table segment.

S203, merging the first table segment and the second table segment according to the physical merging prediction result and the semantic merging prediction sequence.

Because the physical merging prediction result indicates whether the first table segment and the second table segment can be merged in physical structure, and the semantic merging prediction sequence indicates whether each corresponding cell of the first table segment and the second table segment can be merged semantically, the first table segment and the second table segment can be merged accurately by combining the physical merging prediction result and the semantic merging prediction sequence.

In this embodiment, an image to be processed is input into a pre-trained form merging prediction model, and the form merging prediction model can predict to obtain a physical merging prediction result and a semantic merging prediction sequence, where the physical merging prediction result indicates whether the first form segment and the second form segment can be merged in a physical structure, and the semantic merging prediction sequence indicates whether each corresponding cell of the first form segment and the second form segment can be merged in a semantic manner, that is, whether the cell can be predicted from a context semantic perspective in a cell level through the semantic merging prediction sequence. Because the physical structure and the context semantic information of the cell level are combined to judge whether the tables can be combined, the accuracy of the result of the table combination can be obviously improved. In addition, each item in the semantic merging prediction sequence corresponds to one cell of the first table segment and one cell of the second table segment respectively, so that the semantic merging prediction sequence is a variable-length sequence and is not required to be limited under a fixed length, and the semantic merging prediction sequence can be flexibly applied to table merging scenes with different columns.

The following describes a prediction process of the table merge prediction model.

FIG. 5 is a schematic diagram of an architecture of a form-merging prediction model, as shown in FIG. 5, where the form-merging prediction model includes: visual feature processing network, semantic feature processing network and linear processing layer. The visual feature processing network, the semantic feature processing network and the linear processing layer are sequentially connected in series. As an example, the visual feature processing network may be a residual visual feature encoder, such as Resnet50. The semantic feature processing network may be a semantic feature encoder, such as a Transformer semantic feature encoder. The Linear treatment layer may be a Linear layer, and one or more fully connected layers may be included in the Linear layer.

Based on the model architecture shown in fig. 5, the above step S202 may be performed according to the following steps.

Fig. 6 is another flow chart of the table merging method provided in the embodiment of the present application, as shown in fig. 6, the step S202 may include:

s601, inputting the image to be processed into the visual feature processing network, performing visual feature coding by the visual feature processing network to obtain visual features, and performing position coding on the visual features to obtain position coded features.

Optionally, the visual feature processing network obtains the visual feature after performing visual feature encoding on the image to be processed, and the visual feature processing network can learn whether the table column lines are aligned or not based on the training data set in the pre-training process. Thus, the visual characteristics obtained after the visual characteristics processing network process can characterize whether the column lines of the first table segment and the second table segment are aligned. The higher the degree of column line alignment, the higher the probability that the first and second table segments are physically combinable. Taking the example shown in fig. 3 as an example, all column lines of the first table section and the second table section can be aligned, and thus, it can be confirmed that the first table section and the second table section can be physically combined.

Optionally, after the visual features are obtained, the visual features are subjected to position coding, so that the contextual position relation between the first table segment and the second table segment is better represented, and further accurate prediction is realized.

As a possible implementation manner, the visual features may be respectively subjected to horizontal direction position coding and vertical direction position coding based on a two-dimensional sine and cosine coding algorithm, so as to obtain the coded features.

Specifically, the two-dimensional sine and cosine coding algorithm can be used for respectively carrying out position coding on the visual features in the horizontal direction and the vertical direction, so that the coded features can further represent the context position relationship between the first table section and the second table section, and further accurate semantic analysis can be carried out by taking the position relationship as auxiliary judgment information.

S602, inputting the position-coded features into the semantic feature processing network, and carrying out semantic feature coding by the semantic feature processing network to obtain semantic-coded features.

Optionally, the semantic feature processing network can learn whether the semantics of the upper and lower cells in the same column can be combined based on the training data set in the pre-training process. Therefore, the semantically encoded features obtained after the processing of the semantic feature processing network can represent whether the semantically combinable cells corresponding to the first table segment and the second table segment. Taking the example shown in fig. 3 as an example, the 8 cells in the last row of the first table segment and the corresponding cells in the first row of the second table segment may be semantically combined, so it may be confirmed that the corresponding cells in the first table segment and the second table segment may be semantically combined.

It should be appreciated that the above-described semantically encoded features, while being able to characterize whether each corresponding cell in the first table segment and the second table segment is semantically combinable, are still able to characterize whether column lines of the first table segment and the second table segment are aligned, i.e., whether the first table segment and the second table segment are physically combinable.

And S603, inputting the features after semantic coding into the linear processing layer for linear transformation to obtain the physical combination prediction result and the semantic combination prediction sequence.

Optionally, the semantically encoded features obtained through the semantic feature processing network cannot directly represent the prediction result, so in this step, the semantically encoded features can be subjected to linear transformation through the linear processing layer, thereby obtaining the prediction result. The prediction result specifically comprises the physical combination prediction result and a semantic combination prediction sequence.

In this embodiment, the form merging prediction model sequentially includes a visual feature processing network, a semantic feature processing network and a linear processing layer, whether the form can be merged in a physical structure can be learned through the visual feature processing network, whether the semantics of each corresponding cell in the first form segment and the second form segment can be merged can be learned through the semantic feature processing network, meanwhile, the context position relationship between the first form segment and the second form segment can be learned through position coding, the learned features can be transformed into visual prediction results through linear transformation, and the accurate and visual prediction results can be ensured to be output by the form merging prediction model by integrating the processing processes.

As an alternative embodiment, the visual feature processing network may include a plurality of coding layers connected in series. Fig. 7 is a schematic diagram of a visual feature processing network, as shown in fig. 7, where the visual feature processing network may specifically be a residual visual feature encoder, and the residual visual feature encoder includes 4 coding layers connected in series. It should be understood that the 4 coding layers shown in fig. 7 are only an example, and the number of coding layers is not limited thereto, and may be flexibly set as required. Accordingly, the visual feature processing network may perform visual feature encoding in accordance with the following procedure.

Optionally, an optional manner of performing the visual feature encoding by the visual feature processing network in the step S601 includes:

inputting the image to be processed into a first coding layer for coding processing, inputting the processed characteristics into a next coding layer for coding processing, and sequentially executing until the last coding layer finishes coding processing; respectively carrying out maximum value pooling treatment on the treated characteristics of each coding layer except the last coding layer to obtain pooled characteristics of each coding layer; and splicing the pooled features of each coding layer and the features of the last coding layer after coding treatment to obtain the visual features.

Referring to fig. 7, after the image to be processed is input into the first coding layer for coding, the first coding layer inputs the characteristics after coding into the second coding layer for coding, and the coding processes are continued in sequence until the last coding layer completes the coding process. Besides, the features after the encoding process of the other encoding layers except the last encoding layer can be respectively subjected to maximum pooling (Maxpooling), and the feature size after the maximum pooling process can be kept consistent with the feature size after the encoding process of the last encoding layer. On the basis, the characteristics of the coding layers before the last coding layer after the maximum pooling treatment are spliced with the characteristics of the coding layers after the coding treatment, so that the visual characteristics can be obtained.

In this embodiment, the multiple coding layers are used for coding in sequence, so that whether the visual features output by the visual feature processing network can be combined with the physical structure or not can be represented more accurately.

As an alternative embodiment, the step S203 may include:

And if the physical merging prediction result indicates that the first table segment and the second table segment can be merged in physical structure, traversing each item in the semantic merging prediction sequence, and if the value of the traversed current item is a preset value, carrying out content merging on a cell corresponding to the current item in the first table segment and a cell corresponding to the second table segment.

The preset value may be, for example, the value 1 illustrated in the foregoing step S202. When the value of the current item in the semantic merging prediction sequence is 1, the cell semantics corresponding to the current item can be merged.

Alternatively, if the first table segment and the second table segment are combinable in physical structure, it may be further determined whether each cell is combinable semantically according to the above procedure one by one. When cells are physically and semantically combinable, the corresponding cells may be combined. For example, in the foregoing example of fig. 3, the semantics of the cells corresponding to the 8 column lines may be combined, and then the content of the upper and lower cells may be combined for each cell. When the cells are physically mergeable and semantically non-mergeable, then the corresponding cells are not merged. For example, in the example of fig. 4, if the semantics of the cells corresponding to the 4 column lines are not combinable, the content of the upper and lower cells is not combined for each cell.

In another case, if the first table segment and the second table segment are not combinable in physical structure, the first table segment and the second table segment are not combined, and the judgment on the semantic merge prediction sequence is not continued.

In this embodiment, on the premise that the physical structures of the first table segment and the second table segment can be combined, by judging whether each cell is semantically combinable one by one, the cell level can be accurately combined.

As an alternative embodiment, the above-described image to be processed may be generated by the following procedure.

Fig. 8 is a schematic flow chart of a table merging method according to an embodiment of the present application, as shown in fig. 8, further including, before the step S201:

s801, according to page segmentation marks in the electronic document, a first initial table segment and a second initial table segment are intercepted from adjacent pages of the electronic document, and an initial image is obtained.

Alternatively, the page of the electronic document may be converted into an image and image recognition may be performed, thereby recognizing whether a preset page division flag exists. When the page division mark exists, the page division mark is indicated to belong to the adjacent page, so that whether a table column line or a table row line exists above and below the page division mark can be further judged, and if yes, the existence of a cross-page table can be determined. Therefore, part or all of the tables in the adjacent pages can be cut out, and the initial image can be obtained. Illustratively, the initial image may be obtained by cutting out the bottom half of the upper page in the adjacent page and the top half of the lower page in the adjacent page. It will be appreciated that the initial image includes page segmentation markers between adjacent pages.

S802, cutting the initial image and performing table edge alignment processing to obtain the image to be processed.

Alternatively, the initial image may be cropped to leave only the table portion based on the table edge line in the initial image, without leaving blank portions outside the table. On this basis, the lower page may be translated to the left with respect to the leftmost column line of the upper page such that the leftmost column line of the upper page is aligned with the leftmost column line of the lower page.

In the embodiment, by performing form interception, cutting and alignment on the electronic document, one side alignment of the form of the upper page and the lower page in the obtained image to be processed can be realized, so that the negative influence of form merging and prediction model prediction is eliminated, and the accuracy of a prediction result is ensured.

The following describes a table merging prediction model training method according to an embodiment of the present application.

Fig. 9 is a flowchart of a method for training a table merge prediction model according to an embodiment of the present application, where an execution subject of the method may be any electronic device with computing processing capability. As shown in fig. 9, the method includes:

s901, constructing a training data set based on an original electronic document containing a cross-page table.

Alternatively, the number of the original electronic documents may be one or more. One or more sets of cross-page tables may be included in each original electronic document. A training data set with a certain scale can be obtained by performing price processing on the cross-page tables, and the training data set can comprise a test set and a training set. The process of building a training data set in particular will be described in detail in the following examples.

S902, training the initial merging model based on the training data set to obtain a form merging prediction model.

The prediction results of the form merging prediction model comprise a physical merging prediction result and a semantic merging prediction sequence, wherein the physical merging prediction result is used for indicating whether a first form segment and a second form segment in an input image to be processed can be merged in a physical structure, each item in the semantic merging prediction sequence corresponds to one cell of the first form segment and the second form segment respectively, and each item in the semantic merging prediction sequence is used for indicating whether the corresponding cell on the first form segment and the corresponding cell on the second form segment can be merged semantically.

Alternatively, the initial merging model may be built in advance, and since the form merging prediction model is trained by the initial merging model, the model structure of the initial merging model is the same as that of the form merging prediction model. The foregoing structures shown in fig. 5 and fig. 7 may be referred to specifically, and will not be described herein. The initial merging model can be subjected to multiple rounds of iterative training and parameter correction, when the result of certain iterative training meets the preset convergence condition or iterative training times, the training can be stopped, and the initial merging model trained for the last time is used as a form merging prediction model.

In this embodiment, a form merge prediction model may be obtained by training with a training data set by constructing the training data set, and by training the initial merge model, the form merge prediction model obtained after training is enabled to predict and obtain a physical merge prediction result and a semantic merge prediction sequence, where the physical merge prediction result indicates whether the first form segment and the second form segment can be merged in a physical structure, and the semantic merge prediction sequence indicates whether each corresponding cell of the first form segment and the second form segment can be merged in a semantic manner, that is, whether the cell can be merged from a context semantic perspective in a cell level can be predicted by the semantic merge prediction sequence. Because the physical structure and the context semantic information of the cell level are combined to judge whether the tables can be combined, the accuracy of the result of the table combination can be obviously improved.

As an alternative embodiment, the step S901 may include:

intercepting and cutting a cross-page table in the original electronic document and aligning the edges of the table to obtain a test data set and a forward training data set; splitting and splicing the tables in the original electronic document to obtain a negative training data set.

For example, the page split markers described above are used to identify upper and lower page table regions, with table coordinates of (t1_x1, t1_y1, t2_x2, t2_y2), (t2_x1, t2_y1, t2_x2, t2_y2). The upper and lower table region pictures t1, t2 are cropped to have dimensions (H0, W0, 3), (H1, W1, 3), respectively, where h0=t1_y2-t1_y1, w0=t1_x2-t1_x1, h1=t2_y2-t2_y1, w1=t2_x2-t2_x1. The left sides of the upper and lower table region pictures are aligned and spliced into a picture I of h×w×3, wherein h=h0+h1, w=max (W0, W1), and the picture I obtained in this way can be used as a picture in the test dataset. In addition, for the cross-page table in the original electronic document, the bottom half of the upper page and the top half of the lower page are first spliced into a new picture Ipg by taking the page segmentation mark as the center, and then the table area in Ipg is detected, and cutting and splicing are performed accordingly. The pictures obtained in this way can be used as pictures in the forward training dataset. In addition, line detection is carried out on the table in the original electronic document, the line in the middle position is randomly selected to divide the same table into two different parts tupi and tdowni, a plurality of picture groups G are obtained, the parts from Gi and Gj are spliced to obtain a spliced picture set, meanwhile, the pictures in the picture set are cut, and the splicing position is ensured to be positioned at the position of the central horizontal line of the picture. The pictures obtained in this way can be used as pictures in a negative training dataset.

The training data set may also be annotated based on the generation of the training data set. For example, the value of the tag tag_merge of the stitched pictures from different groups and upper and lower pages is 0, i.e. it is impossible to merge, and at the same time, the value of the tag sequence col_merge may be marked according to the minimum cell column obtained by dividing the total number of column lines of the upper page and the lower page.

In this embodiment, the richness of the obtained training data set is higher by processing the table in the original electronic document, so that the prediction result of the table merging prediction model obtained by training is more accurate.

Fig. 10 is another flow chart of a table merging prediction model training method according to an embodiment of the present application, as shown in fig. 10, the step S902 may include:

s1001, inputting sample data in a training data set into an initial merging model to obtain a processing result of the initial merging model, wherein the initial merging model comprises: visual feature processing network, semantic feature processing network and linear processing layer.

S1002, performing loss calculation on a processing result of the initial merging model based on the target loss function to obtain loss of the initial merging model. Wherein, the objective loss function at least comprises: and an error transfer loss function for calculating a deviation of the processing result of the visual feature processing network from the processing result of the semantic feature processing network.

FIG. 11 is a schematic diagram of training an initial merge model using an error transfer loss function, as shown in FIG. 11, with the visual feature processing network being embodied as a residual visual feature encoder, and the semantic feature processing network being embodied as a Transformer semantic feature encoder. The visual characteristic output by the residual visual characteristic encoder is F, and the position-coded characteristic obtained by performing position coding on F is V _in Transformer semantic feature encoder pair V _in The output semantic coded features after the semantic feature coding are V _out . Since the Transformer semantic feature encoder has a transfer error, in order to ensure that the semantic features learned by the initial merging model are as close as possible to the visual features, the error transfer loss function described above may be set, and the error transfer loss function is used to calculate the deviation between the processing result of the visual feature processing network and the processing result of the semantic feature processing network, that is, calculate F and V _out Is a deviation of (2). As shown in FIG. 11, the error transfer loss function calculates a loss L ₀ . By L ₀ The transfer error of the transducer semantic feature encoder can be alleviated or counteracted.

As an optional implementation manner, the objective loss function further includes: physical merge-loss function and semantic merge-loss function. The physical merge-loss function is used for calculating the loss of the physical merge prediction result, and the semantic merge-loss function is used for calculating the loss of the semantic merge prediction result.

With continued reference to FIG. 11, the loss calculated by the physical merge-loss function is L _{tab_merge} The loss calculated by the semantic merging loss function is L _{col_merge} . The physical merging loss function can be calculated by adopting cross entropy, and the semantic merging loss function can be calculated by adopting a sequence prediction mode such as ctc_loss and the like to obtain a sequence consisting of 0 or 1. The loss L of the above target loss function can be expressed as: l=l _{tab_merge} +L _{col_merge} +L ₀ 。

S1003, carrying out iterative correction on the initial merging model according to the loss of the initial merging model to obtain the table merging prediction model.

The initial merging model can be subjected to multiple rounds of iterative training and parameter correction, when the result of certain iterative training meets the preset convergence condition or iterative training times, the training can be stopped, and the initial merging model trained for the last time is used as a form merging prediction model.

In this embodiment, an error transfer loss function is introduced into the loss function of the initial merging model, and the error transfer loss function can calculate the deviation between the processing result of the visual feature processing network and the processing result of the semantic feature processing network, so that the loss of the error transfer loss function is used to correct the initial merging model, and the transfer error can be relieved or counteracted, so that the prediction result of the table merging prediction model obtained through training is more accurate.

Based on the same inventive concept, the embodiment of the present application further provides a form merging device corresponding to the form merging method, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the form merging method in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.

Fig. 12 is a block diagram of a table merging device according to an embodiment of the present application, and as shown in fig. 12, the device includes:

an obtaining module 1201 is configured to obtain an image to be processed, where the image to be processed includes a first table segment and a second table segment distributed across pages.

The processing module 1202 is configured to input the image to be processed into a pre-trained form merge prediction model for prediction processing, to obtain a physical merge prediction result and a semantic merge prediction sequence, where the physical merge prediction result is used to indicate whether the first form segment and the second form segment are combinable in physical structure, each item in the semantic merge prediction sequence corresponds to a cell of the first form segment and a cell of the second form segment, and each item in the semantic merge prediction sequence is used to indicate whether the corresponding cell of the first form segment and the corresponding cell of the second form segment are combinable in semantic.

And the merging module 1203 is configured to merge the first table segment and the second table segment according to the physical merge prediction result and the semantic merge prediction sequence.

As an alternative embodiment, the table merging prediction model includes: visual feature processing network, semantic feature processing network and linear processing layer.

The processing module 1202 is specifically configured to:

inputting the image to be processed into the visual feature processing network, performing visual feature coding by the visual feature processing network to obtain visual features, and performing position coding on the visual features to obtain position coded features.

Inputting the position-coded features into the semantic feature processing network, and carrying out semantic feature coding by the semantic feature processing network to obtain semantic-coded features.

As an alternative embodiment, the visual feature processing network comprises a plurality of coding layers connected in series.

The processing module 1202 is specifically configured to:

And inputting the image to be processed into a first coding layer for coding processing, inputting the processed characteristics into a next coding layer for coding processing, and sequentially executing until the last coding layer finishes coding processing.

And carrying out maximum value pooling treatment on the treated characteristics of each coding layer except the last coding layer to obtain pooled characteristics of each coding layer.

As an alternative embodiment, the processing module 1202 is specifically configured to:

As an alternative embodiment, the combining module 1203 is specifically configured to:

As an alternative embodiment, the processing module 1202 is further configured to:

and according to the page segmentation marks in the electronic document, intercepting the first initial table segment and the second initial table segment from adjacent pages of the electronic document to obtain an initial image.

Based on the same inventive concept, the embodiment of the present application further provides a form merge prediction model training device corresponding to the form merge prediction model training method, and since the principle of solving the problem of the device in the embodiment of the present application is similar to that of the form merge prediction model training method in the embodiment of the present application, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.

Fig. 13 is a block diagram of a table merging prediction model training apparatus according to an embodiment of the present application, as shown in fig. 13, where the apparatus includes:

a construction module 1301 is configured to construct a training dataset based on the original electronic document that contains the cross-page table.

The training module 1302 is configured to train the initial merging model based on the training data set to obtain a form merging prediction model, where a prediction result of the form merging prediction model includes a physical merging prediction result and a semantic merging prediction sequence, the physical merging prediction result is used to indicate whether a first form segment and a second form segment in an input image to be processed are combinable in a physical structure, each item in the semantic merging prediction sequence corresponds to one cell of the first form segment and the second form segment, and each item in the semantic merging prediction sequence is used to indicate whether a corresponding cell on the first form segment and the second form segment is combinable in a semantic manner.

As a possible implementation manner, the construction module 1301 is specifically configured to:

and intercepting, cutting and aligning the edges of the cross-page table in the original electronic document to obtain a test data set and a forward training data set.

inputting the sample data in the training data set into the initial merging model to obtain a processing result of the initial merging model, wherein the initial merging model comprises the following steps: visual feature processing network, semantic feature processing network and linear processing layer.

And carrying out loss calculation on the processing result of the initial merging model based on a target loss function to obtain the loss of the initial merging model, wherein the target loss function at least comprises: and the error transfer loss function is used for calculating deviation of the processing result of the visual characteristic processing network and the processing result of the semantic characteristic processing network.

As a possible implementation manner, the objective loss function further includes: physical merge-loss function and semantic merge-loss function.

The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.

The embodiment of the present application further provides an electronic device 140, as shown in fig. 14, which is a schematic structural diagram of the electronic device 140 provided in the embodiment of the present application, including: processor 141, memory 142, and optionally bus 143. The memory 142 stores machine-readable instructions executable by the processor 141 (e.g., execution instructions corresponding to the acquiring module 1201, the processing module 1202, and the combining module 1203 in the apparatus of fig. 12, or execution instructions corresponding to the building module 1301 and the training module 1302 in the apparatus of fig. 13), and when the electronic device 140 is running, the processor 141 communicates with the memory 142 through the bus 143, and the machine-readable instructions when executed by the processor 141 perform the method steps in the foregoing method embodiments.

Embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above-described form merging method or form merging prediction model training method.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the method embodiments, which are not described in detail in this application. In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, and for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, indirect coupling or communication connection of devices or modules, electrical, mechanical, or other form.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered in the protection scope of the present application.

Claims

1. A method of merging tables, comprising:

2. The method of claim 1, wherein the table merge prediction model comprises: a visual feature processing network, a semantic feature processing network and a linear processing layer;

3. The method of claim 2, wherein the visual characteristics processing network comprises a plurality of coding layers in series;

4. The method of claim 2, wherein the position encoding the visual feature to obtain a position encoded feature comprises:

5. The method according to any one of claims 1-4, wherein the merging the first table segment and the second table segment according to the physical merge prediction result and the semantic merge prediction result comprises:

6. The method according to any one of claims 1-4, further comprising, prior to the acquiring the image to be processed:

7. A method for training a form merge prediction model, comprising:

8. The method of claim 7, wherein constructing a training dataset based on the original electronic document containing the cross-page form comprises:

9. The method of claim 7, wherein training the initial merge model based on the training dataset results in a tabular merge prediction model, comprising:

10. The method according to claim 9, wherein the objective loss function further comprises: a physical merge-loss function and a semantic merge-loss function;

11. A form merge device, comprising:

12. A form merge prediction model training device, comprising:

13. An electronic device, comprising: a processor and a memory storing machine readable instructions executable by the processor to perform the steps of the form merge method of any one of claims 1 to 6 or the form merge prediction model training method of any one of claims 7 to 10 when the electronic device is run.

14. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the form merging method according to any one of claims 1 to 6 or the steps of the form merging prediction model training method according to any one of claims 7 to 10.