CN116071770B

CN116071770B - Method, device, equipment and medium for general identification of form

Info

Publication number: CN116071770B
Application number: CN202310203359.8A
Authority: CN
Inventors: 赵驰煦; 王国鹏; 刘源超; 柏英杰
Original assignee: Shenzhen Qianhai Huanrong Lianyi Information Technology Service Co Ltd
Current assignee: Shenzhen Qianhai Huanrong Lianyi Information Technology Service Co Ltd
Priority date: 2023-03-06
Filing date: 2023-03-06
Publication date: 2023-06-16
Anticipated expiration: 2043-03-06
Also published as: CN116071770A

Abstract

The application relates to the technical field of artificial intelligence, and discloses a general identification method, device, equipment and medium for a table, which are used for accurately identifying a table structure. The method comprises the following steps: acquiring transverse gap information and longitudinal gap information of a table; determining a longitudinal sampling position of the transverse gap information according to the longitudinal gap information, and determining a transverse sampling position of the longitudinal gap information according to the transverse gap information; extracting whole column information of the transverse gap information according to the longitudinal sampling position to obtain a longitudinal sub-feature map, and extracting whole row information of the longitudinal gap information according to the transverse sampling position to obtain a transverse sub-feature map; and identifying the table structure of the table according to the transverse sub-feature diagram and the longitudinal sub-feature diagram.

Description

Method, device, equipment and medium for general identification of form

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method, a device, equipment and a medium for universally identifying a form.

Background

The main current table recognition algorithm is mainly based on two ideas of table line segmentation and key point detection.

The grid line segmentation mainly extracts image features of the horizontal lines and the vertical lines through a convolution network, and then calculates intersection points of the horizontal lines and the vertical lines through post-processing, so that positions and coordinate values of all cells in the table are obtained.

The key point detection method calculates the cell coordinates by detecting the positions of the top points and the center points of the cells in the table, and then restores the structure of the table.

The inventors found that the disadvantage of the division of the table lines is that the information of the table grid lines is excessively dependent, and when the table is discontinuous due to printing inaccuracy, or a method of using different colors instead of the table grid lines as the dividing cells, the method cannot accurately recognize the table structure. The key point detection can solve the limitation of the table line segmentation method to a certain extent, but the key point detection method can not obtain an accurate result in the scene of unobvious image characteristics at the cross point position, overlarge cell area, overmany empty cells without characters and wireless tables.

It can be seen that the two methods described above have problems in that they are excessively limited to the concern of local image information in the form, and in the case where the form lines or intersections are not completely or even absent, the form cannot be accurately recognized.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, apparatus, device, and medium for general form identification for accurately identifying forms.

In a first aspect, a method for universally identifying a table is provided, including:

acquiring transverse gap information and longitudinal gap information of a table;

determining a longitudinal sampling position of the transverse gap information according to the longitudinal gap information, and determining a transverse sampling position of the longitudinal gap information according to the transverse gap information;

extracting whole column information of the transverse gap information according to the longitudinal sampling position to obtain a longitudinal sub-feature map, and extracting whole row information of the longitudinal gap information according to the transverse sampling position to obtain a transverse sub-feature map;

the table structure of the table is identified from the transverse and longitudinal sub-feature maps.

Optionally, determining the longitudinal sampling position of the transverse gap information according to the longitudinal gap information includes:

according to the longitudinal gap information, determining the positions of longitudinal reference points with preset reference quantity from the transverse gap information;

and taking the longitudinal reference point positions of a preset reference number as longitudinal sampling positions.

Optionally, determining the longitudinal reference point positions of the preset reference number from the transverse gap information according to the longitudinal gap information includes:

Obtaining a transverse vector through transverse gap information, wherein the transverse vector comprises coordinate values of each position point of the table in the transverse direction;

suppressing the position points with the coordinate values smaller than the preset distance in the transverse vector so as to reserve the position points with large coordinate values in the adjacent position points;

and screening out the position points with preset reference quantity from the transverse vector subjected to the inhibition processing as longitudinal reference point positions.

Optionally, determining the longitudinal reference point positions of the preset reference number from the transverse gap information includes:

and equidistant determining the positions of the longitudinal reference points of the preset reference number from the transverse gap information.

Optionally, identifying a table structure of the table according to the transverse sub-feature diagram and the longitudinal sub-feature diagram includes:

selecting the position points with preset sampling quantity from each column of the longitudinal sub-feature diagram as first sampling points, and selecting the position points with preset sampling quantity from each row of the transverse sub-feature diagram as second sampling points;

predicting the coordinate value of each first sampling point by using a first transducer network, and predicting the coordinate value of each second sampling point by using a second transducer network;

the coordinate values of all the first sampling points corresponding to each column are aggregated to obtain a longitudinal separation line, and the coordinate values of all the second sampling points corresponding to each row are aggregated to obtain a transverse separation line;

The table structure of the table is identified based on the transverse separation lines and the longitudinal separation lines.

Optionally, identifying the table structure of the table according to the transverse separation line and the longitudinal separation line includes:

combining the transverse separation lines and the longitudinal separation lines to obtain an initial cell area of the form;

scaling the coordinates of the initial cell area of the table so that the scaled initial cell area is the same as the original feature map of the table in size;

dividing the original feature map according to the coordinate value of the zoomed initial cell area to obtain a plurality of divided feature blocks;

fusing text features of the form with corresponding segmentation feature blocks to obtain fusion features;

and inputting each fusion characteristic into the graph neural network to obtain a table structure of the table.

Optionally, acquiring the lateral gap information and the longitudinal gap information of the table includes:

acquiring a form image of a form;

extracting image information of a form image through a pre-trained target detection network to obtain an original feature image;

and respectively inputting the original feature map into a transverse feature extraction module and a longitudinal feature extraction module to obtain transverse gap information and longitudinal gap information of the table.

In a second aspect, there is provided a form generic identification device, the device comprising;

the acquisition module is used for acquiring transverse gap information and longitudinal gap information of the form;

the determining module is used for determining a longitudinal sampling position according to the transverse gap information and determining a transverse sampling position according to the longitudinal gap information;

the extraction module is used for extracting the whole row information of the transverse gap information according to the longitudinal sampling position to obtain a longitudinal sub-feature map, and extracting the whole row information of the longitudinal gap information according to the transverse sampling position to obtain a transverse sub-feature map;

and the identification module is used for identifying the table structure of the table according to the transverse sub-feature diagram and the longitudinal sub-feature diagram.

In a third aspect, there is provided a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of any of the methods described above when the computer program is executed.

In a fourth aspect, there is provided a computer readable storage medium storing a computer program which when executed by a processor performs the steps of a method as in any of the preceding claims.

In one of the schemes provided above, because the global typesetting features, that is, the gap features, are extracted by using the image information features in the extracted form pictures, the dependency degree of the form lines or the cross points is very low, the correct structure identification can be still performed under the condition that the information is not completely or even not present, the method is also applicable to the identification of the wireless form, and the method is a general mode capable of simultaneously carrying out the structure identification on the wired form and the wireless form, and has higher universality and higher application value.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for table generic identification in an embodiment of the present application;

FIG. 2 is a schematic diagram of a process of a method for table generic identification in an embodiment of the present application;

FIG. 3 is a schematic diagram of a process for extracting a sub-feature map according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a general identification device for a table in an embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The embodiment of the application provides a general table identification method, device, equipment and storage medium, which have low dependency degree on table lines or crossing points, can still perform correct structure identification under the condition that the information is not completely or even not exists, can be suitable for identifying an infinite table, namely can identify wireless tables or wired tables, and improves the universality of table identification. The following describes a general recognition method for a table provided in the present application.

In one embodiment, as shown in fig. 1, a method for universally identifying a table is provided, and the method includes the following steps:

s10: and acquiring transverse gap information and longitudinal gap information of the table.

In the conventional scheme, the table line division has a disadvantage in that information of table lines is excessively relied upon, and when a table is discontinuous due to printing uncertainty or a method of dividing cells with different colors instead of the table lines, the table structure cannot be accurately recognized in this way. The key point detection can solve the limitation of the table line segmentation method to a certain extent, but under the scene of unobvious image characteristics at the cross point position, overlarge cell area, overmany empty cells without characters and wireless tables, the key point detection method can not realize or obtain accurate results. In this embodiment, in order to solve the problems in the above-mentioned manner, after the form image of the identified form is acquired, the form image is first identified to acquire the image information of the form, the original feature map S of the form is obtained through the image information, and then the transverse gap information R and the longitudinal gap information C are identified by using the original feature map.

The transverse gap information R and the longitudinal gap information C of the table characterize the transverse and longitudinal characteristics of the table in the transverse and longitudinal directions, respectively. For example, if the table is a wired table, the lateral gap information R may include the identified lateral line position information and/or lateral gap position information, and the longitudinal gap information C may include the identified longitudinal line position information and/or longitudinal gap position information; if the table is a wireless table, the lateral gap information R may include the identified lateral gap position information, and the longitudinal gap information C may include the identified longitudinal gap position information. That is, whether the identified form is a wireless or wired form, features reflecting the lateral and longitudinal characteristics of the form may be identified. It is also noted that this is different from the way in which the intersection coordinates or the table lines are identified, but only the lateral gap information R and the longitudinal gap information C of the table are identified as a whole, and not directly depending on the intersection coordinates or the table lines.

S20: and determining the longitudinal sampling position of the transverse gap information according to the longitudinal gap information, and determining the transverse sampling position of the longitudinal gap information according to the transverse gap information.

S30: and extracting the whole column information of the transverse gap information according to the longitudinal sampling position to obtain a longitudinal sub-feature map, and extracting the whole row information of the longitudinal gap information according to the transverse sampling position to obtain a transverse sub-feature map.

S40: the table structure of the table is identified from the transverse and longitudinal sub-feature maps.

After the transverse gap information R and the longitudinal gap information C of the table are determined, the longitudinal sampling position of the transverse gap information R can be determined according to the longitudinal gap information C, the transverse sampling position of the longitudinal gap information C is determined according to the transverse gap information R, and finally sampling is performed based on the two sampling positions respectively to obtain a longitudinal sub-feature map C 'and a transverse sub-feature map R'.

It can be understood that if the table has the table lines, the most obvious longitudinal gaps are the positions, the high probability corresponds to the original table diagram with the information of the transverse lines, and if the table does not have the table lines, the most obvious longitudinal gaps are the positions, the high probability corresponds to the original table diagram with the information of the transverse lines; similarly, if a table has table lines, then the high probability corresponds to the original table diagram having vertical line information where the lateral gap is most significant, and if the table does not have table lines, then the high probability corresponds to the original table diagram having vertical line information where the lateral gap is most significant. Based on the characteristics, in the embodiment of the application, the longitudinal sampling position of the transverse gap information R is determined according to the longitudinal gap information C, the transverse sampling position of the longitudinal gap information C is determined according to the transverse gap information R, and finally the longitudinal sub-feature map R 'and the transverse sub-feature map C' can be obtained based on the two different sampling positions.

It can be seen that the longitudinal sub-feature map R 'reflects column information of the table, and the transverse sub-feature map C' reflects row information of the table, so that the cell area of the table can be identified according to the longitudinal sub-feature map R 'and the transverse sub-feature map C', and a table structure is obtained.

It can be seen that in this embodiment, a general table identification method is provided, and first, the transverse gap information R and the longitudinal gap information C of the table are obtained; determining a longitudinal sampling position/a transverse sampling position according to the longitudinal gap information C/the transverse gap information R, extracting a longitudinal sub-feature map R 'according to the longitudinal sampling position, and extracting a transverse sub-feature map C' according to the transverse sampling position; finally, according to the transverse sub-feature diagram C 'and the longitudinal sub-feature diagram R', the table structure of the table is identified, and as the embodiment of the application extracts the global typesetting features, namely the gap features, by utilizing the image information features in the extracted table pictures, the dependency degree of the table lines or the cross points is very low, the correct structure identification can be still carried out under the condition that the information is not completely or even not present, the method can also be applied to the identification of the wireless table, is a general mode capable of carrying out structure identification on the wired table and the wireless table at the same time, has higher universality and higher application value.

It should be noted that, in combination with the above embodiment, in step S20, that is, the longitudinal sampling position of the transverse gap information is determined according to the longitudinal gap information, and the transverse sampling position of the longitudinal gap information is determined according to the transverse gap information, the reference point position may be selected arbitrarily or selectively according to the coordinate value of the gap information, which is not particularly limited. In the following, a procedure of determining a longitudinal sampling position is taken as an example, and the procedure is detailed.

In order to facilitate understanding of the above-described processes, a complete implementation will be described in detail below in connection with fig. 2 and 3, in connection with other embodiments described below and the above-described embodiments.

In one embodiment, step S10, that is, acquiring the lateral gap information and the longitudinal gap information of the table, includes:

s11: acquiring a form image of a form;

s12: extracting image information of a form image through a pre-trained target detection network to obtain an original feature image;

s13: and respectively inputting the original feature map into a transverse feature extraction module and a longitudinal feature extraction module to obtain transverse gap information and longitudinal gap information of the table.

In this embodiment, as shown in fig. 2, a table image corresponding to the identified table is acquired first, and then image information of the table image is extracted through a pre-trained target detection network to obtain an original feature map. Illustratively, in one embodiment, as shown in fig. 2, the image information of the table image may be extracted through the res net-18 backbone network and the feature pyramid layers (Feature Pyramid Networks, FPN) to obtain an original feature map S with a size of w×h. It should be noted that, besides the manner of extracting the original feature map by the res net-18 backbone network and the FPN layer, other feature extraction modules may be used, for example, other backbone networks and/or feature layers may be used, which is not limited in the embodiments of the present application. And then, respectively inputting the original feature map S into a transverse feature extraction module and a longitudinal feature extraction module, so as to respectively obtain transverse gap information and longitudinal gap information of the obtained table, wherein the transverse feature extraction module and the longitudinal feature extraction module can be realized through a convolution network.

In this embodiment, a specific processing manner for extracting the transverse gap information and the longitudinal gap information is provided, while the feasibility of the scheme is improved, the original image size is maintained in the process of extracting the original feature image by the ResNet-18 backbone network and the FPN layer, the original image is not compressed, the feature image is compressed when the feature image of the table image is predicted by the traditional scheme, and the predicted result is restored to the original size image, and the process causes the loss of coordinate precision, so that the predicted table structure has a slight deviation from the true structure, but the predicted table structure is performed on the original image size in the invention, so that the recognition result is more accurate.

In one embodiment, in step S20, that is, determining the longitudinal sampling position of the lateral gap information according to the longitudinal gap information includes:

s21: and determining the longitudinal reference point positions of a preset reference number from the transverse gap information according to the longitudinal gap information.

S22: and taking the longitudinal reference point positions of a preset reference number as longitudinal sampling positions.

In this embodiment, a preset reference number of longitudinal reference points is determined from the transverse gap information R according to the longitudinal gap information C as the longitudinal sampling position, where the preset reference number is an empirical value, and the preset reference number may be, for example, 8, 10, 15, or the like, and is not limited in particular.

Therefore, in this embodiment, the whole column information of the transverse gap information is extracted according to the determined longitudinal sampling position as the reference point, so that the processing data volume of the longitudinal sub-feature map can be processed for the next step, and the processing efficiency is greatly improved. Namely: in the process, which columns are spliced into a small characteristic diagram is needed to be taken out, column information in the table can be reflected most, and a correct result can be predicted through the information.

Similarly, in an embodiment, in step S20, that is, determining the lateral sampling position of the longitudinal gap information according to the lateral gap information includes:

s23: and determining the transverse reference point positions of the preset reference number from the longitudinal gap information according to the transverse gap information.

S24: and taking the transverse reference point positions with preset reference numbers as transverse sampling positions.

It will be appreciated that the processing procedure and details of the processing steps S22 to S24 correspond to the foregoing steps S21 to S22, where in this embodiment, a preset reference number of lateral reference points is determined as lateral sampling positions from the longitudinal gap information C according to the lateral gap information R, where the preset reference number is an empirical value, and the preset reference number may be, for example, 8, 10, 15, or the like, and is not limited specifically. For example, the lateral/longitudinal reference point positions may each be 10, i.e. each comprise 10 sampling position reference points. Therefore, in this embodiment, the whole row of information of the longitudinal gap information is extracted according to the determined transverse sampling position as the reference point, so that the processing data volume of the transverse sub-feature map can be processed for the next step, and the processing efficiency is greatly improved. Namely: in the process, which lines are spliced into a small characteristic diagram is needed to be taken out, the line information in the table can be reflected most, and the correct result can be predicted through the information.

It should be noted that, in combination with the above embodiment, in step S21 or step S22, that is, the process of determining the position of the longitudinal reference point and the process of determining the position of the transverse reference point, the embodiments of the present application provide two different processing manners and have corresponding features, and these two different manners are described in detail by taking the determination of the position of the longitudinal reference point as an example.

First mode

According to the longitudinal gap information, determining the longitudinal reference point positions of a preset reference number from the transverse gap information comprises the following steps:

s211: summing the transverse gap information along the width direction of the table to obtain a transverse vector, wherein the transverse vector comprises coordinate values of each position point of the table in the transverse direction;

s212: and suppressing the position points with the coordinate values smaller than the preset distance in the transverse vector so as to keep the position points with the large coordinate values in the adjacent position points.

S213: and screening out the position points with preset reference quantity from the transverse vector subjected to the inhibition processing as longitudinal reference point positions.

In this embodiment, as shown in fig. 2 and 3, the transverse gap information needs to be summed along the width direction of the table to obtain a transverse vector, where the transverse vector includes coordinate values of each position point of the table in the transverse direction, and the position points of the adjacent position points in the transverse vector, where the coordinate values of the adjacent position points are smaller than a preset distance, are suppressed to preserve the position point of the adjacent position points where the coordinate values are large, and the preset distance is an empirical value, which is not particularly limited. And screening out the position points with preset reference quantity from the transverse vector subjected to the inhibition processing as longitudinal reference point positions. For example, in fig. 3, the position points of the 10 maximum values are selected as the reference points of R, and the whole row where the 10 reference points in R are located is taken out, so as to obtain a longitudinal sub-feature map R' with a size of 10×h.

Similarly, for the process of determining the position of the transverse reference point, the process is similar to the process of determining the position of the longitudinal reference point, that is, the longitudinal gap information needs to be summed along the height direction of the table to obtain a longitudinal vector, wherein the longitudinal vector comprises the coordinate value of each position point of the table in the longitudinal direction, and the position points, in the longitudinal vector, of which the coordinate values of adjacent position points are smaller than the preset distance are suppressed so as to preserve the position point of the adjacent position points, of which the coordinate values are large, and likewise, the preset distance is an empirical value and is not particularly limited. And screening out the position points with preset reference quantity from the longitudinal vector subjected to the inhibition processing as the positions of the transverse reference points. For example, in fig. 3, the position points of the maximum value of 10 are selected as the reference points of C, and the entire row where the 10 reference points in C are located is taken out, so as to obtain a transverse sub-feature map C' with a size of w×10.

It should be noted that this step is mainly to prepare for extracting the sub-feature map in the next processing network (e.g., the transducer in fig. 3), and since the transducer is computationally intensive, it is necessary to filter the data input to the transducer to reduce the computational load, namely: it is necessary to take out which rows/columns of the original signature S are spelled into a small signature to best reflect the line and column information in the table, and the transducer can predict the correct result only from these information. As shown in fig. 3, if the table has grid lines, where the vertical line features are most obvious (white column features in the lower left diagram of fig. 3), the large probability corresponds to that the original table is horizontal line information, so that, with this as a reference, useful information (vertical dotted line at the left of fig. 3, small feature map split at the right of fig. 3, namely, vertical sub-feature map) can be more accurately fetched from the row feature map (upper left of fig. 3), and the processing procedure of the horizontal sub-feature map is similar, that is, the processing procedure is to sample more accurately, and the processing amount of the Transformer is reduced subsequently.

Second mode

It should be emphasized that the purpose of this reference point is to sample more accurately, and in practice, even 10 columns/rows are equally spaced from the lateral/longitudinal gap information, the predetermined number of longitudinal reference point positions may be equally spaced from the lateral gap information, and the predetermined number of lateral reference point positions may be equally spaced from the longitudinal gap information, which is not particularly limited.

For example, wireless tables do not have a table line feature. That is, in other implementations, the accuracy may actually vary even if 10 columns/row are equally spaced in a row or column signature (e.g., a wireless table has no table line features). The purpose of taking 10 reference points is that during the reasoning process, the transducer will only make predictions at these locations. For example, for the 10×h sub-feature diagram on the left of fig. 3, the transducer predicts 100 values (K values after processing) on each column, and finally K folding lines are obtained, where each folding line is composed of 10 points (i.e. separation lines) to adapt to the situations of deformation, missing, etc. of the table picture, so as to improve the applicability.

In one embodiment, in step S40, identifying a table structure of a table according to the horizontal sub-feature map and the vertical sub-feature map includes:

S41: selecting the position points with preset sampling quantity from each column of the longitudinal sub-feature diagram as first sampling points, and selecting the position points with preset sampling quantity from each row of the transverse sub-feature diagram as second sampling points;

s42: predicting the coordinate value of each first sampling point by using a first transducer network, and predicting the coordinate value of each second sampling point by using a second transducer network;

s43: the coordinate values of all the first sampling points corresponding to each column are aggregated to obtain a longitudinal separation line, and the coordinate values of all the second sampling points corresponding to each row are aggregated to obtain a transverse separation line;

s44: the table structure of the table is identified based on the transverse separation lines and the longitudinal separation lines.

In this embodiment, the processing procedure including the transverse separation line and the longitudinal separation line specifically uses two converters networks to predict the coordinate value of each determined sampling point first, and then uses the coordinate value aggregation processing of the sampling points to obtain the corresponding coordinate separation line. Wherein:

for the transverse separation line, a preset sampling number of position points are selected from each column of the longitudinal sub-feature map R 'as first sampling points, and for example, 100 first sampling points are selected at equal intervals from each column of the longitudinal sub-feature map R', and then the coordinates of each first sampling point in each column are predicted by using a first transducer network (transducer 1), so that all the transverse separation lines are obtained. Wherein the exemplary reference numeral 100 is merely an exemplary illustration and is not intended to limit embodiments of the present application.

Similarly, for the longitudinal separation line, a preset number of location points are selected from each row of the transverse sub-feature map C 'as second sampling points, and for example, 100 second sampling points are selected at equal intervals from each row of the transverse sub-feature map C', and then the coordinates of each second sampling point in each row are predicted by using a second transformer network (transformer 2), so as to obtain all the longitudinal separation lines. Wherein the exemplary reference numeral 100 is merely an exemplary illustration and is not intended to limit embodiments of the present application.

That is, the Transformer network predicts only the rows/columns determined by these sub-feature maps during the reasoning process. For example, for a 10×h-sized vertical sub-feature diagram on the top left of fig. 3, the transducer predicts 100 values on each column to K after processing, resulting in K polylines, each composed of 10 points (i.e., separation lines).

When the converter network predicts the coordinate values, the two sub-feature graphs are respectively used as Key and Value parts in the Encoder of the two converter networks, then 100 sampling points are selected at equal intervals from each column/row of the longitudinal sub-feature graph R 'and the transverse sub-feature graph C' respectively as Query in the converter network Decoder, the Query is sent into the Decoder, the output layer of the Decoder is connected with a linear layer, and the linear layer is used for predicting the coordinate Value corresponding to each sampling point. For easy understanding, taking line prediction as an example (the upper half of fig. 2) and assuming that the number of channels of the original feature map S is D, the size of the longitudinal sub-feature map R' is 10×h×d, which is Key and Value. And taking out the 100 first sampling points and then splicing to obtain a matrix of 100 xD, namely the Query. The process of predicting the coordinates of the first sample point is as follows: combining the first two dimensions of Key and Value into one dimension (10 x H x D), multiplying the matrix by Key to obtain a matrix with the size of (100 x 10 x H), multiplying the matrix by Value to obtain a result with the size of (100 x D), passing the result through a linear layer, repeating the steps for three times, and passing the final result through the linear layer for predicting y-direction coordinate values of each point in 100 points. The obtained 100 point coordinate results are aggregated, adjacent coordinates are combined into one point, and finally a separation line in the horizontal direction, namely a transverse separation line, is obtained; similarly, corresponding longitudinal separation lines can be obtained.

It should be noted that the above 100 sampling points are only exemplary, and are not limiting to the embodiments of the present application.

Therefore, in this embodiment, the coordinate values of the sampling points in the sub-feature map are predicted through the two convertors networks, and then the positions of the separation lines are obtained through the aggregation mode, so that the positions of the separation lines are effectively identified, and the accuracy of the table cell areas is improved.

It is also noted that in an embodiment, the two converters networks are trained first, so that the loss between the output separation line and the predicted separation line meets the requirement, and the embodiment can be applied to the separation line identification. Wherein, in the training process, after respectively calculating the predicted transverse and longitudinal separation lines, the transverse and longitudinal separation lines and the preset label values can be determinedLoss of space, i.eL _horizontal 、L _vertical . Exemplary, smooth L may be used ₁ Loss was calculated as follows:

the method comprises the steps of carrying out a first treatment on the surface of the Wherein f (xi) is a predicted value, yi is a true value;

for example, in the case of row prediction, the transform predicts several y values on the column by using the positions (x) of the abscissa of the original image corresponding to the 10 red dashed lines (10 reference points) in the upper left diagram of fig. 3, and post-processes the y values to finally obtain coordinates (xi, yi) of the points forming each separation line, wherein the label is a manually marked line, and the smoth L is calculated ₁ At Loss, a Smooth L between the predicted yi (i.e., f (xi)) and the true coordinate yi of the tag line at xi is calculated ₁ Loss of mass; the rows and columns are reversed in making column prediction, and Smooth L between xi and xi is calculated ₁ Loss to utilize Smooth L in calculating row/column predictions ₁ Loss meets the required conditions.

The training of the model such as the two transducer networks is not described in detail here.

In addition, it should be noted that, after the table structure of the table is identified according to the transverse separation line and the longitudinal separation line, the cell area of the table can be identified, however, the separation line is the smallest-granularity division of the whole table, namely: even if some cells appear as large cells formed by combining a plurality of cells, the cells are cut into cells of the smallest unit after the above-mentioned processing. Therefore, in order to accurately identify the table structure, it is also necessary to learn which cells should be merged into one cell by another network, and then perform a process of merging and filling text, as shown in the following embodiments:

in one embodiment, in S44, identifying a table structure of the table, that is, according to the transverse separation line and the longitudinal separation line, includes:

S441: combining the transverse separation lines and the longitudinal separation lines to obtain an initial cell area of the form;

s442: scaling the coordinates of the initial cell region of the table so that the scaled table is the same as the original feature map of the table in size;

s443: dividing the original feature graph according to the coordinate values of the scaled cell areas to obtain a plurality of divided feature graphs;

s444: fusing text features of the form with corresponding segmentation feature blocks to obtain fusion features;

s445: and inputting each fusion characteristic into the graph neural network to obtain a table structure of the table.

In this embodiment, the horizontal separation line and the vertical separation line are combined to obtain an initial cell region of the table, that is, the horizontal separation line and the vertical separation line are integrated to obtain each cell region, then the cell coordinate size is scaled to the same size as the original feature map S, then the original feature map S is segmented according to the scaled coordinate values to obtain a plurality of segmented feature blocks, and each segmented feature block after segmentation is used as one of inputs of each node in the graph convolutional neural network (Graph Convolutional Network, GCN). And as shown in fig. 2, the text detection network may be a micro binary network (Real-time Scene Text Detection with Differentiable Binarization, DBNet) or other text detection network, which is not limited in particular, by identifying text features in the form of the form image through the text detection network and the convolutional recurrent neural network. And merging the text features in the form with the cut image blocks, and then sending the merged text features into the GCN network.

It should be noted that the GCN network is used to predict which two cells should be combined into one cell among the identified cells. Wherein the GCN network is used after training, and the training process needs to calculate matching loss, namelyL _group . It will be appreciated that the GCN network learns an adjacency matrix for recording the relationship between each cell and other cells in the identified cell region, 1 representing the relationship, i.e. should be merged into a lattice, and 0 representing the independence, i.e. should not be merged.L _group Prediction for GCNCross entropy loss between the adjacency matrix of the label and the adjacency matrix of the label. Therefore, to calculate this Loss, it is also necessary to prepare an adjacency matrix of tag data for recording the relationship of each small cell to other cells. And finally, combining the cell coordinates to be combined by combining the text features and the segmentation image blocks by utilizing the trained GCN network, and filling the text identification content into the corresponding cells to obtain a final table structure.

Notably, it can be seen that the final loss at training time can be expressed as follows:L _total = λ ₁ L _horizontal + (1-λ ₁ )L _vertical + λL _group ；

the whole network architecture needs to meet the requirement that the final loss meets the requirement lambda ₁ As the weight coefficient, configuration may be performed.

In summary, in combination with the above embodiments, the processing procedure of the final embodiment may be as shown in fig. 2, where the original table picture is first sent to the res net-18+fpn network to extract and fuse the features of the image on different scales, and since the rows and columns of the table are then to be divided and the required features are different, two convolution networks are required to extract the features in the lateral direction and the longitudinal direction, and exchange the information of the two feature maps, and the sampling positions of the column/row feature maps are determined by the row/column feature maps. And selecting 10 columns and rows with the most obvious characteristics in the vertical characteristics and the horizontal characteristics as reference points of the horizontal characteristics and the vertical characteristics respectively, and finally taking out the whole columns/rows as a new sub-characteristic diagram, wherein the process mainly considers the position of the intersection of the columns and the rows as the main position of the dividing cells, and the row characteristics can be the basis of the dividing cells at the most obvious position of the column characteristics in the image.

And then respectively using the two sub-feature graphs as Key and Value of the transducer, extracting 100 points at equal intervals in the column/row of each sub-feature graph to serve as Query, forming a Decoder part with the Key and the Value, and connecting a linear layer at the output position of the Decoder for predicting the ordinate/abscissa of each sampling point. Each transducer gives 10 x 100 ordinate/abscissa. And combining the coordinates with similar coordinates to obtain a final line and column separation line, namely a transverse separation line and a longitudinal separation line. And finally comprehensively obtaining the finely divided cells according to the transverse dividing lines and the longitudinal dividing lines. And scaling the cell coordinates to the size of the original feature map S according to the size, segmenting the feature map according to the scaled cell coordinates, detecting and identifying characters in the table by using DBNet and CRNN, performing the Embedding on the text information, performing feature fusion with the segmented image features of the corresponding cells, and constructing a graph convolution network for predicting the relationship among the cells by taking the cells of each fusion information as nodes. And finally, obtaining the combined form recognition result.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not limit the implementation process of the embodiment of the present application in any way.

In one embodiment, a table universal identification device is provided, and the table universal identification device corresponds to the table universal identification method in the embodiment one by one. As shown in fig. 4, the table general identification apparatus includes an acquisition module 101, a determination module 102, an extraction module 103, and an identification module 104. The functional modules are described in detail as follows:

an acquisition module 101 for acquiring lateral gap information and longitudinal gap information of a table;

a determining module 102, configured to determine a longitudinal sampling position according to the transverse gap information, and determine a transverse sampling position according to the longitudinal gap information;

the extraction module 103 is configured to extract whole column information of the lateral gap information according to the longitudinal sampling position, obtain a longitudinal sub-feature map, and extract whole row information of the longitudinal gap information according to the lateral sampling position, so as to obtain a lateral sub-feature map;

the identifying module 104 is configured to identify a table structure of the table according to the horizontal sub-feature diagram and the vertical sub-feature diagram.

In one embodiment, the determining module 102 is configured to:

In one embodiment, the identification module 104 is configured to:

scaling the coordinates of the initial cell region of the table so that the scaled table is the same as the original feature map of the table in size;

dividing the original feature graph according to the coordinate values of the scaled cell areas to obtain a plurality of divided feature graphs;

In an embodiment, the obtaining module 101 is configured to:

acquiring a form image of a form;

It can be seen that in this embodiment, a general table identifying device is provided, and first, the transverse gap information R and the longitudinal gap information C of the table are obtained; determining a longitudinal sampling position/a transverse sampling position according to the longitudinal gap information C/the transverse gap information R, extracting a longitudinal sub-feature map C 'according to the longitudinal sampling position, and extracting a transverse sub-feature map R' according to the transverse sampling position; finally, according to the transverse sub-feature diagram R 'and the longitudinal sub-feature diagram C', the table structure of the table is identified, and as the embodiment of the application extracts the global typesetting features, namely the gap features, by utilizing the image information features in the extracted table pictures, the dependency degree of the table lines or the cross points is very low, the correct structure identification can be still carried out under the condition that the information is not completely or even not present, the method can also be applied to the identification of the wireless table, is a general mode capable of carrying out structure identification on the wired table and the wireless table at the same time, has higher universality and higher application value.

For specific limitations of the form-generic recognition device, reference may be made to the above limitations of the form-generic recognition method, and no further description is given here. The various modules in the table universal identification device described above may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, the internal structure of which may be as shown in FIG. 5. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a readable storage medium, an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the readable storage media. The network interface of the computer device is used for communicating with external devices through network connection and is used for acquiring the table images. The computer program is executed by a processor to implement a table universal identification method of the above embodiment. The readable storage medium provided by the present embodiment includes a nonvolatile readable storage medium and a volatile readable storage medium.

In one embodiment, a computer device is provided comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of when executing the computer program:

In one embodiment, one or more computer readable storage media storing a computer program are provided, the readable storage media provided by the present embodiment include a non-volatile readable storage medium and a volatile readable storage medium. A computer program is stored on a readable storage medium, which when executed by one or more processors performs the steps of:

In an embodiment, there is also provided a computer program product comprising a computer program, characterized in that the computer program when executed by a processor realizes the steps of:

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of a computer program stored on a non-transitory readable storage medium or a volatile readable storage medium, which when executed, may comprise the steps of the above-described embodiments of the methods. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A method for universally identifying a form, comprising:

selecting a preset sampling number of position points from each column of the longitudinal sub-feature map as a first sampling point, and selecting a preset sampling number of position points from each row of the transverse sub-feature map as a second sampling point;

and identifying the table structure of the table according to the transverse separation lines and the longitudinal separation lines.

2. The method of claim 1, wherein determining the longitudinal sampling location of the lateral gap information from the longitudinal gap information comprises:

According to the longitudinal gap information, determining the longitudinal reference point positions of a preset reference number from the transverse gap information;

and taking the longitudinal reference point positions of the preset reference number as the longitudinal sampling positions.

3. The method for universally identifying tables as claimed in claim 2, wherein the determining the positions of the longitudinal reference points of the preset reference number from the lateral gap information according to the longitudinal gap information comprises:

obtaining a transverse vector through the transverse gap information, wherein the transverse vector comprises coordinate values of each position point of the table in the transverse direction;

suppressing the position points with the coordinate values smaller than the preset distance in the transverse vector so as to keep the position points with large coordinate values in the adjacent position points;

4. The method for universal recognition of a form according to claim 2, wherein said determining the longitudinal reference point positions of a preset reference number from the lateral gap information comprises:

and determining the positions of the longitudinal reference points with preset reference numbers from the transverse gap information in an equidistant mode.

5. The method of claim 1, wherein said identifying a table structure of said table based on said transverse separation line and said longitudinal separation line comprises:

fusing the text features of the form with the corresponding segmentation feature blocks to obtain fusion features;

and inputting each fusion characteristic into a graph neural network to obtain a table structure of the table.

6. The method for universally identifying a form according to any of claims 1 to 4, wherein the acquiring lateral gap information and longitudinal gap information of the form comprises:

acquiring a form image of the form;

extracting image information of the table image through a pre-trained target detection network to obtain an original feature image;

7. A form generic identification device, said device comprising;

an identification module for:

8. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.