CN114926851A

CN114926851A - Method, system and storage medium for identifying table structure in table picture

Info

Publication number: CN114926851A
Application number: CN202210558928.6A
Authority: CN
Inventors: 喻晨曦
Original assignee: Qizhidao Network Technology Co Ltd
Current assignee: Qizhidao Network Technology Co Ltd
Priority date: 2022-05-21
Filing date: 2022-05-21
Publication date: 2022-08-19

Abstract

The application relates to a method, a system and a storage medium for identifying a table structure in a table picture, wherein the method comprises the following steps: acquiring a characteristic diagram corresponding to the form picture; for the feature map, performing global boundary segmentation, learning empty cell segmentation information, namely segmenting all aligned cells including non-empty cells and empty cells, and generating a basic true value of the empty cells according to the maximum height/width of the non-empty cells in the same row/column; performing global mask regression, and distributing soft labels to pixels in all non-empty cells in the horizontal and vertical directions; and aligning the corresponding table structure in the horizontal and vertical directions according to the soft label so as to obtain the adjusted boundary coordinate point and realize the recovery of the table structure. The method and the device can realize that the empty cells can be identified when the table structure is detected, and avoid the visual blurring caused by crossing rows/columns of cells.

Description

Method, system and storage medium for identifying table structure in table picture

Technical Field

The present application relates to the field of table picture identification technologies, and in particular, to a method, a system, and a storage medium for identifying a table structure in a table picture.

Background

Accurate detection of the table structure in the table picture plays a crucial role in high-precision content identification of the table picture data.

In the prior art, the detection of the table structure in the table picture is usually realized by a method for detecting a grid boundary, but the method has a certain limitation that a table without a grid boundary cannot be processed. The method is suitable for processing tables without grid boundaries at the same time by detecting the positions of text blocks firstly and then restoring the relation of the bounding boxes through a graphical neural network, but the method not only needs huge data sample support, but also is difficult to obtain empty cells, so that the method is easy to fall into the problem of visual blurring of cells across rows and columns.

Disclosure of Invention

In order to solve the problems in the prior art, particularly the problem that when a table structure of a table picture is detected, empty cells are difficult to obtain, so that the problem is easy to fall into the visual blurring of cross-row/cross-column cells, the application provides a method, a system and a storage medium for identifying the table structure in the table picture.

In a first aspect, the method for identifying a table structure in a table picture provided by the present application adopts the following technical solutions: a method for identifying a table structure in a table picture comprises the following steps:

acquiring a characteristic diagram corresponding to the form picture;

for the feature map, global boundary segmentation is carried out, empty cell segmentation information is learned, namely all aligned cells are segmented, including non-empty cells and empty cells, and basic truth values of the empty cells are generated according to the maximum height/width of the non-empty cells in the same row/column;

performing global mask regression, and distributing soft labels to pixels in all non-empty cells in the horizontal and vertical directions;

and aligning the corresponding table structure in the horizontal and vertical directions according to the soft label, so as to obtain an adjusted boundary coordinate point and realize the recovery of the table structure.

By adopting the technical scheme, particularly all aligned cells including non-empty cells and empty cells are segmented, the basic true value of the empty cell is generated according to the maximum height/width of the non-empty cells in the same row/column, global mask regression is carried out, and soft labels are distributed to pixels in all non-empty cells in the horizontal and vertical directions in the suggested bounding box area, so that the empty cells can be identified when table structure detection is carried out, and the visual blurring of cross-row/cross-column cells is avoided.

Preferably, the method further comprises:

for the feature map, local mask alignment is carried out, namely a model is trained to learn a binary segmentation task to align a unit region, and meanwhile, a local mask regression task distributes soft labels to pixels in a suggested boundary frame region in the horizontal and vertical directions;

and re-scoring the prediction results of the local mask and the global mask to obtain an updated pixel distribution soft label.

By adopting the method and the mask re-scoring strategy, the advantages of local and global characteristics in the aspect of object perception can be integrated, the prediction expression effects of the local and global characteristics are fused, the local characteristics can predict a more reliable text region mask, the global prediction can provide more reliable remote visual information, and the finally identified table structure is more accurate. In addition, compared with the prior art that table recovery is performed simply by adopting a mode of pixel segmentation such as Unet, the algorithm model of the application realizes empty cell positioning by fusing local and global information, including binary segmentation and mask regression, and can reduce deviation and overfitting risks in the process of fitting data, so that the amount of data samples needing to be supported is relatively small.

Preferably, for mask regression, the pixels are assigned soft labels in the horizontal and vertical directions by:

assuming that the shape of the non-empty cell or the suggested bounding box is rectangular, the upper left corner and the lower right corner of the bounding box are respectively expressed as { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ) In which 0. ltoreq.x ₁ ≤x ₂ ≤X,0≤y ₁ ≤y ₂ Less than or equal to Y, the shape of the mask is (2, Y, X), the middle point of the text is taken as the maximum regression target, and the training target belongs to (0, 1)]Then the corresponding local or global horizontal pyramid label prediction score for any pixel (y, x) is calculated by the following formula respectively

Local or global vertical pyramid tag prediction scores

Wherein x is more than or equal to 0<X,0≤y<Y,

x and y represent the horizontal and vertical coordinates of any pixel, x _mid 、y _mid Representing the coordinates of the middle point of the bounding box, X, Y representing the maximum width and maximum height of the proposed bounding box, respectively.

By the method, local and global mask alignment can be accurately realized.

Preferably, the mask re-scoring is performed on the prediction results of the local mask and the global mask by the following method: for a certain predicted aligned bounding box B { (x) ₀ ,y ₀ ),(x ₁ ,y ₁ ) Firstly, obtaining bbox of the text area mask, and marking as B ₁ ＝{(x′ ₀ ,y′ ₀ ),(x′ ₁ ,y′ ₁ ) Then find the matching connected domain P ═ P in the global boundary segmentation map ₁ ,P ₂ ,......,P _n In which P is _i Where (x, y) denotes a pixel, let P be ₀ ＝{p|x ₀ ≤p·x≤x ₁ ,y ₀ ≤p·y≤y ₁ Indicates the overlap area, then closeAt the point (x, y) epsilon P ₀ The prediction pyramid label of (a) is calculated in the following way:

wherein the content of the first and second substances,

respectively representing local horizontal, global horizontal, local vertical and global vertical pyramid label predicted values; f (x), F (y) represents the point (x, y) epsilon P obtained after the mask is rescored ₀ Pyramid mask labels in horizontal and vertical directions, respectively.

By adopting the method, the soft label prediction results of the local mask and the global mask are subjected to mask re-scoring, the local and global information is compromised, and the prediction boundary is refined, so that the finally identified table structure is more accurate.

Preferably, the adjusted boundary coordinate point is obtained by:

for the suggested bounding box region, finding a matched connected region in the global boundary segmentation graph, and then respectively fitting two planes in a three-dimensional space by adopting horizontal and vertical pyramid labels obtained by re-scoring by using a mask;

and calculating the intersection line of the fitting plane and the zero plane to obtain the boundary coordinate point after the adjustment of the proposed boundary frame.

By adopting the steps, the bounding box which is aligned more accurately can be obtained, and the finally recovered table structure is more accurate.

Preferably, the training model learns a binary segmentation task, and specifically, the prediction mark is obtained by dynamically adjusting a probability map with a pixel point being a text, a threshold map with a pixel point and an approximate binary map, and is used for distinguishing a text region from a non-text region.

More preferably, the prediction flag of the approximate binary map is calculated by the following formula

Wherein, P _i,j For probability map predictive markers, T, in which the pixel is text _i,j Predicting and marking the threshold value graph of the pixel point; k is an integer parameter for adjusting the gradient amplitude; when x is equal to P _i,j -T _i,j <Time 0, predictive flag

Indicating that the region is a text-free region; when x is P _i,j -T _i,j >Time 0, predictive flag

Indicating that the region is a text region.

By setting the parameter k, optimization is facilitated, the optimization of three output graphs can be better, and the final segmentation result is more excellent.

Preferably, the recovering of the table structure includes: the step of matching the cells specifically comprises: if a pair of aligned bounding boxes overlap in the abscissa or the ordinate, they are matched in the corresponding vertical or horizontal direction; and connecting the alignment bounding boxes vertically or horizontally.

Preferably, the recovering of the table structure includes: the empty cell positioning method specifically comprises the following steps:

after the cell matching is finished, representing the connection relation between the aligned bounding boxes by using edges, wherein all nodes in the same row or column form a complete subgraph;

searching all maximum cliques in the subgraph by adopting a maximum clique search algorithm; in the searching process of the row or the column, each node belonging to the same row or the column is positioned in the same group, for cells crossing multiple rows or multiple columns, (because the corresponding node appears multiple times in different groups), the corresponding clusters are sorted by averaging y coordinates or x coordinates, each node is marked by utilizing a row index or a column index of the corresponding cluster, and the nodes appearing in multiple groups are marked by utilizing multiple row indexes or column indexes, so that the vacant positions corresponding to the vacant cells are determined.

By adopting the method, the empty cells can be more accurately positioned.

Preferably, the recovering of the table structure further includes: the step of merging the empty cells specifically comprises the following steps:

designating a single empty cell of the aligned bounding box shape as the maximum height/width of the cells in the same row/column;

(according to clues learned by the global segmentation task, and a pixel voting mechanism is used for determining a result), calculating the predicted pixel ratio of 1 in the interval area of each pair of adjacent empty cells, and if the pixel ratio is greater than a preset probability threshold, merging the adjacent empty cells. By adopting the method, the original feature graph processing result containing the global boundary information is fused with the result of realizing the empty cell positioning, so that the empty cell completion is realized.

Preferably, the probability threshold is obtained according to sensitivity analysis; the sensitivity refers to the descending change condition of the loss of each branch under different probability thresholds, and the lowest loss is obtained, namely the optimal solution of the probability threshold is obtained.

Preferably, the VoVNetV2-39-FPN model is used as a backbone network to process the table pictures to obtain the feature maps corresponding to the table pictures. Thus, the accuracy is 5% higher than ResNet50-FPN, the inference speed is faster than ResNeXt-FPN and the small target can be detected with higher accuracy (i.e. the font of the target text is smaller or the table looks smaller in the picture) under the condition that the inference speed of the model is the same.

Preferably, the training model learns a binary segmentation task to align the unit region by using a differentiable binary loss statistical method. So that more accurate segmentation into content cells is possible.

In a second aspect, the present application provides a system for identifying a table structure in a table picture, which adopts the following technical solutions: a system for identifying a table structure in a table picture, comprising:

the characteristic diagram acquisition module is used for acquiring a characteristic diagram corresponding to the form picture;

the global boundary segmentation module is used for carrying out global boundary segmentation on the feature map, learning empty cell segmentation information, namely segmenting all aligned cells including non-empty cells and empty cells, and generating a basic true value of the empty cells according to the maximum height/width of the non-empty cells in the same row/column;

the global mask regression module is used for performing global mask regression and distributing soft labels to pixels in all non-empty cells in the horizontal and vertical directions;

and the table structure aligning and recovering module is used for aligning the corresponding table structure in the horizontal and vertical directions according to the soft label so as to recover the table structure.

In a third aspect, the present application provides an electronic device, which adopts the following technical solution:

an electronic device comprising a memory and a processor, the memory having stored thereon a computer program that can be loaded by the processor and that executes the method according to any of the preceding claims.

In a fourth aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:

a computer-readable storage medium storing a computer program that can be loaded by a processor and executed to perform a method as any one of the preceding.

In summary, the present application includes at least one of the following beneficial technical effects:

1. the method and the device divide all aligned cells, including non-empty cells and empty cells, generate basic truth values of the empty cells according to the maximum height/width of the non-empty cells in the same row/column, perform global mask regression, and allocate soft labels to pixels in all the non-empty cells in horizontal and vertical directions in a suggested boundary frame area, so that the empty cells can be identified when table structure detection is performed, and the visual blurring of cross-row/cross-column cells is avoided.

2. By adopting the method, the table structure of most table pictures can be identified with the precision of more than 90%, and the detection and identification of the table content with the precision of 95% can be realized based on the accurate identification of the table structure.

Drawings

FIG. 1 is a flow chart of a method of an embodiment of the present application.

Fig. 2 is a schematic diagram of a table structure in a table picture identified by the method of the present application.

Fig. 3 is a flow chart of a method of another embodiment of the present application.

Detailed Description

The present application is described in further detail below with reference to figures 1-3.

The embodiment of the application discloses a method for identifying a table structure in a table picture. Referring to fig. 1, a method for identifying a table structure in a table picture includes the following steps:

s1, acquiring a characteristic diagram corresponding to the form picture;

s2, for the feature map, global boundary segmentation is carried out, empty cell segmentation information is learned, namely all aligned cells including non-empty cells and empty cells are segmented, and basic truth values of the empty cells are generated according to the maximum height/width of the non-empty cells in the same row/column (only this task can learn the empty cell segmentation information, and the most reasonable cell splitting mode is captured in the global boundary segmentation process);

s3, performing global mask regression, and distributing soft labels to the pixels in all the non-empty cells in the horizontal and vertical directions;

and S4, aligning the corresponding table structure in the horizontal and vertical directions according to the soft label, further obtaining the adjusted boundary coordinate point, and realizing the recovery of the table structure.

By adopting the steps of the method, the detection and the identification of the hollow cells in the figure 2 can be realized, and the visual blurring trapped in the cells crossing rows/columns can be avoided.

In order to further improve the recognition accuracy of the form picture, before step S1, the method may further include: and preprocessing the table picture, such as adjusting definition, size, angle and the like of the table picture.

In order to prevent the frames from overlapping, all real labels of the aligned bounding boxes can be reduced by 5% -9%, and the specific reduction effect is the best, and can be obtained through experiments.

As shown in fig. 1, in order to further improve the accuracy of identifying the table structure in the table picture, the method may further include (as shown in fig. 3):

s2', for the feature map, local mask alignment is carried out, namely a model is trained to learn a binary segmentation task to align a unit region, and meanwhile, a local mask regression task distributes soft labels to pixels in a suggested bounding box region in the horizontal direction and the vertical direction;

s3', re-scoring the local mask and the global mask to obtain the updated pixel distribution soft label.

In step S2', in order to make the obtained text region mask information more accurate, before the training model learns the binary segmentation task, the feature map may be subjected to FCN processing.

In order to make the original mask position more accurate, before the local mask regression task is performed, the ROI Align process may be performed on the feature map.

For the mask regression in step S3 and step S2', the pixels are assigned soft labels in the horizontal and vertical directions by:

assuming that the shape of the non-empty cell or the suggested bounding box is rectangular, the upper left corner and the lower right corner of the bounding box are respectively expressed as { (x) ₁ ,y ₁ ),(x ₂ ,y ₂ ) Wherein x is more than or equal to 0 ₁ ≤x ₂ ≤X,0≤y ₁ ≤y ₂ Less than or equal to Y, the shape of the mask is (2, Y, X), and the middle point of the text is used as the maximum regression targetBidding, training target belongs to (0, 1)]Then the corresponding local or global horizontal pyramid label prediction score for any pixel (y, x) is calculated by the following formula respectively

Local or global vertical pyramid tag prediction scores

Wherein x is more than or equal to 0<X,0≤y<Y,

x and y represent the horizontal and vertical coordinates of any pixel, x _mid 、y _mid Representing the midpoint coordinates of the bounding box and X, Y representing the maximum width and maximum height, respectively, of the proposed bounding box.

In step S3', the prediction results of the local mask and the global mask may be re-scored specifically by the following method:

for a certain predicted aligned bounding box B { (x) ₀ ,y ₀ ),(x ₁ ,y ₁ ) Firstly, obtaining bbox of the text area mask, and marking as B ₁ ＝{(x′ ₀ ,y′ ₀ ),(x′ ₁ ,y′ ₁ ) Then find the matching connected domain P ═ P in the global boundary segmentation map ₁ ,P ₂ ,......,P _n In which P is _i Where (x, y) denotes a pixel, let P be ₀ ＝{p|x ₀ ≤p·x≤x ₁ ,y ₀ ≤p·y≤y ₁ Denotes the overlap region, then with respect to point (x, y) ∈ P ₀ The prediction pyramid label is calculated in the following way:

wherein the content of the first and second substances,

respectively representing local horizontal, global horizontal, local vertical and global vertical pyramid label predicted values; f (x), F (y) represents the point (x, y) epsilon P obtained after the mask is rescored ₀ Pyramid mask labels in the horizontal and vertical directions, respectively (i.e., the updated pixel assignment soft label described above).

In step S3', the adjusted boundary coordinate point is obtained by:

s3' 1, for the suggested bounding box region, finding a matched connected region in the global boundary segmentation map, and then respectively fitting two planes in a three-dimensional space by using horizontal and vertical pyramid labels obtained by re-scoring by using a mask;

and S3' 2, calculating the intersection line of the fitting plane and the zero plane to obtain the boundary coordinate point after the adjustment of the proposed boundary frame.

In step S2', the training model learns a binary segmentation task, and specifically obtains a prediction flag by dynamically adjusting a probability map with a pixel point being a text, a threshold map with a pixel point, and an approximate binary map, for distinguishing a text region from a non-text region, where the text region may be marked as 1 and other regions are marked as 0.

Specifically, the prediction flag of the approximate binary image can be calculated by the following formula

Wherein, P _i,j For probability maps in which the pixel points are text, predictive markers, T _i,j Predicting and marking the threshold value image of the pixel point; k is an integer parameter for adjusting the gradient magnitude, specifically, when the model is trained, for example, when k is 50, the gradient is much greater than k is 1, and the gradient of the wrong region is larger; when x is P _i,j -T _i,j <Time 0, predictive flag

Indicating that the region is a non-text region; when x is P _i,j -T _i,j >When 0, the flag is predicted

Indicating that the region is a text region.

Loss function L of training model of binary segmentation task _lm Comprises the following steps: loss L of probability map prediction mark with pixel point being text _s Approximate loss L of binary graph _b Threshold value map prediction mark loss L of sum pixel point _t Sum of:

L _lm ＝L _s +αL _b +βL _t

wherein alpha and beta are hyper-parameters; loss L of probability graph prediction mark with pixel point being text _s Bce loss can be used; loss L of approximate binary plot _b Dice loss may be used; threshold value image prediction mark loss L of pixel point _t L1 loss may be used;

in particular: loss L of approximate binary map _b By adopting dice loss, the problem that the samples of the pixels are not unbalanced can be solved.

In step S4, the recovering of the table structure includes: the step of matching the cells specifically comprises: if a pair of aligned bounding boxes overlap in the abscissa or the ordinate, they are matched in the corresponding vertical or horizontal direction; and connecting the alignment bounding boxes vertically or horizontally.

In step S4, the recovering of the table structure further includes: the step of empty cell positioning specifically comprises:

s41, after the cell matching is completed, the edges are used for representing the connection relation between the aligned bounding boxes, and all nodes in the same row or column form a complete subgraph;

s42, searching all the maximum cliques in the subgraph by adopting a maximum clique search algorithm; in the searching process of the row or the column, each node belonging to the same row or the column is positioned in the same group, for cells crossing multiple rows or multiple columns, (because the corresponding node appears multiple times in different groups), the corresponding clusters are sorted by averaging y coordinates or x coordinates, each node is marked by utilizing a row index or a column index of the corresponding cluster, and the nodes appearing in multiple groups are marked by utilizing multiple row indexes or column indexes, so that the vacant positions corresponding to the vacant cells are determined.

In step S4, the recovering the table structure further includes: the empty cell merging step specifically comprises the following steps:

s441, designating the single empty cell in the shape of the aligned bounding box as the maximum height/width of the cells in the same row/column;

s442, (according to the clue learned by the global segmentation task, and using a pixel voting mechanism to determine a result), calculating a predicted pixel ratio of 1 in the partition area of each pair of adjacent empty cells, and merging the adjacent empty cells if the pixel ratio is greater than a preset probability threshold. The probability threshold is obtained according to sensitivity analysis; the sensitivity refers to the decreasing change condition of each branch loss under different probability thresholds, and the lowest loss, namely the optimal solution of the probability threshold, is obtained, and is generally between 0.5 and 0.6.

In step S1, the VoVNetV2-39-FPN model may be used as a backbone network to process the table pictures, so as to obtain the feature maps corresponding to the table pictures.

In S2', the training model may use a differentiable binary loss statistical method to learn a binary segmentation task to align the cell region. Other methods may also be employed to learn the binary segmentation task, such as Riddler-Calvard, etc.

In this application, before the step S1 of obtaining the feature map corresponding to the table picture, a process of distinguishing a foreground from a background may also be performed; in performing the table structure recovery training, after step S4, block classification and block regression may also be included. Then, the total loss function L is:

L＝L _rpn +γ ₁ (L _cl +L _box )+γ ₂ (L _lm +L _lp )+γ ₃ (L _seg +L _gp )

wherein L is _rpn A binary cross entropy for distinguishing the foreground from the background; gamma ray ₁ 、γ ₂ 、γ ₃ Is an adjustable hyper-parameter; l is _cl Binary cross entropy for frame classification; l is _box The frame regression loss; l is a radical of an alcohol _lm Is the binary segmentation loss of the local mask; l is a radical of an alcohol _lp Local mask regression loss in the horizontal direction and the vertical direction; l is a radical of an alcohol _seg Is the global mask binary segmentation loss; l is _gp Is the global mask regression loss in the horizontal and vertical directions. Wherein L is _rpn 、L _cl 、L _box The acquisition method of (1) is prior art.

The embodiment of the application also discloses a system for identifying the table structure in the table picture. A system for identifying a table structure in a table picture, comprising:

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above.

The embodiment of the application further discloses the electronic equipment. An electronic device comprising a memory and a processor, said memory having stored thereon a computer program that can be loaded by the processor and that executes any of the methods described above.

The electronic device may be an electronic device such as a desktop computer, a notebook computer, or a cloud server, and the electronic device includes but is not limited to a processor and a memory, for example, the electronic device may further include an input/output device, a network access device, a bus, and the like.

A processor in the present application may include one or more processing cores. The processor executes or executes the instructions, programs, code sets, or instruction sets stored in the memory, calls data stored in the memory, performs various functions of the present application, and processes the data. The Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the above processor functions may be other devices, and the embodiments of the present application are not limited in particular.

The memory may be an internal storage unit of the electronic device, for example, a hard disk or a memory of the electronic device, or an external storage device of the electronic device, for example, a plug-in hard disk, a smart card (SMC), a secure digital card (SD) or a flash memory card (FC) provided on the electronic device, and the memory may also be a combination of the internal storage unit of the electronic device and the external storage device, and the memory is used for storing a computer program and other programs and data required by the electronic device, and may also be used for temporarily storing data that has been output or will be output, which is not limited in this application.

The embodiment of the application also discloses a computer readable storage medium. A computer readable storage medium storing a computer program capable of being loaded by a processor and performing any of the methods described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, and the computer program may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).

The above embodiments are preferred embodiments of the present application, and the protection scope of the present application is not limited by the above embodiments, so: equivalent variations of the methods and principles of the present application are intended to be within the scope of the present application.

Claims

1. A method for identifying a table structure in a table picture is characterized in that: the method comprises the following steps:

acquiring a characteristic diagram corresponding to the form picture;

and aligning the corresponding table structure in the horizontal and vertical directions according to the soft label so as to obtain the adjusted boundary coordinate point and realize the recovery of the table structure.

2. The method of claim 1, further comprising: for the feature map, local mask alignment is carried out, namely a model is trained to learn a binary segmentation task to align a unit region, and meanwhile, a local mask regression task distributes soft labels to pixels in a suggested boundary frame region in the horizontal and vertical directions;

and re-scoring the local mask and the global mask prediction results to obtain updated pixel allocation soft labels.

3. Method for identifying a table structure in a table picture according to claim 1 or 2, characterized in that for mask regression, the pixels are assigned soft labels in the horizontal and vertical direction by:

assuming that the shape of the non-empty cell or the suggested bounding box is rectangular, the upper left corner and the lower right corner of the bounding box are respectively expressed as { (x) ₁ ，y ₁ )，(x ₂ ，y ₂ ) Wherein x is more than or equal to 0 ₁ ≤x ₂ ≤X，0≤y ₁ ≤y ₂ Y is less than or equal to Y, the shape of the mask is (2, Y, X), the middle point of the text is taken as the maximum regression target, and the training target belongs to (0, 1)]Then the corresponding local OR of any pixel (y, x) is calculated by the following formula respectivelyGlobal horizontal pyramid tag prediction score

Local or global vertical pyramid tag prediction scores

Wherein X is more than or equal to 0 and less than X, Y is more than or equal to 0 and less than Y,

4. The method of claim 2, wherein the mask re-scoring is performed on the predicted results of the local mask and the global mask by a method comprising:

for a certain predicted aligned bounding box B { (x) ₀ ，y ₀ )，(x ₁ ，y ₁ ) Firstly, obtaining bbox of the text area mask, and marking as B ₁ ＝{(x′ ₀ ，y′ ₀ )，(x′ ₁ ，y′ ₁ ) Then find the matching connected domain P ═ P in the global boundary segmentation map ₁ ，P ₂ ，......，P _n In which P is _i Where (x, y) denotes a pixel, let P be ₀ ＝{p|x ₀ ≤p·x≤x ₁ ，y ₀ ≤p·y≤y ₁ Denotes an overlap region, then with respect to point (x, y) ∈ P ₀ The prediction pyramid label of (a) is calculated in the following way:

wherein the content of the first and second substances,

respectively representing local horizontal, global horizontal, local vertical and global vertical pyramid label predicted values; f (x), F (y) represent points (x, y) E P obtained after re-scoring the mask ₀ Pyramid mask labels in horizontal and vertical directions, respectively.

5. The method of identifying a table structure in a table picture of claim 2, wherein the adjusted boundary coordinate points are obtained by:

6. The method of claim 2, wherein the step of identifying the table structure in the table picture comprises the steps of: the training model learns a binary segmentation task, and specifically, a prediction mark is obtained through dynamic adjustment of a probability map with pixels being texts, a threshold map with pixels and an approximate binary map, and is used for distinguishing text regions from non-text regions.

7. The method for identifying a table structure in a table picture according to claim 6, which comprisesIs characterized in that: calculating a prediction index of the approximate binary image by the following formula

Wherein, P _i，j For probability maps in which the pixel points are text, predictive markers, T _i，j Predicting and marking the threshold value graph of the pixel point; k is an integer parameter for adjusting the gradient amplitude; when x is equal to P _i，j -T _i，j When < 0, the flag is predicted

Indicating that the region is a text-free region; when x is equal to P _i，j -T _i，j When > 0, predict marker

Indicating that the region is a text region.

8. The method of claim 1, wherein the implementing of the recovery of the table structure comprises: the step of matching the cells specifically comprises: if a pair of aligned bounding boxes overlap in the abscissa or the ordinate, they are matched in the corresponding vertical or horizontal direction; and connecting the aligned bounding boxes vertically or horizontally.

9. The method of claim 8, wherein the implementing recovery of the table structure comprises: the step of empty cell positioning specifically comprises:

searching all maximum cliques in the subgraph by adopting a maximum clique search algorithm; in the searching process of the rows or the columns, each node of the same row or column is located in the same group, for cells spanning multiple rows or multiple columns, (as the corresponding nodes can appear multiple times in different groups), the corresponding clusters are sorted by averaging y coordinates or x coordinates, each node is marked by using a row index or a column index of the corresponding cluster, and the nodes appearing in the multiple groups are marked by using multiple row indexes or column indexes, so that the vacant positions corresponding to the vacant cells are determined.

10. The method of claim 9, wherein the recovering of the table structure further comprises: the empty cell merging step specifically comprises the following steps:

and calculating the predicted pixel ratio of 1 in the interval area of each pair of adjacent empty cells, and merging the adjacent empty cells if the pixel ratio is greater than a preset probability threshold.

11. The method of claim 1, wherein the probability threshold is derived from a sensitivity analysis; the sensitivity refers to the decreasing change condition of each branch loss under different probability thresholds, and the lowest loss is obtained, namely the optimal solution of the probability threshold is obtained.

12. The method for identifying the table structure in the table picture according to claim 1, wherein a VoVNetV2-39-FPN model is used as a backbone network to process the table picture, so as to obtain a feature map corresponding to the table picture.

13. The method of claim 2, wherein the step of identifying the table structure in the table map comprises: the training model adopts a differentiable binary loss statistical method to learn a binary segmentation task to align the unit area.

14. A system for identifying a table structure in a table picture, comprising:

15. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program that can be loaded by the processor and that executes the method according to any of claims 1 to 13.

16. A computer-readable storage medium, characterized in that a computer program is stored which can be loaded by a processor and which executes a method according to any one of claims 1 to 13.