CN115171135A

CN115171135A - Hand-drawn chart identification method based on key point prediction

Info

Publication number: CN115171135A
Application number: CN202210615119.4A
Authority: CN
Inventors: 蔡波; 方佳琪
Original assignee: Yunnan Hengyu Technology Co ltd; Wuhan University WHU
Current assignee: Yunnan Hengyu Technology Co ltd; Wuhan University WHU
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-10-11

Abstract

The invention relates to the technical field of hand-drawn chart recognition, in particular to a hand-drawn chart recognition method based on key point prediction, which comprises the following steps: 1. detecting each primitive in the diagram as a pair of key points, namely determining an upper left corner point and a lower right corner point of a boundary box of the primitive together; 2. performing pooling operation on the maximal cross corner pooling MICP and the accumulated cross corner pooling CICP in parallel; 3. performing feature fusion on the respective pooled feature maps; 4. for each arrow connecting two primitives, predicting a head key point and a tail key point of the arrow by using an arrow direction prediction network, and enhancing the arrow key point information by using a snowflake corner pooling SCP module; thus, the identification of the entire chart structure is completed. The invention can better recognize the hand-drawn chart.

Description

Hand-drawn chart identification method based on key point prediction

Technical Field

The invention relates to the technical field of hand-drawn chart recognition, in particular to a hand-drawn chart recognition method based on key point prediction.

Background

The hand-drawing chart is a simple, efficient and convenient graphical expression mode of human thinking and intention, and is an extension of language communication and text communication from the graphical perspective. Generally speaking, one of the ultimate purposes of hand-drawn charts is to save the manuscripts of the charts, even identify the primitives in the charts by a certain method and understand the information content reflected by the charts, and then reason, classify, reconstruct or archive the charts. The hand-drawing chart recognition refers to that the hand-drawing chart recognition process including primitive recognition, chart understanding and the like is completed through a computer. The hand-drawn chart recognition is the basis of a plurality of hand-drawn chart digital processing tasks, such as hand-drawn chart classification, hand-drawn chart reconstruction, hand-drawn chart digitization and the like. Therefore, the hand-drawing chart recognition technology is a key technology for providing powerful support for the graphical expression mode of the hand-drawing chart in practical application.

However, the identification and analysis of many types of charts, such as flow charts, finite state machines, circuit diagrams, chemical molecular structure diagrams, and score symbols, for hand-drawn charts, remains challenging because of the complex two-dimensional structure and morphological variability of the graphical elements of these charts.

Disclosure of Invention

It is an object of the present invention to provide a method for hand-drawn graph identification based on keypoint prediction that overcomes some or some of the disadvantages of the prior art.

The method for identifying the hand-drawn chart based on the key point prediction comprises the following steps:

1. detecting each primitive in the diagram as a pair of key points, namely determining an upper left corner point and a lower right corner point of a boundary box of the primitive together;

2. performing pooling operation on the maximal cross corner pooling MICP and the accumulated cross corner pooling CICP in parallel;

3. performing feature fusion on the respective pooled feature maps;

4. for each arrow connecting two primitives, predicting a head key point and a tail key point of the arrow by using an arrow direction prediction network, and enhancing the arrow key point information by using a snowflake corner pooling SCP module; thus, the identification of the entire chart structure is completed.

Preferably, in the cumulative cross corner pooling CICP, in order to determine whether the activation value of the primitive is an upper left corner, the CICP searches the uppermost boundary of the target horizontally to the right and searches the leftmost boundary of the target vertically to the bottom; let F _t And F _l Is an input feature map for corner pooling,

and

are respectively F _t And F _l Response of intermediate position (i, j); using H × W feature map, CICP determines whether the neuron activation value at position (i, j) is in the upper left corner, and accumulates in F by parallel _t All responses in (i, j) and (i, H) in (A, B) are distributed horizontally, and are distributed vertically at F _l The sum of (i, j) and (W, j) in (1) as T _ij And L _ij (ii) a Finally, they are added to generate a feature map F ^CICP The calculation process can be expressed by the following formula:

the method of pooling CICP for the lower right corner is similar to the process of computing the upper left corner, i.e. all responses vertically distributed in (0, j) and (i, j) and all responses horizontally distributed in (i, 0) and (i, j) are added in parallel, and then the pooled results are added.

Preferably, the arrow direction prediction network adopts a jump connection mode, and the SCP module is inserted in the middle of the jump connection.

Preferably, in the snowflake corner pooling SCP module:

let F be the profile of the SCP, F _ij Is the response for position (i, j) in F. Using H x W feature maps, pooled feature maps F ^SCP The neuron activation value response to position (i, j) in the SCP is expressed by the following formula:

preferably, in the arrow direction prediction network, thermodynamic diagrams, offsets and semantic embedding are used to predict the head key points and the tail key points of the arrows, specifically as follows:

let P _cij Is the probability of class c at position (i, j) in the predictive thermodynamic diagram, y _cij Is a ground-truth thermodynamic diagram, which has C channels; then, the category Loss of the arrow key points is estimated by the Focal local Loss function:

where N is the number of objects in the image and α is a hyper-parameter that controls the contribution of each point;

the positions of the arrow key points are slightly corrected by a prediction offset, let (x, y) be the position in the image,

is its down-sampling position in the thermodynamic diagram, where s is the down-sampling factor; the deviation of the arrow key point k between these two positions is estimated:

wherein the prediction deviation and the ground-truth deviation can be calculated by a Smooth-L1 Loss function:

determining which pair of head key points and rear key points belong to the same arrow, and matching the two key points with the maximum similarity together by using semantic embedding; let e _hk Semantic embedding of head key points for arrow k, e _tk The semantics of the tail key points are embedded and are four-dimensional vectors; and matching the key points by using a pull loss training network, and keeping away the key points which do not belong to the same object by using the pull loss:

finally, linearly combining the loss functions of all network branches as the final loss function of the whole model, wherein the loss functions with the same task property share the same combination coefficient:

here, the

And

is a loss function for the arrow key point prediction task; and L is _det 、 L _push 、L _pull And L _off Is a loss function for the corner key point prediction task; where α, β, γ, and λ are coefficient weights of the sub-loss functions.

Preferably, if the graph is a nested graph, each primitive is associated with a plurality of positions of a plurality of layers in the feature pyramid, so that each primitive is predicted by one or more positions, specifically:

for arbitrary feature level P _k At an arbitrary position on

The following formula is used:

map it back to obtain its input image

Is called a mapping position, which is close to l _k The center of the receptive field of (a); for feature levels P from any two _k And P _j At any two positions l _k And l _j ， l' _k And l' _j The overlapping is avoided; thus, each mapping location is globally unique in the input image; secondly, for

If there is a location l from the feature pyramid, let l' and b ⁽ⁱ⁾ The following conditions are satisfied:

0＜ξ(l′,b ⁽ⁱ⁾ ) (30) therein

Then l is b ⁽ⁱ⁾ Predicting a location for a candidate of the target; obviously, b is ⁽ⁱ⁾ Multiple positions can be used, all of which constitute b ⁽ⁱ⁾ Is calculated from the candidate position set theta ⁽ⁱ⁾ (ii) a Furthermore, if

So that l ⁱ And b ⁽ⁱ⁾ The following conditions are satisfied:

γ _k-1 ＜δ(l′,b ⁽ⁱ⁾ )≤γ _k (32)

wherein

Then l is b ⁽ⁱ⁾ At the feature level P _k True predicted position of, where gamma _k (k =1, 4.., 5) is a hyper-parameter; at P _k Up prediction b ⁽ⁱ⁾ All positions of (1) contain b ⁽ⁱ⁾ Location set of

And is provided with

Reflect b is ⁽ⁱ⁾ Capacity is accommodated over the entire characteristic pyramid;

in addition, for

And

will be responsible for predicting b ⁽ⁱ⁾ Is determined by a discriminant function delta (l', b) ⁽ⁱ⁾ ) Supremum and infimum determinations of (a), which reflect b ⁽ⁱ⁾ The above regression range, the calculation formula is as follows:

therefore, if

Then it must have

That is, any primitive in the flow graph can be predicted at two adjacent feature levels; for the primitive detection in the flow chart, the characteristic can effectively avoid target accumulation caused by that the nested primitives and the nested primitives are simultaneously distributed to the same characteristic level;

for any two targets b ⁽ⁱ⁾ And b ^(j) By passing

Whether or not it is 1 or not, b ⁽ⁱ⁾ Whether to nest b ^(j) Wherein

Now, let b ^(t) For nested primitives, all of their nested primitives form their nested set

The purpose of multi-level nested graphic element detection is to detect b ^(t) And N ^(t) All nested primitives in (a) are properly associated with locations on the feature pyramid while maintaining a high recall per primitive.

Aiming at the identification of a common chart, the identification method provided by the invention is used for identifying each graphic primitive in the chart based on a key point prediction mode, and determining the connection relation between an arrow and the graphic primitive by predicting the head and tail key points of the arrow, thereby identifying the structure of the whole chart. The key point prediction is a distinctive target detection paradigm, namely, objects are classified and positioned by accurately predicting key points of the objects in the graph. In order to further strengthen the semantic information of the object key points in the feature map, the invention provides two key point pooling models, and the two models embed the key point information represented by the geometric outline features of the graphic elements into the feature map as priori knowledge in a pooling mode, so that the key point information in the feature map is effectively enhanced. Finally, the invention also provides an arrow direction prediction branch network aiming at the chart recognition task, which is used for predicting the head and tail key point information of the arrow, so that the connection relation between the arrow and other primitives is deduced through the arrow key point, and the understanding of the chart layer becomes possible.

Drawings

FIG. 1 is a flow chart of a method for identifying a hand-drawn chart based on keypoint prediction in an embodiment;

FIG. 2 is a schematic diagram of the overall architecture of DrawnNet in an embodiment;

FIG. 3 is a graph showing the results of pooling of CICP and MICP for the same profile in the example;

FIG. 4 is a schematic diagram of a network structure of a network branch predicted by a top left corner point in an embodiment;

FIG. 5 is a diagram illustrating an embodiment of predicting network branches in the direction of arrows;

FIG. 6 is a schematic diagram of a snowflake corner pooling module in the embodiment;

FIG. 7 is a diagram showing a nested diagram in the embodiment;

FIG. 8 is a diagram illustrating nested primitive recognition using a feature pyramid in an embodiment;

FIG. 9 is a diagram illustrating the identification of DrawnNet on three data sets in one embodiment;

FIG. 10 is se:Sub>A sample diagram of some defects in the FC-A datse:Sub>A set of the example;

FIG. 11 is a diagram showing the variation of mAP with the number of iterations when different λ are adopted in the embodiment.

Detailed Description

For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings and examples. It is to be understood that the examples are illustrative of the invention and not limiting.

Examples

As shown in fig. 1, the present embodiment provides a method for recognizing a hand-drawn chart based on keypoint prediction, which includes the following steps:

3. performing feature fusion on the respective pooled feature maps;

DrawnNet model

This example presents DrawnNet, a keypoint-based detection model, based on the latest convolutional neural network technique. DrawnNet was designed on the basis of CornerNet for models that do not understand the structure of the diagram themselves. A new module mechanism is introduced into DrawnNet to expand the detection function of CornerNet, and the modules can effectively utilize the prior knowledge in the chart to enable the improved model architecture to adapt to the hand-drawing chart recognition task. Specifically, the present embodiment proposes two novel keypoint pooling modules for explicitly embedding a priori knowledge of geometric features, etc. present in the graph, into the feature map, and then fusing these features into the keypoint prediction. In addition, in order to understand the structure of the graph, an arrow direction prediction branch is proposed, which aims to predict an arrow direction by predicting the head and rear key points of the arrow.

In DrawnNet, this embodiment detects each primitive in the graph as a pair of key points, i.e., the upper left corner point and the lower right corner point of the primitive bounding box are determined together. Furthermore, for each arrow connecting two primitives, drawnNet has a branch network called arrow direction prediction for predicting the head and tail keypoints of the arrow, which pair of keypoints can determine the direction of the arrow. Therefore, the structure of the diagram can be completely understood through this branch. Fig. 2 shows the overall architecture of DrawnNet. DrawnNet uses the Hourglass model as the backbone model. The Hourglass is a typical encoder-decoder structure and is widely applied to visual tasks such as key point detection such as pose estimation and image segmentation or pixel level prediction.

Corner key point prediction

In DrawnNet, the graphical element is represented as a pair of key points, namely, an upper left corner point and a lower right corner point. However, the feature map typically lacks distinct local visual indications to indicate potential locations where key corners may appear. In CornerNet, this embodiment proposes a pooling method called maximal cross corner pooling (MICP) in order to locate potential key corners. The module MICP starts with one pixel, looks for the maximum of neuron activation in both directions, horizontally and vertically, respectively, and then adds them as a pooling result at that point. In the diagram, most primitives are rectangular outlines, the corner points of which appear clearly where several boundary lines intersect each other. Therefore, in DrawnNet, the present embodiment extends the original corner pooling module by introducing another pooling reduction method, and embeds the geometric features presented in the graph as explicit prior knowledge into the prediction of the key corner by pooling.

In DrawnNet, the pooling process of the corner pooling module proposed in this embodiment traverses each pixel in both the horizontal and vertical directions, and if a neuron responds most strongly in the neighborhood of the pooled feature map, this location may be a potential key corner, and it is located at the intersection of the horizontal and vertical pooling vectors. This embodiment refers to such Corner Pooling as cross Corner Pooling (ICP). Max is used in CornerNet as the reduction function of the ICP module to calculate the final response, called maximum cross corner pooling (MICP); in DrawnNet, however, this embodiment uses sum to accumulate all responses vertically and horizontally, referred to as cumulative cross-corner pooling (CICP).

As shown in fig. 3, MICP and CICP are used to pool the same profile, respectively. In fig. 3, MICP and CICP are expected to capture the upper left corner of a rectangular pattern consisting of responses with neuron activation of 1. This rectangular visual pattern is undoubtedly the basic visual pattern that frequently appears in primitive objects in the diagram. Figure 3 (b) clearly demonstrates that MICP fails to effectively capture the upper left corner (circled with a solid line) because in the signature of the MICP pooling result, the other neurons in the neighborhood where its corner points are located respond nearly as strongly as their own neurons and thus fail to effectively highlight the visual features of the key point in the pooled signature. In contrast, the CICP module in FIG. 3 (a) handles this situation in the profile by maximizing the degree of response of the corresponding neuron in its neighborhood (e.g., 3 x 3 and 5 x 5 regions, etc.).

As described above, in order to determine whether the activation value of the primitive is an upper left corner, the CICP searches the uppermost boundary of the target horizontally to the right, and searches the leftmost boundary of the target vertically to the bottom; let F _t And F _l Is an input feature map for corner pooling,

and

are respectively F _t And F _l Response of intermediate position (i, j); using the H × W feature map, CICP determines whether the neuron activation value at position (i, j) is in the upper left corner, and accumulates at F in parallel _t All responses in (i, j) and (i, H) in (A, B) are distributed horizontally, and are distributed vertically at F _l The sum of (i, j) and (W, j) in (1) as T _ij And L _ij (ii) a Finally, they are added to generate a feature map F ^CICP The calculation process can be expressed by the following formula:

the method of the CICP pooling calculation for the lower right corner is similar to the calculation for the upper left corner, i.e. all responses vertically distributed in (0, j) and (i, j) and all responses horizontally distributed in (i, 0) and (i, j) are accumulated in parallel, and then the pooled results are added.

Geometric feature fusion

The network structure of the network branch predicted by the upper left corner point is shown in fig. 4. The improvement of this embodiment over CornerNet includes that the corner point prediction branch in DrawnNet extends the corner point pooling module and proposes a method of feature fusion of feature maps of pooled results from multiple pooling modules. Feature fusion is typically performed to exploit refinement of feature information from different channel or spatial feature maps. Many visualization task models are currently equipped with feature fusion modules. For example, a feature pyramid network fuses features of a multi-scale object by connecting a pyramid of downsampled convolved features.

The feature fusion method employed in DrawnNet in this embodiment is based on a pooling method, where the proposed keypoint pooling method is first applied to the feature maps per branch of the channel to fully utilize the geometric information in the image, and then adaptive feature refinement is performed by multiplying, adding or stitching pooled feature maps of different pooling methods. As described above, the present embodiment proposes a CICP module as a supplement to the MICP module, and particularly enriches semantic features of corner information in the feature map. Therefore, in order to make the later specific detection task effectively utilize the semantic information of the corner points encoded into the feature map, the present embodiment refers to the design of the residual error network ResNet, and first, two parallel 128-channel 3 × 3Conv-BN-ReLU layers are used to replace the original 3 × 3 convolution module in the residual error module, so as to construct the whole corner point pooling module to process the features from the backbone network; the MICP and CICP are then used in parallel to pair the two feature maps (e.g., F in the upper left corner _t And F _l ) Performing pooling operations, one for vertical pooling and the other for horizontal pooling, and adding their respective pooled feature maps to obtain F ^CICP And F ^MICP 。

Arrow direction prediction

In an arrow-connected diagram like a flow chart, structure identification involves indicating which primitives are connected by each arrow and which direction each arrow points to. Although the object detection model may classify and locate the primitives of the diagram through bounding boxes, this information is not sufficient for diagram structure recognition. The present embodiment finds that this problem can be effectively solved by predicting the arrow key point information. The plane vector formed by the head key point and the tail key point of the arrow indicates the direction of the arrow, and the primitives connected with the arrow can be predicted according to the position relation between each arrow key point and the surrounding primitives. In order to predict arrow key points, the embodiment adds parallel arrow direction prediction network branches on a DrawnNet backbone model. Fig. 5 illustrates a network. The arrow network reuses the feature maps from the backbone network and uses the SCP module to enhance the arrow key point information.

The output of the arrow direction prediction network branch is similar to the function of the corner point prediction branch network, and comprises thermodynamic diagram generation, semantic embedding and position offset prediction. The arrow direction prediction network adopts a jump connection mode similar to a residual error network ResNet module, and an SCP module is inserted in the jump connection, so that the semantic features of key points in a feature map can be effectively enhanced.

Angular pooling of snowflakes

The direction of the arrow is determined by positioning the head and tail of the arrow, which is detected by the inherent intelligence of DrawnNet as a key point detection task. To address this issue, the present embodiment proposes a snowflake corner pooling module (SCP) to capture richer, more recognizable visual arrow patterns. Fig. 6 illustrates the principle of SCP.

Let F be the profile of the SCP, F _ij Is the response for position (i, j) in F. Using H W feature map, pooled feature map F ^SCP The neuron activation value response to position (i, j) in the SCP is expressed by the following formula:

arrow keypoint prediction

In the arrow direction prediction network, thermodynamic diagrams, offsets and semantic embedding are used for predicting head key points and tail key points of an arrow, and the method specifically comprises the following steps:

let P _cij Is the probability of class c at position (i, j) in the predictive thermodynamic diagram, y _cij Is a ground-truth thermodynamic diagram, which has C channels; then, the category Loss of the arrow key point is estimated by the Focal local Loss function:

the positions of the arrow key points are slightly corrected by the prediction offset, let (x, y) be the position in the image,

a graph may include a plurality of arrows, and thus head and tail keypoints of the plurality of arrows may be predicted. Therefore, it is necessary to determine which pair of head keypoints and back keypoints belong to the same arrow, and the method of this embodiment is also semantic embedding used in CornerNet, and matches the two keypoints with the greatest similarity together; let e _hk Is an arrow kSemantic embedding of head keypoints of (e) _tk Semantic embedding of tail key points, wherein the tail key points are all four-dimensional vectors; and matching the key points by using a pull loss training network, and keeping away the key points which do not belong to the same object by using the pull loss:

here, the

And

Nested graph recognition

In addition to the above-mentioned identification of common charts, there is also a special chart, that is, a nested chart. This type of diagram is primarily a program flow diagram in which nested structures play an important role, usually representing the logic of a program, such as loops, selections, etc. In the identification process of the whole nested diagram, the most troublesome point is the identification of the nested primitives therein. A nested primitive in a nested graph refers to a primitive that includes one or more other primitives, which are referred to as nested primitives of the primitive, and which reflect complex semantic logic in a nested manner. From the perspective of a digital image, a nested primitive appears as multiple primitives "stacked" together, with the inner layers being nested primitives and the outer layers being nested primitives, as shown in FIG. 7. The type of the outer-layer nested primitive is always unchanged no matter how the number and the type of the inner-layer nested primitives and the logic structure between the inner-layer nested primitives change, so that the existence of the inner-layer primitives often brings great difficulty to the model for identifying the outer-layer nested primitive, the model needs to have strong generalization capability to ensure that the nested primitive is not interfered by the change of the inner-layer primitive when being identified, and a data enhancement algorithm is provided for a training stage to improve the generalization of the model.

The feature pyramid acts on the decoder part of the stem model hourglass, and the feature maps of the top three layers are used as the input of the feature pyramid. Order to

Is the feature level of the ith layer on the feature pyramid (as shown in fig. 8), and the sampling coefficient to this layer is s _k . The group-truth bounding box in the input nested icon is defined as B = { B = { (B) ⁽¹⁾ ,B ⁽²⁾ 8230, where

Are respectively B ⁽ⁱ⁾ The coordinates of the upper left corner and the coordinates of the lower right corner.

In the method adopted in this embodiment, each primitive is associated with a plurality of positions of a plurality of layers in the feature pyramid, so that each primitive is predicted by one or more positions, specifically:

for arbitrary feature level P _k At an arbitrary position on

The following formula is used:

map it back to obtain its input image

If there is a location l from the feature pyramid, so that l' and b ⁽ⁱ⁾ The following conditions are satisfied:

0＜ξ(l′,b ⁽ⁱ⁾ ) (48) therein

So that l ⁱ And b ⁽ⁱ⁾ The following conditions are satisfied:

γ _k-1 ＜δ(l′,b ⁽ⁱ⁾ )≤γ _k (50)

wherein

Then l is b ⁽ⁱ⁾ At the feature level P _k True predicted position of, where gamma _k (k =1, 4.., 5) is a hyper-parameter; at P _k Up-prediction b ⁽ⁱ⁾ All positions of (a) contain b ⁽ⁱ⁾ Location set of

And is provided with

in addition, for

And

will be responsible for predicting b ⁽ⁱ⁾ Is represented by a discriminant function delta (l', b) ⁽ⁱ⁾ ) Supremum and infimum determinations of (a), which reflect b ⁽ⁱ⁾ The above regression range, the calculation formula is as follows:

therefore, if

Then it must have

That is, any primitive in the flow chart can be predicted at two adjacent feature levels; for the primitive detection in the flow chart, the characteristic can effectively avoid target accumulation caused by that the nested primitives and the nested primitives are simultaneously distributed to the same characteristic level;

for any two targets b ⁽ⁱ⁾ And b ^(j) By passing

Whether or not it is 1 or not, b ⁽ⁱ⁾ Whether to nest b ^(j) In which

The purpose of multi-level nested graphic element detection is to detect b ^(t) And N ^(t) All nested primitives in (a) are properly associated with locations on the feature pyramid while maintaining a high recall rate for each primitive. Therefore, for

In that

And Ω ^(t) There are two possible scenarios between, as follows:

1) If it is not

Then

And b ^(t) Are from two different feature levels, so they can both be effectively positively sample predicted and thus be used in the training processAnd (6) model learning.

2) If it is not

Then the

Are ambiguous samples because they are all predictive

And b ^(t) . This is due to the fact that the two objects do not differ greatly in size. In this case DrawnNet specifies that these positions predict the smallest sized primitive, here

Thus, the device

Is that

And b ^(t) Is that

In addition, the method can be used for producing a composite material

May be very close to 1, resulting in b ^(t) Number of positive samples

Smaller, which will seriously affect b ^(t) The recall ratio of. Fortunately, in the nested chart dataset of this embodiment, there is a significant size difference between the nested primitives and the nested primitives, so this is unlikely to happen.

Thus, for nested and nested primitives, drawnNet can predict where they will be assigned to the appropriate feature level according to their own size. Because the available prediction positions in the feature pyramid are very rich, and each primitive can be predicted at multiple positions, drawnNet has a high recall rate for multi-level target detection in the flow chart. However, other non-dense predictive models such as YOLOv3 cannot handle the case of target clustering, resulting in very low detection recall on dense images.

Nested graph-oriented data enhancement

In order to improve the diversity of the nested graph data sets and more accurately identify the nested primitives in the nested graphs, the embodiment proposes a data enhancement algorithm (algorithm 1). Intuitively, the algorithm mainly functions to discard each primitive in the nested graphs with a certain probability, so that a large number of different nested graphs can be derived from one nested graph. In particular, it is assumed in the algorithm that the probability of each primitive being discarded follows a uniform distribution over the interval 0-1, i.e. the primitive will be discarded when the random variable X ≧ λ. Therefore, the most capable of derivatization

A graph containing only k (0. Ltoreq. K. Ltoreq. N) primitives, then a total of 2 is derived ⁿ Different charts are laid out. Assuming that each nested graph in the training set contains 10 primitives on average, at most 1024 different nested graphs can be derived from each nested graph. This is equivalent to 1024 times expansion of the training data set, and this figure is huge, even if there are only hundreds of training samples in the data set, hundreds of thousands of training samples can be derived after the expansion of the data enhancement algorithm proposed in this embodiment, which is completely sufficient for training the target detection model based on the convolutional network, because the number of training samples in the largest common target detection data set COCO-2017 is not more than 12 thousands.

In addition, the data enhancement algorithm proposed in this embodiment has a point that compared with other data enhancement algorithms in the image domain, the algorithm of this embodiment has only one hyper-parameter, that is, a probability threshold λ for determining whether each primitive is discarded. The setting of lambda is closely related to the quantity of the primitives which can be reserved after each chart is enhanced each time; the larger the value of λ, the more primitives are discarded each time data enhancement is performed, and vice versa. In the experiment of this embodiment, the value of λ generally has three possibilities, which are 0.25, 0.50 and 0.75 respectively. If the value of λ is too large (λ > 0.75), the enhanced graph appears too hollow because too many primitives are discarded, and the detection of the graph of the content hollow is undoubtedly a waste of computing resources; conversely, if λ is too small (λ < 0.25), there are too few primitives discarded each time, which can diminish the effect of data enhancement. Therefore, when the value of λ is 0.5, a relatively good data enhancement effect can be achieved. It can be theoretically demonstrated that when λ =0.5, the probability of discarding half of the number of primitives in each graph is the greatest, i.e., the probability of retaining half of the primitives after data enhancement in each graph is the greatest.

Evaluation of experiments

This example evaluates DrawnNet on three open freehand chart datasets, two of which describe flow charts (FC-A and FC-B), one of which is se:Sub>A finite automaton dataset (FA).

FC-se:Sub>A was released in 2011 as se:Sub>A baseline database for hand-drawn flow charts, 419 drawings were drawn by 35 authors from 28 pre-defined templates (248 for the training set and 171 for the test set), the greatest deficiency of which was the lack of comments about chart structure and time information. Only a single primitive is provided. Therefore, the quality of these charts is low, and it is difficult to evaluate the online identification method.

FC-B was released in 2016 as se:Sub>A complement to FC-A and contained 672 graphs (280 training set, 196 test set, 196 validation set), which were derived from 28 pattern templates drawn by 24 authors. Some templates reference FC-se:Sub>A and retain the usual algorithmic functionality. In addition to the chart structure labels, its labels also include the direction in which the arrow points.

FA and FC-B were published simultaneously, with a total of 300 charts (132 in training set, 84 in test set and 84 in validation set), generated from 12 templates drawn by 25 workers. There are four categories of data sets: state (single circle), final state (two concentric circles), test and arrow. The arrows are generally curved. As with FC-B, its label also contains the pointing direction of the arrow.

In addition, in order to evaluate the recognition effect of the DrawnNet model of the present embodiment on the nested chart, the author of the present embodiment constructs a nested chart data set, which is of a program flow chart type. The author topic group and a software company collaboratively develop a piece of software for drawing and labeling nested charts. The embodiment researches the grammar mechanism of the current mainstream programming language including C, C + +, java and the like in advance, and considers the programming language widely used in the last century, such as Fortain, etc., to summarize 9 basic program structures from these languages, which are begin, end, input/output, if-else, align, process, while, for, do-while, and use these 9 predefined basic program structures as the basic primitives of the nested icon data set, that is, only these 9 categories of primitives are in the data set. Wherein if-else, while, for, do-while are 4 kinds of nestable primitives, that is, any primitive in 9 kinds of primitives can be nested as a sub-primitive in the 4 kinds of primitives; while begin, end, input/output, assign and process are primitive primitives, that is, primitives cannot contain other primitives, and only can be nested in a nestable primitive.

In addition to defining the basic program structures, in order to construct a high-quality nested graph data set, so that the program algorithm described by the graph therein can effectively reflect the business logic of the existing program in the current software engineering field, the author of this embodiment performs some sample surveys on the code segments of the open source code base such as gitubb, and calculates three indexes for characterizing the code complexity, namely, the number N of the basic program structures, the nested depth D, and the category distribution C of the basic program structures. By referring to the sampling results of the three indexes, the embodiment respectively formulates corresponding quantization standards as the overall distribution of the data set for guiding the drawing of the chart.

Finally, the present embodiment constructs a nested chart dataset for a total of 600 samples. Each sample consists of two parts, namely, the flowchart image and the label information of each primitive in the flowchart, which can be expressed as a binary < image, label >, wherein label represents the class and position of each primitive. The nested chart dataset is then as per 7:3 into a training data set and a test data set.

Precision and Recall are used to evaluate the performance of the method proposed in this embodiment on the two tasks of primitive recognition and graph recognition. In addition, for the identification of the nested graph, in order to measure the accuracy of the model of the embodiment in target detection, the embodiment uses two indexes, namely AP and mAP, with reference to the general target detection task, and is used for evaluating the accuracy of the model in the identification of the nested graph.

Precision: precision is the accuracy measured by the overall accuracy of the model to predict the correct outcome in the middle of the task compared to all predicted outcomes.

Recall: recall is the Recall rate measured by the overall coverage of the model to predict the correct outcome during the task compared to all outcomes to be predicted.

AP: AP refers to the "Average Precision" (Average Precision) first introduced in VOC 2007. Defined as the average detection accuracy at different recall rates, is usually evaluated in a category-specific manner.

mAP: the mAP is the "average precision" across all classes, and is widely used to compare the performance of all object classes in a general object detection problem assessment.

Table 1 shows a comparison of performance of DrawnNet with other online chart recognition and offline chart recognition systems. For online recognition systems, they implement offline recognition of charts based on analyzing and modeling the timing stroke information, which largely exploits primitive segmentation and primitive classification. These low-level representations and local understanding are sensitive to the reference dataset and may negatively impact the accuracy of the recognition if the labeling quality of the dataset is low or the labeling information is not accurate enough. In contrast, recognition systems based on deep learning target detection can be designed with a targeted design such that the model robustly achieves superior performance in the graph recognition task.

TABLE 1 Chart identification accuracy on individual datasets

Fig. 9 shows the performance of DrawnNet on three benchmark sets, with the head and tail keypoints of the arrows labeled separately.

This embodiment also reveals the case where DrawnNet performs the primitive recognition task in each reference dataset. Tables 2, 3 and 4 show the primitive identification results on each reference dataset. In general, drawnNet performs well on several primitive shapes, which can be explained by the fact that primitive shapes and sizes differ much less than arrows and text.

DrawnNet has higher primitive recognition recall and accuracy on the FC-A dataset (Table 2). However, drawnNet performs slightly less in all classes than in the other classes. Through examination of the training set of FC-se:Sub>A, the arrows where the author found some samples are marked with circles instead of crosses as shown in fig. 10, which may mislead the classification decision of the model, causing the model to confuse it with other classes such as text, and thus mislead the model learning. In addition, there is another problem that the resolution of some of the circles is too small, and as shown in fig. 10, after a series of down-sampling is performed in the feature map, the resolution becomes smaller and smaller. Therefore, as mentioned above, it severely hinders the feature extraction by the corner pool module of the present embodiment.

TABLE 2 accuracy and recall of FC-A primitive identification across categories

Table 3 shows that DrawnNet can accurately identify the primitives in the chart. Interestingly, arrow R-CNN gave complementary results in terms of accuracy and recall for the two categories, data and Process, namely 100 and 94.9 for the Data category, respectively, and almost the opposite for the Process category. Part of the reason is that despite the use of FPN, the underlying network Faster RCNN of Arrow R-CNN does not have a corresponding mechanism to learn fine-grained discriminant features between different classes of similar primitives, where both Data primitives and Process primitives are quadrilateral, except that the two parallel edges of the Data primitives are slightly oblique, rather than vertical as the Process primitives.

TABLE 3 accuracy and recall of FC-B primitive identification across categories

As shown in table 4, drawnNet fully recognized State and Final State shapes in the FA test set. Because the classification is small, the characteristics of each classification are stable and thus not difficult to recognize.

TABLE 4 FA primitive identification accuracy and recall in various categories

Ablation experiment

The present embodiment performed an ablation study on each reference data set to further quantify the impact of the keypoint pooling method proposed by the present embodiment on chart identification. Table 5 shows that the accuracy of chart recognition can be significantly improved by combining the CICP module and the SCP module. It should be noted here that the present embodiment proposes an arrow direction branch network, but the arrow direction branch network has to be considered in ablation studies because its responsibility is to predict arrow key points, which is indispensable for the task of chart recognition. Thus, in ablation studies, the present embodiment ablates only the SCP modules in the arrowed direction branching network, rather than ablating the entire branching network.

Clearly, the use of the SCP module effectively improves chart recognition accuracy because it subtly pools to help predict arrow key points. Whether the arrow key points are predicted is crucial for subsequent chart identification. In addition, the CICP module also helps DrawnNet to perceive where corner key points in the rectangular outline may be located, but it may not give good results if used alone. After all, whether the prediction of the arrow key points is correct or not is crucial to the correctness of the graph understanding.

TABLE 5 ablation results

Nested chart recognition experiment

The accuracy of the method is evaluated in two aspects of overall performance and popularization of a data enhancement algorithm. In order to evaluate the overall performance of the method, the embodiment trains the FCOS over 100 epochs respectively, and evaluates the trained model in terms of flowchart detection. Table 6 shows the results of the test resolution. As can be seen from table 6, the data enhancement significantly improves the accuracy of the flowchart detection, wherein the maximum mapp can reach 77.98, which is 3.65% higher than that before the data enhancement is not adopted, and the improvement is very significant between APs with different IoU thresholds.

TABLE 6 Overall Performance on primitive recognition tasks

To further investigate the effectiveness of data augmentation, this example selects different λ values and incremental iterations on the training set (interval 10) to train the model, and then observes the variation of the mAP over the test split. Fig. 11 shows the results. As previously analyzed, the maps did increase with the increase in age and the adoption of the data enhancement algorithm of the present embodiment. This trend leads to an increasing generalization capability of the model, because the continuous expansion of the training data leads to a continuous diversification of the morphology of the training data, and as the epochs increase, the model can witness more different training data, which is consistent with the expectation of the present embodiment. In addition, as can be seen from fig. 11, different λ values do have certain influence on the mAP, and as analyzed before, when λ =0.50, the data enhancement effect is the best, and the experimental results also verify this.

Therefore, with the increase of the number of iterations and the adoption of data expansion, the method of the embodiment performs surprisingly on the task of detecting the flow chart, and the corresponding metric index is continuously rising. This is enough to prove that the data enhancement algorithm of the present embodiment is powerful, which greatly helps DrawnNet show a dramatic result in both tasks!

Enhanced generalization capability of data

In order to evaluate the overall performance of the model of the embodiment in identifying each category and multi-level nested primitives, the embodiment performs further experiments. Table 7 provides the maps for each category. It can be seen that without data augmentation, the mAP for each class ranges between 55.45 and 88.89, with the mAP for 4 nested primitives being significantly lower than the mAP for the other 5 basic primitives. Obviously, the data enhancement is most significant for lifting four nested primitives, such as if-else, while and the like, and the lifting amplitude is above 5.49. This is because the identification of nested primitives is more challenging than the identification of basic primitives. Thus, the gain in data increase is mainly concentrated on nested primitives, which causes the average mAP of the four nested primitive classes to rise to 65.01, 6.77 higher than without the increase (as shown in Table 7).

TABLE 7 mAP on each category

Table 8 is the maps of nestable primitives with different nesting depths. It can be seen that the performance of the model can be maintained at a higher value of mAP when the nesting depth is 1 or 2, while the data enhancement is most significant for the improvement of the model in mAP when the nesting depth is increased to 3, 4 or even 5. When the nesting depth is 3 or 4, the increment of the mAP on all four nestable classes is kept above 6.42, and their average mAP on all four classes is increased by 9.08 and 6.87, respectively.

TABLE 8 mAP comparison of nested classes of different nesting depths under data enhancement

Thus, the data enhancement algorithm proposed in this embodiment generates a large amount of different training data for the model, which evolve with different internal structures. By learning these variants, the model can accommodate multi-level control flows, particularly the four nestable primitives shown in the figure with deeply nested child primitives.

The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

Claims

1. The hand-drawn chart recognition method based on key point prediction is characterized by comprising the following steps of: the method comprises the following steps:

3. performing feature fusion on the respective pooled feature maps;

2. The method of claim 1, wherein the method comprises: in the cumulative cross corner pooling CICP, in order to determine whether the activation value of the primitive is an upper left corner or not, the CICP searches the uppermost boundary of the target to the right along the horizontal direction and searches the leftmost boundary of the target to the vertical direction; let F _t And F _l Is an input feature map for corner pooling,

and

are respectively F _t And F _l Response of intermediate position (i, j); using H × W feature map, CICP determines whether the neuron activation value at position (i, j) is in the upper left corner, and accumulates in F by parallel _t All responses in (i, j) and (i, H) in (A) are distributed horizontally, and are distributed vertically at F _l The sum of (i, j) and (W, j) in (1) as T _ij And L _ij (ii) a Finally, they are added to generate a feature map F ^CICP The calculation process can be expressed by the following formula:

3. The method of claim 1, wherein the method comprises: the arrow direction prediction network adopts a jump connection mode, and an SCP module is inserted in the middle of the jump connection.

4. The method of claim 1, wherein the method comprises: in the snowflake corner pooling SCP module:

5. the method of claim 1, wherein the method comprises: in the arrow direction prediction network, thermodynamic diagrams, offsets and semantic embedding are used for predicting head key points and tail key points of an arrow, and the method specifically comprises the following steps:

is that itA downsampling position in the thermodynamic diagram, wherein s is a downsampling factor; the deviation of the arrow key point k between these two positions is estimated:

the prediction deviation and the group-route deviation can be calculated through a Smooth-L1 Loss function:

determining which pair of head key points and back key points belong to the same arrow, and matching the two key points with the maximum similarity together by using semantic embedding; let e _hk Semantic embedding of head key points for arrow k, e _tk Semantic embedding of tail key points, wherein the tail key points are all four-dimensional vectors; and matching the key points by using a pull loss training network, and distancing the key points which do not belong to the same object by using the pull loss:

here, the

And

is a loss function for the arrow key point prediction task; and L is _det 、L _push 、L _pull And L _off Is a loss function for the corner key point prediction task; where α, β, γ, and λ are coefficient weights of the sub-loss functions.

6. The method of claim 1, wherein the method comprises: if the graph is a nested graph, each primitive is associated with a plurality of positions of a plurality of layers in the feature pyramid, so that each primitive is predicted by one or more positions, specifically:

for an arbitrary feature level P _k At an arbitrary position on

The following formula is used:

map it back to obtain its input image

Is called a mapping position, which is close to l _k The center of the receptive field of (a); for feature levels P from any two _k And P _j At any two positions l _k And l _j ，l' _k And l' _j The overlapping is avoided; thus, each mapping location is globally unique in the input image; it is composed ofSecondly, in respect of

0＜ξ(l′,b ⁽ⁱ⁾ ) (12) wherein

Then l is b ⁽ⁱ⁾ Predicting a location for a candidate of the target; obviously, b ⁽ⁱ⁾ Multiple positions can be used, all of which constitute b ⁽ⁱ⁾ Is set of candidate positions theta ⁽ⁱ⁾ (ii) a Furthermore, if

So that l ⁱ And b ⁽ⁱ⁾ The following conditions are satisfied:

γ _k-1 ＜δ(l′,b ⁽ⁱ⁾ )≤γ _k (14)

wherein

And is

Reflect b is ⁽ⁱ⁾ Capacity on the entire feature pyramid;

in addition, for

And

therefore, if

Then it must have

for any two targets b ⁽ⁱ⁾ And b ^(j) By passing

Whether or not it is 1 or not, b ⁽ⁱ⁾ Whether to nest b ^(j) In which

The purpose of multi-level nested primitive detection is to detect b ^(t) And N ^(t) All nested primitives in (a) are properly associated with locations on the feature pyramid while maintaining a high recall rate for each primitive.