CN115171135A - Hand-drawn chart identification method based on key point prediction - Google Patents

Hand-drawn chart identification method based on key point prediction Download PDF

Info

Publication number
CN115171135A
CN115171135A CN202210615119.4A CN202210615119A CN115171135A CN 115171135 A CN115171135 A CN 115171135A CN 202210615119 A CN202210615119 A CN 202210615119A CN 115171135 A CN115171135 A CN 115171135A
Authority
CN
China
Prior art keywords
arrow
primitive
nested
corner
pooling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210615119.4A
Other languages
Chinese (zh)
Inventor
蔡波
方佳琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan Hengyu Technology Co ltd
Wuhan University WHU
Original Assignee
Yunnan Hengyu Technology Co ltd
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan Hengyu Technology Co ltd, Wuhan University WHU filed Critical Yunnan Hengyu Technology Co ltd
Priority to CN202210615119.4A priority Critical patent/CN115171135A/en
Publication of CN115171135A publication Critical patent/CN115171135A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/15Cutting or merging image elements, e.g. region growing, watershed or clustering-based techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/18Extraction of features or characteristics of the image
    • G06V30/18143Extracting features based on salient regional features, e.g. scale invariant feature transform [SIFT] keypoints
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the technical field of hand-drawn chart recognition, in particular to a hand-drawn chart recognition method based on key point prediction, which comprises the following steps: 1. detecting each primitive in the diagram as a pair of key points, namely determining an upper left corner point and a lower right corner point of a boundary box of the primitive together; 2. performing pooling operation on the maximal cross corner pooling MICP and the accumulated cross corner pooling CICP in parallel; 3. performing feature fusion on the respective pooled feature maps; 4. for each arrow connecting two primitives, predicting a head key point and a tail key point of the arrow by using an arrow direction prediction network, and enhancing the arrow key point information by using a snowflake corner pooling SCP module; thus, the identification of the entire chart structure is completed. The invention can better recognize the hand-drawn chart.

Description

Hand-drawn chart identification method based on key point prediction
Technical Field
The invention relates to the technical field of hand-drawn chart recognition, in particular to a hand-drawn chart recognition method based on key point prediction.
Background
The hand-drawing chart is a simple, efficient and convenient graphical expression mode of human thinking and intention, and is an extension of language communication and text communication from the graphical perspective. Generally speaking, one of the ultimate purposes of hand-drawn charts is to save the manuscripts of the charts, even identify the primitives in the charts by a certain method and understand the information content reflected by the charts, and then reason, classify, reconstruct or archive the charts. The hand-drawing chart recognition refers to that the hand-drawing chart recognition process including primitive recognition, chart understanding and the like is completed through a computer. The hand-drawn chart recognition is the basis of a plurality of hand-drawn chart digital processing tasks, such as hand-drawn chart classification, hand-drawn chart reconstruction, hand-drawn chart digitization and the like. Therefore, the hand-drawing chart recognition technology is a key technology for providing powerful support for the graphical expression mode of the hand-drawing chart in practical application.
However, the identification and analysis of many types of charts, such as flow charts, finite state machines, circuit diagrams, chemical molecular structure diagrams, and score symbols, for hand-drawn charts, remains challenging because of the complex two-dimensional structure and morphological variability of the graphical elements of these charts.
Disclosure of Invention
It is an object of the present invention to provide a method for hand-drawn graph identification based on keypoint prediction that overcomes some or some of the disadvantages of the prior art.
The method for identifying the hand-drawn chart based on the key point prediction comprises the following steps:
1. detecting each primitive in the diagram as a pair of key points, namely determining an upper left corner point and a lower right corner point of a boundary box of the primitive together;
2. performing pooling operation on the maximal cross corner pooling MICP and the accumulated cross corner pooling CICP in parallel;
3. performing feature fusion on the respective pooled feature maps;
4. for each arrow connecting two primitives, predicting a head key point and a tail key point of the arrow by using an arrow direction prediction network, and enhancing the arrow key point information by using a snowflake corner pooling SCP module; thus, the identification of the entire chart structure is completed.
Preferably, in the cumulative cross corner pooling CICP, in order to determine whether the activation value of the primitive is an upper left corner, the CICP searches the uppermost boundary of the target horizontally to the right and searches the leftmost boundary of the target vertically to the bottom; let F t And F l Is an input feature map for corner pooling,
Figure RE-GDA0003803893160000021
and
Figure RE-GDA0003803893160000022
are respectively F t And F l Response of intermediate position (i, j); using H × W feature map, CICP determines whether the neuron activation value at position (i, j) is in the upper left corner, and accumulates in F by parallel t All responses in (i, j) and (i, H) in (A, B) are distributed horizontally, and are distributed vertically at F l The sum of (i, j) and (W, j) in (1) as T ij And L ij (ii) a Finally, they are added to generate a feature map F CICP The calculation process can be expressed by the following formula:
Figure RE-GDA0003803893160000023
Figure RE-GDA0003803893160000024
the method of pooling CICP for the lower right corner is similar to the process of computing the upper left corner, i.e. all responses vertically distributed in (0, j) and (i, j) and all responses horizontally distributed in (i, 0) and (i, j) are added in parallel, and then the pooled results are added.
Preferably, the arrow direction prediction network adopts a jump connection mode, and the SCP module is inserted in the middle of the jump connection.
Preferably, in the snowflake corner pooling SCP module:
let F be the profile of the SCP, F ij Is the response for position (i, j) in F. Using H x W feature maps, pooled feature maps F SCP The neuron activation value response to position (i, j) in the SCP is expressed by the following formula:
Figure RE-GDA0003803893160000031
preferably, in the arrow direction prediction network, thermodynamic diagrams, offsets and semantic embedding are used to predict the head key points and the tail key points of the arrows, specifically as follows:
let P cij Is the probability of class c at position (i, j) in the predictive thermodynamic diagram, y cij Is a ground-truth thermodynamic diagram, which has C channels; then, the category Loss of the arrow key points is estimated by the Focal local Loss function:
Figure RE-GDA0003803893160000032
where N is the number of objects in the image and α is a hyper-parameter that controls the contribution of each point;
the positions of the arrow key points are slightly corrected by a prediction offset, let (x, y) be the position in the image,
Figure RE-GDA0003803893160000033
is its down-sampling position in the thermodynamic diagram, where s is the down-sampling factor; the deviation of the arrow key point k between these two positions is estimated:
Figure RE-GDA0003803893160000034
wherein the prediction deviation and the ground-truth deviation can be calculated by a Smooth-L1 Loss function:
Figure RE-GDA0003803893160000035
determining which pair of head key points and rear key points belong to the same arrow, and matching the two key points with the maximum similarity together by using semantic embedding; let e hk Semantic embedding of head key points for arrow k, e tk The semantics of the tail key points are embedded and are four-dimensional vectors; and matching the key points by using a pull loss training network, and keeping away the key points which do not belong to the same object by using the pull loss:
Figure RE-GDA0003803893160000036
Figure RE-GDA0003803893160000041
Figure RE-GDA0003803893160000042
finally, linearly combining the loss functions of all network branches as the final loss function of the whole model, wherein the loss functions with the same task property share the same combination coefficient:
Figure RE-GDA0003803893160000043
here, the
Figure RE-GDA0003803893160000044
And
Figure RE-GDA0003803893160000045
is a loss function for the arrow key point prediction task; and L is det 、 L push 、L pull And L off Is a loss function for the corner key point prediction task; where α, β, γ, and λ are coefficient weights of the sub-loss functions.
Preferably, if the graph is a nested graph, each primitive is associated with a plurality of positions of a plurality of layers in the feature pyramid, so that each primitive is predicted by one or more positions, specifically:
for arbitrary feature level P k At an arbitrary position on
Figure RE-GDA0003803893160000046
The following formula is used:
Figure RE-GDA0003803893160000047
map it back to obtain its input image
Figure RE-GDA0003803893160000048
Is called a mapping position, which is close to l k The center of the receptive field of (a); for feature levels P from any two k And P j At any two positions l k And l j , l' k And l' j The overlapping is avoided; thus, each mapping location is globally unique in the input image; secondly, for
Figure RE-GDA0003803893160000049
If there is a location l from the feature pyramid, let l' and b (i) The following conditions are satisfied:
0<ξ(l′,b (i) ) (30) therein
Figure RE-GDA00038038931600000410
Then l is b (i) Predicting a location for a candidate of the target; obviously, b is (i) Multiple positions can be used, all of which constitute b (i) Is calculated from the candidate position set theta (i) (ii) a Furthermore, if
Figure RE-GDA00038038931600000411
So that l i And b (i) The following conditions are satisfied:
γ k-1 <δ(l′,b (i) )≤γ k (32)
wherein
Figure RE-GDA0003803893160000051
Then l is b (i) At the feature level P k True predicted position of, where gamma k (k =1, 4.., 5) is a hyper-parameter; at P k Up prediction b (i) All positions of (1) contain b (i) Location set of
Figure RE-GDA0003803893160000052
And is provided with
Figure RE-GDA0003803893160000053
Reflect b is (i) Capacity is accommodated over the entire characteristic pyramid;
in addition, for
Figure RE-GDA0003803893160000054
And
Figure RE-GDA0003803893160000055
will be responsible for predicting b (i) Is determined by a discriminant function delta (l', b) (i) ) Supremum and infimum determinations of (a), which reflect b (i) The above regression range, the calculation formula is as follows:
Figure RE-GDA0003803893160000056
Figure RE-GDA0003803893160000057
therefore, if
Figure RE-GDA0003803893160000058
Then it must have
Figure RE-GDA0003803893160000059
That is, any primitive in the flow graph can be predicted at two adjacent feature levels; for the primitive detection in the flow chart, the characteristic can effectively avoid target accumulation caused by that the nested primitives and the nested primitives are simultaneously distributed to the same characteristic level;
for any two targets b (i) And b (j) By passing
Figure RE-GDA00038038931600000510
Whether or not it is 1 or not, b (i) Whether to nest b (j) Wherein
Figure RE-GDA00038038931600000511
Now, let b (t) For nested primitives, all of their nested primitives form their nested set
Figure RE-GDA00038038931600000512
The purpose of multi-level nested graphic element detection is to detect b (t) And N (t) All nested primitives in (a) are properly associated with locations on the feature pyramid while maintaining a high recall per primitive.
Aiming at the identification of a common chart, the identification method provided by the invention is used for identifying each graphic primitive in the chart based on a key point prediction mode, and determining the connection relation between an arrow and the graphic primitive by predicting the head and tail key points of the arrow, thereby identifying the structure of the whole chart. The key point prediction is a distinctive target detection paradigm, namely, objects are classified and positioned by accurately predicting key points of the objects in the graph. In order to further strengthen the semantic information of the object key points in the feature map, the invention provides two key point pooling models, and the two models embed the key point information represented by the geometric outline features of the graphic elements into the feature map as priori knowledge in a pooling mode, so that the key point information in the feature map is effectively enhanced. Finally, the invention also provides an arrow direction prediction branch network aiming at the chart recognition task, which is used for predicting the head and tail key point information of the arrow, so that the connection relation between the arrow and other primitives is deduced through the arrow key point, and the understanding of the chart layer becomes possible.
Drawings
FIG. 1 is a flow chart of a method for identifying a hand-drawn chart based on keypoint prediction in an embodiment;
FIG. 2 is a schematic diagram of the overall architecture of DrawnNet in an embodiment;
FIG. 3 is a graph showing the results of pooling of CICP and MICP for the same profile in the example;
FIG. 4 is a schematic diagram of a network structure of a network branch predicted by a top left corner point in an embodiment;
FIG. 5 is a diagram illustrating an embodiment of predicting network branches in the direction of arrows;
FIG. 6 is a schematic diagram of a snowflake corner pooling module in the embodiment;
FIG. 7 is a diagram showing a nested diagram in the embodiment;
FIG. 8 is a diagram illustrating nested primitive recognition using a feature pyramid in an embodiment;
FIG. 9 is a diagram illustrating the identification of DrawnNet on three data sets in one embodiment;
FIG. 10 is se:Sub>A sample diagram of some defects in the FC-A datse:Sub>A set of the example;
FIG. 11 is a diagram showing the variation of mAP with the number of iterations when different λ are adopted in the embodiment.
Detailed Description
For a further understanding of the invention, reference should be made to the following detailed description taken in conjunction with the accompanying drawings and examples. It is to be understood that the examples are illustrative of the invention and not limiting.
Examples
As shown in fig. 1, the present embodiment provides a method for recognizing a hand-drawn chart based on keypoint prediction, which includes the following steps:
1. detecting each primitive in the diagram as a pair of key points, namely determining an upper left corner point and a lower right corner point of a boundary box of the primitive together;
2. performing pooling operation on the maximal cross corner pooling MICP and the accumulated cross corner pooling CICP in parallel;
3. performing feature fusion on the respective pooled feature maps;
4. for each arrow connecting two primitives, predicting a head key point and a tail key point of the arrow by using an arrow direction prediction network, and enhancing the arrow key point information by using a snowflake corner pooling SCP module; thus, the identification of the entire chart structure is completed.
DrawnNet model
This example presents DrawnNet, a keypoint-based detection model, based on the latest convolutional neural network technique. DrawnNet was designed on the basis of CornerNet for models that do not understand the structure of the diagram themselves. A new module mechanism is introduced into DrawnNet to expand the detection function of CornerNet, and the modules can effectively utilize the prior knowledge in the chart to enable the improved model architecture to adapt to the hand-drawing chart recognition task. Specifically, the present embodiment proposes two novel keypoint pooling modules for explicitly embedding a priori knowledge of geometric features, etc. present in the graph, into the feature map, and then fusing these features into the keypoint prediction. In addition, in order to understand the structure of the graph, an arrow direction prediction branch is proposed, which aims to predict an arrow direction by predicting the head and rear key points of the arrow.
In DrawnNet, this embodiment detects each primitive in the graph as a pair of key points, i.e., the upper left corner point and the lower right corner point of the primitive bounding box are determined together. Furthermore, for each arrow connecting two primitives, drawnNet has a branch network called arrow direction prediction for predicting the head and tail keypoints of the arrow, which pair of keypoints can determine the direction of the arrow. Therefore, the structure of the diagram can be completely understood through this branch. Fig. 2 shows the overall architecture of DrawnNet. DrawnNet uses the Hourglass model as the backbone model. The Hourglass is a typical encoder-decoder structure and is widely applied to visual tasks such as key point detection such as pose estimation and image segmentation or pixel level prediction.
Corner key point prediction
In DrawnNet, the graphical element is represented as a pair of key points, namely, an upper left corner point and a lower right corner point. However, the feature map typically lacks distinct local visual indications to indicate potential locations where key corners may appear. In CornerNet, this embodiment proposes a pooling method called maximal cross corner pooling (MICP) in order to locate potential key corners. The module MICP starts with one pixel, looks for the maximum of neuron activation in both directions, horizontally and vertically, respectively, and then adds them as a pooling result at that point. In the diagram, most primitives are rectangular outlines, the corner points of which appear clearly where several boundary lines intersect each other. Therefore, in DrawnNet, the present embodiment extends the original corner pooling module by introducing another pooling reduction method, and embeds the geometric features presented in the graph as explicit prior knowledge into the prediction of the key corner by pooling.
In DrawnNet, the pooling process of the corner pooling module proposed in this embodiment traverses each pixel in both the horizontal and vertical directions, and if a neuron responds most strongly in the neighborhood of the pooled feature map, this location may be a potential key corner, and it is located at the intersection of the horizontal and vertical pooling vectors. This embodiment refers to such Corner Pooling as cross Corner Pooling (ICP). Max is used in CornerNet as the reduction function of the ICP module to calculate the final response, called maximum cross corner pooling (MICP); in DrawnNet, however, this embodiment uses sum to accumulate all responses vertically and horizontally, referred to as cumulative cross-corner pooling (CICP).
As shown in fig. 3, MICP and CICP are used to pool the same profile, respectively. In fig. 3, MICP and CICP are expected to capture the upper left corner of a rectangular pattern consisting of responses with neuron activation of 1. This rectangular visual pattern is undoubtedly the basic visual pattern that frequently appears in primitive objects in the diagram. Figure 3 (b) clearly demonstrates that MICP fails to effectively capture the upper left corner (circled with a solid line) because in the signature of the MICP pooling result, the other neurons in the neighborhood where its corner points are located respond nearly as strongly as their own neurons and thus fail to effectively highlight the visual features of the key point in the pooled signature. In contrast, the CICP module in FIG. 3 (a) handles this situation in the profile by maximizing the degree of response of the corresponding neuron in its neighborhood (e.g., 3 x 3 and 5 x 5 regions, etc.).
As described above, in order to determine whether the activation value of the primitive is an upper left corner, the CICP searches the uppermost boundary of the target horizontally to the right, and searches the leftmost boundary of the target vertically to the bottom; let F t And F l Is an input feature map for corner pooling,
Figure RE-GDA0003803893160000081
and
Figure RE-GDA0003803893160000082
are respectively F t And F l Response of intermediate position (i, j); using the H × W feature map, CICP determines whether the neuron activation value at position (i, j) is in the upper left corner, and accumulates at F in parallel t All responses in (i, j) and (i, H) in (A, B) are distributed horizontally, and are distributed vertically at F l The sum of (i, j) and (W, j) in (1) as T ij And L ij (ii) a Finally, they are added to generate a feature map F CICP The calculation process can be expressed by the following formula:
Figure RE-GDA0003803893160000091
Figure RE-GDA0003803893160000092
the method of the CICP pooling calculation for the lower right corner is similar to the calculation for the upper left corner, i.e. all responses vertically distributed in (0, j) and (i, j) and all responses horizontally distributed in (i, 0) and (i, j) are accumulated in parallel, and then the pooled results are added.
Geometric feature fusion
The network structure of the network branch predicted by the upper left corner point is shown in fig. 4. The improvement of this embodiment over CornerNet includes that the corner point prediction branch in DrawnNet extends the corner point pooling module and proposes a method of feature fusion of feature maps of pooled results from multiple pooling modules. Feature fusion is typically performed to exploit refinement of feature information from different channel or spatial feature maps. Many visualization task models are currently equipped with feature fusion modules. For example, a feature pyramid network fuses features of a multi-scale object by connecting a pyramid of downsampled convolved features.
The feature fusion method employed in DrawnNet in this embodiment is based on a pooling method, where the proposed keypoint pooling method is first applied to the feature maps per branch of the channel to fully utilize the geometric information in the image, and then adaptive feature refinement is performed by multiplying, adding or stitching pooled feature maps of different pooling methods. As described above, the present embodiment proposes a CICP module as a supplement to the MICP module, and particularly enriches semantic features of corner information in the feature map. Therefore, in order to make the later specific detection task effectively utilize the semantic information of the corner points encoded into the feature map, the present embodiment refers to the design of the residual error network ResNet, and first, two parallel 128-channel 3 × 3Conv-BN-ReLU layers are used to replace the original 3 × 3 convolution module in the residual error module, so as to construct the whole corner point pooling module to process the features from the backbone network; the MICP and CICP are then used in parallel to pair the two feature maps (e.g., F in the upper left corner t And F l ) Performing pooling operations, one for vertical pooling and the other for horizontal pooling, and adding their respective pooled feature maps to obtain F CICP And F MICP
Arrow direction prediction
In an arrow-connected diagram like a flow chart, structure identification involves indicating which primitives are connected by each arrow and which direction each arrow points to. Although the object detection model may classify and locate the primitives of the diagram through bounding boxes, this information is not sufficient for diagram structure recognition. The present embodiment finds that this problem can be effectively solved by predicting the arrow key point information. The plane vector formed by the head key point and the tail key point of the arrow indicates the direction of the arrow, and the primitives connected with the arrow can be predicted according to the position relation between each arrow key point and the surrounding primitives. In order to predict arrow key points, the embodiment adds parallel arrow direction prediction network branches on a DrawnNet backbone model. Fig. 5 illustrates a network. The arrow network reuses the feature maps from the backbone network and uses the SCP module to enhance the arrow key point information.
The output of the arrow direction prediction network branch is similar to the function of the corner point prediction branch network, and comprises thermodynamic diagram generation, semantic embedding and position offset prediction. The arrow direction prediction network adopts a jump connection mode similar to a residual error network ResNet module, and an SCP module is inserted in the jump connection, so that the semantic features of key points in a feature map can be effectively enhanced.
Angular pooling of snowflakes
The direction of the arrow is determined by positioning the head and tail of the arrow, which is detected by the inherent intelligence of DrawnNet as a key point detection task. To address this issue, the present embodiment proposes a snowflake corner pooling module (SCP) to capture richer, more recognizable visual arrow patterns. Fig. 6 illustrates the principle of SCP.
Let F be the profile of the SCP, F ij Is the response for position (i, j) in F. Using H W feature map, pooled feature map F SCP The neuron activation value response to position (i, j) in the SCP is expressed by the following formula:
Figure RE-GDA0003803893160000101
arrow keypoint prediction
In the arrow direction prediction network, thermodynamic diagrams, offsets and semantic embedding are used for predicting head key points and tail key points of an arrow, and the method specifically comprises the following steps:
let P cij Is the probability of class c at position (i, j) in the predictive thermodynamic diagram, y cij Is a ground-truth thermodynamic diagram, which has C channels; then, the category Loss of the arrow key point is estimated by the Focal local Loss function:
Figure RE-GDA0003803893160000111
where N is the number of objects in the image and α is a hyper-parameter that controls the contribution of each point;
the positions of the arrow key points are slightly corrected by the prediction offset, let (x, y) be the position in the image,
Figure RE-GDA0003803893160000112
is its down-sampling position in the thermodynamic diagram, where s is the down-sampling factor; the deviation of the arrow key point k between these two positions is estimated:
Figure RE-GDA0003803893160000113
wherein the prediction deviation and the ground-truth deviation can be calculated by a Smooth-L1 Loss function:
Figure RE-GDA0003803893160000114
a graph may include a plurality of arrows, and thus head and tail keypoints of the plurality of arrows may be predicted. Therefore, it is necessary to determine which pair of head keypoints and back keypoints belong to the same arrow, and the method of this embodiment is also semantic embedding used in CornerNet, and matches the two keypoints with the greatest similarity together; let e hk Is an arrow kSemantic embedding of head keypoints of (e) tk Semantic embedding of tail key points, wherein the tail key points are all four-dimensional vectors; and matching the key points by using a pull loss training network, and keeping away the key points which do not belong to the same object by using the pull loss:
Figure RE-GDA0003803893160000115
Figure RE-GDA0003803893160000116
Figure RE-GDA0003803893160000121
finally, linearly combining the loss functions of all network branches as the final loss function of the whole model, wherein the loss functions with the same task property share the same combination coefficient:
Figure RE-GDA0003803893160000122
here, the
Figure RE-GDA0003803893160000123
And
Figure RE-GDA0003803893160000124
is a loss function for the arrow key point prediction task; and L is det 、 L push 、L pull And L off Is a loss function for the corner key point prediction task; where α, β, γ, and λ are coefficient weights of the sub-loss functions.
Nested graph recognition
In addition to the above-mentioned identification of common charts, there is also a special chart, that is, a nested chart. This type of diagram is primarily a program flow diagram in which nested structures play an important role, usually representing the logic of a program, such as loops, selections, etc. In the identification process of the whole nested diagram, the most troublesome point is the identification of the nested primitives therein. A nested primitive in a nested graph refers to a primitive that includes one or more other primitives, which are referred to as nested primitives of the primitive, and which reflect complex semantic logic in a nested manner. From the perspective of a digital image, a nested primitive appears as multiple primitives "stacked" together, with the inner layers being nested primitives and the outer layers being nested primitives, as shown in FIG. 7. The type of the outer-layer nested primitive is always unchanged no matter how the number and the type of the inner-layer nested primitives and the logic structure between the inner-layer nested primitives change, so that the existence of the inner-layer primitives often brings great difficulty to the model for identifying the outer-layer nested primitive, the model needs to have strong generalization capability to ensure that the nested primitive is not interfered by the change of the inner-layer primitive when being identified, and a data enhancement algorithm is provided for a training stage to improve the generalization of the model.
The feature pyramid acts on the decoder part of the stem model hourglass, and the feature maps of the top three layers are used as the input of the feature pyramid. Order to
Figure RE-GDA0003803893160000125
Is the feature level of the ith layer on the feature pyramid (as shown in fig. 8), and the sampling coefficient to this layer is s k . The group-truth bounding box in the input nested icon is defined as B = { B = { (B) (1) ,B (2) 8230, where
Figure RE-GDA0003803893160000126
Figure RE-GDA0003803893160000127
Are respectively B (i) The coordinates of the upper left corner and the coordinates of the lower right corner.
In the method adopted in this embodiment, each primitive is associated with a plurality of positions of a plurality of layers in the feature pyramid, so that each primitive is predicted by one or more positions, specifically:
for arbitrary feature level P k At an arbitrary position on
Figure RE-GDA0003803893160000131
The following formula is used:
Figure RE-GDA0003803893160000132
map it back to obtain its input image
Figure RE-GDA0003803893160000133
Is called a mapping position, which is close to l k The center of the receptive field of (a); for feature levels P from any two k And P j At any two positions l k And l j , l' k And l' j The overlapping is avoided; thus, each mapping location is globally unique in the input image; secondly, for
Figure RE-GDA0003803893160000134
If there is a location l from the feature pyramid, so that l' and b (i) The following conditions are satisfied:
0<ξ(l′,b (i) ) (48) therein
Figure RE-GDA0003803893160000135
Then l is b (i) Predicting a location for a candidate of the target; obviously, b is (i) Multiple positions can be used, all of which constitute b (i) Is calculated from the candidate position set theta (i) (ii) a Furthermore, if
Figure RE-GDA0003803893160000136
So that l i And b (i) The following conditions are satisfied:
γ k-1 <δ(l′,b (i) )≤γ k (50)
wherein
Figure RE-GDA0003803893160000137
Then l is b (i) At the feature level P k True predicted position of, where gamma k (k =1, 4.., 5) is a hyper-parameter; at P k Up-prediction b (i) All positions of (a) contain b (i) Location set of
Figure RE-GDA00038038931600001311
And is provided with
Figure RE-GDA0003803893160000138
Reflect b is (i) Capacity is accommodated over the entire characteristic pyramid;
in addition, for
Figure RE-GDA0003803893160000139
And
Figure RE-GDA00038038931600001310
will be responsible for predicting b (i) Is represented by a discriminant function delta (l', b) (i) ) Supremum and infimum determinations of (a), which reflect b (i) The above regression range, the calculation formula is as follows:
Figure RE-GDA0003803893160000141
Figure RE-GDA0003803893160000142
therefore, if
Figure RE-GDA0003803893160000143
Then it must have
Figure RE-GDA0003803893160000144
That is, any primitive in the flow chart can be predicted at two adjacent feature levels; for the primitive detection in the flow chart, the characteristic can effectively avoid target accumulation caused by that the nested primitives and the nested primitives are simultaneously distributed to the same characteristic level;
for any two targets b (i) And b (j) By passing
Figure RE-GDA0003803893160000145
Whether or not it is 1 or not, b (i) Whether to nest b (j) In which
Figure RE-GDA0003803893160000146
Now, let b (t) For nested primitives, all of their nested primitives form their nested set
Figure RE-GDA0003803893160000147
The purpose of multi-level nested graphic element detection is to detect b (t) And N (t) All nested primitives in (a) are properly associated with locations on the feature pyramid while maintaining a high recall rate for each primitive. Therefore, for
Figure RE-GDA0003803893160000148
In that
Figure RE-GDA0003803893160000149
And Ω (t) There are two possible scenarios between, as follows:
1) If it is not
Figure RE-GDA00038038931600001410
Then
Figure RE-GDA00038038931600001411
And b (t) Are from two different feature levels, so they can both be effectively positively sample predicted and thus be used in the training processAnd (6) model learning.
2) If it is not
Figure RE-GDA00038038931600001412
Then the
Figure RE-GDA00038038931600001413
Are ambiguous samples because they are all predictive
Figure RE-GDA00038038931600001414
And b (t) . This is due to the fact that the two objects do not differ greatly in size. In this case DrawnNet specifies that these positions predict the smallest sized primitive, here
Figure RE-GDA00038038931600001415
Thus, the device
Figure RE-GDA00038038931600001416
Is that
Figure RE-GDA00038038931600001417
And b (t) Is that
Figure RE-GDA00038038931600001418
In addition, the method can be used for producing a composite material
Figure RE-GDA00038038931600001419
May be very close to 1, resulting in b (t) Number of positive samples
Figure RE-GDA00038038931600001420
Smaller, which will seriously affect b (t) The recall ratio of. Fortunately, in the nested chart dataset of this embodiment, there is a significant size difference between the nested primitives and the nested primitives, so this is unlikely to happen.
Thus, for nested and nested primitives, drawnNet can predict where they will be assigned to the appropriate feature level according to their own size. Because the available prediction positions in the feature pyramid are very rich, and each primitive can be predicted at multiple positions, drawnNet has a high recall rate for multi-level target detection in the flow chart. However, other non-dense predictive models such as YOLOv3 cannot handle the case of target clustering, resulting in very low detection recall on dense images.
Nested graph-oriented data enhancement
In order to improve the diversity of the nested graph data sets and more accurately identify the nested primitives in the nested graphs, the embodiment proposes a data enhancement algorithm (algorithm 1). Intuitively, the algorithm mainly functions to discard each primitive in the nested graphs with a certain probability, so that a large number of different nested graphs can be derived from one nested graph. In particular, it is assumed in the algorithm that the probability of each primitive being discarded follows a uniform distribution over the interval 0-1, i.e. the primitive will be discarded when the random variable X ≧ λ. Therefore, the most capable of derivatization
Figure RE-GDA0003803893160000151
A graph containing only k (0. Ltoreq. K. Ltoreq. N) primitives, then a total of 2 is derived n Different charts are laid out. Assuming that each nested graph in the training set contains 10 primitives on average, at most 1024 different nested graphs can be derived from each nested graph. This is equivalent to 1024 times expansion of the training data set, and this figure is huge, even if there are only hundreds of training samples in the data set, hundreds of thousands of training samples can be derived after the expansion of the data enhancement algorithm proposed in this embodiment, which is completely sufficient for training the target detection model based on the convolutional network, because the number of training samples in the largest common target detection data set COCO-2017 is not more than 12 thousands.
In addition, the data enhancement algorithm proposed in this embodiment has a point that compared with other data enhancement algorithms in the image domain, the algorithm of this embodiment has only one hyper-parameter, that is, a probability threshold λ for determining whether each primitive is discarded. The setting of lambda is closely related to the quantity of the primitives which can be reserved after each chart is enhanced each time; the larger the value of λ, the more primitives are discarded each time data enhancement is performed, and vice versa. In the experiment of this embodiment, the value of λ generally has three possibilities, which are 0.25, 0.50 and 0.75 respectively. If the value of λ is too large (λ > 0.75), the enhanced graph appears too hollow because too many primitives are discarded, and the detection of the graph of the content hollow is undoubtedly a waste of computing resources; conversely, if λ is too small (λ < 0.25), there are too few primitives discarded each time, which can diminish the effect of data enhancement. Therefore, when the value of λ is 0.5, a relatively good data enhancement effect can be achieved. It can be theoretically demonstrated that when λ =0.5, the probability of discarding half of the number of primitives in each graph is the greatest, i.e., the probability of retaining half of the primitives after data enhancement in each graph is the greatest.
Figure RE-GDA0003803893160000161
Evaluation of experiments
This example evaluates DrawnNet on three open freehand chart datasets, two of which describe flow charts (FC-A and FC-B), one of which is se:Sub>A finite automaton dataset (FA).
FC-se:Sub>A was released in 2011 as se:Sub>A baseline database for hand-drawn flow charts, 419 drawings were drawn by 35 authors from 28 pre-defined templates (248 for the training set and 171 for the test set), the greatest deficiency of which was the lack of comments about chart structure and time information. Only a single primitive is provided. Therefore, the quality of these charts is low, and it is difficult to evaluate the online identification method.
FC-B was released in 2016 as se:Sub>A complement to FC-A and contained 672 graphs (280 training set, 196 test set, 196 validation set), which were derived from 28 pattern templates drawn by 24 authors. Some templates reference FC-se:Sub>A and retain the usual algorithmic functionality. In addition to the chart structure labels, its labels also include the direction in which the arrow points.
FA and FC-B were published simultaneously, with a total of 300 charts (132 in training set, 84 in test set and 84 in validation set), generated from 12 templates drawn by 25 workers. There are four categories of data sets: state (single circle), final state (two concentric circles), test and arrow. The arrows are generally curved. As with FC-B, its label also contains the pointing direction of the arrow.
In addition, in order to evaluate the recognition effect of the DrawnNet model of the present embodiment on the nested chart, the author of the present embodiment constructs a nested chart data set, which is of a program flow chart type. The author topic group and a software company collaboratively develop a piece of software for drawing and labeling nested charts. The embodiment researches the grammar mechanism of the current mainstream programming language including C, C + +, java and the like in advance, and considers the programming language widely used in the last century, such as Fortain, etc., to summarize 9 basic program structures from these languages, which are begin, end, input/output, if-else, align, process, while, for, do-while, and use these 9 predefined basic program structures as the basic primitives of the nested icon data set, that is, only these 9 categories of primitives are in the data set. Wherein if-else, while, for, do-while are 4 kinds of nestable primitives, that is, any primitive in 9 kinds of primitives can be nested as a sub-primitive in the 4 kinds of primitives; while begin, end, input/output, assign and process are primitive primitives, that is, primitives cannot contain other primitives, and only can be nested in a nestable primitive.
In addition to defining the basic program structures, in order to construct a high-quality nested graph data set, so that the program algorithm described by the graph therein can effectively reflect the business logic of the existing program in the current software engineering field, the author of this embodiment performs some sample surveys on the code segments of the open source code base such as gitubb, and calculates three indexes for characterizing the code complexity, namely, the number N of the basic program structures, the nested depth D, and the category distribution C of the basic program structures. By referring to the sampling results of the three indexes, the embodiment respectively formulates corresponding quantization standards as the overall distribution of the data set for guiding the drawing of the chart.
Finally, the present embodiment constructs a nested chart dataset for a total of 600 samples. Each sample consists of two parts, namely, the flowchart image and the label information of each primitive in the flowchart, which can be expressed as a binary < image, label >, wherein label represents the class and position of each primitive. The nested chart dataset is then as per 7:3 into a training data set and a test data set.
Precision and Recall are used to evaluate the performance of the method proposed in this embodiment on the two tasks of primitive recognition and graph recognition. In addition, for the identification of the nested graph, in order to measure the accuracy of the model of the embodiment in target detection, the embodiment uses two indexes, namely AP and mAP, with reference to the general target detection task, and is used for evaluating the accuracy of the model in the identification of the nested graph.
Precision: precision is the accuracy measured by the overall accuracy of the model to predict the correct outcome in the middle of the task compared to all predicted outcomes.
Recall: recall is the Recall rate measured by the overall coverage of the model to predict the correct outcome during the task compared to all outcomes to be predicted.
AP: AP refers to the "Average Precision" (Average Precision) first introduced in VOC 2007. Defined as the average detection accuracy at different recall rates, is usually evaluated in a category-specific manner.
mAP: the mAP is the "average precision" across all classes, and is widely used to compare the performance of all object classes in a general object detection problem assessment.
Table 1 shows a comparison of performance of DrawnNet with other online chart recognition and offline chart recognition systems. For online recognition systems, they implement offline recognition of charts based on analyzing and modeling the timing stroke information, which largely exploits primitive segmentation and primitive classification. These low-level representations and local understanding are sensitive to the reference dataset and may negatively impact the accuracy of the recognition if the labeling quality of the dataset is low or the labeling information is not accurate enough. In contrast, recognition systems based on deep learning target detection can be designed with a targeted design such that the model robustly achieves superior performance in the graph recognition task.
TABLE 1 Chart identification accuracy on individual datasets
Figure RE-GDA0003803893160000181
Fig. 9 shows the performance of DrawnNet on three benchmark sets, with the head and tail keypoints of the arrows labeled separately.
This embodiment also reveals the case where DrawnNet performs the primitive recognition task in each reference dataset. Tables 2, 3 and 4 show the primitive identification results on each reference dataset. In general, drawnNet performs well on several primitive shapes, which can be explained by the fact that primitive shapes and sizes differ much less than arrows and text.
DrawnNet has higher primitive recognition recall and accuracy on the FC-A dataset (Table 2). However, drawnNet performs slightly less in all classes than in the other classes. Through examination of the training set of FC-se:Sub>A, the arrows where the author found some samples are marked with circles instead of crosses as shown in fig. 10, which may mislead the classification decision of the model, causing the model to confuse it with other classes such as text, and thus mislead the model learning. In addition, there is another problem that the resolution of some of the circles is too small, and as shown in fig. 10, after a series of down-sampling is performed in the feature map, the resolution becomes smaller and smaller. Therefore, as mentioned above, it severely hinders the feature extraction by the corner pool module of the present embodiment.
TABLE 2 accuracy and recall of FC-A primitive identification across categories
Figure RE-GDA0003803893160000191
Table 3 shows that DrawnNet can accurately identify the primitives in the chart. Interestingly, arrow R-CNN gave complementary results in terms of accuracy and recall for the two categories, data and Process, namely 100 and 94.9 for the Data category, respectively, and almost the opposite for the Process category. Part of the reason is that despite the use of FPN, the underlying network Faster RCNN of Arrow R-CNN does not have a corresponding mechanism to learn fine-grained discriminant features between different classes of similar primitives, where both Data primitives and Process primitives are quadrilateral, except that the two parallel edges of the Data primitives are slightly oblique, rather than vertical as the Process primitives.
TABLE 3 accuracy and recall of FC-B primitive identification across categories
Figure RE-GDA0003803893160000201
As shown in table 4, drawnNet fully recognized State and Final State shapes in the FA test set. Because the classification is small, the characteristics of each classification are stable and thus not difficult to recognize.
TABLE 4 FA primitive identification accuracy and recall in various categories
Figure RE-GDA0003803893160000202
Ablation experiment
The present embodiment performed an ablation study on each reference data set to further quantify the impact of the keypoint pooling method proposed by the present embodiment on chart identification. Table 5 shows that the accuracy of chart recognition can be significantly improved by combining the CICP module and the SCP module. It should be noted here that the present embodiment proposes an arrow direction branch network, but the arrow direction branch network has to be considered in ablation studies because its responsibility is to predict arrow key points, which is indispensable for the task of chart recognition. Thus, in ablation studies, the present embodiment ablates only the SCP modules in the arrowed direction branching network, rather than ablating the entire branching network.
Clearly, the use of the SCP module effectively improves chart recognition accuracy because it subtly pools to help predict arrow key points. Whether the arrow key points are predicted is crucial for subsequent chart identification. In addition, the CICP module also helps DrawnNet to perceive where corner key points in the rectangular outline may be located, but it may not give good results if used alone. After all, whether the prediction of the arrow key points is correct or not is crucial to the correctness of the graph understanding.
TABLE 5 ablation results
Figure RE-GDA0003803893160000211
Nested chart recognition experiment
The accuracy of the method is evaluated in two aspects of overall performance and popularization of a data enhancement algorithm. In order to evaluate the overall performance of the method, the embodiment trains the FCOS over 100 epochs respectively, and evaluates the trained model in terms of flowchart detection. Table 6 shows the results of the test resolution. As can be seen from table 6, the data enhancement significantly improves the accuracy of the flowchart detection, wherein the maximum mapp can reach 77.98, which is 3.65% higher than that before the data enhancement is not adopted, and the improvement is very significant between APs with different IoU thresholds.
TABLE 6 Overall Performance on primitive recognition tasks
Figure RE-GDA0003803893160000212
To further investigate the effectiveness of data augmentation, this example selects different λ values and incremental iterations on the training set (interval 10) to train the model, and then observes the variation of the mAP over the test split. Fig. 11 shows the results. As previously analyzed, the maps did increase with the increase in age and the adoption of the data enhancement algorithm of the present embodiment. This trend leads to an increasing generalization capability of the model, because the continuous expansion of the training data leads to a continuous diversification of the morphology of the training data, and as the epochs increase, the model can witness more different training data, which is consistent with the expectation of the present embodiment. In addition, as can be seen from fig. 11, different λ values do have certain influence on the mAP, and as analyzed before, when λ =0.50, the data enhancement effect is the best, and the experimental results also verify this.
Therefore, with the increase of the number of iterations and the adoption of data expansion, the method of the embodiment performs surprisingly on the task of detecting the flow chart, and the corresponding metric index is continuously rising. This is enough to prove that the data enhancement algorithm of the present embodiment is powerful, which greatly helps DrawnNet show a dramatic result in both tasks!
Enhanced generalization capability of data
In order to evaluate the overall performance of the model of the embodiment in identifying each category and multi-level nested primitives, the embodiment performs further experiments. Table 7 provides the maps for each category. It can be seen that without data augmentation, the mAP for each class ranges between 55.45 and 88.89, with the mAP for 4 nested primitives being significantly lower than the mAP for the other 5 basic primitives. Obviously, the data enhancement is most significant for lifting four nested primitives, such as if-else, while and the like, and the lifting amplitude is above 5.49. This is because the identification of nested primitives is more challenging than the identification of basic primitives. Thus, the gain in data increase is mainly concentrated on nested primitives, which causes the average mAP of the four nested primitive classes to rise to 65.01, 6.77 higher than without the increase (as shown in Table 7).
TABLE 7 mAP on each category
Figure RE-GDA0003803893160000221
Table 8 is the maps of nestable primitives with different nesting depths. It can be seen that the performance of the model can be maintained at a higher value of mAP when the nesting depth is 1 or 2, while the data enhancement is most significant for the improvement of the model in mAP when the nesting depth is increased to 3, 4 or even 5. When the nesting depth is 3 or 4, the increment of the mAP on all four nestable classes is kept above 6.42, and their average mAP on all four classes is increased by 9.08 and 6.87, respectively.
TABLE 8 mAP comparison of nested classes of different nesting depths under data enhancement
Figure RE-GDA0003803893160000231
Thus, the data enhancement algorithm proposed in this embodiment generates a large amount of different training data for the model, which evolve with different internal structures. By learning these variants, the model can accommodate multi-level control flows, particularly the four nestable primitives shown in the figure with deeply nested child primitives.
The present invention and its embodiments have been described above schematically, without limitation, and what is shown in the drawings is only one of the embodiments of the present invention, and the actual structure is not limited thereto. Therefore, if the person skilled in the art receives the teaching, without departing from the spirit of the invention, the person skilled in the art shall not inventively design the similar structural modes and embodiments to the technical solution, but shall fall within the scope of the invention.

Claims (6)

1. The hand-drawn chart recognition method based on key point prediction is characterized by comprising the following steps of: the method comprises the following steps:
1. detecting each primitive in the diagram as a pair of key points, namely determining an upper left corner point and a lower right corner point of a boundary box of the primitive together;
2. performing pooling operation on the maximal cross corner pooling MICP and the accumulated cross corner pooling CICP in parallel;
3. performing feature fusion on the respective pooled feature maps;
4. for each arrow connecting two primitives, predicting a head key point and a tail key point of the arrow by using an arrow direction prediction network, and enhancing the arrow key point information by using a snowflake corner pooling SCP module; thus, the identification of the entire chart structure is completed.
2. The method of claim 1, wherein the method comprises: in the cumulative cross corner pooling CICP, in order to determine whether the activation value of the primitive is an upper left corner or not, the CICP searches the uppermost boundary of the target to the right along the horizontal direction and searches the leftmost boundary of the target to the vertical direction; let F t And F l Is an input feature map for corner pooling,
Figure RE-FDA0003803893150000011
and
Figure RE-FDA0003803893150000012
are respectively F t And F l Response of intermediate position (i, j); using H × W feature map, CICP determines whether the neuron activation value at position (i, j) is in the upper left corner, and accumulates in F by parallel t All responses in (i, j) and (i, H) in (A) are distributed horizontally, and are distributed vertically at F l The sum of (i, j) and (W, j) in (1) as T ij And L ij (ii) a Finally, they are added to generate a feature map F CICP The calculation process can be expressed by the following formula:
Figure RE-FDA0003803893150000013
Figure RE-FDA0003803893150000014
the method of the CICP pooling calculation for the lower right corner is similar to the calculation for the upper left corner, i.e. all responses vertically distributed in (0, j) and (i, j) and all responses horizontally distributed in (i, 0) and (i, j) are accumulated in parallel, and then the pooled results are added.
3. The method of claim 1, wherein the method comprises: the arrow direction prediction network adopts a jump connection mode, and an SCP module is inserted in the middle of the jump connection.
4. The method of claim 1, wherein the method comprises: in the snowflake corner pooling SCP module:
let F be the profile of the SCP, F ij Is the response for position (i, j) in F. Using H x W feature maps, pooled feature maps F SCP The neuron activation value response to position (i, j) in the SCP is expressed by the following formula:
Figure RE-FDA0003803893150000021
5. the method of claim 1, wherein the method comprises: in the arrow direction prediction network, thermodynamic diagrams, offsets and semantic embedding are used for predicting head key points and tail key points of an arrow, and the method specifically comprises the following steps:
let P cij Is the probability of class c at position (i, j) in the predictive thermodynamic diagram, y cij Is a ground-truth thermodynamic diagram, which has C channels; then, the category Loss of the arrow key point is estimated by the Focal local Loss function:
Figure RE-FDA0003803893150000022
where N is the number of objects in the image and α is a hyper-parameter that controls the contribution of each point;
the positions of the arrow key points are slightly corrected by the prediction offset, let (x, y) be the position in the image,
Figure RE-FDA0003803893150000023
is that itA downsampling position in the thermodynamic diagram, wherein s is a downsampling factor; the deviation of the arrow key point k between these two positions is estimated:
Figure RE-FDA0003803893150000024
the prediction deviation and the group-route deviation can be calculated through a Smooth-L1 Loss function:
Figure RE-FDA0003803893150000025
determining which pair of head key points and back key points belong to the same arrow, and matching the two key points with the maximum similarity together by using semantic embedding; let e hk Semantic embedding of head key points for arrow k, e tk Semantic embedding of tail key points, wherein the tail key points are all four-dimensional vectors; and matching the key points by using a pull loss training network, and distancing the key points which do not belong to the same object by using the pull loss:
Figure RE-FDA0003803893150000031
Figure RE-FDA0003803893150000032
Figure RE-FDA0003803893150000033
finally, linearly combining the loss functions of all network branches as the final loss function of the whole model, wherein the loss functions with the same task property share the same combination coefficient:
Figure RE-FDA0003803893150000034
here, the
Figure RE-FDA0003803893150000035
And
Figure RE-FDA0003803893150000036
is a loss function for the arrow key point prediction task; and L is det 、L push 、L pull And L off Is a loss function for the corner key point prediction task; where α, β, γ, and λ are coefficient weights of the sub-loss functions.
6. The method of claim 1, wherein the method comprises: if the graph is a nested graph, each primitive is associated with a plurality of positions of a plurality of layers in the feature pyramid, so that each primitive is predicted by one or more positions, specifically:
for an arbitrary feature level P k At an arbitrary position on
Figure RE-FDA0003803893150000037
The following formula is used:
Figure RE-FDA0003803893150000038
map it back to obtain its input image
Figure RE-FDA0003803893150000039
Is called a mapping position, which is close to l k The center of the receptive field of (a); for feature levels P from any two k And P j At any two positions l k And l j ,l' k And l' j The overlapping is avoided; thus, each mapping location is globally unique in the input image; it is composed ofSecondly, in respect of
Figure RE-FDA00038038931500000310
If there is a location l from the feature pyramid, let l' and b (i) The following conditions are satisfied:
0<ξ(l′,b (i) ) (12) wherein
Figure RE-FDA0003803893150000041
Then l is b (i) Predicting a location for a candidate of the target; obviously, b (i) Multiple positions can be used, all of which constitute b (i) Is set of candidate positions theta (i) (ii) a Furthermore, if
Figure RE-FDA00038038931500000413
So that l i And b (i) The following conditions are satisfied:
γ k-1 <δ(l′,b (i) )≤γ k (14)
wherein
Figure RE-FDA0003803893150000042
Then l is b (i) At the feature level P k True predicted position of, where gamma k (k =1, 4.., 5) is a hyper-parameter; at P k Up prediction b (i) All positions of (1) contain b (i) Location set of
Figure RE-FDA0003803893150000043
And is
Figure RE-FDA0003803893150000044
Reflect b is (i) Capacity on the entire feature pyramid;
in addition, for
Figure RE-FDA0003803893150000045
And
Figure RE-FDA0003803893150000046
will be responsible for predicting b (i) Is represented by a discriminant function delta (l', b) (i) ) Supremum and infimum determinations of (a), which reflect b (i) The above regression range, the calculation formula is as follows:
Figure RE-FDA0003803893150000047
Figure RE-FDA0003803893150000048
therefore, if
Figure RE-FDA0003803893150000049
Then it must have
Figure RE-FDA00038038931500000410
That is, any primitive in the flow chart can be predicted at two adjacent feature levels; for the primitive detection in the flow chart, the characteristic can effectively avoid target accumulation caused by that the nested primitives and the nested primitives are simultaneously distributed to the same characteristic level;
for any two targets b (i) And b (j) By passing
Figure RE-FDA00038038931500000411
Whether or not it is 1 or not, b (i) Whether to nest b (j) In which
Figure RE-FDA00038038931500000412
Now, let b (t) For nested primitives, all of their nested primitives form their nested set
Figure RE-FDA0003803893150000051
The purpose of multi-level nested primitive detection is to detect b (t) And N (t) All nested primitives in (a) are properly associated with locations on the feature pyramid while maintaining a high recall rate for each primitive.
CN202210615119.4A 2022-05-31 2022-05-31 Hand-drawn chart identification method based on key point prediction Pending CN115171135A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210615119.4A CN115171135A (en) 2022-05-31 2022-05-31 Hand-drawn chart identification method based on key point prediction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210615119.4A CN115171135A (en) 2022-05-31 2022-05-31 Hand-drawn chart identification method based on key point prediction

Publications (1)

Publication Number Publication Date
CN115171135A true CN115171135A (en) 2022-10-11

Family

ID=83483283

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210615119.4A Pending CN115171135A (en) 2022-05-31 2022-05-31 Hand-drawn chart identification method based on key point prediction

Country Status (1)

Country Link
CN (1) CN115171135A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115375917A (en) * 2022-10-25 2022-11-22 杭州华橙软件技术有限公司 Target edge feature extraction method, device, terminal and storage medium
CN116071773A (en) * 2023-03-15 2023-05-05 广东电网有限责任公司东莞供电局 Method, device, medium and equipment for detecting form in power grid construction type archive

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115375917A (en) * 2022-10-25 2022-11-22 杭州华橙软件技术有限公司 Target edge feature extraction method, device, terminal and storage medium
CN116071773A (en) * 2023-03-15 2023-05-05 广东电网有限责任公司东莞供电局 Method, device, medium and equipment for detecting form in power grid construction type archive
CN116071773B (en) * 2023-03-15 2023-06-27 广东电网有限责任公司东莞供电局 Method, device, medium and equipment for detecting form in power grid construction type archive

Similar Documents

Publication Publication Date Title
He et al. Which and how many regions to gaze: Focus discriminative regions for fine-grained visual categorization
Saba et al. Effects of artificially intelligent tools on pattern recognition
CN115171135A (en) Hand-drawn chart identification method based on key point prediction
Hongwei et al. Solder joint inspection method for chip component using improved AdaBoost and decision tree
Bera et al. Attend and guide (ag-net): A keypoints-driven attention-based deep network for image recognition
Lacerda et al. Segmentation of connected handwritten digits using Self-Organizing Maps
Bhuyan et al. An effective method for fingerprint classification
Lin et al. Saliency detection via multi-scale global cues
Luo et al. SFA: small faces attention face detector
Dey et al. A two-stage CNN-based hand-drawn electrical and electronic circuit component recognition system
Xue et al. Detection and rectification of arbitrary shaped scene texts by using text keypoints and links
Amirian et al. Trace and detect adversarial attacks on CNNs using feature response maps
Cheng et al. Leveraging semantic segmentation with learning-based confidence measure
Roy et al. Offline hand-drawn circuit component recognition using texture and shape-based features
Lin et al. Radical-based extract and recognition networks for Oracle character recognition
Song et al. Weakly supervised semantic segmentation via box-driven masking and filling rate shifting
Lee et al. Neuralfp: out-of-distribution detection using fingerprints of neural networks
CN107688822A (en) Newly-increased classification recognition methods based on deep learning
Charitidis et al. Operation-wise attention network for tampering localization fusion
de Oliveira et al. A new segmentation approach for handwritten digits
INTHIYAZ et al. YOLO (YOU ONLY LOOK ONCE) Making Object detection work in Medical Imaging on Convolution detection System.
Keyrouz et al. Enhanced chemical structure recognition and prediction using Bayesian fusion
Elitez Handwritten digit string segmentation and recognition using deep learning
Amraee et al. Handwritten logic circuits analysis using the Yolo network and a new boundary tracking algorithm
Vasudevan et al. Flowchart knowledge extraction on image processing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: Room 921, Floor 9, No. 2 Office Building, Phase II, Luolangwan International Trade City, Guandu District, Kunming District, Yunnan 650000

Applicant after: Yunnan Hengyu Technology Co.,Ltd.

Applicant after: WUHAN University

Address before: 430072 Hubei Province, Wuhan city Wuchang District of Wuhan University Luojiashan

Applicant before: WUHAN University

Applicant before: Yunnan Hengyu Technology Co.,Ltd.