CN104462414A

CN104462414A - Topological structure based flow chart similarity method

Info

Publication number: CN104462414A
Application number: CN201410768419.1A
Authority: CN
Inventors: 刘海亮; 邓伟财; 苏航
Original assignee: Shenzhen Research Institute of Sun Yat Sen University
Current assignee: Shenzhen Research Institute of Sun Yat Sen University
Priority date: 2014-12-12
Filing date: 2014-12-12
Publication date: 2015-03-25

Abstract

The invention discloses a topological structure based flow chart similarity method and relates to the field of digital image retrieval. The method includes: converting a flow chart into a graph model; constructing a minimum BFS (breadth first search) code tree for the graph model; according to the hierarchical relation of the minimum BFS code tree, weighting nodes and measuring similarity of the flow chart model. All the nodes of the flow chart are weighted according to the hierarchical structure of the minimum BFS code tree corresponding to the flow chart, adverse impact of the low-relevancy nodes on similarity measurement effect is avoided effectively, precision of the flow chart similarity measurement method is improved, meanwhile matching number of the nodes of the flow chart is reduced greatly, time efficiency of the flow chart similarity measurement method is improved, and accordingly the method has high practical value.

Description

Topological structure-based flow chart similarity method

Technical Field

The invention relates to the field of digital image retrieval, in particular to a topological structure-based flow chart similarity method.

Background

The flow chart is a graphic description of the flow, process and algorithm, and has wide application in the fields of technical design, communication, scientific research, commercial bulletin and the like, particularly in scientific research, the flow chart is one of the most popular qualitative analysis tools in the aspects of making a research conclusion and describing the conclusion, has stronger intuitiveness and high generalization on scientific research results, and has become a main means for researchers to efficiently search, know and describe the process of the research results. In the face of massive flow chart data, how to quickly and effectively retrieve interesting flow charts and relevant information from massive flow chart data is a research hotspot in artificial intelligence and pattern recognition.

The basic principle of flowchart retrieval is to search the most similar flowchart according to the flowchart to be retrieved provided by the user and feed back the retrieval result to the user, so the core of flowchart retrieval is the similarity measurement technology of the flowcharts, and many scholars successively develop research work on the similarity of the flowcharts. The flow chart similarity method widely adopted at present mainly measures the behavior similarity, the structure similarity and the text similarity, wherein the text similarity measures the similarity of texts of a flow chart model by adopting the ideas of character string editing distance and semantics; the behavior similarity is the similarity of execution semantics of the calculation flow chart model; the structural similarity is based on a graph editing distance method, and the similarity of the flow chart model is calculated by combining the text similarity.

At present, a graph edit distance method is widely applied to flow chart similarity measurement and achieves good effects, and is proposed by S.Nejati and M.Sabetzadeh in the documents of Matching and planning of statistical properties, Proceedings of the 29th international reference on software Engineering,2007, 54-64. The graph edit distance of the method refers to the minimum number of deformation operations required for transforming two graphs into each other, wherein the deformation is completed by node replacement, deletion and addition of edges or nodes, and the like, but the graph edit distance method is an NP-hard problem, the complexity of the graph is rapidly increased along with the increase of the number of nodes of the graph, and the polynomial time O (n) of the lower limit and the upper limit is the polynomial time O (n) of the graph³)□O(n⁷) The retrieval efficiency is low.

Based on the defects of the classification method, the scheme of the invention provides a flow chart similarity method based on a topological structure, the minimum coding tree is generated by using breadth-first search coding, the hierarchical relation of the nodes in the flow chart is effectively realized, the thought of the topological structure is utilized, the corresponding weight is given according to the hierarchical relation of the nodes, the influence of the nodes with strong correlation on the flow chart similarity is improved, and the efficiency of flow chart similarity measurement is effectively improved.

Disclosure of Invention

The invention aims to provide a similarity measurement method for a flow chart, which can solve the problems that the classification speed of the current supervision classification method is low, the complexity exponentially increases along with the increase of the number of features, and the classification precision is influenced by the features with weak correlation.

The invention provides a flow chart similarity method based on a topological structure, which comprises the following steps:

s1: converting the flow chart into a graph model;

s2: constructing a minimum Breadth First (BFS) coding tree for the graph model;

s3: and according to the hierarchical relationship of the minimum BFS coding tree, giving a weight to the node, and measuring the similarity of the flow chart model.

The method for similarity of flowcharts based on topological structures, wherein the step S1 of converting the flowcharts into the graph model, is performed according to the following steps:

and identifying elements such as rectangles, diamonds, straight lines and arrows of the flow chart, and recording the logical relationship of each element in the flow chart.

The above flowchart similarity method based on the topology structure, wherein the constructing a minimum Breadth First (BFS) coding tree in step S2 is performed according to the following steps:

s2.1: constructing a BFS subscript for the flow chart model;

carrying out breadth-first search on the flow chart model G to generate a breadth-first search tree T, forming a linear sequence according to the traversal sequence of the vertexes of the tree T, and when i is<j, then indicates the vertex V_iAt V_jIs previously traversed, records this linear sequence using subscripts, and forms a subscript graph G_T。

S2.2: establishing BFS codes for the breadth-first search tree of the flow chart model;

for a breadth-first search (BFS) tree of a given graph G, the corresponding BFS encoded BFSCode (G, T) for that breadth-first search tree is represented as a subscripted graph G_TOrder of edges e_iI-0, …, | E | -1, wherein the sequence E_iSatisfy the partial order relation <_BTThe conditions of (1): 1) if μ₁<ν₁And v₁<μ₂Then e₁＜_BT e₂(ii) a 2) If μ₁＝μ₂And v₁<ν₂Then e₁＜_BT e₂(ii) a 3) If e₁＜_BT e₂And e₂＜_BT e₃Then e₁＜_BT e₃。

S2.3: determining a minimum BFS coding tree according to the lexicography order;

for the encoding set S of the flowchart model G ═ { BFSCode (G, T) | T is the BFS tree of the graph G }, if two elements a of the set S ═ BFSCode (G)_A,T_A)＝(a₀,a₁,…,a_m) And B ═ BFSCode (G)_B,T_B)＝(b₀,b₁,…,b_n) And if the BFS lexicography order meets the condition: there is r, 0. ltoreq. r.ltoreq.min (m, n), a_r＝b_rWhen r is<k is, a_k＜b_k(ii) a The minimum BFS encoding minBFSCode (G) is then represented as the smallest element in the set S (G) according to the BFS lexicography order.

In the above flowchart similarity method based on a topological structure, step S3 is to assign a weight to a node according to the hierarchical relationship of the minimum BFS coding tree, and measure the similarity of the flowchart model, and the method is performed according to the following steps:

s3.1: for matching flow chart F_dFinding the node N with the minimum number of layers_d；

S3.2: for the flow chart F to be searched_sCalculating all nodes N of the current active layer_sNode similarity Sim of_node(N_s,N_d) And weighted similarity Sim_w(N_s,N_d)；

For node N_s∈F_sAnd N_d∈F_dThe similarity of the nodes is

{Sim}_{node} (N_{s}, N_{d}) = 1 - \frac{d (T (N_{s}), T (N_{d}))}{\max (| T (N_{s}) |, | T (N_{d}) |)}

Wherein, T (N)_s) And T (N)_d) Representing text in a node, | T (N)_s) I and I T (N)_d) I represents the length of the text string in the node, d (T (N)_s),T(N_d) ) represents the edit distance of the two strings.

The weighted similarity of the sum nodes is

Sim_w(N_s,N_d)＝W_i·Sim_node(N_s,N_d)

Wherein,the weight of the ith layer of the flow chart model.

S3.3: selecting weighted similarity Sim_w(N_s,N_d) The optimal node is used as the optimal matching node;

s3.4: deleting current node N_dAnd updating the effective node layer and re-executing Step1 until F_dAll nodes in the node are matched;

s3.5: calculate the two flowcharts F_sAnd F_dSimilarity of each layer of (a);

for the flow chart model F_sThe similarity of the ith layer is expressed as

<math> <mrow> <msub> <mi>Sim</mi> <mi>layer</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfenced open='{' close=''> <mtable> <mtr> <mtd> <mfrac> <mrow> <mi>Σ</mi> <msub> <mi>Sim</mi> <mi>node</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>N</mi> <mi>s</mi> </msub> <mo>,</mo> <msub> <mi>N</mi> <mi>d</mi> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <mo>|</mo> <mi>MathchNode</mi> <mo>|</mo> </mrow> </mfrac> <mo>,</mo> <mo>|</mo> <mi>MathchNode</mi> <mo>|</mo> <mo>></mo> <mn>0</mn> <mo>,</mo> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> <mo>,</mo> <mo>|</mo> <mi>MathchedNode</mi> <mo>|</mo> <mo>=</mo> <mn>0</mn> <mo>.</mo> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>

Wherein, | MathchedNode | is the number of matched node pairs.

S3.6: calculate the two flowcharts F_sAnd F_dThe similarity of (c).

For a given two flow charts F_sAnd F_dAnd their similarity is the weighted sum of the similarities of the layers, the similarity of the flow chart is represented

<math> <mrow> <msub> <mi>Sim</mi> <mi>F</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>s</mi> </msub> <mo>,</mo> <msub> <mi>F</mi> <mi>d</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>W</mi> <mi>i</mi> </msub> <mo>·</mo> <msub> <mi>Sim</mi> <mi>layer</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> </mrow> </math>

Compared with the prior art, the technical scheme of the invention has the following advantages:

(1) the invention divides the hierarchical structure of the flow chart model by using the minimum BFS coding tree method, so that the hierarchy of the flow chart nodes is more obvious, and the measurement of the importance degree of the nodes is facilitated;

(2) according to the method, the weight is given to each layer of nodes according to the influence degree of the nodes on the flow chart similarity measurement, so that the negative influence of the nodes with low correlation degree on the similarity measurement effect is effectively avoided, and the precision of the flow chart similarity measurement method is improved;

(3) simulation experiments show that compared with the existing flow chart similarity method, the method has higher precision and better retrieval efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a topology-based flowchart similarity method in an embodiment of the present invention;

FIG. 2 is a diagram of the results of a comparison experiment of the similarity method A Star algorithm, the exhaustion method and the heuristic method of the present invention with the existing flow chart.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the flowchart similarity method based on the topology of the present technical solution includes the following steps:

step 1: converting the flow chart into a graph model;

Step 2: constructing a minimum Breadth First (BFS) coding tree for the graph model;

(1) constructing a BFS subscript for the flow chart model;

(2) Establishing BFS codes for the breadth-first search tree of the flow chart model;

(3) Determining a minimum BFS coding tree according to the lexicography order;

for the encoding set S of the flowchart model G ═ { BFSCode (G, T) | T is the BFS tree of the graph G }, if two elements a of the set S ═ BFSCode (G)_A,T_A)＝(a₀,a₁,…,a_m) And B ═ BFSCode (G)_B,T_B)＝(b₀,b₁,…,b_n) And if the BFS lexicography order meets the condition: exists r, 0 is less than or equal tor≤min(m,n)，a_r＝b_rWhen r is<k is, a_k＜b_k(ii) a The minimum BFS encoding minBFSCode (G) is then represented as the smallest element in the set S (G) according to the BFS lexicography order.

And step 3: and according to the hierarchical relationship of the minimum BFS coding tree, giving a weight to the node, and measuring the similarity of the flow chart model.

(1) For matching flow chart F_dFinding the node N with the minimum number of layers_d；

(2) For the flow chart F to be searched_sCalculating all nodes N of the current active layer_sNode similarity Sim of_node(N_s,N_d) And weighted similarity Sim_w(N_s,N_d)；

For node N_s∈F_sAnd N_d∈F_dThe similarity of the nodes is

{Sim}_{node} (N_{s}, N_{d}) = 1 - \frac{d (T (N_{s}), T (N_{d}))}{\max (| T (N_{s}) |, | T (N_{d}) |)}

The weighted similarity of the sum nodes is

Sim_w(N_s,N_d)＝W_i·Sim_node(N_s,N_d)

Wherein,the weight of the ith layer of the flow chart model.

(3) Selecting weighted similarity Sim_w(N_s,N_d) The optimal node is used as the optimal matching node;

(4) deleting current node N_dAnd updating the effective node layer and re-executing Step1 until F_dAll nodes in the node are matched;

(5) calculate the two flowcharts F_sAnd F_dSimilarity of each layer of (a);

for the flow chart model F_sThe similarity of the ith layer is expressed as

Wherein, | MathchedNode | is the number of matched node pairs.

(5) Calculate the two flowcharts F_sAnd F_dThe similarity of (c).

The effectiveness and the practicability of the method are verified through simulation experiments.

Simulation content: three representative flow chart similarity methods are selected to test on the same flow chart set in a comparative experiment mode to verify the effectiveness of the invention. Specifically selected is the A Star method proposed by B.Messmer et al, references specifically "B.Messmer, effective Graphmatching Algorithms for preprocessing Model Graphs, PhD thesis, Switzerland, University of Bern, 1995", exhaustive and heuristic methods proposed by Remco M.Dijkman and Marlon Dumas, references specifically "Graph matching Algorithms for business processing similarity search, Proceedings of the 7th International Conference Business Process Management,2009, 48-63"

Experiment: the flow chart data set used in the experiment is from an image database (PubMed Central) in the fields of biomedicine and life science provided by the United states national library of medicine, and comprises 10 types of flow chart models, each type comprises 10 flow charts, the experiment respectively carries out experiment precision comparison on an A Star method, an exhaustion method, a heuristic method and the method of the invention, the classification result is shown in figure 2, and the simulation result shows that: the method of the invention is higher than the A Star method, the exhaustion method and the heuristic method in precision ratio, recall ratio and time efficiency.

The experimental result shows that the method of the invention is higher than the prior flow chart similarity method in precision ratio, recall ratio and time efficiency.

The method for similarity of flowcharts based on topological structures provided by the embodiment of the present invention is described in detail above, and the principle and the implementation of the present invention are explained in detail herein by applying specific examples, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A flow chart similarity method based on a topological structure is characterized by comprising the following steps:

s1: converting the flow chart into a graph model;

s2: constructing a minimum Breadth First (BFS) coding tree for the graph model;

2. The topology based flow chart similarity method according to claim 1, wherein the step S1 of converting the flow chart into the graph model comprises:

3. The topology based flow chart similarity method according to claim 1, wherein said step S2, constructing a minimum Breadth First (BFS) coding tree comprises:

s2.1: constructing a BFS subscript for the flow chart model;

s2.3: and determining the minimum BFS coding tree according to the lexicographical order.

4. The topology-based flowchart similarity method according to claim 3, wherein the step S2.1 of constructing BFS subscripts for the flowchart model specifically comprises:

5. The topology-based flowchart similarity method according to claim 3, wherein the step S2.2 of establishing BFS coding for the breadth-first search tree of the flowchart model further comprises:

for a breadth-first search (BFS) tree of a given graph G, the corresponding BFS encoded BFSCode (G, T) for that breadth-first search tree is represented as a subscripted graph G_TOrder of edges e_iI-0, …, | E | -1, wherein the sequence E_iThe condition that the partial order relation is less than BT is satisfied: 1) if μ₁<ν₁And v₁<μ₂Then e₁＜BT e₂(ii) a 2) If μ₁＝μ₂And v₁<ν₂Then e₁＜BT e₂(ii) a 3) If e₁＜BT e₂And e₂＜BT e₃E is then₁＜BT e₃。

6. The topology-based flow chart similarity method according to claim 3, wherein the step S2.3 of determining the minimum BFS coding tree according to the lexicographical order specifically comprises:

7. The method according to claim 1, wherein the step S3 of assigning weights to the nodes according to the hierarchical relationship of the minimum BFS coding tree, and measuring the similarity of the flow graph model comprises:

s3.4: deleting current node N_dAnd updating the valid node level, andstep1 is newly executed until F_dAll nodes in the node are matched;

s3.5: calculate the two flowcharts F_sAnd F_dSimilarity of each layer of (a);

s3.6: calculate the two flowcharts F_sAnd F_dThe similarity of (c).

8. The topology based flow chart similarity method according to claim 7, wherein said step S3.2, calculating node similarity and weighted similarity comprises:

for node N_s∈F_sAnd N_d∈F_dThe similarity of the nodes is

{Sim}_{node} (N_{s}, N_{d}) = 1 - \frac{d (T (N_{s}), T (N_{d}))}{\max (| T (N_{s}) |, | T (N_{d}) |)}

The weighted similarity of the sum nodes is

Sim_w(N_s,N_d)＝W_i·Sim_node(N_s,N_d)

Wherein,the weight of the ith layer of the flow chart model.

9. The topology based flow chart similarity method according to claim 7, wherein said step S3.5, calculating the similarity of each layer in the flow chart, comprises:

for the flow chart model F_sThe similarity of the ith layer is expressed as

Wherein, | MathchedNode | is the number of matched node pairs.

10. The topology based flow chart similarity method according to claim 7, wherein said step S3.5, calculating the similarity of two flow charts, comprises:

<math> <mrow> <msub> <mi>Sim</mi> <mi>F</mi> </msub> <mrow> <mo>(</mo> <msub> <mi>F</mi> <mi>s</mi> </msub> <mo>,</mo> <msub> <mi>F</mi> <mi>d</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>0</mn> </mrow> <mrow> <mi>n</mi> <mo>-</mo> <mn>1</mn> </mrow> </munderover> <msub> <mi>W</mi> <mi>i</mi> </msub> <mo>·</mo> <msub> <mi>Sim</mi> <mi>layer</mi> </msub> <mrow> <mo>(</mo> <mi>i</mi> <mo>)</mo> </mrow> <mo>.</mo> </mrow> </math>