CN116628286A

CN116628286A - Graph similarity searching method and device and computer storage medium

Info

Publication number: CN116628286A
Application number: CN202310907506.XA
Authority: CN
Inventors: 郑朝晖; 王健翔; 邱珍
Original assignee: Suzhou Highguard Network Technology Co ltd
Current assignee: Suzhou Highguard Network Technology Co ltd
Priority date: 2023-07-24
Filing date: 2023-07-24
Publication date: 2023-08-22
Anticipated expiration: 2043-07-24
Also published as: CN116628286B

Abstract

The invention relates to a graph similarity searching method, a graph similarity searching device and a computer storage medium. The method comprises the following steps: providing a data graph set comprising a plurality of data graphs and a query graph; determining an edit distance threshold, determining the difference value of the vertex number and the edge number between the data graph and the query graph, and filtering the data graph with the difference value larger than the edit distance threshold from the data graph set to obtain a pre-candidate data graph set; partitioning the query graph based on the expansion probability to obtain a query graph partition set, wherein the query graph partition set is a non-overlapping partition; determining the number of unmatched partitions between a query graph and a data graph included in a pre-candidate data graph set, filtering out the data graph with the number of subareas larger than an editing distance threshold value from the pre-candidate data graph set, and obtaining a candidate data graph set; constructing a multi-layer index; dividing an index sequence; compressing the index; and calculating the graph editing distance between the data graph and the query graph, and adding the query graph into the result set and returning the result set when the graph editing distance is smaller than or equal to the editing distance threshold value.

Description

Graph similarity searching method and device and computer storage medium

Technical Field

The present invention relates to the field of image searching technologies, and in particular, to a method and apparatus for searching image similarity, and a computer storage medium.

Background

In recent years, with the rapid development of internet technology, the data volume is exponentially increased, and the realization of efficient storage and retrieval of data is very important. In the big data age, because data entities have respective characteristic attributes and there are complex relationships associated with each other between a large amount of data, the data entities and the relationships between the data are generally abstracted into graph structures. The graph similarity search algorithm has an important meaning in data analysis in the face of large-scale graph data sets, and has been widely used in various fields such as biochemical informatics, computer vision, pattern recognition, data retrieval, and the like.

In the graph dataset, for a given query graph and edit distance threshold, the process of retrieving all data graphs for which edit distances do not exceed the edit distance threshold according to a specified graph similarity metric is referred to as a graph similarity search. Currently, metrics that evaluate graph similarity (similarity evaluation of query graphs and data graphs) are graph edit distance, maximum common subgraph and graph alignment, etc. Wherein the graph edit distance (Graph Edit Distance, GED) is used as the most commonly used metric for ensuring that all types of graphs can be evaluated basically and structural differences among the graphs can be calculated accurately. Because the graph edit distance calculation is NP (Non-deterministic Polynomial) -Hard, the existing method mostly adopts the concept of 'filtering-verification' to solve the graph similarity search problem, and the performance of the method mainly depends on the size of the candidate set, the cost of filtering to obtain the candidate set and the calculation cost of the graph edit distance. In the filtering stage, an index construction algorithm and an upper and lower bound pruning strategy are generally adopted to rapidly filter the data graphs which do not meet the threshold constraint, so as to obtain a candidate set. However, too loose a filtering lower bound results in too large a candidate set, and designing an index structure that is better would alleviate this problem, but would result in a larger index space footprint, however, most studies do not take this performance bottleneck into account. In the verification stage, to accurately calculate the graph editing distance between the query graph and the candidate set data graph, the process requires a large calculation cost. If the filtering stage is able to obtain a reduced candidate set, the time consumption of the verification stage is greatly reduced.

Disclosure of Invention

The invention aims to provide a graph similarity searching method, a graph similarity searching device and a computer storage medium, so as to solve the problems of incomplete searching result, large candidate set, large index space occupation, high calculation cost and the like in the prior art.

In a first aspect, the present invention provides a graph similarity searching method, including the steps of:

providing a query graph and a data graph set comprising a plurality of data graphs;

determining an editing distance threshold, determining the difference value of the number of top points and the number of sides between each data graph and the query graph, and filtering out the data graphs with the difference value of the number of top points and the number of sides larger than the editing distance threshold from the data graph set to obtain a pre-candidate data graph set;

partitioning the query graph based on the expanded probabilities to obtain a partitioned set of query graphs; the query graph partition set comprises a plurality of non-overlapping partitions; the number of non-overlapping partitions is the sum of the edit distance threshold and the lower bound parameter value;

determining the number of unmatched partitions between the query graph and each data graph in the pre-candidate data graph set, and filtering out data graphs with the number of unmatched partitions greater than an editing distance threshold from the pre-candidate data graph set to obtain a candidate data graph set;

Constructing a multi-layer index, wherein each layer index is configured with a sub-candidate query graph set, each sub-candidate query graph set comprises a plurality of non-overlapping partitions, and the plurality of sub-candidate query graph sets form a candidate query graph set; the lower boundary parameter value is the number of layers of the index where the non-overlapping partition is located;

dividing an index sequence, calculating element similarity difference values in the index sequence, and setting a compression threshold value of the index sequence;

compressing the index by adopting a partition compression method when the element similarity difference value is larger than a compression threshold value, and compressing the index by adopting a difference compression method when the element similarity difference value is smaller than or equal to the compression threshold value;

and calculating a graph editing distance between the data graph and the query graph, and adding the query graph into the result set and returning the result set when the graph editing distance is smaller than or equal to an editing distance threshold value.

Compared with the prior art, in order to solve the problem of lower bound looseness caused by fixed partition and improve filtering performance, the invention provides a query graph partition algorithm based on expansion probability, and the matching condition of each vertex and each edge in a query graph and a data graph is dynamically calculated by introducing the expansion probability, so that a query graph partition set meeting the disjoint condition is finally obtained. In order to reduce the calculation times of the graph editing distance in the verification stage, prevent the data adaptability caused by single filtering, optimize the filtering process, and provide a hierarchical filtering mechanism based on a global coarse filtering and sub-graph matching method to reduce the size of a candidate data graph set. On the basis, the zero index is constructed based on the coding compression algorithm, so that the occupied index space is reduced, and efficient query is realized in a limited space.

In a second aspect, the present invention further provides a graph similarity searching apparatus, configured to implement the graph similarity searching method provided in the first aspect, including:

the storage module is used for storing a data graph set, wherein the data graph set comprises a plurality of data graphs;

the user interface is used for loading the query graph;

the first-level filtering module filters data graphs with the difference value larger than the editing distance threshold value from the data graph set on the basis of determining the editing distance threshold value and determining the difference value of the top point number and the edge number between each data graph and the query graph so as to obtain a pre-candidate data graph set;

the query graph partitioning module is used for partitioning the query graph based on the expansion probability to obtain a query graph partitioning set, wherein the query graph partitioning set comprises a plurality of non-overlapping partitions, and the number of the non-overlapping partitions is the sum of an editing distance threshold value and a lower limit parameter value;

the secondary filtering module is used for filtering out the data graphs with the number of the unmatched partitions larger than the editing distance threshold value from the pre-candidate data graph set on the basis of determining the number of the unmatched partitions between the query graph and each data graph included in the pre-candidate data graph set so as to obtain a candidate data graph set;

the multi-layer index construction module is provided with sub-candidate query graph sets, each sub-candidate query graph set comprises a plurality of non-overlapping partitions, and the plurality of sub-candidate query graph sets form a candidate query graph set; the lower boundary parameter value is the number of layers of the index where the non-overlapping partition is located;

The index sequence dividing module is used for dividing the index sequence, calculating element similarity difference values in the index sequence and setting a compression threshold value of the index sequence;

the index compression module compresses the index by adopting a partition compression method when the element similarity difference value is larger than a compression threshold value, and compresses the index by adopting a difference compression method when the element similarity difference value is smaller than or equal to the compression threshold value;

and the diagram similarity searching module is used for calculating the diagram editing distance between the data diagram and the query diagram, and adding the query diagram into the result set and returning the result set when the diagram editing distance is smaller than the editing distance threshold value.

Compared with the prior art, the beneficial effects of the graph similarity searching device provided by the invention are the same as those of the graph similarity searching method provided by the first aspect and/or any implementation manner of the first aspect, and the description is omitted here.

In a third aspect, the present invention also provides a computer storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor and to carry out the method steps provided in the first aspect.

Compared with the prior art, the beneficial effects of the computer storage medium provided by the invention are the same as those of the graph similarity searching method provided by the first aspect and/or any implementation manner of the first aspect, and are not repeated here.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and do not constitute a limitation on the invention. In the drawings:

FIG. 1 is a flowchart of a graph similarity search method provided by an embodiment of the present invention;

FIG. 2 is a detailed process diagram of a graph similarity search method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a query graph q and data graphs g1, g 2;

FIG. 4 is a diagram of the partitioning of query graph q;

FIG. 5 is a schematic diagram of a query graph partitioning process based on expanded probabilities;

FIGS. 6a-6c are graphs of average candidate sets for different dataset algorithms in large bottom contrast;

FIG. 7 is a graph comparing index footprint conditions under different compression algorithms;

8a-8c are graphs of query response times versus different dataset algorithms;

FIG. 9a is a graph of candidate set size comparisons for different data set sizes;

FIG. 9b is a graph of algorithmic query response time versus different data set sizes.

Detailed Description

In order to make the technical problems, technical schemes and beneficial effects to be solved more clear, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

It will be understood that when an element is referred to as being "mounted" or "disposed" on another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise. The meaning of "a number" is one or more than one unless specifically defined otherwise.

In the description of the present invention, it should be understood that the directions or positional relationships indicated by the terms "upper", "lower", "front", "rear", "left", "right", etc., are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

Referring to fig. 1 and fig. 2, the graph similarity searching method provided by the embodiment of the invention includes the following steps:

s10, providing a query graph and a data graph set comprising a plurality of data graphs.

The set of data graphs may be defined as G, which includes a data graph defined as G, and a query graph defined as q. It should be further explained that before providing the data map set G, a map database is further provided, where an xml or other format of the original data map may be stored. Further, data preprocessing is needed to be performed on the original data graph, namely, information such as the number of vertexes and edges, labels and the like in the original data graph is extracted, and the information is processed into an input format needed by a program.

As an example, the labeled data graph set G may be defined as a triplet: g= { V, E, L }. Where V represents the set of vertices in the data graph set G,representing the edge set, L represents the tag label function. For one dataGraph G e G, use V _g And E is _g Representing vertex and edge sets, respectively, in data diagram g, with |V _g I and I E _g I represents the number of vertices and edges, respectively, in the data graph G, and i G represents the number of data graphs G in the data graph set G.

And S11, determining an editing distance threshold, determining the difference value of the number of top points and the number of sides between each data graph and the query graph, and filtering out the data graphs of which the difference value of the number of top points and the number of sides is larger than the editing distance threshold from the data graph set to obtain a pre-candidate data graph set.

For ease of understanding, the concept of edit distance (i.e., graph edit distance) will be explained below with specific examples, which are to be understood as illustrative only and not limiting.

The graph edit distance (Graph Edit Distance, abbreviated as GED) refers to the minimum edit distance operand required for converting a data graph g into a query graph q, which is used to measure the structural difference between the two graphs. The graph edit distance between the data graph g and the query graph q is represented using the GED (g, q). The graph editing distance operation comprises the following six points:

(1) Inserting a new isolated vertex u;

(2) Inserting a new edge e, e= (u, v) between existing vertices u and v;

(3) Deleting an isolated vertex u;

(4) Deleting the edge e, e= (u, v) connecting the vertices u and v;

(5) Modifying the label of the vertex v;

(6) The label of edge e is modified.

As an example, see fig. 3, two data graphs g are given ₁ And g ₂ GED (g) ₁ ,g ₂ ) =5. Wherein g ₁ Conversion to g ₂ The editing operation steps of (a) are embodied as follows:

(1) Deleting the edges connecting the vertexes C and D;

(2) Deleting edges connecting the vertexes A and F;

(3) Deleting edges connecting the vertexes C and F;

(4) Deleting the vertex F;

(5) The vertex C is modified to E.

This step may be defined as a primary filtration or a coarse filtration. I.e. coarse-grained filtering before partitioning, calculating the difference LB of the number of vertices and edges between the data graph G ε G and the query graph q ₁ The definition is as follows:

wherein, |V _g I and I E _g The i indicates the number of vertices and edges in the data graph g, respectively. If LB ₁ (g,q)>τ, then at least τ+1 vertex/edge delete or add operations are required to convert data graph g to query graph q, then GED (g, q) ε ₁ (g,q)>And tau, the graph editing distance between the data graph g and the query graph q is necessarily greater than a threshold tau, so that the data graph g can be filtered before partitioning, a pre-candidate data graph set is obtained and expanded partitioning is carried out, and unnecessary graph partitioning judging process can be avoided.

S12, partitioning the query graph based on the expansion probability to obtain a query graph partition set, wherein the query graph partition set comprises tau+k non-overlapping partitions, k is a lower bound parameter value, and the lower bound parameter value is the layer number of indexes where the non-overlapping partitions are located.

Before describing this step in detail, the concept of non-overlapping partitions included in the partitioning process is described by way of example with reference to the data diagram g, it being understood that the following examples are by way of illustration only and not by way of limitation.

The data graph is divided into independent sub-structures which are mutually exclusive according to a specific rule. The embodiment of the invention provides a Z-Index algorithm, which for a given data graph g, represents the partition result meeting the following conditions as P (g) = { P ₁ ,p ₂ ,...,p _i }。

（1）；

（2）；

（3）。

In the partitioning process of the Z-Index algorithm, in order to solve the problem of lower bound looseness caused by fixed partitioning and improve filtering performance, an extended probability-based query graph partitioning algorithm is provided, and the matching condition of each vertex and each edge in a dynamic calculation graph is introduced by introducing an extended probability value, so that a query graph partitioning set P (q) meeting the disjoint condition is finally obtained.

As an example, referring to fig. 4, the query graph q is divided into 4 partitions (4 non-overlapping partitions), i.e., P (q) = { P ₁ ,p ₂ ,p ₃ ,p ₄ }. Wherein no overlapping portion exists in any two partitions, and the union of all the partitions is the complete query graph q.

For a non-overlapping partition p _i The extension basis is as follows: the size of the non-overlapping partitions and the frequency with which vertex and edge labels occur in the non-overlapping partitions. Wherein the non-overlapping partition size represents the non-overlapping partition p _i Total number of vertices and edges, i.e. The vertex tag frequency represents the partition p _i The number of times each vertex of all vertices is present, i.e +.>The method comprises the steps of carrying out a first treatment on the surface of the Similarly, the edge tag frequency is expressed as +.>. If a certain non-overlapping partition p in the query map q _i The larger it is, the more likely it is that it is affected by edit distance manipulation, the less easily it is matched. Similarly, a certain non-overlapping partition p _i The higher the frequency of the middle vertex and edge labels, the greater the probability that they will appear in the data graph g, and the easier it is to match. Thus, according to the expansion probability value s (p _i ) The non-overlapping partition p can be rapidly judged _i And whether the filter screen is matched with the data graph g or not greatly improves the filtering effect. For non-overlapping partitions p _i The expansion probability value is s (p _i ) Is defined as follows：

Wherein f (L) _v ) Representing the number of vertexes with vertex labels v in the partition; f (L) _e ) And the same is true. s (p) _i ) The larger the value, the higher the partition p _i The easier it is to be matched, i.e. the greater the probability that the data map g is similar to the query map q. For vertex v, it is added to partition p _i Contribution value of (3)The definition is as follows:

△p _i =|s(p _i ∪{v})-s(p _i )|

wherein p is _i For a certain non-overlapping partition; v is the point of adding to non-overlapping partition p _i Neighbor vertices in (a); s (p) _i ) An expanded probability value for a certain of the non-overlapping partitions.

Partitioning the query graph based on the expanded probabilities includes:

s120. Provide a query graph q. See query graph q provided in fig. 3 and 4.

S121. Assigning vertices included in the query graph q. The method specifically comprises the following steps:

s1210, randomly selecting tau+k initial vertexes, and expanding the initial vertexes into tau+k non-overlapping initial partitions;

s1211 adds the neighboring vertex of each initial vertex to the non-overlapping initial partition, calculates the contribution value Deltap of the neighboring vertex to each non-overlapping initial partition _i Adding neighbor vertices to contribution Δp _i The largest non-overlapping initial partition; when the contribution value Deltap _i When the two non-overlapping initial partitions are equal, randomly adding neighbor vertexes into the smaller non-overlapping initial partitions;

s1212 repeating the previous step, calculating the contribution value Deltap of the neighbor vertex of each non-overlapping initial partition to each partition _i Until all neighbor vertices have been assigned.

S122, distributing cross-region edges included in the query graph q. After all neighbor vertices are allocated, the method The cross-regional edges are distributed into non-overlapping initial subareas where vertexes of the cross-regional edges are positioned, and a contribution value delta p is calculated _i The trans-regional edges are distributed to the contribution value Deltap _i The largest non-overlapping initial partition, thereby obtaining a non-overlapping partition.

As an example, referring to fig. 5, let τ=2, k=1, select vertices B, D and F as initial vertices, and mark the partition where p is located, respectively ₁ ，p ₂ And p ₃ Calculated as s (p ₁ )=s(p ₂ )=s(p ₃ )。

Vertex assignment is first performed: neighboring and unassigned neighboring vertices to the partition have a and B, and since a and B, F are neighboring, an attempt is made to assign neighboring vertex a to p ₁ And p ₃ Is calculated to obtain=0.5,/>=0.5, then add neighbor vertex a randomly to partition p ₁ . Since neighbor vertex B is adjacent to A, D, F, neighbor B can be assigned to region p ₁ ，p ₂ And p ₃ In (1) calculating->=0.03，/>=0.17，/>=0.17, then add neighbor vertex B to partition p ₂ Is a kind of medium. The unassigned vertex set is { A }. Since the neighbor vertex A is only associated with region p ₂ Adjacent, thus adding neighbor A to region p ₂ . At this time p ₁ = { B, a (B, a) }, p3= { F }, p2= { D, B, a, B (D, B), D (B, a) }, and vertex assignment ends.

Distribution of cross-regional edges: the cross-region edge set is { a (B, A), a (B, D), B (A, B), c (A, F), D (B, F) }, and for edge a (B, A), the calculation is performed=0.75-0.67=0.08，/>=0.4-0.33=0.07 because +.>>/>Edge a (B, A) is added to p ₁ In the partition. Similarly, the other edges are added to the corresponding partitions. The final partition information is:

p ₁ ={B,A,a(B,A),a(B,D),b(A,B)}，p ₂ ={D,B,A,b(D,B),d(B,A)}，p ₃ ={F,d(F,B),c(F,A)}。

s13, determining the number LB of unmatched partitions between the query graph and each data graph included in the pre-candidate data graph set ₂ (g, q) filtering out LB from the pre-candidate data map set ₂ (g, q) > τ to obtain a set of candidate data maps.

This step may be defined as secondary filtering or fine filtering. That is, in the partitioning process of the pre-candidate data map set, the number of unmatched partitions is calculated, and it is determined whether the data map g can be filtered. In the process of partitioning the query graph q, the number of unmatched partitions between the query graph q and each data graph g is calculated and recorded as LB ₂ (g, q). If the number of unmatched partitions in the data map g is greater than the edit distance threshold τ, i.e., LB ₂ (g,q)>τ, the data map g must not be within the edit distance constraint and can be safely filtered. According to the pigeon nest principle, each unmatched partition needs at least one editing distance operation to reach a matched state, and if the number of unmatched partitions is larger than tau, at least tau+1 operations are needed, so that the constraint of the editing distance threshold condition is not met any more. As shown in FIG. 4, p ₁ g ₁ ，p ₄ />g ₁ 。p ₁ And p ₄ Is g in FIG. 3 ₁ And p ₂ And p ₃ Is a non-matching partition. So the number of unmatched partitions is 2-tau, data graph g ₁ The result of the search for the similarity of the query graph q can be put into a candidate set for the next graph editing distance GED verification.

In summary, after the hierarchical filtering mechanism, LB is satisfied ₁ (g,q)>T or LB ₂ (g,q)>The data diagram of tau is filtered, so that a more simplified candidate set can be obtained, and the calculation times of the editing distance of the diagram in the verification stage are greatly reduced.

S14, constructing a multi-layer index, wherein each layer index is configured with a sub-candidate query graph set, each sub-candidate query graph set comprises a plurality of non-overlapping partitions, and the plurality of sub-candidate query graph sets form a candidate query graph set; the lower bound parameter value is the number of layers of the index where the non-overlapping partition is located.

As an example, an L-layer "zero" index structure is built for the query graph q, and the specific flow is as follows: in the ith (1-L) layer, dividing the query graph q into tau+k partitions based on a query graph q expansion probability partitioning algorithm, and obtaining a candidate set C corresponding to the layer through a hierarchical filtering mechanism _i Final candidate set. For each partition p of the query graph q, an inverted index table I (p) is maintained, all data graphs G containing the partition are saved, and all data graphs G constitute a data set G. Thus, all data graphs G containing partitions p can be quickly found in the data set G. When judging whether the data graph g is matched with the partition p, the Pars algorithm provided by the prior art needs to frequently perform sub-graph isomorphism calculation. In order to avoid complex isomorphic calculation of the subgraph, Z-Index provided by the embodiment of the invention records the frequencies of the vertex and edge labels of the data graph g and the partition p in the process of inquiring the partition of the graph q, and is marked as N (g) and N (p). If partition p is a matching partition of data map g, then N (g). Ltoreq.N (p) is indicated, otherwise, a non-matching partition is considered. According to the above description, the "zero" index structure in the i-th layer is defined as: / >。

S15, dividing the index sequence, calculating element similarity difference Sgap in the index sequence, and setting a compression threshold L of the index sequence.

As one example, two efficient compressed storage are proposed based on coding algorithmsIs used to construct the "zero" index sequence ZIndex. First, element similarity difference s in index sequence s is calculated _gap And compared to a sequence compression threshold L. If s is _gap If the compression value is larger than the compression threshold value l, adopting a partition compression algorithm; otherwise, a difference compression algorithm is used. Wherein s is _gap The formula is defined as:

；

where |s| represents the length of the sequence s.

S16, compressing the index, when S _gap >And when L, compressing the index by adopting a partition compression method, and when S _gap And when L is less than or equal to L, compressing the index by adopting a difference compression method.

As an example, a partition compression algorithm, for index sequences with non-uniform data distribution, unified compression may reduce the compression effect. To solve this problem, the embodiment of the invention provides an index partition compression algorithm based on sequence division, which selects a division length d according to the sparse condition of data distribution and divides the division length d into a plurality of piecesThe sub-sequences are compressed separately. For example, for a given sequence s= {1,2,3,4,5,125,130,137,144,158}, if d=5 is set, the sequence s may be divided into sub-sequences s ₁ Sum s ₂ Wherein s is ₁ ={1,2,3,4,5}，s ₂ = {125,130,137,144,158}, then the problem is converted into a subsequence s ₁ Sum s ₂ And (5) performing compression treatment.

As a second example, a difference compression algorithm that maintains each strokeThe first element in the division is unchanged, and then the difference value between the two adjacent elements is calculated in turn, namely s ₁ →s ₁ ´={1,1,1,1,1}，s ₂ →s ₂ And = {125,5,7,7,14}. Finally, for the treated sequence s ₁ ' and s ₂ Is compressed using a coding algorithm.

Common encoding algorithms are unary encoding, golomb encoding, exponential golomb encoding, and the like. For different coding algorithms express (∙), different "zero" index sequences ZIndex will be obtained:

wherein, the liquid crystal display device comprises a liquid crystal display device,=compress(/>)。

s17, computing GED (g, q), and adding the query graph into the result set and returning the result set when the GED (g, q) is less than or equal to tau, wherein the GED (g, q) is the graph editing distance between the data graph and the query graph.

As an example, assume that query graph q and its non-overlapping partition results are shown in FIG. 4, and data graph g is shown in FIG. 3 as g ₁ And g ₂ As shown, edit distance threshold τ=2. In the stage of first-stage filtering, LB is calculated ₁ (q,g ₁ )=(6-5)+(7-7)=1<τ，LB ₂ (q,g ₂ )=(6-4)+(7-4)=5>Tau and therefore g ₂ Is filtered. In the secondary filtering stage, q and g are calculated ₁ The number of unmatched partitions of 1<Tau, g ₁ Adding candidate set c= { g ₁ }. During the verification phase GED (q, g) ₁ )=3>τ, the final result set R is empty, i.e. query graph q and data graph g ₁ And data diagram g ₂ Is not within the threshold constraint.

The effectiveness of the graph similarity searching method provided by the embodiment of the invention is verified from two aspects of time complexity and space complexity. The method comprises the following steps:

firstly, before the program is started, the vertex and edge label frequencies of all graphs in a database are calculated in advance, the time complexity of the process is O (|G|), and in the filtering process, filtering can be realized by using the obtained vertex and edge label frequencies, and the time complexity is O (|G|×|Q|). In the graph partitioning phase, the query graph is partitioned and the partition size is recorded, and then s (p) of partition p is calculated, wherein the time complexity is O (|V) _q |+|E _q |) is provided. Finally, an L-layer zero index is established and compressed, and the accurate value of the graph editing distance is further calculated, so that the time complexity of the algorithm is O (|Q|L× (O (|V) _q |+|E _q |)+O(|G|+|Q|)))。

Since the query graph is to be partitioned and indexed, the spatial complexity of the algorithm is O (Lx|Px|Q|), where |P| represents the number of partitions.

The embodiment of the invention provides a multi-level filtering and low-Index space occupation graph similarity searching algorithm Z-Index based on query graph partitioning. Specifically, an extended probability-based query graph partitioning algorithm is provided, an extended probability value is introduced for each partition, namely, the possibility that vertexes or edges are distributed to the current partition is provided, a complex structure partitioning process is converted into a simple numerical comparison, whether one partition is matched with a data graph or not can be judged more accurately according to the value, and the filtering effect is improved. Second, a hierarchical filtering mechanism is proposed to reduce the candidate set size. In order to avoid unnecessary partition matching and index construction, a pre-candidate set is obtained by coarse-granularity filtering before partitioning the query graph, and then the candidate set is further simplified by filtering based on a sub-graph matching method in the partitioning process, so that the problem of overlarge candidate set is solved. Thirdly, partitioning the query graph, establishing indexes, introducing element similarity difference values for each index sequence to represent the data distribution sparseness of the sequence, and providing two coding compression algorithms of partition compression and difference value compression on the basis, so as to establish a zero index structure, greatly improve the filtering speed while reducing the index space, and relieve the space pressure brought by constructing indexes in mass data graphs.

The prior art provides a diagram-like search algorithm implementation of a filter-verification computing framework that is divided into a filtering stage and a verification stage. In the filtering stage, an index construction algorithm and an upper and lower bound pruning strategy are generally adopted to rapidly filter the data graphs which do not meet the threshold constraint, so as to obtain a candidate set. However, too loose a filtering lower bound results in too large a candidate set, and designing an index structure that is better would alleviate this problem, but would result in a larger index space footprint, however, most studies do not take this performance bottleneck into account. In the verification stage, to accurately calculate the graph editing distance between the query graph and the candidate set data graph, the process requires a large calculation cost. If the filtering stage is able to obtain a reduced candidate set, the time consumption of the verification stage is greatly reduced.

Results and analysis are realized:

1. data set

Three data sets are provided for experiments, the performance of the method provided by the embodiment of the invention is verified, and 100 data graphs are randomly selected from each data set to form a query graph set Q. Statistical information as shown in table 1, each data set is described in detail below.

(1) AIDS1: AIDS is a virus screening dataset from NCI/HIN development therapy programs for the discovery of AIDS viruses. The dataset consisted of 42687 compounds.

(2) IMDB-MULTI2: IMDB-MULTI is an interactive data and network data repository with real-time visual analysis function, and 1500 data graphs are selected for experiments.

(3) Grappgen 3: GRAPHGEN is a composite graph generator that can be used to create a large number of data graphs containing labels, with 10000 data graphs being generated using the composite graph generator.

Table 1 statistics of three data sets

2. Experimental environment

The experimental run-time environment was Intel (R) Core (TM) i7-10700 CPU@2.90GHz, with 16GB of memory, using the Microsoft Windows 10.10.64 bit operating system. The development environment is Visual Studio2019, and the development language is C++.

3. Evaluation index

In the experiment, the edit distance threshold use range was set to τ= {1,2,3,4,5,6}, and experimental evaluation was performed from the following four aspects:

(1) Filtration capacity analysis: the effectiveness and accuracy of the hierarchical filtering mechanism is evaluated using the average candidate set size |c|, the accuracy acc, and the recall recovery, defined as follows:

where |q| represents the query set size, |cq| represents the candidate set size of the query graph, tp= |c n r|, tp+tn represents the number of data graphs judged to be correct, and FN represents the number of data graphs incorrectly filtered. In essence, the smaller the data plot through the filtering conditions, i.e., the smaller the |C|, the better the filtering performance.

(2) "zero" index construction cost: including index build time and index size.

(3) Query response time T: the time for the system to respond to the query request is represented as one of important indexes of the graph similarity search algorithm, and is defined as follows:

wherein T is _pindex Is the partitioning time of the query graph based on the extended probability and the time cost for constructing a zero index, T _filter Is a hierarchical filtering to generate candidate setsTime of use, T _ged Is the calculation time of the graph edit distance.

(4) Scalability: the change trend of the candidate set size and the query response time of the algorithm on different scale data sets is discussed to illustrate the scalability of the algorithm.

4. Experimental analysis

In order to better verify the performance of the method (Z-Index algorithm) provided by the embodiment of the invention, the existing mainstream algorithms Pars and ML-Partition are selected as comparison algorithms, and experimental verification is performed on the three data sets with different scales. In order to ensure fairness of the experiment and avoid accidental factors, 300 times of query calculation are respectively executed under each evaluation index, and an average value is calculated to be used as a final experiment result.

4.1 Filter Capacity analysis

To verify the validity and accuracy of the hierarchical filtering mechanism, candidate set size |c|, accuracy acc, and recall ratio recovery are used as evaluation indexes, respectively.

First, the validity is verified, and the change in |C| at different thresholds is tested as shown in FIGS. 6 a-6C. Where the abscissa represents the threshold size and the ordinate represents the number of data graphs in the candidate set. Over 300 query calculations, it has been shown that the candidate set gradually increases over all data sets, sometimes even about the entire data set, as the threshold increases. It can be seen from FIGS. 6a-6c that the Z-Index algorithm yields a minimum candidate set of about 50% of the ML-Partition candidate set, followed by the ML-Partition, and the maximum candidate set is the Pars algorithm, over the different data sets. And along with the increase of the editing distance threshold, the growth speed of the candidate set of the Z-Index algorithm is obviously smaller than that of ML-Partition and Pars, which shows that the Z-Index algorithm can greatly simplify the candidate set, reduce the calculation times of the graph editing distance in the verification stage, avoid a plurality of invalid graph editing distance calculation and verify the effectiveness of the hierarchical filtering mechanism of the Z-Index algorithm.

And secondly, verifying the accuracy and the recall rate. The fixed threshold τ=3 in the experiment of the part, the candidate set size |C| generated by Pars, ML-Partition and Z-Index algorithms is counted respectively, then the editing distance is calculated accurately on the graphs in the candidate set, the number of the data graphs which are not in the editing distance threshold is counted, and the accuracy acc and recall rate recovery of the filtering algorithm are calculated. As shown in Table 2, the acc and the recovery obtained by the Z-Index algorithm are slightly higher than those of Pars and ML-Partition and can reach 0.945 at most, which indicates that the Z-Index algorithm can realize a better filtering effect on the premise of ensuring accuracy.

TABLE 2 accuracy acc and recall ratio recovery (%)

4.2 "zero" index construction cost analysis

The Z-Index algorithm analyzes the Index construction cost mainly from two aspects: index space occupation and index construction time. In the case of editing the distance threshold τ=3, the space occupation of the "zero" index under different coding algorithms was tested, and the experimental results are shown in fig. 7. In fig. 7, N represents an index size before encoding, and U and E represent index sizes when a unary encoding algorithm and a first order exponential golomb encoding algorithm are used, respectively. Experiments show that the first-order exponential golomb coding algorithm has a good compression effect on different data sets, and the index is compressed to about 3% of the original index, so that the performance of the zero index can be better reflected. Thus, the experiment will choose a first order exponential golomb coding algorithm as the coding algorithm for Z-Index.

As shown in table 3, experiments tested the average index size space and index build time of the three algorithms on different data sets with a threshold τ=3. As can be seen from the data in the table, the Z-Index algorithm is significantly better than ML-Partition and Pars in both Index size and Index build time. Wherein the Index size of Z-Index is much smaller than the Index constructed in the ML-Partition and Pars algorithms, the Index space used is reduced by about an order of magnitude. The reason is that: the number of query graphs is far less than the number of graphs in the graph dataset, so that index space occupation is less due to partitioning of the query graphs; on the other hand, index compression algorithms further reduce the index size. In terms of Index construction time, the Z-Index algorithm has the shortest Index construction time because complex sub-graph isomorphic computation is not required compared with the ML-Partition and Pars algorithms.

In summary, the Z-Index algorithm optimizes the filtering stage by partitioning the query graph and constructing the zero Index, so that the low Index space occupation of the algorithm is realized, the Index construction time is reduced, and better performance is shown.

Table 3 index space size (MB) and index build time(s) data statistics for three algorithms under different data sets

4.3 query response time analysis

Query response time is an important factor in measuring the performance of graph similarity search algorithms. In the Z-Index algorithm, the query response time includes the query graph partitioning time, the "zero" Index construction time, the time when the hierarchical filtering produces the candidate set, and the computation time of the graph edit distance based on the expanded probability. From the experimental results in sections 4.1 and 4.2, the Z-Index algorithm takes the shortest time to construct the Index, and the Z-Index can yield the smallest candidate set, so that the calculation time of the verification phase diagram edit distance is lower than ML-Partition and Pars. 8a-8c, the query response time of the Z-Index algorithm is less than ML-Partition and Pars at different thresholds τ. And when the threshold value is smaller, the Z-Index algorithm shows a better effect, and as the threshold value is increased, the lifting speed gradually becomes stable, and the lifting interval of the query efficiency is 9.1% -78.8%. Moreover, the Z-Index algorithm can obtain the shortest query response time on the sparse graph dataset AIDS, the IMDB-MULTI and the dense graph dataset GRAPHGEN, and the algorithm is verified to be applicable to various graph datasets.

4.4 scalability analysis

To test and compare the scalability of the Z-Index, pars, and ML-Partition algorithms, the present section of experiment set the edit distance threshold τ=3. Figure 9a records candidate set sizes for three algorithms on a 500K-20M random data set, where the abscissa represents the data set size and the ordinate represents the number of data graphs in the candidate set. As can be seen from fig. 9a, the candidate set sizes of Pars and ML-Partition increase significantly as the data set size increases, which results in a large number of graph edit distance calculations during the verification phase. In particular for the Pars algorithm, when the data set size exceeds 5M, the program will have memory errors and will not get the correct result, so only experimental data of Pars on 500K and 5M are given in FIG. 9 a. The size of the candidate set generated by the Z-Index is slowly increased along with the increase of the data set size, and particularly for a large-scale data set of 15M-20M, the increasing speed of the candidate set of the Z-Index is obviously lower than that of the ML-Partition, which shows that the Z-Index algorithm can be expanded to a larger-scale data set and still has a better filtering effect.

FIG. 9b compares the query response times of Z-Index, pars, and ML-Partition over a 500K-20M random data set, where Pars is valid only over no more than 5M data sets. As can be seen from FIG. 9b, the query response time and trend of the Z-Index algorithm increase with increasing data set size is always lower than that of Pars and ML-Partition, indicating that the increase in data size has less effect on the Z-Index algorithm, and Z-Index has higher query efficiency even on large data sets. The above experiments verify that the Z-Index algorithm has good scalability.

a user interface for inserting a query graph;

a first filtering module for determining the editing distance threshold tau and determining the difference LB between the top number and the edge number of each data graph and the query graph ₁ On the basis of (1) filtering out LBs from a collection of data graphs ₁ Data graphs of > τ to obtain a set of pre-candidate data graphs;

the query graph partitioning module is used for partitioning the query graph based on the expansion probability to obtain a query graph partitioning set, wherein the query graph partitioning set comprises tau+k non-overlapping partitions, and k is a lower bound parameter value;

the secondary filtering module is used for determining the query graph and the pre-candidateThe number of unmatched sub-regions LB between each data graph included in the data graph set ₂ (g, q) filtering out LB from the pre-candidate data map set ₂ (g, q) > τ to obtain a set of candidate data maps;

the multi-layer index construction module is used for configuring sub-candidate query graph sets in each layer index, each sub-candidate query graph set comprises a plurality of sub-areas, and the sub-candidate query graph sets form candidate query graph sets;

The index sequence dividing module is used for dividing an index sequence, calculating an element similarity difference value Sgap in the index sequence and setting a sequence compression threshold L;

the index compression module compresses the index by adopting a partition compression method when Sgap is larger than L, and compresses the index by adopting a difference compression method when Sgap is smaller than or equal to L;

and the graph similarity searching module is used for calculating GED (g, q), and adding the query graph into the result set and returning the result set when the GED (g, q) is less than or equal to tau, wherein the GED (g, q) is the graph editing distance between the data graph and the query graph.

In the description of the above embodiments, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A graph similarity search method, comprising the steps of:

determining an editing distance threshold, determining the difference value of the number of top points and the number of edges between each data graph and the query graph, and filtering out the data graphs with the difference value of the number of top points and the number of edges being larger than the editing distance threshold from the data graph set to obtain a pre-candidate data graph set;

partitioning the query graph based on the expansion probability to obtain a query graph partition set; the query graph partition set comprises a plurality of non-overlapping partitions; the number of the non-overlapping partitions is the sum of the editing distance threshold and a lower bound parameter value;

Determining the number of unmatched partitions between the query graph and each data graph in the pre-candidate data graph set, and filtering the data graph with the number of unmatched partitions being larger than the editing distance threshold from the pre-candidate data graph set to obtain a candidate data graph set;

constructing a multi-layer index, wherein each layer of index is configured with a sub-candidate query graph set, each sub-candidate query graph set comprises a plurality of non-overlapping partitions, and the sub-candidate query graph sets form a candidate query graph set; the lower bound parameter value is the number of layers of the index where the non-overlapping partition is located;

dividing the index sequence, calculating element similarity difference values in the index sequence, and setting a compression threshold value of the index sequence;

compressing the index, when the element similarity difference value is larger than the compression threshold value, compressing the index by adopting a partition compression method, and when the element similarity difference value is smaller than or equal to the compression threshold value, compressing the index by adopting a difference compression method;

and calculating a graph editing distance between the data graph and the query graph, and adding the query graph into a result set and returning the result set when the graph editing distance is smaller than or equal to an editing distance threshold value.

2. The graph similarity search method of claim 1, wherein partitioning the query graph based on an expanded probability comprises:

assigning vertices included in the query graph;

and distributing cross-region edges included in the query graph.

3. The graph similarity search method of claim 2, wherein the assigning vertices included in the query graph includes:

randomly selecting initial vertexes of the non-overlapping partitions, and expanding the initial vertexes into the non-overlapping initial partitions;

adding neighbor vertexes of each initial vertex into the non-overlapping initial partition, calculating contribution values of the neighbor vertexes to each non-overlapping initial partition, and adding the neighbor vertexes into the non-overlapping initial partition with the largest contribution value; when the contribution values are equal, randomly adding the neighbor vertexes into a smaller non-overlapping initial partition;

and repeating the previous step, and calculating the contribution value of the neighbor vertexes of each non-overlapping initial partition to each non-overlapping initial partition until all the neighbor vertexes are distributed.

4. The graph similarity search method of claim 3, wherein the assigning cross-region edges included in the query graph includes:

After all the neighbor vertexes are distributed, the transregional edges are distributed to the non-overlapping initial partition where the vertexes are located, the contribution value is calculated, and the transregional edges are distributed to the non-overlapping initial partition with the largest contribution value, so that the non-overlapping partition is obtained.

5. The graph similarity search method according to claim 4, wherein the contribution value is defined as Δp _i And Deltap _i By Deltap _i =|s(p _i ∪{v})-s(p _i ) Determining;

wherein Δp _i Represents the contribution value, p _i Is a certain partition; v is the initial vertex; s (p) _i ) An expanded probability value for a certain of the partitions; i represents the number of partitions, i is greater than or equal to 1 and is less than or equal to the edit distance threshold.

6. The graph similarity search method of claim 5, wherein the expanded probability value is defined as s (p _i ) Is determined by the following method:

wherein f (Lv) represents the number of vertices with vertex labels v in the non-overlapping partition; f (Le) is the number of edges in the non-overlapping partition for which edge labels are e; the I Vpi I is the total number of vertexes in the non-overlapping partition; and I, epi is the total number of edges in the non-overlapping partition.

7. The graph similarity search method of claim 1, wherein the constructing a multi-layer index includes:

In the index of a certain layer, dividing the query graph into a plurality of non-overlapping partitions based on the expansion probability, wherein the number of the non-overlapping partitions is the sum of the editing distance threshold value and the lower-limit parameter value; obtaining a sub-candidate query graph set corresponding to the index of the layer through a hierarchical filtering mechanism, wherein all the sub-candidate query graph sets included in the index of the layer jointly form the candidate query graph set;

for each of the non-overlapping partitions of the query graph, an inverted index table is maintained, and all of the data graphs including the non-overlapping partitions are stored.

8. The graph similarity search method of claim 7, wherein maintaining an inverted index table for each of the non-overlapping partitions of the query graph, storing all of the data graph including the non-overlapping partitions, comprises:

and recording the frequencies of the vertex labels and the edge labels of the data graph and the non-overlapping partition in the process of partitioning the query graph, wherein the frequencies are respectively marked as N (g) and N (p), and when N (g) is less than or equal to N (p), the non-overlapping partition is a matching partition of the data graph, and all the data graph containing the non-overlapping partition is saved.

9. A graph similarity search apparatus for implementing the graph similarity search method according to any one of claims 1 to 8, comprising:

the user interface is used for loading the query graph;

the first-level filtering module is used for filtering the data graphs with the difference value larger than the editing distance threshold value from the data graph set on the basis of determining the editing distance threshold value and determining the difference value of the top point number and the edge number between each data graph and the query graph so as to obtain a pre-candidate data graph set;

the query graph partitioning module is used for partitioning the query graph based on the expansion probability to obtain a query graph partition set, wherein the query graph partition set comprises a plurality of non-overlapping partitions, and the number of the non-overlapping partitions is the sum of the editing distance threshold value and a lower bound parameter value;

the secondary filtering module is used for filtering the data graphs with the number of the unmatched partitions larger than the editing distance threshold value from the pre-candidate data graph set on the basis of determining the number of the unmatched partitions between the query graph and each data graph included in the pre-candidate data graph set so as to obtain a candidate data graph set;

The multi-layer index construction module is used for configuring sub-candidate query graph sets for each layer of index, each sub-candidate query graph set comprises a plurality of non-overlapping partitions, and the sub-candidate query graph sets form candidate query graph sets; the lower bound parameter value is the number of layers of the index where the non-overlapping partition is located;

the index compression module is used for compressing the index by adopting a partition compression method when the element similarity difference value is larger than the compression threshold value, and compressing the index by adopting a difference compression method when the element similarity difference value is smaller than or equal to the compression threshold value;

and the diagram similarity searching module is used for calculating the diagram editing distance between the data diagram and the query diagram, and adding the query diagram into a result set and returning the result set when the diagram editing distance is smaller than or equal to an editing distance threshold value.

10. A computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the method steps of any of claims 1-8.