CN111125375B

CN111125375B - Lineage graph summarization method based on node structure similarity and semantic proximity

Info

Publication number: CN111125375B
Application number: CN201911331390.XA
Authority: CN
Inventors: 卢暾; 周倍思; 于方玉; 张鹏; 顾宁
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-12-21
Filing date: 2019-12-21
Publication date: 2023-04-07
Anticipated expiration: 2039-12-21
Also published as: CN111125375A

Abstract

The invention belongs to the technical field of lineage requirements, and particularly relates to a lineage graph abstract method based on node structure similarity and semantic proximity. The invention comprises two stages: a similar node set identification stage, wherein similar nodes are gathered together according to the structural similarity and semantic proximity of the nodes, and a series of similar node sets are identified; in the node set replacement stage, the lineage graph comprises various nodes, such as data nodes, active nodes, proxy nodes and the like, and different replacement strategies are adopted for different types of node sets, so that the effectiveness of the lineage graph after replacement is guaranteed; according to the method, the semantic distance of the activity nodes is defined by combining the influence proximity and the time proximity between the activity nodes, and finally, the activity node sets with adjacent semantics are identified. The method uses the super nodes to replace node sets with similar structures and similar semantics, refines similar nodes in the lineage diagram, reduces the structural complexity and the semantic complexity of the lineage diagram, and improves the comprehensibility degree of the lineage diagram.

Description

Lineage graph summarization method based on node structure similarity and semantic proximity

Technical Field

The invention belongs to the technical field of lineage diagrams, and particularly relates to a lineage diagram abstract method based on node structure similarity and semantic proximity.

Background

Lineage data records a history of data derivation, which can describe the data generation process for a data lineage query to aid in result rendering, confidence enhancement, quality assessment, and the like. However, lineage data is accumulated over time, which makes the results of lineage queries quite large. If the query results are presented in the form of a lineage diagram, which may contain thousands of nodes, such a lineage diagram is difficult for the reader to intuitively understand. Most of the existing lineage graph summarization algorithms need to identify similar node sets through a large amount of manpower assistance, such as identification according to a knowledge base generated by a large amount of interviews. Therefore, the invention provides a lineage graph abstract method based on node structure similarity and semantic proximity, which is used for identifying a node set with similar structure and similar semantics based on three recognitions that data nodes with the same data source and use are subdata of higher semantic data, activity nodes which collaboratively generate the same data are sub-activities of higher semantic activities, activity nodes contained in the higher semantic activities have similar influence and similar activity time. And then, replacing the node sets with similar structures and similar semantics by using the super nodes, refining the similar nodes in the lineage diagram, reducing the structural complexity and the semantic complexity of the lineage diagram, and improving the comprehensibility degree of the lineage diagram.

Disclosure of Invention

The invention aims to provide a lineage diagram abstract method based on node structure similarity and semantic proximity so as to reduce the structure complexity and semantic complexity of the lineage diagram and improve the comprehensibility degree of the lineage diagram.

The lineage diagram summarization method provided by the invention is realized based on three cognition that data nodes with the same data source and use are subdata of higher semantic data, activity nodes which collaborate to generate the same data are sub-activities of higher semantic activities, activity nodes contained in the higher semantic activities have similar influence and similar activity time. The method comprises two stages: a similar node set identification stage, wherein similar nodes are gathered together according to the structural similarity and semantic proximity of the nodes, and a series of similar node sets are identified; in the node set replacement stage, the lineage graph comprises various nodes, such as data nodes, active nodes, proxy nodes and the like, and different replacement strategies are adopted for different types of node sets, so that the effectiveness of replacing the lineage graph before and after the lineage graph is replaced is ensured

(I) similar node set identification phase

The method comprises the steps that based on three recognitions that data nodes with the same data source and use are subdata of higher semantic data, activity nodes generating the same data in a cooperative mode are sub-activities of higher semantic activities, activity nodes contained in the higher semantic activities have similar influence and similar activity time, a similar data node set is identified according to the source and the use of the data nodes, a similar activity node set is identified according to output data of the activity nodes, agent sets correspondingly associated with the activity node sets are identified along with the identification of the activity node set, finally, the definition of similarity distance measurement of a traditional clustering algorithm is used for reference, the semantic distance of the activity nodes is defined by combining influence proximity and time proximity between the activity nodes, and finally, an activity node set with adjacent semantics is identified.

Definition related to data sources, data purposes, activity inputs, activity outputs, activity controllers and the like is carried out in the identification phase of the similar node set.

Definition 1: for data node d, if there is an active node a in the lineage graph and the relationship wasgeneredby (d, a), then active a can become the source of data d. Source (d) = { a1, a2, …, an } represents all sources of data d, where ai ∈ Source (d) is one of the sources of d.

Definition 2: for data node d, if there is an active node a in the lineage graph and there is a relationship Used (a, d), then the active a can be the purpose of data d. Usage (d) = { a1, a2, …, an } represents all uses of data d, where ai ∈ Usage (d) is one of the uses of d.

Definition 3: for active node a, if there is one data node in the lineage graph and the relationship Used (a, d) exists, then data d is the input to active a. Input (a) = { d1, d2, …, dn } represents all inputs for activity a, where di e Input (a) is one of the inputs for a.

Definition 4: for active node a, if there is one data node d in the lineage graph and the relationship wasgeneradedby (d, a) exists, then data d is the output of active a. Output (a) = { d1, d2, …, dn } represents all outputs of active a, where di ∈ Output (a) is one of the outputs of a.

Definition 5: for active node a, if there is an agent c in the lineage graph and the relationship WasControlledBy (a, c) exists, then agent c is the master of active a.

Specifically, the specific steps of the similar node set identification are as follows:

step 1: identifying a set of similar data nodes based on the source and use of the data nodes, the source of the data nodes being the active node that generated the data, and the use of the data nodes being the active node that used the data as input;

/nitial DC _k ＝{d _i /

if Source(d _i )＝Source(d _j )andUsa _g e(d _i )＝Usage(d _j )

then DC _k ＝DC _k ∪{d _j } (equation 1)

Where di and dj represent any two data nodes in the lineage diagram, and DCk represents a collection of data nodes.

Step 2: identifying similar active node sets according to the output data of the active nodes, and identifying corresponding associated agent sets along with the identification of the active node sets;

Initial AC _k ＝{a _i }，CC _k ＝{controller(a _i )}

if output(a _i )＝output(a _j )

then AC _k ＝AC _k ∪{a _j }，CC _k ＝CC _k ∪{controller(a _j ) } (formula 2)

Wherein ai and aj represent any two active nodes in the lineage diagram, ACk represents a set of active nodes, and CCk represents a set of proxy nodes.

And step 3: defining semantic distance of the active nodes by combining influence proximity and time proximity among the active nodes, and identifying active node sets with adjacent semantics and corresponding agent node sets;

influence(a _i )＝(∑ _d∈Data Exist(out_edge(d，a _i ))-I _min )/(I _max -I _min ) (formula 3)

influence_distance(a _i ，a _j )＝|influence(a _i )-influence(a _j ) Time _ distance (a) (equation 4) _i ，a _j )＝max(0，a _i ·startTime-a _j ·endTime，a _j ·startTime-a _i endTime) (equation 5) semantic _ distance (a) _i ，a _j )＝influence_distance(a _i ，a _j )+time_distance(a _i ，a _j ) (formula 6)

InitialSAC _k ＝{a _i }，SCC _k ＝{controller(a _i )}

if semantic_distance(a _i ，a _j )＜σ

then SAC _k ＝SAC _k ∪{a _j }，SCC _k ＝SCC _k ∪{controller(a _j ) } (equation 7)

Where ai and aj represent any two active nodes in the lineage graph, data represents the set of all Data nodes in the lineage graph, and exit _ edge (d, ai)) =1 when there is an edge from d to ai in the lineage graph, otherwise the result is 0.Imin represents the minimum value of the influence of all the active nodes, imax represents the maximum value of the influence of all the active nodes, ai.starttime, aj.starttime, ai.endtime and aj.endtime represent the start time and the end time of the active ai and aj respectively, SACk represents the set of the active nodes, SCCk represents the set of the proxy nodes, and σ is the semantic clustering threshold given by the user.

(II) node set replacement phase

The lineage diagram comprises various nodes, such as data nodes, active nodes, proxy nodes and the like, and different replacement strategies are adopted for different types of node sets, so that the effectiveness of the lineage diagram before and after replacement is guaranteed.

The method comprises the following specific steps:

step 1: a set of replacement data nodes DC comprising the following steps:

(1) Creating and initializing a data node d, and replacing the DC with the data node d;

(2) Creating a WasGeneratedBy relationship for the data node d in the lineage diagram;

(3) Establishing a Used relation for a data node d in the lineage diagram;

step 2: replacing the active node set AC and its corresponding proxy node set CC, comprising the steps of:

(1) Creating and initializing an active node a, and replacing the AC with the active node a;

(2) Creating a WasGeneratedBy relationship for the active node a in the lineage graph;

(3) Establishing a Used relation for the active node a in the lineage diagram;

(4) Creating and initializing a proxy node c, and replacing the CC by the proxy node c;

(5) Creating a WasControledBy relation from an active node a to an agent node c in the lineage graph;

create WasControlledBy from atom c (equation 15)

And step 3: replacing an active node set SAC and its corresponding proxy node set SCC, comprising the steps of:

(1) Finding intermediate data nodes generated and used by the active set SAC;

/>

INTD＝INTD ₁ ∪INTD ₂ (formula 16)

(2) Creating and initializing an active node a, and replacing SAC $ INTD with the active node a;

(3) Creating a WasGeneratedBy relationship for the active node a in the lineage graph;

(4) Establishing a Used relation for the active node a in the lineage diagram;

(5) Consistent with step 2, proxy node c is created and initialized, replacing SCC with proxy node c, and creating a WasControledBy relationship from active node a to proxy node c.

INTD1 and INDT2 were removed to ensure the validity of the newly generated lineage map, i.e., no rings in the map.

The invention has the beneficial effects that:

the invention provides a lineage graph abstract method based on node structure similarity and semantic proximity, aiming at the problem that visual reading is difficult due to overload of information quantity in an actual lineage graph and based on the insights of the structure similarity and the semantic proximity between nodes of the lineage graph. The method identifies the node sets with similar structures and similar semantics, replaces the node sets with similar structures and similar semantics by the super nodes, refines the similar nodes in the lineage diagram, reduces the structural complexity and the semantic complexity of the lineage diagram, and improves the comprehensibility degree of the lineage diagram.

Drawings

FIG. 1 is a core schematic diagram of the method of the present invention.

Detailed Description

The pseudo code for realizing the lineage diagram abstract method based on the node structure similarity and the semantic proximity is shown in appendix 1.

The complexity of the training algorithm is O (| V | | E | + | C | ^2+ | D | ^ 2), | V | is the number of nodes in the lineage map, | E | is the number of edges in the lineage map, | C | is the size of the activity set that may output the same, | D | is the number of data nodes in the lineage map. It is acceptable that the algorithm is time-complex polynomial time-complex.

Based on the algorithm logic, 36 successfully operated scientific workflow lineages of Taverna provenance are used for synthesizing a lineage graph data set with 1502 node number and 1598 edge number, and the simplified gain change along with the change of the clustering threshold sigma is tested on the basis of the data set. The reduction gain is used for evaluating the reduction degree of the summarized lineage diagram compared with the original lineage diagram, and the formula is as follows.

Wherein, G0 represents the original lineage diagram, G sigma represents the lineage diagram generated by the abstract taking sigma as the clustering threshold.

The method is used for testing by respectively setting the clustering threshold to 0.0000001,0.000001,0.00001,0.00005,0.0001,0.0002,0.0003,0.0004,0.0005,0.0006,0.0007,0.0008,0.009,0.001 in an experiment, and the result shows that the reduction gain is continuously changed from 1 to 1.84, and the result proves that the method can effectively control the reduction degree of the lineage diagram.

The present invention is not intended to be limited to the particular embodiments shown and described, and various modifications, equivalents and improvements made within the spirit and scope of the present invention are intended to be included therein.

Appendix 1

/>

/>

Claims

1. A lineage graph abstract method based on node structure similarity and semantic proximity is characterized by comprising two stages: a similar node set identification stage, wherein similar nodes are gathered together according to the structural similarity and semantic proximity of the nodes, and a series of similar node sets are identified; in the node set replacement stage, a lineage graph comprises a plurality of types of nodes including data nodes, active nodes and proxy nodes; different replacement strategies are adopted for different types of node sets, so that the effectiveness of the replaced lineage graph is guaranteed; wherein:

(I) similar node set identification phase

The data nodes with the same data source and use are subdata of higher semantic data, the activity nodes which generate the same data in a cooperative mode are sub-activities of higher semantic activities, the activity nodes contained in the higher semantic activities have similar influence and similar activity time, similar nodes are searched based on the three cognition, and the specific steps are as follows:

step 1: identifying a set of similar data nodes according to the source and the use of the data nodes, wherein the source of the data nodes refers to active nodes generating data, and the use of the data nodes refers to active nodes using the data as input;

Initial DC _u ＝{d _i }

if Source(d _i )＝Source(d _j )and Usage(d _i )＝Usage(d _j )

then DC _k ＝DC _k ∪{d _j } (formula 1)

Wherein d is _i And d _j Representing any two data nodes, DC, in a lineage diagram _k Representing a collection of data nodes, source () will find the Source of given data, usage () will find the purpose of given data;

step 2: identifying similar active node sets according to the output data of the active nodes, and identifying the corresponding associated agent sets as the active node sets along with the identification of the active node sets;

InitialAC _k ＝{a _i }，CC _k ＝{controller(a _i )}

if output(a _i )＝output(a _j )

Wherein, a _i And a _j Representing any two active nodes, AC, in a lineage diagram _k Representing a set of active nodes, CC _k Representing a collection of agent nodes, controller () will return the controller for a given activity, output () will solve the output data for the given activity;

and step 3: defining semantic distance of the active nodes by combining influence proximity and time proximity among the active nodes, and identifying active node sets with adjacent semantics and corresponding proxy node sets;

influence_distance(a _i ，a _j )＝|influence(a _i )-influence(a _j ) L (equation 4)

time_distance(a _i ，a _j )＝max(0，a _i ·startTime-a _j ·endTime，a _j ·startTime-a _i endTime) (equation 5)

semantic_distance(a _i ，a _j )＝influence_distance(a _i ，a _j )+time_distance(a _i ，a _j ) (formula 6)

Initial SAC _k ＝{a _i }，SCC _k ＝{controller(a _i )}

if semantic_distance(a _i ，a _j )＜σ

then SAC _k ＝SAC _k ∪{a _i }，SCC _k ＝SCC _k ∪{controller(a _j ) } (equation 7)

Wherein，a _i And a _j Representing any two active nodes in the lineage diagram, data represents the corpus of all Data nodes in the lineage diagram, when there is a distance from d to a in the lineage diagram _i Edge of (1), exist (out _ edge (d, a) _i ) ) =1, otherwise the result is 0; i is _min Represents the minimum value of the influence of all active nodes, I _max Represents the maximum value of the influence of all active nodes, a _i .startTime、a _j .startTime、a _i endTime and a _j endTime represents Activity a, respectively _i ，a _j Start time and end time of, SAC _k Representing a set of active nodes to be replaced, SCC _k Representing a set of proxy nodes to be replaced, σ being the semantic clustering threshold given by the user;

(II) node set replacement phase

The lineage diagram contains multiple types of nodes, including data nodes, active nodes and proxy nodes, and different replacement strategies are adopted for different types of node sets, so that the effectiveness of the lineage diagram before and after replacement is guaranteed; the method comprises the following specific steps:

step 1: set of replacement data nodes DC _k The specific process comprises the following steps:

(1) Creating and initializing data node d to replace DC with data node d _k ；

(3) Creating a Used relation for a data node d in a lineage graph;

step 2: set AC of replacement active nodes _k And its corresponding set of proxy nodes CC _k The specific process comprises the following steps:

(1) Creating and initializing active node a to replace AC with active node a _k ；

(2) Creating a wasgeneredby relationship for the active node a in the lineage graph;

(3) Establishing a Used relation for the active node a in the lineage graph;

(4) Creating and initializing proxy node c to replace CC with proxy node c _k ；

create WasControlledBy from a to c (equation 15)

And 3, step 3: instead ofActive node set SAC to be replaced _k And corresponding proxy node set SCC to be replaced _k The specific process comprises the following steps:

(1) Finding SAC by active set _k Intermediate data nodes are generated and used;

INTD＝INTD ₁ ∪INTD ₂ (formula 16)

(2) Creating and initializing active node a to replace SAC with active node a _k ∪INTD；

(4) Establishing a Used relation for the active node a in the lineage diagram;

(5) Consistent with step 2, proxy node c is created and initialized to replace the SCC with proxy node c _k And creates a WasControlledBy relationship from active node a to proxy node c.