CN111125375B - Lineage graph summarization method based on node structure similarity and semantic proximity - Google Patents
Lineage graph summarization method based on node structure similarity and semantic proximity Download PDFInfo
- Publication number
- CN111125375B CN111125375B CN201911331390.XA CN201911331390A CN111125375B CN 111125375 B CN111125375 B CN 111125375B CN 201911331390 A CN201911331390 A CN 201911331390A CN 111125375 B CN111125375 B CN 111125375B
- Authority
- CN
- China
- Prior art keywords
- nodes
- node
- data
- active
- lineage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Abstract
The invention belongs to the technical field of lineage requirements, and particularly relates to a lineage graph abstract method based on node structure similarity and semantic proximity. The invention comprises two stages: a similar node set identification stage, wherein similar nodes are gathered together according to the structural similarity and semantic proximity of the nodes, and a series of similar node sets are identified; in the node set replacement stage, the lineage graph comprises various nodes, such as data nodes, active nodes, proxy nodes and the like, and different replacement strategies are adopted for different types of node sets, so that the effectiveness of the lineage graph after replacement is guaranteed; according to the method, the semantic distance of the activity nodes is defined by combining the influence proximity and the time proximity between the activity nodes, and finally, the activity node sets with adjacent semantics are identified. The method uses the super nodes to replace node sets with similar structures and similar semantics, refines similar nodes in the lineage diagram, reduces the structural complexity and the semantic complexity of the lineage diagram, and improves the comprehensibility degree of the lineage diagram.
Description
Technical Field
The invention belongs to the technical field of lineage diagrams, and particularly relates to a lineage diagram abstract method based on node structure similarity and semantic proximity.
Background
Lineage data records a history of data derivation, which can describe the data generation process for a data lineage query to aid in result rendering, confidence enhancement, quality assessment, and the like. However, lineage data is accumulated over time, which makes the results of lineage queries quite large. If the query results are presented in the form of a lineage diagram, which may contain thousands of nodes, such a lineage diagram is difficult for the reader to intuitively understand. Most of the existing lineage graph summarization algorithms need to identify similar node sets through a large amount of manpower assistance, such as identification according to a knowledge base generated by a large amount of interviews. Therefore, the invention provides a lineage graph abstract method based on node structure similarity and semantic proximity, which is used for identifying a node set with similar structure and similar semantics based on three recognitions that data nodes with the same data source and use are subdata of higher semantic data, activity nodes which collaboratively generate the same data are sub-activities of higher semantic activities, activity nodes contained in the higher semantic activities have similar influence and similar activity time. And then, replacing the node sets with similar structures and similar semantics by using the super nodes, refining the similar nodes in the lineage diagram, reducing the structural complexity and the semantic complexity of the lineage diagram, and improving the comprehensibility degree of the lineage diagram.
Disclosure of Invention
The invention aims to provide a lineage diagram abstract method based on node structure similarity and semantic proximity so as to reduce the structure complexity and semantic complexity of the lineage diagram and improve the comprehensibility degree of the lineage diagram.
The lineage diagram summarization method provided by the invention is realized based on three cognition that data nodes with the same data source and use are subdata of higher semantic data, activity nodes which collaborate to generate the same data are sub-activities of higher semantic activities, activity nodes contained in the higher semantic activities have similar influence and similar activity time. The method comprises two stages: a similar node set identification stage, wherein similar nodes are gathered together according to the structural similarity and semantic proximity of the nodes, and a series of similar node sets are identified; in the node set replacement stage, the lineage graph comprises various nodes, such as data nodes, active nodes, proxy nodes and the like, and different replacement strategies are adopted for different types of node sets, so that the effectiveness of replacing the lineage graph before and after the lineage graph is replaced is ensured
(I) similar node set identification phase
The method comprises the steps that based on three recognitions that data nodes with the same data source and use are subdata of higher semantic data, activity nodes generating the same data in a cooperative mode are sub-activities of higher semantic activities, activity nodes contained in the higher semantic activities have similar influence and similar activity time, a similar data node set is identified according to the source and the use of the data nodes, a similar activity node set is identified according to output data of the activity nodes, agent sets correspondingly associated with the activity node sets are identified along with the identification of the activity node set, finally, the definition of similarity distance measurement of a traditional clustering algorithm is used for reference, the semantic distance of the activity nodes is defined by combining influence proximity and time proximity between the activity nodes, and finally, an activity node set with adjacent semantics is identified.
Definition related to data sources, data purposes, activity inputs, activity outputs, activity controllers and the like is carried out in the identification phase of the similar node set.
Definition 1: for data node d, if there is an active node a in the lineage graph and the relationship wasgeneredby (d, a), then active a can become the source of data d. Source (d) = { a1, a2, …, an } represents all sources of data d, where ai ∈ Source (d) is one of the sources of d.
Definition 2: for data node d, if there is an active node a in the lineage graph and there is a relationship Used (a, d), then the active a can be the purpose of data d. Usage (d) = { a1, a2, …, an } represents all uses of data d, where ai ∈ Usage (d) is one of the uses of d.
Definition 3: for active node a, if there is one data node in the lineage graph and the relationship Used (a, d) exists, then data d is the input to active a. Input (a) = { d1, d2, …, dn } represents all inputs for activity a, where di e Input (a) is one of the inputs for a.
Definition 4: for active node a, if there is one data node d in the lineage graph and the relationship wasgeneradedby (d, a) exists, then data d is the output of active a. Output (a) = { d1, d2, …, dn } represents all outputs of active a, where di ∈ Output (a) is one of the outputs of a.
Definition 5: for active node a, if there is an agent c in the lineage graph and the relationship WasControlledBy (a, c) exists, then agent c is the master of active a.
Specifically, the specific steps of the similar node set identification are as follows:
step 1: identifying a set of similar data nodes based on the source and use of the data nodes, the source of the data nodes being the active node that generated the data, and the use of the data nodes being the active node that used the data as input;
/nitial DC k ={d i /
if Source(d i )=Source(d j )andUsa g e(d i )=Usage(d j )
then DC k =DC k ∪{d j } (equation 1)
Where di and dj represent any two data nodes in the lineage diagram, and DCk represents a collection of data nodes.
Step 2: identifying similar active node sets according to the output data of the active nodes, and identifying corresponding associated agent sets along with the identification of the active node sets;
Initial AC k ={a i },CC k ={controller(a i )}
if output(a i )=output(a j )
then AC k =AC k ∪{a j },CC k =CC k ∪{controller(a j ) } (formula 2)
Wherein ai and aj represent any two active nodes in the lineage diagram, ACk represents a set of active nodes, and CCk represents a set of proxy nodes.
And step 3: defining semantic distance of the active nodes by combining influence proximity and time proximity among the active nodes, and identifying active node sets with adjacent semantics and corresponding agent node sets;
influence(a i )=(∑ d∈Data Exist(out_edge(d,a i ))-I min )/(I max -I min ) (formula 3)
influence_distance(a i ,a j )=|influence(a i )-influence(a j ) Time _ distance (a) (equation 4) i ,a j )=max(0,a i ·startTime-a j ·endTime,a j ·startTime-a i endTime) (equation 5) semantic _ distance (a) i ,a j )=influence_distance(a i ,a j )+time_distance(a i ,a j ) (formula 6)
InitialSAC k ={a i },SCC k ={controller(a i )}
if semantic_distance(a i ,a j )<σ
then SAC k =SAC k ∪{a j },SCC k =SCC k ∪{controller(a j ) } (equation 7)
Where ai and aj represent any two active nodes in the lineage graph, data represents the set of all Data nodes in the lineage graph, and exit _ edge (d, ai)) =1 when there is an edge from d to ai in the lineage graph, otherwise the result is 0.Imin represents the minimum value of the influence of all the active nodes, imax represents the maximum value of the influence of all the active nodes, ai.starttime, aj.starttime, ai.endtime and aj.endtime represent the start time and the end time of the active ai and aj respectively, SACk represents the set of the active nodes, SCCk represents the set of the proxy nodes, and σ is the semantic clustering threshold given by the user.
(II) node set replacement phase
The lineage diagram comprises various nodes, such as data nodes, active nodes, proxy nodes and the like, and different replacement strategies are adopted for different types of node sets, so that the effectiveness of the lineage diagram before and after replacement is guaranteed.
The method comprises the following specific steps:
step 1: a set of replacement data nodes DC comprising the following steps:
(1) Creating and initializing a data node d, and replacing the DC with the data node d;
(2) Creating a WasGeneratedBy relationship for the data node d in the lineage diagram;
(3) Establishing a Used relation for a data node d in the lineage diagram;
step 2: replacing the active node set AC and its corresponding proxy node set CC, comprising the steps of:
(1) Creating and initializing an active node a, and replacing the AC with the active node a;
(2) Creating a WasGeneratedBy relationship for the active node a in the lineage graph;
(3) Establishing a Used relation for the active node a in the lineage diagram;
(4) Creating and initializing a proxy node c, and replacing the CC by the proxy node c;
(5) Creating a WasControledBy relation from an active node a to an agent node c in the lineage graph;
create WasControlledBy from atom c (equation 15)
And step 3: replacing an active node set SAC and its corresponding proxy node set SCC, comprising the steps of:
(1) Finding intermediate data nodes generated and used by the active set SAC;
INTD=INTD 1 ∪INTD 2 (formula 16)
(2) Creating and initializing an active node a, and replacing SAC $ INTD with the active node a;
(3) Creating a WasGeneratedBy relationship for the active node a in the lineage graph;
(4) Establishing a Used relation for the active node a in the lineage diagram;
(5) Consistent with step 2, proxy node c is created and initialized, replacing SCC with proxy node c, and creating a WasControledBy relationship from active node a to proxy node c.
INTD1 and INDT2 were removed to ensure the validity of the newly generated lineage map, i.e., no rings in the map.
The invention has the beneficial effects that:
the invention provides a lineage graph abstract method based on node structure similarity and semantic proximity, aiming at the problem that visual reading is difficult due to overload of information quantity in an actual lineage graph and based on the insights of the structure similarity and the semantic proximity between nodes of the lineage graph. The method identifies the node sets with similar structures and similar semantics, replaces the node sets with similar structures and similar semantics by the super nodes, refines the similar nodes in the lineage diagram, reduces the structural complexity and the semantic complexity of the lineage diagram, and improves the comprehensibility degree of the lineage diagram.
Drawings
FIG. 1 is a core schematic diagram of the method of the present invention.
Detailed Description
The pseudo code for realizing the lineage diagram abstract method based on the node structure similarity and the semantic proximity is shown in appendix 1.
The complexity of the training algorithm is O (| V | | E | + | C | ^2+ | D | ^ 2), | V | is the number of nodes in the lineage map, | E | is the number of edges in the lineage map, | C | is the size of the activity set that may output the same, | D | is the number of data nodes in the lineage map. It is acceptable that the algorithm is time-complex polynomial time-complex.
Based on the algorithm logic, 36 successfully operated scientific workflow lineages of Taverna provenance are used for synthesizing a lineage graph data set with 1502 node number and 1598 edge number, and the simplified gain change along with the change of the clustering threshold sigma is tested on the basis of the data set. The reduction gain is used for evaluating the reduction degree of the summarized lineage diagram compared with the original lineage diagram, and the formula is as follows.
Wherein, G0 represents the original lineage diagram, G sigma represents the lineage diagram generated by the abstract taking sigma as the clustering threshold.
The method is used for testing by respectively setting the clustering threshold to 0.0000001,0.000001,0.00001,0.00005,0.0001,0.0002,0.0003,0.0004,0.0005,0.0006,0.0007,0.0008,0.009,0.001 in an experiment, and the result shows that the reduction gain is continuously changed from 1 to 1.84, and the result proves that the method can effectively control the reduction degree of the lineage diagram.
The present invention is not intended to be limited to the particular embodiments shown and described, and various modifications, equivalents and improvements made within the spirit and scope of the present invention are intended to be included therein.
Claims (1)
1. A lineage graph abstract method based on node structure similarity and semantic proximity is characterized by comprising two stages: a similar node set identification stage, wherein similar nodes are gathered together according to the structural similarity and semantic proximity of the nodes, and a series of similar node sets are identified; in the node set replacement stage, a lineage graph comprises a plurality of types of nodes including data nodes, active nodes and proxy nodes; different replacement strategies are adopted for different types of node sets, so that the effectiveness of the replaced lineage graph is guaranteed; wherein:
(I) similar node set identification phase
The data nodes with the same data source and use are subdata of higher semantic data, the activity nodes which generate the same data in a cooperative mode are sub-activities of higher semantic activities, the activity nodes contained in the higher semantic activities have similar influence and similar activity time, similar nodes are searched based on the three cognition, and the specific steps are as follows:
step 1: identifying a set of similar data nodes according to the source and the use of the data nodes, wherein the source of the data nodes refers to active nodes generating data, and the use of the data nodes refers to active nodes using the data as input;
Initial DC u ={d i }
if Source(d i )=Source(d j )and Usage(d i )=Usage(d j )
then DC k =DC k ∪{d j } (formula 1)
Wherein d is i And d j Representing any two data nodes, DC, in a lineage diagram k Representing a collection of data nodes, source () will find the Source of given data, usage () will find the purpose of given data;
step 2: identifying similar active node sets according to the output data of the active nodes, and identifying the corresponding associated agent sets as the active node sets along with the identification of the active node sets;
InitialAC k ={a i },CC k ={controller(a i )}
if output(a i )=output(a j )
then AC k =AC k ∪{a j },CC k =CC k ∪{controller(a j ) } (formula 2)
Wherein, a i And a j Representing any two active nodes, AC, in a lineage diagram k Representing a set of active nodes, CC k Representing a collection of agent nodes, controller () will return the controller for a given activity, output () will solve the output data for the given activity;
and step 3: defining semantic distance of the active nodes by combining influence proximity and time proximity among the active nodes, and identifying active node sets with adjacent semantics and corresponding proxy node sets;
influence(a i )=(∑ d∈Data Exist(out_edge(d,a i ))-I min )/(I max -I min ) (formula 3)
influence_distance(a i ,a j )=|influence(a i )-influence(a j ) L (equation 4)
time_distance(a i ,a j )=max(0,a i ·startTime-a j ·endTime,a j ·startTime-a i endTime) (equation 5)
semantic_distance(a i ,a j )=influence_distance(a i ,a j )+time_distance(a i ,a j ) (formula 6)
Initial SAC k ={a i },SCC k ={controller(a i )}
if semantic_distance(a i ,a j )<σ
then SAC k =SAC k ∪{a i },SCC k =SCC k ∪{controller(a j ) } (equation 7)
Wherein,a i And a j Representing any two active nodes in the lineage diagram, data represents the corpus of all Data nodes in the lineage diagram, when there is a distance from d to a in the lineage diagram i Edge of (1), exist (out _ edge (d, a) i ) ) =1, otherwise the result is 0; i is min Represents the minimum value of the influence of all active nodes, I max Represents the maximum value of the influence of all active nodes, a i .startTime、a j .startTime、a i endTime and a j endTime represents Activity a, respectively i ,a j Start time and end time of, SAC k Representing a set of active nodes to be replaced, SCC k Representing a set of proxy nodes to be replaced, σ being the semantic clustering threshold given by the user;
(II) node set replacement phase
The lineage diagram contains multiple types of nodes, including data nodes, active nodes and proxy nodes, and different replacement strategies are adopted for different types of node sets, so that the effectiveness of the lineage diagram before and after replacement is guaranteed; the method comprises the following specific steps:
step 1: set of replacement data nodes DC k The specific process comprises the following steps:
(1) Creating and initializing data node d to replace DC with data node d k ;
(2) Creating a WasGeneratedBy relationship for the data node d in the lineage diagram;
(3) Creating a Used relation for a data node d in a lineage graph;
step 2: set AC of replacement active nodes k And its corresponding set of proxy nodes CC k The specific process comprises the following steps:
(1) Creating and initializing active node a to replace AC with active node a k ;
(2) Creating a wasgeneredby relationship for the active node a in the lineage graph;
(3) Establishing a Used relation for the active node a in the lineage graph;
(4) Creating and initializing proxy node c to replace CC with proxy node c k ;
(5) Creating a WasControledBy relation from an active node a to an agent node c in the lineage graph;
create WasControlledBy from a to c (equation 15)
And 3, step 3: instead ofActive node set SAC to be replaced k And corresponding proxy node set SCC to be replaced k The specific process comprises the following steps:
(1) Finding SAC by active set k Intermediate data nodes are generated and used;
INTD=INTD 1 ∪INTD 2 (formula 16)
(2) Creating and initializing active node a to replace SAC with active node a k ∪INTD;
(3) Creating a WasGeneratedBy relationship for the active node a in the lineage graph;
(4) Establishing a Used relation for the active node a in the lineage diagram;
(5) Consistent with step 2, proxy node c is created and initialized to replace the SCC with proxy node c k And creates a WasControlledBy relationship from active node a to proxy node c.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911331390.XA CN111125375B (en) | 2019-12-21 | 2019-12-21 | Lineage graph summarization method based on node structure similarity and semantic proximity |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911331390.XA CN111125375B (en) | 2019-12-21 | 2019-12-21 | Lineage graph summarization method based on node structure similarity and semantic proximity |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111125375A CN111125375A (en) | 2020-05-08 |
CN111125375B true CN111125375B (en) | 2023-04-07 |
Family
ID=70500878
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911331390.XA Active CN111125375B (en) | 2019-12-21 | 2019-12-21 | Lineage graph summarization method based on node structure similarity and semantic proximity |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111125375B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113535803B (en) * | 2021-06-15 | 2023-03-10 | 复旦大学 | Block chain efficient retrieval and reliability verification method based on keyword index |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106713313A (en) * | 2016-12-22 | 2017-05-24 | 河海大学 | Access control method based on origin graph abstractness |
CN108804582A (en) * | 2018-05-24 | 2018-11-13 | 天津大学 | Method based on the chart database optimization of complex relationship between big data |
CN110008306A (en) * | 2019-04-04 | 2019-07-12 | 北京易华录信息技术股份有限公司 | A kind of data relationship analysis method, device and data service system |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170111245A1 (en) * | 2015-10-14 | 2017-04-20 | International Business Machines Corporation | Process traces clustering: a heterogeneous information network approach |
US20180341701A1 (en) * | 2017-05-24 | 2018-11-29 | Ca, Inc. | Data provenance system |
US10514948B2 (en) * | 2017-11-09 | 2019-12-24 | Cloudera, Inc. | Information based on run-time artifacts in a distributed computing cluster |
-
2019
- 2019-12-21 CN CN201911331390.XA patent/CN111125375B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106713313A (en) * | 2016-12-22 | 2017-05-24 | 河海大学 | Access control method based on origin graph abstractness |
CN108804582A (en) * | 2018-05-24 | 2018-11-13 | 天津大学 | Method based on the chart database optimization of complex relationship between big data |
CN110008306A (en) * | 2019-04-04 | 2019-07-12 | 北京易华录信息技术股份有限公司 | A kind of data relationship analysis method, device and data service system |
Non-Patent Citations (2)
Title |
---|
Andreas Reisser 等.Utilizing Semantic Web Technologies for Efficient Data Lineage and Impact Analyses in Data Warehouse Environments.IEEE.2009,第1-5页. * |
高明 等.数据世系管理技术研究综述.计算机学报.2010,第373-389页. * |
Also Published As
Publication number | Publication date |
---|---|
CN111125375A (en) | 2020-05-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Luan et al. | Scientific information extraction with semi-supervised neural tagging | |
CN109146610B (en) | Intelligent insurance recommendation method and device and intelligent insurance robot equipment | |
Richard et al. | Temporal action detection using a statistical language model | |
CN109564589B (en) | Entity identification and linking system and method using manual user feedback | |
Elhamifar et al. | Unsupervised procedure learning via joint dynamic summarization | |
WO2020010834A1 (en) | Faq question and answer library generalization method, apparatus, and device | |
Wang et al. | Lifelong learning memory networks for aspect sentiment classification | |
CN111143571B (en) | Entity labeling model training method, entity labeling method and device | |
US11699437B2 (en) | System and method for quantifying meeting effectiveness using natural language processing | |
Liliana et al. | A review on conditional random fields as a sequential classifier in machine learning | |
CN111125375B (en) | Lineage graph summarization method based on node structure similarity and semantic proximity | |
Yoo et al. | Image-to-graph transformers for chemical structure recognition | |
Fariha et al. | A new framework for mining frequent interaction patterns from meeting databases | |
CN111160638A (en) | Conversion estimation method and device | |
Hong et al. | Knowledge-grounded dialogue modelling with dialogue-state tracking, domain tracking, and entity extraction | |
CN116469103A (en) | Automatic labeling method for medical image segmentation data | |
CN110413795A (en) | A kind of professional knowledge map construction method of data-driven | |
CN114004233B (en) | Remote supervision named entity recognition method based on semi-training and sentence selection | |
JP2017538226A (en) | Scalable web data extraction | |
Nguyen et al. | GOAL: gist-set online active learning for efficient chest X-ray image annotation | |
Mao et al. | Emotion profile refinery for speech emotion classification | |
JP2018169835A (en) | Model learning device, word extraction device, method, and program | |
Lamine et al. | The threshold EM algorithm for parameter learning in bayesian network with incomplete data | |
Spiegler et al. | Unsupervised word decomposition with the promodes algorithm | |
CN110188181A (en) | Field keyword determines method, apparatus, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |