CN116304220A - Multi-granularity tracing method for data integration - Google Patents

Multi-granularity tracing method for data integration Download PDF

Info

Publication number
CN116304220A
CN116304220A CN202211545898.1A CN202211545898A CN116304220A CN 116304220 A CN116304220 A CN 116304220A CN 202211545898 A CN202211545898 A CN 202211545898A CN 116304220 A CN116304220 A CN 116304220A
Authority
CN
China
Prior art keywords
data integration
data
granularity
tracing
traceability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211545898.1A
Other languages
Chinese (zh)
Inventor
杨斐斐
申德荣
聂铁铮
寇月
Original Assignee
东北大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 东北大学 filed Critical 东北大学
Priority to CN202211545898.1A priority Critical patent/CN116304220A/en
Publication of CN116304220A publication Critical patent/CN116304220A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/9038Presentation of query results
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data integration-oriented multi-granularity tracing method, which is characterized in that firstly, the defects of the existing data integration task-oriented tracing method are analyzed, and a multi-granularity tracing model is provided according to the characteristics of the data integration task; secondly, constructing a data integration traceability process model, wherein a user can select to use any plurality of data integration subunits in the data integration traceability tool box to construct a data integration workflow; constructing a multi-granularity traceability map in the map database based on traceability meta information generated by the traceability model and the data integration workflow; and finally, designing multi-granularity traceable queries, and playing back the data integration process from multiple granularities, wherein the process comprises coarse granularity traceable queries and fine granularity traceable queries. The multi-granularity tracing method for data integration provided by the invention can play back the data integration process from the activity level and the entity level, and improves the interpretability, the credibility and the repeatability of the data integration.

Description

Multi-granularity tracing method for data integration
Technical Field
The invention belongs to the technical field of data integration and data tracing, and particularly relates to a multi-granularity tracing method oriented to data integration.
Background
In the data enabling era, data integration has played an important role in enterprise decision-making as a core technology in data science applications. However, with the abundance of data resources, the characteristics of complexity, isomerism, low quality and the like of multi-source data are more prominent, which brings new challenges to data science application based on data integration technology.
Data integration aims at fusing data resources in a multi-source heterogeneous dataset. In the aspect of data integration, the data integration process is complex, and usually covers multiple activities such as data discovery, pattern alignment, entity analysis, data restoration and the like, so that the integration quality of multi-source data cannot be ensured. In the aspect of data integration technology, the existing data integration technology based on rules, similar matching and the like can not well meet the requirements of enterprises, but the current data integration technology based on deep learning has poor solvability, and the reliability of an integration result is difficult to ensure.
The traceability is a structured metadata form, and is used for recording the source of information and helping to judge whether the information is credible or not. The data tracing technology focuses on analysis and simulation of the evolution process of data, mainly aims at real-time tracing and updating of the development process of original data, and can be used in the business fields of data quality evaluation, data verification, data recovery and the like. Tracing may provide the following benefits: 1) It interprets the contribution of the various sources and processing steps to the final result to the user; 2) The method tracks the data processing process, enriches data by using information about how to obtain the data, and improves the credibility of the result; 3) Ideally, it is able to reconstruct the data integration process from the trace information embedded in the final result.
Because of the relevance of multi-source heterogeneous data and the complexity of the integration process, users have higher requirements on the interpretability, the reliability and the repeatability of the data integration process and the integration result. The interpretability emphasizes interpretation results or processes that are transparent to some audience. The credibility of data refers to the integrity, consistency and accuracy of the data, and all data used for describing storage are objective and real, and depend on the process of collecting, processing and analyzing the data and the like. The data tracing and tracking process of data integration records related data operation, and the reliability of the integration result can be improved by enriching the data with information about how to obtain the data. Reproducibility of the results includes starting from the same materials and methods and checking whether the previous results can be confirmed. Since trace-back can generally record anything that happens during processing, it is naturally used for reproducibility. For example, the trace-back of the recall may include intermediate steps involved in the process, which changes were made, how parameters were set, how they were changed, and so forth. The data quality includes various dimensions and indicators of evaluating the data quality, including integrity, accuracy, timeliness, or trustworthiness, among others. In this case, trace-back is typically used to monitor and debug applications to evaluate and improve certain quality dimensions.
The existing tracing technology is mainly divided into database query tracing and data science workflow tracing. Database query tracing is a process of playing back data evolution from a data record level (fine granularity), and a deriving process of using core algebraic data items to perform reasoning evolution to represent query results. The data science workflow tracing is used for explaining the derivative process of a certain data set (coarse granularity) in the workflow, and the main flow workflow-oriented tracing method takes entities, activities and agents as cores and is used for capturing the steps and the use information of data related to the workflow. The existing database tracing model and workflow tracing model can only play back the data evolution process from a single granularity, and cannot meet the tracing requirement of the data integration task, because the tracing of data facing the data integration task is far more complex than the tracing of the workflow, the processing process of multi-granularity objects such as a data set, a data record, a data attribute and the like need to be traced, and tracing data can be queried in a simple and efficient mode. Early data tracing for data integration is mainly to trace tracing information in the process of integrating a database, and Trio (Jennifer Widom. Trio: A System for Integrated Management of Data, accumay, and Linear [ C ]// CIDR, 2005:262-276.) manages the correctness and pedigree of data as a component of the data together with the data. Perm (B.Glavic and G.Alonso.Perm: processing Provenance and Data on the Same Data Model through Query Rewriting [ C ]// ICDE, shanghai, china,29March-2 April 2009.Washington,DC,USA:IEEE Computer Society,2009:174-185.) uses query rewrites in relational databases to capture traceability information. Entity resolution (Entity Resolution, ER for short) is a key task for data integration, and Erprov (optold s., herschel m.Provenance for Entity Resolution, [ C ]// Provenance and Annotation of Data and processes.ipaw 2018.Lecture Notes in Computer Science,Springer,Cham.2018:226-230 ]) describes a traceability model of how data is handled during entity resolution, first abstracting the ER task as an algebraic arithmetic operation, and then defining traceability on this abstract representation. Blast (G.Simonini, L.Gagliardelli, S.Zhu, et al, working lossley scheme-aware Entity Resolution with User Interaction [ C ]//2018International Conference on High Performance Computing&Simulation (HPCS) & Orleans, france: IEEE,2018: 860-864.) extends the process of entity resolution algorithm capturing traceability information, through which entity resolution results are visualized. PROVDB (Hui M, deshpande A. ProvDB: provenanceC-enabled Lifecycle Management of Collaborative Data Analysis Workflows [ J ]. ACM, 2018.) is a unified traceability and metadata management system that supports lifecycle management of complex collaborative data science workflows. Provisions (Sven Herting, heiko Paulheim. Provisions and Usage of Provenance Data in the WebIsALOD Knowledge Graph [ C ]// ISWC 2018.) supports the traceability of ETL and matching calculations, with database style optimizability and on-demand computing potential. Although there are multiple tracing schemes for multiple applications, the existing data tracing scheme is only aimed at a single task of data integration, is not oriented to the whole process, and cannot be fully applied to the task of data integration.
Disclosure of Invention
Aiming at the defects of the existing data integration-oriented data tracing method, the invention provides a data integration-oriented multi-granularity tracing method, which comprises the following steps:
step 1: constructing a multi-granularity traceability model DI_PROV oriented to a data integration task, realizing playback of a data integration process by a plurality of granularities, wherein coarse granularity takes a subtask of data integration as a basic unit to play back the data integration process from a table level, and fine granularity takes a data entity and an attribute as a core to play back the evolution process of data in the data integration process; the relationship in the traceability model di_prov includes: an activity with an initial point as an activity end point uses an object Used, an activity with an initial point as an object end point as an activity generates an object wasgeneddby, a parent activity with an initial point as an activity end point as an activity contains sub-activities wasincluding ddedby, an activity with an initial point as an activity end point as an activity contributes to the activity wasContributedTo, an object with an initial point as an object end point generates wasDevedBy from the object, an agent with an initial point as an object end point as an agent takes responsibility for the object wasattributedTo, a relationship actedOnBehalf with an initial point as an agent end point as an agent taking responsibility for different agents, an agent with an initial point as an activity end point as an agent takes responsibility wasassocicatedWith in the activity, an entity with an initial point as an entity end point as an entity has a conflict attribute hasfltatthree, and an entity attribute with an initial point as an entity end point as an entity; the specific construction process comprises the steps of tracing model node division and tracing model relation division;
The tracing model node is divided into three node types, namely an object type, an activity type and an agent type; data files, models, data entities and attributes which are used and generated by integrating corresponding data of the objects; subtasks of the Data Integration corresponding to the activities include pattern alignment (SM), entity resolution (Entity Resolution ER), data fusion (DI) and conflict resolution (Conflict Resolution CR) processes; the agent corresponds to a user responsible for completing the data integration task, and the agent is thinned into a total agent and an agent;
the traceability model relationship division is characterized in that the following relationships are added on the basis of inheriting the relationships among PROV model objects, activities and agents:
1) The data integration comprises a plurality of subtasks, and the inclusion relationship and the contribution relationship are introduced to embody the structural characteristics of the workflow;
2) The agent is divided into a sub-agent and a total agent, and a representative relationship exists between the total agent and the sub-agent;
3) A data file comprises a plurality of data entities, and sub-record relations are added between the data file and the data entities;
4) Attribute conflicts may occur when data entities from different data files are fused, and the data entities have conflicting attributes and attribute truth values, and a truth value relationship and a conflicting attribute relationship are added;
Step 2: the method comprises the steps of constructing a data integration traceability process model, wherein the data integration traceability process model comprises a data integration traceability toolbox ProvToolBox, a data integration Workflow model DI_Workflow and a traceability meta-information model ProvMeta_in, which are formed by data integration subunits DI_Blockbox; according to the data integration task, a user constructs a data integration Workflow DI_Workflow based on a data integration subunit DI_Blockbox in a traceability toolbox ProvToolBox, and the data integration Workflow executes and generates data integration traceability meta information ProvMeta_in;
the ProvToolBox is a set of data integration subunits DI_Blockbox, the data integration subunits DI_Blockbox is expressed as DI_Blockbox (Activity, API, paramer_type, output_type), wherein activity= { activity_name: activity_type }, activity_name represents the name of the data integration subunit, and activity_type represents the task type corresponding to the Activity of the data integration subunit; api= { API 1 ,api 2 ,…api j The function set in the subunit, api j Representing the j-th function in the subunit; paramter_type= { pt 1 ,pt 2 ,…,pt j The input parameters of the API in the user registration data integration subunit are the data types corresponding to real parameters in the data integration Workflow model DI_Workflow, pt j Representing the type of the j-th parameter input; output_type= { ot 1 ,ot 2 ,…,ot m Set of data types corresponding to output of API, ot m Representing the type of the mth data output;
the data integration Workflow model di_workflow is denoted as di_workflow (workflow_activity, api_para), workflow_activity= { a 1 ,a 2 ,…a i A data integration subunit set, a, of a data integration workflow i Data integration subunit DI_Blockbox in corresponding data integration traceability toolbox ProvToolBox i ;Api_Para={api_para 1 ,api_para 2 ,…,api_para j Is a data integration sub-Function set in unit DI_Blockbox, api_para j The j-th function is input with a specific parameter value Para to obtain a corresponding Output value Output. Where api denotes a processing function in the data integration subunit di_blockbox, where para= { p 1 ,p 2 ,…,p j User parameters of the data integration workflow, output= { o } are 1 ,o 2 ,…,o m -data generated by a function in the data integration subunit di_blockbox;
the provenance meta-information model provmeta_in is expressed as provmeta_in (wfprrov, aprov Sprov, tprov) and is used for representing provenance meta-information generated in the data integration process, where wfprrov is workflow_activity, aprov= { Aprov in the data integration Workflow model di_workflow 1 ,aprov 2 ,…,aprov j Function set API in the corresponding data integration workflow, sprov= { Sprov 1 ,sprov 2 ,…,sprov j The source object set of API operations corresponds to Para in the data integration Workflow model DI_Workflow, each source object is denoted as sprov j {sprov j :sprovt j },sprov j Type value sprovit of (a) j Calculated by a function from Parameter_type in data integration subunit DI_Blockbox, ttprov= { Tprov 1 ,tprov 2 ,…,tprov m The target object is set, and corresponds to Output in the data integration workflow, and each target object is expressed as tprov m {tprov m :tprovt m -wherein the type value tprovt m Calculating according to output_type of the data integration subunit DI_Blockbox by a function, and tracing a source object sprov in meta-information j Target object tprov m And function aprov j Identification is performed by using an Identify (id, name, type) which is used for uniquely identifying an object or operation, the name representing an object name or method name, and the type being used for identifying a type value of the object;
step 3: constructing a multi-granularity traceability graph according to traceability meta information ProvMeta_in generated by execution of a multi-granularity traceability model DI_PROV and a data integration Workflow model DI_Workflow; the method specifically comprises the following steps:
step 3.1: initializing a multi-granularity traceability graph G (V, E);
step 3.2: constructing a tracing relation between an object and an activity according to a tracing model DI-PROV and tracing meta information ProvMeta_in, and storing the tracing relation into a tracing relation list Prov_doc, wherein nodes in the multi-granularity tracing graph are source objects sprov in the tracing meta information ProvMeta_in j Target object tprov m And function aprov j The tracing relation among nodes in the multi-granularity tracing graph is formed by a function aprov in tracing meta information ProvMeta_in j And type determination of function parameters;
step 3.3: circularly accessing each data record stored in the traceability relation list Prov_doc, and creating coarse-granularity and fine-granularity traceability graph nodes and relations among the nodes according to the types of the data objects;
step 3.4: according to the relation between nodes, creating coarse granularity and fine granularity traceability graphs in a graph database Neo4j, and associating and returning the traceability graphs with different granularities;
the multi-granularity traceability graph G (V, E) is a directed graph, wherein V= { Node } is a Node set of the multi-granularity traceability graph, and corresponds to an object, an activity and an agent in the traceability model DI_PROV; e= { Edge } is an Edge set of the multi-granularity traceability graph, corresponds to an Edge in the traceability model DI_PROV, is used for representing the relation among activities, objects and agents, and is represented as node_id (node_name, node_type), wherein node_id is used for representing the uniqueness of a Node, node_name is the name of the Node, and node_type identifies the type of the Node; edge is represented as Edge (edge_id, edge_value, node) 1 _id,node 2 _id), edge_id represents the unique ID of the edge, edge_value is the value of the edge, node 1 _id represents the starting node of the edge, node 2 _id represents the termination node of the edge; creating a relation between coarse-granularity and fine-granularity traceability graph nodes according to the type of the data object; and respectively creating coarse-granularity and fine-granularity traceability graphs according to the relation between the nodes, and generating a multi-granularity traceability graph by associating the traceability graphs with different granularities.
Step 4: storing the multi-granularity traceability graph in a Neo4j database, and realizing graph traversal query and aggregation query by using a Cypher query language, wherein the query comprises traceability query with coarse granularity and traceability query with fine granularity;
the tracing inquiry of the coarse granularity comprises the following steps:
step A1: initializing a tracing record;
step A2: when the query type is a tracing summary, firstly acquiring a source data set used in the whole data integration process, secondly acquiring main activities of data integration, and finally acquiring a tracing record and returning according to the source data set and the main activities traversing multi-granularity tracing graph;
step A3: when the query type is a tracing segment, a certain subtask of data integration which the user wants to query is based on the query condition of the user, and then the tracing relation of all activities and objects of the subtask is returned by traversing the multi-granularity tracing graph;
The tracing inquiry of the fine granularity comprises the following steps:
step B1: initializing a result set, converting a query result R into an entity e, and checking whether the entity e has overshoot resolution in the data integration process;
step B2: if the entity e generates attribute conflict, playing back the conflict resolution process;
step B3: if the entity e does not generate attribute conflict, firstly acquiring a result data set of the entity e, then traversing the multi-granularity traceability map with the result data set as a starting point to acquire a traceability record, taking a termination condition of the breadth traversal as a source data set, finally querying the traceability relation between the result entity and the entity in the source data set, acquiring a final traceability record and returning.
The beneficial effects of the invention are as follows:
the invention provides a multi-granularity tracing method for data integration, which constructs a multi-granularity tracing diagram in a diagram database Neo4j based on a multi-granularity tracing model and designs various tracing inquiry methods, supports the process of replaying data integration from an activity level and an entity level, and can improve the interpretability, the credibility and the reusability of the data integration.
Drawings
FIG. 1 is a schematic diagram of a multi-granularity tracing method for data integration in the invention;
FIG. 2 is a conceptual diagram of a DI_PROV traceability model according to the present invention;
FIG. 3 is an example of a multi-granularity trace-source diagram in accordance with the present invention;
FIG. 4 is a rough granularity trace-source summary query example of the present invention;
FIG. 5 is an example of coarse-grained traceable segmented query of the present invention;
FIG. 6 is an example of a fine granularity traceable query of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings and examples of specific embodiments.
The data tracing user firstly puts forward a multi-granularity tracing model to support playback of the data integration process from a plurality of granularities; and secondly, constructing a data integration traceability process model which mainly comprises a data integration traceability toolbox ProvToolBox, a data integration Workflow model DI_Workflow and a traceability meta-information model ProvMeta_in. The data integration user selects a data integration subunit in a data traceability toolbox ProvToolBox to construct a data integration Workflow DI_Workflow to perform data integration on a data set restaurant data set Fodors and Zagats to obtain a final data integration result, wherein sample data are shown in tables 1, 2 and 3; the data tracing user executes a mapping relation between a tracing tool box and a data integration workflow constructed by the data integration user to acquire tracing meta information, and constructs the tracing meta information into a multi-granularity tracing graph; the data quality management user does not trust the final result given by the data integration user, selects multi-granularity traceability query to query the traceability data in the data integration process, and obtains the evolution process of the data in the data integration process.
TABLE 1 Fodors sample data
Identity ID Name Address Addr City Number Phone Type
restraurant10 Maggie Riverside street-32 New York 328799414 snackery
restraurant11 Sunset_restaurant Sunset Avenue New York 237681123 french_cuisine
TABLE 2 Zagates sample data
Identity ID Name Address Addr City Number Phone Type
Restraurant20 Maggie Riverside street-32 New York 328799414 snackery
Restraurant21 Sunset_restaurant Sunset Avenue New York 237681123 german_cuisine
TABLE 3 integration result sample data
Identity ID Name Address Addr City Number Phone Type
Restraurant00 Maggie Riverside street-32 New York 328799414 snackery
Restraurant01 Sunset_restaurant Sunset Avenue New York 237681123 german_cuisine
A multi-granularity tracing method for data integration is shown in figure 1, and specifically comprises the following steps:
step 1: the method comprises the steps of constructing a multi-granularity traceability model DI_PROV for a data integration task, establishing a data model which is a key technology of data traceability, and providing a traceability model DI_PROV supporting data integration based on the PROV traceability model according to the characteristics of the data integration task and the multi-granularity traceability requirement, as shown in fig. 2. The DI_PROV traceability model supports playback of a data integration process from a plurality of granularities, coarse granularity plays back the data integration process from a table level by taking subtasks (such as mode alignment, entity analysis and the like) of the data integration as basic units, and fine granularity plays back the evolution process of data in the data integration process by taking data entities and attributes as cores. Table 4 illustrates nodes and relationships in the di_prov traceability model.
Table 4 meaning of relationship in DI_PROV traceability model
Figure BDA0003979791600000071
Tracing model node division: there are three node types in the traceability model: objects, activities, and agents. Data files, models, data entities, attributes and the like which are used and generated by integrating corresponding data of the objects; the subtasks of the data integration corresponding to the activities mainly comprise the processes of pattern alignment, entity analysis, data fusion and the like; the agent corresponds to the user responsible for completing the data integration task, and because the data integration process is complex, a plurality of groups of users are usually required to complete together, different tasks are assigned according to the skills and the specialties of the users, and the agent is thinned into a total agent and an agent;
tracing model relation division: on the basis of inheriting the relationships of PROV model objects, activities and agents, the following relationships are added:
1) The data integration comprises a plurality of subtasks, and a containment relation (WasIncludedBy) and a contribution relation (WasContributedTo) are introduced to embody the structural characteristics of the workflow.
2) The agents are divided into sub-agents and total agents, with a representative relationship (acteonbehalf) between the total agents and the sub-agents.
3) A data file contains several data entities, and a sub-record relationship (WasRecordOf) is added between the data file and the data entities.
4) Attribute conflicts may occur when data entities from different data files are fused, where the data entities have conflicting attributes and attribute truths, with truth relationships (HasTruth) and conflicting attribute relationships (hasConflictAttribute) added.
The data tracing user analyzes the characteristics of the data integration task, and divides the nodes and edges of the multi-granularity tracing model to construct the multi-granularity tracing model DI_PROV. Expanding the entities in the conventional workflow model PROV into objects, including data files, models, data entities and attributes which are used and generated by data integration; dividing the activity into subtasks of data integration, including pattern alignment, entity resolution, data fusion and conflict resolution; agents are divided into total agents and subagent.
On the basis of inheriting the PROV relationship of the traditional workflow model, the relationship between nodes is divided. The activities and the activities are added with a containing relation and a contribution relation, the total agent and the subagent are added with a representing relation, the data file and the data entity are added with a sub-record relation, and the data entity and the attribute are added with a true value relation and a conflict attribute relation.
Step 2: the method comprises the steps of constructing a data integration traceability process model, wherein the data integration traceability process model comprises a data integration traceability toolbox ProvToolBox, a data integration Workflow model DI_Workflow and a traceability meta-information model ProvMeta_in, which are formed by data integration subunits DI_Blockbox; according to the data integration task, a user constructs a data integration Workflow DI_Workflow based on a data integration subunit DI_Blockbox in a traceability toolbox ProvToolBox, and the data integration Workflow executes and generates data integration traceability meta information ProvMeta_in;
Data integration traceability toolbox ProvToolBox: typical Data integration includes subtasks such as pattern alignment (SM), entity resolution (ER Entity Resolution), data Fusion (DF), and conflict resolution (CR Conflict Resolution). Each subtask is a data integration subunit di_blockbox. The DI_Blockbox set constitutes the ProvToolBox of the data integration traceability tool box, i.e. ProvToolbox= { DI_Blockbox 1 ,DI_Blockbox 2 ,…,DI_Blockbox i };
The data integration subunit di_blockbox is defined as di_blockbox (Activity, API, parameter_type, output_type), wherein activity= { activity_name: activity_type }, activity_name represents the name of the data integration subunit, and activity_type represents the task type corresponding to the Activity of the data integration subunit, i.e. the subtasks (typically including pattern alignment, entity parsing, entity fusion, etc.) in the data integration process. Api= { API 1 ,api 2 ,…api j The process function in this subunit is denoted by Parameter_type= { pt 1 ,pt 2 ,…,pt j The input parameter (data set, entity attribute, etc.) of API is the set of specific data types corresponding to the input parameter (data set, entity attribute, etc.), the number of user registersAnd the input parameters of the API in the data integration subunit are the data types corresponding to the real parameters in the data integration Workflow model DI_Workflow. Output_type= { ot 1 ,ot 2 ,…,ot m The data type set corresponding to the output of the API;
data integration Workflow model di_workflow: data integration typically encompasses data integration subtasks such as pattern alignment, entity resolution, conflict resolution, and the like. The user can select and use any plurality of data integration subunits DI_Blockbox in the ProvToolBox i Constructing a data integration workflow;
the data integration Workflow model di_workflow. The data integration Workflow model is defined as di_workflow (workflow_activity, api_para), workflow_activity= { a 1 ,a 2 ,…,a n A data integration subunit set, a, of a data integration workflow i Data integration subunit DI_Blockbox in corresponding data integration traceability toolbox ProvToolBox i 。Api_Para={api_para 1 ,api_para 2 ,…,api_para j The function set in the data integration subunit DI_Blockbox, each api_para i Can be expressed as api_para i { API: para } → Output. API represents a processing function in the data integration subunit DI_Blockbox, where Para= { p 1 ,p 2 ,…,p j The user parameters of the data integration workflow include data sets, entity attributes and the like, and output= { o 1 ,o 2 ,…,o m The data generated by a function in the data integration subunit di_blockbox.
The traceability meta information model ProvMeta_in: the user uses the data integration subunit di_blockbox registered in the data integration traceability toolbox to construct a data integration Workflow di_workflow, and saves the data integration Workflow constructed by the user and related configuration (used configuration information such as subunit di_blockbox, data dependency set, user parameters and output) to generate traceability meta information about each step performed.
The trace meta information ProvMeta_in. The traceability meta-information is defined ProvMeta_in (Wfprov, aprov Sprov, ttprov) to represent data integrationTrace meta information generated by a process, wherein wfprrov is workflow_activity in a data integration Workflow, aprov= { Aprov 1 ,aprov 2 ,…,aprov j Corresponds to a function set API in the data integration workflow. Sprov= { Sprov 1 ,sprov 2 ,…,sprov j Set of source objects for API operations, corresponding to Para in the data integration Workflow DI_Workflow, each source object denoted as sprov j {sprov j :sprovt j },sprov j Type value sprovit of (a) j By a function sprovit j =ξ(sprov j ) Calculated from the Parameter_type in the data integration subunit DI_Blockbox. Tprov= { Tprov 1 ,tprov 2 ,…,tprov m And the target object set corresponds to Output in the data integration workflow. Each target object is expressed as tprov m {tprov m :tprovt m },tprov m Type value tprovt m By a function of
Figure BDA0003979791600000102
And calculating according to the output_type of the data integration subunit DI_Blockbox. Source object sprov in traceable meta-information j Target object tprov m And function aprov j Identification is performed using an identification (id, name, type) that is used to uniquely Identify an object or operation, name representing an object name or method name, and type that is used to Identify a type value of the object.
And collecting the traceability meta information of the data integration process: the data integration user registers the data integration subunit di_blockbox online, and clearly describes the atomic tasks, functions, inputs and outputs of the specific subunits and a specific implementation method. In this example, the Data integration subunit registered by the Data integration user is di_blockbox { flexmatch, py_entry_mapping, voing, data_fusion }. The data tracing user uses the data integration subunit to construct a data tracing toolbox, provToolBox, an example of which is shown in Table 5. Contains a plurality of Data integration subunits DI_Blockbox { flexible, py_entry mapping, voing, data_fusion }. For example, the first sub-module in the data integration sub-unit py_entitydetermination is denoted ({ py_entitydetermination, ER }, block_tables, { source_data, block_traindata, block_model }, and criterion_entitydetermination) means that when the entity resolution task is completed by the data integration sub-unit py_entitydetermination, the type of the first and second input parameters thereof is the source data set source_data, the third and fourth parameters are the training set used and the block model block_model, and the data type corresponding to the output of the block_tables function is the candidate entity set entity.
Table 5 traceability toolbox example
Figure BDA0003979791600000101
For example, one of the sub-modules in the data integration sub-unit py_entity is denoted ({ py_entity, ER }, entity_match, candidate_data, and match_data) indicating that when entity resolution is completed using the data integration sub-unit py_entity, the type of its input parameter is candidate data_data when an entity_match operation is used, and the type of data corresponding to the output of the function is matching entity_data.
The data integration user selects flexmatch for pattern alignment of different data sources, which addresses the problem of matching multiple patterns to a single intervening pattern. Records describing the same entity in different data sources are then grouped into a set of obtained entity matching data sets using py_entitymatching. And finally, fusing the entities by adopting Data fusion.
In order to facilitate description of data instances, such as data sets, entity attributes, etc., involved in the data integration workflow process, variables are used in place of variables and their corresponding data instances are shown in table 6. It is assumed that there is one data integration Workflow di_workflow (workflow_activity, api_para) see table 7. Wherein workflow_activity= { flexible, py_entity, data_fusion, voice }, the Data integration Workflow di_workflow representing the present example mainly contains the following Data integration subunits: pattern alignment flexmatch, entity resolution py_entitymatch, data fusion data_fusion, and conflict resolution voing. Api_Para= { schema_mapping, { block_tables, entity_mapping }, entity_fusion, get_realattric } is a set of functions in a data integration subunit used in the data integration process. There is a list of parameters for each function in api_para.
Table 6 data variable instance table
Variable(s) Definition of the definition
S 1 Source data set Sfordors.csv without pattern alignment
S 2 Source data set szagats.csv without pattern alignment
T 1 Completion pattern aligned source data set fordors
T 2 Completion pattern aligned source data set zagats. Csv
S 3 Data set traindata.csv used by block model
M 1 Block model er
C 1 Entity resolution candidate data set candiatata.csv
T 3 Entity resolution matching result set matchingdata.csv
T 4 Attribute conflict data set conflicttribute
T 5 Data fusion result data set resultdata. Csv
E 1 Sunset restaurant (restarant 11, sunset_restarant, sunset Avenue, new York, 23768123, french_cuisine)
E 2 Sunset restaurant (restalurant 21, sunset_restalurant, sunset Avenue, new York, 23768123, german_cuisine)
E 4 Sunset restaurant (restalurant 1, sunset_restalure, sunset Avenue, new York, 23768123, french_cuisine)
A 1 Sunset restaurant type french_cuisine in fordors.csv dataset
A 2 zagats.csv dataset Sunset restaurant type german_cuisine
A 3 Sunset restaurant in fordors.csv datasetType french_cuisine
Table 7 data integration workflow example
DI_Blockbox Api_Para Para Output
flexmatch schema_matching {S 1 ,S 2 } {T 1 ,T 2 }
py_entitymatching block_tables {T 1 ,T 2 ,S 3 ,M 1 } {C 1 }
py_entitymatching entity_matching {C 1 } {T 3 }
Data_fusion entity_fusion {T 3 } {T 4 ,T 5 }
Voting get_realattribute {A 1 ,A 2 ,E 4 } {A 3 }
Table 8 traceability meta information example
Figure BDA0003979791600000111
Integrated data set Fordors restaurant data set S 1 And Zagats restaurant dataset S 2 Firstly, selecting a data integration subunit flexmatch to perform mode alignment, calling a function shrema_match in the mode alignment to calculate the similarity between character strings, acquiring the corresponding relation between an original mode and a target mode, modifying the mode of a source data set, and generating a result data set as T 1 And T 2 The method comprises the steps of carrying out a first treatment on the surface of the Secondly, selecting a data integration subunit py_entitymapping to perform entity analysis, wherein the entity analysis mainly comprises the steps of block_tables and entity matching entitymapping, and the block_tables use a block model M 1 For a source data set T 1 And T 2 Partitioning, the model using training dataset S 3 Training, and calculating the similarity of entities in the block by adopting a similarity calculation algorithm to generate a candidate data set C 1 The entity_matching function divides records describing the same restaurant in different data sources into a group of acquisition matching entity sets T 3 The method comprises the steps of carrying out a first treatment on the surface of the Finally, selecting a Data integration subunit data_fusion for entity fusion, calling an entity_fusion function to fuse Data records describing the same restaurant in a matched Data set, and generating an attribute conflict Data set T if attribute conflict occurs in the entity in the fusion process 4 For recording conflicting properties of entities, select numbers Attribute conflict is solved according to the integrated subunit voing, and finally consistent and clean integrated data T is obtained 5 。T 5 Various data including a plurality of restaurants, E 4 Is a French menu restaurant named Sunset_resultaant, sunset Avenue, E, located in New York 4 Restaurant dataset S from Zagates 1 Restaurant E in 1 And restaurant E 2 Obtained through the data integration operation, the Zagats restaurant data set S 1 Middle record restaurant E 1 Is a menu of french_cuisine, and the data set Fordors restaurant data set S 2 Display E 2 Is a german_cuisine in fusion E 1 And E is 2 Obtaining E 4 When the attribute type conflicts, the conflict value is A 1 =french_cuisine and a 2 =german_cuisine, using conflict resolution strategy, obtaining the attribute truth value a 3 =french_cuisine。
And the data tracing user executes a mapping relation between the tracing tool box and the data integration workflow constructed by the data integration user to acquire tracing meta information. The trace meta-information generated during the data integration process is represented using ProvMeta_in (Wfprov, aprov Sprov, tborov), and the trace meta-information generated in this example is shown in Table 8. Trace meta information generated by pattern alignment stage ({ flexible }, { schema_matching }, { (S) 1 :unsmdataset),(S 2 :unsmdataset)},{(T 1 :source_dataset),(T 2 Source_dataset) to obtain a target pattern after a pattern matching operation, a source dataset S 1 And S is 2 The dataset after alignment to the target pattern is { (T) 1 :source_dataset),(T 2 Source_dataset). Source meta-information generated in the partitioning stage of entity resolution ({ py_entitymapping }, { block_tables }, { (T) 1 :source_dataset),(T 2 :source_dataset),(S 3 :blocking_traindata),(M 1 :blocking_model)},{(C 1 Candesata) }, where { (T) 1 :source_dataset),(T 2 :source_dataset),(S 3 :blocking_traindata),(M 1 Blocking_model) is a collection representing source objects, { (C) 1 The cast_dataset) is a target object, and the block_tables represent a block function. Real worldTraceability meta information generated by the matching stage in the volume parsing ({ py_entity }, { entity_matching }, { (C) 1 :candiate_dataset)},{(T 3 : timing_dataset) }). Candidate data set C generated in the partitioning stage using entity_matching 1 Performing entity matching to obtain a matching data set T 3 . The trace meta information generated by the Data fusion stage is ({ data_fusion }, { entity_fusion }, { (T) 3 :matching_dataset)},{(T 4 :conflict_attributeset),(T 5 Target_dataset) }) where resolving data conflicts using a conflict resolution policy results in a set of conflicting properties T 4 Fusing the data records describing the same restaurant in the matched data set to finally obtain consistent and clean integrated data T 5 . The trace-source meta-information generated by the conflict resolution stage is ({ Voting }, { get_realattric }, { (A) 1 :conflict_attribute),(A 2 :conflict_attribute),(E 4 :conflict_entity)},{(A 3 :real_attribute)}),E 4 Restaurant dataset S from Zagates 1 Restaurant E in 1 And restaurant E 2 Obtained through the data integration operation, the Zagats restaurant data set S 1 Middle record restaurant E 4 Is a menu of french_cuisine, and the data set Fordors restaurant data set S 2 Display E 4 Is a vegetable system of german_cuisine, E 4 Type of (a) conflict with a conflict value of a 1 =french_cuisine and a 2 =german_cuisine, using conflict resolution strategy, obtaining the attribute truth value a 3 =french_cuisine. Source object sprov in traceable meta-information j Target object tprov m And function aprov j Identification is performed using an identification (id, name, type) that is used to uniquely Identify an object or function, name representing the name of the object or method, and type that is used to Identify the type value of the object. For example (T) 1 Fodors.csv, source_dataset) represents one source data set fodors.csv in a data integration campaign, the corresponding entity is identified as T 1 The method comprises the steps of carrying out a first treatment on the surface of the If (activity_01, block_tables, activity) indicates that block_tables are an activity in the data integration process, the activity is identified as activity_01.
Step 3: based on multi-granularity traceability model DI_PROV and user construction numberConstructing a multi-granularity traceability graph according to traceability meta information ProvMeta_in generated by the integrated workflow; and constructing the traceability relation between the object and the activity according to the traceability meta information ProvMeta_in by using the DI-PROV traceability model. The nodes in the multi-granularity tracing graph are data operation source objects sprov represented by Identify in tracing meta-information ProvMeta_in j Target object tprov m And function aprov j The traceability relation among nodes in the multi-granularity traceability graph is formed by a function aprov in traceability meta-information ProvMeta_in j And type determination of function input/output. The method specifically comprises the following steps:
step 3.1: initializing a multi-granularity traceability graph G (V, E);
step 3.2: constructing a tracing relation between an object and an activity according to a tracing model DI-PROV and tracing meta information ProvMeta_in, and storing the tracing relation into a tracing relation list Prov_doc, wherein nodes in the multi-granularity tracing graph are source objects sprov in the tracing meta information ProvMeta_in j Target object tprov m And function aprov j The tracing relation among nodes in the multi-granularity tracing graph is formed by a function aprov in tracing meta information ProvMeta_in j And type determination of function parameters;
step 3.3: circularly accessing each data record stored in the traceability relation list Prov_doc, and creating coarse-granularity and fine-granularity traceability graph nodes and relations among the nodes according to the types of the data objects;
step 3.4: and according to the relation between the nodes, creating coarse granularity and fine granularity traceability graphs in a graph database Neo4j, and associating and returning the traceability graphs with different granularities.
The multi-granularity tracing graph is a directed graph G (V, E), where v= { Node i The node set is a node set of the multi-granularity traceability graph, and corresponds to the object, the activity and the agent in the DI_PROV traceability model; e= { Edge j And the multi-granularity traceability graph is an edge set of the multi-granularity traceability graph, corresponds to edges in the DI_PROV traceability model and is used for representing the relation among activities, objects and agents. Defining node_id (node_name, node_type) to represent Node in multi-granularity traceability graph, node_id to represent Node uniqueness, node_name to be name of Node, node_type identifies the type of node. Definition Edge (edge_id, edge_value, node) 1 _id,node 2 _id) represents an edge in the multi-granularity traceability graph, edge_id represents the unique ID of the edge, edge_value is the value of the edge, and node 1 _id represents the starting node of the edge, node 2 The_id represents the termination node of the edge. Creating a relation between coarse-granularity and fine-granularity traceability graph nodes according to the type of the data object; and respectively creating coarse-granularity and fine-granularity traceability graphs according to the relation between the nodes, and generating multi-granularity traceability graphs and returning by associating the traceability graphs with different granularities.
Restaurant data set S in data set Fordors 1 And Zagats restaurant dataset S 2 Integration illustrates the construction of a multi-granularity traceability graph. First, the multi-granularity trace-source graph G (V, E) is initialized. And then calling a find_provobject function according to the multi-granularity traceability model DI_PROV to acquire an object in traceability meta-information, calling find_provability to acquire an activity in traceability meta-information, adding the object and the activity to a node set of the traceability graph, calling find_elementary relation to acquire a traceability relation between the object and the activity according to the object and the activity, and adding the traceability relation to an edge set of the traceability graph. And repeating the process to obtain all the objects and activities involved in the data integration process, and constructing the traceability relationship between the objects and activities. And constructing coarse granularity and fine granularity traceability graphs in a graph database Neo4j according to the node set and the edge set of the traceability graph, and correlating and returning the traceability graphs with different granularities. The multi-granularity traceability map of the framework is shown in fig. 3. Source object { S ] contained in data integration procedure 1 ,S 2 ,E 1 ,E 2 ,M 1 ,S 4 ,A 1 ,A 2 Target object { T }, target object { T } 1 ,T 2 ,C 1 ,T 3 ,T 4 ,T 5 ,A 3 ,E 4 An activity { schema_mapping, block_tables, entity_fusion, get_realattric, entity_mapping }.
Such as the trace meta-information generated by the partitioning stage in entity resolution ({ py_entitymapping }, { block_tables }, { (T) 1 :source_dataset),(T 2 :source_dataset),(S 3 :blocking_traindata),(M 1 :blocking_model)},{(C 1 Candesata) }), identifies objects in di_pro: source data set T 1 And T 2 ,S 3 And M 1 Training set and block model for representing block_tables use, C 1 Is a candidate entity set output by the block_tables function, activity: the block function block_tables. Specific relationships between activities and objects are Used and wasgeneddy. The nodes and edges of the tracing graph of the partitioning stage in the entity analysis are shown in table 9.
Table 9 entity resolution blocking stage tracing graph node and edge instance
Node 1 Node 2 Edge
Node(T 1 ,fordors.csv,source_dataset) Node(activity_01,block_tables,activity) Edge(Used_01,Used,T 1 ,activity_01)
Node(T 2 ,zagats.csv,source_dataset) Node(activity_01,block_tables,activity) Edge(Used_02,Used,T 2 ,activity_01)
Node(S 3 ,traindata.csv,blocking_traindata) Node(activity_01,block_tables,activity) Edge(Used_03,Used,S 3 ,activity_01)
Node(M 1 ,er,blocking_model) Node(activity_01,block_tables,activity) Edge(Used_04,Used,M 1 ,activity_01)
Node(C 1 ,candiatedata.csv,candidate_dataset) Node(activity_01,block_tables,activity) Edge(WasGeneratedBy_0,WasGeneratedBy,C 1 ,activity_01)
Step 4: the multi-granularity traceability graph is stored in a Neo4j database, and graph traversal query and aggregation query are realized by using a Cypher query language, wherein the query comprises traceability query with coarse granularity and traceability query with fine granularity.
Coarse granularity traceable queries are classified into traceable summary queries and traceable segment queries. The traceability summary query firstly obtains a source data set Fordors restaurant data set S 1 And Zagats restaurant dataset S 2 Secondly, acquiring the relation among the main activities including pattern alignment, entity analysis and conflict resolution and all activities of the data integration, and finally obtaining a unified and clean result data set T 5 . The data quality management user coarse granularity summary traceable query result is shown in fig. 4. The tracing segmentation query segments the multi-granularity tracing graph according to the data integration subunit of the user input data integration workflow, and returns tracing records of the execution process of the single data integration subunit. The result of the coarse-grained segment traceability query of the data quality management user is shown in fig. 5.
Coarse-granularity traceability query: summarizing the multi-granularity traceability graph according to the condition input by the user, segmenting the traceability graph to return active-level traceability information, and summarizing or segmenting the traceability graph G (V, E) according to the query type query_type and the query condition (condition) input by the user. A1 Initializing a tracing record; a2 When the query type is a tracing summary, firstly acquiring a source data set used in the whole data integration process, secondly acquiring main activities of data integration, wherein the activities are primary activities of the data integration such as entity analysis and sub-activities which do not contain entity analysis such as training matching and the like, and finally acquiring tracing records and returning according to the source data set and the main activities by traversing the multi-granularity tracing graph; a3 When the query type is the tracing segment, searching a subtask such as entity analysis which satisfies the query condition, namely the data integration of the user query, and then traversing the multi-granularity tracing graph to return the tracing relation of all activities and objects of the entity analysis subtask.
The fine granularity traceable query takes a data entity and an attribute as a core, and plays back the evolution of the data entity in the data integration process. B1 Initializing a result set, converting a query result R into an entity e, and checking whether the entity e has overshoot resolution in the data integration process; b2 If the entity e has attribute conflict, playing back the conflict resolution process; b3 If the entity e does not generate attribute conflict, firstly acquiring a result data set of the entity e, wherein most of data processing is performed in a data set mode in the data integration process, then traversing the multi-granularity traceability graph with the result data set as a starting point to acquire a traceability record, wherein the termination condition of the breadth traversal is the source data set, and finally querying the traceability relation between the result entity and the entity in the source data set to acquire a final traceability record and returning.
The fine granularity tracing inquiry converts inquiry results into an entity E, whether the E generates overshoot resolution in the data integration process is checked, if the E generates attribute conflict, the conflict resolution process is played back, if the entity E does not generate attribute conflict, a result data set of the entity E is firstly obtained, then a multi-granularity tracing graph G (V, E) is traversed by taking the result data set as a starting point to obtain tracing records, and a final tracing record is obtained and returned. Data management user queries restaurant E 3 The fine-grained traceable query result of (2) is shown in fig. 6. The data management user knows the source data set S from the query result 1 Restaurant E in 1 And S is 2 Restaurant E in 2 Represented byFor the same restaurant, E is determined using entity fusion entity_fusion 1 And E is 2 Fusion to obtain restaurant E 4 Entity fusion process this occurs with attribute conflicts, E 4 With conflicting properties A 1 And A 2 . Source dataset S 1 Restaurant E in 1 Type A of (2) 1 =french_cuisine, and S 2 Restaurant E in 2 Type A of (2) 2 =german_cuisine, obtaining attribute true value a using conflict resolution policy get_realattric 3 =french_cuisine。

Claims (10)

1. The multi-granularity tracing method for data integration is characterized by comprising the following steps of:
step 1: constructing a multi-granularity traceability model DI_PROV oriented to a data integration task, realizing playback of a data integration process by a plurality of granularities, wherein coarse granularity takes a subtask of data integration as a basic unit to play back the data integration process from a table level, and fine granularity takes a data entity and an attribute as a core to play back the evolution process of data in the data integration process;
step 2: the method comprises the steps of constructing a data integration traceability process model, wherein the data integration traceability process model comprises a data integration traceability toolbox ProvToolBox, a data integration Workflow model DI_Workflow and a traceability meta-information model ProvMeta_in, which are formed by data integration subunits DI_Blockbox; according to the data integration task, a user constructs a data integration Workflow DI_Workflow based on a data integration subunit DI_Blockbox in a traceability toolbox ProvToolBox, and the data integration Workflow executes and generates data integration traceability meta information ProvMeta_in;
Step 3: constructing a multi-granularity traceability graph according to traceability meta information ProvMeta_in generated by execution of a multi-granularity traceability model DI_PROV and a data integration Workflow model DI_Workflow;
step 4: the multi-granularity traceability graph is stored in a Neo4j database, and graph traversal query and aggregation query are realized by using a Cypher query language, wherein the query comprises traceability query with coarse granularity and traceability query with fine granularity.
2. The data integration-oriented multi-granularity tracing method according to claim 1, wherein the relation in the multi-granularity tracing model di_pro in step 1 comprises: an activity with an initial point as an activity end point uses an object Used, an activity with an initial point as an object end point as an activity generates an object wasgeneddby, a parent activity with an initial point as an activity end point as an activity contains sub-activities wasincluding ddedby, an activity with an initial point as an activity end point as an activity contributes to the activity wasContributedTo, an object with an initial point as an object end point generates wasDevedBy from the object, an agent with an initial point as an object end point as an agent takes responsibility for the object wasattributedTo, a relationship actedOnBehalf with an initial point as an agent end point as an agent taking responsibility for different agents, an agent with an initial point as an activity end point as an agent takes responsibility wasassocicatedWith in the activity, an entity with an initial point as an entity end point as an entity has a conflict attribute hasfltatthree, and an entity attribute with an initial point as an entity end point as an entity; the specific construction process comprises the steps of tracing the node division of the model and tracing the relation division of the model.
3. The data integration-oriented multi-granularity tracing method according to claim 2, wherein the tracing model node is divided into three node types, namely an object type, an activity type and an agent type; data files, models, data entities and attributes which are used and generated by integrating corresponding data of the objects; subtasks of the Data Integration corresponding to the activities include pattern alignment (SM), entity resolution (Entity Resolution ER), data fusion (DI) and conflict resolution (Conflict Resolution CR) processes; the agents correspond to users responsible for accomplishing the data integration tasks, and the agents are refined into overall agents and agents.
4. The data integration-oriented multi-granularity tracing method of claim 2, wherein the tracing model relationship division adds the following relationship on the basis of inheriting the relationship of the PROV model object, the activity and the agent:
1) The data integration comprises a plurality of subtasks, and the inclusion relationship and the contribution relationship are introduced to embody the structural characteristics of the workflow;
2) The agent is divided into a sub-agent and a total agent, and a representative relationship exists between the total agent and the sub-agent;
3) A data file comprises a plurality of data entities, and sub-record relations are added between the data file and the data entities;
4) Attribute conflicts may occur when data entities from different data files are merged, where conflicting attributes and attribute truth values are present for the data entities, and truth relationships and conflicting attribute relationships are added.
5. The multi-granularity tracing method for data integration according to claim 1, wherein the data integration tracing toolbox ProvToolBox in the step 2 is a set of data integration subunits DI_Blockbox, the data integration subunits DI_Blockbox is expressed as DI_Blockbox (Activity, API, parameter_type, output_type), wherein activity= { activity_name: activity_type }, activity_name represents a name of a data integration subunit, and activity_type represents a task type corresponding to the Activity of the data integration subunit; api= { API 1 ,api 2 ,…api j The function set in the subunit, api j Representing the j-th function in the subunit; paramter_type= { pt 1 ,pt 2 ,…,pt j The input parameters of the API in the user registration data integration subunit are the data types corresponding to real parameters in the data integration Workflow model DI_Workflow, pt j Representing the type of the j-th parameter input; output_type= { ot 1 ,ot 2 ,…,ot m Set of data types corresponding to output of API, ot m Indicating the type of the mth data output.
6. The data integration-oriented multi-granularity tracing method according to claim 1, wherein the data integration Workflow model di_workflow in step 2 is denoted as di_workflow (workflow_activity, api_para), workflow_activity={a 1 ,a 2 ,…a i A data integration subunit set, a, of a data integration workflow i Data integration subunit DI_Blockbox in corresponding data integration traceability toolbox ProvToolBox i ;Api_Para={api_para 1 ,api_para 2 ,…,api_para j The function set in the data integration subunit DI_Blockbox, api_para j The j-th function is input with a specific parameter value Para to obtain a corresponding Output value Output; where api denotes a processing function in the data integration subunit di_blockbox, where para= { p 1 ,p 2 ,…,p j User parameters of the data integration workflow, output= { o } are 1 ,o 2 ,…,o m The data generated by a function in the data integration subunit di_blockbox.
7. The data integration-oriented multi-granularity tracing method according to claim 1, wherein the tracing meta information model profmeta_in in the step 2 is denoted by profmeta_in (wfplrov, aprov Sprov, tprov) and is used for representing tracing meta information generated in the data integration process, wherein wfplrov is a workflow_activity, aprov= { Aprov in the data integration Workflow model di_workflow 1 ,aprov 2 ,…,aprov j Function set API in the corresponding data integration workflow, sprov= { Sprov 1 ,sprov 2 ,…,sprov j The source object set of API operations corresponds to Para in the data integration Workflow model DI_Workflow, each source object is denoted as sprov j {sprov j :sprovt j },sprov j Type value sprovit of (a) j Calculated by a function from Parameter_type in data integration subunit DI_Blockbox, ttprov= { Tprov 1 ,tprov 2 ,…,tprov m The target object is set, and corresponds to Output in the data integration workflow, and each target object is expressed as tprov m {tprov m :tprovt m -wherein the type value tprovt m Calculating according to output_type of the data integration subunit DI_Blockbox by a function, and tracing a source object sprov in meta-information j Target object tprov m And function aprov j Identification is performed using an identification (id, name, type) that is used to uniquely Identify an object or operation, name representing an object name or method name, and type that is used to Identify a type value of the object.
8. The data integration-oriented multi-granularity tracing method according to claim 1, wherein the step 3 specifically comprises:
step 3.1: initializing a multi-granularity traceability graph G (V, E);
step 3.2: constructing a tracing relation between an object and an activity according to a tracing model DI-PROV and tracing meta information ProvMeta_in, and storing the tracing relation into a tracing relation list Prov_doc, wherein nodes in the multi-granularity tracing graph are source objects sprov in the tracing meta information ProvMeta_in j Target object tprov m And function aprov j The tracing relation among nodes in the multi-granularity tracing graph is formed by a function aprov in tracing meta information ProvMeta_in j And type determination of function parameters;
step 3.3: circularly accessing each data record stored in the traceability relation list Prov_doc, and creating coarse-granularity and fine-granularity traceability graph nodes and relations among the nodes according to the types of the data objects;
step 3.4: and according to the relation between the nodes, creating coarse granularity and fine granularity traceability graphs in a graph database Neo4j, and associating and returning the traceability graphs with different granularities.
9. The data integration-oriented multi-granularity tracing method according to claim 1, wherein the multi-granularity tracing graph G (V, E) in the step 3 is a directed graph, where v= { Node } is a Node set of the multi-granularity tracing graph, and corresponds to an object, an activity and an agent in a tracing model di_pro; e= { Edge } is an Edge set of the multi-granularity traceability graph, corresponds to an Edge in the traceability model DI_PROV and is used for representing the relation among activities, objects and agents, a Node is represented as node_id (node_name, node_type), the node_id is used for representing the uniqueness of the Node, the node_name is the name of the Node, and the Node e_type identifies the type of node; edge is represented as Edge (edge_id, edge_value, node) 1 _id,node 2 _id), edge_id represents the unique ID of the edge, edge_value is the value of the edge, node 1 _id represents the starting node of the edge, node 2 _id represents the termination node of the edge; creating a relation between coarse-granularity and fine-granularity traceability graph nodes according to the type of the data object; and respectively creating coarse-granularity and fine-granularity traceability graphs according to the relation between the nodes, and generating a multi-granularity traceability graph by associating the traceability graphs with different granularities.
10. The data integration-oriented multi-granularity tracing method according to claim 1, wherein the tracing query for coarse granularity in step 4 comprises:
step A1: initializing a tracing record;
step A2: when the query type is a tracing summary, firstly acquiring a source data set used in the whole data integration process, secondly acquiring main activities of data integration, and finally acquiring a tracing record and returning according to the source data set and the main activities traversing multi-granularity tracing graph;
step A3: when the query type is a tracing segment, a certain subtask of data integration which the user wants to query is based on the query condition of the user, and then the tracing relation of all activities and objects of the subtask is returned by traversing the multi-granularity tracing graph;
The tracing inquiry of the fine granularity comprises the following steps:
step B1: initializing a result set, converting a query result R into an entity e, and checking whether the entity e has overshoot resolution in the data integration process;
step B2: if the entity e generates attribute conflict, playing back the conflict resolution process;
step B3: if the entity e does not generate attribute conflict, firstly acquiring a result data set of the entity e, then traversing the multi-granularity traceability map with the result data set as a starting point to acquire a traceability record, taking a termination condition of the breadth traversal as a source data set, finally querying the traceability relation between the result entity and the entity in the source data set, acquiring a final traceability record and returning.
CN202211545898.1A 2022-12-05 2022-12-05 Multi-granularity tracing method for data integration Pending CN116304220A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211545898.1A CN116304220A (en) 2022-12-05 2022-12-05 Multi-granularity tracing method for data integration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211545898.1A CN116304220A (en) 2022-12-05 2022-12-05 Multi-granularity tracing method for data integration

Publications (1)

Publication Number Publication Date
CN116304220A true CN116304220A (en) 2023-06-23

Family

ID=86817384

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211545898.1A Pending CN116304220A (en) 2022-12-05 2022-12-05 Multi-granularity tracing method for data integration

Country Status (1)

Country Link
CN (1) CN116304220A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235153A (en) * 2023-10-08 2023-12-15 数安信(北京)科技有限公司 ProV-DM model-based compliance data evidence-storing and tracing method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235153A (en) * 2023-10-08 2023-12-15 数安信(北京)科技有限公司 ProV-DM model-based compliance data evidence-storing and tracing method and system

Similar Documents

Publication Publication Date Title
Meier et al. SQL & NoSQL databases
Diba et al. Extraction, correlation, and abstraction of event data for process mining
Brito et al. Migrating to GraphQL: A practical assessment
Beheshti et al. Scalable graph-based OLAP analytics over process execution data
Bavota et al. Methodbook: Recommending move method refactorings via relational topic models
Nunes et al. From a monolith to a microservices architecture: An approach based on transactional contexts
Grahlmann et al. Reviewing enterprise content management: A functional framework
Augusto et al. Measuring fitness and precision of automatically discovered process models: a principled and scalable approach
CN111125068A (en) Metadata management method and system
Begicheva et al. Discovering high-level process models from event logs
Miao et al. On model discovery for hosted data science projects
CN116304220A (en) Multi-granularity tracing method for data integration
Petermann et al. Graph mining for complex data analytics
Zdepski et al. New Perspectives for NoSQL Database Design: A Systematic Review
Correia et al. Development of a crowd-powered system architecture for knowledge discovery in scientific domains
Vargas-Vera et al. State of the art on ontology alignment
Suh et al. SuperMan: a novel system for storing and retrieving scientific-simulation provenance for efficient job executions on computing clusters
Pouchard et al. The earth system grid discovery and semantic web technologies
Zhang et al. Enriching wikidata with linked open data
Chen Multi-perspective evaluation of relational and graph databases
Kehrer et al. A systematic literature review of big data literature for EA evolution
Tang et al. A query language for workflow logs
Liu W7 Model of Provenance and its Use in the Context of Wikipedia
JP7307243B1 (en) Feature selection system
Flores et al. Incremental schema integration for data wrangling via knowledge graphs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination