CN103745319A

CN103745319A - Data provenance traceability system and method based on multi-state scientific workflow

Info

Publication number: CN103745319A
Application number: CN201410010013.7A
Authority: CN
Inventors: 黄雨; 井玉欣; 王捍贫; 张世琨
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2014-01-09
Filing date: 2014-01-09
Publication date: 2014-04-23
Anticipated expiration: 2034-01-09
Also published as: CN103745319B

Abstract

The invention discloses a data provenance traceability system and method based on multi-state scientific workflow. The method comprises the following steps of obtaining an extended scientific workflow process model on the basis of a digraph-based scientific workflow process model through extension of the digraph-based scientific workflow process model; enriching a data model part of the extended scientific workflow process model by utilizing a data provenance technology, comprehensively describing an executing procedure of the scientific workflow from two points of a process and data so as to obtain a unified management model of process data based on the multi-state scientific workflow, and describing and tracing the data provenance. According to the data provenance traceability system and method, the evolution and state of data in large-scale complicated scientific calculation and collaborative research and development process can be better described, so that the monitoring capacity of the flow process is enhanced, the comprehensive management strategy of the flow is realized, the scientific research and development efficiency is improved, and the scientific development and the technical progress are promoted.

Description

A kind of data pedigree traceability system and method based on multimode scientific workflow

Technical field

The invention provides a kind of data pedigree traceability system and method based on multimode scientific workflow, be specifically related under a kind of multi-task state in research-on-research flow process example, the data pedigree between each flow nodes is related to the storage mode of retroactive method and data pedigree relation.

Background technology

In extensive, complication system Design and manufacturing process and scientific experiment, such as spacecraft design, steamer manufacture etc., conventionally need numerous Collaborations to complete equivalent level task a large amount of, that interdepend associated.In this course, outstanding feature is exactly in design process and implementation process, to relate to considerable task and mass data, workflow high complexity.

Management for complex work flow process, aspect process, in design process due to complicated type product, generally include a large amount of active nodes and parameter, can support again multidisciplinary multi-disciplinary collaborative design and optimization, this just requires process management should pay close attention to the data transmission between each active node, and the mapping between node parameter simultaneously, so, process management needs to control complicacy, takes into account and controls stream and data stream, support process optimization.

In order to solve the process management problem of complex work flow process, researchist is introduced in field of scientific study by workflow technology, has proposed concept and the model of scientific workflow (Scientific Workflow, SWF).Workflow can become job analysis definition good task and role, according to the rule pre-defining and process, carries out, and each task is coordinated simultaneously and is monitored.Scientific workflow has been inherited the advantage of workflow, by the analysis to data dependence relation between task, the method of optimal combination is provided, control each several part completes according to the order of sequence under certain constraint condition, data between each active node of control and management flow effectively, and promotion flow process is carried out downwards.For iteration that in scientific research and complication system design process, each node may occur research and development operation, on scientific workflow basis, increased the condition managing of node, with the execution in support performance flow process, the operation such as reform.

Aspect data management, raising along with system complexity, related data are more and more, the result data of only paying close attention to after integrated cannot guarantee correctness and the consistance of data, thus analyze the generation of data and the process of evolution for assessment data quality, guarantee that correctness and the security of data have very large effect.Based on this, in computer realm, data pedigree concept is suggested and becomes study hotspot, and its importance approved by a plurality of scientific workflow projects, as GridDB, and Chimera, myGRID, CMCS etc.

In workflow, each active node can complete, input data at preposition node and start and carry out satisfied in the situation that, carries out after finishing the data that obtain are rationally stored, and drive the execution of subsequent node; The support task function of reforming, after an active node re-executes, its subsequent node all should obtain message, prompting change; Requirement can be carried out effective version management to the data of all previous execution of each active node, can be according to certain versions of data of active node, and review this edition data and be by which edition data of each preposition node before and calculate, and be depicted as genealogical.

But to the expression of data pedigree and inquiry, be all different between these systems.For example, Kepler provides pedigree register to carry out the information of the workflow instance that real time record creates, and these information have comprised workflow context, data history, the definition of workflow and evolution.Taverna, by using Semantic network technology to build four levels, is used for successively representing pedigree data: flow process layer, data Layer, organized layer and stratum of intellectual.Same RDFProv system has also been used Semantic network technology, inherits its interoperability manipulation, and the advantages such as extendability provide the ability of storage and querying relational databases management system.Chimera is used the technology of a kind of virtual data directory (VDC), by one group, can execute the task and be mapped to conversion, and task call is mapped to data transformation, and the relation that I/O is mapped to data object forms.VisTrails be first except supported data history of evolution, the research-on-research Workflow Management System that also support performance stream evolution pedigree is followed the trail of.

Summary of the invention

In current complication system design and simulation process and scientific research experiment based on scientific workflow, the data pedigree that lacks a unified support multimode scientific workflow is described and retroactive method problem, the invention provides a kind of data pedigree retroactive method based on multimode scientific workflow, provided organization mechanism and the method for digging of data pedigree during scientific workflow process is carried out.

Principle of the present invention is: the scientific workflow process model of take based on digraph, as basis, is expanded it, obtains the scientific workflow process model of an expansion.Recycling data pedigree technology is enriched its data model part, the implementation of scientific workflow is described in all directions from process and two angles of data, obtain based on multi-mode scientific workflow process data unified management model, and with this, data pedigree is described and is reviewed.

Technical scheme provided by the invention is as follows:

A data pedigree traceability system based on multimode scientific workflow, is characterized in that, comprising: system service end, user side, relational database, data manipulation unit, logical calculated unit; Wherein,

Described system service end, for one or more is positioned at the computing machine of high in the clouds or LAN environment, in order to accept the workflow execution request of workflow user and request is reacted;

Described user side, for being positioned at local terminal, is the input equipment that journey is flow through in workflow user execution work;

Described relational database, in order to preserve the dependence of workflow activities nodal information, logic node information, data and constraint collection etc.; Described relational database is arranged on system service end;

Described data manipulation unit, and described relational database carries out alternately, comprising: data query, increase data, Update Table, deletion data;

Described logical calculated unit, carries out logical calculated according to the work at present state of workflow user, is undertaken alternately, and result of calculation is represented to workflow user by user side by data manipulation unit and relational database;

When workflow user is carried out certain task node for the first time: described logical calculated unit, first, check whether to have existed in relational database and take the list item that this task is rearmounted logic node, if existed, upgrade the current task version of this list item; Then, check whether to exist in relational database and take the list item that this task is prefix logic node, if existed, upgrade the current task version of this list item; If above-mentioned two all do not exist, in relational database, newly increase one, with the rearmounted logic node of current task, current task and current task version number, create, when being carried out by rearmounted logic node, rearmounted logic node version number fills in;

When workflow user is reformed certain task node: described logical calculated unit, first check whether to have existed in relational database and take the list item that this task is rearmounted logic node, if had, newly-increased one, preserve the relation between already present previous task version and the current task version of reforming; Whether then, check to exist in relational database and take the list item that this task is preposition node, if had, and rearmounted node, the previous task version number that directly changes this is current task node version number if also not carrying out; If above-mentioned two all do not exist, newly increase one, with the rearmounted logic node of current task, current task and current task version number, create, when being carried out by rearmounted logic node, its rearmounted logic node version number fills in.

Described data pedigree traceability system, it is characterized in that, described data manipulation unit and logical calculated unit are positioned at system service end, and described data pedigree traceability system is a thin client system, and the user interface of user side is browser or User Defined system.

Described data pedigree traceability system, is characterized in that, described data manipulation unit and logical calculated unit are positioned at client, and described data pedigree traceability system is a Fat Client system, and the user interface of user side is User Defined system.

Described data pedigree traceability system, is characterized in that, when workflow user data query pedigree is related to, directly in relational database, inquires about current task and current version place list item.

The present invention provides a kind of data pedigree retroactive method based on multimode scientific workflow simultaneously, it is characterized in that, comprising:

Build relational database, in order to preserve active node information, the logic node information of workflow node, the dependence of data and constraint collection etc.; Described relational database is positioned at server end;

When workflow user is carried out certain task node for the first time: first, check whether to have existed in relational database and take the list item that this task is rearmounted logic node, if existed, upgrade the current task version of this list item; Then, check whether to exist in relational database and take the list item that this task is prefix logic node, if existed, upgrade the current task version of this list item; If above-mentioned two all do not exist, in relational database, newly increase one, with the rearmounted logic node of current task, current task and current task version number, create, when being carried out by rearmounted logic node, rearmounted logic node version number fills in;

When workflow user is reformed certain task node: whether existed in first checking relational database and take the list item that this task is rearmounted logic node, if had, newly-increased one, preserve the relation between already present previous task version and the current task version of reforming; Whether then, check to exist in relational database and take the list item that this task is preposition node, if had, and rearmounted node, the previous task version number that directly changes this is current task node version number if also not carrying out; If above-mentioned two all do not exist, newly increase one, with the rearmounted logic node of current task, current task and current task version number, create, when being carried out by rearmounted logic node, its rearmounted logic node version number fills in.

Described data pedigree retroactive method, is characterized in that, when workflow user data query pedigree is related to, directly in relational database, inquires about current task and current version place list item.

Described data pedigree retroactive method, is characterized in that, each Work stream data is bundled on workflow node, and the data pedigree relation obtaining is upgraded automatically according to the disposition and management of relation between workflow node.

Described data pedigree retroactive method, it is characterized in that, in described relational database, in the storage organization of every workflow node data, should at least store following nodal information: present node is quoted <CurrentNode>, present node versions of data <CurrentNodeVersion>, the subsequent node of present node is quoted <NextNode>, the subsequent node versions of data <NextNodeVersion> of present node.

Described data pedigree retroactive method, is characterized in that, between scientific workflow node, foundation and the update method of data pedigree relation are as follows:

Data pedigree relation is set up input with update method and is comprised certain node being just finished in the research-on-research flow model that established and scientific workflow, for this node, regarded as the general node in scientific workflow, by research-on-research flow model, can be obtained preposition, the subsequent node information of this node, and the state of this node is carry out first or reform, for explaining conveniently, make following hypothesis:

● Node is a node in workflow;

● total m of the preposition node of Node, is respectively pNode ₁, pNode ₂,, pNode _m;

● total n of the subsequent node of Node, is respectively nNode ₁, nNode ₂,, nNode _n;

When node Node is complete, produce after new data:

1) decision node Node, whether for carrying out first, if carry out first, performs step 2); Otherwise node Node, for re-executing, performs step 3);

2) Node node, for carrying out first, is carried out following sub-step:

2.1) check that whether having such data recording: <NextNode> is quoting of Node, <NextNodeVersion> is 0; If there is such data recording, the <NextNodeVersion> of this data recording is updated to the versions of data number of node Node; If there is no such data recording, performs step 2.2);

2.2) be the 1st the subsequent node Node of Node ₁in storage organization, increase a data recording, <CurrentNode> memory node Node wherein quotes, the versions of data number of <CurrentNodeVersion> memory node Node, <NextNode> stores Node ₁quote, <NextNodeVersion> is 0;

2.3) repeated execution of steps 2.2), until all set up corresponding data recording for all subsequent node of Node in storage organization;

2.4) method finishes;

3) Node node, for re-executing, supposes that the original latest data version number of Node is OldVersion, and the new data version number obtaining after re-executing is NewVersion, and this method will be carried out following sub-step:

3.1) in storage organization, check that whether having such data recording: <NextNode> is quoting of Node, <NextNodeVersion> is 0; If there is such data recording, show that this node is due to reforming that its preposition node is reformed and caused, the <NextNodeVersion> of this data recording is updated to the NewVersion of the new data version number value of node Node, continues manner of execution step 3.3); If there is no such data recording, manner of execution step 3.2);

3.2) in storage organization, check that whether having such data recording: <NextNode> is quoting of Node, <NextNodeVersion> is that the original versions of data of Node number is OldVersion; If there is such data recording, show that this node is re-executing of initiatively initiating, copy this data recording, and the <NextNodeVersion> that copies the data recording obtaining is updated to the NewVersion of the new data version number value of node Node, continue manner of execution step 3.3; If there is no such data recording, algorithm finishes;

3.3) in storage organization, check that whether having such data recording: <CurrentNode> is quoting of Node, <CurrentNodeVersion> is that the original versions of data of Node number is OldVersion, and <NextNodeVersion> is 0; If there is such data recording, the <NextNodeVersion> of this data recording is updated to End value, so just avoided the descendant node of Node also to use old data to calculate; If there is no such data recording, skips this step, continues execution step 3.4);

3.4) be the 1st the subsequent node Node of Node ₁in storage organization, increase a data recording, <CurrentNode> memory node Node wherein quotes, the NewVersion of new data version number of <CurrentNodeVersion> memory node Node, the 1st the subsequent node Node of <NextNode> storage Node ₁quote, <NextNodeVersion> is 0;

3.5) repeated execution of steps 3.4), until all set up corresponding data recording for all subsequent node of Node in storage organization;

3.6) method finishes;

So far, data pedigree relation is set up with update method and is finished, and carries out said method operate by each node in research-on-research flow model, can be whole research-on-research flow model and sets up data pedigree relational model.

Described data pedigree retroactive method, it is characterized in that, the data pedigree retroactive method of scientific workflow node, data pedigree relational model based on setting up before, arbitrary edition data that can be based on arbitrary node in scientific workflow, this edition data of reviewing forward this node is by which edition data of preposition node to be derived, and this edition data of inquiring about backward this node has been derived out the versions of data of which rearmounted node; Be divided into and review forward and derive backward two subprocess;

The input of data pedigree retroactive method should comprise the research-on-research flow model having established, and as certain versions of data of certain node in the scientific workflow of trace-back operation basic point and this node; The node of supposing input is that the version number of Node and this node is Version, both form two tuple (Node, Version), conveniently set up a queue data structure Q store this two tuple for processing, the implementation of data pedigree retroactive method is as follows:

● trace back process forward:

1) two tuples (Node, Version) that form for the node Node He Qi Version of version number, are first added pending queue Q;

2) by the research-on-research flow model having established, find all preposition node of Node node, be designated as pNode ₁, pNode ₂,, pNode _m; If there is no any preposition node, jumps to step 5);

3) in the pedigree data structure establishing, search following data recording: <CurrentNode> value is to pNode ₁quote, <NextNode> value is for to the quoting of Node, <NextNodeVersion> value is Version; If there are such data, record the value of <CurrentNodeVersion>, be assumed to be pVersion, this shows that the Version edition data of Node node is according to its preposition node pNode ₁pVersion edition data calculate, if there is no two tuple (pNode in pending queue Q ₁, pVersion), added Q; If there is no such data, continue execution step 4);

4) repeating step 3), the corresponding preposition node pNode2 of Version edition data of inquiry Node,, the version number of pNodem, adds pending queue Q;

5) from queue Q, delete (Node, Version);

6) next two tuples of select progressively from queue Q, repeating step 2) to 5), thereby recursively review forward data pedigree relation, until queue Q is empty;

● derive backward process:

2) judge whether the Version of its version number is End or 0, End or 0, jumps to step 5) if, if be not End, by the research-on-research flow model having established, find all executeds of Node node complete and be not in the descendant node under state to be reformed, be designated as nNode ₁, nNode ₂,, nNode _n; If there is no any descendant node, jumps to step 5);

3) in the basis pedigree data structure establishing, search following data recording: <CurrentNode> value is quoting Node, <CurrentNodeVersion> value is Version, and <NextNode> value is to nNode ₁quote, and the version number of <NextNodeVersion> is up-to-date; If there are such data, record the value of <NextNodeVersion>, be assumed to be nVersion, this shows rearmounted node nNode ₁the latest edition data that calculate according to the Version edition data of Node node are nVersion, if there is no two tuple (nNode in pending queue Q ₁, nVersion), added Q; If there is no such data, perform step 4);

4) repeating step 3), the corresponding rearmounted node nNode of Version edition data of inquiry Node ₂,, nNode _nlatest edition this shop, the line item of going forward side by side, adds pending queue Q;

5) from queue Q, delete (Node, Version);

6) next two tuples of select progressively from queue Q, repeating step 2) to 5), thereby recursively derive backward data pedigree relation, until queue Q is empty;

Preposition, the subsequent data version of Node node Version edition data have finally been obtained, in conjunction with preposition between the node defining in science Work flow model, follow-up relation, each versions of data can be identified on corresponding node, form a data pedigree relational model, again by research-on-research flow model output intent is expanded, i.e. exportable this data pedigree relational model.

Beneficial effect of the present invention: the new method of reviewing data pedigree proposed by the invention---the data pedigree retroactive method based on multimode scientific workflow, on original scientific workflow model basis, integrated data pedigree technology, data mode while having increased model dynamic operation, and the historical record of model data.The present invention can describe evolution and the state of data in the calculating of large-scale complex science and collaborative design and development flow process better, thereby strengthen the monitoring capacity to flowchart process, the omnibearing operating strategy of realization flow, improves scientific research efficiency, and advance science development and technical progress.

Compared to the prior art, method of the present invention has the following advantages:

Clear logic: by date storage method provided by the invention and retroactive method, on the mass data basis that can produce in scientific workflow implementation, follow natural computation sequence and data input/output relation, generate from front to back the shape data pedigree graph of a relation of throwing the net; Position and status information that in the figure can each data of clear location, with and preposition and status information subsequent data, and part or the branch of data intercept pedigree graph of a relation analyze as required.

Be easy to realize: retroactive method provided by the invention is used software approach to realize, adopt recursive algorithm deal with data relation, low to hardware requirement; In addition, method provided by the present invention both can utilize database language direct construction on data storage according to actual conditions, also can utilize associative programming language to be implemented on data abstraction object.

Strong adaptability: according to multi-mode scientific workflow process data unified management model, each data binding is on workflow process node, and the data pedigree relation therefore obtaining can be upgraded automatically according to the disposition and management of relation between workflow process node.

Accompanying drawing explanation

When the versions of data pedigree of describing former and later two associated nodes in the following drawings is related to, for explaining conveniently, the v version data message of certain node n is expressed as to n:v, the latest edition data message of certain node n is expressed as to n:[Latest].

0 layer data flow graph between Fig. 1 data pedigree relation management function and research-on-research streaming system

Level 1 data volume figure between Fig. 2 data pedigree relation management function and research-on-research streaming system

Fig. 3 data pedigree relation is set up and is upgraded functional flow diagram (node is carried out first)

Fig. 4 data pedigree relation is set up and is upgraded functional flow diagram (node re-executes)

Fig. 5 data pedigree retroactive method process flow diagram (reviewing forward)

Fig. 6 data pedigree retroactive method process flow diagram (deriving backward)

Simple scientific workflow model instance of Fig. 7

Fig. 8 node 2v1.0 edition data is trace back process forward

Fig. 9 node 2v1.0 edition data is derived process backward

The data genealogical of Figure 10 node 2v1.0 edition data

Figure 11 node 4v1.0 edition data review forward data genealogical

Figure 12 node 4v1.0 edition data review backward data genealogical

The data genealogical of Figure 13 node 4v1.0 edition data

Fig. 1 is in deployed environment of the present invention, data stream relation between data pedigree relation management function and scientific workflow system node operating function, user inputs after the data of certain node calculating to research-on-research streaming system, and system sends to model, node and edition data in data pedigree relation management subsystem and processes and store.User is to nodename and the version number of data pedigree relation management subsystem input appointment, and this subsystem can be inquired about and review, output data pedigree relation data.

Data pedigree relation management function in Fig. 1 is carried out to further refinement, can obtain Fig. 2.In Fig. 2, data pedigree relation management function has specifically been divided into data pedigree relation and has set up and upgrade function and data pedigree relational query function.

Fig. 3 has described data pedigree relation and has set up and upgrade in function the concrete execution flow process that the data that produce for the node of carrying out are first processed.In order to store the versions of data pedigree relation of former and later two associated nodes, each relation record here has all comprised preposition nodal information, preposition node data version information, descendant node information, descendant node versions of data information.For a node of carrying out first, by the flow process in Fig. 3, its computational data record is processed, the data message of this node can be joined in data pedigree relation.

Fig. 4 has described data pedigree relation and has set up and upgrade in function the concrete execution flow process that the data that produce for the node re-executing are processed.If a node re-executes for a certain reason, need to the reason re-executing be judged and be processed respectively, set up afterwards new pedigree relation record and store.

Fig. 3 and Fig. 4 set up with the execution flow process of upgrading function and are described data pedigree relation, and processing details reference is wherein the explanation in embodiment hereinafter.

Fig. 5 and Fig. 6 have described the treatment scheme of data pedigree retroactive method of the present invention.Versions of data to the node of appointment and this node appointment, can in data pedigree model, inquire about and review, which preposition node data this edition data that obtains this node is derived from, and take this edition data of this node and be basis, calculate backward the data of having derived out which descendant node.Therefore, data pedigree retroactive method is divided into be reviewed forward and derives two subprocess backward, and Fig. 5 has shown the treatment scheme of reviewing forward subprocess, and Fig. 6 has shown the treatment scheme of deriving backward subprocess.Wherein specifically processing details can be with reference to the explanation in embodiment hereinafter.

Embodiment

The present invention is based on research-on-research flow model, carry out the foundation of data pedigree, analyze, review on it, embodiment is as follows:

1. set up storage organization

Data pedigree retroactive method of the present invention does not also rely on certain fixing data file layout, can be database table, data storage object etc. according to concrete implementation environment.The versions of data pedigree relation between scientific workflow node is calculated and stored to this method, therefore require in data store organisation (for ease of statement, being designated as Table), every item number is according to should at least storing following nodal information:

● present node is quoted (being designated as <CurrentNode>);

● present node versions of data number (being designated as <CurrentNodeVersion>);

● the subsequent node of present node is quoted (being designated as <NextNode>);

● the subsequent node versions of data of present node number (being designated as <NextNodeVersion>);

Versions of data number can be points to quoting of certain versions of data, different and different according to concrete implementation environment, but data version management is not within discussion scope of the present invention, therefore regard versions of data number as certain value merely here, define a constant versions of data End herein, show that corresponding node does not produce data.

2. the foundation of data pedigree relation and renewal operation

The invention provides foundation and the update method of data pedigree relation between scientific workflow node, this foundation and update method can be bundled in the executable operations event of workflow node, complete until node, produce after new data, this data pedigree of Automatically invoked relation is set up and update method, and data pedigree storage data are safeguarded.

Data pedigree relation is set up with update method input and is comprised certain node being just finished in the research-on-research flow model that established and scientific workflow, for this node, should be regarded as the general node in scientific workflow, by research-on-research flow model, can be obtained preposition, the subsequent node information of this node, and the state of this node (carry out first or reform), for explaining conveniently, make following hypothesis:

● Node is a node in workflow;

● total n of the subsequent node of Node, is respectively nNode ₁, nNode ₂,, nNode _n.

When node Node is complete, produce after new data, calling data pedigree relation is set up and update method, in method, need be also to carry out first for Node, still re-execute and judge, and take different processing procedures.It should be noted that, a node has carried out re-executing the versions of data of calculating and producing and has changed, can cause subsequent node all to become state to be re-executed, show that these subsequent node need to re-execute calculating, this is also that the feature of research-on-research flow model determines.A node is initiatively to re-execute, also or owing to being subject to the impact of preposition node and passive re-executing, state in two kinds of situations in data pedigree data recording is different, when therefore a node re-executes, need to judge whether this causes because its preposition node re-executes.In method, concrete scheduling is as follows with implementation:

4) whether decision node Node is for carrying out first, if carry out first, this method continues to carry out below step 2; Otherwise node Node is for re-executing, and this method continues to carry out below step 3.

5) Node node is for carrying out first, and this method will be carried out following sub-step:

2.1) in storage organization Table, check that whether having such data recording: <NextNode> is quoting of Node, <NextNodeVersion> is 0.If there is such data recording, the <NextNodeVersion> of this data recording is updated to the versions of data number of node Node; If there is no such data recording, skips this step.

2.2) be the 1st the subsequent node Node of Node ₁in storage organization Table, increase a data recording, <CurrentNode> memory node Node wherein quotes, the versions of data number of <CurrentNodeVersion> memory node Node, <NextNode> stores Node ₁quote, <NextNodeVersion> is 0.

2.3) repeated execution of steps 2.2, until all set up corresponding data recording for all subsequent node of Node in storage organization Table.

2.4) method finishes.

6) Node node, for re-executing, supposes that the original latest data version number of Node is OldVersion, and the new data version number obtaining after re-executing is NewVersion, and this method will be carried out following sub-step:

3.1) in storage organization Table, check that whether having such data recording: <NextNode> is quoting of Node, <NextNodeVersion> is 0.If there is such data recording, show that this node is due to reforming that its preposition node is reformed and caused, the <NextNodeVersion> of this data recording is updated to the NewVersion of the new data version number value of node Node, continues manner of execution step 3.3; If there is no such data recording, manner of execution step 3.2.

3.2) in storage organization Table, check that whether having such data recording: <NextNode> is quoting of Node, <NextNodeVersion> is that the original versions of data of Node number is OldVersion.If there is such data recording, show that this node is re-executing of initiatively initiating, copy this data recording, and the <NextNodeVersion> that copies the data recording obtaining is updated to the NewVersion of the new data version number value of node Node, continue manner of execution step 3.3; If there is no such data recording, algorithm finishes.

3.3) in storage organization Table, check that whether having such data recording: <CurrentNode> is quoting of Node, <CurrentNodeVersion> is that the original versions of data of Node number is OldVersion, and <NextNodeVersion> is 0.If there is such data recording, the <NextNodeVersion> of this data recording is updated to End value, so just avoided the descendant node of Node also to use old data to calculate; If there is no such data recording, skips this step, continues execution step 3.4.

3.4) be the 1st the subsequent node Node of Node ₁in storage organization Table, increase a data recording, <CurrentNode> memory node Node wherein quotes, the NewVersion of new data version number of <CurrentNodeVersion> memory node Node, the 1st the subsequent node Node of <NextNode> storage Node ₁quote, <NextNodeVersion> is 0.

3.5) repeated execution of steps 3.4, until all set up corresponding data recording for all subsequent node of Node in storage organization Table.

3.6) method finishes.

3. data pedigree retroactive method

The invention provides the data pedigree retroactive method of scientific workflow node, data pedigree relational model based on setting up before, arbitrary edition data that can be based on arbitrary node in scientific workflow, this edition data of reviewing forward this node is by which edition data of preposition node to be derived, this edition data of inquiring about backward this node has been derived out the versions of data of which rearmounted node, therefore, this method is divided into and reviews forward and derive two subprocess backward.

In fact, according to defined data relationship above, set up and maintaining method, its preposition node data version of certain versions of data of a node must be definite and unique, can not occur the descendant node data that the different editions data of preposition node are derived out, and its version must be not identical; And that the rearmounted node data version of certain versions of data of a node may have is a plurality of and not unique, allow to occur that the same edition data according to preposition node calculates the different editions data of deriving out descendant node.

The input of data pedigree retroactive method should comprise the research-on-research flow model having established, and as certain versions of data of certain node in the scientific workflow of trace-back operation basic point and this node.The node of supposing input is that the version number of Node and this node is Version, both form two tuple (Node, Version), conveniently set up a queue data structure Q store this two tuple for processing, the implementation of data pedigree retroactive method is as follows:

● trace back process forward:

1) two tuples (Node, Version) that form for the node Node He Qi Version of version number, are first added pending queue Q.

2) by the research-on-research flow model having established, find all preposition node of Node node, be designated as pNode ₁, pNode ₂,, pNode _m.If there is no any preposition node, jumps to step 5.

3) in the pedigree data structure establishing, search following data recording: <CurrentNode> value is to pNode ₁quote, <NextNode> value is for to the quoting of Node, <NextNodeVersion> value is Version.If there are such data, record the value (being assumed to be pVersion) of <CurrentNodeVersion>, this shows that the Version edition data of Node node is according to its preposition node pNode ₁pVersion edition data calculate, if there is no two tuple (pNode in pending queue Q ₁, pVersion), added Q.If there is no such data, continue execution step 4.

4) repeating step 3, the corresponding preposition node pNode2 of Version edition data of inquiry Node, and, the version number of pNodem, adds pending queue Q.

5) from queue Q, delete (Node, Version).

6) next two tuples of select progressively from queue Q, repeating step 2 to 5, thus recursively review forward data pedigree relation, until queue Q is empty.

● derive backward process:

Mention above, the rearmounted node data version of certain versions of data of a node may be not unique, derives backward the up-to-date edition data that process is first only searched descendant node.

2) judge whether the Version of its version number is End or 0, End or 0, jumps to step 5 if, if be not End, by the research-on-research flow model having established, find all executeds of Node node complete and be not in the descendant node under state to be reformed, be designated as nNode ₁, nNode ₂,, nNode _n.If there is no any descendant node, jumps to step 5.

3) in the basis pedigree data structure establishing, search following data recording: <CurrentNode> value is quoting Node, <CurrentNodeVersion> value is Version, and <NextNode> value is to nNode ₁quote, and the version number of <NextNodeVersion> is up-to-date.If there are such data, record the value (being assumed to be nVersion) of <NextNodeVersion>, this shows rearmounted node nNode ₁the latest edition data that calculate according to the Version edition data of Node node are nVersion, if there is no two tuple (nNode in pending queue Q ₁, nVersion), added Q.If there is no such data, perform step 4.

4) repeating step 3, the corresponding rearmounted node nNode of Version edition data of inquiry Node ₂,, nNode _nlatest edition this shop, the line item of going forward side by side, adds pending queue Q.

5) from queue Q, delete (Node, Version).

6) next two tuples of select progressively from queue Q, repeating step 2 to 5, thus recursively derive backward data pedigree relation, until queue Q is empty.

Derive in process backward, if want to derive out all versions of data of descendant node, only need to be to step 2) search and modify, the version number that removes restrictive condition <NextNodeVersion> is up-to-date.But thisly operate in each node and have in the research-on-research flow model of a plurality of versions of data, be easy to cause the exponential growth of query manipulation, have a strong impact on system effectiveness, therefore suggestion adopts the operation of deriving descendant node latest data version.Another feasible mode is to record all versions of data of descendant node, selects dynamically to derive the versions of data of appointment according to user.

In order further the present invention to be set forth, spy is exemplified below:

Embodiment 1:

Fig. 7 is a simple research-on-research flow model, wherein has 7 nodes, and this model has following regulation:

1) after the each execution of each node, will produce the data of a version;

2) each node only has its all preposition nodes all complete and produce after data and just can start to carry out;

3) after a node was carried out once, can re-execute;

4) if a node re-executes, its all descendant nodes also need to re-execute.

For this model, this example will carry out following sequence of operations:

1)

XM

1,2,3,5,4,6 in order, it is each that to carry out the version that obtains data be 1.0;

2) re-execute node 4, re-executing the versions of data obtaining is 2.0;

3) re-execute all executed subsequent node of node 4, obtain new versions of data 2.0;

4) XM 7, and the versions of data obtaining is 1.0;

5) its front and back data pedigree relation of 1.0 edition datas of query node 2;

6) its front and back data pedigree relation of 1.0 edition datas of query node 4;

Now this model is applied to data pedigree of the present invention and is related to that disposal route is as follows:

1. set up storage organization

In relational database (as SQL Server), set up table DataProvenance structure as follows:

Table 1 table DataProvenance storage organization

2. XM 1 to 6 in order

Research-on-research flow model starts to carry out, and first XM 1, and obtains the data of version 1.0.After node 1 is finished, trigger data pedigree relation is set up and is upgraded operation, and according to the method in the present invention, implementation is as follows:

1) decision node 1 is for carrying out first, and executing data pedigree relation is set up operation;

2) in table DataProvenance, check that whether having such data recording: <NextNode> is that 1, <NextNodeVersion> is 0.After inquiry there is not such data recording in discovery, continues to carry out;

3) in table DataProvenance for the subsequent node of node 1 (being node 2) insertion is if next number is according to record:

CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
					1	1.0	2	0

Be finished after above step, in database table DataProvenance, data storage condition is as follows:

Table 2 database table DataProvenance storage condition (1)

ID	CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
						1	1	1.0	2	0

Work flow model continues XM 2, and the data that Yi Ge version number is 1.0 equally trigger above implementation again:

1) decision node 2 is for carrying out first, and executing data pedigree relation is set up operation;

2) in table DataProvenance, check that whether having such data recording: <NextNode> is that 2, <NextNodeVersion> is 0.After inquiry, find to exist such data recording, major key ID is 1, the <NextNodeVersion> of these data is updated to 1.0, after this step that is finished, in database table DataProvenance, data storage condition is as follows:

Table 3 database table DataProvenance storage condition (2)

ID	CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
						1	1	1.0	2	1.0

3) in table DataProvenance, insert respectively following data recording for the subsequent node of node 2 (being node 3,4):

CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
				2	1.0	3	0
2	1.0	4	0

Table 4 database table DataProvenance storage condition (3)

ID	CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
						1	1	1.0	2	1.0
2	2	1.0	3	0
					3	2	1.0	4	0

Research-on-research flow model continues

XM

3,5,4,6, repeats said process, and after being finished, in database table DataProvenance, data storage condition is as follows:

Table 5 database table DataProvenance storage condition (4)

ID	CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
						1	1	1.0	2	1.0
2	2	1.0	3	1.0
					3	2	1.0	4	1.0
4	3	1.0	5	1.0
					5	5	1.0	7	0
6	4	1.0	6	1.0
					7	6	1.0	7	0

3. re-execute node 4

Next, research-on-research flow model has carried out once re-executing to node 4, on the input data basis of node 2, recalculates and obtained redaction data, and version number is 2.0.After node 4 re-executes, understand equally trigger data pedigree relation and set up and upgrade operation, according to the method in the present invention, implementation is as follows:

1) decision node 4 is for re-executing, and executing data pedigree relation is upgraded operation;

2) in table DataProvenance, check that whether having such data recording: <NextNode> is 4, <NextNodeVersion> is 0, after inquiry there is not such data recording in discovery, continues to carry out;

3) in table DataProvenance, check that whether having such data recording: <NextNode> is 4, <NextNodeVersion> is 1.0, after inquiry, find to exist such data recording, the data that in table 5, major key is 3, according to these data, in table DataProvenance, insert if next number is according to record:

CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
					2	1.0	4	2.0

4) in table DataProvenance, check that whether having such data recording: <CurrentNode> is 4, <CurrentNodeVersion> is 1.0, <NextNodeVersion> is 0, after inquiry there is not such data recording in discovery, continues to carry out;

5) in table DataProvenance for the subsequent node of node 4 (being node 6) insertion is if next number is according to record:

CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
					4	2.0	6	0

Table 6 database table DataProvenance storage condition (5)

ID	CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
						1	1	1.0	2	1.0
2	2	1.0	3	1.0
					3	2	1.0	4	1.0
4	3	1.0	5	1.0
					5	5	1.0	7	0
6	4	1.0	6	1.0
					7	6	1.0	7	0
8	2	1.0	4	2.0
					9	4	2.0	6	0

4. re-execute all executed subsequent node of node 4

The impact that re-executed by node 4, node 6 also needs to re-execute, and obtains new versions of data 2.0.After node 6 is complete, carries out following data pedigree relation and upgrade operation:

1) decision node 6 is for re-executing, and executing data pedigree relation is upgraded operation;

2) in table DataProvenance, check that whether having such data recording: <NextNode> is 6,

<NextNodeVersion> is 0, after inquiry, find to exist such data recording, the data that in table 6, major key is 9, change the <NextNodeVersion> in this data recording into 2.0;

3) in table DataProvenance, check that whether having such data recording: <CurrentNode> is 6, <CurrentNodeVersion> is 1.0, <NextNodeVersion> is 0, after inquiry, find to exist such data recording, the data that in table 6, major key is 7, change the <NextNodeVersion> in this data recording into End;

4) in table DataProvenance for the subsequent node of node 6 (being node 7) insertion is if next number is according to record:

CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
					6	2.0	7	0

Table 7 database table DataProvenance storage condition (6)

ID	CurrentNode	CurrentNodeVersion	NextNode	NextNodeVersion
						1	1	1.0	2	1.0
2	2	1.0	3	1.0
					3	2	1.0	4	1.0
4	3	1.0	5	1.0
					5	5	1.0	7	0
6	4	1.0	6	1.0
					7	6	1.0	7	End
8	2	1.0	4	2.0
					9	4	2.0	6	2.0
10	6	2.0	7	0

5. XM 7

Research-on-research flow model utilizes after 1.0 edition datas of node 5 and 2.0 edition data XM 7 of node 6, obtain 1.0 edition datas of node 7, because node 7 is to carry out first, its data pedigree is related to that process of establishing is identical with the process first of node 1-6, so locate to repeat no more.After being finished, in database table DataProvenance, data storage condition is as follows:

Table 8 database table DataProvenance storage condition (7)

ID

CurrentNode

CurrentNodeVersion

NextNode

NextNodeVersion

[0210]?

1	1	1.0	2	1.0
					2	2	1.0	3	1.0
3	2	1.0	4	1.0
					4	3	1.0	5	1.0
5	5	1.0	7	1.0
					6	4	1.0	6	1.0
7	6	1.0	7	End
					8	2	1.0	4	2.0
9	4	2.0	6	2.0
					10	6	2.0	7	1.0

6. its data pedigree relation of 1.0 edition datas of query node 2

1.0 edition datas of node 2 are called to data pedigree retroactive method of the present invention, and its implementation is as follows:

● trace back process forward:

1) set up a queue Q, (2,1.0) are added in Q.

2), according to research-on-research flow model, the preposition node that finds node 2 is node 1.3) in table DataProvenance, check that whether having such data recording: <CurrentNode> is 1, <NextNode> is 2, <NextNodeVersion> is 1.0, after inquiry, find to exist such data recording, the data that in table 8, major key is 1, record the value 1.0 of <CurrentNodeVersion> in these data, by (1

1.0) add in Q;

4) from Q, delete (2,1.0), select (1,1.0) to start to process.

5) 1.0 edition datas of node 1 are reviewed forward, found that node 1 is without any preposition node.

6) from Q, delete (1,1.0).

7) Q is empty, finishes to review forward.

Above-mentioned 7 steps can represent by the process in Fig. 8.

● derive backward process:

1) set up a queue Q, (2,1.0) are added in Q.

2) according to research-on-research flow model, the subsequent node that finds node 2 is

node

3,4, and these two subsequent node are processed successively.

3) first for node 3, in table DataProvenance, check that whether having such data recording: <CurrentNode> is 2, <CurrentNodeVersion> is 1.0, <NextNode> is 3, after inquiry, find to exist such data recording, the data that in table 8, major key is 2, owing to only having data, so the <NextNodeVersion> value in these data is also up-to-date, record the value 1.0 of <NextNodeVersion> in these data, by (3, 1.0) add in Q.

4) similarly, find the latest edition data of the node 4 that 1.0 edition datas of node 2 derive, record its version number 2.0, (4,2.0) are added in Q.

5) from Q, delete (2,1.0), select (3,1.0) to start to process.

6) data (5,1.0) of deriving that obtain (3,1.0) add in Q.

7) from Q, delete (3,1.0), select (4,2.0) to start to process.

8) data (6,2.0) of deriving that obtain (4,2.0) add in Q.

9) from Q, delete (4,2.0), select (5,1.0) to start to process.

10) data (7,1.0) of deriving that obtain (5,1.0) add in Q.

11) from Q, delete (5,1.0), select (6,2.0) to start to process.

12) what obtain (6,2.0) derives data (7,1.0), due to existing (7,1.0) in Q, therefore needn't add in Q.

13) from Q, delete (6,2.0), select (7,1.0) to start to process.

14) (7,1.0) do not have subsequent node, delete (7,1.0) from Q, and now Q is empty, finishes the process of deriving.

Whole process as shown in Figure 9.

To review forward and derive backward after two subprocess results integration, obtain 1.0 edition data pedigree relations of node 2 as shown in figure 10.

7. its data pedigree relation of 1.0 edition datas of query node 4

1.0 edition datas of node 4 are called to data pedigree retroactive method of the present invention, and its implementation is as follows:

● trace back process forward:

The process of reviewing forward according to 1.0 edition datas of node 4 is with to 1.0 edition datas of node 2, retroactive method is identical forward before, and process is similar, so locate to repeat no more, after implementation finishes, what obtain reviews forward result as shown in figure 11.

● derive backward process:

1) set up a queue Q, (4,1.0) are added in Q.

2), according to research-on-research flow model, the subsequent node that finds node 4 is node 6

3) for node 6, in table DataProvenance, check that whether having such data recording: <CurrentNode> is 4, <CurrentNodeVersion> is 1.0, <NextNode> is 6, after inquiry, find to exist such data recording, the data that in table 8, major key is 6, owing to only having data, so the <NextNodeVersion> value in these data is also up-to-date, record the value 1.0 of <NextNodeVersion> in these data, by (6, 1.0) add in Q

4) from Q, delete (4,1.0), select (6,1.0) to start to process.

5), according to research-on-research flow model, the subsequent node that finds node 6 is node 7

6) for node 7, in table DataProvenance, check that whether having such data recording: <CurrentNode> is 6, <CurrentNodeVersion> is 1.0, <NextNode> is 7, after inquiry, find to exist such data recording, the data that in table 8, major key is 7, owing to only having data, so the <NextNodeVersion> value in these data is also up-to-date, record the value End of <NextNodeVersion> in these data, by (7, End) add in Q

7) from Q, delete (6,1.0), select (7, End) start to process.

8) version number due to node 7 is End, therefore delete from Q, (7, End), now Q is empty, finishes the process of deriving backward.

The above-mentioned process of deriving can be represented by Figure 12.Wherein in fact final step will not have (7 of computational data version in pedigree relation, End) deleted, in fact because node 4 re-executed, so the data of 1.0 versions are not in use to finally, just after 1.0 edition datas of having derived out node 6, just gone out of use.

To review forward and derive backward after two subprocess results integration, obtain 1.0 edition data pedigree relations of node 2 as shown in figure 13.

List of references

(1) there is data acquisition and the analytic system 200810037137.9 of the ability of can reviewing

(2) complex test data retroactive method 200810240154.2

(3) support the dynamic workflow model subdivision method 200810055620.X of distributed execution

(4) acquisition methods 201110077073.7 of service from workflow system

(5) Workflow executing apparatus and workflow executing method 200910138638.0.

Claims

1. the data pedigree traceability system based on multimode scientific workflow, is characterized in that, comprising: system service end, user side, relational database, data manipulation unit, logical calculated unit; Wherein,

Described relational database, in order to preserve dependence and the constraint collection of workflow activities nodal information, logic node information, data; Described relational database is arranged on system service end;

2. data pedigree traceability system as claimed in claim 1, is characterized in that, described data manipulation unit and logical calculated unit are positioned at system service end.

3. data pedigree traceability system as claimed in claim 1, is characterized in that, described data manipulation unit and logical calculated unit are positioned at client.

4. data pedigree traceability system as claimed in claim 1, is characterized in that, when workflow user data query pedigree is related to, directly in relational database, inquires about current task and current version place list item.

5. the data pedigree retroactive method based on multimode scientific workflow, is characterized in that, comprising:

Build relational database, in order to preserve active node information, the logic node information of workflow node, dependence and the constraint collection of data; Described relational database is positioned at server end;

6. data pedigree retroactive method as claimed in claim 5, is characterized in that, when workflow user data query pedigree is related to, directly in relational database, inquires about current task and current version place list item.

7. data pedigree retroactive method as claimed in claim 5, is characterized in that, each Work stream data is bundled on workflow node, and the data pedigree relation obtaining is upgraded automatically according to the disposition and management of relation between workflow node.

8. data pedigree retroactive method as claimed in claim 5, it is characterized in that, in described relational database, in the storage organization of every workflow node data, should at least store following nodal information: present node is quoted <CurrentNode>, present node versions of data <CurrentNodeVersion>, the subsequent node of present node is quoted <NextNode>, the subsequent node versions of data <NextNodeVersion> of present node.

9. data pedigree retroactive method as claimed in claim 8, is characterized in that, between scientific workflow node, foundation and the update method of data pedigree relation are as follows:

● Node is a node in workflow;

When node Node is complete, produce after new data:

2) Node node, for carrying out first, is carried out following sub-step:

2.4) method finishes;

3.6) method finishes;

10. data pedigree retroactive method as claimed in claim 9, it is characterized in that, the data pedigree retroactive method of scientific workflow node, data pedigree relational model based on setting up before, arbitrary edition data that can be based on arbitrary node in scientific workflow, this edition data of reviewing forward this node is by which edition data of preposition node to be derived, and this edition data of inquiring about backward this node has been derived out the versions of data of which rearmounted node; Be divided into and review forward and derive backward two subprocess;

● trace back process forward:

5) from queue Q, delete (Node, Version);

● derive backward process:

5) from queue Q, delete (Node, Version);