CN103440553A

CN103440553A - Workflow matching and finding system, based on provenance, facing proteomic data analysis

Info

Publication number: CN103440553A
Application number: CN2013103809827A
Authority: CN
Inventors: 翟广猛; 卢暾; 黄兴; 陈昭灿; 顾宁
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2013-08-28
Filing date: 2013-08-28
Publication date: 2013-12-11

Abstract

The invention belongs to the technical field of scientific workflows supported by computers and particularly discloses a workflow matching and finding system, based on provenance, facing proteomic data analysis. When a proteomic data analysis process is set up, the system sufficiently utilizes collected historical process information and provenance information relevant to the historical process information, realizes reuse of knowledge, reduces time consumption and energy consumption at process setting-up time, and quickens the progress of data analysis experiments. The processing process inside the system can be divided into three steps: using scientific workflows based on tasks to describe experiment tasks; utilizing a process matching and provenance information discovery algorithm mechanism to extract the instantiated process and the provenance information of the instantiated process; integrating the extracted process information and showing the information to researchers, and requiring the researchers to reuse or modify the historical process configuration information.

Description

Workflow coupling and the discovery system based on provenance towards proteomics data, analyzed

Technical field

The invention belongs to the scientific workflow technical field of computer supported, be specifically related to one towards the proteomics data analysis, and support coupling and the discovery system of the workflow based on provenance.

Background technology

about the scientific workflow in e-sciencethe scientific workflow technology is more and more paid attention at field of scientific study, and the researchist has started some ripe research-on-research Workflow Management Systems of application in their scientific research, and this way has greatly promoted the discovery in the scientific research.Nowadays, along with network technical development maturation further, in order to improve the collaborative of scientific research field, some tissues by experimental data, calculating and and the Scientific Research Resource such as analysis tool with the form of web services, release, other scientists can be applied to these shared resources in their scientific research by network.In order effectively to organize these distributed resources, researchists have designed some scientific workflows and research-on-research Workflow Management System.Under the help of these systems, the researchist can the layout computation-intensive task, analyze large batch of data, integrate distributed resource and local resource, more energy can be placed in the solution of professional problem and time that needn't overspending calculate and the organization and administration of data resource on.

More typical research-on-research Workflow Management System has Taverna at present, Kepler, Triana etc.Taverna is a research-on-research Workflow Management System of increasing income, and is mainly used in field of bioinformatics.Taverna can integrate multiple shared web services, comprises following several types: Arbitrary WSDL type, Soaplab type, Talisman type, Nested workflow type, String constant type, Local processor type.It provides a visual operating platform for the user, and on this platform, the biologist can serve to describe and carry out data-centered scientific workflow with these.It also provides the web services discovery, the functions such as abnormality processing and provenance acquisition of information.Similar to Taverna, Kepler is also an Oriented Bioinformatics field and the scientific workflow management tool of increasing income.Except general shared web services, it can integrate the resource such as database one class, and provides control strategy flexibly to combine these resources.Triana is mainly used in service under distributed environment and the execution of workflow.Different from Taverna and Kepler, Triana can be applied to different fields.

Increase along with the scientific research complicacy, the shared resource used in scientific research is more and more, in order to realize the tracking in resource source, the reproduction of the examination of accuracy and experiment execution step, scientists starts provenance is incorporated in scientific workflow.

about provenancebecause the uncertainty of scientific research, must provide a kind of mechanism can examine for scientists validity as a result and the correctness of scientific experiment.More widespread along with cooperative cooperating phenomenon in scientific research in addition, the authentication of contribution also attracts much attention.Data Provenance(data origin/pedigree) recorded the historical information of the experimental data produced in the scientific workflow implementation, content comprise primary data that data are corresponding and primary data develop into current data all treatment steps of process.As a kind of new technology, it is applied in scientific workflow.This has not only solved the problems referred to above, because have the abilities such as record data origin, data reproduction and data sharing, the difference between decryption and support knowledge reuse, the vital role of provenance in scientific workflow obtained general approval simultaneously.

about proteomics data analysis (proteomics data analysis)proteomics is a newer field of scientific study.Research to proteomics will contribute to explain life science, disclose secrets of life.Yet be in fact a multi-step and complicated flow process to the proteomics data analysis.In the proteomics data analysis, use scientific workflow will alleviate widely researchist's burden.Recent years, proteomics had great development, and one of them most important factor that urges is increasing income or free data analysis tool of further increasing.But people often can run into following several problem when using these instruments: 1. instrument is difficult for installing; 2. can't not use these instruments in the situation that have the professional to instruct; 3. because the inconsistent interface of these instruments that causes of result data form is incompatible, therefore be difficult for setting up the data analysis flow process.Be introduced in the proteomics data analysis field in order to address the above problem scientific workflow.

the problem facedwith business activity, compare, scientific research has inherent exploration, uncertainty.The research-on-research flow process is often tentative, dynamic change, so scientific workflow flows unlike commercial pursuit, and it does not have fixing solution and pattern.It is the work of a bothersome consumption power that these characteristics of scientific workflow cause the establishment of workflow.For a scientific research mission, scientists not only will select one group of suitable service for carrying out experimental duties from numerous services, parameter that also must debugging control service inner behavior before execution.And the selection of service and the setting of its parameter often will just can be found one group of service and the parameter value that meets this requirement of experiment and characteristic of experiment through continuous trial.This problem seems more obvious in the proteomics data analysis process.Therefore, have very much and must have a kind of simple and effective method of searching to assist the scientific experiment that scientists is for they to set up scientific workflow.Knowledge reuse in scientific research can be used as a breach.But how fully using historical flow process and relevant provenance information to help scientist creates scientific workflow and is still a challenging job.

The present invention has designed workflow coupling and a discovery system based on provenance just in order to solve above subject matter.When creating the data analysis flow process, it can provide many group service procedures, and the user reuses or revise these service procedures according to requirement of experiment.

Summary of the invention

The object of the invention is to propose the system that the workflow of a support towards the proteomics data analysis process based on provenance mated and found.

Taverna, Kepler, the Workflow Management System of Triana is all supported obtaining of provenance information, but the source that the provenance information of obtaining is mainly used in recording experimental data, the regeneration of experimental data and shared.They reuse support seldom to workflow creation stage knowledge.Designed system of the present invention is added to provenance information as the reference factor in the coupling and discovery procedure of workflow, for new scientific research mission is searched the use for reference of more satisfactory historical flow process.

1, the structure of the treatment scheme of system and system

Fig. 1 has showed the workflow coupling based on provenance and the treatment scheme of finding.Treatment scheme is divided into 3 steps.(1) use abstract scientific workflow modeling pattern data of description analysis task; (2) workflow of take in (1) is template, uses the flow path match algorithm to find out the flow process of coupling (or maximum coupling path) and its corresponding provenance information from historical flow process; (3) the historical flow process of the coupling of obtaining (or maximum coupling path) and its provenance information show the user with a kind of understandable organizational form.

Based on above-mentioned treatment scheme, the present invention has designed the framed structure of system.The framed structure of system as shown in Figure 2, comprising: workflow module, INFORMATION DISCOVERY module, database update module, database, workflow engine.Wherein:

Workflow module comprises two parts content: the workflow of initialization stream and instantiation.Initialization stream is expressed as the data of description analysis task and the scientific workflow that is comprised of abstract task, in the workflow matching process as workflow template; The workflow of instantiation is expressed as the scientific workflow after each the abstract task appointment specific service in initialization stream.

Database comprises the historical process knowledge thesaurus of HFK() and PDC(SDI collection storehouse), the provenance information that they generate while being respectively used to log history procedure information (comprising corresponding Abstract workflow and the workflow after instantiation) and their execution.

The INFORMATION DISCOVERY module is comprised of following three submodules: flow path match and information integron module, and procedure information excavates submodule, and source data is excavated submodule.Procedure information excavates submodule for from historical process knowledge thesaurus, finding out the task node overlapping with flow template and corresponding specific service thereof.Source data is excavated submodule and is obtained provenance information corresponding to coupling flow process for collect storehouse from SDI.Flow path match and information integron module are for integrating above-mentioned two information that submodule obtains, and the information display after then integrating is to the user.

Workflow engine comprises execution monitoring module and the workflow execution module of workflow.The latter is responsible for carrying out the workflow of instantiation, the provenance information produced when the former is responsible for collecting flow performing.

The database update module is just upgraded HFK and PDC after the workflow execution module executes a workflow at every turn.

Data flow relationship description between above-mentioned modules is as follows:

The user creates initialization stream (comprise the requirement of workflow is described) according to task description; The INFORMATION DISCOVERY module is using initialization stream as workflow template from historical process knowledge thesaurus with the SDI that produces while collecting the historical flow process of finding out coupling storehouse and corresponding execution thereof of SDI; The information display that system is obtained the INFORMATION DISCOVERY module is to the user, and the user carrys out instantiation initialization stream by these information, thereby obtains the workflow of instantiation; In system, the workflow execution submodule of workflow engine module is carried out the workflow of instantiation, and the execution monitoring module of workflow can flow the SDI produced while carrying out by collection work simultaneously; Historical process knowledge thesaurus and SDI that the SDI that the database update module is obtained workflow engine and the workflow information of instantiation are updated to data module as historical procedure information are collected in storehouse.

2, the detailed description of modules in system frame structure

(1) abstract task-set and entity services collection.The initialization flow module of system will be used abstract task-set S and entity services collection T when creating initialization stream and instantiation workflow.T means the set of the abstract task that system is used.The user can describe a data analysis flow process by these tasks.S means the entity services of having integrated in system, and these services can realize the abstract task in T.In addition, the flow process of system buildup is data-centered pipeline type.

(2) scientific workflow.The workflow module of system is used the workflow modeling method based on task.Scientific workflow is a simple digraph G=(V, E, F).V is a finite aggregate that abstract task node forms in flow process.E is a set be comprised of the data dependence relation between task node.F is the set of the mapping relations of abstract task and specific service.For example, v ₁and v ₂two task nodes in V, if v ₂data dependence is in v ₁there is so an element e in E ₁=<v ₁, v ₂mean this dependence.If an element f is arranged in F _i=<v _i, s _i, service s is arranged when it is illustrated in workflow execution so _ibe responsible for the v that executes the task _i.Whether according to the value of F in scientific workflow, be empty, we are divided into two kinds by scientific workflow: the workflow of initialization stream and instantiation.Initialization stream refers to the abstract workflow of describing scientific research mission, and now F is empty.The workflow of instantiation refers to that the user passes through after the knowledge reuse process workflow to the instantiation of initialization stream, the specific service of an appointment is now arranged for each task node in flow process, and F has recorded this mapping relations between task and specific service.

(3) historical process knowledge thesaurus module.Native system has been used the workflow modeling method based on task, thus our task in must the log history flow process and the corresponding relation of service, as a part of reusing knowledge.Historical process knowledge thesaurus module is responsible for recording this corresponding relation.Its Information Organization mode is: (<task _i, service _i, { G _1,g ₂) be a basic information unit (BIU) in historical process knowledge thesaurus module, its semanteme is: G ₁, G ₂in abstract task task _iwhen flow performing by the service service _icarry out.

(4) SDI is collected library module .sDI is collected the provenance information produced when library module is responsible for workflow execution in register system, and these information will show the user together with the flow process that is extracted out in the workflow matching process with coupling.The characteristics of coupling system self, we see Fig. 3 to OPM() in the provenance Information Organization that proposes improve.The proteomics data analysis process mostly is the pipeline type, we can by " result from (wasGeneratedBy) " and " use " dependence releases (used) " originate from (wasDerivedFrom) " and " (wasTriggerrdBy) is triggered " dependence.So we only record the first two dependence and save space in SDI collection library module.SDI is collected Information Organization form in library module ₁, object ₂, object wherein ₁and object ₂between have partial ordering relation.

(5) the INFORMATION DISCOVERY module in system for according to workflow template, is collected library module from historical process knowledge thesaurus module and SDI, the provenance information when obtaining the instantiation information of historical flow process of coupling and its and carrying out.Algorithm in this module has been used basic concepts in realizing: the relation of equivalence between task or service, the overlapping relation between workflow, overlay path and overlay path collection, Maximum overlap path collection.Be described below respectively:

(a) relation of equivalence between task or service.Provide two element v that belong to abstract task-set (T) ₁and v ₂if, v ₁, v ₂meaning of task is same, thinks v ₁and v ₂be of equal value, be denoted as v ₁=v ₂; Provide two element s that belong to entity services collection (S) ₁and s ₂if, s ₁, s ₂the service meaned is same, thinks s ₁and s ₂be of equal value, be denoted as s ₁=s ₂.

(b) overlapping relation between workflow.Two workflow G _i=(V _i, E _i, F _i) and G _k=(V _k, E _k, F _k) between have overlapping relation, and if only if, and there is v in (1) _x∈ T, and (2) v _x∈ V _iand v _x∈ V _k.V now _xbe called workflow G _iand G _koverlapping nodes.

(c) overlay path and overlay path collection.G _i=(V _i, E _i, F _i) and G _k=(V _k, E _k, F _k) be two workflows, make E=E _i∩ E _k.If e _x∈ E, so by e _xthe ordered set that the partial ordering relation that two related task nodes determine according to their data dependence relations rearranges is called an overlay path.If E=null, but between two workflows, overlapping relation is arranged, the set be comprised of single overlapping nodes so is a special overlay path.If l ₁and l ₂two overlay paths, and l ₂first node and l ₁in last node identical, we can be with a new path l=l so ₁ul ₂replace l ₁and l ₂, this process is called the merging of overlay path.The set be comprised of all overlay paths is called the overlay path collection.

(d) Maximum overlap path collection.OPS is G _iand G _kan overlay path collection.OPS is a Maximum overlap path collection, any two paths l in and if only if OPS _i, l _jmeet condition: (1) l _iand l _jit not the same path; And (2) do not exist and belong to l simultaneously _iand l _jnode; And (3) for l _iin any one node v _iand l _jin any one node v _j, v _iand v _jbetween do not have the partial ordering relation that can be determined by the data dependence relation hereditary property.Can prove for two workflow G _iand G _k, the Maximum overlap path collection between them is limited set and is unique.

3, the data structure of using in the algorithm related in system module and algorithm

Native system relates generally to following algorithm in implementation procedure:

（1）Algorithm?1:?FIF(task?,?HFK)。Procedure information excavates submodule and comprises this algorithm, for search all entity services of given a certain abstract task correspondence historical flow process from historical process knowledge thesaurus module.

（2）Algorithm?2:?PIF(G _x,?service _x,?PDC)。Source data is excavated submodule and is comprised this algorithm, for from SDI, collecting library module, searches historical flow process G _xcarrying out entity services service _xthe time corresponding input data and output data provenance information.

（3）Algorithm?3：Merge(PS ₁,?PS ₂)。Flow path match and information integron module comprise this algorithm, when generating the maximum path collection, use this algorithm to merge two known overlay paths.

（4）Algorithm?4:?GMPS(G=(V,?E?,?F),?HFK,PDC)。Flow path match and information integron module comprise this algorithm, use this algorithm to obtain corresponding input data and the output data in Maximum overlap path in the Maximum overlap path of workflow template G=(V, E, F) and historical flow process in HFK and PDC.Called Algorithm 1 in this algorithm implementation procedure, Algorithm 2, and Algorithm 3.

（5）Algorithm?5:?UpdateHFK(G _x,?HFK)。HFK in the database update module upgrades submodule and comprises this algorithm, the G that uses this algorithm that the workflow execution monitoring module in workflow engine is obtained _xthe information updating of the instantiation of flow process is in historical process knowledge storage (HFK) database.

（6）Algorithm?6:?UpdatePDC(PInfo,?PDC)。PDC in the database update module upgrades submodule and comprises this algorithm, and the provenance information PInfo during flow performing of using this algorithm that the workflow execution monitoring module in workflow engine is obtained is updated to SDI and collects in (PDC) database.

While stating algorithm in realization, the present invention has used 3 data structures.DE (DElement) is a data structure that has the element of partial ordering relation for storage, and this data structure can hold two fundamental elements (as task node, entity services etc.) with partial ordering relation.CE (CElement) is a compound data structure, and it both can hold basic element can also hold the DE element.MList is the chained list of a multidimensional, and its inner structure is as Fig. 4.TData in MList is responsible for depositing the task node in workflow template; SData is responsible for realizing in the log history flow process specific service of abstract task in TData; Provenance information when PData is responsible for recording service execution in SData (comprising input data and output data).For convenient, algorithm is described, the present invention has been these data structure definitions some atomic operations, specific as follows:

1. π ₁(DE): obtain first element in DE; π ₂(DE): obtain second element in DE.

2. CE[i]: the element that is designated as i under obtaining in CE; Set CE[i]=object: be CE[i] assignment object.

3. addTData (TData, MList): TData is added in MList.

4. addSData (task, SData, MList): SData is added in MList, and the address of SData in MList determined by task.

5. addPData (task, service, PData, MList): PData is inserted in MList, and the position of PData in MList is by task, and service determines jointly.

The accompanying drawing explanation

Fig. 1 is system flow.

Fig. 2 is system framework.

Fig. 3 is the dependence that the provenance in OPM relates to.In figure, P means specific service, and A means the data that service is used or produced.

Fig. 4 is the data structure of using in the workflow matching algorithm.

Embodiment

the specific implementation of the middle algorithm of modules

(1) procedure information excavates module for search the entity services of given a certain task correspondence historical flow process from historical process knowledge thesaurus module (HFK), and the algorithm of realization is: Algorithm 1:FIF (task, HFK).Basic information unit (BIU) organizational form in aforementioned known HFK is CE=(<task _i, service _i, { G _x, G _y).Being implemented as follows of algorithm wherein: in traversal HFK all message units, if CE _iin first element task _iwith task be that relation of equivalence is task _i=task, by CE _iadd to and return in chained list, repeat this process until traversal finishes.Finally return results chained list.Implementation of pseudocode is shown in instructions appendix (1).

(2) provenance information when source data excavation module is searched some specific service execution of specifying flow process for collect storehouse (PDC) from SDI, the algorithm of its realization is: Algorithm 2:PIF (G _x, service _x, PDC).The organizational form of the message unit in aforementioned known PDC is DE=<object ₁, object ₂.Being implemented as follows of algorithm: create new CE data type object a: CE.Find out in PDC by G _xall message units of sign, these message units are temporary to be placed in chained list DElementSet.Element in traversal DElementSet: DE, if π ₁and service (DE) _xrelation of equivalence is arranged, so set CE[2]=π ₂(DE); Otherwise, if π ₂and service (DE) _xrelation of equivalence is arranged, so set CE[1]=π ₁(DE).Set?CE[0]=G _x。By CE as a result of data return.Implementation of pseudocode is shown in instructions appendix (2).

(3) flow path match and information integrate module are responsible for the information architecture Maximum overlap path collection obtained according to above-mentioned two modules.At main body algorithm Algorithm4:GMPS (G=(V, E, F), HFK, PDC) in excavate the algorithm in module: Algorithm 1 by call flow information excavating module and source data and with Algorithm 2, HFK be temporarily stored in MList with match information relevant to assigned work stream G in FDC.After completing above-mentioned work, according to the data construct Maximum overlap path collection in MList, this process is divided into two steps: the first step, tectonic stacking path collection; Second one, constantly merge the overlay path collection of constructing in the first step, until these paths can not remerge, the path collection now obtained is a maximum path collection.

For the process prescription in the task tectonic stacking path of a certain appointment as follows: (1) takes out all relevant SData data according to the abstract task task in assigned work stream G from MList; (2) take out all PData that SData determines in (1); (3) for any one PData:CE _i(=(G _x, inputdata, outputdata)) we can construct an overlay path L _i: Set L _i[0]=CE[0], Set L _i[2]=<CE[1], CE[2], Set L _i[1]=<<task _i, SData _k.The set formed according to the overlay path of all SData and PData structure is exactly overlay path collection corresponding to this abstract task.

The arthmetic statement that path merges is as follows: given two overlay path collection PS ₁, PS ₂.PS wherein ₂in all task nodes all postorder in (being determined by the data partial ordering relation) PS ₁.For PS ₁in each overlay path L _i ¹: traversal PS ₂in every paths L _k ², (a)if L _k ²in workflow sign and L _i ¹in the workflow sign equate, execution step (b); Otherwise execution step (c); (b)if π ₁(L _i ¹[2])=π ₂(L _k ²[2]), by L _i ¹[1] all elements in adds L to according to partial ordering relation _k ²[1] in; Otherwise execution step (c); (c)by L _i ¹add PS to ₂in.Finally by PS ₂as new path collection, return.Implementation of pseudocode is shown in instructions appendix (3).

Merge according to the iteration path process prescription that builds the maximum path algorithm as follows: to each the abstract task node structure path collection PS in the V set of given workflow G ₁, PS ₂..., PS _n(number that n is node in V).Merge successively PS _nand PS _n-1, PS _nand PS _n-2..., PS _nand PS ₁.Path merges the PS obtained after end _nmaximum overlap path collection (overlay path wherein comprises corresponding input data and output data) for given workflow G and historical flow process.Implementation of pseudocode is shown in instructions appendix (4).

(4) in system framework figure, data update module is responsible for historical process knowledge thesaurus (HFK) and SDI collection storehouse (PDC) are upgraded.It has two submodule: HFK to upgrade and PDC upgrades.The former is responsible for HFK is upgraded, and the latter is responsible for PDC is upgraded.

The realization of the renewal of HFK: if G _x=([T ₂, T ₃, T ₄, T ₅, T ₇], [<T ₂, T ₃,<T ₃, T ₄,<T ₄, T ₅,<T ₅, T ₇], []) be the Abstract workflow of a new definition.G after instantiation _xin F=(<T ₂, S ₂ ^x,<T ₃, S ₃ ^x,<T ₄, S ₄ ^x,<T ₅, S ₅ ^x,<T ₇, S ₇ ^x).For any one the element f in F, if there is an identity element CE[0 in HFK]=f, by G _xadd CE[1 to] in.Otherwise create a new Elements C E '=(f, { G _x), then CE ' is added in HFK.Implementation of pseudocode is shown in instructions appendix (5).

The realization of the renewal of PDC: supposing the system is at G _xthe provenance information of obtaining during execution is PInfo=(<Data _x, S ₂ ^x, Data ₂,<Data ₂, S ₃ ^x, Data ₃,<Data ₃, S ₄ ^x, Data ₄,<Data ₄, S ₅ ^x, Data ₅,<Data ₅, S ₇ ^x, Data ₇).The renewal of PDC is achieved as follows: for each the element e in PInfo: the structure objects DE that creates two DE types ₁and DE ₂.DE wherein ₁be labeled as " using (used) " dependence, DE ₂mark " resulting from (wasGeneratedBy) " dependence.For DE ₁, DE ₂assignment: set DE ₁[0]=e[1], set DE ₁[1]=e[0]; Set DE ₂[0]=e[2], set DE ₂[1]=e[1].With G _xas DE ₁, DE ₂the flow process sign, then add in PDC.Implementation of pseudocode is shown in instructions appendix (6).

To illustrate in addition a bit, therefore system meeting consuming time increases along with the increase of data volume in HFK and PDC, is necessary the control of the data volume in HFK and PDC within the specific limits.Can or delete that by the information cluster in HFK, PDC the remote methods such as historical flow process control the data volume in HFK and PDC.

the appendix explanation:

（1）Algorithm?1:?FIF(task?,?HFK)。

（2）Algorithm?2:?PIF(G _x,?service _x,?PDC)。

（3）Algorithm?3：Merge(PS ₁,?PS ₂)。

（4）Algorithm?4:?GMPS(G=(V,?E?,?F),?HFK,PDC)。

（5）Algorithm?5:?UpdateHFK(G _x,?HFK)。

（6）Algorithm?6:?UpdatePDC(PInfo,?PDC)。

The instructions appendix

（1）

Figure 2013103809827100002DEST_PATH_IMAGE001

（2）

（3）

Figure 2013103809827100002DEST_PATH_IMAGE003

（4）

（5）

（6）

。

Claims

1. workflow coupling and a discovery system based on provenance of analyzing towards proteomics data, is characterized in that comprising: workflow module, INFORMATION DISCOVERY module, database update module, database, workflow engine; Wherein:

Described workflow module comprises two parts content: the workflow of initialization stream and instantiation; Initialization stream is expressed as the data of description analysis task and the scientific workflow that is comprised of abstract task, in the workflow matching process as workflow template; The workflow of instantiation is expressed as the scientific workflow after each the abstract task appointment specific service in initialization stream;

Described database, comprise historical process knowledge thesaurus (HFK) and SDI collection storehouse (PDC), the provenance information that they generate while being respectively used to log history procedure information and their execution;

Described INFORMATION DISCOVERY module is comprised of following three submodules: flow path match and information integron module, and procedure information excavates submodule, and source data is excavated submodule; Procedure information excavates submodule for from historical process knowledge thesaurus, finding out the task node overlapping with flow template and corresponding specific service thereof; Source data is excavated submodule and is obtained provenance information corresponding to coupling flow process for collect storehouse from SDI; Flow path match and information integron module are for integrating above-mentioned two information that submodule obtains, and the information display after then integrating is to the user;

Described workflow engine, comprise execution monitoring module and the workflow execution module of workflow; The latter is responsible for carrying out the workflow of instantiation, the provenance information produced when the former is responsible for collecting flow performing;

Described database update module is just upgraded HFK and PDC after the workflow execution module executes a workflow at every turn;

Data flow relation between above-mentioned modules is as follows:

The user creates initialization stream according to task description; The INFORMATION DISCOVERY module is using initialization stream as workflow template from historical process knowledge thesaurus with the SDI that produces while collecting the historical flow process of finding out coupling storehouse and corresponding execution thereof of SDI; The information display obtained by the INFORMATION DISCOVERY module is to the user, and the user carrys out instantiation initialization stream by these information, thereby obtains the workflow of instantiation; The workflow execution submodule of workflow engine module is carried out the workflow of instantiation, the SDI produced when the execution monitoring module collection work of workflow stream is carried out simultaneously; Historical process knowledge thesaurus and SDI that the SDI that the database update module is obtained workflow engine and the workflow information of instantiation are updated to data module as historical procedure information are collected in storehouse.

2. workflow coupling and the discovery system based on provenance according to claim 1, is characterized in that the initialization flow module of system relates to abstract task-set S and entity services collection T when creating initialization stream and instantiation workflow; Wherein, T means the set of the abstract task that system is used, and the user describes a data analysis flow process by these tasks; S means the entity services of having integrated in system, and these services realize the abstract task in T; In addition, the flow process of system buildup is data-centered pipeline type.

3. the workflow based on provenance according to claim 2 is mated and the discovery system, it is characterized in that described scientific workflow is a simple digraph G=(V, E, F), wherein, V is a finite aggregate that abstract task node forms in flow process, and E is a set be comprised of the data dependence relation between task node, and F is the set of the mapping relations of abstract task and specific service.

4. workflow coupling and the discovery system based on provenance according to claim 3, is characterized in that in historical process knowledge thesaurus module, the Information Organization mode is: (<task _i, service _i, { G _1,g ₂), meaning a basic information unit (BIU) in historical process knowledge thesaurus module, its semanteme is: G ₁, G ₂in abstract task task _iwhen flow performing by the service service _icarry out;

SDI is collected Information Organization form in library module ₁, object ₂, object wherein ₁and object ₂between have partial ordering relation.

5. workflow coupling and the discovery system based on provenance according to claim 4, is characterized in that

The INFORMATION DISCOVERY module relates to following key concept: the relation of equivalence between task or service, the overlapping relation between workflow, overlay path and overlay path collection, Maximum overlap path collection; Be respectively:

(a) relation of equivalence between task or service, provide two element v that belong to abstract task-set (T) ₁and v ₂if, v ₁, v ₂meaning of task is same, thinks v ₁and v ₂be of equal value, be denoted as v ₁=v ₂; Provide two element s that belong to entity services collection (S) ₁and s ₂if, s ₁, s ₂the service meaned is same, thinks s ₁and s ₂be of equal value, be denoted as s ₁=s ₂;

(b) overlapping relation between workflow, establish two workflow G _i=(V _i, E _i, F _i) and G _k=(V _k, E _k, F _k) between have overlapping relation, and if only if, and there is v in (1) _x∈ T, and (2) v _x∈ V _iand v _x∈ V _k, v now _xbe called workflow G _iand G _koverlapping nodes;

(c) overlay path and overlay path collection, establish G _i=(V _i, E _i, F _i) and G _k=(V _k, E _k, F _k) be two workflows, make E=E _i∩ E _kif, e _x∈ E, so by e _xthe ordered set that the partial ordering relation that two related task nodes determine according to their data dependence relations rearranges is called an overlay path; If E=null, but between two workflows, overlapping relation is arranged, the set be comprised of single overlapping nodes so is a special overlay path; If l ₁and l ₂two overlay paths, and l ₂first node and l ₁in last node identical, use so a new path l=l ₁ul ₂replace l ₁and l ₂, this process is called the merging of overlay path; The set be comprised of all overlay paths is called the overlay path collection;

(d) Maximum overlap path collection, establishing OPS is G _iand G _kan overlay path collection, OPS is a Maximum overlap path collection, any two paths l in and if only if OPS _i, l _jmeet condition: (1) l _iand l _jit not the same path; And (2) do not exist and belong to l simultaneously _iand l _jnode; And (3) for l _iin any one node v _iand l _jin any one node v _j, v _iand v _jbetween do not have the partial ordering relation that can be determined by the data dependence relation hereditary property.

6. the workflow based on provenance according to claim 5 is mated and the discovery system, it is characterized in that system comprises following algorithm:

(1) Algorithm 1:FIF (task, HFK), procedure information excavates submodule and comprises this algorithm, for search all entity services of given a certain abstract task correspondence historical flow process from historical process knowledge thesaurus module;

(2) Algorithm 2:PIF (G _x, service _x, PDC), source data is excavated submodule and is comprised this algorithm, for from SDI, collecting library module, searches historical flow process G _xcarrying out entity services service _xthe time corresponding input data and output data provenance information;

(3) Algorithm 3:Merge (PS ₁, PS ₂), flow path match and information integron module comprise this algorithm, when generating the maximum path collection, use this algorithm to merge two known overlay paths;

(4) Algorithm 4:GMPS (G=(V, E, F), HFK, PDC), flow path match and information integron module comprise this algorithm, use this algorithm to obtain workflow template G=(V, E, F) input data and the output data corresponding with Maximum overlap path in the Maximum overlap path of historical flow process in HFK and PDC; Call Algorithm 1 in this algorithm implementation procedure, Algorithm 2, and Algorithm 3;

(5) Algorithm 5:UpdateHFK (G _x, HFK), the HFK in the database update module upgrades submodule and comprises this algorithm, the G that uses this algorithm that the workflow execution monitoring module in workflow engine is obtained _xthe information updating of the instantiation of flow process is in historical process knowledge storage (HFK) database;

(6) Algorithm 6:UpdatePDC (PInfo, PDC), PDC in the database update module upgrades submodule and comprises this algorithm, and the provenance information PInfo during flow performing of using this algorithm that the workflow execution monitoring module in workflow engine is obtained is updated to SDI and collects in (PDC) database.

7. the workflow based on provenance according to claim 6 is mated and the discovery system, it is characterized in that system algorithm is used following 3 data structure: DE, a data structure that there is the element of partial ordering relation for storage, this data structure holds two fundamental elements with partial ordering relation; CE, a compound data structure, it had both held basic element, also held the DE element; MList, the chained list of a multidimensional, its inner structure comprises: TData, be responsible for depositing the task node in workflow template, SData, be responsible for realizing in the log history flow process specific service of abstract task in TData, PData, the provenance information while being responsible for recording service execution in SData.