CN103440553A - Workflow matching and finding system, based on provenance, facing proteomic data analysis - Google Patents

Workflow matching and finding system, based on provenance, facing proteomic data analysis Download PDF

Info

Publication number
CN103440553A
CN103440553A CN2013103809827A CN201310380982A CN103440553A CN 103440553 A CN103440553 A CN 103440553A CN 2013103809827 A CN2013103809827 A CN 2013103809827A CN 201310380982 A CN201310380982 A CN 201310380982A CN 103440553 A CN103440553 A CN 103440553A
Authority
CN
China
Prior art keywords
workflow
information
module
algorithm
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2013103809827A
Other languages
Chinese (zh)
Inventor
翟广猛
卢暾
黄兴
陈昭灿
顾宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN2013103809827A priority Critical patent/CN103440553A/en
Publication of CN103440553A publication Critical patent/CN103440553A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of scientific workflows supported by computers and particularly discloses a workflow matching and finding system, based on provenance, facing proteomic data analysis. When a proteomic data analysis process is set up, the system sufficiently utilizes collected historical process information and provenance information relevant to the historical process information, realizes reuse of knowledge, reduces time consumption and energy consumption at process setting-up time, and quickens the progress of data analysis experiments. The processing process inside the system can be divided into three steps: using scientific workflows based on tasks to describe experiment tasks; utilizing a process matching and provenance information discovery algorithm mechanism to extract the instantiated process and the provenance information of the instantiated process; integrating the extracted process information and showing the information to researchers, and requiring the researchers to reuse or modify the historical process configuration information.

Description

Workflow coupling and the discovery system based on provenance towards proteomics data, analyzed
Technical field
The invention belongs to the scientific workflow technical field of computer supported, be specifically related to one towards the proteomics data analysis, and support coupling and the discovery system of the workflow based on provenance.
Background technology
about the scientific workflow in e-sciencethe scientific workflow technology is more and more paid attention at field of scientific study, and the researchist has started some ripe research-on-research Workflow Management Systems of application in their scientific research, and this way has greatly promoted the discovery in the scientific research.Nowadays, along with network technical development maturation further, in order to improve the collaborative of scientific research field, some tissues by experimental data, calculating and and the Scientific Research Resource such as analysis tool with the form of web services, release, other scientists can be applied to these shared resources in their scientific research by network.In order effectively to organize these distributed resources, researchists have designed some scientific workflows and research-on-research Workflow Management System.Under the help of these systems, the researchist can the layout computation-intensive task, analyze large batch of data, integrate distributed resource and local resource, more energy can be placed in the solution of professional problem and time that needn't overspending calculate and the organization and administration of data resource on.
More typical research-on-research Workflow Management System has Taverna at present, Kepler, Triana etc.Taverna is a research-on-research Workflow Management System of increasing income, and is mainly used in field of bioinformatics.Taverna can integrate multiple shared web services, comprises following several types: Arbitrary WSDL type, Soaplab type, Talisman type, Nested workflow type, String constant type, Local processor type.It provides a visual operating platform for the user, and on this platform, the biologist can serve to describe and carry out data-centered scientific workflow with these.It also provides the web services discovery, the functions such as abnormality processing and provenance acquisition of information.Similar to Taverna, Kepler is also an Oriented Bioinformatics field and the scientific workflow management tool of increasing income.Except general shared web services, it can integrate the resource such as database one class, and provides control strategy flexibly to combine these resources.Triana is mainly used in service under distributed environment and the execution of workflow.Different from Taverna and Kepler, Triana can be applied to different fields.
Increase along with the scientific research complicacy, the shared resource used in scientific research is more and more, in order to realize the tracking in resource source, the reproduction of the examination of accuracy and experiment execution step, scientists starts provenance is incorporated in scientific workflow.
about provenancebecause the uncertainty of scientific research, must provide a kind of mechanism can examine for scientists validity as a result and the correctness of scientific experiment.More widespread along with cooperative cooperating phenomenon in scientific research in addition, the authentication of contribution also attracts much attention.Data Provenance(data origin/pedigree) recorded the historical information of the experimental data produced in the scientific workflow implementation, content comprise primary data that data are corresponding and primary data develop into current data all treatment steps of process.As a kind of new technology, it is applied in scientific workflow.This has not only solved the problems referred to above, because have the abilities such as record data origin, data reproduction and data sharing, the difference between decryption and support knowledge reuse, the vital role of provenance in scientific workflow obtained general approval simultaneously.
about proteomics data analysis (proteomics data analysis)proteomics is a newer field of scientific study.Research to proteomics will contribute to explain life science, disclose secrets of life.Yet be in fact a multi-step and complicated flow process to the proteomics data analysis.In the proteomics data analysis, use scientific workflow will alleviate widely researchist's burden.Recent years, proteomics had great development, and one of them most important factor that urges is increasing income or free data analysis tool of further increasing.But people often can run into following several problem when using these instruments: 1. instrument is difficult for installing; 2. can't not use these instruments in the situation that have the professional to instruct; 3. because the inconsistent interface of these instruments that causes of result data form is incompatible, therefore be difficult for setting up the data analysis flow process.Be introduced in the proteomics data analysis field in order to address the above problem scientific workflow.
the problem facedwith business activity, compare, scientific research has inherent exploration, uncertainty.The research-on-research flow process is often tentative, dynamic change, so scientific workflow flows unlike commercial pursuit, and it does not have fixing solution and pattern.It is the work of a bothersome consumption power that these characteristics of scientific workflow cause the establishment of workflow.For a scientific research mission, scientists not only will select one group of suitable service for carrying out experimental duties from numerous services, parameter that also must debugging control service inner behavior before execution.And the selection of service and the setting of its parameter often will just can be found one group of service and the parameter value that meets this requirement of experiment and characteristic of experiment through continuous trial.This problem seems more obvious in the proteomics data analysis process.Therefore, have very much and must have a kind of simple and effective method of searching to assist the scientific experiment that scientists is for they to set up scientific workflow.Knowledge reuse in scientific research can be used as a breach.But how fully using historical flow process and relevant provenance information to help scientist creates scientific workflow and is still a challenging job.
The present invention has designed workflow coupling and a discovery system based on provenance just in order to solve above subject matter.When creating the data analysis flow process, it can provide many group service procedures, and the user reuses or revise these service procedures according to requirement of experiment.
Summary of the invention
The object of the invention is to propose the system that the workflow of a support towards the proteomics data analysis process based on provenance mated and found.
Taverna, Kepler, the Workflow Management System of Triana is all supported obtaining of provenance information, but the source that the provenance information of obtaining is mainly used in recording experimental data, the regeneration of experimental data and shared.They reuse support seldom to workflow creation stage knowledge.Designed system of the present invention is added to provenance information as the reference factor in the coupling and discovery procedure of workflow, for new scientific research mission is searched the use for reference of more satisfactory historical flow process.
1, the structure of the treatment scheme of system and system
Fig. 1 has showed the workflow coupling based on provenance and the treatment scheme of finding.Treatment scheme is divided into 3 steps.(1) use abstract scientific workflow modeling pattern data of description analysis task; (2) workflow of take in (1) is template, uses the flow path match algorithm to find out the flow process of coupling (or maximum coupling path) and its corresponding provenance information from historical flow process; (3) the historical flow process of the coupling of obtaining (or maximum coupling path) and its provenance information show the user with a kind of understandable organizational form.
Based on above-mentioned treatment scheme, the present invention has designed the framed structure of system.The framed structure of system as shown in Figure 2, comprising: workflow module, INFORMATION DISCOVERY module, database update module, database, workflow engine.Wherein:
Workflow module comprises two parts content: the workflow of initialization stream and instantiation.Initialization stream is expressed as the data of description analysis task and the scientific workflow that is comprised of abstract task, in the workflow matching process as workflow template; The workflow of instantiation is expressed as the scientific workflow after each the abstract task appointment specific service in initialization stream.
Database comprises the historical process knowledge thesaurus of HFK() and PDC(SDI collection storehouse), the provenance information that they generate while being respectively used to log history procedure information (comprising corresponding Abstract workflow and the workflow after instantiation) and their execution.
The INFORMATION DISCOVERY module is comprised of following three submodules: flow path match and information integron module, and procedure information excavates submodule, and source data is excavated submodule.Procedure information excavates submodule for from historical process knowledge thesaurus, finding out the task node overlapping with flow template and corresponding specific service thereof.Source data is excavated submodule and is obtained provenance information corresponding to coupling flow process for collect storehouse from SDI.Flow path match and information integron module are for integrating above-mentioned two information that submodule obtains, and the information display after then integrating is to the user.
Workflow engine comprises execution monitoring module and the workflow execution module of workflow.The latter is responsible for carrying out the workflow of instantiation, the provenance information produced when the former is responsible for collecting flow performing.
The database update module is just upgraded HFK and PDC after the workflow execution module executes a workflow at every turn.
Data flow relationship description between above-mentioned modules is as follows:
The user creates initialization stream (comprise the requirement of workflow is described) according to task description; The INFORMATION DISCOVERY module is using initialization stream as workflow template from historical process knowledge thesaurus with the SDI that produces while collecting the historical flow process of finding out coupling storehouse and corresponding execution thereof of SDI; The information display that system is obtained the INFORMATION DISCOVERY module is to the user, and the user carrys out instantiation initialization stream by these information, thereby obtains the workflow of instantiation; In system, the workflow execution submodule of workflow engine module is carried out the workflow of instantiation, and the execution monitoring module of workflow can flow the SDI produced while carrying out by collection work simultaneously; Historical process knowledge thesaurus and SDI that the SDI that the database update module is obtained workflow engine and the workflow information of instantiation are updated to data module as historical procedure information are collected in storehouse.
2, the detailed description of modules in system frame structure
(1) abstract task-set and entity services collection.The initialization flow module of system will be used abstract task-set S and entity services collection T when creating initialization stream and instantiation workflow.T means the set of the abstract task that system is used.The user can describe a data analysis flow process by these tasks.S means the entity services of having integrated in system, and these services can realize the abstract task in T.In addition, the flow process of system buildup is data-centered pipeline type.
(2) scientific workflow.The workflow module of system is used the workflow modeling method based on task.Scientific workflow is a simple digraph G=(V, E, F).V is a finite aggregate that abstract task node forms in flow process.E is a set be comprised of the data dependence relation between task node.F is the set of the mapping relations of abstract task and specific service.For example, v 1and v 2two task nodes in V, if v 2data dependence is in v 1there is so an element e in E 1=<v 1, v 2mean this dependence.If an element f is arranged in F i=<v i, s i, service s is arranged when it is illustrated in workflow execution so ibe responsible for the v that executes the task i.Whether according to the value of F in scientific workflow, be empty, we are divided into two kinds by scientific workflow: the workflow of initialization stream and instantiation.Initialization stream refers to the abstract workflow of describing scientific research mission, and now F is empty.The workflow of instantiation refers to that the user passes through after the knowledge reuse process workflow to the instantiation of initialization stream, the specific service of an appointment is now arranged for each task node in flow process, and F has recorded this mapping relations between task and specific service.
(3) historical process knowledge thesaurus module.Native system has been used the workflow modeling method based on task, thus our task in must the log history flow process and the corresponding relation of service, as a part of reusing knowledge.Historical process knowledge thesaurus module is responsible for recording this corresponding relation.Its Information Organization mode is: (<task i, service i, { G 1,g 2) be a basic information unit (BIU) in historical process knowledge thesaurus module, its semanteme is: G 1, G 2in abstract task task iwhen flow performing by the service service icarry out.
(4) SDI is collected library module .sDI is collected the provenance information produced when library module is responsible for workflow execution in register system, and these information will show the user together with the flow process that is extracted out in the workflow matching process with coupling.The characteristics of coupling system self, we see Fig. 3 to OPM() in the provenance Information Organization that proposes improve.The proteomics data analysis process mostly is the pipeline type, we can by " result from (wasGeneratedBy) " and " use " dependence releases (used) " originate from (wasDerivedFrom) " and " (wasTriggerrdBy) is triggered " dependence.So we only record the first two dependence and save space in SDI collection library module.SDI is collected Information Organization form in library module 1, object 2, object wherein 1and object 2between have partial ordering relation.
(5) the INFORMATION DISCOVERY module in system for according to workflow template, is collected library module from historical process knowledge thesaurus module and SDI, the provenance information when obtaining the instantiation information of historical flow process of coupling and its and carrying out.Algorithm in this module has been used basic concepts in realizing: the relation of equivalence between task or service, the overlapping relation between workflow, overlay path and overlay path collection, Maximum overlap path collection.Be described below respectively:
(a) relation of equivalence between task or service.Provide two element v that belong to abstract task-set (T) 1and v 2if, v 1, v 2meaning of task is same, thinks v 1and v 2be of equal value, be denoted as v 1=v 2; Provide two element s that belong to entity services collection (S) 1and s 2if, s 1, s 2the service meaned is same, thinks s 1and s 2be of equal value, be denoted as s 1=s 2.
(b) overlapping relation between workflow.Two workflow G i=(V i, E i, F i) and G k=(V k, E k, F k) between have overlapping relation, and if only if, and there is v in (1) x∈ T, and (2) v x∈ V iand v x∈ V k.V now xbe called workflow G iand G koverlapping nodes.
(c) overlay path and overlay path collection.G i=(V i, E i, F i) and G k=(V k, E k, F k) be two workflows, make E=E i∩ E k.If e x∈ E, so by e xthe ordered set that the partial ordering relation that two related task nodes determine according to their data dependence relations rearranges is called an overlay path.If E=null, but between two workflows, overlapping relation is arranged, the set be comprised of single overlapping nodes so is a special overlay path.If l 1and l 2two overlay paths, and l 2first node and l 1in last node identical, we can be with a new path l=l so 1ul 2replace l 1and l 2, this process is called the merging of overlay path.The set be comprised of all overlay paths is called the overlay path collection.
(d) Maximum overlap path collection.OPS is G iand G kan overlay path collection.OPS is a Maximum overlap path collection, any two paths l in and if only if OPS i, l jmeet condition: (1) l iand l jit not the same path; And (2) do not exist and belong to l simultaneously iand l jnode; And (3) for l iin any one node v iand l jin any one node v j, v iand v jbetween do not have the partial ordering relation that can be determined by the data dependence relation hereditary property.Can prove for two workflow G iand G k, the Maximum overlap path collection between them is limited set and is unique.
3, the data structure of using in the algorithm related in system module and algorithm
Native system relates generally to following algorithm in implementation procedure:
(1)Algorithm?1:?FIF(task?,?HFK)。Procedure information excavates submodule and comprises this algorithm, for search all entity services of given a certain abstract task correspondence historical flow process from historical process knowledge thesaurus module.
(2)Algorithm?2:?PIF(G x,?service x,?PDC)。Source data is excavated submodule and is comprised this algorithm, for from SDI, collecting library module, searches historical flow process G xcarrying out entity services service xthe time corresponding input data and output data provenance information.
(3)Algorithm?3:Merge(PS 1,?PS 2)。Flow path match and information integron module comprise this algorithm, when generating the maximum path collection, use this algorithm to merge two known overlay paths.
(4)Algorithm?4:?GMPS(G=(V,?E?,?F),?HFK,PDC)。Flow path match and information integron module comprise this algorithm, use this algorithm to obtain corresponding input data and the output data in Maximum overlap path in the Maximum overlap path of workflow template G=(V, E, F) and historical flow process in HFK and PDC.Called Algorithm 1 in this algorithm implementation procedure, Algorithm 2, and Algorithm 3.
(5)Algorithm?5:?UpdateHFK(G x,?HFK)。HFK in the database update module upgrades submodule and comprises this algorithm, the G that uses this algorithm that the workflow execution monitoring module in workflow engine is obtained xthe information updating of the instantiation of flow process is in historical process knowledge storage (HFK) database.
(6)Algorithm?6:?UpdatePDC(PInfo,?PDC)。PDC in the database update module upgrades submodule and comprises this algorithm, and the provenance information PInfo during flow performing of using this algorithm that the workflow execution monitoring module in workflow engine is obtained is updated to SDI and collects in (PDC) database.
While stating algorithm in realization, the present invention has used 3 data structures.DE (DElement) is a data structure that has the element of partial ordering relation for storage, and this data structure can hold two fundamental elements (as task node, entity services etc.) with partial ordering relation.CE (CElement) is a compound data structure, and it both can hold basic element can also hold the DE element.MList is the chained list of a multidimensional, and its inner structure is as Fig. 4.TData in MList is responsible for depositing the task node in workflow template; SData is responsible for realizing in the log history flow process specific service of abstract task in TData; Provenance information when PData is responsible for recording service execution in SData (comprising input data and output data).For convenient, algorithm is described, the present invention has been these data structure definitions some atomic operations, specific as follows:
1. π 1(DE): obtain first element in DE; π 2(DE): obtain second element in DE.
2. CE[i]: the element that is designated as i under obtaining in CE; Set CE[i]=object: be CE[i] assignment object.
3. addTData (TData, MList): TData is added in MList.
4. addSData (task, SData, MList): SData is added in MList, and the address of SData in MList determined by task.
5. addPData (task, service, PData, MList): PData is inserted in MList, and the position of PData in MList is by task, and service determines jointly.
The accompanying drawing explanation
Fig. 1 is system flow.
Fig. 2 is system framework.
Fig. 3 is the dependence that the provenance in OPM relates to.In figure, P means specific service, and A means the data that service is used or produced.
Fig. 4 is the data structure of using in the workflow matching algorithm.
Embodiment
the specific implementation of the middle algorithm of modules
(1) procedure information excavates module for search the entity services of given a certain task correspondence historical flow process from historical process knowledge thesaurus module (HFK), and the algorithm of realization is: Algorithm 1:FIF (task, HFK).Basic information unit (BIU) organizational form in aforementioned known HFK is CE=(<task i, service i, { G x, G y).Being implemented as follows of algorithm wherein: in traversal HFK all message units, if CE iin first element task iwith task be that relation of equivalence is task i=task, by CE iadd to and return in chained list, repeat this process until traversal finishes.Finally return results chained list.Implementation of pseudocode is shown in instructions appendix (1).
(2) provenance information when source data excavation module is searched some specific service execution of specifying flow process for collect storehouse (PDC) from SDI, the algorithm of its realization is: Algorithm 2:PIF (G x, service x, PDC).The organizational form of the message unit in aforementioned known PDC is DE=<object 1, object 2.Being implemented as follows of algorithm: create new CE data type object a: CE.Find out in PDC by G xall message units of sign, these message units are temporary to be placed in chained list DElementSet.Element in traversal DElementSet: DE, if π 1and service (DE) xrelation of equivalence is arranged, so set CE[2]=π 2(DE); Otherwise, if π 2and service (DE) xrelation of equivalence is arranged, so set CE[1]=π 1(DE).Set?CE[0]=G x。By CE as a result of data return.Implementation of pseudocode is shown in instructions appendix (2).
(3) flow path match and information integrate module are responsible for the information architecture Maximum overlap path collection obtained according to above-mentioned two modules.At main body algorithm Algorithm4:GMPS (G=(V, E, F), HFK, PDC) in excavate the algorithm in module: Algorithm 1 by call flow information excavating module and source data and with Algorithm 2, HFK be temporarily stored in MList with match information relevant to assigned work stream G in FDC.After completing above-mentioned work, according to the data construct Maximum overlap path collection in MList, this process is divided into two steps: the first step, tectonic stacking path collection; Second one, constantly merge the overlay path collection of constructing in the first step, until these paths can not remerge, the path collection now obtained is a maximum path collection.
For the process prescription in the task tectonic stacking path of a certain appointment as follows: (1) takes out all relevant SData data according to the abstract task task in assigned work stream G from MList; (2) take out all PData that SData determines in (1); (3) for any one PData:CE i(=(G x, inputdata, outputdata)) we can construct an overlay path L i: Set L i[0]=CE[0], Set L i[2]=<CE[1], CE[2], Set L i[1]=<<task i, SData k.The set formed according to the overlay path of all SData and PData structure is exactly overlay path collection corresponding to this abstract task.
The arthmetic statement that path merges is as follows: given two overlay path collection PS 1, PS 2.PS wherein 2in all task nodes all postorder in (being determined by the data partial ordering relation) PS 1.For PS 1in each overlay path L i 1: traversal PS 2in every paths L k 2, (a)if L k 2in workflow sign and L i 1in the workflow sign equate, execution step (b); Otherwise execution step (c); (b)if π 1(L i 1[2])=π 2(L k 2[2]), by L i 1[1] all elements in adds L to according to partial ordering relation k 2[1] in; Otherwise execution step (c); (c)by L i 1add PS to 2in.Finally by PS 2as new path collection, return.Implementation of pseudocode is shown in instructions appendix (3).
Merge according to the iteration path process prescription that builds the maximum path algorithm as follows: to each the abstract task node structure path collection PS in the V set of given workflow G 1, PS 2..., PS n(number that n is node in V).Merge successively PS nand PS n-1, PS nand PS n-2..., PS nand PS 1.Path merges the PS obtained after end nmaximum overlap path collection (overlay path wherein comprises corresponding input data and output data) for given workflow G and historical flow process.Implementation of pseudocode is shown in instructions appendix (4).
(4) in system framework figure, data update module is responsible for historical process knowledge thesaurus (HFK) and SDI collection storehouse (PDC) are upgraded.It has two submodule: HFK to upgrade and PDC upgrades.The former is responsible for HFK is upgraded, and the latter is responsible for PDC is upgraded.
The realization of the renewal of HFK: if G x=([T 2, T 3, T 4, T 5, T 7], [<T 2, T 3,<T 3, T 4,<T 4, T 5,<T 5, T 7], []) be the Abstract workflow of a new definition.G after instantiation xin F=(<T 2, S 2 x,<T 3, S 3 x,<T 4, S 4 x,<T 5, S 5 x,<T 7, S 7 x).For any one the element f in F, if there is an identity element CE[0 in HFK]=f, by G xadd CE[1 to] in.Otherwise create a new Elements C E '=(f, { G x), then CE ' is added in HFK.Implementation of pseudocode is shown in instructions appendix (5).
The realization of the renewal of PDC: supposing the system is at G xthe provenance information of obtaining during execution is PInfo=(<Data x, S 2 x, Data 2,<Data 2, S 3 x, Data 3,<Data 3, S 4 x, Data 4,<Data 4, S 5 x, Data 5,<Data 5, S 7 x, Data 7).The renewal of PDC is achieved as follows: for each the element e in PInfo: the structure objects DE that creates two DE types 1and DE 2.DE wherein 1be labeled as " using (used) " dependence, DE 2mark " resulting from (wasGeneratedBy) " dependence.For DE 1, DE 2assignment: set DE 1[0]=e[1], set DE 1[1]=e[0]; Set DE 2[0]=e[2], set DE 2[1]=e[1].With G xas DE 1, DE 2the flow process sign, then add in PDC.Implementation of pseudocode is shown in instructions appendix (6).
To illustrate in addition a bit, therefore system meeting consuming time increases along with the increase of data volume in HFK and PDC, is necessary the control of the data volume in HFK and PDC within the specific limits.Can or delete that by the information cluster in HFK, PDC the remote methods such as historical flow process control the data volume in HFK and PDC.
the appendix explanation:
(1)Algorithm?1:?FIF(task?,?HFK)。
(2)Algorithm?2:?PIF(G x,?service x,?PDC)。
(3)Algorithm?3:Merge(PS 1,?PS 2)。
(4)Algorithm?4:?GMPS(G=(V,?E?,?F),?HFK,PDC)。
(5)Algorithm?5:?UpdateHFK(G x,?HFK)。
(6)Algorithm?6:?UpdatePDC(PInfo,?PDC)。
The instructions appendix
(1)
Figure 2013103809827100002DEST_PATH_IMAGE001
(2)
Figure 685757DEST_PATH_IMAGE002
(3)
Figure 2013103809827100002DEST_PATH_IMAGE003
(4)
Figure 71739DEST_PATH_IMAGE004
(5)
(6)
Figure DEST_PATH_IMAGE007

Claims (7)

1. workflow coupling and a discovery system based on provenance of analyzing towards proteomics data, is characterized in that comprising: workflow module, INFORMATION DISCOVERY module, database update module, database, workflow engine; Wherein:
Described workflow module comprises two parts content: the workflow of initialization stream and instantiation; Initialization stream is expressed as the data of description analysis task and the scientific workflow that is comprised of abstract task, in the workflow matching process as workflow template; The workflow of instantiation is expressed as the scientific workflow after each the abstract task appointment specific service in initialization stream;
Described database, comprise historical process knowledge thesaurus (HFK) and SDI collection storehouse (PDC), the provenance information that they generate while being respectively used to log history procedure information and their execution;
Described INFORMATION DISCOVERY module is comprised of following three submodules: flow path match and information integron module, and procedure information excavates submodule, and source data is excavated submodule; Procedure information excavates submodule for from historical process knowledge thesaurus, finding out the task node overlapping with flow template and corresponding specific service thereof; Source data is excavated submodule and is obtained provenance information corresponding to coupling flow process for collect storehouse from SDI; Flow path match and information integron module are for integrating above-mentioned two information that submodule obtains, and the information display after then integrating is to the user;
Described workflow engine, comprise execution monitoring module and the workflow execution module of workflow; The latter is responsible for carrying out the workflow of instantiation, the provenance information produced when the former is responsible for collecting flow performing;
Described database update module is just upgraded HFK and PDC after the workflow execution module executes a workflow at every turn;
Data flow relation between above-mentioned modules is as follows:
The user creates initialization stream according to task description; The INFORMATION DISCOVERY module is using initialization stream as workflow template from historical process knowledge thesaurus with the SDI that produces while collecting the historical flow process of finding out coupling storehouse and corresponding execution thereof of SDI; The information display obtained by the INFORMATION DISCOVERY module is to the user, and the user carrys out instantiation initialization stream by these information, thereby obtains the workflow of instantiation; The workflow execution submodule of workflow engine module is carried out the workflow of instantiation, the SDI produced when the execution monitoring module collection work of workflow stream is carried out simultaneously; Historical process knowledge thesaurus and SDI that the SDI that the database update module is obtained workflow engine and the workflow information of instantiation are updated to data module as historical procedure information are collected in storehouse.
2. workflow coupling and the discovery system based on provenance according to claim 1, is characterized in that the initialization flow module of system relates to abstract task-set S and entity services collection T when creating initialization stream and instantiation workflow; Wherein, T means the set of the abstract task that system is used, and the user describes a data analysis flow process by these tasks; S means the entity services of having integrated in system, and these services realize the abstract task in T; In addition, the flow process of system buildup is data-centered pipeline type.
3. the workflow based on provenance according to claim 2 is mated and the discovery system, it is characterized in that described scientific workflow is a simple digraph G=(V, E, F), wherein, V is a finite aggregate that abstract task node forms in flow process, and E is a set be comprised of the data dependence relation between task node, and F is the set of the mapping relations of abstract task and specific service.
4. workflow coupling and the discovery system based on provenance according to claim 3, is characterized in that in historical process knowledge thesaurus module, the Information Organization mode is: (<task i, service i, { G 1,g 2), meaning a basic information unit (BIU) in historical process knowledge thesaurus module, its semanteme is: G 1, G 2in abstract task task iwhen flow performing by the service service icarry out;
SDI is collected Information Organization form in library module 1, object 2, object wherein 1and object 2between have partial ordering relation.
5. workflow coupling and the discovery system based on provenance according to claim 4, is characterized in that
The INFORMATION DISCOVERY module relates to following key concept: the relation of equivalence between task or service, the overlapping relation between workflow, overlay path and overlay path collection, Maximum overlap path collection; Be respectively:
(a) relation of equivalence between task or service, provide two element v that belong to abstract task-set (T) 1and v 2if, v 1, v 2meaning of task is same, thinks v 1and v 2be of equal value, be denoted as v 1=v 2; Provide two element s that belong to entity services collection (S) 1and s 2if, s 1, s 2the service meaned is same, thinks s 1and s 2be of equal value, be denoted as s 1=s 2;
(b) overlapping relation between workflow, establish two workflow G i=(V i, E i, F i) and G k=(V k, E k, F k) between have overlapping relation, and if only if, and there is v in (1) x∈ T, and (2) v x∈ V iand v x∈ V k, v now xbe called workflow G iand G koverlapping nodes;
(c) overlay path and overlay path collection, establish G i=(V i, E i, F i) and G k=(V k, E k, F k) be two workflows, make E=E i∩ E kif, e x∈ E, so by e xthe ordered set that the partial ordering relation that two related task nodes determine according to their data dependence relations rearranges is called an overlay path; If E=null, but between two workflows, overlapping relation is arranged, the set be comprised of single overlapping nodes so is a special overlay path; If l 1and l 2two overlay paths, and l 2first node and l 1in last node identical, use so a new path l=l 1ul 2replace l 1and l 2, this process is called the merging of overlay path; The set be comprised of all overlay paths is called the overlay path collection;
(d) Maximum overlap path collection, establishing OPS is G iand G kan overlay path collection, OPS is a Maximum overlap path collection, any two paths l in and if only if OPS i, l jmeet condition: (1) l iand l jit not the same path; And (2) do not exist and belong to l simultaneously iand l jnode; And (3) for l iin any one node v iand l jin any one node v j, v iand v jbetween do not have the partial ordering relation that can be determined by the data dependence relation hereditary property.
6. the workflow based on provenance according to claim 5 is mated and the discovery system, it is characterized in that system comprises following algorithm:
(1) Algorithm 1:FIF (task, HFK), procedure information excavates submodule and comprises this algorithm, for search all entity services of given a certain abstract task correspondence historical flow process from historical process knowledge thesaurus module;
(2) Algorithm 2:PIF (G x, service x, PDC), source data is excavated submodule and is comprised this algorithm, for from SDI, collecting library module, searches historical flow process G xcarrying out entity services service xthe time corresponding input data and output data provenance information;
(3) Algorithm 3:Merge (PS 1, PS 2), flow path match and information integron module comprise this algorithm, when generating the maximum path collection, use this algorithm to merge two known overlay paths;
(4) Algorithm 4:GMPS (G=(V, E, F), HFK, PDC), flow path match and information integron module comprise this algorithm, use this algorithm to obtain workflow template G=(V, E, F) input data and the output data corresponding with Maximum overlap path in the Maximum overlap path of historical flow process in HFK and PDC; Call Algorithm 1 in this algorithm implementation procedure, Algorithm 2, and Algorithm 3;
(5) Algorithm 5:UpdateHFK (G x, HFK), the HFK in the database update module upgrades submodule and comprises this algorithm, the G that uses this algorithm that the workflow execution monitoring module in workflow engine is obtained xthe information updating of the instantiation of flow process is in historical process knowledge storage (HFK) database;
(6) Algorithm 6:UpdatePDC (PInfo, PDC), PDC in the database update module upgrades submodule and comprises this algorithm, and the provenance information PInfo during flow performing of using this algorithm that the workflow execution monitoring module in workflow engine is obtained is updated to SDI and collects in (PDC) database.
7. the workflow based on provenance according to claim 6 is mated and the discovery system, it is characterized in that system algorithm is used following 3 data structure: DE, a data structure that there is the element of partial ordering relation for storage, this data structure holds two fundamental elements with partial ordering relation; CE, a compound data structure, it had both held basic element, also held the DE element; MList, the chained list of a multidimensional, its inner structure comprises: TData, be responsible for depositing the task node in workflow template, SData, be responsible for realizing in the log history flow process specific service of abstract task in TData, PData, the provenance information while being responsible for recording service execution in SData.
CN2013103809827A 2013-08-28 2013-08-28 Workflow matching and finding system, based on provenance, facing proteomic data analysis Pending CN103440553A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2013103809827A CN103440553A (en) 2013-08-28 2013-08-28 Workflow matching and finding system, based on provenance, facing proteomic data analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2013103809827A CN103440553A (en) 2013-08-28 2013-08-28 Workflow matching and finding system, based on provenance, facing proteomic data analysis

Publications (1)

Publication Number Publication Date
CN103440553A true CN103440553A (en) 2013-12-11

Family

ID=49694246

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2013103809827A Pending CN103440553A (en) 2013-08-28 2013-08-28 Workflow matching and finding system, based on provenance, facing proteomic data analysis

Country Status (1)

Country Link
CN (1) CN103440553A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103823885A (en) * 2014-03-07 2014-05-28 河海大学 Data provenance dependence relation analysis model-based data dependence analysis method
CN105912588A (en) * 2016-03-31 2016-08-31 中国农业银行股份有限公司 Visualization processing method and system for big data based on memory calculations
CN103745319B (en) * 2014-01-09 2017-01-04 北京大学 A kind of data provenance traceability system based on multi-state scientific workflow and method
CN109658765A (en) * 2019-03-04 2019-04-19 西安交通大学医学院第附属医院 A kind of digital medical images software teaching service system
CN112162737A (en) * 2020-10-13 2021-01-01 深圳晶泰科技有限公司 Universal description language data system of directed acyclic graph automatic task flow
CN112734189A (en) * 2020-12-30 2021-04-30 深圳晶泰科技有限公司 Method for establishing experimental workflow model
CN112948569A (en) * 2019-12-10 2021-06-11 中国石油天然气股份有限公司 Method and device for pushing scientific workflow diagram version based on active knowledge graph
WO2022077222A1 (en) * 2020-10-13 2022-04-21 深圳晶泰科技有限公司 Directed-acyclic-graph-type automatic common workflow description language data system
CN112162737B (en) * 2020-10-13 2024-06-28 深圳晶泰科技有限公司 General description language data system for automatic task flow of directed acyclic graph

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694709A (en) * 2009-09-27 2010-04-14 华中科技大学 Service-oriented distributed work flow management system
CN102043625A (en) * 2010-12-22 2011-05-04 中国农业银行股份有限公司 Workflow operation method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101694709A (en) * 2009-09-27 2010-04-14 华中科技大学 Service-oriented distributed work flow management system
CN102043625A (en) * 2010-12-22 2011-05-04 中国农业银行股份有限公司 Workflow operation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUANGMENG ZHAI: ""PWMDS: A system supporting preovenance-based matching and discovery of workflows in proteomics data anlaysis"", 《PROCEEDINGS OF THE 2012 IEEE 16TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103745319B (en) * 2014-01-09 2017-01-04 北京大学 A kind of data provenance traceability system based on multi-state scientific workflow and method
CN103823885A (en) * 2014-03-07 2014-05-28 河海大学 Data provenance dependence relation analysis model-based data dependence analysis method
CN105912588A (en) * 2016-03-31 2016-08-31 中国农业银行股份有限公司 Visualization processing method and system for big data based on memory calculations
CN109658765A (en) * 2019-03-04 2019-04-19 西安交通大学医学院第附属医院 A kind of digital medical images software teaching service system
CN112948569A (en) * 2019-12-10 2021-06-11 中国石油天然气股份有限公司 Method and device for pushing scientific workflow diagram version based on active knowledge graph
CN112162737A (en) * 2020-10-13 2021-01-01 深圳晶泰科技有限公司 Universal description language data system of directed acyclic graph automatic task flow
WO2022077222A1 (en) * 2020-10-13 2022-04-21 深圳晶泰科技有限公司 Directed-acyclic-graph-type automatic common workflow description language data system
CN112162737B (en) * 2020-10-13 2024-06-28 深圳晶泰科技有限公司 General description language data system for automatic task flow of directed acyclic graph
CN112734189A (en) * 2020-12-30 2021-04-30 深圳晶泰科技有限公司 Method for establishing experimental workflow model

Similar Documents

Publication Publication Date Title
AU2020203909B2 (en) Data lineage summarization
CN103440553A (en) Workflow matching and finding system, based on provenance, facing proteomic data analysis
Celik et al. Blockchain supported BIM data provenance for construction projects
US11593369B2 (en) Managing data queries
Aridhi et al. Big graph mining: Frameworks and techniques
Altintas et al. Provenance collection support in the kepler scientific workflow system
CN103180826B (en) Object data set is managed in the data flow diagram for represent computer program
US9292306B2 (en) System, multi-tier interface and methods for management of operational structured data
US20110264636A1 (en) Updating a data warehouse schema based on changes in an observation model
CN104823185A (en) Systems and methods for interest-driven data sharing in interest-driven business intelligence systems
Wang et al. A framework for distributed data-parallel execution in the Kepler scientific workflow system
CN115516443A (en) Generating optimization logic from architecture
Postina et al. An ea-approach to develop soa viewpoints
Lai et al. {GLogS}: Interactive graph pattern matching query at large scale
Lautenbacher et al. Planning support for enterprise changes
Aloisio et al. ProGenGrid: a workflow service infrastructure for composing and executing bioinformatics grid services
Cannataro et al. Integrating ontology and workflow in PROTEUS, a grid-based problem solving environment for bioinformatics
Wang et al. A high-level distributed execution framework for scientific workflows
CN115066673A (en) System and method for ETL pipeline processing
Baunsgaard et al. Federated Data Preparation, Learning, and Debugging in Apache SystemDS
Guo et al. Research on distributed data mining system based on hadoop platform
Pahwa et al. UCLEAN: A REQUIREMENT BASED OBJECT-ORIENTED ETL FRAMEWORK
Kim A Data Mining Infrastructure for Cheminformatics
Jiao et al. Towards a lightweight SOA framework for enterprise cloud computing
Aloisio et al. A WorkFlow management system for bioinformatics grid

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20131211

WD01 Invention patent application deemed withdrawn after publication