CN112948569A - Method and device for pushing scientific workflow diagram version based on active knowledge graph - Google Patents

Method and device for pushing scientific workflow diagram version based on active knowledge graph Download PDF

Info

Publication number
CN112948569A
CN112948569A CN201911258247.2A CN201911258247A CN112948569A CN 112948569 A CN112948569 A CN 112948569A CN 201911258247 A CN201911258247 A CN 201911258247A CN 112948569 A CN112948569 A CN 112948569A
Authority
CN
China
Prior art keywords
workflow
sub
activity
candidate
scientific
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911258247.2A
Other languages
Chinese (zh)
Inventor
孙莎莎
施振生
周长兵
孙梦宇
董大忠
昌燕
马超
武瑾
芮昀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Petrochina Co Ltd
Original Assignee
Petrochina Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Petrochina Co Ltd filed Critical Petrochina Co Ltd
Priority to CN201911258247.2A priority Critical patent/CN112948569A/en
Publication of CN112948569A publication Critical patent/CN112948569A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

According to the method and the device for pushing the scientific workflow chart version based on the activity knowledge graph, the candidate activities or the candidate sub-workflows are selected from the candidate activities and the sub-workflow sets of each activity slot based on the semantic similarity and the structural similarity, the scientific workflow chart version is generated according to the fixed structural relationship among all the activity slots in the scientific workflow requirement chart version, and then different granularity fragments in different scientific workflows are reused or shared, so that the scientific workflow is sorted according to the similarity of the scientific workflow chart version and the user requirement and recommended to the user, and the user is helped to reuse or redevelop the scientific workflow.

Description

Method and device for pushing scientific workflow diagram version based on active knowledge graph
Technical Field
The invention relates to the technical field of scientific workflows, in particular to a method and a device for pushing a scientific workflow diagram version based on an activity knowledge graph.
Background
With the development and increasing maturity of Web 2.0 technology, applications distributed at different nodes in the internet are packaged in the form of Web/REST services, or hybrid sockets, and invoke each other, breaking through differences in platform structure. In practical application, a single atomic service can only provide simple and limited functions, and for complex user requirements, a faster and more convenient application development mode needs to be realized through reuse and combination of services.
Dynamic service composition presents a huge challenge as Web services grow exponentially. To address this problem, Web services integrate scientific workflows through standard interfaces. Each interface is an executable program and realizes the program function through reading and writing files in a high-performance computing environment. Essentially, these scientific workflows describe a multi-step, repetitive execution process that includes Web services required to complete a task and data connections between these services. Today, scientists build reconfigurable scientific experiments through scientific workflows. At the same time, the generation of an online scientific workflow database facilitates the sharing, discovery, and reuse of scientific workflows. With the increasing abundance of data in the scientific workflow database, the requirements of scientific workers can be efficiently and accurately met by reusing the scientific workflow or part of the scientific workflow and reusing the scientific workflow after modification.
The scientific Workflow (Workflow) takes a graph as a basic entity, has the characteristics of small scale and complex structure, and clearly describes the interdependency among activities in the scientific Workflow, including data dependency and control dependency. The data dependency relationship refers to a data flow direction relationship between data, and the control dependency relationship refers to that no data dependency or data flow exists between activities, but a mandatory precedence relationship exists in execution.
Notably, for different scientific workflow requirements, which may be related to multiple scientific workflow components, this means that it is difficult to be satisfied by any single legacy scientific workflow. In such cases, this need should be fulfilled by including appropriate segments in different scientific workflows, and assembling these cross-scientific workflow segments according to certain principles. Based on the above description, the main challenge faced in current research is that cross-scientific workflow segments between activities that contain different granularities are difficult to discover and implement.
Disclosure of Invention
In order to solve at least one of the above disadvantages, an embodiment of the first aspect of the present application provides an active knowledge graph-based scientific workflow graph version pushing method, including:
acquiring a scientific workflow requirement chart, wherein the scientific workflow requirement chart comprises a plurality of movable grooves, all the movable grooves have a fixed structural relationship, and each movable groove comprises a movable or a sub-workflow; the activity is a minimum structural unit, and the sub-workflow comprises a plurality of activities with fixed structural relationships;
acquiring a candidate activity and sub-workflow set of each activity slot based on a preset activity knowledge graph; the active knowledge graph comprises a plurality of scientific workflows;
selecting candidate activities or candidate sub-workflows from the candidate activity and sub-workflow sets of each activity slot based on semantic similarity and structural similarity, and generating a scientific workflow layout according to the fixed structural relationship among all activity slots in the scientific workflow requirement layout;
and pushing the scientific workflow chart plate.
In certain embodiments, further comprising:
and establishing the activity knowledge graph.
In certain embodiments, the establishing the active knowledge-graph comprises:
extracting pre-stored scientific workflows and each activity and sub workflow as named entities;
extracting the relationship attributes among the named entities;
information supplement is carried out on each named entity, and the title and text description of each named entity are extracted;
the original scientific workflow data is converted to an active knowledge graph based on entities and relationships according to the title and textual description of each named entity.
In some embodiments, the scientific workflow includes an activity set, a sub-workflow set, and an edge set, where the edge set includes structural relationships of all activities and sub-workflows, and the obtaining a candidate activity and sub-workflow set for each activity slot based on a preset activity knowledge graph includes:
determining semantic relevance of each sub-workflow and each activity in the activity knowledge graph;
acquiring candidate activity and sub workflow sets of an initial point activity slot and an end point activity slot;
and sequentially determining candidate activity and sub workflow sets of the other activity slots according to the candidate activity and sub workflow sets of the start point activity slot and the end point activity slot and the edge set.
In some embodiments, the determining semantic relevance of each sub-workflow and each activity in the activity knowledge-graph comprises:
representing each sub workflow and each activity in the form of a first document, wherein the document comprises names and description information of the correspondingly represented sub workflows or activities;
obtaining a representative word of each sub workflow or activity according to the description information;
correspondingly adding each representative word to the names of the sub-workflows or the activities to form a text fragment, wherein the names of all the sub-workflows or the activities jointly form a second document;
converting the second document into an input format of a bitterm theme model, and inputting the input format into the bitterm theme model;
extracting each representative word into a topic unit based on the principle of a bitterm topic model, and counting the probability of each topic unit;
generating a topic proportion expectation of the second document according to the probability of each topic unit;
balancing the generalization capability of the bitterm topic model according to the confusion degree and the topic similarity degree, and determining the optimal number of topics;
for each topic, calculating the probability average value of the topic generated by all the activity and sub-workflow;
reserving a theme with the probability average value not less than a set threshold; wherein all activities and sub-workflows corresponding to the retained topics have semantic relevance.
In some embodiments, said extracting candidate activities or candidate sub-workflows from the set of candidate activities and sub-workflows for each activity slot based on semantic similarity and structural similarity comprises:
calculating structural similarity and semantic similarity according to the candidate activities and elements in the sub-workflow set;
according to the structural similarity and the semantic similarity, sequencing the similarity of all the activities or sub-workflows in the candidate activity and sub-workflow set to obtain a sequence with the similarity from high to low;
and selecting the first K activity or sub-workflows from the sequence as candidate activity or sub-workflows of the corresponding activity slot, wherein K is a positive integer larger than 0.
An embodiment of a second aspect of the present application provides a scientific workflow diagram version pushing device based on an active knowledge graph, including:
the scientific workflow requirement chart board acquisition module is used for acquiring a scientific workflow requirement chart board, the scientific workflow requirement chart board comprises a plurality of movable grooves, all the movable grooves have a fixed structural relationship, and each movable groove comprises a movable or sub-workflow; the activity is a minimum structural unit, and the sub-workflow comprises a plurality of activities with fixed structural relationships;
the candidate activity and sub-workflow set acquisition module is used for acquiring a candidate activity and sub-workflow set of each activity slot based on a preset activity knowledge graph;
the scientific workflow chart version generation module is used for selecting candidate activities or candidate sub-workflows from the candidate activity and sub-workflow sets of each activity slot based on semantic similarity and structural similarity, and generating a scientific workflow chart version according to the fixed structural relationship among all activity slots in the scientific workflow demand chart version;
and the pushing module is used for pushing the scientific workflow chart plate.
In certain embodiments, further comprising:
and the active knowledge graph establishing module is used for establishing the active knowledge graph.
In certain embodiments, the active knowledge graph creation module comprises:
the extraction unit is used for extracting the pre-stored scientific workflow and each activity and sub-workflow as named entities;
the extraction unit is used for extracting the relationship attributes among the named entities;
the information supplementing unit is used for supplementing information to each named entity and extracting the title and text description of each named entity;
and the active knowledge map conversion unit is used for converting the original scientific workflow data into an active knowledge map based on the entities and the relations according to the titles and the text descriptions of the named entities.
In some embodiments, the scientific workflow includes an activity set, a sub-workflow set, and an edge set, where the edge set includes structural relationships of all activities and sub-workflows, and the candidate activity and sub-workflow set obtaining module includes:
the semantic relevance determining unit is used for determining the semantic relevance of each sub workflow and each activity in the activity knowledge graph;
an endpoint active slot candidate activity and sub workflow set acquisition unit which acquires candidate activity and sub workflow sets of an initial point active slot and an end point active slot;
and the intermediate active slot candidate active and sub workflow set acquisition unit is used for sequentially determining candidate active and sub workflow sets of the other active slots according to the candidate active and sub workflow sets of the initial point active slot and the end point active slot and the edge set.
In some embodiments, the semantic relevance determining unit comprises:
the first document representation unit is used for representing each sub workflow and each activity in a first document form, wherein the document comprises names and description information of the correspondingly represented sub workflows or activities;
a representative word acquiring unit that acquires a representative word of each sub workflow or activity according to the description information;
the second document representing unit is used for correspondingly adding each representative word into the names of the sub-workflows or the activities to form a text fragment, wherein the names of all the sub-workflows or the activities jointly form a second document;
the model input unit is used for converting the second document into an input format of a bitterm theme model and inputting the input format to the bitterm theme model;
the theme probability statistical unit extracts each representative word as a theme unit based on the principle of a bitterm theme model and counts the probability of each theme unit;
a proportion expectation generating unit which generates a theme proportion expectation of the second document according to the probability of each theme unit;
the optimal theme number determining unit is used for determining the optimal theme number according to the confusion degree and the theme similarity degree and the generalization capability of the balanced bitterm theme model;
the probability average value generating unit is used for calculating the probability average value of the theme generated under all the activity and sub-workflow aiming at each theme;
the theme retaining unit is used for retaining the theme with the probability average value not less than the set threshold; wherein all activities and sub-workflows corresponding to the retained topics have semantic relevance.
In some embodiments, the scientific workflow layout generation module comprises:
the similarity calculation unit is used for calculating the structural similarity and the semantic similarity according to the elements in the candidate activity and sub workflow set;
the sequencing unit is used for sequencing the similarity of all the activities or the sub-workflows in the candidate activity and sub-workflow set according to the structural similarity and the specific gravity of the semantic similarity to obtain a sequence with the similarity from high to low;
and the candidate activity or sub workflow selecting unit is used for selecting the first K activity or sub workflow from the sequence as the candidate activity or sub workflow of the corresponding activity slot, wherein K is a positive integer larger than 0.
In a third aspect of the present application, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program to implement the steps of the method for pushing a scientific workflow diagram version based on an active knowledge graph as described above.
A fourth aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the method for pushing a scientific workflow version based on an active knowledge graph as described above.
The beneficial effect of this application is as follows:
according to the method and the device for pushing the scientific workflow chart version based on the activity knowledge graph, the candidate activities or the candidate sub-workflows are selected from the candidate activities and the sub-workflow sets of each activity slot based on the semantic similarity and the structural similarity, the scientific workflow chart version is generated according to the fixed structural relationship among all the activity slots in the scientific workflow requirement chart version, and then different granularity fragments in different scientific workflows are reused or shared, so that the scientific workflow is sorted according to the similarity of the scientific workflow chart version and the user requirement and recommended to the user, and the user is helped to reuse or redevelop the scientific workflow.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 shows a flowchart of a method for pushing a scientific workflow version based on an active knowledge graph in the embodiment of the present application.
FIG. 2 shows a schematic view of a knowledge graph fragment of a scientific workflow hierarchy model in an embodiment of the present application.
FIG. 3 is a diagram illustrating a chart for meeting requirements of a scientific workflow in an embodiment of the present application.
Fig. 4 shows a schematic structural diagram of a scientific workflow diagram version pushing device based on an activity knowledge graph in the embodiment of the present application.
Fig. 5 shows a schematic structural diagram of an electronic device suitable for implementing embodiments of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 shows an activity knowledge graph-based scientific workflow layout pushing method in an embodiment of a first aspect of the present application, including:
s1: acquiring a scientific workflow requirement chart, wherein the scientific workflow requirement chart comprises a plurality of movable grooves, all the movable grooves have a fixed structural relationship, and each movable groove comprises a movable or a sub-workflow; the activity is a minimum structural unit, and the sub-workflow comprises a plurality of activities with fixed structural relationships;
s2: acquiring a candidate activity and sub-workflow set of each activity slot based on a preset activity knowledge graph; the active knowledge graph comprises a plurality of scientific workflows;
s3: selecting candidate activities or candidate sub-workflows from the candidate activity and sub-workflow sets of each activity slot based on semantic similarity and structural similarity, and generating a scientific workflow layout according to the fixed structural relationship among all activity slots in the scientific workflow requirement layout;
s4: and pushing the scientific workflow chart plate.
According to the scientific workflow chart version pushing method based on the activity knowledge graph, candidate activities or candidate sub-workflows are selected from a candidate activity and sub-workflow set of each activity slot based on semantic similarity and structural similarity, a scientific workflow chart version is generated according to the fixed structural relation among all activity slots in the scientific workflow requirement chart version, and then different granularity fragments in different scientific workflows are reused or shared, sequencing according to the similarity of the different granularity fragments with the user requirements is achieved, and the different granularity fragments are recommended to a user, so that the user is helped to reuse or redevelop the scientific workflows.
In the present application, as in fig. 2 to 3, a Scientific Workflow (Scientific Workflow) is defined: a scientific workflow swf is a quintuple (tl, dsc, SWFsub, ACT, LNK) where:
title tl is swf
dsc is a text description of swf
SWFsub is a set of sub-workflows contained in swf
ACT is a set of active sets contained in swf
LNK is a set of edge sets { LNKinv, LNKpch }, where LNKinv refers to a flattened callout on SWFsub and ACT, and LNKpch refers to a child workflow inside SWFsub and a corresponding active hierarchical parent-child relationship in ACT.
As described above, each sub-workflow is a relatively coarse-grained activity. The activity in the myExperiment repository may be a REST/Web service or a shuffle interface and is typically represented as (1) a string name with several keywords and (2) a description in plain text. A scientific workflow is transformed into a hierarchical model in which hierarchical parent-child relationships are explicitly specified between activities in successive layers.
In addition, the Activity Knowledge Graph (Activity Knowledge Graph) is defined as AKG as a triple (E, R, S), wherein:
e is a set of entities that includes workflows in { swf }, their children workflows and activities.
R is a set of relationship types that include (i) prtoff refers to a child workflow or activity belonging to a workflow, (ii) Invok refers to a flat calling relationship between child workflows or activity pairs, and (iii) PrtCld refers to a hierarchical parent-child relationship formed by a child workflow and its corresponding activity.
Figure BDA0002310900770000081
Is a set of tuples used to refer to relationships specified on an entity.
It is to be understood that the structural relationships in this application represent the flat calling relationships between the sub-workflow or activity pairs described above.
The active knowledge graph may be established online or offline, to which the present application is not limited.
In some embodiments, the step of establishing an active knowledge graph specifically comprises:
s01: extracting pre-stored scientific workflows and each activity and sub workflow as named entities;
s02: extracting the relationship attributes among the named entities;
s03: information supplement is carried out on each named entity, and the title and text description of each named entity are extracted;
s04: the original scientific workflow data is converted to an active knowledge graph based on entities and relationships according to the title and textual description of each named entity.
Specifically, in step S02, if the sub-workflow or activity belongs to one workflow, the relationship between them is prtoff; if a planarization calling process exists between the sub workflow and the activity pair, the Invok relationship exists between the sub workflow and the activity pair; the child workflows and their corresponding activities form a hierarchical structure, and there is a PrtCld relationship between them. And calculating topic similarity and document similarity according to each activity slot and other activities, scoring and sorting to obtain the top K1 candidate activities.
In step S1, the scientific workflow requirement plate is a plate with a known workflow template, that is, each active slot, the fixed relationship between each active slot and the activity or sub-workflow in each active slot are known, and the scientific workflow requirement plate is recombined by candidate activities or sub-workflows to obtain a final scientific workflow plate.
Furthermore, in some embodiments, step S2 specifically includes:
s21: determining semantic relevance of each sub-workflow and each activity in the activity knowledge graph;
s22: acquiring candidate activity and sub workflow sets of an initial point activity slot and an end point activity slot;
s23: and sequentially determining candidate activity and sub workflow sets of the other activity slots according to the candidate activity and sub workflow sets of the start point activity slot and the end point activity slot and the edge set.
In some embodiments, step S21 specifically includes:
s211: representing each sub workflow and each activity in the form of a first document, wherein the document comprises names and description information of the correspondingly represented sub workflows or activities;
s212: obtaining a representative word of each sub workflow or activity according to the description information;
s213: correspondingly adding each representative word to the names of the sub-workflows or the activities to form a text fragment, wherein the names of all the sub-workflows or the activities jointly form a second document;
s214: converting the second document into an input format of a bitterm theme model, and inputting the input format into the bitterm theme model;
s215: extracting each representative word into a topic unit based on the principle of a bitterm topic model, and counting the probability of each topic unit;
s216: generating a topic proportion expectation of the second document according to the probability of each topic unit;
s217: balancing the generalization capability of the bitterm topic model according to the confusion degree and the topic similarity degree, and determining the optimal number of topics;
s218: for each topic, calculating the probability average value of the topic generated by all the activity and sub-workflow;
s219: reserving a theme with the probability average value not less than a set threshold; wherein all activities and sub-workflows corresponding to the retained topics have semantic relevance.
In particular, the topic proportion is expected to be a probability distribution of the document, representing a document-topic distribution matrix. The matching relation is whether the connection in the demand graph exists between the candidate activities, and the relation on the graph is inquired. If there is such a connectable relationship between the activities, the next expansion is made, making the segment continuously larger. These indices refer to the fact that the inactive slots and edges are labeled at the beginning, for convenience to know which edges and activities have been processed. Then, starting from the head, adding activities to continuously expand the graph fragment structure. The activities and sub-workflows are first represented in short documents, with the help of specific names and description information in the activities and sub-workflows. The textual words in the description are evaluated and combined into a name to form a short document representation of the activity or sub-workflow. Words in the description are considered related when they are semantically similar to the word in the name, or words in the description that often appear together with words in the name that are semantically similar or equivalent. The representative words for each activity and sub-workflow are picked and added to the names of the activity and sub-workflows, generating a short document. Biterm Topic Model (BTM) discovery based on a corpus of short documents. The short document is digitized and converted to the format of the input requirements of the BTM topic model. According to the BTM principle, each short text in the short document can be regarded as a single text segment, each pair of different words is extracted as a biterm, and the biterm is used as a training data set of the topic probability distribution in the BTM topic model. Considering a corpus of short documents as a mixture of topics, each bitterm is independently extracted from a particular topic. And calculating the theme proportion expectation of the bitterm generated by each short document, and reasoning the themes of the activities and the sub workflow. And balancing the generalization capability of the BTM topic model according to the confusion degree and the topic similarity, thereby determining the optimal topic number. A representative topic is determined. The average of the probabilities for a topic generated for all activities and sub-workflows in that topic is calculated. A threshold (typically a multiple of the probability average) is set to account for the importance of the topic. The subject value is compared to a threshold value. In the case where the topic value is not less than the threshold, the probability of the topic is retained and is representative for an activity or sub-workflow. The confusion degree is that a value can be obtained by using the topic distribution and having a calculation formula. Whether the language model can clearly express the meaning of the document or not is shown, and the distinguishing feeling is good. Topic similarity is also a formula calculated from the probability distribution. Generalization ability refers to whether the topic model (also called language model) trained by us can be applied to the differentiation of all or most documents.
It is understood that the activity described in the present application is the smallest structural unit, each activity slot is an activity node, the activity node may include only one activity, or may be formed by encapsulating a plurality of activities to form a sub-workflow, it is understood that in each activity slot, the activity and the sub-workflow exist uniquely, that is, if the activity slot includes only one activity or only one sub-workflow, both cannot coexist, and likewise, the sub-workflow may be regarded as a complex activity with coarse granularity (i.e., formed by a plurality of activities), therefore, the activity and the sub-workflow are equal in the activity knowledge graph, and the sub-workflow is a complex activity, and the sub-workflow and the activity are both to achieve a sub-requirement, therefore, the set of candidate activities and sub-workflows described in the present application, no matter whether the activity slot includes only one activity (i.e., includes only one smallest structural unit) or the sub-workflow The set of candidates may be a set including only activities, a set including only sub-workflows, or a set including both activities and sub-workflows.
Furthermore, it is understood that the structural units (i.e. activities) in the candidate activities and sub-workflow sets in the present application specifically include, in some embodiments, step S3:
s31: calculating structural similarity and semantic similarity according to the candidate activities and elements in the sub-workflow set;
s32: according to the structural similarity and the semantic similarity, sequencing the similarity of all the activities or sub-workflows in the candidate activity and sub-workflow set to obtain a sequence with the similarity from high to low; in some embodiments, the results of topic similarity and document similarity are weighted, such as α and β, the sum of which is 1, e.g., α is 0.3 and β is 0.7.
S33: and selecting the first K activity or sub-workflows from the sequence as candidate activity or sub-workflows of the corresponding activity slot, wherein K is a positive integer larger than 0.
It is to be understood that the present application expresses the structural relationship between a workflow and its child workflows and activities in a semantic manner. On the basis of scientific workflow and a hierarchical model, a series of triple sets are constructed by defining related entities and relationship types and are used for referring to the relationship specified on the entities. The constructed activity knowledge graph accommodates flattened calling relationships between activities in the workflow, as well as hierarchical parent-child relationships specified on child workflows and corresponding activities.
The short document representation mode is obtained correspondingly by extracting key information from the activity and the sub workflow, the topic representation of the activity and the sub workflow is deduced by means of a BTM topic model, and representative topics of the activity and the sub workflow are selected respectively according to the importance degree of each topic. In addition, according to the related measurement indexes such as the confusion degree and the topic similarity, the optimal number of topics of the model is determined, so that the discovery of the correlation between activities is promoted.
And finding corresponding candidate activities or sub-workflows in respective activity slots by the workflow segments represented by a certain requirement, recombining the relation between the candidate activities or sub-workflows by virtue of the constructed activity knowledge graph to form a series of candidate cross-workflow segments, balancing the structural and semantic similarity between the segments and the requirement, and evaluating and recommending the segments.
The method for pushing the scientific workflow diagram version based on the activity knowledge diagram converts the scientific workflow with a hierarchical structure into a knowledge diagram structure, and visualizes the relation between the sub-workflow and the activity in the scientific workflow; based on short document representation forms of the activities and the sub-workflows, quantifying semantic relevance by using representative questions obtained by the activities and the sub-workflows, and generating by adopting a BTM (Business to model) topic model; through the method, the reusable or reusable cross workflow fragments meeting the requirements with higher possibility are recommended to the user through the requirement matching of the user by finding the cross workflow fragments existing in different workflows and evaluating and recommending the fragments according to the structural and semantic similarity of the fragments.
Based on the same inventive concept, as shown in fig. 4, an embodiment of the second aspect of the present application provides an apparatus for pushing a scientific workflow diagram version based on an active knowledge graph, including:
the scientific workflow requirement chart board acquisition module 1 is used for acquiring a scientific workflow requirement chart board, wherein the scientific workflow requirement chart board comprises a plurality of movable grooves, all the movable grooves have a fixed structural relationship, and each movable groove comprises a movable or sub-workflow; the activity is a minimum structural unit, and the sub-workflow comprises a plurality of activities with fixed structural relationships;
the candidate activity and sub workflow set acquisition module 2 is used for acquiring a candidate activity and sub workflow set of each activity slot based on a preset activity knowledge graph;
a scientific workflow chart version generation module 3, which selects candidate activities or candidate sub-workflows from the candidate activity and sub-workflow sets of each activity slot based on semantic similarity and structural similarity, and generates a scientific workflow chart version according to the fixed structural relationship among all activity slots in the scientific workflow demand chart version;
and the pushing module 4 is used for pushing the scientific workflow chart plate.
Based on the same inventive concept, in some embodiments, the method further comprises:
and the active knowledge graph establishing module is used for establishing the active knowledge graph.
Based on the same inventive concept, in some embodiments, the active knowledge-graph establishing module comprises:
the extraction unit is used for extracting the pre-stored scientific workflow and each activity and sub-workflow as named entities;
the extraction unit is used for extracting the relationship attributes among the named entities;
the information supplementing unit is used for supplementing information to each named entity and extracting the title and text description of each named entity;
and the active knowledge map conversion unit is used for converting the original scientific workflow data into an active knowledge map based on the entities and the relations according to the titles and the text descriptions of the named entities.
Based on the same inventive concept, in some embodiments, the scientific workflow includes an activity set, a sub-workflow set, and an edge set, where the edge set includes structural relationships of all activities and sub-workflows, and the candidate activity and sub-workflow set obtaining module includes:
the semantic relevance determining unit is used for determining the semantic relevance of each sub workflow and each activity in the activity knowledge graph;
an endpoint active slot candidate activity and sub workflow set acquisition unit which acquires candidate activity and sub workflow sets of an initial point active slot and an end point active slot;
and the intermediate active slot candidate active and sub workflow set acquisition unit is used for sequentially determining candidate active and sub workflow sets of the other active slots according to the candidate active and sub workflow sets of the initial point active slot and the end point active slot and the edge set.
Based on the same inventive concept, in some embodiments, the semantic relevance determining unit includes:
the first document representation unit is used for representing each sub workflow and each activity in a first document form, wherein the document comprises names and description information of the correspondingly represented sub workflows or activities;
a representative word acquiring unit that acquires a representative word of each sub workflow or activity according to the description information;
the second document representing unit is used for correspondingly adding each representative word into the names of the sub-workflows or the activities to form a text fragment, wherein the names of all the sub-workflows or the activities jointly form a second document;
the model input unit is used for converting the second document into an input format of a bitterm theme model and inputting the input format to the bitterm theme model;
the theme probability statistical unit extracts each representative word as a theme unit based on the principle of a bitterm theme model and counts the probability of each theme unit;
a proportion expectation generating unit which generates a theme proportion expectation of the second document according to the probability of each theme unit;
the optimal theme number determining unit is used for determining the optimal theme number according to the confusion degree and the theme similarity degree and the generalization capability of the balanced bitterm theme model;
the probability average value generating unit is used for calculating the probability average value of the theme generated under all the activity and sub-workflow aiming at each theme;
the theme retaining unit is used for retaining the theme with the probability average value not less than the set threshold; wherein all activities and sub-workflows corresponding to the retained topics have semantic relevance.
Based on the same inventive concept, in some embodiments, the scientific workflow layout generation module includes:
the similarity calculation unit is used for calculating the structural similarity and the semantic similarity according to the elements in the candidate activity and sub workflow set;
the sequencing unit is used for sequencing the similarity of all the activities or the sub-workflows in the candidate activity and sub-workflow set according to the structural similarity and the specific gravity of the semantic similarity to obtain a sequence with the similarity from high to low;
and the candidate activity or sub workflow selecting unit is used for selecting the first K activity or sub workflow from the sequence as the candidate activity or sub workflow of the corresponding activity slot, wherein K is a positive integer larger than 0.
According to the scientific workflow chart version pushing device based on the activity knowledge graph, candidate activities or candidate sub-workflows are selected from a candidate activity and sub-workflow set of each activity slot based on semantic similarity and structural similarity, a scientific workflow chart version is generated according to the fixed structural relation among all activity slots in the scientific workflow requirement chart version, and then different granularity fragments in different scientific workflows are reused or shared, sequencing according to the similarity of the different granularity fragments with user requirements is achieved, and the different granularity fragments are recommended to a user, so that the user is helped to reuse or redevelop the scientific workflows.
An embodiment of the present application further provides a specific implementation manner of an electronic device capable of implementing all steps in the method in the foregoing embodiment, and referring to fig. 5, the electronic device specifically includes the following contents:
a processor (processor)601, a memory (memory)602, a communication Interface (Communications Interface)603, and a bus 604;
the processor 601, the memory 602 and the communication interface 603 complete mutual communication through the bus 604;
the processor 601 is used to call the computer program in the memory 602, and when the processor executes the computer program, the processor implements all the steps of the method in the above embodiments.
Embodiments of the present application also provide a computer-readable storage medium capable of implementing all the steps of the method in the above embodiments, and the computer-readable storage medium stores thereon a computer program, which when executed by a processor implements all the steps of the method in the above embodiments.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the hardware + program class embodiment, since it is substantially similar to the method embodiment, the description is simple, and the relevant points can be referred to the partial description of the method embodiment. Although embodiments of the present description provide method steps as described in embodiments or flowcharts, more or fewer steps may be included based on conventional or non-inventive means. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual apparatus or end product executes, it may execute sequentially or in parallel (e.g., parallel processors or multi-threaded environments, or even distributed data processing environments) according to the method shown in the embodiment or the figures. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the presence of additional identical or equivalent elements in a process, method, article, or apparatus that comprises the recited elements is not excluded. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, in implementing the embodiments of the present description, the functions of each module may be implemented in one or more software and/or hardware, or a module implementing the same function may be implemented by a combination of multiple sub-modules or sub-units, and the like. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form. The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein. The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of an embodiment of the specification. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction. The above description is only an example of the embodiments of the present disclosure, and is not intended to limit the embodiments of the present disclosure. Various modifications and variations to the embodiments described herein will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the embodiments of the present specification should be included in the scope of the claims of the embodiments of the present specification.

Claims (14)

1. A scientific workflow chart edition pushing method based on an active knowledge graph is characterized by comprising the following steps:
acquiring a scientific workflow requirement chart, wherein the scientific workflow requirement chart comprises a plurality of movable grooves, all the movable grooves have a fixed structural relationship, and each movable groove comprises a movable or a sub-workflow; the activity is a minimum structural unit, and the sub-workflow comprises a plurality of activities with fixed structural relationships;
acquiring a candidate activity and sub-workflow set of each activity slot based on a preset activity knowledge graph; the active knowledge graph comprises a plurality of scientific workflows;
selecting candidate activities or candidate sub-workflows from the candidate activity and sub-workflow sets of each activity slot based on semantic similarity and structural similarity, and generating a scientific workflow layout according to the fixed structural relationship among all activity slots in the scientific workflow requirement layout;
and pushing the scientific workflow chart plate.
2. The scientific workflow layout pushing method according to claim 1, further comprising:
and establishing the activity knowledge graph.
3. The scientific workflow layout pushing method of claim 2 wherein said establishing said active knowledge graph comprises:
extracting pre-stored scientific workflows and each activity and sub workflow as named entities;
extracting the relationship attributes among the named entities;
information supplement is carried out on each named entity, and the title and text description of each named entity are extracted;
the original scientific workflow data is converted to an active knowledge graph based on entities and relationships according to the title and textual description of each named entity.
4. The method for pushing science workflow layout according to claim 1, wherein the science workflow comprises an activity set, a sub workflow set and an edge set, the edge set comprises structural relations of all activities and sub workflows, and the step of obtaining the candidate activity and sub workflow set of each activity slot based on a preset activity knowledge graph comprises:
determining semantic relevance of each sub-workflow and each activity in the activity knowledge graph;
acquiring candidate activity and sub workflow sets of an initial point activity slot and an end point activity slot;
and sequentially determining candidate activity and sub workflow sets of the other activity slots according to the candidate activity and sub workflow sets of the start point activity slot and the end point activity slot and the edge set.
5. The scientific workflow layout pushing method of claim 4 wherein the determining semantic relevance of each sub-workflow and each activity in the activity knowledge graph comprises:
representing each sub workflow and each activity in the form of a first document, wherein the document comprises names and description information of the correspondingly represented sub workflows or activities;
obtaining a representative word of each sub workflow or activity according to the description information;
correspondingly adding each representative word to the names of the sub-workflows or the activities to form a text fragment, wherein the names of all the sub-workflows or the activities jointly form a second document;
converting the second document into an input format of a bitterm theme model, and inputting the input format into the bitterm theme model;
extracting each representative word into a topic unit based on the principle of a bitterm topic model, and counting the probability of each topic unit;
generating a topic proportion expectation of the second document according to the probability of each topic unit;
balancing the generalization capability of the bitterm topic model according to the confusion degree and the topic similarity degree, and determining the optimal number of topics;
for each topic, calculating the probability average value of the topic generated by all the activity and sub-workflow;
reserving a theme with the probability average value not less than a set threshold; wherein all activities and sub-workflows corresponding to the retained topics have semantic relevance.
6. The scientific workflow layout pushing method of claim 5 wherein the selecting candidate activities or candidate sub-workflows from the set of candidate activities and sub-workflows for each activity slot based on semantic similarity and structural similarity comprises:
calculating structural similarity and semantic similarity according to the candidate activities and elements in the sub-workflow set;
according to the structural similarity and the semantic similarity, sequencing the similarity of all the activities or sub-workflows in the candidate activity and sub-workflow set to obtain a sequence with the similarity from high to low;
and selecting the first K activity or sub-workflows from the sequence as candidate activity or sub-workflows of the corresponding activity slot, wherein K is a positive integer larger than 0.
7. The utility model provides a scientific workflow chart version pusher based on activity knowledge map which characterized in that includes:
the scientific workflow requirement chart board acquisition module is used for acquiring a scientific workflow requirement chart board, the scientific workflow requirement chart board comprises a plurality of movable grooves, all the movable grooves have a fixed structural relationship, and each movable groove comprises a movable or sub-workflow; the activity is a minimum structural unit, and the sub-workflow comprises a plurality of activities with fixed structural relationships;
the candidate activity and sub-workflow set acquisition module is used for acquiring a candidate activity and sub-workflow set of each activity slot based on a preset activity knowledge graph;
the scientific workflow chart version generation module is used for selecting candidate activities or candidate sub-workflows from the candidate activity and sub-workflow sets of each activity slot based on semantic similarity and structural similarity, and generating a scientific workflow chart version according to the fixed structural relationship among all activity slots in the scientific workflow demand chart version;
and the pushing module is used for pushing the scientific workflow chart plate.
8. The scientific workflow chart version pushing device according to claim 7, further comprising:
and the active knowledge graph establishing module is used for establishing the active knowledge graph.
9. The scientific workflow graph version pushing device of claim 8 wherein the active knowledge graph establishing module comprises:
the extraction unit is used for extracting the pre-stored scientific workflow and each activity and sub-workflow as named entities;
the extraction unit is used for extracting the relationship attributes among the named entities;
the information supplementing unit is used for supplementing information to each named entity and extracting the title and text description of each named entity;
and the active knowledge map conversion unit is used for converting the original scientific workflow data into an active knowledge map based on the entities and the relations according to the titles and the text descriptions of the named entities.
10. The scientific workflow layout pushing device according to claim 7, wherein the scientific workflow comprises an activity set, a sub workflow set and an edge set, the edge set comprises structural relationships of all activities and sub workflows, and the candidate activity and sub workflow set obtaining module comprises:
the semantic relevance determining unit is used for determining the semantic relevance of each sub workflow and each activity in the activity knowledge graph;
an endpoint active slot candidate activity and sub workflow set acquisition unit which acquires candidate activity and sub workflow sets of an initial point active slot and an end point active slot;
and the intermediate active slot candidate active and sub workflow set acquisition unit is used for sequentially determining candidate active and sub workflow sets of the other active slots according to the candidate active and sub workflow sets of the initial point active slot and the end point active slot and the edge set.
11. The scientific workflow layout pushing device according to claim 10 wherein the semantic relevance determining unit comprises:
the first document representation unit is used for representing each sub workflow and each activity in a first document form, wherein the document comprises names and description information of the correspondingly represented sub workflows or activities;
a representative word acquiring unit that acquires a representative word of each sub workflow or activity according to the description information;
the second document representing unit is used for correspondingly adding each representative word into the names of the sub-workflows or the activities to form a text fragment, wherein the names of all the sub-workflows or the activities jointly form a second document;
the model input unit is used for converting the second document into an input format of a bitterm theme model and inputting the input format to the bitterm theme model;
the theme probability statistical unit extracts each representative word as a theme unit based on the principle of a bitterm theme model and counts the probability of each theme unit;
a proportion expectation generating unit which generates a theme proportion expectation of the second document according to the probability of each theme unit;
the optimal theme number determining unit is used for determining the optimal theme number according to the confusion degree and the theme similarity degree and the generalization capability of the balanced bitterm theme model;
the probability average value generating unit is used for calculating the probability average value of the theme generated under all the activity and sub-workflow aiming at each theme;
the theme retaining unit is used for retaining the theme with the probability average value not less than the set threshold; wherein all activities and sub-workflows corresponding to the retained topics have semantic relevance.
12. The scientific workflow diagram version pushing device according to claim 11, wherein the scientific workflow diagram version generating module comprises:
the similarity calculation unit is used for calculating the structural similarity and the semantic similarity according to the elements in the candidate activity and sub workflow set;
the sequencing unit is used for sequencing the similarity of all the activities or the sub-workflows in the candidate activity and sub-workflow set according to the structural similarity and the specific gravity of the semantic similarity to obtain a sequence with the similarity from high to low;
and the candidate activity or sub workflow selecting unit is used for selecting the first K activity or sub workflow from the sequence as the candidate activity or sub workflow of the corresponding activity slot, wherein K is a positive integer larger than 0.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of claims 1 to 6 when executing the program.
14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 6.
CN201911258247.2A 2019-12-10 2019-12-10 Method and device for pushing scientific workflow diagram version based on active knowledge graph Pending CN112948569A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911258247.2A CN112948569A (en) 2019-12-10 2019-12-10 Method and device for pushing scientific workflow diagram version based on active knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911258247.2A CN112948569A (en) 2019-12-10 2019-12-10 Method and device for pushing scientific workflow diagram version based on active knowledge graph

Publications (1)

Publication Number Publication Date
CN112948569A true CN112948569A (en) 2021-06-11

Family

ID=76225443

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911258247.2A Pending CN112948569A (en) 2019-12-10 2019-12-10 Method and device for pushing scientific workflow diagram version based on active knowledge graph

Country Status (1)

Country Link
CN (1) CN112948569A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115146920A (en) * 2022-05-27 2022-10-04 电子科技大学 Multi-main-body workflow reconstruction method based on control flow and data dependence
CN116089624A (en) * 2022-11-17 2023-05-09 昆仑数智科技有限责任公司 Knowledge graph-based data recommendation method, device and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440553A (en) * 2013-08-28 2013-12-11 复旦大学 Workflow matching and finding system, based on provenance, facing proteomic data analysis
CN110502621A (en) * 2019-07-03 2019-11-26 平安科技(深圳)有限公司 Answering method, question and answer system, computer equipment and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103440553A (en) * 2013-08-28 2013-12-11 复旦大学 Workflow matching and finding system, based on provenance, facing proteomic data analysis
CN110502621A (en) * 2019-07-03 2019-11-26 平安科技(深圳)有限公司 Answering method, question and answer system, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JINFENG WEN等: "Discovering Crossing-Workflow Fragments Based on Activity Knowledge Graph", 《LECTURE NOTES IN COMPUTER SCIENCE》, vol. 11877, pages 516 - 525 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115146920A (en) * 2022-05-27 2022-10-04 电子科技大学 Multi-main-body workflow reconstruction method based on control flow and data dependence
CN115146920B (en) * 2022-05-27 2024-04-26 电子科技大学 Multi-main-body workflow reconstruction method based on control flow and data dependence
CN116089624A (en) * 2022-11-17 2023-05-09 昆仑数智科技有限责任公司 Knowledge graph-based data recommendation method, device and system
CN116089624B (en) * 2022-11-17 2024-02-27 昆仑数智科技有限责任公司 Knowledge graph-based data recommendation method, device and system

Similar Documents

Publication Publication Date Title
Chen et al. Gl2vec: Graph embedding enriched by line graphs with edge features
Bian et al. Network embedding and change modeling in dynamic heterogeneous networks
Li et al. A novel approach for API recommendation in mashup development
Soliman et al. Architectural knowledge for technology decisions in developer communities: An exploratory study with stackoverflow
Fu FCA based ontology development for data integration
CN103425740B (en) A kind of material information search method based on Semantic Clustering of internet of things oriented
Chua et al. Eff2Match results for OAEI 2010
Qamar et al. A majority vote based classifier ensemble for web service classification
US11620453B2 (en) System and method for artificial intelligence driven document analysis, including searching, indexing, comparing or associating datasets based on learned representations
CN112948569A (en) Method and device for pushing scientific workflow diagram version based on active knowledge graph
Stram et al. Weighted one mode projection of a bipartite graph as a local similarity measure
Levin et al. Towards software analytics: Modeling maintenance activities
de Moura et al. Extracting new metrics from version control system for the comparison of software developers
Goldberg et al. CASTLE: crowd-assisted system for text labeling and extraction
Cao et al. Unsupervised construction of knowledge graphs from text and code
Secer et al. Ontology mapping using bipartite graph
Surianarayanan et al. Towards quicker discovery and selection of web services considering required degree of match through indexing and decomposition of non–functional constraints
CN115168609A (en) Text matching method and device, computer equipment and storage medium
Rezende et al. Proposed application of data mining techniques for clustering software projects
Tossavainen et al. Implementing a system enabling open innovation by sharing public goals based on linked open data
Saouli et al. SaaS-DCS: software-as-a-service discovery and composition system-based existence degree
CN114372148A (en) Data processing method based on knowledge graph technology and terminal equipment
CN112988216A (en) Software architecture recovery method based on functional structure
Cerón-Figueroa et al. Instance-based ontology matching for open and distance learning materials
Al-Msie'Deen et al. Naming the identified feature implementation blocks from software source code

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination