CN112162737A - Universal description language data system of directed acyclic graph automatic task flow - Google Patents

Universal description language data system of directed acyclic graph automatic task flow Download PDF

Info

Publication number
CN112162737A
CN112162737A CN202011091614.7A CN202011091614A CN112162737A CN 112162737 A CN112162737 A CN 112162737A CN 202011091614 A CN202011091614 A CN 202011091614A CN 112162737 A CN112162737 A CN 112162737A
Authority
CN
China
Prior art keywords
workflow
type
definition
layer
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011091614.7A
Other languages
Chinese (zh)
Other versions
CN112162737B (en
Inventor
姜子麒
温书豪
谈樑
刘阳
马健
范陕姗
赖力鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Jingtai Technology Co Ltd
Original Assignee
Shenzhen Jingtai Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Jingtai Technology Co Ltd filed Critical Shenzhen Jingtai Technology Co Ltd
Priority to CN202011091614.7A priority Critical patent/CN112162737B/en
Priority claimed from CN202011091614.7A external-priority patent/CN112162737B/en
Publication of CN112162737A publication Critical patent/CN112162737A/en
Application granted granted Critical
Publication of CN112162737B publication Critical patent/CN112162737B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/31Programming languages or programming paradigms
    • G06F8/315Object-oriented languages

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a general description language data system of directed acyclic graph automatic task flow, comprising: a Step definition layer, a Workflow definition layer and a Template definition layer; the Step layer is the description of a single task, and specifically declares the name, type, document, parameter and other information of each input and output item aiming at the input and output declaration of a docker mirror image or other executors; the Workflow layer is a Workflow composed of one or more steps, dependent topologies of the steps need to be defined, and shared parameters can be defined; the Template carries out parameter presetting and specification of supplementary parameters, checker or data source definition on the basis of Workflow definition. The data center is used with the task execution tool, and a corresponding tool needs to be realized by using a programming language. The data center needs to be able to store each definition document and index to the corresponding document by reference links, and the interpreter needs to read all the definition contents and assign the corresponding data to the definition structure according to the reference links.

Description

Universal description language data system of directed acyclic graph automatic task flow
Technical Field
The invention belongs to the field of cloud computing, and particularly relates to a universal description language data system for directed acyclic graph type automatic task flow.
Background
The specialized industry subdivision trend in the field of scientific computing is obvious nowadays, and the development of a detailed problem solving algorithm and the engineering application guided by actual value are divided into two developing directions. How to combine the application of subdivided professional methods to accomplish the objective becomes an indispensable requirement, the smaller and smaller method granularity makes learning cost of a large number of methods and labor cost for combining the methods higher and higher, and the technology of automated workflow is widely applied in various fields. In the field of scientific computing: SBP (Seven Bridge platform) developed by SBG (Seven Bridges genomics) company in the United states is regarded as a company core technology, integrates algorithms, data hosting and computing resources, constructs a flexible genome data analysis process through a visual interface and executes computation; the method is mainly aimed at data-intensive scientific computing, task flow description in a text format and different task execution in serial connection with a command line tool are performed, mainstream cloud computing resource providers such as AWS, Google and AliCloud, and computing standards such as HPC and Spark provide support for the CWL.
For the task flow description language, the core function has two points: defining specific tasks and defining the flow between tasks. The input, output and execution modes of the tasks are described by defining specific tasks, the execution sequence of the tasks and the data flow path are defined by defining the inter-task flow, and the core information is provided for a specific task flow engine to be analyzed, so that the automatic execution of the task flow can be completed.
For a closed task flow platform, the closed task flow platform lacks openness and usability, can be arranged only by using an algorithm task provided inside, is difficult to meet the calculation requirement of rapid development, and cannot access flexible calculation resources, so that the closed task flow platform mainly focuses on a general task flow description language.
For the existing general description language, the following defects are mainly existed:
1. the task description granularity is rough:
the granularity of single task description of the existing general description language is very rough, a user needs to define an input parameter group and an acquisition mode, but the language does not relate to the type and the detail structure of data, so that the user needs to deeply know the specific characteristics of the task when using the universal description language, and data check and verification cannot be provided when the wrong data type and structure are provided.
2. The threshold for knowledge in the computer field is high, and the arrangement of non-computer professionals is inconvenient:
the existing universal description language is tightly coupled with the computer programming technology, a large amount of bottom-layer details and computer proper nouns are exposed, and a user can write the information only by possessing certain computer domain knowledge. The requirement that an algorithm writer (computer engineer) and an algorithm user (scientist) use the algorithms at the same time cannot be met.
3. Information is numerous and complicated and lacks data multiplexing
When the existing universal description language is applied to high-performance scientific computation, parameters brought by characteristics of a large number of scientific computation fields need to be defined due to universality, the parameters need to be repeatedly input, and the data template and data coverage filling capacity is not available
4. Lack of automatic parallel source language
In the field of scientific computing, parallelism is an indispensable capability due to the enormous computing requirements. The existing general description language lacks parallel description primitives like map-reduce or scatter-gather and cannot provide automatic parallel functions for single-point tasks.
Disclosure of Invention
Based on this, there is a need for a generic description language data system for directed acyclic graph automatic task flow, comprising:
a Step definition layer, a Workflow definition layer and a Template definition layer;
the Step layer is the description of a single task, and specifically declares the name, type, document, parameter and other information of each input and output item aiming at the input and output declaration of a docker mirror image or other executors;
the Workflow layer is a Workflow composed of one or more steps, dependent topologies of the steps need to be defined, and shared parameters can be defined;
the Template carries out parameter presetting and specification of supplementary parameters, checker or data source definition on the basis of Workflow definition.
According to the technical scheme, the user-defined type is defined through the TypeDef, the TypeDef supports a plurality of keywords such as type, required, symbols, value, const, ref, embedded _ value and the like, and the data description granularity is improved through the details of the keyword expanded data. And the Step definition describing a single task refers to different TypeDef in the input and output declarations to declare the input and the output of the Step, so that high-precision complete task description is realized. The problem that the granularity of single task description of the existing general description language is very coarse is solved.
The concrete steps of Workflow for building Template are as follows: values in Template are resolved, which contains data such as $ step2/$ in _ arg2: { const: 1.0}, the first field is slashed with the step name and the slashed with the entry name, and the two parameters are used to locate the specific mapping in Workflow. In the second field, the first parameter before the colon indicates the nature of the data, which in the example is constant, and the latter parameter is specific data. When the Template is applied, the property and data of the second parameter representation are covered in the step name and the entry name of the first parameter representation.
The reference between the levels is realized by url, for example, (type: ^ TypeDef/common/version/1.0.0) introduces a TypeDef definition named common version 1.0.0 for the type representing the variable, and the original text of the definition is obtained by requesting the parameters like the data center.
Preferably, the method further comprises a TypeDef layer, if a user needs to use a special custom type, the TypeDef layer needs to be written, and the part mainly abstracts the definition of a general or complex compound type, so that the reference and the management are convenient.
The invention further adopts the technical scheme, and has the advantages that the keywords such as required, type, value, const, serializer, symbols and the like in the TypeDef support detailed declarations such as default values, constants, enumerated values and the like besides providing the type declarations. By matching the input data with these statements, a more elaborate data check and verification can be achieved. Thereby doing so to provide data checking and verification.
The invention also provides a scheme for realizing the reference of the data between the four layers of definitions, which comprises the following steps:
1. step will only refer to TypeDef, by filling in the reference url when defining the type of data, as:
inputs:
$in_arg1:
type:^typedef/common/jobArgs/version/1.0.0
indicates that Step refers to the type definition named jobArgs in the TypeDef definition named common
2. Workflow will only refer to Step, by filling in the url when declaring the run field of the Step used, as:
steps:
$step1:
run: ^step/demo/version/1.0.0
indicating that the Workflow references a step definition with version 1.0.0 named demo
Template will only refer to Workflow, implemented by filling in url in the Workflow field declared in the metadata, as:
workflow:^workflow/some_workflow/version/1.0.0
indicating that the Template applies to the Workflow definition named some _ Workflow.
Correspondingly, the invention also provides an analysis method adopting the system, which comprises the following steps:
recursive analysis: the input document and all documents on which the input document depends are pulled locally from the data center. The parser will recursively traverse each value of the first input document, if the value is an external link at the beginning, download the link to the corresponding file through the data center Client, and repeat the steps for the new file until all dependent links are ready.
Syntax tree analysis: because each layer number of the description language has a priority and a coverage relation, in order to realize data logic of layer-by-layer coverage, the coverage needs to be constructed and applied layer-by-layer from the bottommost layer;
and analyzing the Template file, traversing the specific variables and values in the Template, and indexing to a certain input/output value of a certain Step in the Workflow object to perform the covering operation.
After the analysis is finished, obtaining a tree in an object form, wherein a workflow object is a root node, the workflow object contains all Step objects through a steps attribute, and the Step objects contain all TypeDefs objects through inputs/outputs attribute;
in addition to building hierarchical explicit object trees and hierarchical assignments, a second important algorithm for parsers is to topologically order Workflow. The user defines the dependence among different steps, and the topological sorting algorithm can solve the most efficient operation solution of the steps.
Preferably, the analysis method includes:
the first layer of analysis is a type definition file with Class as TypeDef, and all objects of TypeDef are constructed according to file contents and are mapped in a memory as K to V.
And the second layer constructs a Step object and analyzes all classes as the file of the Step. And constructing a Step object through the file content, wherein the inputs/outputs attribute of the Step object comprises a plurality of typeDef objects, and if the Step applies a variable of a custom type, the loaded object is taken from the typeDef K: V mapping to replace the object in the Step, and value covering operation is performed.
And a third layer constructs a Workflow object, analyzes class as a Workflow file, constructs the Workflow object through the file content, and stores the steps related to the Workflow object in a mapping mode of StepName StepObject, wherein the steps attribute of the Workflow object comprises all the steps related to the Workflow. Workflow takes all the Step objects depended by the Workflow from the Step mapping and stores the Step objects into the Step attributes of the Workflow, and the values in the Workflow definition and the values in the Step objects are subjected to covering operation according to the file contents.
Preferably, the method of the topological sorting algorithm includes the following steps:
step A, solving a FollowMap through ref links marked in the inputs of each Step, wherein the FollowMap is a mapping of a list of (dependent Step: dependent Step >);
step B, after obtaining the FollowMap, inverting the FollowMap mapping to obtain the LeademMap, namely the mapping of < stepName: the Step list which the Step depends on >;
introducing a Distance concept, wherein the Distance is abbreviated as Dis in the flow chart, meaning the dependence Distance from the operated object is defined as 1 by default;
step D, traversing all steps, if the Step is not checked, traversing Leader steps of the Step, if one Step has no Leader, indicating that the Step has no dependency, and regarding the checked juxtaposed Dis as 1; while the leader map in other steps is dependent, the Dis of the Step is added with the Dis of the leader Steps, and so on.
The core of the recursive idea is to use the topological sorting algorithm of the mathematical map by declaring topological dependencies connecting steps in the inputs of the steps, and FollowMap and LeademMap are representations of adjacency matrixes. The starting point is determined by leader map, the starting point Step is set to 1 by the concept of Dis, and Dis at the intermediate point Step is the sum of Dis from the starting point and the path Step to the point. By sorting Dis, we can get the most efficient operation order. And when a certain Step is executed, the running sequence of the current state can be updated by only recursively subtracting the Dis of the subsequent node according to the FollowMap.
The invention further adopts the technical scheme, thereby solving the parameter problem caused by the need of defining a large number of characteristics of the scientific computing field and the problem of data template and data coverage filling capability.
The invention brings the following beneficial effects:
1. input-output description details: the matching of input and output by checking whether the type keys are identical is described in detail.
2. Specifying the data source of an entry by a ref key as an output item (e.g., $ arg1: { ref: $ step1/$ output _ list1 in the Workflow example indicates that the entry arg1 links with the output item named output _ list1 of step 1) implements the type and value, supports custom types,
3. the method is realized by using doc keywords contained in a TypeAndValueDefine substructure, the doc keywords can be added with description information and only used as annotation display, and are not analyzed and calculated to add explanatory words irrelevant to calculation, and specific input detection and data conversion can be provided based on type information.
4. The decoupling of the domain knowledge is realized through four-layer structure separation, in a use scene, professionals in the computer field write typeDef definition, professionals in the scientific computing field write Step definition, write Workflow definition by referring to the Step definition, professionals in task operation use Workflow definition in a combined mode, write Template definition according to submitted experience summary, and obtain the decoupling of the domain knowledge: the single task and the workflow arrangement are completely decoupled, the computer-related knowledge and the specific algorithm task details do not need to be known, and the arrangement and the connection can be realized as long as the input and output types defined by the tasks can be matched.
5. Automatic concurrent primitives: declaring specific distribution parameters through a scatter _ gather keyword, splitting an input data list into a plurality of data groups according to the distribution parameters, creating a plurality of subtasks, distributing each data group to each subtask to perform parallel computation to support the declaration of scatter-gather automatic concurrent subtasks,
6. the parallel capability provided by the automatic concurrent subtask enables the workflow defined by the description language to be applied to a plurality of scenes such as computation acceleration, data analysis, streaming computation and the like. And is not limited to a single simple description of providing input and output for computation. And field decoupling capacity is enhanced, algorithm experts focus on solving abstract problems for development, parallel capacity is not considered, namely the concept of the computer field, the capability is met by the description language when the actual computing problem needs batch parallel processing, and the using mode and the capability of tasks are expanded
7. Data template application, data overlay to: data among the four layers of definitions have reference relations, can be covered according to the priority of the levels, and can realize one-key configuration or default parameter configuration by applying different data templates.
The invention is only a set of language standards, the language provides definition of all necessary information, and the specific execution task needs to be matched with an interpreter, a data center and a task execution tool for use, and a programming language is needed to realize a corresponding tool. The data center needs to be able to store each definition document and index to the corresponding document by reference links, and the interpreter needs to read all the definition contents and assign the corresponding data to the definition structure according to the reference links. The task execution tool needs the complete structured data obtained by the interpreter, and submits tasks according to the information scheduling.
Drawings
FIG. 1 is a detailed hierarchy and reference relationship of languages in the data system of the present invention.
FIG. 2 is a data template overlay for a language in the data system of the present invention.
FIG. 3 is a flow chart of the specific use of language in the data system of the present invention.
Fig. 4 shows a workflow of data center uploading and downloading.
Fig. 5 shows the main parsing flow.
Fig. 6 shows the main idea of the topology ranking algorithm.
Detailed Description
Example 1
FIG. 1 illustrates the specific hierarchy and reference relationships of the language. The Workflow is specifically described through four levels, namely a TypeDef definition layer, a Step definition layer, a Workflow definition layer and a Template definition layer. Examples and introductions for each defined layer are as follows:
the TypeDef layer is not necessary, if a user needs to use a special self-defined type, the TypeDef layer needs to be written, and the part mainly abstracts the definition of a general or complex compound type, is convenient to quote and manage; the Step layer is the description of a single task, and specifically declares the name, type, document, parameter and other information of each input and output item aiming at the input and output declaration of a docker mirror image or other executors; the Workflow layer is a Workflow composed of one or more steps, dependent topologies of the steps need to be defined, and shared parameters can be defined; the Template layer needs to be based on a Workflow definition, and the Template can be used for presetting parameters and describing, checking or defining data sources of supplementary parameters. A template must explicitly declare a unique Workflow; the references between the levels are implemented by urls, for example (type: ^ TypeDef/common/version/1.0.0) introduces a TypeDef definition named common version 1.0.0 for the type representing the variable.
Example 2
xwlVersion describes the version of the description language, which is used for distinguishing version iteration brought by continuous addition of functions; class describes the type of this document, and has four (typeDef, Step, Workflow, Template); version describes the version of the definition; author describes author information; doc describes the annotation specification for the document; name describes the name of the document, and authors need to keep the name uniqueness when writing documents of the same type;
a sub-structure named TypeAndValueDefine is defined in the description language, which contains the type, name, value and several attributes that define a variable in detail. The following are three more representative examples of TypeAndValueDefine:
$name:
type: int[]
const: 1
value: 1
ref: $xxx/$xxx
required: true
doc: this is a type and value defing demo
symbols: [1, 2, 3]
$oneDir:
type: dir
AutoSyncInterval: 300 time interval (unit s) for automatic upload
autoSyncIgnore: ["run_charm_[0-9]*/", "calc[0-9]*/"]
$oneObject:
type: object
Serializer: # obj object type requires definition of codec
saveType: file
fileExt: json
encoder: ^py:json.dumps
decoder: ^py:json.loads
The method comprises the following specific steps:
the first step is as follows: defining the name of the substructure at the outermost layer, wherein the $ name at the outermost layer is the name of the substructure and begins with $ following the language principle;
the second step is that: general keywords defining the required attributes of the descriptive substructures: the type keyword is a type description, supports structures of int, double, borolan, string, array, file, dir, direct, record and obj, and can be identified as a list by adding [ ] suffix; const and value are mutually exclusive keywords which indicate the value represented by the substructure definition, value is a variable value, const is an invariable value; ref is a key mutually exclusive with value/const, and identifies the value source reference and one other TypeAndValueDefine substructure; the required key word is whether the definition must have a value or not, and the default is true; the doc keyword is a description; the symbols key word is an enumeration value domain and is used when defining a value domain needing to be limited;
the third step: defining special keywords that describe the substructures supported in a particular type:
as in the second definition the sub-structure is a definition of the type folder. An autoSyncInterval key can be defined as an interval of automatic synchronization; the autoSyncIgnore keyword is a default ignored file name list and supports regular grammar;
as in the third definition the substructure is a definition of a type as a custom object. The serializer key can be defined to define the needed codec definition for the object; the saveType keyword is in a storage mode and can be file/string; the fileExt is the suffix name of the stored file and is used when the saveType is file; the encoder keyword is an encoder url, the linked encoder needs to be an executable method, receives an object and returns character string data; the decoder key is the decoder url, connected to an executable method as needed by the decoder, which accepts a string of data and returns an object. The codec follows the external link criterion, using a prefix, py, identifying it as a python method.
Example 3
An example of the TypeDef definition, named common, is as follows (the general information section is not repeated):
xwlVersion: 1.0.0
class: TypeDef
doc: a stucture type def
author: ziqi.jiang
name: common
version: 1.0.0
typeDefs:
$jobArgs:
doc: Contains some info about compute args
type: record
fields:
:cores:
type: int
:memory:
type: int
the method comprises the following specific steps:
the first step is as follows: the typeDefs key is defined at the outermost layer, and contains some typeAndValueDefine substructures inside the typeDefs key. As this definition declares a record data named struct, fields is a record type sub-key declaration that contains two attributes cores, memory;
the second step is that: in type, a type and a value define sub-structure using the type def are declared by a link of a fixed format as follows:
$use_typedef_demo:
type: ^typedef/commmon/jobArgs/version/1.0.0。
example 4
The Step definition contains a detailed description of a computational task, and the following is an example of a Step definition (general information section omitted)
entryPoint: ^py:/home/job/run/loader.py
jobArgs:
type: ^typedef/common/jobArgs/version/1.0.0
value:
cores: 16
memory: 24000
inputs:
$in_arg1:
type: file
outputs:
$out_arg1:
type: double
The first step is as follows: four primary keys are defined that describe the Step attribute: entryPoint, jobArgs, inputs, outputs. And entryPoint is an execution entry of the Step, such as a loader file located under the/home/job/run directory, which is executed by python in the example. jobArgs is the execution parameter for Step, in the example, the referenced TypeDef is used, and the default value of 16 cores 24000MB is given;
secondly, defining input and output items: input/output is an input/output parameter list of the Step, and a plurality of TypeAndValueDefine substructures are arranged inside the input/output parameter list;
example 5
The Workflow definition contains several Step declarations and parameter dependencies between steps, and is an example of a Workflow definition (the general information part has been omitted):
vars:
$share_arg1:
type: string
steps:
$step1:
run: ^step/demo/version/1.0.0
jobArgs:
cores: 2
memory: 3000
in:
$arg1:
ref: {vars/$share_args1}
out: [$output_list1, $output2]
$step2:
run: ^ step/scatter_gather_demo/version/1.0.0
scatter
maxJob: 100
minJob: 0
zip:
$scatter_in_arg1: {ref: $step1/$output_list1}
jobIn:
$in_arg1: {ref: zip/$item}
$in_arg2: ~
gather:
failTol: 0.1
retryLimit: 1
jobOut: [$output]
unzip:
$gather_outputs: {ref: jobOut/$output}
outputs:
$out_wf_arg1: {ref: $step2/gather_outputs}
the method comprises the following specific steps:
the first step is as follows: defining a shared variable pool vars to be reused at the outermost layer: the vars keyword is a group of shared variable definition pools which are used in the document, and if a plurality of steps in workflow need to share a group of input, the vars keyword can be referred by the ref keyword; step keywords are Step definitions used by the workflow and dependency topologies of the Step definitions, and Step names and Step-defined key value pair declarations are used inside the Step keywords;
the second step is that: define the Steps used and their topological relations:
under the steps key, there are two step declarations named step1, step 2:
in the declaration of step1, run is the specific definition url of the step, represented by an external link that follows the beginning of the criterion, meaning that the 1.0.0 version definition named demo is introduced; the jobArgs key maps to the jobArgs defined in Step, where it is assigned a default value; the in key is a declared input parameter, a parameter named arg1 is declared, the value of the parameter refers to the value of share _ arg1 in the self-sharing variable, and the name in the in needs to be consistent with the name of an input item in the inputs in the Step definition; an out key is a parameter that is enabled in the workflow, where the name needs to be consistent with the name of the output item within outputs in the Step definition.
In the declaration of step2, an automatic concurrent step declared using the scatter-gather primitive is shown. jobArgs can be omitted when no default is assigned; the scatter key declares this to be a concurrent step;
the scatter keyword will distribute each element in the received input list to the same number of subtasks as the input list through zip mapping, under the definition of scatter: the maxJob/minJob keyword is the range of the number of the task concurrence; the zip is a concurrent batch parameter mapping of the task, a plurality of TypeAndValueDefine substructures are arranged under the zip, and as the task is defined to face a single input, a parameter mapping is required to be defined to show how a plurality of received parameters are mapped to subtask input items needing to be concurrent. As this example illustrates an array type named scatter _ in _ arg1 that accepts the result of step1 task named output _ list 1; the jobIn key is the original input of the Step, and is internally provided with a plurality of TypeAndValueDefine substructures, the name of which must be consistent with the name of the Step definition input item, and the in _ args1 declares that the value comes from scatter _ in _ arg1 in the zip map. Meaning that each element in the list received by scatter _ in _ arg1 is distributed to in _ arg1 items of each child job when run-time.
The gather keyword will aggregate the output results of multiple subtasks into an output list through an unship mapping, and under the definition of gather: the failTol is the failure fault tolerance rate of the sub job, is a decimal in the range of 0-1, if the failure task occupation ratio is greater than the decimal, the step is considered to fail, and retry is abandoned; retryLimit is the maximum allowed failure retry times, if part of subtasks fail and the failure number ratio is smaller than the fault tolerance rate, the retry does not exceed the retryLimit times; jobOut is an enabled output item in the original Step definition, and the name of the jobOut is required to be consistent with an output item in the Step definition; unzip is a mapping for parameter aggregation, as in this example, unzip declares that a definition named gather _ outputs aggregates the output items of all subtasks;
the outermost outputs key, meaning the final output of the workflow, defines an output named out _ wf _ arg1, in this example, whose value is derived from the step 2's aggregated result, gather _ outputs.
Example 6
A Template definition is used to specify a set of preset values to apply on the workflow as a data Template. An example of a Template definition is as follows:
workflow: ^workflow/some_workflow/version/1.0.0
values:
vars/$share_arg1: {value: 233}
$step2/$in_arg2: {const: 1.0}
the method comprises the following specific steps:
the first step is as follows: define the target Workflow keyword to which the Template applies: workflow defines url for the workflow to which the Template is to be applied
The second step is that: define the value values pre-populated for this Workflow: values are used to fix some values to be filled, only two forms of data of value/const are supported, as in the above example, a defined value named share _ arg1 in shared variable vars is filled as variable value 233, and a defined value named in _ arg2 in step2 input is filled as non-variable value 1.0, respectively.
FIG. 2 illustrates the data template overlay, parsing the behavior table for this language. Data are classified into value (variable value) and const (invariable constant) based on properties, and into: typedef, step, workflow, template, inline. When a Workflow is executed, final data needs to be analyzed from data sources of multi-layer definitions, and when one definition has a plurality of data sources, three actions of ignoring, covering and conflicting occur. The parsing of the data follows the following principles: const data cannot be overwritten; inline > template > workflow > step > typeDef; two values meet to override by priority (inlineValue may override inlineValue); two consts meet a conflict; the higher level value encounters a lower level const collision.
Fig. 3 illustrates a specific flow of use of the language. When the language is used for writing: computer engineers need to describe existing algorithms by the Step definition of the language. Firstly, writing a custom type definition TypeDef which can be needed according to the requirement of an existing algorithm and publishing the custom type definition to a data center, then writing a Step definition describing the existing algorithm and publishing the Step definition, and if the Step definition needs to be used, referring to the custom type Def through url. The scientific computing solution expert compiles a Workflow definition, references the required Step definitions by referencing url in the Workflow definition, and connects the output of each Step to the input of the next Step one by one in the Workflow for layout. Finally, writing a Template definition to fill in the default values for a specific usage scenario. The task executor only needs to select the corresponding Workflow and the template and transfer the Workflow and the template into the language interpreter, the language interpreter analyzes the Workflow layer by layer from top to bottom and obtains corresponding data from the data center through the url for analysis, and finally the analyzed complete data is transmitted to the task execution tool for task submission.
Example 7
The specific process for issuing the Step definition is as follows:
the data center comprises:
the data center is a service of a simple C/S architecture, indexes are managed through a Server side database, and a file system manages specific data contents; and the Client terminal performs simple analysis, uploading and downloading work. Fig. 4 shows a workflow of data center upload and download:
uploading a work flow:
the user submits a description language file to request the client. The client reads the file content, acquires specific types, names and version parameters by analyzing class, name and version fields, and requests the server with the file content; the server side indexes the database through the corresponding parameters, and if files with the same type, name and version exist, parameter check failure is returned; and if the file address does not exist, generating a new file address, and adding detailed information into the database. And then the server accesses the file system to store the file at the new file address, and then returns the result to the client.
Downloading a working process:
the user carries the type, name and version parameters to access the server; and the server side indexes the database through the corresponding parameters, if the NotFound error does not exist, the server side acquires a specific file address if the NotFound error exists, accesses the file system through the file address and acquires the file content, and returns a result to the client side.
By adopting the scheme, the method has the advantages that the file is stored by using the file system and is not directly stored into the database, so that the original granularity of the data is reserved, the integrity of the description language file is ensured, the performance of the database is improved by using a larger file by using the file system, and the file reading can be accelerated by using a faster index address and multithreading when the file is requested in batches.
A resolver:
the parser is an independent and offline data analysis tool, and mainly parses a complete definition through steps of recursive analysis, syntax tree analysis, object loading, application linking, layer-by-layer application of numerical values and the like. Fig. 5 shows the main parsing flow:
before the parser starts parsing the content, the input document and all documents on which the input document depends are first pulled locally from the data center. The parser will recursively traverse each value of the first input document, if the value is an external link at the beginning, download the link to the corresponding file through the data center Client, and repeat the steps for the new file until all dependent links are ready.
Since each layer number of the description language has a priority and an overlay relationship, in order to realize data logic of layer-by-layer overlay, the overlay needs to be built and applied layer-by-layer from the bottommost layer. The first analysis is a type definition file with Class as TypeDef, all objects of TypeDef are constructed according to file contents and are mapped in a memory as K to V.
And the second layer constructs a Step object and analyzes all classes as the file of the Step. And constructing a Step object through the file content, wherein the inputs/outputs attribute of the Step object comprises a plurality of typeDef objects, and if the Step applies a variable of a custom type, the loaded object is taken from the typeDef K: V mapping to replace the object in the Step, and value covering operation is performed.
And a third layer constructs a Workflow object, analyzes class as a Workflow file, constructs the Workflow object through the file content, and stores the steps related to the Workflow object in a mapping mode of StepName StepObject, wherein the steps attribute of the Workflow object comprises all the steps related to the Workflow. Workflow takes all the Step objects depended by the Workflow from the Step mapping and stores the Step objects into the Step attributes of the Workflow, and the values in the Workflow definition and the values in the Step objects are subjected to covering operation according to the file contents.
And finally, analyzing the Template file, traversing the specific variables and values in the Template, and indexing to a certain input/output value of a certain Step in the Workflow object to perform covering operation.
After the analysis is finished, a tree in an object form is obtained, a workflow object is a root node, the workflow object contains all Step objects through Step attributes, and the Step objects contain all TypeDefs objects through inputs/outputs attributes.
In addition to building hierarchical explicit object trees and hierarchical assignments, a second important algorithm for parsers is to topologically order Workflow. The user defines the dependence among different steps, and the topological sorting algorithm can solve the most efficient operation solution of the steps. Fig. 6 shows the main ideas of the topology ranking algorithm:
through ref chaining marked in inputs of each Step, a FollowMap is obtained, and the FollowMap is a mapping of a list of (dependent Step: dependent Step)
After obtaining the FollowMap, the FollowMap mapping is inverted to obtain the mapping of the LeademMap, i.e., < stepName: list of steps on which the Step depends >.
The concept of Distance is introduced, abbreviated as Dis in the flow chart, meaning the dependence Distance from the operated is 1 by default (can be operated directly)
And traversing all steps, if the Step is not checked, traversing Leader steps of the Step, and if one Step has no Leader, indicating that the Step has no dependence, and considering that the checked concatenation Dis is 1. While the leader map in other steps is dependent, the Dis of the Step is added with the Dis of the leader Steps, and so on.
The core of the recursive idea is based on a topological sorting algorithm of a mathematical graph, and FollowMap and LeademMap are expression forms of an adjacency matrix. The starting point is determined by leader map, the starting point Step is set to 1 by the concept of Dis, and Dis at the intermediate point Step is the sum of Dis from the starting point and the path Step to the point. By sorting Dis, we can get the most efficient operation order. And when a certain Step is executed, the running sequence of the current state can be updated by only recursively subtracting the Dis of the subsequent node according to the FollowMap.
In light of the foregoing description of the preferred embodiments according to the present application, it is to be understood that various changes and modifications may be made without departing from the spirit and scope of the invention. The technical scope of the present application is not limited to the contents of the specification, and must be determined according to the scope of the claims.

Claims (8)

1. A generic description language data system for directed acyclic graph automatic task flow, comprising:
a Step definition layer, a Workflow definition layer and a Template definition layer;
the Step layer is the description of a single task, and specifically declares the name, type, document, parameter and other information of each input and output item aiming at the input and output declaration of a docker mirror image or other executors;
the Workflow layer is a Workflow composed of one or more steps, dependent topologies of the steps need to be defined, and shared parameters can be defined;
the Template carries out parameter presetting and specification of supplementary parameters, checker or data source definition on the basis of Workflow definition.
2. The directed acyclic graph automatic task flow generic description language data system as claimed in claim 1, further comprising a TypeDef layer, wherein if a user needs to use a special custom type, the TypeDef layer needs to be written, and this part mainly abstracts the definition of a generic or complex compound type, and is convenient for reference and management.
3. A parsing method for a generic description language data system using directed acyclic graph automatic task flow according to claim 1, comprising the steps of:
recursive analysis: pulling the input document and all documents depended by the input document from the data center to the local; the resolver can recursively traverse each value of the first input document, if the value is an external link at the beginning of the ^ code, the link is downloaded to a corresponding file through the data center Client, and the step is repeated for a new file until all the dependent links are prepared;
syntax tree analysis: because each layer number of the description language has a priority and a coverage relation, in order to realize data logic of layer-by-layer coverage, the coverage needs to be constructed and applied layer-by-layer from the bottommost layer;
analyzing the Template file, traversing specific variables and values in the Template, and indexing to a certain input/output value of a certain Step in the Workflow object to perform covering operation;
and (3) object loading, namely obtaining a tree in an object form after the analysis is finished, wherein a workflow object is a root node, the workflow object contains all Step objects through a steps attribute, and the Step objects contain all typeDefs objects through an inputs/outputs attribute.
4. A parsing method according to claim 3, wherein the parsing method comprises:
the first layer analyzes a type definition file with Class as TypeDef, constructs all objects of TypeDef according to file contents and maps the objects as K to V to be stored in a memory;
the second layer constructs a Step object and analyzes all classes as the files of the Step; constructing a Step object through file contents, wherein the inputs/outputs attribute of the Step object comprises a plurality of TypeDef objects, if the Step applies a variable of a custom type, the loaded object is taken from a TypeDef K-V mapping to replace the object in the Step, and value covering operation is carried out;
establishing a Workflow object at the third layer, analyzing class as a Workflow file, establishing the Workflow object through file contents, wherein the steps attribute of the Workflow object comprises all steps related to the Workflow and is stored in a mapping mode of StepName, StepObject; workflow takes all the Step objects depended by the Workflow from the Step mapping and stores the Step objects into the Step attributes of the Workflow, and the values in the Workflow definition and the values in the Step objects are subjected to covering operation according to the file contents.
5. A parsing method as claimed in claim 3 wherein, in addition to building hierarchical explicit object trees and hierarchical assignments, the second important algorithm for the parser is to topologically order Workflow; the method of the topological sorting algorithm comprises the following steps:
step A, solving a FollowMap through ref links marked in the inputs of each Step, wherein the FollowMap is a mapping of a list of (dependent Step: dependent Step >);
step B, after obtaining the FollowMap, inverting the FollowMap mapping to obtain the LeademMap, namely the mapping of < stepName: the Step list which the Step depends on >;
introducing a Distance concept, wherein the Distance is abbreviated as Dis in the flow chart, meaning the dependence Distance from the operated object is defined as 1 by default;
step D, traversing all steps, if the Step is not checked, traversing Leader steps of the Step, if one Step has no Leader, indicating that the Step has no dependency, and regarding the checked juxtaposed Dis as 1; while in other steps, leader map is dependent, and the Dis of the leader Steps is added to the Dis of the Step.
6. A parsing method as claimed in claim 3, wherein in said recursive analysis, the type inputted by the user is obtained, and compared with the type of the type keyword declaration; when the input types are inconsistent, trying to forcibly convert the data input by the user into a statement type, and if the conversion fails, throwing out a type error; the method comprises the following specific steps:
if the statement is a str character string type, user input data is 123, and the type c is an int integer type;
checking for type, int, str inconsistency;
an attempt to force the transition: integer 123 may be converted to a string "123";
inputting a warning that the detection passes and the type conversion is undergone;
if the statement is int integer type, the user input data is 123, and the type is int integer type;
a. checking the type, int, int consistent;
b. checking to pass;
thirdly, if the statement is int integer type, the user input data is abc, and the type is str character string type;
a. checking for type, int, str inconsistency;
b. an attempt to force the transition: the string abc cannot be converted to an integer;
c. the check fails and the throw error type check fails.
7. The method of claim 3, wherein said Step-defined method comprises the following steps:
the data center comprises: the data center is a service of a simple C/S architecture, indexes are managed through a Server side database, and a file system manages specific data contents; the Client side carries out simple analysis, uploading and downloading work;
uploading a work flow: a user submits a description language file request client, the client reads file content, and obtains specific types, names and version parameters by analyzing class, name and version fields, and the request server carries the file content;
downloading a working process: the user carries the type, name and version parameters to access the server; and the server side indexes the database through the corresponding parameters, if the NotFound error does not exist, the server side acquires a specific file address if the NotFound error exists, accesses the file system through the file address and acquires the file content, and returns a result to the client side.
8. The method of claim 7, wherein in the uploading workflow, the server indexes the database by corresponding parameters, and returns a parameter check failure if files with the same type, name and version exist; if the file address does not exist, a new file address is generated, and meanwhile, detailed information is added into a database; and then the server accesses the file system to store the file at the new file address, and then returns the result to the client.
CN202011091614.7A 2020-10-13 General description language data system for automatic task flow of directed acyclic graph Active CN112162737B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011091614.7A CN112162737B (en) 2020-10-13 General description language data system for automatic task flow of directed acyclic graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011091614.7A CN112162737B (en) 2020-10-13 General description language data system for automatic task flow of directed acyclic graph

Publications (2)

Publication Number Publication Date
CN112162737A true CN112162737A (en) 2021-01-01
CN112162737B CN112162737B (en) 2024-06-28

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114489679A (en) * 2022-02-22 2022-05-13 北京科杰科技有限公司 Intelligent analysis system and method for DAG dependency of hadoop big data task

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101668047A (en) * 2009-09-30 2010-03-10 北京航空航天大学 Method and device for automatically generating composite service description language
CN103440553A (en) * 2013-08-28 2013-12-11 复旦大学 Workflow matching and finding system, based on provenance, facing proteomic data analysis
CN103679789A (en) * 2013-12-09 2014-03-26 北京大学 Parallel rendering and visualization method and system based on data flow diagram
US20160162819A1 (en) * 2014-12-03 2016-06-09 Hakman Labs LLC Workflow definition, orchestration and enforcement via a collaborative interface according to a hierarchical procedure list
CN106326006A (en) * 2016-08-23 2017-01-11 成都卡莱博尔信息技术股份有限公司 Task management system aiming at task flow of data platform
US10467050B1 (en) * 2015-04-06 2019-11-05 State Farm Mutual Automobile Insurance Company Automated workflow creation and management
US10474506B1 (en) * 2019-07-18 2019-11-12 Capital One Services, Llc Finite state machine driven workflows
CN110609675A (en) * 2018-06-14 2019-12-24 中兴通讯股份有限公司 Workflow modeling method and device and computer readable storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101668047A (en) * 2009-09-30 2010-03-10 北京航空航天大学 Method and device for automatically generating composite service description language
CN103440553A (en) * 2013-08-28 2013-12-11 复旦大学 Workflow matching and finding system, based on provenance, facing proteomic data analysis
CN103679789A (en) * 2013-12-09 2014-03-26 北京大学 Parallel rendering and visualization method and system based on data flow diagram
US20160162819A1 (en) * 2014-12-03 2016-06-09 Hakman Labs LLC Workflow definition, orchestration and enforcement via a collaborative interface according to a hierarchical procedure list
US10467050B1 (en) * 2015-04-06 2019-11-05 State Farm Mutual Automobile Insurance Company Automated workflow creation and management
CN106326006A (en) * 2016-08-23 2017-01-11 成都卡莱博尔信息技术股份有限公司 Task management system aiming at task flow of data platform
CN110609675A (en) * 2018-06-14 2019-12-24 中兴通讯股份有限公司 Workflow modeling method and device and computer readable storage medium
US10474506B1 (en) * 2019-07-18 2019-11-12 Capital One Services, Llc Finite state machine driven workflows

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
GUANGMENG ZHAI ET AL.: ""PWMDS: A system supporting preovenance-based matching and discovery of workflows in proteomics data anlaysis"", 《PROCEEDINGS OF THE 2012 IEEE 16TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK IN DESIGN》, pages 456 - 463 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114489679A (en) * 2022-02-22 2022-05-13 北京科杰科技有限公司 Intelligent analysis system and method for DAG dependency of hadoop big data task

Similar Documents

Publication Publication Date Title
US20230030393A1 (en) General description language data system for directed acyclic graph automatic task flow
Emerson Temporal and modal logic
McBride Turing-completeness totally free
Pagán et al. Querying large models efficiently
US20040056908A1 (en) Method and system for dataflow creation and execution
WO2015019364A2 (en) Graph based ontology modeling system
US10474435B2 (en) Configuration model parsing for constraint-based systems
CN113434533A (en) Data tracing tool construction method, data processing method, device and equipment
CN115525287A (en) Multi-stage compiler architecture
Hinkel Implicit incremental model analyses and transformations
Gulati et al. Apache Spark 2. x for Java developers
CN110209699B (en) Data interface dynamic generation and execution method based on openEHR Composition template
Elaasar Definition of modeling vs. programming languages
CN112162737B (en) General description language data system for automatic task flow of directed acyclic graph
CN112162737A (en) Universal description language data system of directed acyclic graph automatic task flow
Ha et al. Translating a distributed relational database to a document database
CN111221860A (en) Mixed query optimization method and device based on big data
US20220350797A1 (en) Providing container images
US11023674B2 (en) Generation and application of object notation deltas
CN114547083A (en) Data processing method and device and electronic equipment
RU2681408C2 (en) Method and system of graphical oriented creation of scalable and supported software realizations of complex computational methods
Chang et al. Support NNEF execution model for NNAPI
Johnston et al. Towards an Approximation-Aware Computational Workflow Framework for Accelerating Large-Scale Discovery Tasks
Hawkins et al. Data structure fusion
Amissah A framework for executable systems modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant