CN112162737A

CN112162737A - Universal description language data system of directed acyclic graph automatic task flow

Info

Publication number: CN112162737A
Application number: CN202011091614.7A
Authority: CN
Inventors: 姜子麒; 温书豪; 谈樑; 刘阳; 马健; 范陕姗; 赖力鹏
Original assignee: Shenzhen Jingtai Technology Co Ltd
Current assignee: Shenzhen Jingtai Technology Co Ltd
Priority date: 2020-10-13
Filing date: 2020-10-13
Publication date: 2021-01-01
Anticipated expiration: 2040-10-13

Abstract

The invention provides a general description language data system of directed acyclic graph automatic task flow, comprising: a Step definition layer, a Workflow definition layer and a Template definition layer; the Step layer is the description of a single task, and specifically declares the name, type, document, parameter and other information of each input and output item aiming at the input and output declaration of a docker mirror image or other executors; the Workflow layer is a Workflow composed of one or more steps, dependent topologies of the steps need to be defined, and shared parameters can be defined; the Template carries out parameter presetting and specification of supplementary parameters, checker or data source definition on the basis of Workflow definition. The data center is used with the task execution tool, and a corresponding tool needs to be realized by using a programming language. The data center needs to be able to store each definition document and index to the corresponding document by reference links, and the interpreter needs to read all the definition contents and assign the corresponding data to the definition structure according to the reference links.

Description

Universal description language data system of directed acyclic graph automatic task flow

Technical Field

The invention belongs to the field of cloud computing, and particularly relates to a universal description language data system for directed acyclic graph type automatic task flow.

Background

The specialized industry subdivision trend in the field of scientific computing is obvious nowadays, and the development of a detailed problem solving algorithm and the engineering application guided by actual value are divided into two developing directions. How to combine the application of subdivided professional methods to accomplish the objective becomes an indispensable requirement, the smaller and smaller method granularity makes learning cost of a large number of methods and labor cost for combining the methods higher and higher, and the technology of automated workflow is widely applied in various fields. In the field of scientific computing: SBP (Seven Bridge platform) developed by SBG (Seven Bridges genomics) company in the United states is regarded as a company core technology, integrates algorithms, data hosting and computing resources, constructs a flexible genome data analysis process through a visual interface and executes computation; the method is mainly aimed at data-intensive scientific computing, task flow description in a text format and different task execution in serial connection with a command line tool are performed, mainstream cloud computing resource providers such as AWS, Google and AliCloud, and computing standards such as HPC and Spark provide support for the CWL.

For the task flow description language, the core function has two points: defining specific tasks and defining the flow between tasks. The input, output and execution modes of the tasks are described by defining specific tasks, the execution sequence of the tasks and the data flow path are defined by defining the inter-task flow, and the core information is provided for a specific task flow engine to be analyzed, so that the automatic execution of the task flow can be completed.

For a closed task flow platform, the closed task flow platform lacks openness and usability, can be arranged only by using an algorithm task provided inside, is difficult to meet the calculation requirement of rapid development, and cannot access flexible calculation resources, so that the closed task flow platform mainly focuses on a general task flow description language.

For the existing general description language, the following defects are mainly existed:

1. the task description granularity is rough:

the granularity of single task description of the existing general description language is very rough, a user needs to define an input parameter group and an acquisition mode, but the language does not relate to the type and the detail structure of data, so that the user needs to deeply know the specific characteristics of the task when using the universal description language, and data check and verification cannot be provided when the wrong data type and structure are provided.

2. The threshold for knowledge in the computer field is high, and the arrangement of non-computer professionals is inconvenient:

the existing universal description language is tightly coupled with the computer programming technology, a large amount of bottom-layer details and computer proper nouns are exposed, and a user can write the information only by possessing certain computer domain knowledge. The requirement that an algorithm writer (computer engineer) and an algorithm user (scientist) use the algorithms at the same time cannot be met.

3. Information is numerous and complicated and lacks data multiplexing

When the existing universal description language is applied to high-performance scientific computation, parameters brought by characteristics of a large number of scientific computation fields need to be defined due to universality, the parameters need to be repeatedly input, and the data template and data coverage filling capacity is not available

4. Lack of automatic parallel source language

In the field of scientific computing, parallelism is an indispensable capability due to the enormous computing requirements. The existing general description language lacks parallel description primitives like map-reduce or scatter-gather and cannot provide automatic parallel functions for single-point tasks.

Disclosure of Invention

Based on this, there is a need for a generic description language data system for directed acyclic graph automatic task flow, comprising:

a Step definition layer, a Workflow definition layer and a Template definition layer;

the Step layer is the description of a single task, and specifically declares the name, type, document, parameter and other information of each input and output item aiming at the input and output declaration of a docker mirror image or other executors;

the Workflow layer is a Workflow composed of one or more steps, dependent topologies of the steps need to be defined, and shared parameters can be defined;

the Template carries out parameter presetting and specification of supplementary parameters, checker or data source definition on the basis of Workflow definition.

According to the technical scheme, the user-defined type is defined through the TypeDef, the TypeDef supports a plurality of keywords such as type, required, symbols, value, const, ref, embedded _ value and the like, and the data description granularity is improved through the details of the keyword expanded data. And the Step definition describing a single task refers to different TypeDef in the input and output declarations to declare the input and the output of the Step, so that high-precision complete task description is realized. The problem that the granularity of single task description of the existing general description language is very coarse is solved.

The concrete steps of Workflow for building Template are as follows: values in Template are resolved, which contains data such as $ step2/$ in _ arg2: { const: 1.0}, the first field is slashed with the step name and the slashed with the entry name, and the two parameters are used to locate the specific mapping in Workflow. In the second field, the first parameter before the colon indicates the nature of the data, which in the example is constant, and the latter parameter is specific data. When the Template is applied, the property and data of the second parameter representation are covered in the step name and the entry name of the first parameter representation.

The reference between the levels is realized by url, for example, (type: ^ TypeDef/common/version/1.0.0) introduces a TypeDef definition named common version 1.0.0 for the type representing the variable, and the original text of the definition is obtained by requesting the parameters like the data center.

Preferably, the method further comprises a TypeDef layer, if a user needs to use a special custom type, the TypeDef layer needs to be written, and the part mainly abstracts the definition of a general or complex compound type, so that the reference and the management are convenient.

The invention further adopts the technical scheme, and has the advantages that the keywords such as required, type, value, const, serializer, symbols and the like in the TypeDef support detailed declarations such as default values, constants, enumerated values and the like besides providing the type declarations. By matching the input data with these statements, a more elaborate data check and verification can be achieved. Thereby doing so to provide data checking and verification.

The invention also provides a scheme for realizing the reference of the data between the four layers of definitions, which comprises the following steps:

1. step will only refer to TypeDef, by filling in the reference url when defining the type of data, as:

inputs:

$in_arg1:

type:^typedef/common/jobArgs/version/1.0.0

indicates that Step refers to the type definition named jobArgs in the TypeDef definition named common

2. Workflow will only refer to Step, by filling in the url when declaring the run field of the Step used, as:

steps:

$step1:

run: ^step/demo/version/1.0.0

indicating that the Workflow references a step definition with version 1.0.0 named demo

Template will only refer to Workflow, implemented by filling in url in the Workflow field declared in the metadata, as:

workflow:^workflow/some_workflow/version/1.0.0

indicating that the Template applies to the Workflow definition named some _ Workflow.

Correspondingly, the invention also provides an analysis method adopting the system, which comprises the following steps:

recursive analysis: the input document and all documents on which the input document depends are pulled locally from the data center. The parser will recursively traverse each value of the first input document, if the value is an external link at the beginning, download the link to the corresponding file through the data center Client, and repeat the steps for the new file until all dependent links are ready.

Syntax tree analysis: because each layer number of the description language has a priority and a coverage relation, in order to realize data logic of layer-by-layer coverage, the coverage needs to be constructed and applied layer-by-layer from the bottommost layer;

and analyzing the Template file, traversing the specific variables and values in the Template, and indexing to a certain input/output value of a certain Step in the Workflow object to perform the covering operation.

After the analysis is finished, obtaining a tree in an object form, wherein a workflow object is a root node, the workflow object contains all Step objects through a steps attribute, and the Step objects contain all TypeDefs objects through inputs/outputs attribute;

in addition to building hierarchical explicit object trees and hierarchical assignments, a second important algorithm for parsers is to topologically order Workflow. The user defines the dependence among different steps, and the topological sorting algorithm can solve the most efficient operation solution of the steps.

Preferably, the analysis method includes:

the first layer of analysis is a type definition file with Class as TypeDef, and all objects of TypeDef are constructed according to file contents and are mapped in a memory as K to V.

And the second layer constructs a Step object and analyzes all classes as the file of the Step. And constructing a Step object through the file content, wherein the inputs/outputs attribute of the Step object comprises a plurality of typeDef objects, and if the Step applies a variable of a custom type, the loaded object is taken from the typeDef K: V mapping to replace the object in the Step, and value covering operation is performed.

And a third layer constructs a Workflow object, analyzes class as a Workflow file, constructs the Workflow object through the file content, and stores the steps related to the Workflow object in a mapping mode of StepName StepObject, wherein the steps attribute of the Workflow object comprises all the steps related to the Workflow. Workflow takes all the Step objects depended by the Workflow from the Step mapping and stores the Step objects into the Step attributes of the Workflow, and the values in the Workflow definition and the values in the Step objects are subjected to covering operation according to the file contents.

Preferably, the method of the topological sorting algorithm includes the following steps:

step A, solving a FollowMap through ref links marked in the inputs of each Step, wherein the FollowMap is a mapping of a list of (dependent Step: dependent Step >);

step B, after obtaining the FollowMap, inverting the FollowMap mapping to obtain the LeademMap, namely the mapping of < stepName: the Step list which the Step depends on >;

introducing a Distance concept, wherein the Distance is abbreviated as Dis in the flow chart, meaning the dependence Distance from the operated object is defined as 1 by default;

step D, traversing all steps, if the Step is not checked, traversing Leader steps of the Step, if one Step has no Leader, indicating that the Step has no dependency, and regarding the checked juxtaposed Dis as 1; while the leader map in other steps is dependent, the Dis of the Step is added with the Dis of the leader Steps, and so on.

The core of the recursive idea is to use the topological sorting algorithm of the mathematical map by declaring topological dependencies connecting steps in the inputs of the steps, and FollowMap and LeademMap are representations of adjacency matrixes. The starting point is determined by leader map, the starting point Step is set to 1 by the concept of Dis, and Dis at the intermediate point Step is the sum of Dis from the starting point and the path Step to the point. By sorting Dis, we can get the most efficient operation order. And when a certain Step is executed, the running sequence of the current state can be updated by only recursively subtracting the Dis of the subsequent node according to the FollowMap.

The invention further adopts the technical scheme, thereby solving the parameter problem caused by the need of defining a large number of characteristics of the scientific computing field and the problem of data template and data coverage filling capability.

The invention brings the following beneficial effects:

1. input-output description details: the matching of input and output by checking whether the type keys are identical is described in detail.

2. Specifying the data source of an entry by a ref key as an output item (e.g., $ arg1: { ref: $ step1/$ output _ list1 in the Workflow example indicates that the entry arg1 links with the output item named output _ list1 of step 1) implements the type and value, supports custom types,

3. the method is realized by using doc keywords contained in a TypeAndValueDefine substructure, the doc keywords can be added with description information and only used as annotation display, and are not analyzed and calculated to add explanatory words irrelevant to calculation, and specific input detection and data conversion can be provided based on type information.

4. The decoupling of the domain knowledge is realized through four-layer structure separation, in a use scene, professionals in the computer field write typeDef definition, professionals in the scientific computing field write Step definition, write Workflow definition by referring to the Step definition, professionals in task operation use Workflow definition in a combined mode, write Template definition according to submitted experience summary, and obtain the decoupling of the domain knowledge: the single task and the workflow arrangement are completely decoupled, the computer-related knowledge and the specific algorithm task details do not need to be known, and the arrangement and the connection can be realized as long as the input and output types defined by the tasks can be matched.

5. Automatic concurrent primitives: declaring specific distribution parameters through a scatter _ gather keyword, splitting an input data list into a plurality of data groups according to the distribution parameters, creating a plurality of subtasks, distributing each data group to each subtask to perform parallel computation to support the declaration of scatter-gather automatic concurrent subtasks,

6. the parallel capability provided by the automatic concurrent subtask enables the workflow defined by the description language to be applied to a plurality of scenes such as computation acceleration, data analysis, streaming computation and the like. And is not limited to a single simple description of providing input and output for computation. And field decoupling capacity is enhanced, algorithm experts focus on solving abstract problems for development, parallel capacity is not considered, namely the concept of the computer field, the capability is met by the description language when the actual computing problem needs batch parallel processing, and the using mode and the capability of tasks are expanded

7. Data template application, data overlay to: data among the four layers of definitions have reference relations, can be covered according to the priority of the levels, and can realize one-key configuration or default parameter configuration by applying different data templates.

The invention is only a set of language standards, the language provides definition of all necessary information, and the specific execution task needs to be matched with an interpreter, a data center and a task execution tool for use, and a programming language is needed to realize a corresponding tool. The data center needs to be able to store each definition document and index to the corresponding document by reference links, and the interpreter needs to read all the definition contents and assign the corresponding data to the definition structure according to the reference links. The task execution tool needs the complete structured data obtained by the interpreter, and submits tasks according to the information scheduling.

Drawings

FIG. 1 is a detailed hierarchy and reference relationship of languages in the data system of the present invention.

FIG. 2 is a data template overlay for a language in the data system of the present invention.

FIG. 3 is a flow chart of the specific use of language in the data system of the present invention.

Fig. 4 shows a workflow of data center uploading and downloading.

Fig. 5 shows the main parsing flow.

Fig. 6 shows the main idea of the topology ranking algorithm.

Detailed Description

Example 1

FIG. 1 illustrates the specific hierarchy and reference relationships of the language. The Workflow is specifically described through four levels, namely a TypeDef definition layer, a Step definition layer, a Workflow definition layer and a Template definition layer. Examples and introductions for each defined layer are as follows:

the TypeDef layer is not necessary, if a user needs to use a special self-defined type, the TypeDef layer needs to be written, and the part mainly abstracts the definition of a general or complex compound type, is convenient to quote and manage; the Step layer is the description of a single task, and specifically declares the name, type, document, parameter and other information of each input and output item aiming at the input and output declaration of a docker mirror image or other executors; the Workflow layer is a Workflow composed of one or more steps, dependent topologies of the steps need to be defined, and shared parameters can be defined; the Template layer needs to be based on a Workflow definition, and the Template can be used for presetting parameters and describing, checking or defining data sources of supplementary parameters. A template must explicitly declare a unique Workflow; the references between the levels are implemented by urls, for example (type: ^ TypeDef/common/version/1.0.0) introduces a TypeDef definition named common version 1.0.0 for the type representing the variable.

Example 2

xwlVersion describes the version of the description language, which is used for distinguishing version iteration brought by continuous addition of functions; class describes the type of this document, and has four (typeDef, Step, Workflow, Template); version describes the version of the definition; author describes author information; doc describes the annotation specification for the document; name describes the name of the document, and authors need to keep the name uniqueness when writing documents of the same type;

a sub-structure named TypeAndValueDefine is defined in the description language, which contains the type, name, value and several attributes that define a variable in detail. The following are three more representative examples of TypeAndValueDefine:

$name:

type: int[]

const: 1

value: 1

ref: $xxx/$xxx

required: true

doc: this is a type and value defing demo

symbols: [1, 2, 3]

$oneDir:

type: dir

AutoSyncInterval: 300 time interval (unit s) for automatic upload

autoSyncIgnore: ["run_charm_[0-9]*/", "calc[0-9]*/"]

$oneObject:

type: object

Serializer: # obj object type requires definition of codec

saveType: file

fileExt: json

encoder: ^py:json.dumps

decoder: ^py:json.loads

The method comprises the following specific steps:

the first step is as follows: defining the name of the substructure at the outermost layer, wherein the $ name at the outermost layer is the name of the substructure and begins with $ following the language principle;

the second step is that: general keywords defining the required attributes of the descriptive substructures: the type keyword is a type description, supports structures of int, double, borolan, string, array, file, dir, direct, record and obj, and can be identified as a list by adding [ ] suffix; const and value are mutually exclusive keywords which indicate the value represented by the substructure definition, value is a variable value, const is an invariable value; ref is a key mutually exclusive with value/const, and identifies the value source reference and one other TypeAndValueDefine substructure; the required key word is whether the definition must have a value or not, and the default is true; the doc keyword is a description; the symbols key word is an enumeration value domain and is used when defining a value domain needing to be limited;

the third step: defining special keywords that describe the substructures supported in a particular type:

as in the second definition the sub-structure is a definition of the type folder. An autoSyncInterval key can be defined as an interval of automatic synchronization; the autoSyncIgnore keyword is a default ignored file name list and supports regular grammar;

as in the third definition the substructure is a definition of a type as a custom object. The serializer key can be defined to define the needed codec definition for the object; the saveType keyword is in a storage mode and can be file/string; the fileExt is the suffix name of the stored file and is used when the saveType is file; the encoder keyword is an encoder url, the linked encoder needs to be an executable method, receives an object and returns character string data; the decoder key is the decoder url, connected to an executable method as needed by the decoder, which accepts a string of data and returns an object. The codec follows the external link criterion, using a prefix, py, identifying it as a python method.

Example 3

An example of the TypeDef definition, named common, is as follows (the general information section is not repeated):

xwlVersion: 1.0.0

class: TypeDef

doc: a stucture type def

author: ziqi.jiang

name: common

version: 1.0.0

typeDefs:

$jobArgs:

doc: Contains some info about compute args

type: record

fields:

:cores:

type: int

:memory:

type: int

the method comprises the following specific steps:

the first step is as follows: the typeDefs key is defined at the outermost layer, and contains some typeAndValueDefine substructures inside the typeDefs key. As this definition declares a record data named struct, fields is a record type sub-key declaration that contains two attributes cores, memory;

the second step is that: in type, a type and a value define sub-structure using the type def are declared by a link of a fixed format as follows:

$use_typedef_demo:

type: ^typedef/commmon/jobArgs/version/1.0.0。

example 4

The Step definition contains a detailed description of a computational task, and the following is an example of a Step definition (general information section omitted)

entryPoint: ^py:/home/job/run/loader.py

jobArgs:

type: ^typedef/common/jobArgs/version/1.0.0

value:

cores: 16

memory: 24000

inputs:

$in_arg1:

type: file

outputs:

$out_arg1:

type: double

The first step is as follows: four primary keys are defined that describe the Step attribute: entryPoint, jobArgs, inputs, outputs. And entryPoint is an execution entry of the Step, such as a loader file located under the/home/job/run directory, which is executed by python in the example. jobArgs is the execution parameter for Step, in the example, the referenced TypeDef is used, and the default value of 16 cores 24000MB is given;

secondly, defining input and output items: input/output is an input/output parameter list of the Step, and a plurality of TypeAndValueDefine substructures are arranged inside the input/output parameter list;

example 5

The Workflow definition contains several Step declarations and parameter dependencies between steps, and is an example of a Workflow definition (the general information part has been omitted):

vars:

$share_arg1:

type: string

steps:

$step1:

run: ^step/demo/version/1.0.0

jobArgs:

cores: 2

memory: 3000

in:

$arg1:

ref: {vars/$share_args1}

out: [$output_list1, $output2]

$step2:

run: ^ step/scatter_gather_demo/version/1.0.0

scatter

maxJob: 100

minJob: 0

zip:

$scatter_in_arg1: {ref: $step1/$output_list1}

jobIn:

$in_arg1: {ref: zip/$item}

$in_arg2: ~

gather:

failTol: 0.1

retryLimit: 1

jobOut: [$output]

unzip:

$gather_outputs: {ref: jobOut/$output}

outputs:

$out_wf_arg1: {ref: $step2/gather_outputs}

the method comprises the following specific steps:

the first step is as follows: defining a shared variable pool vars to be reused at the outermost layer: the vars keyword is a group of shared variable definition pools which are used in the document, and if a plurality of steps in workflow need to share a group of input, the vars keyword can be referred by the ref keyword; step keywords are Step definitions used by the workflow and dependency topologies of the Step definitions, and Step names and Step-defined key value pair declarations are used inside the Step keywords;

the second step is that: define the Steps used and their topological relations:

under the steps key, there are two step declarations named step1, step 2:

in the declaration of step1, run is the specific definition url of the step, represented by an external link that follows the beginning of the criterion, meaning that the 1.0.0 version definition named demo is introduced; the jobArgs key maps to the jobArgs defined in Step, where it is assigned a default value; the in key is a declared input parameter, a parameter named arg1 is declared, the value of the parameter refers to the value of share _ arg1 in the self-sharing variable, and the name in the in needs to be consistent with the name of an input item in the inputs in the Step definition; an out key is a parameter that is enabled in the workflow, where the name needs to be consistent with the name of the output item within outputs in the Step definition.

In the declaration of step2, an automatic concurrent step declared using the scatter-gather primitive is shown. jobArgs can be omitted when no default is assigned; the scatter key declares this to be a concurrent step;

the scatter keyword will distribute each element in the received input list to the same number of subtasks as the input list through zip mapping, under the definition of scatter: the maxJob/minJob keyword is the range of the number of the task concurrence; the zip is a concurrent batch parameter mapping of the task, a plurality of TypeAndValueDefine substructures are arranged under the zip, and as the task is defined to face a single input, a parameter mapping is required to be defined to show how a plurality of received parameters are mapped to subtask input items needing to be concurrent. As this example illustrates an array type named scatter _ in _ arg1 that accepts the result of step1 task named output _ list 1; the jobIn key is the original input of the Step, and is internally provided with a plurality of TypeAndValueDefine substructures, the name of which must be consistent with the name of the Step definition input item, and the in _ args1 declares that the value comes from scatter _ in _ arg1 in the zip map. Meaning that each element in the list received by scatter _ in _ arg1 is distributed to in _ arg1 items of each child job when run-time.

The gather keyword will aggregate the output results of multiple subtasks into an output list through an unship mapping, and under the definition of gather: the failTol is the failure fault tolerance rate of the sub job, is a decimal in the range of 0-1, if the failure task occupation ratio is greater than the decimal, the step is considered to fail, and retry is abandoned; retryLimit is the maximum allowed failure retry times, if part of subtasks fail and the failure number ratio is smaller than the fault tolerance rate, the retry does not exceed the retryLimit times; jobOut is an enabled output item in the original Step definition, and the name of the jobOut is required to be consistent with an output item in the Step definition; unzip is a mapping for parameter aggregation, as in this example, unzip declares that a definition named gather _ outputs aggregates the output items of all subtasks;

the outermost outputs key, meaning the final output of the workflow, defines an output named out _ wf _ arg1, in this example, whose value is derived from the step 2's aggregated result, gather _ outputs.

Example 6

A Template definition is used to specify a set of preset values to apply on the workflow as a data Template. An example of a Template definition is as follows:

workflow: ^workflow/some_workflow/version/1.0.0

values:

vars/$share_arg1: {value: 233}

$step2/$in_arg2: {const: 1.0}

the method comprises the following specific steps:

the first step is as follows: define the target Workflow keyword to which the Template applies: workflow defines url for the workflow to which the Template is to be applied

The second step is that: define the value values pre-populated for this Workflow: values are used to fix some values to be filled, only two forms of data of value/const are supported, as in the above example, a defined value named share _ arg1 in shared variable vars is filled as variable value 233, and a defined value named in _ arg2 in step2 input is filled as non-variable value 1.0, respectively.

FIG. 2 illustrates the data template overlay, parsing the behavior table for this language. Data are classified into value (variable value) and const (invariable constant) based on properties, and into: typedef, step, workflow, template, inline. When a Workflow is executed, final data needs to be analyzed from data sources of multi-layer definitions, and when one definition has a plurality of data sources, three actions of ignoring, covering and conflicting occur. The parsing of the data follows the following principles: const data cannot be overwritten; inline > template > workflow > step > typeDef; two values meet to override by priority (inlineValue may override inlineValue); two consts meet a conflict; the higher level value encounters a lower level const collision.

Fig. 3 illustrates a specific flow of use of the language. When the language is used for writing: computer engineers need to describe existing algorithms by the Step definition of the language. Firstly, writing a custom type definition TypeDef which can be needed according to the requirement of an existing algorithm and publishing the custom type definition to a data center, then writing a Step definition describing the existing algorithm and publishing the Step definition, and if the Step definition needs to be used, referring to the custom type Def through url. The scientific computing solution expert compiles a Workflow definition, references the required Step definitions by referencing url in the Workflow definition, and connects the output of each Step to the input of the next Step one by one in the Workflow for layout. Finally, writing a Template definition to fill in the default values for a specific usage scenario. The task executor only needs to select the corresponding Workflow and the template and transfer the Workflow and the template into the language interpreter, the language interpreter analyzes the Workflow layer by layer from top to bottom and obtains corresponding data from the data center through the url for analysis, and finally the analyzed complete data is transmitted to the task execution tool for task submission.

Example 7

The specific process for issuing the Step definition is as follows:

the data center comprises:

the data center is a service of a simple C/S architecture, indexes are managed through a Server side database, and a file system manages specific data contents; and the Client terminal performs simple analysis, uploading and downloading work. Fig. 4 shows a workflow of data center upload and download:

uploading a work flow:

the user submits a description language file to request the client. The client reads the file content, acquires specific types, names and version parameters by analyzing class, name and version fields, and requests the server with the file content; the server side indexes the database through the corresponding parameters, and if files with the same type, name and version exist, parameter check failure is returned; and if the file address does not exist, generating a new file address, and adding detailed information into the database. And then the server accesses the file system to store the file at the new file address, and then returns the result to the client.

Downloading a working process:

the user carries the type, name and version parameters to access the server; and the server side indexes the database through the corresponding parameters, if the NotFound error does not exist, the server side acquires a specific file address if the NotFound error exists, accesses the file system through the file address and acquires the file content, and returns a result to the client side.

By adopting the scheme, the method has the advantages that the file is stored by using the file system and is not directly stored into the database, so that the original granularity of the data is reserved, the integrity of the description language file is ensured, the performance of the database is improved by using a larger file by using the file system, and the file reading can be accelerated by using a faster index address and multithreading when the file is requested in batches.

A resolver:

the parser is an independent and offline data analysis tool, and mainly parses a complete definition through steps of recursive analysis, syntax tree analysis, object loading, application linking, layer-by-layer application of numerical values and the like. Fig. 5 shows the main parsing flow:

before the parser starts parsing the content, the input document and all documents on which the input document depends are first pulled locally from the data center. The parser will recursively traverse each value of the first input document, if the value is an external link at the beginning, download the link to the corresponding file through the data center Client, and repeat the steps for the new file until all dependent links are ready.

Since each layer number of the description language has a priority and an overlay relationship, in order to realize data logic of layer-by-layer overlay, the overlay needs to be built and applied layer-by-layer from the bottommost layer. The first analysis is a type definition file with Class as TypeDef, all objects of TypeDef are constructed according to file contents and are mapped in a memory as K to V.

And finally, analyzing the Template file, traversing the specific variables and values in the Template, and indexing to a certain input/output value of a certain Step in the Workflow object to perform covering operation.

After the analysis is finished, a tree in an object form is obtained, a workflow object is a root node, the workflow object contains all Step objects through Step attributes, and the Step objects contain all TypeDefs objects through inputs/outputs attributes.

In addition to building hierarchical explicit object trees and hierarchical assignments, a second important algorithm for parsers is to topologically order Workflow. The user defines the dependence among different steps, and the topological sorting algorithm can solve the most efficient operation solution of the steps. Fig. 6 shows the main ideas of the topology ranking algorithm:

through ref chaining marked in inputs of each Step, a FollowMap is obtained, and the FollowMap is a mapping of a list of (dependent Step: dependent Step)

After obtaining the FollowMap, the FollowMap mapping is inverted to obtain the mapping of the LeademMap, i.e., < stepName: list of steps on which the Step depends >.

The concept of Distance is introduced, abbreviated as Dis in the flow chart, meaning the dependence Distance from the operated is 1 by default (can be operated directly)

And traversing all steps, if the Step is not checked, traversing Leader steps of the Step, and if one Step has no Leader, indicating that the Step has no dependence, and considering that the checked concatenation Dis is 1. While the leader map in other steps is dependent, the Dis of the Step is added with the Dis of the leader Steps, and so on.

The core of the recursive idea is based on a topological sorting algorithm of a mathematical graph, and FollowMap and LeademMap are expression forms of an adjacency matrix. The starting point is determined by leader map, the starting point Step is set to 1 by the concept of Dis, and Dis at the intermediate point Step is the sum of Dis from the starting point and the path Step to the point. By sorting Dis, we can get the most efficient operation order. And when a certain Step is executed, the running sequence of the current state can be updated by only recursively subtracting the Dis of the subsequent node according to the FollowMap.

In light of the foregoing description of the preferred embodiments according to the present application, it is to be understood that various changes and modifications may be made without departing from the spirit and scope of the invention. The technical scope of the present application is not limited to the contents of the specification, and must be determined according to the scope of the claims.

Claims

1. A generic description language data system for directed acyclic graph automatic task flow, comprising:

2. The directed acyclic graph automatic task flow generic description language data system as claimed in claim 1, further comprising a TypeDef layer, wherein if a user needs to use a special custom type, the TypeDef layer needs to be written, and this part mainly abstracts the definition of a generic or complex compound type, and is convenient for reference and management.

3. A parsing method for a generic description language data system using directed acyclic graph automatic task flow according to claim 1, comprising the steps of:

recursive analysis: pulling the input document and all documents depended by the input document from the data center to the local; the resolver can recursively traverse each value of the first input document, if the value is an external link at the beginning of the ^ code, the link is downloaded to a corresponding file through the data center Client, and the step is repeated for a new file until all the dependent links are prepared;

analyzing the Template file, traversing specific variables and values in the Template, and indexing to a certain input/output value of a certain Step in the Workflow object to perform covering operation;

and (3) object loading, namely obtaining a tree in an object form after the analysis is finished, wherein a workflow object is a root node, the workflow object contains all Step objects through a steps attribute, and the Step objects contain all typeDefs objects through an inputs/outputs attribute.

4. A parsing method according to claim 3, wherein the parsing method comprises:

the first layer analyzes a type definition file with Class as TypeDef, constructs all objects of TypeDef according to file contents and maps the objects as K to V to be stored in a memory;

the second layer constructs a Step object and analyzes all classes as the files of the Step; constructing a Step object through file contents, wherein the inputs/outputs attribute of the Step object comprises a plurality of TypeDef objects, if the Step applies a variable of a custom type, the loaded object is taken from a TypeDef K-V mapping to replace the object in the Step, and value covering operation is carried out;

establishing a Workflow object at the third layer, analyzing class as a Workflow file, establishing the Workflow object through file contents, wherein the steps attribute of the Workflow object comprises all steps related to the Workflow and is stored in a mapping mode of StepName, StepObject; workflow takes all the Step objects depended by the Workflow from the Step mapping and stores the Step objects into the Step attributes of the Workflow, and the values in the Workflow definition and the values in the Step objects are subjected to covering operation according to the file contents.

5. A parsing method as claimed in claim 3 wherein, in addition to building hierarchical explicit object trees and hierarchical assignments, the second important algorithm for the parser is to topologically order Workflow; the method of the topological sorting algorithm comprises the following steps:

step D, traversing all steps, if the Step is not checked, traversing Leader steps of the Step, if one Step has no Leader, indicating that the Step has no dependency, and regarding the checked juxtaposed Dis as 1; while in other steps, leader map is dependent, and the Dis of the leader Steps is added to the Dis of the Step.

6. A parsing method as claimed in claim 3, wherein in said recursive analysis, the type inputted by the user is obtained, and compared with the type of the type keyword declaration; when the input types are inconsistent, trying to forcibly convert the data input by the user into a statement type, and if the conversion fails, throwing out a type error; the method comprises the following specific steps:

if the statement is a str character string type, user input data is 123, and the type c is an int integer type;

checking for type, int, str inconsistency;

an attempt to force the transition: integer 123 may be converted to a string "123";

inputting a warning that the detection passes and the type conversion is undergone;

if the statement is int integer type, the user input data is 123, and the type is int integer type;

a. checking the type, int, int consistent;

b. checking to pass;

thirdly, if the statement is int integer type, the user input data is abc, and the type is str character string type;

a. checking for type, int, str inconsistency;

b. an attempt to force the transition: the string abc cannot be converted to an integer;

c. the check fails and the throw error type check fails.

7. The method of claim 3, wherein said Step-defined method comprises the following steps:

the data center comprises: the data center is a service of a simple C/S architecture, indexes are managed through a Server side database, and a file system manages specific data contents; the Client side carries out simple analysis, uploading and downloading work;

uploading a work flow: a user submits a description language file request client, the client reads file content, and obtains specific types, names and version parameters by analyzing class, name and version fields, and the request server carries the file content;

downloading a working process: the user carries the type, name and version parameters to access the server; and the server side indexes the database through the corresponding parameters, if the NotFound error does not exist, the server side acquires a specific file address if the NotFound error exists, accesses the file system through the file address and acquires the file content, and returns a result to the client side.

8. The method of claim 7, wherein in the uploading workflow, the server indexes the database by corresponding parameters, and returns a parameter check failure if files with the same type, name and version exist; if the file address does not exist, a new file address is generated, and meanwhile, detailed information is added into a database; and then the server accesses the file system to store the file at the new file address, and then returns the result to the client.