CN114138446A

CN114138446A - Towable machine learning workflow component scheduling method

Info

Publication number: CN114138446A
Application number: CN202111488423.9A
Authority: CN
Inventors: 张金磊; 许哲豪; 宋少鸿
Original assignee: Suzhou Yingtiandi Information Technology Co ltd
Current assignee: Suzhou Yingtiandi Information Technology Co ltd
Priority date: 2021-12-08
Filing date: 2021-12-08
Publication date: 2022-03-04

Abstract

The invention discloses a towable machine learning workflow component scheduling method, which comprises the following steps: s1, designing a component configuration template corresponding to the component type, wherein except the pseudo node Base, other nodes correspond to tasks required by machine learning modeling one by one; s2, obtaining tasks and task sequences required by machine learning modeling included in the current machine learning workflow, and introducing the task sequences and the configuration template as parameters; and S3, dynamically loading configuration template parameters according to the task sequence, and executing the machine learning workflow. The design idea of the draggable machine learning workflow is that the workflow is modularized and can be quickly multiplexed, and the creation of the machine learning workflow task can be respectively realized by dragging a Web front-end page, a code layer and a command line.

Description

Towable machine learning workflow component scheduling method

Technical Field

The invention relates to the technical field of machine learning workflows, in particular to a towable machine learning workflow component scheduling method.

Background

In recent years, with the rapid development of computer application technology, artificial intelligence, big data, cloud computing have become the focus of attention in IT field. The Machine Learning (ML) algorithm which makes the computer have "intelligence" has achieved remarkable results on tasks such as target identification and target detection, and is successfully applied to the fields such as financial transactions, commodity recommendation and traffic prediction. When a machine learning algorithm is used for training a model, in order to avoid excessive time occupied by processes such as raw data collection, data cleaning, missing value processing, feature extraction, sample generation and model evaluation, a machine learning workflow needs to be constructed for an actual business scene.

Most of the existing machine learning algorithms are constructed and tested from a code layer, and services can be deployed only after the procedures of environment configuration, algorithm flow design, data interface design, program compiling, program debugging and the like. For the upstream data source and the downstream service application, the algorithm service can be opened after being deployed every time, and the bottleneck of automatic operation is formed. In addition, for a primary developer in the field of machine learning or a data analyst in a business layer, there is a certain difficulty in developing and testing a code layer algorithm, and the threshold is too high.

At present, most of machine learning workflow construction methods are mainly based on a scheduling method and a scheduling system, and the construction of an easy-to-use machine learning workflow should be more than that. Therefore, a new form of workflow construction is proposed in combination with specific actual business requirements. Specifically, the machine learning workflow is delivered in a standard form of a Python third party package, and subsequent function updating and version iteration are supported; the independent operation is supported, and meanwhile, the dragging type operation can be adopted by matching with a Web front-end interface; the independent runtimes interact through a friendly command line interface.

Although the existing machine learning workflow scheduling method and system have the advantages that modules such as raw data collection, data cleaning, missing value processing, feature extraction, sample generation and the like are modularized, integration is convenient, and time is saved, but the construction idea has disadvantages. The machine learning workflow construction idea is opposite to the mode of constructing through a code layer and testing a machine learning algorithm, and a corresponding workflow scheduling method is limited and cannot adapt to the mode of constructing based on the code layer. To this end, we propose a towable machine learning workflow component scheduling method.

Disclosure of Invention

The present invention is directed to a method for scheduling a towable machine learning workflow component, so as to solve the problems mentioned in the background art.

In order to achieve the purpose, the invention provides the following technical scheme: a towable machine learning workflow component scheduling method comprises the following steps:

s1, designing a component configuration template corresponding to the component type, wherein except the pseudo node Base, other nodes correspond to tasks required by machine learning modeling one by one;

s2, obtaining tasks and task sequences required by machine learning modeling included in the current machine learning workflow, and introducing the task sequences and the configuration template as parameters;

and S3, dynamically loading configuration template parameters according to the task sequence, and executing the machine learning workflow.

Preferably, the method further comprises: and S11, packaging the component by adopting the component configuration template corresponding to the component type, wherein the component is packaged by adopting the component configuration template corresponding to the component type, and the data input interface, the data output interface and the component parameter interface of the component are exposed.

Preferably, the method further comprises: and S12, configuring the task running log saving path through the pseudo node Base.

Preferably, the method further comprises: s21, after components corresponding to tasks required by machine learning modeling are connected according to the task sequence, checking whether the two components connected in front and back accord with connection specifications or not according to the normative input type, input quantity, output type and output quantity of the components;

if yes, performing step S3; otherwise, rejecting the connected component not meeting the connection specification and reporting an error prompt, and repeatedly executing the steps S11-S21 until the component connection meets the connection specification, and then executing the step S3.

Preferably, the error reporting mode in S21 is a pop-up box prompt or a command line terminal prompting the connection error information of the component in the task flow required by the machine learning modeling.

Preferably, the method further comprises: s22, the user can also design a component operation flow chart by dragging the machine learning component, and control the data flow direction among the components; the output of one component can be utilized by multiple downstream components simultaneously, but one component cannot run twice; according to the type of the data set loaded by the user, the component automatically selects the fitting, converting, model training, evaluating or predicting process; the user can continue to build the flow and run the task from any one of the previously run components without starting to run the whole flow again.

A draggable machine learning workflow project packaging mechanism, comprising:

the workflow item packaging module is used for packaging the whole workflow item of the current version in a standard mode of a Python third-party package in a one-key mode, and installing the whole workflow item to different Python running environments or virtual environments according to requirements during deployment;

and the function iteration module is used for acquiring tasks required by machine learning modeling included in the current machine learning workflow and performing function iteration updating on components corresponding to the tasks required by the machine learning modeling.

A friendly command line interactive interface has highly configurable, automatically generates an attractive formatted help page and supports sub-command functions, and after a machine learning workflow project is installed through a workflow project packaging mechanism of the second aspect of the disclosure, command line interaction can be performed through an mlpctl instruction.

The mlpctl is customized and developed in the process of designing the machine learning workflow, and is an implementation mode of a friendly command line interactive interface.

Tasks, task sequences and configuration templates required by current machine learning modeling are transmitted as parameters, and a workflow program is executed by running an mlpctl instruction, so that the machine learning workflow scheduling method provided by the first aspect of the disclosure is realized.

Whether the workflow task is created through a command line or a Web front page drag, the workflow program is finally executed through an mlpctl instruction, which is also the key for solving the two-choice problem described in the background art.

Compared with the prior art, the invention has the beneficial effects that:

the design idea of the draggable machine learning workflow is that the workflow is modularized and can be quickly multiplexed, and the creation of the machine learning workflow task can be respectively realized by dragging a Web front-end page, a code layer and a command line.

Drawings

FIG. 1 is a schematic diagram of the overall process of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, the present invention provides a technical solution: a towable machine learning workflow component scheduling method comprises the following steps: s1, designing a component configuration template corresponding to the component type, wherein except the pseudo node Base, other nodes correspond to tasks required by machine learning modeling one by one; s2, obtaining tasks and task sequences required by machine learning modeling included in the current machine learning workflow, and introducing the task sequences and the configuration template as parameters; and S3, dynamically loading configuration template parameters according to the task sequence, and executing the machine learning workflow.

The method further comprises the following steps: and S11, packaging the component by adopting the component configuration template corresponding to the component type, wherein the component is packaged by adopting the component configuration template corresponding to the component type, and the data input interface, the data output interface and the component parameter interface of the component are exposed.

The method further comprises the following steps: and S12, configuring the task running log saving path through the pseudo node Base.

The method further comprises the following steps: s21, after components corresponding to tasks required by machine learning modeling are connected according to the task sequence, checking whether the two components connected in front and back accord with connection specifications or not according to the normative input type, input quantity, output type and output quantity of the components;

In the error reporting prompting mode in the S21, the connection error information of the components in the task flow required by the machine learning modeling is prompted by a bullet box or a command line terminal.

The method further comprises the following steps: s22, the user can also design a component operation flow chart by dragging the machine learning component, and control the data flow direction among the components; the output of one component can be utilized by multiple downstream components simultaneously, but one component cannot run twice; according to the type of the data set loaded by the user, the component automatically selects the fitting, converting, model training, evaluating or predicting process; the user can continue to build the flow and run the task from any one of the previously run components without starting to run the whole flow again.

The method can be respectively realized by modes of Web front-end page dragging, a code layer and a command line, a preparation process of a learning workflow of a command line machine is given below, components listed in the preparation process are not all component types, and the preparation steps are as follows:

1. the parameters of each component are specified in the configuration template in the form of key-value by component category, and the format of the template may be of the type of. yml or.json. The configuration information of the component is a parameter necessary for starting the workflow program, and the logic level has only one data loading component, the data loading has only output and does not accept the input of other components, and other components have input necessarily. Fields with the same name in the configuration template have similar meanings in a plurality of components, and a file path adopts a uniform naming rule;

2. and configuring the pseudo component Base, wherein the pseudo component has no configuration item of the upstream and downstream component categories, and is used as an upstream component of the data loading component for the simplicity of program codes. In the configuration of the pseudo component, a data source and a data format are specified, the data source can be an offline file or a file on a cluster, the data format supports common formats such as csv and excel, and a log saving path is specified in the Base pseudo component;

3. configuring a data loading component Dataload, wherein an upstream component of the data loading component is a pseudo component Base, and a downstream component of the data loading component is a missing value processing component Imputation. The data loading component is used for converting data received from the pseudo component into a data format required by a machine learning algorithm or performing undersampling or oversampling on unbalanced data samples according to needs, and can also be used for configuring the data loading proportion obtained from the pseudo component Base and a path for storing the converted data;

4. configuring the missing value processing component Imputation, wherein the upstream component of the missing value processing component is a data loading component Dataload, and the downstream component of the missing value processing component is a feature engineering component featureEngining. The missing value processing component is operative to process missing values, configurable options including: missing value processing mode and processed data format. The processing mode can be selected from a mean value or median mode, a random sampling mode, an arbitrary value filling mode, a mode filling mode, a deleting mode and the like. Appointing corresponding fields and processing missing value processing modes in the configuration template, wherein the configuration of the data storage format is generally the same as that of the previous component;

5. configuring a feature engineering component, featureengineering, an upstream component of the feature engineering component being a missing value processing component, Imputation, a downstream component of the feature engineering component being a feature scaling component, FeatureScaling. The feature engineering component is used for performing feature construction, selection and conversion on data, and configurable options comprise: a category coding method, a discretization method, a mathematical transformation method, an abnormal value processing method, a feature creation method, a feature selection method, and a processed data storage format. Corresponding fields and characteristic engineering methods are appointed in the configuration template, and the configuration of the data storage format is generally the same as that of the previous component;

6. configuring a feature scaling component, FeatureScaling, an upstream component of the feature scaling component being a feature engineering component, featureengineering, a downstream component being a data partitioning component, DataSplit. The feature scaling component functions to set numerical features within the same scale, configurable options including: feature normalization, feature min-max scaling, feature normalization, etc., and processed data storage formats. Corresponding fields needing conversion (if not, all fields acting on input data) and a feature scaling method are specified in a configuration template, and the configuration of a data storage format is generally the same as that of the previous component;

7. the data partitioning component DataSplit is configured, the upstream component of the data partitioning component is a feature scaling component FeatureScaling, and the downstream component is a model component Models. The data partitioning component is used for partitioning the data into a training set and a test set, and configurable options comprise: dividing a data set into proportions, training set storage paths and test set storage paths;

8. model component Models, the upstream component data partitioning component DataSplit downstream component of the model component is none. The model component loads an algorithm model in the model base according to parameters provided by the configuration template through an internal model management method, and has the function of modeling data and outputting a trained model file and a model operation result. The configurable options include: model selection, evaluation mode selection, model file storage path and model result output path.

After the configuration of each component is completed, the whole workflow task is executed by using mlpctl create-configuration "[ 'DataLoad', 'input', 'featureengineering', 'featurefiltering', 'Data Split', 'Models' ] '″' pipeline/configuration.

A draggable machine learning workflow project packaging mechanism, comprising: the workflow item packaging module is used for packaging the whole workflow item of the current version in a standard mode of a Python third-party package in a one-key mode, and installing the whole workflow item to different Python running environments or virtual environments according to requirements during deployment; and the function iteration module is used for acquiring tasks required by machine learning modeling included in the current machine learning workflow and performing function iteration updating on components corresponding to the tasks required by the machine learning modeling.

A friendly command line interactive interface has highly configurable, automatically generates an attractive formatted help page and supports sub-command functions, and after a machine learning workflow project is installed through a workflow project packaging mechanism of the second aspect of the disclosure, command line interaction can be performed through an mlpctl instruction. The mlpctl is customized and developed in the process of designing the machine learning workflow, and is an implementation mode of a friendly command line interactive interface. Tasks, task sequences and configuration templates required by current machine learning modeling are transmitted as parameters, and a workflow program is executed by running an mlpctl instruction, so that the machine learning workflow scheduling method provided by the first aspect of the disclosure is realized.

Whether the workflow task is created through a command line or a Web front page drag, the workflow program is finally executed through an mlpctl instruction, which is also the key for solving the above-mentioned two-choice dilemma.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A method for scheduling a draggable machine learning workflow component is characterized by comprising the following steps: the method comprises the following steps:

2. The towable machine-learning workflow component scheduling method of claim 1, wherein: the method further comprises the following steps: and S11, packaging the component by adopting the component configuration template corresponding to the component type, wherein the component is packaged by adopting the component configuration template corresponding to the component type, and the data input interface, the data output interface and the component parameter interface of the component are exposed.

3. The towable machine-learning workflow component scheduling method of claim 2, wherein: the method further comprises the following steps: and S12, configuring the task running log saving path through the pseudo node Base.

4. The towable machine-learning workflow component scheduling method of claim 3, wherein: the method further comprises the following steps: s21, after components corresponding to tasks required by machine learning modeling are connected according to the task sequence, checking whether the two components connected in front and back accord with connection specifications or not according to the normative input type, input quantity, output type and output quantity of the components;

5. The towable machine-learning workflow component scheduling method of claim 4, wherein: in the error reporting prompting mode in the S21, the connection error information of the components in the task flow required by the machine learning modeling is prompted by a bullet box or a command line terminal.

6. The towable machine-learning workflow component scheduling method of claim 5, wherein: the method further comprises the following steps: s22, the user can also design a component operation flow chart by dragging the machine learning component, and the data flow direction between the components is controlled.