CN114138446A - Towable machine learning workflow component scheduling method - Google Patents

Towable machine learning workflow component scheduling method Download PDF

Info

Publication number
CN114138446A
CN114138446A CN202111488423.9A CN202111488423A CN114138446A CN 114138446 A CN114138446 A CN 114138446A CN 202111488423 A CN202111488423 A CN 202111488423A CN 114138446 A CN114138446 A CN 114138446A
Authority
CN
China
Prior art keywords
component
machine learning
workflow
task
configuration template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111488423.9A
Other languages
Chinese (zh)
Inventor
张金磊
许哲豪
宋少鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Yingtiandi Information Technology Co ltd
Original Assignee
Suzhou Yingtiandi Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Yingtiandi Information Technology Co ltd filed Critical Suzhou Yingtiandi Information Technology Co ltd
Priority to CN202111488423.9A priority Critical patent/CN114138446A/en
Publication of CN114138446A publication Critical patent/CN114138446A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/0486Drag-and-drop
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/38Creation or generation of source code for implementing user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/451Execution arrangements for user interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Stored Programmes (AREA)

Abstract

The invention discloses a towable machine learning workflow component scheduling method, which comprises the following steps: s1, designing a component configuration template corresponding to the component type, wherein except the pseudo node Base, other nodes correspond to tasks required by machine learning modeling one by one; s2, obtaining tasks and task sequences required by machine learning modeling included in the current machine learning workflow, and introducing the task sequences and the configuration template as parameters; and S3, dynamically loading configuration template parameters according to the task sequence, and executing the machine learning workflow. The design idea of the draggable machine learning workflow is that the workflow is modularized and can be quickly multiplexed, and the creation of the machine learning workflow task can be respectively realized by dragging a Web front-end page, a code layer and a command line.

Description

Towable machine learning workflow component scheduling method
Technical Field
The invention relates to the technical field of machine learning workflows, in particular to a towable machine learning workflow component scheduling method.
Background
In recent years, with the rapid development of computer application technology, artificial intelligence, big data, cloud computing have become the focus of attention in IT field. The Machine Learning (ML) algorithm which makes the computer have "intelligence" has achieved remarkable results on tasks such as target identification and target detection, and is successfully applied to the fields such as financial transactions, commodity recommendation and traffic prediction. When a machine learning algorithm is used for training a model, in order to avoid excessive time occupied by processes such as raw data collection, data cleaning, missing value processing, feature extraction, sample generation and model evaluation, a machine learning workflow needs to be constructed for an actual business scene.
Most of the existing machine learning algorithms are constructed and tested from a code layer, and services can be deployed only after the procedures of environment configuration, algorithm flow design, data interface design, program compiling, program debugging and the like. For the upstream data source and the downstream service application, the algorithm service can be opened after being deployed every time, and the bottleneck of automatic operation is formed. In addition, for a primary developer in the field of machine learning or a data analyst in a business layer, there is a certain difficulty in developing and testing a code layer algorithm, and the threshold is too high.
At present, most of machine learning workflow construction methods are mainly based on a scheduling method and a scheduling system, and the construction of an easy-to-use machine learning workflow should be more than that. Therefore, a new form of workflow construction is proposed in combination with specific actual business requirements. Specifically, the machine learning workflow is delivered in a standard form of a Python third party package, and subsequent function updating and version iteration are supported; the independent operation is supported, and meanwhile, the dragging type operation can be adopted by matching with a Web front-end interface; the independent runtimes interact through a friendly command line interface.
Although the existing machine learning workflow scheduling method and system have the advantages that modules such as raw data collection, data cleaning, missing value processing, feature extraction, sample generation and the like are modularized, integration is convenient, and time is saved, but the construction idea has disadvantages. The machine learning workflow construction idea is opposite to the mode of constructing through a code layer and testing a machine learning algorithm, and a corresponding workflow scheduling method is limited and cannot adapt to the mode of constructing based on the code layer. To this end, we propose a towable machine learning workflow component scheduling method.
Disclosure of Invention
The present invention is directed to a method for scheduling a towable machine learning workflow component, so as to solve the problems mentioned in the background art.
In order to achieve the purpose, the invention provides the following technical scheme: a towable machine learning workflow component scheduling method comprises the following steps:
s1, designing a component configuration template corresponding to the component type, wherein except the pseudo node Base, other nodes correspond to tasks required by machine learning modeling one by one;
s2, obtaining tasks and task sequences required by machine learning modeling included in the current machine learning workflow, and introducing the task sequences and the configuration template as parameters;
and S3, dynamically loading configuration template parameters according to the task sequence, and executing the machine learning workflow.
Preferably, the method further comprises: and S11, packaging the component by adopting the component configuration template corresponding to the component type, wherein the component is packaged by adopting the component configuration template corresponding to the component type, and the data input interface, the data output interface and the component parameter interface of the component are exposed.
Preferably, the method further comprises: and S12, configuring the task running log saving path through the pseudo node Base.
Preferably, the method further comprises: s21, after components corresponding to tasks required by machine learning modeling are connected according to the task sequence, checking whether the two components connected in front and back accord with connection specifications or not according to the normative input type, input quantity, output type and output quantity of the components;
if yes, performing step S3; otherwise, rejecting the connected component not meeting the connection specification and reporting an error prompt, and repeatedly executing the steps S11-S21 until the component connection meets the connection specification, and then executing the step S3.
Preferably, the error reporting mode in S21 is a pop-up box prompt or a command line terminal prompting the connection error information of the component in the task flow required by the machine learning modeling.
Preferably, the method further comprises: s22, the user can also design a component operation flow chart by dragging the machine learning component, and control the data flow direction among the components; the output of one component can be utilized by multiple downstream components simultaneously, but one component cannot run twice; according to the type of the data set loaded by the user, the component automatically selects the fitting, converting, model training, evaluating or predicting process; the user can continue to build the flow and run the task from any one of the previously run components without starting to run the whole flow again.
A draggable machine learning workflow project packaging mechanism, comprising:
the workflow item packaging module is used for packaging the whole workflow item of the current version in a standard mode of a Python third-party package in a one-key mode, and installing the whole workflow item to different Python running environments or virtual environments according to requirements during deployment;
and the function iteration module is used for acquiring tasks required by machine learning modeling included in the current machine learning workflow and performing function iteration updating on components corresponding to the tasks required by the machine learning modeling.
A friendly command line interactive interface has highly configurable, automatically generates an attractive formatted help page and supports sub-command functions, and after a machine learning workflow project is installed through a workflow project packaging mechanism of the second aspect of the disclosure, command line interaction can be performed through an mlpctl instruction.
The mlpctl is customized and developed in the process of designing the machine learning workflow, and is an implementation mode of a friendly command line interactive interface.
Tasks, task sequences and configuration templates required by current machine learning modeling are transmitted as parameters, and a workflow program is executed by running an mlpctl instruction, so that the machine learning workflow scheduling method provided by the first aspect of the disclosure is realized.
Whether the workflow task is created through a command line or a Web front page drag, the workflow program is finally executed through an mlpctl instruction, which is also the key for solving the two-choice problem described in the background art.
Compared with the prior art, the invention has the beneficial effects that:
the design idea of the draggable machine learning workflow is that the workflow is modularized and can be quickly multiplexed, and the creation of the machine learning workflow task can be respectively realized by dragging a Web front-end page, a code layer and a command line.
Drawings
FIG. 1 is a schematic diagram of the overall process of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, the present invention provides a technical solution: a towable machine learning workflow component scheduling method comprises the following steps: s1, designing a component configuration template corresponding to the component type, wherein except the pseudo node Base, other nodes correspond to tasks required by machine learning modeling one by one; s2, obtaining tasks and task sequences required by machine learning modeling included in the current machine learning workflow, and introducing the task sequences and the configuration template as parameters; and S3, dynamically loading configuration template parameters according to the task sequence, and executing the machine learning workflow.
The method further comprises the following steps: and S11, packaging the component by adopting the component configuration template corresponding to the component type, wherein the component is packaged by adopting the component configuration template corresponding to the component type, and the data input interface, the data output interface and the component parameter interface of the component are exposed.
The method further comprises the following steps: and S12, configuring the task running log saving path through the pseudo node Base.
The method further comprises the following steps: s21, after components corresponding to tasks required by machine learning modeling are connected according to the task sequence, checking whether the two components connected in front and back accord with connection specifications or not according to the normative input type, input quantity, output type and output quantity of the components;
if yes, performing step S3; otherwise, rejecting the connected component not meeting the connection specification and reporting an error prompt, and repeatedly executing the steps S11-S21 until the component connection meets the connection specification, and then executing the step S3.
In the error reporting prompting mode in the S21, the connection error information of the components in the task flow required by the machine learning modeling is prompted by a bullet box or a command line terminal.
The method further comprises the following steps: s22, the user can also design a component operation flow chart by dragging the machine learning component, and control the data flow direction among the components; the output of one component can be utilized by multiple downstream components simultaneously, but one component cannot run twice; according to the type of the data set loaded by the user, the component automatically selects the fitting, converting, model training, evaluating or predicting process; the user can continue to build the flow and run the task from any one of the previously run components without starting to run the whole flow again.
The method can be respectively realized by modes of Web front-end page dragging, a code layer and a command line, a preparation process of a learning workflow of a command line machine is given below, components listed in the preparation process are not all component types, and the preparation steps are as follows:
1. the parameters of each component are specified in the configuration template in the form of key-value by component category, and the format of the template may be of the type of. yml or.json. The configuration information of the component is a parameter necessary for starting the workflow program, and the logic level has only one data loading component, the data loading has only output and does not accept the input of other components, and other components have input necessarily. Fields with the same name in the configuration template have similar meanings in a plurality of components, and a file path adopts a uniform naming rule;
2. and configuring the pseudo component Base, wherein the pseudo component has no configuration item of the upstream and downstream component categories, and is used as an upstream component of the data loading component for the simplicity of program codes. In the configuration of the pseudo component, a data source and a data format are specified, the data source can be an offline file or a file on a cluster, the data format supports common formats such as csv and excel, and a log saving path is specified in the Base pseudo component;
3. configuring a data loading component Dataload, wherein an upstream component of the data loading component is a pseudo component Base, and a downstream component of the data loading component is a missing value processing component Imputation. The data loading component is used for converting data received from the pseudo component into a data format required by a machine learning algorithm or performing undersampling or oversampling on unbalanced data samples according to needs, and can also be used for configuring the data loading proportion obtained from the pseudo component Base and a path for storing the converted data;
4. configuring the missing value processing component Imputation, wherein the upstream component of the missing value processing component is a data loading component Dataload, and the downstream component of the missing value processing component is a feature engineering component featureEngining. The missing value processing component is operative to process missing values, configurable options including: missing value processing mode and processed data format. The processing mode can be selected from a mean value or median mode, a random sampling mode, an arbitrary value filling mode, a mode filling mode, a deleting mode and the like. Appointing corresponding fields and processing missing value processing modes in the configuration template, wherein the configuration of the data storage format is generally the same as that of the previous component;
5. configuring a feature engineering component, featureengineering, an upstream component of the feature engineering component being a missing value processing component, Imputation, a downstream component of the feature engineering component being a feature scaling component, FeatureScaling. The feature engineering component is used for performing feature construction, selection and conversion on data, and configurable options comprise: a category coding method, a discretization method, a mathematical transformation method, an abnormal value processing method, a feature creation method, a feature selection method, and a processed data storage format. Corresponding fields and characteristic engineering methods are appointed in the configuration template, and the configuration of the data storage format is generally the same as that of the previous component;
6. configuring a feature scaling component, FeatureScaling, an upstream component of the feature scaling component being a feature engineering component, featureengineering, a downstream component being a data partitioning component, DataSplit. The feature scaling component functions to set numerical features within the same scale, configurable options including: feature normalization, feature min-max scaling, feature normalization, etc., and processed data storage formats. Corresponding fields needing conversion (if not, all fields acting on input data) and a feature scaling method are specified in a configuration template, and the configuration of a data storage format is generally the same as that of the previous component;
7. the data partitioning component DataSplit is configured, the upstream component of the data partitioning component is a feature scaling component FeatureScaling, and the downstream component is a model component Models. The data partitioning component is used for partitioning the data into a training set and a test set, and configurable options comprise: dividing a data set into proportions, training set storage paths and test set storage paths;
8. model component Models, the upstream component data partitioning component DataSplit downstream component of the model component is none. The model component loads an algorithm model in the model base according to parameters provided by the configuration template through an internal model management method, and has the function of modeling data and outputting a trained model file and a model operation result. The configurable options include: model selection, evaluation mode selection, model file storage path and model result output path.
After the configuration of each component is completed, the whole workflow task is executed by using mlpctl create-configuration "[ 'DataLoad', 'input', 'featureengineering', 'featurefiltering', 'Data Split', 'Models' ] '″' pipeline/configuration.
A draggable machine learning workflow project packaging mechanism, comprising: the workflow item packaging module is used for packaging the whole workflow item of the current version in a standard mode of a Python third-party package in a one-key mode, and installing the whole workflow item to different Python running environments or virtual environments according to requirements during deployment; and the function iteration module is used for acquiring tasks required by machine learning modeling included in the current machine learning workflow and performing function iteration updating on components corresponding to the tasks required by the machine learning modeling.
A friendly command line interactive interface has highly configurable, automatically generates an attractive formatted help page and supports sub-command functions, and after a machine learning workflow project is installed through a workflow project packaging mechanism of the second aspect of the disclosure, command line interaction can be performed through an mlpctl instruction. The mlpctl is customized and developed in the process of designing the machine learning workflow, and is an implementation mode of a friendly command line interactive interface. Tasks, task sequences and configuration templates required by current machine learning modeling are transmitted as parameters, and a workflow program is executed by running an mlpctl instruction, so that the machine learning workflow scheduling method provided by the first aspect of the disclosure is realized.
Whether the workflow task is created through a command line or a Web front page drag, the workflow program is finally executed through an mlpctl instruction, which is also the key for solving the above-mentioned two-choice dilemma.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (6)

1. A method for scheduling a draggable machine learning workflow component is characterized by comprising the following steps: the method comprises the following steps:
s1, designing a component configuration template corresponding to the component type, wherein except the pseudo node Base, other nodes correspond to tasks required by machine learning modeling one by one;
s2, obtaining tasks and task sequences required by machine learning modeling included in the current machine learning workflow, and introducing the task sequences and the configuration template as parameters;
and S3, dynamically loading configuration template parameters according to the task sequence, and executing the machine learning workflow.
2. The towable machine-learning workflow component scheduling method of claim 1, wherein: the method further comprises the following steps: and S11, packaging the component by adopting the component configuration template corresponding to the component type, wherein the component is packaged by adopting the component configuration template corresponding to the component type, and the data input interface, the data output interface and the component parameter interface of the component are exposed.
3. The towable machine-learning workflow component scheduling method of claim 2, wherein: the method further comprises the following steps: and S12, configuring the task running log saving path through the pseudo node Base.
4. The towable machine-learning workflow component scheduling method of claim 3, wherein: the method further comprises the following steps: s21, after components corresponding to tasks required by machine learning modeling are connected according to the task sequence, checking whether the two components connected in front and back accord with connection specifications or not according to the normative input type, input quantity, output type and output quantity of the components;
if yes, performing step S3; otherwise, rejecting the connected component not meeting the connection specification and reporting an error prompt, and repeatedly executing the steps S11-S21 until the component connection meets the connection specification, and then executing the step S3.
5. The towable machine-learning workflow component scheduling method of claim 4, wherein: in the error reporting prompting mode in the S21, the connection error information of the components in the task flow required by the machine learning modeling is prompted by a bullet box or a command line terminal.
6. The towable machine-learning workflow component scheduling method of claim 5, wherein: the method further comprises the following steps: s22, the user can also design a component operation flow chart by dragging the machine learning component, and the data flow direction between the components is controlled.
CN202111488423.9A 2021-12-08 2021-12-08 Towable machine learning workflow component scheduling method Pending CN114138446A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111488423.9A CN114138446A (en) 2021-12-08 2021-12-08 Towable machine learning workflow component scheduling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111488423.9A CN114138446A (en) 2021-12-08 2021-12-08 Towable machine learning workflow component scheduling method

Publications (1)

Publication Number Publication Date
CN114138446A true CN114138446A (en) 2022-03-04

Family

ID=80384667

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111488423.9A Pending CN114138446A (en) 2021-12-08 2021-12-08 Towable machine learning workflow component scheduling method

Country Status (1)

Country Link
CN (1) CN114138446A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996249A (en) * 2022-05-17 2022-09-02 苏州佳祺仕信息科技有限公司 Data processing method and device, electronic equipment, storage medium and product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114996249A (en) * 2022-05-17 2022-09-02 苏州佳祺仕信息科技有限公司 Data processing method and device, electronic equipment, storage medium and product

Similar Documents

Publication Publication Date Title
CN110554958B (en) Graph database testing method, system, device and storage medium
CN113792825B (en) Fault classification model training method and device for electricity information acquisition equipment
US11327874B1 (en) System, method, and computer program for orchestrating automatic software testing
CN112052172B (en) Rapid test method and device for third-party channel and electronic equipment
CN105122212A (en) Periodicity optimization in an automated tracing system
US11443168B2 (en) Log analysis system employing long short-term memory recurrent neural net works
CN112463968B (en) Text classification method and device and electronic equipment
WO2021223215A1 (en) Automated decision platform
CN110609740A (en) Method and device for determining dependency relationship between tasks
Ligěza et al. AI approach to formal analysis of BPMN models. Towards a logical model for BPMN diagrams
CN114328277A (en) Software defect prediction and quality analysis method, device, equipment and medium
CN105446952A (en) Method and system for processing semantic fragments
CN114138446A (en) Towable machine learning workflow component scheduling method
Sankar et al. Prediction of code fault using Naive Bayes and SVM classifiers
CN115543781A (en) Method and interactive system for automatically verifying automobile software model
Haridasan et al. Arithmetic Optimization with Deep Learning Enabled Churn Prediction Model for Telecommunication Industries.
CN112379913B (en) Software optimization method, device, equipment and storage medium based on risk identification
US20240086165A1 (en) Systems and methods for building and deploying machine learning applications
CN115345600B (en) RPA flow generation method and device
CN111506305A (en) Tool kit generation method and device, computer equipment and readable storage medium
Yue et al. Towards Quantum Software Requirements Engineering
CN115775064A (en) Engineering decision calculation result evaluation method and cloud platform
CN113190582B (en) Data real-time interactive mining flow modeling analysis system
CN114756211A (en) Model training method and device, electronic equipment and storage medium
CN111340281B (en) Prediction model training method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination