CN114115857A

CN114115857A - Method and system for constructing automatic production line of machine learning model

Info

Publication number: CN114115857A
Application number: CN202111268941.XA
Authority: CN
Inventors: 鄂海红; 宋美娜; 邵明岩; 刘钟允; 朱云飞; 郑云帆; 吕晓东; 魏文定
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2022-03-01
Anticipated expiration: 2041-10-29
Also published as: CN114115857B; WO2023071075A1

Abstract

The invention provides a method and a system for constructing an automatic production line of a machine learning model, wherein the method comprises the following steps: constructing an operator assembly according to the configuration of the operator assembly, and storing the operator assembly into an operator warehouse; visually arranging and reading operator structure data in an operator warehouse, and combining operator components through business processing logic to generate a model task flow; converting the model task flow into a cloud native workflow engine execution plan, and submitting the cloud native workflow engine execution plan to a container cluster for execution so as to output a model file; based on model packaging, performing model file conversion and model inference container mirror image construction operation, and storing data corresponding to the operation into a model warehouse; and reading model data in the model warehouse and analyzing to generate three operators, and combining the three operator components to form a model issuing task flow to be submitted to the container cluster execution model issuing process. The invention improves the construction efficiency of the model production line, and the constructed model production line can quickly train a new model and improve the production capacity of the model.

Description

Method and system for constructing automatic production line of machine learning model

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a system for constructing an automatic production line of a machine learning model.

Background

As the development of artificial intelligence has entered the explosive development stage, artificial intelligence technology has been applied to various industries. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. In a practical sense, machine learning is a method that trains a model using data and then predicts using the model.

Training machine learning models is not always feasible, and a machine learning model production line is needed to update training models in the face of ever-increasing industry data and ever-changing industry standards. The machine learning model production line can solidify the steps of model training and model deployment, and achieve the purposes of training a new model and deploying the model on line. A traditional model production line construction mode is a pure manual mode, a plurality of scripts are compiled to process original data to obtain a training data set of a model, model training codes are compiled to train the model, and finally the model reasoning scripts are required to be compiled to deploy an online model. The traditional model production line construction mode needs manual configuration of a dependent environment, manual operation of scripts, collection of operation results, manual model deployment and maintenance of model services, so that the model development period is long, all steps of the model production line are difficult to upgrade and modify due to strong coupling, and the reusability is poor. The manual configuration of the environment also causes problems such as environment-dependent conflicts. The traditional model production line construction mode is difficult to adapt to the rapid model iteration requirement brought by industry change.

The existing technical scheme lacks a model deployment module and does not cover a complete model production line, namely, a complete process from data source import to model online. The system only develops a production line aiming at a deep learning model and lacks support for a general machine learning model. The system provides a high degree of packaging for the production line, provides only a few parameter options to change the production line, lacks flexibility, and each step of the production line is not reusable to other production lines.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention provides a method and a system for constructing an automatic production line of a machine learning model, which aim at the problems. The method divides the machine learning model production line construction process into operator component development, operator arrangement, model task flow execution, model packing and model release. Specifically, firstly, a container technology is utilized to solidify the model production line steps into operator assemblies, and the problems of single machine environment dependence and environment conflict are solved. And then combining a plurality of computer components to form a model task flow through operator arrangement, wherein operators in the model task flow can be combined and replaced at will, and the reusability of steps of a model production line is improved. The model task flow is converted into a cloud native workflow execution plan through a cloud native workflow engine, submitted to a container cluster to be executed to obtain a model file, packaged and stored in a model warehouse through model packaging, and finally released into model application to provide model services for the outside. The five construction flows are mutually independent and closely connected, so that the construction efficiency of the model production line is improved, and meanwhile, the constructed model production line can quickly train a new model, shorten the online process of the model and improve the production capacity of the model.

Therefore, a first objective of the present invention is to provide a method for constructing an automatic production line of a machine learning model, including:

constructing an operator assembly according to the configuration of the operator assembly, and storing the operator assembly into an operator warehouse;

visually arranging and reading operator structure data in the operator warehouse, and logically combining the operator components through service processing to generate a model task flow;

converting the model task flow into a cloud native workflow engine execution plan, and submitting the cloud native workflow engine execution plan to a container cluster for execution so as to output a model file;

based on model packaging, performing model file conversion and model inference container mirror image construction operation, and storing data corresponding to the operation into a model warehouse;

and reading model data in the model warehouse and analyzing the model data to generate three operators, and combining the three operator assemblies to form a model release task flow to be submitted to the container cluster execution model release process.

According to the method for constructing the machine learning model automatic production line, an operator component is constructed according to the configuration of the operator component, and the operator component is stored in an operator warehouse; visually arranging and reading operator structure data in an operator warehouse, and combining operator components through business processing logic to generate a model task flow; converting the model task flow into a cloud native workflow engine execution plan, and submitting the cloud native workflow engine execution plan to a container cluster for execution so as to output a model file; based on model packaging, performing model file conversion and model inference container mirror image construction operation, and storing data corresponding to the operation into a model warehouse; and reading model data in the model warehouse and analyzing to generate three operators, and combining the three operator components to form a model issuing task flow to be submitted to the container cluster execution model issuing process. The five construction processes are mutually independent and closely connected, so that the construction efficiency of the model production line is improved, and meanwhile, the constructed model production line can quickly train a new model, shorten the online process of the model and improve the production capacity of the model.

In addition, the machine learning model automatic production line building method according to the above embodiment of the present invention may further have the following additional technical features:

further, in an embodiment of the present invention, the constructing an operator component according to an operator component configuration and storing the operator component in an operator repository includes: copying an operator file into a file memory special for an operator, solidifying the file used by the operator in operation, generating a Docker file according to an operator dependent environment and a basic mirror image, submitting the Docker file to Docker Daemon for constructing an operator operation mirror image, informing the Docker Daemon after the construction is finished to push the operator operation mirror image to a mirror image warehouse, writing the address of the operator file in the storage library and the operator operation mirror image information into operator component configuration, storing the operator component information into the operator warehouse to complete the operator construction, generating an operator test template according to the operator component configuration, displaying the operator test template at the front end, submitting the operator test template to generate a single-node task flow, converting the single-node task flow into a cloud primary workflow execution plan, submitting the single-node task flow to a container cluster for execution, and obtaining an operator execution log; the operator warehouse comprises a file storage, a relational database and a mirror image warehouse, and is respectively used for storing operator codes, operator structure data and container mirror images.

Further, in an embodiment of the present invention, the visually arranging reads operator structure data in the operator warehouse, and combines the operator components through business processing logic to generate a model task stream, including: the method comprises the steps of reading operator information of a current operator warehouse, displaying operator components in an operator list on the left side of a front-end task flow canvas according to configuration information of the operator components, placing operators needed for building a model task flow in a middle canvas, generating operator component connection end points according to the configuration of the operators, taking the upper end points of the operator components as input end points, taking the lower end points as output end points, selecting operators, arranging an operator configuration panel on the right side of the canvas, connecting the input end and the output end of each operator according to a model production line flow, configuring relevant parameters on the configuration panel of each operator, completing building of the model workflow, and storing the built model task flow after the building is completed.

Further, in one embodiment of the present invention, the method further comprises: the method comprises the steps that JSON configuration files with uniform formats are generated for operators of different types according to specific rules, a user connects the input end and the output end of each operator according to a specific sequence to construct a task stream, the input setting and the output setting of the operators are automatically configured according to the edge and the node of each connecting line, operator structure data in an operator warehouse are read and analyzed when the task stream is arranged, task stream configurations with the JSON formats are dynamically generated according to operation, and when the task stream storage operation is executed, the task stream configurations with the JSON formats are sent to the rear end by the front end to be stored.

Further, in an embodiment of the present invention, the converting the model task flow into a cloud native workflow engine execution plan and submitting the cloud native workflow engine execution plan to a container cluster for execution so as to output a model file includes: analyzing and converting the model task flow structure data to generate a cloud native workflow execution plan, submitting the cloud native workflow execution plan to a container cluster to execute the model task flow, and storing a model data file generated by executing the model task flow in an object storage server: the method comprises the following steps: when the model task flow is executed, verifying the JSON-format task flow configuration, analyzing the JSON-format task flow configuration after the verification is completed, converting the JSON-format task flow configuration into a cloud native workflow execution plan, and acquiring the operation log information of each node of the model workflow from the container cluster after the operation is completed; wherein the cloud native workflow execution plan comprises: and creating container cluster resource objects required by the operation of the operator components and the transfer operation of the operator operation container input and output files.

Further, in an embodiment of the present invention, the performing the operations of model file conversion and model inference container mirror image construction based on model packaging, and storing the data corresponding to the operations in a model repository includes: receiving model configuration information input by a user at the front end, performing templated model packaging through a model packaging process, analyzing the model configuration information to perform model file standardization and model inference container mirror image construction work, and storing a model inference code, a data file and a container mirror image into a model warehouse as model data, wherein the model warehouse is used for storing the model inference configuration data, the model structure data and the model inference container mirror image file; wherein the model repository comprises the relational database, an object storage server, and a mirror repository; in the model packing flow, selecting a model type, providing a model inference operator according to a corresponding rule, after determining the model type and the model inference operator type, providing specific data for a subsequent model data packet according to a specific strategy, packing the specific data into the model data and storing the model data in a model warehouse; the specific data comprises a data packet, a file address after model conversion and a model instance running mirror image address.

Further, in an embodiment of the present invention, the reading the model data in the model repository and analyzing the model data to generate three operators, and combining the three operator components to form a model publishing task stream to be submitted to the container cluster to execute the model publishing process includes: receiving model Service configuration information input by a user at the front end, reading model data in the model warehouse, analyzing and generating a model deployment operator, simultaneously generating a Service configuration operator and an Ingress configuration operator for model Service opening, automatically compiling task flows for model deployment and model Service opening, analyzing the task flows to generate a cloud native workflow execution plan, submitting the cloud native workflow execution plan to a container cluster for execution, and completing model Service release.

Further, in one embodiment of the invention, the operator component types include: a plurality of data reading operators, data processing operators, model training operators, data derivation operators, visualization operators, model deployment operators and cluster configuration operators; operator component configuration information, including: operator files, operator input and output settings, operator parameter settings, operator operation scripts, operator dependence environments, basic mirror images required by operator construction and resource configuration required by operator operation; the operator file comprises an operator running script and other files required by operator running, wherein the operator running script is an operator running inlet and is an executable binary file; the operator input and output settings are used for defining the data source and the data output position of an operator; the operator parameter setting is used for defining parameters required when the operator runs the script.

Further, in an embodiment of the present invention, the reading the model data in the model repository and analyzing the model data to generate three operators, and combining the three operator components to form a model publishing task stream to be submitted to the container cluster to execute the model publishing process, further includes: the cloud native workflow execution plan comprises a first node, a third node, a model deployment node, a fourth node, a Service object cleaning node, a fifth node, a third node and a fourth node, wherein the first node configures nodes for an Ingress object, the Ingress object is created, a request is routed to a model Service object, the second node configures nodes for the Service object, the Service object is created, the request flow load is balanced to each model deployment node, the third node is the model deployment node, the node configuration is generated by analyzing model data, a running container is generated by using a model running mirror image, a model file and a model inference code file are bound, the use of container resources is limited according to the running resource configuration, the fourth node is the Service object cleaning node, the fifth node is the Service object cleaning node, the cloud native workflow execution plan is submitted to a container cluster for execution, the container cluster deploys a model and develops a model Service, a model release process is completed, the former three nodes are sequentially operated during the workflow execution, and an end signal is waited at the third node, and triggering an exit event when the workflow is finished, operating the fourth node and the fifth node by using a callback mechanism, and clearing the Service object and the Ingress object.

In order to achieve the above object, a second embodiment of the present invention provides a machine learning model automatic production line building system, including:

the operator building module is used for building an operator assembly according to the configuration of the operator assembly and storing the operator assembly into an operator warehouse;

the operator arrangement module is used for visually arranging and reading operator structure data in the operator warehouse and combining the operator components through business processing logic to generate a model task flow;

the model task flow module is used for converting the model task flow into a cloud native workflow engine execution plan and submitting the cloud native workflow engine execution plan to the container cluster for execution so as to output a model file;

the model packing module is used for carrying out model file conversion and model inference container mirror image construction operation based on model packing and storing data corresponding to the operation into a model warehouse;

and the model issuing module is used for reading the model data in the model warehouse and analyzing the model data to generate three operators, and combining the three operator modules to form a model issuing task flow to be submitted to the container cluster to execute the model issuing process.

The machine learning model automatic production line construction system comprises an operator construction module, an operator warehouse and an operator storage module, wherein the operator construction module is used for constructing an operator assembly according to the configuration of the operator assembly and storing the operator assembly into the operator warehouse; the operator arrangement module is used for visually arranging and reading operator structure data in the operator warehouse and combining the operator components through business processing logic to generate a model task flow; the model task flow module is used for converting the model task flow into a cloud native workflow engine execution plan and submitting the cloud native workflow engine execution plan to the container cluster for execution so as to output a model file; the model packing module is used for carrying out model file conversion and model inference container mirror image construction operation based on model packing and storing data corresponding to the operation into a model warehouse; and the model issuing module is used for reading the model data in the model warehouse and analyzing the model data to generate three operators, and combining the three operator modules to form a model issuing task flow to be submitted to the container cluster to execute the model issuing process. The five construction processes are mutually independent and closely connected, so that the construction efficiency of the model production line is improved, and meanwhile, the constructed model production line can quickly train a new model, shorten the online process of the model and improve the production capacity of the model.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a flowchart of a method for constructing an automatic production line of a machine learning model according to an embodiment of the present invention;

fig. 2 is a schematic diagram illustrating a machine learning model automation line according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a process for constructing an operator according to an embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating an operator arrangement and a task flow execution process of a model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a model packaging and model publishing process provided by an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a machine learning model automation line building system according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The following describes a method and a system for constructing an automatic production line of a machine learning model according to an embodiment of the present invention with reference to the drawings, and first, a method for constructing an automatic production line of a machine learning model according to an embodiment of the present invention will be described with reference to the drawings.

Fig. 1 is a flowchart of a method for constructing an automatic production line of a machine learning model according to an embodiment of the present invention.

As shown in fig. 1, the method for constructing an automatic production line of a machine learning model includes the following steps:

and step S1, constructing an operator assembly according to the operator assembly configuration, and storing the operator assembly in an operator warehouse.

Specifically, the operator construction mainly provides an operator development function, receives operator configuration information input by a user at the front end, forms an operator operation mirror image construction file by analyzing the operator configuration, submits the operator operation mirror image construction file to Docker Daemon for mirror image construction, and stores the mirror image information and the operator configuration together as operator structure data in an operator warehouse after the construction is completed. The operator warehouse comprises a file storage, a relational database and a mirror image warehouse, and is respectively used for storing operator codes, operator structure data and container mirror images. Meanwhile, an operator construction provides an operator test function, operator input and output configuration can be analyzed, a test template is generated, the test template is filled in and submitted to a system for operator test, and an operator operation result is obtained.

As an example, as shown in FIG. 2, the invention aims to design an automatic production line for developing machine learning models (including various AI models) efficiently by the method. In the method, a model application development process is disassembled into operator construction, operator arrangement, model task flow execution, model packing and model release.

It can be understood that the operator construction process mainly constructs an operator assembly according to the configuration of the operator assembly and stores the operator assembly in an operator warehouse, the operator assembly is an abstraction of one step of a machine learning model production line, the operator assembly can be freely combined under certain logic, the reusability of the machine learning model production line can be improved, for example, a database data reading operator and a model training operator can be used in different machine learning training scenes, and only corresponding SQL sentences or model training superparameters need to be adjusted. Meanwhile, the operator component packs the dependency environment into a container mirror image by using a container technology, and solves the problems of complicated configuration of application and script running environments, software package conflict and the like. The built operator can generate a test template according to the configuration, and the reliability of the operator is ensured by filling a test module and submitting the test module to a system test.

And step S2, visually arranging and reading operator structure data in the operator warehouse, and combining the operator components through business processing logic to generate a model task flow.

Specifically, operator arrangement mainly provides an operator visualization arrangement function, operator structure data in an operator warehouse are read and analyzed to form a front-end visualization node, and a user can connect input and output ends of operators in a dragging mode to form a model task flow. The parameters and usage resources of each operator are configurable. The model task flow may be configured with execution cycles, number of failed retries, and the like. And storing the model task flow structure data into a relational database after storing.

As an example, as shown in fig. 2, the operator arrangement process is to combine operator components into a model task stream through business processing logic, and since the operator components have explicit input/output and execution processes, the efficiency of constructing the model task stream can be improved. The model task flow comprises a complete model training process for deriving (including model data) from data input, data processing, model training and data, and is used for solidifying the process of producing the model in the model application development process.

And step S3, converting the model task flow into a cloud native workflow engine execution plan, and submitting the cloud native workflow engine execution plan to a container cluster for execution so as to output a model file.

Specifically, the model task flow mainly provides the functions of analyzing and converting the structural data of the model task flow, is used for generating a cloud native workflow execution plan, and is submitted to the container cluster to execute the model task flow. And storing a model data file generated by executing the model task flow in the object storage server.

As an example, as shown in fig. 2, in the model task flow execution process, the model task flow is first converted into a cloud native workflow engine execution plan, and then submitted to a container cluster for execution, each operator component operates as a container, and resources used by each operator operation container are specifically limited, so that the utilization efficiency of cluster resources is improved.

And step S4, performing model file conversion and model inference container mirror image construction operation based on model packaging, and storing data corresponding to the operation into a model warehouse.

Specifically, the model packaging provides a templated model building function, receives model configuration information input by a user at the front end, performs model file standardization (such as ONNX conversion) and model inference container mirror image building work by analyzing the model configuration information, and finally stores model inference codes, model data files and model container mirror images as model data into a model warehouse. The model warehouse comprises a relational database server, an object storage server and a mirror image warehouse and is used for storing model reasoning configuration data, model structure data and model reasoning container mirror image files.

As an example, as shown in fig. 2, after the execution of the model task flow is completed, a model file is output, a templated model package is performed by a model packaging module, operations such as model file conversion and packaging of a model operation dependent environment into a container mirror image are performed, and finally, a model data packet is packaged together with a model inference code and a model inference configuration and stored in a model warehouse.

And step S5, reading the model data in the model warehouse and analyzing to generate three operators, and combining the three operator modules to form a model issuing task flow to be submitted to the container cluster to execute the model issuing process.

Specifically, model publishing provides model deployment and model service opening functions. Model data in a model warehouse are read and analyzed to generate a model deployment operator by receiving model Service configuration information input by a user at the front end, a Service configuration operator and an Ingress configuration operator for model Service opening are generated at the same time, task flows of model deployment and model Service opening are automatically compiled, a cloud native workflow execution plan is generated by analyzing the task flows and submitted to a container cluster for execution, and model Service release is completed. And triggering a callback mechanism by using an exit event, and automatically cleaning the configuration of the container cluster Service and Ingress to prevent resource exhaustion.

As an example, as shown in fig. 2, the model publishing process abstracts model deployment and model Service opening into three operator components, including a model instance deployment operator, a Service configuration operator, and an Ingress configuration operator. The model packing can facilitate the model instance deployment operator to read the model data and operate the model deployment container, the Service configuration operator is used for creating the Service resource object, can provide a uniform entry address for the model applications in a group of model deployment containers, and load distribution of the request to each model application is performed, and meanwhile, the Ingress configuration operator is used for creating the Ingress resource object, so that access to specific model application services in the container cluster from the outside is realized. And combining the three operator components to form a model release task flow, converting the model release task flow into a cloud native workflow engine execution plan, and then submitting the cloud native workflow engine execution plan to a container cluster execution model release flow. The efficiency of model release can be improved by releasing the task flow through the model, and meanwhile, the configuration of the container cluster Service and the container cluster Ingress can be automatically cleaned by triggering a callback mechanism through the workflow exit event, so that the resource exhaustion of the container cluster can be prevented.

The embodiments of the present invention will be further explained with reference to the drawings.

As an example, as shown in fig. 3, in the operator construction process, an operator component is an abstraction of a production line step of the machine learning model, and is also an operation node in the task flow after instantiation. The operator component types comprise data reading operators, data processing operators, model training operators, data derivation operators, visualization operators, model deployment operators, cluster configuration operators and the like. Each operator has fixed input and output and running mirror images, and parameters and running resources can be adjusted.

The method comprises the steps that firstly, a user is required to fill in operator component configuration information at the front end when an operator is constructed, wherein the operator component configuration information comprises an operator file, operator input and output settings, operator parameter settings, an operator operation script, an operator dependence environment, a basic mirror image required by the construction of the operator and resource configuration required by the operation of the operator. Specifically, the operator file comprises an operator running script and other files required by operator running, wherein the operator running script is a running inlet of an operator, and the operator running script can be a Python script, a Shell script or other executable binary files; the operator input and output is arranged to define the data source and data output position of the operator, and the operator can have a plurality of inputs and outputs. Specifically, the operator input can be from other operators, a local file, an external database or the like, and the operator output position can be other operators or an external database or the like; the operator dependent environment and the basic mirror image are used for constructing an operator running mirror image to achieve the effect of solidifying the operator running environment; the operator parameter setting is used for defining parameters required when the operator running script is executed; the resource configuration required by the operator operation defines the lower limit of the resource used by the operator operation, and the abnormal operation of the operator due to lack of resources is prevented. And then, analyzing the configuration information of the operator assembly to execute the operation of solidifying the operator file data and operating the mirror image to construct. Specifically, the system copies the operator file to a file memory dedicated to the operator first, and the file is used for solidifying the file used by the operator to ensure the stability of the operator in operation. The file storage may be implemented using object storage or a network file system, etc. And then the system generates a Docker file according to the operator dependent environment and the basic mirror image and submits the Docker file to Docker Daemon for construction operation of the operator running mirror image, and the Docker Daemon is informed to push the operator running mirror image to the mirror image warehouse after construction is completed. And finally, writing the address of the operator file in the storage library and the operator operation mirror image information into the operator component configuration, and storing the operator component information into an operator warehouse by the system to complete operator construction.

According to the operator component configuration, the system can generate an operator test template and display the operator test template at the front end. Specifically, operator input can be input in an external database mode and a local file mode, operator output can be input in an external database mode, and operator parameters and operator operation resources can be changed at the front end. After the test template is submitted, the system generates a single-node task flow, converts the single-node task flow into a cloud native workflow execution plan, submits the cloud native workflow execution plan to a container cluster for execution, and finally obtains an operator execution log for checking the correctness and reliability of an operator.

As an example, as shown in fig. 4, in an operator arrangement and model task flow execution process, a machine learning training process may be abstracted to a model workflow formed by combining and arranging a plurality of operator components under a certain logic, where the model workflow generally includes starting with a data import operator, passing through a data processing operator, inputting into a model training operator, and finally outputting into a data export operator or a visualization operator. The purpose of quickly constructing a machine learning training production line can be achieved by arranging the combined operators. Meanwhile, the model workflow is analyzed by the model workflow module, a cloud native workflow execution plan is generated and submitted to the container cluster for execution, the container technology and the container arrangement technology can be fully utilized, and the resource utilization rate of the server is improved.

The operator arrangement sub-process is used for connecting operators with each other through certain logic to form a model task flow. Firstly, the system reads operator information of a current operator warehouse and displays operator components in an operator list on the left side of a front-end task flow canvas according to configuration information of the operator components. And the user places operators required by the construction of the model task flow into the intermediate canvas in a dragging mode. The operator is a rectangular block in the canvas, the system generates an operator component connecting end point according to the configuration of the operator, the upper end point of the operator component is used as an input end point, and the lower end point of the operator component is used as an output end point. Specifically, the corresponding end points are only shown if the input and output are selected to the operator front end in the input and output setting, and one output end point can output to a plurality of input end points, and one input end point can only be connected with one output end point. And an operator configuration panel is arranged on the right side of the canvas after the operator is selected, and comprises input setting, output setting, parameter setting and operation resource setting of the operator. The user connects the input end and the output end of each operator according to the model production line flow, and completes the construction of the model workflow after configuring the relevant parameters on the configuration panel of each operator, and meanwhile, the model workflow can configure the parameters such as the execution period, the number of failed retries and the like. After the construction is completed, the user can store the constructed model task flow, so that the subsequent change and operation are facilitated. In order to realize the functions, the system designs a set of rules which can generate JSON configuration files with uniform formats for operators of different types. The user connects the input end and the output end of each operator according to a certain sequence to construct a task flow, and the system can automatically configure the input setting and the output setting of the operators according to the edge and the node of each connecting line. When a user arranges the task flow in a dragging mode, the system reads and analyzes operator structure data in an operator warehouse, and the task flow configuration in the JSON format is dynamically generated according to the operation of the user. And when the user executes the task stream storage operation, the front end sends the task stream configuration in the JSON format to the system rear end for storage.

And the model task flow execution flow is used for analyzing the structural data of the model task flow, generating a cloud native workflow execution plan and submitting the cloud native workflow execution plan to the container cluster to execute the model task flow. When executing the model task flow, firstly, the task flow configuration in the JSON format is verified. Specifically, whether operator input and output settings are legal, whether running script parameters are legal, whether running resource configuration is in accordance with expectations and the like are checked. And then analyzing the JSON-format model task flow configuration, and converting the JSON-format model task flow configuration into a cloud native workflow execution plan, wherein the cloud native workflow execution plan comprises a Kubernetes container cluster resource object required by creating an operator operating component, a transfer operation of an operator operating container input/output file and the like. For example, a model task flow can be converted into a Workflow object of a Yaml-format cloud-native Workflow engine Argo Workflow, each operator is designed as a Template object, an Input Artifact and an Output Artifact are generated according to operator Input and Output configuration, an image parameter of a Container is set according to an operator running image, a command parameter and an args parameter of the Container are set according to an operator running script and parameter configuration, an env parameter of the Container is set according to an operator dependent environment, a resources parameter of the Container is configured according to an operator running resource, an Input Artifact is generated according to an address of an operator file in a repository, and the Input Artifact is used for placing the operator file into a work directory of the Container. Workflow sets a Main Template as an entry, the execution sequence between operators is converted into the configuration in Template Dag after being analyzed, and the Step in each Dag corresponds to the Template of one operator. And after the construction is finished, submitting the Workflow object to a cloud native Workflow engine Argo Workflow for execution, generating a cloud native Workflow execution plan by the cloud native Workflow engine Argo Workflow and submitting the cloud native Workflow execution plan to a container cluster Kubernets, and executing the model Workflow by the container cluster to obtain an operation result. For the step of parsing the JSON-formatted model task flow configuration and converting into the cloud native Workflow execution plan, a cloud native Workflow engine or a cloud native Workflow generation tool other than Argo Workflow may be used, which is only an example here. After the operation is finished, the system acquires the operation log information of each node of the model workflow from the container cluster, and simultaneously, the model file generated by the model workflow can be stored in an external database for the use of the model packing process.

As an example, as shown in FIG. 5, after training of the machine learning model is completed, deployment is required to provide the model application service, and the model application service production process is divided into two sub-processes of model packaging and model publishing.

The model packing sub-process provides the function of adapting various mainstream model reasoning frames to the models generated by various machine learning frames (including the deep learning frame), and packages and packs the model files, the model dependent environment and the model reasoning codes into a model data packet for the model issuing environment. In the model packing sub-flow, first, model types including, but not limited to, PyTorch model, TensorFlow model, Caffe model, XGboost model, Scikit-left model, etc. need to be selected. Next, available model inference operators are provided according to the correspondence rules, the model inference operators including templated inference codes and corresponding underlying running images. Including but not limited to, the pytorrech model may use a Torchserve model deployment operator, a TensorRT model deployment operator, a flash model deployment operator, the XGBoost model, the Scikit-left model may use a corresponding flash model deployment operator, etc., as shown in fig. 5. After the model type and the model reasoning algorithm type are determined, data required by a subsequent model data packet is provided according to a certain strategy. Specifically, the model data generally needs a model file, a model inference code, a model dependent environment and a model inference configuration, the model file is a file for describing a model structure and model parameters, the model inference code is used for describing model pre-inference processing and post-processing codes, the model dependent environment comprises operating environment configurations or software packages used for pre-processing and post-processing, and the model inference configuration comprises a model instance minimum operating resource quantity, an inference framework hyper-parameter and the like. For example, when the torchserver model deployment operator is used to deploy the PyTorch model, the serialization file of the PyTorch model, the Handler and the name of the software package required by the Handler to run need to be provided, and the running resources of the model instance need to be configured. Then, model conversion and model operation mirror image construction work are carried out. For model transformation work, for example, inference deployment of the PyTorch model using the TensorRT requires a model that is first transformed into the ONNX format. Specific operation images can be generated according to the model dependent environment aiming at the construction work of the model operation images. And finally, packaging the data packet, the file address after model conversion and the model instance operation mirror image address into model data and storing the model data into a model warehouse.

The model release process is designed into a model deployment production line and is formed by combining a model deployment operator, a Service configuration operator and an Ingress configuration operator. Firstly, selecting a model to be deployed from a model warehouse, setting the number of model instances and the operation resource quantity (not lower than the minimum operation resource quantity) of the model instances at the same time, and then constructing a cloud-native workflow execution plan used by a model deployment production line. Specifically, a first node of the cloud native workflow execution plan configures a node for an Ingress object, and the node creates the Ingress object for routing a request to a model Service object. The second node configures nodes for the Service objects, and the nodes can create one Service object for balancing the request traffic load to each model deployment node. The third node is a model deployment node, the number of the nodes is consistent with the number of configured model instances, the configuration of the nodes is generated by analyzing model data, the operation container is generated by using a model operation mirror image, and the model file and the model inference code file are bound and the use of container resources is limited according to the operation resource configuration. The fourth node is a Service object cleaning node. And the fifth node is a Service object cleaning node. And finally, submitting the cloud native workflow execution plan to a container cluster for execution, wherein the container cluster deploys the model and develops the model service to complete the model release process. The workflow execution runtime runs the first three nodes in sequence and waits for an end signal at the third node, at which point the model instance can provide model reasoning services. And triggering an exit event when the workflow is finished, and operating the fourth node and the fifth node by utilizing a callback mechanism to clear the Service object and the Ingress object, recovering the cluster resources and ensuring that the cluster resources are not exhausted.

Next, an automatic production line construction system of a machine learning model proposed according to an embodiment of the present invention is described with reference to the drawings.

As shown in fig. 6, the system 10 includes: the system comprises an operator building module 100, an operator arranging module 200, a model task flow module 300, a model packing module 400 and a model publishing module 500.

The operator building module 100 is used for building an operator assembly according to the configuration of the operator assembly and storing the operator assembly into an operator warehouse;

the operator arrangement module 200 is used for visually arranging and reading operator structure data in an operator warehouse and combining operator components through business processing logic to generate a model task flow;

the model task flow module 300 is configured to convert a model task flow into a cloud-native workflow engine execution plan, and submit the cloud-native workflow engine execution plan to the container cluster for execution so as to output a model file;

the model packing module 400 is used for carrying out model file conversion and model inference container mirror image construction operation based on model packing and storing data corresponding to the operation into a model warehouse;

and the model issuing module 500 is configured to read model data in the model warehouse and analyze the model data to generate three operators, and combine the three operator modules to form a model issuing task stream to be submitted to the container cluster to execute the model issuing process.

According to the machine learning model automatic production line construction system, the operator construction module is used for constructing the operator assembly according to the configuration of the operator assembly and storing the operator assembly into the operator warehouse; the operator arrangement module is used for visually arranging and reading operator structure data in the operator warehouse and combining the operator components through business processing logic to generate a model task flow; the model task flow module is used for converting the model task flow into a cloud native workflow engine execution plan and submitting the cloud native workflow engine execution plan to the container cluster for execution so as to output a model file; the model packing module is used for carrying out model file conversion and model inference container mirror image construction operation based on model packing and storing data corresponding to the operation into a model warehouse; and the model issuing module is used for reading the model data in the model warehouse and analyzing the model data to generate three operators, and combining the three operator modules to form a model issuing task flow to be submitted to the container cluster to execute the model issuing process. The five construction processes are mutually independent and closely connected, so that the construction efficiency of the model production line is improved, and meanwhile, the constructed model production line can quickly train a new model, shorten the online process of the model and improve the production capacity of the model.

It should be noted that the foregoing explanation of the embodiment of the method for calculating multiple indexes and predicting a trend of a technical subject is also applicable to the device for calculating multiple indexes and predicting a trend of a technical subject of the embodiment, and is not repeated herein.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A method for constructing an automatic production line of a machine learning model is characterized by comprising the following steps:

2. The method for constructing an automatic production line of machine learning models according to claim 1, wherein the constructing operator components according to operator component configuration and storing the operator components in an operator warehouse comprises:

copying an operator file into a file memory special for an operator, solidifying the file used by the operator in operation, generating a Docker file according to an operator dependent environment and a basic mirror image, submitting the Docker file to Docker Daemon for constructing an operator operation mirror image, informing the Docker Daemon after the construction is finished to push the operator operation mirror image to a mirror image warehouse, writing the address of the operator file in the storage library and the operator operation mirror image information into operator component configuration, storing the operator component information into the operator warehouse to complete the operator construction, generating an operator test template according to the operator component configuration, displaying the operator test template at the front end, submitting the operator test template to generate a single-node task flow, converting the single-node task flow into a cloud primary workflow execution plan, submitting the single-node task flow to a container cluster for execution, and obtaining an operator execution log; the operator warehouse comprises a file storage, a relational database and a mirror image warehouse, and is respectively used for storing operator codes, operator structure data and container mirror images.

3. The method for constructing an automatic production line of machine learning models according to claim 2, wherein the visualization arrangement reads operator structure data in the operator warehouse, and combines the operator components through business processing logic to generate a model task stream, and comprises:

the method comprises the steps of reading operator information of a current operator warehouse, displaying operator components in an operator list on the left side of a front-end task flow canvas according to configuration information of the operator components, placing operators needed for building a model task flow in a middle canvas, generating operator component connection end points according to the configuration of the operators, taking the upper end points of the operator components as input end points, taking the lower end points as output end points, selecting operators, arranging an operator configuration panel on the right side of the canvas, connecting the input end and the output end of each operator according to a model production line flow, configuring relevant parameters on the configuration panel of each operator, completing building of the model workflow, and storing the built model task flow after the building is completed.

4. The machine learning model automation line building method of claim 3, the method further comprising: the method comprises the steps that JSON configuration files with uniform formats are generated for operators of different types according to specific rules, a user connects the input end and the output end of each operator according to a specific sequence to construct a task stream, the input setting and the output setting of the operators are automatically configured according to the edge and the node of each connecting line, operator structure data in an operator warehouse are read and analyzed when the task stream is arranged, task stream configurations with the JSON formats are dynamically generated according to operation, and when the task stream storage operation is executed, the task stream configurations with the JSON formats are sent to the rear end by the front end to be stored.

5. The method for constructing the machine learning model automatic production line according to claim 4, wherein the converting the model task flow into the cloud native workflow engine execution plan and submitting the cloud native workflow engine execution plan to the container cluster for execution so as to output the model file comprises:

analyzing and converting the model task flow structure data to generate a cloud native workflow execution plan, submitting the cloud native workflow execution plan to a container cluster to execute the model task flow, and storing a model data file generated by executing the model task flow in an object storage server: the method comprises the following steps: when the model task flow is executed, verifying the JSON-format task flow configuration, analyzing the JSON-format task flow configuration after the verification is completed, converting the JSON-format task flow configuration into a cloud native workflow execution plan, and acquiring the operation log information of each node of the model workflow from the container cluster after the operation is completed; wherein the cloud native workflow execution plan comprises: and creating container cluster resource objects required by the operation of the operator components and the transfer operation of the operator operation container input and output files.

6. The method for constructing an automatic production line of a machine learning model according to claim 5, wherein the operations of converting the model file and constructing the mirror image of the model inference container based on model packaging and storing the data corresponding to the operations into a model warehouse comprise:

receiving model configuration information input by a user at the front end, performing templated model packaging through a model packaging process, analyzing the model configuration information to perform model file standardization and model inference container mirror image construction work, and storing a model inference code, a data file and a container mirror image into a model warehouse as model data, wherein the model warehouse is used for storing the model inference configuration data, the model structure data and the model inference container mirror image file; wherein the model repository comprises the relational database, an object storage server, and a mirror repository;

in the model packing flow, selecting a model type, providing a model inference operator according to a corresponding rule, after determining the model type and the model inference operator type, providing specific data for a subsequent model data packet according to a specific strategy, packing the specific data into the model data and storing the model data in a model warehouse; the specific data comprises a data packet, a file address after model conversion and a model instance running mirror image address.

7. The method for constructing the machine learning model automatic production line according to claim 6, wherein the reading of the model data in the model warehouse and the parsing generate three operators, and the combining of the three operator modules forms a model publishing task stream to be submitted to the container cluster to execute the model publishing process comprises:

receiving model Service configuration information input by a user at the front end, reading model data in the model warehouse, analyzing and generating a model deployment operator, simultaneously generating a Service configuration operator and an Ingress configuration operator for model Service opening, automatically compiling task flows for model deployment and model Service opening, analyzing the task flows to generate a cloud native workflow execution plan, submitting the cloud native workflow execution plan to a container cluster for execution, and completing model Service release.

8. The method for constructing an automatic production line of machine learning models according to claim 2, wherein the operator component types include: a plurality of data reading operators, data processing operators, model training operators, data derivation operators, visualization operators, model deployment operators and cluster configuration operators; operator component configuration information, including: operator files, operator input and output settings, operator parameter settings, operator operation scripts, operator dependence environments, basic mirror images required by operator construction and resource configuration required by operator operation; the operator file comprises an operator running script and other files required by operator running, wherein the operator running script is an operator running inlet and is an executable binary file; the operator input and output settings are used for defining the data source and the data output position of an operator; the operator parameter setting is used for defining parameters required when the operator runs the script.

9. The method as claimed in claim 4, wherein the method for constructing an automatic production line of machine learning model is characterized in that the method reads model data in the model warehouse and analyzes the model data to generate three operators, combines the three operators to form a model release task stream to be submitted to the container cluster to execute the model release process, and further comprises:

the cloud native workflow execution plan comprises a first node, a third node, a model deployment node, a fourth node, a Service object cleaning node, a fifth node, a third node and a fourth node, wherein the first node configures nodes for an Ingress object, the Ingress object is created, a request is routed to a model Service object, the second node configures nodes for the Service object, the Service object is created, the request flow load is balanced to each model deployment node, the third node is the model deployment node, the node configuration is generated by analyzing model data, a running container is generated by using a model running mirror image, a model file and a model inference code file are bound, the use of container resources is limited according to the running resource configuration, the fourth node is the Service object cleaning node, the fifth node is the Service object cleaning node, the cloud native workflow execution plan is submitted to a container cluster for execution, the container cluster deploys a model and develops a model Service, a model release process is completed, the former three nodes are sequentially operated during the workflow execution, and an end signal is waited at the third node, and triggering an exit event when the workflow is finished, operating the fourth node and the fifth node by using a callback mechanism, and clearing the Service object and the Ingress object.

10. The utility model provides a machine learning model automation line founds system which characterized in that includes: