WO2023071075A1 - 机器学习模型自动化生产线构建方法及系统 - Google Patents
机器学习模型自动化生产线构建方法及系统 Download PDFInfo
- Publication number
- WO2023071075A1 WO2023071075A1 PCT/CN2022/087218 CN2022087218W WO2023071075A1 WO 2023071075 A1 WO2023071075 A1 WO 2023071075A1 CN 2022087218 W CN2022087218 W CN 2022087218W WO 2023071075 A1 WO2023071075 A1 WO 2023071075A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- model
- operator
- task flow
- data
- warehouse
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 90
- 238000004519 manufacturing process Methods 0.000 title claims abstract description 74
- 238000010801 machine learning Methods 0.000 title claims abstract description 49
- 238000010276 construction Methods 0.000 claims abstract description 69
- 230000008569 process Effects 0.000 claims abstract description 50
- 238000004806 packaging method and process Methods 0.000 claims abstract description 32
- 238000000547 structure data Methods 0.000 claims abstract description 29
- 238000012545 processing Methods 0.000 claims abstract description 19
- 238000006243 chemical reaction Methods 0.000 claims abstract description 17
- 238000004590 computer program Methods 0.000 claims description 31
- 238000013515 script Methods 0.000 claims description 21
- 238000003860 storage Methods 0.000 claims description 21
- 238000012549 training Methods 0.000 claims description 17
- 230000001419 dependent effect Effects 0.000 claims description 14
- 238000012360 testing method Methods 0.000 claims description 13
- 230000000007 visual effect Effects 0.000 claims description 9
- 238000012800 visualization Methods 0.000 claims description 6
- 230000007246 mechanism Effects 0.000 claims description 5
- 238000012858 packaging process Methods 0.000 claims description 5
- 238000004140 cleaning Methods 0.000 claims description 4
- 238000007405 data analysis Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 2
- 238000012795 verification Methods 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 claims 1
- 230000006870 function Effects 0.000 description 9
- 238000011161 development Methods 0.000 description 8
- 238000013473 artificial intelligence Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 5
- 230000008676 import Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012805 post-processing Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000007711 solidification Methods 0.000 description 1
- 230000008023 solidification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/36—Software reuse
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
- G06F8/61—Installation
- G06F8/63—Image based installation; Cloning; Build to order
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
- G06F8/76—Adapting program code to run in a different environment; Porting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Definitions
- the present disclosure relates to the technical field of artificial intelligence, in particular to a method and system for constructing an automatic production line of a machine learning model.
- Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its application pervades all fields of artificial intelligence. In a practical sense, machine learning is a method of using data to train a model and then using the model to predict.
- Training machine learning models is not done once and for all.
- a machine learning model production line is needed to update the training model.
- the machine learning model production line can solidify the steps of model training and model deployment to achieve the purpose of training new models and deploying the models online.
- the traditional model production line construction method is a purely manual method. By writing multiple scripts to process the raw data, the training data set of the model is obtained, and then the model training code is written to train the model. Finally, the model inference script needs to be written to deploy the online model.
- the traditional model production line construction method needs to manually configure the dependent environment, manually run the script and collect the running results, manually deploy the model and maintain the model service, which makes the model development cycle long, and the various steps of the model production line are too coupled to be upgraded and modified, and complex Poor usability.
- the way of manually configuring the environment will also bring problems such as environment dependency conflicts.
- the traditional model production line construction method is difficult to adapt to the rapid model iteration requirements brought about by industry changes.
- the present disclosure aims to solve one of the technical problems in the related art at least to a certain extent.
- the embodiments of the present disclosure aim at the above problems and propose a method and system for constructing an automatic production line of a machine learning model.
- This disclosure divides the machine learning model production line construction process into operator component development, operator arrangement, model task flow execution, model packaging, and model release.
- container technology is used to solidify the model production line steps into operator components to solve the problems of stand-alone environment dependence and environment conflicts.
- multiple operator components are combined to form a model task flow through operator orchestration. Operators in the model task flow can be combined and replaced arbitrarily, improving the reusability of the steps of the model production line.
- the model task flow is converted into a cloud-native workflow execution plan through the cloud-native workflow engine, and submitted to the container cluster for execution to obtain the model file.
- the model is packaged and stored in the model warehouse through model packaging, and finally the model is published as a model application to provide model services to the outside world. .
- These five construction processes are independent and closely connected to each other, which improves the construction efficiency of the model production line.
- the constructed model production line can quickly train new models, shorten the process of model launch, and improve model production capacity.
- the first purpose of the present disclosure is to propose a method for building an automated production line of a machine learning model, including:
- Visual arrangement reads the operator structure data in the operator warehouse, and combines the operator components through business processing logic to generate a model task flow;
- model file conversion and model inference container image construction operations are performed, and the data corresponding to the operations are stored in the model warehouse;
- the model data in the model warehouse is read and parsed to generate three types of operators, and the three types of operator components are combined to form a model release task flow to be submitted to the container cluster to execute the model release process.
- the machine learning model automation production line construction method of the embodiment of the present disclosure constructs the operator component according to the configuration of the operator component, and stores the operator component in the operator warehouse; the visual arrangement reads the operator structure data in the operator warehouse, and The operator component generates the model task flow through the combination of business processing logic; converts the model task flow into a cloud-native workflow engine execution plan, and submits it to the container cluster for execution to output the model file; based on the model packaging, performs model file conversion and model inference container Mirror construction operation, store the data corresponding to the operation into the model warehouse; read the model data in the model warehouse and analyze and generate three operators, combine the three operator components to form a model release task flow, and submit it to the container cluster to execute the model release process .
- the disclosure improves the construction efficiency of the model production line through five mutually independent and closely connected construction processes. At the same time, the constructed model production line can quickly train new models, shorten the process of model launching, and improve the model production capacity.
- the constructing the operator component according to the configuration of the operator component, and storing the operator component into the operator warehouse includes: copying the operator file to the operator-specific file storage , to fix the files used by the operator, generate a Dockerfile based on the operator’s dependent environment and the basic image, and submit it to the Docker Daemon for the construction of the operator image. After the build is complete, notify the Docker Daemon to push the operator image to the mirror warehouse.
- the address of the operator file in the repository and the operator running image information are written into the operator component configuration, and the operator component information is stored in the operator warehouse to complete the operator construction.
- the operator test is generated
- the template is displayed on the front end, and the operator test template is submitted to generate a single-node task flow, which is converted into a cloud-native workflow execution plan and submitted to the container cluster for execution to obtain the operator execution log;
- the operator warehouse includes a file storage , relational database, and mirror warehouse are used to store operator code, operator structure data, and container image files respectively.
- the visual arrangement reads the operator structure data in the operator warehouse, and combines the operator components through business processing logic to generate a model task flow, including: reading the current operator The operator information of the warehouse, and display the operator components in the operator list on the left side of the front-end task flow canvas according to the configuration information of the operator components, and place the operators required to build the model task flow in the middle canvas, according to the operator configuration Generate the connection endpoints of the operator component.
- the upper endpoint of the operator component is used as the input endpoint, and the lower endpoint is used as the output endpoint.
- the right side of the canvas is the operator configuration panel. Connect, and configure the relevant parameters in the configuration panel of each operator to complete the construction of the model workflow, and save the built model workflow after the construction is completed.
- the method further includes: generating JSON configuration files in a unified format for different types of operators according to specific rules, and the user connects the input and output ends of each operator in a specific order to construct a task flow , and automatically configure the input and output settings of the operator according to the edges and nodes of each connection, read and parse the operator structure data in the operator warehouse during task flow arrangement, and dynamically generate JSON formatted data according to the operation Task flow configuration, when executing the operation of saving the task flow, the front end sends the task flow configuration in JSON format to the back end for saving.
- converting the model task flow into a cloud-native workflow engine execution plan and submitting it to the container cluster for execution to output the model file includes: converting the model task flow structure data into Parsing and converting, generating cloud-native workflow execution plan, and submitting to the container cluster to execute the model task flow, the model data file generated by the execution of the model task flow is stored in the object storage server: including: when executing the model task flow, verify the The task flow configuration in JSON format. After the verification is completed, parse the model task flow configuration in JSON format and convert it into a cloud-native workflow execution plan.
- the cloud-native workflow execution plan includes: creating a container cluster resource object required to run an operator component, and a variety of transition operations for input and output files of an operator running container.
- the model-based packaging performing the model file conversion and model inference container image construction operations, and storing the data corresponding to the operations into the model warehouse, includes: receiving the model configuration input by the user at the front end Information, templated model packaging through the model packaging process, parsing the model configuration information to standardize model files and build model reasoning container images, and store model reasoning codes, data files and container images as model data in the model warehouse.
- the model warehouse is used to store model reasoning configuration data, model structure data, and model reasoning container image files; wherein, the model warehouse includes the relational database, object storage server, and mirror warehouse; in the model packaging process, select a model Type, provide model reasoning operators according to the corresponding rules, after determining the model type and model reasoning operator type, provide specific data for the subsequent model data package according to a specific strategy, package the specific data into the model data and store it in the model Warehouse; wherein, the specific data includes a data package, an address of a converted model file, and an address of a running image of a model instance.
- the model data in the model warehouse is read and parsed to generate three types of operators, and the three types of operator components are combined to form a model publishing task flow to be submitted to the container cluster
- Execute the model publishing process including: receiving the model service configuration information input by the user at the front end, reading the model data in the model warehouse, analyzing and generating the model deployment operator, and generating the Service configuration operator and Ingress for model service opening Configure operators, automatically arrange task flows for model deployment and model service opening, analyze task flows to generate cloud-native workflow execution plans and submit them to container clusters for execution, and complete model service releases.
- operator component types include: data reading operator, data processing operator, model training operator, data export operator, visualization operator, model deployment operator and cluster configuration operator Various types; operator component configuration information, including: operator files, operator input and output settings, operator parameter settings, operator running scripts, operator dependent environments, basic images required for operator construction, and operator operation requirements Various types of resource configuration; the operator file includes the operator running script and other files required for the operator to run, the operator running script is the running entry of the operator, and is an executable binary file; the operator The input and output settings are used to define the data source and data output location of the operator; the operator parameter settings are used to define the parameters required when the operator runs the script.
- the model data in the model warehouse is read and parsed to generate three types of operators, and the three types of operator components are combined to form a model publishing task flow to be submitted to the container cluster
- Executing the model release process also includes: the first node of the cloud-native workflow execution plan is an Ingress object configuration node, an Ingress object is created, and the request is routed to the model service Service object; the second node is a Service object configuration node, and a Service object is created. Load balance the request traffic to each model deployment node.
- the third node is the model deployment node.
- the configuration of the node is generated by model data analysis.
- the running container is generated using the model running image, and the model file and model reasoning code file are bound and run according to the Resource configuration limits the use of container resources.
- the fourth node is the Service object cleaning node
- the fifth node is the Service object cleaning node.
- the cloud-native workflow execution plan is submitted to the container cluster for execution.
- the container cluster will deploy the model and develop the model service to complete the model. Publish the process, run the first three nodes in sequence when the workflow is executed, wait for the end signal at the third node, trigger an exit event when the workflow ends, use the callback mechanism to run the fourth node and the fifth node, and transfer the Service object and Ingress object clear.
- the embodiment of the second aspect of the present disclosure proposes a machine learning model automatic production line construction system, including:
- An operator building module configured to construct an operator component according to the configuration of the operator component, and store the operator component in the operator warehouse;
- the operator arrangement module is used to visually arrange and read the operator structure data in the operator warehouse, and combine the operator components through business processing logic to generate a model task flow;
- the model task flow module is used to convert the model task flow into a cloud-native workflow engine execution plan, and submit it to the container cluster for execution to output the model file;
- the model packaging module is used for packaging based on the model, performing the conversion of the model file and the image construction operation of the model reasoning container, and storing the corresponding data of the operation into the model warehouse;
- the model publishing module is used to read the model data in the model warehouse and analyze and generate three types of operators, and combine the three types of operator components to form a model publishing task flow to submit to the container cluster to execute the model publishing process.
- the operator construction module is used to construct the operator component according to the configuration of the operator component, and store the operator component in the operator warehouse;
- the operator arrangement module is used for visualization Arrange and read the operator structure data in the operator warehouse, and combine the operator components to generate model task flow through business processing logic;
- the model task flow module is used to convert the model task flow into a cloud-native workflow engine execution plan, and submit Execute to the container cluster to output model files;
- the model packaging module is used for packaging based on the model, performing model file conversion and model inference container image construction operations, and storing the corresponding data of the operation into the model warehouse;
- the model publishing module is used for reading the model warehouse
- the model data in the model is analyzed and generated into three types of operators, and the three types of operator components are combined to form a model release task flow, which is submitted to the container cluster to execute the model release process.
- the disclosure improves the construction efficiency of the model production line through five mutually independent and closely connected construction processes. At the same time, the constructed model production
- the embodiment of the third aspect of the present disclosure provides a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions; the computer-executable instructions are stored in After being executed by the processor, the method described in the first aspect of the present disclosure can be implemented.
- the embodiment of the fourth aspect of the present disclosure provides an electronic device, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
- the processor executes the program, Implement the method as described in the first aspect of the present disclosure.
- the embodiment of the fifth aspect of the present disclosure proposes a computer program product, wherein the computer program product includes computer program code, and when the computer program code is run on a computer, the computer program product as described in the first aspect of the present disclosure is implemented.
- the embodiment of the sixth aspect of the present disclosure proposes a computer program, wherein the computer program includes computer program code, and when the computer program code is run on the computer, the computer executes the computer program as described in the first aspect of the present disclosure. method described in the aspect.
- FIG. 1 is a flowchart of a method for constructing a machine learning model automated production line provided by an embodiment of the present disclosure
- FIG. 2 is a schematic diagram of building an automated production line of a machine learning model provided by an embodiment of the present disclosure
- FIG. 3 is a schematic diagram of an operator construction process provided by an embodiment of the present disclosure.
- FIG. 4 is a schematic diagram of operator arrangement and model task flow execution flow provided by an embodiment of the present disclosure
- FIG. 5 is a schematic diagram of the model packaging and model publishing process provided by the embodiment of the present disclosure.
- FIG. 6 is a schematic structural diagram of a machine learning model automatic production line construction system provided by an embodiment of the present disclosure.
- FIG. 1 is a flowchart of a method for building an automated production line of a machine learning model according to an embodiment of the present disclosure.
- the machine learning model automatic production line construction method includes the following steps:
- Step S1 construct the operator component according to the configuration of the operator component, and store the operator component in the operator warehouse.
- operator construction is mainly to provide operator development functions, receive operator configuration information input by users on the front end, and generate an operator running image construction file by analyzing the operator configuration, and submit it to Docker Daemon for image construction. After the construction is completed, the The image information and operator configuration are used as operator structure data and stored in the operator warehouse.
- Operator warehouses include file storage, relational databases, and mirror warehouses, which are used to store operator code, operator structure data, and container image files, respectively.
- operator construction provides operator testing functions, which can analyze operator input and output configurations, generate test templates, fill in the test templates and submit them to the system for operator testing, and obtain operator operation results.
- the purpose of this disclosure is to efficiently design an automated production line for the development of machine learning models (including various AI models) through this method.
- the model application development process is disassembled into operator construction, operator arrangement, model task flow execution, model packaging, and model release.
- the operator construction process mainly builds operator components based on the operator component configuration and stores them in the operator warehouse.
- the operator component is an abstraction of one of the steps in the machine learning model production line.
- the operator component can Free combination can improve the reusability of the machine learning model production line.
- the database data reading operator and model training operator can be used in different machine learning training scenarios. You only need to adjust the corresponding SQL statement or model training hyperparameters. Can.
- the operator component uses container technology to package the dependent environment into a container image to solve problems such as cumbersome configuration of the application and script running environment and software package conflicts.
- the completed operator can generate a test template according to the configuration, and the reliability of the operator can be ensured by filling in the test module and submitting it to the system test.
- step S2 the visual arrangement reads the operator structure data in the operator warehouse, and combines the operator components through business processing logic to generate a model task flow.
- operator arrangement mainly provides operator visual arrangement function.
- the user can use drag and drop to connect the input and output terminals of the operator to form a Model task flow.
- the parameters and resource usage of each operator are configurable.
- the model task flow can configure the execution cycle, the number of failed retries, and so on. After the model task flow structure data is saved, it is stored in a relational database.
- the operator orchestration process is to combine the operator components through business processing logic to form a model task flow. Since the operator components have clear input, output and execution processes, it can improve the model task flow. construction efficiency.
- the model task flow includes a complete model training process from data input, data processing, model training, and data export (including model data), which is used to solidify the process of producing models in the model application development process.
- step S3 the model task flow is converted into a cloud-native workflow engine execution plan, and submitted to the container cluster for execution to output the model file.
- the model task flow mainly provides the analysis and conversion functions of the model task flow structure data, which is used to generate the cloud-native workflow execution plan, and submit it to the container cluster to execute the model task flow.
- the model data files generated by the execution of the model task flow are stored in the object storage server.
- the model task flow is first converted into a cloud-native workflow engine execution plan, and then submitted to the container cluster for execution.
- Each operator component runs as a container.
- the resources used by each operator to run the container are specifically limited to improve the utilization efficiency of cluster resources.
- Step S4 based on the model packaging, perform model file conversion and model inference container image construction operations, and store the corresponding data in the model warehouse.
- model packaging provides templated model building functions, receives model configuration information input by users at the front end, standardizes model files (such as ONNX conversion, etc.) , model data files, and model container images are stored in the model repository as model data.
- the model warehouse includes a relational database server, an object storage server, and a mirror warehouse, which are used to store model inference configuration data, model structure data, and model inference container image files.
- the model file is output, the templated model is packaged through the model packaging module, the model file is converted, and the model operation dependent environment is packaged into a container image, etc., and finally together with the model Inference code and model inference configuration are packaged into model data packages and stored in the model warehouse.
- Step S5 read the model data in the model warehouse, analyze and generate three types of operators, combine the three types of operator components to form a model release task flow, and submit it to the container cluster to execute the model release process.
- model publishing provides model deployment and model service opening functions.
- model service configuration information input by the user at the front end, it reads the model data in the model warehouse, analyzes and generates model deployment operators, and at the same time generates Service configuration operators and Ingress configuration operators for model service opening, and automatically arranges them into models
- Deploy the open task flow of the model service analyze the task flow to generate a cloud-native workflow execution plan and submit it to the container cluster for execution, and complete the release of the model service.
- Use the exit event to trigger the callback mechanism to automatically clean up the container cluster Service and Ingress configurations to prevent resource exhaustion.
- the model publishing process abstracts model deployment and model service opening into three operator components, including model instance deployment operator, Service configuration operator, and Ingress configuration operator.
- Model packaging can facilitate model instance deployment operators to read model data and run model deployment containers.
- Service configuration operators are used to create Service resource objects, which can provide a unified entry address for model applications in a group of model deployment containers, and will The load is requested to be distributed to each model application, and the Ingress configuration operator is used to create an Ingress resource object to achieve external access to specific model application services in the container cluster.
- the three operator components are combined to form a model publishing task flow, which is converted into a cloud-native workflow engine execution plan, and then submitted to the container cluster to execute the model publishing process.
- the efficiency of model publishing can be improved through the model release task flow.
- the workflow exit event triggers the callback mechanism to automatically clean up the container cluster Service and Ingress configuration, which can prevent the container cluster from running out of resources.
- the operator component in the operator construction process, is an abstraction of the production line steps of the machine learning model, and it is also a running node in the task flow after instantiation.
- Operator component types include data reading operators, data processing operators, model training operators, data export operators, visualization operators, model deployment operators, and cluster configuration operators. Each operator has fixed input and output and running images, and parameters and running resources can be adjusted.
- the user first needs to fill in the operator component configuration information on the front end, including operator files, operator input and output settings, operator parameter settings, operator running scripts, operator dependent environments, basic images required for building operators, operator Resource configuration required for subrunning.
- the operator file includes the operator running script and other files required for the operator to run.
- the operator running script is the entry point for the operator to run.
- the operator running script can be a Python script, a Shell script, or other executable binary files.
- ;Operator input and output settings are used to define the data source and data output location of the operator, and the operator can have multiple inputs and outputs.
- operator input can come from other operators, local files, or external databases, etc.
- operator output locations can be other operators or external databases, etc.
- Operators rely on the environment and the basic image to build the operator’s running image to achieve solidification
- the system first copies the operator files to the operator-specific file storage, which is used to solidify the files used by the operator to ensure the stability of the operator operation.
- File storage can be implemented using object storage or network file systems.
- the system generates a Dockerfile based on the operator's dependent environment and the basic image and submits it to the Docker Daemon for the construction of the operator's running image.
- the Docker Daemon is notified to push the operator's running image to the mirror warehouse.
- the address of the operator file in the repository and the operator running image information are written into the operator component configuration, and the system stores the operator component information in the operator warehouse to complete the operator construction.
- the system can generate an operator test template and display it on the front end. Specifically, for operator input, you can use external databases and local files for input, and for operator output, you can use external databases. Operator parameters and operator running resources can be changed on the front end. After submitting the test template, the system will generate a single-node task flow, convert it into a cloud-native workflow execution plan and submit it to the container cluster for execution, and finally obtain the operator execution log to check the correctness and reliability of the operator.
- the machine learning training process can be abstracted into a model workflow composed of multiple operator components combined and orchestrated under certain logic.
- the model workflow It generally includes starting from the data import operator, passing through the data processing operator, and then inputting the model training operator, and finally outputting to the data export operator or visualization operator.
- the purpose of quickly building a machine learning training production line can be achieved by orchestrating combination operators.
- the model workflow is analyzed by the model workflow module to generate a cloud-native workflow execution plan and submit it to the container cluster for execution, which can make full use of container technology and container orchestration technology to improve server resource utilization.
- the operator orchestration sub-process is used to connect operators to each other through certain logic to form a model task flow.
- the system will read the operator information in the current operator warehouse, and display the operator components in the operator list on the left side of the front-end task flow canvas according to the configuration information of the operator components.
- the user places the operators needed to build the model task flow on the middle canvas by dragging and dropping.
- An operator is a rectangular block in the canvas.
- the system generates the connection endpoints of the operator component according to the configuration of the operator. The upper endpoint of the operator component is used as the input endpoint, and the lower endpoint is used as the output endpoint.
- the corresponding endpoint will only be displayed if the input and output to the front end of the operator are selected in the input and output settings, and an output endpoint can output to multiple input endpoints, while an input endpoint can only be connected to one output endpoint.
- the right side of the canvas is the operator configuration panel, including the operator's input settings, output settings, parameter settings, and running resource settings.
- the user connects the input and output of each operator according to the model production line process, and configures the relevant parameters in the configuration panel of each operator to complete the construction of the model workflow.
- the execution cycle of the model task flow itself can be configured. Parameters such as the number of failed retries.
- the user can save the built model task flow for subsequent modification and operation.
- this system has designed a set of rules, which can generate JSON configuration files in a unified format for different types of operators.
- the user connects the input and output of each operator in a certain order to build a task flow, and the system automatically configures the input and output settings of the operator according to the edges and nodes of each connection.
- the system will read and parse the operator structure data in the operator warehouse, and dynamically generate the task flow configuration in JSON format according to the user's operation.
- the front end sends the task flow configuration in JSON format to the system backend for saving.
- the model task flow execution process is used to parse the model task flow structure data, generate a cloud-native workflow execution plan, and submit it to the container cluster to execute the model task flow.
- a model task flow When executing a model task flow, first verify the task flow configuration in JSON format. Specifically, check whether the operator input and output settings are legal, whether the running script parameters are legal, whether the running resource configuration meets expectations, and other operations. Then parse the model task flow configuration in JSON format and convert it into a cloud-native workflow execution plan.
- the cloud-native workflow execution plan includes the creation of the Kubernetes container cluster resource object required to run the operator component, and the transfer of the input and output files of the operator running container operation etc.
- the model task flow can be converted into a Workflow object of the cloud-native workflow engine Argo Workflow in Yaml format.
- Each operator is designed as a Template object, and Input Artifact and Output Artifact are generated according to the input and output configuration of the operator.
- Set the image parameter of the container in the operator running image set the command parameter and args parameter of the container according to the operator running script and parameter configuration, set the env parameter of the container according to the environment variable configuration in the operator dependent environment, and configure the container according to the operator running resource
- the resources parameter of the parameter generates an Input Artifact according to the address of the operator file in the repository, which is used to put the operator file into the working directory of the Container.
- Workflow sets a Main Template as the entry point, and the execution sequence between operators is parsed and converted to the configuration in the Template Dag.
- the Step in each Dag corresponds to the Template of an operator.
- the Workflow object is submitted to the cloud-native workflow engine Argo Workflow for execution.
- the cloud-native workflow engine Argo Workflow generates a cloud-native workflow execution plan and submits it to the container cluster Kubernetes.
- the container cluster executes the model workflow to get the running results.
- the system obtains the running log information of each node of the model workflow from the container cluster, and the model files generated by the model workflow can be stored in an external database for use in the model packaging process.
- model packaging As an example, as shown in FIG. 5 , after the machine learning model is trained, it needs to be deployed to provide model application services.
- the production process of the model application service is divided into two sub-processes: model packaging and model release.
- the model packaging sub-process provides the function of adapting the models generated by various machine learning frameworks (including deep learning frameworks) to various mainstream model inference frameworks, packaging and packaging model files, model dependent environments, and model inference codes into model data packages, Provided for use in the model publishing environment.
- model packaging sub-process you first need to select the model type, including but not limited to PyTorch model, TensorFlow model, Caffe model, XGBoost model, and Scikit-learn model. Next, provide available model inference operators according to the corresponding rules, and the model inference operators include templated inference codes and corresponding basic running images.
- Model data generally requires model files, model inference codes, model-dependent environments, and model inference configurations.
- Model files are files used to describe model structures and model parameters, and model inference codes are used for pre-processing and post-processing of model inferences.
- the description of the code, the model-dependent environment includes the operating environment configuration or software package used in pre-processing and post-processing, and the model inference configuration includes the minimum amount of running resources of the model instance and the hyperparameters of the inference framework.
- the TorchServe model deployment operator you need to provide the PyTorch model serialization file, handler, and the package name required for the Handler to run, and you need to configure model instance running resources.
- model conversion work for model, the PyTorch model uses TensorRT for reasoning deployment, which needs to be converted into a model in ONNX format first.
- a specific operation image can be generated according to the model's dependent environment.
- the model release process is designed as a model deployment production line, which is composed of model deployment operators, service configuration operators, and ingress configuration operators.
- the first node of the cloud-native workflow execution plan configures the node for the Ingress object, and the node creates an Ingress object for routing requests to the model service Service object.
- the second node configures the node for the Service object, and the node will create a Service object to load balance the request traffic to each model deployment node.
- the third node is the model deployment node.
- the number of nodes is the same as the number of configured model instances.
- the configuration of the nodes is generated by model data analysis.
- the running container is generated using the model running image, and the model file and model reasoning code file are bound and run according to the running resources. Configuration limits container resource usage.
- the fourth node cleans up nodes for the Service object.
- the fifth node is a cleanup node for the Service object.
- the cloud-native workflow execution plan is submitted to the container cluster for execution.
- the container cluster will deploy the model and develop the model service to complete the model publishing process.
- the workflow is executed, the first three nodes are run sequentially, and the third node waits for the end signal.
- the model instance can provide model reasoning services.
- an exit event is triggered, and the callback mechanism is used to run the fourth node and the fifth node to clear the Service object and the Ingress object, reclaim the cluster resources, and ensure that the cluster resources will not be exhausted.
- the machine learning model automation production line construction method of the embodiment of the present disclosure constructs the operator component according to the configuration of the operator component, and stores the operator component in the operator warehouse; the visual arrangement reads the operator structure data in the operator warehouse, and The operator component generates the model task flow through the combination of business processing logic; converts the model task flow into a cloud-native workflow engine execution plan, and submits it to the container cluster for execution to output the model file; based on the model packaging, performs model file conversion and model inference container Mirror construction operation, store the data corresponding to the operation into the model warehouse; read the model data in the model warehouse and analyze and generate three operators, combine the three operator components to form a model release task flow, and submit it to the container cluster to execute the model release process .
- the present disclosure improves the construction efficiency of the model production line through five mutually independent and closely connected construction processes. At the same time, the constructed model production line can quickly train new models, shorten the process of model launch, and improve model production capacity.
- the system 10 includes: an operator construction module 100 , an operator arrangement module 200 , a model task flow module 300 , a model packaging module 400 and a model release module 500 .
- Operator construction module 100 configured to construct operator components according to operator component configurations, and store operator components into operator warehouses;
- the operator arrangement module 200 is used to visually arrange and read the operator structure data in the operator warehouse, and combine the operator components through business processing logic to generate a model task flow;
- the model task flow module 300 is used to convert the model task flow into a cloud-native workflow engine execution plan, and submit it to the container cluster for execution to output the model file;
- the model packaging module 400 is used to perform model file conversion and model inference container image construction operations based on model packaging, and store corresponding data in the model warehouse;
- the model publishing module 500 is used to read the model data in the model warehouse, analyze and generate three types of operators, combine the three types of operator components to form a model publishing task flow, and submit it to the container cluster to execute the model publishing process.
- the operator construction module is used to construct the operator component according to the configuration of the operator component, and store the operator component in the operator warehouse;
- the operator arrangement module is used to The operator structure data in the operator warehouse is read through visual arrangement, and the operator components are combined to generate a model task flow through business processing logic;
- the model task flow module is used to convert the model task flow into a cloud-native workflow engine execution plan, And submit it to the container cluster for execution to output the model file;
- the model packaging module is used to package based on the model, perform model file conversion and model inference container image construction operations, and store the corresponding data in the model warehouse;
- the model release module is used to read
- the model data in the model warehouse is parsed to generate three types of operators, and the three types of operator components are combined to form a model release task flow, which is submitted to the container cluster to execute the model release process.
- the disclosure improves the construction efficiency of the model production line through five mutually independent and closely connected construction processes. At the same time, the constructed model production
- the present disclosure also proposes a non-transitory computer-readable storage medium, wherein the non-transitory computer-readable storage medium stores computer-executable instructions; the computer-executable instructions are executed by a processor
- the aforementioned machine learning model automated production line construction method can be realized.
- the present disclosure also proposes an electronic device, including a memory, a processor, and a computer program stored on the memory and operable on the processor.
- the processor executes the program, the aforementioned machine is realized. Learn how to build models for automated production lines.
- the present disclosure also proposes a computer program product, wherein the computer program product includes computer program code, and when the computer program code is run on the computer, the aforementioned machine learning model automatic production line construction is realized. method.
- the present disclosure also proposes a computer program, wherein the computer program includes computer program code, and when the computer program code is run on the computer, the computer is made to execute the aforementioned method for building an automated production line of a machine learning model .
- first and second are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features.
- the features defined as “first” and “second” may explicitly or implicitly include at least one of these features.
- “plurality” means at least two, such as two, three, etc., unless otherwise specifically defined.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
提供了一种机器学习模型自动化生产线构建方法及系统,其中,该方法包括:根据算子组件配置构建出算子组件,并将算子组件存入算子仓库;可视化编排读取算子仓库中的算子结构数据,将算子组件通过业务处理逻辑组合生成模型任务流;将模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件;基于模型打包,进行模型文件转换和模型推理容器镜像构建操作,将操作对应数据存入模型仓库;读取模型仓库中的模型数据并解析生成三种算子,将三种算子组件组合形成模型发布任务流以提交给容器集群执行模型发布流程。
Description
相关申请的交叉引用
本申请基于申请号为202111268941.X、申请日为2021年10月29日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
本公开涉及人工智能技术领域,具体涉及一种机器学习模型自动化生产线构建方法及系统。
随着人工智能的发展进入蓬勃发展期,人工智能技术已经被应用到各行各业。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。从实践的意义上来说,机器学习是一种利用数据训练出模型,然后使用模型预测的一种方法。
训练机器学习模型并不是一劳永逸的,面对不断增加的行业数据和不断变化的行业标准,需要一个机器学习模型生产线来更新训练模型。机器学习模型生产线可以固化模型训练和模型部署的步骤,达到训练新模型并将模型部署上线的目的。传统的模型生产线构建方式是纯人工的方式,通过编写多个脚本来处理原始数据,得到模型的训练数据集,再编写模型训练代码来训练模型,最后需要编写模型推理脚本来部署上线模型。传统的模型生产线构建方式需要人工配置依赖环境、手动运行脚本并收集运行结果、人工部署模型并维护模型服务,使得模型开发周期长,模型生产线的各个步骤由于耦合性太强难以升级改造,并且复用性差。人工配置环境的方式还会带来环境依赖冲突等问题。传统的模型生产线构建方式难以适应行业变化带来的模型快速迭代需求。
相关的技术方案缺少模型部署模块,并没有覆盖完整的模型生产线,即从数据源导入到模型上线完整的流程。该系统只针对深度学习模型开发生产线,缺少对一般机器学习模型的支持。该系统对生产线进行高度的封装,只提供少数几个参数选择来改变生产线,缺乏灵活性,且生产线各个步骤不可复用到其他生产线。
发明内容
本公开旨在至少在一定程度上解决相关技术中的技术问题之一。
因此,本公开实施例针对上述问题,提出一种机器学习模型自动化生产线构建方法和系统。本公开将机器学习模型生产线构建流程分为算子组件开发、算子编排、模型任务流执行、模型打包、模型发布。具体来说,首先利用容器技术将模型生产线步骤固化为算子组件,解决单机环境依赖、环境冲突问题。接着通过算子编排将多个算子组件组合形成模型任务流,模型任务流中算子可以任意组合和替换,提高模型生产线步骤的复用性。模型任务流通过云原生工作流引擎转换为云原生工作流执行计划,提交给容器集群执行得到模型文件,通过模型打包将模型封装存入模型仓库,最后将模型发布成模型应用,对外提供模型服务。这五个构建流程相互独立又紧密相连,提高了模型生产线的构建效率,同时构建而成的模型生产线能够快速训练出新的模型,缩短了模型上线的过程,提高了模型生产能力。
为此,本公开的第一个目的在于提出一种机器学习模型自动化生产线构建方法,包括:
根据算子组件配置构建出算子组件,并将所述算子组件存入算子仓库;
可视化编排读取所述算子仓库中的算子结构数据,将所述算子组件通过业务处理逻辑组合生成模型任务流;
将所述模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件;
基于模型打包,进行所述模型文件转换和模型推理容器镜像构建操作,将所述操作对应数据存入模型仓库;
读取所述模型仓库中的模型数据并解析生成三种算子,将所述三种算子组件组合形成模型发布任务流以提交给所述容器集群执行模型发布流程。
本公开实施例的机器学习模型自动化生产线构建方法,根据算子组件配置构建出算子组件,并将算子组件存入算子仓库;可视化编排读取算子仓库中的算子结构数据,将算子组件通过业务处理逻辑组合生成模型任务流;将模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件;基于模型打包,进行模型文件转换和模型推理容器镜像构建操作,将操作对应数据存入模型仓库;读取模型仓库中的模型数据并解析生成三种算子,将三种算子组件组合形成模型发布任务流以提交给容器集群执行模型发布流程。本公开通过五个相互独立又紧密相连的构建流程,提高了模型生产线的构建效率,同时构建而成的模型生产线能够快速训练出新的模型,缩短了模型上线的过程,提高了模型生产能力。
在本公开的一个实施例中,所述根据算子组件配置构建出算子组件,并将所述算子组件存入算子仓库,包括:将算子文件复制到算子专用的文件存储器中,固化算子运行使用的文件,根据算子依赖环境和基础镜像生成Dockerfile文件并提交给Docker Daemon进行算 子运行镜像的构建操作,构建完成后通知Docker Daemon将算子运行镜像推送指镜像仓库,算子文件在存储库中的地址和算子运行镜像信息被写入算子组件配置中,将算子组件信息存入算子仓库中完成算子构建,根据算子组件配置,生成算子测试模板并在前端展示,提交所述算子测试模板生成单节点任务流,并转换为云原生工作流执行计划提交给容器集群执行,得到算子执行日志;其中,所述算子仓库包括文件存储器、关系型数据库和镜像仓库,分别用于存储算子代码、算子结构数据和容器镜像文件。
在本公开的一个实施例中,所述可视化编排读取所述算子仓库中的算子结构数据,将所述算子组件通过业务处理逻辑组合生成模型任务流,包括:读取目前算子仓库的算子信息,并根据算子组件的配置信息在前端任务流画布左侧算子列表中展示算子组件,将构建模型任务流需要的算子放置于中间画布中,根据算子的配置生成算子组件连接端点,算子组件上方端点作为输入端点,下方端点作为输出端点,选中算子后画布右侧是算子配置面板,依据模型生产线流程将每个算子的输入端和输出端连接,并且在每个算子的配置面板配置好相关参数完成对模型工作流的构建,构建完成后保存构建好的模型任务流。
在本公开的一个实施例中,所述方法还包括:根据特定规则为不同类型的算子生成统一格式的JSON配置文件,用户按特定顺序连接每个算子的输入端和输出端构建任务流,并根据每条连线的边和节点自动配置算子的输入设置和输出设置,在进行任务流编排时,读取并解析算子仓库中的算子结构数据,根据操作动态生成JSON格式的任务流配置,执行保存任务流操作时,前端将所述JSON格式的任务流配置发送到后端进行保存。
在本公开的一个实施例中,所述将所述模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行,以输出模型文件,包括:将所述模型任务流结构数据进行解析和转换,生成云原生工作流执行计划,并提交给容器集群执行所述模型任务流,模型任务流执行产生的模型数据文件存于对象存储服务器:包括:执行模型任务流时,验证所述JSON格式的任务流配置,验证完成后解析所述JSON格式的模型任务流配置,并转换为云原生工作流执行计划,运行完成后从容器集群获取模型工作流各个节点的运行日志信息;其中,所述云原生工作流执行计划包括:创建运行算子组件所需的容器集群资源对象、算子运行容器输入输出文件的中转操作中的多种。
在本公开的一个实施例中,所述基于模型打包,进行所述模型文件转换和模型推理容器镜像构建操作,将所述操作对应数据存入模型仓库,包括:接收用户在前端输入的模型配置信息,通过模型打包流程进行模板化模型封装,解析所述模型配置信息进行模型文件标准化和模型推理容器镜像构建工作,将模型推理代码、数据文件和容器镜像作为模型数据存入模型仓库,所述模型仓库用于存储模型推理配置数据、模型结构数据和模型推理容器镜像文件;其中,所述模型仓库包括所述关系型数据库、对象存储服务器和镜像仓库; 在所述模型打包流程中,选择模型类型,根据对应规则提供模型推理算子,在确定模型类型和模型推理算子类型后,根据特定策略为后序模型数据包提供特定数据,将所述特定数据打包成所述模型数据存入模型仓库;其中,所述特定数据包括数据包、模型转换后的文件地址和模型实例运行镜像地址。
在本公开的一个实施例中,所述读取所述模型仓库中的模型数据并解析生成三种算子,将所述三种算子组件组合形成模型发布任务流以提交给所述容器集群执行模型发布流程,包括:接收用户在前端输入的模型服务配置信息,读取所述模型仓库中的模型数据并解析生成模型部署算子,同时生成用于模型服务开放的Service配置算子和Ingress配置算子,自动编排成模型部署和模型服务开放的任务流,解析任务流生成云原生工作流执行计划并提交给容器集群执行,完成模型服务发布。
在本公开的一个实施例中,算子组件类型包括:数据读取算子、数据处理算子、模型训练算子、数据导出算子、可视化算子、模型部署算子和集群配置算子中的多种;算子组件配置信息,包括:算子文件、算子输入输出设置、算子参数设置、算子运行脚本、算子依赖环境、构建算子所需基础镜像和算子运行所需资源配置中的多种;所述算子文件包括算子运行脚本以及算子运行所需的其他文件,所述算子运行脚本是算子的运行入口,为可执行二进制文件;所述算子输入输出设置用于定义算子的数据源和数据输出位置;所述算子参数设置用于定义所述算子运行脚本执行时所需的参数。
在本公开的一个实施例中,所述读取所述模型仓库中的模型数据并解析生成三种算子,将所述三种算子组件组合形成模型发布任务流以提交给所述容器集群执行模型发布流程,还包括:云原生工作流执行计划第一节点为Ingress对象配置节点,创建Ingress对象,将请求路由到模型服务Service对象上,第二节点为Service对象配置节点,创建Service对象,将请求流量负载均衡到各个模型部署节点上,第三节点为模型部署节点,节点的配置由模型数据解析生成,其中运行容器使用模型运行镜像生成,绑定模型文件和模型推理代码文件并根据运行资源配置限制容器资源使用,第四节点为Service对象清理节点,第五节点为Service对象清理节点,将云原生工作流执行计划提交给容器集群执行,容器集群将部署模型并开发模型服务,完成模型发布流程,工作流执行运行时顺序运行前三个节点,并在第三节点等待结束信号,工作流结束时触发退出事件,利用回调机制运行第四节点和第五节点,将Service对象和Ingress对象清除。
为达上述目的,本公开第二方面实施例提出了一种机器学习模型自动化生产线构建系统,包括:
算子构建模块,用于根据算子组件配置构建出算子组件,并将所述算子组件存入算子仓库;
算子编排模块,用于可视化编排读取所述算子仓库中的算子结构数据,将所述算子组件通过业务处理逻辑组合生成模型任务流;
模型任务流模块,用于将所述模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件;
模型打包模块,用于基于模型打包,进行所述模型文件转换和模型推理容器镜像构建操作,将所述操作对应数据存入模型仓库;
模型发布模块,用于读取所述模型仓库中的模型数据并解析生成三种算子,将所述三种算子组件组合形成模型发布任务流以提交给所述容器集群执行模型发布流程。
本公开实施例的机器学习模型自动化生产线构建系统,算子构建模块,用于根据算子组件配置构建出算子组件,并将算子组件存入算子仓库;算子编排模块,用于可视化编排读取算子仓库中的算子结构数据,将算子组件通过业务处理逻辑组合生成模型任务流;模型任务流模块,用于将模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件;模型打包模块,用于基于模型打包,进行模型文件转换和模型推理容器镜像构建操作,将操作对应数据存入模型仓库;模型发布模块,用于读取模型仓库中的模型数据并解析生成三种算子,将三种算子组件组合形成模型发布任务流以提交给容器集群执行模型发布流程。本公开通过五个相互独立又紧密相连的构建流程,提高了模型生产线的构建效率,同时构建而成的模型生产线能够快速训练出新的模型,缩短了模型上线的过程,提高了模型生产能力。
为达上述目的,本公开第三方面实施例提出了一种非临时性计算机可读存储介质,其中所述非临时性计算机可读存储介质存储有计算机可执行指令;所述计算机可执行指令被处理器执行后,能够实现本公开第一方面所述的方法。
为达上述目的,本公开第四方面实施例提出了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现如本公开第一方面所述的方法。
为达上述目的,本公开第五方面实施例提出了一种计算机程序产品,其中,所述计算机程序产品中包括计算机程序代码,当所述计算机程序代码在计算机上运行时,实现如本公开第一方面所述的方法。
为达上述目的,本公开第六方面实施例提出了一种计算机程序,其中,所述计算机程序包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如本公开第一方面所述的方法。
本公开附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本公开的实践了解到。
本公开上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为本公开实施例所提供的机器学习模型自动化生产线构建方法的流程图;
图2为本公开实施例所提供的机器学习模型自动化生产线构建示意图;
图3为本公开实施例所提供的算子构建流程示意图;
图4为本公开实施例所提供的算子编排及模型任务流执行流程示意图;
图5为本公开实施例所提供的模型打包及模型发布流程示意图;
图6为本公开实施例所提供的机器学习模型自动化生产线构建系统结构示意图。
下面详细描述本公开的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本公开,而不能理解为对本公开的限制。
下面参照附图描述根据本公开实施例提出的机器学习模型自动化生产线构建方法及系统,首先将参照附图描述根据本公开实施例提出的机器学习模型自动化生产线构建方法。
图1是本公开一个实施例的机器学习模型自动化生产线构建方法流程图。
如图1所示,该机器学习模型自动化生产线构建方法包括以下步骤:
步骤S1,根据算子组件配置构建出算子组件,并将算子组件存入算子仓库。
具体的,算子构建主要是提供算子开发功能,接收用户在前端输入的算子配置信息,通过解析算子配置形成算子运行镜像构建文件,提交给Docker Daemon进行镜像构建,构建完成后将镜像信息和算子配置一同作为算子结构数据,存入算子仓库。算子仓库包括文件存储器、关系型数据库和镜像仓库,分别用于存储算子代码、算子结构数据和容器镜像文件。同时算子构建提供算子测试功能,能够解析算子输入输出配置,生成测试模板,填写测试模板提交给系统进行算子测试,得到算子运行结果。
作为一种示例,如图2所示,本公开目的是通过该方法能够高效设计出一条机器学习模型(含各类AI模型)开发的自动化生产线。在本公开实施例的方法中,将模型应用开发流程拆解为算子构建、算子编排、模型任务流执行、模型打包、模型发布。
可以理解的是,算子构建流程主要根据算子组件配置构建出算子组件并存入算子仓库,算子组件是对机器学习模型生产线其中一个步骤的抽象,算子组件可以在一定逻辑下自由 组合,能够提高机器学习模型生产线的复用性,比如数据库数据读取算子、模型训练算子在不同的机器学习训练场景下均可使用,只需调整对应SQL语句或者模型训练超参数即可。同时算子组件利用容器技术将依赖环境打包为容器镜像,解决应用及脚本运行环境配置繁琐、软件包冲突等问题。构建完成的算子可以根据配置生成测试模板,通过填写测试模块并提交给系统测试来保证算子的可靠性。
步骤S2,可视化编排读取算子仓库中的算子结构数据,将算子组件通过业务处理逻辑组合生成模型任务流。
具体的,算子编排主要提供算子可视化编排功能,通过读取算子仓库中的算子结构数据,解析后形成前端可视化节点,用户可以使用拖拽方式将算子的输入输出端相连,形成模型任务流。每个算子的参数和使用资源都是可配置的。模型任务流可以配置执行周期,失败重试次数等等配置。模型任务流结构数据保存后存入关系型数据库。
作为一种示例,如图2所示,算子编排流程就是将算子组件通过业务处理逻辑组合而成模型任务流,由于算子组件有明确的输入输出和执行过程,因此能够提高模型任务流的构建效率。模型任务流包括从数据输入、数据处理、模型训练和数据导出(包括模型数据)完整的模型训练流程,用于固化模型应用开发流程中生产模型的流程。
步骤S3,将模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件。
具体的,模型任务流主要提供模型任务流结构数据的解析和转换功能,用于生成云原生工作流执行计划,并提交给容器集群执行模型任务流。模型任务流执行产生的模型数据文件存于对象存储服务器。
作为一种示例,如图2所示,模型任务流执行流程中,首先将模型任务流转换为云原生工作流引擎执行计划,然后提交给容器集群执行,每个算子组件作为一个容器运行,每个算子运行容器使用的资源都是具体限制的,提高集群资源的利用效率。
步骤S4,基于模型打包,进行模型文件转换和模型推理容器镜像构建操作,将操作对应数据存入模型仓库。
具体的,模型打包提供模板化模型构建功能,接收用户在前端输入的模型配置信息,通过解析模型配置信息进行模型文件标准化(例如ONNX转换等)和模型推理容器镜像构建工作,最后将模型推理代码、模型数据文件、模型容器镜像作为模型数据存入模型仓库。模型仓库包括关系型数据库服务器、对象存储服务器和镜像仓库,用于存储模型推理配置数据、模型结构数据和模型推理容器镜像文件。
作为一种示例,如图2所示,模型任务流执行完毕后输出模型文件,通过模型打包模块进行模板化模型封装,进行模型文件转换和模型运行依赖环境打包成容器镜像等操作,最后连同模型推理代码、模型推理配置打包成模型数据包存入模型仓库。
步骤S5,读取模型仓库中的模型数据并解析生成三种算子,将三种算子组件组合形成模型发布任务流以提交给容器集群执行模型发布流程。
具体的,模型发布提供模型部署和模型服务开放功能。通过接收用户在前端输入的模型服务配置信息,读取模型仓库中的模型数据并解析生成模型部署算子,同时生成用于模型服务开放的Service配置算子和Ingress配置算子,自动编排成模型部署和模型服务开放的任务流,解析任务流生成云原生工作流执行计划并提交给容器集群执行,完成模型服务发布。利用退出事件触发回调机制,自动清理容器集群Service和Ingress配置,防止资源耗尽。
作为一种示例,如图2所示,模型发布流程将模型部署和模型服务开放抽象为三种算子组件,包括模型实例部署算子、Service配置算子、Ingress配置算子。模型打包能够方便模型实例部署算子读取模型数据和运行模型部署容器,Service配置算子用于创建Service资源对象,可以为一组模型部署容器中的模型应用提供一个统一的入口地址,并且将请求进行负载分发到各个模型应用上,同时Ingress配置算子用于创建Ingress资源对象,实现从外部对容器集群中的特定模型应用服务的访问。三种算子组件组合形成模型发布任务流后转换为云原生工作流引擎执行计划,然后提交给容器集群执行模型发布流程。通过模型发布任务流可以提高模型发布的效率,同时利用工作流退出事件触发回调机制,自动清理容器集群Service和Ingress配置,可以防止容器集群资源耗尽。
下面结合附图对本公开实施例做进一步阐述。
作为一种示例,如图3所示,算子构建流程中,算子组件是对机器学习模型生产线步骤的一个抽象,同时实例化后也是任务流中的一个运行节点。算子组件类型包括数据读取算子、数据处理算子、模型训练算子、数据导出算子、可视化算子、模型部署算子、集群配置算子等类型。每个算子拥有固定的输入输出和运行镜像,可以调整参数和运行资源。
构建算子首先需要用户在前端填写算子组件配置信息,包括算子文件、算子输入输出设置、算子参数设置、算子运行脚本、算子依赖环境、构建算子所需基础镜像、算子运行所需资源配置。具体来说,算子文件包括算子运行脚本以及算子运行所需的其他文件,算子运行脚本是算子的运行入口,算子运行脚本可以是Python脚本、Shell脚本或者其他可执行二进制文件;算子输入输出设置用于定义算子的数据源和数据输出位置,算子可以有多个输入输出。具体来说,算子输入可以来源其他算子、本地文件或者外部数据库等,算子输出位置可以是其他算子或者外部数据库等;算子依赖环境和基础镜像用于构建算子运行镜像达到固化算子运行环境的作用;算子参数设置用于定义算子运行脚本执行时所需的参 数;算子运行所需资源配置定义算子运行时使用的资源下限,防止算子因缺少资源而运行异常。接着,解析算子组件配置信息执行固化算子文件数据和算子运行镜像构建操作。具体来说,系统先将算子文件复制到算子专用的文件存储器中,用于固化算子运行使用的文件以保证算子运行的稳定性。文件存储器可以使用对象存储或者网络文件系统等方式实现。接着系统再根据算子依赖环境和基础镜像生成Dockerfile文件并提交给Docker Daemon进行算子运行镜像的构建操作,构建完成后通知Docker Daemon将算子运行镜像推送指镜像仓库。最后,算子文件在存储库中的地址和算子运行镜像信息被写入算子组件配置中,系统将算子组件信息存入算子仓库中完成算子构建。
根据算子组件配置,系统可以生成算子测试模板并在前端展示。具体来说,针对算子输入可以使用外部数据库和本地文件两种方式输入,算子输出可以使用外部数据库的方式,算子参数和算子运行资源都是可以在前端更改。提交测试模板后系统会生成单节点任务流,并转换为云原生工作流执行计划提交给容器集群执行,最后得到算子执行日志,用于检查算子的正确性和可靠性。
作为一种示例,如图4所示,算子编排及模型任务流执行流程中,机器学习训练过程可以抽象为多个算子组件在一定逻辑下组合编排而成的模型工作流,模型工作流一般包括从数据导入算子开始,经过数据处理算子,再输入模型训练算子,最后输出到数据导出算子或者可视化算子中。通过编排组合算子可以达到快速构建机器学习训练生产线的目的。同时模型工作流经过模型工作流模块解析,生成云原生工作流执行计划提交给容器集群执行,能够充分利用容器技术和容器编排技术,提升服务器资源利用率。
算子编排子流程用于将算子通过一定的逻辑相互连接形成模型任务流。首先,系统会读取目前算子仓库的算子信息,并根据算子组件的配置信息在前端任务流画布左侧算子列表中展示算子组件。用户通过拖拽的方式将构建模型任务流需要的算子放置于中间画布中。算子在画布中是一个长方块,系统根据算子的配置生成算子组件连接端点,算子组件上方端点作为输入端点,下方端点作为输出端点。具体来说,只有输入输出设置中选择输入输出到算子前端才会展现对应端点,并且一个输出端点可以输出到多个输入端点,而一个输入端点只能连接一个输出端点。选中算子后画布右侧是算子配置面板,包括算子的输入设置、输出设置、参数设置和运行资源设置。用户依据模型生产线流程将每个算子的输入端和输出端连接,并且在每个算子的配置面板配置好相关参数就完成对模型工作流的构建,同时模型任务流本身可以配置执行周期,失败重试次数等参数。构建完成后用户可以保存构建好的模型任务流,便于后续更改和运行。为了实现以上功能,本系统设计了一套规则,该规则可为不同类型的算子生成统一格式的JSON配置文件。用户按一定顺序连接每个算子的输入端和输出端来构建任务流,系统会根据每条连线的边和节点自动配置算子的输入设 置和输出设置。用户通过拖拽的方式进行任务流编排时,系统会读取并解析算子仓库中的算子结构数据,根据用户的操作动态生成JSON格式的任务流配置。用户执行保存任务流操作时,前端将该JSON格式的任务流配置发送到系统后端进行保存。
模型任务流执行流程用于解析模型任务流结构数据,生成云原生工作流执行计划,并提交给容器集群执行模型任务流。执行模型任务流时,首先验证JSON格式的任务流配置。具体来说,检查算子输入输出设置是否合法,运行脚本参数是否合法,运行资源配置是否符合预期等操作。接着解析JSON格式的模型任务流配置,并转换为云原生工作流执行计划,云原生工作流执行计划包括创建运行算子组件所需的Kubernetes容器集群资源对象、算子运行容器输入输出文件的中转操作等。举例来说,可以将模型任务流转换为Yaml格式的云原生工作流引擎Argo Workflow的Workflow对象,每个算子被设计为一个Template对象,根据算子输入输出配置生成Input Artifact和Output Artifact,根据算子运行镜像设置Container的image参数,根据算子运行脚本和参数配置设置Container的command参数和args参数,根据算子依赖环境中的环境变量配置设置Container的env参数,根据算子运行资源配置Container的resources参数,根据算子文件在存储库中的地址生成Input Artifact,用于将算子文件放入Container的工作目录中。Workflow设置一个Main Template作为入口,算子之间的执行顺序经过解析后转换为Template Dag中的配置,每个Dag中的Step对应一个算子的Template。构建完成后将Workflow对象提交给云原生工作流引擎Argo Workflow执行,云原生工作流引擎Argo Workflow生成云原生工作流执行计划并提交给容器集群Kubernetes,容器集群执行模型工作流得到运行结果。对于解析JSON格式的模型任务流配置,并转换为云原生工作流执行计划这个步骤,可以使用Argo Workflow之外的云原生工作流引擎或者云原生工作流生成工具,此处只是做一个举例。运行完成后系统从容器集群获取模型工作流各个节点的运行日志信息,同时模型工作流产生的模型文件可以存储于外部数据库,供模型打包流程使用。
作为一种示例,如图5所示,机器学习模型训练完成后需要进行部署才能提供模型应用服务,本公开实施例将模型应用服务生产过程分为模型打包和模型发布两个子流程。
模型打包子流程提供对各种机器学习框架(包括深度学习框架)产生的模型进行适配各类主流模型推理框架的功能,将模型文件、模型依赖环境以及模型推理代码封装打包成模型数据包,提供给模型发布环境使用。在模型打包子流程中,首先需要选择模型类型,包括但不限于PyTorch模型、TensorFlow模型、Caffe模型、XGBoost模型和Scikit-learn模型等。接着,根据对应规则来提供可用的模型推理算子,模型推理算子包括模板化推理代码以及对应的基础运行镜像。包括但不限于PyTorch模型可以使用Torchserve模型部署算子、使用TensorRT模型部署算子、使用Flask模型部署算子,XGBoost模型、Scikit-learn模型可以使用对应的Flask模型部署算子等,具体如图5所示。确定模型类型和模型推理算子类型后, 根据一定策略来提供后序模型数据包需要的数据。具体来说,模型数据一般需要模型文件、模型推理代码、模型依赖环境和模型推理配置,模型文件是用于描述模型结构和模型参数的文件,模型推理代码用于对模型推理前处理和后处理代码的描述,模型依赖环境包括前处理和后处理用到的运行环境配置或者软件包,模型推理配置包括模型实例最低运行资源量以及推理框架超参数等。举例来说,使用TorchServe模型部署算子部署PyTorch模型,需要提供PyTorch模型序列化文件、处理程序Handler以及Handler运行所需的软件包名称,同时需要配置模型实例运行资源。接着,进行模型转换和模型运行镜像构建工作。针对模型转换工作,举例来说,PyTorch模型使用TensorRT进行推理部署需要先转换为ONNX格式的模型。针对模型运行镜像构建工作,可以根据模型依赖环境生成具体运行镜像。最后,将包括数据包、模型转换后的文件地址、模型实例运行镜像地址打包成模型数据存入模型仓库。
模型发布流程设计为一条模型部署生产线,由模型部署算子、Service配置算子、Ingress配置算子组合而成。首先从模型仓库中选择需要部署的模型,同时设置好模型实例数、模型实例运行资源量(不低于最低运行资源量),接着构建模型部署生产线使用的云原生工作流执行计划。具体来说,云原生工作流执行计划第一节点为Ingress对象配置节点,节点会创建一个Ingress对象,用于将请求路由到模型服务Service对象上。第二节点为Service对象配置节点,节点会创建一个Service对象,用于将请求流量负载均衡到各个模型部署节点上。第三节点为模型部署节点,节点数量与配置的模型实例数一致,节点的配置由模型数据解析生成,其中运行容器使用模型运行镜像生成,绑定模型文件和模型推理代码文件并根据运行资源配置限制容器资源使用。第四节点为Service对象清理节点。第五节点为Service对象清理节点。最后将云原生工作流执行计划提交给容器集群执行,容器集群将部署模型并开发模型服务,完成模型发布流程。工作流执行运行时顺序运行前三个节点,并在第三节点等待结束信号,此时模型实例可以提供模型推理服务。工作流结束时触发退出事件,利用回调机制运行第四节点和第五节点,用于将Service对象和Ingress对象清除,回收集群资源,保证集群资源不会耗尽。
本公开实施例的机器学习模型自动化生产线构建方法,根据算子组件配置构建出算子组件,并将算子组件存入算子仓库;可视化编排读取算子仓库中的算子结构数据,将算子组件通过业务处理逻辑组合生成模型任务流;将模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件;基于模型打包,进行模型文件转换和模型推理容器镜像构建操作,将操作对应数据存入模型仓库;读取模型仓库中的模型数据并解析生成三种算子,将三种算子组件组合形成模型发布任务流以提交给容器集群执行模型发布流程。本公开通过五个相互独立又紧密相连的构建流程,提高了模型生产线的构建效率, 同时构建而成的模型生产线能够快速训练出新的模型,缩短了模型上线的过程,提高了模型生产能力。
下面将参照附图描述根据本公开实施例提出的机器学习模型自动化生产线构建系统。
如图6所示,该系统10包括:算子构建模块100、算子编排模块200、模型任务流模块300、模型打包模块400和模型发布模块500。
算子构建模块100,用于根据算子组件配置构建出算子组件,并将算子组件存入算子仓库;
算子编排模块200,用于可视化编排读取算子仓库中的算子结构数据,将算子组件通过业务处理逻辑组合生成模型任务流;
模型任务流模块300,用于将模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件;
模型打包模块400,用于基于模型打包,进行模型文件转换和模型推理容器镜像构建操作,将操作对应数据存入模型仓库;
模型发布模块500,用于读取模型仓库中的模型数据并解析生成三种算子,将三种算子组件组合形成模型发布任务流以提交给容器集群执行模型发布流程。
根据本公开实施例的机器学习模型自动化生产线构建系统,通过算子构建模块,用于根据算子组件配置构建出算子组件,并将算子组件存入算子仓库;算子编排模块,用于可视化编排读取算子仓库中的算子结构数据,将算子组件通过业务处理逻辑组合生成模型任务流;模型任务流模块,用于将模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件;模型打包模块,用于基于模型打包,进行模型文件转换和模型推理容器镜像构建操作,将操作对应数据存入模型仓库;模型发布模块,用于读取模型仓库中的模型数据并解析生成三种算子,将三种算子组件组合形成模型发布任务流以提交给容器集群执行模型发布流程。本公开通过五个相互独立又紧密相连的构建流程,提高了模型生产线的构建效率,同时构建而成的模型生产线能够快速训练出新的模型,缩短了模型上线的过程,提高了模型生产能力。
需要说明的是,前述对机器学习模型自动化生产线构建方法实施例的解释说明也适用于该实施例的机器学习模型自动化生产线构建装置,此处不再赘述。
为了实现上述实施例,本公开还提出了一种非临时性计算机可读存储介质,其中所述非临时性计算机可读存储介质存储有计算机可执行指令;所述计算机可执行指令被处理器执行后,能够实现前述机器学习模型自动化生产线构建的方法。
为了实现上述实施例,本公开还提出了一种电子设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现前述机器学习模型自动化生产线构建的方法。
为了实现上述实施例,本公开还提出了一种计算机程序产品,其中所述计算机程序产品中包括计算机程序代码,当所述计算机程序代码在计算机上运行时,实现前述机器学习模型自动化生产线构建的方法。
为了实现上述实施例,本公开还提出了一种计算机程序,其中所述计算机程序包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行前述机器学习模型自动化生产线构建的方法。
需要说明的是,前述对机器学习模型自动化生产线构建方法实施例的解释说明也适用于上述实施例中的非临时性计算机可读存储介质、电子设备、计算机程序产品和计算机程序,此处不再赘述。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本公开的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本公开的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不必须针对的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任一个或多个实施例或示例中以合适的方式结合。此外,在不相互矛盾的情况下,本领域的技术人员可以将本说明书中描述的不同实施例或示例以及不同实施例或示例的特征进行结合和组合。
尽管上面已经示出和描述了本公开的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本公开的限制,本领域的普通技术人员在本公开的范围内可以对上述实施例进行变化、修改、替换和变型。
Claims (14)
- 一种机器学习模型自动化生产线构建方法,其特征在于,包括以下步骤:根据算子组件配置构建出算子组件,并将所述算子组件存入算子仓库;可视化编排读取所述算子仓库中的算子结构数据,将所述算子组件通过业务处理逻辑组合生成模型任务流;将所述模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件;基于模型打包,进行所述模型文件转换和模型推理容器镜像构建操作,将所述操作对应数据存入模型仓库;读取所述模型仓库中的模型数据并解析生成三种算子,将所述三种算子组件组合形成模型发布任务流以提交给所述容器集群执行模型发布流程。
- 根据权利要求1所述的机器学习模型自动化生产线构建方法,其特征在于,所述根据算子组件配置构建出算子组件,并将所述算子组件存入算子仓库,包括:将算子文件复制到算子专用的文件存储器中,固化算子运行使用的文件,根据算子依赖环境和基础镜像生成Dockerfile文件并提交给Docker Daemon进行算子运行镜像的构建操作,构建完成后通知Docker Daemon将算子运行镜像推送指镜像仓库,算子文件在存储库中的地址和算子运行镜像信息被写入算子组件配置中,将算子组件信息存入算子仓库中完成算子构建,根据算子组件配置,生成算子测试模板并在前端展示,提交所述算子测试模板生成单节点任务流,并转换为云原生工作流执行计划提交给容器集群执行,得到算子执行日志;其中,所述算子仓库包括文件存储器、关系型数据库和镜像仓库,分别用于存储算子代码、算子结构数据和容器镜像文件。
- 根据权利要求1或2所述的机器学习模型自动化生产线构建方法,其特征在于,所述可视化编排读取所述算子仓库中的算子结构数据,将所述算子组件通过业务处理逻辑组合生成模型任务流,包括:读取目前算子仓库的算子信息,并根据算子组件的配置信息在前端任务流画布左侧算子列表中展示算子组件,将构建模型任务流需要的算子放置于中间画布中,根据算子的配置生成算子组件连接端点,算子组件上方端点作为输入端点,下方端点作为输出端点,选中算子后画布右侧是算子配置面板,依据模型生产线流程将每个算子的输入端和输出端连接,并且在每个算子的配置面板配置好相关参数完成对模型工作流的构建,构建完成后保存构建好的模型任务流。
- 根据权利要求1至3中任一项所述的机器学习模型自动化生产线构建方法,其特征在于,所述方法还包括:根据特定规则为不同类型的算子生成统一格式的JSON配置文件,用户按特定顺序连接每个算子的输入端和输出端构建任务流,并根据每条连线的边和节点自动配置算子的输入设置和输出设置,在进行任务流编排时,读取并解析算子仓库中的算子结构数据,根据操作动态生成JSON格式的任务流配置,执行保存任务流操作时,前端将所述JSON格式的任务流配置发送到后端进行保存。
- 根据权利要求1至4中任一项所述的机器学习模型自动化生产线构建方法,其特征在于,所述将所述模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行,以输出模型文件,包括:将所述模型任务流结构数据进行解析和转换,生成云原生工作流执行计划,并提交给容器集群执行所述模型任务流,模型任务流执行产生的模型数据文件存于对象存储服务器:包括:执行模型任务流时,验证所述JSON格式的任务流配置,验证完成后解析所述JSON格式的模型任务流配置,并转换为云原生工作流执行计划,运行完成后从容器集群获取模型工作流各个节点的运行日志信息;其中,所述云原生工作流执行计划包括:创建运行算子组件所需的容器集群资源对象、算子运行容器输入输出文件的中转操作中的多种。
- 根据权利要求1至5中任一项所述的机器学习模型自动化生产线构建方法,其特征在于,所述基于模型打包,进行所述模型文件转换和模型推理容器镜像构建操作,将所述操作对应数据存入模型仓库,包括:接收用户在前端输入的模型配置信息,通过模型打包流程进行模板化模型封装,解析所述模型配置信息进行模型文件标准化和模型推理容器镜像构建工作,将模型推理代码、数据文件和容器镜像作为模型数据存入模型仓库,所述模型仓库用于存储模型推理配置数据、模型结构数据和模型推理容器镜像文件;其中,所述模型仓库包括所述关系型数据库、对象存储服务器和镜像仓库;在所述模型打包流程中,选择模型类型,根据对应规则提供模型推理算子,在确定模型类型和模型推理算子类型后,根据特定策略为后序模型数据包提供特定数据,将所述特定数据打包成所述模型数据存入模型仓库;其中,所述特定数据包括数据包、模型转换后的文件地址和模型实例运行镜像地址。
- 根据权利要求1至6中任一项所述的机器学习模型自动化生产线构建方法,其特征在于,所述读取所述模型仓库中的模型数据并解析生成三种算子,将所述三种算子组件组合形成模型发布任务流以提交给所述容器集群执行模型发布流程,包括:接收用户在前端输入的模型服务配置信息,读取所述模型仓库中的模型数据并解析生成模型部署算子,同时生成用于模型服务开放的Service配置算子和Ingress配置算子,自动 编排成模型部署和模型服务开放的任务流,解析任务流生成云原生工作流执行计划并提交给容器集群执行,完成模型服务发布。
- 根据权利要求2所述的机器学习模型自动化生产线构建方法,其特征在于,算子组件类型包括:数据读取算子、数据处理算子、模型训练算子、数据导出算子、可视化算子、模型部署算子和集群配置算子中的多种;算子组件配置信息,包括:算子文件、算子输入输出设置、算子参数设置、算子运行脚本、算子依赖环境、构建算子所需基础镜像和算子运行所需资源配置中的多种;所述算子文件包括算子运行脚本以及算子运行所需的其他文件,所述算子运行脚本是算子的运行入口,为可执行二进制文件;所述算子输入输出设置用于定义算子的数据源和数据输出位置;所述算子参数设置用于定义所述算子运行脚本执行时所需的参数。
- 根据权利要求4所述的机器学习模型自动化生产线构建方法,其特征在于,所述读取所述模型仓库中的模型数据并解析生成三种算子,将所述三种算子组件组合形成模型发布任务流以提交给所述容器集群执行模型发布流程,还包括:云原生工作流执行计划第一节点为Ingress对象配置节点,创建Ingress对象,将请求路由到模型服务Service对象上,第二节点为Service对象配置节点,创建Service对象,将请求流量负载均衡到各个模型部署节点上,第三节点为模型部署节点,节点的配置由模型数据解析生成,其中运行容器使用模型运行镜像生成,绑定模型文件和模型推理代码文件并根据运行资源配置限制容器资源使用,第四节点为Service对象清理节点,第五节点为Service对象清理节点,将云原生工作流执行计划提交给容器集群执行,容器集群将部署模型并开发模型服务,完成模型发布流程,工作流执行运行时顺序运行前三个节点,并在第三节点等待结束信号,工作流结束时触发退出事件,利用回调机制运行第四节点和第五节点,将Service对象和Ingress对象清除。
- 一种机器学习模型自动化生产线构建系统,其特征在于,包括:算子构建模块,用于根据算子组件配置构建出算子组件,并将所述算子组件存入算子仓库;算子编排模块,用于可视化编排读取所述算子仓库中的算子结构数据,将所述算子组件通过业务处理逻辑组合生成模型任务流;模型任务流模块,用于将所述模型任务流转换为云原生工作流引擎执行计划,并提交给容器集群执行以输出模型文件;模型打包模块,用于基于模型打包,进行所述模型文件转换和模型推理容器镜像构建操作,将所述操作对应数据存入模型仓库;模型发布模块,用于读取所述模型仓库中的模型数据并解析生成三种算子,将所述三种算子组件组合形成模型发布任务流以提交给所述容器集群执行模型发布流程。
- 一种非临时性计算机可读存储介质,其特征在于,所述非临时性计算机可读存储介质存储有计算机可执行指令;所述计算机可执行指令被处理器执行后,能够实现权利要求1至9中任一项所述的方法。
- 一种电子设备,其特征在于,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现如权利要求1至9中任一项所述的方法。
- 一种计算机程序产品,其特征在于,所述计算机程序产品中包括计算机程序代码,当所述计算机程序代码在计算机上运行时,实现如权利要求1至9中任一项所述的方法。
- 一种计算机程序,其特征在于,所述计算机程序包括计算机程序代码,当所述计算机程序代码在计算机上运行时,使得计算机执行如权利要求1至9中任一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111268941.X | 2021-10-29 | ||
CN202111268941.XA CN114115857B (zh) | 2021-10-29 | 2021-10-29 | 一种机器学习模型自动化生产线构建方法及系统 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023071075A1 true WO2023071075A1 (zh) | 2023-05-04 |
Family
ID=80379330
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/087218 WO2023071075A1 (zh) | 2021-10-29 | 2022-04-15 | 机器学习模型自动化生产线构建方法及系统 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114115857B (zh) |
WO (1) | WO2023071075A1 (zh) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116308065A (zh) * | 2023-05-10 | 2023-06-23 | 合肥新鸟科技有限公司 | 一种物流仓储设备智能化运维管理方法及系统 |
CN116578300A (zh) * | 2023-07-13 | 2023-08-11 | 江西云眼视界科技股份有限公司 | 一种基于可视化组件的应用创建方法、设备及存储介质 |
CN116911406A (zh) * | 2023-07-05 | 2023-10-20 | 上海数禾信息科技有限公司 | 风控模型部署方法、装置、计算机设备和存储介质 |
CN117372846A (zh) * | 2023-10-17 | 2024-01-09 | 湖南苏科智能科技有限公司 | 基于嵌入式平台的目标检测方法、平台、装置及设备 |
CN117971251A (zh) * | 2024-04-01 | 2024-05-03 | 深圳市卓驭科技有限公司 | 软件部署方法、设备、存储介质及产品 |
CN118012524A (zh) * | 2023-12-19 | 2024-05-10 | 慧之安信息技术股份有限公司 | 算法模型转换的文件抓取和配置方法和系统 |
CN118227096A (zh) * | 2024-05-20 | 2024-06-21 | 浙江口碑网络技术有限公司 | 一种基于大模型的开发任务编排方法和装置 |
CN118378719A (zh) * | 2024-04-10 | 2024-07-23 | 电子科技大学 | 一种机器学习工作流构建方法 |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114115857B (zh) * | 2021-10-29 | 2024-04-05 | 北京邮电大学 | 一种机器学习模型自动化生产线构建方法及系统 |
CN114625440A (zh) * | 2022-03-10 | 2022-06-14 | 中国建设银行股份有限公司 | 模型数据处理方法、装置、电子设备和存储介质 |
CN114969085A (zh) * | 2022-03-16 | 2022-08-30 | 杭州半云科技有限公司 | 一种基于可视化技术算法建模的方法和系统 |
CN114611714B (zh) * | 2022-05-11 | 2022-09-02 | 成都数之联科技股份有限公司 | 模型处理方法、装置、系统、电子设备及存储介质 |
CN114647404A (zh) * | 2022-05-23 | 2022-06-21 | 深圳市华付信息技术有限公司 | 基于工作流对算法模型进行编排的方法、装置及介质 |
CN115115062B (zh) * | 2022-06-29 | 2023-09-19 | 北京百度网讯科技有限公司 | 机器学习模型建立方法、相关装置及计算机程序产品 |
CN116009850B (zh) * | 2023-03-28 | 2023-06-16 | 西安热工研究院有限公司 | 工业控制数据二次开发方法、系统、设备及介质 |
CN116127474B (zh) * | 2023-04-20 | 2023-06-23 | 熙牛医疗科技(浙江)有限公司 | 知识计算低代码平台 |
CN117539630A (zh) * | 2023-11-20 | 2024-02-09 | 上海直真君智科技有限公司 | 一种基于dag延伸机制的云际异构联合计算方法和系统 |
CN117785266B (zh) * | 2023-12-26 | 2024-09-20 | 无锡雪浪数制科技有限公司 | 应用程序的自动发布方法、调度服务器及低代码平台 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413294A (zh) * | 2019-08-06 | 2019-11-05 | 中国工商银行股份有限公司 | 服务发布系统、方法、装置和设备 |
CN111414233A (zh) * | 2020-03-20 | 2020-07-14 | 京东数字科技控股有限公司 | 一种在线模型推理系统 |
CN112329945A (zh) * | 2020-11-24 | 2021-02-05 | 广州市网星信息技术有限公司 | 一种模型部署及推理的方法和装置 |
CN112418438A (zh) * | 2020-11-24 | 2021-02-26 | 国电南瑞科技股份有限公司 | 基于容器的机器学习流程化训练任务执行方法及系统 |
US11102076B1 (en) * | 2021-02-04 | 2021-08-24 | Oracle International Corporation | Techniques for network policies analysis in container frameworks |
CN114115857A (zh) * | 2021-10-29 | 2022-03-01 | 北京邮电大学 | 一种机器学习模型自动化生产线构建方法及系统 |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10395181B2 (en) * | 2015-06-05 | 2019-08-27 | Facebook, Inc. | Machine learning system flow processing |
CN105867955A (zh) * | 2015-09-18 | 2016-08-17 | 乐视云计算有限公司 | 一种应用程序部署系统及部署方法 |
CN110245003A (zh) * | 2019-06-06 | 2019-09-17 | 中信银行股份有限公司 | 一种机器学习单机算法编排系统及方法 |
EP3786783A1 (fr) * | 2019-08-30 | 2021-03-03 | Bull SAS | Systeme d'aide a la conception d'application d'intelligence artificielle, executable sur des plates-formes informatiques distribuees |
CN110825511A (zh) * | 2019-11-07 | 2020-02-21 | 北京集奥聚合科技有限公司 | 一种基于建模平台模型运行流程调度方法 |
CN111047190A (zh) * | 2019-12-12 | 2020-04-21 | 广西电网有限责任公司 | 一种基于交互式学习技术的多元化业务建模框架系统 |
CN112148494B (zh) * | 2020-09-30 | 2023-07-25 | 北京百度网讯科技有限公司 | 用于算子服务的处理方法、装置、智能工作站和电子设备 |
-
2021
- 2021-10-29 CN CN202111268941.XA patent/CN114115857B/zh active Active
-
2022
- 2022-04-15 WO PCT/CN2022/087218 patent/WO2023071075A1/zh unknown
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110413294A (zh) * | 2019-08-06 | 2019-11-05 | 中国工商银行股份有限公司 | 服务发布系统、方法、装置和设备 |
CN111414233A (zh) * | 2020-03-20 | 2020-07-14 | 京东数字科技控股有限公司 | 一种在线模型推理系统 |
CN112329945A (zh) * | 2020-11-24 | 2021-02-05 | 广州市网星信息技术有限公司 | 一种模型部署及推理的方法和装置 |
CN112418438A (zh) * | 2020-11-24 | 2021-02-26 | 国电南瑞科技股份有限公司 | 基于容器的机器学习流程化训练任务执行方法及系统 |
US11102076B1 (en) * | 2021-02-04 | 2021-08-24 | Oracle International Corporation | Techniques for network policies analysis in container frameworks |
CN114115857A (zh) * | 2021-10-29 | 2022-03-01 | 北京邮电大学 | 一种机器学习模型自动化生产线构建方法及系统 |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116308065A (zh) * | 2023-05-10 | 2023-06-23 | 合肥新鸟科技有限公司 | 一种物流仓储设备智能化运维管理方法及系统 |
CN116911406A (zh) * | 2023-07-05 | 2023-10-20 | 上海数禾信息科技有限公司 | 风控模型部署方法、装置、计算机设备和存储介质 |
CN116911406B (zh) * | 2023-07-05 | 2024-02-02 | 上海数禾信息科技有限公司 | 风控模型部署方法、装置、计算机设备和存储介质 |
CN116578300A (zh) * | 2023-07-13 | 2023-08-11 | 江西云眼视界科技股份有限公司 | 一种基于可视化组件的应用创建方法、设备及存储介质 |
CN116578300B (zh) * | 2023-07-13 | 2023-11-10 | 江西云眼视界科技股份有限公司 | 一种基于可视化组件的应用创建方法、设备及存储介质 |
CN117372846A (zh) * | 2023-10-17 | 2024-01-09 | 湖南苏科智能科技有限公司 | 基于嵌入式平台的目标检测方法、平台、装置及设备 |
CN118012524A (zh) * | 2023-12-19 | 2024-05-10 | 慧之安信息技术股份有限公司 | 算法模型转换的文件抓取和配置方法和系统 |
CN117971251A (zh) * | 2024-04-01 | 2024-05-03 | 深圳市卓驭科技有限公司 | 软件部署方法、设备、存储介质及产品 |
CN118378719A (zh) * | 2024-04-10 | 2024-07-23 | 电子科技大学 | 一种机器学习工作流构建方法 |
CN118227096A (zh) * | 2024-05-20 | 2024-06-21 | 浙江口碑网络技术有限公司 | 一种基于大模型的开发任务编排方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
CN114115857A (zh) | 2022-03-01 |
CN114115857B (zh) | 2024-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2023071075A1 (zh) | 机器学习模型自动化生产线构建方法及系统 | |
US10565095B2 (en) | Hybrid testing automation engine | |
JP5197688B2 (ja) | 統合環境生成器 | |
US9886253B2 (en) | Techniques for managing functional service definitions in an SOA development lifecycle | |
US8589864B2 (en) | Automating the creation of an application provisioning model | |
CN114625353A (zh) | 模型框架代码生成系统及方法 | |
US20060041539A1 (en) | Method and apparatus for organizing, visualizing and using measured or modeled system statistics | |
US20100281455A1 (en) | Determining system level dependencies | |
US9052845B2 (en) | Unified interface for meta model checking, modifying, and reporting | |
CN113946321B (zh) | 计算逻辑的处理方法、电子设备和可读存储介质 | |
US20060225042A1 (en) | Virtual threads in business process programs | |
CN112948110B (zh) | 云应用的拓扑与编排系统、方法、存储介质及电子设备 | |
KR100910336B1 (ko) | 논리 프로세스 및 물리 프로세스 모델을 맵핑한 비즈니스 프로세스 모델을 관리하기 위한 시스템 및 방법 | |
CN113448678A (zh) | 应用信息生成方法、部署方法及装置、系统、存储介质 | |
CN113407174A (zh) | 任务调度方法、装置、设备及存储介质 | |
Zúñiga-Prieto et al. | Architecture description language for incremental integration of cloud services architectures | |
CN103678485B (zh) | 虚拟试验流程节点驱动与活动封装系统 | |
Capelli et al. | A framework for early design and prototyping of service-oriented applications with design patterns | |
CN114757124A (zh) | 一种基于xml的cfd工作流建模方法、装置、计算机及存储介质 | |
Bhuta et al. | Attribute-based cots product interoperability assessment | |
CN116627392B (zh) | 一种基于交互式ide的模型开发方法及系统 | |
US20240303074A1 (en) | Source code-based determination of differences between web elements of different versions of a web application | |
Pautasso et al. | Autonomic computing for virtual laboratories | |
Bian et al. | A Research of Page Automatic Publishing Scheme Based on Low Code Platform | |
CN118378719A (zh) | 一种机器学习工作流构建方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22884991 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |