CN114518887A

CN114518887A - Machine learning model management method, device, equipment and storage medium

Info

Publication number: CN114518887A
Application number: CN202210146941.0A
Authority: CN
Inventors: 李杨; 杨晨; 廖艺; 冯彦明; 黄龙飞; 周锋; 曹闯
Original assignee: Henan Zhongyuan Consumption Finance Co ltd
Current assignee: Henan Zhongyuan Consumption Finance Co ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-20

Abstract

The application discloses a machine learning model management method, device, equipment and storage medium. The method comprises the following steps: acquiring server information of each server in a server cluster; acquiring a machine learning model to be deployed, and screening out a target server from all servers according to a deployment strategy corresponding to the machine learning model and the server information; deploying the machine learning model to the target server. The models derived by different frames and the custom models can be deployed on the servers meeting the requirements, one-key automatic deployment of the models is realized, deployment and release time of machine learning and deep learning models is effectively shortened, operation can be simplified in the face of large-scale model deployment and management, and manpower and management resources are saved to save cost.

Description

Machine learning model management method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a machine learning model management method, a device, equipment and a storage medium.

Background

At present, machine learning and deep learning frameworks bloom in time, and for the frameworks, models derived from training of different frameworks provide unique model publishing capabilities, but various machine learning languages and frameworks are often used in enterprises, and traditional machine models or deep learning models are trained, different models have different requirements on configuration of servers, and different deployment modes, so that enterprise-level unified management is difficult to perform, and in the face of increasing of the number of models, the development platform and the technical manager have very high challenges.

In the prior art, Tf Serving of a Tensorflow frame is adopted for model deployment, custom development and deployment can be performed on models in different formats, but different frames need to perform model derivation and format conversion in order to adapt to deployment capabilities provided by other frames, that is, the deployment capability of Tf Serving is provided by a tensolflow frame and must depend on the tensolflow frame, but the frame is not necessarily used in the conventional machine learning modeling, the difficulty of deploying the model after converting the model into tfserving is high, format conversion is performed between different frames, not all frames and formats are supported, and great limitation is achieved.

Disclosure of Invention

In view of this, the present invention aims to provide a method, an apparatus, a device, and a medium for managing a machine learning model, which can implement one-key automatic deployment of the model, and effectively shorten deployment and release time of the machine learning model and the deep learning model. The specific scheme is as follows:

In a first aspect, the present application discloses a machine learning model management method, including:

acquiring server information of each server in a server cluster;

acquiring a machine learning model to be deployed, and screening out a target server from all servers according to a deployment strategy corresponding to the machine learning model and the server information;

deploying the machine learning model to the target server.

Optionally, the obtaining server information of each server in the server cluster includes:

acquiring server information reported by each server through a distributed coordinator corresponding to the server cluster;

the server information comprises server configuration information, server states and available resource conditions.

Optionally, the obtaining a machine learning model to be deployed, and screening out a target server from all the servers according to a deployment policy corresponding to the machine learning model and the server information includes:

acquiring a machine learning model to be deployed and a deployment strategy corresponding to the machine learning model through a preset model uploading interface, and storing the machine learning model to a common storage; the deployment strategy comprises deployment node information, a processor running mode, a frame type of the machine learning model and a format type of the machine learning model;

And screening out target servers from all the servers according to the deployment strategy and the server information.

Optionally, the deploying the machine learning model to the target server includes:

and generating a deployment instruction based on the server information corresponding to the target server, and sending the deployment instruction to the distributed coordinator, so that the target server in the server cluster acquires the machine learning model from the common storage after monitoring the deployment instruction from the distributed coordinator.

Optionally, after the deploying the machine learning model to the target server, the method further includes:

acquiring deployment strategy modification information;

sending the deployment strategy modification information to the distributed coordinator, so that each process of a server in the server cluster adjusts resource deployment of the corresponding machine learning model according to the deployment strategy modification information; wherein each server adopts a multi-process mode.

receiving a model calling request through a preset uniform calling interface provided by a dynamic gateway route;

Determining a target machine learning model to be called and server information corresponding to the target machine learning model according to the model calling request;

and adjusting a routing table of the dynamic gateway route according to the target machine learning model and the server information so as to call the target machine learning model.

Optionally, after the invoking the target machine learning model, the method further includes:

and recording the called information corresponding to the target machine learning model, and the input parameters and the output result of the target machine learning model through an independent log.

In a second aspect, the present application discloses a machine learning model management apparatus, comprising:

the server information acquisition module is used for acquiring the server information of each server in the server cluster;

the target server determining module is used for acquiring a machine learning model to be deployed and screening out a target server from all the servers according to a deployment strategy corresponding to the machine learning model and the server information;

a deployment module to deploy the machine learning model to the target server.

In a third aspect, the present application discloses an electronic device, comprising:

A memory for storing a computer program;

a processor for executing the computer program to implement the aforementioned machine learning model management method.

In a fourth aspect, the present application discloses a computer readable storage medium for storing a computer program; wherein the computer program when executed by a processor implements the aforementioned machine learning model management method.

In the application, server information of each server in a server cluster is obtained; acquiring a machine learning model to be deployed, and screening out a target server from all servers according to a deployment strategy corresponding to the machine learning model and the server information; deploying the machine learning model to the target server. It can be seen that, by obtaining server information of each server in a server cluster, after obtaining a machine learning model to be deployed and a corresponding deployment strategy, a matched server is selected as a target server according to the deployment strategy of the current machine learning model to be deployed and the server information of each server, and the target server is used for loading the machine learning model to be deployed, models derived for different frames and custom models can be deployed on the servers meeting requirements, so that one-key automatic deployment of the models is realized, deployment and release time of the machine learning and deep learning models is effectively shortened, operation can be simplified in the face of large-scale model deployment and management, and manpower and management resources are saved to save cost.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

FIG. 1 is a flow chart of a machine learning model management method provided by the present application;

FIG. 2 is a block diagram of a machine learning model management system according to the present application;

FIG. 3 is a block diagram of another machine learning model management system provided herein;

FIG. 4 is a schematic structural diagram of a machine learning model management apparatus provided in the present application;

fig. 5 is a block diagram of an electronic device provided in the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the prior art, the Tf Serving of the tensoflow frame is adopted for model deployment in the prior art, the Tf Serving must depend on the tensoflow frame, format conversion is carried out between different frames, not all frames and formats are supported, the method has great limitation, custom release is carried out on the models in different formats, independent deployment is carried out on the models constructed each time, the deployment and maintenance cost is very high, and the efficiency of machine learning model deployment and subsequent management is reduced. In order to overcome the technical problems, the application provides a machine learning model management method, which can realize one-key automatic deployment of a model and effectively shorten deployment and release time of machine learning and deep learning models

The embodiment of the application discloses a machine learning model management method, and as shown in fig. 1, the method may include the following steps:

step S11: server information for each server in the server cluster is obtained.

In this embodiment, server information of each server in a server cluster is first obtained, where the server information may be automatically reported by the server, the server cluster may be a server cluster obtained by performing local area network networking on a plurality of servers, and the server cluster may include a CPU (central processing unit) server and a GPU (graphics processing unit) server, and the servers are all correspondingly marked with a CPU server or a GPU server.

In this embodiment, the obtaining server information of each server in the server cluster may include: acquiring server information reported by each server through a distributed coordinator corresponding to the server cluster; the server information comprises server configuration information, server states and available resource conditions. For example, as shown in fig. 2, in this embodiment, the distributed coordinator communicates with each server, that is, several servers are organized together to communicate with the distributed coordination framework, and current server information is automatically uploaded, where the server information includes, but is not limited to, server configuration information, server status, and available resource status, so that hardware information such as configuration on each server and remaining available resources can be seen through a management interface of the management end.

Step S12: and acquiring a machine learning model to be deployed, and screening out a target server from all the servers according to a deployment strategy corresponding to the machine learning model and the server information.

In this embodiment, a machine learning model to be deployed uploaded by a user and a deployment policy corresponding to the machine learning model are obtained, where the deployment policy may include, but is not limited to, deployment node information, a processor operation mode, a frame type of the machine learning model, and a format type of the machine learning model, the deployment node information is one or more servers to which the machine learning model is currently expected to be deployed, the processor operation mode is whether GPU acceleration needs to be enabled, and then a target server meeting deployment requirements of the machine learning model is screened from all servers according to the deployment policy corresponding to the machine learning model and all the obtained server information. It can be understood that different models have different characteristics such as frames and formats, and therefore, a user uploads the models and simultaneously uploads the corresponding deployment policies so as to perform automatic deployment according to the deployment policies. And moreover, different types of servers of the CPU and the GPU can be managed, and for a model needing the GPU to be accelerated, the GPU server can be automatically identified and automatically deployed to a GPU server node.

In this embodiment, the obtaining a machine learning model to be deployed, and screening out a target server from all the servers according to a deployment policy corresponding to the machine learning model and the server information may include: acquiring a machine learning model to be deployed and a deployment strategy corresponding to the machine learning model through a preset model uploading interface, and storing the machine learning model to a common storage; the deployment strategy comprises deployment node information, a processor operation mode, a frame type of the machine learning model and a format type of the machine learning model; and screening out a target server from all the servers according to a deployment strategy and the server information. For example, as shown in fig. 2, a user only needs to obtain a machine learning model to be deployed and a deployment policy corresponding to the machine learning model through a preset model uploading interface provided by a management interface, and the system stores both the machine learning model and the corresponding deployment policy in a common storage, that is, the system is visible to all servers. The deployment node information can be information corresponding to one node or information corresponding to a plurality of nodes, that is, one model can be deployed on a plurality of server nodes, so that in the case of a single server failure, services can still be provided by other nodes, and high availability of the services is guaranteed.

Step S13: deploying the machine learning model to the target server.

The machine learning model is deployed to the target server after the target server is determined, the number of the target servers can be multiple, the distributed management capability is realized by combining a distributed coordinator, one-key automatic multi-machine deployment of the specified model can be realized, and the problem of resource shortage is easily relieved. In this embodiment, the deploying the machine learning model to the target server may include: and generating a deployment instruction based on the server information corresponding to the target server, and sending the deployment instruction to the distributed coordinator, so that the target server in the server cluster acquires the machine learning model from the common storage after monitoring the deployment instruction from the distributed coordinator.

It can be understood that, after the user uploads the model, the model file is stored in the shared storage, that is, it is visible to all servers, and meanwhile, the management end assigns a target server with idle resources and meeting requirements to load the model according to the resources monitored by the distributed coordinator, and synchronizes the corresponding deployment instruction to the distributed coordinator, and the monitor on the server also monitors the information change on the distributed coordinator in the whole process, so that the latest data can be obtained in the first time, and when the target server finds that the server needs to load the model, the target server automatically pulls the corresponding machine learning model from the shared storage, and loads the machine learning model into the local memory, and provides service to the outside according to the assigned method. That is to say, the user only needs to upload the machine learning and deep learning models that have been trained and derived by the user, and mark whether the GPU is needed for acceleration, which is a model of what format of what framework, and which server node or which number of nodes are required to be deployed, and the system can automatically perform model deployment according to the above information.

In this embodiment, after deploying the machine learning model to the target server, the method may further include: acquiring deployment strategy modification information; sending the deployment strategy modification information to the distributed coordinator, so that each process of a server in the server cluster adjusts resource deployment of the corresponding machine learning model according to the deployment strategy modification information; wherein each server adopts a multi-process mode. 2, a plurality of machine learning models and deep learning models can reuse the computing resources of the server, so that the maximum utilization of the server resources can be realized; when the deployment strategy of the model is modified, each process can monitor the modification information of the deployment strategy and automatically reload the latest model deployment strategy, so that the server of each process is ensured to be the latest; for example, if it is clear that a certain model needs to be deployed on 5 servers during uploading, when the model is too large in request amount, the existing 5 servers are not enough to meet the concurrent requirement, and a user only needs to modify the deployment node information in the model deployment policy into 10 on the management interface, the distributed coordinator will redistribute the 5 servers to deploy the model, so as to achieve the horizontal resource expansion of the model, therefore, when the server cluster needs to be expanded, only needs to transversely increase the nodes, and theoretically, there is no computational resource bottleneck problem, a distributed machine learning model management system capable of dynamically increasing the computational power is constructed, so that users and managers can perform one-key deployment and monitoring on enterprise-level machine learning and deep learning models through conventional interface operation, and the online offline and abnormal of a single model do not cause interference to other models, the requirement of high availability is met.

In this embodiment, after the deploying the machine learning model to the target server, the method may further include: receiving a model calling request through a preset unified calling interface provided by a dynamic gateway route; determining a target machine learning model to be called and server information corresponding to the target machine learning model according to the model calling request; and adjusting a routing table of the dynamic gateway route according to the target machine learning model and the server information so as to call the target machine learning model. It can be understood that, for example, as shown in fig. 3, when the model is deployed on multiple servers, a uniform calling interface is provided by the dynamic gateway routing for the caller, so that the calling path of the model can be fixed regardless of the nodes on which the model is deployed. For the bottom layer of the cluster, a plurality of servers are arranged, each server has an own IP address, and for the convenience of a user, the externally exposed IP should be fixed, so that the calling conditions of all the servers need to be dynamically routed, and the servers automatically adapt to the designated servers according to the model information and process the routing by the servers; the dynamic gateway routing caches the corresponding relation between the machine learning model and the nodes in the distributed coordination framework, when an external request calls a specified model, the gateway finds out the servers to which the model is deployed, then distributes the request to the specified servers, and the servers return the result.

And each server is provided with a heartbeat package for keeping heartbeat with the distributed coordination frame, when a certain server fails, the distributed coordination frame finds that the heartbeat of the server fails, namely the server fails, then the server can be considered to fail, at the moment, other servers can be searched and utilized to automatically load the deployed model on the server again, meanwhile, the gateway can update cache information, the original failed server can be automatically removed, and the problem of the whole service is guaranteed.

In this embodiment, after the invoking the target machine learning model, the method may further include: and recording called information corresponding to the target machine learning model, and input parameters and output results of the target machine learning model through an independent log. The calling logs are independently stored, so that the calling logs of each model at any moment and information such as model entering and prediction results can be conveniently inquired, and unified model deployment and service publishing capacity, service monitoring and log management can be realized, and calling condition information and model health checking capacity of the models can be inquired in a unified manner.

As can be seen from the above, in this embodiment, server information of each server in a server cluster is obtained; acquiring a machine learning model to be deployed, and screening out a target server from all servers according to a deployment strategy corresponding to the machine learning model and the server information; deploying the machine learning model to the target server. It can be seen that, by obtaining server information of each server in a server cluster, after obtaining a machine learning model to be deployed and a corresponding deployment strategy, a matched server is selected as a target server according to the deployment strategy of the current machine learning model to be deployed and the server information of each server, and the target server is used for loading the machine learning model to be deployed, models derived for different frames and custom models can be deployed on the servers meeting requirements, so that one-key automatic deployment of the models is realized, deployment and release time of the machine learning and deep learning models is effectively shortened, operation can be simplified in the face of large-scale model deployment and management, and manpower and management resources are saved to save cost.

Correspondingly, the embodiment of the present application further discloses a machine learning model management apparatus, as shown in fig. 4, the apparatus includes:

a server information obtaining module 11, configured to obtain server information of each server in a server cluster;

the target server determining module 12 is configured to obtain a machine learning model to be deployed, and screen out a target server from all the servers according to a deployment policy corresponding to the machine learning model and the server information;

a deployment module 13, configured to deploy the machine learning model to the target server.

In the above, the server information of each server in the server cluster is obtained; acquiring a machine learning model to be deployed, and screening out a target server from all servers according to a deployment strategy corresponding to the machine learning model and the server information; deploying the machine learning model to the target server. It can be seen that, by obtaining server information of each server in a server cluster, after obtaining a machine learning model to be deployed and a corresponding deployment strategy, a matched server is selected as a target server according to the deployment strategy of the current machine learning model to be deployed and the server information of each server, and the target server is used for loading the machine learning model to be deployed, models derived for different frames and custom models can be deployed on the servers meeting requirements, so that one-key automatic deployment of the models is realized, deployment and release time of the machine learning and deep learning models is effectively shortened, operation can be simplified in the face of large-scale model deployment and management, and manpower and management resources are saved to save cost.

In some specific embodiments, the server information obtaining module 11 may be specifically configured to obtain, by using a distributed coordinator corresponding to the server cluster, server information reported by each server; the server information comprises server configuration information, server states and available resource conditions.

In some specific embodiments, the target server determining module 12 may specifically include:

the deployment strategy acquisition unit is used for acquiring a machine learning model to be deployed and a deployment strategy corresponding to the machine learning model through a preset model uploading interface and storing the machine learning model to a shared storage; the deployment strategy comprises deployment node information, a processor running mode, a frame type of the machine learning model and a format type of the machine learning model;

and the target server determining unit is used for screening out the target servers from all the servers according to the deployment strategy and the server information.

In some specific embodiments, the deployment module 13 may be specifically configured to generate a deployment instruction based on the server information corresponding to the target server, and send the deployment instruction to the distributed coordinator, so that the target server in the server cluster acquires the machine learning model from the common storage after monitoring the deployment instruction from the distributed coordinator.

In some embodiments, the machine learning model management apparatus may specifically include:

the deployment strategy modification information acquisition unit is used for acquiring deployment strategy modification information;

the resource deployment adjusting unit is used for sending the deployment strategy modification information to the distributed coordinator, so that each process of the server in the server cluster adjusts the resource deployment of the corresponding machine learning model according to the deployment strategy modification information; wherein each server adopts a multi-process mode.

the model calling request receiving unit is used for receiving the model calling request through a preset uniform calling interface provided by the dynamic gateway route;

the target machine learning model determining unit is used for determining a target machine learning model to be called and server information corresponding to the target machine learning model according to the model calling request;

and the target machine learning model calling unit is used for adjusting a routing table of the dynamic gateway route according to the target machine learning model and the server information so as to call the target machine learning model.

and the recording unit is used for recording the called information corresponding to the target machine learning model, the input parameters of the target machine learning model and the output result through an independent log.

Further, the embodiment of the present application also discloses an electronic device, which is shown in fig. 5, and the content in the drawing cannot be considered as any limitation to the application scope.

Fig. 5 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present disclosure. The electronic device 20 may specifically include: at least one processor 21, at least one memory 22, a power supply 23, a communication interface 24, an input output interface 25, and a communication bus 26. Wherein, the memory 22 is used for storing a computer program, and the computer program is loaded and executed by the processor 21 to implement the relevant steps in the machine learning model management method disclosed in any of the foregoing embodiments.

In this embodiment, the power supply 23 is configured to provide a working voltage for each hardware device on the electronic device 20; the communication interface 24 can create a data transmission channel between the electronic device 20 and an external device, and a communication protocol followed by the communication interface is any communication protocol applicable to the technical solution of the present application, and is not specifically limited herein; the input/output interface 25 is configured to obtain external input data or output data to the outside, and a specific interface type thereof may be selected according to specific application requirements, which is not specifically limited herein.

In addition, the storage 22 is used as a carrier for storing resources, and may be a read-only memory, a random access memory, a magnetic disk or an optical disk, etc., the resources stored thereon include an operating system 221, a computer program 222, and data 223 including a machine learning model, etc., and the storage may be a transient storage or a permanent storage.

The operating system 221 is configured to manage and control each hardware device and the computer program 222 on the electronic device 20, so as to implement the operation and processing of the mass data 223 in the memory 22 by the processor 21, and may be Windows Server, Netware, Unix, Linux, or the like. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the machine learning model management method performed by the electronic device 20 disclosed in any of the foregoing embodiments.

Further, an embodiment of the present application also discloses a computer storage medium, where computer-executable instructions are stored in the computer storage medium, and when the computer-executable instructions are loaded and executed by a processor, the steps of the machine learning model management method disclosed in any of the foregoing embodiments are implemented.

In the present specification, the embodiments are described in a progressive manner, and each embodiment focuses on differences from other embodiments, and the same or similar parts between the embodiments are referred to each other. The device disclosed in the embodiment corresponds to the method disclosed in the embodiment, so that the description is simple, and the relevant points can be referred to the description of the method part.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Finally, it should also be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above detailed description is provided for a machine learning model management method, apparatus, device and medium, and the specific examples are applied herein to explain the principles and embodiments of the present invention, and the descriptions of the above embodiments are only used to help understanding the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A machine learning model management method, comprising:

acquiring server information of each server in a server cluster;

deploying the machine learning model to the target server.

2. The method for machine learning model management according to claim 1, wherein the obtaining server information of each server in a server cluster comprises:

3. The machine learning model management method according to claim 2, wherein the obtaining of the machine learning model to be deployed and the screening of the target servers from all the servers according to the deployment policy corresponding to the machine learning model and the server information includes:

4. The machine learning model management method of claim 3, wherein the deploying the machine learning model to the target server comprises:

5. The method according to claim 2, further comprising, after deploying the machine learning model to the target server:

acquiring deployment strategy modification information;

6. The machine learning model management method of any one of claims 1 to 5, further comprising, after the deploying the machine learning model to the target server:

7. The machine learning model management method of claim 6, further comprising, after said invoking the target machine learning model:

8. A machine learning model management apparatus, comprising:

a deployment module to deploy the machine learning model to the target server.

9. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the machine learning model management method of any one of claims 1 to 7.

10. A computer-readable storage medium for storing a computer program; wherein the computer program when executed by the processor implements a machine learning model management method as claimed in any one of claims 1 to 7.