CN112311605A

CN112311605A - Cloud platform and method for providing machine learning service

Info

Publication number: CN112311605A
Application number: CN202011226841.6A
Authority: CN
Inventors: 马震; 王志洋; 马慧荣; 黄严; 张德兵; 邓亚峰; 赵勇
Original assignee: Beijing Deepglint Information Technology Co ltd
Current assignee: Beijing Deepglint Information Technology Co ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2021-02-02
Anticipated expiration: 2040-11-06
Also published as: CN112311605B

Abstract

A cloud platform and method of providing machine learning services, comprising: the system comprises an IaaS layer, a PaaS layer and a SaaS layer, wherein the IaaS layer is provided with a system support module, the PaaS layer is provided with Kubernets and a Docker, and the SaaS layer comprises a public library service module, a RESTful micro-service module, an application service module and a management maintenance module, wherein the public library service module is used for recording logs, configuration parameters and mathematical calculation; the RESTful micro-service module is used for processing the received request, scheduling and life cycle management of the WEB classification task; the application service module is used for displaying the task running state and the machine learning result; and the management maintenance module is used for managing the mirror image resources by using the Harbor. By adopting the scheme in the application, the environment can be simply transferred, the tracking experiment and the machine learning deployment are very easy, and the experiment result can be reproduced.

Description

Cloud platform and method for providing machine learning service

Technical Field

The present application relates to cloud computing technologies, and in particular, to a cloud platform and method for providing machine learning services.

Background

Machine learning is a very popular technology, and is not only applied to the fields of security protection, transportation, medical treatment, finance, retail and the like.

Although machine learning can produce excellent results, its use is still complex in practice. In addition to the common challenges in software development, machine learning developers are faced with new challenges including experimental management (e.g., tracking which parameters, code, and data the results are due to), repeatability (e.g., the same code can be executed later in the same operating environment), deploying models to the production environment, and data governance (auditing the models and data used throughout the organization). These workflow-related challenges surrounding the machine learning lifecycle are typically the biggest hurdles to using machine learning in a production environment and extending it within the facility.

At present, some cloud platforms support online training services of algorithms, and the cloud platforms often have the following functions: and writing codes by self for training by clicking resources such as applying GPU, storing and the like and corresponding environments. The following problems still exist when the cloud platforms are utilized for machine learning:

1. there are countless independent tools, from data preparation to model training, hundreds of software tools cover each phase of the machine learning lifecycle, machine learning developers need to deploy production environments around tens of libraries;

2. experimental results are difficult to reproduce, a large amount of data is needed during model training, environments are deployed in a targeted mode, and when the model training system is used in an actual production environment of a user, the environments need to be re-deployed and a large amount of data are obtained, so that the approximately same experimental results can be obtained;

3. tracking experiments and deploying machine learning is difficult, machine learning algorithms have dozens of configurable parameters, tracking these configurable parameters and their values is very difficult, and migrating a trained model to a production environment is very challenging.

Problems existing in the prior art:

at present, no cloud platform specially aiming at machine learning exists, so that a user needs to perform a large amount of complex work such as data preparation, environment deployment and the like again when using a model in different environments.

Disclosure of Invention

The application example provides a cloud platform and a method for providing machine learning service, so as to solve the technical problem.

According to a first aspect of an example of the present application, there is provided a cloud platform providing a machine learning service, comprising: the system comprises an IaaS layer, a PaaS layer and a SaaS layer, wherein the IaaS layer is provided with a system support module, the PaaS layer is provided with Kubernets and Dockers, the SaaS layer comprises a public library service module, a RESTful micro-service module, an application service module and a management and maintenance module, wherein,

the public library service module is used for recording logs, configuration parameters and mathematical calculation;

the RESTful micro-service module is used for processing the received request, scheduling and life cycle management of the WEB classification task;

the application service module is used for displaying the task running state and the machine learning result;

and the management maintenance module is used for managing the mirror image resources by using the Harbor.

According to a second aspect of an example of the present application, there is provided a method for machine learning by using the cloud platform for providing a machine learning service, including:

acquiring a training data set;

loading the training data set and a predetermined initial algorithm model into a predetermined GPU cluster, and training to obtain a trained algorithm model;

and respectively carrying out mirror image storage on the environment and the data by utilizing a Harbor according to the received environment storage request and the data storage request.

The cloud service is customized for machine learning business, the cloud platform provided in the application example is adopted, after one-time training is completed, the environment and data can be respectively mirrored to be stored as mirror image resources, machine learning developers do not need to redeploy the environment in subsequent production environments, the model after the training is completed can be simply migrated to the production environments, tracking experiments and machine learning deployment are very easy, and experimental results can reappear.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate example embodiments of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 illustrates a schematic structural diagram of a cloud platform for providing a machine learning service in an example one of the present application;

FIG. 2 is a flow chart illustrating a method for providing machine learning services in example two of the present application;

fig. 3 shows an architecture diagram of a machine learning cloud platform in example four of the present application.

Detailed Description

In order to make the technical solutions and advantages in the examples of the present application more apparent, the following further detailed description of the examples of the present application with reference to the accompanying drawings makes it clear that the described examples are only a part of the examples of the present application, and not an exhaustive list of all examples. It should be noted that the examples and features of the examples in this application may be combined with each other without conflict.

Example one

Fig. 1 shows a schematic structural diagram of a cloud platform for providing a machine learning service in an example one of the present application.

As shown, the cloud platform for providing machine learning services includes: the system comprises an IaaS layer, a PaaS layer and a SaaS layer, wherein the IaaS layer is provided with a system support module, the PaaS layer is provided with Kubernets and Dockers, the SaaS layer comprises a public library service module, a RESTful micro-service module, an application service module and a management and maintenance module, wherein,

In specific implementation, the WEB classification task may include a machine learning training task, an environment and data mirroring task, and the like.

In one embodiment, the RESTful microservice module comprises:

the training data storage unit is used for receiving a training data storage request and storing the training data in the training data storage request;

and the training data extraction unit is used for receiving a training data extraction request and extracting and feeding back the requested training data according to the training data extraction request.

In one embodiment, the training data saving unit is configured to receive a training data saving request and save training data in the training data saving request to a cloud data center.

In one embodiment, the RESTful microservice module comprises:

the environment storage unit is used for receiving an environment storage request and storing an environment mirror image in the environment storage request;

and the environment extraction unit is used for receiving the environment extraction request and feeding back the requested environment extraction according to the environment extraction request.

In one embodiment, the environment saving unit is configured to receive an environment saving request and save an environment image in the environment saving request to a local training center.

In one embodiment, the administration and maintenance module includes:

the mirror image manufacturing unit is used for Docker mirror image manufacturing; specifically, the method can refer to packaging a training environment or a code block required by machine learning into an ISO image file.

The release unit is used for releasing the Docker mirror image; specifically, the whole platform can uniformly manage all manufactured mirror images, and a user can select a mirror image file required by the user according to the requirement of the user, so that the machine learning training environment of the user can be deployed quickly.

The monitoring unit is used for monitoring Kubernets, Dockers and micro-service resources; specifically, the monitoring unit may monitor kubernets, Docker, and micro service resources in real time in order to ensure efficient operation of the entire cluster. If a certain machine learning training environment collapses, the embodiment of the application obtains the information of the environment collapse through the monitoring unit, and then the machine learning training environment is quickly recovered, so that the relevant data are not lost, and the high availability of the service is ensured.

And the arranging unit is used for arranging the resources. In particular, machine learning training requires a large amount of computer resources, including CPU, GPU, memory, disk, network, etc. The resources are arranged to better allocate these network resources in the cluster with the aim of allowing all users to better use the flexible computer resources in the cluster.

Example two

Based on the same inventive concept, the application example provides a method for machine learning by using the cloud platform for providing the machine learning service as in the first example.

Fig. 2 is a flowchart illustrating a method for providing a machine learning service according to example two of the present application.

As shown, the method for providing a machine learning service includes:

step 201, acquiring a training data set;

in an embodiment, a training data set is stored in a disk of an IaaS layer, the application example performs training set management through a SaaS service MinIO, and a public library service module is called to operate the MinIO, so as to obtain a corresponding training data set.

202, loading the training data set and a predetermined initial algorithm model into a predetermined GPU cluster, and training to obtain a trained algorithm model;

in one implementation, resources such as a GPU, a CPU and a storage are firstly distributed on an IaaS layer, then a mirror image is loaded to the IaaS through a kubernets public service module, and then a user can check and use the algorithm training environment of the user through RESTful service.

And step 203, utilizing a Harbor to respectively store the environment and the data in a mirror image mode according to the received environment storage request and the data storage request.

In an implementation mode, when the RESTful service receives a request of a user for saving a training environment, a docker public service module is called to pack a mirror image, and after the packing is completed, a Harbor basic service is called to upload the mirror image to a master node of the Harbor service.

The cloud service is customized for machine learning business, the environment and the data can be respectively stored after one-time training is completed by adopting the cloud platform provided in the application example, machine learning developers can simply migrate the model after the training is completed to the production environment without redeploying the environment in the subsequent production environment, tracking experiments and machine learning deployment are very easy, and experimental results can reappear.

In one embodiment, the mirror saving the environment and the data according to the received environment saving request and data saving request respectively includes:

saving the environment mirror image to a local training center according to the received environment saving request;

and storing the data mirror image to a cloud data center according to the received data storage request.

In one embodiment, the method further comprises:

extracting a requested environment according to the received environment extraction request;

and loading the extracted environment into an algorithm service by utilizing a RESTful micro-service framework, generating an application program interface and providing the application program interface for an application service module.

In one embodiment, the context save request is from a first terminal and the context extract request is from a second terminal.

Example three

The machine learning cloud platform provided by the embodiment of the application adopts the architecture design of cloud computing, the system supports deployment on cloud computing IaaS layer service, the PaaS layer adopts the application mode of Kubernets + Docker, and the SaaS layer comprises: public library services, RESTful microservice frameworks (microservice core load, HTTP API), application services + WEB, administrative maintenance, etc.

In particular, the method comprises the following steps of,

public library service: including basic functions such as logging, configuration, mathematical calculations, etc.;

RESTful microservices framework: based on a flash framework, the method is used for unifying micro-service interfaces, decoupling the relation with services and unifying RESTful Application Program Interfaces (APIs). In particular, the method is used for processing the request, scheduling and life cycle management of the WEB classification task.

And (3) management and maintenance: the Harbor manages the mirroring resources, and comprises Docker mirroring manufacturing and issuing, Kubernetes, Docker and micro-service resource monitoring and resource arrangement functions.

And (3) WEB application: and the UI display is used for displaying the running state of the task and the result of machine learning, and also comprises a monitoring display of the resource.

The cloud service is customized for the machine learning business, the functions and services of AI training, AI on-line service, training model management, training environment management, GPU resource management and the like are provided, one-stop machine learning task hosting is realized, and the cloud service is generally suitable for common machine learning business scenes such as picture recognition, audio and video processing and the like. In particular, the following advantages are included:

1. and managing hybrid cloud resources. And carrying out precise management on the calculation resources such as the GPUs of different IDCs and the like.

The existing cloud platform can only support the management of resources such as a CPU, an internal memory, a disk and the like, and can not realize the reasonable management of the GPU. Algorithm training is particularly dependent on the GPU, and the requirements of different algorithms on the GPU are different. According to the embodiment of the application, GPU resources are uniformly scheduled through Kubernetes, so that the GPU resources are accurately managed.

2. The user training can self-expand GPU, storage and other resources, self-define SSD local storage configuration, and support expansion of various cloud service storage types.

In the embodiment of the application, computer resources are distributed and scheduled through Kubernets, and a bridge between users and resources is built through a public service module and RESTful service. The user can click and select the needed resource configuration through the front-end service in the browser.

3. The training service uses a Docker mirror image encapsulation training algorithm, a user can upload a self-defined algorithm mirror image to the DGnet mirror image center, and the training service can pull the training mirror image; the mirror center provides basic mirror templates of AI frameworks such as TensorFlow, MXNet, Keras, Caffe and the like.

Different machine learning environments such as TensorFlow, Caffe, MXNet and the like are often needed for algorithm training, a Linux system does not default to the environments, in order to reduce the time for deploying the environment, some basic algorithm training environments are built through docker in the embodiment of the application, and a user can directly use the built algorithm training environments. The existing cloud platform only supports the installation of a Linux system and does not deploy a related algorithm model environment.

4. The training service realizes a training one-stop hosting service, and simultaneously supports a distributed AI training task and an interactive AI training task. The platform realizes the functions of GPU node scheduling, training data uploading and downloading, task disaster tolerance and the like, and has high availability.

The traditional cloud computing platform does not perform customized processing on algorithm training, and the embodiment of the application performs customized processing on AI algorithm training, including providing mirror images related to AI algorithms, GPU node scheduling and the like.

Example four

In order to facilitate the implementation of the present application, the present application is illustrated by a specific example.

As shown, the entire flow of user usage is presented. In the flow of fig. three, the user is unaware, which is the pipline called by the whole program and the basic service required to be called in the whole service in the example of the present application. The system is called according to the sequence of the three flows of the figure. And basic services in a data center, a training center and an integrated service center are called in the process to schedule related resources.

The machine learning cloud platform provided by the embodiment of the application comprises a data center, a training center and an integrated service center, wherein the data center, the training center and the integrated service center are supported by a system and deployed on a cloud computing IaaS layer service, a PaaS layer of the cloud platform adopts an application mode of Kubernets + Docker, and a SaaS layer of the cloud platform is a micro-service machine learning system.

Supposing that a face recognition model is to be made, a user A uploads a plurality of face pictures acquired by a public security department or other ways to a cloud platform provided by the application example as training data, and the application example stores the training data to a data center;

user a builds a training environment, including determining the number and storage resources of GPUs, algorithms used for training, and so on. And the program carries out resource scheduling in the training center module, and schedules the task instance of the user to a host machine meeting the requirements according to the requirements of the user. Training is performed after the instance is started. After training is completed, a face recognition model is obtained, a user A clicks a storage environment, an image of the environment is stored to a training center, the image is stored and deleted through a DockerSDK, and RBAC management of the image is managed through an integrated HarborSDK.

When the user B wants to use the face recognition model, a request can be sent to a cloud platform provided by the application example, and the application example can extract mirror image data of the environment of the face recognition model and provide the mirror image data to the client of the user B. The user can select the needed instance configuration and the needed mirror image through the list option of the Web front end. The back-end program packages the user requirements into a Yaml configuration file, and then the scheduling system performs explicit scheduling and configuration according to the Yaml file.

The main functional modules of the system comprise: public library services, RESTful microservice frameworks (microservice core load, HTTP API), application services + WEB, administrative maintenance, etc.

RESTful microservices framework: based on a flash framework, micro-service interfaces are mainly unified, the relation between decoupling and service is unified, and RESTful API is unified.

And (3) management and maintenance: the Harbor manages the mirroring resources and mainly comprises Docker mirroring manufacturing and issuing, Kubernetes, Docker and micro-service resource monitoring and resource arrangement functions.

And (3) WEB application: if the micro-service is regarded as a deep server, the shallow server application and the WEB client are included. And the server side processes the request, scheduling and life cycle management of the WEB classification task. And the WEB terminal displays the running state of the task and the result UI display of machine learning and also monitors and displays resources.

As will be appreciated by one skilled in the art, examples of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The solution in the present application example can be implemented by using various computer languages, for example, object-oriented programming language Java and transliterated scripting language JavaScript.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to examples of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred examples of the present application have been described, additional variations and modifications in those examples may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A cloud platform that provides machine learning services, comprising: the system comprises an IaaS layer, a PaaS layer and a SaaS layer, wherein the IaaS layer is provided with a system support module, the PaaS layer is provided with Kubernets and Dockers, the SaaS layer comprises a public library service module, a RESTful micro-service module, an application service module and a management and maintenance module, wherein,

2. The cloud platform of claim 1, wherein said RESTful microservices module comprises:

3. The cloud platform of claim 2, wherein the training data saving unit is configured to receive a training data saving request and save training data in the training data saving request to a cloud data center.

4. The cloud platform of claim 1, wherein said RESTful microservices module comprises:

5. The cloud platform of claim 4, wherein the environment saving unit is configured to receive an environment saving request and save an environment image in the environment saving request to a local training center.

6. The cloud platform of claim 1, wherein the administration and maintenance module comprises:

the mirror image manufacturing unit is used for Docker mirror image manufacturing;

the release unit is used for releasing the Docker mirror image;

the monitoring unit is used for monitoring Kubernets, Dockers and micro-service resources;

and the arranging unit is used for arranging the resources.

7. A method for machine learning using the cloud platform for providing machine learning services according to any one of claims 1 to 6, comprising:

acquiring a training data set;

8. The method of claim 7, wherein mirroring the context and the data according to the received context save request and the data save request respectively comprises:

9. The method of claim 7, further comprising:

10. The method of claim 9, wherein the context save request is from a first terminal and the context extract request is from a second terminal.