CN112311605B

CN112311605B - Cloud platform and method for providing machine learning service

Info

Publication number: CN112311605B
Application number: CN202011226841.6A
Authority: CN
Inventors: 马震; 王志洋; 马慧荣; 黄严; 张德兵; 邓亚峰; 赵勇
Original assignee: Beijing Gelingshentong Information Technology Co ltd
Current assignee: Beijing Gelingshentong Information Technology Co ltd
Priority date: 2020-11-06
Filing date: 2020-11-06
Publication date: 2023-12-22
Anticipated expiration: 2040-11-06
Also published as: CN112311605A

Abstract

A cloud platform and method of providing machine learning services, comprising: the system comprises an IaaS layer, a PaaS layer and a SaaS layer, wherein the IaaS layer is provided with a system support module, the PaaS layer is provided with Kubernetes and dockers, and the SaaS layer comprises a public library service module, a RESTful micro-service module, an application service module and a management maintenance module, wherein the public library service module is used for recording logs, configuration parameters and mathematical calculations; the RESTful micro-service module is used for processing the received request, scheduling and life cycle management of the WEB classification task; the application service module is used for displaying the running state of the task and the machine learning result; and the management maintenance module is used for managing the mirror image resources by utilizing the Harbor. By adopting the scheme in the application, the environment can be migrated simply, the tracking experiment and the machine learning deployment are very easy, and the experimental result can be reproduced.

Description

Cloud platform and method for providing machine learning service

Technical Field

The present application relates to cloud computing technology, and in particular, to a cloud platform and method for providing machine learning services.

Background

Machine learning is a very popular technology, whether applied in security, transportation, medical, financial, retail, etc. fields.

Although machine learning can produce excellent results, it is still complex to use in practice. In addition to the common challenges in software development, machine learning developers are faced with new challenges including experimental management (e.g., tracking what parameters, code and data are caused by the results), repeatability (e.g., the same code may be executed later in the same operating environment), deployment of models to production environments, and data governance (auditing models and data used throughout the organization). These workflow-related challenges around the machine learning lifecycle are typically the biggest hurdle to using machine learning in a production environment and expanding it inside an organization.

Currently, there are some online training services of cloud platform supporting algorithms, and these cloud platforms often have the following functions: and (3) self-writing codes for training by clicking resources such as a GPU, storage and the like and corresponding environments. While machine learning with these cloud platforms still has the following problems:

1. there are numerous tools independent of each other, from data preparation to model training, hundreds of software tools covering each stage of the machine learning lifecycle, machine learning developers need to deploy a production environment around tens of libraries;

2. the experimental result is difficult to reproduce, a large amount of data is needed when a model is trained, the environment is deployed in a targeted manner, and when the model is used in the actual production environment of a user, the environment is required to be redeployed and a large amount of data is acquired, so that the near-identical experimental result is possible to be obtained;

3. tracking experiments and deploying machine learning is difficult, machine learning algorithms have tens of configurable parameters, tracking these configurable parameters and their values is very difficult, and migrating trained models to a production environment is very challenging.

Problems in the prior art:

at present, a cloud platform special for machine learning is not available, so that a user needs to re-execute a great deal of complex work such as data preparation, environment deployment and the like when using a model in different environments.

Disclosure of Invention

The embodiment of the application provides a cloud platform and a method for providing machine learning service, so as to solve the technical problems.

According to a first aspect of the present application, there is provided a cloud platform for providing machine learning services, comprising: an IaaS layer, a PaaS layer and a SaaS layer, wherein the IaaS layer is provided with a system support module, the PaaS layer is provided with Kubernetes and Docker, the SaaS layer comprises a public library service module, a RESTful micro-service module, an application service module and a management maintenance module,

the public library service module is used for recording logs, configuration parameters and mathematical calculation;

the RESTful micro-service module is used for processing the received request, scheduling and life cycle management of the WEB classification task;

the application service module is used for displaying the task running state and the machine learning result;

and the management maintenance module is used for managing the mirror image resources by utilizing the Harbor.

According to a second aspect of the present application, there is provided a method for machine learning using a cloud platform for providing machine learning services as described above, comprising:

acquiring a training data set;

loading the training data set and a predetermined initial algorithm model into a predetermined GPU cluster, and training to obtain a trained algorithm model;

and respectively mirror-storing the environment and the data by utilizing the Harbor according to the received environment storage request and the data storage request.

According to the cloud platform provided by the embodiment of the application, after one training is finished, environments and data can be respectively mirrored to serve as mirror image resources to be stored, machine learning developers do not need to redeploy the environments in subsequent production environments, models after the training is finished can be simply migrated to the production environments, tracking experiments and machine learning deployment are very easy, and experimental results can be reproduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate and explain the application and together with the description serve to explain the application and do not constitute an undue limitation. In the drawings:

fig. 1 shows a schematic structural diagram of a cloud platform for providing a machine learning service in an example one of the present application;

FIG. 2 is a flow chart of a method of providing machine learning services in example two of the present application;

fig. 3 shows an architecture schematic of a machine learning cloud platform in example four of the present application.

Detailed Description

In order to make the technical solutions and advantages in the examples of the present application more apparent, the following detailed description of exemplary examples of the present application is given in conjunction with the accompanying drawings, and it is apparent that the described examples are only some examples of the present application and not exhaustive of all examples. It should be noted that, in the case of no conflict, the examples and features in the examples may be combined with each other.

Example one

Fig. 1 shows a schematic structural diagram of a cloud platform for providing a machine learning service in an example of the present application.

As shown, the cloud platform for providing machine learning service includes: an IaaS layer, a PaaS layer and a SaaS layer, wherein the IaaS layer is provided with a system support module, the PaaS layer is provided with Kubernetes and Docker, the SaaS layer comprises a public library service module, a RESTful micro-service module, an application service module and a management maintenance module,

In specific implementation, the WEB classification task may include a machine learning training task, an environment and data mirroring task, and the like.

In one embodiment, the RESTful microservice module comprises:

the training data storage unit is used for receiving a training data storage request and storing training data in the training data storage request;

and the training data extraction unit is used for receiving the training data extraction request and extracting feedback of the requested training data according to the training data extraction request.

In one embodiment, the training data storage unit is configured to receive a training data storage request and store training data in the training data storage request to a cloud data center.

In one embodiment, the RESTful microservice module comprises:

the environment preservation unit is used for receiving an environment preservation request and preserving the environment mirror image in the environment preservation request;

and the environment extraction unit is used for receiving the environment extraction request and feeding back the requested environment extraction according to the environment extraction request.

In one embodiment, the environment preservation unit is configured to receive an environment preservation request and preserve an environment image in the environment preservation request to a local training center.

In one embodiment, the management maintenance module includes:

the mirror image manufacturing unit is used for manufacturing a mirror image of the Docker; in particular, it may refer to packaging training environments or code blocks required for machine learning into an ISO image.

The issuing unit is used for issuing the Docker mirror image; specifically, the whole platform can uniformly manage all manufactured images, and a user can select image files required by the user according to the needs of the user, so that the user can quickly deploy the machine learning training environment of the user.

A monitoring unit for monitoring Kubernetes, docker and micro service resources; specifically, the monitoring unit may monitor Kubernetes, docker and micro-service resources in real time in order to ensure efficient operation of the entire cluster. If a certain machine learning training environment is crashed, the embodiment of the application obtains the message of the environmental crash through the monitoring unit, and then the machine learning training environment is quickly recovered, so that the situation that relevant data cannot be lost is ensured, and the high availability of service is ensured.

And the arrangement unit is used for arranging the resources. In practice, machine learning training requires significant amounts of computer resources, including CPU, GPU, memory, disk, network, and so forth. The resources are arranged to better allocate these network resources in the cluster in order to allow all users to better use the flexible computer resources in the cluster.

Example two

Based on the same inventive concept, the present application example provides a method for machine learning by using the cloud platform for providing machine learning services as described in example one.

Fig. 2 shows a flow chart of a method for providing machine learning services in example two of the present application.

As shown, the method for providing machine learning service includes:

step 201, acquiring a training data set;

in one embodiment, the training data set is stored in a disk of the IaaS layer, and the training data set is managed by the SaaS service MinIO in the embodiment of the present application, and the public library service module is invoked to operate the MinIO, so as to obtain a corresponding training data set.

Step 202, loading the training data set and a predetermined initial algorithm model into a predetermined GPU cluster, and training to obtain a trained algorithm model;

in one embodiment, resources such as GPU, CPU, storage and the like are allocated at the IaaS layer, then the mirror image is loaded to the IaaS through the kubernetes public service module, and then the user can view the training environment using own algorithm through the RESTful service.

And 203, respectively mirror-storing the environment and the data by utilizing a Harbor according to the received environment storage request and the data storage request.

In one embodiment, when the RESTful service receives a request for storing a training environment by a user, the RESTful service invokes a dock public service module to package the image, and after the package is completed, the Harbor basic service is invoked to upload the image to a master node of the Harbor service.

The cloud platform provided in the embodiment of the application is adopted, after one training is completed, the environment and the data can be respectively stored, machine learning developers can simply migrate the trained model to the production environment without redeploying the environment in the subsequent production environment, tracking experiments and machine learning deployment are very easy, and experimental results can be reproduced.

In one embodiment, the storing the environment and the data according to the received environment storing request and the data storing request in mirror images respectively includes:

storing the environment mirror image to a local training center according to the received environment storage request;

and storing the data mirror image to a cloud data center according to the received data storage request.

In one embodiment, the method further comprises:

extracting the requested environment according to the received environment extraction request;

and loading the extracted environment into an algorithm service by using a RESTful micro-service framework, and generating an application program interface to be provided for an application service module.

In one embodiment, the context save request is from a first terminal and the context extract request is from a second terminal.

Example three

The machine learning cloud platform provided by the embodiment of the application adopts the architecture design of cloud computing, the system support is deployed on cloud computing IaaS layer service, the PaaS layer adopts the application mode of Kubernetes+Docker, and the SaaS layer comprises: public library services, RESTful micro-service frameworks (micro-service kernel loading, HTTP API), application services + WEB, management maintenance, etc.

In particular, the method comprises the steps of,

public library service: including basic functions such as journaling, configuration, mathematical calculations, etc.;

RESTful micro-service framework: based on the flash framework, the method is used for unifying micro-service interfaces, decoupling the relation with the service and unifying RESTful Application Program Interface (API). Specifically, the method is used for processing the request, the dispatch and the life cycle management of the WEB classification task.

Management and maintenance: the Harbor manages mirror image resources, including Docker mirror image manufacturing and release, kubernetes, docker, micro-service resource monitoring and resource arrangement functions.

Application WEB: the UI display is used for displaying the running state of the task and the machine learning result, and also comprises a monitoring display of the resource.

The cloud service is customized for the machine learning traffic volume, functions and services such as AI training, AI online service, training model management, training environment management, GPU resource management and the like are provided, one-stop machine learning task hosting is realized, and the cloud service is universally applicable to common machine learning business scenes such as picture recognition, audio and video processing and the like. In particular, the method comprises the following advantages:

1. hybrid cloud resource management. And performing accurate management on computing resources such as GPU (graphics processing units) of different IDCs.

The existing cloud platform only can support the management of resources such as CPU, memory, disk and the like, and cannot realize the rationalization management of the GPU. The algorithm training is particularly dependent on the GPU, and the requirements of different algorithms on the GPU are different. According to the method and the device, uniform scheduling is carried out on GPU resources through the Kubernetes, so that accurate management of the GPU resources is achieved.

2. User training can self-expand resources such as GPU, storage and the like, self-configure SSD local storage, and support expansion of multiple cloud service storage types.

The embodiment of the application allocates and schedules the computer resources through the Kubernetes, and establishes a bridge between the users and the resources through the public service module and the RESTful service. The user can click and select the required resource configuration through the front-end service in the browser.

3. The training service packages a training algorithm by using a Docker mirror image, a user can upload a user-defined algorithm mirror image to a DGnet mirror image center, and the training service can pull the training mirror image; the mirror center provides a basic mirror template of the AI framework of TensorFlow, MXNet, keras, caffe, etc.

The algorithm training often needs different machine learning environments like TensorFlow, caffe, MXNet, etc., but Linux systems do not install these environments by default, so in order to reduce the time of deploying this part of the environment, in this application, some basic algorithm training environments are built through a dock, and users can directly use the built algorithm training environments. The existing cloud platform only supports the installation of a Linux system and does not deploy related algorithm model environments.

4. The training service enables training a one-stop hosted service while also supporting distributed AI training tasks and interactive AI training tasks. The platform realizes the functions of GPU node scheduling, training data uploading and downloading, task disaster recovery and the like, and has high availability.

The conventional cloud computing platform does not perform customization processing on algorithm training, and the embodiment of the application performs customization processing on AI algorithm training, including providing mirror images related to the AI algorithm, GPU node scheduling and the like.

Example four

For the purposes of facilitating the practice of this application, this application example will be described in terms of a specific example.

As shown, the entire flow of user usage is illustrated. In the flow of the third diagram, the user is unaware, which is the ppline called by the whole program and the basic service required to be called in the whole service in the example of the application. The system will call in the flow order of the graph. And in the process, basic services in the data center, the training center and the integrated service center are called to schedule related resources.

The machine learning cloud platform provided by the embodiment of the application comprises a data center, a training center and an integrated service center, wherein the data center, the training center and the integrated service center are supported and deployed on cloud computing IaaS layer services, the PaaS layer of the cloud platform adopts an application mode of Kubernetes+Docker, and the SaaS layer of the cloud platform is a micro-server machine learning system.

Assuming that a face recognition model is to be manufactured, a user A uploads a plurality of face pictures acquired by public security departments or other ways as training data to a cloud platform provided by an application example, and the application example stores the training data to a data center;

user a builds a training environment, including determining the number of GPUs and storage resources, algorithms used for training, and so forth. The program can perform resource scheduling in the training center module, and schedule the task instance of the user to the host machine meeting the requirements according to the requirements of the user. Training is performed after the instance is started. After training is completed, a face recognition model is obtained, a user A clicks a storage environment, the mirror image of the environment is manufactured and stored in a training center, the mirror image storage is performed through a DockerSDK to perform storage deleting operation, and RBAC management of the mirror image is performed through an integrated HarborSDK.

When the user B wants to use the face recognition model, a request can be sent to a cloud platform provided by the application example, and the application example can extract mirror image data of the environment of the face recognition model and provide the mirror image data to the client of the user B. The user can select the needed instance configuration and the needed mirror image through the list option of the Web front end. And the back-end program packages the user requirements into a Yaml configuration file, and the scheduling system performs explicit scheduling and configuration according to the Yaml configuration file.

The main functional modules of the system comprise: public library services, RESTful micro-service frameworks (micro-service kernel loading, HTTP API), application services + WEB, management maintenance, etc.

RESTful micro-service framework: based on the flash framework, the micro service interface is mainly unified, the relation between the service interface and the service is decoupled, and the RESTful API is unified.

Management and maintenance: the Harbor manages mirror image resources, and mainly comprises mirror image manufacturing and release of the Docker, kubernetes, docker, micro-service resource monitoring and resource arrangement functions.

Application WEB: if the microservice is considered a deep server, then the shallow server application and WEB client are contained herein. The server side processes the request, the dispatch and the life cycle management of the WEB classification task. The WEB terminal displays the running state of the task and the machine learning result UI display, and also comprises the monitoring display of the resource.

It will be appreciated by those skilled in the art that examples of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The solutions in the examples of the present application may be implemented in various computer languages, for example, object-oriented programming language Java, and an transliterated scripting language JavaScript, etc.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to examples of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred examples of the present application have been described, additional variations and modifications in those examples may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following appended claims be interpreted as including the preferred examples and all such alterations and modifications as fall within the scope of the present application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. A cloud platform for providing machine learning services, comprising: an IaaS layer, a PaaS layer and a SaaS layer, wherein the IaaS layer is provided with a system support module, the PaaS layer is provided with Kubernetes and Docker, the SaaS layer comprises a public library service module, a RESTful micro-service module, an application service module and a management maintenance module,

the management maintenance module is used for managing mirror image resources by utilizing a Harbor;

the RESTful microservice module comprises:

the environment extraction unit is used for receiving an environment extraction request and feeding back the requested environment extraction according to the environment extraction request;

the RESTful microservice module comprises:

the training data extraction unit is used for receiving a training data extraction request and extracting and feeding back the requested training data according to the training data extraction request;

the training data storage unit is used for receiving a training data storage request and storing training data in the training data storage request to a cloud data center;

the environment preservation unit is used for receiving an environment preservation request and preserving the environment mirror image in the environment preservation request to a local training center.

2. The cloud platform of claim 1, wherein the management maintenance module comprises:

the mirror image manufacturing unit is used for manufacturing a mirror image of the Docker;

the issuing unit is used for issuing the Docker mirror image;

a monitoring unit for monitoring Kubernetes, docker and micro service resources;

and the arrangement unit is used for arranging the resources.

3. A method of machine learning using the cloud platform for providing machine learning services according to any of claims 1 to 2, comprising:

acquiring a training data set;

respectively mirror-image storing the environment and the data by utilizing a Harbor according to the received environment storing request and data storing request;

the storing the environment and the data according to the received environment storing request and the data storing request in mirror images respectively comprises the following steps:

4. A method as claimed in claim 3, further comprising:

5. The method of claim 4, wherein the context save request is from a first terminal and the context extract request is from a second terminal.