WO2021238251A1

WO2021238251A1 - Inference service system based on kubernetes

Info

Publication number: WO2021238251A1
Application number: PCT/CN2021/073345
Authority: WO
Inventors: 王超; 吴韶华; 陈清山; 张荣国; 林秀
Original assignee: 苏州浪潮智能科技有限公司
Priority date: 2020-05-28
Filing date: 2021-01-22
Publication date: 2021-12-02
Also published as: CN111629061A; CN111629061B

Abstract

An inference service system based on Kubernetes, comprising a computing resource cluster and an inference service platform. The inference service platform comprises: a multi-framework model module used for supporting models exported from multiple frameworks; and a user-defined mirror image module used for obtaining a mirror image file sent by a user, performing deployment according to the mirror image file, and executing an inference service, wherein the mirror image file is a file obtained by packaging a trained model and a running environment by the user. Thus, in the present application, a trained model and a running environment are packaged in a mirror image form and then submitted to the inference service platform, the inference service platform deploys an online inference service in a parameter passing mode, inference tasks can be carried out without converting model types or considering model compatibility, and the inference service operation efficiency is improved.

Description

A reasoning service system based on Kubernetes

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 28, 2020, the application number is 202010470862.6, and the invention title is "a Kubernetes-based reasoning service system", the entire content of which is incorporated into this application by reference middle.

Technical field

This application relates to the technical field of reasoning services, and in particular to a reasoning service system based on Kubernetes.

Background technique

Online inference service (Online Inference Service) is an important part of the machine learning project. Through online inference service, the trained model can reflect its value in the production process. Many Internet or companies with online business usually have several or even dozens of online reasoning services, which are called tens of millions of times a day. In order to support online services efficiently and stably, the online service framework needs to be able to support mainstream deep learning frameworks, support running on CPU and GPU resources, and support running multiple models on a single graphics card to improve GPU resource utilization. Although the related technologies provide support for multi-frame models, for models trained on non-standard deep learning frameworks and AI applications at the SaaS layer, existing technologies cannot provide online deployment functions for online reasoning services.

Therefore, how to provide a solution to the above-mentioned technical problems is a problem that needs to be solved by those skilled in the art at present.

Summary of the invention

The purpose of this application is to provide a Kubernetes-based reasoning service system that can encapsulate the trained model and operating environment in the form of a mirror image and submit it to the reasoning service platform. The reasoning service platform deploys online reasoning services through parameter transfer. , There is no need to switch model types, and no need to worry about model compatibility to perform inference tasks. The specific plan is as follows:

This application provides a Kubernetes-based reasoning service system, including:

Computing resource cluster and reasoning service platform;

Wherein, the reasoning service platform includes:

Multi-frame model module, used to support models exported by multiple frameworks;

The custom image module is used to obtain the image file sent by the user, deploy according to the image file, and execute inference services, where the image file is a file obtained by encapsulating the trained model and the operating environment by the user.

Optionally, the reasoning service platform further includes:

The test and release module is used to obtain a test model, and perform a performance test based on the test model and the corresponding running model using A/B testing and corresponding diversion information, when the performance of the test model is greater than the performance of the running model At that time, the test model will be released on a rolling basis.

Optionally, the test and release module is configured to migrate all users corresponding to the running model to the test model in free time, so as to realize the release of the test model.

Optionally, the testing and publishing module is configured to sequentially migrate users corresponding to the running model to the testing model to implement the publishing of the testing model.

Optionally, the reasoning service platform further includes:

The traffic management model is used to divert the requested traffic of the user in a preset manner to obtain the diversion information.

Optionally, the multi-frame model module is also used to obtain and modify the configuration file of the pre-launched reasoning service, and create a reasoning service instance.

Optionally, the multi-frame model module is also used to obtain the parameters of the configuration file for adding the pre-online reasoning service to create a reasoning service instance.

Optionally, the custom image module is also used to parse the image file to obtain the trained model and the operating environment; execute the reasoning service based on the trained model and the operating environment , Obtain the inference result, and feed back the inference result to the user.

Optionally, it further includes: a scheduling module, configured to determine the number of corresponding pods according to the utilization rate of the computing resources in the computing resource cluster or the metric provided by the user.

Optionally, the reasoning service platform further includes:

The monitoring module is used to monitor the computing resource cluster; when an error occurs in the reasoning service, execute a service warning.

This application provides a Kubernetes-based reasoning service system, including: a computing resource cluster and a reasoning service platform; wherein the reasoning service platform includes: a multi-frame model module for supporting models derived from multiple frameworks; a custom mirroring module , Used to obtain the image file sent by the user, deploy according to the image file, and execute the inference service, wherein the image file is a file obtained by encapsulating the trained model and the operating environment by the user.

It can be seen that this application encapsulates the trained model and operating environment in the form of a mirror image and submits it to the reasoning service platform. The reasoning service platform deploys online reasoning services in the form of parameter transfer. There is no need to convert the model type, and there is no need to worry about model compatibility. The reasoning task can be carried out based on the nature, which improves the efficiency of the reasoning service operation.

Description of the drawings

In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are the embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without creative work.

Figure 1 is a schematic structural diagram of a Kubernetes-based reasoning service system provided by an embodiment of the application;

Figure 2 is a schematic structural diagram of another Kubernetes-based reasoning service system provided by an embodiment of the application;

FIG. 3 is a schematic diagram of a test and release process provided by an embodiment of the application;

FIG. 4 is a schematic diagram of a test structure of a reasoning service platform provided by an embodiment of the application;

FIG. 5 is a schematic diagram of a scheduling provided by an embodiment of this application;

FIG. 6 is a schematic diagram of the working structure of a custom mirroring module provided by an embodiment of the application.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of this application clearer, the following will clearly and completely describe the technical solutions in the embodiments of this application with reference to the drawings in the embodiments of this application. Obviously, the described embodiments These are a part of the embodiments of this application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.

Although the related technologies provide support for multi-frame models, for models trained in non-standard deep learning frameworks and AI (Artificial Intelligence) applications at the SaaS layer, existing technologies cannot provide online deployment functions for online deployment. Reasoning service. The above technical problem is not solved. This embodiment provides a Kubernetes-based reasoning service system. Please refer to Figure 1. Figure 1 is a schematic structural diagram of a Kubernetes-based reasoning service system provided by an embodiment of the application, including:

Computing resource cluster 100 and reasoning service platform 200;

Among them, the reasoning service platform 200 includes:

The custom image module is used to obtain the image file sent by the user, deploy it according to the image file, and perform inference services. The image file is a file obtained by encapsulating the trained model and the operating environment by the user.

Among them, this application is developed in Python and go programming languages during the implementation process, and the deployment environment is a Linux system. However, this scheme is not restricted by language and system environment, and can be fully realized in other languages and environments.

Among them, the number of computing resources in the computing resource cluster 100 can be customized by users. It is understandable that each computing resource is provided with an accelerator. The accelerator includes but is not limited to GPU (Graphics Processing Unit, graphics processor), CPU (central processing unit, central processing unit), Cambrian MLU, and dedicated neural network processors can be homogeneous accelerators or heterogeneous accelerators.

To further elaborate on the reasoning service platform 200, the online reasoning service corresponding to the reasoning service platform 200 is not a model, but a service process for online deployment and preliminary preparation of the model. Specifically, operating users can provide reliable computing power guarantee for online reasoning services through fine-grained computing resource management and scheduling, and the reasoning service platform 200 provides multi-frame model modules and custom mirroring modules. For further details, please refer to the figure. 2. Figure 2 is a schematic structural diagram of another Kubernetes-based reasoning service system provided by an embodiment of the application. The system may also include: a data processing module, a traffic management module, a testing and publishing module, a user and storage module, and a monitoring module , Scheduling module and resource module, making the deployment of online reasoning services more stable and convenient. Among them, the data processing module includes data pre/post processing, model averaging, and model conversion; the multi-frame model module may specifically include TensorFlow Serving, Pytorch, TensorRT Inference Server, and ML Model Serving; the user and storage module may specifically include multi-user strategy, mirroring Warehouse and model management; the monitoring module can specifically include logs, cluster monitoring and service alarms.

It is worth noting that the reasoning service platform 200 or software can provide rapid auto-scaling capabilities based on Kubernetes API resource configuration and controller status, and solve the management and deployment mechanism based on virtualization technology when responding to the rapid expansion and contraction requirements of services. There are problems such as manual creation of resource instances, inability to unify the operating environment, instance deployment, low resource recovery efficiency, and poor elasticity. At the same time, it automatically scales Replication according to the utilization of computing resources used or CustomMetrics provided by other applications. The number of Pods in Controller, Deployment, and Replica Set makes cluster management and operation more efficient and stable, while effectively reducing the cost of computing resources.

Specifically, the multi-frame model module is used to support models derived from multiple frameworks. Among them, multiple frameworks include but are not limited to: TensorFlow, Pytorch, TensorRT, and SKLearn. It can be seen that the reasoning service platform 200 or software supports a variety of models derived from deep learning/machine learning frameworks, and provides service support for data pre/post processing for the supported models. At the same time, different computing resources can be selected according to different data processing requirements (CPUs or GPUs). Create a reasoning service instance by modifying or adding the configuration file (.yaml file) of the corresponding pre-online reasoning service or the parameters in the configuration file, and quickly deploy the reasoning service of the required model to the online environment.

Specifically, the custom mirroring module is used to obtain the mirrored file sent by the user, deploy according to the mirrored file, and perform inference services, where the mirrored file is a file obtained by encapsulating the trained model and the operating environment by the user.

It is understandable that the reasoning service platform 200 or software supports reasoning services for non-standard publishing framework models, including: TensorFlow, Pytorch, TensorRT and other optimized or custom framework models for reasoning service instance creation. Users run according to the training model used. Environment, encapsulate the trained model and operating environment (non-standard framework, running script, etc.) in the form of a mirror image to obtain a mirror file, and submit the mirror file to the reasoning service platform 200 or software, and the reasoning service platform performs online through parameter transfer. The deployment of the above reasoning service does not need to change the model type, and there is no need to worry about model compatibility to perform reasoning tasks. Specifically, the custom image module is also used to parse the image file to obtain the trained model and operating environment; perform inference services based on the trained model and operating environment, obtain the inference results, and feed the inference results back to the user.

Please refer to FIG. 6, which is a schematic diagram of the working structure of a custom mirroring module provided by an embodiment of this application. Specifically, the user encapsulates the trained model and operating environment, including configuration files, to obtain the image file, and send the image file to the Kubernetes-based reasoning service system. The custom mirror module of the Kubernetes-based reasoning service system receives the mirror file, parses the mirror file, and obtains the trained model, operating environment, and configuration file. The reasoning service is executed based on the trained model and operating environment, and the reasoning result is obtained, and The reasoning result is fed back to the user. Of course, it will also include storage services for storing mirror files, etc., as well as monitoring services, reasoning services for monitoring mirror files.

Further, in order to enable development or operation and maintenance personnel to quickly push the trained model to the online environment and obtain verification of real traffic, as a capability support for subsequent services, in this embodiment, the reasoning service platform 200 further includes:

The test and release module is used to obtain the test model, and perform performance testing based on the test model and the corresponding running model using A/B testing and corresponding diversion information. When the performance of the test model is greater than the performance of the running model, the test model is rolled release.

The reasoning service platform 200 or software provides online test functions of model services for services in a production environment. Users can perform reasoning results and performance verification for online services, and support gray-scale release of online services through A/B testing. Taking into account the importance and seriousness of the production environment, the pre-launch model, that is, the test model, must be tested in real online traffic before it can be released in full. Using A/B testing can effectively provide the pre-launch model with a custom scale line On the basis of ensuring the stability and accurate isolation of the flow, the release strategy provided by the reasoning service platform 200 or the software can be used to control the release strategy of the model regularly and quantitatively to ensure that the number of online requests will not This causes a load impact on the existing available computing resources, so that subsequent models can be smoothly transitioned to full models. Please refer to FIG. 3 and FIG. 4. FIG. 3 is a schematic diagram of a test and release process provided by an embodiment of the application. Specifically, test model 1 and run model 2 are obtained from model and mirror management, and model deployment 1 and run model 2 execute corresponding model deployment 2 based on test model 1 according to the release strategy, which are obtained after A/B testing and preprocessing Perform performance testing on the corresponding shunt information, implement reasoning services, and obtain corresponding calculation results; only when the performance of the test model is greater than the performance of the running model, the rolling release of the test model can be performed. FIG. 4 is a schematic diagram of a test structure of a reasoning service platform 200 provided by an embodiment of the application. It is understandable that the test model must be tested with real online traffic before it can be released in full. Specifically, the user sends a request to the inference service platform. When the inference service platform receives the request, based on the internal and external cluster load balancing, it will allocate real traffic to the corresponding test model and operation model, where the test model test traffic is allocated, and the operation is allocated The default flow of the model enables the test model and the running model to execute the inference service respectively, and the test model and the running model respectively execute the model service 1, model service 2, model service n of the corresponding traffic, and then get the A/B test of the running model and the test model Calculation results.

Further, in order to avoid the overload of the accelerator of the underlying computing resources being too large, causing delays in obtaining user information, in this embodiment, the test and release module is used to migrate all users corresponding to the running model to the test model during idle time. To achieve the release of the test model. By migrating all users corresponding to the running model to the test model in free time, the migration is carried out when the user is not using it, so as to avoid the occurrence of delay in actual use.

Further, in order to avoid the occurrence of time delay caused by the excessive load of the accelerator of the underlying computing resources, in this embodiment, the test and release module is used to sequentially migrate the users corresponding to the running model to the test model to realize the test Release of the model. By sequentially migrating the users corresponding to the running model to the test model, the problem of excessive load can be avoided. Among them, the sequence can be sequentially migrating one, two, or other numbers of users, as long as this embodiment can be implemented. The purpose is sufficient, and this embodiment does not limit it. It can be seen that by sequentially migrating the users corresponding to the running model to the test model, the problem of excessive load is avoided, and the delay phenomenon is avoided.

Further, in order to decouple traffic from infrastructure expansion, in this embodiment, the reasoning service platform 200 further includes: a traffic management model, which is used to split the requested traffic of the user in a preset manner to obtain the split information.

Among them, the traffic management model in this embodiment uses the Istio traffic management model, which decouples the traffic from the expansion of the infrastructure, so that the operation and maintenance personnel can specify which rules the traffic follows through the Pilot, instead of specifying which pods/VM should be Receive traffic. By decoupling traffic from infrastructure extensions, Istio can provide various traffic management functions independent of application code. These functions are implemented through the deployed Envoy sidecar proxy. The Pod contains a sidecar proxy, which is part of the Istio grid and is responsible for coordinating all inbound and outbound traffic for the Pod. In the Istio grid, Pilot is responsible for converting advanced routing rules into configurations and propagating them to the sidecar agents. This means that when services communicate with each other, their routing decisions are determined by the client. The traffic control scheme of the reasoning service enables the online service to divert the request traffic of online users through preset methods (such as random, designated ID, etc.), and pass the real traffic request through HTTP (Hypertext Transfer Protocol, hypertext) Transmission protocol) is sent to the server to perform reasoning services based on different model frameworks, and the validity of the test model is verified by comparison of calculation results.

Further, in order to support the online model service of the mainstream standard framework model, the modified computing framework cannot provide effective services. In this embodiment, the multi-frame model module is also used to obtain the configuration file of the modified pre-online reasoning service. Create an instance of the inference service.

Further, in order to support the online model service of the mainstream standard framework model, it is impossible to provide effective services for the modified or upgraded version of the computing framework. In this embodiment, the multi-frame model module is also used to obtain and add pre-online reasoning services The parameters of the configuration file to create an instance of the inference service.

Furthermore, in order to make cluster management and operation more efficient and stable, while effectively reducing the cost of computing resources, the flexible deployment model also provides operation and maintenance personnel with a series of deployment solutions such as cloud computing resources and local computing resources, allowing inference services Users of can make more efficient use of resources and services according to the actual situation, and also include: a scheduling module, used to determine the number of corresponding pods based on the utilization of computing resources in the computing resource cluster 100 or metrics provided by users .

Specifically, the reasoning service platform 200 or software can provide rapid automatic scaling capabilities based on Kubernetes API resource configuration and controller status, and solve the problem of management and deployment mechanisms based on virtualization technology in responding to the rapid expansion and contraction requirements of services. Problems such as manual creation of resource instances, inability to unify the operating environment, instance deployment, low resource recovery efficiency, and poor elasticity. Furthermore, the scheduling module automatically scales the number of Pods in Replication Controller, Deployment, and Replica Set according to the utilization of computing resources used or Custom Metrics provided by other applications. The flexible deployment mode also provides operations and maintenance personnel with A series of deployment schemes such as cloud computing resources and local computing resources enable users of inference services to use resources and services more efficiently according to actual conditions, making cluster management and operation more efficient and stable, while effectively reducing computing Resource cost. It can be seen that this embodiment uses a safe and effective resource management and control method to reasonably allocate different computing powers to perform scheduling.

Further, the scheduling module may also receive a user request, the reasoning service platform 200 executes the reasoning service, obtains the calculation result, and feeds the calculation result back to the user. For details, please refer to FIG. 5, which is a schematic diagram of a scheduling provided by an embodiment of the application. Specifically, the reasoning service platform obtains the mirror file, deploys the model corresponding to the mirror file based on the publishing strategy, and uses the model to execute the reasoning service according to the user's request to obtain the corresponding reasoning service result, that is, the calculation result. Specifically, when the request is to identify a person in an image, the obtained calculation result can be either the image of the person in the image or the absence of a person in the image; when the request is to obtain the voiceprint information in the voice, the obtained calculation The result is the voiceprint information in the voice; of course, there may be other requests, and the user can set according to actual needs, as long as the purpose of this embodiment can be achieved.

Further, in order to achieve better resource deployment, in this embodiment, the reasoning service platform 200 further includes: a monitoring module for monitoring the computing resource cluster 100; when an error occurs in the reasoning service, a service warning is executed.

In this embodiment, a monitoring module is provided to monitor the computing resource cluster 100 and obtain operation information and usage information in a timely manner. And when there is an error in the reasoning service, the service warning is executed so that the technicians can perform maintenance.

Therefore, the present invention can provide users with rapid deployment and effective scheduling of AI computing resources based on the deployment of local clusters or cloud server terminals, reduce the online, operation and maintenance costs of local or cloud platforms, and help users with various online reasoning business needs or Algorithms and business teams within the enterprise quickly implement applications or services.

Based on the above technical means, this embodiment encapsulates the trained model and the operating environment in the form of a mirror image and submits it to the reasoning service platform 200. The reasoning service platform deploys online reasoning services in the form of parameter transfer without changing the model type. Inference tasks can be performed without worrying about model compatibility, which improves the efficiency of inference service operation.

The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method part.

Professionals can further realize that the units and algorithm steps of the examples described in the embodiments disclosed in this article can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.

The steps of the method or algorithm described in combination with the embodiments disclosed herein can be directly implemented by hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or all areas in the technical field. Any other known storage media.

The above is a detailed introduction to a Kubernetes-based reasoning service system provided by this application. In this article, specific examples are used to describe the principles and implementation of the application, and the description of the above examples is only used to help understand the methods and core ideas of the application. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of this application, several improvements and modifications can be made to this application, and these improvements and modifications also fall within the protection scope of the claims of this application.

Claims

A reasoning service system based on Kubernetes, which is characterized in that it includes:

Computing resource cluster and reasoning service platform;

Wherein, the reasoning service platform includes:

Multi-frame model module, used to support models exported by multiple frameworks;

The custom image module is used to obtain the image file sent by the user, deploy according to the image file, and execute inference services, where the image file is a file obtained by encapsulating the trained model and the operating environment by the user.
The Kubernetes-based reasoning service system according to claim 1, wherein the reasoning service platform further comprises:

The test and release module is used to obtain a test model, and perform a performance test based on the test model and the corresponding running model using A/B testing and corresponding diversion information, when the performance of the test model is greater than the performance of the running model At that time, the test model will be released on a rolling basis.
The Kubernetes-based reasoning service system according to claim 2, wherein the test and release module is used to migrate all users corresponding to the running model to the test model in free time to implement all The release of the test model.
The Kubernetes-based reasoning service system according to claim 2, wherein the test and release module is used to sequentially migrate users corresponding to the running model to the test model to implement the test model release.
The Kubernetes-based reasoning service system according to claim 2, wherein the reasoning service platform further comprises:

The traffic management model is used to divert the requested traffic of the user in a preset manner to obtain the diversion information.
The Kubernetes-based reasoning service system according to claim 1, wherein the multi-frame model module is also used to obtain and modify the configuration file of the pre-online reasoning service, and create a reasoning service instance.
The Kubernetes-based reasoning service system according to claim 1, wherein the multi-framework model module is also used to obtain parameters for adding a configuration file of the pre-online reasoning service to create a reasoning service instance.
The Kubernetes-based reasoning service system according to claim 1, wherein the custom image module is also used to parse the image file to obtain the trained model and the operating environment; The trained model and the operating environment execute the reasoning service to obtain the reasoning result, and feed the reasoning result back to the user.
The Kubernetes-based reasoning service system according to claim 1, further comprising: a scheduling module, configured to determine the corresponding pod according to the utilization rate of the computing resources in the computing resource cluster or the metric provided by the user quantity.
The Kubernetes-based reasoning service system according to claim 1, wherein the reasoning service platform further comprises:

The monitoring module is used to monitor the computing resource cluster; when an error occurs in the reasoning service, execute a service warning.