CN111414233A

CN111414233A - Online model reasoning system

Info

Publication number: CN111414233A
Application number: CN202010201491.1A
Authority: CN
Inventors: 黄绿君; 高峰斌
Original assignee: JD Digital Technology Holdings Co Ltd
Current assignee: JD Digital Technology Holdings Co Ltd
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2020-07-14

Abstract

The embodiment of the application relates to an online model reasoning system, which comprises a model warehouse and a container mirror image warehouse, wherein the container mirror image warehouse is used for storing container mirror images required by a reasoning model, when an online reasoning request of a user is received, a model micro-service engine is enabled to call the reasoning model required by the user from the model warehouse according to configuration information of the user, and the container mirror images in the container mirror image warehouse avoid the condition that the container mirror images required by a training model are inconsistent with actual container mirror images, so that the reasoning model can be packaged into a reasoning service capable of being operated in a containerization mode, and the online reasoning service is provided.

Description

Online model reasoning system

Technical Field

The application relates to the technical field of distributed storage, in particular to an online model reasoning system.

Background

With the development of big data technology and artificial intelligence technology, more and more business scenes such as financial wind control, online advertisement, commodity recommendation, intelligent cities and the like adopt a large number of machine learning technologies to improve the service quality and the intelligent decision level. For specific tasks, after a model is obtained through training in a training environment specified by the model, the model needs to be packaged, then the model is deployed as an online reasoning service, and when a user uses an operating environment the same as the training environment, the reasoning service can be used.

However, in the process of implementing the invention, the inventor finds that with the increase of the demand of the reasoning service, the types of the reasoning models needing to be deployed are increased, and the problem of error operation of the online reasoning service after the reasoning model is deployed is caused by the fact that the training environment of the reasoning model is different from the operation environment of a user.

Disclosure of Invention

To solve the technical problem or at least partially solve the technical problem, embodiments of the present application provide an online model inference system.

In a first aspect, an embodiment of the present application provides an online model inference system, where the system includes: the system comprises a model warehouse, a container mirror image warehouse, a service designer and a model micro-service engine;

the model warehouse is used for storing an inference model and metadata of the inference model;

the container mirror image warehouse is used for storing container mirror images required by the operation of the inference model;

the service designer is used for receiving configuration information of an inference model of a user to provide online inference service to the outside;

the model micro-service engine is used for pulling the container mirror image in the container mirror image warehouse and pulling the inference model and the metadata from the model warehouse according to the configuration information; and packaging the inference model, the metadata and the container mirror image to obtain model inference service capable of containerization operation so as to provide online inference service for the outside.

Optionally, the system further comprises: a service status monitoring device;

the service state monitoring module is used for determining the CPU utilization rate, GPU utilization rate, memory utilization rate, response delay and the number of container instances of each container instance for carrying inference service in the model micro-service engine; and calculating an accuracy index of the inference service in the model microservice engine.

Optionally, the system further comprises: a container organizer;

the container orchestrator is used for calculating the number of expected container instances according to the CPU utilization rate, the GPU utilization rate, the memory utilization rate, the response time delay and the inference service number, and increasing/deleting the container instances in the model micro-service engine according to the number of the expected container instances.

Optionally, a formula for calculating the number of expected container instances according to the CPU usage rate, the GPU usage rate, the memory usage rate, the response delay, and the number of inference services is as follows:

wherein, α, β and gamma are weight factors of 4 measurement dimensions of CPU utilization rate, GPU utilization rate, memory utilization rate and response time delay respectively, the value range is [0,1], the sum is 1, and ceil represents rounding-down.

Optionally, the model microservice engine includes: a model screener;

and the model filter is used for determining a filtering strategy according to the configuration information and pulling the inference model which accords with the filtering strategy from the model warehouse according to the filtering strategy.

Optionally, the configuration information includes any one of the following five model screening strategies;

a first screening strategy: determining target data information required by a user according to the configuration information, and determining a target reasoning model according to the target data information;

the second screening strategy: acquiring the accuracy indexes of the plurality of reasoning models from a service state monitoring module, and selecting the reasoning model with the highest accuracy index from the plurality of reasoning models to obtain a target reasoning model;

the third screening strategy: acquiring performance evaluation indexes of a plurality of inference models with the same type and different versions, and selecting the inference model with the highest performance evaluation index from the plurality of inference models to obtain a target inference model;

a fourth screening strategy: acquiring performance evaluation indexes of a plurality of inference models with the same type and different versions, and updating and iterating the inference models with the performance evaluation indexes higher than a threshold value to obtain a target inference model;

a fifth screening strategy: and determining a reasoning model identification designated by a user according to the configuration information, and determining a target reasoning model according to the reasoning model identification.

Optionally, the service designer includes: a pressure test/online service module;

the pressure test/online service module is used for carrying out pressure test on the inference service in the model micro-service engine to generate a test result; and receiving a request for inference service request from the user.

Optionally, the online model inference system further comprises: a load balancer;

the load balancer is used for distributing the inference service application request of the user to the container instance of the model micro-service engine, so that the inference service deployed in the container instance responds to the inference service application request of the user.

Optionally, the designer comprises: a monitoring panel;

the monitoring panel is used for collecting storage data in the service state monitoring module and calculating the storage data according to a preset calculation mode to obtain a monitoring index for a user to check.

Optionally, the system further comprises: a model service release management module;

the model service publishing management module is used for managing the functions of online, offline, registration, discovery, publishing, restarting and management of reasoning services in the model micro-service engine.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: according to the embodiment of the application, the model warehouse and the container mirror image warehouse storing the container mirror images required by the inference model are established, when an online inference request of a user is received, the model micro-service engine calls the inference model required by the user from the model warehouse according to the configuration information of the user, and the container mirror images in the container mirror image warehouse avoid the condition that the container mirror images required by the training model are inconsistent with the actual container mirror images, so that the inference model can be packaged into the inference service capable of being operated in a containerization mode, and the online inference service is provided.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

FIG. 1 is a schematic structural diagram of an online model inference system according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of another online model inference system provided in an embodiment of the present application;

fig. 3 is a schematic structural diagram of another online model inference system provided in the embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, with the development of big data technology and artificial intelligence technology and the increase of business scenes, a large number of machine learning technologies are adopted to improve the service quality and the intelligent decision level. Aiming at the machine learning model operating environment, the inventor finds the following problems in the research process: due to the fact that the software environment, the dependence base library and the versions of the machine learning model are various, different models are different, the basic environment needs to be built once when different models are deployed, repeated work exists, and the environment is possibly different from the environment when the models are trained, and abnormal operation is caused.

At present, in the prior art, when a plurality of machine learning model services are deployed directly on a physical machine, although the basic environments of the plurality of services can be isolated in a manner of creating a virtual software environment by using tools such as Conda, resource conflicts exist among the plurality of services, and the stability of the services is affected. In addition, the services are deployed in a single instance, and the high availability of the model service cannot be guaranteed. In summary, as the demand of the inference service required by the user increases, the types of the inference models to be deployed increase, and the training environment of the inference models is different from the operation environment of the user, so that the problem of operation error of the online inference service after the inference models are deployed online is caused. Based on this, the embodiment of the present invention first provides an online model inference system, as shown in fig. 1, where the system includes: the system comprises a model warehouse 01, a container mirror image warehouse 02, a service designer 03 and a model micro-service engine 04;

the model warehouse 01 is used for storing inference models and metadata of the inference models;

in the embodiment of the invention, the model warehouse 01 refers to a management warehouse of model files and model metadata which are trained by model construction personnel aiming at specific machine learning tasks based on different frames, and provides functions of version management of models, preview comparison of model metadata information, multi-dimensional classification, sorting, searching and the like of the models. Any version of the model of model repository 01 may be deployed as a batch offline service or a real-time online service or released to a model trading market.

The inference model refers to a machine learning model, the metadata of the inference model refers to configuration parameters of the inference model, for example, the metadata may include a container mirror specified by the inference model, or include data types output by the inference model and other parameter information related to inference, and the specific content may be determined according to the actual situation.

The container mirror image warehouse 02 is used for storing container mirror images required by the operation of the inference model;

in the embodiment of the present invention, the container mirror image warehouse 02 refers to a database for providing container mirror images and mirror image management of model training, model inference tasks, and the like, and the container mirror image warehouse 02 includes a container mirror image (i.e., an environment required for inference model operation) such as a basic algorithm library and a dependent package required for operation of an algorithm model on different hardware platforms (a CPU/GPU server, and the like).

The service designer 03 is configured to receive configuration information of an inference model of an online inference service to be provided externally by a user;

in the embodiment of the invention, the server designer is used for receiving configuration information input by a user, wherein the configuration information refers to information input by the user when the user selects the inference service of the online service in the system, and the configuration information can directly specify the selected inference model in practical application or the user only inputs data required to be calculated.

The model micro-service engine 04 is used for pulling the container mirror image in the container mirror image warehouse 02 and pulling the inference model and the metadata from the model warehouse 01 according to the configuration information; and packaging the inference model, the metadata and the container mirror image to obtain model inference service capable of containerization operation so as to provide online inference service for the outside.

In the embodiment of the present invention, the model microservice engine 04 may pull the container mirror image from the container mirror image warehouse 02 and pull the inference model and the metadata from the model warehouse 01 through the configuration information input by the user, specifically, if the configuration information includes the identifier of the inference model, the inference model and the metadata may be directly pulled from the model warehouse 01 according to the configuration information, the container mirror image on which the inference model depends is determined according to the metadata, and then the container mirror image is pulled from the container mirror image warehouse 02; if the configuration information only contains data information that needs to be calculated by the user, the inference model capable of outputting the data information is determined through the metadata, then the inference model and the metadata are pulled from the model warehouse 01, the container mirror image on which the inference model depends is determined according to the metadata, and then the container mirror image is pulled from the container mirror image warehouse 02.

In this step, the model micro service engine 04 may further encapsulate the inference model, the metadata, and the container mirror image to obtain a model inference service capable of containerization operation to provide an online inference service externally, in practical applications, one system may often provide a plurality of inference services externally, but the matching problem between the inference model and the container mirror image is not considered in the prior art, so that the inference service often makes a mistake in operation, and then the container mirror image corresponding to the inference model is configured through manual maintenance, which is not only tedious in process, lacking in unified management and tracking, and easy to make a mistake.

However, by setting the model warehouse 01 and the container mirror image warehouse 02, the user can select the corresponding container mirror image when designing the specific model online reasoning task in the service designer 03, and the operation environment does not need to be set up again by operation and maintenance, so that the operation stability of the reasoning service is ensured, the number of the reasoning services deployed in the system is large in time, each reasoning service can be pertinently ensured to be deployed online, and the stable service is provided for the user.

In another embodiment of the present invention, as shown in fig. 2, the system further includes: a service status monitoring device 05;

In the embodiment of the present invention, the service status monitoring device 05 is connected to the model microengine service, and is configured to determine CPU utilization, GPU utilization, memory utilization, response delay, and the number of container instances of a plurality of containers in the model microengine service, so as to reflect inference service performance carried by the container.

On the other hand, in the research process, the inventor finds that in the practical application of the inference system, in order to cope with the situation of a large-scale service request, a higher virtual machine number and a higher hardware specification are often configured, and the invocation volume of most services often fluctuates along with the fluctuation of the service, and phenomena similar to a day high phenomenon and a night low phenomenon often occur, so that the utilization rate of hardware resources is low during the period of the lower invocation volume, and resources are wasted, so the invention further provides an elastic capacity expansion and reduction method capable of realizing the inference service, and specifically, in another embodiment of the invention, the system further includes: a container organizer 06;

the container orchestrator 06 is configured to calculate an expected number of container instances according to the CPU utilization, the GPU utilization, the memory utilization, the response delay, and the number of inference services, and add/delete container instances in the model microservice engine according to the expected number of container instances.

In the embodiment of the invention, the monitoring of the number of the expected container instances is calculated by the CPU utilization rate, the GPU utilization rate, the memory utilization rate, the response time delay and the inference service number of the container in the module micro-service engine, so that the resource use condition is monitored, and then the resource configuration is adjusted according to a pre-estimated calculation mode to achieve the effect of resource optimization.

Further, in another embodiment of the present invention, a specific calculation manner for implementing elastic scaling of inference services is provided: and calculating the expected container instance number according to the CPU utilization rate, the GPU utilization rate, the memory utilization rate, the response time delay and the inference service number by the following formula:

In the embodiment of the invention, the expected number of container instances in the next time window is obtained by adopting the container instance number calculation formula provided by the invention for the container resource utilization rate index in the previous time window, then the number of container instances is automatically adjusted by means of a Horizontal automatic scaling function (HPA for short) in Kubernetes, and then the service state monitoring module continuously monitors the updated indexes of the resource utilization rate and the like of each container instance in real time, so that the dynamic adjustment of the model online reasoning service resources is realized through circulation, the purpose of dynamically adjusting the resource utilization rate real-time monitoring index and the expected resource calculation formula based on the reasoning model service is realized, and the resource demand of the model reasoning service is ensured and the idle waste of resources is reduced.

In addition, the inventor also finds that the accuracy of model reasoning needs to retrain the reasoning model after a certain time due to the drift of data distribution, the online service is updated, the current deployment mode needs to manually select a certain version of model file and upload the model file to replace the online model file, the online service is possibly restarted according to different model service packaging modes, the online service is interrupted, the updating process needs manual operation and cannot be automated, the process is complicated, unified management and tracking are lacked, and errors are easy to occur. Based on this, in another embodiment of the present invention, as shown in fig. 3, the model microservice engine includes: a model screener 07;

In the embodiment of the present invention, the model filter is configured to determine the filtering policy according to the configuration information, where the filtering policy included in the configuration information may be a policy template provided by the system in advance, and after the user selects a target policy from the policy template, the system writes the policy into the configuration information, so that the filtering policy may be determined by the configuration information.

Wherein the configuration information comprises any one of the following five model screening strategies;

in practical application, aiming at the condition that a user does not know which inference model to select, the user only needs to input data required by the user, and the model filter can traverse metadata of each inference model in the model warehouse so as to determine a target inference model required by the user from a plurality of inference models.

The second screening strategy: and acquiring the accuracy indexes of the plurality of reasoning models from a service state monitoring module, and selecting the reasoning model with the highest accuracy index from the plurality of reasoning models to obtain a target reasoning model.

In practical application, because a plurality of iterative versions may exist in the inference model with the same function, the system can calculate the accuracy of the output result of the inference model after the inference model outputs the result for each inference model, and further preferentially select the inference model with the highest accuracy.

The third screening strategy: and acquiring performance evaluation indexes of a plurality of inference models with the same type and different versions, and selecting the inference model with the highest performance evaluation index from the plurality of inference models to obtain a target inference model.

In practical application, since the inference model with the same function may have a plurality of iterative versions, the system may evaluate the performance of each inference model with the same type according to a preset performance evaluation index, wherein a specific accuracy index calculation manner of the performance evaluation index may be determined according to the type of the inference model, any user-defined weighting index may be used, and the present invention is not specifically limited to this.

A fourth screening strategy: and acquiring performance evaluation indexes of a plurality of inference models with the same type and different versions, and updating and iterating the target inference model by utilizing the inference model with the performance evaluation index higher than a threshold value.

In practical application, since a plurality of iteration versions may exist in the inference model with the same function, a new version can be re-iterated by using metadata of the same inference model to be deployed online for a user to use.

In practical application, a user can directly designate the inference model to calculate data required by the user, specifically, the inference model can be displayed to the user through a panel to be selected, and a target inference model can be determined according to the inference model identification selected by the user.

The embodiment of the invention achieves the aim of reducing manual operation by designing an automatic model screening method and a strategy, is suitable for an application scene in which a machine learning model is frequently subjected to iterative updating along with the lapse of time, and better ensures the accuracy of the online reasoning service of the reasoning model.

In yet another embodiment of the present invention, the service designer comprises: a pressure test/online service module;

In the embodiment of the invention, the pressure test/online service module refers to a functional module which performs pressure test on the deployed online model inference service or develops the deployed online model inference service to provide the model inference service for a demand party. In the actual application, a json, a data file and a concurrent request are provided by a pressure test to perform long-time or full-load operation test on the deployed model reasoning service, and a report of performance, reliability and stability of the service is generated; the online service is to expose the API interface, calling mode and feedback state description of the model reasoning service for the internal and external network request service. In addition, the stress testing/presence service module may also receive a server request input by a user for using an online inference service.

In yet another embodiment of the present invention, the online model inference system further comprises: a load balancer;

In the embodiment of the invention, automatic load balancing capability is provided for each model reasoning online service, namely, large-scale request flow generated in a pressure test/online service module is distributed to container instances of each computing node through a load balancing algorithm to achieve the aim of balancing resource configuration, wherein the load balancing algorithm adopts a random method, a polling method, an original address hashing method, a weighted polling method, a minimum connection number method and the like, and specific settings can be determined according to actual conditions.

In yet another embodiment of the present invention, the designer includes: a monitoring panel;

In yet another embodiment of the present invention, the system further comprises: a model service release management module;

the model service publishing management module is used for managing the online, offline, registration, discovery, publishing, restarting and management functions of the reasoning service in the model micro-service engine so that a user can know the online reasoning service and select the online reasoning service.

In another embodiment of the present invention, a complete embodiment of the online model inference system in practical application is further provided, as shown in fig. 3, 1) first, the service designer may be a web UI design interface, which is a platform for providing users with the selection of online inference services, and the users may select their own desired inference models on the platform or input their own desired inference data, and then the system determines the inference models for the users according to the inference data, and the specific situation may be determined according to the actual situation.

On the other hand, after the user designs the inference service required by the user, a pressure test/online service module in a web UI design interface can generate large-scale request traffic to distribute the request to container instances of each computing node through a load balancing algorithm, and further, for balancing the automatic load capacity of each online inference service, the online inference system further comprises a load balancer, which is used for distributing the inference service application request of the user to the container instances of the model micro-service engine, so that the inference service deployed in the container instances responds to the inference service application request of the user.

2) The model micro-service engine pulls the inference model from the model warehouse and the container mirror image warehouse according to the selected design content of the user in the service designer, the inference model in the model warehouse is a management warehouse of model files and model metadata information which are trained by construction personnel aiming at specific machine learning tasks based on different frames, the inference model can provide online inference service for the user after being successfully deployed on a container, and therefore the container mirror image operated by the inference model is also required to be pulled from the container mirror image warehouse, for example: basic algorithm libraries such as Scikit-leern, Tensorflow, PyTorch, Keras, Caffe, MXNet, dependency packages and the like are installed.

However, in practical applications, the accuracy of model inference needs to retrain the inference model after a certain time due to drift of data distribution, update the online service, the current deployment mode needs to manually select a certain version of model file, upload the model file, replace the online model file, and according to different model service packaging modes, the service may need to be restarted, which causes interruption of the online service, and the updating process needs manual operation and cannot be automated, and is tedious, lacks of unified management and tracking, and is prone to errors. Therefore, in order to solve the problems of fast iteration speed of inference models, high-frequency update service, complex manual operation and the like, the inventor adds a model filter in a model micro-service engine for determining a filtering strategy according to configuration information and pulling an inference model conforming to the filtering strategy from a model warehouse according to the filtering strategy, wherein the filtering strategy can be displayed in a web UI (user interface) design interface for a user to select, and a specific embodiment of the invention also provides five templates of the filtering strategy, which comprise: the first screening strategy is used for determining target data information required by a user according to the configuration information and determining a target inference model according to the target data information, so that the user can only input data required by the user aiming at the condition that the user does not know which inference model to select; the second screening strategy: the method comprises the steps that accuracy indexes of a plurality of inference models are obtained from a service state monitoring module, the inference model with the highest accuracy index is selected from the inference models, and a target inference model is obtained, so that the system can calculate and store the accuracy of an output result of the inference model after the inference model outputs the result aiming at the condition that a plurality of iteration versions possibly exist in the inference models with the same function, and the user can select the inference model according to the accuracy; the third screening strategy: the method comprises the steps of obtaining performance evaluation indexes of a plurality of inference models with the same type and different versions, selecting the inference model with the highest performance evaluation index from the inference models to obtain a target inference model, and aiming at the condition that the inference models with the same function possibly have a plurality of iterative versions, evaluating the performance of each inference model with the same type by the system according to preset performance evaluation indexes, wherein the performance evaluation indexes comprise: the conventional performance indexes of the regression model comprise MSE, RMSE, MAE and the like, and the conventional performance indexes of the classification model comprise accuracy, recall rate, F1 value, AUC and the like, and can be determined according to actual conditions; a fourth screening strategy: acquiring performance evaluation indexes of a plurality of inference models with the same type and different versions, updating and iterating a target inference model by using the inference model with the performance evaluation index higher than a threshold value, and iterating a new version by using metadata of the same inference model to deploy online for a user aiming at the condition that the inference model with the same function possibly has a plurality of iteration versions; a fifth screening strategy: and determining the inference model identification appointed by the user according to the configuration information, and determining a target inference model according to the inference model identification, wherein the strategy is an inference model directly selected according to data required by the user. The embodiment of the invention achieves the aim of reducing manual operation by designing the automatic model screening method and strategy, is suitable for an application scene in which the machine learning model is frequently subjected to iterative updating along with the time, and better ensures the accuracy of the online reasoning service of the reasoning model.

3) In practical application, the online model inference system also comprises a Kubernets cluster connected with the model micro-service engine, wherein the Kubernets cluster is used for managing the deployed inference service in the model micro-service engine and the matched resources thereof, and reasonably allocating and scheduling basic resources such as a CPU heterogeneous cluster, a GPU heterogeneous cluster, Ceph/HDFS storage service and the like in infrastructure based on a container technology and a self-research scheduling and arranging technology, so that a high-available runtime environment is provided for the model service. The computing nodes in the CPU/GPU heterogeneous cluster are managed in a label mode, namely different computing nodes are divided into different available areas or groups. In deploying the model online service, a node selector is used to deploy the model online service onto a target computing node with a specified tag. To ensure high availability, the number of target compute nodes per tag combination is greater than 1. Therefore, after one target node is down, the scheduler can find out the computing nodes meeting the conditions and migrate the containers corresponding to the online services on the down node to other computing nodes, and therefore high availability of the model online services is guaranteed.

However, based on the existing kubernets cluster (kubernets is a container arrangement engine of Google open source, which supports automatic deployment, large-scale scalable and application containerization management, and can make the application of containerization deployment simple and efficient, kubernets provides a mechanism for application deployment, planning, updating and maintenance), which cannot solve the problem that the prior art deals with large-scale service request situations, when large-scale service requests occur, higher virtual machine quantity and hardware specification are often required to be configured, while the calling quantity of most services often fluctuates along with the fluctuation of business, and phenomena similar to high daytime and low nighttime are often generated, so that the utilization rate of hardware resources is low during the period of low calling quantity, and resource waste is caused, therefore, based on the invention, an elastic capacity expansion and contraction method capable of reasoning services is also provided, and a container arrangement device is added in the kubernets cluster, the method is used for calculating the number of expected container instances according to the CPU utilization rate, the GPU utilization rate, the memory utilization rate, the response time delay and the inference service number of the containers in the model micro-service engine, and increasing/deleting the container instances in the model micro-service engine according to the number of the expected container instances.

In the above steps, parameter information of CPU usage, GPU usage, memory usage, response delay and inference service quantity of a container in a model micro-service engine is acquired by a service state monitoring device, so a service state monitoring device for acquiring container configuration resources is designed, then a container organizer in a Kubernetes cluster adopts a container instance quantity calculation formula provided in the above embodiments to obtain an expected container instance quantity in a next time window, then the container instance quantity is automatically adjusted by means of a Horizontal automatic capacity expansion function (HPA for short) in Kubernetes, then a service state monitoring module continues to monitor indexes such as resource usage rates of the updated container instances in real time, and by this cycle, dynamic adjustment of model online inference service resources is realized, and the purpose of dynamically adjusting resource usage rate real-time monitoring indexes and expected resource calculation formula based on inference model services is realized, the specific calculation formula for the number of the container instances can refer to the content of the above embodiments, and is not repeated here, and in addition, the online reasoning system further includes a monitoring panel, and the monitoring panel is used for collecting the storage data in the service state monitoring module, and calculating the storage data according to a preset calculation mode to obtain a monitoring index for a user to check.

The embodiment of the invention completely isolates container images of different services through a container technology packaging technology (such as a Docker packaging technology) and a packaging model reasoning task, provides distributed disaster tolerance and resource elastic expansion and contraction functions of the services by means of Kubernetes for service arrangement, and decouples an algorithm model from a service framework by combining functional modules such as a model warehouse, a container image warehouse, a service state monitor, a model service release management module, a monitoring panel and the like, so that deployment online, updating and management processes of a reasoning model become simple, online efficiency is high, risks are low, and stability, flexibility and service capability of a prediction service are improved.

4) The online reasoning system also designs a model service publishing management module for managing online, offline, registration, discovery, publishing, restarting and management functions of the reasoning service in the model micro-service engine in order to improve user experience, and when the reasoning service is updated, the model service publishing management module can push a message to a user at the first time so that the user can know the online reasoning service and select the online reasoning service.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present invention, which enable those skilled in the art to understand or practice the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. An online model reasoning system, characterized in that the system comprises: the system comprises a model warehouse, a container mirror image warehouse, a service designer and a model micro-service engine;

2. The online model reasoning system of claim 1, further comprising: a service status monitoring device;

3. The online model reasoning system of claim 2, further comprising: a container organizer;

4. The online model reasoning system of claim 3, wherein the formula for calculating the number of expected container instances based on the CPU usage rate, GPU usage rate, memory usage rate, response latency, and number of reasoning services is as follows:

5. The online model reasoning system of claim 1, wherein the model microservices engine comprises: a model screener;

6. The online model reasoning system of claim 5, wherein the configuration information comprises any one of the following five model screening strategies;

7. The online model reasoning system of claim 1, wherein the service designer comprises: a pressure test/online service module;

8. The online model inference system of claim 7, wherein the online model inference system further comprises: a load balancer;

9. The online model reasoning system of claim 1, wherein the designer comprises: a monitoring panel;

10. The online model reasoning system of claim 1, further comprising: a model service release management module;