WO2021238251A1 - Inference service system based on kubernetes - Google Patents

Inference service system based on kubernetes Download PDF

Info

Publication number
WO2021238251A1
WO2021238251A1 PCT/CN2021/073345 CN2021073345W WO2021238251A1 WO 2021238251 A1 WO2021238251 A1 WO 2021238251A1 CN 2021073345 W CN2021073345 W CN 2021073345W WO 2021238251 A1 WO2021238251 A1 WO 2021238251A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
reasoning service
reasoning
kubernetes
module
Prior art date
Application number
PCT/CN2021/073345
Other languages
French (fr)
Chinese (zh)
Inventor
王超
吴韶华
陈清山
张荣国
林秀
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Publication of WO2021238251A1 publication Critical patent/WO2021238251A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/51Discovery or management thereof, e.g. service location protocol [SLP] or web services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers

Definitions

  • This application relates to the technical field of reasoning services, and in particular to a reasoning service system based on Kubernetes.
  • Online inference service (Online Inference Service) is an important part of the machine learning project. Through online inference service, the trained model can reflect its value in the production process. Many Internet or companies with online business usually have several or even dozens of online reasoning services, which are called tens of millions of times a day. In order to support online services efficiently and stably, the online service framework needs to be able to support mainstream deep learning frameworks, support running on CPU and GPU resources, and support running multiple models on a single graphics card to improve GPU resource utilization. Although the related technologies provide support for multi-frame models, for models trained on non-standard deep learning frameworks and AI applications at the SaaS layer, existing technologies cannot provide online deployment functions for online reasoning services.
  • the purpose of this application is to provide a Kubernetes-based reasoning service system that can encapsulate the trained model and operating environment in the form of a mirror image and submit it to the reasoning service platform.
  • the reasoning service platform deploys online reasoning services through parameter transfer. , There is no need to switch model types, and no need to worry about model compatibility to perform inference tasks.
  • the specific plan is as follows:
  • This application provides a Kubernetes-based reasoning service system, including:
  • the reasoning service platform includes:
  • Multi-frame model module used to support models exported by multiple frameworks
  • the custom image module is used to obtain the image file sent by the user, deploy according to the image file, and execute inference services, where the image file is a file obtained by encapsulating the trained model and the operating environment by the user.
  • the reasoning service platform further includes:
  • the test and release module is used to obtain a test model, and perform a performance test based on the test model and the corresponding running model using A/B testing and corresponding diversion information, when the performance of the test model is greater than the performance of the running model At that time, the test model will be released on a rolling basis.
  • test and release module is configured to migrate all users corresponding to the running model to the test model in free time, so as to realize the release of the test model.
  • testing and publishing module is configured to sequentially migrate users corresponding to the running model to the testing model to implement the publishing of the testing model.
  • the reasoning service platform further includes:
  • the traffic management model is used to divert the requested traffic of the user in a preset manner to obtain the diversion information.
  • the multi-frame model module is also used to obtain and modify the configuration file of the pre-launched reasoning service, and create a reasoning service instance.
  • the multi-frame model module is also used to obtain the parameters of the configuration file for adding the pre-online reasoning service to create a reasoning service instance.
  • the custom image module is also used to parse the image file to obtain the trained model and the operating environment; execute the reasoning service based on the trained model and the operating environment , Obtain the inference result, and feed back the inference result to the user.
  • a scheduling module configured to determine the number of corresponding pods according to the utilization rate of the computing resources in the computing resource cluster or the metric provided by the user.
  • the reasoning service platform further includes:
  • the monitoring module is used to monitor the computing resource cluster; when an error occurs in the reasoning service, execute a service warning.
  • This application provides a Kubernetes-based reasoning service system, including: a computing resource cluster and a reasoning service platform; wherein the reasoning service platform includes: a multi-frame model module for supporting models derived from multiple frameworks; a custom mirroring module , Used to obtain the image file sent by the user, deploy according to the image file, and execute the inference service, wherein the image file is a file obtained by encapsulating the trained model and the operating environment by the user.
  • this application encapsulates the trained model and operating environment in the form of a mirror image and submits it to the reasoning service platform.
  • the reasoning service platform deploys online reasoning services in the form of parameter transfer. There is no need to convert the model type, and there is no need to worry about model compatibility.
  • the reasoning task can be carried out based on the nature, which improves the efficiency of the reasoning service operation.
  • Figure 1 is a schematic structural diagram of a Kubernetes-based reasoning service system provided by an embodiment of the application
  • Figure 2 is a schematic structural diagram of another Kubernetes-based reasoning service system provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of a test and release process provided by an embodiment of the application.
  • FIG. 4 is a schematic diagram of a test structure of a reasoning service platform provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of a scheduling provided by an embodiment of this application.
  • FIG. 6 is a schematic diagram of the working structure of a custom mirroring module provided by an embodiment of the application.
  • FIG. 1 is a schematic structural diagram of a Kubernetes-based reasoning service system provided by an embodiment of the application, including:
  • the reasoning service platform 200 includes:
  • Multi-frame model module used to support models exported by multiple frameworks
  • the custom image module is used to obtain the image file sent by the user, deploy it according to the image file, and perform inference services.
  • the image file is a file obtained by encapsulating the trained model and the operating environment by the user.
  • this application is developed in Python and go programming languages during the implementation process, and the deployment environment is a Linux system.
  • this scheme is not restricted by language and system environment, and can be fully realized in other languages and environments.
  • each computing resource in the computing resource cluster 100 can be customized by users. It is understandable that each computing resource is provided with an accelerator.
  • the accelerator includes but is not limited to GPU (Graphics Processing Unit, graphics processor), CPU (central processing unit, central processing unit), Cambrian MLU, and dedicated neural network processors can be homogeneous accelerators or heterogeneous accelerators.
  • the online reasoning service corresponding to the reasoning service platform 200 is not a model, but a service process for online deployment and preliminary preparation of the model.
  • operating users can provide reliable computing power guarantee for online reasoning services through fine-grained computing resource management and scheduling, and the reasoning service platform 200 provides multi-frame model modules and custom mirroring modules.
  • Figure 2 is a schematic structural diagram of another Kubernetes-based reasoning service system provided by an embodiment of the application.
  • the system may also include: a data processing module, a traffic management module, a testing and publishing module, a user and storage module, and a monitoring module , Scheduling module and resource module, making the deployment of online reasoning services more stable and convenient.
  • the data processing module includes data pre/post processing, model averaging, and model conversion;
  • the multi-frame model module may specifically include TensorFlow Serving, Pytorch, TensorRT Inference Server, and ML Model Serving;
  • the user and storage module may specifically include multi-user strategy, mirroring Warehouse and model management;
  • the monitoring module can specifically include logs, cluster monitoring and service alarms.
  • the reasoning service platform 200 or software can provide rapid auto-scaling capabilities based on Kubernetes API resource configuration and controller status, and solve the management and deployment mechanism based on virtualization technology when responding to the rapid expansion and contraction requirements of services.
  • problems such as manual creation of resource instances, inability to unify the operating environment, instance deployment, low resource recovery efficiency, and poor elasticity.
  • it automatically scales Replication according to the utilization of computing resources used or CustomMetrics provided by other applications.
  • the number of Pods in Controller, Deployment, and Replica Set makes cluster management and operation more efficient and stable, while effectively reducing the cost of computing resources.
  • the multi-frame model module is used to support models derived from multiple frameworks.
  • multiple frameworks include but are not limited to: TensorFlow, Pytorch, TensorRT, and SKLearn.
  • the reasoning service platform 200 or software supports a variety of models derived from deep learning/machine learning frameworks, and provides service support for data pre/post processing for the supported models.
  • different computing resources can be selected according to different data processing requirements (CPUs or GPUs).
  • Create a reasoning service instance by modifying or adding the configuration file (.yaml file) of the corresponding pre-online reasoning service or the parameters in the configuration file, and quickly deploy the reasoning service of the required model to the online environment.
  • the custom mirroring module is used to obtain the mirrored file sent by the user, deploy according to the mirrored file, and perform inference services, where the mirrored file is a file obtained by encapsulating the trained model and the operating environment by the user.
  • the reasoning service platform 200 or software supports reasoning services for non-standard publishing framework models, including: TensorFlow, Pytorch, TensorRT and other optimized or custom framework models for reasoning service instance creation.
  • Users run according to the training model used. Environment, encapsulate the trained model and operating environment (non-standard framework, running script, etc.) in the form of a mirror image to obtain a mirror file, and submit the mirror file to the reasoning service platform 200 or software, and the reasoning service platform performs online through parameter transfer.
  • the deployment of the above reasoning service does not need to change the model type, and there is no need to worry about model compatibility to perform reasoning tasks.
  • the custom image module is also used to parse the image file to obtain the trained model and operating environment; perform inference services based on the trained model and operating environment, obtain the inference results, and feed the inference results back to the user.
  • FIG. 6 is a schematic diagram of the working structure of a custom mirroring module provided by an embodiment of this application.
  • the user encapsulates the trained model and operating environment, including configuration files, to obtain the image file, and send the image file to the Kubernetes-based reasoning service system.
  • the custom mirror module of the Kubernetes-based reasoning service system receives the mirror file, parses the mirror file, and obtains the trained model, operating environment, and configuration file.
  • the reasoning service is executed based on the trained model and operating environment, and the reasoning result is obtained, and The reasoning result is fed back to the user.
  • it will also include storage services for storing mirror files, etc., as well as monitoring services, reasoning services for monitoring mirror files.
  • the reasoning service platform 200 further includes:
  • the test and release module is used to obtain the test model, and perform performance testing based on the test model and the corresponding running model using A/B testing and corresponding diversion information. When the performance of the test model is greater than the performance of the running model, the test model is rolled release.
  • the reasoning service platform 200 or software provides online test functions of model services for services in a production environment. Users can perform reasoning results and performance verification for online services, and support gray-scale release of online services through A/B testing. Taking into account the importance and seriousness of the production environment, the pre-launch model, that is, the test model, must be tested in real online traffic before it can be released in full. Using A/B testing can effectively provide the pre-launch model with a custom scale line.
  • the release strategy provided by the reasoning service platform 200 or the software can be used to control the release strategy of the model regularly and quantitatively to ensure that the number of online requests will not This causes a load impact on the existing available computing resources, so that subsequent models can be smoothly transitioned to full models.
  • FIG. 3 is a schematic diagram of a test and release process provided by an embodiment of the application.
  • test model 1 and run model 2 are obtained from model and mirror management, and model deployment 1 and run model 2 execute corresponding model deployment 2 based on test model 1 according to the release strategy, which are obtained after A/B testing and preprocessing Perform performance testing on the corresponding shunt information, implement reasoning services, and obtain corresponding calculation results; only when the performance of the test model is greater than the performance of the running model, the rolling release of the test model can be performed.
  • FIG. 4 is a schematic diagram of a test structure of a reasoning service platform 200 provided by an embodiment of the application. It is understandable that the test model must be tested with real online traffic before it can be released in full.
  • the user sends a request to the inference service platform.
  • the inference service platform receives the request, based on the internal and external cluster load balancing, it will allocate real traffic to the corresponding test model and operation model, where the test model test traffic is allocated, and the operation is allocated.
  • the default flow of the model enables the test model and the running model to execute the inference service respectively, and the test model and the running model respectively execute the model service 1, model service 2, model service n of the corresponding traffic, and then get the A/B test of the running model and the test model Calculation results.
  • the test and release module is used to migrate all users corresponding to the running model to the test model during idle time. To achieve the release of the test model. By migrating all users corresponding to the running model to the test model in free time, the migration is carried out when the user is not using it, so as to avoid the occurrence of delay in actual use.
  • the test and release module is used to sequentially migrate the users corresponding to the running model to the test model to realize the test Release of the model.
  • the sequence can be sequentially migrating one, two, or other numbers of users, as long as this embodiment can be implemented. The purpose is sufficient, and this embodiment does not limit it. It can be seen that by sequentially migrating the users corresponding to the running model to the test model, the problem of excessive load is avoided, and the delay phenomenon is avoided.
  • the reasoning service platform 200 further includes: a traffic management model, which is used to split the requested traffic of the user in a preset manner to obtain the split information.
  • the traffic management model in this embodiment uses the Istio traffic management model, which decouples the traffic from the expansion of the infrastructure, so that the operation and maintenance personnel can specify which rules the traffic follows through the Pilot, instead of specifying which pods/VM should be Receive traffic.
  • Istio can provide various traffic management functions independent of application code. These functions are implemented through the deployed Envoy sidecar proxy.
  • the Pod contains a sidecar proxy, which is part of the Istio grid and is responsible for coordinating all inbound and outbound traffic for the Pod.
  • Pilot is responsible for converting advanced routing rules into configurations and propagating them to the sidecar agents. This means that when services communicate with each other, their routing decisions are determined by the client.
  • the traffic control scheme of the reasoning service enables the online service to divert the request traffic of online users through preset methods (such as random, designated ID, etc.), and pass the real traffic request through HTTP (Hypertext Transfer Protocol, hypertext) Transmission protocol) is sent to the server to perform reasoning services based on different model frameworks, and the validity of the test model is verified by comparison of calculation results.
  • HTTP Hypertext Transfer Protocol, hypertext
  • the modified computing framework cannot provide effective services.
  • the multi-frame model module is also used to obtain the configuration file of the modified pre-online reasoning service. Create an instance of the inference service.
  • the multi-frame model module is also used to obtain and add pre-online reasoning services The parameters of the configuration file to create an instance of the inference service.
  • the flexible deployment model also provides operation and maintenance personnel with a series of deployment solutions such as cloud computing resources and local computing resources, allowing inference services Users of can make more efficient use of resources and services according to the actual situation, and also include: a scheduling module, used to determine the number of corresponding pods based on the utilization of computing resources in the computing resource cluster 100 or metrics provided by users .
  • the reasoning service platform 200 or software can provide rapid automatic scaling capabilities based on Kubernetes API resource configuration and controller status, and solve the problem of management and deployment mechanisms based on virtualization technology in responding to the rapid expansion and contraction requirements of services. Problems such as manual creation of resource instances, inability to unify the operating environment, instance deployment, low resource recovery efficiency, and poor elasticity.
  • the scheduling module automatically scales the number of Pods in Replication Controller, Deployment, and Replica Set according to the utilization of computing resources used or Custom Metrics provided by other applications.
  • the flexible deployment mode also provides operations and maintenance personnel with A series of deployment schemes such as cloud computing resources and local computing resources enable users of inference services to use resources and services more efficiently according to actual conditions, making cluster management and operation more efficient and stable, while effectively reducing computing Resource cost. It can be seen that this embodiment uses a safe and effective resource management and control method to reasonably allocate different computing powers to perform scheduling.
  • the scheduling module may also receive a user request, the reasoning service platform 200 executes the reasoning service, obtains the calculation result, and feeds the calculation result back to the user.
  • the reasoning service platform obtains the mirror file, deploys the model corresponding to the mirror file based on the publishing strategy, and uses the model to execute the reasoning service according to the user's request to obtain the corresponding reasoning service result, that is, the calculation result.
  • the obtained calculation result can be either the image of the person in the image or the absence of a person in the image; when the request is to obtain the voiceprint information in the voice, the obtained calculation The result is the voiceprint information in the voice; of course, there may be other requests, and the user can set according to actual needs, as long as the purpose of this embodiment can be achieved.
  • the reasoning service platform 200 further includes: a monitoring module for monitoring the computing resource cluster 100; when an error occurs in the reasoning service, a service warning is executed.
  • a monitoring module is provided to monitor the computing resource cluster 100 and obtain operation information and usage information in a timely manner. And when there is an error in the reasoning service, the service warning is executed so that the technicians can perform maintenance.
  • the present invention can provide users with rapid deployment and effective scheduling of AI computing resources based on the deployment of local clusters or cloud server terminals, reduce the online, operation and maintenance costs of local or cloud platforms, and help users with various online reasoning business needs or Algorithms and business teams within the enterprise quickly implement applications or services.
  • this embodiment encapsulates the trained model and the operating environment in the form of a mirror image and submits it to the reasoning service platform 200.
  • the reasoning service platform deploys online reasoning services in the form of parameter transfer without changing the model type. Inference tasks can be performed without worrying about model compatibility, which improves the efficiency of inference service operation.
  • the steps of the method or algorithm described in combination with the embodiments disclosed herein can be directly implemented by hardware, a software module executed by a processor, or a combination of the two.
  • the software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or all areas in the technical field. Any other known storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computer Security & Cryptography (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Debugging And Monitoring (AREA)

Abstract

An inference service system based on Kubernetes, comprising a computing resource cluster and an inference service platform. The inference service platform comprises: a multi-framework model module used for supporting models exported from multiple frameworks; and a user-defined mirror image module used for obtaining a mirror image file sent by a user, performing deployment according to the mirror image file, and executing an inference service, wherein the mirror image file is a file obtained by packaging a trained model and a running environment by the user. Thus, in the present application, a trained model and a running environment are packaged in a mirror image form and then submitted to the inference service platform, the inference service platform deploys an online inference service in a parameter passing mode, inference tasks can be carried out without converting model types or considering model compatibility, and the inference service operation efficiency is improved.

Description

一种基于Kubernetes的推理服务系统A reasoning service system based on Kubernetes
本申请要求于2020年05月28日提交中国专利局、申请号为202010470862.6、发明名称为“一种基于Kubernetes的推理服务系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 28, 2020, the application number is 202010470862.6, and the invention title is "a Kubernetes-based reasoning service system", the entire content of which is incorporated into this application by reference middle.
技术领域Technical field
本申请涉及推理服务技术领域,特别涉及一种基于Kubernetes的推理服务系统。This application relates to the technical field of reasoning services, and in particular to a reasoning service system based on Kubernetes.
背景技术Background technique
在线推理服务(Online Inference Service),是机器学习工程中的重要的一环,通过在线推理服务,训练出来的模型得以在生产环节中体现其价值。很多互联网或具有线上业务的企业通常存在几个甚至几十个线上推理服务,每天调用次数高达千万级别。为了高效稳定的支撑在线服务,在线服务框架需要能够支持主流深度学习框架,支持运行在CPU和GPU资源上,并且单显卡支持运行多个模型,提升GPU资源利用率。相关技术中虽然采用的提供了多框架模型的支持,但是对于非标准深度学习框架训练得到的模型,及SaaS层的AI应用,现有技术无法提供在线部署功能进行在线推理服务。Online inference service (Online Inference Service) is an important part of the machine learning project. Through online inference service, the trained model can reflect its value in the production process. Many Internet or companies with online business usually have several or even dozens of online reasoning services, which are called tens of millions of times a day. In order to support online services efficiently and stably, the online service framework needs to be able to support mainstream deep learning frameworks, support running on CPU and GPU resources, and support running multiple models on a single graphics card to improve GPU resource utilization. Although the related technologies provide support for multi-frame models, for models trained on non-standard deep learning frameworks and AI applications at the SaaS layer, existing technologies cannot provide online deployment functions for online reasoning services.
因此,如何提供一种解决上述技术问题的方案是本领域技术人员目前需要解决的问题。Therefore, how to provide a solution to the above-mentioned technical problems is a problem that needs to be solved by those skilled in the art at present.
发明内容Summary of the invention
本申请的目的是提供一种基于Kubernetes的推理服务系统,能够将训练完成的模型和运行环境以镜像形式进行封装,提交到推理服务平台,推理服务平台通过参数传递形式进行线上推理服务的部署,不需要转换模型类型,也无需顾虑模型兼容性即可进行推理任务。其具体方案如下:The purpose of this application is to provide a Kubernetes-based reasoning service system that can encapsulate the trained model and operating environment in the form of a mirror image and submit it to the reasoning service platform. The reasoning service platform deploys online reasoning services through parameter transfer. , There is no need to switch model types, and no need to worry about model compatibility to perform inference tasks. The specific plan is as follows:
本申请提供了一种基于Kubernetes的推理服务系统,包括:This application provides a Kubernetes-based reasoning service system, including:
计算资源集群和推理服务平台;Computing resource cluster and reasoning service platform;
其中,所述推理服务平台包括:Wherein, the reasoning service platform includes:
多框架模型模块,用于支持多种框架导出的模型;Multi-frame model module, used to support models exported by multiple frameworks;
自定义镜像模块,用于获取用户发送的镜像文件,根据所述镜像文件进行部署,并执行推理服务,其中,所述镜像文件是用户将完成训练的模型和运行环境进行封装而得到的文件。The custom image module is used to obtain the image file sent by the user, deploy according to the image file, and execute inference services, where the image file is a file obtained by encapsulating the trained model and the operating environment by the user.
可选的,所述推理服务平台还包括:Optionally, the reasoning service platform further includes:
测试与发布模块,用于获取测试模型,并基于所述测试模型、对应的运行模型利用A/B测试和对应的分流信息进行性能测试,当所述测试模型的性能大于所述运行模型的性能时,将所述测试模型滚动发布。The test and release module is used to obtain a test model, and perform a performance test based on the test model and the corresponding running model using A/B testing and corresponding diversion information, when the performance of the test model is greater than the performance of the running model At that time, the test model will be released on a rolling basis.
可选的,所述测试与发布模块,用于在空闲时间,将所有所述运行模型对应的用户迁移到所述测试模型上,实现所述测试模型的发布。Optionally, the test and release module is configured to migrate all users corresponding to the running model to the test model in free time, so as to realize the release of the test model.
可选的,所述测试与发布模块,用于依次将所述运行模型对应的用户迁移到所述测试模型上,实现所述测试模型的发布。Optionally, the testing and publishing module is configured to sequentially migrate users corresponding to the running model to the testing model to implement the publishing of the testing model.
可选的,所述推理服务平台还包括:Optionally, the reasoning service platform further includes:
流量管理模型,用于通过预设方式分流用户的请求流量,得到所述分流信息。The traffic management model is used to divert the requested traffic of the user in a preset manner to obtain the diversion information.
可选的,所述多框架模型模块,还用于获取修改预上线推理服务的配置文件,创建推理服务实例。Optionally, the multi-frame model module is also used to obtain and modify the configuration file of the pre-launched reasoning service, and create a reasoning service instance.
可选的,所述多框架模型模块,还用于获取添加预上线推理服务的配置文件的参数,创建推理服务实例。Optionally, the multi-frame model module is also used to obtain the parameters of the configuration file for adding the pre-online reasoning service to create a reasoning service instance.
可选的,所述自定义镜像模块,还用于对所述镜像文件进行解析,得到所述训练的模型和所述运行环境;基于所述训练的模型和所述运行环境执行所述推理服务,得到推理结果,并将所述推理结果反馈至所述用户。Optionally, the custom image module is also used to parse the image file to obtain the trained model and the operating environment; execute the reasoning service based on the trained model and the operating environment , Obtain the inference result, and feed back the inference result to the user.
可选的,还包括:调度模块,用于根据所述计算资源集群中的计算资源的利用率或者用户提供的度量指标,确定对应的pod的数量。Optionally, it further includes: a scheduling module, configured to determine the number of corresponding pods according to the utilization rate of the computing resources in the computing resource cluster or the metric provided by the user.
可选的,所述推理服务平台还包括:Optionally, the reasoning service platform further includes:
监控模块,用于监控所述计算资源集群;当所述推理服务出现错误时,执行服务预警。The monitoring module is used to monitor the computing resource cluster; when an error occurs in the reasoning service, execute a service warning.
本申请提供一种基于Kubernetes的推理服务系统,包括:计算资源集 群和推理服务平台;其中,所述推理服务平台包括:多框架模型模块,用于支持多种框架导出的模型;自定义镜像模块,用于获取用户发送的镜像文件,根据所述镜像文件进行部署,并执行推理服务,其中,所述镜像文件是用户将完成训练的模型和运行环境进行封装而得到的文件。This application provides a Kubernetes-based reasoning service system, including: a computing resource cluster and a reasoning service platform; wherein the reasoning service platform includes: a multi-frame model module for supporting models derived from multiple frameworks; a custom mirroring module , Used to obtain the image file sent by the user, deploy according to the image file, and execute the inference service, wherein the image file is a file obtained by encapsulating the trained model and the operating environment by the user.
可见,本申请将训练完成的模型和运行环境以镜像形式进行封装,提交到推理服务平台,推理服务平台通过参数传递形式进行线上推理服务的部署,不需要转换模型类型,也无需顾虑模型兼容性即可进行推理任务,提高了推理服务运行的效率。It can be seen that this application encapsulates the trained model and operating environment in the form of a mirror image and submits it to the reasoning service platform. The reasoning service platform deploys online reasoning services in the form of parameter transfer. There is no need to convert the model type, and there is no need to worry about model compatibility. The reasoning task can be carried out based on the nature, which improves the efficiency of the reasoning service operation.
附图说明Description of the drawings
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to more clearly describe the technical solutions in the embodiments of the present application or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only These are the embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without creative work.
图1为本申请实施例提供的一种基于Kubernetes的推理服务系统的结构示意图;Figure 1 is a schematic structural diagram of a Kubernetes-based reasoning service system provided by an embodiment of the application;
图2为本申请实施例提供的另一种基于Kubernetes的推理服务系统的结构示意图;Figure 2 is a schematic structural diagram of another Kubernetes-based reasoning service system provided by an embodiment of the application;
图3为本申请实施例提供的一种测试与发布的流程示意图;FIG. 3 is a schematic diagram of a test and release process provided by an embodiment of the application;
图4为本申请实施例提供的一种推理服务平台的测试结构示意图;FIG. 4 is a schematic diagram of a test structure of a reasoning service platform provided by an embodiment of the application;
图5为本申请实施例提供的一种调度示意图;FIG. 5 is a schematic diagram of a scheduling provided by an embodiment of this application;
图6为本申请实施例提供的一种自定义镜像模块工作的结构示意图。FIG. 6 is a schematic diagram of the working structure of a custom mirroring module provided by an embodiment of the application.
具体实施方式Detailed ways
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of this application clearer, the following will clearly and completely describe the technical solutions in the embodiments of this application with reference to the drawings in the embodiments of this application. Obviously, the described embodiments These are a part of the embodiments of this application, but not all of the embodiments. Based on the embodiments in this application, all other embodiments obtained by a person of ordinary skill in the art without creative work shall fall within the protection scope of this application.
相关技术中虽然采用的提供了多框架模型的支持,但是对于非标准深度学习框架训练得到的模型,及SaaS层的AI(Artificial Intelligence,人工智能)应用,现有技术无法提供在线部署功能进行在线推理服务。未解决上述技术问题,本实施例提供一种基于Kubernetes的推理服务系统,请参考图1,图1为本申请实施例提供的一种基于Kubernetes的推理服务系统的结构示意图,包括:Although the related technologies provide support for multi-frame models, for models trained in non-standard deep learning frameworks and AI (Artificial Intelligence) applications at the SaaS layer, existing technologies cannot provide online deployment functions for online deployment. Reasoning service. The above technical problem is not solved. This embodiment provides a Kubernetes-based reasoning service system. Please refer to Figure 1. Figure 1 is a schematic structural diagram of a Kubernetes-based reasoning service system provided by an embodiment of the application, including:
计算资源集群100和推理服务平台200; Computing resource cluster 100 and reasoning service platform 200;
其中,推理服务平台200包括:Among them, the reasoning service platform 200 includes:
多框架模型模块,用于支持多种框架导出的模型;Multi-frame model module, used to support models exported by multiple frameworks;
自定义镜像模块,用于获取用户发送的镜像文件,根据镜像文件进行部署,并执行推理服务,其中,镜像文件是用户将完成训练的模型和运行环境进行封装而得到的文件。The custom image module is used to obtain the image file sent by the user, deploy it according to the image file, and perform inference services. The image file is a file obtained by encapsulating the trained model and the operating environment by the user.
其中,本申请在实施过程中是以Python和go编程语言开发的,部署环境为Linux系统。但是该方案不受语言和系统环境的限制,在其他语言和环境下也完全可以实现。Among them, this application is developed in Python and go programming languages during the implementation process, and the deployment environment is a Linux system. However, this scheme is not restricted by language and system environment, and can be fully realized in other languages and environments.
其中,计算资源集群100中的计算资源的数量用户可自定义设置,可以理解的是,每个计算资源中设置有加速器,该加速器包括但是不限定于GPU(Graphics Processing Unit,图形处理器)、CPU(central processing unit,中央处理器)、寒武纪MLU和专用神经网络处理器,可以是同构加速器也可以是异构加速器。Among them, the number of computing resources in the computing resource cluster 100 can be customized by users. It is understandable that each computing resource is provided with an accelerator. The accelerator includes but is not limited to GPU (Graphics Processing Unit, graphics processor), CPU (central processing unit, central processing unit), Cambrian MLU, and dedicated neural network processors can be homogeneous accelerators or heterogeneous accelerators.
针对推理服务平台200进行进一步阐述,推理服务平台200对应的在线推理服务不是一种模型,而是一种对于模型进行上线部署及前期准备的服务流程。具体的,操作用户可以通过细粒度的计算资源管理和调度为线上推理服务提供可靠的计算力保证,并且推理服务平台200提供了多框架模型模块、自定义镜像模块,进一步的,请参考图2,图2为本申请实施例提供的另一种基于Kubernetes的推理服务系统的结构示意图,该系统还可以包括:数据处理模块、流量管理模块、测试与发布模块、用户与存储模块,监控模块、调度模块和资源模块,使得线上推理服务的部署更加稳定与便捷。其中,数据处理模块包括数据前/后处理、模型平均和模型转换; 多框架模型模块具体可以包括TensorFlow Serving、Pytorch、TensorRT Inference Server、ML Model Serving;用户与存储模块具体可以包括多用户策略、镜像仓库、模型管理;监控模块具体可以包括日志、集群监控和服务报警。To further elaborate on the reasoning service platform 200, the online reasoning service corresponding to the reasoning service platform 200 is not a model, but a service process for online deployment and preliminary preparation of the model. Specifically, operating users can provide reliable computing power guarantee for online reasoning services through fine-grained computing resource management and scheduling, and the reasoning service platform 200 provides multi-frame model modules and custom mirroring modules. For further details, please refer to the figure. 2. Figure 2 is a schematic structural diagram of another Kubernetes-based reasoning service system provided by an embodiment of the application. The system may also include: a data processing module, a traffic management module, a testing and publishing module, a user and storage module, and a monitoring module , Scheduling module and resource module, making the deployment of online reasoning services more stable and convenient. Among them, the data processing module includes data pre/post processing, model averaging, and model conversion; the multi-frame model module may specifically include TensorFlow Serving, Pytorch, TensorRT Inference Server, and ML Model Serving; the user and storage module may specifically include multi-user strategy, mirroring Warehouse and model management; the monitoring module can specifically include logs, cluster monitoring and service alarms.
值得注意的是,推理服务平台200或软件可以基于Kubernetes API资源配置和控制器状态,提供快速的自动伸缩能力,解决了基于虚拟化技术的管理及部署机制在应对服务快速扩容和缩容需求时存在的手动创建资源实例、无法统一运行环境、实例部署、资源回收效率低和弹性能力差等问题,同时根据所使用的计算资源的利用率或其他应程序提供的度量指标Custom Metrics,自动伸缩Replication Controller、Deployment和Replica Set中的Pod数量,使得集群管理和运行更加高效,稳定性,同时有效地降低计算资源成本。It is worth noting that the reasoning service platform 200 or software can provide rapid auto-scaling capabilities based on Kubernetes API resource configuration and controller status, and solve the management and deployment mechanism based on virtualization technology when responding to the rapid expansion and contraction requirements of services. There are problems such as manual creation of resource instances, inability to unify the operating environment, instance deployment, low resource recovery efficiency, and poor elasticity. At the same time, it automatically scales Replication according to the utilization of computing resources used or CustomMetrics provided by other applications. The number of Pods in Controller, Deployment, and Replica Set makes cluster management and operation more efficient and stable, while effectively reducing the cost of computing resources.
具体的,多框架模型模块,用于支持多种框架导出的模型,其中,多种框架包括但是不限定于:TensorFlow、Pytorch、TensorRT、SKLearn。可见,推理服务平台200或软件支持多种的深度学习/机器学习框架导出的模型,并且为所支持的模型提供数据前/后处理的服务支持,同时可根据不同数据处理需求选择不同的计算资源(CPUs或者GPUs)。通过修改或添加相应预上线推理服务的配置文件(.yaml文件)或配置文件中的参数创建推理服务实例,快速将所需模型的推理服务部署在线上环境。Specifically, the multi-frame model module is used to support models derived from multiple frameworks. Among them, multiple frameworks include but are not limited to: TensorFlow, Pytorch, TensorRT, and SKLearn. It can be seen that the reasoning service platform 200 or software supports a variety of models derived from deep learning/machine learning frameworks, and provides service support for data pre/post processing for the supported models. At the same time, different computing resources can be selected according to different data processing requirements (CPUs or GPUs). Create a reasoning service instance by modifying or adding the configuration file (.yaml file) of the corresponding pre-online reasoning service or the parameters in the configuration file, and quickly deploy the reasoning service of the required model to the online environment.
具体的,自定义镜像模块,用于获取用户发送的镜像文件,根据镜像文件进行部署,并执行推理服务,其中,镜像文件是用户将完成训练的模型和运行环境进行封装而得到的文件。Specifically, the custom mirroring module is used to obtain the mirrored file sent by the user, deploy according to the mirrored file, and perform inference services, where the mirrored file is a file obtained by encapsulating the trained model and the operating environment by the user.
可以理解的是,推理服务平台200或软件支持非标准发布框架模型的推理服务,包括:TensorFlow、Pytorch、TensorRT等经过优化或者自定义框架模型的推理服务实例创建,用户根据训练模型所使用的运行环境,将训练完成的模型和运行环境(非标准框架、运行脚本等)以镜像形式进行封装,得到镜像文件,将镜像文件提交到推理服务平台200或软件,推理服务平台通过参数传递形式进行线上推理服务的部署,不需要转换模型类型,也无需顾虑模型兼容性即可进行推理任务。具体的,自定义镜像模块,还用于对镜像文件进行解析,得到训练的模型和运行环境;基于训练的模 型和运行环境执行推理服务,得到推理结果,并将推理结果反馈至用户。It is understandable that the reasoning service platform 200 or software supports reasoning services for non-standard publishing framework models, including: TensorFlow, Pytorch, TensorRT and other optimized or custom framework models for reasoning service instance creation. Users run according to the training model used. Environment, encapsulate the trained model and operating environment (non-standard framework, running script, etc.) in the form of a mirror image to obtain a mirror file, and submit the mirror file to the reasoning service platform 200 or software, and the reasoning service platform performs online through parameter transfer. The deployment of the above reasoning service does not need to change the model type, and there is no need to worry about model compatibility to perform reasoning tasks. Specifically, the custom image module is also used to parse the image file to obtain the trained model and operating environment; perform inference services based on the trained model and operating environment, obtain the inference results, and feed the inference results back to the user.
请参考图6,图6为本申请实施例提供的一种自定义镜像模块工作的结构示意图。具体的,用户将完成训练的模型和运行环境,还可以包括配置文件进行封装,得到镜像文件,将镜像文件发送至基于Kubernetes的推理服务系统。基于Kubernetes的推理服务系统的自定义镜像模块接收到镜像文件,对镜像文件进行解析,得到训练的模型和运行环境还有配置文件,基于训练的模型和运行环境执行推理服务,得到推理结果,并将推理结果反馈至用户,当然,将还包括了存储服务,用于存储镜像文件等,还包括监控服务,用于监控镜像文件的推理服务。Please refer to FIG. 6, which is a schematic diagram of the working structure of a custom mirroring module provided by an embodiment of this application. Specifically, the user encapsulates the trained model and operating environment, including configuration files, to obtain the image file, and send the image file to the Kubernetes-based reasoning service system. The custom mirror module of the Kubernetes-based reasoning service system receives the mirror file, parses the mirror file, and obtains the trained model, operating environment, and configuration file. The reasoning service is executed based on the trained model and operating environment, and the reasoning result is obtained, and The reasoning result is fed back to the user. Of course, it will also include storage services for storing mirror files, etc., as well as monitoring services, reasoning services for monitoring mirror files.
进一步的,为了使开发或运维人员能够快速将已训练好的模型推送至线上环境,获得真实流量的验证,作为后续服务的能力支持,本实施例中,推理服务平台200还包括:Further, in order to enable development or operation and maintenance personnel to quickly push the trained model to the online environment and obtain verification of real traffic, as a capability support for subsequent services, in this embodiment, the reasoning service platform 200 further includes:
测试与发布模块,用于获取测试模型,并基于测试模型、对应的运行模型利用A/B测试和对应的分流信息进行性能测试,当测试模型的性能大于运行模型的性能时,将测试模型滚动发布。The test and release module is used to obtain the test model, and perform performance testing based on the test model and the corresponding running model using A/B testing and corresponding diversion information. When the performance of the test model is greater than the performance of the running model, the test model is rolled release.
推理服务平台200或软件为生产环境下的服务提供了模型服务在线测试功能,用户可以针对线上服务进行推理结果及性能验证,支持通过A/B测试进行在线服务的灰度发布。考虑到生产环境的重要性和严肃性,预上线模型即测试模型必须经过线上真实流量的测试后才可以进行全量发布,使用A/B测试可以有效地为预上线模型提供自定义规模的线上流量进行测试,在保证了流量稳定和准确隔离的基础上,通过推理服务平台200或软件提供的发布策略,可以定时、定量地对模型进行发布策略上的控制,确保线上请求数量不会对现有可用计算资源造成负载冲击,使后续发布的模型能够平稳过渡为全量模型。请参考图3和图4,图3为本申请实施例提供的一种测试与发布的流程示意图。具体的,从模型与镜像管理中得到测试模型1和运行模型2,根据发布策略基于测试模型1进行模型部署1和运行模型2执行对应的模型部署2,利用A/B测试和预处理后得到的对应的分流信息进行性能测试,实现推理服务,得到对应的计算结果;只有当测试模型的性能大于运行模型的性能时,才能够进行测试模型的滚动发布。图4为本申请实施例提供的一种推理服务平台200的测试的结构示意图。 可以理解的是,测试模型必须经过线上真实流量的测试后才可以进行全量发布。具体的,用户下发请求至推理服务平台,当推理服务平台接收到该请求后,基于内外集群负载平衡,会分配真实流量到对应的测试模型和运行模型,其中分配测试模型测试流量,分配运行模型默认流量,使测试模型和运行模型分别执行推理服务,测试模型和运行模型分别执行对应流量的模型服务1、模型服务2、模型服务n,然后得到运行模型和测试模型的A/B测试的计算结果。The reasoning service platform 200 or software provides online test functions of model services for services in a production environment. Users can perform reasoning results and performance verification for online services, and support gray-scale release of online services through A/B testing. Taking into account the importance and seriousness of the production environment, the pre-launch model, that is, the test model, must be tested in real online traffic before it can be released in full. Using A/B testing can effectively provide the pre-launch model with a custom scale line On the basis of ensuring the stability and accurate isolation of the flow, the release strategy provided by the reasoning service platform 200 or the software can be used to control the release strategy of the model regularly and quantitatively to ensure that the number of online requests will not This causes a load impact on the existing available computing resources, so that subsequent models can be smoothly transitioned to full models. Please refer to FIG. 3 and FIG. 4. FIG. 3 is a schematic diagram of a test and release process provided by an embodiment of the application. Specifically, test model 1 and run model 2 are obtained from model and mirror management, and model deployment 1 and run model 2 execute corresponding model deployment 2 based on test model 1 according to the release strategy, which are obtained after A/B testing and preprocessing Perform performance testing on the corresponding shunt information, implement reasoning services, and obtain corresponding calculation results; only when the performance of the test model is greater than the performance of the running model, the rolling release of the test model can be performed. FIG. 4 is a schematic diagram of a test structure of a reasoning service platform 200 provided by an embodiment of the application. It is understandable that the test model must be tested with real online traffic before it can be released in full. Specifically, the user sends a request to the inference service platform. When the inference service platform receives the request, based on the internal and external cluster load balancing, it will allocate real traffic to the corresponding test model and operation model, where the test model test traffic is allocated, and the operation is allocated The default flow of the model enables the test model and the running model to execute the inference service respectively, and the test model and the running model respectively execute the model service 1, model service 2, model service n of the corresponding traffic, and then get the A/B test of the running model and the test model Calculation results.
进一步的,为了避免底层计算资源的加速器的负荷过大,造成用户信息获取的延时,本实施例中,测试与发布模块,用于在空闲时间,将所有运行模型对应的用户迁移到测试模型上,实现测试模型的发布。通过在空闲时间将所有运行模型对应的用户迁移到测试模型上,在用户不使用的情况下进行迁移,避免了在实际使用时延时现象的发生。Further, in order to avoid the overload of the accelerator of the underlying computing resources being too large, causing delays in obtaining user information, in this embodiment, the test and release module is used to migrate all users corresponding to the running model to the test model during idle time. To achieve the release of the test model. By migrating all users corresponding to the running model to the test model in free time, the migration is carried out when the user is not using it, so as to avoid the occurrence of delay in actual use.
进一步的,为了避免底层计算资源的加速器的负荷过大,造成的延时现象的发生,本实施例中,测试与发布模块,用于依次将运行模型对应的用户迁移到测试模型上,实现测试模型的发布。通过依次将运行模型对应的用户迁移到测试模型上,避免了负荷过大的问题的出现,其中,该依次可以是依次迁移一个、两个或者其他数量的用户,只要是能够实现本实施例的目的即可,本实施例不在进行限定。可见,通过依次将运行模型对应的用户迁移到测试模型上,避免了负荷过大的问题的出现,进而避免了延时现象的发生。Further, in order to avoid the occurrence of time delay caused by the excessive load of the accelerator of the underlying computing resources, in this embodiment, the test and release module is used to sequentially migrate the users corresponding to the running model to the test model to realize the test Release of the model. By sequentially migrating the users corresponding to the running model to the test model, the problem of excessive load can be avoided. Among them, the sequence can be sequentially migrating one, two, or other numbers of users, as long as this embodiment can be implemented. The purpose is sufficient, and this embodiment does not limit it. It can be seen that by sequentially migrating the users corresponding to the running model to the test model, the problem of excessive load is avoided, and the delay phenomenon is avoided.
进一步的,为了将流量与基础设施扩容进行解耦,本实施例中,推理服务平台200还包括:流量管理模型,用于通过预设方式分流用户的请求流量,得到分流信息。Further, in order to decouple traffic from infrastructure expansion, in this embodiment, the reasoning service platform 200 further includes: a traffic management model, which is used to split the requested traffic of the user in a preset manner to obtain the split information.
其中,本实施例中的流量管理模型是使用Istio的流量管理模型,是将流量与基础设施扩容进行解耦,让运维人员可以通过Pilot指定流量遵循什么规则,而不是指定哪些pods/VM应该接收流量。通过将流量从基础设施扩展中解耦,就可以让Istio提供各种独立于应用程序代码之外的流量管理功能。这些功能是通过部署的Envoy sidecar代理来实现的。Pod包含一个sidecar代理,该代理作为Istio网格的一部分,负责协调Pod的所有入站和出站流量。在Istio网格中,Pilot负责将高级路由规则转换为配置并将它们 传播到sidecar代理。这意味着当服务彼此通信时,它们的路由决策是由客户端确定的。推理服务的流量调控方案使线上正在运行的服务可以通过预设方式(如:随机、制定ID等)分流线上用户的请求流量,并将真实流量请求通过HTTP(Hypertext Transfer Protocol,超文本传输协议)方式发送到服务端,进行基于不同模型框架的推理服务,通过计算结果对比验证测试模型的有效性。Among them, the traffic management model in this embodiment uses the Istio traffic management model, which decouples the traffic from the expansion of the infrastructure, so that the operation and maintenance personnel can specify which rules the traffic follows through the Pilot, instead of specifying which pods/VM should be Receive traffic. By decoupling traffic from infrastructure extensions, Istio can provide various traffic management functions independent of application code. These functions are implemented through the deployed Envoy sidecar proxy. The Pod contains a sidecar proxy, which is part of the Istio grid and is responsible for coordinating all inbound and outbound traffic for the Pod. In the Istio grid, Pilot is responsible for converting advanced routing rules into configurations and propagating them to the sidecar agents. This means that when services communicate with each other, their routing decisions are determined by the client. The traffic control scheme of the reasoning service enables the online service to divert the request traffic of online users through preset methods (such as random, designated ID, etc.), and pass the real traffic request through HTTP (Hypertext Transfer Protocol, hypertext) Transmission protocol) is sent to the server to perform reasoning services based on different model frameworks, and the validity of the test model is verified by comparison of calculation results.
进一步的,为了支持主流标准框架模型的线上模型服务,对于有修改的计算框架无法提供有效的服务,本实施例中,多框架模型模块,还用于获取修改预上线推理服务的配置文件,创建推理服务实例。Further, in order to support the online model service of the mainstream standard framework model, the modified computing framework cannot provide effective services. In this embodiment, the multi-frame model module is also used to obtain the configuration file of the modified pre-online reasoning service. Create an instance of the inference service.
进一步的,为了支持主流标准框架模型的线上模型服务,对于有修改或有升级版本的计算框架无法提供有效的服务,本实施例中,多框架模型模块,还用于获取添加预上线推理服务的配置文件的参数,创建推理服务实例。Further, in order to support the online model service of the mainstream standard framework model, it is impossible to provide effective services for the modified or upgraded version of the computing framework. In this embodiment, the multi-frame model module is also used to obtain and add pre-online reasoning services The parameters of the configuration file to create an instance of the inference service.
进一步的,为了使得集群管理和运行更加高效,稳定性,同时有效地降低计算资源成本灵活的部署模式也为运维人员提供了云端计算资源、本地计算资源等一系列的部署方案,让推理服务的使用者能够根据实际情况对资源和服务进行更高效的利用,还包括:调度模块,用于根据计算资源集群100中的计算资源的利用率或者用户提供的度量指标,确定对应的pod的数量。Furthermore, in order to make cluster management and operation more efficient and stable, while effectively reducing the cost of computing resources, the flexible deployment model also provides operation and maintenance personnel with a series of deployment solutions such as cloud computing resources and local computing resources, allowing inference services Users of can make more efficient use of resources and services according to the actual situation, and also include: a scheduling module, used to determine the number of corresponding pods based on the utilization of computing resources in the computing resource cluster 100 or metrics provided by users .
具体的,推理服务平台200或软件可以基于Kubernetes API资源配置和控制器状态,提供快速的自动伸缩能力,解决了基于虚拟化技术的管理及部署机制在应对服务快速扩容和缩容需求时存在的手动创建资源实例、无法统一运行环境、实例部署、资源回收效率低和弹性能力差等问题。进一步的,调度模块根据所使用的计算资源的利用率或其他应程序提供的度量指标Custom Metrics,自动伸缩Replication Controller、Deployment和Replica Set中的Pod数量,灵活的部署模式也为运维人员提供了云端计算资源、本地计算资源等一系列的部署方案,让推理服务的使用者能够根据实际情况对资源和服务进行更高效的利用,使得集群管理和运行更加高效,稳定性,同时有效地降低计算资源成本。可见,本实施例通过安全有效的资源管控方式合理调配不同计算力来执行调度。Specifically, the reasoning service platform 200 or software can provide rapid automatic scaling capabilities based on Kubernetes API resource configuration and controller status, and solve the problem of management and deployment mechanisms based on virtualization technology in responding to the rapid expansion and contraction requirements of services. Problems such as manual creation of resource instances, inability to unify the operating environment, instance deployment, low resource recovery efficiency, and poor elasticity. Furthermore, the scheduling module automatically scales the number of Pods in Replication Controller, Deployment, and Replica Set according to the utilization of computing resources used or Custom Metrics provided by other applications. The flexible deployment mode also provides operations and maintenance personnel with A series of deployment schemes such as cloud computing resources and local computing resources enable users of inference services to use resources and services more efficiently according to actual conditions, making cluster management and operation more efficient and stable, while effectively reducing computing Resource cost. It can be seen that this embodiment uses a safe and effective resource management and control method to reasonably allocate different computing powers to perform scheduling.
进一步的,调度模块还可以是接收到用户请求,推理服务平台200执行推理服务,得到计算结果,并将计算结果反馈至用户。具体请参考图5,图5为本申请实施例提供的一种调度示意图。具体的,推理服务平台获取镜像文件,基于发布策略进行镜像文件对应的模型的部署,根据用户的请求,利用该模型执行推理服务,得到对应的推理服务结果,也就是计算结果。具体的,当该请求是识别图像中的人物时,得到的计算结果可以是图像中的人物图像或者图像中无人物两种情况;当该请求是获取语音中的声纹信息时,得到的计算结果就是该语音中的声纹信息;当然还可能存在其他的请求,用户可根据实际需求进行设置,只要是能够实现本实施例的目的即可。Further, the scheduling module may also receive a user request, the reasoning service platform 200 executes the reasoning service, obtains the calculation result, and feeds the calculation result back to the user. For details, please refer to FIG. 5, which is a schematic diagram of a scheduling provided by an embodiment of the application. Specifically, the reasoning service platform obtains the mirror file, deploys the model corresponding to the mirror file based on the publishing strategy, and uses the model to execute the reasoning service according to the user's request to obtain the corresponding reasoning service result, that is, the calculation result. Specifically, when the request is to identify a person in an image, the obtained calculation result can be either the image of the person in the image or the absence of a person in the image; when the request is to obtain the voiceprint information in the voice, the obtained calculation The result is the voiceprint information in the voice; of course, there may be other requests, and the user can set according to actual needs, as long as the purpose of this embodiment can be achieved.
进一步的,为了实现更好地资源部署,本实施例中,推理服务平台200还包括:监控模块,用于监控计算资源集群100;当所述推理服务出现错误时,执行服务预警。Further, in order to achieve better resource deployment, in this embodiment, the reasoning service platform 200 further includes: a monitoring module for monitoring the computing resource cluster 100; when an error occurs in the reasoning service, a service warning is executed.
本实施例中设置监控模块,用于监控计算资源集群100,及时获取运行信息和使用信息。并且当推理服务出现错误时,执行服务预警,以使技术人员进行维护。In this embodiment, a monitoring module is provided to monitor the computing resource cluster 100 and obtain operation information and usage information in a timely manner. And when there is an error in the reasoning service, the service warning is executed so that the technicians can perform maintenance.
因此,本发明能够为用户提供基于本地集群或云服务端的部署的AI计算资源的快速部署和有效调度,降低本地或云平台的上线、运行与维护成本,帮助具有各种线上推理业务需求或企业内算法及业务团队快速落地应用或服务。Therefore, the present invention can provide users with rapid deployment and effective scheduling of AI computing resources based on the deployment of local clusters or cloud server terminals, reduce the online, operation and maintenance costs of local or cloud platforms, and help users with various online reasoning business needs or Algorithms and business teams within the enterprise quickly implement applications or services.
基于上述技术手段,本实施例将训练完成的模型和运行环境以镜像形式进行封装,提交到推理服务平台200,推理服务平台通过参数传递形式进行线上推理服务的部署,不需要转换模型类型,也无需顾虑模型兼容性即可进行推理任务,提高了推理服务运行的效率。Based on the above technical means, this embodiment encapsulates the trained model and the operating environment in the form of a mirror image and submits it to the reasoning service platform 200. The reasoning service platform deploys online reasoning services in the form of parameter transfer without changing the model type. Inference tasks can be performed without worrying about model compatibility, which improves the efficiency of inference service operation.
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method part.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各 示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Professionals can further realize that the units and algorithm steps of the examples described in the embodiments disclosed in this article can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, in the above description, the composition and steps of each example have been generally described in accordance with the function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of this application.
结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块,或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of the method or algorithm described in combination with the embodiments disclosed herein can be directly implemented by hardware, a software module executed by a processor, or a combination of the two. The software module can be placed in random access memory (RAM), internal memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disks, removable disks, CD-ROMs, or all areas in the technical field. Any other known storage media.
以上对本申请所提供的一种基于Kubernetes的推理服务系统进行了详细介绍。本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。The above is a detailed introduction to a Kubernetes-based reasoning service system provided by this application. In this article, specific examples are used to describe the principles and implementation of the application, and the description of the above examples is only used to help understand the methods and core ideas of the application. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of this application, several improvements and modifications can be made to this application, and these improvements and modifications also fall within the protection scope of the claims of this application.

Claims (10)

  1. 一种基于Kubernetes的推理服务系统,其特征在于,包括:A reasoning service system based on Kubernetes, which is characterized in that it includes:
    计算资源集群和推理服务平台;Computing resource cluster and reasoning service platform;
    其中,所述推理服务平台包括:Wherein, the reasoning service platform includes:
    多框架模型模块,用于支持多种框架导出的模型;Multi-frame model module, used to support models exported by multiple frameworks;
    自定义镜像模块,用于获取用户发送的镜像文件,根据所述镜像文件进行部署,并执行推理服务,其中,所述镜像文件是用户将完成训练的模型和运行环境进行封装而得到的文件。The custom image module is used to obtain the image file sent by the user, deploy according to the image file, and execute inference services, where the image file is a file obtained by encapsulating the trained model and the operating environment by the user.
  2. 根据权利要求1所述的基于Kubernetes的推理服务系统,其特征在于,所述推理服务平台还包括:The Kubernetes-based reasoning service system according to claim 1, wherein the reasoning service platform further comprises:
    测试与发布模块,用于获取测试模型,并基于所述测试模型、对应的运行模型利用A/B测试和对应的分流信息进行性能测试,当所述测试模型的性能大于所述运行模型的性能时,将所述测试模型滚动发布。The test and release module is used to obtain a test model, and perform a performance test based on the test model and the corresponding running model using A/B testing and corresponding diversion information, when the performance of the test model is greater than the performance of the running model At that time, the test model will be released on a rolling basis.
  3. 根据权利要求2所述的基于Kubernetes的推理服务系统,其特征在于,所述测试与发布模块,用于在空闲时间,将所有所述运行模型对应的用户迁移到所述测试模型上,实现所述测试模型的发布。The Kubernetes-based reasoning service system according to claim 2, wherein the test and release module is used to migrate all users corresponding to the running model to the test model in free time to implement all The release of the test model.
  4. 根据权利要求2所述的基于Kubernetes的推理服务系统,其特征在于,所述测试与发布模块,用于依次将所述运行模型对应的用户迁移到所述测试模型上,实现所述测试模型的发布。The Kubernetes-based reasoning service system according to claim 2, wherein the test and release module is used to sequentially migrate users corresponding to the running model to the test model to implement the test model release.
  5. 根据权利要求2所述的基于Kubernetes的推理服务系统,其特征在于,所述推理服务平台还包括:The Kubernetes-based reasoning service system according to claim 2, wherein the reasoning service platform further comprises:
    流量管理模型,用于通过预设方式分流用户的请求流量,得到所述分流信息。The traffic management model is used to divert the requested traffic of the user in a preset manner to obtain the diversion information.
  6. 根据权利要求1所述的基于Kubernetes的推理服务系统,其特征在于,所述多框架模型模块,还用于获取修改预上线推理服务的配置文件,创建推理服务实例。The Kubernetes-based reasoning service system according to claim 1, wherein the multi-frame model module is also used to obtain and modify the configuration file of the pre-online reasoning service, and create a reasoning service instance.
  7. 根据权利要求1所述的基于Kubernetes的推理服务系统,其特征在于,所述多框架模型模块,还用于获取添加预上线推理服务的配置文件的参数,创建推理服务实例。The Kubernetes-based reasoning service system according to claim 1, wherein the multi-framework model module is also used to obtain parameters for adding a configuration file of the pre-online reasoning service to create a reasoning service instance.
  8. 根据权利要求1所述的基于Kubernetes的推理服务系统,其特征在 于,所述自定义镜像模块,还用于对所述镜像文件进行解析,得到所述训练的模型和所述运行环境;基于所述训练的模型和所述运行环境执行所述推理服务,得到推理结果,并将所述推理结果反馈至所述用户。The Kubernetes-based reasoning service system according to claim 1, wherein the custom image module is also used to parse the image file to obtain the trained model and the operating environment; The trained model and the operating environment execute the reasoning service to obtain the reasoning result, and feed the reasoning result back to the user.
  9. 根据权利要求1所述的基于Kubernetes的推理服务系统,其特征在于,还包括:调度模块,用于根据所述计算资源集群中的计算资源的利用率或者用户提供的度量指标,确定对应的pod的数量。The Kubernetes-based reasoning service system according to claim 1, further comprising: a scheduling module, configured to determine the corresponding pod according to the utilization rate of the computing resources in the computing resource cluster or the metric provided by the user quantity.
  10. 根据权利要求1所述的基于Kubernetes的推理服务系统,其特征在于,所述推理服务平台还包括:The Kubernetes-based reasoning service system according to claim 1, wherein the reasoning service platform further comprises:
    监控模块,用于监控所述计算资源集群;当所述推理服务出现错误时,执行服务预警。The monitoring module is used to monitor the computing resource cluster; when an error occurs in the reasoning service, execute a service warning.
PCT/CN2021/073345 2020-05-28 2021-01-22 Inference service system based on kubernetes WO2021238251A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010470862.6 2020-05-28
CN202010470862.6A CN111629061B (en) 2020-05-28 2020-05-28 Inference service system based on Kubernetes

Publications (1)

Publication Number Publication Date
WO2021238251A1 true WO2021238251A1 (en) 2021-12-02

Family

ID=72260850

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/073345 WO2021238251A1 (en) 2020-05-28 2021-01-22 Inference service system based on kubernetes

Country Status (2)

Country Link
CN (1) CN111629061B (en)
WO (1) WO2021238251A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780110A (en) * 2022-06-21 2022-07-22 山东极视角科技有限公司 Optimization method and optimization system of algorithm link
CN115248692A (en) * 2022-09-21 2022-10-28 之江实验室 Device and method for supporting cloud deployment of multiple deep learning framework models
CN115344356A (en) * 2022-10-18 2022-11-15 江苏智云天工科技有限公司 Distributed training system based on containerization realization and construction method thereof
CN115562813A (en) * 2022-10-27 2023-01-03 北京同创永益科技发展有限公司 Method and system for dynamically constructing workload in cloud native environment

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111629061B (en) * 2020-05-28 2023-01-24 苏州浪潮智能科技有限公司 Inference service system based on Kubernetes
CN112015470B (en) * 2020-09-09 2022-02-01 平安科技(深圳)有限公司 Model deployment method, device, equipment and storage medium
CN112214285A (en) * 2020-10-22 2021-01-12 厦门渊亭信息科技有限公司 Docker-based model service deployment system
CN112286705A (en) * 2020-11-24 2021-01-29 四川长虹电器股份有限公司 Kubernetes-based container web service interface aggregation system
CN112418427A (en) * 2020-11-25 2021-02-26 广州虎牙科技有限公司 Method, device, system and equipment for providing deep learning unified reasoning service
CN112671582B (en) * 2020-12-25 2023-01-06 苏州浪潮智能科技有限公司 Artificial intelligence reasoning method and system based on edge reasoning cluster
CN112579303A (en) * 2020-12-30 2021-03-30 苏州浪潮智能科技有限公司 Method and equipment for allocating deep learning development platform resources
CN112817581A (en) * 2021-02-20 2021-05-18 中国电子科技集团公司第二十八研究所 Lightweight intelligent service construction and operation support method
CN113467931B (en) * 2021-06-04 2023-12-22 中国联合网络通信集团有限公司 Processing method, device and system of calculation task
CN113608751B (en) * 2021-08-04 2023-04-07 北京百度网讯科技有限公司 Operation method, device and equipment of reasoning service platform and storage medium
CN113723610B (en) * 2021-08-30 2023-07-28 苏州浪潮智能科技有限公司 Dynamic updating method, device and equipment for inference framework and readable storage medium
CN113791759A (en) * 2021-09-09 2021-12-14 上海仙塔智能科技有限公司 Code development processing method and device, electronic equipment and storage medium
CN114036031B (en) * 2022-01-05 2022-06-24 阿里云计算有限公司 Scheduling system and method for resource service application in enterprise digital middleboxes

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109272116A (en) * 2018-09-05 2019-01-25 郑州云海信息技术有限公司 A kind of method and device of deep learning
CN110058922A (en) * 2019-03-19 2019-07-26 华为技术有限公司 A kind of method, apparatus of the metadata of extraction machine learning tasks
WO2019184750A1 (en) * 2018-03-30 2019-10-03 华为技术有限公司 Deep learning task scheduling method and system and related apparatus
CN111629061A (en) * 2020-05-28 2020-09-04 苏州浪潮智能科技有限公司 Inference service system based on Kubernetes

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106302771A (en) * 2016-08-23 2017-01-04 浪潮电子信息产业股份有限公司 Domain name configuration method of application created based on Docker container
CN110941474A (en) * 2018-09-25 2020-03-31 北京京东尚科信息技术有限公司 Method, system, equipment and storage medium for sharing computing resources by Hadoop and Kubernetes system
CN110516934A (en) * 2019-08-13 2019-11-29 湖南智擎科技有限公司 Intelligent big data practical training method and system based on scalable cluster

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019184750A1 (en) * 2018-03-30 2019-10-03 华为技术有限公司 Deep learning task scheduling method and system and related apparatus
CN109272116A (en) * 2018-09-05 2019-01-25 郑州云海信息技术有限公司 A kind of method and device of deep learning
CN110058922A (en) * 2019-03-19 2019-07-26 华为技术有限公司 A kind of method, apparatus of the metadata of extraction machine learning tasks
CN111629061A (en) * 2020-05-28 2020-09-04 苏州浪潮智能科技有限公司 Inference service system based on Kubernetes

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114780110A (en) * 2022-06-21 2022-07-22 山东极视角科技有限公司 Optimization method and optimization system of algorithm link
CN114780110B (en) * 2022-06-21 2022-09-09 山东极视角科技有限公司 Optimization method and optimization system of algorithm link
CN115248692A (en) * 2022-09-21 2022-10-28 之江实验室 Device and method for supporting cloud deployment of multiple deep learning framework models
CN115344356A (en) * 2022-10-18 2022-11-15 江苏智云天工科技有限公司 Distributed training system based on containerization realization and construction method thereof
CN115344356B (en) * 2022-10-18 2023-01-31 江苏智云天工科技有限公司 Distributed training system based on containerization realization and construction method thereof
CN115562813A (en) * 2022-10-27 2023-01-03 北京同创永益科技发展有限公司 Method and system for dynamically constructing workload in cloud native environment

Also Published As

Publication number Publication date
CN111629061A (en) 2020-09-04
CN111629061B (en) 2023-01-24

Similar Documents

Publication Publication Date Title
WO2021238251A1 (en) Inference service system based on kubernetes
CN109828831B (en) Artificial intelligence cloud platform
CN108737168B (en) Container-based micro-service architecture application automatic construction method
CN113569987A (en) Model training method and device
CN112395196B (en) Data job development test method, device, equipment, system and storage medium
CN112527647B (en) NS-3-based Raft consensus algorithm test system
CN114679380B (en) Method and related device for creating edge cluster
CN109740765A (en) A kind of machine learning system building method based on Amazon server
BR102021002596A2 (en) DYNAMICALLY ALLOCATED CLOUD OPERATORS MANAGEMENT SYSTEM AND METHOD FOR IT
CN113703997A (en) Bidirectional asynchronous communication middleware system integrating multiple message agents and implementation method
Harichane et al. KubeSC‐RTP: Smart scheduler for Kubernetes platform on CPU‐GPU heterogeneous systems
WO2022089199A1 (en) Dynamic replacement of degrading processing elements in streaming applications
CN110716875A (en) Concurrency test method based on feedback mechanism in domestic office environment
US10346155B1 (en) Compilation optimization via dynamic server cloning
CN110928659B (en) Numerical value pool system remote multi-platform access method with self-adaptive function
Kyryk et al. Infrastructure as Code and Microservices for Intent-Based Cloud Networking
US20220107817A1 (en) Dynamic System Parameter for Robotics Automation
Rizvandi et al. Preliminary results on modeling CPU utilization of mapreduce programs
CN114356401A (en) Gray scale distribution method and device, electronic equipment and computer readable storage medium
Jang et al. Enhancing Node Fault Tolerance through High-Availability Clusters in Kubernetes
US11340940B2 (en) Workload assessment and configuration simulator
Yang et al. An efficient dynamic load balancing method for simulation of variable structure systems
Dayarathna et al. A mechanism for stream program performance recovery in resource limited compute clusters
Nikolaou et al. Total Cost of Ownership Perspective of Cloud vs Edge Deployments of IoT Applications
CN116909756B (en) Cross-cloud service method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21813756

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21813756

Country of ref document: EP

Kind code of ref document: A1