WO2024120205A1 - Method and apparatus for optimizing application performance, electronic device, and storage medium - Google Patents

Method and apparatus for optimizing application performance, electronic device, and storage medium Download PDF

Info

Publication number
WO2024120205A1
WO2024120205A1 PCT/CN2023/133455 CN2023133455W WO2024120205A1 WO 2024120205 A1 WO2024120205 A1 WO 2024120205A1 CN 2023133455 W CN2023133455 W CN 2023133455W WO 2024120205 A1 WO2024120205 A1 WO 2024120205A1
Authority
WO
WIPO (PCT)
Prior art keywords
application
cpu
indicator data
abnormal
abnormal application
Prior art date
Application number
PCT/CN2023/133455
Other languages
French (fr)
Chinese (zh)
Inventor
叶可江
张永贺
须成忠
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2024120205A1 publication Critical patent/WO2024120205A1/en

Links

Abstract

The present invention relates to the technical field of cloud computing. Disclosed are a method and apparatus for optimizing application performance, an electronic device, and a storage medium. The method for optimizing application performance is applied to a mixed cluster. The method comprises: obtaining indicator data; on the basis of first indicator data of applications, detecting whether the applications are interfered with; if there is an abnormal application that is interfered with, on the basis of second indicator data, allocating a CPU core for the abnormal application from a CPU shared pool; and updating a control group Cgroup file of the abnormal application according to the CPU core allocated to the abnormal application. According to the present invention, interference received by applications is monitored and solved in real time by means of indicator data of the applications and a system core, and CPU cores allocated for the applications are dynamically adjusted, so that the application stability of a mixed cluster is greatly guaranteed, application performance is improved, the overall utilization rate of the entire machine is improved, and the problem of interference to application performance caused by CPU core preemption in the mixed cluster is solved.

Description

一种应用性能优化方法、装置、电子设备及存储介质Application performance optimization method, device, electronic device and storage medium 技术领域Technical Field
本发明属于云计算技术领域,更具体地,涉及一种应用性能优化方法、装置、电子设备及存储介质。The present invention belongs to the technical field of cloud computing, and more specifically, relates to an application performance optimization method, device, electronic device and storage medium.
背景技术Background technique
目前的混部集群中应用主要以微服务化的形式在容器中部署。容器化的应用以两种不同的CPU使用模式来部署到服务器上,一种是将应用绑定到固定CPU核上的CPU set模式,在该模式下应用所能使用的CPU核是预先分配好的且由该应用独占不可被其他应用抢占;另一种模式是共享CPU共享池中所有CPU核的CPU share模式,在该模式下所有的应用都共享同一批没被CPU set模式的应用独占的CPU核。Currently, applications in colocation clusters are mainly deployed in containers in the form of microservices. Containerized applications are deployed on servers in two different CPU usage modes: one is the CPU set mode that binds the application to a fixed CPU core. In this mode, the CPU cores that the application can use are pre-allocated and exclusively used by the application and cannot be preempted by other applications; the other mode is the CPU share mode that shares all CPU cores in the CPU sharing pool. In this mode, all applications share the same batch of CPU cores that are not exclusively used by applications in the CPU set mode.
随着云计算的迅速发展,越来越多的应用从CPU set模式转为CPU share模式,但是CPU share模式下应用在同一个CPU共享池里共享CPU核,必然会出现应用抢占CPU核这一问题,这就导致应用运行过程中势必会产生严重的干扰,进而使得调度开销剧增,操作系统把大量时间用在CPU核上线程的换入换出,而真正被应用使用的CPU时间片却没有多少,从而导致应用性能受到严重影响。With the rapid development of cloud computing, more and more applications are switching from CPU set mode to CPU share mode. However, in CPU share mode, applications share CPU cores in the same CPU sharing pool, which will inevitably lead to the problem of applications preempting CPU cores. This will inevitably cause serious interference during application operation, which in turn causes a sharp increase in scheduling overhead. The operating system spends a lot of time swapping in and out threads on the CPU core, but not much CPU time slice is actually used by the application, which seriously affects application performance.
可见,现有技术中存在混部集群中因为CPU核抢占导致应用性能受到干扰的问题。It can be seen that in the prior art, there is a problem in which application performance is disturbed due to CPU core preemption in a colocation cluster.
技术问题technical problem
针对相关技术的缺陷,本发明提供一种应用性能优化方法、装置、电子设备及存储介质,旨在解决相关技术中存在的混部集群中因为CPU核抢占导致应用性能受到干扰的问题。In view of the defects of the related art, the present invention provides an application performance optimization method, device, electronic device and storage medium, aiming to solve the problem of application performance being disturbed due to CPU core preemption in a mixed cluster existing in the related art.
技术解决方案Technical Solutions
所述技术方案如下:The technical solution is as follows:
根据本申请的一个方面,一种应用性能优化方法,应用于混部集群,所述方法包括:获取指标数据,所述指标数据包括各应用在当前时间段内运行过程中的第一指标数据、以及与系统内核相关的第二指标数据;基于各应用的所述第一指标数据,检测各所述应用是否受到干扰;若存在受到干扰的异常应用,则基于所述第二指标数据,从CPU共享池中为所述异常应用分配CPU核;根据为所述异常应用分配的所述CPU核,对所述异常应用的控制群组Cgroup文件进行更新。According to one aspect of the present application, an application performance optimization method is applied to a colocation cluster, the method comprising: obtaining indicator data, the indicator data comprising first indicator data of each application during its operation in a current time period, and second indicator data related to a system kernel; based on the first indicator data of each application, detecting whether each application is interfered with; if there is an abnormal application that is interfered with, based on the second indicator data, allocating a CPU core to the abnormal application from a CPU shared pool; and updating a control group Cgroup file of the abnormal application according to the CPU core allocated to the abnormal application.
根据本申请的一个方面,一种应用性能优化装置,部署于混部集群,所述装置包括:获取模块,用于获取指标数据,所述指标数据包括各应用在当前时间段内运行过程中的第一指标数据、以及与系统内核相关的第二指标数据;干扰检测模块,用于基于各应用的所述第一指标数据,检测各所述应用是否受到干扰;资源分配模块,用于若存在受到干扰的异常应用,则基于所述第二指标数据,从CPU共享池中为所述异常应用分配CPU核;文件更新模块,用于根据为所述异常应用分配的所述CPU核,对所述异常应用的控制群组Cgroup文件进行更新。According to one aspect of the present application, an application performance optimization device is deployed in a colocation cluster, and the device includes: an acquisition module, used to obtain indicator data, the indicator data including first indicator data of each application during its operation in a current time period, and second indicator data related to the system kernel; an interference detection module, used to detect whether each application is interfered with based on the first indicator data of each application; a resource allocation module, used to allocate a CPU core to the abnormal application from a CPU sharing pool based on the second indicator data if there is an abnormal application that is interfered with; and a file update module, used to update the control group Cgroup file of the abnormal application according to the CPU core allocated to the abnormal application.
根据本申请的一个方面,一种电子设备,包括:至少一个处理器、至少一个存储器、以及至少一条通信总线,其中,存储器上存储有计算机程序,处理器通过通信总线读取存储器中的计算机程序;计算机程序被处理器执行时实现如上所述的应用性能优化方法。According to one aspect of the present application, an electronic device includes: at least one processor, at least one memory, and at least one communication bus, wherein a computer program is stored in the memory, and the processor reads the computer program in the memory through the communication bus; when the computer program is executed by the processor, the application performance optimization method as described above is implemented.
根据本申请的一个方面,一种存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现如上所述的应用性能优化方法。According to one aspect of the present application, a storage medium stores a computer program thereon, and when the computer program is executed by a processor, the application performance optimization method as described above is implemented.
根据本申请的一个方面,一种计算机程序产品,计算机程序产品包括计算机程序,计算机程序存储在存储介质中,计算机设备的处理器从存储介质读取计算机程序,处理器执行计算机程序,使得计算机设备执行时实现如上所述的应用性能优化方法。According to one aspect of the present application, a computer program product includes a computer program, the computer program is stored in a storage medium, a processor of a computer device reads the computer program from the storage medium, and the processor executes the computer program, so that the computer device implements the application performance optimization method as described above when executing the computer program.
本申请提供的技术方案带来的有益效果是:The beneficial effects of the technical solution provided by this application are:
在上述技术方案中,获取混部集群上关于各应用和系统内核的各种指标数据,基于其中各应用的指标数据,检测各应用是否受到干扰;当存在受到干扰的异常应用,则基于指标数据中有关系统内核的指标数据,动态地调整应用所分配的CPU核,以保障应用性能。本发明基于各应用的指标数据对应用受到的干扰进行实时监测和实时解决,以及基于系统内核相关的指标数据而实现的CPU资源动态调整,极大程度上保障了混部集群应用的稳定性,提高应用性能和提高整机的利用率,解决了混部集群中因为CPU核抢占导致应用性能受到干扰的问题。In the above technical solution, various indicator data about each application and the system kernel on the mixed cluster are obtained, and based on the indicator data of each application, it is detected whether each application is disturbed; when there is an abnormal application that is disturbed, the CPU core allocated to the application is dynamically adjusted based on the indicator data of the system kernel in the indicator data to ensure application performance. The present invention monitors and solves the interference to the application in real time based on the indicator data of each application, and dynamically adjusts the CPU resources based on the indicator data related to the system kernel, which greatly ensures the stability of the mixed cluster application, improves the application performance and the utilization rate of the whole machine, and solves the problem of application performance being disturbed due to CPU core preemption in the mixed cluster.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对本申请实施例描述中所需要使用的附图作简单地介绍。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required for use in describing the embodiments of the present application are briefly introduced below.
图1是本申请实施例提供的一种应用性能优化方法的实施环境的示意图;FIG1 is a schematic diagram of an implementation environment of an application performance optimization method provided in an embodiment of the present application;
图2是本申请实施例提供的一种应用性能优化方法的流程图;FIG2 is a flow chart of an application performance optimization method provided in an embodiment of the present application;
图3是图2对应实施例中步骤240在一个实施例的流程图;FIG3 is a flow chart of step 240 in one embodiment of the embodiment corresponding to FIG2 ;
图4是图2对应实施例中步骤240在另一个实施例的流程图;FIG4 is a flow chart of step 240 in the embodiment corresponding to FIG2 in another embodiment;
图5是一应用场景中一种应用性能优化方法的具体实现示意图;FIG5 is a schematic diagram of a specific implementation of an application performance optimization method in an application scenario;
图6是根据一示例性实施例示出的一种应用性能优化装置的框图;FIG6 is a block diagram of an application performance optimization device according to an exemplary embodiment;
图7是根据一示例性实施例示出的一种服务器的硬件结构图;FIG7 is a hardware structure diagram of a server according to an exemplary embodiment;
图8是根据一示例性实施例示出的一种电子设备的框图。Fig. 8 is a block diagram of an electronic device according to an exemplary embodiment.
本发明的实施方式Embodiments of the present invention
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能解释为对本申请的限制。The embodiments of the present application are described in detail below, and examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present application, and cannot be interpreted as limiting the present application.
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解,当我们称元件被“连接”或“耦接”到另一元件时,它可以直接连接或耦接到其他元件,或者也可以存在中间元件。此外,这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。It will be understood by those skilled in the art that, unless expressly stated, the singular forms "one", "said", and "the" used herein may also include plural forms. It should be further understood that the term "comprising" used in the specification of the present application refers to the presence of the features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components and/or groups thereof. It should be understood that when we refer to an element as being "connected" or "coupled" to another element, it may be directly connected or coupled to the other element, or there may be an intermediate element. In addition, the "connection" or "coupling" used herein may include wireless connection or wireless coupling. The term "and/or" used herein includes all or any unit and all combinations of one or more associated listed items.
下面是对本申请涉及的几个名词进行的介绍和解释:The following is an introduction and explanation of several terms involved in this application:
CPU核,通常认为是分配给应用使用的逻辑核,不同的逻辑核可以来自于同一个物理核,还可能来自于不同的物理核。A CPU core is usually considered to be a logical core allocated to applications. Different logical cores can come from the same physical core or from different physical cores.
Socket,指CPU插槽,就是用于安装CPU的插座。往往CPU资源包括来自多个Socket的CPU核,应用绑定的CPU核应当尽量属于同一Socket,跨Socket会浪费性能资源。Socket refers to the CPU slot, which is the socket for installing the CPU. Often CPU resources include CPU cores from multiple sockets. The CPU cores bound to the application should belong to the same socket as much as possible. Cross-socket will waste performance resources.
控制群组Cgroup,即control groups,是Linux内核的一个功能,用来限制、控制与分离一个进程组的资源(如CPU、内存、磁盘输入输出等)。CPU share pool,也认为是CPU共享池,其中的所有CPU核都可以被所有的进程所调用。Cgroup, or control groups, is a function of the Linux kernel that is used to limit, control, and separate the resources of a process group (such as CPU, memory, disk input and output, etc.). CPU share pool, also known as CPU sharing pool, all CPU cores in it can be called by all processes.
CPU share模式,指的是各应用可以共享CPU共享池中的所有CPU核。The CPU share mode means that each application can share all CPU cores in the CPU sharing pool.
CPU set模式,指的是应用能使用的CPU核是预先分配好的且由该应用独占不可被其他应用抢占。该模式下的应用大多属于一些优先级比较高的在线业务。CPU set mode means that the CPU cores that can be used by an application are pre-allocated and exclusively used by the application and cannot be preempted by other applications. Most applications in this mode are online services with relatively high priority.
混部集群,一种把集群混合起来,将不同类型的任务调度到相同的物理资源上,通过调度、资源隔离等控制手段,在保障SLO的基础上,提高资源利用率,极大降低成本的技术。混部意味着要把各种不同业务特性、优先级、资源使用模型的负载混合在同一台机器上运行,必然伴随资源抢占等问题。Colocation clusters are a technology that mixes clusters and schedules different types of tasks to the same physical resources. Through scheduling, resource isolation and other control measures, it improves resource utilization and greatly reduces costs while ensuring SLO. Colocation means that loads with different business characteristics, priorities, and resource usage models are mixed and run on the same machine, which is inevitably accompanied by problems such as resource preemption.
如前所述,相关技术中混部集群中各应用之间往往伴随资源抢占而引起的干扰问题。As mentioned above, in the related art, there are often interference problems caused by resource preemption between applications in a colocation cluster.
通常,为了解决混部集群中各应用之间的干扰都是从优化应用部署的角度入手,通过不断地改良调度部署算法来达到节约部署成本,进而减少应用干扰的目的。这些方法虽然在一定程度上改善了应用部署成本难以控制、以及预防一些性能干扰等问题,但是随着应用规模的急剧升高,服务器上的应用密度越来越大,单纯的改善调度部署方式已经很难预防应用之间的干扰,而且应用本身“千人千面”,应用特性之间差异很大,很难用一种机制或算法涵盖所有的情况。Usually, in order to solve the interference between applications in a colocation cluster, we start from the perspective of optimizing application deployment, and continuously improve the scheduling and deployment algorithm to save deployment costs and reduce application interference. Although these methods have improved the problems of difficult to control application deployment costs and prevent some performance interference to a certain extent, as the scale of applications increases sharply, the application density on servers is getting higher and higher. It is difficult to prevent interference between applications by simply improving the scheduling and deployment methods. In addition, applications themselves are "different for different people" and the application characteristics vary greatly. It is difficult to cover all situations with one mechanism or algorithm.
由上可知,相关技术中仍存在应用进行CPU核抢占导致性能受到干扰的局限性。As can be seen from the above, the related technology still has the limitation that the application preempts the CPU core, causing the performance to be disturbed.
为此,本申请提供的应用性能优化方法,能够动态调整CPU资源进而有效地提升应用性能,相应地,该应用性能优化方法适用于应用性能优化装置,该应用性能优化装置可部署于电子设备,该电子设备可以是配置冯诺依曼体系结构的计算机设备,例如,该计算机设备可以是台式电脑、笔记本电脑、服务器等等。To this end, the application performance optimization method provided in the present application can dynamically adjust CPU resources to effectively improve application performance. Accordingly, the application performance optimization method is suitable for an application performance optimization device, which can be deployed in an electronic device. The electronic device can be a computer device configured with a von Neumann architecture, for example, the computer device can be a desktop computer, a laptop computer, a server, etc.
实施例Example
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of the present application more clear, the implementation methods of the present application will be further described in detail below with reference to the accompanying drawings.
请参阅图1,其示出了本申请提供的一种应用性能优化方法所涉及的实施环境的示意图。该实施环境包括监控组件101、触发器组件102、CPU调度分配组件103、调谐组件104、以及管控组件105。Please refer to FIG1 , which shows a schematic diagram of an implementation environment involved in an application performance optimization method provided by the present application. The implementation environment includes a monitoring component 101 , a trigger component 102 , a CPU scheduling and allocation component 103 , a tuning component 104 , and a management and control component 105 .
监控组件101从应用和系统内核收集指标数据,并且将指标数据发送至触发器组件102和CPU调度分配组件103,指标数据包括各应用在当前时间段内运行过程中的第一指标数据、以及与系统内核相关的第二指标数据。The monitoring component 101 collects indicator data from applications and the system kernel, and sends the indicator data to the trigger component 102 and the CPU scheduling and allocation component 103. The indicator data includes first indicator data of each application running in the current time period, and second indicator data related to the system kernel.
触发器组件102基于各应用的第一指标数据,检测各应用是否受到干扰,并且将检测结果发送调谐组件104。The trigger component 102 detects whether each application is interfered with based on the first indicator data of each application, and sends the detection result to the tuning component 104 .
其中,第一指标数据是指应用在当前时间段内运行过程中相关的指标数据,例如可以是请求的响应延迟、端到端时延、任务完成时间等。Among them, the first indicator data refers to the indicator data related to the operation of the application in the current time period, such as the response delay of the request, the end-to-end delay, the task completion time, etc.
CPU调度分配组件103根据监控组件101所收集到的与系统内核相关的第二指标数据来计算得到CPU账本。The CPU scheduling and allocation component 103 calculates the CPU account book according to the second indicator data related to the system kernel collected by the monitoring component 101.
其中,第二指标数据可以包括CPU核空闲情况、每个socket中CPU核的位置、哪些CPU核在同一个物理核等。CPU账本是指被划分为不同层级的CPU核。该划分规则可是CPU核是否位于同一个socket,还可以是CPU核是否位于同一个物理核,或者CPU核所在的物理核中是否存在被独占的CPU核,又或者各CPU核的空闲程度中的一种或者多种,此处并未加以限定。Among them, the second indicator data may include the idle status of the CPU core, the location of the CPU core in each socket, which CPU cores are in the same physical core, etc. The CPU account book refers to the CPU cores divided into different levels. The division rule may be whether the CPU cores are in the same socket, whether the CPU cores are in the same physical core, whether there is an exclusive CPU core in the physical core where the CPU core is located, or one or more of the idleness of each CPU core, which is not limited here.
若触发器组件102发送的检测结果为存在受到干扰的异常应用,则调谐组件104根据从CPU调度分配组件103得到的CPU账本,从CPU共享池中为异常应用分配CPU核,得到CPU分配策略,并将CPU分配策略发送给管控组件105。If the detection result sent by the trigger component 102 is that there is an abnormal application that is disturbed, the tuning component 104 allocates CPU cores to the abnormal application from the CPU shared pool according to the CPU ledger obtained from the CPU scheduling and allocation component 103, obtains the CPU allocation strategy, and sends the CPU allocation strategy to the management and control component 105.
管控组件105收到调谐组件104发送的CPU分配策略后,检测CPU分配策略是否正确,若正确,则遍历每个需要调整CPU资源的异常应用,找到该异常应用的Cgroup文件,通过修改Cgroup文件中异常应用所绑定的CPU核为目标值,实现异常应用与CPU分配策略所指示的CPU核的绑定,即调整了异常应用的CPU资源,进而优化了异常应用的性能,从而使得该异常应用恢复正常。After receiving the CPU allocation policy sent by the tuning component 104, the management and control component 105 detects whether the CPU allocation policy is correct. If correct, it traverses each abnormal application that needs to adjust the CPU resources, finds the Cgroup file of the abnormal application, and modifies the CPU core bound to the abnormal application in the Cgroup file to the target value to achieve the binding of the abnormal application with the CPU core indicated by the CPU allocation policy, that is, the CPU resources of the abnormal application are adjusted, and then the performance of the abnormal application is optimized, so that the abnormal application returns to normal.
请参阅图2,本申请实施例提供了一种应用性能优化方法,该方法应用于混部集群。Please refer to FIG. 2 . An embodiment of the present application provides an application performance optimization method, which is applied to a colocation cluster.
在下述方法实施例中,为了便于描述,以该方法各步骤的执行主体为混部集群中的服务器为例进行说明,但是并非对此构成具体限定。In the following method embodiments, for ease of description, the execution subject of each step of the method is taken as an example of a server in a colocation cluster, but this does not constitute a specific limitation.
如图2所示,该方法可以包括以下步骤:As shown in FIG. 2 , the method may include the following steps:
步骤200,获取指标数据。Step 200, obtaining indicator data.
其中,指标数据包括各应用在当前时间段内运行过程中的第一指标数据、以及与系统内核相关的第二指标数据。The indicator data includes first indicator data of each application during operation in the current time period, and second indicator data related to the system kernel.
其中,第一指标数据是指应用在当前时间段内运行过程中相关的指标数据,例如可以是请求的响应延迟、端到端时延、任务完成时间等。Among them, the first indicator data refers to the indicator data related to the operation of the application in the current time period, such as the response delay of the request, the end-to-end delay, the task completion time, etc.
第二指标数据是指与系统内核相关的指标数据,例如可以是应用在CPU核上的调度延迟、应用的CPI(平均指令执行所需周期数)、CPU每个核使用率、每个socket中CPU核的位置、哪些CPU核在同一个物理核等。The second indicator data refers to indicator data related to the system kernel, such as the scheduling delay of the application on the CPU core, the CPI (average number of cycles required for instruction execution) of the application, the utilization rate of each CPU core, the position of the CPU core in each socket, which CPU cores are on the same physical core, etc.
关于指标数据的获取,在一个可能的实现方式,是通过部署在混部集群中的监控组件对应用的运行过程和/或系统内核实时监测和采集得到的。Regarding the acquisition of indicator data, in one possible implementation, the indicator data is obtained by real-time monitoring and collection of the application running process and/or the system kernel by a monitoring component deployed in the colocation cluster.
步骤220,基于各应用的第一指标数据,检测各应用是否受到干扰。Step 220: Based on the first indicator data of each application, detect whether each application is interfered.
在一个可能的实现方式,应用是否受到干扰是通过检测应用是否存在性能波动确定的,即,若应用存在性能波动,则确定应用受到干扰。In a possible implementation, whether the application is interfered with is determined by detecting whether the application has performance fluctuations, that is, if the application has performance fluctuations, it is determined that the application is interfered with.
具体地,步骤220可以包括以下步骤:获取各应用在历史时间段内运行过程中的历史指标数据,通过将各应用的第一指标数据和各应用在历史时间段内运行过程中的历史指标数据进行对比分析,分别得到各应用的性能波动数据,然后根据性能波动数据,判断各应用是否受到干扰,以此完成对各应用是否受到干扰的检测。Specifically, step 220 may include the following steps: obtaining historical indicator data of each application during its operation in a historical time period, obtaining performance fluctuation data of each application by comparing and analyzing the first indicator data of each application with the historical indicator data of each application during its operation in the historical time period, and then determining whether each application is interfered with based on the performance fluctuation data, thereby completing the detection of whether each application is interfered with.
其中,历史指标数据是各应用在历史时间段内运行过程中相关的指标数据,例如可以是请求的响应延迟、端到端时延、任务完成时间等。The historical indicator data refers to the indicator data related to the operation of each application in the historical time period, such as the response delay of the request, the end-to-end delay, the task completion time, etc.
在一个可能的实现方式,性能波动数据可以是计算第一指标数据与某个历史指标数据之差获得的。In a possible implementation, the performance fluctuation data may be obtained by calculating the difference between the first indicator data and certain historical indicator data.
在一个可能的实现方式,性能波动数据可以是比较第一指标数据与某个历史时间段内的所有历史指标数据的均值得到的。In a possible implementation, the performance fluctuation data may be obtained by comparing the first indicator data with an average value of all historical indicator data within a certain historical time period.
步骤240,若存在受到干扰的异常应用,则基于第二指标数据,从CPU共享池中为异常应用分配CPU核。Step 240: If there is an abnormal application that is disturbed, a CPU core is allocated to the abnormal application from the CPU shared pool based on the second indicator data.
其中,应用受到干扰则视为异常应用,由于CPU共享池中的CPU核由各应用共享,为了避免CPU核抢占,在为异常应用分配CPU核之前,便需要基于第二指标数据从CPU共享池中选出特定数量和位置的CPU核,然后分配给异常应用,从而实现动态调整该异常应用的CPU资源。Among them, the application that is disturbed is regarded as an abnormal application. Since the CPU cores in the CPU shared pool are shared by various applications, in order to avoid CPU core preemption, before allocating CPU cores to abnormal applications, it is necessary to select a specific number and position of CPU cores from the CPU shared pool based on the second indicator data, and then allocate them to the abnormal applications, thereby dynamically adjusting the CPU resources of the abnormal application.
例如,基于第二指标数据,为异常应用分配CPU共享池中空闲程度最高的CPU核。For example, based on the second indicator data, the CPU core with the highest idleness in the CPU shared pool is allocated to the abnormal application.
步骤260,根据为异常应用分配的CPU核,对异常应用的控制群组Cgroup文件进行更新。Step 260: Update the control group Cgroup file of the abnormal application according to the CPU core allocated to the abnormal application.
控制群组Cgroup(control groups)是Linux内核的一个功能,用来限制、控制与分离一个进程组的资源(如CPU、内存、磁盘输入输出等)。更新Cgroup文件,就可以将应用绑定的CPU核进行更新。Cgroup (control groups) is a function of the Linux kernel that is used to limit, control and separate the resources of a process group (such as CPU, memory, disk input and output, etc.). By updating the Cgroup file, the CPU core bound to the application can be updated.
在一示例性实施例中,步骤260之后,该方法还可以包括以下步骤:In an exemplary embodiment, after step 260, the method may further include the following steps:
步骤261,在异常应用的Cgroup文件完成更新后,基于异常应用在当前时间段内运行过程中的第一指标数据,检测异常应用是否恢复正常。Step 261 : After the Cgroup file of the abnormal application is updated, based on the first indicator data of the abnormal application during operation in the current time period, it is detected whether the abnormal application has returned to normal.
也就是说,继续获取异常应用在完成Cgroup文件更新后的运行过程中的第一指标数据,通过将异常应用的第一指标数据和异常应用在历史时间段内运行过程中的历史指标数据进行对比分析,得到该异常应用在完成Cgroup文件更新后的性能波动数据,然后根据该性能波动数据所指示的该异常应用在完成Cgroup文件更新后的性能波动,判断该异常应用在完成Cgroup文件更新后是否恢复正常。That is to say, continue to obtain the first indicator data of the abnormal application during the operation process after completing the Cgroup file update, and obtain the performance fluctuation data of the abnormal application after completing the Cgroup file update by comparing and analyzing the first indicator data of the abnormal application with the historical indicator data of the abnormal application during the operation process in the historical time period. Then, based on the performance fluctuation of the abnormal application after completing the Cgroup file update indicated by the performance fluctuation data, determine whether the abnormal application has returned to normal after completing the Cgroup file update.
若该异常应用在完成Cgroup文件更新后恢复正常,则执行步骤262;反之,若该异常应用在完成Cgroup文件更新后仍处于异常,则返回执行步骤240,继续为该异常应用调整CPU资源,直至该异常应用恢复正常。If the abnormal application returns to normal after completing the Cgroup file update, execute step 262; otherwise, if the abnormal application is still abnormal after completing the Cgroup file update, return to execute step 240 and continue to adjust the CPU resources for the abnormal application until the abnormal application returns to normal.
步骤262,若检测到异常应用恢复正常,则将异常应用分配到的CPU核恢复至CPU共享池。Step 262: If it is detected that the abnormal application has returned to normal, the CPU core allocated to the abnormal application is restored to the CPU shared pool.
通过再次修改异常应用的Cgroup文件,设置与异常应用绑定的CPU核的数量和位置,便能够将该异常应用异常时所分配到的CPU核恢复至CPU共享池。By modifying the Cgroup file of the abnormal application again and setting the number and position of the CPU cores bound to the abnormal application, the CPU cores allocated to the abnormal application when it is abnormal can be restored to the CPU sharing pool.
在上述过程中,通过各应用的指标数据对应用干扰进行实时监测和实时解决,以及基于系统内核相关的指标数据而实现的CPU资源动态调整,极大程度上保障了混部集群应用稳定性,提高应用性能和提高整机的利用率,解决了混部集群中因为CPU核抢占导致的应用性能受到干扰的问题。In the above process, the application interference is monitored and resolved in real time through the indicator data of each application, and the CPU resources are dynamically adjusted based on the indicator data related to the system kernel. This greatly ensures the application stability of the colocation cluster, improves application performance and the utilization rate of the entire machine, and solves the problem of application performance interference caused by CPU core preemption in the colocation cluster.
请参阅图3,在一示例性实施例中,步骤240可以包括以下步骤:Referring to FIG. 3 , in an exemplary embodiment, step 240 may include the following steps:
步骤241,若异常应用支持CPU share模式,则根据系统的第二指标数据,将CPU共享池中的CPU核划分为若干个空闲层级。Step 241 : if the abnormal application supports the CPU share mode, the CPU cores in the CPU sharing pool are divided into a plurality of idle levels according to the second indicator data of the system.
其中,同一个空闲层级中的CPU核的分配优先级相同。在此说明的是,空闲层级越高,分配优先级越低,表示该空闲层级中的CPU核越难被分配。The CPU cores in the same idle level have the same allocation priority. It should be noted that the higher the idle level, the lower the allocation priority, which means that the CPU cores in the idle level are more difficult to be allocated.
在一个可能的实现方式,将同一个socket中的CPU核划分到同一个空闲层级。为了防止给应用分配CPU的时候出现跨socket的情况,同一个socket中的CPU核尽量划分到同一个空闲层级。此种方式下,可以有效地减少跨socket运行CPU而消耗的多余性能,进一步有利于提升应用性能。In one possible implementation, the CPU cores in the same socket are allocated to the same idle level. To prevent cross-socket CPU allocation to applications, the CPU cores in the same socket are allocated to the same idle level as much as possible. This method can effectively reduce the redundant performance consumed by running CPUs across sockets, which is further beneficial to improving application performance.
在一个可能的实现方式,将同一个物理核中的CPU核划分到同一个空闲层级。In a possible implementation, CPU cores in the same physical core are divided into the same idle level.
在一个可能的实现方式,基于系统的第二指标数据,确定CPU核的空闲程度,将空闲程度在相同设定范围内的CPU核划分为同一个空闲层级。其中,空闲程度可以根据每个CPU核使用率得到,每个CPU核使用率是从与系统内核相关的第二指标数据中获得的。举例来说,空闲程度在10%至20%之间的CPU核划分为一个空闲层级,空闲程度在20%至30%之间的CPU核划分为另一个空闲层级。In one possible implementation, based on the second indicator data of the system, the idleness of the CPU core is determined, and the CPU cores with idleness within the same set range are divided into the same idle level. The idleness can be obtained based on the utilization rate of each CPU core, and the utilization rate of each CPU core is obtained from the second indicator data related to the system core. For example, CPU cores with idleness between 10% and 20% are divided into one idle level, and CPU cores with idleness between 20% and 30% are divided into another idle level.
在一个可能的实现方式,若CPU核所在的物理核中存在被独占的CPU资源,则CPU核的空闲层级高于其他CPU核的空闲层级。其中,所述其他CPU核是指所在物理核中不存在被独占的CPU核。此种方式下,可以有效地避免对独占逻辑核的支持CPU set模式的应用产生干扰,从而进一步有利于提升应用性能。In one possible implementation, if there is an exclusive CPU resource in the physical core where the CPU core is located, the idle level of the CPU core is higher than the idle levels of other CPU cores. The other CPU core refers to a CPU core that does not have an exclusive CPU resource in the physical core. In this way, interference with applications that support the CPU set mode of the exclusive logical core can be effectively avoided, which is further beneficial to improving application performance.
步骤242,根据异常应用需要的CPU核数量以及CPU核的空闲层级,从CPU共享池中选取同一个空闲层级的CPU核,得到CPU分配策略。Step 242 , according to the number of CPU cores required by the abnormal application and the idle level of the CPU cores, select CPU cores of the same idle level from the CPU sharing pool to obtain a CPU allocation strategy.
其中,CPU分配策略用于指示能够分配给所述异常应用的CPU核。The CPU allocation policy is used to indicate the CPU cores that can be allocated to the abnormal application.
继续参阅图4,在一示例性实施例中,步骤240还可以包括以下步骤:Continuing to refer to FIG. 4 , in an exemplary embodiment, step 240 may further include the following steps:
步骤243,基于CPU分配策略指示的能够分配给异常应用的CPU核,检测所述CPU分配策略是否正确。Step 243: Based on the CPU cores that can be allocated to the abnormal application as indicated by the CPU allocation policy, it is detected whether the CPU allocation policy is correct.
若检测到CPU分配策略正确,则执行步骤260。If it is detected that the CPU allocation strategy is correct, step 260 is executed.
反之,若检测到CPU分配策略不正确,例如,能够分配给异常应用的CPU核已经被支持CPU set模式的其他应用独占,则执行步骤244。On the contrary, if it is detected that the CPU allocation policy is incorrect, for example, the CPU cores that can be allocated to the abnormal application have been exclusively occupied by other applications supporting the CPU set mode, step 244 is executed.
步骤244,若检测到所述CPU分配策略错误,则重新为所述异常应用分配CPU共享池中的CPU核。Step 244: if it is detected that the CPU allocation policy is wrong, then a CPU core in the CPU shared pool is re-allocated to the abnormal application.
通过上述实施例的配合,将CPU共享池中的CPU核划分为若干个空闲层级以方便给异常应用分配。划分的过程中把同一个socket中的CPU核划分到同一个空闲层级中,可以有效地减少跨socket运行CPU而消耗的多余性能,进一步有利于提升应用性能。检测分配策略的正确与否,以防某些CPU核在本时刻被CPU set型应用独占从而导致分配失败。Through the cooperation of the above embodiments, the CPU cores in the CPU sharing pool are divided into several idle levels to facilitate allocation to abnormal applications. In the process of division, the CPU cores in the same socket are divided into the same idle level, which can effectively reduce the redundant performance consumed by running CPUs across sockets, and further help improve application performance. Check whether the allocation strategy is correct to prevent some CPU cores from being monopolized by CPU set type applications at this moment, resulting in allocation failure.
图5是一应用场景中一种应用性能优化方法的具体实现示意图。该应用场景中,初始状态下的应用分配到的CPU核为CPU共享池,混部集群中的服务器收集应用在当前时间段内运行过程中的第一指标数据,进而再判断应用的第一指标数据是否异常,若异常,则调整应用分配的CPU核;反之则继续收集应用的数据指标并且判断是否异常。Figure 5 is a schematic diagram of a specific implementation of an application performance optimization method in an application scenario. In this application scenario, the CPU cores allocated to the application in the initial state are the CPU shared pool, and the servers in the colocation cluster collect the first indicator data of the application during the current time period, and then determine whether the first indicator data of the application is abnormal. If it is abnormal, the CPU cores allocated to the application are adjusted; otherwise, the data indicators of the application continue to be collected and determined whether they are abnormal.
调整应用分配的CPU核之后,继续收集应用的第一指标数据并且判断是否正常,若正常则应用回到初始状态,即将应用分配到的CPU核恢复至CPU共享池;反之则继续调整应用分配的CPU核。After adjusting the CPU cores allocated to the application, continue to collect the first indicator data of the application and determine whether it is normal. If it is normal, the application returns to the initial state, that is, the CPU cores allocated to the application are restored to the CPU shared pool; otherwise, continue to adjust the CPU cores allocated to the application.
在本应用场景中,通过各应用的指标数据对应用受到的干扰进行实时监测和实时解决,动态地调整应用所分配的CPU核,极大程度上保障了混部集群应用稳定性,提高应用性能和提高整机的利用率,解决了混部集群中因为CPU核抢占导致应用性能受到干扰的问题。In this application scenario, the interference to the application is monitored and resolved in real time through the indicator data of each application, and the CPU core allocated to the application is adjusted dynamically, which greatly ensures the stability of the colocation cluster application, improves application performance and the utilization rate of the entire machine, and solves the problem of application performance being disturbed due to CPU core preemption in the colocation cluster.
下述为本申请装置实施例,可以用于执行本申请所涉及的应用性能优化方法。对于本申请装置实施例中未披露的细节,请参照本申请所涉及的应用性能优化方法的方法实施例。The following is an embodiment of the device of the present application, which can be used to execute the application performance optimization method involved in the present application. For details not disclosed in the embodiment of the device of the present application, please refer to the method embodiment of the application performance optimization method involved in the present application.
请参阅图6,本申请实施例中提供了一种应用性能优化装置900,部署于混部集群,该装置900包括但不限于:获取模块910、干扰检测模块930、资源分配模块950、以及文件更新模块970。Please refer to FIG. 6 . An embodiment of the present application provides an application performance optimization device 900 deployed in a colocation cluster. The device 900 includes but is not limited to: an acquisition module 910 , an interference detection module 930 , a resource allocation module 950 , and a file update module 970 .
其中,获取模块910,用于获取指标数据,指标数据包括各应用在当前时间段内运行过程中的第一指标数据、以及与系统内核相关的第二指标数据。The acquisition module 910 is used to acquire indicator data, which includes first indicator data of each application during operation in the current time period and second indicator data related to the system kernel.
干扰检测模块930,用于基于各应用的所述第一指标数据,检测各所述应用是否受到干扰。The interference detection module 930 is used to detect whether each application is interfered based on the first indicator data of each application.
资源分配模块950,用于若存在受到干扰的异常应用,则基于所述第二指标数据,从CPU共享池中为所述异常应用分配CPU核。The resource allocation module 950 is configured to allocate a CPU core from a CPU sharing pool to an abnormal application that is disturbed based on the second indicator data.
文件更新模块970,用于根据为所述异常应用分配的所述CPU核,对所述异常应用的控制群组Cgroup文件进行更新。The file updating module 970 is used to update the control group Cgroup file of the abnormal application according to the CPU core allocated to the abnormal application.
需要说明的是,上述实施例所提供的应用性能优化装置在进行应用性能优化时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即应用性能优化装置的内部结构将划分为不同的功能模块,以完成以上描述的全部或者部分功能。It should be noted that the application performance optimization device provided in the above embodiment only uses the division of the above-mentioned functional modules as an example when performing application performance optimization. In actual applications, the above-mentioned functions can be assigned to different functional modules as needed, that is, the internal structure of the application performance optimization device will be divided into different functional modules to complete all or part of the functions described above.
另外,上述实施例所提供的应用性能优化装置与应用性能优化方法的实施例属于同一构思,其中各个模块执行操作的具体方式已经在方法实施例中进行了详细描述,此处不再赘述。In addition, the application performance optimization device and the application performance optimization method provided in the above embodiments belong to the same concept, and the specific manner in which each module performs operations has been described in detail in the method embodiments and will not be repeated here.
请参阅图7,其示出了根据一示例性实施例示出的一种服务器的结构示意。Please refer to FIG. 7 , which shows a schematic diagram of the structure of a server according to an exemplary embodiment.
需要说明的是,该服务器只是一个适配于本申请的示例,不能认为是提供了对本申请的使用范围的任何限制。该服务器也不能解释为需要依赖于或者必须具有图7示出的示例性的服务器2000中的一个或者多个组件。It should be noted that the server is only an example adapted to the present application and cannot be considered to provide any limitation on the scope of use of the present application. The server cannot be interpreted as needing to rely on or having to have one or more components in the exemplary server 2000 shown in FIG. 7 .
服务器2000的硬件结构可因配置或者性能的不同而产生较大的差异,如图7所示,服务器2000包括:电源210、接口230、至少一存储器250、以及至少一中央处理器(CPU, Central Processing Units)270。The hardware structure of the server 2000 may vary greatly due to different configurations or performances. As shown in FIG. 7 , the server 2000 includes: a power supply 210 , an interface 230 , at least one memory 250 , and at least one central processing unit (CPU) 270 .
具体地,电源210用于为服务器2000上的各硬件设备提供工作电压。Specifically, the power supply 210 is used to provide operating voltage for each hardware device on the server 2000 .
接口230包括至少一有线或无线网络接口231,用于与外部设备交互。The interface 230 includes at least one wired or wireless network interface 231 for interacting with external devices.
当然,在其余本申请适配的示例中,接口230还可以进一步包括至少一串并转换接口233、至少一输入输出接口235以及至少一USB接口237等,如图7所示,在此并非对此构成具体限定。Of course, in other examples adapted by this application, the interface 230 may further include at least one serial-to-parallel conversion interface 233, at least one input-output interface 235, and at least one USB interface 237, as shown in FIG. 7, which is not specifically limited here.
存储器250作为资源存储的载体,可以是只读存储器、随机存储器、磁盘或者光盘等,其上所存储的资源包括操作系统251、应用程序253及数据255等,存储方式可以是短暂存储或者永久存储。The memory 250 is a carrier for storing resources, which may be a read-only memory, a random access memory, a disk or an optical disk, etc. The resources stored thereon include an operating system 251, an application 253 and data 255, etc. The storage method may be temporary storage or permanent storage.
其中,操作系统251用于管理与控制服务器2000上的各硬件设备以及应用程序253,以实现中央处理器270对存储器250中海量数据255的运算与处理,其可以是Windows ServerTM、Mac OS XTM、UnixTM、LinuxTM、FreeBSDTM等。Among them, the operating system 251 is used to manage and control the hardware devices and application programs 253 on the server 2000 to enable the central processor 270 to calculate and process the massive data 255 in the memory 250. It can be Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
应用程序253是基于操作系统251之上完成至少一项特定工作的计算机程序,其可以包括至少一模块(图7未示出),每个模块都可以分别包含有对服务器2000的计算机程序。例如,应用性能优化装置可视为部署于服务器2000的应用程序253。The application 253 is a computer program that performs at least one specific task based on the operating system 251, and may include at least one module (not shown in FIG. 7 ), each of which may include a computer program for the server 2000. For example, the application performance optimization device may be regarded as an application 253 deployed on the server 2000.
数据255可以是存储于磁盘中的照片、图片等,还可以是指标数据等,存储于存储器250中。The data 255 may be photos, pictures, etc. stored in a disk, or may be indicator data, etc. stored in the memory 250 .
中央处理器270可以包括一个或多个以上的处理器,并设置为通过至少一通信总线与存储器250通信,以读取存储器250中存储的计算机程序,进而实现对存储器250中海量数据255的运算与处理。例如,通过中央处理器270读取存储器250中存储的一系列计算机程序的形式来完成应用性能优化方法。The central processor 270 may include one or more processors and is configured to communicate with the memory 250 through at least one communication bus to read the computer program stored in the memory 250, thereby realizing the operation and processing of the mass data 255 in the memory 250. For example, the application performance optimization method is completed in the form of the central processor 270 reading a series of computer programs stored in the memory 250.
此外,通过硬件电路或者硬件电路结合软件也能同样实现本申请,因此,实现本申请并不限于任何特定硬件电路、软件以及两者的组合。In addition, the present application can also be implemented through hardware circuits or hardware circuits combined with software. Therefore, the implementation of the present application is not limited to any specific hardware circuits, software, or a combination of the two.
请参阅图8,本申请实施例中提供了一种电子设备4000,该电子设备4000可以包括混部集群中的服务器。Please refer to FIG. 8 . An electronic device 4000 is provided in an embodiment of the present application. The electronic device 4000 may include a server in a colocation cluster.
在图8中,该电子设备4000包括至少一个处理器4001、至少一条通信总线4002以及至少一个存储器4003。其中,处理器4001和存储器4003相连,如通过通信总线4002相连。In FIG8 , the electronic device 4000 includes at least one processor 4001, at least one communication bus 4002, and at least one memory 4003. The processor 4001 and the memory 4003 are connected, for example, via the communication bus 4002.
可选地,电子设备4000还可以包括收发器4004,收发器4004可以用于该电子设备与其他电子设备之间的数据交互,如数据的发送和/或数据的接收等。需要说明的是,实际应用中收发器4004不限于一个,该电子设备4000的结构并不构成对本申请实施例的限定。Optionally, the electronic device 4000 may further include a transceiver 4004, which may be used for data interaction between the electronic device and other electronic devices, such as data transmission and/or data reception. It should be noted that in actual applications, the transceiver 4004 is not limited to one, and the structure of the electronic device 4000 does not constitute a limitation on the embodiments of the present application.
处理器4001可以是CPU(Central Processing Unit,中央处理器),通用处理器,DSP(Digital Signal Processor,数据信号处理器),ASIC(Application Specific Integrated Circuit,专用集成电路),FPGA(Field Programmable Gate Array,现场可编程门阵列)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器4001也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。Processor 4001 may be a CPU (Central Processing Unit), a general-purpose processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. It may implement or execute various exemplary logic blocks, modules and circuits described in conjunction with the disclosure of this application. Processor 4001 may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, etc.
通信总线4002可包括一通路,在上述组件之间传送信息。通信总线4002可以是PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(Extended Industry Standard Architecture,扩展工业标准结构)总线等。通信总线4002可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The communication bus 4002 may include a path to transmit information between the above components. The communication bus 4002 may be a PCI (Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture) bus, etc. The communication bus 4002 may be divided into an address bus, a data bus, a control bus, etc. For ease of representation, FIG8 only uses one thick line, but does not mean that there is only one bus or one type of bus.
存储器4003可以是ROM(Read Only Memory,只读存储器)或可存储静态信息和指令的其他类型的静态存储设备,RAM(Random Access Memory,随机存取存储器)或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM(Electrically Erasable Programmable Read Only Memory,电可擦可编程只读存储器)、CD-ROM(Compact Disc Read Only Memory,只读光盘)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。The memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, or an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical disk storage, optical disk storage (including compressed optical disk, laser disk, optical disk, digital versatile disk, Blu-ray disk, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and can be accessed by a computer, but is not limited thereto.
存储器4003上存储有计算机程序,处理器4001通过通信总线4002读取存储器4003中存储的计算机程序。The memory 4003 stores a computer program, and the processor 4001 reads the computer program stored in the memory 4003 through the communication bus 4002 .
该计算机程序被处理器4001执行时实现上述各实施例中的应用性能优化方法。When the computer program is executed by the processor 4001, the application performance optimization method in the above-mentioned embodiments is implemented.
此外,本申请实施例中提供了一种存储介质,该存储介质上存储有计算机程序,该计算机程序被处理器执行时实现上述各实施例中的应用性能优化方法。In addition, a storage medium is provided in an embodiment of the present application, on which a computer program is stored. When the computer program is executed by a processor, the application performance optimization method in the above embodiments is implemented.
本申请实施例中提供了一种计算机程序产品,该计算机程序产品包括计算机程序,该计算机程序存储在存储介质中。计算机设备的处理器从存储介质读取该计算机程序,处理器执行该计算机程序,使得该计算机设备执行上述各实施例中的应用性能优化方法。A computer program product is provided in an embodiment of the present application, the computer program product includes a computer program, the computer program is stored in a storage medium. A processor of a computer device reads the computer program from the storage medium, and the processor executes the computer program, so that the computer device executes the application performance optimization method in each of the above embodiments.
与相关技术相比,本发明基于各应用的指标数据对应用受到的干扰进行实时监测和实时解决,以及基于系统内核相关的指标数据而实现的CPU资源动态调整,极大程度上保障了混部集群应用的稳定性,提高应用性能和提高整机的利用率,杜绝混部服务器中由于应用对CPU核的抢占所产生的干扰,解决了混部集群中因为CPU核抢占导致应用性能受到干扰的问题。Compared with the related art, the present invention monitors and solves the interference to the application in real time based on the indicator data of each application, and dynamically adjusts the CPU resources based on the indicator data related to the system kernel, which greatly ensures the stability of the hybrid cluster application, improves the application performance and the utilization rate of the whole machine, eliminates the interference caused by the application preempting the CPU core in the hybrid server, and solves the problem of interference in application performance due to CPU core preemption in the hybrid cluster.
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that, although the steps in the flowchart of the accompanying drawings are displayed in sequence as indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but can be executed at different times, and their execution order is not necessarily sequential, but can be executed in turn or alternately with other steps or at least a part of the sub-steps or stages of other steps.
本领域的技术人员容易理解,以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明的保护范围之内。It will be easily understood by those skilled in the art that the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection scope of the present invention.

Claims (10)

  1. 一种应用性能优化方法,其特征在于,应用于混部集群,所述方法包括:An application performance optimization method, characterized in that it is applied to a colocation cluster, and the method comprises:
    获取指标数据,所述指标数据包括各应用在当前时间段内运行过程中的第一指标数据、以及与系统内核相关的第二指标数据;Acquire indicator data, the indicator data including first indicator data of each application during operation in a current time period, and second indicator data related to the system kernel;
    基于各应用的所述第一指标数据,检测各所述应用是否受到干扰;Based on the first indicator data of each application, detecting whether each application is interfered with;
    若存在受到干扰的异常应用,则基于所述第二指标数据,从CPU共享池中为所述异常应用分配CPU核;If there is an abnormal application that is disturbed, allocating a CPU core from a CPU shared pool to the abnormal application based on the second indicator data;
    根据为所述异常应用分配的所述CPU核,对所述异常应用的控制群组Cgroup文件进行更新。According to the CPU core allocated to the abnormal application, a control group Cgroup file of the abnormal application is updated.
  2. 如权利要求1所述的方法,其特征在于,所述根据为所述异常应用分配的所述CPU核,对所述异常应用的控制群组Cgroup文件进行更新之后,所述方法还包括:The method according to claim 1, characterized in that after updating the control group Cgroup file of the abnormal application according to the CPU core allocated to the abnormal application, the method further comprises:
    在所述异常应用的Cgroup文件完成更新后,基于所述异常应用在当前时间段内运行过程中的第一指标数据,检测所述异常应用是否恢复正常;After the Cgroup file of the abnormal application is updated, based on the first indicator data of the abnormal application during operation in the current time period, detecting whether the abnormal application has returned to normal;
    若为是,则将所述异常应用分配到的CPU核恢复至CPU共享池。If yes, the CPU core allocated to the abnormal application is restored to the CPU shared pool.
  3. 如权利要求1所述的方法,其特征在于,所述基于各应用的所述第一指标数据,检测各所述应用是否受到干扰,包括:The method according to claim 1, wherein detecting whether each of the applications is interfered with based on the first indicator data of each application comprises:
    获取各所述应用在历史时间段内运行过程中的历史指标数据;Obtaining historical indicator data of each of the applications during operation in a historical time period;
    根据各所述应用的第一指标数据与历史指标数据,分别计算各所述应用的性能波动数据;Calculating performance fluctuation data of each application respectively according to the first indicator data and the historical indicator data of each application;
    若所述应用的性能波动数据指示所述应用存在性能波动,则检测到所述应用为受到干扰的异常应用。If the performance fluctuation data of the application indicates that the application has performance fluctuation, the application is detected as an abnormal application that is disturbed.
  4. 如权利要求1所述的方法,其特征在于,所述基于所述第二指标数据,从CPU共享池中为所述异常应用分配CPU核,包括:The method according to claim 1, wherein allocating a CPU core to the abnormal application from a CPU shared pool based on the second indicator data comprises:
    若所述异常应用支持CPU share模式,则根据系统的第二指标数据,将CPU共享池中的CPU核划分为若干个空闲层级;同一个空闲层级中的CPU核的分配优先级相同;If the abnormal application supports the CPU share mode, the CPU cores in the CPU sharing pool are divided into a plurality of idle levels according to the second indicator data of the system; the CPU cores in the same idle level have the same allocation priority;
    根据所述异常应用需要的CPU核数量以及CPU核的空闲层级,从所述CPU共享池中选取CPU核,得到CPU分配策略;所述CPU分配策略用于指示能够分配给所述异常应用的CPU核。According to the number of CPU cores required by the abnormal application and the idle level of the CPU cores, a CPU core is selected from the CPU sharing pool to obtain a CPU allocation policy; the CPU allocation policy is used to indicate the CPU cores that can be allocated to the abnormal application.
  5. 如权利要求4所述的方法,其特征在于,所述基于所述第二指标数据,从CPU共享池中为所述异常应用分配CPU核,还包括:The method according to claim 4, wherein the allocating a CPU core to the abnormal application from a CPU shared pool based on the second indicator data further comprises:
    基于所述CPU分配策略指示的能够分配给所述异常应用的CPU核,检测所述CPU分配策略是否正确;Based on the CPU cores that can be allocated to the abnormal application indicated by the CPU allocation policy, detecting whether the CPU allocation policy is correct;
    若能够分配给所述异常应用的CPU核已经被支持CPU set模式的其他应用独占,则检测到所述CPU分配策略错误,重新为所述异常应用分配CPU共享池中的CPU核。If the CPU core that can be allocated to the abnormal application has been exclusively occupied by other applications supporting the CPU set mode, the CPU allocation policy error is detected, and a CPU core in the CPU shared pool is re-allocated to the abnormal application.
  6. 如权利要求4所述的方法,其特征在于,所述根据系统的第二指标数据,将CPU共享池中CPU核划分为若干个空闲层级,包括:The method according to claim 4, characterized in that the dividing the CPU cores in the CPU sharing pool into a plurality of idle levels according to the second indicator data of the system comprises:
    将同一个socket中的CPU核划分到同一个空闲层级;或Assign CPU cores in the same socket to the same idle level; or
    将同一个物理核中的CPU核划分到同一个空闲层级;或Assign CPU cores in the same physical core to the same idle level; or
    基于系统的第二指标数据,确定CPU核的空闲程度,将所确定空闲程度在相同设定范围内的CPU核划分为同一个空闲层级;或Based on the second indicator data of the system, determine the idleness of the CPU cores, and classify the CPU cores whose determined idleness is within the same set range into the same idle level; or
    若CPU核所在的物理核中存在被独占的CPU核,则所述CPU核的空闲层级高于其他CPU核的空闲层级;所述其他CPU核是指所在物理核中不存在被独占的CPU核。If there is an exclusively occupied CPU core in the physical core where the CPU core is located, the idle level of the CPU core is higher than the idle levels of other CPU cores; the other CPU core refers to the CPU core that is not exclusively occupied in the physical core where the CPU core is located.
  7. 如权利要求1至6任一项所述的方法,其特征在于,所述根据为所述异常应用分配的CPU核,对所述异常应用的Cgroup文件进行更新,包括:The method according to any one of claims 1 to 6, characterized in that the updating of the Cgroup file of the abnormal application according to the CPU core allocated to the abnormal application comprises:
    确定与所述异常应用对应的Cgroup位置,并根据所确定的Cgroup位置,查找到所述异常应用的Cgroup文件;Determine a Cgroup location corresponding to the abnormal application, and find the Cgroup file of the abnormal application according to the determined Cgroup location;
    在所述异常应用的Cgroup文件中,将为所述异常应用分配的CPU核与所述异常应用绑定。In the Cgroup file of the abnormal application, the CPU core allocated to the abnormal application is bound to the abnormal application.
  8. 一种应用性能优化装置,其特征在于,部署于混部集群,所述装置包括:An application performance optimization device, characterized in that it is deployed in a colocation cluster, and comprises:
    获取模块,用于获取指标数据,所述指标数据包括各应用在当前时间段内运行过程中的第一指标数据、以及与系统内核相关的第二指标数据;An acquisition module, used to acquire indicator data, wherein the indicator data includes first indicator data of each application during operation in a current time period, and second indicator data related to the system kernel;
    干扰检测模块,用于基于各应用的所述第一指标数据,检测各所述应用是否受到干扰;An interference detection module, configured to detect whether each of the applications is interfered with based on the first indicator data of each application;
    资源分配模块,用于若存在受到干扰的异常应用,则基于所述第二指标数据,从CPU共享池中为所述异常应用分配CPU核;a resource allocation module, configured to allocate a CPU core from a CPU sharing pool to an abnormal application that is disturbed based on the second indicator data if there is an abnormal application that is disturbed;
    文件更新模块,用于根据为所述异常应用分配的所述CPU核,对所述异常应用的控制群组Cgroup文件进行更新。The file updating module is used to update the control group Cgroup file of the abnormal application according to the CPU core allocated to the abnormal application.
  9. 一种电子设备,其特征在于,包括:至少一个处理器、至少一个存储器、以及至少一条通信总线,其中,An electronic device, characterized in that it comprises: at least one processor, at least one memory, and at least one communication bus, wherein:
    所述存储器上存储有计算机程序,所述处理器通过所述通信总线读取所述存储器中的所述计算机程序;The memory stores a computer program, and the processor reads the computer program in the memory through the communication bus;
    所述计算机程序被所述处理器执行时实现权利要求1至7中任一项所述的应用性能优化方法。When the computer program is executed by the processor, the application performance optimization method according to any one of claims 1 to 7 is implemented.
  10. 一种存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7中任一项所述的应用性能优化方法。A storage medium having a computer program stored thereon, characterized in that when the computer program is executed by a processor, the application performance optimization method according to any one of claims 1 to 7 is implemented.
PCT/CN2023/133455 2022-12-05 2023-11-22 Method and apparatus for optimizing application performance, electronic device, and storage medium WO2024120205A1 (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211548824.3 2022-12-05

Publications (1)

Publication Number Publication Date
WO2024120205A1 true WO2024120205A1 (en) 2024-06-13

Family

ID=

Similar Documents

Publication Publication Date Title
Jalaparti et al. Network-aware scheduling for data-parallel jobs: Plan when you can
US7865686B2 (en) Virtual computer system, and physical resource reconfiguration method and program thereof
RU2569805C2 (en) Virtual non-uniform memory architecture for virtual machines
US10831387B1 (en) Snapshot reservations in a distributed storage system
US7945913B2 (en) Method, system and computer program product for optimizing allocation of resources on partitions of a data processing system
US8185905B2 (en) Resource allocation in computing systems according to permissible flexibilities in the recommended resource requirements
KR20170110708A (en) Opportunistic Resource Migration for Resource Deployment Optimization
WO2015001850A1 (en) Task allocation determination device, control method, and program
TWI786564B (en) Task scheduling method and apparatus, storage media and computer equipment
WO2019056771A1 (en) Distributed storage system upgrade management method and device, and distributed storage system
CN116149846A (en) Application performance optimization method and device, electronic equipment and storage medium
CN112052068A (en) Method and device for binding CPU (central processing unit) of Kubernetes container platform
US8954969B2 (en) File system object node management
CN113032102A (en) Resource rescheduling method, device, equipment and medium
CN113794764A (en) Request processing method and medium for server cluster and electronic device
CN112860387A (en) Distributed task scheduling method and device, computer equipment and storage medium
US20210389994A1 (en) Automated performance tuning using workload profiling in a distributed computing environment
CN112631994A (en) Data migration method and system
CN112948113A (en) Cluster resource management scheduling method, device, equipment and readable storage medium
US11080092B1 (en) Correlated volume placement in a distributed block storage service
CN111831408A (en) Asynchronous task processing method and device, electronic equipment and medium
WO2024120205A1 (en) Method and apparatus for optimizing application performance, electronic device, and storage medium
US20120042322A1 (en) Hybrid Program Balancing
US20090320036A1 (en) File System Object Node Management
CN114697213A (en) Upgrading method and device