CN112540934B

CN112540934B - Method and system for ensuring service quality when multiple delay key programs are executed together

Info

Publication number: CN112540934B
Application number: CN202011465046.2A
Authority: CN
Inventors: 王琳; 李东桦; 黄天元; 耿世超; 周莲莲; 季红滨; 张昭
Original assignee: Shandong Big Data Center; Shandong Normal University
Current assignee: Shandong Big Data Center; Shandong Normal University
Priority date: 2020-12-14
Filing date: 2020-12-14
Publication date: 2022-07-29
Anticipated expiration: 2040-12-14
Also published as: CN112540934A

Abstract

The invention discloses a method and a system for ensuring the service quality when a plurality of delay key programs are executed together, which start the plurality of delay key programs; each delay key program is preset in a corresponding core, and the delay key programs on each core share the last-level cache space; dividing each delay key program into a plurality of program stages; dividing each program stage into a plurality of program intervals; sampling a program interval in each program phase of each delay critical program in a process in which a plurality of delay critical programs are operated together; calculating first, second and third actual performance data of each program stage according to the sampling data; classifying the phase types and performances of the corresponding program phases according to the first, second and third actual performance data; and dynamically adjusting the cache space occupied by each delay key program in the running process according to the stage type and the performance type of each program stage of each delay key program.

Description

Method and system for ensuring service quality when multiple delay key programs are executed together

Technical Field

The present application relates to the field of parallel and distributed computing technologies, and in particular, to a method and system for ensuring quality of service when a plurality of delay-critical programs are executed together.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

Data centers have grown to maturity from concept. In a data center, a large number of programs are executed on as few servers as possible in order to improve resource utilization. In a server node, there are multiple programs executing on a single node. The advantage of multiple programs executing together is that the utilization of the server can be increased, and the problem is that the performance of the program is reduced. The degree of performance degradation of a program depends on program characteristics, and for some programs, performance degradation is not significant when the program is executed with other programs, and for some programs, performance degradation is significant when the program is executed with other programs.

At the same time, a large number of delay-critical programs are running in the data center. The client executes the program in the data center and has certain service quality requirements on the program, such as the performance of the program cannot be lower than 90% of the performance of the program when the program is executed alone. When a delay-critical program is executed together with other programs, it is easy to cause severe performance degradation due to performance interference, and thus the quality of service requirements of customers cannot be satisfied. This is a problem that must be solved. Therefore, a method is needed to ensure the service quality of the delay-critical program on the basis of improving the utilization rate of system resources as much as possible. This is a problem that the present application has to solve.

Disclosure of Invention

In order to solve the defects of the prior art, the application provides a method and a system for ensuring the service quality when a plurality of delay key programs are executed together;

in a first aspect, the present application provides a method for ensuring quality of service when multiple delay-critical programs are executed together;

a method for ensuring quality of service when a plurality of delay-critical programs are executed together, comprising:

initializing a hardware counter, and starting a plurality of delay key programs; each delay key program is preset in a corresponding core, and the delay key programs on each core share the last level cache space LLC;

dividing each delay key program into a plurality of program stages; dividing each program stage into a plurality of program intervals;

sampling a program interval in each program phase of each delay key program by using a hardware performance counter in the process that a plurality of delay key programs are operated together; calculating first, second and third actual performance data for each program phase from the sampled data; classifying the phase types of the corresponding program phases according to the first actual performance data; classifying the performance of the program phase according to the second and third actual performance data;

And dynamically adjusting the cache space occupied by each delay key program in the running process according to the stage type and the performance type of each program stage of each delay key program.

In a second aspect, the present application provides a system for ensuring quality of service when multiple delay-critical programs are executed together;

a system for ensuring quality of service when a plurality of delay-critical programs are executed together, comprising:

an initialization module configured to: initializing a hardware counter and starting a plurality of delay key programs; each delay key program is preset in a corresponding core, and the delay key programs on each core share the last level cache space LLC;

a staging module configured to: dividing each delay key program into a plurality of program stages; dividing each program stage into a plurality of program intervals;

a classification module configured to: sampling a program interval in each program phase of each delay key program by using a hardware performance counter in the process that a plurality of delay key programs are operated together; calculating first, second and third actual performance data for each program phase from the sampled data; classifying the phase types of the corresponding program phases according to the first actual performance data; classifying the performance of the program phase according to the second and third actual performance data;

A dynamic adjustment module configured to: and dynamically adjusting the cache space occupied by each delay key program in the running process according to the stage type and the performance type of each program stage of each delay key program.

In a third aspect, the present application further provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present application also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

In a fifth aspect, the present application also provides a computer program (product) comprising a computer program for implementing the method of any of the preceding first aspects when run on one or more processors.

Compared with the prior art, the beneficial effects of this application are:

by monitoring the performance indexes of the delay key programs in real time and utilizing CAT to dynamically divide LLC resources for the delay key programs of different types, the performance of the delay key programs in common execution is ensured, and the utilization rate of the LLC resources is improved as much as possible.

Intel technology supporting Last Level Cache (LLC) allocation may better utilize the cache through cache partitioning. The present application may utilize this technique to guarantee the performance requirements of users by preventing delay-critical programs from polluting each other's caches. In addition, the present application may better meet the performance requirements of the user by allocating more LLC resources to the performance-benefited delay-critical programs and reducing or stopping allocation to those delay-critical programs that do not benefit.

The invention dynamically adjusts the space occupied by the program by using the performance index of the stage of the program during operation. The invention can promote the quantity of the delay key programs and the resource utilization rate of the LLC as much as possible while ensuring the service quality of the programs.

Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application.

FIG. 1 is a flowchart of a resource partitioning method according to a first embodiment;

FIG. 2 is a flowchart of the program phase performance analysis of the first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Interpretation of terms:

delay critical procedures, refer to: applications with strict requirements on tail latency, tail latency is an important performance indicator for latency critical programs.

LLC, means: the Last Level Cache, i.e. the Last Level Cache, refers to the highest Level Cache that is commonly shared by all functional units (e.g. CPU core, IGP and DSP) on the chip.

CAT, meaning: cache Allocation Technology, has the basic goal of enabling resource Allocation based on application priority or class of service (CLOS). The Intel Xeon processor E5 v4 family (and a subset of the Intel Xeon processor E5 v3 family devoted to communication) introduces functionality to configure and utilize cache allocation techniques on the last level cache.

CLOS, refers to: class of Service, CLOS, as an abstraction, may add multiple resource control attributes, thereby reducing software overhead during context switching.

Example one

The embodiment provides a method for ensuring the service quality when a plurality of delay key programs are executed together;

S101: initializing a hardware counter, and starting a plurality of delay key programs; each delay key program is preset in a corresponding core, and the delay key programs on each core share a last level cache space (LLC);

s102: dividing each delay key program into a plurality of program stages; dividing each program stage into a plurality of program intervals;

s103: sampling a program interval in each program phase of each delay key program by using a hardware performance counter in the process that a plurality of delay key programs are operated together;

calculating first, second and third actual performance data for each program phase from the sampled data;

classifying the phase types of the corresponding program phases according to the first actual performance data;

classifying the performance of the program phase according to the second and third actual performance data;

s104: and dynamically adjusting the cache space occupied by each delay key program in the running process according to the stage type and the performance type of each program stage of each delay key program.

It should be understood that in S101, to ensure that each latency critical program has no CPU time contention, the initial state of the system places each latency critical program in a different core, with the latency critical programs on each core sharing the LLC.

For example, the plurality of delay criticalities refer to two or more delay critical processes.

As one or more embodiments, the S101 further includes:

assuming that the buffer space LLC has N paths of spaces in total, reserving M paths of spaces as standby spaces, and averagely distributing the rest N-M paths of spaces to all delay key programs; n and M are both positive integers.

Illustratively, each delay-critical program is partitioned into different CLOS and isolated by CAT, reducing the impact between delay-critical programs.

For example, assuming that the system has N LLC spaces, M cache way spaces are reserved for CLOS #1 as candidate spaces, and the remaining N-M spaces are equally allocated to all latency critical programs.

Illustratively, assuming that the latency critical procedure is two, the LLC space is allocated from low to high by address, since the CAT only supports a contiguous partition of LLC space. Delay Key 1 occupies the amount of space represented by CLOS #0, equal in size to

And (4) a way. The N LLC spaces are isolated and defined as CLOS #1, which is a spare space whose addresses begin with the address next to the last address of CLOS # 0. The remaining LLC space is allocated to delay-critical program 2, with the space defined as CLOS # 2.

As one or more embodiments, the S102: dividing each delay key program into a plurality of program stages; the method comprises the following specific steps:

the number of instructions is counted by a counter, and the stages of the program are divided by executing a set number of instructions.

As one or more embodiments, the S102: dividing each program stage into a plurality of program intervals; the method comprises the following specific steps:

counting the condition branch instructions, and triggering interruption after executing the X condition branch instructions;

namely every X condition branch instructions are used as a program interval; another hardware counter is responsible for recording the total number of instructions executed during the period, and X is a positive integer.

Illustratively, each program phase is subdivided into a number of program intervals; the method comprises the following specific steps:

each program phase is divided into a number of program intervals using different sampling periods.

It should be understood that, in S102, each delay-critical program is divided into program phases containing a fixed number of instructions, and in order to better acquire program performance information, the present application introduces a two-stage phase detection method.

Stage division: in the running process of a program, performance indexes (such as IPC) of the program may change, program segments belonging to the same stage have similar performance indexes, while program segments belonging to different stages have different performance indexes, and the program can be divided into different stages according to the performance indexes of the program. The method uses a fixed instruction number to divide the stages of the program, and then uses IPC indexes to classify the runtime stages of the program, wherein the fixed instruction number can be 1000 ten thousand, 1 hundred million and 10 hundred million.

And (5) dividing the interval. In order to obtain the phase information of the program in the running process in more detail, the method adopts an interval division method for subdividing the program phase. In order to reduce sampling overhead and information loss, performance data is sampled once every X condition branch instructions. The sampling period can be selected according to practical situations, for example, 100M and 200M.

As one or more embodiments, the S103: sampling a program interval of each program stage of each delay key program by using a hardware performance counter in the process that a plurality of delay key programs are operated together; the method comprises the following specific steps:

in the process that a plurality of delay key programs are operated together, a hardware performance counter is utilized to sample the program interval of each program phase of each delay key program, and the instruction number IPC, the number of LLC uncommitted numbers, the number of LLC hits and the number of LLC references of each period of a performance index are obtained.

Calculating program interval index MPKI by using the obtained LLC number of hits, LLC number of hits and LLC number of references _LLC And HPKI _LLC . Program-spaced MPKI _LLC And HPKI _LLC The calculation formula of (a) is as follows:

Num _Miss refers to the number of LLC misses, Num _Ins Refers to LLC reference number, Num _Hit Refers to the number of LLC hits.

It should be understood that, with the performance indicators of all intervals in the program stage, the average of the performance indicators is calculated as the performance indicator of the stage where the delay-critical program is located, and is used for analyzing the stage behavior of the delay-critical program.

As one or more embodiments, the S103: calculating first, second and third actual performance data for each program phase from the sampled data; the method comprises the following specific steps:

calculating first actual performance data of each program stage according to the sampling data; the first actual performance data refers to: IPC average of number of instructions per cycle;

calculating second actual performance data of each program stage according to the sampling data; second actual performance data, refer to: mean MPKI of missed instruction count per thousand instructions in LLC _LLC ；

Calculating third actual performance data of each program stage according to the sampling data; the third actual performance data refers to: hit finger per thousand instructions on LLCMean value of order number HPKI _LLC 。

Illustratively, the IPC and MPKI of the phase are calculated according to the data sampled at each interval in the phase _LLC (number of missed instructions per thousand instructions on LLC) and HPKI _LLC (number of hits per thousand instructions on LLC). IPC _LLC 、MPKI _LLC 、HPKI _LLC The average calculation steps are as follows:

IPC _LLC1 IPC index referring to the first interval, n represents the number of intervals in a program phase, MPKI _LLC1 Refers to the MPKI index, HPKI, of the first interval _LLC1 Refers to the HPKI index for the first interval.

As one or more embodiments, the S103: classifying the phase types of the corresponding program phases according to the first actual performance data; the method comprises the following specific steps:

according to

Program phase types are divided into 3 classes:

a type:

b type:

class C:

where α is a first set threshold value and β is a second set threshold value.

As one or more embodiments, the S103: classifying the performance of the program phase according to the second and third actual performance data; the method comprises the following specific steps:

according to

And

program performance types are classified into 3 types:

a type:

b type:

and c is as follows:

where η refers to a third set threshold, γ refers to a fourth set threshold, and η < γ.

As one or more embodiments, the S104: dynamically adjusting the cache space occupied by each delay key program in the running process according to the stage type and the performance type of each program stage of each delay key program; the method comprises the following specific steps:

Judging the stage type and the performance type of each program stage of each delay key program;

if the type of the program phase is A type and the performance type is a or b type, primarily judging that the cache space needs to be increased in the program phase; in the process of increasing the cache space, if the type of the program stage is not changed, immediately stopping increasing the cache space and reducing the increased cache space; if the phase type is changed into B or C, saving modification;

if the type of the program phase is A type and the performance type is c type, preliminarily judging that the cache space of the program phase needs to be reduced; if the 1-way cache space is reduced,

if not, continuing to reduce the cache space, otherwise, restoring the modification operation;

if the program phase is B type and the performance type is B type, the cache space occupied by the program phase is not changed;

if the program stage is of type B and the performance type is of type a or c, preliminarily judging that the cache space needs to be reduced in the program stage; if the 1-path cache space is reduced and the stage type is not changed into the A type, continuing to reduce the cache space; otherwise, restoring the modified operation;

if the program stage is C type and the performance type is a type or C type, judging whether the program stage has the phenomenon of surplus resources, and if the 1-path cache space is reduced and the stage type is not changed into the A type or the B type, continuing to reduce the cache space; otherwise, restoring the modified operation;

If the program phase is of type C and the performance type is of type b, the cache space occupied by the program phase is not changed;

as one or more embodiments, the method further comprises:

acquiring the resource use condition of the LLC to perform dynamic management;

if the CLOS #1 has free space, acquiring the space from the CLOS #1, and if the cache space in the CLOS #1 is completely allocated, judging whether the adjacent programs on the physical address are in a resource surplus state; if yes, distributing redundant space according to the performance status of the adjacent program, and if the resource surplus condition does not exist, continuing to wait for the free space; and if the free space does not appear for a long time, carrying out data migration.

The phase performance data for each program is registered in a historical phase & performance table (HPPT). The HPPT stores phase information and performance information for each program. The method and the device dynamically adjust the cache occupied by the current stage according to the stage behavior of the running program and the performance information of the program during running.

And dynamically adjusting the cache space occupied by the delay key program according to the stage of the delay key program, the MPKI _ LLC and the HPKI _ LLC.

Fig. 1 depicts a resource partitioning method. For each program that needs to be executed, program performance information is obtained using a hardware performance counter.

Fig. 2 depicts a program phase performance analysis method.

For each program to be executed, use is made of

The index analyzes the program runtime phase and utilizes

And

the index analyzes the program performance, and further dynamically adjusts the cache space occupied by the program.

Example two

The embodiment provides a system for ensuring the service quality when a plurality of delay key programs are executed together;

It should be noted here that the initialization module, the phase division module, the classification module and the dynamic adjustment module correspond to steps S101 to S104 in the first embodiment, and the modules are the same as the corresponding steps in the implementation example and application scenarios, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for ensuring quality of service when a plurality of delay critical programs are executed together, comprising:

sampling a program interval in each program phase of each delay key program by using a hardware performance counter in the process that a plurality of delay key programs are operated together; calculating first, second and third actual performance data for each program phase from the sampled data; classifying the phase types of the corresponding program phases according to the first actual performance data; classifying the performance of the program phase according to the second and third actual performance data; calculating first, second and third actual performance data for each program phase from the sampled data; the method comprises the following specific steps: calculating first actual performance data of each program stage according to the sampling data; the first actual performance data refers to: IPC average of number of instructions per cycle; calculating second actual performance data of each program stage according to the sampling data; second actual performance data, refer to: mean MPKI of missed instruction count per thousand instructions in LLC _LLC (ii) a Calculating third actual performance data of each program stage according to the sampling data; the third actual performance data refers to: average HPKI of hit instruction count per thousand instructions on LLC _LLC (ii) a Classifying the performance of the program phase according to the second and third actual performance data; the method comprises the following specific steps: according to

Program phase types are divided into 3 classes:

a type:

b type:

class C:

wherein alpha is a first set threshold value, beta is a second set threshold value;

alternatively, the first and second electrodes may be,

classifying the performance of the program phase according to the second and third actual performance data; the method comprises the following specific steps:

according to

And

program performance types are classified into 3 types:

a type:

b type:

and c is as follows:

wherein η refers to a third set threshold, γ refers to a fourth set threshold, and η < γ;

dynamically adjusting the cache space occupied by each delay key program in the running process according to the stage type and the performance type of each program stage of each delay key program; the method comprises the following specific steps:

If the type of the program phase is A type and the performance type is c type, preliminarily judging that the cache space of the program phase needs to be reduced; if the cache space of 1 way is reduced,

if the program phase is of type B and the performance type is of type B, the cache space occupied by the program phase is not changed;

if the program phase is C type and the performance type is b type, the cache space occupied by the program phase is not changed;

2. The method of claim 1, wherein the initializing a hardware counter starts a plurality of latency critical processes; further comprising:

assuming that a cache space LLC has N paths of spaces in total, reserving M paths of spaces as standby spaces, and averagely distributing the rest N-M paths of spaces to all delay key programs; n and M are both positive integers.

3. The method of claim 1, wherein each delay-critical program is divided into a number of program phases; the method comprises the following specific steps:

4. The method of claim 1, wherein each program phase is subdivided into a number of program intervals; the method comprises the following specific steps:

5. The method of claim 1, wherein the program interval of each program phase of each delay critical program is sampled using a hardware performance counter during which a plurality of delay critical programs are being run together; the method comprises the following specific steps:

6. A system for ensuring quality of service when a plurality of delay-critical programs are executed together, comprising:

an initialization module configured to: initializing a hardware counter, and starting a plurality of delay key programs; each delay key program is preset in a corresponding core, and the delay key programs on each core share the last level cache space LLC;

a classification module configured to: sampling a program interval in each program phase of each delay key program by using a hardware performance counter in the process that a plurality of delay key programs are operated together; calculating first, second and third actual performance data for each program phase from the sampled data; classifying the phase types of the corresponding program phases according to the first actual performance data; classifying the performance of the program phase according to the second and third actual performance data; calculating first, second and third actual performance data for each program phase from the sampled data; the method comprises the following specific steps: calculating first actual performance data of each program stage according to the sampling data; the first actual performance data refers to: IPC average of number of instructions per cycle; calculating second actual performance data of each program stage according to the sampling data; second actual performance data, refer to: mean MPKI of missed instruction count per thousand instructions in LLC _LLC (ii) a Calculating third actual performance data of each program stage according to the sampling data; the third actual performance data refers to: average HPKI of hit instruction count per thousand instructions on LLC _LLC (ii) a Classifying the performance of the program phase according to the second and third actual performance data; the method comprises the following specific steps:

according to

Program phase types are divided into 3 classes:

a type:

b type:

class C:

alternatively, the first and second electrodes may be,

according to

And

program performance types are classified into 3 types:

a type:

b type:

and c is as follows:

7. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-5.

8. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 5.