CN113780336B

CN113780336B - Lightweight cache dividing method and device based on machine learning

Info

Publication number: CN113780336B
Application number: CN202110851952.4A
Authority: CN
Inventors: 邱杰凡; 华宗汉; 尚美静; 尹元楚
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-07-27
Filing date: 2021-07-27
Publication date: 2024-02-02
Anticipated expiration: 2041-07-27
Also published as: CN113780336A

Abstract

The invention discloses a lightweight cache dividing method and device based on machine learning, which comprises the steps of firstly classifying programs by using a support vector machine model, and classifying the programs according to the interference degree of the programs on other programs when the programs occupy the cache and the sensitivity degree of the programs on the cache dividing size; and secondly, performing LLC resource scheduling among divided program categories by applying a Bayesian optimization algorithm, searching an LLC division scheme capable of maximizing system throughput, and finally performing cache division according to the LLC division scheme capable of generating the highest performance. The technical scheme of the invention reduces the dispatching expenditure, can rapidly provide a cache dividing scheme and improves the throughput of the whole system.

Description

Lightweight cache dividing method and device based on machine learning

Technical Field

The application belongs to the technical field of server cache partitioning, and particularly relates to a lightweight cache partitioning method and device based on machine learning.

Background

The Last Level Cache (LLC) of the currently mainstream desktop-Level multi-core processor typically employs a sharing mechanism, i.e., without limitation, applications running on the processor share the LLC. Traditional cache replacement strategies only focus on the accessed condition of data and do not consider the relationship of the data to a specific application. This means that program data already stored on the LLC may be replaced by the LLC due to data access by other programs. When program data that has been replaced is accessed, a cache miss occurs, causing a conflict for concurrent programs on the LLC. Such conflicts necessarily result in reduced performance of the program, affecting the overall throughput of the system.

The prior art attempts to cache resource partitioning between concurrent applications to improve LLC performance. The basic idea of these works is to isolate LLC resources used by concurrently running applications, thereby mitigating contention and mutual interference of concurrent applications at the cache level. There are two ways to implement LLC resource partitioning, one is a technology that requires complex hardware support, and the other is a pure software technology.

Hardware technology typically relies on complex hardware designs and is not versatile. Page coloring used by software technology does not require hardware support, but has its own drawbacks. For example, page coloring techniques are not compatible with the megapage mechanism. This is because the megapages require a large number of consecutive base pages in virtual memory and physical memory, resulting in all available page colors being occupied. To this end, intel proposes a coarse-grained cache partitioning technique (Cache Allocation Technology, CAT). The advantage of CAT is the ability to dynamically partition LLC rapidly in most Intel commercial CPUs. However, the smallest partition unit of CAT is a way, since LLC way is a coarse-grained partition unit. This may result in excessive or insufficient LLC resources allocated by the program, resulting in reduced program working set performance.

There are various methods for LLC partitioning using CAT technology in the prior art. Some partitioning methods rely on detailed performance data collected by the performance collectors for LLC partitioning. For example, an LLC partitioning scheme is generated based on the cache miss rate of each application in the workload. These methods require monitoring the change in cache miss rate for each application and implementing fine-grained cache partitioning. Such real-time monitoring requires a CPU to afford additional high analysis overhead. Some researchers applied heuristic algorithms to LLC partitioning, resulting in significant partitioning performance. However, heuristic algorithms require continuous "trial and error", i.e., continuous LLC scheduling, to explore a wide search space, possibly resulting in high scheduling overhead. Therefore, a new low-overhead and high-performance cache dividing method needs to be designed.

Disclosure of Invention

The purpose of the application is to provide a lightweight cache dividing method and device based on machine learning, so as to reduce scheduling overhead, rapidly provide a cache dividing scheme and improve the throughput of the whole system.

In order to achieve the above purpose, the technical scheme of the application is as follows:

a lightweight cache dividing method based on machine learning comprises the following steps:

constructing and training a first support vector machine model for distinguishing a strong interference program from a non-strong interference program, and constructing and training a second support vector machine model for distinguishing a cache sensitive program from a cache insensitive program;

classifying programs in the working set by adopting a trained first support vector machine model and a trained second support vector machine model;

the method comprises the steps of taking the number of cache ways occupied by each of a strong interference program, a cache sensitive program and a cache insensitive program as a cache dividing scheme, taking the average weighted acceleration ratio of all application programs in a working set as a black box function of the relation between the cache dividing scheme and system throughput, and predicting the optimal cache dividing scheme by adopting Bayesian optimization.

Further, the black box function f (x) is expressed as:

wherein, IPC _{shared_i} Is the number of instructions per clock cycle when the ith program runs in parallel with other programs, IPC _{alone_i} The IPC of the ith program when running alone, n is the program number in the working set.

Further, the predicting an optimal cache partition scheme by using bayesian optimization includes:

initializing three cache dividing schemes, calculating corresponding black box function values, and starting iteration;

updating the Gaussian process model according to the existing cache dividing scheme and the corresponding black box function value;

adopting Gaussian process gain expectation as an acquisition function, generating a buffer division scheme of the next sample to be explored, and calculating a corresponding black box function value;

and when the preset iteration termination condition is reached, terminating the iteration, outputting a final cache division scheme, and otherwise, returning to continue the iteration.

Further, the initializing three cache dividing schemes includes:

when the number of the cache ways is M, for three types of programs, namely a strong interference program, a cache sensitive program and a cache insensitive program, the program occupies M-2 ways of caches according to one type of program, and the other two types respectively occupy 1 way of caches, so that three cache dividing schemes are generated.

Further, the gain is desirably expressed as:

wherein m (x) and sigma (x) are the estimated mean and estimated mean square error, respectively, of the gaussian process with respect to x,is the current optimum value, ζ is a constant vector that trades off between sampled and non-sampled; furthermore, the->CDF (z) and PDF (z) are a standard normal cumulative distribution function and probability density function of z, respectively.

The application also provides a lightweight cache dividing device based on machine learning, which comprises a processor and a memory storing a plurality of computer instructions, wherein the computer instructions realize the steps of the lightweight cache dividing method based on machine learning when being executed by the processor.

The lightweight cache dividing method and device based on machine learning comprise the steps of firstly classifying programs by using a support vector machine model, and classifying the programs according to interference degrees of the programs on other programs when the programs occupy caches and sensitivity degrees of the programs on cache dividing sizes; and secondly, performing LLC resource scheduling among divided program categories by applying a Bayesian optimization algorithm, searching an LLC division scheme capable of maximizing system throughput, and finally performing cache division according to the LLC division scheme capable of generating the highest performance. The technical scheme of the invention reduces the dispatching expenditure, can rapidly provide a cache dividing scheme and improves the throughput of the whole system.

Drawings

FIG. 1 is a flow chart of a lightweight cache partitioning method based on machine learning;

fig. 2 is a flowchart of a bayesian optimization algorithm according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

In recent years, machine learning has been widely used in the fields of computer vision, natural language processing, automation, and the like, because accurate decisions can be realized with lower running overhead. Meanwhile, there is a complex internal relationship between program operation parameters (memory bandwidth, LLC capacity, etc.), cache partitioning scheme and the performance of the application program. Machine learning can successfully build a relational model and accurately infer high-performance partitioning schemes therefrom. Based on the relational model, the optimal cache partitioning scheme can be deduced by only collecting the operation parameters once as the input of the model. Thus, applying machine learning in cache partitioning helps to reduce analysis overhead.

The lightweight cache dividing method based on machine learning provided by the application, as shown in fig. 1, comprises the following steps:

step S1, a first support vector machine model for distinguishing a strong interference program from a non-strong interference program is constructed and trained, and a second support vector machine model for distinguishing a cache sensitive program from a cache insensitive program is constructed and trained.

In order to rapidly provide an LLC partition scheme and improve the throughput of the whole system, the application provides a lightweight cache partition method based on machine learning, wherein the cache partition method firstly uses a support vector machine model to classify application programs, and classifies the application programs according to the interference degree of the application programs on other application programs when the application programs occupy the cache and the sensitivity degree of the application programs on the cache partition size; secondly, performing LLC resource scheduling among the divided program categories by applying a Bayesian optimization algorithm, and searching an LLC division scheme capable of maximizing the system throughput; finally, the division is performed according to the LLC division scheme capable of generating the highest performance.

The method constructs a support vector machine model, performs offline training by using normalized data, and constructs a maximum geometric interval hyperplane for program class division. A Gaussian kernel function is used as a kernel function of the support vector machine, so that the model is nonlinear separable. The present application applies a binary tree structure to three classifications, in which 2 support vector machine models of two classifications are trained, named Classfier-A and Classfier-B, respectively. The Classfier-A is used for distinguishing a strong interference program from a non-strong interference program, and the Classfier-B is used for distinguishing a Cache sensitive program from a Cache insensitive program.

In this embodiment, n specpu 2006 and specpu 2017 applications (n=4, 5, 6) are randomly selected to form a 5000-tuple working set collection training dataset; thereafter, 9 operating parameters of the application are sampled using cache monitoring techniques (Cache Monitor Technology, CMT), memory bandwidth monitoring (Memory Bandwidth Monitor, MBM) and Linux native tools, and specific usage instructions include "pqos-m-p all: pid "," top "and" cat/proc/cpu info|grep 'cpu MHz' ", etc.; the sampling content comprises 9 operation parameters of CPU core frequency, IPC, cache miss, LLC occupation amount, local memory bandwidth, memory footprint, virtual memory, resident memory and actual LLC line occupation number of an application program, wherein the operation parameters form an input feature vector of a support vector machine.

In this embodiment, if a program runs in parallel with omnetpp, astar and xz, the sum of the average performance degradation of omnetpp, astar and xz exceeds 10%, then the program is marked as a strong interference program; if a non-strongly interfering program runs alone, the average performance can rise by more than 10% when its occupied LLC allocates 1-way LLC to all-way LLC, then the program is marked as a Cache sensitive program, otherwise the program is marked as a Cache insensitive program.

In the embodiment, the strong interference program is marked according to the condition that the performance of the background program is reduced when the program and the background program run in parallel; and secondly, marking a Cache sensitive program and a Cache insensitive program based on the average performance rising amplitude of LLCs from 1-way LLCs to all ways when the non-strong interference programs are independently operated. And then cleaning the data set, wherein the data set needs to be preprocessed due to redundancy and interference information in the data set, including data format specification, high-frequency repeated data deletion and the like. For example, if a certain group of data repeatedly appears, only one group of data is reserved, the rest of the same data is deleted, in addition, the data with the missing is not reserved, obvious anomalies are removed, and the data with excessive values higher or lower than the normal values are not reserved. And finally, carrying out normalization processing on the data set, recording the maximum value and the minimum value of each feature, and converting the feature value into a value between 0 and 1 by adopting a min-max normalization method.

The min-max normalization method formula is as follows:

wherein min, max are minimum and maximum values of each feature _original Is the original feature value _normalized Is the characteristic value after normalization processing.

Offline training is performed by using normalized data, and 2 two-class support vector machine models, named Classfier-A and Classfier-B, are trained. The Classfier-A is used for distinguishing a strong interference program from a non-strong interference program, and the Classfier-B is used for distinguishing a Cache sensitive program from a Cache insensitive program.

In a specific embodiment, the support vector machine of the present application employs a gaussian kernel function, where the formula of the gaussian kernel function is:

wherein x is the spatial position of the eigenvalue, x' is the kernel function center; sigma is the width parameter of the function, and is used for controlling the radial action range of the function; furthermore, e is a natural constant.

And S2, classifying the programs in the working set by adopting the trained first support vector machine model and the trained second support vector machine model.

For a working set needing to be cached and divided, the embodiment adopts a trained support vector machine model to classify programs in the working set, and classifies the programs into the following categories: strong interference program, buffer sensitive program and buffer insensitive program.

Specifically, a first support vector machine model Classfier-A is adopted to classify the programs into strong interference programs or non-strong interference programs, and then a second support vector machine model Classfier-B is adopted to further classify the non-strong interference programs, so that the non-strong interference programs are classified into Cache sensitive programs (Cache sensitive programs) or Cache insensitive programs (Cache insensitive programs).

When classifying the programs in the working set, similar to training the support vector machine model, the 9 operation parameters of the programs need to be sampled to form the input feature vector of the support vector machine, and then normalization processing and the like are needed, which are not described herein.

And S3, taking the number of cache ways occupied by each of the three programs of the strong interference program, the cache sensitive program and the cache insensitive program as a cache dividing scheme, taking the average weighted acceleration ratio of all the application programs in the working set as a black box function of the relation between the cache dividing scheme and the system throughput, and predicting the optimal cache dividing scheme by adopting Bayesian optimization.

The relation between the Cache dividing scheme and the system throughput is a black box function f (x), wherein x represents the number of Cache ways occupied by each of the three types of programs, namely, the strong interference program, the Cache sensitive program and the Cache insensitive program, for example, x= <1,3,5> represents that 1/3/5 Cache ways are respectively allocated to the Cache sensitive program and the Cache insensitive program; the black box function may be represented by WS for all applications in the concurrent workload as follows:

wherein, IPC _{shared_i} Is the instruction number per clock cycle (Instructions Per Clock, IPC) of the ith program running in parallel with other programs, IPC _{alone_i} The IPC of the ith program when independently running, and n is the application program number in the working set; the IPC of each application will be sampled by the CMT performance monitor.

In the Bayesian optimization initialization, the method initializes three cache dividing schemes, including:

For example, the samples in three extreme cases, i.e. 18-way buffers for one class of program and then 1-way buffers for the remaining class 2 programs, are initialized assuming a 20-way buffer number. The three cache partitioning schemes are x= <18,1,1>, x= <1,18,1> and x= <1,1,18>, respectively.

The present embodiment adopts bayesian optimization to predict an optimal cache partition scheme, as shown in fig. 2, including:

In the iterative Process of Bayesian optimization prediction, a Gaussian Process (GP) model is updated according to the existing set of x and f (x); then taking Gaussian process gain expectations (Expected Improvement, EI) as acquisition functions, and generating a next buffer division scheme x to be explored for sampling according to the mean and covariance of GP estimation; the EI calculation formula is as follows:

wherein m (x) and sigma (x) are the estimated mean and estimated mean square error of GP with respect to x, respectively,is the current optimum value, ζ is a constant vector that trades off between sampled and non-sampled; furthermore, the->CDF (z) and PDF (z) are a standard normal cumulative distribution function and a probability density function of z, respectively; when EI is calculated, GP is called for predicting model precision; EI takes into account the non-sampled cache partitioning scheme and generates an x that may have the highest f (x) value.

The LLC division scheme generated by the EI formula is executed, the corresponding f (x) is obtained, and the GP is updated; and continuing the loop iteration for 20 times, if the value of EI in the period is smaller than 0.001, terminating the loop, and executing the LLC partition scheme generated last time as a final scheme.

According to the technical scheme, LLC resource scheduling is carried out among divided program categories by applying a Bayesian optimization algorithm, an LLC division scheme capable of maximizing system throughput is searched, and finally cache division is carried out according to the LLC division scheme capable of generating the highest performance.

In another embodiment, the application further provides a lightweight cache division device based on machine learning, which comprises a processor and a memory storing a plurality of computer instructions, wherein the computer instructions realize the steps of the lightweight cache division method based on machine learning when being executed by the processor.

For specific limitations regarding the machine learning-based lightweight cache division apparatus, reference may be made to the above limitations regarding the machine learning-based lightweight cache division method, and no further description is given here. The lightweight cache division device based on machine learning can be fully or partially implemented by software, hardware and a combination thereof. May be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor invokes the corresponding operations.

The memory and the processor are electrically connected directly or indirectly to each other for data transmission or interaction. For example, the components may be electrically connected to each other by one or more communication buses or signal lines. The memory stores a computer program that can be executed on a processor that implements the network topology layout method in the embodiment of the present invention by executing the computer program stored in the memory.

The Memory may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc. The memory is used for storing a program, and the processor executes the program after receiving an execution instruction.

The processor may be an integrated circuit chip having data processing capabilities. The processor may be a general-purpose processor including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), and the like. The methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the invention. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. The lightweight cache dividing method based on machine learning is characterized by comprising the following steps of:

taking the number of cache ways occupied by each of the three programs, namely a strong interference program, a cache sensitive program and a cache insensitive program, as a cache dividing scheme, taking the average weighted acceleration ratio of all the application programs in a working set as a black box function of the relation between the cache dividing scheme and the system throughput, and adopting Bayesian optimization to predict the optimal cache dividing scheme;

the method for predicting the optimal cache dividing scheme by adopting Bayesian optimization comprises the following steps:

terminating iteration when a preset iteration termination condition is reached, outputting a final cache division scheme, otherwise returning to continue iteration;

the gain expectations are expressed as:

wherein m (x) and sigma (x) are the estimated mean and estimated mean square error, respectively, of the gaussian process with respect to x,is the current optimum value, ζ is a constant vector that trades off between sampled and non-sampled; furthermore, the->CDF (z) and PDF (z) are a standard normal cumulative distribution function and probability density function of z, respectively, and x represents a cache division scheme.

2. The machine learning based lightweight cache division method as claimed in claim 1, wherein the black box function f (x) is expressed as:

wherein, IPC _{shared_i} Is the number of instructions per clock cycle when the ith program runs in parallel with other programs, IPC _{alone_i} The IPC of the ith program when independently running is that n is the number of programs in a working set, and x represents the number of cache ways occupied by the three types of programs, namely the strong interference program, the cache sensitive program and the cache insensitive program.

3. The machine learning based lightweight cache partitioning method of claim 1, wherein said initializing three cache partitioning schemes comprises:

4. A machine learning based lightweight cache dividing device comprising a processor and a memory storing a number of computer instructions, wherein the computer instructions when executed by the processor implement the steps of the method of any one of claims 1 to 3.