CN112783803B

CN112783803B - Computer CPU-GPU shared cache control method and system

Info

Publication number: CN112783803B
Application number: CN202110111509.3A
Authority: CN
Inventors: 于慧
Original assignee: Hunan Zhongke Changxing Technology Co ltd
Current assignee: Hunan Zhongke Changxing Technology Co ltd
Priority date: 2021-01-27
Filing date: 2021-01-27
Publication date: 2022-11-18
Anticipated expiration: 2041-01-27
Also published as: CN112783803A

Abstract

The invention provides a method and a system for controlling a shared cache of a CPU-GPU (central processing unit-graphics processing unit) of a computer, which comprises the following steps of firstly, obtaining the utilization rate of each core of the CPU and the GPU and the first-level cache miss rate; calculating the product of the utilization rate of each core of the CPU and the GPU and the first-level cache miss rate; acquiring a CPU and GPU memory allocation proportion set by a user; and adjusting the last level cache shared by the CPU and the GPU according to the sequencing result. The method provided by the invention realizes dynamic adjustment of the LLC by combining the utilization rate of each core and the first-level cache miss rate and the setting of a user, overcomes the problems of too complex adjustment of the LLC and system resource waste in the prior art, optimizes and adjusts the LLC in a computer in a targeted manner, and improves the overall performance of a CPU chip integrated with a GPU.

Description

Computer CPU-GPU shared cache control method and system

Technical Field

The application relates to the field of computer chips, in particular to a control method and a control system for sharing a last-level cache by a CPU-GPU.

Background

The cache is a high-speed memory between the CPU and the memory, when the CPU reads data, the data is firstly searched from the cache, if the data is found, the data is immediately read, otherwise, the data is read from the memory with relatively low speed, and the speed of the CPU can be increased by reasonably setting the cache. Before the GPU is used, the CPU is responsible for all transactions of the computer, and then the GPU which is specially responsible for graphic processing and floating point operation, namely a graphic processor, appears. With the development of large-scale integrated circuits, more and more electronic elements are integrated on a chip, and the integration of a CPU and a GPU on one chip brings new improvement to the performance of a computer. The chip integrating the CPU and the GPU realizes the interaction and sharing of data through a Last Level Cache (Last Level Cache LLC), but the LLC is excessively occupied due to more concurrent threads of the GPU core, and the Cache data of the CPU in the LLC can be replaced by the GPU if an LRU algorithm is adopted, so that the performance of the CPU is reduced.

In order to allocate LLC reasonably, many researchers and companies provide various ways to solve the above problems, such as a static allocation method that divides LLC into fixed parts, each part being assigned to a specific CPU or GPU core or thread, and a feedback type dynamic allocation method that dynamically adjusts LLC allocated to a program or core while a computer is running. However, in the static allocation method, the allocated LLC is fixed and cannot be adjusted in time, and although the dynamic allocation method implements dynamic adjustment, the existing dynamic adjustment process is complex, and the adjustment process wastes valuable system resources. How to economically and effectively adjust the LLC shared by the CPU and the GPU is an urgent problem to be solved in the field.

Disclosure of Invention

In order to solve the problems, the invention provides a method and a system for controlling a computer CPU-GPU shared cache, which consider the utilization rate and the cache miss rate of a CPU core and a GPU core, and allocate the memory of the CPU and the GPU according to the needs of a user, and experiments prove that the invention has higher pertinence compared with the existing LLC adjusting technology.

In one aspect, the present invention provides a computer CPU-GPU shared cache control method, applied in a CPU-GPU fusion architecture, the method comprising the steps of:

acquiring the utilization rate and the first-level cache miss rate of each core of a CPU (Central processing Unit), and the utilization rate and the first-level cache miss rate of each core of a GPU (graphics processing Unit);

computing CPU core utilizationThe product of the rate and the first level cache miss rate to obtain C _n N =1, …, N; calculating the product of the utilization rate of each core of the GPU and the first-level cache miss rate to obtain G _m M =1, …, M; wherein, N is the number of CPU cores, and M is the number of GPU cores;

obtaining the memory allocation proportion of the CPU and the GPU set by the user, and according to the proportion and C _n 、G _m To obtain C' _n 、G' _m And to C' _n 、G' _m Sorting is carried out;

and adjusting the last level cache shared by the CPU and the GPU according to the sequencing result.

On the other hand, the memory allocation proportion of the CPU and the GPU set by the user is obtained according to the proportion and C _n 、G _m To obtain C' _n 、G' _m The method specifically comprises the following steps:

obtaining the size of the CPU and the GPU allocated by a user, and obtaining the proportion x of the memory allocated to the CPU and the GPU according to the size: y;

C' _n ＝x*C _n ，G' _m ＝y*G _m 。

on the other hand, the adjusting the last-level cache shared by the CPU and the GPU according to the sorting result specifically includes:

c 'are arranged in descending order' _n And G' _m Get a queue R _s If R is _s If the first threshold value is smaller than the second threshold value, the last level cache is not adjusted, otherwise, the last level cache is adjusted; where s = n + m, the second threshold is greater than the first threshold.

In another aspect, the adjusting the last-level cache specifically includes:

acquiring the number a1 of cores larger than a second threshold value and the number a2 of cores smaller than a first threshold value, if a1>0 and a2>0, distributing the last level cache corresponding to a2 cores at the tail of the queue to a1 cores at the head of the queue, if only a1>0, distributing the last level cache corresponding to a1 cores at the tail of the queue to a1 cores at the head of the queue, and if only a2>0, distributing the optimal level cache corresponding to a2 cores at the tail of the queue to a2 cores at the head of the queue.

In another aspect, the CPU-GPU converged architecture refers to a CPU chip that integrates a GPU.

In another aspect, the obtaining of the CPU and GPU memory allocation ratio set by the user specifically includes: and obtaining the memory allocated to the CPU and the memory allocated to the GPU from the BIOS, and calculating according to the memory allocated to the CPU and the memory allocated to the GPU to obtain the memory allocation ratio of the CPU and the GPU.

On the other hand, the invention also provides a computer CPU-GPU shared cache control system which is applied to the CPU-GPU fusion architecture and comprises the following modules:

the first acquisition module is used for acquiring the utilization rate and the first-level cache miss rate of each core of the CPU, and the utilization rate and the first-level cache miss rate of each core of the GPU;

a calculation module for calculating the product of the utilization rate of each core of the CPU and the first-level cache miss rate to obtain C _n N =1, …, N; calculating the product of the utilization rate of each core of the GPU and the first-level cache miss rate to obtain G _m M =1, …, M; wherein, N is the number of CPU cores, and M is the number of GPU cores;

a second obtaining module for obtaining the CPU and GPU memory allocation proportion set by the user, and according to the proportion and C _n 、G _m To obtain C' _n 、G' _m And to C' _n 、G' _m Sorting is carried out;

and the adjusting module is used for adjusting the last-level cache shared by the CPU and the GPU according to the sequencing result.

C' _n ＝x*C _n ，G' _m ＝y*G _m 。

furthermore, the present invention also provides a computer readable storage medium for storing computer program instructions, characterized in that the computer program instructions, when executed by a processor, implement the method as described above.

Furthermore, the present invention also provides an electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method as described above.

The CPU and the GPU comprise a plurality of cores, if the utilization rate of the cores is high, the current cores are busy, and if the allocated last-level cache is too small, the cores also need to read data from the memory, so that the time overhead of reading the data of the cores is more increased, and the utilization rate of the cores is higher; because the program data has locality, if the first-level cache miss rate corresponding to the core is lower, the program in the core has stronger locality, and at the moment, the LLC does not need to be excessively distributed to the core; in addition, the size of the video memory allocated to the CPU and the GPU by the user reflects the characteristics of the program run by the user, and the size of the memory allocated to the CPU and the GPU is also considered. The LLC adjusting method provided by the invention effectively improves the overall performance of the chip and avoids the conflict of the CPU and the GPU on LLC resources.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a schematic diagram of a CPU-GPU sharing LLC in the prior art;

FIG. 2 is a schematic flow chart of the present invention.

Detailed Description

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

In one embodiment, the invention provides a computer CPU-GPU shared cache control method, which is applied to a CPU-GPU fusion architecture and comprises the following steps:

calculating the product of the utilization rate of each core of the CPU and the first-level cache miss rate to obtain C _n N =1, …, N; calculating the product of the utilization rate of each core of the GPU and the first-level cache miss rate to obtain G _m M =1, …, M; wherein, N is the number of CPU cores, and M is the number of GPU cores;

In another embodiment, the obtaining of the CPU and GPU memory allocation ratio set by the user is based on the ratio and C _n 、G _m To obtain C' _n 、G' _m The method specifically comprises the following steps:

obtaining the size which is set by a user and is allocated to a CPU and a GPU, and obtaining the proportion x of memory allocated to the CPU and the GPU according to the size: y;

C' _n ＝x*C _n ，G' _m ＝y*G _m 。

in another embodiment, the adjusting the last-level cache shared by the CPU and the GPU according to the sorting result specifically includes:

c 'are arranged in descending order' _n And G' _m Get the queue R _s If R is _s If the first threshold value is smaller than the second threshold value, the last level cache is not adjusted, otherwise, the last level cache is adjusted; where s = n + m, the second threshold is greater than the first threshold.

The utilization rate of each core of the CPU and the GPU is related to the complexity and concurrency of data processing, and is also related to the cache of the CPU and the GPU, and the larger the cache is, the lower the possibility that the CPU and the GPU read data from the memory is, and the shorter the time for reading the data is;

and because different programs have different characteristics, some programs have better data locality, and can read more data from the cache, the miss rate of the first-level cache of the programs is lower, while some programs, especially programs with a large amount of data, need to read data from a hard disk and a memory frequently, and the miss rate of the first-level cache of the programs is higher.

In addition, different users have different uses of computers, for example, some users are more used for playing network games or processing pictures and videos, the GPU memory of the user is generally allocated more, and some users only process word documents and browse webpages, and in order to improve the performance of the CPU, the memory allocated to the CPU is more. The purposes are different, and the allocated CPU and GPU memories are different. If the user allocates more memory for the GPU, more shared caches need to be allocated to meet the user's demand, and less shared caches can be allocated by the irregular method.

In order to facilitate understanding of the present invention, a specific example will be described below.

Assuming that the total computer memory of the user is 8G, the user allocates 5G for the CPU and allocates 3G for the GPU, and the ratio of the memory allocated by the CPU and the GPU is 5:3;

the current computer CPU has 4 cores and GPU has 8 stream processing cores, and at a certain moment, the utilization rate of the cores and the first-level cache miss rate are shown in table 1 below, where the weight of the CPU core is 5 and the weight of the GPU core is 3:

TABLE 1

Core(s)	Core utilization	First level cache miss rate	Product of	Weight product
					C1	80％	80％	0.64	3.2
C2	20％	20％	0.4	2.0
					C3	60％	30％	0.18	0.9
C4	30％	60％	0.18	0.9
					G1	80％	80％	0.64	1.92
G2	80％	20％	0.16	0.48
					G3	90％	90％	0.81	2.43
G4	40％	60％	0.24	0.72
					G5	10％	10％	0.01	0.03
G6	50％	60％	0.3	0.9
					G7	40％	60％	0.24	0.72
G8	30％	60％	0.18	0.54

According to the final result, the cores are ordered as: c1, G3, C2, G1, C3, C4, G6, G4, G7, G8, G2, G5.

In another embodiment, the adjusting the last-level cache specifically includes:

In another embodiment, the CPU-GPU fusion architecture refers to a CPU chip integrated with a GPU.

In another embodiment, the obtaining of the CPU and GPU memory allocation ratio set by the user specifically includes: and obtaining the memory allocated to the CPU and the memory allocated to the GPU from the BIOS, and calculating according to the memory allocated to the CPU and the memory allocated to the GPU to obtain the memory allocation ratio of the CPU and the GPU.

The first threshold and the second threshold are obtained by analyzing statistical information, such as the LLC analysis of the busy state time and the occupation of the core; or may be user or system specific.

Still taking the data in table 1 as an example, assuming that the second threshold is 2 and the first threshold is 0.9, the sorted core is divided into three parts, as shown in table 2 below.

TABLE 2

C1、G3

C2、G1、C3、C4、G6、G4

G7、G8、G2、G5

In this case, the LLC parts corresponding to G7, G8, G2 and G5 cores need to be allocated to C1 and G3, and there are various specific adjustment methods, for example, 20% of the LLC occupied or remaining by G7, G8, G2 and G5 cores are allocated to C1 and G3 on average, or may be allocated to C1 and G3 according to a certain ratio, which is not specifically limited in the present invention.

Example two

In another embodiment, the present invention further provides a computer CPU-GPU shared cache control system, applied in a CPU-GPU fusion architecture, wherein the system includes the following modules:

a second obtaining module for obtaining the CPU and GPU memory allocation proportion set by the user, according to the proportion and C _n 、G _m To obtain C' _n 、G' _m And to C' _n 、G' _m Sorting is carried out;

In another embodiment, the obtaining of the memory allocation proportion of the CPU and the GPU set by the user is based on the proportion and C _n 、G _m To obtain C' _n 、G' _m The method specifically comprises the following steps:

C' _n ＝x*C _n ，G' _m ＝y*G _m 。

other specific embodiments in the second embodiment are described in the specific embodiments in the first embodiment, and are not described herein again.

EXAMPLE III

In another embodiment, the present invention further provides a computer-readable storage medium for storing computer program instructions, wherein the computer program instructions, when executed by a processor, implement the method of embodiment one.

Example four

In another embodiment, the present invention further provides an electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, and wherein the one or more computer program instructions are executed by the processor to implement the method of embodiment one.

The various embodiments described in the present invention may be combined to implement a corresponding technical solution. The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

Claims

1. A computer CPU-GPU shared cache control method is applied to a CPU-GPU fusion architecture, and is characterized by comprising the following steps:

2. The method of claim 1, wherein the obtaining of the CPU and GPU memory allocation ratio set by the user is based on the ratio and C _n 、G _m To obtain C' _n 、G' _m The method specifically comprises the following steps:

C' _n ＝x*C _n ，G' _m ＝y*G _m 。

3. the method of claim 1, wherein the adjusting the last level cache shared by the CPU and the GPU according to the sorting result specifically comprises:

4. The method of claim 3, wherein the adjusting the last level cache comprises:

acquiring the number a1 of cores larger than a second threshold value and the number a2 of cores smaller than a first threshold value, if the a1 is larger than 0 and the a2 is larger than 0, distributing the last level cache corresponding to the a2 cores at the tail of the queue to the a1 cores at the head of the queue, if only the a1 is larger than 0, distributing the last level cache corresponding to the a1 cores at the tail of the queue to the a1 cores at the head of the queue, and if only the a2 is larger than 0, distributing the optimal level cache corresponding to the a2 cores at the tail of the queue to the a2 cores at the head of the queue.

5. The method of claim 1, wherein the CPU-GPU converged architecture refers to a CPU chip that integrates a GPU.

6. The method according to claim 1, wherein the obtaining of the CPU and GPU memory allocation ratio set by the user specifically comprises: and acquiring the memory allocated to the CPU and the memory allocated to the GPU from the BIOS, and calculating the memory allocation proportion of the CPU and the GPU according to the memory allocated to the CPU and the memory allocated to the GPU.

7. A computer CPU-GPU shared cache control system is applied to a CPU-GPU fusion architecture, and is characterized by comprising the following modules:

the first acquisition module is used for acquiring the utilization rate and the first-level cache miss rate of each core of the CPU, the utilization rate and the first-level cache miss rate of each core of the GPU;

8. The system of claim 7, wherein the obtaining of the user set CPU and GPU memory allocation ratio is based on the ratio and C _n 、G _m To obtain C' _n 、G' _m The method specifically comprises the following steps:

C' _n ＝x*C _n ，G' _m ＝y*G _m 。

9. a computer readable storage medium storing computer program instructions, which when executed by a processor implement the method of any one of claims 1-6.

10. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method of any of claims 1-6.