CN106708626A

CN106708626A - Low power consumption-oriented heterogeneous multi-core shared cache partitioning method

Info

Publication number: CN106708626A
Application number: CN201611187228.1A
Authority: CN
Inventors: 方娟; 刘士建; 程妍瑾; 常泽清
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2017-05-24

Abstract

The invention discloses a lower power consumption-oriented heterogeneous multi-core shared cache partitioning method. The method comprises the following steps of: carrying out static partitioning on a shared last-stage cache L2 cache, distributing a cache space with a constant proportion of 50% to a CPU application, and distributing the residual space to a GPU application; on the basis of mean-proportion partitioning, carrying out optimum static proportion partitioning, and portioning different proportions to the CPU application and the GPU application; and finally carrying out dynamic self-adaptive proportion partitioning mechanism, and dynamically changing the proportions, in a last-stage cache, of the CPU application and the GPU application according to an IPC partitioning index during the operation, so as to achieve the aims of reducing the system power consumption and enhancing the system performance.

Description

A kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption

Technical field

The invention belongs to computer architecture caching system construction applications, and in particular to a kind of heterogeneous polynuclear towards low-power consumption Cache sharing division methods.

Background technology

With continuing to develop for polycaryon processor, traditional multicore architecture is difficult to meet the demand of large-scale calculations, industry Boundary forms heterogeneous multi-nucleus processor framework by integrating different types of processor on same chip.At heterogeneous polynuclear Be fused to for processor with different computing capabilitys by reason device (HMP-Heterogeneous Multi-core Processor) Together, it is widely used in the industries such as Aero-Space, Industry Control, instrument and meter, to meet performance requirements, reduces power consumption And cost.HMP, can be by different types of distribution of computation tasks to different due to being integrated with the processor core of different characteristics and performance Parallel processing on the processor core of type, such as, quick complicated nuclear energy performs serial code part, and better simply core then can Parallel processing digital, so as to for the application of different demands provides more flexible, efficient treatment mechanism, meet various applied environments Requirement to system real time, power consumption, reliability and cost, the focus as current research.

Graphic process unit (Graphics Processing Unit, GPU) has relatively simple control logic, is integrated with A large amount of parallel processing cores, with peak value efficiency (the calculating performance of unit work consumptiom) higher.GPU is from beginning of being born just in floating-point Calculate and surmounted CPU in performance, huge wide gap causes substantial amounts of program between GPU executed in parallel and CPU order execution performances The part of computation-intensive is given GPU and is calculated by developer, and Computer Architecture is also from traditional multi-core CPU epoch Driving crosses the multi-core CPU-GPU isomery epoch.Heterogeneous multi-nucleus processor framework such as AMD Fusion, Intel Sandy The appearance of Bridge and Nvidia Denver indicates the heterogeneous polynuclear framework main flow as current era.

CPU and GPU are integrated in a chip on heterogeneous multi-nucleus processor, using shared last level cache structure, with CPU phases Than GPU number of threads is more, and degree of parallelism is high, and GPU application programs can reach data access speed more more than CPU application program Rate, with certain access delay tolerance.The program locality feature that is shown by CPU and GPU application programs and The inconsistency of memory access behavior, the data of different processes can seize shared last level cache, bring the problem of resource contention. When CPU application programs and GPU application programs are performed jointly, because GPU application program access rates are high, therefore it is most of can profit Last level cache space will be used by GPU application programs so that the space for leaving CPU for is very limited, cause CPU application journeys Ordered pair can be reduced substantially in the access of shared LLC, and for many CPU application programs, application program is delayed once there is high speed Especially last level cache missing is deposited, then needs extraly to go access off-chip main to deposit, cause unnecessary power dissipation overhead.Cause This, in heterogeneous polynuclear framework, the division for sharing LLC is most important for the influence of systematic function and power consumption.Now, a conjunction It is very necessary that efficient shared buffer memory division methods are managed for reducing the power dissipation overhead of system.

At present, existing some researchs are devoted to Cache cache subsystems, and Xie and Loh et al. are proposed by using hardware meter Thread of the number device to operation on the same chip is cached dynamically subregion so that the cache hit rate of cross-thread is maximized. Lee and Kim et al. are by considering the influence of the application and caching miss rate of memory level parallelism in performance, it is proposed that a dynamic Ground cache partitions strategy.Qureshi and Patt et al. propose the shared last level cache division methods based on Buffer Utilization, By using a performance monitor UMON (Utility Monitor), the shared height of data access final stage of different processes is counted The position of hit during speed caching, the execution for instructing next cycle partition strategy.But existing cache management work, It is mainly used in isomorphism multiple nucleus system environment, it is impossible to be adapted to the isomerous environment that CPU is combined with GPU, more cannot be to from CPU's Request and the request from GPU make a distinction, and cause the unjustness of shared last level cache distribution, have a strong impact on the performance of system And power consumption.

The content of the invention

The present invention proposes a kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption.Shared end is carried out first The level L2 grades of static division of Cache of caching, the spatial cache for distributing constant ratio 50% is applied to CPU, and remaining space leaves GPU for Using.Respectively on the basis of ratio cut partition, the division of optimal staticaccelerator scale is carried out, divide inequality proportion and applied to CPU and GPU. Finally, dynamically self adaptation ratio cut partition mechanism is carried out, using the dynamic patitioning algorithm towards low-power consumption, is divided according to IPC and referred to Mark operationally dynamically changes CPU and applies the ratio for occupying last level cache with GPU applications, so as to reach reduction system power dissipation, The purpose of lifting system performance.

To reach above-mentioned purpose, the present invention uses following technical scheme.

A kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption, comprise the following steps：

Step 1, realizes that CPU request asks to distinguish with GPU, tracks access request, and difference is distinguished using flag bit TagID The access request of core；

Step 2, the flag bit TagID according to different core access request determines the L2 Cache ground that access request is mapped to Location；

Step 3, realizes static division, comprises the following steps：

Step 3.1, equivalent is divided equally

In the request buffering queue L1RequestToL2Cache of L2 Cache, TagID message flags position is judged, if The request is the L1 grades of Cache from GPU cores, and the access request is mapped into odd address, if the request is from CPU core L1 grades of Cache of the heart, even address is mapped to by the access request；

Step 3.2, optimal dividing on the basis of dividing equally, is altered in steps the buffer address for distributing to CPU and GPU applications Ratio, instruction number (the Instructions Per performed in statistics shared buffer memory CPU and the respective unit period of GPU program Cycle abbreviation IPC), find best performance division proportion least in power-consuming；

Step 4, realizes that dynamic self-adapting is divided

Step 2, the CPU and GPU last level cache accountings in 3 be application program operation before it is ready-portioned, in operation Mobile state adjustment will not be entered according to the feature of application program.Dynamic is divided operationally collects CPU core and GPU core access requests Feature, realizes that self adaptation is dynamically divided.

Preferably, step 4 is specifically included：

Step 4.1, monitors access request, and the memory access behavior information of CPU application programs and GPU application programs is obtained respectively. The IPC values of the IPC and GPU application programs of CPU application programs are counted respectively；

Step 4.2, the IPC indexs according to GPU calculate the performance gain σ of application program, and Cache is arranged into (Cache way) It is allocated to the maximum application of performance gain；

Step 4.3, periodically execution step 4.1,4.2, the memory access information according to current period GPU, calculate GPU applications IPC yield values, arrange distribution Cache to corresponding application program when next cycle starts.

Preferably, step 4.2 to specifically include specific method as follows：If, Threshold_lowIt is the lower bound of threshold value evaluation and test Value, Threshold_highIt is the upper dividing value of threshold value evaluation and test,

If the 1, the IPC yield values σ of the GPU is less than threshold value Threshold_low, GPU application programs are that caching is insensitive Cache row are allocated to CPU application programs by type.

If the 2, the IPC yield values σ of the GPU is more than or equal to threshold value Threshold_low, less than or equal to threshold value Threshold_high, GPU application programs are Cache Sensitive type, and distributing to GPU application programs can bring bigger income, will Cache row are allocated to GPU application programs.

If the 3, the IPC yield values σ of the GPU is more than or equal to threshold value Threshold_high, GPU application program stage of development Property change, recover caching be divided to original state.

Compared with prior art, the present invention has advantages below：

GPU application programs and CPU Application sharing last level caches, the exclusive good concurrency of GPU application programs and visit Survive slow tolerance, cause GPU application programs to occupy most of LLC spaces, have a strong impact on the memory access hit rate of CPU programs, lead The expense for causing extra access to host, influences the performance and power consumption of system.LLC static divisions and dynamic self-adapting division methods, have Be limited in the addressing space of CPU and GPU on some Cache row by effect ground, it is to avoid the unfair competition of GPU application programs, improves CPU application programs reduce memory access miss rate to the utilization rate of LLC, so as to reach reduction power consumption, improve the purpose of systematic function.

Brief description of the drawings

To make the purpose of the present invention, scheme is more easy-to-understand, and below in conjunction with accompanying drawing, the present invention is further described.

Fig. 1 is CPU+GPU heterogeneous multi-core system Organization Charts, and the heterogeneous multi-core system is by 2 core cpus and 4 GPU cores Composition heterogeneous multi-core architecture, each core includes a privately owned L1 grades of Cache, and all core cpus are shared with GPU cores L2 Cache are the communication network-on-chip between final stage shared buffer memory (LLC) and main memory controller DRAM (MEMORY), all cores NOC is exchanged.

Fig. 2 is SLICC operating mechanism figures；

Fig. 3 schematic diagram of Cache division methods

Fig. 4 is the flow chart of dynamic self-adapting LLC partitioning algorithms.

Specific embodiment

To make the purpose of the present invention, technical scheme and advantage become more apparent, below in conjunction with accompanying drawing to the present invention Embodiment be described in detail.

Involved in the present invention is a kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption, as shown in figure 1, There are two core cpus, four GPU cores with one, each core possesses respective L1 Cache, share a L2 As a example by the heterogeneous processor of Cache.The cpu test program of operation is that SPEC CPU2006, the GPU application programs of single thread are Rodinia.Each workload is made up of a cpu test program and a GPU application program.In simulator, use SLICC (specification language for implementing cache coherence) scripts are described Consistency protocol.As shown in Fig. 2 being SLICC operating mechanism figures.Comprise the following steps that：

Step 1, distinguishes CPU access requests and GPU access requests, and addition flag bit TagID marks the Cache of each to compile Number, distinguish L1 Cache and belong to core cpu or GPU cores.

One workload group of operation (includes 2 benchmark test programs), and the wherein benchmark of CPU is One test program of SPEC2006 operates in a core cpu, and the benchmark of GPU is the rodinia test programs of GPU Another core cpu is operated in, is operated on 4 GPU cores by core cpu guiding GPU test programs rodinia.Have 6 altogether The individual L1 cache message from different core.Increase TagID flag bits newly on each L1 Cache Controller, pass through TagID distinguishes the L1 Cache message from different core.

Step 2, the flag bit TagID according to different core access request determines the L2 Cache ground that access request is mapped to Location.

With the addition of flag bit in step 1 to distinguish the L1 Cache message of different core, as L1 Cache When access request message in Controller reaches L2 Cache Controller, in port L1RequestL2Network_ The judgement of type of message is carried out in, different ground are carried out to the L1 Cache request messages from core cpu or GPU cores Location space reflection.

Step 3, realizes static division.

CPU application programs and GPU Application sharing L2 Cache space address, L2 is accessed by CPU application programs Cache addresses are limited in the Cache of some fixations row (Cache Way), and GPU application programs are accessed into L2 Cache addresses limit Make in remaining Cache for fixing Way, as shown in figure 3, realizing static division.Static division scheme efficiently avoid GPU Parallel multithread takes to L2 grades of the unfair of caching, it is ensured that CPU reduces CPU access for L2 grades of caching utilization ratio The expense that off-chip main is deposited, further reduces the power consumption of system.

Step 3.1, equivalent is divided equally.

Storage comes from CPU or GPU L1 Cache Controller in L1RequestL2Network_in message queues Message, L2 grades is mapped by getCacheEntry (in_msg.addr) functions according to the address of in_msg.addr by message In the corresponding cache_entry of Cache, odd even address mark function F_O (in_msg.addr) is increased newly, if in_msg.addr It is even address, returns to true, otherwise returns to false.For the message from CPU L1 Cache, firstly for from CPU L1 The message of Cache judges whether F_O (in_msg.addr) return value is true, if true, maps directly to L2 grades of Cache L2cache [addr] in, if false, in mapping that to the L2 grades of L2cache of Cache [addr+1] so that come from The message of CPU L1 Cache is mapped in even address.For the message from GPU L1 Cache, as F_O (in_msg.addr) When return value is false, directly mapped, return value is true, and message is mapped in L2cache [addr+1].By L2 Level Cache address spaces are divided into odd address and even address two parts, are respectively allocated to CPU application programs and GPU application programs, So as to reduce the memory access miss rate of CPU applications.

Step 3.2, optimal dividing.

Another mode of static division is exactly that inequality proportion is divided, and the L2 Cache spaces of inequality proportion are allocated to CPU and GPU is applied, and according to the L2 grades of Cache accounting of the current actual products CPU of Intel and AMD and GPU, 1/8 is applied to GPU Program, 7/8 gives CPU application programs, and such spatial cache accounting effect is preferable.Therefore newly-increased address partition function C_G (in_msg.addr), for the message from GPU L1 Cache, if in_msg.addr%8 is equal to 0, by L2cache GPU is distributed in [addr] address, if in_msg.addr%8 is not equal to 0, L2cache [addr/8+8] address is distributed to GPU.For the message from CPU L1 Cache, if in_msg.addr%8 is not equal to 0, by L2cache [addr] Location is distributed directly to CPU, and L2cache [addr+1] address otherwise is distributed into CPU.So divide L2 grades of Cache address is empty Between be divided into 1:7 inequality proportion, GPU occupies 1 part, and CPU occupies 7 parts, although GPU application programs are occupied L2 grades of caching compared with It is few, but due to the concurrency and delay-tolerant of GPU height, L2 grades of less caching be not obvious to GPU performance impacts, CPU applications occupy most of L2 grades of Cache, can be effectively reduced memory access miss rate so that the power dissipation overhead of system is reduced.

Step 4, realizes that dynamic self-adapting is divided.Step 2, CPU and GPU last level cache accountings in 3 are in application program It is ready-portioned before operation, Mobile state adjustment will not be entered according to the feature of application program in operation.Dynamic is divided operationally The feature of CPU core and GPU core access requests is collected, realizes that self adaptation is dynamically divided, flow such as Fig. 4 institutes of self adaptation partitioning algorithm Show.The dynamic of self adaptation divides access request feature that can operationally according to CPU and GPU, dynamically changes and distributes to CPU With L2 grades of ratio of Cache of GPU applications, for the L2 grades of Cache space that the application of different characteristic can distribute different proportion, make Performance benefits are obtained to maximize.

Step 4.1, monitors access request, and the memory access behavior information of CPU application programs and GPU application programs is obtained respectively. The IPC values of the IPC and GPU application programs of CPU application programs are counted respectively.It is average that IPC values refer to CPU or GPU each cycles The instruction strip number of execution, computing formula is as follows：

IPC can be very good to reflect influence of the L2 grades of buffer memory capacity change to CPU and GPU application program capacities, therefore Can reflect that application program capacity is changed by L2 grades of buffer memory capacity and is changed by the monitoring to IPC values.From L2 grades In first sampling period after each division of caching, record the IPC values of each cycle GPU applications.

Step 4.2, the IPC indexs according to GPU calculate the performance gain σ of application program, by Cache being allocated to property of way Can the maximum application of gain.Specific method is as follows：

Assuming that X_iRepresent i-th value of cycle IPC after caching division, Threshold_lowThe floor value of threshold value evaluation and test, Threshold_highIt is the upper dividing value of threshold value evaluation and test.Then the IPC values of current sample period divide the stage with GPU application programs before Average IPC othernesses are that formula is as follows：

(1) if the IPC yield values σ for calculating GPU is less than threshold value Threshold_low, it can be determined that current GPU application journeys Sequence is caching non-sensitive type, and distributing more L2 grades of Cache cachings can't produce influence to GPU application program capacities, on the contrary CPU application programs can be reduced to the L2 grades of access efficiency of Cache, cause the memory access miss rate of CPU application programs to greatly promote.This When should stop distribution L2 grade cache to GPU apply, Cache Way are distributed into CPU application programs, L2 grades can be effectively improved The utilization ratio of Cache.

(2) if the IPC yield values σ for calculating GPU is more than or equal to threshold value Threshold_lowLess than or equal to threshold value Threshold_high, GPU application programs are Cache Sensitive type, and distributing to L2 grades of Cache space of GPU application programs can bring bigger Income, then Cache Way are allocated to GPU application programs.

(3) if the IPC yield values σ for calculating GPU is more than or equal to threshold value Threshold_high, then GPU application programs hair Raw phasic Chang, recovers caching and is divided to original state.

Step 4.3, periodically execution step 4.1,4.2, the memory access information according to current period GPU, calculate GPU applications IPC yield values, corresponding application program is given when next cycle starts by distribution Cache Way.

Heterogeneous polynuclear cache sharing division methods towards low-power consumption of the invention, the Sandy Bridge frameworks of Intel The integration CPU and GPU cores in a chip are realized with the Kaveri frameworks of AMD, heterogeneous multi-nucleus processor framework is formd. Heterogeneous multi-nucleus processor framework simplifies the communication between CPU and GPU, realizes and shared between CPU and GPU afterbody caching (LLC) resource.Because GPU cores possess level of parallelism higher than core cpu, therefore GPU can reach data higher and visit Speed is asked, causing the spatial cache of majority will be taken by GPU application programs, leave behind very limited amount of space and give CPU application journeys Sequence, has a strong impact on the memory access latency of CPU application programs and the expense of power consumption, and GPU application programs have good concurrency, cause GPU has shielding effect to memory access latency, is limited on the influence of GPU application program capacities.So, in order to ensure CPU application journeys Sequence obtains the Fairshare of caching, static division can be carried out to the LLC resources that CPU and GPU shares and the dynamic of self adaptation is divided. For CPU retains constant percentage caching during static partition, remaining leaves GPU for.Self adaptation division methods dynamically are by online Analysis CPU and GPU application programs operationally carry out the dynamic adjustment of partition size to the susceptibility of cache size.Caching point Contention of the CPU and GPU application programs to shared buffer memory can effectively be alleviated in area, reduce CPU application programs and access what off-chip main was deposited Expense, so as to reach the purpose of the power consumption of reduction system.

Claims

1. a kind of heterogeneous polynuclear cache sharing division methods towards low-power consumption, it is characterised in that comprise the following steps：

Step 1, realizes that CPU request asks to distinguish with GPU, tracks access request, and different core is distinguished using flag bit TagID Access request；

Step 2, the flag bit TagID according to different core access request determines the L2Cache addresses that access request is mapped to；

Step 3, realizes static division, comprises the following steps：

Step 3.1, equivalent is divided equally

In the request buffering queue L1RequestToL2Cache of L2Cache, TagID message flags position is judged, if should The L1 from GPU cores grades of Cache of Seeking Truth, odd address is mapped to by the access request, if the request is from core cpu L1 grades of Cache, even address is mapped to by the access request；

Step 3.2, optimal dividing on the basis of dividing equally, is altered in steps the buffer address ratio for distributing to CPU and GPU applications, Instruction number (the Instructions Per Cycle performed in statistics shared buffer memory CPU and the respective unit period of GPU program Abbreviation IPC), find best performance division proportion least in power-consuming；

Step 4, realizes that dynamic self-adapting is divided

Step 2, the CPU and GPU last level cache accountings in 3 are ready-portioned before application program operation, in operation will not Feature according to application program enters Mobile state adjustment.Dynamic divides the spy for operationally collecting CPU core and GPU core access requests Levy, realize that self adaptation is dynamically divided.

2. as claimed in claim 1 towards the heterogeneous polynuclear cache sharing division methods of low-power consumption, it is characterised in that step 4 Specifically include：

Step 4.1, monitors access request, and the memory access behavior information of CPU application programs and GPU application programs is obtained respectively.Respectively Count the IPC values of the IPC and GPU application programs of CPU application programs；

Step 4.2, the IPC indexs according to GPU calculate the performance gain σ of application program, and Cache row (Cache way) is divided To the application that performance gain is maximum；

Step 4.3, periodically performs step 4.1,4.2, the memory access information according to current period GPU, the IPC of calculating GPU applications Yield value, arranges distribution Cache to corresponding application program when next cycle starts.

3. as claimed in claim 1 towards the heterogeneous polynuclear cache sharing division methods of low-power consumption, it is characterised in that step 4.2 to specifically include specific method as follows：If, Threshold_lowIt is the floor value of threshold value evaluation and test, Threshold_highFor threshold value is commented The upper dividing value surveyed,

If the 1, the IPC yield values σ of the GPU is less than threshold value Threshold_low, GPU application programs are caching non-sensitive type, will Cache row are allocated to CPU application programs.

If the 2, the IPC yield values σ of the GPU is more than or equal to threshold value Threshold_low, less than or equal to threshold value Threshold_high, GPU application programs are Cache Sensitive type, and distributing to GPU application programs can bring bigger income, and Cache row are allocated into GPU Application program.

If the 3, the IPC yield values σ of the GPU is more than or equal to threshold value Threshold_high, the change of GPU application programs stage of development Change, recover caching and be divided to original state.