CN102135793A

CN102135793A - Mixed dividing method of low-power-consumption multi-core shared cache

Info

Publication number: CN102135793A
Application number: CN 201110076723
Authority: CN
Inventors: 方娟; 杜文娟
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2011-03-29
Filing date: 2011-03-29
Publication date: 2011-07-27
Anticipated expiration: 2031-03-29
Also published as: CN102135793B

Abstract

The invention relates to a mixed dividing method of a low-power-consumption multi-core shared cache, belonging to the field of computer system structures. With increase of number of cores integrated on a chip, the low-power-consumption design becomes the necessary trend, however, throughput or equity is considered and the problem of power consumption is ignored in most traditional cache dividing methods. The invention provides a novel low-power-consumption dividing method. The diving method combines threads with larger access difference in a secondary cache into a dividing unit to realize column division of the cache by utilizing the locality principle of a program, thus, when the same application is run, less cache columns are used, and the remaining cache columns are closed so that the purpose of reducing power consumption is achieved on the basis of meeting the performance.

Description

A kind of multinuclear towards low-power consumption is shared Cache and is mixed division methods

Technical field

The invention belongs to field of computer architecture, be specifically related to a kind of multinuclear and share Cache mixing division methods towards low-power consumption.

Background technology

Along with the increase of core number integrated on the sheet, the processor surface temperature also becomes more and more higher and is exponential increase.High power consumption not only means a large amount of energy resource consumptions, and hot stack and ever-increasing power dissipation density also will threaten the stability of system.Too high power consumption can limiting processor performance lifting, and if further improve frequency or increase the capacity of buffer memory, it is upwards soaring that the power consumption of processor is continued, and then enter into a vicious cycle.In the face of the power consumption pressure of processor, low power dissipation design has become the key problem in the following microprocessor Design.

Yet the present research that multinuclear is shared the second-level cache partition strategy is nearly all towards handling capacity or fairness, and is few towards the research of low-power consumption.Only division methods towards low-power consumption is to realize on the basis of any sacrifice in performance, division methods is not carried out substantial improvement.

Summary of the invention

The present invention utilizes the principle of locality of program run, combine by privately owned and shared two kinds of resource distribution modes and to implement Cache and divide, thereby when being implemented in same applications of operation, use the least possible Cache to be listed as, close remaining columns, reach the purpose that reduces system power dissipation.Aspect the assurance system performance, the present invention has adopted the index of system IPC as the evaluation system performance, and IPC is the instruction number (Instruction Per Cycle) that each clock is carried out.

It is as follows to the invention provides technical scheme:

Dynamic partition strategy of the present invention mainly is divided into three steps: initialization, divide and recall.In algorithm, four basic parameters are arranged: diversity factor threshold value R _Share, recall threshold value IPC _[initial], performance loss threshold value PLT and timeslice t.R _ShareGeneral value is 50%～200%, IPC _[initial]Value obtains in test, and the PLT value is between 0～3%, and the t value is between 100000～5000000 clock period.The main process of algorithm is as follows:

1, a kind of multinuclear towards low-power consumption is shared Cache mixing division methods, it is characterized in that comprising following steps:

(1) initialization:

1.1) with each thread as an independent dividing unit, divide a row second-level cache for each dividing unit;

1.2) behind timeslice t, whether determining program end of run, finishes then to jump to step (4), otherwise continues execution in step 1.3); Timeslice t is meant that with the even piecemeal working time of application program every time is called a timeslice t;

1.3) according to 1. computing system performance IPC of formula _[par], IPC is the instruction number (Instruction Per Cycle) that each clock is carried out;

{IPC}_{[par]} = Σ_{i = 1}^{n} {IPC}_{[{app}_{i}]} = Σ_{i = 1}^{n} \frac{1}{α + β \times θ_{i} (x)}

①

Wherein, n represents the total number of threads that application program comprises, Expression thread i (instruction number that each clock of 1≤i≤n) is carried out is used for characterizing the performance of thread i, 2. calculates according to following formula:

{IPC}_{[{app}_{i}]} = \frac{1}{α + β \times θ_{i} (x)}

②

2. α in the formula and β computing formula be as shown in the formula 3., θ _i(x) be the crash rate of the thread i that obtains according to the stack range performance, expression when for the second-level cache of thread i distribution x size, the disappearance number of times of thread i in timeslice t;

α＝CPI _[base]+E ₁+M ₁×E ₂，β＝E ₃ ③

3. the CPI in the formula _[base]Be the execution cycle number of average every instruction, E ₁Hit expense, M for visit one-level Cache ₁Be the disappearance number of times of one-level Cache, E ₂The visit second-level cache hits expense, E when lacking for one-level Cache ₃Visit the expense of hitting of main memory when lacking for second-level cache; M ₁, E ₁, E ₂, E ₃And CPI _[base]Value determine by employed Computer Architecture;

1.4) as if system performance IPC this moment _[par]〉=(1-PLT) IPC _[initial], then close remaining Cache row, execution in step (3); Otherwise execution in step (2); Wherein, recall threshold value IPC _[initial]Expression Cache is listed as the system performance when dividing merely, and PLT represents the performance loss threshold value;

(2) the division stage: make all thread T that do not merge form thread collection TS, TS={T _i| 1≤i≤n}

2.1) if | TS| 〉=2, execution in step 2.2), otherwise jump to step 2.5);

2.2) 4. calculate the diversity factor R that any two thread accesses Cache are listed as among the TS according to formula _Diff

R _Diff=| P _i-P _j|, (1≤i≤n, 1≤j≤n and i ≠ j) 4.

Wherein P is the thread accesses degree of bias, 5. calculate by formula,

P = \frac{U_{up} - U_{down}}{U_{used}}

⑤

The present invention is divided into two parts up and down, U with each row of Cache _UpBe illustrated among the row Cache that thread gets the Cache piece number that the first half was visited, U _DownThe piece number that expression the latter half was visited, U _UsedRepresent the Cache piece sum of visiting in these row;

2.3) for all differences degree R _DiffSatisfy R _Diff〉=R _ShareThread right, with R _DiffObtain peaked two thread＜T _i, T _jMerge into a dividing unit, TS=TS-{T _i, T _j; Wherein, R _ShareExpression diversity factor threshold value;

2.4) if | TS| 〉=2, then execution in step 2.3), otherwise execution in step 2.5);

2.5) judge whether unallocated Cache row, be then to carry out 2.6), otherwise execution in step (3);

2.6) calculate the IPC recruitment Plus_IPC of all dividing unit _k(x) value, Plus_IPC _k(x) be meant when dividing unit k (Cache that 1≤k≤n) distributes be listed as from x (x ∈ [and 0, C _L2)) when row are increased to the x+1 row, the IPC recruitment of dividing unit k, wherein C _L2Total columns for second-level cache; Plus_IPC _k(x) computing formula is as shown in the formula 6.:

⑥

Wherein, function

Computing formula such as following formula 2.;

2.7) divide a row Cache to the maximum Plus_IPC that gets _k(x) dividing unit;

2.8) according to formula 1. computing system divide performance IPC _[par]

2.9) if this moment IPC _[par]〉=(1-PLT) IPC _[initial], then the division stage finishes, and closes remaining Cache row, forwards step (3) to; Otherwise carry out 2.5);

(3) recall the stage: behind timeslice t of program run,

3.1) program end of run whether, be then to forward step (4) to, otherwise execution in step 3.2);

3.2) according to formula 1. computing system divide performance IPC _[par]

3.3) if this moment IPC _[par]〉=(1-PLT) IPC _[initial], forward step (3) to, otherwise, recover initial division, carry out (2);

(4) output operation result, the power consumption of being saved with power consumption assessment tool evaluating system.

In the implementation of above-mentioned partition strategy, the present invention has adopted IPC (Instruction Per Cycle: the order bar number of each clock operation) as the module of system performance, with the box-like performance of coming characterization system of crash rate and ideal I PC, suc as formula 1..To sum up state, the purport of this partitioning algorithm is to transform towards the multinuclear Cache partition problem of low-power consumption in order 7. to find the solution formula problem 8. satisfying formula.

Σ_{i = 1}^{n} \frac{1}{α + β \times θ_{i} (c_{i})} &GreaterEqual; (1 - PLT) \times {IPC}_{[initial]}

⑦

E_{consum} = E_{col} \times \max (C_{L 2} - n - \underset{k &Element; Units}{Σ} c_{k}), c_{k} > 0

⑧

7. formula is conditional function, the system performance that the system performance after indicating to guarantee to divide will be equivalent to necessarily to lose basic Cache row in the threshold value at least when dividing.8. formula is objective function, wherein E _ColRepresent the Cache average power consumption that consumed of position of itemizing, C _L2Total columns of expression second-level cache, the Cache columns that n divides away when representing initial division, Units are represented the set that all dividing unit are formed, c _kThe Cache columns that expression division unit k divides,

The Cache row total amount that the expression division stage divides for all dividing unit, then E in use _ConsumThe total power consumption that the representation program operation is saved.This shows that the second-level cache amount that system uses is few more, the power consumption of system consumption is just few more.Therefore, program will be closed more second-level cache row as far as possible in operational process, reach the purpose that farthest reduces power consumption;

For the crash rate of dynamic collection application program under Different Ca che capacity, each nuclear has all increased a crash rate watch-dog (MRM).Like this, each nuclear can be simulated the accessing operation of monopolizing when using second-level cache.Just can obtain the deletion condition of the Cache row of different stack distances according to the record of MRM.According to the result of MRM record, the crash rate θ (x) when just can computing system distributing different big or small second-level cache to dividing unit.

In order to add up the accessed diversity factor of every row Cache, the present invention has also increased by two access counter ACU and ACD for each nuclear, be used for writing down Cache and list the access times of half part and the access times of Cache row the latter half, thereby calculate the visit diversity factor R that Cache is listed as _Diff

Description of drawings

Fig. 1 is a method flow diagram of the present invention;

Fig. 2 is a mixing partition process process flow diagram of the present invention;

Embodiment:

Chip multi-core processor with a two-stage Cache structure is an example below, and division methods of the present invention is described in detail.

Dispose as table 1:

Table 1

Four parameter difference values on this processor: diversity factor threshold value R _Share=100%, recall threshold value IPC _InitialValue obtains in test, and performance loss threshold value PLT=0 and timeslice t=100000 adopt CACTI as the power consumption assessment instrument, move 4 threads.Wherein, CACTI is one of power consumption assessment instrument commonly used in this area.Concrete steps are as follows:

(1) initialization:

1.1) with each thread of application program as an independent dividing unit, divide a row second-level cache for each dividing unit;

1.2) after timeslice, whether determining program end of run, finishes then to jump to step (4), otherwise continues execution in step 1.3);

1.3) by the current computer architecture, get M ₁=0.1, E ₁=3, E ₂=6, E ₃=158 and CPI _Base=0.5, n=4 calculates α=CPI _[base]+ E ₁+ M ₁* E ₂=0.5+3+0.1 * 6=4.1, β=E ₃=158, θ _i(x) by the working procedure decision,, obtain by the stack range performance according to the MRM monitoring result.Computing system performance IPC _[par],

{IPC}_{[par]} = Σ_{i = 1}^{4} \frac{1}{α + β \times θ_{i} (x)} = Σ_{i = 1}^{4} \frac{1}{4.1 + 158 \times θ_{i} (x)}

1.4) as if system performance IPC this moment _[par]〉=IPC _[initial], then close remaining Cache row, execution in step (3); Otherwise execution in step (2);

(2) the division stage: make all thread T that do not merge form thread collection TS, TS={T _i| 1≤i≤4}

2.1) if | TS| 〉=2, execution in step 2.2), otherwise jump to step 2.5);

2.2) from ACU and ACD, read the U of each thread _UpAnd U _Down, calculate visit degree of bias P.Suppose, calculate P this moment ₁=100%, P ₂=-80%, P ₃=90%, P ₄=-85%, calculate the diversity factor R that any two thread accesses Cache are listed as among the TS according to following formula _Diff

R _Diff=| P _i-P _j|, (1≤i≤4,1≤j≤4 and i ≠ j)

R _DiffResult of calculation such as following table:

R _diff

P ₁

P ₂

P ₃

P ₄

P ₁	——	180％	10％	185％
					P ₂	180％	——	170％	5％
P ₃	10％	170％	——	175％
					P ₄	185％	5％	175％	——

2.3) at this moment, R _Diff〉=100% thread is to having＜T ₁, T ₂,＜T ₁, T ₄,＜T ₂, T ₃,＜T ₃, T ₄, with R _DiffObtain peaked two thread＜T ₁, T ₄Merge into a dividing unit, TS=TS-{T ₁, T ₄;

2.4) at this moment, | TS| 〉=2,＜T ₂, T ₃Diversity factor R _Diff=170% 〉=100%, satisfy the merging condition, continue to merge thread＜T ₂, T ₃, at this moment, | TS|＜2, execution in step 2.5);

2.6) merge through thread, at this moment, system has two dividing unit, dividing unit 1＜T ₁, T ₄And dividing unit 2＜T ₂, T ₃, calculate the IPC recruitment Plus_IPC of these two dividing unit _k(x), k=1,2, computing formula as shown in the formula:

Plus_{IPC}_{1} (x) = [{IPC}_{{[app}_{1}]} (x + 1) - {IPC}_{{[app}_{1}]} (x)] + [{IPC}_{{[app}_{4}]} (x + 1) - {IPC}_{{[app}_{4}]} (x)]

Plus_{IPC}_{2} (x) = [{IPC}_{{[app}_{2}]} (x + 1) - {IPC}_{{[app}_{2}]} (x)] + [{IPC}_{{[app}_{3}]} (x + 1) - {IPC}_{{[app}_{3}]} (x)]

2.7) if this moment Plus_IPC ₁(x) 〉=Plus_IPC ₂(x), divide a row Cache and give dividing unit 1, i.e. dividing unit＜T ₁, T ₄; Otherwise, divide a row Cache and give dividing unit 2, i.e. dividing unit＜T ₂, T ₃.

2.8) computing system division performance IPC _[par]

2.9) if this moment IPC _[par]〉=IPC _[initial], then the division stage finishes, and closes remaining Cache row, forwards step (3) to; Otherwise carry out 2.5);

(3) recall the stage: after timeslice of program run

3.2) computing system division performance

3.3) if this moment IPC _[par]〉=IPC _[initial], then this time recall the stage end, forward step (3) to, otherwise, recover initial division, carry out (2);

(4) output operation result, the power consumption of being saved with power consumption assessment tool CACTI evaluating system.

Claims

1. the multinuclear towards low-power consumption is shared Cache mixing division methods, it is characterized in that comprising following steps:

(1) initialization:

{IPC}_{[par]} = Σ_{i = 1}^{n} {IPC}_{[{app}_{i}]} = Σ_{i = 1}^{n} \frac{1}{α + β \times θ_{i} (x)}

①

Wherein, n represents the total number of threads that application program comprises,

Expression thread i (instruction number that each clock of 1≤i≤n) is carried out is used for characterizing the performance of thread i, 2. calculates according to following formula:

{IPC}_{[{app}_{i}]} = \frac{1}{α + β \times θ_{i} (x)}

②

α＝CPI _[base]+E ₁+M ₁×E ₂，β＝E ₃ ③

3. the CPI in the formula _[base]Be the execution cycle number of average every instruction, E ₁Hit expense, M for visit one-level Cache ₁Be the disappearance number of times of one-level Cache, E ₂The visit second-level cache hits expense, E when lacking for one-level Cache ₃Visit the expense of hitting of main memory when lacking for second-level cache; M ₁, E ₁, E ₂, E ₃And CPI _[baae]Value determine by employed Computer Architecture;

2.1) if | TS| 〉=2, execution in step 2.2), otherwise jump to step 2.5);

R _Diff=| P _i-P _j|, (1≤i≤n, 1≤j≤n and i ≠ j) 4.

Wherein P is the thread accesses degree of bias, 5. calculate by formula,

P = \frac{U_{up} - U_{down}}{U_{used}}

⑤

⑥

Wherein, function

Computing formula such as following formula 2.;

2.7) divide a row Cache to the maximum Plus_IPC that gets _k(x) dividing unit;

2.8) according to formula 1. computing system divide performance IPC _[par]

(3) recall the stage: behind timeslice t of program run,

3.2) according to formula 1. computing system divide performance IPC _[par]