CN102135793A - Mixed dividing method of low-power-consumption multi-core shared cache - Google Patents
Mixed dividing method of low-power-consumption multi-core shared cache Download PDFInfo
- Publication number
- CN102135793A CN102135793A CN 201110076723 CN201110076723A CN102135793A CN 102135793 A CN102135793 A CN 102135793A CN 201110076723 CN201110076723 CN 201110076723 CN 201110076723 A CN201110076723 A CN 201110076723A CN 102135793 A CN102135793 A CN 102135793A
- Authority
- CN
- China
- Prior art keywords
- ipc
- cache
- thread
- formula
- row
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Memory System Of A Hierarchy Structure (AREA)
Abstract
The invention relates to a mixed dividing method of a low-power-consumption multi-core shared cache, belonging to the field of computer system structures. With increase of number of cores integrated on a chip, the low-power-consumption design becomes the necessary trend, however, throughput or equity is considered and the problem of power consumption is ignored in most traditional cache dividing methods. The invention provides a novel low-power-consumption dividing method. The diving method combines threads with larger access difference in a secondary cache into a dividing unit to realize column division of the cache by utilizing the locality principle of a program, thus, when the same application is run, less cache columns are used, and the remaining cache columns are closed so that the purpose of reducing power consumption is achieved on the basis of meeting the performance.
Description
Technical field
The invention belongs to field of computer architecture, be specifically related to a kind of multinuclear and share Cache mixing division methods towards low-power consumption.
Background technology
Along with the increase of core number integrated on the sheet, the processor surface temperature also becomes more and more higher and is exponential increase.High power consumption not only means a large amount of energy resource consumptions, and hot stack and ever-increasing power dissipation density also will threaten the stability of system.Too high power consumption can limiting processor performance lifting, and if further improve frequency or increase the capacity of buffer memory, it is upwards soaring that the power consumption of processor is continued, and then enter into a vicious cycle.In the face of the power consumption pressure of processor, low power dissipation design has become the key problem in the following microprocessor Design.
Yet the present research that multinuclear is shared the second-level cache partition strategy is nearly all towards handling capacity or fairness, and is few towards the research of low-power consumption.Only division methods towards low-power consumption is to realize on the basis of any sacrifice in performance, division methods is not carried out substantial improvement.
Summary of the invention
The present invention utilizes the principle of locality of program run, combine by privately owned and shared two kinds of resource distribution modes and to implement Cache and divide, thereby when being implemented in same applications of operation, use the least possible Cache to be listed as, close remaining columns, reach the purpose that reduces system power dissipation.Aspect the assurance system performance, the present invention has adopted the index of system IPC as the evaluation system performance, and IPC is the instruction number (Instruction Per Cycle) that each clock is carried out.
It is as follows to the invention provides technical scheme:
Dynamic partition strategy of the present invention mainly is divided into three steps: initialization, divide and recall.In algorithm, four basic parameters are arranged: diversity factor threshold value R
Share, recall threshold value IPC
[initial], performance loss threshold value PLT and timeslice t.R
ShareGeneral value is 50%~200%, IPC
[initial]Value obtains in test, and the PLT value is between 0~3%, and the t value is between 100000~5000000 clock period.The main process of algorithm is as follows:
1, a kind of multinuclear towards low-power consumption is shared Cache mixing division methods, it is characterized in that comprising following steps:
(1) initialization:
1.1) with each thread as an independent dividing unit, divide a row second-level cache for each dividing unit;
1.2) behind timeslice t, whether determining program end of run, finishes then to jump to step (4), otherwise continues execution in step 1.3); Timeslice t is meant that with the even piecemeal working time of application program every time is called a timeslice t;
1.3) according to 1. computing system performance IPC of formula
[par], IPC is the instruction number (Instruction Per Cycle) that each clock is carried out;
Wherein, n represents the total number of threads that application program comprises,
Expression thread i (instruction number that each clock of 1≤i≤n) is carried out is used for characterizing the performance of thread i, 2. calculates according to following formula:
2. α in the formula and β computing formula be as shown in the formula 3., θ
i(x) be the crash rate of the thread i that obtains according to the stack range performance, expression when for the second-level cache of thread i distribution x size, the disappearance number of times of thread i in timeslice t;
α=CPI
[base]+E
1+M
1×E
2,β=E
3 ③
3. the CPI in the formula
[base]Be the execution cycle number of average every instruction, E
1Hit expense, M for visit one-level Cache
1Be the disappearance number of times of one-level Cache, E
2The visit second-level cache hits expense, E when lacking for one-level Cache
3Visit the expense of hitting of main memory when lacking for second-level cache; M
1, E
1, E
2, E
3And CPI
[base]Value determine by employed Computer Architecture;
1.4) as if system performance IPC this moment
[par]〉=(1-PLT) IPC
[initial], then close remaining Cache row, execution in step (3); Otherwise execution in step (2); Wherein, recall threshold value IPC
[initial]Expression Cache is listed as the system performance when dividing merely, and PLT represents the performance loss threshold value;
(2) the division stage: make all thread T that do not merge form thread collection TS, TS={T
i| 1≤i≤n}
2.1) if | TS| 〉=2, execution in step 2.2), otherwise jump to step 2.5);
2.2) 4. calculate the diversity factor R that any two thread accesses Cache are listed as among the TS according to formula
Diff
R
Diff=| P
i-P
j|, (1≤i≤n, 1≤j≤n and i ≠ j) 4.
Wherein P is the thread accesses degree of bias, 5. calculate by formula,
The present invention is divided into two parts up and down, U with each row of Cache
UpBe illustrated among the row Cache that thread gets the Cache piece number that the first half was visited, U
DownThe piece number that expression the latter half was visited, U
UsedRepresent the Cache piece sum of visiting in these row;
2.3) for all differences degree R
DiffSatisfy R
Diff〉=R
ShareThread right, with R
DiffObtain peaked two thread<T
i, T
jMerge into a dividing unit, TS=TS-{T
i, T
j; Wherein, R
ShareExpression diversity factor threshold value;
2.4) if | TS| 〉=2, then execution in step 2.3), otherwise execution in step 2.5);
2.5) judge whether unallocated Cache row, be then to carry out 2.6), otherwise execution in step (3);
2.6) calculate the IPC recruitment Plus_IPC of all dividing unit
k(x) value, Plus_IPC
k(x) be meant when dividing unit k (Cache that 1≤k≤n) distributes be listed as from x (x ∈ [and 0, C
L2)) when row are increased to the x+1 row, the IPC recruitment of dividing unit k, wherein C
L2Total columns for second-level cache; Plus_IPC
k(x) computing formula is as shown in the formula 6.:
2.7) divide a row Cache to the maximum Plus_IPC that gets
k(x) dividing unit;
2.8) according to formula 1. computing system divide performance IPC
[par]
2.9) if this moment IPC
[par]〉=(1-PLT) IPC
[initial], then the division stage finishes, and closes remaining Cache row, forwards step (3) to; Otherwise carry out 2.5);
(3) recall the stage: behind timeslice t of program run,
3.1) program end of run whether, be then to forward step (4) to, otherwise execution in step 3.2);
3.2) according to formula 1. computing system divide performance IPC
[par]
3.3) if this moment IPC
[par]〉=(1-PLT) IPC
[initial], forward step (3) to, otherwise, recover initial division, carry out (2);
(4) output operation result, the power consumption of being saved with power consumption assessment tool evaluating system.
In the implementation of above-mentioned partition strategy, the present invention has adopted IPC (Instruction Per Cycle: the order bar number of each clock operation) as the module of system performance, with the box-like performance of coming characterization system of crash rate and ideal I PC, suc as formula 1..To sum up state, the purport of this partitioning algorithm is to transform towards the multinuclear Cache partition problem of low-power consumption in order 7. to find the solution formula problem 8. satisfying formula.
7. formula is conditional function, the system performance that the system performance after indicating to guarantee to divide will be equivalent to necessarily to lose basic Cache row in the threshold value at least when dividing.8. formula is objective function, wherein E
ColRepresent the Cache average power consumption that consumed of position of itemizing, C
L2Total columns of expression second-level cache, the Cache columns that n divides away when representing initial division, Units are represented the set that all dividing unit are formed, c
kThe Cache columns that expression division unit k divides,
The Cache row total amount that the expression division stage divides for all dividing unit, then E in use
ConsumThe total power consumption that the representation program operation is saved.This shows that the second-level cache amount that system uses is few more, the power consumption of system consumption is just few more.Therefore, program will be closed more second-level cache row as far as possible in operational process, reach the purpose that farthest reduces power consumption;
For the crash rate of dynamic collection application program under Different Ca che capacity, each nuclear has all increased a crash rate watch-dog (MRM).Like this, each nuclear can be simulated the accessing operation of monopolizing when using second-level cache.Just can obtain the deletion condition of the Cache row of different stack distances according to the record of MRM.According to the result of MRM record, the crash rate θ (x) when just can computing system distributing different big or small second-level cache to dividing unit.
In order to add up the accessed diversity factor of every row Cache, the present invention has also increased by two access counter ACU and ACD for each nuclear, be used for writing down Cache and list the access times of half part and the access times of Cache row the latter half, thereby calculate the visit diversity factor R that Cache is listed as
Diff
Description of drawings
Fig. 1 is a method flow diagram of the present invention;
Fig. 2 is a mixing partition process process flow diagram of the present invention;
Embodiment:
Chip multi-core processor with a two-stage Cache structure is an example below, and division methods of the present invention is described in detail.
Dispose as table 1:
Table 1
Four parameter difference values on this processor: diversity factor threshold value R
Share=100%, recall threshold value IPC
InitialValue obtains in test, and performance loss threshold value PLT=0 and timeslice t=100000 adopt CACTI as the power consumption assessment instrument, move 4 threads.Wherein, CACTI is one of power consumption assessment instrument commonly used in this area.Concrete steps are as follows:
(1) initialization:
1.1) with each thread of application program as an independent dividing unit, divide a row second-level cache for each dividing unit;
1.2) after timeslice, whether determining program end of run, finishes then to jump to step (4), otherwise continues execution in step 1.3);
1.3) by the current computer architecture, get M
1=0.1, E
1=3, E
2=6, E
3=158 and CPI
Base=0.5, n=4 calculates α=CPI
[base]+ E
1+ M
1* E
2=0.5+3+0.1 * 6=4.1, β=E
3=158, θ
i(x) by the working procedure decision,, obtain by the stack range performance according to the MRM monitoring result.Computing system performance IPC
[par],
1.4) as if system performance IPC this moment
[par]〉=IPC
[initial], then close remaining Cache row, execution in step (3); Otherwise execution in step (2);
(2) the division stage: make all thread T that do not merge form thread collection TS, TS={T
i| 1≤i≤4}
2.1) if | TS| 〉=2, execution in step 2.2), otherwise jump to step 2.5);
2.2) from ACU and ACD, read the U of each thread
UpAnd U
Down, calculate visit degree of bias P.Suppose, calculate P this moment
1=100%, P
2=-80%, P
3=90%, P
4=-85%, calculate the diversity factor R that any two thread accesses Cache are listed as among the TS according to following formula
Diff
R
Diff=| P
i-P
j|, (1≤i≤4,1≤j≤4 and i ≠ j)
R
DiffResult of calculation such as following table:
R diff | P 1 | P 2 | P 3 | P 4 |
P 1 | —— | 180% | 10% | 185% |
P 2 | 180% | —— | 170% | 5% |
P 3 | 10% | 170% | —— | 175% |
P 4 | 185% | 5% | 175% | —— |
2.3) at this moment, R
Diff〉=100% thread is to having<T
1, T
2,<T
1, T
4,<T
2, T
3,<T
3, T
4, with R
DiffObtain peaked two thread<T
1, T
4Merge into a dividing unit, TS=TS-{T
1, T
4;
2.4) at this moment, | TS| 〉=2,<T
2, T
3Diversity factor R
Diff=170% 〉=100%, satisfy the merging condition, continue to merge thread<T
2, T
3, at this moment, | TS|<2, execution in step 2.5);
2.5) judge whether unallocated Cache row, be then to carry out 2.6), otherwise execution in step (3);
2.6) merge through thread, at this moment, system has two dividing unit, dividing unit 1<T
1, T
4And dividing unit 2<T
2, T
3, calculate the IPC recruitment Plus_IPC of these two dividing unit
k(x), k=1,2, computing formula as shown in the formula:
2.7) if this moment Plus_IPC
1(x) 〉=Plus_IPC
2(x), divide a row Cache and give dividing unit 1, i.e. dividing unit<T
1, T
4; Otherwise, divide a row Cache and give dividing unit 2, i.e. dividing unit<T
2, T
3.
2.8) computing system division performance IPC
[par]
2.9) if this moment IPC
[par]〉=IPC
[initial], then the division stage finishes, and closes remaining Cache row, forwards step (3) to; Otherwise carry out 2.5);
(3) recall the stage: after timeslice of program run
3.1) program end of run whether, be then to forward step (4) to, otherwise execution in step 3.2);
3.3) if this moment IPC
[par]〉=IPC
[initial], then this time recall the stage end, forward step (3) to, otherwise, recover initial division, carry out (2);
(4) output operation result, the power consumption of being saved with power consumption assessment tool CACTI evaluating system.
Claims (1)
1. the multinuclear towards low-power consumption is shared Cache mixing division methods, it is characterized in that comprising following steps:
(1) initialization:
1.1) with each thread as an independent dividing unit, divide a row second-level cache for each dividing unit;
1.2) behind timeslice t, whether determining program end of run, finishes then to jump to step (4), otherwise continues execution in step 1.3); Timeslice t is meant that with the even piecemeal working time of application program every time is called a timeslice t;
1.3) according to 1. computing system performance IPC of formula
[par], IPC is the instruction number (Instruction Per Cycle) that each clock is carried out;
Wherein, n represents the total number of threads that application program comprises,
Expression thread i (instruction number that each clock of 1≤i≤n) is carried out is used for characterizing the performance of thread i, 2. calculates according to following formula:
2. α in the formula and β computing formula be as shown in the formula 3., θ
i(x) be the crash rate of the thread i that obtains according to the stack range performance, expression when for the second-level cache of thread i distribution x size, the disappearance number of times of thread i in timeslice t;
α=CPI
[base]+E
1+M
1×E
2,β=E
3 ③
3. the CPI in the formula
[base]Be the execution cycle number of average every instruction, E
1Hit expense, M for visit one-level Cache
1Be the disappearance number of times of one-level Cache, E
2The visit second-level cache hits expense, E when lacking for one-level Cache
3Visit the expense of hitting of main memory when lacking for second-level cache; M
1, E
1, E
2, E
3And CPI
[baae]Value determine by employed Computer Architecture;
1.4) as if system performance IPC this moment
[par]〉=(1-PLT) IPC
[initial], then close remaining Cache row, execution in step (3); Otherwise execution in step (2); Wherein, recall threshold value IPC
[initial]Expression Cache is listed as the system performance when dividing merely, and PLT represents the performance loss threshold value;
(2) the division stage: make all thread T that do not merge form thread collection TS, TS={T
i| 1≤i≤n}
2.1) if | TS| 〉=2, execution in step 2.2), otherwise jump to step 2.5);
2.2) 4. calculate the diversity factor R that any two thread accesses Cache are listed as among the TS according to formula
Diff
R
Diff=| P
i-P
j|, (1≤i≤n, 1≤j≤n and i ≠ j) 4.
Wherein P is the thread accesses degree of bias, 5. calculate by formula,
The present invention is divided into two parts up and down, U with each row of Cache
UpBe illustrated among the row Cache that thread gets the Cache piece number that the first half was visited, U
DownThe piece number that expression the latter half was visited, U
UsedRepresent the Cache piece sum of visiting in these row;
2.3) for all differences degree R
DiffSatisfy R
Diff〉=R
ShareThread right, with R
DiffObtain peaked two thread<T
i, T
jMerge into a dividing unit, TS=TS-{T
i, T
j; Wherein, R
ShareExpression diversity factor threshold value;
2.4) if | TS| 〉=2, then execution in step 2.3), otherwise execution in step 2.5);
2.5) judge whether unallocated Cache row, be then to carry out 2.6), otherwise execution in step (3);
2.6) calculate the IPC recruitment Plus_IPC of all dividing unit
k(x) value, Plus_IPC
k(x) be meant when dividing unit k (Cache that 1≤k≤n) distributes be listed as from x (x ∈ [and 0, C
L2)) when row are increased to the x+1 row, the IPC recruitment of dividing unit k, wherein C
L2Total columns for second-level cache; Plus_IPC
k(x) computing formula is as shown in the formula 6.:
2.7) divide a row Cache to the maximum Plus_IPC that gets
k(x) dividing unit;
2.8) according to formula 1. computing system divide performance IPC
[par]
2.9) if this moment IPC
[par]〉=(1-PLT) IPC
[initial], then the division stage finishes, and closes remaining Cache row, forwards step (3) to; Otherwise carry out 2.5);
(3) recall the stage: behind timeslice t of program run,
3.1) program end of run whether, be then to forward step (4) to, otherwise execution in step 3.2);
3.2) according to formula 1. computing system divide performance IPC
[par]
3.3) if this moment IPC
[par]〉=(1-PLT) IPC
[initial], forward step (3) to, otherwise, recover initial division, carry out (2);
(4) output operation result, the power consumption of being saved with power consumption assessment tool evaluating system.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100767236A CN102135793B (en) | 2011-03-29 | 2011-03-29 | Mixed dividing method of low-power-consumption multi-core shared cache |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2011100767236A CN102135793B (en) | 2011-03-29 | 2011-03-29 | Mixed dividing method of low-power-consumption multi-core shared cache |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102135793A true CN102135793A (en) | 2011-07-27 |
CN102135793B CN102135793B (en) | 2012-07-04 |
Family
ID=44295596
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2011100767236A Expired - Fee Related CN102135793B (en) | 2011-03-29 | 2011-03-29 | Mixed dividing method of low-power-consumption multi-core shared cache |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102135793B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103077128A (en) * | 2012-12-29 | 2013-05-01 | 华中科技大学 | Method for dynamically partitioning shared cache in multi-core environment |
CN103150266A (en) * | 2013-02-20 | 2013-06-12 | 北京工业大学 | Improved multi-core shared cache replacing method |
TWI476583B (en) * | 2011-09-23 | 2015-03-11 | Nat Univ Tsing Hua | Power aware computer simulation system and method thereof |
CN106200868A (en) * | 2016-06-29 | 2016-12-07 | 联想(北京)有限公司 | Shared variable acquisition methods, device and polycaryon processor in polycaryon processor |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060018179A1 (en) * | 2002-11-18 | 2006-01-26 | Paul Marchal | Cost-aware design-time/run-time memory management methods and apparatus |
CN101739299A (en) * | 2009-12-18 | 2010-06-16 | 北京工业大学 | Method for dynamically and fairly partitioning shared cache based on chip multiprocessor |
-
2011
- 2011-03-29 CN CN2011100767236A patent/CN102135793B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060018179A1 (en) * | 2002-11-18 | 2006-01-26 | Paul Marchal | Cost-aware design-time/run-time memory management methods and apparatus |
CN101739299A (en) * | 2009-12-18 | 2010-06-16 | 北京工业大学 | Method for dynamically and fairly partitioning shared cache based on chip multiprocessor |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI476583B (en) * | 2011-09-23 | 2015-03-11 | Nat Univ Tsing Hua | Power aware computer simulation system and method thereof |
CN103077128A (en) * | 2012-12-29 | 2013-05-01 | 华中科技大学 | Method for dynamically partitioning shared cache in multi-core environment |
CN103077128B (en) * | 2012-12-29 | 2015-09-23 | 华中科技大学 | Shared buffer memory method for dynamically partitioning under a kind of multi-core environment |
CN103150266A (en) * | 2013-02-20 | 2013-06-12 | 北京工业大学 | Improved multi-core shared cache replacing method |
CN103150266B (en) * | 2013-02-20 | 2015-10-28 | 北京工业大学 | A kind of multinuclear cache sharing replacement method of improvement |
CN106200868A (en) * | 2016-06-29 | 2016-12-07 | 联想(北京)有限公司 | Shared variable acquisition methods, device and polycaryon processor in polycaryon processor |
CN106200868B (en) * | 2016-06-29 | 2020-07-24 | 联想(北京)有限公司 | Method and device for acquiring shared variables in multi-core processor and multi-core processor |
Also Published As
Publication number | Publication date |
---|---|
CN102135793B (en) | 2012-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hu et al. | Data allocation optimization for hybrid scratch pad memory with SRAM and nonvolatile memory | |
EP2743834B1 (en) | Dynamic set-associative cache apparatus for processor and visiting method thereof | |
CN102135793B (en) | Mixed dividing method of low-power-consumption multi-core shared cache | |
Li et al. | MAC: Migration-aware compilation for STT-RAM based hybrid cache in embedded systems | |
Wu et al. | RAMZzz: Rank-aware DRAM power management with dynamic migrations and demotions | |
Mittal et al. | CASHIER: A cache energy saving technique for QoS systems | |
Jing et al. | Compiler assisted dynamic register file in GPGPU | |
Tan et al. | Soft-error reliability and power co-optimization for GPGPUS register file using resistive memory | |
Huang et al. | Triangle counting and truss decomposition using FPGA | |
Xie et al. | Page policy control with memory partitioning for DRAM performance and power efficiency | |
Hebbar SR et al. | SPEC CPU2017: Performance, event, and energy characterization on the core i7-8700K | |
US20140146060A1 (en) | Power management method for graphic processing unit and system thereof | |
CN103593304B (en) | The quantization method of effective use based on LPT device model caching | |
CN103218304A (en) | On-chip and off-chip distribution method for embedded memory data | |
CN101901192B (en) | On-chip and off-chip data object static assignment method | |
Zhang et al. | To co-run, or not to co-run: A performance study on integrated architectures | |
Fang et al. | Performance optimization by dynamically altering cache replacement algorithm in CPU-GPU heterogeneous multi-core architecture | |
Li et al. | Managing hybrid main memories with a page-utility driven performance model | |
Kabat et al. | Performance evaluation of high bandwidth memory for HPC workloads | |
Malkowski et al. | Phase-aware adaptive hardware selection for power-efficient scientific computations | |
Zhu et al. | Onac: optimal number of active cores detector for energy efficient gpu computing | |
Atoofian et al. | Power-Aware L 1 and L 2 Caches for GPGPUs | |
Lashgar et al. | Investigating Warp Size Impact in GPUs | |
Lee et al. | 2WPR: Disk Buffer Replacement Algorithm Based on the Probability of Reference to Reduce the Number of Writes in Flash Memory | |
Paul et al. | Introduction to the technology mediated collaborations in healthcare Minitrack |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20120704 Termination date: 20190329 |
|
CF01 | Termination of patent right due to non-payment of annual fee |