CN109710563B - Cache partition dividing method for reconfigurable system - Google Patents

Cache partition dividing method for reconfigurable system Download PDF

Info

Publication number
CN109710563B
CN109710563B CN201811377375.4A CN201811377375A CN109710563B CN 109710563 B CN109710563 B CN 109710563B CN 201811377375 A CN201811377375 A CN 201811377375A CN 109710563 B CN109710563 B CN 109710563B
Authority
CN
China
Prior art keywords
partition
data
cache
scaling
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811377375.4A
Other languages
Chinese (zh)
Other versions
CN109710563A (en
Inventor
杨晨
刘童博
王逸洲
侯佳
耿莉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN201811377375.4A priority Critical patent/CN109710563B/en
Publication of CN109710563A publication Critical patent/CN109710563A/en
Application granted granted Critical
Publication of CN109710563B publication Critical patent/CN109710563B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a cache partition dividing method of a reconfigurable system, which comprises the following steps: 1) when a plurality of coarse-grained reconfigurable arrays CGRA process data, managing and sharing on-chip cache by adopting an invalid scaling FS so as to reduce the bandwidth requirement of an off-chip memory; 2) and introducing new data, namely the data overlapping amount and the iteration number for correction on the basis of invalid scaling FS in the cache partition so as to achieve performance optimization. The invention is oriented to the reconfigurable processor, newly adds statistics of data overlapping amount and iteration times in the original partitioning method, corrects and improves the statistics, avoids adverse effects of data overlapping and unbalanced calculation between CGRAs in a multi-CGRA platform to a certain extent, and improves the performance of caching in the reconfigurable processor.

Description

Cache partition dividing method for reconfigurable system
Technical Field
The invention belongs to the field of cache partition, and particularly relates to a cache partition dividing method of a reconfigurable system.
Background
General purpose computers and ASICs represent two extreme computing approaches, with general purpose computers having the greatest flexibility and low performance and ASICs having the highest performance and the worst flexibility. There are many applications that require both high performance and some flexibility. For example, a multimedia application may include sub-tasks such as data parallel processing, bit processing, irregular computation, high-precision word operation, and operation with real-time requirement, and the processing system is required to flexibly process the sub-tasks and achieve certain performance. Many other applications have similar requirements, such as data encryption, artificial intelligence, etc., which have led to the creation of reconfigurable processors. The reconfigurable processor dynamically changes the function of the operation unit array by configuring the stream during operation, the function change generally consumes a short clock cycle, and then the operation unit array is driven by the data stream to perform calculation, so that the reconfigurable processor is a parallel calculation architecture with high flexibility and high energy efficiency, which is currently a research hotspot in academia and industry, and many remarkable research results and industrial products come into the world.
Caching techniques have emerged to reduce the speed difference between memory and processors. The data in the cache is a subset of the main memory data. Thus, if a block needs to be brought into the cache from main memory, it may happen that the block map is already occupied by other content at the cache location. The LRU replacement policy determines eviction based on the most recently accessed data block. This performs well in high local work concentration. However, in the multi-core system, there is a certain limitation. Shared caches are ubiquitous in Chip Multiprocessors (CMP) to mitigate high latency, high energy and limited bandwidth of main memory. These advantages are costly. When multiple applications share a CMP, they may be disturbed in the shared cache. This can lead to large performance variations, hamper quality of service (QoS) guarantees, and can reduce cache utilization, thereby affecting overall throughput. Therefore, it is critical to efficiently manage shared on-chip caches and reduce off-chip memory bandwidth requirements. Cache partitioning is considered a promising technique to improve the efficiency of shared caching.
Disclosure of Invention
The invention aims to add two new data based on the existing classic cache partition method, modify and improve the two new data, effectively manage the cache on the shared chip and improve the performance of the reconfigurable processor system.
The invention is realized by adopting the following technical scheme:
a cache partition dividing method of a reconfigurable system comprises the following steps:
1) when a plurality of coarse-grained reconfigurable arrays CGRA process data, managing and sharing on-chip cache by adopting an invalid scaling FS so as to reduce the bandwidth requirement of an off-chip memory;
2) and introducing new data, namely the data overlapping amount and the iteration number for correction on the basis of invalid scaling FS in the cache partition so as to achieve performance optimization.
The further improvement of the invention is that the specific implementation method of the step 1) is as follows:
101) controlling the size of each partition by scaling the uselessness of the candidates in each partition by a scaling factor α i;
102) the garbage of the replacement candidate fcard belonging to partition i will have α i scaling and the line with the largest fsacland will be evicted.
The further improvement of the invention is that the step 101) is implemented as follows:
assuming there are R replacement candidates and two partitions, the insertion ratio is I1And I2The target ratio is s1And s2(ii) a Without loss of generality, assume I1<s1Ii ═ Ei, therefore I2>s2Naturally, i.e. without scaling, the size ratio of a partition tends to be proportional to its insertion rate; to reduce the size of partition 2 from I2Decrease to s2The uselessness of cache lines belonging to partition 2 is scaled by a factor of alpha2Zoom in, α 2 > 1, and keep partition 1 un-zoomed, i.e., α1=1;
Figure BDA0001871072320000021
In the formula of alpha2For partition 2 scaling factors, I1And I2Partition 1 and partition 2 insertion ratios, s, respectively1And s2Target ratios for partition 1 and partition 2, respectively, and R is the number of replacement candidates.
The further improvement of the invention is that the step 102) is implemented as follows:
in each eviction, the uselessness of the replacement candidate fcard belonging to partition i will scale by α i, where the line with the largest fscaledcard will be evicted
fscalecand=fcand×αi (1)
Where fcard represents a replacement candidate, α i represents an i-zone scaling factor, and fscaledcard represents a replacement candidate unavailability.
The further improvement of the invention is that the specific implementation method of the step 2) is as follows:
201) correcting the scaling factor through the data overlapping amount between the CGRAs;
202) correcting the scaling factor through the iteration times in the CGRA;
203) and correcting the scaling factor by considering the influence of data overlapping and iteration times.
The further improvement of the invention is that the step 201) is implemented as follows:
suppose that there are two partitions that actually hit the number of numbers given by the following equation:
Figure BDA0001871072320000031
Figure BDA0001871072320000032
in the formula
Figure BDA0001871072320000033
And
Figure BDA0001871072320000034
actual number of hits, cnt, for two partitions respectively12And cnt21Are the data overlap, U1And U2Respectively, the number of hits regardless of data overlap;
the scaling factor is corrected by the amount of the overlapped data, and the correction formula is as follows:
Figure BDA0001871072320000035
Figure BDA0001871072320000036
in the formula
Figure BDA0001871072320000037
And
Figure BDA0001871072320000038
scaling factors, alpha, for partition 1 and partition 2, respectively, after correction of the amount of overlapping data1And alpha2The partition 1 and partition 2 scaling factors before correction, respectively.
The further improvement of the present invention is that, the step 202) is implemented as follows:
the actual number of hits is given by:
U*=U×N (7)
in the formula, N is iteration times;
the correction formula for the scaling factor is thus as follows:
Figure BDA0001871072320000041
Figure BDA0001871072320000042
in the formula N1And N2The number of iterations for partition 1 and partition 2, respectively.
The further improvement of the invention is that the step 203) is implemented as follows:
considering the influence of data overlap and iteration number at the same time, the following formula can be obtained:
Figure BDA0001871072320000043
Figure BDA0001871072320000044
the invention has the following beneficial technical effects:
the invention is oriented to the reconfigurable processor, newly adds statistics of data overlapping amount and iteration times in the original partitioning method, corrects and improves the statistics, avoids adverse effects of data overlapping and unbalanced calculation between CGRAs in a multi-CGRA platform to a certain extent, and improves the performance of caching in the reconfigurable processor.
Invalid scaling (FC) can partition exactly the entire cache while having a large number of partitions still maintaining high associativity. The uselessness of a cache line indicates that the line is not useful for its application performance and may be ordered by various policies. By scaling the garbage of cache lines belonging to different partitions, the FS can adjust the eviction of different partitions, thereby controlling the size of these partitions. By always evicting a cache line with the largest size of useless functions from the complete candidate list, the FS can retain a large number of useful lines, thereby maintaining high associativity.
In the traditional partitioning method, the overlapped data amount is cancelled in summation, the partitioning proportion cannot be changed, and a more effective scaling factor can be obtained by correcting an FS scaling factor by introducing the parameter of the overlapped data amount, so that partitioning is more reasonable.
After the iteration times are introduced, the number of missed hits on the CGRA can be obtained more accurately, if the actual number of hits is larger than the number of hits obtained by the hit counter, the scaling factor corresponding to the partition is reduced, the size of the partition tends to increase, the partition with high performance can be allocated with more cache resources, the size of the partition can be controlled accurately and finely, and the performance can be optimized.
Drawings
FIG. 1 shows the composition of a generic CGRA.
Fig. 2 is a hardware block diagram of a dual CGRA platform.
Figure 3 performance test results.
Detailed Description
The main characteristics are as follows:
1. and reasonably dividing a cache area based on a utilization rate Partitioning method (Utility-based cache Partitioning) and invalid Scaling (Utility Scaling).
2. Two new data are introduced-the amount of data overlap and the number of iterations, and the data can be computed by hardware tracking.
Main advantages
1. By combining the advantages of a utilization-Based Cache Partitioning method (Utility-Based Cache Partitioning) and invalid Scaling (Utility Scaling), the Cache area is reasonably partitioned, appropriate contents in the main memory are placed in appropriate Cache partitions, and the system performance is improved.
2. The original cache division is corrected by using the data overlapping amount and the iteration times to obtain a more reasonable scaling factor alpha and invalidity f, and the size of the partition is accurately and finely controlled, so that the partition with high performance can allocate more cache resources, and the performance is optimized.
The cache partitioning method provided by the invention is mainly oriented to a reconfigurable processor, a general CGRA is composed as shown in figure 1, and a general CGRA is composed of an extensible PE array, a data transmission controller, a configuration information memory and a configuration controller as shown in figure 3-2. The PE array is an integral part of the CGRA, and each PE may be dynamically configured to perform word-level operations, such as arithmetic and logical operations. The operation of the PEs and the interconnections between the PEs may differ in different CGRA designs. The data transfer controller provides operand data to the PE array. The configuration information memory stores configuration information for configuring the PE array. By altering the configuration information, the functionality of the CGRA can be fully dynamically programmed at runtime. The configuration controller is responsible for scheduling the execution order of the PE array configuration information. The cache division hardware block diagram is shown in fig. 2, and the whole hardware structure is composed of CGRA, UMON, CMON, IMON, MMON, FMON, partition algorithm and shared cache.
UMON is a hit rate monitor that monitors the number of hits per CGRA. When the CGRA issues an access request and the request hits, the hit counter is incremented by one.
CMON is a data overlap counter that monitors the amount of data overlap for each CGRA. When the CGRA issues an access request, and the request hits, but core _ id is not equal to cache _ id, the counter is incremented by one.
The IMON is an iteration monitor, and in the CGRA platform, the number of iterations can be derived directly from the configuration information.
MMON is a miss monitor, and the ratio of the number of misses obtained by the two monitors can be used as the insertion ratio of the two partitions to calculate the scaling factor α.
FMON is a monitor of invalidity, with each block having a counter that is zeroed whenever it is accessed, and the counters of the other blocks are incremented by one, and the counter of the block being replaced is zeroed and the counters of the other blocks are incremented by one when the access misses. The value of the counter is the invalidity f of each block.
Each monitor on the CGRA transmits the monitored information to a partition algorithm to obtain partition results, and then the partition algorithm transmits the partition results to a shared cache to divide the shared cache. The cache division design based on the data overlapping amount and the iteration number is as follows:
FS controls the size of each partition primarily by scaling the uselessness of candidates in each partition. The scaling factor for each partition is α i. In each eviction, the uselessness of the replacement candidate (fcard) belonging to partition i will be scaled by α i, where the line with the largest gcaledcard (equation 1) will be evicted.
fscalecand=fcand×αi (1)
Where fcard represents a replacement candidate, α i represents an i-zone scaling factor, and fscaledcard represents a replacement candidate unavailability;
suppose there are R replacement candidates (generally equal to the way number of the cache) and two partitions, with insertion ratios of I1And I2The target ratio is s1And s2. Without loss of generality, assume I1<s1Ii ═ Ei (stable), so I2>s2. Naturally (i.e., without scaling), the size ratio of partitions tends to scale with their insertion rate. To reduce the size of partition 2 from I2Decrease to s2The uselessness of cache lines belonging to partition 2 is scaled by a factor of alpha22> 1) enlargement, and keeping the partition 1 not scaled, i.e. alpha1=1。
Figure BDA0001871072320000071
In the formula of alpha2For partition 2 scaling factors, I1And I2Partition 1 and partition 2 insertion ratios, s, respectively1And s2Target proportions of a partition 1 and a partition 2 are respectively adopted, and R is the number of replacement candidates;
the data overlap amount is cnt, cnt21Representing the amount of data a request issued by CGRA2 hits on the partition to which CGRA1 belongs, and vice versa.
There are two partitions whose actual number of hits is given by equations 3 and 4:
Figure BDA0001871072320000072
Figure BDA0001871072320000073
in the formula
Figure BDA0001871072320000074
And
Figure BDA0001871072320000075
actual number of hits, cnt, for two partitions respectively12And cnt21Are the data overlap, U1And U2Respectively, the number of hits regardless of data overlap;
its scaling factor is corrected by the amount of overlapping data. The correction formula is as follows:
Figure BDA0001871072320000076
Figure BDA0001871072320000077
in the formula
Figure BDA0001871072320000078
And
Figure BDA0001871072320000079
scaling factors, alpha, for partition 1 and partition 2, respectively, after correction of the amount of overlapping data1And alpha2The scaling factors of the partition 1 and the partition 2 before correction are respectively;
the iteration number is marked as N and can be directly given by configuration information in the reconfigurable system. In a pipelined CGRA, each CGRA has its corresponding number of iterations.
The actual number of hits is given by:
U*=U×N (7)
the modification of the scaling factor is thus as in equations 8 and 9.
Figure BDA00018710723200000710
Figure BDA00018710723200000711
In the formula N1And N2Respectively the iteration times of the partition 1 and the partition 2;
considering the influence of data overlap and iteration number at the same time, the following formula can be obtained:
Figure BDA0001871072320000081
Figure BDA0001871072320000082
if the actual number of hits is greater than the number of hits from the hit counter, the scaling factor for the partition is reduced, so that the lines of the partition are less likely to be evicted and the size of the partition tends to increase. In the continuous eviction, the size of the partition is accurately and finely controlled, so that the performance is optimized.
[ Performance test of the present invention ]
The proposed cache partitioning algorithm is to utilize the partitioning to place the proper contents in the main memory into the proper cache partitions, i.e. the partitioning is performed properly. So the core index for evaluating a cache partitioning algorithm is the hit rate or miss rate. For the execution of the replacement and partition algorithm, the CPU is mainly used for continuously sending out access addresses, the main memory stores data corresponding to each access address, the cache is used as a subset of the main memory and also used for storing the data corresponding to the access addresses, and the cache management circuit judges whether the access addresses are hit or not on one hand and then sends out signals to transmit related data according to the result. The invention is realized by using the Testbench of the Verilog language on the basis that a cache simulation platform realized by using RTL-level Verilog codes is used for simulating the LRU replacement strategy.
The present invention uses the rand function in the MATLAB program to generate, for example, 600 random numbers of 0 to 599 and gives a small portion to core 1 and a large portion to core 2. Because the data is random numbers, there is a partial data overlap, i.e., the data overlap amount according to the present invention. By comparing 5 sets of random sequences for non-buffer and buffer partitions, the average performance improvement reaches 25% as shown in the figure.

Claims (3)

1. A cache partition dividing method of a reconfigurable system is characterized by comprising the following steps:
1) when a plurality of coarse-grained reconfigurable arrays CGRA process data, managing and sharing on-chip cache by adopting an invalid scaling FS so as to reduce the bandwidth requirement of an off-chip memory; the specific implementation method comprises the following steps:
101) controlling the size of each partition by scaling the uselessness of the candidates in each partition by a scaling factor α i; the specific implementation method comprises the following steps:
if the shared cache has R replacement candidatesAnd has two partitions with insertion ratios of I1And I2The target ratio is s1And s2(ii) a If I1<s1If Ii is equal to Ei, Ei is the eviction proportion in the partition I, I is equal to 1, 22>s2I.e. without scaling, the size ratio of partitions tends to be proportional to their insertion rate; to reduce the size of partition 2 from I2Decrease to s2The uselessness of cache lines belonging to partition 2 is scaled by a factor of alpha2Zoom in, α 2 > 1, and keep partition 1 un-zoomed, i.e., α1=1;
Figure FDA0002653772490000011
In the formula of alpha2For partition 2 scaling factors, I1And I2Partition 1 and partition 2 insertion ratios, s, respectively1And s2Target proportions of a partition 1 and a partition 2 are respectively adopted, and R is the number of replacement candidates;
102) the uselessness of the replacement candidate fcard belonging to partition i will scale by α i and the line with the largest fsacland will be evicted; the specific implementation method comprises the following steps:
in each eviction, the uselessness of the replacement candidate fcard belonging to partition i will scale by α i, where the line with the largest fscaledcard will be evicted
fscalecand=fcand×αi (1)
Where fcard represents a replacement candidate, α i represents an i-zone scaling factor, and fscaledcard represents a replacement candidate unavailability;
2) introducing new data, namely data overlapping quantity and iteration times for correction on the basis of invalid scaling FS in the cache partition so as to achieve performance optimization; the specific implementation method comprises the following steps:
201) correcting the scaling factor through the data overlapping amount between the CGRAs; the specific implementation method comprises the following steps:
if there are two partitions, the actual number of hits is given by the following formula:
Figure FDA0002653772490000021
Figure FDA0002653772490000022
in the formula
Figure FDA0002653772490000023
And
Figure FDA0002653772490000024
actual number of hits, cnt, for two partitions respectively12And cnt21Are the data overlap, U1And U2Respectively, the number of hits regardless of data overlap;
the scaling factor is corrected by the amount of the overlapped data, and the correction formula is as follows:
Figure FDA0002653772490000025
Figure FDA0002653772490000026
in the formula
Figure FDA0002653772490000027
And
Figure FDA0002653772490000028
scaling factors, alpha, for partition 1 and partition 2, respectively, after correction of the amount of overlapping data1And alpha2The scaling factors of the partition 1 and the partition 2 before correction are respectively;
202) correcting the scaling factor through the iteration times in the CGRA;
203) and correcting the scaling factor by considering the influence of data overlapping and iteration times.
2. The method for partitioning a buffer of a reconfigurable system according to claim 1, wherein step 202) is implemented as follows:
the actual number of hits is given by:
U*=U×N (7)
in the formula of U*The number of actual hits of the two partitions is U, the number of data overlapping hits is not considered, and N is the number of iterations;
the correction formula for the scaling factor is thus as follows:
Figure FDA0002653772490000029
Figure FDA00026537724900000210
in the formula N1And N2The number of iterations for partition 1 and partition 2, respectively.
3. The method for partitioning a cache of a reconfigurable system according to claim 2, wherein the step 203) is implemented as follows:
considering the influence of data overlap and iteration number at the same time, the following formula can be obtained:
Figure FDA0002653772490000031
Figure FDA0002653772490000032
CN201811377375.4A 2018-11-19 2018-11-19 Cache partition dividing method for reconfigurable system Active CN109710563B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811377375.4A CN109710563B (en) 2018-11-19 2018-11-19 Cache partition dividing method for reconfigurable system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811377375.4A CN109710563B (en) 2018-11-19 2018-11-19 Cache partition dividing method for reconfigurable system

Publications (2)

Publication Number Publication Date
CN109710563A CN109710563A (en) 2019-05-03
CN109710563B true CN109710563B (en) 2020-11-10

Family

ID=66254935

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811377375.4A Active CN109710563B (en) 2018-11-19 2018-11-19 Cache partition dividing method for reconfigurable system

Country Status (1)

Country Link
CN (1) CN109710563B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505087B (en) * 2021-06-29 2023-08-22 中国科学院计算技术研究所 Cache dynamic dividing method and system considering service quality and utilization rate

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699629A (en) * 2015-03-16 2015-06-10 清华大学 Sharing on-chip cache dividing device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3258382B1 (en) * 2016-06-14 2021-08-11 Arm Ltd A storage controller

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104699629A (en) * 2015-03-16 2015-06-10 清华大学 Sharing on-chip cache dividing device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CIACP: A Correlation- and Iteration- Aware Cache Partitioning Mechanism to Improve Performance of Multiple Coarse-Grained Reconfigurable Arrays;Chen Yang 等;《IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS》;20170131;第28卷;第29-41页第1-5节,图2 *
Futility scaling: High-associativity cache partitioning;R. Wang 等;《IEEE/ACM International Symposium on Microarchitecture》;20141231;第356–367页 *
Utility-based cache partitioning: a low-overhead, high-performance, runtime mechanism to partition shared caches;M. K. Qureshi 等;《IEEE Computer Society》;20061231;第423–432页 *

Also Published As

Publication number Publication date
CN109710563A (en) 2019-05-03

Similar Documents

Publication Publication Date Title
US8904154B2 (en) Execution migration
Dybdahl et al. An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors
Sanchez et al. SCD: A scalable coherence directory with flexible sharer set encoding
Kim et al. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
Zhao et al. SPATL: Honey, I shrunk the coherence directory
JP3425158B2 (en) Computer system with multi-buffer data cache
Basu et al. Scavenger: A new last level cache architecture with global block priority
Das et al. SLIP: reducing wire energy in the memory hierarchy
CN109710563B (en) Cache partition dividing method for reconfigurable system
Sankaranarayanan et al. An energy efficient GPGPU memory hierarchy with tiny incoherent caches
CN116820773A (en) GPGPU register cache management system
Yang et al. Partially shared cache and adaptive replacement algorithm for NoC-based many-core systems
CN115203076B (en) Data structure optimized private memory caching
Khan et al. Em2: A scalable shared-memory multicore architecture
Lira et al. Replacement techniques for dynamic NUCA cache designs on CMPs
Das et al. A framework for block placement, migration, and fast searching in tiled-DNUCA architecture
BiTalebi et al. LARA: Locality-aware resource allocation to improve GPU memory-access time
Das et al. Random-LRU: a replacement policy for chip multiprocessors
Flores et al. Memory hierarchies for future hpc architectures
Lira et al. Analysis of non-uniform cache architecture policies for chip-multiprocessors using the parsec benchmark suite
Sahoo et al. CAMO: A novel cache management organization for GPGPUs
Kavi et al. A comparative analysis of performance improvement schemes for cache memories
Mehboob et al. Enhance the Performance of Associative Memory by Using New Methods
Zhang et al. Line-coalescing dram cache
El-Moursy et al. V-set cache design for LLC of multi-core processors

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant