CN109710563B

CN109710563B - Cache partition dividing method for reconfigurable system

Info

Publication number: CN109710563B
Application number: CN201811377375.4A
Authority: CN
Inventors: 杨晨; 刘童博; 王逸洲; 侯佳; 耿莉
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2018-11-19
Filing date: 2018-11-19
Publication date: 2020-11-10
Anticipated expiration: 2038-11-19
Also published as: CN109710563A

Abstract

The invention discloses a cache partition dividing method of a reconfigurable system, which comprises the following steps: 1) when a plurality of coarse-grained reconfigurable arrays CGRA process data, managing and sharing on-chip cache by adopting an invalid scaling FS so as to reduce the bandwidth requirement of an off-chip memory; 2) and introducing new data, namely the data overlapping amount and the iteration number for correction on the basis of invalid scaling FS in the cache partition so as to achieve performance optimization. The invention is oriented to the reconfigurable processor, newly adds statistics of data overlapping amount and iteration times in the original partitioning method, corrects and improves the statistics, avoids adverse effects of data overlapping and unbalanced calculation between CGRAs in a multi-CGRA platform to a certain extent, and improves the performance of caching in the reconfigurable processor.

Description

Cache partition dividing method for reconfigurable system

Technical Field

The invention belongs to the field of cache partition, and particularly relates to a cache partition dividing method of a reconfigurable system.

Background

General purpose computers and ASICs represent two extreme computing approaches, with general purpose computers having the greatest flexibility and low performance and ASICs having the highest performance and the worst flexibility. There are many applications that require both high performance and some flexibility. For example, a multimedia application may include sub-tasks such as data parallel processing, bit processing, irregular computation, high-precision word operation, and operation with real-time requirement, and the processing system is required to flexibly process the sub-tasks and achieve certain performance. Many other applications have similar requirements, such as data encryption, artificial intelligence, etc., which have led to the creation of reconfigurable processors. The reconfigurable processor dynamically changes the function of the operation unit array by configuring the stream during operation, the function change generally consumes a short clock cycle, and then the operation unit array is driven by the data stream to perform calculation, so that the reconfigurable processor is a parallel calculation architecture with high flexibility and high energy efficiency, which is currently a research hotspot in academia and industry, and many remarkable research results and industrial products come into the world.

Caching techniques have emerged to reduce the speed difference between memory and processors. The data in the cache is a subset of the main memory data. Thus, if a block needs to be brought into the cache from main memory, it may happen that the block map is already occupied by other content at the cache location. The LRU replacement policy determines eviction based on the most recently accessed data block. This performs well in high local work concentration. However, in the multi-core system, there is a certain limitation. Shared caches are ubiquitous in Chip Multiprocessors (CMP) to mitigate high latency, high energy and limited bandwidth of main memory. These advantages are costly. When multiple applications share a CMP, they may be disturbed in the shared cache. This can lead to large performance variations, hamper quality of service (QoS) guarantees, and can reduce cache utilization, thereby affecting overall throughput. Therefore, it is critical to efficiently manage shared on-chip caches and reduce off-chip memory bandwidth requirements. Cache partitioning is considered a promising technique to improve the efficiency of shared caching.

Disclosure of Invention

The invention aims to add two new data based on the existing classic cache partition method, modify and improve the two new data, effectively manage the cache on the shared chip and improve the performance of the reconfigurable processor system.

The invention is realized by adopting the following technical scheme:

a cache partition dividing method of a reconfigurable system comprises the following steps:

1) when a plurality of coarse-grained reconfigurable arrays CGRA process data, managing and sharing on-chip cache by adopting an invalid scaling FS so as to reduce the bandwidth requirement of an off-chip memory;

2) and introducing new data, namely the data overlapping amount and the iteration number for correction on the basis of invalid scaling FS in the cache partition so as to achieve performance optimization.

The further improvement of the invention is that the specific implementation method of the step 1) is as follows:

101) controlling the size of each partition by scaling the uselessness of the candidates in each partition by a scaling factor α i;

102) the garbage of the replacement candidate fcard belonging to partition i will have α i scaling and the line with the largest fsacland will be evicted.

The further improvement of the invention is that the step 101) is implemented as follows:

assuming there are R replacement candidates and two partitions, the insertion ratio is I₁And I₂The target ratio is s₁And s₂(ii) a Without loss of generality, assume I₁＜s₁Ii ═ Ei, therefore I₂＞s₂Naturally, i.e. without scaling, the size ratio of a partition tends to be proportional to its insertion rate; to reduce the size of partition 2 from I₂Decrease to s₂The uselessness of cache lines belonging to partition 2 is scaled by a factor of alpha₂Zoom in, α 2 > 1, and keep partition 1 un-zoomed, i.e., α₁＝1；

In the formula of alpha₂For partition 2 scaling factors, I₁And I₂Partition 1 and partition 2 insertion ratios, s, respectively₁And s₂Target ratios for partition 1 and partition 2, respectively, and R is the number of replacement candidates.

The further improvement of the invention is that the step 102) is implemented as follows:

in each eviction, the uselessness of the replacement candidate fcard belonging to partition i will scale by α i, where the line with the largest fscaledcard will be evicted

fscalecand＝fcand×αi (1)

Where fcard represents a replacement candidate, α i represents an i-zone scaling factor, and fscaledcard represents a replacement candidate unavailability.

The further improvement of the invention is that the specific implementation method of the step 2) is as follows:

201) correcting the scaling factor through the data overlapping amount between the CGRAs;

202) correcting the scaling factor through the iteration times in the CGRA;

203) and correcting the scaling factor by considering the influence of data overlapping and iteration times.

The further improvement of the invention is that the step 201) is implemented as follows:

suppose that there are two partitions that actually hit the number of numbers given by the following equation:

in the formula

And

actual number of hits, cnt, for two partitions respectively₁₂And cnt₂₁Are the data overlap, U₁And U₂Respectively, the number of hits regardless of data overlap;

the scaling factor is corrected by the amount of the overlapped data, and the correction formula is as follows:

in the formula

And

scaling factors, alpha, for partition 1 and partition 2, respectively, after correction of the amount of overlapping data₁And alpha₂The partition 1 and partition 2 scaling factors before correction, respectively.

The further improvement of the present invention is that, the step 202) is implemented as follows:

the actual number of hits is given by:

U^*＝U×N (7)

in the formula, N is iteration times;

the correction formula for the scaling factor is thus as follows:

in the formula N₁And N₂The number of iterations for partition 1 and partition 2, respectively.

The further improvement of the invention is that the step 203) is implemented as follows:

considering the influence of data overlap and iteration number at the same time, the following formula can be obtained:

the invention has the following beneficial technical effects:

the invention is oriented to the reconfigurable processor, newly adds statistics of data overlapping amount and iteration times in the original partitioning method, corrects and improves the statistics, avoids adverse effects of data overlapping and unbalanced calculation between CGRAs in a multi-CGRA platform to a certain extent, and improves the performance of caching in the reconfigurable processor.

Invalid scaling (FC) can partition exactly the entire cache while having a large number of partitions still maintaining high associativity. The uselessness of a cache line indicates that the line is not useful for its application performance and may be ordered by various policies. By scaling the garbage of cache lines belonging to different partitions, the FS can adjust the eviction of different partitions, thereby controlling the size of these partitions. By always evicting a cache line with the largest size of useless functions from the complete candidate list, the FS can retain a large number of useful lines, thereby maintaining high associativity.

In the traditional partitioning method, the overlapped data amount is cancelled in summation, the partitioning proportion cannot be changed, and a more effective scaling factor can be obtained by correcting an FS scaling factor by introducing the parameter of the overlapped data amount, so that partitioning is more reasonable.

After the iteration times are introduced, the number of missed hits on the CGRA can be obtained more accurately, if the actual number of hits is larger than the number of hits obtained by the hit counter, the scaling factor corresponding to the partition is reduced, the size of the partition tends to increase, the partition with high performance can be allocated with more cache resources, the size of the partition can be controlled accurately and finely, and the performance can be optimized.

Drawings

FIG. 1 shows the composition of a generic CGRA.

Fig. 2 is a hardware block diagram of a dual CGRA platform.

Figure 3 performance test results.

Detailed Description

The main characteristics are as follows:

1. and reasonably dividing a cache area based on a utilization rate Partitioning method (Utility-based cache Partitioning) and invalid Scaling (Utility Scaling).

2. Two new data are introduced-the amount of data overlap and the number of iterations, and the data can be computed by hardware tracking.

Main advantages

1. By combining the advantages of a utilization-Based Cache Partitioning method (Utility-Based Cache Partitioning) and invalid Scaling (Utility Scaling), the Cache area is reasonably partitioned, appropriate contents in the main memory are placed in appropriate Cache partitions, and the system performance is improved.

2. The original cache division is corrected by using the data overlapping amount and the iteration times to obtain a more reasonable scaling factor alpha and invalidity f, and the size of the partition is accurately and finely controlled, so that the partition with high performance can allocate more cache resources, and the performance is optimized.

The cache partitioning method provided by the invention is mainly oriented to a reconfigurable processor, a general CGRA is composed as shown in figure 1, and a general CGRA is composed of an extensible PE array, a data transmission controller, a configuration information memory and a configuration controller as shown in figure 3-2. The PE array is an integral part of the CGRA, and each PE may be dynamically configured to perform word-level operations, such as arithmetic and logical operations. The operation of the PEs and the interconnections between the PEs may differ in different CGRA designs. The data transfer controller provides operand data to the PE array. The configuration information memory stores configuration information for configuring the PE array. By altering the configuration information, the functionality of the CGRA can be fully dynamically programmed at runtime. The configuration controller is responsible for scheduling the execution order of the PE array configuration information. The cache division hardware block diagram is shown in fig. 2, and the whole hardware structure is composed of CGRA, UMON, CMON, IMON, MMON, FMON, partition algorithm and shared cache.

UMON is a hit rate monitor that monitors the number of hits per CGRA. When the CGRA issues an access request and the request hits, the hit counter is incremented by one.

CMON is a data overlap counter that monitors the amount of data overlap for each CGRA. When the CGRA issues an access request, and the request hits, but core _ id is not equal to cache _ id, the counter is incremented by one.

The IMON is an iteration monitor, and in the CGRA platform, the number of iterations can be derived directly from the configuration information.

MMON is a miss monitor, and the ratio of the number of misses obtained by the two monitors can be used as the insertion ratio of the two partitions to calculate the scaling factor α.

FMON is a monitor of invalidity, with each block having a counter that is zeroed whenever it is accessed, and the counters of the other blocks are incremented by one, and the counter of the block being replaced is zeroed and the counters of the other blocks are incremented by one when the access misses. The value of the counter is the invalidity f of each block.

Each monitor on the CGRA transmits the monitored information to a partition algorithm to obtain partition results, and then the partition algorithm transmits the partition results to a shared cache to divide the shared cache. The cache division design based on the data overlapping amount and the iteration number is as follows:

FS controls the size of each partition primarily by scaling the uselessness of candidates in each partition. The scaling factor for each partition is α i. In each eviction, the uselessness of the replacement candidate (fcard) belonging to partition i will be scaled by α i, where the line with the largest gcaledcard (equation 1) will be evicted.

fscalecand＝fcand×αi (1)

Where fcard represents a replacement candidate, α i represents an i-zone scaling factor, and fscaledcard represents a replacement candidate unavailability;

suppose there are R replacement candidates (generally equal to the way number of the cache) and two partitions, with insertion ratios of I₁And I₂The target ratio is s₁And s₂. Without loss of generality, assume I₁＜s₁Ii ═ Ei (stable), so I₂＞s₂. Naturally (i.e., without scaling), the size ratio of partitions tends to scale with their insertion rate. To reduce the size of partition 2 from I₂Decrease to s₂The uselessness of cache lines belonging to partition 2 is scaled by a factor of alpha₂(α₂> 1) enlargement, and keeping the partition 1 not scaled, i.e. alpha₁＝1。

In the formula of alpha₂For partition 2 scaling factors, I₁And I₂Partition 1 and partition 2 insertion ratios, s, respectively₁And s₂Target proportions of a partition 1 and a partition 2 are respectively adopted, and R is the number of replacement candidates;

the data overlap amount is cnt, cnt₂₁Representing the amount of data a request issued by CGRA2 hits on the partition to which CGRA1 belongs, and vice versa.

There are two partitions whose actual number of hits is given by equations 3 and 4:

in the formula

And

its scaling factor is corrected by the amount of overlapping data. The correction formula is as follows:

in the formula

And

scaling factors, alpha, for partition 1 and partition 2, respectively, after correction of the amount of overlapping data₁And alpha₂The scaling factors of the partition 1 and the partition 2 before correction are respectively;

the iteration number is marked as N and can be directly given by configuration information in the reconfigurable system. In a pipelined CGRA, each CGRA has its corresponding number of iterations.

The actual number of hits is given by:

U^*＝U×N (7)

the modification of the scaling factor is thus as in equations 8 and 9.

In the formula N₁And N₂Respectively the iteration times of the partition 1 and the partition 2;

if the actual number of hits is greater than the number of hits from the hit counter, the scaling factor for the partition is reduced, so that the lines of the partition are less likely to be evicted and the size of the partition tends to increase. In the continuous eviction, the size of the partition is accurately and finely controlled, so that the performance is optimized.

[ Performance test of the present invention ]

The proposed cache partitioning algorithm is to utilize the partitioning to place the proper contents in the main memory into the proper cache partitions, i.e. the partitioning is performed properly. So the core index for evaluating a cache partitioning algorithm is the hit rate or miss rate. For the execution of the replacement and partition algorithm, the CPU is mainly used for continuously sending out access addresses, the main memory stores data corresponding to each access address, the cache is used as a subset of the main memory and also used for storing the data corresponding to the access addresses, and the cache management circuit judges whether the access addresses are hit or not on one hand and then sends out signals to transmit related data according to the result. The invention is realized by using the Testbench of the Verilog language on the basis that a cache simulation platform realized by using RTL-level Verilog codes is used for simulating the LRU replacement strategy.

The present invention uses the rand function in the MATLAB program to generate, for example, 600 random numbers of 0 to 599 and gives a small portion to core 1 and a large portion to core 2. Because the data is random numbers, there is a partial data overlap, i.e., the data overlap amount according to the present invention. By comparing 5 sets of random sequences for non-buffer and buffer partitions, the average performance improvement reaches 25% as shown in the figure.

Claims

1. A cache partition dividing method of a reconfigurable system is characterized by comprising the following steps:

1) when a plurality of coarse-grained reconfigurable arrays CGRA process data, managing and sharing on-chip cache by adopting an invalid scaling FS so as to reduce the bandwidth requirement of an off-chip memory; the specific implementation method comprises the following steps:

101) controlling the size of each partition by scaling the uselessness of the candidates in each partition by a scaling factor α i; the specific implementation method comprises the following steps:

if the shared cache has R replacement candidatesAnd has two partitions with insertion ratios of I₁And I₂The target ratio is s₁And s₂(ii) a If I₁＜s₁If Ii is equal to Ei, Ei is the eviction proportion in the partition I, I is equal to 1, 2₂＞s₂I.e. without scaling, the size ratio of partitions tends to be proportional to their insertion rate; to reduce the size of partition 2 from I₂Decrease to s₂The uselessness of cache lines belonging to partition 2 is scaled by a factor of alpha₂Zoom in, α 2 > 1, and keep partition 1 un-zoomed, i.e., α₁＝1；

102) the uselessness of the replacement candidate fcard belonging to partition i will scale by α i and the line with the largest fsacland will be evicted; the specific implementation method comprises the following steps:

fscalecand＝fcand×αi (1)

2) introducing new data, namely data overlapping quantity and iteration times for correction on the basis of invalid scaling FS in the cache partition so as to achieve performance optimization; the specific implementation method comprises the following steps:

201) correcting the scaling factor through the data overlapping amount between the CGRAs; the specific implementation method comprises the following steps:

if there are two partitions, the actual number of hits is given by the following formula:

in the formula

And

in the formula

And

202) correcting the scaling factor through the iteration times in the CGRA;

2. The method for partitioning a buffer of a reconfigurable system according to claim 1, wherein step 202) is implemented as follows:

the actual number of hits is given by:

U^*＝U×N (7)

in the formula of U^*The number of actual hits of the two partitions is U, the number of data overlapping hits is not considered, and N is the number of iterations;

the correction formula for the scaling factor is thus as follows:

3. The method for partitioning a cache of a reconfigurable system according to claim 2, wherein the step 203) is implemented as follows: