CN113392604B

CN113392604B - Dynamic capacity expansion method and system for cache under multi-CPU co-encapsulation architecture based on advanced encapsulation technology

Info

Publication number: CN113392604B
Application number: CN202110622895.2A
Authority: CN
Inventors: 李晓霖; 郝沁汾; 叶笑春; 范东睿
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2023-08-01
Anticipated expiration: 2041-06-04
Also published as: CN113392604A

Abstract

The invention provides a dynamic capacity expansion method and a system for a cache under a multi-CPU co-encapsulation architecture based on an advanced encapsulation technology, which aim to solve the problems of increased CPU chip casting cost and difficult encapsulation caused by expanding the cache. In the structure, the cache in the CPU chip can access the cache in the same type of CPU chip by designing the interaction mechanism of caches among different CPUs and by means of the packaging technology, thereby achieving the purpose of dynamically expanding the cache capacity in the CPU chip and realizing the cache sharing among multiple CPUs.

Description

Dynamic capacity expansion method and system for cache under multi-CPU co-encapsulation architecture based on advanced encapsulation technology

Technical Field

The present invention relates to a cache structure design in the field of CPU structure design, and in particular, to a method and system for dynamically expanding cache under a multi-CPU co-encapsulation architecture based on advanced packaging technology.

Background

In the times of big data and cloud computing, due to the number of users and the diversity of data sources, more and more loads on a CPU show the characteristics of low computing access memory ratio, unobvious data locality and the like. For example, graph computation applications are a typical application for data centers that are used to quickly process and handle ever-increasing graph data. But for graph computing applications, the execution behavior of the graph computing application becomes very irregular due to the irregular, unstructured nature of the graph computing load. Irregular fine granularity access causes extremely low hit rate of a cache and low utilization efficiency of a cache block, so that the existing general CPU architecture needs larger access bandwidth.

Since the memory bandwidth of a CPU is growing slowly, in order to make up for the gap between the increasing CPU performance and the memory performance, the CPU needs to integrate a cache of a larger capacity. However, since the SRAM constituting the cache has a large area and the cost of the chip is almost proportional to the design area of the chip, integrating a cache of a larger capacity tends to result in an increase in the chip cost of the chip. At the same time, larger area chips also present significant challenges for single chip packaging. This presents a significant challenge to CPU design and packaging.

Disclosure of Invention

The invention is based on advanced packaging technology (advanced package technology), by which the speed of interaction between multiple chips located in the same package can be made faster.

The invention aims to solve the problems of increased CPU chip casting cost and difficult packaging caused by expanding a cache, and provides a novel CPU cache structure design capable of dynamically expanding capacity. In the structure, the cache in the CPU chip can access the cache in the same type of CPU chip by designing the interaction mechanism of caches among different CPUs and by means of the packaging technology, thereby achieving the purpose of dynamically expanding the cache capacity in the CPU chip and realizing the cache sharing among multiple CPUs.

The invention has the following key points:

1. by accessing the cache in other CPU chips, the capacity of the cache in the chip of the CPU is expanded, the minimum area of the chip is increased, the chip feeding cost is reduced, and the packaging difficulty is reduced.

2. The accessed CPU chips are of the same type, so that only one type of CPU chip is required to be designed, and the CPU design difficulty and the CPU chip throwing cost are reduced.

3. Even if some failed CPU chips exist in the stream chip due to the yield problem, the cache memory can be fully utilized as long as the cache memory circuit in the stream chip is correct, so that the loss caused by the failed CPU chips is reduced.

4. The invention designs an interaction mechanism between caches. By this mechanism, the cache of one CPU chip can operate on the cache of another similar CPU chip.

5. The cache structure designed by the invention can dynamically expand the capacity, so that the CPU chip can not only work independently and is not influenced by other CPU chips, but also can be combined into a CPU chip with a large capacity cache for running programs with large memory access requirements, such as graph computing application, and has flexibility.

6. With advanced packaging techniques, multiple CPU chips are integrated into a single package while maintaining performance close to monolithic integration. Therefore, in the cache structure designed by the invention, the single package integrating N CPU chips has performance close to that of a single CPU chip expanding N times of cache capacity.

Specifically, the invention provides a dynamic capacity expansion method for a cache under a multi-CPU co-encapsulation architecture based on an advanced encapsulation technology, aiming at the defects of the prior art, which comprises the following steps:

step 1, setting a CPU which reaches a preset condition as a main terminal, and selecting a CPU which can meet the memory bandwidth requirement from other CPUs according to the memory bandwidth requirement of the main terminal and the cache size of the other CPUs except the main terminal to be set as a slave terminal of the main terminal;

step 2, when the main end reads the cache data, inquiring whether the read request hits in the local cache of the main end, if yes, reading and returning the data from the local cache, otherwise, sending a request to the auxiliary end, inquiring whether the read request hits in the auxiliary end, if yes, reading the cache of the auxiliary end hit and returning the data, otherwise, reading the data from the internal memory and returning, and simultaneously writing the read data into the local or auxiliary cache;

and 3, inquiring whether the write request hits in the local cache of the master when the master writes the cache data, if yes, reading the data from the local cache, merging the write data with the write data, and then writing the read data back to the local cache, otherwise, sending a request to the slave, inquiring whether the write request hits in the slave, if yes, writing the data into the hit slave cache, otherwise, writing the data into the master cache or the slave cache by replacing the local cache or the slave cache.

The method for dynamically expanding the cache under the multi-CPU co-encapsulation architecture based on the advanced encapsulation technology comprises the following steps:

and step 21, the slave terminal inquires whether the read request hits in the local cache, if yes, the read request reads the data from the local cache and returns the data to the master terminal, otherwise, whether the local data of the slave terminal is replaced is judged, if yes, the replaced block is sent to the master terminal, the data sent by the master terminal is received and written back into the local cache of the slave terminal, otherwise, the cache of the master terminal or other slave terminals is replaced, and the flow is ended.

step 31, the slave terminal inquires whether the write request hits in the local cache, if yes, the data sent by the master terminal is received, the data is combined with the data of the local cache and then written back to the local cache of the slave terminal, otherwise, whether the local data of the slave terminal is replaced is judged, if yes, the replaced block is sent to the master terminal, the data sent by the master terminal is received, the data is written back to the local cache of the slave terminal, and otherwise, the flow is ended.

The method for dynamically expanding the cache under the multi-CPU co-encapsulation architecture based on the advanced encapsulation technology, wherein the preset conditions comprise: the memory bandwidth requirement of the CPU is greater than a threshold, or the calculated memory duty cycle of the CPU is less than a threshold.

The method for dynamically expanding the cache under the multi-CPU co-encapsulation architecture based on the advanced encapsulation technology, wherein the master end and the slave end are positioned in the same encapsulation chip.

The invention also provides a dynamic capacity expansion system of the cache under the multi-CPU co-encapsulation architecture based on the advanced encapsulation technology, which comprises:

the module 1 is configured to set a CPU that reaches a preset condition as a master, and select, from the remaining CPUs, a CPU that can meet the access bandwidth requirement as a slave of the master according to the access bandwidth requirement of the master and the cache sizes of the remaining CPUs except the master;

a module 2, configured to, when the master reads the cache data, query whether the read request hits in the local cache of the master, if yes, read and return the data from the local cache, otherwise send a request to the slave, query whether the read request hits in the slave, if yes, read the hit slave cache and return the data, otherwise read the data from the memory and return, and write the read data into the local or slave cache;

and the module 3 is configured to query whether the write request hits in the local cache of the master when the master writes the cache data, if yes, read the data from the local cache, merge with the write data, and write back to the local cache, otherwise send a request to the slave, query whether the write request hits in the slave, if yes, write the data into the hit slave cache, otherwise, write the data into the master cache or the slave cache by replacing the local or slave cache.

The advanced packaging technology-based dynamic capacity expansion system of a cache under a multi-CPU co-packaging architecture, wherein the module 2 comprises:

and the module 21 is configured to query whether the read request hits in the local cache, read and return data from the local cache to the master if the read request hits, otherwise determine whether to replace local data of the slave, send the replaced block to the master if the read request hits, receive the data sent by the master, write the data back to the local cache of the slave, otherwise select to replace the local cache of the master or other slave, and end the flow.

The advanced packaging technology-based dynamic capacity expansion system of a cache under a multi-CPU co-packaging architecture, wherein the module 3 comprises:

and the module 31 is configured to query whether the write request hits in the local cache by the slave, if yes, receive the data sent by the master, combine with the data in the local cache, and write back to the local cache of the slave, otherwise determine whether to replace the local data of the slave, if yes, send the replaced block to the master, receive the data sent by the master, write back to the local cache of the slave, and otherwise end the flow.

The advanced packaging technology-based dynamic capacity expansion system of the cache under the multi-CPU co-packaging architecture, wherein the preset conditions comprise: the memory bandwidth requirement of the CPU is greater than a threshold, or the calculated memory duty cycle of the CPU is less than a threshold.

The dynamic capacity expansion system of the cache under the multi-CPU co-encapsulation architecture based on the advanced encapsulation technology, wherein the master end and the slave end are positioned in the same encapsulation chip.

The advantages of the invention are as follows: in the structure of the present invention, the cache of the CPU chip can dynamically expand the cache capacity by multiple times at the cost of only increasing a smaller area by accessing the caches of other CPU chips and by means of advanced packaging technology. On the one hand, the development costs of the chip as a whole are reduced, since the increased costs using advanced packaging techniques are lower than the increased costs of larger chip area. On the other hand, in a single package integrated with a plurality of CPU chips, each CPU chip can work independently and is not affected by other CPU chips, and can be combined into a CPU chip with a large capacity cache for running programs with large memory access requirements, such as graph computing application. At the same time of obtaining flexibility, only one CPU needs to be designed, so the difficulty of CPU design and the cost of chip throwing are reduced.

Drawings

FIG. 1 is a block diagram of a single package chip with 4 CPU chips integrated therein, the 4 chips being in a general mode, adapted for use in general computing;

fig. 2 is a calculation structure diagram of a single package chip integrated with 4 CPU chips, wherein the CPU chip 1 is in a master mode and the remaining CPU chips are in slave modes, and the calculation structure diagram is used for a large memory access bandwidth requirement;

FIG. 3 is a flow chart of reading cache data in the general mode;

FIG. 4 is a flow chart of writing cache data in the general mode;

FIG. 5 is a flow chart of a read cache data in master mode;

FIG. 6 is a flow chart of writing cache data in master mode;

FIG. 7 is a flow chart of reading cache data from the slave mode;

FIG. 8 is a flow chart for writing cache data in slave mode.

Detailed Description

In order to make the above features and effects of the present invention more clearly understood, the following specific examples are given with reference to the accompanying drawings.

The cache structure design of the invention, the access mode of the CPU chip is divided into the following three modes:

1. general mode. In the mode, the access and storage performance behavior of the CPU chip is consistent with that of the CPU chip which does not use the cache structure design of the invention, namely, the cache of other CPU chips is not accessed, and the cache in the CPU chip is not accessed by other CPU chips.

2. Master mode. In this mode the CPU chip may access the caches of other CPU chips, but not the caches within its own chip.

3. Slave mode. In this mode the CPU chip may be accessed by other CPU chips to its own on-chip cache but not to other CPU chip caches. Meanwhile, the CPU chip only keeps the cache to continue running in the mode, and the rest parts stop working.

The access mode of the CPU chip can be statically configured through a top-level pin and the like, and also can be dynamically configured through configuration of an internal register of the chip. The static configuration adopted by the invention is to configure the access mode of the CPU chip through pins on the top layer of the CPU before the application program is operated, and the mode is not changed in the operation process of the application program. The dynamic configuration adopted by the invention means that the configuration is carried out on the other CPU chips according to the running application program in the running process. For example, in fig. 2, when running an application program with a large access bandwidth requirement, the CPU1 may configure registers in the CPU 2/3/4 to be a slave terminal and to be a master terminal through pins interconnected between the CPUs. If the access bandwidth requirement of the running application program is small, the mode is switched back to the general mode, and other CPUs are not needed as slaves.

In the accompanying drawings, fig. 1 and 2 show 4 CPU chips designed based on the cache structure of the present invention integrated into a single packaged chip. In fig. 1, 4 CPU chips in a general mode are shown, and the 4 chips work normally and independently without mutual influence. The mode is suitable for common calculation, namely, the memory bandwidth requirement is small, the memory ratio is calculated, the memory ratio is the ratio of the calculation operation to the memory operation in the running process of the CPU, and the lower the ratio is, the more the memory operation is, the larger the memory bandwidth requirement is. If the memory bandwidth requirement is required to be calculated, for example, when the memory bandwidth requirement of the CPU is greater than a threshold value, or the calculated memory ratio of the CPU is lower than the threshold value, as shown in fig. 2, 1 CPU chip (e.g., CPU chip 1) may be configured as a master mode, and the other CPU chips are slave modes, where the CPU chip 1 may increase its own cache capacity to four times by accessing caches on the other 3 CPU chips, thereby improving the performance of the calculation. However, the present application is not limited thereto, and because the present application adopts dynamic configuration, only one of the cores may be the master mode and the other may be the slave mode according to the actual requirement of the CPU, and the two remain general modes. At this time, the CPU chip 1 can still double the self-high cache capacity.

The CPU chips are interconnected by the intel's high-level interface bus protocol. Because the advanced interface bus protocol supports higher data transfer rates and a compact CPU layout is employed, the footprint is minimized. By means of advanced packaging technology, the single-chip package integrating N CPU chips in the cache structure designed by the invention has performance close to that of a single CPU chip with N times of cache capacity expanded.

Based on the cache structure design of the invention, the access mode of the CPU chip is divided into a general mode, a master mode and a slave mode. The flow of reading and writing the cache data corresponding to the three modes is shown in fig. 3-8.

Read cache data in general mode:

1. a query is made as to whether the read request hits in the local cache. If hit, jump to step 2, otherwise jump to step 3.

2. Local cache hits. Data is read from and returned to the local cache. Ending the flow.

3. The local cache is lost. Whether the memory needs to be written back is judged, and if so, the replaced block is written back to the memory.

4. Data is read from memory, returned to the read data, and written back to the local cache. Ending the flow.

Write cache data in general mode:

1. it is queried whether the write request hits in the local cache. If hit, jump to step 2, otherwise jump to step 3.

2. Local cache hits. Data is read from the local cache, combined with the write data, and written back to the local cache. Ending the flow.

4. The data is read from the memory, combined with the write data and written back to the local cache. Ending the flow.

Reading cache data in a master mode:

1. a query is made as to whether the read request hits in the local cache on the home side. If hit, jump to step 2, otherwise jump to step 3.

3. The local cache is lost. A request is sent to the slave, querying if the read request hits on the slave, and waiting for all slaves to answer.

4. All slaves answer. If there is a slave hit, go to step 5, otherwise go to step 6.

5. There is a slave hit, and the data of the slave reply is read and returned.

6. There is no slave hit. A cache is selected to replace either the local or the slave. If the local cache is selected to be replaced, go to step 7, otherwise go to step 9.

7. Replacing the local cache. Whether the memory needs to be written back is judged, and if so, the replaced block is written back to the memory.

8. Data is read from memory, returned to the read data, and written back to the local cache. Ending the flow.

9. Replacing the slave cache. Judging whether the slave needs to be written back to the memory, if so, reading the data which the slave needs to write back to the memory.

10. Data is read from memory, returned to the read data, and written back to the slave cache. Ending the flow.

Writing cache data in master mode:

3. The local cache is lost. A request is sent to the slave, querying if the write request hits on the slave, and waiting for all slaves to answer.

5. There is a slave hit, writing the data to the slave cache.

8. The data is read from the memory, combined with the write data and written back to the local cache. Ending the flow.

10. The data is read from the memory, and the read data and the write data are combined and then written back to the slave cache. Ending the flow.

Reading the cache data in slave mode:

2. Local cache hits. Reads and returns data from the local cache to the master. Ending the flow.

3. Both the master and the local are lost. It is determined whether to read the data in the content instead of locally. If the primary cache is selected to be replaced, the flow is ended, otherwise, the process jumps to step 4.

4. And judging whether the memory needs to be written back, and if so, sending the replaced block to the main terminal.

5. And receiving the data sent by the main terminal and writing back to the local cache. Ending the flow.

Write cache data in slave mode:

2. Local cache hits. And receiving data sent by the main terminal, merging the data with the data of the local cache, and writing the data back to the local cache. Ending the flow.

3. Both the master and the local are lost. It is determined whether to replace the local. If the primary cache is selected to be replaced, the flow is ended, otherwise, the process jumps to step 4.

The following is an example of a method corresponding to the above-described example, and this embodiment mode can be implemented in cooperation with the above-described embodiment mode. The related technical details mentioned in the above embodiments are still valid in this embodiment, and in order to reduce repetition, they are not repeated here. Accordingly, the related technical details mentioned in the present embodiment can also be applied to the above-described embodiments.

Claims

1. The dynamic capacity expansion method of the cache under the multi-CPU co-encapsulation architecture based on the advanced encapsulation technology is characterized by comprising the following steps:

step 1, setting a CPU which reaches a preset condition as a main terminal, and selecting a CPU which can meet the memory bandwidth requirement from other CPUs according to the memory bandwidth requirement of the main terminal and the cache size of the other CPUs except the main terminal to be set as a slave terminal of the main terminal; the master end and the slave end are positioned in the same packaging chip; the preset condition is that the access bandwidth requirement of the CPU is larger than a threshold value, or the calculated access duty ratio of the CPU is lower than the threshold value;

2. The method for dynamically expanding cache memory under multi-CPU co-packaged architecture according to claim 1, wherein the step 2 comprises:

3. The method for dynamic cache expansion in multi-CPU co-packaged architecture based on advanced packaging technology as claimed in claim 1, wherein said step 3 comprises:

4. A dynamic expansion system for cache under multi-CPU co-encapsulation architecture based on advanced encapsulation technology, comprising:

the module 1 is configured to set a CPU that reaches a preset condition as a master, and select, from the remaining CPUs, a CPU that can meet the access bandwidth requirement as a slave of the master according to the access bandwidth requirement of the master and the cache sizes of the remaining CPUs except the master; the master end and the slave end are positioned in the same packaging chip; the preset condition is that the access bandwidth requirement of the CPU is larger than a threshold value, or the calculated access duty ratio of the CPU is lower than the threshold value;

5. The advanced packaging technology based multi-CPU co-packaged architecture cache dynamic capacity expansion system as claimed in claim 4, wherein said module 2 comprises:

6. The advanced packaging technology based multi-CPU co-packaged architecture cache dynamic capacity expansion system as claimed in claim 4, wherein said module 3 comprises: