CN113392604A

CN113392604A - Advanced packaging technology-based dynamic capacity expansion method and system for cache under multi-CPU (Central processing Unit) common-package architecture

Info

Publication number: CN113392604A
Application number: CN202110622895.2A
Authority: CN
Inventors: 李晓霖; 郝沁汾; 叶笑春; 范东睿
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2021-06-04
Filing date: 2021-06-04
Publication date: 2021-09-14
Anticipated expiration: 2041-06-04
Also published as: CN113392604B

Abstract

The invention provides a dynamic capacity expansion method and a dynamic capacity expansion system for a cache under a multi-CPU (central processing unit) common-package architecture based on an advanced packaging technology, aims to solve the problems of increased chip cost of a CPU (central processing unit) chip and difficulty in packaging caused by the expansion of the cache, and provides a novel CPU cache structural design capable of dynamically expanding the capacity. In the structure, the high-speed cache in the CPU chip can access the high-speed cache in the same type of CPU chip by designing the interaction mechanism of the high-speed caches among different CPUs and by means of the packaging technology, thereby achieving the aim of dynamically expanding the high-speed cache capacity in the CPU chip and realizing the high-speed cache sharing among multiple CPUs.

Description

Advanced packaging technology-based dynamic capacity expansion method and system for cache under multi-CPU (Central processing Unit) common-package architecture

Technical Field

The invention relates to a cache structure design in the field of CPU structure design, in particular to a dynamic capacity expansion method and a dynamic capacity expansion system of a cache under a multi-CPU co-encapsulation framework based on an advanced encapsulation technology.

Background

In the era of big data and cloud computing, due to the diversification of the number of users and data sources, more and more loads on a CPU present the characteristics of low computing access-memory ratio, unobvious data locality and the like. For example, graph computing applications are a typical application of data centers that are used to quickly process and handle ever-increasing graph data. However, for graph computation applications, the execution behavior of the graph computation applications becomes very irregular due to the irregular, unstructured nature of the graph computation load. Due to irregular fine-grained access, the hit rate of a cache is extremely low, and the utilization efficiency of a cache block is low, so that the existing general CPU architecture needs larger access bandwidth.

Since the memory access bandwidth of the CPU is slowly increasing, in order to make up for the increasing gap between the CPU performance and the memory performance, the CPU needs to integrate a cache with larger capacity. However, since the SRAM constituting the cache has a large area and the chip cost is almost proportional to the design area of the chip, integrating the cache with a larger capacity inevitably leads to an increase in chip cost. At the same time, the larger area of the chip also makes single chip packaging a great challenge. This presents a significant challenge to CPU design and packaging.

Disclosure of Invention

The invention is based on advanced package technology (advanced package technology), and can make the interaction speed between a plurality of chips in the same package faster through the advanced package technology.

The invention aims to solve the problems of increased chip cost and difficult packaging of a CPU chip caused by the expansion of a cache, and provides a novel CPU cache structural design capable of dynamically expanding capacity. In the structure, the high-speed cache in the CPU chip can access the high-speed cache in the same type of CPU chip by designing the interaction mechanism of the high-speed caches among different CPUs and by means of the packaging technology, thereby achieving the aim of dynamically expanding the high-speed cache capacity in the CPU chip and realizing the high-speed cache sharing among multiple CPUs.

The invention has the following key points:

1. by accessing the cache in other CPU chips, the cache capacity in the CPU chip is expanded, the extremely small area of the chip is increased, the chip feeding cost is reduced, and the packaging difficulty is reduced.

2. The accessed CPU chips are of the same kind, so only one CPU chip needs to be designed, and the CPU design difficulty and the CPU chip casting cost are reduced.

3. Even if a CPU chip partially fails due to the problem of good product rate in the tape-out, the cache can still be fully utilized as long as the internal cache circuit is correct, thereby reducing the loss caused by the failed CPU chip.

4. The invention designs an interaction mechanism between caches. By this mechanism, the cache of one CPU chip can operate on the cache of another CPU chip of the same type.

5. The cache structure designed by the invention can dynamically expand the capacity, so that the CPU chips can work independently without being influenced by other CPU chips, and can be combined into a CPU chip with a high-capacity cache for running programs with large access and storage requirements, such as graph calculation application, and the cache structure has flexibility.

6. With advanced packaging techniques, multiple CPU chips are integrated into a single package while maintaining performance close to monolithic integration. Therefore, in the cache structure designed by the invention, the single chip package of N CPU chips is integrated, and the performance of the cache structure is close to that of a single CPU chip with N times of cache capacity expansion.

Specifically, in order to overcome the defects in the prior art, the present invention provides a dynamic capacity expansion method for a cache under a multi-CPU co-encapsulated architecture based on an advanced encapsulation technology, wherein the method comprises:

step 1, setting a CPU which meets a preset condition as a master end, and selecting a CPU which can meet the memory access bandwidth requirement from other CPUs as a slave end of the master end according to the memory access bandwidth requirement of the master end and the cache sizes of the other CPUs except the master end;

step 2, when the main end reads the cache data, inquiring whether the read request hits in the local cache of the main end, if so, reading and returning the data from the local cache, otherwise, sending a request to the auxiliary end, inquiring whether the read request hits in the auxiliary end, if so, reading the hit auxiliary end cache and returning the data, otherwise, reading and returning the data from the memory, and simultaneously writing the read data into the local cache or the auxiliary end cache;

and 3, when the master end writes in the cache data, inquiring whether the write request hits in the local cache of the master end, if so, reading the data from the local cache, merging the data with the write-in data and then writing the data back to the local cache, otherwise, sending a request to the slave end, inquiring whether the write request hits in the slave end, if so, writing the data into the hit slave end cache, otherwise, writing the data into the master end cache or the slave end cache by replacing the local or slave end cache.

The dynamic capacity expansion method of the cache under the multi-CPU co-packaged architecture based on the advanced packaging technology, wherein the step 2 comprises:

step 21, the slave side inquires whether the read request hits in the local cache, if so, the read request is read from the local cache and returns data to the master side, otherwise, whether the local data of the slave side is selected to be replaced is judged, if so, the replaced block is sent to the master side, the data sent by the master side is received and written back to the local cache of the slave side, otherwise, the cache of the master side or other slave sides is selected to be replaced, and the process is ended.

The dynamic capacity expansion method of the cache under the multi-CPU co-packaged architecture based on the advanced packaging technology, wherein the step 3 comprises:

step 31, the slave side inquires whether the write request hits in the local cache, if so, the data sent by the master side is received, the data is merged with the data of the local cache and then written back to the local cache of the slave side, otherwise, whether the local data of the slave side is replaced is judged, if so, the replaced block is sent to the master side, the data sent by the master side is received and written back to the local cache of the slave side, and if not, the flow is ended.

The dynamic capacity expansion method of the cache under the multi-CPU co-encapsulation architecture based on the advanced encapsulation technology comprises the following preset conditions: the memory access bandwidth requirement of the CPU is larger than a threshold value, or the calculation memory access ratio of the CPU is lower than the threshold value.

The dynamic capacity expansion method of the cache under the multi-CPU common-package architecture based on the advanced packaging technology is characterized in that the main end and the auxiliary end are positioned in the same packaging chip.

The invention also provides a dynamic capacity expansion system of the cache under the multi-CPU common-package architecture based on the advanced packaging technology, which comprises the following steps:

the module 1 is used for setting the CPU which reaches the preset condition as a master end, and selecting the CPU which can meet the memory access bandwidth requirement from the other CPUs as a slave end of the master end according to the memory access bandwidth requirement of the master end and the cache sizes of the other CPUs except the master end;

the module 2 is used for inquiring whether the read request hits in a local cache of the master end when the master end reads the cache data, if so, reading and returning the data from the local cache, otherwise, sending a request to the slave end, inquiring whether the read request hits in the slave end, if so, reading and returning the hit slave end cache, otherwise, reading and returning the data from the memory, and simultaneously writing the read data into the local cache or the slave end cache;

and a module 3, configured to query whether the write request hits in the local cache of the master when the master writes in the cache data, and if so, read data from the local cache, merge with the write data and then write back to the local cache, otherwise send a request to the slave, query whether the write request hits in the slave, if so, write data into the hit slave, and otherwise, write data into the master cache or the slave cache by replacing the local or slave cache.

The dynamic capacity expansion system of the cache under the multi-CPU co-encapsulation architecture based on the advanced encapsulation technology, wherein the module 2 comprises:

module 21, configured to query whether the read request hits in the local cache, if so, read from the local cache and return data to the master, otherwise, determine whether to select to replace local data of the slave, if so, send the replaced block to the master, receive data sent by the master, write the data back to the local cache of the slave, otherwise, select to replace the cache of the master or other slaves, and end the process.

The dynamic capacity expansion system of the cache under the multi-CPU co-packaged architecture based on the advanced packaging technology, wherein the module 3 includes:

the module 31 is configured to query, by the slave, whether the write request hits in the local cache, if so, receive data sent by the master, combine with data in the local cache, and write back to the local cache of the slave, otherwise, determine whether to replace local data in the slave, if so, send a replaced block to the master, receive data sent by the master, write back the data to the local cache of the slave, and otherwise, end the flow.

The dynamic capacity expansion system of the cache under the multi-CPU co-packaged architecture based on the advanced packaging technology comprises the following preset conditions: the memory access bandwidth requirement of the CPU is larger than a threshold value, or the calculation memory access ratio of the CPU is lower than the threshold value.

The dynamic capacity expansion system of the cache under the multi-CPU co-encapsulation architecture based on the advanced encapsulation technology is characterized in that the master end and the slave end are positioned in the same encapsulation chip.

According to the scheme, the invention has the advantages that: in the structure of the invention, the cache of the CPU chip can dynamically expand the cache capacity into multiple times at the cost of only increasing a small area by accessing the caches of other CPU chips and by means of advanced packaging technology. On the one hand, the development costs of the chips are generally reduced, since the increased costs using advanced packaging techniques are lower than the increased costs of a larger chip-casting area. On the other hand, in a single package integrating a plurality of CPU chips, each CPU chip can work independently without being influenced by other CPU chips, and can be combined into a CPU chip with a large-capacity cache for running programs with large access and storage requirements, such as graph calculation application. And when the flexibility is obtained, only one CPU needs to be designed, so that the difficulty of CPU design and the cost of chip casting are reduced.

Drawings

Fig. 1 is a structural diagram of a single packaged chip integrated with 4 CPU chips, where 4 chips are in a general mode and are suitable for general computation;

FIG. 2 is a calculation structure diagram of a single package chip integrated with 4 CPU chips, wherein the CPU chip 1 is in a master mode, and the other CPU chips are in slave modes, and are used for large memory access bandwidth requirements;

FIG. 3 is a flow chart of reading cache data in a general mode;

FIG. 4 is a flow chart of a write of cache data in a general mode;

FIG. 5 is a flow chart of reading cache data in master mode;

FIG. 6 is a flow chart of writing cache data in master mode;

FIG. 7 is a flow chart of reading cached data in end mode;

FIG. 8 is a flow chart of writing cache data in end mode.

Detailed Description

In order to make the aforementioned features and effects of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

The cache structure design of the invention, the access mode of the CPU chip is divided into the following three types:

1. a general mode. In the mode, the access behavior of the CPU chip is consistent with the access behavior of the CPU chip which is not designed by using the cache structure, namely, the cache of other CPU chips cannot be accessed, and the cache in the CPU chip cannot be accessed by other CPU chips.

2. Master side mode. In this mode, the CPU chip can access the cache of other CPU chips, but can not be accessed to the cache in the CPU chip by other CPU chips.

3. Slave mode. In this mode, the CPU chip can be accessed to the cache in the CPU chip by other CPU chips, but can not be accessed to the cache of other CPU chips. Meanwhile, in the mode, the CPU chip only keeps the cache to continue to operate, and the rest parts stop working.

The memory access mode of the CPU chip can be statically configured through a top-layer pin and the like, and can also be dynamically configured through a register in the configuration chip. The static configuration adopted by the invention is that the access mode of the CPU chip is configured through the pins of the CPU top layer before the application program is run, and the mode is not changed in the running process of the application program. The dynamic configuration adopted by the invention is to configure the rest CPU chips according to the running application program in the running process. For example, in fig. 2, when running an application with a large memory bandwidth requirement, the CPU1 may configure the registers inside the CPU 2/3/4 to be the slave and the master through the pins interconnected between the CPUs. And if the access bandwidth requirement of the running application program is small, the running application program is switched back to the common mode, and other CPUs are not needed to be used as the slave ends.

In the description of the drawings, fig. 1 and 2 show 4 CPU chips designed based on the cache structure of the present invention integrated in a single package chip. Fig. 1 shows 4 CPU chips in a general mode, where the 4 chips normally and independently operate without affecting each other. The mode is suitable for common calculation, namely the memory access bandwidth requirement is low, the calculation of the memory access ratio is high, the calculation of the memory access ratio refers to the ratio of calculation operation and memory access operation in the running process of a CPU, and the lower the ratio, the more memory access operation is, and the greater the memory access bandwidth requirement is. If a calculation with a large memory access bandwidth requirement is required, for example, the memory access bandwidth requirement of the CPU is greater than a threshold, or the calculation memory access ratio of the CPU is lower than the threshold, as shown in fig. 2, 1 CPU chip (for example, CPU chip 1) may be configured as a master mode, and the remaining CPU chips are configured as slave modes, and at this time, the CPU chip 1 may increase its own cache capacity by four times by accessing the caches on the remaining 3 CPU chips, thereby improving the performance of the calculation. However, the present application is not limited to this, and since the present application adopts dynamic configuration, only one of the cores may be the master mode and the other is the slave mode according to the actual requirement of the CPU, and the remaining two are still the general modes. At this time, the CPU chip 1 can still increase the self cache capacity by two times.

The CPU chips are interconnected by intel's high-level interface bus protocol. Because the advanced interface bus protocol supports higher data transmission rate and adopts a compact CPU layout, the occupied area is reduced to the maximum extent. By means of advanced packaging technology, the cache structure designed by the invention integrates single-chip packaging of N CPU chips, and has performance close to that of a single CPU chip with N times of cache capacity.

Based on the cache structure design of the invention, the access mode of the CPU chip is divided into a general mode, a master end mode and a slave end mode. The flow of reading and writing the cache data corresponding to the three modes is shown in fig. 3-8.

Read cache data in normal mode:

1. a query is made as to whether the read request hits in the local cache. And if so, jumping to the step 2, otherwise, jumping to the step 3.

2. A local cache hit. Data is read from the local cache and returned. And ending the flow.

3. The local cache is lost. And judging whether the block needs to be written back to the memory or not, and if so, writing the replaced block back to the memory.

4. Data is read from memory, read data is returned, and written back to the local cache. And ending the flow.

Write cache data in general mode:

1. a query is made as to whether the write request hits in the local cache. And if so, jumping to the step 2, otherwise, jumping to the step 3.

2. A local cache hit. Data is read from the local cache, merged with the write data and written back to the local cache. And ending the flow.

4. Data is read from memory, merged with the write data and written back to the local cache. And ending the flow.

Read cache data in master mode:

1. a query is made as to whether the read request hits in the primary local cache. And if so, jumping to the step 2, otherwise, jumping to the step 3.

3. The local cache is lost. Sending a request to the slave, inquiring whether the read request hits in the slave, and waiting for all the slaves to reply.

4. All slaves respond. And if the slave end hits, jumping to the step 5, otherwise, jumping to the step 6.

5. There is a slave hit, and the data that the slave responds to is read and returned.

6. There is no slave hit. The cache local or slave is selected for replacement. If the local cache is selected to be replaced, jump to step 7, otherwise jump to step 9.

7. The local cache is replaced. And judging whether the block needs to be written back to the memory or not, and if so, writing the replaced block back to the memory.

8. Data is read from memory, read data is returned, and written back to the local cache. And ending the flow.

9. The slave-side cache is replaced. And judging whether the slave end needs to write back to the memory, if so, reading the data needing to be written back by the slave end and writing the data back to the memory.

10. Read data from memory, return read data, and write back to the slave cache. And ending the flow.

Write cache data in master mode:

3. The local cache is lost. Sending a request to the slave, inquiring whether the write request hits in the slave, and waiting for all slaves to reply.

5. There is a slave hit, writing the data to the slave cache.

8. Data is read from memory, merged with the write data and written back to the local cache. And ending the flow.

10. Data is read from memory, merged with write data and written back to the slave cache. And ending the flow.

Read cache data in slave mode:

2. A local cache hit. Reads and returns data from the local cache to the master. And ending the flow.

3. Both the master and the local are lost. It is determined whether the data in the read content is local. If the primary cache is selected to be replaced, the flow is ended, otherwise, the step 4 is skipped.

4. And judging whether the memory needs to be written back or not, and if so, sending the replaced block to the master end.

5. And receiving the data sent by the master end and writing the data back to the local cache. And ending the flow.

Write cache data in slave mode:

2. A local cache hit. And receiving the data sent by the master end, merging the data with the data of the local cache, and then writing the merged data back to the local cache. And ending the flow.

3. Both the master and the local are lost. And judging whether to replace the local area. If the primary cache is selected to be replaced, the flow is ended, otherwise, the step 4 is skipped.

The following are examples of methods corresponding to the above examples, and this embodiment can be implemented in cooperation with the above embodiments. The related technical details mentioned in the above embodiments are still valid in this embodiment, and are not described herein again in order to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the above-described embodiments.

Claims

1. A dynamic capacity expansion method of a cache under a multi-CPU common-package architecture based on an advanced packaging technology is characterized by comprising the following steps:

step 1, setting the CPU meeting the preset condition as a master end, and selecting the CPU which can meet the memory access bandwidth requirement from the other CPUs as a slave end of the master end according to the memory access bandwidth requirement of the master end and the cache sizes of the other CPUs except the master end.

2. The method as claimed in claim 1, wherein the step 2 comprises:

3. The method as claimed in claim 1, wherein the step 3 comprises:

4. The method of claim 1, wherein the predetermined condition comprises: the memory access bandwidth requirement of the CPU is larger than a threshold value, or the calculation memory access ratio of the CPU is lower than the threshold value.

5. The method of claim 1, wherein the master and the slave are located in a same package chip.

6. A dynamic capacity expansion system of a cache under a multi-CPU common-package architecture based on an advanced packaging technology is characterized by comprising:

7. The system of claim 6, wherein the module 2 comprises:

8. The system of claim 6, wherein the module 3 comprises:

9. The system of claim 6, wherein the predetermined conditions include: the memory access bandwidth requirement of the CPU is larger than a threshold value, or the calculation memory access ratio of the CPU is lower than the threshold value.

10. The advanced packaging technology based multi-CPU co-packaged architecture based dynamic capacity expansion system of cache memory of claim 6, wherein the master and the slave are located in the same package chip.