CN117915670A

CN117915670A - Integrated chip structure for memory and calculation

Info

Publication number: CN117915670A
Application number: CN202410294958.XA
Authority: CN
Inventors: 王贻源; 朱海杰; 周华民
Original assignee: Shanghai Xinfeng Microelectronics Co ltd
Current assignee: Shanghai Xinfeng Microelectronics Co ltd
Priority date: 2024-03-14
Filing date: 2024-03-14
Publication date: 2024-04-19
Anticipated expiration: 2044-03-14

Abstract

The application relates to the field of integrated circuits and discloses a memory integrated chip structure. The chip structure comprises: a system-on-chip, a support chip, and at least one memory chip. The system-in-chip, the auxiliary chip, and the at least one memory chip are stacked in sequence. At least one memory chip configured to store data. And the system-in-chip is configured to calculate the data. An auxiliary chip configured to provide an on-chip power supply network, and an on-chip data transmission network.

Description

Integrated chip structure for memory and calculation

Technical Field

The application relates to the field of integrated circuits, in particular to a memory integrated chip structure.

Background

High-performance computing chips refer to specialized chips for performing complex mathematical computation and data processing tasks, such as CPUs (central processing units), GPUs (graphics processors), AI (artificial intelligence) chips, DPUs (data processors), and the like. The high-performance computing chip is widely applied to the fields of scientific computing, engineering simulation, data analysis, artificial intelligence and the like.

However, with the dramatic increase in data and computation, high performance computing chips face very many challenges in terms of bandwidth, power consumption, computing speed, and efficiency. The existing chip structure has difficulty in meeting the performance upgrade of the high-performance computing chip.

Disclosure of Invention

Therefore, the embodiment of the application provides a memory integrated chip structure, which can improve the performance of the chip structure.

The technical scheme of the embodiment of the application is realized as follows:

The embodiment of the application provides a memory integrated chip structure, which comprises: a system-on-chip, an auxiliary chip, and at least one memory chip; the system-on-chip, the auxiliary chip and at least one memory chip are stacked in sequence; at least one of the memory chips configured to store data; the system-on-chip is configured to calculate the data; the auxiliary chip is configured to provide an on-chip power supply network, and an on-chip data transmission network.

In the above aspect, each of the memory chips includes: a local storage module; the system-on-chip includes: a computing core module; each local storage module is positioned right above the computing core module; wherein, the local storage module includes: a plurality of local storage sub-modules; the computing core module includes: a plurality of computing core sub-modules; each local storage sub-module is only accessed by the computing core sub-module directly below the local storage sub-module.

In the above scheme, the system-on-chip further includes: a first cache module; each of the memory chips further includes: a second cache module; each second cache module is positioned right above the first cache module; the first cache module and the second cache module are both accessible to all the computing core sub-modules.

In the above scheme, the access speed of the first cache module is faster than the access speed of the second cache module; the storage density of the first cache module is less than the storage density of the second cache module.

In the above aspect, each of the memory chips further includes: a full chip memory module; the system-on-chip further includes: a peripheral logic module; each full-chip memory module is positioned right above the peripheral logic module; each full-chip memory module is accessible to all modules in the system-on-chip.

In the above solution, in each of the memory chips, the full-chip memory module at least partially encloses the local memory module and the second cache module; in the system-on-chip, the peripheral logic module at least partially encloses the compute core module and the first cache module.

In the above scheme, the auxiliary chip includes: a plurality of nodes; the nodes are arranged in an array mode, and adjacent nodes are electrically connected with each other; each of the nodes includes: a data transmission module and a power supply module; each power supply module supplies power to adjacent partial areas in each storage chip and supplies power to adjacent partial areas in the system-in-chip; each data transmission module is used for transmitting data for adjacent partial areas in each storage chip and transmitting data for adjacent partial areas in the system-in-chip; or each data transmission module transmits data for all areas in each of the memory chip and the system-in-chip.

In the above solution, each data transmission module includes: the device comprises a master device interface, a slave device interface, a router and a storage controller; the master device interface and the slave device interface are respectively connected with the router; the slave device interface is also connected with the storage controller.

In the above solution, each node further includes: a memory operation module; the in-memory operation module in each node is used for being used by a memory connected with the memory controller in the same node.

In the above scheme, the system-on-chip, the auxiliary chip and at least one of the memory chips are connected by hybrid bonding; and through silicon vias extending along the stacking direction are formed in the system-in-chip, the auxiliary chip and at least one of the memory chips.

In the above scheme, the system-on-chip, the auxiliary chip and at least one of the memory chips are stacked on the package substrate in order; the system-in-chip is connected to the packaging substrate through bumps.

Therefore, in the embodiment of the application, the 3D chip integrated with the memory is realized by stacking different chips in three dimensions (3D). The distance between the memory chip for storage and the system-in-chip for calculation is very short, and most of data is left in the chip structure, so that near-memory calculation or in-memory calculation can be realized, and the problems of ' power consumption wall ', ' bandwidth wall ', ' and ' memory wall ' can be solved. Meanwhile, the auxiliary chip also provides an on-chip data transmission network, so that the remote flow of data can be controlled in the auxiliary chip.

Drawings

Fig. 1 is a schematic diagram of a chip structure according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a system-on-chip in a chip structure according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a memory chip in a chip structure according to an embodiment of the present application;

fig. 4 is a schematic diagram of an auxiliary chip in a chip structure according to an embodiment of the present application;

fig. 5 is a schematic diagram of a node in an auxiliary chip according to an embodiment of the present application;

fig. 6 is a schematic diagram of a chip structure according to an embodiment of the application.

Detailed Description

The technical solution of the present application will be further elaborated with reference to the accompanying drawings and examples, which should not be construed as limiting the application, but all other embodiments which can be obtained by one skilled in the art without making inventive efforts are within the scope of protection of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

If a similar description of "first/second" appears in the application document, the following description is added, in which the terms "first/second/third" merely distinguish similar objects and do not represent a specific ordering of the objects, it being understood that "first/second/third" may, where allowed, interchange a specific order or precedence, so that the embodiments of the application described herein can be implemented in an order other than that illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

On the other hand, in the existing chip structure, the data transmission amount is larger, the transmission power consumption is higher, and meanwhile, the distance between the memory chip and the data processing chip is longer, so that the PHY (PHYSICAL LAYER, physical layer, i.e. interface chip) is required to perform remote data transmission. Taking an AI chip as an example, with explosive growth of data volume, the moving volume of data in AI calculation is also increasing. According to related researches, when the semiconductor process reaches 7nm, the data carrying power consumption is as high as 35pJ/bit, and the data carrying power consumption accounts for 63.7% of the total power consumption. The power consumption loss caused by data transmission is more and more serious, and the development speed and efficiency of the chip are limited, so that the problem of 'power consumption wall' is caused.

On the other hand, in the existing chip structure, the data transmission quantity is large, the interface bandwidth is insufficient, and the situation of waiting for data calculation occurs, so that the performance of the whole chip is affected. Computing power on computing devices worldwide has increased 90000 times over the last 20 years. Although memory has evolved from DDR to GDDR6x, which can be used for graphics cards, gaming terminals and high performance computing, and at the same time, the interface standard has also been upgraded from pcie1.0a to nvlink3.0, the increase in communication bandwidth has been only 30 times, and the increase in magnitude compared to computing power has been very limited, resulting in the problem of "bandwidth wall".

On the other hand, in the existing chip structure, the capacity of the on-chip memory (Cache) is too small. Taking an AI chip as an example, the AI model parameters are rapidly expanded, and the memory growth speed of the GPU is proved to be the best. In the model age before GPT-2, the GPU memory can also meet the requirements of an AI large model. But in recent years, with the large scale development and application of the transducer model, the model size has increased 240 times on average every two years. The parameter growth of the GPT-3 equi-large model has exceeded the growth of GPU memory. Thus, communication inside the chip, between the chips, or between the AI accelerators becomes a bottleneck for AI training, which inevitably encounters a "memory wall" problem.

In yet another aspect, in the existing chip structure, the power supply efficiency of the high-performance computing chip is low, and in particular, the power supply efficiency of the typical high-performance computing chip is lower than 80% at present.

In summary, it has been difficult for existing chip structures to meet the performance upgrades of high performance computing chips, that is, high performance computing chips face performance bottlenecks.

Fig. 1 is a schematic diagram of an alternative architecture of a memory integrated chip according to an embodiment of the present application. As shown in fig. 1, the chip structure 10 includes: a system on chip (also known as SOC Die) 20, an Auxiliary chip (also known as auxliary Die) 30, and at least one memory chip 40. The system-in-chip 20, the auxiliary chip 30, and the at least one memory chip 40 are stacked in order.

Wherein at least one memory chip 40 is configured to store data; a system-in-chip 20 configured to calculate data; the auxiliary chip 30 is configured to provide an on-chip power supply network, as well as an on-chip data transmission network.

It should be noted that the number of the memory chips 40 illustrated in fig. 1 is 4, which is not a limitation of the embodiment of the present application. The number of memory chips 40 may also be other values, without limitation.

In an embodiment of the present application, the memory chip 40 may be a dynamic random access memory (also referred to as a DRAM Die). With DRAM, greater on-chip storage and bandwidth may be provided. Meanwhile, a plurality of memory chips 40 may be stacked, so that expansion of the memory capacity may be achieved.

In the embodiment of the present application, the auxiliary chip 30 may provide an on-chip power supply network for the system-on-chip 20 and the memory chip 40, that is, provide power for the system-on-chip 20 and the memory chip 40. Meanwhile, the auxiliary chip 30 can provide an on-chip data transmission network for the system-in-chip 20 and the memory chip 40, so as to realize data handling of the system-in-chip 20 and the memory chip 40.

In the embodiment of the present application, referring to fig. 1, at least one memory chip 40 is stacked on the system-in-chip 20, so that the at least one memory chip 40 and the system-in-chip 20 can be packaged together, and a chip structure integrating memory and calculation is realized.

It can be appreciated that in the embodiment of the application, the 3D chip integrated with the memory is realized by three-dimensionally stacking different chips. In which the distance between the memory chip 40 for storage and the system-on-chip 20 for computation is very short, most of the data is left inside the chip architecture 10, so that near-memory computation or in-memory computation can be implemented, and thus, the problems of "power consumption wall", "bandwidth wall", and "memory wall" can be solved.

Meanwhile, the auxiliary chip 30 provides an on-chip power supply network, so that the power supply efficiency problem is relieved, and meanwhile, the auxiliary chip 30 also provides an on-chip data transmission network, so that the remote flow of data can be controlled in the auxiliary chip 30.

In some embodiments of the present application, with continued reference to FIG. 1, the stacked system on chip 20, auxiliary chip 30, and at least one memory chip 40 are connected by Hybrid Bonding.

In some embodiments of the present application, with continued reference to fig. 1, through Silicon Vias (TSVs) extending in the stacking direction (i.e., the vertical direction) are formed in the system on chip 20, the auxiliary chip 30, and the at least one memory chip 40. The silicon through holes are filled with conductive substances such as copper, tungsten, polysilicon and the like, so that vertical interconnection among different chips is realized.

It can be appreciated that, due to the higher thermal conductivity of the hybrid bond and the through silicon vias, the overall thermal conductivity of the chip structure 10 is improved, and the heat dissipation efficiency of each chip is enhanced, thereby improving the stability of the chip structure 10.

Fig. 2 is a schematic diagram of an alternative architecture of the system-on-chip 20. Fig. 3 is a schematic diagram of an alternative structure of the memory chip 40.

In some embodiments of the present application, referring to FIG. 2, a system-on-chip 20 includes: a computation Core (HPC Core) module 201. Referring to fig. 3, each memory chip 40 includes: local memory (Local DRAM) module 401.

Wherein each local storage module 401 is located directly above the computing core module 201; that is, the projection of each local memory module 401 in the vertical direction (i.e., the stacking direction of the chips) coincides with the projection of the computing core module 201 in the vertical direction.

It should be noted that "directly above" and "overlapping" in this specification may be achieved within a certain error range, that is, a certain error may be allowed to occur in an actual manufacturing process. And will not be described in detail below.

With reference to fig. 2 and 3, the local storage module 401 may include: the plurality of local storage sub-modules, the computing core module 201 may include: a plurality of compute core sub-modules. Each local storage sub-module is only accessed by the computation core sub-modules right below the local storage sub-modules. That is, each storage area (i.e., the local storage sub-module) in the local storage module 401 is only accessed by the computing cores (i.e., the computing core sub-modules) directly below the storage area, so that each computing core has a dedicated storage area, which can improve the computing efficiency. The local storage module 401 may be used to cache intermediate data in the computing process, which need not be shared, and thus may be used only locally, i.e. only by the corresponding computing core.

In some embodiments of the present application, referring to fig. 2, the system-on-chip 20 further includes: a first cache module 202. Referring to fig. 3, each memory chip 40 further includes: a second cache module 402.

Wherein each second cache module 402 is located directly above the first cache module 202; that is, the projection of each second cache module 402 in the vertical direction (i.e., the stacking direction of the chips) coincides with the projection of the first cache module 202 in the vertical direction.

In an embodiment of the present application, referring to fig. 2 and 3, the first cache module 202 and the second cache module 402 are both accessible to all the computation core sub-modules in the computation core module 201.

In some embodiments of the present application, in conjunction with fig. 2 and 3, the access speed of the first cache module 202 is faster than the access speed of the second cache module 402; the storage density of the first cache module 202 is less than the storage density of the second cache module 402.

In an embodiment of the present application, referring to fig. 2, the first cache module 202 may be a second level cache (L2 cache), such as SRAM (static random access memory). Referring to FIG. 3, the second cache module 402 may be a level three cache (L3 cache).

Wherein the storage density of the first cache module 202 is smaller than the second cache module 402, i.e. the storage capacity of the first cache module 202 is smaller per unit area. However, the first cache module 202 is closer to the compute core module 201, and therefore, the latency (latency) of the compute core module 201 accessing data in the first cache module 202 is less than the latency of the compute core module 201 accessing data in the second cache module 402. The second cache module 402 may supplement the first cache module 202, which may increase cache (cache) hit rates.

In some embodiments of the present application, referring to FIG. 3, the latency of the second cache module 402 is less than 30ns.

In some embodiments of the present application, referring to fig. 2, the system-on-chip 20 further includes: peripheral logic module 203. Referring to fig. 3, each memory chip 40 further includes: a full chip memory module 403.

Wherein, each full-chip memory module 403 is located right above the peripheral logic module 203; that is, the projection of each full-chip memory module 403 in the vertical direction (i.e., the stacking direction of the chips) coincides with the projection of the peripheral logic module 203 in the vertical direction.

In an embodiment of the present application, with reference to fig. 2 and 3, each full-chip memory module 403 is accessible to all modules in the system-on-chip 20. The latency of the full-chip memory module 403 is large, but is flexible, so that all modules in the system-in-chip 20 can access the data in the full-chip memory module 403.

In the embodiment of the present application, referring to fig. 2, in the peripheral logic module 203, peripheral logic (Peri), high-speed interface logic (PCIE), and other image processing logic (IP) may be calculated.

In some embodiments of the present application, referring to FIG. 3, in each memory chip 40, a full-chip memory module 403 at least partially encloses a local memory module 401 and a second cache module 402. Referring to fig. 2, in the system-on-chip 20, a peripheral logic module 203 at least partially encloses a compute core module 201 and a first cache module 202.

It will be appreciated that each memory chip 40 is partitioned into multiple levels, i.e., each memory chip 40 is partitioned into a local memory module 401, a second cache module 402, and a full-chip memory module 403, while the system-in-chip 20 is correspondingly partitioned, i.e., the system-in-chip 20 is partitioned into a compute core module 201, a first cache module 202, and a peripheral logic module 203. In this way, different functions are organically combined in the same chip structure 10, and thus, the performance of the chip structure 10 in various aspects can be improved.

In some embodiments of the present application, referring to fig. 4, the auxiliary chip 30 includes: a plurality of nodes 301. The plurality of nodes 301 are arranged in an array, and adjacent nodes 301 are electrically connected to each other.

In an embodiment of the present application, a Network On Chip (NOC) is constructed by a plurality of nodes 301 arranged in an array, including: an on-chip power supply network and an on-chip data transmission network. NOCs can handle data in memory chips over long distances. Because NOCs require the use of logic resources, but long distance handling of data is not a concern for latency, NOCs can be implemented using relatively sophisticated processes.

In some embodiments of the present application, referring to fig. 5, each node 301 comprises: a data transmission module 302 and a power supply module 303.

In the embodiment of the present application, referring to fig. 1 and 5, each power module 303 supplies power to adjacent partial areas in each memory chip 40, and supplies power to adjacent partial areas in the system on chip 20. That is, in combination with fig. 1 and 5, the power supply module 303 in each node 301 may supply power to a partial region of each memory chip 40 located directly above and near the node 301, or may supply power to a partial region of the system-in-chip 20 located directly below and near the node 301.

It can be understood that the auxiliary chip 30 includes a plurality of nodes 301 arranged in an array, and each node 301 supplies power to adjacent partial areas in each of the memory chips 40 and the system-in-chip 20, so that each of the memory chips 40 and the system-in-chip 20 can obtain power nearby, thereby improving power supply efficiency.

In an embodiment of the present application, referring to fig. 5, the power supply module 303 may implement direct current to direct current (DC-DC), so as to provide a DC voltage source with a specific voltage.

In some embodiments of the present application, in conjunction with fig. 1 and 5, each data transfer module 302 transfers data for adjacent partial areas in each memory chip 40 and transfers data for adjacent partial areas in the system on chip 20. That is, in combination with fig. 1 and 5, the power supply module 303 in each node 301 may transmit data to a partial region of each memory chip 40 located directly above and near the node 301, or may transmit data to a partial region of the system-in-chip 20 located directly below and near the node 301. In other embodiments of the present application, each data transfer module 302, in conjunction with fig. 1 and 5, transfers data for all areas in each memory chip 40 and system on chip 20.

In some embodiments of the present application, referring to fig. 5, each data transmission module 302 includes: the master device interface MIU, the slave device interface SIU, the router R and the storage controller HDC. The device comprises a router R, a master device interface MIU and a slave device interface SIU, wherein the master device interface MIU and the slave device interface SIU are respectively connected with the router R; the slave device interface SIU is also connected to the memory controller HDC.

In an embodiment of the present application, referring to fig. 5, a master device interface MIU is used to connect to a master device (master). A slave device interface SIU for connecting a slave device (slave). And the router R is used for controlling the data transmission direction and realizing the transmission of the data to a specific direction. And the storage controller HDC is used for controlling data storage.

In some embodiments of the present application, referring to fig. 5, each node 301 further comprises: and a memory operation module PIM. The in-memory operation module PIM in each node 301 is used for the memory connected to the memory controller HDC in the same node 301.

In some embodiments of the present application, referring to fig. 6, the system-in-chip 20, the auxiliary chip 30, and the at least one memory chip 40 are sequentially stacked on the package substrate 50. The system-in-chip 20 is connected to the package substrate 50 through bumps (bumps), and a C4 Bump may be used. Solder balls (Solder Balls) are also provided under the package substrate 50.

It can be appreciated that, since the thermal conductivity of the bumps and the solder balls is high, the thermal conductivity between the chip and the package substrate 50 is improved, so that the thermal resistance in the direction of the package substrate 50 is reduced, the heat dissipation efficiency of each chip is enhanced, and the stability of the chip structure 10 is improved.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments. The methods disclosed in the method embodiments provided by the application can be arbitrarily combined under the condition of no conflict to obtain a new method embodiment. The features disclosed in the several product embodiments provided by the application can be combined arbitrarily under the condition of no conflict to obtain new product embodiments. The features disclosed in the embodiments of the method or the apparatus provided by the application can be arbitrarily combined without conflict to obtain new embodiments of the method or the apparatus.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application.

Claims

1. A memory integrated chip structure, the chip structure comprising: a system-on-chip, an auxiliary chip, and at least one memory chip;

The system-on-chip, the auxiliary chip and at least one memory chip are stacked in sequence;

At least one of the memory chips configured to store data;

the system-on-chip is configured to calculate the data;

the auxiliary chip is configured to provide an on-chip power supply network, and an on-chip data transmission network.

2. The chip structure of claim 1, wherein,

Each of the memory chips includes: a local storage module; the system-on-chip includes: a computing core module; each local storage module is positioned right above the computing core module;

wherein, the local storage module includes: a plurality of local storage sub-modules; the computing core module includes: a plurality of computing core sub-modules; each local storage sub-module is only accessed by the computing core sub-module directly below the local storage sub-module.

3. The chip structure of claim 2, wherein,

The system-on-chip further includes: a first cache module; each of the memory chips further includes: a second cache module; each second cache module is positioned right above the first cache module;

The first cache module and the second cache module are both accessible to all the computing core sub-modules.

4. The chip structure of claim 3, wherein,

The access speed of the first cache module is faster than that of the second cache module; the storage density of the first cache module is less than the storage density of the second cache module.

5. The chip structure of claim 3, wherein,

Each of the memory chips further includes: a full chip memory module; the system-on-chip further includes: a peripheral logic module; each full-chip memory module is positioned right above the peripheral logic module;

Each full-chip memory module is accessible to all modules in the system-on-chip.

6. The chip structure of claim 5, wherein,

In each of the memory chips, the full-chip memory module at least partially encloses the local memory module and the second cache module;

In the system-on-chip, the peripheral logic module at least partially encloses the compute core module and the first cache module.

7. The chip structure of claim 1, wherein the auxiliary chip comprises: a plurality of nodes; the nodes are arranged in an array mode, and adjacent nodes are electrically connected with each other;

each of the nodes includes: a data transmission module and a power supply module;

Each power supply module supplies power to adjacent partial areas in each storage chip and supplies power to adjacent partial areas in the system-in-chip;

each data transmission module is used for transmitting data for adjacent partial areas in each storage chip and transmitting data for adjacent partial areas in the system-in-chip; or each data transmission module transmits data for all areas in each of the memory chip and the system-in-chip.

8. The chip architecture of claim 7, wherein each of the data transmission modules comprises: the device comprises a master device interface, a slave device interface, a router and a storage controller;

The master device interface and the slave device interface are respectively connected with the router; the slave device interface is also connected with the storage controller.

9. The chip architecture of claim 8, wherein each of the nodes further comprises: a memory operation module; the in-memory operation module in each node is used for being used by a memory connected with the memory controller in the same node.

10. The chip structure of claim 1, wherein,

The system-level chip, the auxiliary chip and at least one storage chip are connected through hybrid bonding;

And through silicon vias extending along the stacking direction are formed in the system-in-chip, the auxiliary chip and at least one of the memory chips.

11. The chip structure of claim 1, wherein,

The system-in-chip, the auxiliary chip and at least one storage chip are sequentially stacked on the packaging substrate; the system-in-chip is connected to the packaging substrate through bumps.