CN117149700B

CN117149700B - Data processing chip, manufacturing method thereof and data processing system

Info

Publication number: CN117149700B
Application number: CN202311404153.8A
Authority: CN
Inventors: 吕佳霖; 王峰; 郭垣翔; 张玮君; 李岑
Original assignee: Beijing Suneng Technology Co ltd
Current assignee: Beijing Suneng Technology Co ltd
Priority date: 2023-10-27
Filing date: 2023-10-27
Publication date: 2024-02-09
Anticipated expiration: 2043-10-27
Also published as: CN117149700A

Abstract

The disclosure provides a data processing chip, a manufacturing method thereof and a data processing system based on the technical problem that the limited memory bandwidth cannot guarantee high-speed data transmission and cannot meet the calculation power demand. The data processing chip includes: a first die, comprising: an arithmetic unit; a second die stacked with the first die, comprising: the buffer is coupled with the arithmetic unit through bonding; wherein the buffer is configured to: and buffering the transmitted data when the arithmetic unit and the host machine or the arithmetic unit and the memory are used for data transmission. Thus, higher bandwidth is provided for data transmission, high computational power is realized, and the bandwidth requirement of the system is reduced.

Description

Data processing chip, manufacturing method thereof and data processing system

Technical Field

The disclosure relates to the technical field of data processing, and in particular relates to a data processing chip, a manufacturing method thereof and a data processing system.

Background

With the development of technology, more and more fields, such as artificial intelligence, security operation, etc., involve a large amount of data processing (i.e., big data processing). Large data processing will result in frequent and large data interactions between the processor and the memory, which requires the memory to have a higher bandwidth to meet the computational demands.

However, in the conventional architecture design, the performance improvement speed of the memory is later than the performance improvement speed of the processor, and the limited memory bandwidth cannot guarantee the high-speed transmission of the data, so that the computing power requirement is difficult to meet.

Disclosure of Invention

The present disclosure provides a data processing chip, a method of manufacturing the same, and a data processing system.

According to a first aspect of the present disclosure, there is provided a data processing chip comprising:

a first die, comprising: an arithmetic unit;

a second die stacked with the first die, comprising: the buffer is coupled with the arithmetic unit through bonding; wherein the buffer is configured to: and when the arithmetic unit and the host machine perform data transmission or the arithmetic unit and the memory perform data transmission, caching the transmitted data.

In some embodiments, the data processing chip further comprises:

a wiring layer located in a third die between the first die and the second die; or, in the second die and between the first die and the buffer;

the bridge circuit is positioned in the wiring layer, a first port of the bridge circuit is coupled with the arithmetic unit through a first interface protocol, and a second port of the bridge circuit is coupled with the buffer through a second interface protocol; wherein the second interface protocol and the first interface protocol are different.

In some embodiments, the operator comprises: a plurality of arithmetic units; the buffer includes: a plurality of cache units;

the data processing chip includes:

a plurality of bridge circuits located in the wiring layer; wherein each of the plurality of arithmetic units is coupled with the plurality of cache units through each of the plurality of bridge circuits, respectively.

In some embodiments, the first die further comprises:

a processor coupled to the buffer by bonding; wherein the buffer is further configured to: and when the processor and the memory perform data transmission, buffering the transmitted data.

In some embodiments, the orthographic projection of the second die coincides with the orthographic projection of the first die.

In some embodiments, the storage capacity of the buffer is greater than a preset value; wherein the preset value is greater than 0 megabytes and less than 1 gigabyte.

In some embodiments, the buffer comprises: dynamic random access memory, flash memory, phase change memory, or magnetic tunnel junction memory.

According to a second aspect of the present disclosure, there is provided a method of manufacturing a data processing chip, comprising:

Forming a first die, the first die comprising an operator;

forming a second die stacked with the first die, the second die including a buffer coupled to the operator by bonding; wherein the buffer is configured to: and when the arithmetic unit and the host machine perform data transmission or the arithmetic unit and the memory perform data transmission, caching the transmitted data.

In some embodiments, the method of manufacturing further comprises:

providing a first wafer, wherein the first wafer comprises a plurality of first dies;

providing a second wafer, the second wafer comprising a plurality of the second dies;

the forming a second die disposed in a stack with the first die, comprising:

bonding the first wafer and the second wafer such that the operator is coupled with the buffer;

and performing dicing processing on the bonded first wafer and second wafer.

In some embodiments, prior to bonding the first wafer and the second wafer, the method of manufacturing further comprises:

providing a third wafer comprising a wiring layer and a bridge circuit in the wiring layer;

Bonding a first face of the third wafer to the first wafer such that a first port of the bridge circuit is coupled to the operator via a first interface protocol;

bonding a second side of the third wafer to the second wafer such that a second port of the bridge circuit is coupled to the buffer via a second interface protocol; wherein the second interface protocol and the first interface protocol are different; the second surface is opposite to the first surface.

In some embodiments, the providing a second wafer includes:

forming the buffer;

forming a wiring layer on the buffer;

forming a bridge circuit in the wiring layer, wherein a second port of the bridge circuit is coupled with the buffer through a second interface protocol;

the bonding the first wafer and the second wafer includes:

inverting the second wafer so that the wiring layer is located between the first die and the buffer;

bonding the bridge circuit and the operator such that a first port of the bridge circuit is coupled with the operator via a first interface protocol; wherein the second interface protocol and the first interface protocol are different.

In some embodiments, the first wafer employs a first process and the second wafer employs a second process; the characteristic size corresponding to the second process is larger than the characteristic size corresponding to the first process.

According to a third aspect of the present disclosure there is provided a data processing system comprising:

a data processing chip as in any above embodiment;

at least one of the memories is arranged in parallel with the first die along a first direction, and the first direction is perpendicular to the stacking direction of the second die and the first die.

In some embodiments, the data processing chip and the memory constitute a node chip;

the data processing system includes: a plurality of the node chips; the plurality of node chips are arranged in parallel along a second direction and are connected in a concentrated mode, the second direction is perpendicular to the stacking direction of the second crystal grains and the first crystal grains, and the second direction is intersected with the first direction.

In the embodiment of the disclosure, in the first aspect, by arranging the first die and the second die to be stacked, the buffer and the arithmetic unit can be coupled in a bonding manner, so that higher bandwidth is provided for data transmission, high calculation force is realized, and meanwhile, the bandwidth requirement of a system is reduced; in the second aspect, compared with the on-chip cache, the method and the device have the advantages that the second crystal grain is arranged outside the first crystal grain (namely outside the chip), and the cache in the second crystal grain is used for caching data, so that the design complexity of the on-chip system is reduced, and the design and production cost of the data processing chip are reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a graph illustrating a theoretical maximum computational performance model that can be achieved in accordance with an exemplary embodiment;

FIG. 2 is a schematic diagram of a data processing chip according to an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a data processing system shown in accordance with an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of another data processing system shown in accordance with an embodiment of the present disclosure;

FIG. 5 is a flow chart illustrating a method of manufacturing a data processing chip according to an embodiment of the present disclosure;

FIG. 6a is a schematic diagram of a manufacturing process of a data processing chip according to an embodiment of the disclosure;

FIG. 6b is a schematic diagram II of a manufacturing process of a data processing chip according to an embodiment of the disclosure;

fig. 6c is a schematic diagram of a manufacturing process of a data processing chip according to an embodiment of the disclosure.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus consistent with some aspects of the disclosure as detailed in the accompanying claims.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

In order to illustrate the technical solutions of the present disclosure, a description will be given below with reference to specific embodiments.

With the great improvement of the calculation force, the bandwidth requirement of the system is increased, otherwise, the unrestrained improvement of the calculation force does not help the overall performance of the system. FIG. 1 is a graph of a theoretical maximum computational performance model (Roof-line model) that can be obtained according to an exemplary embodiment, and referring to FIG. 1, it can be seen that the theoretical performance of the system is no longer increased with increasing computational power due to the bandwidth bottleneck of the memory, as shown on the right side of the dashed line in FIG. 1.

The bandwidth requirements of the system continue to evolve through different memory configurations, such as dynamic random access memory (Dynamic Random Access Memory, DRAM), double Data Rate (DDR), low power memory (Low Power Double Data Rate, LPDDR), graphics card memory (Graphics Double Data Rate, GDDR), and high bandwidth memory (High Bandwidth Memory, HBM), to meet the different system bandwidth requirements.

However, the cost of increasing the bandwidth requirements of the system is a substantial increase in cost overhead and an increase in design complexity. Taking the above illustrated several memories as examples, the bandwidth size satisfies: HBM > GDDR > LPDDR > DDR; the design complexity satisfies: HBM > GDDR > lpddr=ddr; the cost overhead is as follows: HBM > GDDR > LPDDR > DDR.

By adding on-Chip cache (SOC), for example, integrating Static Random-Access Memory (SRAM) in a System On Chip (SOC), the bandwidth requirements of the System to Access off-Chip Memory can be reduced. However, on-chip buffering is costly and has a small capacity (SRAM reaches MB level at most), with limited assistance in reducing system bandwidth requirements. Therefore, how to reduce the system bandwidth requirement while realizing high computation power is a technical problem to be solved.

In view of this, the present disclosure provides a data processing chip, a method of manufacturing the same, and a data processing system.

Fig. 2 is a schematic diagram of a data processing chip 1000 according to an embodiment of the disclosure. Referring to fig. 2, the data processing chip 1000 includes:

a first die 1100, comprising: an arithmetic unit 1110;

the second die 1200, stacked with the first die 1100, includes: the register 1210, the register 1210 is coupled with the arithmetic unit 1110 by bonding; wherein the buffer 1210 is configured to: when the arithmetic unit 1110 performs data transfer with the host or when the arithmetic unit 1110 performs data transfer with the memory, the transferred data is buffered.

The first die 1100 includes, but is not limited to, a system on chip SOC that integrates a plurality of functional circuits. The first die 1100 includes an operator 1110; further, as shown in connection with fig. 3, the first die 1100 further includes a processor 1130, a memory interface circuit 1150, and a node interface circuit 1160, etc., and the processor 1130, the memory interface circuit 1150, and the node interface circuit 1160 will be described below. It should be noted that the functional circuits shown in fig. 2 or fig. 3 are only shown as examples, and the first die 1100 may further include other functional circuits known in the art, for example, the first die 1100 further includes a power supply, a power management circuit, a timing control circuit, and the like.

The operator 1110 is configured to perform a specific calculation in response to a control command and output a calculation result. The operator 1110 includes a central processor (Central Processing Unit, CPU), a graphics processor (Graphics Processing Unit, GPU), a tensor processor (Tensor Processing Unit, TPU), a processor-dispersed processor (Data Processing Unit, DPU), an intelligent processor (Image Processing Unit, IPU), or a Neural network processor (Neural-network Processing Unit, NPU), or the like. The operator 1110 may perform a hash algorithm, a convolution algorithm, a neural network algorithm, or the like.

In a specific example, the arithmetic unit 1110 is a tensor processor TPU, and includes a matrix multiplication unit (Matrix multiplication Unit, MXU), which is also called a systolic array, and can process multiplication and addition of 65536 times of 8-bit integers, which greatly improves the computation power of the data processing chip.

Stacking between the second die 1200 and the first die 1100 may be achieved by a Wafer on Wafer (WoW) technique, and electrical connection between the second die 1200 and the first die 1100 may be achieved by at least one of a bonding layer (bonding layer), a Through-Silicon-Via (TSV), a wire (wire), or a bump (bump).

The buffer 1210 may buffer data to be written into the memory when performing a write operation; or, when a read operation is performed, caching data read from the memory; alternatively, the data to be operated on by the cache operator 1110; alternatively, the cache operator 1110 performs the data output after the operation. The buffer 1210 includes: dynamic random access Memory DRAM, flash Memory (Flash Memory), phase change Memory (Phase Change Memory, PCM), or magnetic tunnel junction Memory (Magnetic Tunnel Junction, MTJ), etc. In other embodiments, the buffer 1210 may be any other type of memory known in the art, and the disclosure is not particularly limited.

In a specific example, the buffer 1210 is a dynamic random access memory DRAM, typically in the GigaByte (GB) class and having more pins (pins), which can achieve high bandwidth by running low-speed digital circuits. It can be appreciated that, in this example, by using a DRAM with larger storage capacity and more pins as a buffer of the data processing chip, the bandwidth requirement of the system can be greatly reduced; and the use of DRAM as the cache in this example can greatly reduce cost as compared to the use of a costly SRAM as the cache.

It should be noted that, the first die 1100 is generally manufactured by a first-pass process, such as a 5nm process, a 3nm process, or a 1nm process, which requires a higher design and production cost, and the integration of the on-chip cache in the system-on-chip can certainly result in a waste of cost. Based on this, in other embodiments, the register 1210 may also be an SRAM, which may employ a process of more than 20nm, for example, a 22nm process, a 25nm process, a 28nm process, or the like. It will be appreciated that in this example, by stacking SRAM with lower process requirements with the first die 1100 and using it as a buffer, design and production costs of the data processing chip can be saved.

In some embodiments, the first die 1100 is manufactured using a first process and the second die 1200 is manufactured using a second process; the characteristic dimension corresponding to the second process is larger than the characteristic dimension corresponding to the first process. For example, the first die 1100 is an SOC, and the feature size corresponding to the SOC process is 5nm; the second die 1200 is a DRAM, and the feature size corresponding to the DRAM process is 25nm.

It should be emphasized that this is merely an example, and it should be understood that the first die 1100 and the second die 1200 may be of other types known in the art, and the feature size corresponding to the process of the first die 1100 is not limited to 5nm, and the feature size corresponding to the process of the second die 1200 is not limited to 25nm, so long as the feature size corresponding to the second process is ensured to be larger than the feature size corresponding to the first process. Of course, in other embodiments, the feature size corresponding to the second process may be smaller than or equal to the feature size corresponding to the first process.

In the embodiment of the disclosure, in the first aspect, by arranging the first die and the second die to be stacked, the buffer and the arithmetic unit can be coupled in a bonding manner, so that higher bandwidth is provided for data transmission, high calculation force is realized, and meanwhile, the bandwidth requirement of a system is reduced; in a second aspect, compared to setting up on-chip caches, the present disclosure sets up a second die outside (i.e., outside) the first die, where a buffer in the second die is used for caching data, which is beneficial to reducing design complexity of a system on chip and reducing design and production costs of a data processing chip.

In some embodiments, referring to FIG. 2, the data processing chip 1000 further includes:

a wiring layer (routing layer) 1310 in the third die 1300 between the first die 1100 and the second die 1200; alternatively, in the second die 1200 and between the first die 1100 and the buffer 1210;

a bridge circuit 1320 disposed in the wiring layer 1310, wherein a first port of the bridge circuit 1320 is coupled to the operator 1110 via a first interface protocol, and a second port of the bridge circuit 1320 is coupled to the buffer 1210 via a second interface protocol; wherein the second interface protocol is different from the first interface protocol.

Fig. 2 shows that the wiring layer 1310 is located in the third die 1300, and the first die 1100, the third die 1300, and the second die 1200 are stacked in order. In another example, the wiring layer 1310 is located in the second die 1200 (not shown in the figure), i.e., the first die 1100 and the second die 1200 are stacked in order.

It should be noted that, since the first die 1100 employs a more advanced process, the integration of the on-chip cache in the system-on-chip results in further increase of design and production costs. In this embodiment, the first die and the second die are stacked, and the wiring layer is disposed in the third die or the second die, so that the design and production cost of the data processing chip are further reduced due to lower process requirements of the third die or the second die. In other embodiments, the routing layer may also be located in the first die and between the operator and the second die.

In some embodiments, the third die 1300 is fabricated using a third process that corresponds to a feature size that is larger than the feature size corresponding to the first process. For example, the feature size corresponding to the third process is 28nm. It should be emphasized that this is merely an example to convey the disclosure to those skilled in the art, and the feature size corresponding to the process of the third wafer 1300 is not limited to 28nm, and only needs to ensure that the feature size corresponding to the third process is larger than the feature size corresponding to the first process. Of course, in other embodiments, the feature size corresponding to the third process may be smaller than or equal to the feature size corresponding to the first process.

In some embodiments, the feature size corresponding to the third process and the feature size corresponding to the second process may be the same or different.

The wiring layer 1310 includes an insulating layer and a wiring (not shown in the figure) located in the insulating layer. The material of the insulating layer comprises silicon oxide, oxynitride, silicon oxynitride, or the like; the material of the wiring includes a conductive material, for example, at least one of copper, aluminum, platinum, titanium, or tin.

The bridge circuit 1320 is configured to perform protocol conversion on data transferred between the operator 1110 and the buffer 1210. The bridge circuit 1320 includes a first port, a conversion unit, and a second port; the first port is used for coupling with the calculator 1110, and the second port is used for coupling with the buffer 1210; the conversion unit is configured to convert a first interface protocol compliant between the first port and the operator 1110 into a second interface protocol compliant between the second port and the buffer 1210, so as to ensure data transmission between the operator 1110 and the buffer 1210.

In a specific example, the first port of the bridge circuit 1320 is coupled to the input/output interface of the operator 1110, and an SRAM compliant interface protocol (SRAM protocol) may be provided between the first port of the bridge circuit 1320 and the operator 1110; a second port of the bridge circuit 1320 is coupled to an input/output interface of the buffer 1210, and a DRAM interface protocol (DRAM protocol) may be compliant between the second port of the bridge circuit 1320 and the buffer 1210. Of course, in other embodiments, the conversion unit may be omitted, and the interface protocol of the DRAM may be complied with between the first port of the bridge circuit 1320 and the operator 1110 and between the second port of the bridge circuit 1320 and the buffer 1210.

In the embodiment of the disclosure, the design and production cost of the data processing chip are further reduced while the data transmission between the arithmetic unit and the buffer is realized by arranging the wiring layer in the third crystal grain or the second crystal grain and arranging the bridge circuit in the wiring layer. In addition, by setting the bridge circuit, the data transmission between the arithmetic unit and the buffer can be realized, and the interface protocol is compatible with the existing interface protocol.

In some embodiments, the data processing chip 1000 includes: at least two second crystal grains which are respectively positioned at two sides of the first crystal grain; at least two wiring layers are respectively positioned at two sides of the first crystal grain and positioned between the first crystal grain and the second crystal grain or positioned in the second crystal grain.

For example, the data processing chip 1000 includes a pair of second dies, which are respectively denoted as a first second die and a second die, and a pair of wiring layers, which are respectively denoted as a first wiring layer and a second wiring layer; the first second crystal grain, the first crystal grain and the second crystal grain are stacked in sequence; the first second crystal grain is coupled with the first crystal grain through a first wiring layer, the first crystal grain is coupled with the second crystal grain through a second wiring layer, and the first wiring layer can be positioned in a third crystal grain between the first second crystal grain and the first crystal grain or in the first second crystal grain; the second wiring layer may be located in the third die between the first die and the second die, or may be located in the second die. Here, the number of the second crystal grains and the wiring layers is not limited to two, but may be three or more.

In some embodiments, referring to fig. 2, the operator 1110 includes: a plurality of operation units 1120; the buffer 1210 includes: a plurality of buffer units 1220;

the data processing chip 1000 includes: a plurality of bridge circuits 1320 located in the wiring layer 1310; wherein each of the plurality of arithmetic units 1120 is coupled to the plurality of cache units 1220 through each of the plurality of bridge circuits 1320, respectively.

Here, the number of the operation units 1120 may be two or more, the number of the buffer units 1220 may be two or more, and the number of the bridge circuits 1320 may be two or more. At least two of the number of operation units 1120, the number of buffer units 1220, and the number of bridge circuits 1320 may be the same or different.

In some embodiments, the number of arithmetic units 1120, the number of cache units 1220, and the number of bridge circuits 1320 are the same. In one embodiment, the number of operation units 1120, the number of cache units 1220, and the number of bridge circuits 1320 are 1024. It can be appreciated that in the present embodiment, the plurality of operation units 1120 can be coupled to the plurality of cache units 1220 through the plurality of bridge circuits 1320, and each operation unit 1120 can manage the data cached in the corresponding cache unit 1220, which is beneficial to improving the computing performance of the data processing chip.

In one embodiment, the storage capacity of the operation unit 1120 is 128B, and the storage capacity of the buffer unit 1220 is 8MB.

In some embodiments, the cache unit 1220 includes a plurality of memory array slices (Memory Array Tile, MAT) 1221. For example, the cache unit 1220 includes 96 MATs. The number of memory array slices included in the cache unit 1220 is not limited thereto.

In some embodiments, the first die further comprises: the processor is coupled with the buffer through bonding; wherein the buffer is further configured to: and when the processor and the memory perform data transmission, buffering the transmitted data. As shown in connection with fig. 3, the processor 1130 and the operator 1110 are both located in the first die 1100 and are arranged side by side in a direction perpendicular to the stacking direction of the first die 1100 and the second die 1200, and the processor 1130 is configured to control logical operations of the memory, such as write operations, read operations, or erase operations.

As shown in fig. 2 and 3, the buffer 1210 includes a first buffer area and a second buffer area (not shown in the drawings); the first buffer is coupled to the computing unit 1110, and the second buffer is coupled to the processor 1130. It will be appreciated that when the DRAM with a larger capacity is used as the buffer, the memory space of the DRAM may be partitioned, a portion of the memory space is used as the buffer of the arithmetic unit 1110, and another portion of the memory space is used as the buffer of the processor 1130, so that the utilization rate of the memory space of the DRAM may be improved. Here, the sizes of the first buffer area and the second buffer area may be the same or different, and may be reasonably set by a person skilled in the art according to actual needs, which is not particularly limited in the present disclosure.

In some embodiments, the front projection of the second die 1200 coincides with the front projection of the first die 1100, i.e. the second die 1200 and the first die 1100 have the same area, which is more beneficial to cutting and packaging in the manufacturing process of the data processing chip, and improves the manufacturing yield of the data processing chip.

In some embodiments, the front projection of the first cache region coincides with the front projection of the operator 1110, and the front projection of the second cache region coincides with the front projection of the processor 1130, i.e., the first cache region and the operator 1110 have the same area, and the second cache region and the processor 1130 have the same area.

In some embodiments, the storage capacity of the buffer 1210 is greater than a preset value; wherein the preset value is greater than 0 Megabyte (MB) and less than 1 GigaByte (GB).

It is noted that for cost reasons, the capacity of on-chip SRAM is typically less than 1GB, e.g., the capacity of on-chip SRAM may be on the MB scale. In the embodiment of the disclosure, the storage capacity of the off-chip buffer can be increased and the bandwidth requirement of the system can be reduced by setting the storage capacity of the buffer to be larger than the preset value.

In a specific embodiment, in a system on chip using SRAM as a buffer, the storage capacity of the SRAM is MB level, and the bandwidth requirement of the corresponding system is terabyte bandwidth per second, and a higher bandwidth memory, such as a high bandwidth memory HBM, is required to meet the system bandwidth requirement, which is at the cost of complex design and high production cost. The present disclosure can reduce the bandwidth requirement of the system to several gigabytes per second bandwidth by using the DRAM with the storage capacity of the buffer 1210 of GB class as an off-chip buffer, without setting a higher bandwidth memory, for example, a high bandwidth memory HBM, thereby simplifying the design complexity of the data processing chip and reducing the cost.

In some embodiments, the data processing chip 1000 is applied in the field of artificial intelligence, the data processing chip 1000 includes, but is not limited to, an artificial intelligence chip (Artificial Intelligence, AI).

Based on the data processing chip, the embodiment of the disclosure also provides a data processing system. A data processing system, comprising: the data processing chip of any of the above embodiments; at least one memory is arranged in parallel with the first die along a first direction, and the first direction is perpendicular to the stacking direction of the second die and the first die.

Fig. 3 is a schematic diagram of a data processing system 3000, shown in accordance with an embodiment of the present disclosure. Referring to fig. 3, a first die 1100 is disposed in parallel with a plurality of memories 2000 along a first direction, the first die 1100 being coupled to the memories 2000 through a memory interface circuit 1150. For example, fig. 3 shows 8 memories 2000a, 2000b, 2000c, 2000d, 2000e, 2000f, 2000g, and 2000h, the memories 2000a, 2000b, 2000c, and 2000d being located at one side of the first die 1100, and the memories 2000e, 2000f, 2000g, and 2000h being located at the other side of the first die 1100.

The memory 2000 includes: double Data Rate (DDR), low power consumption memory (Low Power Double Data Rate, LPDDR), graphics card memory (Graphics Double Data Rate, GDDR), high bandwidth memory (High Bandwidth Memory, HBM), etc. In one embodiment, the register 1210 is a dynamic random access memory DRAM, and the memory 2000 is a fifth generation low power memory LPDDR5, abbreviated as LP5.

In some embodiments, referring to fig. 3, after the buffer is disposed in the second die, the SRAM buffer 1140 (shown by the dashed box in fig. 3) in the first die 1100 can be omitted, which is beneficial to reducing the planar size of the data processing chip. In other embodiments, as shown in fig. 3, after the buffer is set in the second die, the SRAM buffer 1140 in the first die 1100 can be reserved, and the SRAM and the buffer in the second die are used together as the buffer, which is beneficial to further reducing the bandwidth requirement. Those skilled in the art may reasonably set the present disclosure according to actual needs, and the present disclosure is not limited herein.

It can be appreciated that, in the embodiment of the disclosure, by setting the first die and the second die to be stacked, the buffer in the second die is used as the buffer, so that the bandwidth requirement of the system is reduced, and therefore, a low-bandwidth high-capacity memory can be selected, and a plurality of computing chip nodes do not need to be connected together, which is beneficial to reducing the design complexity and the production cost of the data processing system and improving the integration level.

In some embodiments, the first die 1100 and the plurality of memories 2000 are located in a first tier of the data processing system 3000, the third die 1300 is located in a second tier of the data processing system 3000, and the second die 1200 is located in a third tier of the data processing system 3000. That is, the first die 1100 is located in the same level as the plurality of memories 2000, the third die 1300 is located in a different level than the plurality of memories 2000, and the second die 1200 is located in a different level than the plurality of memories 2000.

It should be noted that, as used in this disclosure, the same level indicates that two dies have the same distance from the top surface or the bottom surface of the package substrate (not shown), and different levels indicate that two dies have different distances from the top surface or the bottom surface of the package substrate.

In some embodiments, the data processing chip and the memory constitute a node chip; the data processing system includes: a plurality of node chips; the plurality of node chips are arranged in parallel along a second direction and are connected in a concentrated mode, the second direction is perpendicular to the stacking direction of the second crystal grains and the first crystal grains, and the second direction is intersected with the first direction. Here, the aggregation includes a serial arrangement, a mesh (mesh) network arrangement, a ring (ring) arrangement, or a one-to-many arrangement, etc., and the connection manner of the plurality of node chips in the embodiment of the present disclosure is not particularly limited.

The data processing chip 1000 and at least one memory 2000 in fig. 3 may constitute a node chip, and fig. 4 is a schematic diagram of another data processing system 4000 shown in accordance with an embodiment of the present disclosure. Referring to FIG. 4, data processing system 4000 includes a plurality of node chips, for example, FIG. 4 shows 4 node chips 4000a, 4000b, 4000c, and 4000d, which may be connected via node interface circuit 1160. By integrating a plurality of node chips, the computing power of the data processing system can be further improved.

Based on the data processing chip, the embodiment of the disclosure also provides a manufacturing method of the data processing chip.

Fig. 5 is a flowchart illustrating a method of manufacturing a data processing chip according to an embodiment of the present disclosure. Referring to fig. 5, the manufacturing method includes at least the steps of:

s5100: forming a first die, the first die including an operator;

s5200: forming a second die stacked with the first die, the second die including a buffer coupled to the operator by bonding; wherein the buffer is configured to: and buffering the transmitted data when the arithmetic unit and the host machine or the arithmetic unit and the memory are used for data transmission.

It should be noted that the steps shown in fig. 5 are not exclusive and that other steps may be performed before, after, or between any of the steps in the illustrated operations; the steps shown in fig. 5 can be sequentially adjusted according to actual needs.

Fig. 6a to 6c are schematic views illustrating a manufacturing process of a data processing chip according to an embodiment of the present disclosure. The method for manufacturing the data processing chip according to the embodiment of the present disclosure will be described in detail with reference to fig. 5 and fig. 6a to 6 c.

Referring to fig. 6a, a first wafer 6100A is provided, where the first wafer 6100A includes a plurality of first dies 6100; a second wafer 6200A is provided, the second wafer 6200A comprising a plurality of second dies 6200.

The first wafer 6100A and the second wafer 6200A may be fabricated by processes known in the semiconductor arts (e.g., thin film deposition process, photolithography process, etching process, particle implantation process, etc.), which will not be described herein. The first wafer 6100A employs a first process, and the second wafer 6200A employs a second process, where a feature size corresponding to the second process is larger than a feature size corresponding to the first process. Here, the feature size indicates the smallest size in the first wafer 6100A or the second wafer 6200A. Of course, in other embodiments, the feature size corresponding to the second process may be smaller than or equal to the feature size corresponding to the first process.

The first wafer 6100A includes a plurality of first dies 6100 and first scribe lines between two adjacent first dies 6100, and the second wafer 6200A includes a plurality of second dies 6200 and second scribe lines between two adjacent second dies 6200. In one embodiment, the size of the first die 6100 is the same as the size of the second die 6200, and the size of the first scribe line is the same as the size of the second scribe line. Here, the dimensions include a length and a width.

In some embodiments, step S5200 described above includes: bonding the first wafer and the second wafer so that the operator is coupled with the buffer; dicing is performed on the bonded first and second wafers, and then the two chips are packaged together.

Referring to fig. 6b, after the first wafer 6100A and the second wafer 6200A are manufactured, the first wafer 6100A may be aligned with the second wafer 6200A. Specifically, the first die 6100 is aligned with the second die (not shown), and the first scribe line is aligned with the second scribe line; further, the operators in the first die 6100 are aligned with the registers in the second die (refer to the operators 1110 and registers 1210 in fig. 2). Here, the second crystal grain 6200 is not shown for ease of illustration. It should be appreciated that the second die 6200 in fig. 6b is located on the surface of the second wafer 6200A facing the first wafer 6100A.

After the first wafer 6100A and the second wafer 6200A are aligned, dicing is performed to dice the bonded first wafer 6100A and second wafer 6200A into a plurality of data processing chips, for example, the data processing chips shown in fig. 2; the data processing chip comprises a first crystal grain and a second crystal grain which are stacked, the first crystal grain comprises an arithmetic unit, the second crystal grain comprises a buffer, the arithmetic unit and the buffer are coupled through bonding, and then the two chips are packaged together.

In some embodiments, the method of manufacturing further comprises, prior to bonding the first wafer and the second wafer:

bonding the first face of the third wafer and the first wafer so that a first port of the bridge circuit is coupled with the arithmetic unit through a first interface protocol;

bonding a second face of the third wafer with the second wafer such that a second port of the bridge circuit is coupled with the buffer through a second interface protocol; wherein the second interface protocol is different from the first interface protocol; the second surface is opposite to the first surface.

Referring to fig. 6c, a third wafer 6300A is provided, the third wafer 6300A comprising a plurality of third dies (not shown in the figures). The third wafer 6300A may be fabricated using processes known in the semiconductor arts (e.g., thin film deposition processes, photolithography processes, etching processes, particle implantation processes, etc.), which are not described herein. The third wafer 6300A employs a third process, where a feature size corresponding to the third process is greater than a feature size corresponding to the first process, and the feature size corresponding to the third process and the feature size corresponding to the second process may be the same or different. Of course, in other embodiments, the feature size corresponding to the third process may be smaller than or equal to the feature size corresponding to the first process.

The third wafer 6300A includes a plurality of third dies and a third scribe line between two adjacent third dies, and in one embodiment, the size of the first die 6100, the size of the second die 6200, and the size of the third die are the same, the size of the first scribe line, the size of the second scribe line, and the size of the third scribe line are the same, and the three dies are co-packaged together.

Still referring to fig. 6c, the first face of the third wafer 6300A is aligned with and bonded to the first wafer 6100A such that the first port of the bridge circuit is coupled to an operator (not shown); the second side of the third wafer 6300A and the second wafer 6200 are aligned and bonded, such that the second port of the bridge circuit is coupled to a buffer (not shown), so that data transmission between the arithmetic unit and the buffer can be performed through the bridge circuit. It should be noted that after bonding the first surface of the third wafer 6300A to the first wafer 6100A, the substrate of the third wafer 6300A may be removed first until the second port of the bridge circuit is exposed, and then the second wafer is bonded to the third wafer. For ease of illustration, the third die is not shown in fig. 6 c.

Here, the third wafer may be bonded to the first wafer and then bonded to the second wafer; or, the third wafer is bonded to the second wafer and then bonded to the first wafer, and the bonding sequence is not particularly limited in the embodiments of the present disclosure.

In some embodiments, the providing a second wafer includes: forming a buffer; forming a wiring layer on the buffer; forming a bridge circuit in the wiring layer, wherein a second port of the bridge circuit is coupled with the buffer through a second interface protocol; the bonding the first wafer and the second wafer includes: inverting the second wafer so that the wiring layer is located between the first die and the buffer; bonding the bridge circuit and the operator such that a first port of the bridge circuit is coupled to the operator via a first interface protocol; wherein the second interface protocol is different from the first interface protocol.

After forming the buffer, forming an insulating layer covering the buffer, forming wiring and a bridge circuit in the insulating layer through photoetching, etching, film deposition and other processes, wherein the bridge circuit is coupled with the buffer through interconnection contacts and/or interconnection wires; after the first and second wafers are manufactured, the first and second wafers are aligned and bonded. Here, the alignment and bonding of the first wafer and the second wafer may refer to fig. 6b or fig. 6c described above. For brevity, no further description is provided herein.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A data processing chip, comprising:

a first die, comprising: an operator and a processor;

a second die stacked with the first die, comprising: the buffer is coupled with the arithmetic unit and the processor through bonding; the buffer comprises a first buffer area and a second buffer area, wherein the first buffer area is coupled with the arithmetic unit, and the second buffer area is coupled with the processor;

the first buffer is configured to: when the arithmetic unit and the host machine perform data transmission or the arithmetic unit and the memory perform data transmission, the transmitted data are cached;

the second buffer is configured to: and when the processor performs data transmission with the memory, caching the transmitted data.

2. The data processing chip of claim 1, wherein the data processing chip further comprises:

3. The data processing chip of claim 2, wherein the operator comprises: a plurality of arithmetic units; the buffer includes: a plurality of cache units;

the data processing chip includes:

4. The data processing chip of claim 1, wherein an orthographic projection of the second die coincides with an orthographic projection of the first die.

5. The data processing chip of claim 1, wherein the storage capacity of the buffer is greater than a preset value; wherein the preset value is greater than 0 megabytes and less than 1 gigabyte.

6. The data processing chip of claim 1, wherein the buffer comprises: dynamic random access memory, flash memory, phase change memory, or magnetic tunnel junction memory.

7. A method of manufacturing a data processing chip, comprising:

forming a first die, the first die comprising an operator and a processor;

forming a second die stacked with the first die, the second die including a buffer coupled to the operator and the processor by bonding; the buffer comprises a first buffer area and a second buffer area, wherein the first buffer area is coupled with the arithmetic unit, and the second buffer area is coupled with the processor; the first buffer is configured to: when the arithmetic unit and the host machine perform data transmission or the arithmetic unit and the memory perform data transmission, the transmitted data are cached; the second buffer is configured to: and when the processor performs data transmission with the memory, caching the transmitted data.

8. The manufacturing method according to claim 7, characterized in that the manufacturing method further comprises:

the forming a second die disposed in a stack with the first die, comprising:

bonding the first wafer and the second wafer such that the operator and the processor are coupled with the buffer;

and performing dicing processing on the bonded first wafer and second wafer.

9. The method of manufacturing according to claim 8, wherein prior to bonding the first wafer and the second wafer, the method of manufacturing further comprises:

10. The method of manufacturing of claim 8, wherein the providing a second wafer comprises:

forming the buffer;

forming a wiring layer on the buffer;

the bonding the first wafer and the second wafer includes:

11. The method of claim 8, wherein the first wafer is processed by a first process and the second wafer is processed by a second process; the characteristic size corresponding to the second process is larger than the characteristic size corresponding to the first process.

12. A data processing system, comprising:

the data processing chip of any one of claims 1 to 6;

13. The data processing system of claim 12, wherein the data processing chip and the memory constitute a node chip;