WO2023078006A1 - 加速器结构、生成加速器结构的方法及其设备 - Google Patents

加速器结构、生成加速器结构的方法及其设备 Download PDF

Info

Publication number
WO2023078006A1
WO2023078006A1 PCT/CN2022/122375 CN2022122375W WO2023078006A1 WO 2023078006 A1 WO2023078006 A1 WO 2023078006A1 CN 2022122375 W CN2022122375 W CN 2022122375W WO 2023078006 A1 WO2023078006 A1 WO 2023078006A1
Authority
WO
WIPO (PCT)
Prior art keywords
layer
die group
die
circuit
cow
Prior art date
Application number
PCT/CN2022/122375
Other languages
English (en)
French (fr)
Inventor
邱志威
陈帅
高崧
庄云良
Original Assignee
寒武纪(西安)集成电路有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 寒武纪(西安)集成电路有限公司 filed Critical 寒武纪(西安)集成电路有限公司
Publication of WO2023078006A1 publication Critical patent/WO2023078006A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L21/00Processes or apparatus adapted for the manufacture or treatment of semiconductor or solid state devices or of parts thereof
    • H01L21/02Manufacture or treatment of semiconductor devices or of parts thereof
    • H01L21/04Manufacture or treatment of semiconductor devices or of parts thereof the devices having potential barriers, e.g. a PN junction, depletion layer or carrier concentration layer
    • H01L21/50Assembly of semiconductor devices using processes or apparatus not provided for in a single one of the subgroups H01L21/06 - H01L21/326, e.g. sealing of a cap to a base of a container
    • H01L21/56Encapsulations, e.g. encapsulation layers, coatings

Definitions

  • the present invention generally relates to the field of semiconductors. More specifically, the present invention relates to accelerator structures and devices thereof, methods for generating accelerator structures, and computer-readable storage media, computer program products, and computer devices.
  • Taiwan Semiconductor Manufacturing Co., Ltd. has developed an ultra-large and compact system solution called Integrated Fan-Out System on Wafer (InFO_SoW), which integrates known chip arrays with power and cooling modules, using for high performance computing.
  • InFO_SoW reduces the use of substrates and printed wiring boards by acting as the carrier itself.
  • a tightly packed multi-chip array within a compact system enables this solution to reap the benefits of wafer scale, such as low-latency chip-to-chip communication, high bandwidth density, and low power distribution network (PDN) impedance, for more High computing performance and power efficiency.
  • PDN power distribution network
  • the solution of the present invention provides an accelerator structure and its equipment, a method for generating the accelerator structure, a computer-readable storage medium, a computer program product and a computer device.
  • the present invention discloses an accelerator structure, including: a computing layer, a module layer and a circuit layer.
  • the computing layer is provided with a plurality of chip-on-wafer (CoW) units, and each chip-on-wafer unit includes a first die group and a second die group;
  • the module layer is provided with a power module die group and an interface The module chip group;
  • the circuit layer is arranged between the operation layer and the module layer.
  • the power module die group provides power to the first die group and the second die group through the circuit layer, and the first die group and the second die group output calculation results through the interface module die group through the circuit layer.
  • the present invention discloses an integrated circuit device including the aforementioned accelerator structure, and also discloses a board including the aforementioned integrated circuit device.
  • the present invention discloses a method for generating an accelerator structure, including: generating a circuit layer; generating an operation layer on one side of the circuit layer, the operation layer is provided with a plurality of chip-to-wafer units, each chip-to-wafer
  • the unit includes a first die group and a second die group; and a module layer is formed on the other side of the circuit layer, and the module layer is provided with a power module die group and an interface module die group.
  • the power module die group provides power to the first die group and the second die group through the circuit layer, and the first die group and the second die group output calculation results through the interface module die group through the circuit layer.
  • the present invention discloses a computer-readable storage medium on which is stored computer program code for generating an accelerator structure, and when the computer program code is executed by a processing device, the aforesaid method is executed.
  • the present invention discloses a computer program product, including a computer program for generating an accelerator structure, wherein the computer program implements the steps of the aforementioned method when executed by a processor.
  • the present invention discloses a computer device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the steps of the aforementioned method.
  • the present invention can significantly improve the integration efficiency by integrating the CoW unit into the InFO_SoW structure, so as to meet the requirements of various accelerators for mass integration of chips, and achieve the technical effect of integrating super large computing power.
  • FIG. 1 is a cross-sectional view showing InFO_SoW
  • FIG. 2 is a top view showing an exemplary InFO_SoW
  • FIG. 3 is a schematic layout diagram showing a CoW unit according to an embodiment of the present invention.
  • FIG. 4 is a schematic layout diagram showing another CoW unit according to an embodiment of the present invention.
  • FIG. 5 is a schematic layout diagram showing another CoW unit according to an embodiment of the present invention.
  • Fig. 6 is a schematic structural diagram showing an exemplary board
  • FIG. 7 is a structural diagram illustrating an integrated circuit device according to an embodiment of the present invention.
  • FIG. 8 is a cross-sectional view showing the accelerator structure of CoW combined with InFO_SoW according to an embodiment of the present invention
  • Fig. 9 is a cross-sectional view showing an accelerator structure of CoW combined with InFO_SoW according to another embodiment of the present invention.
  • FIG. 10 is a schematic diagram illustrating a CoW unit of an embodiment of the present invention.
  • FIG. 11 is a schematic diagram illustrating a CoW unit of another embodiment of the present invention.
  • Fig. 12 is a flowchart illustrating another embodiment of the present invention to generate an accelerator structure
  • Fig. 13 is a flowchart showing another embodiment of the present invention generating the first part of the line layer
  • FIG. 14 is a cross-sectional view illustrating the formation of multiple TSVs on a wafer according to another embodiment of the present invention.
  • Fig. 15 is a flow chart showing another embodiment of the present invention to generate an operation layer
  • Fig. 16 is a cross-sectional view showing a plurality of CoW units mounted on a chip according to another embodiment of the present invention.
  • Fig. 17 is a cross-sectional view showing another embodiment of the present invention after producing laminated plastic
  • 18 is a cross-sectional view showing another embodiment of the present invention after chemical mechanical polishing of laminated plastics
  • FIG. 19 is a flowchart illustrating another embodiment of the present invention performing wafer testing
  • Fig. 20 is a cross-sectional view showing another embodiment of the present invention after flipping the wafer
  • 21 is a cross-sectional view showing another embodiment of the present invention after chemical mechanical polishing
  • 22 is a cross-sectional view showing another embodiment of the present invention after depositing an insulating layer
  • Fig. 23 is a cross-sectional view showing another embodiment of the present invention after generating metal dots
  • FIG. 24 is a schematic diagram showing a 5 ⁇ 5 CoW cell array
  • Fig. 25 is a cross-sectional view showing another embodiment of the present invention after the CoW crystal grains are pasted on the second glass;
  • Fig. 26 is a cross-sectional view showing another embodiment of the present invention after producing laminated plastic
  • 27 is a cross-sectional view showing another embodiment of the present invention after chemical mechanical polishing
  • Fig. 28 is a cross-sectional view showing another embodiment of the present invention after completing the entire circuit layer
  • Fig. 29 is a cross-sectional view showing another embodiment of the present invention after generating a module layer
  • Fig. 30 is a cross-sectional view showing another embodiment of the present invention after bonding a heat dissipation module
  • Figure 31 is a flowchart illustrating another embodiment of the present invention to generate an accelerator structure.
  • Fig. 32 is a cross-sectional view showing another embodiment of the present invention after bonding a heat dissipation module.
  • the term “if” may be interpreted as “when” or “once” or “in response to determining” or “in response to detecting” depending on the context.
  • wafer refers to be made of pure silicon, generally divided into 6 inches, 8 inches, 12 inches, and is a silicon substrate used for the production of silicon semiconductor integrated circuits, and its shape is round. shape. It can be processed into various circuit element structures on the silicon substrate and become an integrated circuit product with specific electrical functions; die is a small unpackaged integrated circuit body made of semiconductor materials. The established functions of the integrated circuit are realized on this small piece of semiconductor.
  • the grain is a square small-chip integrated circuit made on the wafer by a large number of steps such as photolithography, also known as a bare crystal; chip (chip) ) means that after testing, the intact, stable, and functioning crystal grains are cut off and packaged to form an integrated circuit device with pins that can be electrically connected to other electronic components.
  • InFO_SoW technology is a wafer-level system that integrates integrated fan-out (InFO), power modules, and heat dissipation modules.
  • Figure 1 shows a cross-sectional view of InFO_SoW.
  • InFO_SoW includes a computing layer 11 and a circuit layer 12. with module layer 13.
  • the computing layer 11 is provided with a chip array, and the processing unit 111, the processing unit 112 and the processing unit 113 are exemplarily shown in the figure to realize the system computing function;
  • the circuit layer 12 is a redistribution layer (RDL) for electrically connecting the computing Layer 11 and module layer 13 grains;
  • the module layer 13 is provided with a power module grain group and an interface module grain group, the power module grain group includes a plurality of power modules 131, which provide power to the chip array of the computing layer 11, and the interface
  • the module die group includes a plurality of interface modules 132 serving as input and output interfaces of the chip array of the computing layer 11 .
  • the power module die group and the interface module die group are soldered to the InFO wafer using ball grid array (BGA) packaging technology.
  • the other side of the computing layer 11 is assembled with a cooling module 14 to dissipate heat for the chip array of the computing layer 11 .
  • FIG. 2 shows a top view of an exemplary InFO_SoW.
  • the power module die group is a 7 ⁇ 7 power module 131
  • the interface module die group includes four interface modules 132, which are respectively located on the sides of the power module array.
  • the circuit layer 12 below the power module die group and the interface module die group is the circuit layer 12 , that is, the InFO wafer.
  • the chip array of the operation layer 11 is located under the circuit layer 12 and is hidden by the module layer 13 and the circuit layer 12, so it is invisible.
  • the lowest layer is the cooling module 14 .
  • CoW is an emerging integrated production technology, which can treat multiple chips as one grain for packaging, achieving the technical effects of small packaging volume, low power consumption, and fewer pins. With the maturity of CoW technology, more and more integrated circuits, especially those with complex calculations, adopt its manufacturing process.
  • CoW units can be formed by integrating a variety of crystal grains with different functions.
  • the CoW unit includes two types of grains : the first crystal grain and the second crystal grain. More specifically, the first die is a system on chip (SoC) and the second die is a memory.
  • SoC system on chip
  • System on chip refers to the integration of a complete system on a single chip, which is a system or product formed by combining multiple integrated circuits with specific functions on one chip.
  • SoIC System-on-integrated-chips
  • the memory can be high bandwidth memory (high bandwidth memory, HBM), which is a high-performance DRAM based on 3D stacking technology, suitable for applications with high memory bandwidth requirements, such as graphics processors, network switching and forwarding equipment (such as routers, switches, etc.
  • FIG. 3 shows a schematic diagram of the layout of a CoW unit of this embodiment.
  • This CoW unit includes 1 system-on-chip 301 and 6 memories 302, wherein the system-on-chip 301 is the aforementioned system-on-chip, which is arranged at the core of the CoW unit, and the memory 302 is the above-mentioned high-bandwidth memory, which is arranged on both sides of the system on chip 301, and three memory 302 are arranged on each side.
  • FIG. 4 shows a schematic layout diagram of another CoW unit of this embodiment.
  • This CoW unit includes a system-on-chip 301 and four memories 302, wherein the system-on-chip 301 is arranged at the core of the CoW unit, and the memory 302 is arranged in the system-on-chip On both sides of the 301, two memories 302 are arranged on each side.
  • FIG. 5 shows a schematic layout diagram of another CoW unit in this embodiment.
  • the CoW unit is formed by arranging two sets of CoW units in FIG. 4 .
  • FIG. 6 shows a schematic structural diagram of an exemplary board 60 .
  • the board 60 includes a chip 601, which is the accelerator structure of this embodiment, integrated with one or more integrated circuit devices, and the integrated circuit device is an artificial intelligence computing unit to support various types of deep learning and machine learning algorithms to meet the needs of intelligent processing in complex scenarios in the fields of computer vision, speech, natural language processing, and data mining.
  • deep learning technology is widely used in the field of cloud intelligence.
  • cloud intelligence applications is the large amount of input data, which has high requirements for the storage capacity and computing power of the platform.
  • the board 60 of this embodiment is suitable for cloud intelligence applications. applications, with huge off-chip storage, on-chip storage and powerful computing capabilities.
  • the chip 601 is connected to an external device 603 through an external interface device 602 .
  • the external device 603 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card or a wifi interface, and the like.
  • the data to be processed can be transmitted to the chip 601 by the external device 603 through the external interface device 602 .
  • the calculation result of the chip 601 can be sent back to the external device 603 via the external interface device 602 .
  • the external interface device 602 may have different interface forms, such as a PCIe interface and the like.
  • the board 60 also includes a storage device 604 for storing data, which includes one or more storage units 605 .
  • the storage device 604 is connected and data transmitted with the control device 606 and the chip 601 through the bus.
  • the control device 606 in the board 60 is configured to regulate the state of the chip 601 .
  • the control device 606 may include a microcontroller (Micro Controller Unit, MCU).
  • FIG. 7 is a configuration diagram showing the integrated circuit device in the chip 601 of this embodiment.
  • the integrated circuit device 70 includes a computing device 701 , an interface device 702 , a processing device 703 and a memory 704 .
  • the computing device 701 is configured to perform operations specified by the user, and is mainly implemented as a single-core intelligent processor or a multi-core intelligent processor for performing deep learning or machine learning calculations, which can interact with the processing device 703 to jointly complete user-specified operations. operation.
  • the interface device 702 is used as an interface for external communication between the computing device 701 and the processing device 703 .
  • the processing device 703 performs basic control including but not limited to data transfer, starting and/or stopping the computing device 701 .
  • the processing device 703 may be one or more types of a central processing unit (central processing unit, CPU), a graphics processing unit (graphics processing unit, GPU) or other general-purpose and/or special-purpose processors.
  • processors including but not limited to digital signal processors (digital signal processors, DSPs), application specific integrated circuits (application specific integrated circuits, ASICs), field-programmable gate arrays (field-programmable gate arrays, FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., and the number thereof can be determined according to actual needs.
  • the SoC 301 in FIGS. 3 to 5 may be the computing device 701 or the processing device 703 , or the computing device 701 and the processing device 703 are combined.
  • the computing device 701 it can be regarded as having a single-core structure or a homogeneous multi-core structure.
  • the whole is regarded as a heterogeneous multi-core structure.
  • the memory 704 is used to store data to be processed, which is a DDR memory, usually 16G or larger in size, and is used to store data of the computing device 701 and/or the processing device 703 .
  • the memory 704 is the memory 302 , and is used for storing operation data required by the system on chip 301 .
  • FIG. 8 shows a cross-sectional view of the accelerator structure of CoW combined with InFO_SoW of this embodiment.
  • the accelerator structure includes a module layer 801 , a circuit layer 802 , an operation layer 803 and a cooling module 804 .
  • the module layer 801 is provided with a power module die group and an interface module die group.
  • the power module die group includes a plurality of power modules 805 arranged in an array as shown in FIG.
  • the interface module die group is the interface device 702 , which includes a plurality of interface modules 806 arranged around the power module die group, serving as the input and output interfaces of the CoW unit 807 of the computing layer 803 .
  • the circuit layer 802 is disposed between the operation layer 803 and the module layer 801 , and includes a first redistribution layer 808 , a TSV 809 and a second redistribution layer 810 from bottom to top.
  • the first redistribution layer 808 is electrically connected to each CoW unit 807 through bumps 811; through-silicon vias 809 are disposed between the first redistribution layer 808 and the second redistribution layer 809 to communicate with the first redistribution layer 808 and the second redistribution layer 809.
  • the second redistribution layer 810 ; the second redistribution layer 810 is located on the TSV 809 , and is electrically connected to the power module die group and the interface module die group in the module layer 801 through solder balls 812 .
  • the computing layer 803 is provided with a plurality of CoW units 807, which are also arranged in an array.
  • the CoW unit in this embodiment includes a first die and a second die, wherein the first die is the system on chip 301, and the second die is the SoC 301.
  • the second die is the memory 302 , and the SoC 301 and the memory 302 can be arranged in the manner shown in FIGS. 3 to 5 or in other manners.
  • the first redistribution layer 808 is used to electrically connect the system-on-chip 301 and the memory 302 in each CoW unit 807, so the system-on-chip 301 and the memory 302 pass through the first redistribution layer 808, the TSV 809 and the second redistribution layer 810 is electrically connected to the module layer 801 .
  • the power module die group supplies power to the CoW unit 807 , the power signal will reach the SoC 301 and the memory 302 from the power module 805 through the second redistribution layer 810 , TSV 809 and the first redistribution layer 808 .
  • the interface module die set in this embodiment is an optical module, specifically an optical fiber module, which converts electrical signals from the system-on-chip 301 or memory 302 into optical signals for output.
  • the data is converted from an optical signal to an electrical signal by the interface module 806, and stored in the memory 302 through the second redistribution layer 810, through-silicon vias 809 and the first redistribution layer 808 middle.
  • each CoW unit 807 of this embodiment can be electrically connected to another adjacent CoW unit via the first redistribution layer 808, the through-silicon via 809, and the second redistribution layer 810, and exchange data with each other, so that all The CoW unit 807 can be linked and cooperated to form an accelerator with powerful computing power.
  • the heat dissipation module 804 is located under the computing layer 803 and attached to the CoW unit 807 for cooling all the CoW units 807 in the computing layer 803 .
  • the heat dissipation module 804 may be a water-cooled backplane.
  • the backplane has layers of microchannels through which coolant is pumped to remove heat, or gallium nitride (GaN) cut into the underlying silicon, where the channels are widened during the etch process, and the GaN layers
  • GaN gallium nitride
  • FIG. 9 shows a cross-sectional view of an accelerator structure of CoW combined with InFO_SoW according to another embodiment of the present invention.
  • the accelerator structure includes a module layer 901, a circuit layer 902, a computing layer 903, and a cooling module 904, wherein the structures of the module layer 901, the computing layer 903, and the cooling module 904 are the same as those in the embodiment of Figure 8 The structure is the same, so it is not repeated.
  • the circuit layer 902 is arranged between the operation layer 903 and the module layer 901, and only includes the first redistribution layer 905 and the second redistribution layer 906, wherein the structure of the first redistribution layer 905 is the same as that of the first redistribution layer 808, The structure of the second redistribution layer 906 is the same as that of the second redistribution layer 810 .
  • the first rewiring layer 905 and the second rewiring layer 906 are directly connected without using TSVs.
  • Such a circuit layer 902 can achieve the same effect as the circuit layer 802, but saves the process of generating TSVs 809.
  • the CoW unit of the present invention is not only a single-layer grain structure as described in the foregoing embodiments, but also a multilayer vertically stacked grain group, that is, the CoW unit of the present invention includes a first grain group and a second grain group , wherein the first crystal grain group and the second crystal grain group are not only a single-layer crystal grain structure, but also a multi-layer vertically stacked structure.
  • the following will describe the multi-layer vertically stacked structure.
  • FIG. 10 shows a schematic diagram of the CoW unit of this embodiment. It should be noted that, for the convenience of description, the perspective of this figure is that the circuit layer is below the operation layer, rather than the circuit layer as shown in Figure 8 or Figure 9. above the operational layer.
  • the first grain group includes a first nuclear layer 1001 and a second nuclear layer 1002.
  • the first nuclear layer 1001 and the second nuclear layer 1002 are vertically stacked together.
  • the first nuclear layer 1001 and the second nuclear layer in FIG. 10 1002 is visually separated from top to bottom and shown in this way for convenience of illustration only.
  • the CoW unit of this embodiment includes two second die groups, which are single-die memory 1003 , more specifically, high bandwidth memory.
  • the first core layer 1001 includes a first computing region 1011 , a first die-to-die region 1012 and a first TSV 1013 .
  • the first operation area 1011 is formed with a first operation circuit to realize the functions of the calculation device 701;
  • the first die-to-die area 1012 is formed with a first transceiver circuit, which is used as a die-to-die interface of the first operation circuit ;
  • the first through-silicon via 1013 is used to realize the electrical interconnection of the stacked die in the three-dimensional integrated circuit.
  • the second core layer 1002 includes a second computing region 1021 , a second die-to-die region 1022 and a second TSV 1023 .
  • the second operation area 1021 is formed with a second operation circuit to realize the function of the processing device 703;
  • the second die-to-die area 1022 is formed with a second transceiver circuit, which is used as a die-to-die interface of the second operation circuit ;
  • the second TSV 1023 is also used to realize the electrical interconnection of the stacked die in the three-dimensional integrated circuit.
  • the first operation area 1011 and the second operation area 1021 also generate a memory 1014 and a memory 1024 respectively for temporarily storing the operation results of the first operation circuit and the second operation circuit.
  • the memory 1014 and the memory 1024 are directly installed in the first operation area 1011 and the second operation area 1021 without conducting through an intermediary layer.
  • the data transmission rate is fast, but the storage space is limited.
  • the first core layer 1001 further includes an input-output area 1015 and a physical area 1016
  • the second core layer 1002 further includes an input-output area 1025 and a physical area 1026 .
  • the input and output area 1015 is formed with input and output circuits, which are used as the interface for the first core layer 1001 to communicate with the outside world.
  • the physical area 1016 has a physical access circuit for the first core layer 1001 to access the off-chip memory
  • the physical area 1026 has a physical access circuit for the second core layer 1002 to access the off-chip memory.
  • the first computing circuit and the second computing circuit perform inter-layer data transmission through the first transceiver circuit and the second transceiver circuit.
  • the data reaches the processing device 703 through the following path: the first computing circuit in the first computing area 1011 ⁇ the first transceiver circuit in the first die-to-die area 1012 ⁇ the first TSV 1013 ⁇ the second die To the second transceiver circuit of the grain area 1022 ⁇ the second computing circuit of the second computing area 1021; when the processing device 703 intends to transmit data to the computing device 701, the data arrives through the following path: the second computing circuit of the second computing area 1021 Circuit ⁇ second die-to-die region 1022 second transceiver circuit ⁇ first TSV 1013 ⁇ first transceiver circuit in first die-to-die region 1012 ⁇ first computation circuit in first computation region 1011 .
  • the computing device 701 intends to store data in the memory 1003
  • the calculation result of the computing device 701 is stored in the memory 1003 through the physical area 1016
  • the memory area 1014 transmits the data to the memory 1003 through the physical access circuit.
  • the data reaches the memory 1003 through the following path: the physical access circuit of the physical area 1016 ⁇ the first TSV 1013 ⁇ the second TSV 1023 ⁇ the first redistribution layer 1004 of the circuit layer; when the memory 1003 intends to transmit data
  • the memory area 1014 is processed by the computing device 701
  • the data arrives at the memory area 1014 through the aforementioned reverse path.
  • some specific TSVs in the first TSV 1013 and the second TSV 1023 are specially designed to electrically conduct data for physically accessing the circuit.
  • the processing device 703 intends to store data in the memory 1003
  • the calculation result of the processing device 703 is stored in the memory 1003 through the physical area 1026
  • the memory area 1024 transmits the data to the memory 1003 through the physical access circuit.
  • the data reaches the memory 1003 through the following paths: the physical access circuit of the physical area 1026 ⁇ the second TSV 1023 ⁇ the first redistribution layer 1004 of the circuit layer; when the memory 1003 intends to transmit data to the memory area 1024 for the processing device
  • the data reaches the memory area 1024 through the aforementioned reverse path.
  • the memory area 1014 transmits the data to the first die group of another CoW unit through the input and output circuit. Specifically, the data reaches another CoW unit through the following path: the input-output circuit of the input-output area 1015 ⁇ the first TSV 1013 ⁇ the second TSV 1023 ⁇ the first redistribution layer 1004 of the circuit layer ⁇ the circuit layer TSV 1005 ⁇ the second redistribution layer 1006 of the circuit layer ⁇ TSV 1005 of the circuit layer ⁇ the first redistribution layer 1004 of the circuit layer; when the first die group of another CoW unit wants to transmit data to the memory area At 1014, the data arrives at the memory area 1014 through the aforementioned reverse path. It should be noted that some specific TSVs in the first TSV 1013 and the second TSV 1023 are specially designed to electrically conduct data of input and output circuits.
  • the data in the memory area 1024 reaches the first die group of another CoW unit through the following path: the input of the input output area 1025 Output circuit ⁇ second TSV 1023 ⁇ first redistribution layer 1004 of the circuit layer ⁇ TSV 1005 of the circuit layer ⁇ second redistribution layer 1006 of the circuit layer ⁇ TSV 1005 of the circuit layer ⁇ second TSV of the circuit layer A redistribution layer 1004 ; when the first die group of another CoW unit wants to transmit data to the memory area 1024 , the data reaches the memory area 1024 through the aforementioned reverse path.
  • the first die group of the computing layer in this embodiment includes vertically stacked first core layer, second core layer and memory layer, and the second die group for memory.
  • Figure 11 shows a schematic diagram of the CoW unit of this embodiment.
  • the first die group in this embodiment includes a first core layer 1101, a second core layer 1102, and an on-chip memory layer 1103.
  • the first core layer 1101, the second core layer 1102, and the on-chip memory layer 1103 are sequentially arranged from top to bottom.
  • the lower layers are vertically stacked together, and the layers in Fig. 11 are visually separated up and down and shown in this way only for the convenience of illustration.
  • the CoW unit of this embodiment includes two second die groups, which are single-die memory 1104 , more specifically, high bandwidth memory.
  • the first core layer 1101 includes a first computing area 1111, which realizes the function of the computing device 701.
  • the first computing area 1111 is covered with the logic layer of the first core layer 1101, that is, the top side of the first core layer 1101 in the figure, the first core
  • the layer 1101 further includes a first die-to-die region 1112 and a first TSV 1113 in certain regions.
  • the second core layer 1102 includes a second operation area 1121, which realizes the function of the processing device 703.
  • the second operation area 1121 is full of the logic layer of the second core layer 1102, that is, the top side of the second core layer 1102 in the figure.
  • the layer 1102 also includes a second die-to-die region 1122 and a second TSV 1123 in certain regions.
  • the first die-to-die region 1112 is vertically opposite to the second die-to-die region 1122 . Its function and effect are the same as those of the foregoing embodiments, so details will not be repeated.
  • the on-chip memory layer 1103 includes a memory area 1131 , a first I/O area 1132 , a second I/O area 1133 , a first physical area 1134 , a second physical area 1135 and a third TSV 1136 .
  • the memory area 1131 is formed with a storage unit for temporarily storing the calculation results of the first operation circuit or the second operation circuit
  • the first input-output area 1132 is formed with a first input-output circuit, which is used as an interface for the first operation circuit to communicate with the outside world
  • the second input-output area 1133 is formed with a second input-output circuit, which is used as an interface for the second operation circuit to communicate with the outside world
  • the first physical area 1134 is formed with a first physical access circuit, which is used to store the first
  • the calculation result of the operation circuit is sent to the memory 1104
  • the second physical area 1135 generates a second physical access circuit for sending the calculation result of the second operation circuit stored in the memory area 1131 to the memory 1104 .
  • the third TSVs 1136 extend over the entire on-chip memory layer 1103 , and are only shown on one side for example.
  • the first computing circuit and the second computing circuit perform inter-layer data transmission through the first transceiver circuit and the second transceiver circuit.
  • the data reaches the processing device 703 through the following path: the first computing circuit in the first computing area 1111 ⁇ the first transceiver circuit in the first die-to-die area 1112 ⁇ the first TSV 1113 ⁇ the second die
  • the second transceiver circuit to the die area 1122 ⁇ the second operation circuit in the second operation area 1121; when the processing device 703 intends to transmit data to the computing device 701, the data reaches the computing device 701 through the aforementioned reverse path.
  • some specific TSVs in the first TSVs 1113 are specially designed to electrically connect the first transceiver circuit and the second transceiver circuit.
  • the memory area 1131 transmits the data to the memory 1104 through the first physical access circuit. Specifically, the data arrives at the memory 1104 through the following path: the first physical access circuit of the first physical area 1134 ⁇ the third TSV 1136 ⁇ the first redistribution layer 1105 of the circuit layer; when the memory 1104 intends to transfer data to the memory area
  • the data reaches the memory area 1131 through the aforementioned reverse path.
  • the memory area 1131 transmits the data to the memory 1104 through the second physical access circuit. Specifically, data arrives at the memory 1104 through the following path: the second physical access circuit of the second physical area 1135 ⁇ the third TSV 1136 ⁇ the first redistribution layer 1105 of the circuit layer; when the memory 1104 intends to transfer data to the memory area
  • the data reaches the memory area 1131 through the aforementioned reverse path.
  • TSVs in the third TSVs 1136 are specially designed to electrically conduct data of the first physical access circuit and the second physical access circuit.
  • the memory area 1131 transmits the data to the first die group of another CoW unit through the first input and output circuit. Specifically, the data reaches the first die group of another CoW unit through the following path: the input-output circuit of the first input-output area 1132 ⁇ the third TSV 1136 ⁇ the first redistribution layer 1105 of the circuit layer ⁇ the circuit layer TSVs 1106 ⁇ the second redistribution layer 1107 of the circuit layer ⁇ the TSVs 1106 of the circuit layer ⁇ the first redistribution layer 1105 of the circuit layer; During data exchange, the data arrives at the memory area 1131 through the aforementioned reverse path.
  • the memory area 1131 transmits the data to the first die group of another CoW unit through the second input and output circuit. Specifically, the data reaches the first die group of another CoW unit through the following path: the input-output circuit of the second input-output area 1133 ⁇ the third TSV 1136 ⁇ the first redistribution layer 1105 of the circuit layer ⁇ the circuit layer TSVs 1106 ⁇ the second redistribution layer 1107 of the circuit layer ⁇ the TSVs 1106 of the circuit layer ⁇ the first redistribution layer 1105 of the circuit layer; During data exchange, the data arrives at the memory area 1131 through the aforementioned reverse path.
  • TSVs in the third TSVs 1136 are specially designed to electrically conduct data of the first and second I/O circuits.
  • the present invention does not limit the number and functions of the vertically stacked grains in the first die group and the second die group, for example, the first die group may also include a first core layer, a first memory layer stacked from top to bottom , the second core layer and the second memory layer, or the first die group includes the first core layer, the first memory layer, the second core layer, the second memory layer, the third memory layer and the stacked from top to bottom The fourth memory layer.
  • the first die group may also include a first core layer, a first memory layer stacked from top to bottom , the second core layer and the second memory layer, or the first die group includes the first core layer, the first memory layer, the second core layer, the second memory layer, the third memory layer and the stacked from top to bottom The fourth memory layer.
  • system on chip of the present invention can be vertically connected to other system on chip in the first die group, and can also be connected horizontally to the system on chip of the first die group in other CoW units to build a three-dimensional computing processor core.
  • the CoW units of the accelerator structure in the above embodiments are arranged in an array, and the technology based on InFO_SoW enables the CoW unit to efficiently cooperate with its surrounding CoW units.
  • a task calculated by the neural network model will be handed over to such an accelerator structure for processing.
  • the task will be divided into multiple subtasks, and each first die group will be assigned a subtask.
  • subtask allocation it can be planned that the CoW unit near the center of the array transfers the intermediate results to the surrounding CoW units, and accumulates and calculates sequentially until the outermost CoW unit calculates the calculation results of the entire task, and the calculation results are passed through the interface.
  • the interface module of the module die group is output directly.
  • Another embodiment of the present invention is a method for generating an accelerator structure, more specifically, a method for generating the accelerator structure of the foregoing embodiments.
  • the circuit layer is first generated, and then the operation layer is generated on one side of the circuit layer.
  • the operation layer is provided with a plurality of CoW units, and each CoW unit includes a first die group and a second die group, and in the
  • the module layer is formed on the other side of the layer, and the module layer is provided with a power module chip group and an interface module chip group.
  • the power module die group provides power to the first die group and the second die group through the circuit layer, and the first die group and the second die group output calculation results through the interface module die group through the circuit layer.
  • Fig. 12 shows a flowchart of this embodiment.
  • step 1201 the first part of the circuit layer is generated, that is, the first redistribution layer 808 and the through-silicon vias 809 in the circuit layer 802 of FIG. 8 are generated on the InFO wafer. This step is further refined into the flowchart of FIG. 13 .
  • a plurality of TSVs 1402 are formed on a wafer 1401 .
  • Through-silicon via technology is a high-density packaging technology.
  • the vertical electrical interconnection of through-silicon vias 1402 is realized, thereby reducing the interconnection length, reducing signal delay, and achieving low chip-to-chip interconnection. power consumption, high-speed communication, increased bandwidth, and miniaturization for device integration.
  • a first redistribution layer 1403 is formed on one side of the plurality of TSVs 1402 .
  • the first redistribution layer 1403 is to pass the contact of the die (that is, the output/input end of the die) through the wafer-level metal wiring process and change its contact position, so that the die can be applied to different packaging forms.
  • metal layers and dielectric layers are deposited on the wafer 1401 and corresponding three-dimensional metal wiring patterns are formed, which are used to re-layout the output/inlet terminals of the grains for electrical signal conduction, making the grain layout more accurate. to be flexible.
  • the first redistribution layer 1403 When designing the first redistribution layer 1403, it is necessary to add via holes at the overlapping positions of the criss-cross metal wirings with the same electrical characteristics on two adjacent layers to ensure the electrical connection between the upper and lower layers. Therefore, the first redistribution layer 1403 will The electrical connection between multiple crystal grains is realized by a three-dimensional conductive structure, thereby reducing the layout area.
  • a plurality of bumps 1404 are generated on the first redistribution layer 1403 .
  • bumps 1404 are solder balls, and solder ball processes are commonly used: evaporation, electroplating, screen printing, or needle depositing.
  • the solder balls are not directly connected to the metal lines in the first redistribution layer 1403, but are bridged by under bump metallization (UBM) to improve adhesion. It can be realized by sputtering or electroplating. So far, the first redistribution layer 808 and the TSV 809 in the circuit layer 802 of FIG. 8 have been generated.
  • UBM under bump metallization
  • step 1202 the calculation layer 803 in FIG. 8 is generated on the side of the circuit layer.
  • the computing layer is provided with a plurality of CoW units, and each CoW unit includes a first die group and a second die group. This step is further refined into the flow chart in Figure 15 .
  • a first die set ie, a system-on-chip
  • a second die set ie memory
  • the CoW unit of this embodiment includes a first die group and a second die group, wherein the first die group is a SoC 301 , the second die group is a memory 302 , and the memory 302 is a high bandwidth memory.
  • a plurality of CoW units are chip-mounted, wherein the first die group and the second die group electrically contact the plurality of bumps 1404 respectively.
  • the CoW unit 1601 includes a system-on-chip 301 and a memory 302 , the chip is mounted on the first redistribution layer 1403 , and contacts of the system-on-chip 301 and the memory 302 electrically contact the bumps 1404 .
  • the number of die attach CoW units 1601 depends on the size of the wafer 1401 .
  • step 1504 the first die group and the second die group are underfilled.
  • the underfill mainly produces sealant 1602 through non-contact spray dispensing, and sealant 1602 provides a sealing effect for the contacts and bumps 1404 of the first die group and the second die group, avoiding The contact and the bump 1404 have better reliability due to the electrical interference generated by the contact with the impurity.
  • step 1505 lamination plastic is generated to cover the plurality of CoW units 1601 .
  • Figure 17 shows the structural diagram after the laminated plastic is produced, as shown in Figure 17, the laminated plastic 1701 covers all the CoW units 1601 to protect the overall structure.
  • step 1506 the lamination plastic 1701 is ground to expose the surface of the plurality of CoW units 1601 .
  • step 1507 the ground surface is chemical mechanical polished (CMP). As shown in FIG. 18, after chemical mechanical polishing of laminated plastic 1701, the surface (top surface) of CoW unit 1601 is exposed to air. At this point, the generation of the operation layer is completed.
  • CMP chemical mechanical polished
  • step 1203 is then performed to perform wafer testing. This step is further refined into the flowchart of FIG. 19 .
  • a first glass is bonded to the surface of the CoW cell 1601 .
  • the wafer 1401 is flipped such that the first glass is located below the wafer 1401 .
  • Fig. 20 shows the structural diagram after flipping. As shown in Fig. 20, the first glass 2001 is attached to the surface of the CoW unit 1601, and after flipping, it serves as a base to support the wafer 1401 and various semiconductors generated based on the wafer 1401. structure, including a CoW unit 1601, so as to facilitate subsequent processes to process the bottom of the wafer 1401 (that is, the top of the wafer 1401 in FIG. 20 ).
  • step 1903 the wafer 1401 is ground to expose the plurality of TSVs 1402 .
  • step 1904 the lapped wafer is chemically mechanically polished.
  • FIG. 21 shows a cross-sectional view after chemical mechanical polishing. As shown in FIG. 21 , the top surface of the TSV 1402 is exposed outside the wafer 1401 .
  • an insulating layer is deposited on the wafer 1401 and a plurality of TSVs 1402 are exposed.
  • a photomask is used to cover the top surface of the TSV 1402, and then an insulating layer is deposited thereon.
  • the material of the insulating layer may be silicon nitride.
  • Fig. 22 shows the structural diagram after depositing the insulating layer. As shown in Fig. 22, since the photomask covers the top surface of the TSV 1402, after depositing the insulating layer 2201, the top surface of the TSV 1402 is still exposed to the air. middle.
  • a plurality of metal points are formed on the insulating layer 1301 , and these metal points are properly electrically contacted with at least one of the plurality of TSVs 1402 to serve as wafer test points for the probes to electrically contact.
  • Fig. 23 shows the structure diagram after the metal point 2301 is generated. As shown in Fig. 23, each TSV 1402 is connected to a metal point 2301, which is used as a wafer test point for probe contact of the wafer test. .
  • the testability content of the wafer test includes scan test, boundary scan test, memory test, DC/AC test, radio frequency test and other functional tests.
  • the scan test is used to detect the logic functions of the first die group and the second die group;
  • the boundary scan test is used to detect the pin functions of the first die group and the second die group;
  • the memory test is used for the die group
  • the read-write and storage functions of various types of memory (such as memory) in the computer are tested;
  • the DC/AC test includes the signal test of the pins of the first die group and the second die group and the power pin, as well as judging the DC current and whether the voltage parameters meet the design specifications;
  • the radio frequency test is aimed at the die group in the CoW unit (if the die group is a radio frequency integrated circuit) to detect the logic function of the radio frequency module; other functional tests are used to detect the first die group And whether other important or customized functions and performances of the second die group meet the design specifications.
  • Wafer map wafer map
  • data log data log
  • step 1204 is then performed to cut each computing layer and wiring layer in units of CoW units.
  • the operation layer and wiring layer with CoW units as the unit are called CoW grains.
  • CoW grains of CoW units, and CoW grains including defective CoW units are eliminated.
  • a plurality of CoW crystal grains are bonded on the second glass.
  • the number and position of CoW grains are planned according to the functions and requirements of the accelerator.
  • a 5 ⁇ 5 CoW grain array is set within a range of 300mm ⁇ 300mm, as shown in Figure 24.
  • CoW crystal grains 2402 are pasted on the second glass 2401 to form a 5 ⁇ 5 CoW unit array.
  • FIG. 25 shows a cross-sectional view of a CoW grain 2402 bonded to a second glass 2401 .
  • step 1206 laminate plastic is generated to cover the CoW die.
  • Fig. 26 shows the structural diagram after the laminated plastic is produced. As shown in Fig. 26, the laminated plastic 2601 covers all the CoW grains 2402 to protect the overall structure.
  • step 1207 the laminated plastic covering the plurality of CoW dies is ground to expose the surfaces of the plurality of TSVs.
  • the insulating layer 2201 and the metal dots 2301 are removed, so that the surface (top surface) of the TSV 1402 is exposed to the air.
  • step 1208 the ground surface is chemically mechanically polished.
  • Fig. 27 shows a cross-sectional view after chemical mechanical polishing.
  • step 1209 a second part of the line layer is generated.
  • a second redistribution layer is formed on the other side of the TSVs to complete the entire circuit layer.
  • FIG. 28 shows a cross-sectional view of the entire wiring layer, and the second redistribution layer 2801 in the figure is the second redistribution layer 810 in FIG. 8 .
  • a module layer is generated on the other side of the circuit layer.
  • solder balls are formed on the second redistribution layer, and then the chip is bonded to the power module die group and the interface module die group, and the solder balls are electrically connected to the second redistribution layer, the power module die group and the interface module die group .
  • FIG. 29 shows a cross-sectional view after the module layer is generated.
  • solder balls 2901 ie, solder balls 812 in FIG. 8
  • the interface module 806 of the grain group, the grain group of the power module provides power to the first grain group and the second grain group through the circuit layer, and the first grain group and the second grain group pass through the interface module grain group through the circuit layer Output the calculation result.
  • step 1211 the second glass is inverted and removed.
  • step 1212 a heat dissipation module is pasted on the computing layer side.
  • FIG. 30 shows a cross-sectional view of a heat dissipation module 3001 (that is, the heat dissipation module 804 in FIG. 8 ) attached. So far the entire accelerator structure has been completed.
  • step 1213 according to the InFO_SoW technology, the structure in FIG. 30 is packaged to realize a single accelerator chip.
  • FIG. 31 shows a flowchart of this embodiment.
  • the CoW unit of this embodiment also includes a first die group and a second die group, the first die group is the above-mentioned SoC, and the second die group is the above-mentioned memory.
  • a first die set ie, a system-on-chip
  • a second die set ie memory
  • a plurality of CoW units are die-attached on the first glass.
  • laminate plastic is generated to cover a plurality of CoW units.
  • the lamination plastic is ground to expose the surface of the plurality of CoW units.
  • the ground surface is chemically mechanically polished.
  • a first redistribution layer is formed on the surface of the CoW unit, wherein the contacts of the first die group and the second die group directly electrically contact the contacts of the first redistribution layer.
  • Wafer testing is then performed.
  • a plurality of metal points are generated on the contacts on the other side of the first redistribution layer, and these metal points are properly electrically contacted with at least one of the contacts of the first redistribution layer to serve as power supply probes. contact wafer test points.
  • step 3109 is then performed to flip the wafer so that the first glass is on top.
  • step 3110 the first glass is removed.
  • step 3111 each CoW die is diced.
  • step 3112 a plurality of qualified CoW grains are pasted on the second glass.
  • step 3113 overmolded plastic is created to cover the CoW die.
  • step 3114 the laminated plastic covering the plurality of CoW dies is ground to expose the metal points.
  • step 3115 the ground surface is chemically mechanically polished.
  • a second redistribution layer of the circuit layer is generated, and the contacts of the second redistribution layer are electrically connected to metal points to complete the entire circuit layer.
  • step 3117 a module layer is generated on the circuit layer.
  • solder balls are formed on the second redistribution layer, and then the chip is bonded to the power module die group and the interface module die group, and the solder balls are electrically connected to the second redistribution layer, the power module die group and the interface module die group .
  • the second glass is inverted and removed.
  • the entire accelerator structure is packaged to realize a single accelerator chip.
  • Fig. 32 shows a sectional view of the accelerator structure of this embodiment.
  • the difference from the accelerator structure in FIG. 30 is that in this embodiment, there is no bump on the first redistribution layer, and the contacts of the first die group and the second die group are directly electrically connected to the first redistribution layer. Contacts, so it is not necessary to fill the bottom of the first die group and the second die group with sealant, and use laminated plastic to cover the CoW unit; this embodiment does not generate TSVs in the circuit layer, the first rewiring layer and the second rewiring layer are connected without using through-silicon vias to save the process of generating through-silicon vias.
  • Another embodiment of the present invention is a computer-readable storage medium on which computer program codes for generating an accelerator structure are stored.
  • the computer program codes are run by a processing device, the execution of FIGS. 12, 13, 15, and 19 is performed. and the method described in Figure 31.
  • Another embodiment of the present invention is a computer program product, including a computer program for generating an accelerator structure, characterized in that, when the computer program is executed by a processor, the computer program shown in FIG. 12 , FIG. 13 , FIG. 15 , FIG. 19 and FIG. 31 is realized.
  • the steps of the method is a computer device comprising a memory, a processor, and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps shown in Fig. 12 , Fig. 13 , and Fig. 15 . , the steps of the method described in FIG. 19 and FIG. 31 .
  • this invention integrates CoW technology into InFO_SoW technology to achieve a large number of integrated chips.
  • This invention represents the development of the chip field, especially the field of artificial intelligence accelerators. trend.
  • the present invention utilizes the chip vertical integration capability of the CoW technology to vertically stack the grains to form a grain group, and then utilizes the SoW technology to spread the grain group in the horizontal direction, so that the processor cores in the grain group (i.e.
  • the aforementioned system-on-a-chip presents a three-dimensional arrangement in this accelerator, and each processor core can cooperate with other adjacent processors in three dimensions, greatly improving the accelerator's ability and speed of data processing, and achieving the technical effect of integrating super large computing power.
  • the present invention expresses some methods and their embodiments as a series of actions and combinations thereof, but those skilled in the art can understand that the solution of the present invention is not limited by the order of the described actions . Therefore, according to the disclosure or teaching of the present invention, those skilled in the art can understand that some of the steps can be performed in other order or at the same time. Further, those skilled in the art can understand that the embodiments described in the present invention can be regarded as optional embodiments, that is, the actions or modules involved therein are not necessarily necessary for the realization of one or some solutions of the present invention. In addition, according to different schemes, the description of some embodiments of the present invention also has different emphases. In view of this, those skilled in the art may understand the parts not described in detail in a certain embodiment of the present invention, and may also refer to relevant descriptions of other embodiments.
  • the above-mentioned integrated units may also be implemented in the form of hardware, that is, specific hardware circuits, which may include digital circuits and/or analog circuits.
  • the physical realization of the hardware structure of the circuit may include but not limited to physical devices, and the physical devices may include but not limited to devices such as transistors or memristors.
  • various devices such as computing devices or other processing devices described herein may be implemented by appropriate hardware processors, such as core processors, GPUs, FPGAs, DSPs, and ASICs.
  • the aforementioned storage unit or storage device can be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), which can be, for example, a variable resistance memory (Resistive Random Access Memory, RRAM), dynamic Random Access Memory (Dynamic Random Access Memory, DRAM), Static Random Access Memory (Static Random Access Memory, SRAM), Enhanced Dynamic Random Access Memory (Enhanced Dynamic Random Access Memory, EDRAM), High Bandwidth Memory (High Bandwidth Memory , HBM), hybrid memory cube (Hybrid Memory Cube, HMC), ROM and RAM, etc.
  • RRAM variable resistance memory
  • DRAM Dynamic Random Access Memory
  • SRAM Static Random Access Memory
  • EDRAM Enhanced Dynamic Random Access Memory
  • High Bandwidth Memory High Bandwidth Memory
  • HBM High Bandwidth Memory
  • HMC Hybrid Memory Cube
  • ROM and RAM etc.
  • An accelerator structure comprising: a computing layer provided with a plurality of chip-to-wafer units, each chip-to-wafer unit including a first die group and a second die group; a module layer provided with a power module Die group and interface module die group; and a circuit layer, arranged between the operation layer and the module layer; wherein, the power module die group passes through the circuit layer to the first die group and The second die group provides power; wherein, the first die group and the second die group output calculation results through the interface module die group through the circuit layer.
  • Clause A2 The accelerator structure according to Clause A1, further comprising a cooling module, adjacent to the computing layer, configured to dissipate heat from the plurality of chip-to-wafer units.
  • Clause A3 The accelerator structure according to Clause A1, wherein the wiring layer is provided with a first redistribution layer for electrically connecting the first die group and the first die group in each chip-to-wafer unit. Two grain groups.
  • Clause A4 The accelerator structure according to Clause A3, wherein the wiring layer is further provided with through-silicon vias and a second redistribution layer, and the through-silicon vias are provided in the first redistribution layer and the second redistribution layer. Between the wiring layers, the first die group and the second die group are electrically connected to the module layer through the first redistribution layer, the TSV and the second redistribution layer.
  • Clause A5 The accelerator structure of Clause A4, wherein each chip-to-wafer unit communicates with another chip-to-wafer unit via the first redistribution layer, the through-silicon via, and the second redistribution layer electrical connection.
  • Clause A6 The accelerator structure of Clause A1, wherein said interface module die set converts electrical signals from said first die set or said second die set to optical signal outputs.
  • Clause A7 The accelerator structure of Clause A1, wherein the first group of dies is a system on a chip and the second group of dies is a memory.
  • Clause A8 The accelerator structure of Clause A1, wherein said first die group comprises a vertically stacked system-on-chip and on-chip memory, and said second die group is a memory.
  • Clause A9 The accelerator structure of Clause A1, wherein the first group of dies includes a vertically stacked first core layer and a second core layer, the second group of dies being a memory.
  • Clause A10 The accelerator structure of Clause A7, 8 or 9, wherein the memory is a high bandwidth memory.
  • Clause A11 The accelerator structure of Clause A9, wherein the first core layer comprises: a first computing region generated with a first computing circuit; and a first die-group-to-die-group region generated with a first transceiver circuit; the second core layer, including: a second computing area, where a second computing circuit is generated; and a second die group-to-die group area, where a second transceiver circuit is generated; wherein, the first computing circuit And the second computing circuit performs data transmission in the first die group through the first transceiver circuit and the second transceiver circuit.
  • Clause A12 The accelerator structure of Clause A11, wherein the first core layer further comprises a physical area, in which physical access circuits are generated to access the memory.
  • Clause A13 The accelerator structure of Clause A11, wherein the first core layer further comprises an input-output region having input-output circuitry for use as a first die assembly with another chip-to-wafer unit. The interface to connect to.
  • Clause A14 The accelerator structure according to Clause A13, wherein the plurality of chip-to-wafer units are arranged in an array, and a chip-to-wafer unit near the center of the array transfers an intermediate result to surrounding adjacent chip-to-wafer units , for the outermost chip-to-wafer unit to calculate the calculation result, and the calculation result is output through the interface module die group.
  • a method of generating an accelerator structure comprising: generating a wiring layer; generating a computing layer on one side of the wiring layer, the computing layer being provided with a plurality of CoW cells, each CoW cell comprising a first die group and a second die group; and a module layer is generated on the other side of the circuit layer, and the module layer is provided with a power module die group and an interface die group; wherein, the power module die group passes through the The circuit layer provides power to the first die group and the second die group; wherein, the first die group and the second die group pass through the interface die group through the circuit layer Output the calculation result.
  • Clause A18 The method according to Clause A17, wherein the step of generating a wiring layer comprises: generating a plurality of through-silicon vias on the wafer; generating a first redistribution layer on one side of the plurality of through-silicon vias; And generating a plurality of bumps on the first redistribution layer.
  • Clause A19 The method of Clause A18, wherein the step of generating a computing layer comprises: die attaching the plurality of CoW units, wherein the first die set and the second die set are electrically contacting the plurality of bumps.
  • Clause A20 The method of Clause A19, wherein the step of generating a computing layer further comprises: underfilling the first die set and the second die set; CoW units.
  • Clause A21 The method of Clause A20, wherein the step of generating the computing layer further comprises: grinding the laminated plastic to expose a surface of the plurality of CoW cells; and chemical mechanical polishing the ground surface.
  • Clause A22 The method of Clause A21, further comprising: performing wafer testing.
  • Clause A23 The method of Clause A22, wherein the step of performing wafer testing comprises: bonding a first glass on the surface; and flipping the wafer.
  • Clause A24 The method of Clause A23, wherein the step of performing wafer testing further comprises: grinding the wafer to expose the plurality of through silicon vias; and chemical mechanical polishing the ground wafer.
  • Clause A25 The method of Clause A24, wherein the step of performing wafer testing further comprises: depositing an insulating layer on the wafer exposing the plurality of through-silicon vias; A plurality of metal points electrically contact at least one of the plurality of TSVs to serve as wafer test points.
  • Clause A26 The method according to Clause A21, further comprising: cutting each of the operation layer and the wiring layer in the unit of the CoW unit to form a CoW crystal grain; laminating a plurality of the CoW crystal grains on the second glass grains; and generating laminated plastic to cover the plurality of CoW dies.
  • Clause A27 The method of Clause A26, further comprising: grinding the laminated plastic covering the plurality of CoW grains to expose surfaces of the plurality of CoW cells; and chemical mechanical polishing the ground surface.
  • Clause A28 The method of Clause A27, wherein the step of generating a wiring layer further comprises: generating a second redistribution layer on the other side of the plurality of TSVs.
  • Clause A29 The method of Clause A28, wherein the step of generating a module layer comprises: forming solder balls on the second redistribution layer; and die attaching the power module die set and the interface die die group; wherein, the solder balls are electrically connected to the second redistribution layer, the power module die group and the interface die group.
  • Clause A30 The method of Clause A29, further comprising: flipping over and removing the second glass; and attaching a heat dissipation module on the computing layer side.
  • Clause A31 A computer readable storage medium having stored thereon computer program code generating an accelerator structure, said computer program code, when executed by a processing means, performing the method of any one of clauses A17 to 30.
  • Clause A32 A computer program product comprising a computer program for generating an accelerator structure, characterized in that said computer program implements the steps of the method of any one of clauses A17 to 30 when executed by a processor.
  • Clause A33 A computer device comprising a memory, a processor and a computer program stored on the memory, wherein said processor executes said computer program to implement the steps of the method of any one of clauses A17 to 30.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Neurology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Condensed Matter Physics & Semiconductors (AREA)
  • Manufacturing & Machinery (AREA)
  • Power Engineering (AREA)
  • Medical Informatics (AREA)
  • Particle Accelerators (AREA)
  • Semiconductor Integrated Circuits (AREA)

Abstract

一种加速器结构及其设备,与生成加速器结构的方法及其计算机可读存储介质、计算机程序产品与计算机装置,加速器结构包括:运算层(803),设置有多个芯片对晶圆单元(807),每个芯片对晶圆单元(807)包括第一晶粒组及第二晶粒组;模块层(801),设置有电源模块(805)晶粒组及接口模块(806)晶粒组;以及线路层(802),设置于运算层(803)及模块层(801)间。电源模块(805)晶粒组通过线路层(802)向第一晶粒组及第二晶粒组提供电源,第一晶粒组及第二晶粒组经由线路层(802)通过接口模块(806)晶粒组输出计算结果。

Description

加速器结构、生成加速器结构的方法及其设备
相关申请的交叉引用
本申请要求于2021年11月05日申请的,申请号为202111308266.9,名称为“加速器结构、生成加速器结构的方法及其设备”的中国专利申请的优先权。
技术领域
本发明一般地涉及半导体领域。更具体地,本发明涉及加速器结构及其设备,与生成加速器结构的方法及其计算机可读存储介质、计算机程序产品与计算机装置。
背景技术
随着人工智能领域的高速发展,高性能计算的应用需求越来越强烈,从电商使用的推荐引擎到自动驾驶汽车,人们生活已脱离不了人工智能解决方案,市场的迅速铺开推动了计算需求呈指数级增长。据统计,自2012年以来,深度学习网络对计算的需求大约每3.5个月就翻一番。
为了满足高性能计算应用对计算性能和存储带宽的需求,从CPU/GPU到ASIC的各种加速器都出现了基于微晶片的多芯片集成方案。除了成品率和成本效益,这些新型芯片还需要短而密集的互连,以实现芯片对芯片(C2C)IO电路,并通过先进的封装技术保持低功耗。
台湾积体电路制造股份有限公司开发了一种超大而紧凑的系统解决方案,称为晶圆上集成扇出系统技术(InFO_SoW),将已知的芯片阵列与功率和散热模块集成在一起,用于高性能计算。InFO_SoW通过作为载体本身来减少对基板和印刷线路板的使用。在一个紧凑的系统内紧密封装的多芯片阵列使该解决方案能够获得晶圆规模的好处,例如低延迟的芯片对芯片通信、高带宽密度和低电源分配网络(PDN)阻抗等,从而获得更高的计算性能和功耗效率。
然而,现有的InFO_SoW技术仅能将多个单颗芯片整合至系统中,这样的集成效率仍不足以满足各种加速器对于芯片大量集成的需求。因此,一种基于InFO_SoW技术的更密集的芯片集成方案是迫切需要的。
发明内容
为了至少部分地解决背景技术中提到的技术问题,本发明的方案提供了一种加速器结构及其设备,与生成加速器结构的方法及其计算机可读存储介质、计算机程序产品与计算机装置。
在一个方面中,本发明揭露一种加速器结构,包括:运算层、模块层及线路层。运算层设置有多个芯片对晶圆(chip on wafer,CoW)单元,每个芯片对晶圆单元包括第一晶粒组及第二晶粒组;模块层设置有电源模块晶粒组及接口模块晶粒组;线路层设置于运算层及模块层间。电源模块晶粒组通过线路层向第一晶粒组及第二晶粒组提供电源,第一晶粒组及第二晶粒组经由线路层通过接口模块晶粒组输出计算结果。
在另一个方面,本发明揭露一种集成电路装置,包括前述的加速器结构,还揭露一种板卡,包括前述的集成电路装置。
在另一个方面,本发明揭露一种生成加速器结构的方法,包括:生成线路层;在线路层的一侧生成运算层,运算层设置有多个芯片对晶圆单元,每个芯片对晶圆单元包括第一晶粒组及第二晶粒组;以及在线路层的另一侧生成模块层,模块层设置有电源模块晶粒组及接口模块晶粒组。电源模块晶粒组通过线路层向第一晶粒组及第二晶粒组提供电源,第一晶粒组及第二晶粒组经由线路层通过接口模块晶粒组输出计算结果。
在另一个方面,本发明揭露一种计算机可读存储介质,其上存储有生成加速器结构的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行前述的方法。
在另一个方面,本发明揭露一种计算机程序产品,包括生成加速器结构的计算机程序,其特征在于,所述计算机程序被处理器执行时实现前述方法的步骤。
在另一个方面,本发明揭露一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,其特征在于,所述处理器执行所述计算机程序以实现前述方法的步骤。
本发明通过将CoW单元整合至InFO_SoW的结构中,可以显著的提高集成效率,以满足各种加速器对于芯片大量集成的需求,达到集成超大运算能力的技术功效。
附图说明
通过参考附图阅读下文的详细描述,本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本发明的若干实施方式,并且相同或对应的标号表示相同或对应的部分。其中:
图1是示出InFO_SoW的剖面图;
图2是示出一种示例性的InFO_SoW的俯视图;
图3是示出本发明实施例的一种CoW单元的布局示意图;
图4是示出本发明实施例的另一种CoW单元的布局示意图;
图5是示出本发明实施例的另一种CoW单元的布局示意图;
图6是示出示例性的板卡的结构示意图;
图7是示出本发明实施例的集成电路装置的结构图;
图8是示出本发明实施例的CoW结合InFO_SoW的加速器结构的剖面图;
图9是示出本发明另一个实施例的CoW结合InFO_SoW的加速器结构的剖面图;
图10是示出本发明实施例的CoW单元的示意图;
图11是示出本发明另一个实施例的CoW单元的示意图;
图12是示出本发明另一个实施例生成加速器结构的流程图;
图13是示出本发明另一个实施例生成线路层的第一部分的流程图;
图14是示出本发明另一个实施例在晶圆上生成多个硅通孔的剖面图;
图15是示出本发明另一个实施例生成运算层的流程图;
图16是示出本发明另一个实施例芯片贴装多个CoW单元后的剖面图;
图17是示出本发明另一个实施例生成压膜塑料后的剖面图;
图18是示出本发明另一个实施例化学机械抛光压膜塑料后的剖面图;
图19是示出本发明另一个实施例执行晶圆测试的流程图;
图20是示出本发明另一个实施例芯片翻转晶圆后的剖面图;
图21是示出本发明另一个实施例化学机械抛光后的剖面图;
图22是示出本发明另一个实施例沉积绝缘层后的剖面图;
图23是示出本发明另一个实施例生成金属点后的剖面图;
图24是示出5×5的CoW单元阵列的示意图;
图25是示出本发明另一个实施例CoW晶粒贴合在第二玻璃后的剖面图;
图26是示出本发明另一个实施例生成压膜塑料后的剖面图;
图27是示出本发明另一个实施例化学机械抛光后的剖面图;
图28是示出本发明另一个实施例完成整个线路层后的剖面图;
图29是示出本发明另一个实施例生成模块层后的剖面图;
图30是示出本发明另一个实施例贴合散热模块后的剖面图;
图31是示出本发明另一个实施例生成加速器结构的流程图;以及
图32是示出本发明另一个实施例贴合散热模块后的剖面图。
具体实施方式
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
应当理解,本发明的权利要求、说明书及附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。本发明的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本发明。如在本发明说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本发明说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。
在此本发明说明书,晶圆(wafer)指的是由纯硅构成,一般分为6英寸、8英寸、12英寸规格不等,为硅半导体集成电路制作所用的硅基材,其形状为圆形。在硅基材上可加工制作成各种电路元件结构,而成为有特定电性功能的集成电路产品;晶粒(die)是以半导体材料制作而成未经封装的一小块集成电路本体,该集成电路的既定功能在这一小片半导体上实现,晶粒是以大批方式,经光刻等多项步骤,制作在晶圆上的方型小片集成电路,又称为裸晶;芯片(chip)则是经过测试,将完好的、稳定的、功能正常的晶粒切割下来,封装形成具有管脚可以和其他电子元件进行电性连接的集成电路装置。
InFO_SoW技术是一种集成了集成扇出型封装(integrated fan-out,InFO)、电源模块和散热模块的晶圆级系统,图1示出InFO_SoW的剖面图,InFO_SoW包括运算层11、线路层12与模块层13。运算层11设置有芯片阵列,图中示例性地展示处理单元111、处理单元112及处理单元113,用以实现系统运算功能;线路层12为重布线层(RDL),用以电性连接运算层11与模块层13的晶粒;模块层13设置有电源模块晶粒组及接口模块晶粒组,电源模块晶粒组包括多个电源模块131,对运算层11的芯片阵列提供电源,接口模块晶粒组包括多个接口模块132,作为运算层11的芯片阵列的输入输出接口。电源模块晶粒组及接口模块晶粒组是利用焊球栅格阵列(ball grid array,BGA)封装技术焊接到InFO晶圆上。运算层11的另一侧组装有散热模块14,为运算层11的芯片阵列进行散热。
图2示出一种示例性的InFO_SoW的俯视图,可以看出电源模块晶粒组为7×7的电源模块131,接口模块晶粒组包括4个接口模块132,分别位于电源模块阵列的侧边。电源模块晶粒组及接口模块晶粒组的下方为线路层12,即InFO晶圆。运算层11的芯片阵列位于线路层12之下,被模块层13和线路层12所遮挡,故不可见。最低层为散热模块14。
CoW是一种新兴的整合生产技术,可以将多个芯片视为一个晶粒进行封装,达到了封装体积小、功耗低、引脚少的技术功效。随着CoW技术日益成熟,越来越多的集成电路尤其是复杂运算的集成电路采用其制程。
本发明的一个实施例是一种将CoW单元整合至InFO_SoW的加速器结构,CoW单元可以利用多种不同功能的晶粒整合而成,为方便说明,在此实施例中CoW单元包括两种 晶粒:第一晶粒及第二晶粒。更具体来说,第一晶粒为片上系统(SoC),第二晶粒为内存。
片上系统指的是在单个芯片上集成一个完整的系统,它是由多个具有特定功能的集成电路组合在一个芯片上形成的系统或产品。系统整合单晶片(system-on-integrated-chips,SoIC)是一种多芯片的堆栈技术,可以实现CoW的接合(bonding)。内存可以是高宽带内存(high bandwidth memory,HBM),这是一种基于3D堆栈工艺制作的高性能DRAM,适用于高存储器带宽需求的应用场合,像是图形处理器、网上交换及转发设备(如路由器、交换器)等。
图3示出此实施例的一种CoW单元的布局示意图,此CoW单元包括1个片上系统301及6个内存302,其中片上系统301为前述的片上系统,设置在CoW单元的核心,而内存302为上述的高宽带内存,布局在片上系统301的两侧,每一侧设置有3个内存302。图4示出此实施例的另一种CoW单元的布局示意图,此CoW单元包括1个片上系统301及4个内存302,其中片上系统301设置在CoW单元的核心,而内存302布局在片上系统301的两侧,每一侧设置有2个内存302。图5示出此实施例的另一种CoW单元的布局示意图,此CoW单元是由2组图4的CoW单元排列而成。片上系统及内存的布局方式多样,以上仅为示例,本发明并不限制CoW单元中晶粒的种类、数量与布局方式。
此实施例的加速器结构可以装配在板卡上,图6示出示例性的板卡60的结构示意图。如图6所示,板卡60包括芯片601,即为此实施例的加速器结构,集成有一个或多个集成电路装置,集成电路装置是一种人工智能运算单元,用以支持各类深度学习和机器学习算法,满足计算机视觉、语音、自然语言处理、数据挖掘等领域复杂场景下的智能处理需求。特别是深度学习技术大量应用在云端智能领域,云端智能应用的一个显著特点是输入数据量大,对平台的存储能力和计算能力有很高的要求,此实施例的板卡60适用在云端智能应用,具有庞大的片外存储、片上存储和强大的计算能力。
芯片601通过对外接口装置602与外部设备603相连接。外部设备603例如是服务器、计算机、摄像头、显示器、鼠标、键盘、网卡或wifi接口等。待处理的数据可以由外部设备603通过对外接口装置602传递至芯片601。芯片601的计算结果可以经由对外接口装置602传送回外部设备603。根据不同的应用场景,对外接口装置602可以具有不同的接口形式,例如PCIe接口等。
板卡60还包括用于存储数据的存储器件604,其包括一个或多个存储单元605。存储器件604通过总线与控制器件606和芯片601进行连接和数据传输。板卡60中的控制器件606配置用于对芯片601的状态进行调控。为此,在一个应用场景中,控制器件606可以包括单片机(Micro Controller Unit,MCU)。
图7是示出此实施例的芯片601中的集成电路装置的结构图。如图7中所示,集成电路装置70包括计算装置701、接口装置702、处理装置703和内存704。
计算装置701配置成执行用户指定的操作,主要实现为单核智能处理器或者多核智能处理器,用以执行深度学习或机器学习的计算,其可以与处理装置703进行交互,以共同完成用户指定的操作。
接口装置702用于作为计算装置701和处理装置703对外联系的接口。
处理装置703作为通用的处理装置,执行包括但不限于数据搬运、对计算装置701的开启和/或停止等基本控制。根据实现方式的不同,处理装置703可以是中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)或其他通用和/或专用处理器中的一种或多种类型的处理器,这些处理器包括但不限于数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,并且其数目可以根据实际需要来确定。
图3至图5的片上系统301可以是计算装置701或是处理装置703,或是计算装置701与处理装置703结合在一起。仅就计算装置701而言,其可以视为具有单核结构或者同构多核结构。当将计算装置701和处理装置703整合共同考虑时,整体视为异构多核结构。
内存704用以存储待处理的数据,为DDR内存,大小通常为16G或更大,用于保存计算装置701和/或处理装置703的数据。内存704即为内存302,用于存放片上系统301所需的运算数据。
图8示出此实施例的CoW结合InFO_SoW的加速器结构的剖面图。如图8所示,此加速器结构包括模块层801、线路层802、运算层803及散热模块804。
模块层801设置有电源模块晶粒组及接口模块晶粒组,电源模块晶粒组包括多个电源模块805,排列呈如图2所示的阵列状,对运算层803的CoW单元提供电源,接口模块晶粒组即为接口装置702,包括多个接口模块806,设置于电源模块晶粒组的四周,作为运算层803的CoW单元807的输入输出接口。
线路层802设置于运算层803及模块层801间,由下而上包括第一重布线层808、硅通孔809及第二重布线层810。第一重布线层808通过凸点811电性连接每个CoW单元807;硅通孔809设置于第一重布线层808及第二重布线层809间,用以连通第一重布线层808及第二重布线层810;第二重布线层810位于硅通孔809之上,通过焊球812电性连接模块层801中的电源模块晶粒组及接口模块晶粒组。
运算层803设置有多个CoW单元807,亦呈阵列状排列,如前所述,此实施例的CoW单元包括第一晶粒及第二晶粒,其中第一晶粒为片上系统301,第二晶粒为内存302,片上系统301与内存302可以依图3至图5所示的方式或其他方式排列。
第一重布线层808用以电性连接每个CoW单元807内的片上系统301及内存302,故片上系统301及内存302经由第一重布线层808、硅通孔809及第二重布线层810与模块层801电性连接。当电源模块晶粒组对CoW单元807供电时,电源信号会由电源模块805通过第二重布线层810、硅通孔809及第一重布线层808到达片上系统301及内存302。当CoW单元807运算产生计算结果欲输出时,计算结果会自片上系统301或内存302通过第一重布线层808、硅通孔809及第二重布线层810到达接口模块806,再由接口模块806输出至系统外。由于人工智能芯片的数据交换量十分庞大,此实施例的接口模块晶粒组为光模块,具体可以是光纤模块,将来自片上系统301或内存302的电信号转换成光信号输出。当CoW单元807需要自系统外载入数据时,数据被接口模块806由光信号转换成电信号,通过第二重布线层810、硅通孔809及第一重布线层808,存储在内存302中。
此外,此实施例的每个CoW单元807都可以经由第一重布线层808、硅通孔809及第二重布线层810与相邻的另一个CoW单元电性连接,彼此交换数据,使得所有CoW单元807可以连动协作以形成算力强大的加速器。
散热模块804位于运算层803下方,贴合于CoW单元807,用以对运算层803中的所有CoW单元807进行散热。散热模块804可以是水冷背板。该背板具有微通道的层,通过水泵使冷却剂流过这些通道以带走热量,或是利用氮化镓(GaN)切入下方的硅中,在蚀刻过程中通道被加宽,GaN层中的原始间隙被铜填充,在这些通道下设计有冷却剂管线,铜有助于将热量传导至冷却剂。
图9示出本发明的另一个实施例的CoW结合InFO_SoW的加速器结构的剖面图。如图9所示,此加速器结构包括模块层901、线路层902、运算层903及散热模块904,其中模块层901、运算层903、散热模块904的结构与图8的实施例中的相应元件的结构相同,故不赘述。
线路层902设置于运算层903及模块层901间,仅包括第一重布线层905及第二重布线层906,其中第一重布线层905的结构与第一重布线层808的结构相同,第二重布线层906的结构与第二重布线层810的结构相同。第一重布线层905及第二重布线层906直接 相连,不利用硅通孔连通,这样的线路层902可以达到与线路层802同样的效果,但节省了生成硅通孔809的工序。
本发明的CoW单元不仅是如前述实施例所述的单层晶粒结构,还可以是多层纵向堆叠的晶粒组,即本发明的CoW单元包括第一晶粒组及第二晶粒组,其中第一晶粒组及第二晶粒组不仅为单层晶粒结构,更可以是多层纵向堆叠的结构。以下将针对多层纵向堆叠结构进行说明。
本发明的另一个实施例同样是CoW结合InFO_SoW的加速器结构,与前述实施例不同处在于,此实施例的CoW单元的第一晶粒组包括纵向堆叠的第一核层及第二核层,第二晶粒组为内存。图10示出此实施例的CoW单元的示意图,需特别注意的是,为方便说明,此图的视角为线路层在运算层的下方,而非如图8或图9所示的线路层在运算层的上方。
第一晶粒组包括第一核层1001与第二核层1002,实际上第一核层1001和第二核层1002纵向堆叠在一块,图10中的第一核层1001与第二核层1002视觉上为上下分离仅为了方便说明而以此方式展示。此实施例的CoW单元包括2个第二晶粒组,其为单晶粒的内存1003,更具体来说是高宽带内存。
第一核层1001包括第一运算区1011、第一晶粒对晶粒区1012及第一硅通孔1013。第一运算区1011生成有第一运算电路,以实现计算装置701的功能;第一晶粒对晶粒区1012生成有第一收发电路,用以作为第一运算电路的晶粒对晶粒接口;第一硅通孔1013用以在三维集成电路中实现堆叠晶粒的电性互连。第二核层1002包括第二运算区1021、第二晶粒对晶粒区1022及第二硅通孔1023。第二运算区1021生成有第二运算电路,以实现处理装置703的功能;第二晶粒对晶粒区1022生成有第二收发电路,用以作为第二运算电路的晶粒对晶粒接口;第二硅通孔1023同样用以在三维集成电路中实现堆叠晶粒的电性互连。
在此实施例中,第一运算区1011和第二运算区1021还分别生成有内存1014和内存1024,用以暂存第一运算电路与第二运算电路的运算结果。内存1014和内存1024直接设置在第一运算区1011和第二运算区1021内,不需经过中介层传导,其数据传输速率快,但存储空间有限。
第一核层1001还包括输入输出区1015及物理区1016,第二核层1002还包括输入输出区1025及物理区1026。输入输出区1015生成有输入输出电路,用以作为第一核层1001对外联系的接口,输入输出区1025生成有输入输出电路,用以作为第二核层1002对外联系的接口。物理区1016生成有物理访问电路,用以作为第一核层1001访问片外内存的接口,物理区1026生成有物理访问电路,用以作为第二核层1002访问片外内存的接口。
当计算装置701与处理装置703要进行数据交换时,第一运算电路及第二运算电路通过第一收发电路及第二收发电路进行层间数据传输。具体来说,数据通过以下路径到达处理装置703:第一运算区1011的第一运算电路→第一晶粒对晶粒区1012的第一收发电路→第一硅通孔1013→第二晶粒对晶粒区1022的第二收发电路→第二运算区1021的第二运算电路;当处理装置703欲传输数据至计算装置701时,数据通过以下路径到达:第二运算区1021的第二运算电路→第二晶粒对晶粒区1022第二收发电路→第一硅通孔1013→第一晶粒对晶粒区1012的第一收发电路→第一运算区1011的第一运算电路。
当计算装置701欲将数据存储至内存1003时,计算装置701的计算结果会通过物理区1016存储至内存1003,内存区1014通过物理访问电路将数据传输至内存1003。具体来说,数据通过以下路径到达内存1003:物理区1016的物理访问电路→第一硅通孔1013→第二硅通孔1023→线路层的第一重布线层1004;当内存1003欲传输数据至内存区1014供计算装置701进行处理时,数据通过前述的反向路径到达内存区1014。需注意的是,第一硅通孔1013与第二硅通孔1023中的部分特定硅通孔专门设计用来电性传导物理访问 电路的数据。
当处理装置703欲将数据存储至内存1003时,当处理装置703的计算结果会通过物理区1026存储至内存1003,内存区1024通过物理访问电路将数据传输至内存1003。具体来说,数据通过以下路径到达内存1003:物理区1026的物理访问电路→第二硅通孔1023→线路层的第一重布线层1004;当内存1003欲传输数据至内存区1024供处理装置703进行处理时,数据通过前述的反向路径到达内存区1024。
当计算装置701的计算结果需要与运算层中的另一个CoW单元的第一晶粒组进行数据交换时,内存区1014通过输入输出电路将数据传输至另一个CoW单元的第一晶粒组。具体来说,数据通过以下路径到达另一个CoW单元:输入输出区1015的输入输出电路→第一硅通孔1013→第二硅通孔1023→线路层的第一重布线层1004→线路层的硅通孔1005→线路层的第二重布线层1006→线路层的硅通孔1005→线路层的第一重布线层1004;当另一个CoW单元的第一晶粒组欲传输数据至内存区1014时,数据通过前述的反向路径到达内存区1014。需注意的是,第一硅通孔1013与第二硅通孔1023中的部分特定硅通孔专门设计用来电性传导输入输出电路的数据。
当处理装置703的计算结果需要与另一个CoW单元的第一晶粒组进行数据交换时,内存区1024的数据通过以下路径到达另一个CoW单元的第一晶粒组:输入输出区1025的输入输出电路→第二硅通孔1023→线路层的第一重布线层1004→线路层的硅通孔1005→线路层的第二重布线层1006→线路层的硅通孔1005→线路层的第一重布线层1004;当另一个CoW单元的第一晶粒组欲传输数据至内存区1024时,数据通过前述的反向路径到达内存区1024。
本发明的另一个实施例同样是CoW结合InFO_SoW的加速器结构,此实施例的运算层的第一晶粒组包括纵向堆叠的第一核层、第二核层与内存层,第二晶粒组为内存。图11示出此实施例的CoW单元的示意图。
此实施例的第一晶粒组包括第一核层1101、第二核层1102与片上内存层1103,实际上第一核层1101、第二核层1102和片上内存层1103依序由上至下纵向堆叠在一块,图11中的各层视觉上为上下分离仅为了方便说明而以此方式展示。此实施例的CoW单元包括2个第二晶粒组,其为单晶粒的内存1104,更具体来说是高宽带内存。
第一核层1101包括第一运算区1111,实现计算装置701的功能,第一运算区1111布满第一核层1101的逻辑层,即图中第一核层1101的顶侧,第一核层1101在特定区域还包括第一晶粒对晶粒区1112及第一硅通孔1113。第二核层1102包括第二运算区1121,实现处理装置703的功能,第二运算区1121布满第二核层1102的逻辑层,即图中第二核层1102的顶侧,第二核层1102在特定区域还包括第二晶粒对晶粒区1122及第二硅通孔1123。第一晶粒对晶粒区1112与第二晶粒对晶粒区1122的位置上下相对。其功能与作用与前述实施例相同,故不赘述。
片上内存层1103包括内存区1131、第一输入输出区1132、第二输入输出区1133、第一物理区1134、第二物理区1135及第三硅通孔1136。内存区1131生成有存储单元,用以暂存第一运算电路或第二运算电路的运算结果,第一输入输出区1132生成有第一输入输出电路,用以作为第一运算电路对外联系的接口,第二输入输出区1133生成有第二输入输出电路,用以作为第二运算电路对外联系的接口,第一物理区1134生成有第一物理访问电路,用以将内存区1131中存储第一运算电路的计算结果发送至内存1104,第二物理区1135生成有第二物理访问电路,用以将内存区1131中存储第二运算电路的计算结果发送至内存1104。第三硅通孔1136遍布整个片上内存层1103,示例性仅显示于一侧。
当计算装置701与处理装置703要进行数据交换时,第一运算电路及第二运算电路通过第一收发电路及第二收发电路进行层间数据传输。具体来说,数据通过以下路径到达处理装置703:第一运算区1111的第一运算电路→第一晶粒对晶粒区1112的第一收发电路 →第一硅通孔1113→第二晶粒对晶粒区1122的第二收发电路→第二运算区1121的第二运算电路;当处理装置703欲传输数据至计算装置701时,数据通过前述的反向路径到达计算装置701。需注意的是,第一硅通孔1113中的部分特定硅通孔专门设计用来电性连接第一收发电路和第二收发电路。
当计算装置701的计算结果(暂存在内存区1131)需要存储至内存1104时,内存区1131是通过第一物理访问电路将数据传输至内存1104的。具体来说,数据通过以下路径到达内存1104:第一物理区1134的第一物理访问电路→第三硅通孔1136→线路层的第一重布线层1105;当内存1104欲传输数据至内存区1131供计算装置701进行处理时,数据通过前述的反向路径到达内存区1131。
当处理装置703的计算结果(暂存在内存区1131)需要存储至内存1104时,内存区1131是通过第二物理访问电路将数据传输至内存1104的。具体来说,数据通过以下路径到达内存1104:第二物理区1135的第二物理访问电路→第三硅通孔1136→线路层的第一重布线层1105;当内存1104欲传输数据至内存区1131供处理装置703进行处理时,数据通过前述的反向路径到达内存区1131。
需注意的是,第三硅通孔1136中的部分特定硅通孔专门设计用来电性传导第一物理访问电路及第二物理访问电路的数据。
当计算装置701的计算结果需要与另一个CoW单元的第一晶粒组进行数据交换时,内存区1131通过第一输入输出电路将数据传输至另一个CoW单元的第一晶粒组。具体来说,数据通过以下路径到达另一个CoW单元的第一晶粒组:第一输入输出区1132的输入输出电路→第三硅通孔1136→线路层的第一重布线层1105→线路层的硅通孔1106→线路层的第二重布线层1107→线路层的硅通孔1106→线路层的第一重布线层1105;当另一个CoW单元的第一晶粒组欲与计算装置701进行数据交换时,数据通过前述的反向路径到达内存区1131。
当处理装置703的计算结果需要与另一个CoW单元的第一晶粒组进行数据交换时,内存区1131通过第二输入输出电路将数据传输至另一个CoW单元的第一晶粒组。具体来说,数据通过以下路径到达另一个CoW单元的第一晶粒组:第二输入输出区1133的输入输出电路→第三硅通孔1136→线路层的第一重布线层1105→线路层的硅通孔1106→线路层的第二重布线层1107→线路层的硅通孔1106→线路层的第一重布线层1105;当另一个CoW单元的第一晶粒组欲与处理装置703进行数据交换时,数据通过前述的反向路径到达内存区1131。
需注意的是,第三硅通孔1136中的部分特定硅通孔专门设计用来电性传导第一及第二输入输出电路的数据。
本发明并不限制第一晶粒组与第二晶粒组中纵向堆叠晶粒的数量与功能,例如第一晶粒组还可以包括自上而下堆叠的第一核层、第一内存层、第二核层及第二内存层,或是第一晶粒组包括自上而下堆叠的第一核层、第一内存层、第二核层、第二内存层、第三内存层及第四内存层。基于前述实施例的说明,本领域技术人员无需创造性的努力便可知悉第一晶粒组与第二晶粒组的各种组合的电性关系,故不赘述。
由上述说明可知,本发明的片上系统可以在第一晶粒组中纵向地与其他片上系统联系,亦可以横向地向其他CoW单元中的第一晶粒组的片上系统联系,布建出三维的运算处理器核。
上述各实施例的加速器结构的CoW单元排列成阵列状,基于InFO_SoW的技术使得CoW单元可以高效地与其周围的CoW单元协作。一般来说,神经网络模型计算的一个任务会交给一个这样的加速器结构来处理,首先任务会被切割成多个子任务,每个第一晶粒组分别指派一个子任务。在进行子任务分配时,可以规划让靠近阵列中央的CoW单元将中间结果向周围邻近的CoW单元传递,依次累加计算,直到最外围的CoW单元计算出整 个任务的计算结果,其计算结果通过接口模块晶粒组的接口模块直接输出。如图2所示,由于接口模块132位于这个加速器结构的外侧,当中间结果自阵列中央往周围累加计算时,最终最外围的CoW单元将获得该任务的计算结果,其计算结果直接通过紧邻的接口模块132输出,这样的任务安排使得数据的传递路径更加精简高效。
本发明的另一个实施例是一种生成加速器结构的方法,更详细来说是生成前述各实施例的加速器结构的方法。此实施例首先生成线路层,接着在在线路层的一侧生成运算层,运算层设置有多个CoW单元,每个CoW单元包括第一晶粒组及第二晶粒组,并在所述层的另一侧生成模块层,模块层设置有电源模块晶粒组及接口模块晶粒组。电源模块晶粒组通过线路层向第一晶粒组及第二晶粒组提供电源,第一晶粒组及第二晶粒组经由线路层通过接口模块晶粒组输出计算结果。图12示出此实施例的流程图。
在步骤1201中,生成线路层的第一部分,即在InFO晶圆上生成图8的线路层802中的第一重布线层808及硅通孔809。此步骤进一步细化成图13的流程图。
在步骤1301中,同时参考图14,在晶圆1401上生成多个硅通孔1402。硅通孔技术是一项高密度封装技术,通过铜、钨、多晶硅等导电物质的填充,实现硅通孔1402的垂直电气互连,进而减小互联长度、降低信号延迟,实现晶片间的低功耗、高速通讯、增加宽带和实现器件集成的小型化。
在步骤1302中,在多个硅通孔1402的一侧生成第一重布线层1403。第一重布线层1403是将晶粒的触点(即晶粒的输出/出入端)通过晶圆级金属布线制程和改变其触点位置,使晶粒能适用于不同的封装形式。简而言之就是在晶圆1401上沉积金属层和介质层并形成相应的立体金属布线图形,用来对晶粒的输出/出入端进行重新布局,以进行电气信号传导,使得晶粒布局更为灵活。在设计第一重布线层1403时,需要在相邻两层电气特性相同的纵横交错的金属布线重叠位置增加通孔,以保证上下层之间的电气连接,因此第一重布线层1403是将多个晶粒间的电性连接以立体传导结构实现,进而减少布局面积。
在步骤1303中,在第一重布线层1403上生成多个凸点(bump)1404。实务上凸点1404为锡球,锡球工艺常用的有:蒸发(evaporation)、电镀(electroplating)、印刷(screen printing)或针孔沉积(needle depositing)等。在此实施例中,锡球不直接与第一重布线层1403里的金属线连接,而是以凸点下金属(under bump metallization,UBM)桥接,以提升黏着力,凸点下金属通常采用溅镀或电镀的方式实现。至此已生成图8的线路层802中的第一重布线层808及硅通孔809。
回到图12,在步骤1202中,在线路层的一侧生成图8的运算层803。如前述实施例所描述,运算层设置有多个CoW单元,每个CoW单元包括第一晶粒组及第二晶粒组。此步骤进一步细化成图15的流程。
在步骤1501中,设置第一晶粒组(即片上系统)于CoW单元的核心位置。在步骤1502中,设置第二晶粒组(即内存)于片上系统的两侧。此二步骤即是实现如图3至图5所示的CoW单元布局规划。具体来说,此实施例的CoW单元包括第一晶粒组与第二晶粒组,其中第一晶粒组为片上系统301,第二晶粒组为内存302,内存302为高宽带内存。
在步骤1503中,芯片贴装多个CoW单元,其中第一晶粒组及第二晶粒组分别电性接触多个凸点1404。如图16所示,CoW单元1601包括片上系统301及内存302,芯片贴装于第一重布线层1403上,且片上系统301及内存302的触点电性接触凸点1404。芯片贴装CoW单元1601的数量由晶圆1401的尺寸而定。
在步骤1504中,底部填充(underfill)第一晶粒组及第二晶粒组。如图16所示,底部填充主要通过非接触喷射式点胶来产生封胶1602,封胶1602为第一晶粒组及第二晶粒组的触点和凸点1404提供密封效果,避免了触点和凸点1404因与杂质接触所产生的电性干扰,这样的结构具备了更佳的可靠性。
在步骤1505中,生成压膜塑料以覆盖多个CoW单元1601。图17示出生成压膜塑 料后的结构图,如图17所示,压膜塑料1701覆盖了所有的CoW单元1601,以起到保护整体结构的效果。
在步骤1506中,研磨(grind)压膜塑料1701以露出多个CoW单元1601的表面。在步骤1507中,化学机械抛光(CMP)研磨后的表面。如图18所示,在化学机械抛光压膜塑料1701后,CoW单元1601的表面(顶面)曝露在空气中。至此完成运算层的生成。
回到图12,接着执行步骤1203,执行晶圆测试。此步骤进一步细化成图19的流程图。
在步骤1901中,在CoW单元1601的表面接合第一玻璃。在步骤1902中,翻转晶圆1401,使得第一玻璃位于晶圆1401的下方。图20示出翻转后的结构图,如图20所示,第一玻璃2001与CoW单元1601的表面贴合,翻转后作为基座以支撑晶圆1401及基于晶圆1401所生成的各种半导体结构,包括CoW单元1601,以方便后续工序对晶圆1401的底部(即图20中晶圆1401的上方)进行加工。
在步骤1903中,研磨晶圆1401以露出多个硅通孔1402。在步骤1904中,化学机械抛光研磨后的晶圆。图21示出化学机械抛光后的剖面图,如图21所示,硅通孔1402的顶面露出于晶圆1401外。
在步骤1905中,沉积绝缘层于晶圆1401上并露出多个硅通孔1402。在此步骤中,利用光罩遮住硅通孔1402的顶面,再沉积绝缘层于上,绝缘层的材质可以是氮化硅。图22示出沉积绝缘层后的结构图,如图22所示,由于光罩遮住硅通孔1402的顶面,故在沉积绝缘层2201后,硅通孔1402的顶面依旧曝露于空气中。
在步骤1906中,在绝缘层1301上生成多个金属点,这些金属点适当地电性接触多个硅通孔1402的至少其中之一,以作为供探针电性接触的晶圆测试点。图23示出生成金属点2301后的结构图,如图23所示,每个硅通孔1402均连接1个金属点2301,以作为晶圆测试点,供晶圆测试的探针接触之用。
在此实施例中,晶圆测试的可测试性内容包括扫描测试、边界扫描测试、存储器测试、直流/交流测试、射频测试及其他功能测试。扫描测试用于检测第一晶粒组及第二晶粒组的逻辑功能;边界扫描测试用于检测第一晶粒组及第二晶粒组的管脚功能;存储器测试则是对晶粒组里的各种类型的存储器(例如内存)的读写和存储功能进行测试;直流/交流测试包括第一晶粒组及第二晶粒组接脚及电源接脚的信号测试,以及判断直流电流和电压参数是否符合设计规格;射频测试则是针对CoW单元中的晶粒组(如果该晶粒组为射频集成电路)来检测射频模块的逻辑功能;其他功能测试用于检测第一晶粒组及第二晶粒组其他重要或定制化的功能和性能是否符合设计规格。
整片晶圆的测试结果会生成一份晶圆图(wafer map)文件,而数据归结成一个数据日志(datalog)。晶圆图记录包含良率、测试时间、各分类的错误数和CoW单元的位置,数据日志则是具体的测试结果。通过分析这些数据,便可识别残次CoW单元的数量与位置。
回到图12,接着执行步骤1204,切割每个以CoW单元为单位的运算层及接线层。在本文中,以CoW单元为单位的运算层及接线层称为CoW晶粒,在此步骤中,将晶圆1401上CoW晶粒切割下来,并根据晶圆测试的结果,留下包括合格的CoW单元的CoW晶粒,淘汰包括残次CoW单元的CoW晶粒。
在步骤1205中,在第二玻璃上贴合多个CoW晶粒。在贴合时,CoW晶粒的数量与位置根据加速器的功能与需求进行规划,例如在300mm×300mm的范围内设置5×5的CoW晶粒阵列,如图24所示,在300mm×300mm的第二玻璃2401上贴合25个CoW晶粒2402,以形成5×5的CoW单元阵列。图25示出CoW晶粒2402贴合在第二玻璃2401后的剖面图。
在步骤1206中,生成压膜塑料以覆盖CoW晶粒。图26示出生成压膜塑料后的结构 图,如图26所示,压膜塑料2601覆盖了所有的CoW晶粒2402,以起到保护整体结构的效果。
在步骤1207中,研磨覆盖多个CoW晶粒的压膜塑料以露出多个硅通孔的表面。如图26所示,在研磨压膜塑料2601后,绝缘层2201与金属点2301被去除掉,使得硅通孔1402的表面(顶面)曝露在空气中。
在步骤1208中,化学机械抛光研磨后的表面。图27示出化学机械抛光后的剖面图。
在步骤1209中,生成线路层的第二部分。在此步骤中,在多个硅通孔的另一侧生成第二重布线层,以完成整个线路层。图28示出完成整个线路层后的剖面图,图中的第二重布线层2801即为图8的第二重布线层810。
在步骤1210中,在线路层的另一侧生成模块层。首先在第二重布线层上形成焊球,接着芯片贴合电源模块晶粒组及接口模块晶粒组,焊球电性连接第二重布线层与电源模块晶粒组及接口模块晶粒组。图29示出生成模块层后的剖面图,图中显示焊球2901(即图8的焊球812)电性连接第二重布线层2801与电源模块晶粒组的电源模块805及接口模块晶粒组的接口模块806,电源模块晶粒组通过线路层向第一晶粒组及第二晶粒组提供电源,第一晶粒组及第二晶粒组经由线路层通过接口模块晶粒组输出计算结果。
在步骤1211中,翻转并去除第二玻璃。在步骤1212中,在运算层侧贴合散热模块。图30示出贴合散热模块3001(即图8的散热模块804)后的剖面图。至此已完成整个加速器结构。
在步骤1213中,根据InFO_SoW技术,将图30的结构进行封装,便可实现单体的加速器芯片。
以上是针对生成图8的结构为例进行说明。如欲生成图9的结构,由于图9的结构与图8的结构的差异仅在于线路层的硅通孔,故上述各流程仅需省略步骤1301,其余步骤均执行即可生成图9的结构。
本发明的另一个实施例同样是一种生成加速器结构的方法,图31示出此实施例的流程图。此实施例的CoW单元同样包括第一晶粒组及第二晶粒组,第一晶粒组为上述的片上系统,第二晶粒组为上述的内存。
在步骤3101中,设置第一晶粒组(即片上系统)于CoW单元的核心位置。在步骤3102中,设置第二晶粒组(即内存)于片上系统的两侧。在步骤3103中,芯片贴装多个CoW单元于第一玻璃上。在步骤3104中,生成压膜塑料以覆盖多个CoW单元。在步骤3105中,研磨压膜塑料以露出多个CoW单元的表面。在步骤3106中,化学机械抛光研磨后的表面。在步骤3107中,在CoW单元的表面生成第一重布线层,其中第一晶粒组及第二晶粒组的接点直接电性接触第一重布线层的接点。
接着执行晶圆测试。在步骤3108中,在第一重布线层另一侧的接点上生成多个金属点,这些金属点适当地电性接触第一重布线层的接点的至少其中之一,以作为供探针电性接触的晶圆测试点。
在晶圆测试后,接着执行步骤3109,翻转晶圆,使得第一玻璃位于上方。在步骤3110中,去除第一玻璃。在步骤3111中,切割每个CoW晶粒。在步骤3112中,在第二玻璃上贴合多个合格的CoW晶粒。在步骤3113中,生成压膜塑料以覆盖CoW晶粒。在步骤3114中,研磨覆盖多个CoW晶粒的压膜塑料以露出金属点。在步骤3115中,化学机械抛光研磨后的表面。在步骤3116中,生成线路层的第二重布线层,第二重布线层的接点电性连接金属点,以完成整个线路层。在步骤3117中,在线路层上生成模块层。首先在第二重布线层上形成焊球,接着芯片贴合电源模块晶粒组及接口模块晶粒组,焊球电性连接第二重布线层与电源模块晶粒组及接口模块晶粒组。在步骤3118中,翻转并去除第二玻璃。在步骤3119中,在运算层侧贴合散热模块。在步骤3120中,封装整个加速器结构,以实现单体的加速器芯片。
图32示出此实施例的加速器结构的剖面图。与图30的加速器结构不同处在于:此实施例在第一重布线层上未设有凸点,直接将第一晶粒组及第二晶粒组的接点电性接触第一重布线层的接点,故不需要在第一晶粒组及第二晶粒组的底部填充封胶,用压膜塑料以覆盖CoW单元即可;此实施例未在线路层生成硅通孔,第一重布线层及第二重布线层相连,不利用硅通孔连通,以节省了生成硅通孔的工序。
本发明的另一个实施例是一种计算机可读存储介质,其上存储有生成加速器结构的计算机程序代码,当计算机程序代码由处理装置运行时,执行图12、图13、图15、图19及图31所述的方法。本发明的另一个实施例是一种计算机程序产品,包括生成加速器结构的计算机程序,其特征在于,所述计算机程序被处理器执行时实现图12、图13、图15、图19及图31所述方法的步骤。本发明的另一个实施例是一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,其特征在于,所述处理器执行所述计算机程序以实现图12、图13、图15、图19及图31所述方法的步骤。
由于芯片领域的高速发展,特别是人工智能领域对于加速器的超大算力的需求,本发明将CoW技术整合至InFO_SoW技术可以实现芯片大量集成,本发明代表了芯片领域,特别是人工智能加速器领域发展趋势。不仅如此,本发明利用CoW技术的芯片垂直整合能力,纵向堆叠晶粒,以形成晶粒组,再利用SoW技术在水平方向上铺开晶粒组,使得晶粒组中的处理器核(即前述的片上系统)在这个加速器中呈现三维排列,每个处理器核可以与三维中邻近的其他处理器协作,大大提升加速器处理数据的能力及速度,达到集成超大运算能力的技术效果。
需要说明的是,为了简明的目的,本发明将一些方法及其实施例表述为一系列的动作及其组合,但是本领域技术人员可以理解本发明的方案并不受所描述的动作的顺序限制。因此,依据本发明的公开或教导,本领域技术人员可以理解其中的某些步骤可以采用其他顺序来执行或者同时执行。进一步,本领域技术人员可以理解本发明所描述的实施例可以视为可选实施例,即其中所涉及的动作或模块对于本发明某个或某些方案的实现并不一定是必需的。另外,根据方案的不同,本发明对一些实施例的描述也各有侧重。鉴于此,本领域技术人员可以理解本发明某个实施例中没有详述的部分,也可以参见其他实施例的相关描述。
在具体实现方面,基于本发明的公开和教导,本领域技术人员可以理解本发明所公开的若干实施例也可以通过本文未公开的其他方式来实现。例如,就前文所述的电子设备或装置实施例中的各个单元来说,本文在考虑了逻辑功能的基础上对其进行拆分,而实际实现时也可以有另外的拆分方式。又例如,可以将多个单元或组件结合或者集成到另一个系统,或者对单元或组件中的一些特征或功能进行选择性地禁用。就不同单元或组件之间的连接关系而言,前文结合附图所讨论的连接可以是单元或组件之间的直接或间接耦合。在一些场景中,前述的直接或间接耦合涉及利用接口的通信连接,其中通信接口可以支持电性、光学、声学、磁性或其它形式的信号传输。
在另外一些实现场景中,上述集成的单元也可以采用硬件的形式实现,即为具体的硬件电路,其可以包括数字电路和/或模拟电路等。电路的硬件结构的物理实现可以包括但不限于物理器件,而物理器件可以包括但不限于晶体管或忆阻器等器件。鉴于此,本文所述的各类装置(例如计算装置或其他处理装置)可以通过适当的硬件处理器来实现,例如核心处理器、GPU、FPGA、DSP和ASIC等。进一步,前述的所述存储单元或存储装置可以是任意适当的存储介质(包括磁存储介质或磁光存储介质等),其例如可以是可变电阻式存储器(Resistive Random Access Memory,RRAM)、动态随机存取存储器(Dynamic Random Access Memory,DRAM)、静态随机存取存储器(Static Random Access Memory,SRAM)、增强动态随机存取存储器(Enhanced Dynamic Random Access Memory,EDRAM)、高带宽存储器(High Bandwidth Memory,HBM)、混合存储器立方体(Hybrid Memory Cube, HMC)、ROM和RAM等。
依据以下条款可更好地理解前述内容:
条款A1.一种加速器结构,包括:运算层,设置有多个芯片对晶圆单元,每个芯片对晶圆单元包括第一晶粒组及第二晶粒组;模块层,设置有电源模块晶粒组及接口模块晶粒组;以及线路层,设置于所述运算层及所述模块层间;其中,所述电源模块晶粒组通过所述线路层向所述第一晶粒组及所述第二晶粒组提供电源;其中,所述第一晶粒组及所述第二晶粒组经由所述线路层通过所述接口模块晶粒组输出计算结果。
条款A2.根据条款A1所述的加速器结构,还包括散热模块,与所述运算层相邻,用以对所述多个芯片对晶圆单元散热。
条款A3.根据条款A1所述的加速器结构,其中所述线路层设置有第一重布线层,用以电性连接每个芯片对晶圆单元内的所述第一晶粒组及所述第二晶粒组。
条款A4.根据条款A3所述的加速器结构,其中所述线路层还设置有硅通孔及第二重布线层,所述硅通孔设置于所述第一重布线层及所述第二重布线层间,所述第一晶粒组及所述第二晶粒组经由所述第一重布线层、所述硅通孔及所述第二重布线层与所述模块层电性连接。
条款A5.根据条款A4所述的加速器结构,其中每个芯片对晶圆单元经由所述第一重布线层、所述硅通孔及所述第二重布线层与另一个芯片对晶圆单元电性连接。
条款A6.根据条款A1所述的加速器结构,其中所述接口模块晶粒组将来自所述第一晶粒组或所述第二晶粒组的电信号转换成光信号输出。
条款A7.根据条款A1所述的加速器结构,其中所述第一晶粒组为片上系统,所述第二晶粒组为内存。
条款A8.根据条款A1所述的加速器结构,其中所述第一晶粒组包括纵向堆叠的片上系统及片上内存,所述第二晶粒组为内存。
条款A9.根据条款A1所述的加速器结构,其中所述第一晶粒组包括纵向堆叠的第一核层及第二核层,所述第二晶粒组为内存。
条款A10.根据条款A7、8或9所述的加速器结构,其中所述内存为高宽带内存。
条款A11.根据条款A9所述的加速器结构,其中所述第一核层包括:第一运算区,生成有第一运算电路;以及第一晶粒组对晶粒组区,生成有第一收发电路;所述第二核层,包括:第二运算区,生成有第二运算电路;以及第二晶粒组对晶粒组区,生成有第二收发电路;其中,所述第一运算电路及所述第二运算电路通过所述第一收发电路及所述第二收发电路进行所述第一晶粒组内的数据传输。
条款A12.根据条款A11所述的加速器结构,其中所述第一核层还包括物理区,生成有物理访问电路,用以访问所述内存。
条款A13.根据条款A11所述的加速器结构,其中所述第一核层还包括输入输出区,生成有输入输出电路,用以作为与另一个芯片对晶圆单元的第一晶粒组电性连接的接口。
条款A14.根据条款A13所述的加速器结构,其中所述多个芯片对晶圆单元排列成阵列状,靠近阵列中央的芯片对晶圆单元将中间结果向周围邻近的芯片对晶圆单元传递运算,供最外围的芯片对晶圆单元计算出所述计算结果,所述计算结果通过所述接口模块晶粒组输出。
条款A15.一种集成电路装置,包括根据条款A1至14任一项所述的加速器结构。
条款A16.一种板卡,包括根据条款A15所述的集成电路装置。
条款A17.一种生成加速器结构的方法,包括:生成线路层;在所述线路层的一侧生成运算层,所述运算层设置有多个CoW单元,每个CoW单元包括第一晶粒组及第二晶粒组;以及在所述线路层的另一侧生成模块层,所述模块层设置有电源模块晶粒组及接口晶粒组;其中,所述电源模块晶粒组通过所述线路层向所述第一晶粒组及所述第二晶粒组提 供电源;其中,所述第一晶粒组及所述第二晶粒组经由所述线路层通过所述接口晶粒组输出计算结果。
条款A18.根据条款A17所述的方法,其中所述生成线路层的步骤包括:在晶圆上生成多个硅通孔;在所述多个硅通孔的一侧生成第一重布线层;以及在所述第一重布线层上生成多个凸点。
条款A19.根据条款A18所述的方法,其中所述生成运算层的步骤包括:芯片贴装所述多个CoW单元,其中所述第一晶粒组及所述第二晶粒组分别电性接触所述多个凸点。
条款A20.根据条款A19所述的方法,其中所述生成运算层的步骤还包括:底部填充所述第一晶粒组及所述第二晶粒组;以及生成压膜塑料以覆盖所述多个CoW单元。
条款A21.根据条款A20所述的方法,其中所述生成运算层的步骤还包括:研磨所述压膜塑料以露出所述多个CoW单元的表面;以及化学机械抛光研磨后的表面。
条款A22.根据条款A21所述的方法,还包括:执行晶圆测试。
条款A23.根据条款A22所述的方法,其中所述执行晶圆测试的步骤包括:在所述表面上接合第一玻璃;以及翻转所述晶圆。
条款A24.根据条款A23所述的方法,其中所述执行晶圆测试的步骤还包括:研磨所述晶圆以露出所述多个硅通孔;以及化学机械抛光研磨后的晶圆。
条款A25.根据条款A24所述的方法,其中所述执行晶圆测试的步骤还包括:沉积绝缘层于所述晶圆上并露出所述多个硅通孔;以及在所述绝缘层上生成多个金属点,所述多个金属点电性接触所述多个硅通孔的至少其中之一,以作为晶圆测试点。
条款A26.根据条款A21所述的方法,还包括:切割每个以所述CoW单元为单位的运算层及接线层,以形成CoW晶粒;在第二玻璃上贴合多个所述CoW晶粒;以及生成压膜塑料以覆盖所述多个CoW晶粒。
条款A27.根据条款A26所述的方法,还包括:研磨覆盖所述多个CoW晶粒的所述压膜塑料以露出所述多个CoW单元的表面;以及化学机械抛光研磨后的表面。
条款A28.根据条款A27所述的方法,其中所述生成线路层的步骤还包括:在所述多个硅通孔的另一侧生成第二重布线层。
条款A29.根据条款A28所述的方法,其中所述生成模块层的步骤包括:在所述第二重布线层上形成焊球;以及芯片贴合所述电源模块晶粒组及所述接口晶粒组;其中,所述焊球电性连接所述第二重布线层与所述电源模块晶粒组及所述接口晶粒组。
条款A30.根据条款A29所述的方法,还包括:翻转并去除所述第二玻璃;以及在所述运算层侧贴合散热模块。
条款A31.一种计算机可读存储介质,其上存储有生成加速器结构的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行条款A17至30任一项所述的方法。
条款A32.一种计算机程序产品,包括生成加速器结构的计算机程序,其特征在于,所述计算机程序被处理器执行时实现条款A17至30任一项所述方法的步骤。
条款A33.一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,其特征在于,所述处理器执行所述计算机程序以实现条款A17至30任一项所述方法的步骤。
以上对本发明实施例进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。

Claims (33)

  1. 一种加速器结构,包括:
    运算层,设置有多个芯片对晶圆单元,每个芯片对晶圆单元包括第一晶粒组及第二晶粒组;
    模块层,设置有电源模块晶粒组及接口模块晶粒组;以及
    线路层,设置于所述运算层及所述模块层间;
    其中,所述电源模块晶粒组通过所述线路层向所述第一晶粒组及所述第二晶粒组提供电源;
    其中,所述第一晶粒组及所述第二晶粒组经由所述线路层通过所述接口模块晶粒组输出计算结果。
  2. 根据权利要求1所述的加速器结构,还包括散热模块,与所述运算层相邻,用以对所述多个芯片对晶圆单元散热。
  3. 根据权利要求1所述的加速器结构,其中所述线路层设置有第一重布线层,用以电性连接每个芯片对晶圆单元内的所述第一晶粒组及所述第二晶粒组。
  4. 根据权利要求3所述的加速器结构,其中所述线路层还设置有硅通孔及第二重布线层,所述硅通孔设置于所述第一重布线层及所述第二重布线层间,所述第一晶粒组及所述第二晶粒组经由所述第一重布线层、所述硅通孔及所述第二重布线层与所述模块层电性连接。
  5. 根据权利要求4所述的加速器结构,其中每个芯片对晶圆单元经由所述第一重布线层、所述硅通孔及所述第二重布线层与另一个芯片对晶圆单元电性连接。
  6. 根据权利要求1所述的加速器结构,其中所述接口模块晶粒组将来自所述第一晶粒组或所述第二晶粒组的电信号转换成光信号输出。
  7. 根据权利要求1所述的加速器结构,其中所述第一晶粒组为片上系统,所述第二晶粒组为内存。
  8. 根据权利要求1所述的加速器结构,其中所述第一晶粒组包括纵向堆叠的片上系统及片上内存,所述第二晶粒组为内存。
  9. 根据权利要求1所述的加速器结构,其中所述第一晶粒组包括纵向堆叠的第一核层及第二核层,所述第二晶粒组为内存。
  10. 根据权利要求7、8或9所述的加速器结构,其中所述内存为高宽带内存。
  11. 根据权利要求9所述的加速器结构,其中所述第一核层包括:
    第一运算区,生成有第一运算电路;以及
    第一晶粒组对晶粒组区,生成有第一收发电路;
    所述第二核层,包括:
    第二运算区,生成有第二运算电路;以及
    第二晶粒组对晶粒组区,生成有第二收发电路;
    其中,所述第一运算电路及所述第二运算电路通过所述第一收发电路及所述第二收发电路进行所述第一晶粒组内的数据传输。
  12. 根据权利要求11所述的加速器结构,其中所述第一核层还包括物理区,生成有物理访问电路,用以访问所述内存。
  13. 根据权利要求11所述的加速器结构,其中所述第一核层还包括输入输出区,生成有输入输出电路,用以作为与另一个芯片对晶圆单元的第一晶粒组电性连接的接口。
  14. 根据权利要求13所述的加速器结构,其中所述多个芯片对晶圆单元排列成阵列状,靠近阵列中央的芯片对晶圆单元将中间结果向周围邻近的芯片对晶圆单元传递运算,供最外围的芯片对晶圆单元计算出所述计算结果,所述计算结果通过所述接口模块晶粒组输出。
  15. 一种集成电路装置,包括根据权利要求1至14任一项所述的加速器结构。
  16. 一种板卡,包括根据权利要求15所述的集成电路装置。
  17. 一种生成加速器结构的方法,包括:
    生成线路层;
    在所述线路层的一侧生成运算层,所述运算层设置有多个CoW单元,每个CoW单元包括第一晶粒组及第二晶粒组;以及
    在所述线路层的另一侧生成模块层,所述模块层设置有电源模块晶粒组及接口晶粒组;
    其中,所述电源模块晶粒组通过所述线路层向所述第一晶粒组及所述第二晶粒组提供电源;
    其中,所述第一晶粒组及所述第二晶粒组经由所述线路层通过所述接口晶粒组输出计算结果。
  18. 根据权利要求17所述的方法,其中所述生成线路层的步骤包括:
    在晶圆上生成多个硅通孔;
    在所述多个硅通孔的一侧生成第一重布线层;以及
    在所述第一重布线层上生成多个凸点。
  19. 根据权利要求18所述的方法,其中所述生成运算层的步骤包括:
    芯片贴装所述多个CoW单元,其中所述第一晶粒组及所述第二晶粒组分别电性接触所述多个凸点。
  20. 根据权利要求19所述的方法,其中所述生成运算层的步骤还包括:
    底部填充所述第一晶粒组及所述第二晶粒组;以及
    生成压膜塑料以覆盖所述多个CoW单元。
  21. 根据权利要求20所述的方法,其中所述生成运算层的步骤还包括:
    研磨所述压膜塑料以露出所述多个CoW单元的表面;以及
    化学机械抛光研磨后的表面。
  22. 根据权利要求21所述的方法,还包括:
    执行晶圆测试。
  23. 根据权利要求22所述的方法,其中所述执行晶圆测试的步骤包括:
    在所述表面上接合第一玻璃;以及
    翻转所述晶圆。
  24. 根据权利要求23所述的方法,其中所述执行晶圆测试的步骤还包括:
    研磨所述晶圆以露出所述多个硅通孔;以及
    化学机械抛光研磨后的晶圆。
  25. 根据权利要求24所述的方法,其中所述执行晶圆测试的步骤还包括:
    沉积绝缘层于所述晶圆上并露出所述多个硅通孔;以及
    在所述绝缘层上生成多个金属点,所述多个金属点电性接触所述多个硅通孔的至少其中之一,以作为晶圆测试点。
  26. 根据权利要求21所述的方法,还包括:
    切割每个以所述CoW单元为单位的运算层及接线层,以形成CoW晶粒;
    在第二玻璃上贴合多个所述CoW晶粒;以及
    生成压膜塑料以覆盖所述多个CoW晶粒。
  27. 根据权利要求26所述的方法,还包括:
    研磨覆盖所述多个CoW晶粒的所述压膜塑料以露出所述多个CoW单元的表面;以及
    化学机械抛光研磨后的表面。
  28. 根据权利要求27所述的方法,其中所述生成线路层的步骤还包括:
    在所述多个硅通孔的另一侧生成第二重布线层。
  29. 根据权利要求28所述的方法,其中所述生成模块层的步骤包括:
    在所述第二重布线层上形成焊球;以及
    芯片贴合所述电源模块晶粒组及所述接口晶粒组;
    其中,所述焊球电性连接所述第二重布线层与所述电源模块晶粒组及所述接口晶粒组。
  30. 根据权利要求29所述的方法,还包括:
    翻转并去除所述第二玻璃;以及
    在所述运算层侧贴合散热模块。
  31. 一种计算机可读存储介质,其上存储有生成加速器结构的计算机程序代码,当所述计算机程序代码由处理装置运行时,执行权利要求17至30任一项所述的方法。
  32. 一种计算机程序产品,包括生成加速器结构的计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求17至30任一项所述方法的步骤。
  33. 一种计算机装置,包括存储器、处理器及存储在存储器上的计算机程序,其特征在于,所述处理器执行所述计算机程序以实现权利要求17至30任一项所述方法的步骤。
PCT/CN2022/122375 2021-11-05 2022-09-29 加速器结构、生成加速器结构的方法及其设备 WO2023078006A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111308266.9 2021-11-05
CN202111308266.9A CN116108900A (zh) 2021-11-05 2021-11-05 加速器结构、生成加速器结构的方法及其设备

Publications (1)

Publication Number Publication Date
WO2023078006A1 true WO2023078006A1 (zh) 2023-05-11

Family

ID=86240628

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/122375 WO2023078006A1 (zh) 2021-11-05 2022-09-29 加速器结构、生成加速器结构的方法及其设备

Country Status (2)

Country Link
CN (1) CN116108900A (zh)
WO (1) WO2023078006A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116828866A (zh) * 2023-06-07 2023-09-29 阿里巴巴达摩院(杭州)科技有限公司 集成电路组件、处理器和片上系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117855176A (zh) * 2022-09-28 2024-04-09 华为技术有限公司 芯片封装结构和电子设备
CN117149700B (zh) * 2023-10-27 2024-02-09 北京算能科技有限公司 数据处理芯片及其制造方法、数据处理系统

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102044512A (zh) * 2009-10-09 2011-05-04 台湾积体电路制造股份有限公司 集成电路及三维堆叠的多重芯片模块
CN103178050A (zh) * 2011-12-22 2013-06-26 俞宛伶 半导体封装结构及其制作方法
CN103875072A (zh) * 2011-10-17 2014-06-18 松下电器产业株式会社 集成电路、多核处理器装置以及集成电路的制造方法
CN106843080A (zh) * 2017-03-29 2017-06-13 杰创智能科技股份有限公司 一种fpga并行阵列模块及其计算方法
CN109560068A (zh) * 2017-09-25 2019-04-02 力成科技股份有限公司 封装结构及芯片结构
CN110098163A (zh) * 2018-01-31 2019-08-06 三星电子株式会社 包括分布电流的硅通孔的半导体装置
US20200135700A1 (en) * 2019-12-26 2020-04-30 Intel Corporation Multi-chip module having a stacked logic chip and memory stack
TWI703650B (zh) * 2019-08-14 2020-09-01 力成科技股份有限公司 半導體封裝結構及其製造方法
CN112117202A (zh) * 2019-06-20 2020-12-22 矽磐微电子(重庆)有限公司 芯片封装结构的制作方法
CN112232523A (zh) * 2020-12-08 2021-01-15 湖南航天捷诚电子装备有限责任公司 一种国产化人工智能计算设备
CN113410223A (zh) * 2021-06-15 2021-09-17 上海壁仞智能科技有限公司 芯片组及其制造方法

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102044512A (zh) * 2009-10-09 2011-05-04 台湾积体电路制造股份有限公司 集成电路及三维堆叠的多重芯片模块
CN103875072A (zh) * 2011-10-17 2014-06-18 松下电器产业株式会社 集成电路、多核处理器装置以及集成电路的制造方法
CN103178050A (zh) * 2011-12-22 2013-06-26 俞宛伶 半导体封装结构及其制作方法
CN106843080A (zh) * 2017-03-29 2017-06-13 杰创智能科技股份有限公司 一种fpga并行阵列模块及其计算方法
CN109560068A (zh) * 2017-09-25 2019-04-02 力成科技股份有限公司 封装结构及芯片结构
CN110098163A (zh) * 2018-01-31 2019-08-06 三星电子株式会社 包括分布电流的硅通孔的半导体装置
CN112117202A (zh) * 2019-06-20 2020-12-22 矽磐微电子(重庆)有限公司 芯片封装结构的制作方法
TWI703650B (zh) * 2019-08-14 2020-09-01 力成科技股份有限公司 半導體封裝結構及其製造方法
US20200135700A1 (en) * 2019-12-26 2020-04-30 Intel Corporation Multi-chip module having a stacked logic chip and memory stack
CN112232523A (zh) * 2020-12-08 2021-01-15 湖南航天捷诚电子装备有限责任公司 一种国产化人工智能计算设备
CN113410223A (zh) * 2021-06-15 2021-09-17 上海壁仞智能科技有限公司 芯片组及其制造方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116828866A (zh) * 2023-06-07 2023-09-29 阿里巴巴达摩院(杭州)科技有限公司 集成电路组件、处理器和片上系统

Also Published As

Publication number Publication date
CN116108900A (zh) 2023-05-12

Similar Documents

Publication Publication Date Title
WO2023078006A1 (zh) 加速器结构、生成加速器结构的方法及其设备
US9087765B2 (en) System-in-package with interposer pitch adapter
US8736068B2 (en) Hybrid bonding techniques for multi-layer semiconductor stacks
TWI748291B (zh) 積體電路裝置、互連元件晶粒及積體晶片上系統的製造方法
CN104011851B (zh) 具有窗口插入器的3d集成电路封装
US20200161275A1 (en) Packages with multi-thermal interface materials and methods of fabricating the same
US20220399321A1 (en) Chipset and manufacturing method thereof
US10509752B2 (en) Configuration of multi-die modules with through-silicon vias
WO2022016470A1 (zh) 一种芯片封装结构、电子设备
US20230352412A1 (en) Multiple die package using an embedded bridge connecting dies
Su et al. 3D-MiM (MUST-in-MUST) technology for advanced system integration
US11791326B2 (en) Memory and logic chip stack with a translator chip
CN110544673B (zh) 一种多层次融合的三维系统集成结构
WO2023056876A1 (zh) 纵向堆叠芯片、集成电路装置、板卡及其制程方法
KR102629195B1 (ko) 패키지 구조, 장치, 보드 카드 및 집적회로를 레이아웃하는 방법
WO2023056875A1 (zh) 多核芯片、集成电路装置、板卡及其制程方法
TWI836843B (zh) 半導體裝置、半導體封裝及半導體裝置的製造方法
WO2022242333A1 (zh) 具有CoWoS封装结构的晶片、晶圆、设备及其生成方法
CN116092960A (zh) 晶圆测试的方法、存储介质、计算机程序产品及装置
WO2022261812A1 (zh) 三维堆叠封装及三维堆叠封装制造方法
TW202410331A (zh) 半導體封裝及其製造方法
Hopsch et al. Low Cost Flip-Chip Stack for Partitioning Processing and Memory
CN115966517A (zh) 背对背堆叠的制程方法及其介质与计算机设备
CN117650127A (zh) 一种半导体封装结构及其制备方法
CN117525005A (zh) 带有真空腔均热板的芯片组件、封装结构及制备方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22889032

Country of ref document: EP

Kind code of ref document: A1