WO2021168837A1 - Dispositif et procédé de traitement de données - Google Patents

Dispositif et procédé de traitement de données Download PDF

Info

Publication number
WO2021168837A1
WO2021168837A1 PCT/CN2020/077290 CN2020077290W WO2021168837A1 WO 2021168837 A1 WO2021168837 A1 WO 2021168837A1 CN 2020077290 W CN2020077290 W CN 2020077290W WO 2021168837 A1 WO2021168837 A1 WO 2021168837A1
Authority
WO
WIPO (PCT)
Prior art keywords
chip
processing unit
dedicated processing
dedicated
data processing
Prior art date
Application number
PCT/CN2020/077290
Other languages
English (en)
Chinese (zh)
Inventor
檀珠峰
李宗岩
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202080001141.9A priority Critical patent/CN113574656A/zh
Priority to PCT/CN2020/077290 priority patent/WO2021168837A1/fr
Priority to EP20921522.7A priority patent/EP4086948A4/fr
Publication of WO2021168837A1 publication Critical patent/WO2021168837A1/fr
Priority to US17/894,211 priority patent/US20220405228A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4063Device-to-bus coupling
    • G06F13/409Mechanical coupling
    • G06F13/4095Mechanical coupling in incremental bus architectures, e.g. bus stacks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/38Information transfer, e.g. on bus
    • G06F13/40Bus structure
    • G06F13/4063Device-to-bus coupling
    • G06F13/4068Electrical coupling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/485Task life-cycle, e.g. stopping, restarting, resuming execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • HELECTRICITY
    • H01ELECTRIC ELEMENTS
    • H01LSEMICONDUCTOR DEVICES NOT COVERED BY CLASS H10
    • H01L25/00Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof
    • H01L25/18Assemblies consisting of a plurality of individual semiconductor or other solid state devices ; Multistep manufacturing processes thereof the devices being of types provided for in two or more different subgroups of the same main group of groups H01L27/00 - H01L33/00, or in a single subclass of H10K, H10N
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of chip technology, and in particular to a data processing device and method.
  • the system-on-chip (SOC) of the smart terminal is constantly evolving and developing along with the progress of semiconductor technology, evolving from 28nm to 7nm or even 5nm.
  • the technology allows machine vision, such as camera algorithms, and neural network algorithms to be integrated into the smart terminal SOC chip, and the power consumption of the smart terminal SOC chip does not exceed the current battery power supply capacity and heat dissipation capacity.
  • the stacked chip architecture will not significantly increase the volume of the product.
  • the chip only splits the analog and input/output devices into independent chips by expanding the stacked memory (Stacked Memory), and then the stacked chip architecture cannot meet the increasingly high computing needs and complexity of users Changeable application scenarios, and the power consumption performance of the chip architecture that is split and then stacked is not optimized enough.
  • the embodiments of the present application provide a data processing device and related methods, which can improve product performance without increasing the volume of the product and satisfy the flexibility of task processing.
  • an embodiment of the present application provides a data processing device, which may include a first chip and a second chip that are stacked and packaged;
  • the first chip includes a general-purpose processor, a bus, and at least one first dedicated processing unit DPU, The general-purpose processor and the at least one first special-purpose processing unit are connected to the bus, and the general-purpose processor is used to generate data processing tasks;
  • the second chip includes a second special-purpose processing unit, and the second special-purpose processing unit
  • the unit has at least part of the same computing function as one or more first dedicated processing units in the at least one first dedicated processing unit, and in the one or more first dedicated processing units and the second dedicated processing unit At least one of is capable of processing at least a part of the data processing task based on the computing function; wherein the first chip and the second chip are connected to each other through an inter-chip interconnection line.
  • the second dedicated processing unit when the general-purpose processor in the first chip generates a data processing task, because one or more of the second dedicated processing unit and the at least one first dedicated processing unit are The dedicated processing unit has at least part of the same computing function, so the second dedicated processing unit can process at least a part of the data processing task based on the at least part of the same computing function.
  • the data processing device can flexibly allocate data processing tasks to the first dedicated processing unit and the second dedicated processing unit according to the needs of the data task to meet the user's computing needs, for example: when the amount of data processing tasks is small , Can be separately allocated to the first dedicated processing unit, or separately allocated to the second dedicated processing unit, etc.; for another example: when the amount of data processing tasks is large, it can be allocated to the first dedicated processing unit and the second dedicated processing unit at the same time.
  • the stacked chip architecture in the data processing device can also assist the first dedicated processing unit to process data processing tasks by the second dedicated processing unit without significantly increasing the overall chip volume, thereby enhancing the calculation of the first dedicated processing unit.
  • the data processing device can meet the increasingly higher power consumption and computing requirements of users without increasing the volume of the product, while also improving the performance of the product and meeting the flexibility of task processing.
  • the second dedicated processing unit has the same calculation function as one or more first dedicated processing units in the at least one first dedicated processing unit.
  • the second dedicated processing unit has the same computing function as one or more first dedicated processing units in the at least one first dedicated processing unit. Therefore, the second dedicated processing unit may be associated with one Or multiple first dedicated processing units process the same data processing task. For example: when one or more first dedicated processing units cannot perform data processing tasks in a timely manner, the second dedicated processing unit can assist the one or more first dedicated processing units to perform data processing tasks to meet the increasing needs of users. Computing needs.
  • the general-purpose processor includes a central processing unit CPU.
  • the general-purpose processor in the first chip can be a central processing unit, which can be used as the computing and control core of the chip system to generate data processing tasks. It is also the final execution unit for information processing and program operation to satisfy users Basic computing requirements.
  • each of the one or more first dedicated processing units and the second dedicated processing units includes a graphics processing unit GPU, an image signal processor ISP, a digital signal processor DSP, Or at least one of the neural network processing unit NPU.
  • the one or more first dedicated processing units and second dedicated processing units can be respectively used to perform data processing tasks of different data types, so that smart terminals can adapt to more and more different types of data processing by users need.
  • the inter-chip interconnection line includes at least one of a TSV interconnection line or a wire bonding interconnection line.
  • a variety of efficient inter-chip interconnection lines can be used for the inter-chip interconnection lines, such as silicon via TSV interconnection lines, wire bonding interconnection lines, and so on.
  • TSV as a through-hole interconnection technology between chips, has small apertures, low latency, and flexible configuration of data bandwidth between chips, which improves the overall computing efficiency of the chip system. With TSV silicon via technology, it can also achieve no bumps.
  • the bonding structure can integrate adjacent chips of different properties. Laminated chip wire bonding interconnection lines reduce the length of interconnection lines between chips and effectively improve the working efficiency of the device itself.
  • the device further includes a third chip, which is stacked and packaged with the first chip and the second chip, and the third chip and the first chip or At least one of the second chips is connected by the inter-chip interconnection line; the third chip includes at least one of a memory, a power transmission circuit module, an input/output circuit module, or an analog module.
  • the first chip can be stacked and packaged with the third chip.
  • stacking the memory and the first chip can partially solve the problem of insufficient storage bandwidth while increasing the computing power of the chip; the power transmission circuit module, One or more of the input and output circuit modules or analog modules are stacked with the first chip to achieve the separation and decoupling of the simulation and logic calculation of the SOC chip while increasing the computing power of the chip, continuing the evolution of the chip and the business scenario to the chip Increasing demands.
  • the inter-chip interconnection line is connected between the one or more first dedicated processing units and the second dedicated processing unit; the second dedicated processing unit is used for slave
  • the one or more first dedicated processing units acquire at least a part of the data processing task.
  • the second chip is connected to one or more first dedicated processing units in the first chip, and at least a part of the data processing task can be obtained from the one or more first dedicated processing units. Therefore, the second dedicated processing unit can accept deployment of one or more first dedicated processing units to perform at least a part of the data processing task, thereby satisfying the flexibility of task processing.
  • the inter-chip interconnection line is connected between the second dedicated processing unit and the bus, and the second dedicated processing unit is configured to transfer from the general-purpose processing unit through the bus.
  • the processor acquires at least a part of the data processing task.
  • the bus connection between the second chip and the first chip can obtain at least a part of the data processing task from the general-purpose processor. Therefore, the second dedicated processing unit can accept general-purpose processor deployment, and execute at least a part of the data processing task alone or together with one or more first dedicated processing units, thereby satisfying the flexibility of task processing.
  • the general-purpose processor is configured to send startup information to the second dedicated processing unit through the inter-chip interconnection line; the second dedicated processing unit is configured to respond to all The start information is converted from the waiting state to the start state, and at least a part of the data processing task is processed based on the calculation function.
  • the general-purpose processor can allocate the computing power of the second special-purpose processing unit, for example: when the general-purpose processor in the first chip sends startup information to the second special-purpose processing unit through the inter-chip interconnection line , The second chip can switch from the waiting state to the starting state to perform data processing tasks, where the power consumption of the waiting state is lower than the power consumption of the starting state. Therefore, when the general-purpose processor does not send the startup information to the second dedicated processing unit, the second dedicated processing unit will always be in a waiting state, so that the power consumption of the stacked chips can be effectively controlled.
  • the general-purpose processor is configured to send startup information to the chip through the inter-chip interconnection line when the computing power of the one or more first dedicated processing units does not meet the demand The second dedicated processing unit.
  • the second chip in the stacked chips can be After receiving the startup information sent by the general-purpose processor, and according to the startup information, the second chip is converted from the waiting state to the startup state, assisting the one or more first dedicated processing units of the first chip to perform data processing tasks to use
  • the chip is prevented from being unable to complete the data processing task of the target data due to insufficient computing power of the one or more first dedicated processing units.
  • the second chip can always be in the waiting state, and there is no need to start the startup state of the second chip, and only the dedicated processing in the first chip
  • the unit performs data processing tasks, reducing the overall power consumption of the chip.
  • the one or more first dedicated processing units are configured to send startup information to the second dedicated processing unit through the inter-chip interconnection line; the second dedicated processing unit , Used to respond to the start information, switch from the waiting state to the start state, and process at least a part of the data processing task based on the calculation function.
  • one or more first dedicated processing units can allocate the computing power of the second dedicated processing unit. For example, when one or more first dedicated processing units in the first chip pass through the inter-chip When the interconnection line sends the startup information to the second dedicated processing unit, the second chip can switch from the waiting state to the startup state to perform data processing tasks, wherein the power consumption of the waiting state is lower than the power consumption of the startup state. Therefore, when one or more first dedicated processing units do not send startup information to the second dedicated processing unit, the second dedicated processing unit will always be in a waiting state, thereby effectively controlling the power consumption of the stacked chips.
  • the one or more first dedicated processing units are configured to communicate through the inter-chip interaction when the computing power of the one or more first dedicated processing units does not meet the demand.
  • the activation information is sent to the second dedicated processing unit through a connection.
  • the second chip in the stacked chips can be Receive the startup information sent by the one or more first dedicated processing units, and switch the second chip from the waiting state to the startup state according to the startup information, and assist the one or more first dedicated processing units of the first chip to execute Data processing tasks are used to enhance or supplement the computing power of the one or more first dedicated processing units, so as to prevent the chip from being unable to complete the data processing tasks of the target data due to insufficient computing power of one or more first dedicated processing units.
  • the second chip can always be in a waiting state, and there is no need to start the startup state of the second chip, and only the dedicated processing in the first chip
  • the unit performs data processing tasks, reducing the overall power consumption of the chip.
  • an embodiment of the present application provides a data processing method, which is characterized in that it includes: generating data processing tasks through a general-purpose processor in a first chip, and the first chip includes the general-purpose processor, A bus and at least one first dedicated processing unit DPU, the general-purpose processor and the at least one first dedicated processing unit are connected to the bus; through one or more first dedicated processing units of the at least one first dedicated processing unit At least one of the dedicated processing unit and the second dedicated processing unit in the second chip package processes at least a part of the data processing task, and one of the second dedicated processing unit and the at least one first dedicated processing unit or A plurality of first dedicated processing units have at least part of the same computing function, and the first chip and the second chip are stacked and packaged and connected to each other by an inter-chip interconnection line.
  • the at least one of the one or more first dedicated processing units in the at least one first dedicated processing unit and at least one of the second dedicated processing units in the second chip package is used for processing.
  • At least a part of the data processing task includes: sending activation information to the second dedicated processing unit via the general-purpose processor via the inter-chip interconnection line; and responding to the activation information via the second dedicated processing unit , Switch from the waiting state to the start state, and process at least a part of the data processing task based on the calculation function.
  • the sending the startup information to the second dedicated processing unit through the general-purpose processor via the inter-chip interconnection line includes: using the general-purpose processor in the one or When the computing power of the multiple first dedicated processing units does not meet the demand, the startup information is sent to the second dedicated processing unit via the inter-chip interconnection line.
  • the at least one of the one or more first dedicated processing units in the at least one first dedicated processing unit and at least one of the second dedicated processing units in the second chip package is used for processing.
  • At least a part of the data processing task includes: sending activation information to the second dedicated processing unit through the one or more first dedicated processing units through the inter-chip interconnection line; the second dedicated processing unit, Used to respond to the start information, switch from the waiting state to the start state, and process at least a part of the data processing task based on the calculation function.
  • the sending the startup information to the second dedicated processing unit through the one or more first dedicated processing units through the inter-chip interconnection line includes: through the one or more first dedicated processing units.
  • the multiple first dedicated processing units send activation information to the second dedicated processing unit via the inter-chip interconnection line when the computing power of the one or more first dedicated processing units does not meet the demand.
  • the embodiments of the present application provide a chip system that includes any device for supporting the data processing involved in the above-mentioned first aspect.
  • the chip system may be composed of a chip, or may include a chip and Other discrete devices.
  • FIG. 1 is a diagram of a traditional von Neumann chip architecture provided by an embodiment of the present application.
  • FIG. 2A is a schematic diagram of a data processing architecture provided by an embodiment of the present application.
  • FIG. 2B is a schematic diagram of a chip structure of a stacked package provided by an embodiment of the present application.
  • FIG. 2C is a schematic diagram of a chip architecture of a stacked package in practical application provided by an embodiment of the present application.
  • FIG. 2D is a schematic diagram of interaction between the second dedicated processing unit and the first dedicated processing unit in the above-mentioned stacked package chip shown in FIG. 2C according to an embodiment of the present application.
  • FIG. 2E is a schematic diagram of another stacked package chip architecture in practical applications provided by an embodiment of the present application.
  • FIG. 2F is a schematic diagram of interaction between the second dedicated processing unit and the first dedicated processing unit in the stacked package chip shown in FIG. 2E according to an embodiment of the present application.
  • 2G is a schematic diagram of another stacked package chip architecture provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • component used in this specification are used to denote computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution.
  • the component may be, but is not limited to, a process, a processor, an object, an executable file, an execution thread, a program, and/or a computer running on a processor.
  • the application running on the computing device and the computing device can be components.
  • One or more components may reside in processes and/or threads of execution, and components may be located on one computer and/or distributed among two or more computers.
  • these components can be executed from various computer readable media having various data structures stored thereon.
  • the component can be based on, for example, a signal having one or more data packets (e.g. data from two components interacting with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through a signal) Communicate through local and/or remote processes.
  • a signal having one or more data packets (e.g. data from two components interacting with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through a signal) Communicate through local and/or remote processes.
  • connection refers to the communication, data interaction, energy transmission, etc. that can be performed between the connected modules, units, or devices. It may be a direct connection or an indirect connection through other devices or modules. For example, it may be connected through some wires, conductors, media, interfaces, devices or units because it can be regarded as an electrical connection or coupling in a broad sense.
  • TSV Through Silicon Via
  • IC integrated Circuit
  • PCB Printed Circuit Board
  • SOC System-on-Chip
  • SOC also known as system-on-chip
  • SOC is an integrated circuit with a dedicated target, which contains a complete functional circuit system and includes all the contents of the embedded software.
  • the SOC includes a number of different functional components, which will be introduced later.
  • Stacked structure refers to a type of system packaging, in which system packaging can be divided into three types: adjacent structure, stacked structure and buried structure.
  • the stacked structure can increase the packaging density in three dimensions, and can be applied to different levels or levels of packaging, such as: package-in-package (PoP), package-in-package (PiP), chip or die stacking, chip and wafer Stacked.
  • Chip stack packaging is common in various terminal products. Its advantage is that standard chip and wire bonding and subsequent packaging can be achieved by using existing equipment and processes. However, it limits the thickness of the entire package and cannot be too large. At present, up to 8 dies can be installed vertically in a package, and the thickness is less than 1.2mm. This requires that each die in the stacked package is a thin wafer, a thin substrate, a low lead arc, and a low lead. Mold cover height, etc.
  • Wafer refers to the silicon wafer used in the production of silicon semiconductor integrated circuits. Because of its circular shape, it is also called a wafer.
  • a bare chip is an integrated circuit product that contains various circuit element structures and has specific electrical functions after processing and cutting on the wafer. Among them, the die can be packaged into a chip.
  • GPU Graphics Processing Unit
  • the GPU also known as the display core, visual processor, and display chip, is a kind of specialized in personal computers, workstations, game consoles and some mobile devices (such as tablet computers, smart phones, etc.).
  • the GPU makes the device less dependent on the CPU, and performs part of the original CPU work, especially in 3D graphics processing.
  • the core technologies used by the GPU include hardware geometric conversion and lighting processing (T&L), cubic environment texture mapping and vertex blending, Texture compression and bump mapping, dual texture four-pixel 256-bit rendering engine, etc., and hardware geometric conversion and light processing technology can be said to be the hallmark of GPU.
  • DSP Digital Signal Processing
  • DSP usually refers to a chip or processor that performs digital signal processing technology.
  • Digital signal processing technologies are technologies that convert analog information (such as sound, video, and pictures) into digital information. They may also be used to process this analog information and then output it as analog information. It can also study digital signal processing algorithms and their implementation methods based on digital signal processing theory, hardware technology, and software technology.
  • Neural-network Processing Unit is a processor that processes neural models. This can be regarded as a component (or subsystem), and sometimes it can be called an NPU coprocessor. Generally adopts the "data-driven parallel computing" architecture, and is particularly good at processing massive multimedia data such as videos and images.
  • PCB Printed circuit boards
  • Static Random-Access Memory is a type of random access memory.
  • the so-called “static” means that the data stored in this kind of memory can be kept constantly as long as it is powered on.
  • DRAM dynamic random access memory
  • the data stored in the SRAM will still disappear (called volatile memory), which is different from the ROM or flash memory that can store data after a power failure.
  • I/O Input/Output
  • I/O usually refers to the input and output of data between internal equipment and external memory or other peripheral equipment. It is the information processing system of internal equipment (such as calculator) and external Communication between the world (perhaps humans or another information processing system).
  • the input is the signal or data received by the system, and the output is the signal or data sent from it.
  • the term can also be used as part of an action; to "run I/O" is an operation that runs input or output.
  • the von Neumann structure also known as the Princeton structure, is a memory structure that combines program instruction memory and data memory.
  • the program instruction storage address and the data storage address point to different physical locations in the same memory, so the program instruction and data have the same width.
  • the arithmetic unit refers to the smallest arithmetic unit of the arithmetic unit in the CPU, and it is the structure of the hardware. At present, calculations are completed by logic circuits formed by small components such as electronic circuits, and all high- and low-level signals are processed, that is, binary signals.
  • FIG. 1 is a diagram of a traditional von Neumann chip architecture provided by an embodiment of the present application. As shown in FIG. 1, it includes an arithmetic unit, a controller, a memory, an input device, and an output device.
  • the traditional von Neumann chip architecture is difficult to meet the current requirements of chip computing power, power consumption, and chip storage bandwidth. Therefore, chip stacking technology can solve some of the insufficient chip computing power, excessive power consumption, and excessive power consumption caused by the limited chip area. Problems such as insufficient memory bandwidth of the chip.
  • the technology of chip stacking technology take the SOC chip of the mobile smart terminal as an example
  • the power transmission circuit, I/O or radio frequency in the SOC chip can be separately split on another chip, which can realize circuit solutions with different functions. Coupled.
  • the embodiment of the application can also decouple the input/output (I/O) module or the dedicated processing unit from the function of the main chip as an independent chip, so that the main chip will no longer have a decoupled independent chip. Some functions require the main chip and the independent chip to work at the same time, and the power consumption performance is not optimized enough.
  • FIG. 2A is a schematic diagram of a data processing architecture provided by an embodiment of the present application.
  • the architecture shown in FIG. 2A mainly uses stacked chips as the main body and is described from the perspective of data processing.
  • the chip may be a packaged chip, or an unpackaged chip, that is, a bare chip.
  • the "first chip”, “second chip”, and “main chip” involved in this application and other chips in the chip structure used for stack packaging can all be understood as chips that are not packaged, that is, die ) Refers to integrated circuit products that contain various circuit element structures on the diced silicon wafer and have specific electrical functions. Therefore, the chips that need to be stacked and packaged in the embodiments of the present application are all bare chips that have not yet been packaged.
  • the chip system stacked in the smart terminal includes a first chip 001 and a second chip 002, the first chip 001 may be a system-on-chip SOC, and the first chip 001 includes a general-purpose processor 011, a bus 00, and at least one A first dedicated processing unit (Domain Processing Unit, DPU) 021, the general-purpose processor 011 and the at least one first dedicated processing unit 021 are connected to the bus, wherein the at least one first dedicated processing unit 021 can record in turn Are: DPU_1 to DPU-n; the second chip 002 includes a second dedicated processing unit 012 (can be denoted as: DPU-A); the first chip 001 and the second chip 002 pass through the inter-chip interconnection line connect.
  • DPU Domain Processing Unit
  • the first chip 001 is used to process data and generate data processing tasks, and may also send startup information to the second chip 002.
  • the second chip 002 is configured to switch from a waiting state to a starting state when the startup information is received, and execute part or all of the processing tasks of the data processing task through the second dedicated processing unit.
  • the general-purpose processor 011 (for example, central processing unit, CPU) in the first chip 001 serves as the operation and control core of the chip system, and is the final execution unit for information processing and program operation.
  • the general-purpose processor 011 may generally include a reduced instruction set processor (Advanced RISC Machines, ARM) series, and may include one or more core processing units.
  • the general-purpose processor 011 may be used for data processing tasks for generating target data.
  • the general-purpose processor 011 may also select at least a part of the data processing tasks of one or more first special processing units 021 from the at least one first special processing unit 021 according to the data processing tasks.
  • the general-purpose processor 011 may also be used to process simple data processing tasks without the assistance of at least one of the first dedicated processing unit 021 and the second dedicated processing unit 012.
  • At least one first dedicated processing unit 021 in the first chip 001 can process at least a part of the data processing task based on the calculation function, for example, the graphics processing unit GPU can perform the data processing task of image recognition.
  • the at least one first dedicated processing unit 021 may include at least one of a graphics processing unit GPU, an image signal processor ISP, a digital signal processor DSP, or a neural network processing unit NPU.
  • all the first dedicated processing units DPU in the at least one first dedicated processing unit 021 may work at the same time, or only one may work.
  • the bus 00 in the first chip 001 is also called an internal bus (Internal Bus) or a board-level bus (Board-Level) or a computer bus (Microcomputer Bus). It can be used to connect various functional components in the chip to form a complete chip system. It can also be used to transmit various data signals, control commands, etc., to assist the communication between various functional devices.
  • the general-purpose processor 011 and at least one first special-purpose processing unit 021 can be connected, so that the general-purpose processor 011 can control one or more of the at least one first special-purpose processing unit 021 to perform data processing tasks.
  • the second dedicated processing unit 012 in the second chip 002 has at least partially the same computing function as one or more first dedicated processing units 021 in the at least one first dedicated processing unit 021, and At least one of the one or more first dedicated processing unit 021 and the second dedicated processing unit 012 can process at least a part of the data processing task based on the computing function.
  • that the second dedicated processing unit 012 has at least partially the same computing function as the one or more first dedicated processing units 021 may mean that the computing function of the second dedicated processing unit 012 is the same as that of the one or more first dedicated processing units.
  • the computing functions of the processing unit 021 may be at least partially the same.
  • the same part of the function is the function shared by the second dedicated processing unit 012 and the one or more first dedicated processing units 021, and the shared function of this part can be
  • the data processing task may be allocated to the second dedicated processing unit 012 and the one or more first dedicated processing units 021 for processing, for example, the second dedicated processing unit 012 and the It is described that one or more first dedicated processing units 021 share the computing power, and this lecture will be introduced in detail later.
  • the calculation functions of the second dedicated processing unit 012 and the one or more first dedicated processing units 021 may also be all the same.
  • the second dedicated processing unit 012 and the one or more first dedicated processing units 021 may use part of the same operation logic, operation mode and/or operation purpose to perform the same data processing task.
  • the one or more first dedicated processing units 021 may include convolutional neural network computing capabilities, and the second dedicated processing unit 012 may also include at least part of the convolutional neural network computing capabilities, such as most of the capabilities.
  • the one or more first dedicated processing units 021 in performing neural operations, can convert a segment of speech into several keywords through a voice recognition method, and the second dedicated processing unit 012 can convert a segment of voice through a voice recognition method. Converted into a string of text including keywords.
  • the one or more first dedicated processing units 021 may include image processing capabilities of ISP to generate captured images.
  • the capabilities may include white balance, noise removal, pixel calibration, image sharpening, and gamma. calibration.
  • the second dedicated processing unit 012 may also include most of the capabilities of the image operation processing of the previous ISP to generate a captured image.
  • the capability may include white balance, noise removal, pixel calibration, or image sharpening.
  • the second dedicated processing unit 012 in the second chip 002 has the same calculation function as one or more first dedicated processing units 021 in the at least one first dedicated processing unit 021, It can be used to perform all the processing tasks of the acquired data processing task.
  • the second dedicated processing unit 012 and the one or more first dedicated processing units 021 may have the same computing function, which may include: the second dedicated processing unit 012 and the one or more first dedicated processing units 021 may Have exactly the same operation logic and/or operation method to perform data processing tasks.
  • the second dedicated processing unit 012 may also include a unit for implementing a parallel matrix operation array, and the algorithm type It may also be consistent with the algorithm type used by the first dedicated processing unit 021.
  • the inter-chip interconnection line includes any one of TSV interconnection lines and wire bonding interconnection lines.
  • TSV as a through-hole interconnection technology between chips, has small apertures, low latency, and flexible configuration of data bandwidth between chips, which improves the overall computing efficiency of the chip system. With TSV silicon via technology, it can also achieve no bumps.
  • the bonding structure can integrate adjacent chips of different properties. Laminated chip wire bonding interconnection line reduces the length of the interconnection line between chips, and effectively improves the working efficiency of the device itself.
  • connection signal of the inter-chip interconnection line may include a data signal and a control signal, where the digital signal may be used to transmit the target data, and the control signal may be used to distribute the data of the target data.
  • the processing tasks are not specifically limited for comparison in this application.
  • the data processing architecture in FIG. 2A is only an exemplary implementation in the embodiment of the present application, and the data processing architecture in the embodiment of the present application includes but is not limited to the above data processing architecture.
  • the first chip 001 may also be stacked and packaged with a plurality of second chips 002 to form a chip, and the second dedicated processing units included in the plurality of second chips 002 are respectively connected to at least one first dedicated processing unit included in the first chip 001.
  • the processing unit 021 has at least part of the same calculation function.
  • the stacked chip system can be configured in different data processing devices, and different data processing devices correspond to different main control forms.
  • the embodiment of the present application does not limit the main control form, for example, a server notebook Computers, smart phones, car TVs, etc.
  • FIG. 2B is a schematic diagram of a stacked package chip architecture provided by an embodiment of the present application. As shown in FIG. The interconnection lines are connected to each other, and the SOC chip is mainly used as the main chip, which is described from the perspective of data processing.
  • the chip architecture shown in FIG. 2B it includes: a first chip 001 and a second chip 002; the first chip 001 includes a general-purpose processor 011 (such as a CPU, or optionally a microcontroller), and at least one first chip A dedicated processing unit 021 (DPU_1, DPU_2...DPU_n) and bus 00, and also includes: a memory 031, an analog module 041, an input/output module 051, etc.; the second chip 002 is connected to the first dedicated processing unit DPU_1 A computing power stacking chip, the second chip 002 includes a second dedicated processing unit 012 (ie: DPU_A).
  • the first chip 001 includes a general-purpose processor 011 (such as a CPU, or optionally a microcontroller), and at least one first chip A dedicated processing unit 021 (DPU_1, DPU_2...DPU_n) and bus 00, and also includes: a memory 031, an analog module 041, an input/output module 051, etc.;
  • the general-purpose processor 011 in the first chip 001 may be a CPU for generating data processing tasks.
  • the general-purpose processor 011 in the first chip 001 is also used to allocate data processing tasks to one or more first dedicated processing units in the at least one first dedicated processing unit 021, and/or the second dedicated processing unit 021 The second dedicated processing unit 012 in the chip.
  • the general-purpose processor 011 in the first chip 001 can flexibly allocate data processing tasks to the first dedicated processing unit and the second dedicated processing unit according to the needs of data tasks, so as to meet the increasingly high computing needs of users For example: when the amount of data processing tasks is small, it can be allocated to the first dedicated processing unit alone; another example: when the amount of data processing tasks is large, it can be allocated to the first dedicated processing unit and the second dedicated processing unit at the same time, or Separately allocated to the second dedicated processing unit and so on. It can meet the flexibility of task processing without increasing the volume of the product.
  • the inter-chip interconnection line is connected between the second dedicated processing unit and the bus, and the general-purpose processor 011 in the first chip 001 can also be used to communicate through the inter-chip interconnection.
  • the activation information is sent to the second dedicated processing unit via a wire, so that the second dedicated processing unit changes from the waiting state to the activated state in response to the activation information, and processes the data processing task based on its calculation function. At least part of it.
  • the power consumption of the waiting state is lower than the power consumption of the starting state. Therefore, when the general-purpose processor does not send the startup information to the second dedicated processing unit, the second dedicated processing unit will always be in a waiting state, so that the power consumption of the stacked chips can be effectively controlled.
  • the inter-chip interconnection line is connected between the second dedicated processing unit and the bus, and the general-purpose processor 011 in the first chip 001 may also be used to connect to the one or more
  • the startup information is sent to the second dedicated processing unit through the inter-chip interconnection line.
  • the second chip 002 in the stacked chip can receive the The general-purpose processor 011 sends the startup information, and according to the startup information, converts the second chip 002 from the waiting state to the startup state, and assists the one or more first dedicated processing units 021 of the first chip 001 to perform data processing tasks, It is used to enhance or supplement the computing power of the one or more first dedicated processing units 021 to prevent the chip from being unable to complete the data processing task of the target data due to insufficient computing power of the one or more first dedicated processing units 021.
  • the computing power of the one or more first dedicated processing units 021 it is predicted whether the one or more first dedicated processing units 021 can complete the execution of the data processing task within a preset time. If the one or more first dedicated processing units 021 cannot complete the data processing task within the preset time, it is determined that the one or more first dedicated processing units 021 are performing the processing data processing Insufficient computing power during tasks.
  • the at least one first dedicated processing unit 021 in the first chip 001 is respectively denoted from left to right as: DPU_1, DPU_2...DPU_n, wherein one or more first dedicated processing units 021 in the at least one first dedicated processing unit 021
  • the dedicated processing unit 021 is used to acquire and execute data processing tasks based on the corresponding calculation function.
  • the at least one first dedicated processing unit 021 may include one or more of a graphics processing unit GPU, an image signal ISP, a digital signal processor DSP, and a neural network processing unit NPU.
  • GPU and ISP can be used to process graphics data in smart terminals
  • DSP can be used to process digital signal data executed in smart terminals
  • NPU can be used to process massive multimedia data such as videos and images in smart terminals. Therefore, the at least one first dedicated processing unit 021 can be used to perform different data processing tasks through different computing functions, so that the smart terminal can adapt to more and more data processing requirements of users.
  • the DPU_1 in the first chip 001 in the embodiment of the present application is used to perform data processing tasks.
  • the inter-chip interconnection line is connected between the one or more first dedicated processing units and the second dedicated processing unit; one or more first dedicated processing units in the first chip 001
  • the processing unit 021 is also configured to, when one or more first dedicated processing units 021 execute the data processing task, can send startup information to the second chip 002 and allocate part or all of the processing of the data processing task
  • the task is transferred to the second dedicated processing unit 012 of the second chip 002. So that the second dedicated processing unit 012 responds to the activation information, transitions from the waiting state to the activation state, and processes the data processing based on the calculation function that is at least partially the same as that of the one or more first dedicated processing units 021 At least part of the task.
  • the power consumption of the waiting state is lower than the power consumption of the starting state. Therefore, when one or more first dedicated processing units 021 do not send startup information to the second dedicated processing unit 012, the second dedicated processing unit 012 will always be in a waiting state, which can effectively control the power consumption of stacked chips. . It should be noted that the power consumption of the waiting mode is lower than the power consumption of the working mode.
  • the inter-chip interconnection line is connected between the one or more first dedicated processing units and the second dedicated processing unit; the one or more second dedicated processing units in the first chip 001 A dedicated processing unit 021, configured to send activation information to the second dedicated processing unit 012 through the inter-chip interconnection line when the computing power of the one or more first dedicated processing units 021 does not meet the demand .
  • the second chip 002 in the stacked chip can receive The one or more first dedicated processing units 021 send the startup information, and according to the startup information, the second chip 002 is converted from the waiting state to the startup state to assist the one or more first dedicated processing units of the first chip 001 021 performs data processing tasks to enhance or supplement the computing power of the one or more first dedicated processing units 021, so as to prevent the chip from being unable to complete the data of the target data due to insufficient computing power of one or more first dedicated processing units 021 Processing tasks to meet the user's computing needs.
  • the bus 00 in the first chip 001 is used to connect the general-purpose processor 011 and the at least one first dedicated processing unit 021.
  • the memory 031 in the first chip 001 is used to store target data and data processing tasks corresponding to the target data.
  • the target data type may include graphics data, video data, audio data, text data, and so on.
  • the simulation module 041 in the first chip 001 mainly implements simulation processing functions, such as radio frequency front-end simulation, port physical layer (Physical, PHY), and so on.
  • the input/output module 051 in the first chip 001 is a universal interface of the SOC chip to external devices, and is used for data input and output. Generally, it includes a physical layer (Physical, PHY) of a controller and a port, such as a universal serial bus (Universal Serial Bus, USB) interface, a mobile industry processor interface (Mobile Industry Processor Interface, MIPI), and so on.
  • a physical layer Physical, PHY
  • USB Universal Serial Bus
  • MIPI Mobile Industry Processor Interface
  • the second dedicated processing unit 012 in the second chip 002 has at least part of the same calculation function as one or more of the first dedicated processing units 021 in the at least one first dedicated processing unit 021, and can be used to execute Part or all of the processing tasks of the data processing tasks allocated by the general-purpose processor 011 or one or more first dedicated processing units 021.
  • the core of the NPU is the unit of the parallel matrix operation array.
  • the second dedicated processing unit 012 is also used to implement parallel matrix operation array units, and the algorithm type is also consistent with the algorithm type used by the NPU, such as Int8, Int6, F16, and so on.
  • the second dedicated processing unit 012 may be different from the NPU, ie.
  • the second dedicated processing unit 012 and the first dedicated processing unit 021 implement the same calculation function, but the computing power is different.
  • the second dedicated processing unit 012 in the second chip 002 has the same calculation function as one or more first dedicated processing units 021 in the at least one first dedicated processing unit 021. Therefore, the second dedicated processing unit 012 can process the same data processing tasks as the one or more first dedicated processing units 021. For example: when one or more first dedicated processing units 021 perform a data processing task, the second dedicated processing unit 012 can assist the one or more first dedicated processing units 021 to jointly perform the data processing task, so as to be more efficient. Meet users' increasingly high computing needs.
  • the second said second dedicated processing unit in said second chip 002 is configured to respond to said start information, switch from a waiting state to a start state, and process the data processing based on the calculation function At least part of the task.
  • the second chip 002 in the stacked chip system can receive the activation sent by the one or more first dedicated processing units 021 Information, and according to the startup information, the second chip 002 is converted from the waiting mode to the working mode, and assists the one or more first dedicated processing units 021 of the first chip 001 to perform data processing tasks for enhancing or supplementing the one
  • the computing power of one or more first dedicated processing units 021 prevents the chip from being unable to complete the data processing task of the target data due to insufficient computing power of one or more first dedicated processing units 021. Therefore, when the general-purpose processor does not send startup information to the second dedicated processing unit, the second dedicated processing unit will always be in a waiting state, which can effectively control the power consumption of the
  • the second dedicated processing unit 012 includes one or more corresponding computing units in the first dedicated processing unit 021 that have at least part of the same computing function, and the one or more first dedicated processing units 021
  • the arithmetic units corresponding to the dedicated processing unit 021 are respectively used to process the target data through arithmetic logic.
  • the second dedicated processing unit 012 and the neural network processing unit NPU have at least part of the same calculation function, because the calculation unit of the neural network processing unit NPU includes: Matrix Unit, Vector Unit and Scalar Unit Therefore, the second dedicated processing unit 012 in the second chip 002 may also include one or more of Matrix Unit, Vector Unit, and Scalar Unit, so as to be used to perform data processing tasks assigned to the second chip 002. Perform matrix multiplication, vector operations, scalar operations, etc.
  • the inter-chip interconnection line between the first chip 001 and the second chip 002 is TSV. Due to the TSV process technology, the aperture of a single silicon via can be as small as ⁇ 10um, and the first chip 001 and the second chip 002 The number of TSV interconnect lines can be determined according to needs without occupying too much area, which is not specifically limited in this application for comparison.
  • the first chip 001 in FIG. 2B may include multiple first dedicated processing units 021, and each of the first dedicated processing units 021 can receive data processing sent by the general processor 011. Tasks are used to process corresponding data processing tasks of a certain type.
  • the first chip 001 usually includes a graphics processing unit GPU, an image signal processor ISP, a digital signal processor DSP, and a neural network processing unit NPU.
  • the graphics processing unit GPU and image signal processor ISP can be used to process graphics data in smart terminals; digital signal processor DSP can be used to process digital signal data executed in smart terminals; neural network processing unit NPU can be used to process smart terminals Massive multimedia data such as videos and images in the terminal.
  • the second dedicated processing unit 012 in the second chip 002 may include one or more arithmetic units, and the one or more arithmetic units are used to process data through corresponding arithmetic logics, respectively.
  • the second dedicated processing unit 012 and the NPU have at least part of the same computing functions, because the core computing units of the NPU are the Matrix Unit, the Vector Unit, and the Scalar Unit, the second chip 002
  • the second dedicated processing unit 012 may also include one or more of Matrix Unit, Vector Unit, and Scalar Unit, which are used to perform data processing tasks allocated to the second chip 002 to perform matrix multiplication, vector operations, or scalar operations on data. and many more.
  • the second dedicated processing unit and the at least one first dedicated processing unit have one or more first dedicated processing units At least part of the same computing functions, so the second dedicated processing unit can process at least part of the same data processing tasks as the one or more first dedicated processing units.
  • the data processing device can flexibly allocate data processing tasks to the first dedicated processing unit and the second dedicated processing unit according to the needs of the data task to meet the increasingly high computing needs of users, such as: When the amount is small, it can be separately allocated to the first dedicated processing unit; another example: when the amount of data processing tasks is large, it can be allocated to the first dedicated processing unit and the second dedicated processing unit at the same time, or separately allocated to the second dedicated processing unit Unit and so on.
  • stacking the second chip on the first chip can meet the volume requirements of the product without significantly increasing the volume of the chip architecture.
  • the other chip can be in a low power consumption state, and can effectively control the power consumption of stacked chips when performing data processing tasks. Therefore, the data processing device can meet the increasingly high power consumption and computing requirements of users without increasing the volume of the product, while improving product performance and meeting the flexibility of task processing.
  • Figure 2C is a schematic diagram of a stacked packaged chip architecture in practical applications provided by an embodiment of the present application
  • Figure 2D is an embodiment of the present application.
  • the stacked package chip structure provided in FIG. 2C is used to support and execute steps 1 to 7 of the following method flow.
  • the second dedicated processing unit DPU-A included in the second chip 002 in the stacked package chip architecture is connected to the first dedicated processing unit DPU_1 through a TSV interconnection line.
  • the second dedicated processing unit DPU_A may include memory (Buffer) 0121 and task scheduler (Task Scheduler) 0122, and the first dedicated processing unit DPU_1 directly schedules the task sequence and data AI algorithm operation module 0123 (AI algorithm operation module) 0123 includes: Matrix Unit 01 (Matrix Unit), Vector Unit 02 (Vector Unit) and Scalar Unit 03 (Scalar Unit)).
  • the first dedicated processing unit DPU_1 may also include: a memory (Buffer) 0211 and a task scheduler (Task Scheduler) 0212. For the corresponding functions, refer to the relevant description of each functional module of the first dedicated processing unit DPU_1.
  • DPU_A is directly scheduled by DPU_1, so the signals that two devices directly interconnect through TSV silicon vias include at least two types: Data Signals and Control Signals, and the number of two signal bits.
  • the number of Data Signals bits is the number of bits required to calculate data in parallel, generally at least a multiple of 64bits, such as 64b, 128b,...1028b, etc.
  • Control signals generally include enable signals, start-stop control signals , Single bit such as interrupt signal.
  • DPU_1 transfers the processed data from the memory 031 to the buffer 0211 of the DPU_1 unit, and then determines whether to send it to DPU_A for assisted calculation processing according to the calculation requirements.
  • the data processing method may include:
  • DPU_1 After DPU_1 receives the data processing task issued by CPU 011 through Task Scheduler0212, it transfers the target data from the memory to the temporary buffer Buffer0211 inside DPU_1;
  • DPU_1 uses Task Scheduler0212 to predict whether the computing power of the data processing task is sufficient. If the computing power of DPU_1 cannot meet the needs of the data processing task, Task Scheduler0212 sends the startup information to DPU_A through the Control signals of TSV;
  • DPU_A After receiving the startup information, DPU_A wakes up from the low-power state, enters the startup state, and feeds back the waiting signal to DPU_1 through TSV Control signals;
  • Task Scheduler0212 of DPU_1 delivers the allocated data processing tasks to Task Scheduler0122 of DPU_A, DPU_A and transfers the data that needs to be processed by DPU_A to Buffer0121 of DPU_A unit;
  • Task Scheduler0212 of DPU_A starts the AI algorithm operation module 0123 (including Matrix Unit 01, Vector Unit 02, Scalar Unit 03, etc.) according to the data processing task, and the AI algorithm operation module 0123 reads the Buffer0121 data and starts to execute the data processing task;
  • AI algorithm operation module 0123 including Matrix Unit 01, Vector Unit 02, Scalar Unit 03, etc.
  • AI arithmetic operation module 0123 stores the calculated data in Buffer0121 of DPU_A;
  • Task Scheduler0122 of DPU_A sends a processing completion signal back to Task Scheduler0212 of DPU_1, and writes data from Buffer 0121 of DPU_A back to Buffer 0211 inside DPU_1.
  • DPU_1 is a dedicated processing unit for AI processing.
  • the core computing part generally includes parallel Matrix Unit 01, Vector Unit 02, Scalar Unit 03, etc., as well as internal temporary data storage buffer 0211 and task scheduler 0212.
  • DPU_A includes Matrix Unit 01, optionally including Vector Unit 02 and Scalar Unit 03, and DPU_A can also include Buffer 0121 and Task Scheduler 0122 as needed.
  • the computing core of DPU_A can include Matrix Unit 01, Vector Unit 02, and Scalar Unit 03, the number of operators or MAC numbers in each unit can be different from that of DPU_1.
  • the DPU_A of the second chip 002 is positioned as a computing power enhancement module, which is directly scheduled and controlled by DPU_1 in the first chip 001.
  • DPU_1 own computing power is sufficient to meet the demand.
  • the DPU_A of the second chip 002 is in a waiting state to save the overall power consumption of the chip system; when some scenes, such as video processing, require AI to handle high computing power or assist in processing, DPU_1 can activate DPU_A through TSV Control signals and participate in calculation processing together.
  • AI enhanced arithmetic unit is taken as an example in the embodiments of this application.
  • This application is not limited to this scenario, and other first dedicated processing units (such as DPU_1...DPU_n) can be connected to DPU_A as other dedicated processing units.
  • the computing power of GPU, ISP, etc. is enhanced, which is not specifically limited in the embodiment of the present application.
  • FIG. 2E is a schematic diagram of another stacked package chip architecture in actual application provided by an embodiment of the present application
  • FIG. 2F is a second dedicated chip in the above-mentioned stacked package shown in FIG.
  • the stacked package chip architecture provided in FIG. 2E is used to support and execute steps 1 to 6 of the following method flow.
  • the stacked chip system includes a first chip 001 and a second chip 002, and the second chip 002 in the chip structure of the stacked package is connected to the bus 00 in the first chip 001 through a TSV interconnection line. That is, the inter-chip interconnection line is connected between the second dedicated processing unit 012 and the bus 00, where the second dedicated processing unit DPU_A may include a memory (Buffer) 0121 and a task scheduler (Task Scheduler) 0122, and the AI algorithm operation module 0123 for directly scheduling task sequences and data by the first dedicated processing unit DPU_1 (the AI algorithm operation module 0123 includes: Matrix Unit 01, Vector Unit 02, and Scalar Unit 03).
  • the first dedicated processing unit DPU_1 may also include: a memory (Buffer) 0211 and a task scheduler (Task Scheduler) 0212. For the corresponding functions, refer to the relevant description of each functional module of the first dedicated processing unit DPU_1.
  • DPU_A is directly controlled and scheduled by the CPU 011 of the first chip 001, and DPU_A is directly connected to the bus 00 of the first chip 001, so the TSV interconnection signal between DPU_A and the first chip 001 generally uses a standard protocol bus ( Advanced eXtensible Interface (AXI) and Peripheral Bus (Advanced Peripheral Bus, APB).
  • AXI Advanced eXtensible Interface
  • APB Advanced Peripheral Bus
  • the AXI bus is used to read and write data signals
  • APB bus is used to control signal configuration.
  • the data processing method includes:
  • the CPU 011 initiates the startup information of the DPU_A of the second chip 002 to participate in the DPU_1 through the bus 00, and configures at least a part of the data processing tasks in the Task Scheduler0122 of the DPU_A unit;
  • DPU_A After DPU_A receives the startup information, there is a waiting state to enter the startup state, that is, it is awakened from the low-power state; DPU_A's Task Scheduler0122 receives the data processing task issued by the CPU 011, from the memory 031 in the chip through the bus 00 Transfer data to Buffer0121 inside DPU_A;
  • Task Scheduler 0122 of DPU_A starts the AI algorithm operation module 0123 (including Matrix Unit 01, Vector Unit 02, and Scalar Unit 03) according to the data processing task, and the AI algorithm operation module 0123 reads the data in buffer 0121 and starts to execute the data processing task;
  • AI algorithm operation module 0123 including Matrix Unit 01, Vector Unit 02, and Scalar Unit 03
  • AI arithmetic operation module 0123 stores the calculated data in Buffer0121 of DPU_A;
  • DPU_A writes the processed data from Buffer0121 to the internal memory 031 of the chip via bus 00;
  • Task Scheduler0122 of DPU_A sends a processing completion signal back to the CPU 011 master control of the first chip 001 to complete the calculation of this data processing task.
  • the second chip 002 is connected to the bus in the first chip 001 through the TSV interconnection line, that is, the data processing task of the target data can be controlled by the CPU 011 through DPU_1 alone, DPU_1 and DPU_A in parallel, Or DPU_A is executed separately.
  • the second dedicated processing unit DPU_A of the second chip 002 is a computing power enhancement module, which is directly scheduled and controlled by the CPU 011 in the first chip 001.
  • the computing power of DPU_1 in the first chip 001 is sufficient to meet the demand.
  • the DPU_A of the second chip 002 can be in the waiting state; or only the DPU_A of the second chip 002 can be in the activated state.
  • the CPU can activate the DPU_A computing unit through the TSV control line to participate in high computing power processing together.
  • the first chip may also be stacked and packaged with a plurality of second chips to form a chip system, and the plurality of second chips
  • the second dedicated processing unit included in each second chip in each of the at least one first dedicated processing unit may have the same or part of the same computing function matching with one or more first dedicated processing units in the at least one first dedicated processing unit.
  • the first chip may also be stacked and packaged with the second chip and the third chip respectively, where the third chip includes one or more of a memory, a power transmission circuit module, an input/output circuit module, or an analog module. Please refer to FIG. 2G. FIG.
  • FIG. 2G is a schematic diagram of another stacked package chip architecture provided by an embodiment of the present application.
  • the stacked memory 003 (Stacked Memory) can be stacked to further bring the storage closer to the computing unit, improve the data processing bandwidth and efficiency of the overall system, and solve the insufficient storage bandwidth of the stacked chips.
  • the power transmission circuit module, the input/output circuit module, or the analog module can be stacked to achieve the separation and decoupling of the simulation and logic calculation of the chip. The higher the demand.
  • FIGS. 2C and 2E are only an exemplary implementation in the embodiments of the present application, and the stacked and packaged chip structures in the embodiments of the present application include, but are not limited to, the above stacked and packaged chips. Chip architecture.
  • the stacked and packaged chips can be configured in different data processing devices, and different data processing devices correspond to different main control forms.
  • the embodiment of the present application does not limit the main control form, for example, a server notebook Computers, smart phones, car TVs, etc.
  • FIG. 3 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • the method can be applied to the structure of the data processing device described in FIG. A chip architecture, where the data processing device may include the stacked package chip architecture provided in FIG. 2B, and is used to support and execute the method flow steps S301 to S302 shown in FIG. 3. in,
  • Step S301 Generate data processing tasks through the general-purpose processor in the first chip.
  • the data processing device generates data processing tasks through the general-purpose processor.
  • the first chip includes the general-purpose processor, a bus, and at least one first dedicated processing unit DPU, the general-purpose processor and the at least one first dedicated processing unit are connected to the bus, the first chip and The second chip stack is packaged into a chip structure.
  • the general-purpose processor includes a central processing unit CPU.
  • each of the one or more first dedicated processing units and the second dedicated processing units includes a graphics processing unit GPU, an image signal processor ISP, a digital signal processor DSP, or a neural network processing unit At least one of NPU.
  • graphics processing GPU can be used to process graphics data in smart terminals
  • signal processing DSP can be used to process digital signal data executed in smart terminals
  • neural network processing unit NPU can be used to process video and image data in smart terminals Massive multimedia data. Therefore, the at least one first dedicated processing unit DPU can be used to execute data processing tasks of different data types, so that the smart terminal can adapt to more and more data processing requirements of users.
  • Step S302 At least one part of the data processing task is processed by at least one of the one or more first dedicated processing units in the at least one first dedicated processing unit and the second dedicated processing unit in the second chip package.
  • the data processing device processes at least a part of the data processing task through at least one of one or more first dedicated processing units in at least one first dedicated processing unit and at least one of a second dedicated processing unit in the second chip package ,
  • the second dedicated processing unit and one or more first dedicated processing units in the at least one first dedicated processing unit have the same computing function, and the first chip and the second chip are stacked and packaged and passed Inter-chip interconnection lines are connected to each other.
  • At least a part of the data processing task is processed by at least one of the one or more first dedicated processing units in the at least one first dedicated processing unit and the second dedicated processing unit in the second chip package , Including: sending activation information to the second dedicated processing unit via the general-purpose processor via the inter-chip interconnection line; and transitioning from the waiting state to the activation via the second dedicated processing unit in response to the activation information State, and process at least a part of the data processing task based on the computing function.
  • the sending the startup information to the second dedicated processing unit via the general-purpose processor via the inter-chip interconnection line includes: using the general-purpose processor in the one or more first dedicated processing units When the computing power of the processing unit does not meet the demand, the startup information is sent to the second dedicated processing unit via the inter-chip interconnection line.
  • At least a part of the data processing task is processed by at least one of the one or more first dedicated processing units in the at least one first dedicated processing unit and the second dedicated processing unit in the second chip package , Including: sending activation information to the second dedicated processing unit through the one or more first dedicated processing units through the inter-chip interconnection line; the second dedicated processing unit is configured to respond to the activation Information, transition from the waiting state to the start state, and process at least a part of the data processing task based on the calculation function.
  • the sending activation information to the second dedicated processing unit via the one or more first dedicated processing units via the inter-chip interconnection line includes: via the one or more first dedicated processing units When the computing power of the one or more first dedicated processing units does not meet the demand, the processing unit sends activation information to the second dedicated processing unit via the inter-chip interconnection line.
  • step S301 to step S302 in the embodiment of the present application can also refer to the relevant description of the above-mentioned respective embodiments of FIG. 2A to FIG. 2G, which will not be repeated here.
  • the stacked chip architecture in the data processing device can flexibly allocate data processing tasks to the first dedicated processing unit and the second dedicated processing unit, respectively.
  • the first dedicated processing unit is processing data processing tasks separately
  • the second processing unit of the stacked second chip will be in a waiting state, and the second dedicated processing unit will switch from the waiting state to the startup state when the second dedicated processing unit receives the startup information Since the power consumption of the waiting state is lower than that of the startup state, the overall power consumption control of the stacked chip architecture is more flexible and efficient, and the energy efficiency of the chip system is improved.
  • the stacked chip architecture can also be assisted by the second dedicated processing unit to process data processing tasks by the second dedicated processing unit without increasing the overall volume, thereby enhancing the computing power of the first dedicated processing unit to a greater extent. It alleviates and solves the demand of the rapid development of the current algorithm for the improvement of the computing power of the chip, and prevents the chip from being unable to complete the data processing task of the target data due to the insufficient computing power of the first dedicated processing unit. Furthermore, the computing power stacking chip architecture can also continue the evolution of chips through vertical computing power stacking (ie, stacking the first chip and the second chip) when Moore's Law slows down and the terminal chip area is limited. Business scenarios demand higher and higher computing power.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only illustrative, for example, the division of the above-mentioned units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual connection or communication connection may be a communication connection of devices or units connected through some wires, conductors, and interfaces, and may also be in electrical or other forms.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Microcomputers (AREA)
  • Power Sources (AREA)

Abstract

L'invention concerne un dispositif de traitement de données et un procédé associé. Le dispositif de traitement de données peut comprendre une première puce (001) et une seconde puce (002) qui sont encapsulées d'une manière empilée ; la première puce (001) comprend un processeur à usage général (011), un bus (00), et au moins une première unité de traitement dédiée (DPU) (021), le processeur à usage général (011) et la ou les premières DPU (021) sont connectés au bus (00), et le processeur à usage général (011) est utilisé pour générer une tâche de traitement de données ; la seconde puce (002) comprend une seconde DPU (012), et au moins une DPU parmi une ou plusieurs de la ou des premières DPU (021) et la seconde DPU (012) est capable de traiter au moins une partie de la tâche de traitement de données sur la base d'une fonction de calcul ; la première puce (001) et la seconde puce (002) sont connectées l'une à l'autre au moyen d'une ligne d'interconnexion entre les puces. Le dispositif de traitement de données peut répondre aux exigences de consommation d'énergie et de fonctionnement de plus en plus hautes des utilisateurs sans augmenter le volume du produit.
PCT/CN2020/077290 2020-02-28 2020-02-28 Dispositif et procédé de traitement de données WO2021168837A1 (fr)

Priority Applications (4)

Application Number Priority Date Filing Date Title
CN202080001141.9A CN113574656A (zh) 2020-02-28 2020-02-28 一种数据处理装置及方法
PCT/CN2020/077290 WO2021168837A1 (fr) 2020-02-28 2020-02-28 Dispositif et procédé de traitement de données
EP20921522.7A EP4086948A4 (fr) 2020-02-28 2020-02-28 Dispositif et procédé de traitement de données
US17/894,211 US20220405228A1 (en) 2020-02-28 2022-08-24 Data processing apparatus and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/077290 WO2021168837A1 (fr) 2020-02-28 2020-02-28 Dispositif et procédé de traitement de données

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/894,211 Continuation US20220405228A1 (en) 2020-02-28 2022-08-24 Data processing apparatus and method

Publications (1)

Publication Number Publication Date
WO2021168837A1 true WO2021168837A1 (fr) 2021-09-02

Family

ID=77490626

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/077290 WO2021168837A1 (fr) 2020-02-28 2020-02-28 Dispositif et procédé de traitement de données

Country Status (4)

Country Link
US (1) US20220405228A1 (fr)
EP (1) EP4086948A4 (fr)
CN (1) CN113574656A (fr)
WO (1) WO2021168837A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115934632A (zh) * 2023-02-09 2023-04-07 南京芯驰半导体科技有限公司 数据处理方法、装置、电子设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116204456A (zh) * 2021-11-30 2023-06-02 华为技术有限公司 数据访问方法及计算设备
CN114139212A (zh) * 2021-12-22 2022-03-04 珠海一微半导体股份有限公司 一种信息安全保护电路、芯片、机器人及方法
EP4414841A1 (fr) * 2021-12-31 2024-08-14 Huawei Technologies Co., Ltd. Appareil et procédé de traitement de tâche informatique, et dispositif électronique

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106209121A (zh) * 2016-07-15 2016-12-07 中国科学院微电子研究所 一种多模多核的通信基带SoC芯片
US20180284186A1 (en) * 2017-04-03 2018-10-04 Nvidia Corporation Multi-chip package with selection logic and debug ports for testing inter-chip communications
CN108734298A (zh) * 2017-04-17 2018-11-02 英特尔公司 扩展gpu/cpu一致性到多gpu核
CN109950227A (zh) * 2017-12-20 2019-06-28 三星电子株式会社 半导体封装件

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4439491B2 (ja) * 2006-05-24 2010-03-24 株式会社ソニー・コンピュータエンタテインメント マルチグラフィックスプロセッサシステム、グラフィックスプロセッサおよびデータ転送方法
US8839012B2 (en) * 2009-09-08 2014-09-16 Advanced Micro Devices, Inc. Power management in multi-GPU systems
CN102937945B (zh) * 2012-10-24 2015-10-28 上海新储集成电路有限公司 一种上下堆叠多颗芯片时减少芯片间互连线的方法
US9065722B2 (en) * 2012-12-23 2015-06-23 Advanced Micro Devices, Inc. Die-stacked device with partitioned multi-hop network
US11295506B2 (en) * 2015-09-16 2022-04-05 Tmrw Foundation Ip S. À R.L. Chip with game engine and ray trace engine

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106209121A (zh) * 2016-07-15 2016-12-07 中国科学院微电子研究所 一种多模多核的通信基带SoC芯片
US20180284186A1 (en) * 2017-04-03 2018-10-04 Nvidia Corporation Multi-chip package with selection logic and debug ports for testing inter-chip communications
CN108734298A (zh) * 2017-04-17 2018-11-02 英特尔公司 扩展gpu/cpu一致性到多gpu核
CN109950227A (zh) * 2017-12-20 2019-06-28 三星电子株式会社 半导体封装件

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4086948A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115934632A (zh) * 2023-02-09 2023-04-07 南京芯驰半导体科技有限公司 数据处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
US20220405228A1 (en) 2022-12-22
CN113574656A (zh) 2021-10-29
EP4086948A4 (fr) 2023-01-11
EP4086948A1 (fr) 2022-11-09

Similar Documents

Publication Publication Date Title
WO2021168837A1 (fr) Dispositif et procédé de traitement de données
TWI746878B (zh) 高頻寬記憶體系統以及邏輯裸片
US12002793B2 (en) Integrating system in package (SiP) with input/output (IO) board for platform miniaturization
US11769534B2 (en) Flexible memory system with a controller and a stack of memory
JP5970078B2 (ja) デバイス相互接続の変化を可能にする積層メモリ
JP2006172700A (ja) 低電力マルチチップ半導体メモリ装置及びそれのチップイネーブル方法
WO2018121118A1 (fr) Appareil et procédé de calcul
US11353900B2 (en) Integrated cross-domain power transfer voltage regulators
US20220334983A1 (en) Techniques For Sharing Memory Interface Circuits Between Integrated Circuit Dies
CN114121055A (zh) 内存互连架构系统和方法
JP2015176435A (ja) Lsiチップ積層システム
US10331592B2 (en) Communication apparatus with direct control and associated methods
US20190286606A1 (en) Network-on-chip and computer system including the same
TWI732523B (zh) 一種存儲器件及其製造方法
US10013195B2 (en) Memory module including plurality of memory packages with reduced power consumption
CN214225915U (zh) 应用于便携式移动终端的多媒体芯片架构与多媒体处理系统
KR102404059B1 (ko) 인터페이스 회로 및 인터페이스 장치
JP2017010605A (ja) デバイス相互接続の変化を可能にする積層メモリ
US20220318955A1 (en) Tone mapping circuit, image sensing device and operation method thereof
TWI236127B (en) Input/output structure and integrated circuit using the same
US10366646B2 (en) Devices including first and second buffers, and methods of operating devices including first and second buffers
US20230315334A1 (en) Providing fine grain access to package memory
WO2023056875A1 (fr) Puce multi-cœur, appareil à circuit intégré, carte à puce et procédé de fabrication associé
WO2024109087A1 (fr) Procédé et appareil de traitement de commande
WO2023231437A1 (fr) Mémoire, système sur puce, dispositif terminal et procédé de commande d'alimentation électrique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20921522

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020921522

Country of ref document: EP

Effective date: 20220802

NENP Non-entry into the national phase

Ref country code: DE