CN109564562B - Big data operation acceleration system and chip - Google Patents

Big data operation acceleration system and chip Download PDF

Info

Publication number
CN109564562B
CN109564562B CN201880002364.XA CN201880002364A CN109564562B CN 109564562 B CN109564562 B CN 109564562B CN 201880002364 A CN201880002364 A CN 201880002364A CN 109564562 B CN109564562 B CN 109564562B
Authority
CN
China
Prior art keywords
data
interfaces
memory
chip
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201880002364.XA
Other languages
Chinese (zh)
Other versions
CN109564562A (en
Inventor
桂文明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Suneng Technology Co ltd
Original Assignee
Beijing Suneng Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Suneng Technology Co ltd filed Critical Beijing Suneng Technology Co ltd
Publication of CN109564562A publication Critical patent/CN109564562A/en
Application granted granted Critical
Publication of CN109564562B publication Critical patent/CN109564562B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17312Routing techniques specific to parallel machines, e.g. wormhole, store and forward, shortest path problem congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17375One dimensional, e.g. linear array, ring

Abstract

The application provides a big data operation acceleration system and a chip, wherein a plurality of core cores are arranged in the chip, each core executes operation and storage control functions, and at least one storage unit is connected to each core outside the chip. By adopting the technical scheme of the invention, each kernel can achieve the technical effect that each kernel can have a large-capacity memory by reading the memory unit connected with the kernel and the memory units connected with other kernels, thereby reducing the times of moving data into or out of the memory from an external memory space and accelerating the processing speed of the data; meanwhile, the multiple kernels can respectively perform independent operation or cooperative operation, so that the data processing speed is increased.

Description

Big data operation acceleration system and chip
Technical Field
The embodiment of the invention relates to the field of integrated circuits, in particular to a big data operation acceleration system and a big data operation acceleration chip.
Background
Asic (application Specific Integrated circuits) refers to an Integrated circuit designed and manufactured according to the requirements of a Specific user and a Specific electronic system. The ASIC is characterized by facing the requirements of specific users, and compared with a general integrated circuit, the ASIC has the advantages of smaller volume, lower power consumption, improved reliability, improved performance, enhanced confidentiality, reduced cost and the like during batch production.
With the development of science and technology, more and more fields, such as artificial intelligence, security operations and the like, all relate to specific calculation with large computation amount. For specific operation, the ASIC chip can exert the characteristics of fast operation, low power consumption and the like. Meanwhile, for these large computation areas, in order to increase the data processing speed and processing capacity, it is generally necessary to control N computation chips to operate simultaneously. With the continuous improvement of data precision, the fields of artificial intelligence, safety operation and the like need to operate larger and larger data, and an ASIC chip generally needs to be configured with a plurality of storage units for storing data, for example, one ASIC chip needs to be configured with 4 blocks of 2G memories; thus, when the N operation chips work simultaneously, 4N 2NG memory blocks are needed. However, when the multiple operation chips work simultaneously, the data storage capacity does not exceed 2G, which causes the waste of the storage unit and increases the system cost.
The above background is only for the purpose of aiding understanding of the present application and does not constitute an admission or admission that any of the matter referred to is part of the common general knowledge relative to the present application.
Disclosure of Invention
The embodiment of the invention provides a big data operation acceleration system and a big data operation acceleration chip, wherein more than 2 ASIC operation chips are respectively connected with more than 2 storage units through buses, and the operation chips exchange data through the storage units, so that the number of the storage units is reduced, connecting lines among the ASIC operation chips are also reduced, the system structure is simplified, each ASIC operation chip is respectively connected with a plurality of storage units, the conflict caused by using a bus mode is avoided, and the Cache does not need to be arranged for each ASIC operation chip.
To achieve the above object, according to a first aspect of the present embodiment, there is provided a big data operation acceleration system, which includes two or more operation chips and two or more storage units, wherein:
the operation chip comprises at least one first data interface (130), more than two second data interfaces (150, 151, 152 and 153), at least two core cores (110, 111, 112 and 113) and a routing unit (120); the at least one first data interface (130) and the more than two second data interfaces (150, 151, 152, 153) are respectively connected with the routing unit, and the routing unit is connected with the at least two cores (110, 111, 112, 113);
the storage unit comprises more than two third data interfaces (250, 251, 252, 253); the storage unit (20) comprises more than two memories, a routing unit (230) and more than two third data interfaces (250, 251, 252, 253); the more than two third data interfaces (250, 251, 252, 253) are respectively connected with the routing unit through buses, and the routing unit is connected with the more than two memories.
The second data interface (150, 151, 152, 153) of the arithmetic chip is connected with the third data interface (250, 251, 252, 253) of the storage unit through a bus.
According to a second aspect of the present embodiment, there is provided a big data operation acceleration system, including two or more operation chips and two or more storage units, wherein:
the operation chip comprises at least one first data interface (130), more than two second data interfaces (150, 151, 152 and 153), at least two core cores (110, 111, 112 and 113) and a routing unit (120); each second data interface is connected with a core, the at least two core cores are connected with the routing unit, and the at least one first data interface (130) is connected with a core (110);
the storage unit comprises more than two third data interfaces (250, 251, 252, 253); the storage unit (20) comprises more than two memories, a routing unit (230) and more than two third data interfaces (250, 251, 252, 253); the more than two third data interfaces (250, 251, 252 and 253) are respectively connected with the routing unit through buses, and the routing unit is connected with the more than two memories;
the second data interface (150, 151, 152, 153) of the arithmetic chip is connected with the third data interface (250, 251, 252, 253) of the storage unit through a bus.
According to a third aspect of the present embodiment, a big data operation chip is provided, wherein the operation chip includes at least one first data interface (130), two or more second data interfaces (150, 151, 152, 153), at least two core cores (110, 111, 112, 113), and a routing unit (120); the at least one first data interface (130) and the more than two second data interfaces (150, 151, 152, 153) are respectively connected with the routing unit, and the routing unit is connected with the at least two cores (110, 111, 112, 113); the second data interface and the third data interface are serdes interfaces; the second data interface (150, 151, 152, 153) of the arithmetic chip is connected with the storage unit through a bus.
According to a fourth aspect of the present embodiment, a big data operation chip is provided, where the operation chip includes at least one first data interface (130), two or more second data interfaces (150, 151, 152, 153), at least two core cores (110, 111, 112, 113), and a routing unit (120); each second data interface is connected with a core, the at least two core cores are connected with the routing unit, and the at least one first data interface (130) is connected with a core (110); the second data interface and the third data interface are serdes interfaces; the second data interface (150, 151, 152, 153) of the arithmetic chip is connected with the storage unit through a bus.
The embodiment of the invention achieves the technical effect of saving the number of the memory units by respectively connecting the plurality of operation chips in the big data operation acceleration system with each memory unit, reduces the system cost, reduces the connecting lines among the ASIC operation chips, simplifies the system structure, and ensures that each ASIC operation chip is respectively connected with the plurality of memory units, thereby avoiding the conflict caused by using a bus mode and not setting a Cache for each ASIC operation chip.
Drawings
In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only exemplary embodiments, and that other drawings can be obtained by those skilled in the art without inventive efforts.
FIG. 1 illustrates a first embodiment of a schematic diagram of a big data operation acceleration system with 4 operation chips and 4 memory units;
FIG. 2a illustrates a first embodiment of a schematic diagram of an arithmetic chip having 4 cores;
FIG. 2b is a signal flow diagram of a computing chip with 4 cores according to a first embodiment;
FIG. 3a illustrates a second embodiment of a schematic diagram of a computing chip architecture having 4 cores;
FIG. 3b is a signal flow diagram of a second embodiment of an arithmetic chip having 4 cores;
FIG. 4a illustrates a third embodiment of a schematic diagram of a memory cell structure corresponding to an arithmetic chip having 4 cores;
FIG. 4b is a schematic diagram illustrating a signal flow of a memory cell corresponding to an operation chip having 4 cores according to a third embodiment;
FIG. 5 is a schematic diagram of a connection structure of a big data operation acceleration system having 4 operation chips and 4 storage units;
FIG. 6 is a diagram illustrating a data structure according to the present embodiment
Detailed Description
Exemplary embodiments of the present embodiment will be described in detail below based on the accompanying drawings, and it should be understood that these embodiments are given only for the purpose of enabling those skilled in the art to better understand and implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
It should be noted that the directions of up, down, left and right in the drawings are merely examples of specific embodiments, and those skilled in the art can change the directions of a part or all of the components shown in the drawings according to actual needs without affecting the functions of the components or the system as a whole, and such a technical solution with changed directions still belongs to the protection scope of the present invention.
A multi-core chip is a multi-processing system embodied on a single large-scale integrated semiconductor chip. Typically, two or more chip cores may be embodied on a multi-core chip, interconnected by a bus (which may also be formed on the same multi-core chip). There may be from two chip cores to many chip cores embodied on the same multi-core chip, the upper limit in the number of chip cores being limited only by manufacturing capability and performance constraints. The multi-core chip may have applications that contain specialized arithmetic and/or logical operations that are performed in multimedia and signal processing algorithms such as video encoding/decoding, 2D/3D graphics, audio and speech processing, image processing, telephony, speech recognition and sound synthesis, encryption processing.
Although only ASIC-specific integrated circuits are mentioned in the background, the specific wiring implementation in the embodiments may be applied to CPUs, GPUs, FPGAs, etc. having multiple cores. In this embodiment, the plurality of cores may be the same core or different cores.
For convenience of description, the big data operation acceleration system with 4 operation chips and 4 storage units in fig. 1 will be described as an example, and those skilled in the art will know that 4 operation chips and 4 storage units are selected as an example here, and the number of operation chips may be N, where N is a positive integer greater than or equal to 2, and may be, for example, 6, 10, 12, etc. The number of memory cells may be M, where M is a positive integer greater than or equal to 2, and may be, for example, 6, 9, 12, and so on. N and M may or may not be equal in embodiments. In this embodiment, the plurality of operation chips may be the same operation chip or different operation chips.
FIG. 1 is a first embodiment of a schematic diagram of a big data operation acceleration system with 4 operation chips and 4 memory units. As shown in FIG. 1, the big data operation acceleration system comprises 4 operation chips (10, 11, 12, 13) and 4 storage units (20, 21, 22, 23); each operation chip is connected with all the storage units through a bus, the operation chips exchange data through the storage units, and data are not directly exchanged among the operation chips; and sending control instructions between the operation chips.
Setting a special storage area and a shared storage area in each storage unit; the special storage area is used for storing a temporary operation result of one operation chip, wherein the temporary operation result is an intermediate calculation result continuously used by the operation chip, and the intermediate calculation result is not used by other operation chips; the shared storage area is used for storing data operation results of the operation chips, and the data operation results are used by other operation chips or need to be fed back and transmitted to the outside. Of course, the memory cells may not be divided for convenience of management. The memory unit may be high-speed external memory such as DDR, SDDR, DDR2, DDR3, DDR4, GDDR5, GDDR6, HMC, HBM, and the like. The memory unit is preferably selected from a DDR (DDR) memory, i.e., a double Data rate (DDR SDRAM) memory. DDR uses a synchronous circuit, so that the main steps of transmission and output of the designated address and data are independently executed and are kept completely synchronous with the CPU; DDR uses DLL (Delay Locked Loop) technology, and when data is valid, the memory controller can use this data filter signal to pinpoint the data, output it every 16 times, and resynchronize the data from different memory modules. The frequency of the DDR memory can be expressed by a working frequency and an equivalent frequency, wherein the working frequency is the actual working frequency of the memory particles, but the DDR memory can transmit data at the rising edge and the falling edge of a pulse, so that the equivalent frequency of the data transmission is twice of the working frequency. DDR2(Double Data Rate 2) memory is a new generation of memory technology standard developed by JEDEC (joint electron device engineering council), DDR2 memory can read/write Data at 4 times the speed of the external bus per clock and can operate at 4 times the speed of the internal control bus. DDR3, DDR4, GDDR5, GDDR6, HMC, and HBM memory are all prior art and will not be described in detail here.
The 4 ASIC operation chips are respectively connected with the 4 storage units through buses, and the operation chips exchange data through the storage units, so that the number of the storage units is reduced, connecting lines among the ASIC operation chips are also reduced, the system structure is simplified, each ASIC operation chip is respectively connected with a plurality of storage units, the conflict caused by using a bus mode is avoided, and the Cache is not required to be arranged for each ASIC operation chip.
FIG. 2a illustrates a first embodiment of a schematic diagram of an arithmetic chip having 4 cores. As will be understood by those skilled in the art, 4 cores are taken as an example, and the number of cores of the computing chip may be Q, where Q is a positive integer greater than or equal to 2, and may be, for example, 6, 10, 12, and so on. In this embodiment, the cores of the operation chip may be cores having the same function, or cores having different functions.
The 4-core operation chip (10) comprises 4 cores (110, 111, 112, 113), a routing unit (120), a data exchange control unit (130) and 4 serdes interfaces (150, 151, 152, 153). And one data exchange control unit and 4 serdes interfaces are respectively connected with the routing unit through buses, and the routing unit is connected with each core. The data exchange control unit may be implemented by using various protocols, such as UART, SPI, PCIE, SERDES, USB, and the like, and in this embodiment, the data exchange control unit is a UART (Universal Asynchronous Receiver/Transmitter) control unit (130). A universal asynchronous receiver transmitter, commonly referred to as UART, is an asynchronous receiver transmitter that converts data to be transmitted between serial and parallel communications, and is typically integrated into the link between the various communication interfaces. However, the UART protocol is only used as an example, and other protocols may be used. The UART control unit (130) can receive external data or control instructions, send control instructions to other chips, receive control instructions from other chips, and feed back operation results or intermediate data to the outside.
serdes is an acronym for SERializer/DESerializer. It is a mainstream Time Division Multiplexing (TDM), point-to-point (P2P) serial communication technology. That is, at the transmitting end, the multi-path low-speed parallel signals are converted into high-speed serial signals, and finally, at the receiving end, the high-speed serial signals are converted into low-speed parallel signals again through a transmission medium (an optical cable or a copper wire). The point-to-point serial communication technology fully utilizes the channel capacity of a transmission medium, reduces the number of required transmission channels and device pins, and improves the transmission speed of signals, thereby greatly reducing the communication cost. Of course, other communication interfaces may be used instead of the serdes interfaces, such as: SSI, UATR. And data and control command transmission is carried out between the chip and the storage unit through the serdes interface and the transmission line.
The core has a main function of executing an external or internal control instruction, performing data calculation, controlling storage of data, and the like.
The routing unit is used for sending data or control instructions to the core cores (110, 111, 112, 113), receiving the data or control instructions sent by the core cores (110, 111, 112, 113) and realizing communication among the core cores. Receiving an internal or external control instruction, writing data into a storage unit through a serdes interface, reading the data or sending the control instruction to a memory unit; if the internal or external control instruction is used for controlling the control instruction of other chips, the routing unit sends the control instruction to the UART control unit (130), and the UART control unit (130) sends the control instruction to other chips; if data needs to be sent to other chips, the routing unit transmits the data to the storage unit through the serdes interface, and the other chips acquire the data through the storage unit; and if the data needs to be received from other chips, the routing unit acquires the data from the storage unit through the serdes interface. A routing unit and a UART control unit (130) for receiving external control instructions and sending control instructions to the cores (110, 111, 112, 113); the external data is received by the UART control unit (130), and is transmitted to the core (110, 111, 112, 113) or the memory unit according to the external data address. The internal data or the internal control command refers to data or a control command generated by the chip itself, and the external data or the external control command refers to data or a control command generated outside the chip, for example, data or a control command sent by an external host or an external network.
FIG. 2b is a signal flow diagram of the first embodiment of the computing chip with 4 cores. The UART interface (130) is used for acquiring chip external data or control instructions, and the routing unit (120) sends the data or control instructions to the core according to data or control instruction addresses, or the routing unit (120) sends the data or control instructions to a storage unit connected with the serdes interface through the serdes interface. If the destination address of the external control instruction points to other chips, the routing unit sends the control instruction to the UART control unit (130), and the UART control unit (130) sends the control instruction to other chips. The UART interface (130) sends the operation result to the outside according to an external control instruction or an internal control instruction, and the operation result can be obtained from a core of an operation chip or a storage unit connected with a serdes interface. The external may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the parameters of the storage unit through the UART control unit and carry out uniform addressing on a plurality of storage particles.
The core can send a control instruction for acquiring or writing data to the routing unit, the control instruction carries a data address, and the routing unit reads or writes the data to the storage unit through the serdes interface according to the address. The core may also send data or control instructions to other core cores through the routing unit according to the address, and obtain data or control instructions from other core cores through the routing unit. And the kernel core calculates according to the acquired data and stores the calculation result into the storage unit. Setting a special storage area and a shared storage area in each storage unit; the special storage area is used for storing a temporary operation result of one operation chip, wherein the temporary operation result is an intermediate calculation result continuously used by the operation chip, and the intermediate calculation result is not used by other operation chips; the shared storage area is used for storing data operation results of the operation chips, and the data operation results are used by other operation chips or need to be fed back and transmitted to the outside. If the control instruction generated by the core of the core is used for controlling the operation of other chips, the routing unit sends the control instruction to the UART control unit (130), and the UART control unit (130) sends the control instruction to other chips. If the control instruction generated by the core of the kernel is used for controlling the storage unit, the routing unit sends the control instruction to the storage unit through the serdes interface.
FIG. 3a illustrates a second embodiment of a schematic diagram of an arithmetic chip having 4 cores. As shown in FIG. 3a, the 4-core calculating chip includes 4 cores (110, 111, 112, 113), a routing unit (120), a UART control unit (130) and 4 serdes interfaces (150, 151, 152, 153). Each serdes interface is connected with one core, 4 cores are connected with the routing unit, and the UART control unit (130) is connected with the core (110).
FIG. 3b is a signal flow diagram of an arithmetic chip with 4 cores according to a second embodiment. The UART control unit (130) is used for acquiring chip external data or control instructions and transmitting the external data or control instructions to a core (110) connected with the UART control unit. The core (110) transmits external data or control instructions to the routing unit (120), and the routing unit sends the data or control instructions to the core (111, 112, 113) corresponding to the data addresses according to the data or control instruction addresses. If the destination address of the data or control instruction is the core of the operation chip, the routing unit sends the data or control instruction to the core (110, 111, 112, 113). If the destination address of the data or control instruction is a memory location, the core (111, 112, 113) sends the data or control instruction to the corresponding memory location through the servers interface (151, 152, 153). The core (110) can also directly send data or control instructions to the corresponding memory unit through the serdes interface (150) connected with the core. In this case, the routing unit stores the serdes interfaces corresponding to all the addresses of the storage units. If the destination address of the data or the control instruction is other operation chips, the data is sent to the corresponding storage unit by the core (111, 112, 113) through the serdes interfaces (151, 152, 153); and the control instruction is sent to other operation chips through the UART control unit. When the core feeds back the operation result or the intermediate data to the outside according to the external control instruction or the internal control instruction, the core acquires the operation result or the intermediate data from the storage unit from the serdes interface, sends the operation result or the intermediate data to the routing unit, the routing unit sends the operation result or the intermediate data to the core (110) connected with the UART control unit, and finally sends the operation result or the intermediate data to the outside through the UART control unit. If the calculation result or the intermediate data is obtained by the serdes interface corresponding to the core connected with the UART control unit, the calculation result or the intermediate data is directly sent to the outside through the UART control unit. The external may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the parameters of the storage unit through the UART control unit and carry out uniform addressing on a plurality of storage units.
The core can send a control instruction to the routing unit, the routing unit sends the control instruction to other core cores, other chips or storage units according to the address of the control instruction, and the other core cores, other chips or storage units execute corresponding operations after receiving the control instruction. When the core sends a control instruction or data to other core cores, the control instruction or data is directly forwarded through the routing unit. The core sends control instructions to other chips and sends the control instructions through the UART control unit. When the kernel core sends a control instruction to the storage unit, the routing unit queries the serdes interface corresponding to the address according to the address, sends the control instruction to the kernel core corresponding to the serdes interface, and then sends the control instruction to the corresponding serdes interface through the kernel core, and the serdes interface sends the control instruction to the storage unit. When the kernel core sends data to other chips or the storage unit, the routing unit queries the serdes interface corresponding to the address according to the address, sends the control instruction to the kernel core corresponding to the serdes interface, and then sends the control instruction to the corresponding serdes interface through the kernel core, and the serdes interface sends the data to the storage unit. The other chips are obtaining data through the storage unit. When the kernel core acquires data from the memory unit, the control command is read to carry a data address, the routing unit inquires a serdes interface corresponding to the address according to the address, the control command is sent to the kernel core corresponding to the serdes interface, the kernel core is sent to the corresponding serdes interface, the serdes interface sends the read control command to the memory unit, and the command carries a destination address and a source address. After the serdes interface acquires data from the storage unit, the serdes interface sends the data to the kernel core corresponding to the serdes interface, the kernel core sends a data packet comprising a source address and a destination address to the routing unit, and the routing unit sends the data packet to the corresponding kernel core according to the destination address. If the core finds that the destination address is the address of the core, the core acquires data for processing. And the core can also send data or commands to other core cores through the routing unit, and acquire data or commands from other core cores through the routing unit. And the kernel core calculates according to the acquired data and stores the calculation result into the storage unit. Setting a special storage area and a shared storage area in each storage unit; the special storage area is used for storing a temporary operation result of one operation chip, wherein the temporary operation result is an intermediate calculation result continuously used by the operation chip, and the intermediate calculation result is not used by other operation chips; the shared storage area is used for storing the operation data result of the operation chip, and the operation data result is used by other operation chips or needs to be fed back and transmitted to the outside.
FIG. 4a illustrates a first embodiment of a schematic diagram of a memory cell structure corresponding to an arithmetic chip having 4 cores. The storage unit (20) includes C memories, where C is 4 as an example, and of course, C is a positive integer greater than or equal to 2, and may be, for example, 6, 10, 12, or the like; the memory (240, 241, 242, 243) comprises a memory controller (220, 221, 222, 223) and a memory particle (210, 211, 212, 213); the memory controller is used for writing or reading data to the memory particles according to the instructions, and the memory particles are used for storing the data. The storage unit (20) further comprises a routing unit (230) and 4 serdes interfaces (250, 251, 252 and 253). The 4 serdes interfaces are respectively connected with the routing unit through the bus, and the routing unit is connected with each memory.
FIG. 4b illustrates a first embodiment of a signal flow diagram for a memory cell corresponding to an arithmetic chip having 4 cores. The storage unit (20) receives the control instruction through the serdes interfaces (250, 251, 252, 253), sends the control instruction to the routing unit (230), the routing unit sends the control instruction to the corresponding memories (240, 241, 242, 243) according to the address in the control instruction, and the memory controllers (220, 221, 222, 223) execute relevant operations according to the control instruction. Uniformly addressing a plurality of memory particles, for example, according to an initialization configuration memory parameter; or resetting the storage particles according to the reset instruction; a write command or a read command, etc. The data acquisition command sent by the operation chip is received through a serdes interface (250, 251, 252, 253), the command carries an address of data to be acquired, the routing unit sends the data acquisition command to the memory according to the address, the memory controller acquires data from the memory particles according to the data acquisition command, and the data is sent to the operation chip requiring the data through the serdes interface according to the source address. The write data command and the data sent by the operation chip are received through the serdes interfaces (250, 251, 252 and 253), the command carries an address of the data to be written, the routing unit sends the write data command and the data to the memory according to the address, and the memory controller writes the data to the memory granules according to the write data command. The write data command and data may be transmitted synchronously or asynchronously. Setting a special storage area and a shared storage area in each storage unit; the special storage area is used for storing a temporary operation result of one operation chip, wherein the temporary operation result is an intermediate calculation result continuously used by the operation chip, and the intermediate calculation result is not used by other operation chips; the shared storage area is used for storing the operation data result of the operation chip, and the operation data result is used by other operation chips or needs to be fed back and transmitted to the outside.
FIG. 5 is a schematic diagram of the connection structure of a big data operation acceleration system with 4 operation chips and 4 storage units. In fig. 5, the system has 4 arithmetic chips (10, 11, 12, 13) and 4 memory units (20, 21, 22, 23). The structure of the computing chip may be the chip structures disclosed in the first embodiment and the second embodiment, and of course, the computing chip may also be a chip structure that is modified by those skilled in the art and is equivalent to the first and the second embodiments, and the chip structure that is modified by these equivalents is also within the protection scope of the present embodiment. The structure of the memory cell may be the memory cell structure disclosed in the third embodiment, and of course, the memory cell may also be a memory cell structure that is modified by those skilled in the art and is equivalent to the third embodiment, and these equivalent modified memory cell structures are also within the protection scope of the present embodiment. In the big data arithmetic acceleration system, the UART control unit (130) of the arithmetic chip (10) is connected with an external host, and the UART control unit (130) of each chip (10, 11, 12, 13) is connected through a bus. Each serdes interface (150, 151, 152, 153) of the chips (10, 11, 12, 13) is connected with a serdes interface (250, 251, 252, 253) of one storage unit (20, 21, 22, 23), so that each operation chip is connected with all the storage units through a bus, the operation chips exchange data through the storage units, and data is not directly exchanged among the operation chips. The internal and external signal flows of the arithmetic chip and the memory cell have been explained in detail in the first, second, and third embodiments, and will not be described again here.
The system is applied to the field of artificial intelligence, picture data or video data sent by an external host are stored in a storage unit (20, 21, 22, 23) through a serdes interface (150, 151, 152, 153) by a UART control unit (130) of an arithmetic chip (10), the arithmetic chip (10, 11, 12, 13) generates a mathematical model of a neural network, and the mathematical model can also be stored in the storage unit (20, 21, 22, 23) through the serdes interface (150, 151, 152, 153) by the external host and read by each arithmetic chip (10, 11, 12, 13). The first-layer mathematical model of the neural network is operated on the operation chip (10), the operation chip (10) reads data from the storage units (20, 21, 22, 23) through the serdes interface to perform operation, and the operation result is stored in at least one of the storage units (20, 21, 22, 23) through the serdes interface. The arithmetic chip (10) sends a control instruction to the arithmetic chip (20) through the UART control unit (130), and starts the arithmetic chip (20) to carry out arithmetic. And running a second layer mathematical model of the neural network on the operation chip (20), reading data from the storage units (20, 21, 22, 23) by the operation chip (20) through a serdes interface to perform operation, and storing an operation result to at least one storage unit in the storage units (20, 21, 22, 23) through the serdes interface. Each chip executes one layer of the neural network, acquires data from the storage units (20, 21, 22 and 23) through the serdes interfaces to carry out operation, and only calculates the operation result until the last layer of the neural network. The operation chip (10) acquires the operation result from the storage units (20, 21, 22, 23) through the serdes interface and feeds back the operation result to the external host through the UART control unit (130).
The system is applied to the field of encrypted digital currency, and a UART control unit (130) of an arithmetic chip (10) stores block information sent by an external host computer into at least one storage unit in storage units (20, 21, 22 and 23). An external host sends a control instruction to 4 arithmetic chips (10, 11, 12, 13) through the arithmetic chips (10, 11, 12, 13) and a UART control unit (130) to carry out data arithmetic, and the 4 arithmetic chips (10, 11, 12, 13) start arithmetic operation. Of course, the external host may send a control command to one computing chip (10) UART control unit (130) to perform data computation, the computing chip (10) sequentially sends control commands to the other 3 computing chips (11, 12, 13) to perform data computation, and the 4 computing chips (10, 11, 12, 13) start computing operations. An external host can also send a control instruction to a UART control unit (130) of one computing chip (10) for data operation, the first computing chip (10) sends the control instruction to a second computing chip (11) for data operation, the second computing chip (11) sends the control instruction to a third computing chip (12) for data operation, the third computing chip (12) sends the control instruction to a fourth computing chip (13) for data operation, and 4 computing chips (10, 11, 12 and 13) start computing operation. The 4 arithmetic chips (10, 11, 12, 13) read block information data from the storage unit through serdes interfaces, the 4 arithmetic chips (10, 11, 12, 13) simultaneously perform workload certification operation, and the arithmetic chip (10) acquires an operation result from the storage unit (20, 21, 22, 23) and feeds the operation result back to an external host through a UART control unit (130).
In the above embodiment, the number of the operation chip and the number of the storage units are equal, and at this time, the number of the second data interfaces of the storage units and the number of the second data interfaces of the operation chip are both the number of the storage units.
However, as will be understood by those skilled in the art, the number of the operation chips and the number of the storage units may also be unequal, where the number of the second data interfaces of the storage units is the number of the operation chips, and the number of the second data interfaces of the operation chips is the number of the storage units. For example, the number of the operation chips is 4, the number of the storage units is 5, 5 second data interfaces are arranged on the operation chips, and 4 second data interfaces are arranged on the storage units.
The bus may be a centralized arbitration bus structure or a loop topology bus structure, which is well known in the art and therefore will not be described in detail herein.
Fig. 6 illustrates a data structure diagram according to the present invention. The data referred to herein is various data such as command data, numerical data, character data, and the like. The data format specifically includes valid bit, destination address dst id, source address src id, and data. The core can determine whether the packet is a command or a value by the valid bit, where it can be assumed that 0 represents a value and 1 represents a command. The kernel will determine the destination address, source address and data type based on the data structure. From the view of instruction operation timing, the present embodiment adopts a conventional six-level pipeline structure, which is respectively an instruction fetch stage, a decoding stage, an execution stage, an access stage, an alignment stage, and a write-back stage. From an instruction set architecture perspective, a reduced instruction set architecture may be assumed. According to the general design method of the simplified instruction set architecture, the instruction set can be divided into a register-register type instruction, a register-immediate instruction, a jump instruction, a memory access instruction, a control instruction and an inter-core communication instruction according to functions.
Using the description provided herein, an embodiment may be implemented as a machine, process, or article of manufacture using standard programming and/or engineering techniques to produce programming software, firmware, hardware or any combination thereof.
Any resulting program(s), having computer-readable program code, may be embodied on one or more computer-usable media such as resident memory devices, smart cards or other removable memory devices, or transmitting devices, thereby making a computer program product and article of manufacture according to an embodiment. As such, the terms "article of manufacture" and "computer program product" as used herein are intended to encompass a computer program that exists permanently or temporarily on any non-transitory medium with which a computer can use.
As noted above, memory/storage devices include, but are not limited to, magnetic disks, optical disks, removable storage devices such as smart cards, Subscriber Identity Modules (SIMs), Wireless Identification Modules (WIMs), semiconductor memory such as Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), and the like. Transmission media includes, but is not limited to, transmissions via wireless communication networks, the internet, intranets, telephone/modem-based network communications, hard-wired/cabled communication network, satellite communications, and other fixed or mobile network systems/communication links.
Although specific example embodiments have been disclosed, those skilled in the art will appreciate that changes can be made to the specific example embodiments without departing from the spirit and scope of the invention.
The present invention has been described above based on the embodiments with reference to the drawings, but the present invention is not limited to the above embodiments, and the present invention is also included in the scope of the present invention by appropriately combining or replacing parts of the embodiments and the modifications according to layout requirements and the like. Further, the combination and the processing order of the embodiments may be appropriately rearranged based on the knowledge of those skilled in the art, or modifications such as various design changes may be applied to the embodiments, and embodiments to which such modifications are applied may be included in the scope of the present invention.

Claims (14)

1. The big data operation acceleration system is characterized by comprising more than two operation chips and more than two storage units, wherein:
the operation chip comprises at least one first data interface (130), more than two second data interfaces (150, 151, 152 and 153), at least two core cores (110, 111, 112 and 113) and a routing unit (120); the at least one first data interface (130) and the more than two second data interfaces (150, 151, 152, 153) are respectively connected with the routing unit, and the routing unit is connected with the at least two cores (110, 111, 112, 113);
the storage unit comprises more than two third data interfaces (250, 251, 252, 253); the storage unit (20) comprises more than two memories, a routing unit (230) and more than two third data interfaces (250, 251, 252, 253); the more than two third data interfaces (250, 251, 252 and 253) are respectively connected with the routing unit through buses, and the routing unit is connected with the more than two memories;
the second data interface (150, 151, 152, 153) of the arithmetic chip is connected with the third data interface (250, 251, 252, 253) of the storage unit through a bus;
the second data interface and the third data interface are serdes interfaces, and the first data interface is a UART interface of a UART control unit; the arithmetic chip and the storage unit transmit data and control instructions through serdes interfaces and transmission lines;
the number of the operation chips is equal to that of the storage units, and the number of the third data interfaces of the storage units and the number of the second data interfaces of the operation chips are the number of the storage units; each operation chip is connected with all the storage units through a bus;
wherein the memory (240, 241, 242, 243) comprises a memory controller (220, 221, 222, 223) and a memory granule (210, 211, 212, 213), wherein the memory controller is used for writing or reading data to the memory granule according to instructions, and the memory granule is used for storing data.
2. The big data operation acceleration system is characterized by comprising more than two operation chips and more than two storage units, wherein:
the operation chip comprises at least one first data interface (130), more than two second data interfaces (150, 151, 152 and 153), at least two core cores (110, 111, 112 and 113) and a routing unit (120); each second data interface is connected with a core, the at least two core cores are connected with the routing unit, and the at least one first data interface (130) is connected with a core (110);
the storage unit comprises more than two third data interfaces (250, 251, 252, 253); the storage unit (20) comprises more than two memories, a routing unit (230) and more than two third data interfaces (250, 251, 252, 253); the more than two third data interfaces (250, 251, 252 and 253) are respectively connected with the routing unit through buses, and the routing unit is connected with the more than two memories;
the second data interface (150, 151, 152, 153) of the arithmetic chip is connected with the third data interface (250, 251, 252, 253) of the storage unit through a bus;
the second data interface and the third data interface are serdes interfaces, and the first data interface is a UART interface of a UART control unit; the arithmetic chip and the storage unit transmit data and control instructions through serdes interfaces and transmission lines;
the number of the operation chips is equal to that of the storage units, and the number of the third data interfaces of the storage units and the number of the second data interfaces of the operation chips are the number of the storage units; each operation chip is connected with all the storage units through a bus;
wherein the memory (240, 241, 242, 243) comprises a memory controller (220, 221, 222, 223) and a memory granule (210, 211, 212, 213), wherein the memory controller is used for writing or reading data to the memory granule according to instructions, and the memory granule is used for storing data.
3. System according to claim 1 or 2, characterized in that said routing unit sends control instructions to an external chip through said at least one first data interface (130).
4. The system of claim 1 or 2, wherein the chips transmit or receive data to or from each other through the second data interface and the storage unit.
5. The system according to claim 1 or 2, wherein the routing unit accepts external data or control commands via the at least one first data interface (130) and sends the accepted external data or control commands to the core or storage unit.
6. The system according to claim 1 or 2, characterized in that the routing unit of the storage unit accepts control instructions via the two or more third data interfaces (250, 251, 252, 253) and sends the control instructions to the respective memories (240, 241, 242, 243) according to addresses in the control instructions.
7. The system according to claim 1 or 2, characterized in that the routing unit of the storage unit sends the acquired data to an arithmetic chip via the two or more third data interfaces (250, 251, 252, 253).
8. The system according to claim 1 or 2, wherein an exclusive storage area and a shared storage area are provided in the storage unit.
9. The system of claim 1 or 2, wherein the storage particles are HMC memory.
10. The system of claim 1 or 2, wherein the two or more operation chips can perform one or more of encryption operation and convolution calculation.
11. The system according to claim 1 or 2, wherein the two or more operation chips respectively perform independent operations, and each calculation unit respectively calculates a result.
12. The system according to claim 1 or 2, wherein the two or more operation chips can perform cooperative operation, and each calculation unit performs operation according to the calculation results of the other two or more operation chips.
13. The big data arithmetic chip is characterized by comprising at least one first data interface (130), more than two second data interfaces (150, 151, 152, 153), at least two core cores (110, 111, 112, 113) and a routing unit (120); the at least one first data interface (130) and the more than two second data interfaces (150, 151, 152, 153) are respectively connected with the routing unit, and the routing unit is connected with the at least two cores (110, 111, 112, 113); the second data interface and the third data interface are serdes interfaces;
the second data interfaces (150, 151, 152, 153) of the arithmetic chip are connected with the storage unit through a bus; wherein the storage unit comprises more than two memories; the storage unit comprises more than two third data interfaces; the second data interface and the third data interface are serdes interfaces, and the first data interface is a UART interface of a UART control unit; the arithmetic chip and the storage unit transmit data and control instructions through serdes interfaces and transmission lines;
the number of the operation chips is equal to that of the storage units, and the number of the third data interfaces of the storage units and the number of the second data interfaces of the operation chips are the number of the storage units; each operation chip is connected with all the storage units through a bus;
the memory comprises a plurality of memory controllers and a plurality of memory particles, wherein the memory controllers are used for writing or reading data to the memory particles according to instructions, and the memory particles are used for storing data.
14. The big data arithmetic chip is characterized by comprising at least one first data interface (130), more than two second data interfaces (150, 151, 152, 153), at least two core cores (110, 111, 112, 113) and a routing unit (120); each second data interface is connected with a core, the at least two core cores are connected with the routing unit, and the at least one first data interface (130) is connected with a core (110); the second data interface and the third data interface are serdes interfaces;
the second data interfaces (150, 151, 152, 153) of the arithmetic chip are connected with the storage unit through a bus; wherein the storage unit comprises more than two memories; the storage unit comprises more than two third data interfaces; the second data interface and the third data interface are serdes interfaces, and the first data interface is a UART interface of a UART control unit; the arithmetic chip and the storage unit transmit data and control instructions through serdes interfaces and transmission lines;
the number of the operation chips is equal to that of the storage units, and the number of the third data interfaces of the storage units and the number of the second data interfaces of the operation chips are the number of the storage units; each operation chip is connected with all the storage units through a bus;
the memory comprises a plurality of memory controllers and a plurality of memory particles, wherein the memory controllers are used for writing or reading data to the memory particles according to instructions, and the memory particles are used for storing data.
CN201880002364.XA 2018-10-30 2018-10-30 Big data operation acceleration system and chip Active CN109564562B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/112688 WO2020087276A1 (en) 2018-10-30 2018-10-30 Big data operation acceleration system and chip

Publications (2)

Publication Number Publication Date
CN109564562A CN109564562A (en) 2019-04-02
CN109564562B true CN109564562B (en) 2022-05-13

Family

ID=65872661

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201880002364.XA Active CN109564562B (en) 2018-10-30 2018-10-30 Big data operation acceleration system and chip

Country Status (2)

Country Link
CN (1) CN109564562B (en)
WO (1) WO2020087276A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214448B (en) * 2020-10-10 2024-04-09 声龙(新加坡)私人有限公司 Data dynamic reconstruction circuit and method of heterogeneous integrated workload proving operation chip
CN114691591A (en) * 2020-12-31 2022-07-01 中科寒武纪科技股份有限公司 Circuit, method and system for inter-chip communication
CN114003552B (en) * 2021-12-30 2022-03-29 中科声龙科技发展(北京)有限公司 Workload proving operation method, workload proving chip and upper computer

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451075A (en) * 2017-09-22 2017-12-08 算丰科技(北京)有限公司 Data processing chip and system, data storage forwarding and reading and processing method
CN108536642A (en) * 2018-06-13 2018-09-14 北京比特大陆科技有限公司 Big data operation acceleration system and chip
CN209784995U (en) * 2018-10-30 2019-12-13 北京比特大陆科技有限公司 Big data operation acceleration system and chip

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7593457B2 (en) * 2004-01-30 2009-09-22 Broadcom Corporation Transceiver system and method having a transmit clock signal phase that is phase-locked with a receive clock signal phase
CN102314377B (en) * 2010-06-30 2014-08-06 国际商业机器公司 Accelerator and method thereof for supporting virtual machine migration
CN103634945A (en) * 2013-11-21 2014-03-12 安徽海聚信息科技有限责任公司 SOC-based high-performance cloud terminal
CN105550140B (en) * 2014-11-03 2018-11-09 联想(北京)有限公司 A kind of electronic equipment and data processing method
CN105183683B (en) * 2015-08-31 2018-06-29 浪潮(北京)电子信息产业有限公司 A kind of more fpga chip accelerator cards

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451075A (en) * 2017-09-22 2017-12-08 算丰科技(北京)有限公司 Data processing chip and system, data storage forwarding and reading and processing method
CN108536642A (en) * 2018-06-13 2018-09-14 北京比特大陆科技有限公司 Big data operation acceleration system and chip
CN209784995U (en) * 2018-10-30 2019-12-13 北京比特大陆科技有限公司 Big data operation acceleration system and chip

Also Published As

Publication number Publication date
WO2020087276A1 (en) 2020-05-07
CN109564562A (en) 2019-04-02

Similar Documents

Publication Publication Date Title
CN109564562B (en) Big data operation acceleration system and chip
CN105740195B (en) Method and apparatus for enhanced data bus inversion encoding of OR chained buses
CN108536642A (en) Big data operation acceleration system and chip
US8738852B2 (en) Memory controller and a dynamic random access memory interface
EP1963978A1 (en) Memory system with both single and consolidated commands
CN103180817A (en) Storage expansion apparatus and server
JP2004062900A (en) Memory controller for increasing bas band width, data transmission method using the same and computer system having the same
CN209784995U (en) Big data operation acceleration system and chip
CN114443521A (en) Sum device for improving transmission rate between CPU and DDR5DIMM
CN209560543U (en) Big data operation chip
CN112740193A (en) Method for accelerating system execution operation of big data operation
CN114385534A (en) Data processing method and device
CN115994115B (en) Chip control method, chip set and electronic equipment
US11082327B2 (en) System and method for computational transport network-on-chip (NoC)
EP2801032B1 (en) Bimodal functionality between coherent link and memory expansion
JP2003050788A (en) Apparatus and method for distribution of signal from high level data link controller to multiple digital signal processor core
CN115129657A (en) Programmable logic resource expansion device and server
US11544009B2 (en) Heterogeneous computation and hierarchical memory image sensing pipeline
WO2020087278A1 (en) Big data computing acceleration system and method
CN209543343U (en) Big data operation acceleration system
CN208298179U (en) Big data operation acceleration system and chip
CN115344393A (en) Service processing method and related equipment
WO2020087239A1 (en) Big data computing acceleration system
CN109643301B (en) Multi-core chip data bus wiring structure and data transmission method
WO2020087243A1 (en) Big data computing chip

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20210813

Address after: 100192 Building No. 25, No. 1 Hospital, Baosheng South Road, Haidian District, Beijing, No. 301

Applicant after: SUANFENG TECHNOLOGY (BEIJING) Co.,Ltd.

Address before: 100192 2nd Floor, Building 25, No. 1 Hospital, Baosheng South Road, Haidian District, Beijing

Applicant before: BITMAIN TECHNOLOGIES Inc.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220225

Address after: Room 501-1, unit 4, floor 5, building 2, yard 9, FengHao East Road, Haidian District, Beijing 100089

Applicant after: Beijing suneng Technology Co.,Ltd.

Address before: 100192 Building No. 25, No. 1 Hospital, Baosheng South Road, Haidian District, Beijing, No. 301

Applicant before: SUANFENG TECHNOLOGY (BEIJING) CO.,LTD.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant