WO2020087278A1

WO2020087278A1 - Big data computing acceleration system and method

Info

Publication number: WO2020087278A1
Application number: PCT/CN2018/112693
Authority: WO
Inventors: 桂文明
Original assignee: 北京比特大陆科技有限公司
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2020-05-07

Abstract

Provided are a big data computing acceleration system and method. By means of setting multiple cores in a chip, each core performs computing and storage control functions, and at least one storage unit is connected to each core outside the chip. By means of the technical solution of the present invention, each core reads a storage unit connected to itself and storage units connected to other cores so as to achieve the technical effect of each core being able to have a large-capacity memory, thereby reducing the number of times data is moved into or out of a memory from an external storage space and accelerating a data processing speed. In addition, multiple cores can respectively perform independent computing or cooperative computing, which also accelerates the data processing speed.

Description

Big data operation acceleration system and method

Technical field

Embodiments of the present invention relate to the field of integrated circuits, and in particular, to a system and method for accelerating big data operations.

Background technique

Application-specific integrated circuits (Application Integrated Circuits, ASIC) refer to integrated circuits designed and manufactured in response to specific user requirements and the needs of specific electronic systems. The characteristics of ASICs are to meet the needs of specific users. Compared with general-purpose integrated circuits, ASICs have the advantages of smaller size, lower power consumption, improved reliability, improved performance, enhanced confidentiality, and lower costs.

With the development of science and technology, more and more fields, such as artificial intelligence, security computing, etc., involve specific calculations with large amounts of computation. For specific operations, ASIC chips can play a specific role such as fast operation and low power consumption. At the same time, for these fields with large amount of calculation, in order to improve the data processing speed and processing capacity, it is usually necessary to control N operation chips to work simultaneously. With the continuous improvement of data accuracy, more and more data needs to be calculated in the fields of artificial intelligence and security computing. In order to store data, it is generally necessary to configure multiple storage units for the ASIC chip, for example, one ASIC chip needs to configure 4 2G memory ; In this way, when N arithmetic chips work at the same time, 4N 2NG memory is needed. However, when multiple computing chips work at the same time, the data storage capacity will not exceed 2 G, which causes a waste of storage units and increases system cost.

Summary of the invention

The embodiment of the present invention provides a big data operation acceleration system and method, which connects more than two ASIC operation chips to more than two storage units through a bus, and the operation chips exchange data through the storage units, which not only reduces the The number of storage units also reduces the connection lines between ASIC operation chips, simplifies the system structure, and each ASIC operation chip is connected to multiple storage units separately, which will not cause conflicts when using the bus mode, and it is not necessary for each One ASIC operation chip sets Cache.

To achieve the above objective, according to a first aspect of this embodiment, a big data operation acceleration system is provided, including more than two operation chips and more than two storage units; the operation chip includes at least one first data interface (130) and More than 2 second data interfaces (150, 151, 152, 153), the storage unit includes more than 2 second data interfaces (250, 251, 252, 253); each second data interface of the arithmetic chip (150, 151, 152, 153) One-to-one correspondence with each second data interface (250, 251, 252, 253) of the storage unit through the bus, used for transmitting data or control instructions; the two or more Each at least one first data interface (130) of the arithmetic chip is connected through a bus, and is used to transmit control instructions.

According to a second aspect of this embodiment, there is provided a method for the above system to perform an operation, the operation chip accepts external data through the at least one first data interface;

The arithmetic chip stores external data to at least one storage unit of the two or more storage units through the two or more second data interfaces;

The arithmetic chip accepts external control instructions through the at least one first data interface;

The arithmetic chip obtains data from the storage unit through the two or more second data interfaces;

The arithmetic chip operates on the data to obtain the operation result or intermediate data;

The operation chip stores the operation result or intermediate data to at least one storage unit of the two or more storage units through the two or more second data interfaces;

The operation chip obtains operation results or intermediate data from the storage unit through the two or more second data interfaces, and feeds the operation results or intermediate data to the outside through the at least one first data interface.

The embodiment of the present invention achieves the technical effect of saving the number of memory units by reducing the number of memory units by connecting multiple operation chips in the big data operation acceleration system to each memory unit, and reducing the connection cost between ASIC operation chips. The system structure is simplified, and each ASIC computing chip is connected to multiple storage units respectively, which will not cause conflicts when using the bus mode, and there is no need to set Cache for each ASIC computing chip.

BRIEF DESCRIPTION

In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings required in the embodiments or the description of the prior art. Obviously, the drawings in the following description These are exemplary embodiments. For a person of ordinary skill in the art, without paying any creative work, other drawings may be obtained based on these drawings.

1 is a schematic structural diagram of a big data operation acceleration system provided by a first embodiment of the present invention;

2a is a schematic structural diagram of an arithmetic chip provided by a first embodiment of the present invention;

2b is a schematic diagram of the signal flow of the arithmetic chip provided by the first embodiment of the present invention;

3a is a schematic structural diagram of an arithmetic chip provided by a second embodiment of the present invention;

3b is a schematic diagram of a signal flow of an arithmetic chip provided by a second embodiment of the invention;

4a is a schematic structural diagram of a storage unit provided by a third embodiment of the present invention;

4b is a schematic diagram of a signal flow of a storage unit provided by a third embodiment of the present invention;

5 is a schematic diagram of a connection structure of a big data operation acceleration system provided by a fourth embodiment of the present invention;

6 is a schematic diagram of a data structure provided by a fifth embodiment of the present invention.

detailed description

The following will specifically describe exemplary implementations of this embodiment based on the drawings. It should be understood that these implementations are given only to enable those skilled in the art to better understand and implement the present invention, and do not limit the present invention in any way. Scope. On the contrary, these embodiments are provided to make the present disclosure more thorough and complete, and to fully convey the scope of the present disclosure to those skilled in the art.

In addition, it is necessary to describe that the directions of up, down, left, and right in each drawing are only exemplified by specific embodiments, and those skilled in the art can change the components shown in the drawings according to actual needs. Part or all of them are changed in direction to be applied, without affecting each component or system as a whole to realize its function. Such a technical solution that changes direction still belongs to the protection scope of the present invention.

Multi-core chips are multi-processing systems embodied on a single large-scale integrated semiconductor chip. Typically, two or more chip cores can be embodied on a multi-core chip, interconnected by a bus (which can also be formed on the same multi-core chip). There can be from two chip cores to many chip cores embodied on the same multi-core chip, and the upper limit in the number of chip cores is limited only by manufacturing capabilities and performance constraints. Multi-core chips can have applications that are implemented in multimedia and signal processing algorithms (such as video encoding / decoding, 2D / 3D graphics, audio and voice processing, image processing, telephony, voice recognition and voice synthesis, encryption processing) Special arithmetic and / or logical operations.

Although only ASIC-specific integrated circuits are mentioned in the background art, the specific wiring implementation in the embodiments can be applied to CPUs, GPUs, FPGAs, etc. that have multi-core chips. In this embodiment, multiple cores may be the same core or different cores.

For convenience of explanation, the following will take the big data operation acceleration system of 4 operation chips and 4 storage units in FIG. 1 as an example for description, and those skilled in the art will know that 4 operation chips and 4 storage units are selected here For example, it is only an exemplary description. The number of operation chips may be N, where N is a positive integer greater than or equal to 2, for example, N may be 6, 10, 12, and so on. The number of storage units may be M, where M is a positive integer greater than or equal to 2, for example, M may be 6, 9, 12, and so on. In the embodiment, N and M may be equal or unequal. In this embodiment, a plurality of arithmetic chips may be the same arithmetic chip or different arithmetic chips.

FIG. 1 is a schematic structural diagram of a big data operation acceleration system provided by a first embodiment of the present invention. In the embodiment shown in FIG. 1, the big data operation acceleration system includes four operation chips and four storage units as an example for description. Please refer to Figure 1. The big data operation acceleration system includes 4 operation chips (10, 11, 12, 13) and 4 storage units (20, 21, 22, 23); each operation chip is connected to all storage units, Optionally, the operation chip and the storage unit may be connected by a bus or by a data line. The arithmetic chips exchange data through the storage unit, and data is not directly exchanged between the arithmetic chips; control instructions are sent between the arithmetic chips.

Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results that other arithmetic chips will not use. The shared storage area is used to store the data operation result of the operation chip, and the data operation result is used by other operation chips, or needs to be transmitted to the outside for feedback transmission. Of course, for convenience of management, the storage unit may not be divided. Here, the storage unit may be double data rate (Dual Data Rate, DDR), serial double data rate (Serial Dual Data Rate, SDDR), DDR2, DDR3, DDR4, double data rate for graphics (Graphics Double Data Rate, GDDR) 5, GDDR6, hybrid memory cube (Hybrid Memory Cube, HMC), high bandwidth memory (High Band Memory, HBM) and other high-speed external memories. Here, the storage unit preferably selects DDR series memory, DDR memory is double rate synchronous dynamic random access memory. DDR uses a synchronization circuit to ensure that the main steps of the specified address and data transmission and output are executed independently, while maintaining complete synchronization with the CPU; DDR uses a delay locked loop (Delay Locked Loop, DL) to provide a data filtering signal technology, when When the data is valid, the memory controller can use this data filter signal to accurately locate the data, output it every 16 times, and resynchronize the data from different memory modules. The frequency of DDR memory can be expressed in two ways: operating frequency and equivalent frequency. The operating frequency is the actual operating frequency of the memory particles, but since DDR memory can transmit data on both the rising and falling edges of the pulse, the equivalent frequency of the transmitted data It is twice the operating frequency. DDR2 memory is a new-generation memory technology standard developed by the Joint Electronic Equipment Engineering (JEDEC). Each clock of DDR2 memory can read / write data at 4 times the speed of the external bus, and can be controlled internally. The bus runs at 4 times the speed. DDR3, DDR4, GDDR5, GDDR6, HMC, HBM memory are all existing technologies, and will not be described in detail here.

4 ASIC operation chips are connected to 4 storage units through a bus, and the operation chips exchange data through the storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips, The system structure is simplified, and each ASIC computing chip is connected to multiple storage units respectively, which will not cause conflicts when using the bus mode, and there is no need to set Cache for each ASIC computing chip.

2a is a schematic structural diagram of an arithmetic chip provided by a first embodiment of the present invention. In FIG. 2a, the operation chip has 4 cores as an example for description. Those skilled in the art can know that 4 cores are selected here as an example, which is only an exemplary description. The number of cores of the arithmetic chip may be Q, where Q is a positive integer greater than or equal to 2, for example, 6, 10, 12 etc. Wait. In this embodiment, the core of the arithmetic chip may be a core with the same function or a core with different functions.

The operation chip (10) of 4 cores includes 4 cores (110, 111, 112, 113), a routing unit (120), a data exchange control unit (130) and 4 serdes interfaces (150, 151, 152) , 153). A data exchange control unit and four serdes interfaces are respectively connected to the routing unit through the bus, and the routing unit is connected to each core core. The data exchange control unit can be implemented using multiple protocols, for example, Universal Asynchronous Receiver / Transmitter (UART), Serial Peripheral Interface (SPI), high-speed serial computer expansion bus standard ( peripheral component (interconnect express, PCIE), serializer / deserializer (SERializer / DESerializer, SERDES), universal serial bus (Universal Serial Bus, USB), etc. In this embodiment, the data exchange control unit is a UART control unit ( 130). Universal asynchronous transceiver is usually called UART, which is an asynchronous transceiver. It converts the data to be transmitted between serial communication and parallel communication. UART is usually integrated on the connection of various communication interfaces. But here is just taking the UART protocol as an example, other protocols can also be used. The UART control unit (130) can receive external data or control commands, send control commands to other chips, receive control commands from other chips, and feed back calculation results or intermediate data to the outside.

Serdes is a mainstream time division multiplexing (Time Division Multiplexing, TDM), point-to-point (Point to Point, P2P) serial communication technology. That is, multiple low-speed parallel signals at the transmitting end are converted into high-speed serial signals, and then through the transmission medium (optical cable or copper wire), and finally the high-speed serial signals at the receiving end are re-converted into low-speed parallel signals. This point-to-point serial communication technology makes full use of the channel capacity of the transmission medium, reduces the number of transmission channels and device pins required, increases the signal transmission speed, and thus greatly reduces the communication cost. Of course, other communication interfaces can also be used instead of the serdes interface, for example: Synchronous Serial Interface (SSI), UATR. Data and control commands are transmitted between the chip and the storage unit through the serdes interface and the transmission line.

The core core's main functions are to execute external or internal control instructions, perform data calculation, and data storage control.

The routing unit is used to send data or control instructions to the core core (110, 111, 112, 113), and accepts data or control instructions sent by the core core (110, 111, 112, 113) to implement communication between the core cores. Accept internal or external control instructions to write data to the storage unit, read data or send control instructions to the memory unit through the serdes interface; if the internal or external control instructions are used to control the control instructions of other chips, the routing unit sends the control instructions to UART control unit (130), sent by the UART control unit (130) to other chips; if data needs to be sent to other chips, the routing unit transmits data to the storage unit through the serdes interface, and other chips obtain data through the storage unit; if needed When other chips receive data, the routing unit obtains data from the storage unit through the serdes interface. The routing unit and the UART control unit (130) accept external control instructions and send control instructions to each core core (110, 111, 112, 113); the UART control unit (130) accepts external data and converts the external data according to the external data address Send to the core (110, 111, 112, 113) or storage unit. The internal data or internal control commands refer to data or control commands generated by the chip itself, and the external data or external control commands refer to data or control commands generated outside the chip, such as data or control sent by an external host or an external network instruction.

2b is a schematic diagram of the signal flow of the arithmetic chip provided by the first embodiment of the present invention. In FIG. 2b, the operation chip has 4 cores as an example for description. Referring to FIG. 2b, the UART interface (130) is used to obtain external data or control instructions of the chip, and the routing unit (120) sends the data or control instructions to the core core according to the data or control instruction address, or the routing unit (120) passes The serdes interface is sent to the storage unit connected to the serdes interface. If the destination address of the external control instruction points to another chip, the routing unit sends the control instruction to the UART control unit (130), which is sent to the other chip by the UART control unit (130). The UART interface (130) sends the operation result to the outside according to the external control instruction or the internal control instruction. The operation result can be obtained from the core core of the operation chip, or can be obtained through the serdes interface to the storage unit connected to the serdes interface. The external mentioned here may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the storage unit parameters through the UART control unit, and uniformly address multiple storage particles.

The core core can send a control instruction to obtain or write data to the routing unit. The control instruction carries the data address, and the routing unit reads or writes data to the storage unit through the serdes interface according to the address. The core core may also send data or control instructions to other core cores through the routing unit according to the address, and obtain data or control instructions from other core cores through the routing unit. The core calculates based on the acquired data and stores the calculation result in the storage unit. Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results that will not be used by other arithmetic chips; the shared storage area is used to store the data arithmetic results of the arithmetic chips. The data arithmetic results are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission. If the control command generated by the core core is used to control the operation of other chips, the routing unit sends the control command to the UART control unit (130), and the UART control unit (130) sends it to the other chips. If the control command generated by the core core is used to control the storage unit, the routing unit sends the control command to the storage unit through the serdes interface.

FIG. 3a is a schematic structural diagram of an arithmetic chip provided by a second embodiment of the present invention. In FIG. 3a, the operation chip has 4 cores as an example for description. As can be seen from Figure 3a, the operation chip of 4 cores includes 4 cores (110, 111, 112, 113), a routing unit (120), a UART control unit (130) and 4 serdes interfaces (150, 151, 152, 153). Each serdes interface is connected to one core core, 4 core cores are connected to the routing unit, and the UART control unit (130) is connected to the core core (110).

3b is a schematic diagram of a signal flow of an arithmetic chip provided by a second embodiment of the invention. In FIG. 3b, the operation chip has 4 cores as an example for description. Referring to FIG. 3b, the UART control unit (130) is used to acquire external data or control instructions of the chip, and transmit the external data or control instructions to the core core (110) connected to the UART control unit. The core (110) transmits external data or control instructions to the routing unit (120), and the routing unit sends the data or control instructions to the core (111, 112, 113) corresponding to the data address according to the data or control instruction addresses. If the destination address of the data or control instruction is the core core of the arithmetic chip, the routing unit sends the data or control instruction to the core core (110, 111, 112, 113). If the destination address of the data or control instruction is a storage unit, then the core (111, 112, 113) is sent to the corresponding storage unit through the serdes interface (151, 152, 153). The core (110) can also directly send data or control commands to the corresponding storage unit through the serdes interface (150) connected to it. In this case, the routing unit stores the serdes interface corresponding to all storage unit addresses. If the destination address of the data or control command is another arithmetic chip, the data is sent by the core (111, 112, 113) to the corresponding storage unit through the serdes interface (151, 152, 153); the control command is sent to the UART control unit to Other computing chips. When the core core feedbacks the operation result or intermediate data to the outside according to the external control instruction or the internal control instruction, the core core obtains the operation result or intermediate data from the storage unit from the serdes interface, and sends the operation result or intermediate data to the routing unit, and the routing unit will The operation result or intermediate data is sent to the core (110) connected to the UART control unit, and finally the operation result or intermediate data is sent to the outside through the UART control unit. If the serdes interface corresponding to the core core connected by the UART control unit obtains the operation result or intermediate data, then the operation result or intermediate data is directly sent to the outside through the UART control unit. The external mentioned here may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the storage unit parameters through the UART control unit, and address multiple storage units uniformly.

The core core can send control instructions to the routing unit. The routing unit sends control instructions to other core cores, other chips, or storage units according to the address of the control instructions. After receiving the control instructions, the other cores, other chips, or storage units perform corresponding operations. When the core core sends control commands or data to other core cores, it is directly forwarded through the routing unit. The core core sends control commands to other chips via the UART control unit. When the core core sends a control command to the storage unit, the routing unit queries the serdes interface corresponding to the address according to the address, and sends the control command to the core core corresponding to the serdes interface, and then sends the core core to the corresponding serdes interface. The serdes interface sends the storage unit to the storage unit. Send control commands. When the core core sends data to other chips or storage units, the routing unit queries the serdes interface corresponding to the address according to the address, and sends control instructions to the core core corresponding to the serdes interface, and then the core core sends the corresponding serdes interface to the corresponding serdes interface. The storage unit sends data. Other chips are acquiring data through the storage unit. When the kernel core obtains data from the memory unit, it reads the data address carried in the control instruction, and the routing unit queries the serdes interface corresponding to the address according to the address, and sends the control instruction to the kernel core corresponding to the serdes interface, and then the kernel core sends the corresponding The serdes interface, the serdes interface sends a read control instruction to the storage unit, and the instruction carries the destination address and the source address. After the serdes interface obtains data from the storage unit, the data is sent to the core core corresponding to the serdes interface. The core core sends the data packet including the source address and the destination address to the routing unit, and the routing unit sends the data packet to the corresponding according to the destination address Core. If the kernel core finds that the destination address is its own address, the kernel core obtains data for processing. And the core core can also send data or commands to other core cores through the routing unit, and obtain data or commands from other core cores through the routing unit. The core calculates based on the acquired data and stores the calculation result in the storage unit. Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results not used by other arithmetic chips; the shared storage area is used to store arithmetic data results of the arithmetic chips, which are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.

4a is a schematic structural diagram of a storage unit according to a third embodiment of the present invention. In FIG. 4a, an example is described in which a storage unit corresponds to an arithmetic chip having 4 cores, that is, in FIG. 4a, the storage unit corresponds to the arithmetic chip shown in the first embodiment. Referring to FIG. 4a, the storage unit (20) includes C memories. Here, C = 4 is taken as an example for description. Of course, C is a positive integer greater than or equal to 2, for example, 6, 10, 12, etc .; the memory (240 , 241, 242, 243) includes storage controllers (220, 221, 222, 223) and storage particles (210, 211, 212, 213); the storage controller is used to write or read data to the storage particles according to instructions, store Particles are used to store data. The storage unit (20) further includes a routing unit (230) and four serdes interfaces (250, 251, 252, 253). The four serdes interfaces are connected to the routing unit through the bus, and the routing unit is connected to each memory.

4b is a schematic diagram of a signal flow of a storage unit provided by a third embodiment of the present invention. In FIG. 4b, an example is described in which a storage unit corresponds to an arithmetic chip having 4 cores, that is, in FIG. 4b, the storage unit corresponds to the arithmetic chip shown in the first embodiment. Referring to FIG. 4b, the storage unit (20) accepts the control instruction through the serdes interface (250, 251, 252, 253) and sends the control instruction to the routing unit (230). The routing unit sends the control instruction according to the address in the control instruction For the corresponding memory (240, 241, 242, 243), the storage controller (220, 221, 222, 223) performs related operations according to the control instructions. For example, according to the initial configuration memory parameters, multiple storage particles are addressed uniformly; or according to the reset instruction, the storage particles are reset and reset; write instructions or read instructions and other operations. Through the serdes interface (250, 251, 252, 253) accept the data acquisition instruction sent by the arithmetic chip, the instruction carries the address of the data to be acquired, the routing unit sends the data acquisition instruction to the memory according to the address, and the storage controller stores The data is obtained from the particles, and the data is sent to the computing chip that needs the data through the serdes interface according to the source address. Through the serdes interface (250, 251, 252, 253), receive the write data command and data sent by the arithmetic chip, the command carries the address of the data to be written, the routing unit sends the write data command and data to the memory according to the address, storage control The device writes data to the storage particles according to the write data instruction. The write data command and data can be transmitted synchronously or asynchronously. Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results not used by other arithmetic chips; the shared storage area is used to store arithmetic data results of the arithmetic chips, which are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.

5 is a schematic diagram of a connection structure of a big data operation acceleration system provided by a fourth embodiment of the present invention. In FIG. 5, the big data operation acceleration system has 4 operation chips and 4 storage units as an example for description. Please refer to FIG. 5, the big data operation acceleration system includes 4 operation chips (10, 11, 12, 13) and 4 storage units (20, 21, 22, 23). The structure of the arithmetic chip may be the chip structure disclosed in the first embodiment and the second embodiment. Of course, the arithmetic chip may also be an equivalent modified chip structure made by those skilled in the art for the first and second embodiments. The chip structure is also within the scope of protection in this embodiment. The structure of the storage unit may be the structure of the storage unit disclosed in the third embodiment. Of course, the storage unit may also be an equivalently improved storage unit structure made by those skilled in the art for the third embodiment. The scope of protection of the embodiments. In the big data operation acceleration system, the UART control unit (130) of the operation chip (10) is connected to an external host, and the UART control unit (130) of each chip (10, 11, 12, 13) is connected through a bus. Each serdes interface (150, 151, 152, 153) of the chip (10, 11, 12, 13) is connected to the serdes interface (250, 251, 252, 253) of a storage unit (20, 21, 22, 23), Furthermore, each operation chip is connected to all storage units through a bus, the operation chip performs data exchange through the storage unit, and data is not directly exchanged between the operation chips. The internal and external signal flows of the arithmetic chip and the storage unit have been described in detail in the first, second, and third embodiments, and will not be described again here.

Optionally, in the embodiment shown in FIG. 5, the UART control unit (130) in any arithmetic chip may be connected to an external host, and the UART control units (130) in other arithmetic chips are connected in sequence. The arithmetic chip connected to the external host can receive the control command of the external host through the UART control unit (130) and send the control command to other arithmetic chips.

For example, the UART control unit (130) in the arithmetic chip 10 may be connected to an external host, the UART control unit (130) in the arithmetic chip 11 is connected to the UART control unit (130) in the arithmetic chip 10, and the UART in the arithmetic chip 12 The control unit (130) is connected to the UART control unit (130) in the arithmetic chip 11, and the UART control unit (130) in the arithmetic chips 1, 3 is connected to the UART control unit (130) in the arithmetic chip 12.

Optionally, the UART control unit (130) in each arithmetic chip 12 may be connected to an external host respectively.

The system is applied in the field of artificial intelligence. The UART control unit (130) of the arithmetic chip (10) stores the picture data or video data sent by the external host to the storage unit (20, 151, 152, 153) through the serdes interface (150, 151, 152, 153). 21, 22, 23), the arithmetic chip (10, 11, 12, 13) generates a mathematical model of the neural network, which can also be used by an external host through the serdes interface

(150, 151, 152, 153) is stored in the storage unit (20, 21, 22, 23) and read by each arithmetic chip (10, 11, 12, 13). The first layer of mathematical model of the neural network is run on the arithmetic chip (10). The arithmetic chip (10) reads data from the storage unit (20, 21, 22, 23) through the serdes interface for operation, and stores the operation result through the serdes interface To at least one of the storage units (20, 21, 22, 23). The arithmetic chip (10) sends a control instruction to the arithmetic chip (20) through the UART control unit (130), and starts the arithmetic chip (20) to perform arithmetic. Run the second layer of mathematical model of the neural network on the arithmetic chip (20), the arithmetic chip (20) reads data from the storage unit (20, 21, 22, 23) through the serdes interface for operation, and stores the operation result through the serdes interface To at least one of the storage units (20, 21, 22, 23). Each chip executes a layer in the neural network, and obtains data from the storage unit (20, 21, 22, 23) through the serdes interface for operation, and only the final layer of the neural network calculates the operation result. The operation chip (10) obtains the operation result from the storage unit (20, 21, 22, 23) through the serdes interface, and feeds it back to the external host through the UART control unit (130).

The system is applied to the field of encrypted digital currency, and the UART control unit (130) of the arithmetic chip (10) stores the block information sent by the external host to at least one storage unit in the storage units (20, 21, 22, 23). The external host sends control instructions to the four arithmetic chips (10, 11, 12, 13) through the arithmetic chip (10, 11, 12, 13) UART control unit (130), and the four arithmetic chips (10, 11, 12. 13) Start operation. Of course, the external host can also send control instructions to one arithmetic chip (10) UART control unit (130) for data calculation, and the arithmetic chip (10) sends control instructions to the other three arithmetic chips (11, 12, 13) in sequence for data calculation , 4 arithmetic chips (10, 11, 12, 13) start the arithmetic operation. The external host may also send a control instruction to a computing chip (10) UART control unit (130) to perform data operations, the first computing chip (10) sends a control instruction to the second computing chip (11) to perform data operations, and the second computing chip (11) Send control instructions to the third arithmetic chip (12) for data calculation, the third arithmetic chip (12) sends control instructions to the fourth arithmetic chip (13) for data calculation, 4 arithmetic chips (10, 11, 12 , 13) Start operation. 4 arithmetic chips (10, 11, 12, 13) read the block information data from the storage unit through the serdes interface, 4 arithmetic chips (10, 11, 12, 13) simultaneously perform the proof of work calculation, the arithmetic chip ( 10) Obtain the operation result from the storage unit (20, 21, 22, 23) and feed it back to the external host through the UART control unit (130).

In the above embodiment, the number of the arithmetic chip and the storage unit are equal, and the number of the second data interface of the storage unit and the number of the second data interface of the arithmetic chip are both the number of the storage unit .

However, those skilled in the art may know that the number of the arithmetic chip and the storage unit may also be unequal. In this case, the number of second data interfaces of the storage unit is the number of the arithmetic chip, and the second The number of data interfaces is the number of storage units. For example, there are four arithmetic chips and five storage units. At this time, five second data interfaces are provided on the arithmetic chip, and four second data interfaces are provided on the storage unit.

The bus may use a centralized arbitration bus structure or a ring topology bus structure. The bus technology is a common technology in the field, so it will not be described in detail here.

6 is a schematic diagram of a data structure provided by a fifth embodiment of the present invention. The data mentioned here is various data such as command data, numeric data, character data, and so on. The data format specifically includes valid bit valid, destination address dst id, source address src id and data data. The kernel can determine whether the data packet is a command or a value by valid bit. Here, it can be assumed that 0 represents a value and 1 represents a command. The kernel will determine the destination address, source address and data type according to the data structure. From the perspective of instruction operation timing, the traditional six-stage pipeline structure is adopted in this embodiment, which are instruction fetch, decoding, execution, memory access, alignment and write-back stage respectively. From the perspective of the instruction set architecture, a simplified instruction set architecture can be adopted. According to the general design method of the reduced instruction set architecture, the instruction set of the present invention can be divided into register-register type instructions, register-immediate instruction, jump instruction, memory access instruction, control instruction and inter-core communication instruction according to functions.

Using the description provided herein, the embodiments can be implemented as a machine, process, or article of manufacture by using standard programming and / or engineering techniques to produce programming software, firmware, hardware, or any combination thereof.

Any generated program (s) (with computer-readable program code) can be embodied on one or more computer-usable media, such as resident storage devices, smart cards or other removable storage devices, or transmission devices, Thus, computer program products and manufactured products are produced according to the embodiments. As such, the terms "article of manufacture" and "computer program product" as used herein are intended to cover computer programs that are permanently or temporarily present on any non-transitory medium that can be used by computers.

As noted above, memory / storage devices include but are not limited to magnetic disks, optical disks, removable storage devices (such as smart cards, subscriber identity modules (SIM), wireless identification modules (Wireless Identification Module, WIM)), semiconductors Memory (such as random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read Only Memory, PROM)), etc. Transmission media include, but are not limited to, transmission via wireless communication networks, the Internet, intranets, telephone / modem-based network communications, hard-wired / cable communications networks, satellite communications, and other fixed or mobile network systems / communication links.

Although specific example embodiments have been disclosed, those skilled in the art will understand that changes can be made to the specific example embodiments without departing from the spirit and scope of the invention.

The present invention has been described above based on the embodiments with reference to the drawings. However, the present invention is not limited to the above-mentioned embodiments, and schemes in which the parts of each embodiment and each modified example are appropriately combined or replaced according to layout requirements are also included in Within the scope of the invention. In addition, based on the knowledge of those skilled in the art, the combination and processing order of the embodiments may be appropriately reorganized, or various design changes and other modifications may be applied to the embodiments. Embodiments to which such modifications are applied may also be included in the present invention. In the range.

Although the present invention has described various concepts in detail, those skilled in the art can understand that various modifications and substitutions to those concepts can be implemented within the spirit of the overall teaching disclosed in the present invention. A person skilled in the art can implement the invention set forth in the claims without undue experimentation by using ordinary techniques. It can be understood that the specific concepts disclosed are only illustrative and are not intended to limit the scope of the present invention, which is determined by the full scope of the appended claims and their equivalents.

Claims

A big data operation acceleration system is characterized by comprising more than two operation chips and more than two storage units, wherein,

The arithmetic chip includes at least one first data interface and more than two second data interfaces;

The storage unit includes more than two third data interfaces;

The second data interface of the arithmetic chip is connected to the third data interface of the storage unit, and is used to transmit data or control instructions;

At least one first data interface of the arithmetic chip is connected for transmitting control instructions.
The system according to claim 1, wherein the second data interface and the third data interface are serdes interfaces, and the first data interface is a UART interface of a UART control unit of a universal asynchronous transceiver transmitter.
The system according to claim 1 or 2, wherein the number of the arithmetic chip and the number of the storage unit are equal, the number of the third data interface of the storage unit and the second data of the arithmetic chip The number of interfaces is the number of storage units.
The system according to claim 1, wherein the number of the arithmetic chip and the number of the storage unit are not equal, the number of the third data interface of the storage unit is the number of the arithmetic chip, the The number of the second data interface of the arithmetic chip is the number of the storage unit.
The system according to claim 1 or 2, wherein the arithmetic chip further includes at least two cores and a routing unit; the at least one first data interface and two or more second data interfaces are respectively The routing unit is connected, and the routing unit is connected to the at least two cores.
The system according to claim 5, wherein the routing unit is configured to send data or control instructions to the core, and receive data or control instructions sent by the core.
The system according to claim 5, wherein the routing unit is configured to write data to the storage unit, read data, or send to the memory unit through the third data interface according to the received control instruction Control instruction.
The system according to claim 5, wherein the routing unit sends a control instruction to the external chip through the at least one first data interface.
The system according to claim 5, wherein data is sent or received between the arithmetic chips through the second data interface and the storage unit.
The system according to claim 5, wherein the routing unit receives external data or control instructions through the at least one first data interface, and sends the received external data or control instructions to the kernel or the storage unit.
The system according to claim 1 or 2, wherein the arithmetic chip further includes at least two cores and a routing unit; each second data interface is connected to a core, and the at least two cores and the routing unit Connected, the at least one first data interface is connected to a core.
The system according to claim 11, wherein the at least one first data interface is used to acquire chip external data or control instructions, and transmit the external data or control instructions to the at least one first data An interface connected core; or, the at least one first data interface is used to obtain an operation result or intermediate data feedback from the at least one first data interface connected core to the outside; or, the at least one first data interface is used Used to send control commands to external chips.
The system according to claim 11, wherein the core is used to transmit data or control instructions to the routing unit; or, to send data or control instructions to the storage unit through the two or more second data interfaces; or To obtain data from the storage unit through the two or more second data interfaces.
The system according to claim 11, wherein the routing unit is configured to send the data or control instruction to the core corresponding to the data address according to the data or control instruction address.
The system according to claim 1 or 2, wherein the storage unit further includes two or more memories and a routing unit; the two or more third data interfaces are respectively connected to the routing unit, and the routing unit further It is connected to the two or more memories.
The system according to claim 15, wherein the memory includes a storage controller and a storage particle, wherein the storage controller is used to write or read data to the storage particle according to an instruction, and the storage particle is used to Storing data.
The system according to claim 15, wherein the routing unit receives the control instruction through the two or more third data interfaces, and sends the control instruction to the corresponding memory according to the address in the control instruction ( 240, 241, 242, 243).
The system according to claim 15, wherein the routing unit is configured to send the acquired data to the arithmetic chip through the two or more third data interfaces.
The system according to claim 15, wherein the storage unit is provided with a dedicated storage area and a shared storage area.
The system according to claim 15, wherein the at least one first data interface is used to initialize and configure the two or more storage units according to the received external command, and to store particles in the two or more storage units Perform unified addressing.
The system according to claim 15, wherein the storage particles are hybrid memory cube HMC memories.
A big data operation acceleration system, characterized by comprising more than two operation chips and more than two storage units; the operation chip includes N cores, wherein the N is a positive integer greater than or equal to 2; the two Each of the more than one arithmetic chips is connected to all of the two or more memory cells.
The system according to claim 22, wherein the two or more arithmetic chips exchange data through the storage unit.
A big data operation acceleration system is characterized by including more than two operation chips and more than two storage units; the operation chip includes at least one first data interface and more than two second data interfaces, and the storage unit includes More than two data interfaces; each second data interface of the arithmetic chip is connected to each third data interface of the storage unit; and each first data interface of the two or more arithmetic chips is connected.
The system according to claim 1, 22 or 24, wherein the two or more arithmetic chips perform one or more of encryption operations and convolution calculations.
The system according to claim 1, 22 or 24, wherein the two or more arithmetic chips respectively perform independent operations, and each arithmetic chip calculates a result separately.
The system according to claim 1, 22 or 24, characterized in that the two or more arithmetic chips perform a cooperative operation, and each arithmetic chip performs an operation based on the calculation results of the other two or more arithmetic chips.
A method for performing calculations according to any one of claims 1, 22 or 24, characterized in that it includes:

The arithmetic chip receives external data through the at least one first data interface;

The arithmetic chip stores external data to at least one storage unit of the two or more storage units through the two or more second data interfaces;

The arithmetic chip receives an external control instruction through the at least one first data interface;

The arithmetic chip obtains data from the storage unit through the two or more second data interfaces;

The arithmetic chip operates on the data to obtain the operation result or intermediate data;

The operation chip stores the operation result or intermediate data to at least one storage unit of the two or more storage units through the two or more second data interfaces;

The operation chip obtains operation results or intermediate data from the storage unit through the two or more second data interfaces, and feeds the operation results or intermediate data to the outside through the at least one first data interface.