CN112740192B

CN112740192B - Big data operation acceleration system and data transmission method

Info

Publication number: CN112740192B
Application number: CN201880097576.0A
Authority: CN
Inventors: 秦强
Original assignee: Bitmain Technologies Inc
Current assignee: Bitmain Technologies Inc
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2024-04-30
Anticipated expiration: 2038-10-30
Also published as: CN112740192A; WO2020087246A1

Abstract

The embodiment of the invention provides a big data operation acceleration system and a data transmission method, wherein the big data operation acceleration system comprises more than 2 operation chips, and the operation chips comprise N kernel cores, N data channels (lanes) and at least one storage unit, wherein N is a positive integer greater than or equal to 4; the data channel (lane) comprises a transmitting interface (tx) and a receiving interface (rx), the kernel core and the data channel (lane) are in one-to-one correspondence, and the kernel core transmits and receives data through the data channel (lane); the more than 2 operation chips are connected through the sending interface (tx) and the receiving interface (rx) to transmit data, and the more than 2 operation chips are connected into a ring shape. By adopting the technical scheme in the embodiment of the invention, the data transmission rate among a plurality of ASIC chips is improved.

Description

Big data operation acceleration system and data transmission method

Technical Field

The present invention relates to the field of integrated circuits, and in particular, to a big data operation acceleration system and a data transmission method.

Background

ASIC (Application SPECIFIC INTEGRATED Circuits), an Application specific integrated circuit, refers to an integrated circuit that is designed and manufactured to meet the needs of a particular user and a particular electronic system. The ASIC is characterized by being oriented to the requirements of specific users, and has the advantages of smaller volume, lower power consumption, improved reliability, improved performance, enhanced confidentiality, reduced cost and the like compared with a general integrated circuit during mass production.

With the development of technology, more and more fields, such as artificial intelligence, security operation, etc., involve specific computation with large computation load. For specific operation, the ASIC chip can exert the specific operations such as fast operation, low power consumption and the like. Meanwhile, in these large-operand fields, in order to increase the processing speed and processing capacity of data, it is generally necessary to control N arithmetic chips to operate simultaneously. With the continuous improvement of data precision, fields such as artificial intelligence, security operation and the like need to operate on larger and larger data, for example: the size of the photograph is now typically 3-7MB, but as the precision of digital cameras and video cameras increases, the size of the photograph can reach 10MB or more, while 30 minutes of video may reach 1 poly G of data. In the fields of artificial intelligence, safe operation and the like, the calculation speed is required to be high, and the time delay is small, so that how to improve the calculation speed and the reaction time is always a target required by chip design. Because the memory collocated with the ASIC chip is generally 64MB or 128MB, when the data to be processed is more than 512MB, the ASIC chip needs to use the memory to access the data for multiple times, and the data is carried into or out of the memory from the external storage space for multiple times, so that the processing speed is reduced. Meanwhile, with the continuous improvement of data precision, the fields of artificial intelligence, safety operation and the like need to operate on larger and larger data, and a plurality of storage units are generally needed to be configured for an ASIC (application specific integrated circuit) chip for storing the data, for example, 4 blocks of 2G memories are needed to be configured for one ASIC chip; when N operation chips work simultaneously, 4N blocks of 2NG memories are needed. However, when multiple operation chips work simultaneously, the data storage amount does not exceed 2G, so that the waste of storage units is caused, and the system cost is increased.

In designs that handle large amounts of related data, two challenges are faced in the prior art: 1. is a requirement for greatly improving the performance. 2. In the case of a distributed system, the problem of data correlation is also solved, namely, the processed data in one subsystem needs to be presented to all other subsystems for confirmation and reprocessing. The time consumed by data processing is generally reduced in two ways, namely, the clock for processing data logic is quickened; and secondly, increasing the number of concurrent blocks of the processed data.

The rise in clock rate is limited under process limitations. Improving the number of concurrency is a more efficient way to improve performance. However, after the concurrency number is increased, the requirement of data bandwidth is also increased correspondingly. In a typical system, if the data bandwidth is dependent on the bandwidth provided by the DDR, the bandwidth boost of the DDR is not linear. Assuming the initial system contains a set of DDRs, a bandwidth of 1x is provided. If we need to get a bandwidth boost of 2x, two sets of DDRs can be implemented, but if we need to get a bandwidth boost of more than 16x, because of physical size limitations, it is not possible to simply instantiate 16 sets of DDRs in one system.

If multiple ASIC chips are required to work cooperatively, the data cannot be directly distributed in multiple systems that are not connected for processing, because the data are all relevant, each piece of data completed in a certain processing unit must be confirmed and reprocessed in other processing units, and therefore the problem of multi-system interconnection must be solved if the rate of data transmission between the multiple ASIC chips is increased.

Disclosure of Invention

The embodiment of the invention aims to provide a mode of connecting distributed storage by using a high-speed interface, which realizes that a plurality of isomorphic systems process a large amount of related data simultaneously. The embodiment of the invention provides a big data operation acceleration system, which eliminates the external memory of a chip, sets a storage unit in an ASIC chip, reduces the time for the ASIC chip to read data from the outside, and accelerates the operation speed of the chip. The storage units are shared by a plurality of ASIC chips, so that the number of the storage units is reduced, connecting wires among ASIC operation chips are reduced, the system structure is simplified, and the cost of the ASIC chips is reduced. Meanwhile, a serdes interface technology is adopted among a plurality of operation chips to carry out data transmission, so that the data transmission rate among a plurality of ASIC chips is improved.

In order to achieve the above purpose, the embodiment of the present invention provides the following technical solutions:

According to a first aspect of an embodiment of the present invention, there is provided a big data operation acceleration system, including more than 2 operation chips, where the operation chips include N kernel cores, N data channels (lanes), and at least one storage unit, where N is a positive integer greater than or equal to 4; the data channel (lane) comprises a transmitting interface (tx) and a receiving interface (rx), the kernel core and the data channel (lane) are in one-to-one correspondence, and the kernel core transmits and receives data through the data channel (lane); the more than 2 operation chips are connected through the sending interface (tx) and the receiving interface (rx) to transmit data, and the more than 2 operation chips are connected into a ring shape.

According to a second aspect of the embodiment of the present invention, there is provided a data transmission method of a big data operation acceleration system, the big data operation acceleration system including 2 or more operation chips, the 2 or more operation chips being connected to transmit data through a transmission interface (tx) and a reception interface (rx), the 2 or more operation chips being connected to form a ring; after the first operation chip of the data source generates data, the data is sent to the second operation chip on one side adjacent to the first operation chip through the sending interface (tx); the second operation chip on the adjacent side divides the data into two paths for transmission, the first path is sent to the kernel core of the second operation chip, and the other path is forwarded to the third operation chip on the adjacent side of the second operation chip through a sending interface (tx).

According to the embodiment of the invention, the plurality of chips are arranged in the big data operation acceleration system, each chip comprises the plurality of kernel cores, each kernel core executes operation and storage control functions, and at least one storage unit is connected to each kernel core in the chip, so that each kernel can have a large-capacity memory by reading data in the storage unit connected with the kernel of the operation chip and the storage unit connected with other kernels of the operation chip, the frequency of carrying in or carrying out the data from an external storage space is reduced, and the data processing speed is accelerated; meanwhile, as a plurality of cores can respectively operate independently or cooperatively, the data processing speed is also increased. The storage units are shared by a plurality of ASIC chips, so that the number of the storage units is reduced, connecting wires among ASIC operation chips are reduced, the system structure is simplified, and the cost of the ASIC chips is reduced.

Drawings

In order to more clearly illustrate the embodiments of the invention or the solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described below, it being obvious that the drawings in the description below are only exemplary embodiments and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a schematic diagram of a big data operation acceleration system architecture with M ASIC chips of a first embodiment;

FIG. 2 illustrates a schematic diagram of an arithmetic chip structure with 4 cores;

FIG. 3 illustrates a schematic structure of a data channel lane;

FIG. 4a is a schematic diagram illustrating the structure of a first embodiment of a memory cell

FIG. 4b illustrates a schematic diagram of a second embodiment of a memory cell;

FIG. 5 illustrates a schematic diagram of a big data operation acceleration system data transfer process;

FIG. 6 illustrates a signal flow diagram of a first embodiment of a first core with 4 cores;

fig. 7 illustrates a schematic diagram of a data structure according to the present invention.

Detailed Description

Exemplary embodiments of the present invention will be specifically described below based on the drawings, and it should be understood that these embodiments are only given to enable those skilled in the art to better understand and to implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In addition, it should be noted that the directions of up, down, left and right in the drawings are merely examples of specific embodiments, and those skilled in the art can change a part or all of the components shown in the drawings according to actual needs to apply the directions without affecting the whole implementation of the functions of the components or the system, and the changed-direction technical solution still belongs to the protection scope of the present invention.

A multi-core chip is a multiprocessing system that is embodied on a single large-scale integrated semiconductor chip. Typically, two or more multi-chip cores may be embodied on a multi-core chip, interconnected by a bus (which may also be formed on the same multi-core chip). There may be from two chip cores to many chip cores that are embodied on the same multi-core chip, with the upper limit in the number of chip cores limited only by manufacturing capability and performance constraints. The multi-core chip may have applications that contain specialized arithmetic and/or logic operations that are performed in multimedia and signal processing algorithms (such as video encoding/decoding, 2D/3D graphics, audio and speech processing, image processing, telephony, speech recognition and sound synthesis, encryption processing).

Although only ASIC application specific integrated circuits are mentioned in the background, the specific wiring implementation in the embodiments may be applied to chips CPU, GPU, FPGA having multiple cores, etc. In this embodiment, the multiple cores may be the same core or different cores.

Fig. 1 illustrates a schematic diagram of a big data operation acceleration system structure having M ASIC chips of a first embodiment. As shown in fig. 1, the big data operation acceleration system includes M ASIC operation chips, where M is a positive integer greater than or equal to 2, and may be, for example, 6, 10, 12, etc. The operation chip comprises a plurality of kernel (kernel 0, kernel 1, kernel 2, and kernel 3), and 4 data channels (lane 0, lane1, and lane2lane 3), wherein the data channels (lane) comprise a transmitting interface (tx) and a receiving interface (rx), the kernel (kernel 0) and the data channels (lane) are in one-to-one correspondence, for example, the kernel 0 of the operation chip 10 is provided with the data channel (lane 0), the data channel (lane 0) is provided with the transmitting interface (lane 0 tx) and the receiving interface (lane 0 rx), the data channel transmitting interface (lane 0 tx) is used for transmitting data or control instructions to the outside of the operation chip 10 by the kernel 0, and the data channel receiving interface (lane 0 rx) is used for transmitting external data or control instructions of the operation chip 10 to the kernel 0. The M arithmetic chips are thus connected via the transmitting interface (tx) and the receiving interface (rx) for data or control command transmission. M operation chips form a closed loop. The method comprises the steps that a storage unit is arranged in each operation chip, 4 kernel cores in the operation chips are connected to the storage unit, the storage units of the M operation chips are used for storing data in a distributed mode, and the kernel cores of the operation chips can acquire data from the storage units of the operation chips and also can acquire data from the storage units of other operation chips. The 4 kernel cores in the operation chip are all connected to the storage unit, and the purpose of data interaction of the 4 kernel cores in the operation chip is also realized through the storage unit. It will be appreciated by those skilled in the art that 4 cores are selected as examples herein, and the number of cores may be N, where N is a positive integer greater than or equal to 4, such as 6, 10, 12, etc. In this embodiment, the multiple cores may be the same core or different cores.

The sending interface (Lane tx) and the receiving interface (Lane rx) of the data channel (Lane) are serdes interfaces, and the operation chips communicate with each other through the serdes interfaces. serdes is an acronym for english SERializer/DESerializer. It is a mainstream Time Division Multiplexing (TDM), point-to-point (P2P) serial communication technology. The multi-path low-speed parallel signals are converted into high-speed serial signals at the transmitting end, and finally the high-speed serial signals are converted into low-speed parallel signals at the receiving end through a transmission medium (an optical cable or a copper wire). The point-to-point serial communication technology fully utilizes the channel capacity of a transmission medium, reduces the number of required transmission channels and device pins, and improves the transmission speed of signals, thereby greatly reducing the communication cost. Of course, other communication interfaces may be used instead of the serdes interface, for example: SSI, UATR. And the chips are used for transmitting data and control instructions through a serdes interface.

Fig. 2 illustrates a first embodiment of a schematic of a computing chip architecture having 4 cores. It will be appreciated by those skilled in the art that 4 cores are selected as examples herein, and the number of cores in the operation chip may be N, where N is a positive integer greater than or equal to 2, for example, 6, 10, 12, etc. In this embodiment, the cores of the operation chip may be cores having the same function or cores having different functions.

The 4-kernel computing chip (1) comprises 4 kernels (core 0, core1, core2, core 3), 4 data channels (lane 0, lane1, lane2lane 3), at least one storage unit and a data exchange control unit, wherein the data exchange control unit is a UART control unit, and each data channel (lane) comprises a transmitting interface (lane tx) and a receiving interface (lane rx).

The core0 of the operation chip (1) is connected with a transmitting interface (Lane 0 tx) and a receiving interface (Lane 0 rx) of a data channel, the data channel transmitting interface (Lane 0 tx) is used for transmitting data or control instructions to the operation chip connected with the operation chip 1 by the core0, and the data channel receiving interface (Lane 0 rx) is used for transmitting data or control instructions transmitted by the operation chip connected with the operation chip (1) to the core 0. Similarly, the core1 of the operation chip 1 is connected to the transmitting interface (lane 1 tx) and the receiving interface (lane 1 rx) of the data channel; the core2 of the operation chip 1 is connected to the transmission interface (lane 2 tx) and the reception interface (lane 2 rx) of the data channel, and the core3 of the operation chip 1 is connected to the transmission interface (lane 3 tx) and the reception interface (lane 3 rx) of the data channel. The transmit interface (lane tx) and the receive interface (lane rx) of the data channel (lane) are serdes interfaces.

A data exchange control unit is connected to the memory unit and 4 cores (core 0, core1, core2, core 3) via a bus, which is not shown in FIG. 2. The data exchange control unit may be implemented by using various protocols, for example UART, SPI, PCIE, SERDES, USB, etc., and in this embodiment, the data exchange control unit is a UART (Universal Asynchronous Receiver/Transmitter) control unit. A universal asynchronous receiver transmitter, commonly referred to as UART, is an asynchronous receiver transmitter that converts data to be transmitted between serial and parallel communications, and UART is typically integrated over a link of various communication interfaces. However, the UART protocol is taken as an example, and other protocols may be used. The UART control unit receives the external data and transmits the external data to the core (core 0, core1, core2, core 3) or the storage unit according to the external data address. The UART control unit may also receive an external control instruction and send the control instruction to the kernel core (core 0, core1, core2, core 3) or the storage unit; the method can also be used for the arithmetic chip to send internal or external control instructions to other arithmetic chips, receive the control instructions from other chips, feed back arithmetic results or intermediate data to the outside, and the like. The internal data or the internal control instruction refers to data or control instructions generated by the chip, and the external data or the external control instruction refers to data or control instructions generated outside the chip, such as data or control instructions sent by an external host or an external network.

The main functions of the kernel (core 0, core1, core2, core 3) are functions of executing external or internal control instructions, performing data calculation, storing control of data, and the like. The kernel (core 0, core1, core2 and core 3) in the operation chip is connected to the storage unit, data is read or written into the storage unit of the operation chip, and the interaction of a plurality of kernel data in the operation chip is realized through the storage unit; the control command may be sent to the memory unit of the arithmetic chip. The kernel (core 0, core1, core2 and core 3) writes data into the storage units of other operation chips through a serdes interface according to the instruction, reads the data or sends a control instruction to the storage units of other operation chips; the kernel cores (core 0, core1, core2, core 3) may also send data to the kernel cores of other operation chips through the serdes interface, read data, or send control instructions to the kernel cores of other operation chips according to the instructions.

Fig. 3 illustrates a first embodiment of a schematic structure of a data channel lane. The data channel (lane) comprises a receiving interface, a transmitting interface, a receiving address judging unit, a transmitting address judging unit and a plurality of registers; one end of the receiving address judging unit is connected with the receiving interface, and the other end of the receiving address judging unit is connected with the kernel through a register; one end of the sending address judging unit is connected with the sending interface (tx), and the other end of the sending address judging unit is connected with the kernel through a register; the receiving address judging unit and the transmitting address judging unit are connected with each other through a register. The receiving interface receives a data frame or a control instruction sent by an operation chip at the adjacent side connected with the receiving interface, sends the data frame or the control instruction to the receiving address judging unit, and the receiving address judging unit sends the data frame or the control instruction to the kernel core and simultaneously sends the data frame or the control instruction to the sending address judging unit; the sending address judging unit receives the data frame or the control instruction, sends the data frame or the control instruction to the sending interface (tx), and the sending interface sends the data frame or the control instruction to the running chip on the other adjacent side connected with the sending interface. The kernel generates a data frame or a control instruction, the data frame or the control instruction is sent to a sending address judging unit, the sending address judging unit sends the data frame or the control instruction to a sending interface, and the sending interface sends the data frame or the control instruction to a receiving interface of an operation chip on the adjacent side. The register is used for temporarily storing data frames or control instructions.

Fig. 4a illustrates a first embodiment of a schematic structure of a memory cell. Each of the arithmetic chips contains N kernel cores, which need concurrent random access data, and if N is on the order of 64 or more, the memory bandwidth of the arithmetic chip is required to be very high, and even the GDDR is difficult to achieve. Thus, the manner in which SRAM arrays and large MUX routing are used in embodiments of the present invention provides high bandwidth. The system shown in fig. 4a is composed of two levels of memory control units to alleviate congestion problems when implemented. The memory unit (40) comprises 8 memories (410 … …, 417), the 8 memories (410, … …, 417) being connected to a memory control unit (420); the storage control unit is used for controlling data reading or storage of the memories. The memory (410 … …, 417) includes at least two memory subunits and a memory control subunit; the storage control subunit is connected with the storage control unit through an interface and is used for controlling data reading or storage of the at least two storage subunits. The storage subunit is an SRAM memory.

Fig. 4b illustrates a second embodiment of a schematic structure of a memory cell. In the memory unit of fig. 4b, a plurality of memory control units (420, 421, 422, 423) may be provided, each core being connected to each of the plurality of memory control units (420, 421, 422, 423), each memory control unit being connected to each memory (410 … …, 417). The structure of the memory is identical to that of fig. 4a and will not be described again here.

The kernel sends the generated data to at least one storage control unit, which sends the data to the storage control subunit, which stores the data in the storage subunit. The core of the operation chip acquires data acquisition commands sent by other operation chips, the core of the operation chip judges whether data are stored in a storage unit of the operation chip or not through a data address, and if the data are stored in the storage unit, the core of the operation chip sends data reading commands to the at least one storage control unit; at least one storage control unit sends a data reading command to a corresponding storage control subunit, the storage control subunit acquires data from the storage subunit, the storage control subunit sends the acquired data to at least one storage control unit, the at least one storage control unit sends the acquired data to a kernel, the kernel sends the acquired data to a sending address judging unit, the sending address judging unit sends the acquired data to a sending interface (tx), and the sending interface sends the acquired data to an adjacent operation chip.

The big data operation acceleration system is applied to the artificial intelligence field, a UART control unit of an operation chip stores picture data or video data sent by an external host into a storage unit through a kernel core, the operation chip generates a mathematical model of a neural network, and the mathematical model can be stored into the storage unit through the UART control unit by the external host and read by each operation chip. The first layer of the neural network mathematical model is operated on the operation chip, the kernel core of the operation chip reads data from the storage unit of the operation chip and/or the storage units of other operation chips to operate, and the operation result is stored into at least one storage unit in the storage units of other operation chips or the storage unit of the operation chip through a serdes interface. The operation chip (1) sends a control instruction to the next operation chip (2) through a UART control unit or a serdes interface, and the next operation chip (2) is started to operate. And (3) operating a second layer of the neural network mathematical model on the next operation chip (2), reading data from a storage unit of the operation chip and/or storage units of other operation chips by a kernel core of the next operation chip to operate, and storing an operation result to at least one storage unit in the storage units of the other operation chips or the storage unit of the operation chip through a serdes interface. Each chip executes one layer in the neural network, acquires data from the storage units of other operation chips or the storage units of the operation chip through a serdes interface to perform operation, and only calculates an operation result until the last layer of the neural network. The operation chip obtains operation results from a local storage unit or storage units of other operation chips and feeds the operation results back to an external host through the UART control unit.

The big data operation acceleration system is applied to the field of encrypted digital currency, and a UART control unit of an operation chip (1) stores block information sent by an external host into at least one storage unit of a plurality of storage units of a plurality of operation chips. The external host computer sends a control instruction to M operation chips through an operation chip (1 … … M) UART control unit to perform data operation, and the M operation chips start operation. Of course, the external host can also send control instructions to the UART control unit (130) of one operation chip (1) to perform data operation, the operation chip (1) sequentially sends control instructions to other M-1 operation chips to perform data operation, and the M operation chips start operation. The external host can also send a control instruction to a UART control unit of one operation chip (1) to perform data operation, the first operation chip (1) sends the control instruction to the second operation chip (2) to perform data operation, the second operation chip (2) sends the control instruction to the third operation chip (3) to perform data operation, the third operation chip (3) sends the control instruction to the fourth operation chip (4) to perform data operation, and M operation chips start operation. The M operation chips acquire data from the storage units of other operation chips or the storage units of the operation chip through serdes interfaces to perform operation, the M operation chips perform workload demonstration operation at the same time, and the operation chip (1) acquires operation results from the storage units and feeds the operation results back to an external host through the UART control unit.

FIG. 5 illustrates a first embodiment of a schematic diagram of a big data operation acceleration system data transfer process. After each operation chip completes 1/n of the work, each operation chip has to transmit the result of its calculation to all other chips because of the data correlation after completing the data that it is responsible for. The operation chip n-1 is the source operation chip of the data frame, and the data is sent to the operation chip 0 through the lane1 tx; in the operation chip 0, the data frame is divided into two paths, the first path is sent to the core of the operation chip 0, and the other path is in the lane1tx channel forwarded to the operation chip 0, so that the data frame is sent to the operation chip 1.

Source ID mechanism: each data frame carries the operation chip ID of the source of the data frame, the operation chip detects the operation chip ID in the data frame every time the data frame is sent to a new operation chip, if the operation chip ID is found to be equal to the ID of the next operation chip connected with the operation chip, the data frame is not forwarded any more, which means that the life cycle of the data frame is terminated at the place and no more bandwidth is occupied. The operation chip ID in the data frame to be detected by the operation chip may be performed in the core or in the reception address determination unit.

Fig. 6 illustrates a signal flow diagram of the first embodiment of the operation chip with 4 cores. The UART control unit (130) is used for acquiring external data or control instructions of the chip and transmitting the external data or control instructions to the kernel (110) connected with the UART control unit. The core (110) transmits external data to the storage unit (120) of the chip for storage according to the data address, or the core (110) transmits the data to other core of the chip corresponding to the data address through the signal channel lane according to the data address, and the other core of the chip stores the data in the local storage unit. The kernel core (110) is executed by the kernel core of the operation chip according to the external control instruction address or is transmitted to other kernel cores of chips corresponding to the control instruction address for execution through a signal channel lane. If the kernel core of the operation chip needs to acquire data, the kernel core can acquire the data from a local storage unit or from the storage units of other operation chips. When data is acquired from the storage units of other operation chips, the kernel core (110) broadcasts an acquired data control instruction to the connected operation chips through a serdes interface (150) connected with the kernel core; the connected operation chip divides the acquired data control instruction into two paths, one path is sent to the kernel core, and the other path is forwarded to the next chip. If the connected operation chip judges that the data is stored in the local storage unit, the kernel core reads the data from the storage unit and sends the data to the operation unit which sends out the data acquisition control instruction through the serdes interface. Of course, the control instruction between the operation chips may also be transmitted through the UART control unit. When the kernel core feeds back the operation result or the intermediate data to the outside according to the external control instruction or the internal control instruction, the kernel core acquires the operation result or the intermediate data from the storage unit of the operation chip or the storage units of other operation chips through a serdes interface, and sends the operation result or the intermediate data to the outside through the UART control unit. External as described herein may refer to an external host, an external network, or an external platform, etc. The external machine can carry out unified addressing on a plurality of storage units through the UART control unit to initialize and configure storage unit parameters.

Of course, the kernel performs calculation according to the acquired data, and stores the calculation result in the storage unit. Each storage unit is provided with a private storage area and a shared storage area; the exclusive storage area is used for storing temporary operation results of one operation chip, wherein the temporary operation results are intermediate calculation results which are continuously utilized by the operation chip, and intermediate calculation results which are not used by other operation chips; the shared storage area is used for storing the operation data result of the operation chip, and the operation data result is used by other operation chips or needs to be fed back and transmitted to the outside.

According to the embodiment of the invention, the plurality of kernel cores are arranged in the chip, each kernel core executes the operation and storage control functions, and at least one storage unit is connected to each kernel core in the chip, so that each kernel can have a large-capacity memory by reading the storage units connected with each kernel and the storage units connected with other kernels, the frequency of carrying data into or out of the memory from an external storage space is reduced, and the data processing speed is accelerated; meanwhile, as a plurality of cores can respectively operate independently or cooperatively, the data processing speed is also increased.

Fig. 7 illustrates a schematic diagram of a data structure according to the present invention. The data referred to herein is a variety of data such as command data, numerical data, character data, and the like. The data format specifically includes valid bits, a destination address dst id, a source address src id, and data. The core may determine whether the packet is a command or a value by using a valid bit, where it may be assumed that 0 represents a value and 1 represents a command. The kernel may determine the destination address, source address, and data type based on the data structure. For example, in fig. 1, when the core 50 sends a data read command to the core 10, the valid bit is 1, the destination address is the address of the core 10, the source address is the address of the core 50, and the data is the read data command, the data type, or the data address. The core 10 transmits data to the core 10, and the valid bit is 0, the destination address is the address of the core 50, the source address is the address of the core 0, and the data is the read data. From the instruction execution time sequence, the conventional six-stage pipeline structure is adopted in the embodiment, which is respectively a fetch stage, a decode stage, an execution stage, a memory access stage, an alignment stage and a write-back stage. From an instruction set architecture perspective, a reduced instruction set architecture may be employed. According to the general design method of the simplified instruction set architecture, the instruction set of the invention can be divided into a register-register type instruction, a register-immediate instruction, a jump instruction, a memory access instruction, a control instruction and an inter-core communication instruction according to functions.

Using the description provided herein, an embodiment may be implemented as a machine, process, or article of manufacture by using standard programming and/or engineering techniques to produce programming software, firmware, hardware, or any combination thereof.

Any generated program(s) (having computer readable program code) may be embodied on one or more computer usable media such as resident memory devices, smart cards or other removable memory devices, or transmitting devices, thereby making computer program products and articles of manufacture according to the embodiments. As such, the terms "article of manufacture" and "computer program product" as used herein are intended to encompass a computer program that exists permanently or temporarily on a non-transitory medium which can be used by any computer.

As noted above, memory/storage devices include, but are not limited to, magnetic disks, optical disks, removable memory devices such as smart cards, subscriber Identity Modules (SIMs), wireless Identity Modules (WIMs), semiconductor memories such as Random Access Memories (RAMs), read Only Memories (ROMs), programmable Read Only Memories (PROMs), and the like. Transmission media includes, but is not limited to, transmissions via wireless communication networks, the internet, intranets, telephone/modem-based network communication, hard-wired/cabled communication network, satellite communication, and other stationary or mobile network systems/communication links.

Although specific example embodiments have been disclosed, those having ordinary skill in the art will understand that changes can be made to the specific example embodiments without departing from the spirit and scope of the invention.

The present invention has been described above with reference to the embodiments with reference to the drawings, but the present invention is not limited to the above embodiments, and the embodiments and the modifications are appropriately combined or substituted according to layout requirements, and are also included in the scope of the present invention. Further, the combination and processing sequence of the embodiments may be appropriately recombined based on the knowledge of those skilled in the art, or various design changes and other modifications may be applied to the embodiments, and the embodiments to which such modifications are applied may be included in the scope of the present invention.

While the invention has been described in detail with respect to various concepts, those skilled in the art will appreciate that various modifications and alternatives to those concepts could be developed in light of the overall teachings of the disclosure. The invention as set forth in the claims can be practiced by those of ordinary skill in the art without undue experimentation. It is intended that the specification be considered as exemplary only, with a true scope of the invention being indicated by the following claims and their full range of equivalents.

Claims

1. The big data operation acceleration system comprises more than 2 operation chips, wherein the operation chips comprise N kernel cores, N data channel lanes and at least one storage unit, wherein N is a positive integer greater than or equal to 4; the data channel Lane comprises a transmitting interface tx and a receiving interface rx, the kernel core corresponds to the data channel Lane one by one, and the kernel core transmits and receives data through the data channel Lane; the more than 2 operation chips are connected through the sending interface tx and the receiving interface rx to transmit data, and the more than 2 operation chips are connected into a ring;

The memory in the memory unit comprises at least two memory subunits and a memory control subunit, the memory control subunit is connected with the memory control unit through an interface, and the memory control subunit is used for controlling data reading or memory of the at least two memory subunits;

The data channel lane further comprises a receiving address judging unit and a transmitting address judging unit; one end of the receiving address judging unit is connected with the receiving interface, and the other end of the receiving address judging unit is connected with the kernel; one end of the sending address judging unit is connected with the sending interface tx, and the other end of the sending address judging unit is connected with the kernel; the receiving address judging unit and the transmitting address judging unit are connected with each other; the receiving address judging unit is used for sending the data frame from the receiving unit to the kernel core and sending the data frame to the sending address judging unit.

2. The system of claim 1, wherein the transmitting interface tx and the receiving interface rx of the operation chip are serdes interfaces, and the operation chip communicates with each other through the serdes interfaces.

3. The system according to claim 2, wherein the receiving interface rx receives a data frame transmitted from an adjacent side operation chip, transmits the data frame to the receiving address judging unit, and the receiving address judging unit transmits the data frame to the core while transmitting the data frame to the transmitting address judging unit; the sending address judging unit receives the data frame, sends the data frame to the sending interface tx, and the sending interface sends the data frame to the running chip on the other adjacent side.

4. The system according to claim 2, wherein the kernel generates a data frame, sends the data frame to the transmission address judging unit, and the transmission address judging unit sends the data frame to the transmission interface tx, and the transmission interface tx sends the data frame to the running chip on the adjacent side.

5. The system according to claim 2, wherein the receiving address judging unit and the transmitting address judging unit are connected to each other through a first-in first-out memory.

6. The system according to claim 2, wherein the core of the operation chip acquires the acquire data command transmitted from the other operation chip, the core of the operation chip judges whether the data is stored in the storage unit of the operation chip through the data address, if the core exists, the core acquires the data from at least one storage unit, the core transmits the acquire data to the transmission address judging unit, the transmission address judging unit transmits the acquire data to the transmission interface tx, and the transmission interface transmits the acquire data to the adjacent operation chip.

7. The data transmission method of the big data operation accelerating system comprises more than 2 operation chips, wherein the more than 2 operation chips are connected with a transmission interface tx, a reception interface rx, a reception address judging unit and a transmission address judging unit to transmit data, and the more than 2 operation chips are connected into a ring; after the first operation chip of the data source generates data, the data is sent to the second operation chip on one side adjacent to the first operation chip through the sending interface tx; the second operation chip on the adjacent side divides the data into two paths for transmission, the first path is sent to the kernel core of the second operation chip, and the other path is forwarded to the third operation chip on the adjacent side of the second operation chip through a sending interface tx;

the kernel core sends the generated data to at least one storage control unit, the at least one storage control unit sends the data to a storage control subunit, and the storage control subunit stores the data in a storage subunit;

One end of the receiving address judging unit is connected with the receiving interface, and the other end of the receiving address judging unit is connected with the kernel; one end of the sending address judging unit is connected with the sending interface tx, and the other end of the sending address judging unit is connected with the kernel; the receiving address judging unit and the transmitting address judging unit are connected with each other; the receiving address judging unit is used for sending the data frame from the receiving unit to the kernel core and sending the data frame to the sending address judging unit.

8. The method of claim 7, wherein the data carries an identification ID of a data source computing chip.

9. The method of claim 8, wherein after the data is transmitted to the adjacent computing chip, the adjacent computing chip detects the identification ID of the computing chip in the data, and if the identification ID of the computing chip is found to be equal to the identification ID of the next computing chip connected to the adjacent computing chip, the data is not forwarded.