CN112740192A

CN112740192A - Big data operation acceleration system and data transmission method

Info

Publication number: CN112740192A
Application number: CN201880097576.0A
Authority: CN
Inventors: 秦强
Original assignee: Bitmain Technologies Inc
Current assignee: Bitmain Technologies Inc
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2021-04-30
Also published as: WO2020087246A1

Abstract

The embodiment of the invention provides a big data operation acceleration system and a data transmission method, wherein the big data operation acceleration system comprises more than 2 operation chips, each operation chip comprises N cores core, N data channels (lane) and at least one storage unit, and N is a positive integer greater than or equal to 4; the data channel (lane) comprises a sending interface (tx) and a receiving interface (rx), the kernel core and the data channel (lane) are in one-to-one correspondence, and the kernel core sends and receives data through the data channel (lane); the more than 2 operation chips are connected through the sending interface (tx) and the receiving interface (rx) to transmit data, and the more than 2 operation chips are connected into a ring. By adopting the technical scheme in the embodiment of the invention, the data transmission rate among a plurality of ASIC chips is improved.

Description

Big data operation acceleration system and data transmission method

Technical Field

The invention relates to the field of integrated circuits, in particular to a big data operation acceleration system and a data transmission method.

Background

Asic (application Specific Integrated circuits) refers to an Integrated circuit designed and manufactured according to the requirements of a Specific user and a Specific electronic system. The ASIC is characterized by facing the requirements of specific users, and compared with a general integrated circuit, the ASIC has the advantages of smaller volume, lower power consumption, improved reliability, improved performance, enhanced confidentiality, reduced cost and the like during batch production.

With the development of science and technology, more and more fields, such as artificial intelligence, security operations and the like, all relate to specific calculation with large computation amount. For specific operation, the ASIC chip can exert the characteristics of fast operation, low power consumption and the like. Meanwhile, for these large computation areas, in order to increase the data processing speed and processing capacity, it is generally necessary to control N computation chips to operate simultaneously. Along with the continuous promotion of data precision, fields such as artificial intelligence, safety operation need to operate bigger and bigger data, for example: the size of the photos is now typically 3-7MB, but as the precision of digital and video cameras increases, the size of the photos can reach 10MB or more, while 30 minutes of video can reach 1 more G of data. In the fields of artificial intelligence, safety operation and the like, the calculation speed and the time delay are required to be high, so that how to improve the calculation speed and the reaction time is always the target required by chip design. Because the memory collocated with the ASIC chip is generally 64MB or 128MB, when the data to be processed is above 512MB, the ASIC chip needs to use the memory to access the data for many times, and the data is moved into or out of the memory from the external storage space for many times, which reduces the processing speed. Meanwhile, with the continuous improvement of data precision, the fields of artificial intelligence, safety operation and the like need to operate larger and larger data, and an ASIC chip generally needs to be configured with a plurality of storage units for storing data, for example, one ASIC chip needs to be configured with 4 blocks of 2G memories; thus, when the N operation chips work simultaneously, 4N blocks of 2NG memories are needed. However, when the multiple operation chips work simultaneously, the data storage capacity does not exceed 2G, which causes the waste of the storage unit and increases the system cost.

In designs that handle large amounts of relevant data, the prior art faces two challenges: 1. is a requirement for greatly improving the performance. 2. In the case of a distributed system, the problem of data dependency is also solved, i.e. processed data in one subsystem needs to be presented to all other subsystems for confirmation and reprocessing. The time consumed by data processing is generally reduced in two ways, namely, a clock for processing data logic is accelerated; the other is to increase the number of concurrent blocks processing data.

The clock rate is increased only marginally under process limitations. Increasing the number of concurrencies is a more efficient way to increase performance. However, after the concurrency number is increased, the requirement of data bandwidth is also increased correspondingly. In a typical system, if the data bandwidth depends on the bandwidth provided by the DDR, the bandwidth boost of the DDR is not linear. Assume that the initial system contains a DDR bank, providing a bandwidth of 1 x. Two sets of DDRs can be implemented if we need to obtain a bandwidth boost of 2x, but if we need to obtain a bandwidth boost of more than 16x, it is not possible to implement by simply instantiating 16 sets of DDRs in one system because of physical size constraints.

If a plurality of ASIC chips are required to work cooperatively, data cannot be directly distributed in a plurality of unconnected systems for processing, because the data are all related, and each piece of data completed in a certain processing unit must be confirmed and reprocessed in other processing units, so that the problem of interconnection of the plurality of systems must be solved if the data transmission rate among the plurality of ASIC chips is increased.

Disclosure of Invention

The embodiment of the invention aims to provide a mode for connecting distributed storage by using a high-speed interface, so that a plurality of homogeneous systems can concurrently process a large amount of related data. The embodiment of the invention provides a big data operation acceleration system, wherein an external memory of a chip is omitted, a storage unit is arranged in an ASIC chip, the time for reading data from the outside of the ASIC chip is reduced, and the operation speed of the chip is accelerated. The storage units are shared by the ASIC chips, so that the number of the storage units is reduced, connecting lines among the ASIC operation chips are also reduced, the system structure is simplified, and the cost of the ASIC chips is reduced. Meanwhile, data transmission is carried out among the plurality of operation chips by adopting a serdes interface technology, so that the data transmission rate among the plurality of ASIC chips is improved.

In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

according to a first aspect of the embodiments of the present invention, there is provided a big data operation acceleration system, including more than 2 operation chips, where each operation chip includes N cores core, N data lanes (lane), and at least one storage unit, where N is a positive integer greater than or equal to 4; the data channel (lane) comprises a sending interface (tx) and a receiving interface (rx), the kernel core and the data channel (lane) are in one-to-one correspondence, and the kernel core sends and receives data through the data channel (lane); the more than 2 operation chips are connected through the sending interface (tx) and the receiving interface (rx) to transmit data, and the more than 2 operation chips are connected into a ring.

According to a second aspect of the embodiments of the present invention, there is provided a data transmission method for a big data operation acceleration system, where the big data operation acceleration system includes more than 2 operation chips, the more than 2 operation chips are connected through a sending interface (tx) and a receiving interface (rx) to transmit data, and the more than 2 operation chips are connected in a ring; after the data source first operation chip generates data, the data is sent to a second operation chip on one side adjacent to the first operation chip through the sending interface (tx); and the second operation chip on one adjacent side divides data into two paths for transmission, the first path is sent to the core of the second operation chip, and the other path is forwarded to the third operation chip on one adjacent side of the second operation chip through a sending interface (tx).

In the embodiment of the invention, a plurality of chips are arranged in a big data operation acceleration system, the plurality of chips comprise a plurality of kernel cores, each kernel core executes operation and storage control functions, and each kernel core is connected with at least one storage unit inside the chip, so that each kernel can have a large-capacity memory by reading the data in the storage unit connected with the kernel of the other operation chip and the storage unit connected with the kernel of the other operation chip, the times of moving the data into or out of the memory from an external storage space are reduced, and the data processing speed is accelerated; meanwhile, the multiple kernels can respectively perform independent operation or cooperative operation, so that the data processing speed is increased. The storage units are shared by the ASIC chips, so that the number of the storage units is reduced, connecting lines among the ASIC operation chips are also reduced, the system structure is simplified, and the cost of the ASIC chips is reduced.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present invention, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only exemplary embodiments, and that other drawings can be obtained by those skilled in the art without inventive efforts.

FIG. 1 illustrates a schematic diagram of a first embodiment of a big data operation acceleration system architecture with M ASIC chips;

FIG. 2 illustrates a schematic diagram of an arithmetic chip having 4 cores;

FIG. 3 illustrates a schematic diagram of the structure of a data channel lane;

FIG. 4a is a schematic diagram of a first embodiment of a memory cell

FIG. 4b illustrates a schematic structural diagram of a second embodiment of a memory cell;

FIG. 5 illustrates a schematic diagram of a data transfer process of a big data computing acceleration system;

FIG. 6 is a signal flow diagram of a computing chip with 4 cores according to a first embodiment;

fig. 7 illustrates a data structure diagram according to the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in detail below based on the accompanying drawings, and it should be understood that these embodiments are given only for the purpose of enabling those skilled in the art to better understand and implement the present invention, and do not limit the scope of the present invention in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

It should be noted that the directions of up, down, left and right in the drawings are merely examples of specific embodiments, and those skilled in the art can change the directions of a part or all of the components shown in the drawings according to actual needs without affecting the functions of the components or the system as a whole, and such a technical solution with changed directions still belongs to the protection scope of the present invention.

A multi-core chip is a multi-processing system embodied on a single large-scale integrated semiconductor chip. Typically, two or more chip cores may be embodied on a multi-core chip, interconnected by a bus (which may also be formed on the same multi-core chip). There may be from two chip cores to many chip cores embodied on the same multi-core chip, with the upper limit in the number of chip cores being limited only by manufacturing capability and performance constraints. The multi-core chip may have applications that contain specialized arithmetic and/or logical operations that are performed in multimedia and signal processing algorithms such as video encoding/decoding, 2D/3D graphics, audio and speech processing, image processing, telephony, speech recognition and sound synthesis, encryption processing.

Although only ASIC-specific integrated circuits are mentioned in the background, the specific wiring implementation in the embodiments may be applied to CPUs, GPUs, FPGAs, etc. having multiple cores. In this embodiment, the plurality of cores may be the same core or different cores.

FIG. 1 illustrates a schematic diagram of a big data operation acceleration system structure with M ASIC chips according to a first embodiment. As shown in fig. 1, the big data operation acceleration system includes M ASIC operation chips, where M is a positive integer greater than or equal to 2, and may be, for example, 6, 10, 12, etc. The operation chip comprises a plurality of cores (core0, core1, core2, core3) and 4 data channels (lane0, lane1 and lane2lane3), wherein the data channels (lane) comprise a sending interface (tx) and a receiving interface (rx), the cores (cores) are in one-to-one correspondence with the data channels (lane), for example, the core0 of the operation chip 10 is provided with the data channel (lane0), the data channel (lane0) is provided with a sending interface (lane0tx) and a receiving interface (lane0rx), the data channel sending interface (lane0tx) is used for sending data or control instructions to the outside of the operation chip 10 from the cores 0, and the data channel receiving interface (lane0rx) is used for sending the data or control instructions to the cores (cores) 0 from the operation chip 10. The M arithmetic chips are thus connected via the transmission interface (tx) and the reception interface (rx) for data or control instruction transmission. M arithmetic chips form a closed loop. The method comprises the steps that a storage unit is arranged in each operation chip, 4 cores of the operation chips are connected to the storage unit, the storage units of the M operation chips are used for storing data in a distributed mode, and the cores of the operation chips can acquire data from the storage units of the operation chips and can also acquire data from the storage units of other operation chips. 4 kernel cores in the operation chip are all connected to the storage unit, and the purpose of data interaction of the 4 kernel cores in the operation chip is also achieved through the storage unit. And those skilled in the art will recognize that 4 cores are taken as an example, which is only an exemplary illustration, and the number of cores may be N, where N is a positive integer greater than or equal to 4, and may be, for example, 6, 10, 12, and so on. In this embodiment, the plurality of cores may be the same core or different cores.

The sending interface (lane tx) and the receiving interface (lane rx) of the data channel (lane) are serdes interfaces, and the operation chips communicate with each other through the serdes interfaces. serdes is an acronym for SERializer/DESerializer. It is a mainstream Time Division Multiplexing (TDM), point-to-point (P2P) serial communication technology. That is, at the transmitting end, the multi-path low-speed parallel signals are converted into high-speed serial signals, and finally, at the receiving end, the high-speed serial signals are converted into low-speed parallel signals again through a transmission medium (an optical cable or a copper wire). The point-to-point serial communication technology fully utilizes the channel capacity of a transmission medium, reduces the number of required transmission channels and device pins, and improves the transmission speed of signals, thereby greatly reducing the communication cost. Of course, other communication interfaces may be used instead of the serdes interfaces, such as: SSI, UATR. And data and control instruction transmission is carried out between the chips through the serdes interfaces.

FIG. 2 illustrates a first embodiment of a schematic diagram of an arithmetic chip having 4 cores. As will be understood by those skilled in the art, 4 cores are taken as an example, which is only an exemplary illustration, and the number of cores of the operation chip may be N, where N is a positive integer greater than or equal to 2, and may be, for example, 6, 10, 12, and so on. In this embodiment, the cores of the operation chip may be cores having the same function, or cores having different functions.

The 4-core arithmetic chip (1) comprises 4 cores, namely a core0, a core1, a core2 and a core3, 4 data channels (lane0, lane1 and lane2lane3), at least one storage unit and a data exchange control unit, wherein the data exchange control unit is a UART control unit, and each data channel (lane) comprises a sending interface (lane tx) and a receiving interface (lane rx).

The core0 of the operation chip (1) is connected with a sending interface (lane0tx) and a receiving interface (lane0rx) of a data channel, the data channel sending interface (lane0tx) is used for sending data or control instructions to the operation chip connected with the operation chip 1 by the core0, and the data channel receiving interface (lane0rx) is used for sending the data or control instructions transmitted by the operation chip connected with the operation chip (1) to the core 0. Similarly, the core1 of the computing chip 1 is connected to the sending interface (lane1tx) and the receiving interface (lane1rx) of the data channel; the core2 of the arithmetic chip 1 is connected to the transmission interface (lane2tx) and the reception interface (lane2rx) of the data path, and the core3 of the arithmetic chip 1 is connected to the transmission interface (lane3tx) and the reception interface (lane3rx) of the data path. The sending interface (lane tx) and the receiving interface (lane rx) of the data channel (lane) are serdes interfaces.

A data exchange control unit is connected to the memory unit and the 4 cores core (core0, core1, core2, core3) through buses, which are not drawn in FIG. 2. The data exchange control unit may be implemented by using various protocols, such as UART, SPI, PCIE, SERDES, USB, and the like, and in this embodiment, the data exchange control unit is a UART (Universal Asynchronous Receiver/Transmitter) control unit. A universal asynchronous receiver transmitter, commonly referred to as UART, is an asynchronous receiver transmitter that converts data to be transmitted between serial and parallel communications, and is typically integrated into the link between the various communication interfaces. However, the UART protocol is only used as an example, and other protocols may be used. The UART control unit receives external data and transmits the external data to the core (core0, core1, core2 and core3) or the storage unit according to the external data address. The UART control unit can also receive external control instructions and send the control instructions to the core (core0, core1, core2 and core3) or the storage unit; the method may be used for the arithmetic chip to send internal or external control instructions to other arithmetic chips, receive control instructions from other chips, and feed back arithmetic results or intermediate data to the outside. The internal data or the internal control command refers to data or a control command generated by the chip itself, and the external data or the external control command refers to data or a control command generated outside the chip, for example, data or a control command sent by an external host or an external network.

The main functions of the core (core0, core1, core2, and core3) are to execute external or internal control instructions, perform data calculation, and control storage of data. The cores (core0, core1, core2 and core3) in the operation chip are all connected to the storage unit, data are read or written into the storage unit of the operation chip, and the interaction of the core data of the cores in the operation chip is realized through the storage unit; control commands may also be sent to the memory cells of the compute chip. The core (core0, core1, core2 and core3) writes data into the storage units of other operation chips, reads the data or sends control instructions to the storage units of other operation chips through the serdes interfaces according to the instructions; the core (core0, core1, core2, core3) may also send data to the core of another operation chip, read data from the core, or send control instructions to the core of another operation chip through the serdes interface according to the instructions.

Fig. 3 illustrates a first embodiment of a schematic diagram of the structure of a data channel lane. The data channel (lane) comprises a receiving interface, a sending interface, a receiving address judging unit, a sending address judging unit and a plurality of registers; one end of the receiving address judging unit is connected with the receiving interface, and the other end of the receiving address judging unit is connected with the core through the register; one end of the sending address judging unit is connected with the sending interface (tx), and the other end of the sending address judging unit is connected with the core through a register; the reception address judgment unit and the transmission address judgment unit are connected to each other through a register. The receiving interface receives a data frame or a control instruction sent by an adjacent side operation chip connected with the receiving interface, sends the data frame or the control instruction to a receiving address judging unit, the receiving address judging unit sends the data frame or the control instruction to the kernel core, and simultaneously sends the data frame or the control instruction to a sending address judging unit; and the sending address judging unit receives the data frame or the control instruction, sends the data frame or the control instruction to a sending interface (tx), and the sending interface sends the data frame or the control instruction to an adjacent operating chip on the other side connected with the sending interface. The core generates a data frame or a control instruction, the data frame or the control instruction is sent to a sending address judging unit, the sending address judging unit sends the data frame or the control instruction to a sending interface, and the sending interface sends the data frame or the control instruction to a receiving interface of an adjacent running chip. The register is used for temporarily storing data frames or control instructions.

Fig. 4a illustrates a first embodiment of a schematic structure of a memory cell. Each compute chip contains N cores, which require concurrent random access to data, and if N is on the order of 64 and above, the memory bandwidth of the compute chip is on the order of very high, which is difficult to achieve even with GDDR. Thus, the use of SRAM arrays and large MUX routing provides high bandwidth in embodiments of the present invention. The system shown in fig. 4a consists of two levels of memory control units to alleviate the problem of congestion when implemented. The storage unit (40) comprises 8 memories (410 … … 417), the 8 memories (410 … … 417) being connected to a storage control unit (420); the storage control unit is used for controlling data reading or storing of the plurality of storages. The memory (410 … … 417) includes at least two memory sub-units and a memory control sub-unit; the storage control subunit is connected with the storage control unit through an interface, and the storage control subunit is used for controlling data reading or storage of the at least two storage subunits. The storage subunit is an SRAM memory.

Fig. 4b illustrates a second embodiment of a schematic structure of a memory cell. In fig. 4b, a plurality of memory control units (420, 421, 422, 423) may be disposed in the memory unit, each core is connected to each of the plurality of memory control units (420, 421, 422, 423), and each memory control unit is connected to each memory (410 … … 417). The structure of the memory is identical to that of fig. 4a and will not be described again here.

The kernel core sends the generated data to at least one storage control unit, the at least one storage control unit sends the data to the storage control subunit, and the storage control subunit stores the data into the storage subunit. The core of the operation chip acquires data acquisition commands sent by other operation chips, judges whether data are stored in a storage unit of the operation chip or not through a data address, and sends a data reading command to the at least one storage control unit if the data are stored in the storage unit of the operation chip; the method comprises the steps that at least one storage control unit sends a data reading command to a corresponding storage control subunit, the storage control subunit acquires data from the storage subunit, the storage control subunit sends the acquired data to at least one storage control unit, the at least one storage control unit sends the acquired data to a kernel core, the kernel core sends the acquired data to a sending address judgment unit, the sending address judgment unit sends the acquired data to a sending interface (tx), and the sending interface sends the acquired data to an adjacent operation chip.

The big data operation accelerating system is applied to the field of artificial intelligence, picture data or video data sent by an external host are stored in a storage unit through a kernel core by a UART control unit of an operation chip, a mathematical model of a neural network is generated by the operation chip, and the mathematical model can be stored in the storage unit through the UART control unit by the external host and read by each operation chip. And operating a first-layer mathematical model of the neural network on the operation chip, reading data from the storage unit of the operation chip and/or the storage units of other operation chips by the core of the operation chip to perform operation, and storing an operation result into at least one of the storage units of the other operation chips or the storage unit of the operation chip through the serdes interface. The operation chip (1) sends a control instruction to the next operation chip (2) through a UART control unit or a serdes interface, and starts the next operation chip (2) to operate. And operating a second layer mathematical model of the neural network on a next operation chip (2), reading data from the storage unit of the operation chip and/or the storage units of other operation chips by the core of the next operation chip to perform operation, and storing an operation result into at least one of the storage units of the other operation chips or the storage unit of the operation chip through a serdes interface. Each chip executes one layer of the neural network, acquires data from the storage units of other operation chips or the storage unit of the operation chip through serdes interfaces to perform operation, and only calculates an operation result by the last layer of the neural network. The arithmetic chip obtains the arithmetic result from the local storage unit or the storage units of other arithmetic chips and feeds the arithmetic result back to the external host through the UART control unit.

The big data arithmetic acceleration system is applied to the field of encrypted digital currency, and a UART control unit of an arithmetic chip (1) stores block information sent by an external host computer into at least one storage unit in a plurality of storage units of a plurality of arithmetic chips. The external host sends control instructions to the M arithmetic chips through the UART control unit of the arithmetic chip (1 … … M) to carry out data arithmetic, and the M arithmetic chips start arithmetic operation. Of course, an external host may also send a control instruction to one arithmetic chip (1) UART control unit (130) for data arithmetic, the arithmetic chip (1) sequentially sends control instructions to other M-1 arithmetic chips for data arithmetic, and the M arithmetic chips start arithmetic operations. Or an external host can send a control instruction to a UART control unit of the operation chip (1) for data operation, the first operation chip (1) sends the control instruction to the second operation chip (2) for data operation, the second operation chip (2) sends the control instruction to the third operation chip (3) for data operation, the third operation chip (3) sends the control instruction to the fourth operation chip (4) for data operation, and M operation chips start operation. M arithmetic chips acquire data from the storage units of other arithmetic chips or the storage unit of the arithmetic chip through serdes interfaces to carry out operation, M arithmetic chips simultaneously carry out workload certification operation, and the arithmetic chip (1) acquires an operation result from the storage unit and feeds the operation result back to an external host through a UART control unit.

FIG. 5 illustrates a first embodiment of a schematic diagram of a data transfer process of a big data computing acceleration system. Each arithmetic chip completes 1/n of work, and after each arithmetic chip completes the data for which it is responsible, the result of completion must be transmitted to all other chips because of data dependency. The operation chip n-1 is a source operation chip of a data frame, and data is sent to the operation chip 0 through lane1 tx; in the operation chip 0, the data frame is divided into two paths for propagation, the first path is sent to the core of the operation chip 0, and the other path is in the lane1tx channel forwarded to the operation chip 0, so that the data frame is sent to the operation chip 1.

The source ID mechanism: each data frame carries the computing chip ID of the source of the data frame, and each time the data frame is sent to a new computing chip, the computing chip detects the computing chip ID in the data frame, and if the computing chip ID is found to be equal to the ID of the next computing chip connected to the computing chip, the data frame is not forwarded any more, which means that the life cycle of the data frame is terminated here and the bandwidth is not occupied any more. The operation chip detects the operation chip ID in the data frame, which may be performed in the core or the receiving address determination unit.

FIG. 6 is a signal flow diagram of an arithmetic chip with 4 cores according to a first embodiment. The UART control unit (130) is used for acquiring chip external data or control instructions and transmitting the external data or control instructions to a core (110) connected with the UART control unit. The core (110) transmits external data to a storage unit (120) of the chip according to the data address for storage, or the core (110) transmits the data to other core cores of the chip corresponding to the data address through a signal channel lane according to the data address, and the other core cores store the data in a local storage unit. The core (110) is executed by the core of the operation chip according to the external control instruction address or is sent to other core of the chip corresponding to the control instruction address through the signal channel lane for execution. If the core of the operation chip needs to acquire data, the core can acquire the data from the local storage unit and also can acquire the data from the storage units of other operation chips. When data are acquired from the storage units of other operation chips, the core (110) broadcasts an acquired data control instruction to the connected operation chips through the serdes interfaces (150) connected with the core; the connected arithmetic chip divides the acquired data control instruction into two paths, one path is sent to the core, and the other path is forwarded to the next chip. And if the connected arithmetic chip judges that the data is stored in the local storage unit, the kernel core reads the data from the storage unit and sends the data to the arithmetic unit which sends out the data acquisition control instruction through the serdes interface. Of course, the control command between the computing chips may also be sent through the UART control unit. When the core feeds back the operation result or the intermediate data to the outside according to the external control instruction or the internal control instruction, the core acquires the operation result or the intermediate data from the storage unit of the operation chip or the storage units of other operation chips through the serdes interface, and sends the operation result or the intermediate data to the outside through the UART control unit. The external may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the parameters of the storage unit through the UART control unit and carry out uniform addressing on a plurality of storage units.

Of course, the kernel core performs calculation according to the acquired data, and stores the calculation result in the storage unit. Setting a special storage area and a shared storage area in each storage unit; the special storage area is used for storing a temporary operation result of one operation chip, wherein the temporary operation result is an intermediate calculation result continuously used by the operation chip, and the intermediate calculation result is not used by other operation chips; the shared storage area is used for storing the operation data result of the operation chip, and the operation data result is used by other operation chips or needs to be fed back and transmitted to the outside.

In the embodiment of the invention, a plurality of kernel cores are arranged in a chip, each kernel core executes the operation and storage control functions, and at least one storage unit is connected to each kernel core in the chip, so that each kernel can have a large-capacity memory by reading the storage unit connected with the kernel and the storage units connected with other kernels, the times of moving data into or out of the memory from an external storage space are reduced, and the data processing speed is accelerated; meanwhile, the multiple kernels can respectively perform independent operation or cooperative operation, so that the data processing speed is increased.

Fig. 7 illustrates a data structure diagram according to the present invention. The data referred to herein is various data such as command data, numerical data, character data, and the like. The data format specifically includes valid bit, destination address dst id, source address src id, and data. The core can determine whether the packet is a command or a value by the valid bit, where it can be assumed that 0 represents a value and 1 represents a command. The kernel will determine the destination address, source address and data type based on the data structure. For example, in fig. 1, when the core 50 sends a data read command to the core 10, the valid bit is 1, the destination address is the address of the core 10, the source address is the address of the core 50, and the data is the read data command, the data type or the data address, etc. If the core 10 sends data to the core 10, the valid bit is 0, the destination address is the address of the core 50, the source address is the address of the core0, and the data is read data. From the view of instruction operation timing, the present embodiment adopts a conventional six-level pipeline structure, which is respectively an instruction fetch stage, a decoding stage, an execution stage, an access stage, an alignment stage, and a write-back stage. From an instruction set architecture perspective, a reduced instruction set architecture may be assumed. According to the general design method of the simplified instruction set architecture, the instruction set can be divided into a register-register type instruction, a register-immediate instruction, a jump instruction, a memory access instruction, a control instruction and an inter-core communication instruction according to functions.

Using the description provided herein, an embodiment may be implemented as a machine, process, or article of manufacture using standard programming and/or engineering techniques to produce programming software, firmware, hardware or any combination thereof.

Any resulting program(s), having computer-readable program code, may be embodied on one or more computer-usable media such as resident memory devices, smart cards or other removable memory devices, or transmitting devices, thereby making a computer program product and article of manufacture according to an embodiment. As such, the terms "article of manufacture" and "computer program product" as used herein are intended to encompass a computer program that exists permanently or temporarily on any non-transitory medium with which a computer can use.

As noted above, memory/storage devices include, but are not limited to, magnetic disks, optical disks, removable storage devices such as smart cards, Subscriber Identity Modules (SIMs), Wireless Identification Modules (WIMs), semiconductor memory such as Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read Only Memory (PROM), and the like. Transmission media includes, but is not limited to, transmissions via wireless communication networks, the internet, intranets, telephone/modem-based network communications, hard-wired/cabled communication network, satellite communications, and other fixed or mobile network systems/communication links.

Although specific example embodiments have been disclosed, those skilled in the art will appreciate that changes can be made to the specific example embodiments without departing from the spirit and scope of the invention.

The present invention has been described above based on the embodiments with reference to the drawings, but the present invention is not limited to the above embodiments, and the present invention is also included in the scope of the present invention by appropriately combining or replacing parts of the embodiments and the modifications according to layout requirements and the like. Further, the combination and the processing order of the embodiments may be appropriately rearranged based on the knowledge of those skilled in the art, or modifications such as various design changes may be applied to the embodiments, and embodiments to which such modifications are applied may be included in the scope of the present invention.

While the invention has been described in detail with respect to the various concepts, it will be appreciated by those skilled in the art that various modifications and alternatives to those concepts may be developed in light of the overall teachings of the disclosure. Those of ordinary skill in the art, with the benefit of this disclosure, will be able to implement the invention as set forth in the claims without undue experimentation. It is to be understood that the specific concepts disclosed are merely illustrative and are not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

Claims

A big data operation acceleration system comprises more than 2 operation chips, wherein each operation chip comprises N cores core, N data channels (lane) and at least one storage unit, and N is a positive integer greater than or equal to 4; the data channel (lane) comprises a sending interface (tx) and a receiving interface (rx), the kernel core and the data channel (lane) are in one-to-one correspondence, and the kernel core sends and receives data through the data channel (lane); the more than 2 operation chips are connected through the sending interface (tx) and the receiving interface (rx) to transmit data, and the more than 2 operation chips are connected into a ring.
The system according to claim 1, characterized in that said sending interface (tx) and said receiving interface (rx) of said calculation chip are serdes interfaces, and said calculation chip communicates with each other through serdes interfaces.
System according to claim 1 or 2, characterized in that said data channel (lane) further comprises a receiving address judging unit, a sending address judging unit; one end of the receiving address judging unit is connected with the receiving interface, and the other end of the receiving address judging unit is connected with the core; one end of the sending address judging unit is connected with the sending interface (tx), and the other end of the sending address judging unit is connected with the core; the reception address judgment unit and the transmission address judgment unit are connected to each other.
The system according to claim 3, wherein the receiving interface (rx) receives a data frame sent by an adjacent side operating chip, and sends the data frame to the receiving address judging unit, and the receiving address judging unit sends the data frame to the core and sends the data frame to the sending address judging unit; and the sending address judging unit receives the data frame, sends the data frame to a sending interface (tx), and the sending interface sends the data frame to the adjacent running chip on the other side.
The system according to claim 3, wherein the core generates a data frame, and transmits the data frame to the transmission address determination unit, and the transmission address determination unit transmits the data frame to the transmission interface (tx), and the transmission interface (tx) transmits the data frame to the adjacent run chip.
The system according to claim 3, wherein the reception address judgment unit and the transmission address judgment unit are connected to each other through a first-in first-out memory.
The system according to claim 3, wherein the core of the operation chip acquires the data acquisition command sent by other operation chips, the core of the operation chip judges whether the data is stored in the storage unit of the operation chip according to the data address, if the core exists, the core acquires the data from at least one storage unit, the core sends the acquired data to the sending address judging unit, the sending address judging unit sends the acquired data to the sending interface (tx), and the sending interface sends the acquired data to the adjacent operation chip.
A data transmission method of a big data operation accelerating system comprises more than 2 operation chips, wherein the more than 2 operation chips are connected through a sending interface (tx) and a receiving interface (rx) to transmit data, and the more than 2 operation chips are connected into a ring; after the data source first operation chip generates data, the data is sent to a second operation chip on one side adjacent to the first operation chip through the sending interface (tx); and the second operation chip on one adjacent side divides data into two paths for transmission, the first path is sent to the core of the second operation chip, and the other path is forwarded to the third operation chip on one adjacent side of the second operation chip through a sending interface (tx).
The method of claim 8, wherein the data carries an Identification (ID) of a source computing chip of the data.
The method of claim 9, wherein after the data is transmitted to the adjacent computing chip, the adjacent computing chip detects an Identifier (ID) of the computing chip in the data, and if the Identifier (ID) of the computing chip is found to be equal to an Identifier (ID) of a next computing chip connected to the adjacent computing chip, the data is not forwarded.