WO2020087246A1

WO2020087246A1 - Big data operation acceleration system and data transmission method

Info

Publication number: WO2020087246A1
Application number: PCT/CN2018/112546
Authority: WO
Inventors: 秦强
Original assignee: 北京比特大陆科技有限公司
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2020-05-07
Also published as: CN112740192A; CN112740192B

Abstract

Embodiments of the present invention provide a big data operation acceleration system and a data transmission method. The big data operation acceleration system comprises two or more operation chips. The operation chip comprises N cores, N data lanes, and at least one storage unit, N being a positive integer greater than or equal to 4. The data lane comprises a transmission interface (tx) and a receiving interface (rx). The cores and the data lanes are in one-to-one correspondence. The core transmits and receives data via the data lane. The two or more operation chips are connected by means of the transmission interface (tx) and the receiving interface (rx) so as to transmit data. The two or more operation chips are connected to form a ring. The technical solution of the embodiments of the present invention improves the speed of data transmission among multiple ASIC chips.

Description

Big data operation acceleration system and data transmission method

Technical field

The invention relates to the field of integrated circuits, in particular to a big data operation acceleration system and a data transmission method.

Background technique

ASIC (Application Specific Integrated Circuits), that is, application-specific integrated circuits, refer to the integrated circuits designed and manufactured in accordance with the requirements of specific users and the needs of specific electronic systems. The characteristics of ASICs are to meet the needs of specific users. Compared with general-purpose integrated circuits, ASICs have the advantages of smaller size, lower power consumption, improved reliability, improved performance, enhanced confidentiality, and lower costs.

With the development of science and technology, more and more fields, such as artificial intelligence, security computing, etc., involve specific calculations with large amounts of computation. For specific operations, ASIC chips can play a specific role such as fast operation and low power consumption. At the same time, for these fields with large amount of calculation, in order to improve the data processing speed and processing capacity, it is usually necessary to control N operation chips to work simultaneously. With the continuous improvement of data accuracy, more and more data needs to be calculated in the fields of artificial intelligence and security computing, for example: the size of photos is generally 3-7MB, but as the precision of digital cameras and video cameras increases, photos The size can reach 10MB or more, and 30 minutes of video may reach more than one G of data. However, in the fields of artificial intelligence and security computing, the calculation speed is fast and the delay is small, so how to improve the calculation speed and response time has always been the goal required by chip design. As the memory of the ASIC chip is generally 64MB or 128MB, when the data to be processed is more than 512MB, the ASIC chip needs to use the memory to access the data multiple times, and the data is moved into or out of the memory from the external storage space many times, reducing the Processing speed. At the same time, with the continuous improvement of data accuracy, artificial intelligence, security computing and other fields need to operate on larger and larger data. In order to store data, it is generally necessary to configure the ASIC chip with multiple storage units, such as an ASIC chip to configure 4 2G memory; when N arithmetic chips work at the same time, 4N 2NG memory is needed. However, when multiple computing chips work at the same time, the data storage capacity will not exceed 2 G, which causes a waste of storage units and increases system cost.

In the design of processing a large amount of related data, the existing technology faces two difficulties: 1. It is a requirement to greatly improve performance. 2. If it is a distributed system, the problem of data relevance must also be resolved, that is, the data processed in one subsystem needs to be presented to all other subsystems for confirmation and reprocessing. Generally, there are two ways to reduce the time spent on data processing. One is to speed up the clock for processing data logic; the other is to increase the number of concurrent blocks for processing data.

Under the process limitation, the improvement of the clock rate is very limited. Increasing the number of concurrency is a more effective way to improve performance. However, after increasing the number of concurrency, it generally increases the data bandwidth requirements accordingly. In a general system, if the data bandwidth depends on the bandwidth provided by DDR, the bandwidth increase of DDR is not linear. Assume that the initial system contains a group of DDRs, providing a bandwidth of 1x. If we need to obtain a 2x bandwidth increase, we can achieve two sets of DDR, but if we need to obtain a bandwidth increase of more than 16x, it is impossible to simply implement 16 sets of DDR in a system because of physical size limitations.

If you need multiple ASIC chips to work together, you cannot directly distribute the data in multiple unconnected systems for processing, because these data are related, and each piece of data completed in a processing unit must be processed in other Confirmation and reprocessing are performed in the unit, so if the rate of data transmission between multiple ASIC chips is increased, the problem of interconnection of multiple systems must also be solved.

Summary of the invention

The purpose of the embodiments of the present invention is to provide a way to use high-speed interfaces to connect to distributed storage, so that multiple homogeneous systems can process a large amount of related data concurrently. The embodiment of the present invention provides a big data operation acceleration system. In this system, the chip external memory is eliminated, and the storage unit is provided inside the ASIC chip, which reduces the time for the ASIC chip to read data from the outside and speeds up the chip operation speed. Multiple ASIC chips share storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips, simplifies the system structure, and reduces the cost of ASIC chips. At the same time, serdes interface technology is used for data transmission between multiple computing chips, which improves the data transmission rate between multiple ASIC chips.

To achieve the above objective, the embodiments of the present invention provide the following technical solutions:

According to a first aspect of an embodiment of the present invention, there is provided a big data operation acceleration system, including more than two operation chips, the operation chip including N cores, N data channels (lane), and at least one storage unit, wherein N is a positive integer greater than or equal to 4; the data channel (lane) includes a transmit interface (tx) and a receive interface (rx), the core core corresponds to the data channel (lane) in one-to-one correspondence, and the core core passes through the data channel (lane) sending and receiving data; the two or more arithmetic chips are connected to transmit data through the sending interface (tx) and the receiving interface (rx), and the two or more arithmetic chips are connected in a ring.

According to a second aspect of the embodiments of the present invention, a data transmission method for a big data operation acceleration system is provided. The big data operation acceleration system includes more than two operation chips, and the two or more operation chips pass a transmission interface (tx) Connect with the receiving interface (rx) to transmit data, the two or more arithmetic chips are connected into a ring; after the first arithmetic chip in the data source generates data, the data is sent to the first arithmetic chip through the sending interface (tx) adjacent The second computing chip on one side; the second computing chip on the adjacent side divides the data into two channels for propagation, the first channel is sent to the core core of the second computing chip, and the other channel is through the sending interface (tx) Forwarded to the third arithmetic chip adjacent to the second arithmetic chip.

In the embodiment of the present invention, multiple chips are provided in the big data operation acceleration system. The multiple chips include multiple core cores, and each core core performs operation and storage control functions, and at least one storage is connected to each core core within the chip. Unit, so that each core reads the data in the storage unit connected to itself and the storage unit connected to the other computing chip core, so that each core can have a large capacity of memory, reducing the data from the external storage space into or out of memory The number of times speeds up the processing speed of the data; at the same time, because multiple cores can operate independently or cooperatively, this also speeds up the processing speed of the data. Multiple ASIC chips share storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips, simplifies the system structure, and reduces the cost of ASIC chips.

BRIEF DESCRIPTION

In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings required in the embodiments or the description of the prior art. Obviously, the drawings in the following description are only These are exemplary embodiments. For a person of ordinary skill in the art, without paying any creative work, other drawings may be obtained based on these drawings.

FIG. 1 is a schematic diagram illustrating the structure of a big data operation acceleration system with M ASIC chips according to the first embodiment;

Figure 2 illustrates a schematic diagram of an arithmetic chip with 4 cores;

Figure 3 illustrates a schematic diagram of the structure of the data channel lane;

4a illustrates a schematic structural view of a first embodiment of a storage unit

4b illustrates a schematic structural diagram of a second embodiment of a storage unit;

5 is a schematic diagram illustrating the data transmission process of the big data operation acceleration system;

6 is a schematic diagram illustrating a signal flow of an arithmetic chip with 4 cores in the first embodiment;

7 illustrates a schematic diagram of a data structure according to the present invention.

detailed description

The following will specifically describe exemplary embodiments of the present invention based on the drawings. It should be understood that these embodiments are given only to enable those skilled in the art to better understand and implement the present invention, and do not limit the present invention in any way. range. On the contrary, these embodiments are provided to make the present disclosure more thorough and complete, and to fully convey the scope of the present disclosure to those skilled in the art.

In addition, it is necessary to describe that the directions of up, down, left, and right in each drawing are only exemplified by specific embodiments, and those skilled in the art can change the components shown in the drawings according to actual needs. Part or all of them are changed in direction to be applied, without affecting each component or system as a whole to realize its function. Such a technical solution that changes direction still belongs to the protection scope of the present invention.

Multi-core chips are multi-processing systems embodied on a single large-scale integrated semiconductor chip. Typically, two or more chip cores can be embodied on a multi-core chip, interconnected by a bus (which can also be formed on the same multi-core chip). There can be from two chip cores to many chip cores embodied on the same multi-core chip, and the upper limit in the number of chip cores is limited only by manufacturing capabilities and performance constraints. Multi-core chips can have applications that are implemented in multimedia and signal processing algorithms (such as video encoding / decoding, 2D / 3D graphics, audio and voice processing, image processing, telephony, voice recognition and voice synthesis, encryption processing) Special arithmetic and / or logical operations.

Although only ASIC-specific integrated circuits are mentioned in the background art, the specific wiring implementation in the embodiments can be applied to CPUs, GPUs, FPGAs, etc. that have multi-core chips. In this embodiment, multiple cores may be the same core or different cores.

FIG. 1 is a schematic diagram illustrating the structure of a big data operation acceleration system with M ASIC chips according to the first embodiment. As shown in FIG. 1, the big data operation acceleration system includes M ASIC operation chips, where M is a positive integer greater than or equal to 2, for example, 6, 10, 12, and so on. The arithmetic chip includes multiple cores core (core0, core1, core2, core3), 4 data channels (lane0, lane1, lane2lane3), the data channel (lane) includes a transmission interface (tx) and a reception interface (rx) , The core and the data channel (lane) have a one-to-one correspondence, for example, the core0 of the computing chip 10 has a data channel (lane0), the data channel (lane0) has a sending interface (lane0tx) and a receiving interface (lane0rx), and the data channel sends The interface (lane0tx) is used by the core0 to send data or control instructions to the outside of the computing chip 10, and the data channel receiving interface (lane0rx) is used to send the external data or control instructions of the computing chip 10 to the core0. In this way, the M arithmetic chips are connected through the sending interface (tx) and the receiving interface (rx) to facilitate the transmission of data or control commands. M arithmetic chips form a closed loop. A storage unit is provided in each arithmetic chip, the four core cores in the arithmetic chip are connected to the storage unit, the storage units of the M arithmetic chips are used for distributed storage of data, and the arithmetic chip core core can be calculated from this operation The storage unit of the chip can acquire data, and the data can also be acquired from the storage unit of other arithmetic chips. The four cores of the arithmetic chip are all connected to the storage unit, and the purpose of data exchange of the four cores of the arithmetic chip is also achieved through the storage unit. Those skilled in the art can know that four cores are selected here as an example, which is only an exemplary description. The number of cores may be N, where N is a positive integer greater than or equal to 4, for example, 6, 10, 12, and so on. In this embodiment, multiple cores may be the same core or different cores.

The sending interface (lane) and receiving interface (lane) of the data channel (lane) are serdes interfaces, and the arithmetic chips communicate through the serdes interface. Serdes is the abbreviation of English SERializer (serializer) / DESerializer (deserializer). It is a mainstream time division multiplexing (TDM) and point-to-point (P2P) serial communication technology. That is, multiple low-speed parallel signals at the transmitting end are converted into high-speed serial signals, and then through the transmission medium (optical cable or copper wire), and finally the high-speed serial signals at the receiving end are re-converted into low-speed parallel signals. This point-to-point serial communication technology makes full use of the channel capacity of the transmission medium, reduces the number of transmission channels and device pins required, increases the signal transmission speed, and thus greatly reduces the communication cost. Of course, other communication interfaces can also be used instead of the serdes interface, for example: SSI, UATR. The chip transmits data and control commands through the serdes interface.

FIG. 2 illustrates a first embodiment of a schematic structural diagram of an arithmetic chip with 4 cores. Those skilled in the art can know that 4 cores are selected here as an example, which is only an exemplary description. The number of cores of the arithmetic chip may be N, where N is a positive integer greater than or equal to 2, such as 6, 10, 12 etc. Wait. In this embodiment, the core of the arithmetic chip may be a core with the same function or a core with different functions.

4 core computing chips (1) include 4 cores (core0, core1, core2, core3), 4 data channels (lane0, lane1, lane2lane3) and at least one storage unit, a data exchange control unit, specific data The exchange control unit is a UART control unit, and each data channel (lane) includes a transmit interface (lane tx) and a receive interface (lane rx).

The core0 of the computing chip (1) is connected to the transmitting interface (lane0tx) and the receiving interface (lane0rx) of the data channel, and the data channel transmitting interface (lane0tx) is used by the core0 to send data or control to the computing chip connected to the computing chip 1 Instruction, the data channel receiving interface (lane0rx) is used to send data or control instructions transmitted by the arithmetic chip (1) connected to the arithmetic chip (1) to the core0. Similarly, the core1 of the operation chip 1 is connected to the transmission interface (lane1tx) and the reception interface (lane1rx) of the data channel; the core2 of the operation chip 1 is connected to the transmission interface (lane2tx) and the reception interface (lane2rx) of the data channel. The core3 of the chip 1 is connected to the sending interface (lane3tx) and the receiving interface (lane3rx) of the data channel. The sending interface (lane tx) and the receiving interface (lane rx) of the data channel (lane) are serdes interfaces.

A data exchange control unit is connected to the storage unit and 4 cores (core0, core1, core2, core3) through the bus. The bus is not drawn in FIG. 2. The data exchange control unit can be implemented using multiple protocols, such as UART, SPI, PCIE, SERDES, USB, etc. In this embodiment, the data exchange control unit is a UART (Universal Asynchronous Receiver / Transmitter) control unit. Universal asynchronous transceiver is usually called UART, which is an asynchronous transceiver. It converts the data to be transmitted between serial communication and parallel communication. UART is usually integrated on the connection of various communication interfaces. But here is just taking the UART protocol as an example, other protocols can also be used. The UART control unit accepts external data and sends the external data to the core (core0, core1, core2, core3) or storage unit according to the external data address. The UART control unit can also accept external control commands and send control commands to the core (core0, core1, core2, core3) or the storage unit; it can also be used to send internal or external control commands to other computing chips from the computing chip. Accept control commands, and feedback operation results or intermediate data to the outside. The internal data or internal control commands refer to data or control commands generated by the chip itself, and the external data or external control commands refer to data or control commands generated outside the chip, such as data or control sent by an external host or an external network instruction.

The core core (core0, core1, core2, core3) main functions are to execute external or internal control instructions, perform data calculations and data storage control functions. The cores (core0, core1, core2, core3) in the arithmetic chip are all connected to the storage unit, and data is read from or written to the storage unit of the arithmetic chip. Core data interaction; can also send control commands to the storage unit of the computing chip. The core core (core0, core1, core2, core3) writes data to, reads data from, or sends control commands to the storage units of other computing chips through the serdes interface according to the instructions; kernel core (core0, core1, core2 , Core3) You can also send data to, read data from, or send control commands to the core of other computing chips through the serdes interface according to the instructions.

FIG. 3 illustrates a first embodiment of a schematic structural diagram of a data channel lane. The data channel (lane) includes a receiving interface, a sending interface, a receiving address judgment unit, a sending address judgment unit, and a plurality of registers; one end of the receiving address judgment unit is connected to the receiving interface, and the other end of the receiving address judgment unit is connected to the core core through registers ; One end of the sending address judgment unit is connected to the sending interface (tx), the other end of the sending address judgment unit is connected to the core core through a register; the receiving address judgment unit and the sending address judgment unit are connected to each other through a register. The receiving interface receives the data frame or control command sent by the running chip on the adjacent side connected to the receiving interface, sends the data frame or control command to the receiving address judgment unit, and receives the data frame or control instruction from the address judgment unit Send to the core core, and at the same time send the data frame or control instruction to the sending address judgment unit; the sending address judgment unit receives the data frame or control instruction, and sends the data frame or control instruction to the sending interface (tx), The sending interface sends the data frame or the control instruction to the adjacent running chip connected to the sending interface. The core core generates a data frame or control instruction, sends the data frame or control instruction to the sending address judgment unit, and the sending address judgment unit sends the data frame or control instruction to the sending interface, and the sending interface sends the data frame or control instruction The instruction is sent to the receiving interface of the running chip on the adjacent side. The purpose of the register is to temporarily store data frames or control instructions.

FIG. 4a illustrates a first embodiment of a schematic structural diagram of a memory cell. Each computing chip contains N core cores, and they need concurrent random access data. If the order of N reaches 64 and above, the memory bandwidth of the computing chip needs to reach a very high order, even GDDR is difficult To achieve such a high bandwidth. Therefore, in the embodiment of the present invention, a high bandwidth is provided by using a SRAM array and a large MUX route. As shown in Figure 4a, the system consists of two-level storage control unit to alleviate the problem of yoke during implementation. The storage unit (40) includes 8 memories (410 ... 417), the 8 memories (410 ... 417) are connected to a storage control unit (420); the storage control unit is used to control the plurality of Read or store data in memory. The memory (410 ... 417) includes at least two storage subunits and a storage control subunit; the storage control subunit is connected to the storage control unit through an interface, and the storage control subunit is used to control the at least two The data of each storage subunit is read or stored. The storage subunit is an SRAM memory.

FIG. 4b illustrates a second embodiment of a schematic structural diagram of a memory cell. Multiple storage control units (420, 421, 422, 423) can be provided in the storage unit in FIG. 4b, each core core and each of the multiple storage control units (420, 421, 422, 423) Connected, each storage control unit is connected to each memory (410 ... 417). The structure of the memory is exactly the same as in FIG. 4a, and will not be described here again.

The core core sends the generated data to at least one storage control unit, and the at least one storage control unit sends the data to the storage control subunit, and the storage control subunit stores the data in the storage subunit. The arithmetic chip core core acquires a data acquisition command sent by another arithmetic chip, the arithmetic chip core core judges whether data is stored in the storage unit of the arithmetic chip according to the data address, and if it exists, sends a data read command to the at least one storage control unit At least one storage control unit sends a data read command to the corresponding storage control subunit, the storage control subunit obtains data from the storage subunit, and the storage control subunit sends the acquired data to at least one storage control unit, at least one The storage control unit sends the acquired data to the kernel core, and the kernel core sends the acquired data to the sending address judgment unit, and the sending address judgment unit sends the acquired data to the sending interface (tx), and the sending interface sends the acquired data The data is sent to the adjacent arithmetic chip.

The big data operation acceleration system is applied to the field of artificial intelligence. The UART control unit of the operation chip stores the image data or video data sent by the external host to the storage unit through the core core. The operation chip generates a mathematical model of the neural network. The mathematical model It can also be stored in the storage unit by the external host through the UART control unit and read by each arithmetic chip. Run the first layer of mathematical model of the neural network on the arithmetic chip. The core of the arithmetic chip reads data from the storage unit of the arithmetic chip and / or the storage unit of other arithmetic chips for operation, and stores the operation result to other through the serdes interface. At least one storage unit in the storage unit of the arithmetic chip, or the storage unit stored in the arithmetic chip. The arithmetic chip (1) sends a control instruction to the next arithmetic chip (2) through the UART control unit or serdes interface, and starts the next arithmetic chip (2) to perform arithmetic. Run the second layer mathematical model of the neural network on the next arithmetic chip (2). The core of the next arithmetic chip reads the data from the storage unit of the arithmetic chip and / or the storage unit of other arithmetic chips for operation, and the operation results At least one storage unit stored in a storage unit of another arithmetic chip through the serdes interface, or a storage unit of the arithmetic chip. Each chip executes a layer in the neural network, and obtains data from the storage unit of other operation chips or the storage unit of the operation chip through the serdes interface to perform calculations, and only calculates the calculation result to the last layer of the neural network. The operation chip obtains the operation result from the local storage unit or the storage unit of other operation chip, and feeds it back to the external host through the UART control unit.

The big data operation acceleration system is applied to the field of encrypted digital currency. The UART control unit of the operation chip (1) stores the block information sent by the external host to at least one storage unit among the plurality of storage units of the plurality of operation chips. The external host sends control instructions to the M arithmetic chips through the arithmetic chip (1 ... M) UART control unit to perform data calculation, and the M arithmetic chips start the arithmetic operation. Of course, an external host can also send control instructions to one arithmetic chip (1) UART control unit (130) to perform data operations, and the arithmetic chip (1) sequentially sends control instructions to other M-1 arithmetic chips to perform data operations. M arithmetic chips Start arithmetic operation. The external host can also send control instructions to a computing chip (1) UART control unit to perform data operations. The first computing chip (1) sends control instructions to the second computing chip (2) to perform data operations. The second computing chip (2) A control instruction is sent to the third arithmetic chip (3) for data calculation, a third arithmetic chip (3) sends a control instruction to the fourth arithmetic chip (4) for data calculation, and M arithmetic chips start arithmetic operations. The M arithmetic chips acquire data from the storage unit of other arithmetic chips or the storage unit of the arithmetic chip through the serdes interface to perform calculations. The M arithmetic chips simultaneously perform proof-of-work calculation operations. The arithmetic chip (1) obtains the calculation result from the storage unit and passes The UART control unit feeds back to the external host.

FIG. 5 illustrates a first embodiment of a schematic diagram of a data transmission process of a big data operation acceleration system. Each arithmetic chip completes the work of 1 / n, and after each arithmetic chip completes its responsible data, because of the data correlation, the result of its calculation must be transmitted to all other chips. Operation chip n-1 is the source operation chip of the data frame, and the data is sent to operation chip 0 through lane1tx; in operation chip 0, the data frame will be divided into two propagations, the first way is sent to the core core of operation chip 0, The other way is in the lane1tx channel that is forwarded to the arithmetic chip 0, so the data frame will be sent to the arithmetic chip 1.

Source ID mechanism: Each data frame carries the operation chip ID of the source of the data frame. Whenever the data frame is sent to a new operation chip, the operation chip will detect the operation chip ID in the data frame. When it is found that the ID of the arithmetic chip and the ID of the next arithmetic chip connected to the arithmetic chip are equal, then the data frame will not be forwarded again, which means that the life cycle of the data frame ends here, nor Take up bandwidth again. The arithmetic chip will detect the arithmetic chip ID in the data frame, which may be performed in the core or in the receiving address judgment unit.

FIG. 6 illustrates a signal flow diagram of an arithmetic chip with four cores in the first embodiment. The UART control unit (130) is used to acquire external data or control instructions of the chip, and transmit the external data or control instructions to the core (110) connected to the UART control unit. The core core (110) transfers external data to the storage unit (120) of the chip for storage according to the data address, or the core core (110) sends data to the other chip core core corresponding to the data address through the signal channel according to the data address, Other chip cores store data in local storage units. The core core (110) is executed by the core core of the arithmetic chip according to the address of the external control instruction or sent to another chip core core corresponding to the address of the control instruction through the signal channel lane for execution. If the core of the arithmetic chip needs to acquire data, the core can acquire data from a local storage unit or data from storage units of other arithmetic chips. When acquiring data from the storage unit of another arithmetic chip, the core core (110) broadcasts the acquisition data control instruction to the connected arithmetic chip through the serdes interface (150) connected to it; the connected arithmetic chip will acquire the data The control instructions are divided into two ways, one way is sent to the core core, and the other way is forwarded to the next chip. If the connected arithmetic chip determines that the data is stored in the local storage unit, the core core reads the data from the storage unit and sends it to the arithmetic unit that issues the data acquisition instruction through the serdes interface. Of course, the control commands between the arithmetic chips can also be sent through the UART control unit. When the core core feedbacks the operation result or intermediate data to the outside according to the external control instruction or the internal control instruction, the core core obtains the operation result or intermediate data from the storage unit of the operation chip or from the storage unit of the other operation chip through the serdes interface, and calculates the operation The result or intermediate data is sent to the outside through the UART control unit. The external mentioned here may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the storage unit parameters through the UART control unit, and address multiple storage units uniformly.

Of course, the kernel core performs calculations based on the acquired data and stores the calculation results in the storage unit. Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results not used by other arithmetic chips; the shared storage area is used to store arithmetic data results of the arithmetic chips, which are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.

In the embodiment of the present invention, multiple core cores are provided in the chip, and each core core performs calculation and storage control functions, and at least one storage unit is connected to each core core inside the chip. The storage unit is connected to the storage unit of other cores, so that each core can have a large-capacity memory, reducing the number of times data is moved into or out of the memory from the external storage space, and speeding up the data processing speed; at the same time, because multiple cores can be separately Independent operation or cooperative operation, which also speeds up the data processing speed.

7 illustrates a schematic diagram of a data structure according to the present invention. The data mentioned here is various data such as command data, numeric data, character data, and so on. The data format specifically includes valid bit valid, destination address dst id, source address src id and data data. The kernel can determine whether the data packet is a command or a value by valid bit. Here, it can be assumed that 0 represents a value and 1 represents a command. The kernel will determine the destination address, source address and data type according to the data structure. For example, in Figure 1, the core 50 sends a data read command to the core 10, the valid bit is 1, the destination address is the address of the core 10, the source address is the address of the core 50, and the data data is the read data command and data type Or data address. When the core 10 sends data to the core 10, the effective bit is 0, the destination address is the address of the core 50, the source address is the address of the core 0, and the data data is the read data. From the perspective of instruction operation timing, the traditional six-stage pipeline structure is adopted in this embodiment, which are instruction fetch, decoding, execution, memory access, alignment and write-back stage respectively. From the perspective of the instruction set architecture, a simplified instruction set architecture can be adopted. According to the general design method of the reduced instruction set architecture, the instruction set of the present invention can be divided into register-register type instructions, register-immediate instruction, jump instruction, memory access instruction, control instruction and inter-core communication instruction according to functions.

Using the description provided herein, the embodiments can be implemented as a machine, process, or article of manufacture by using standard programming and / or engineering techniques to produce programming software, firmware, hardware, or any combination thereof.

Any generated program (s) (with computer-readable program code) can be embodied on one or more computer-usable media, such as resident storage devices, smart cards or other removable storage devices, or transmission devices, Thus, computer program products and manufactured products are produced according to the embodiments. As such, the terms "article of manufacture" and "computer program product" as used herein are intended to cover computer programs that are permanently or temporarily present on any non-transitory medium that can be used by computers.

As noted above, memory / storage devices include but are not limited to magnetic disks, optical disks, removable storage devices (such as smart cards, subscriber identity modules (SIM), wireless identification modules (WIM)), semiconductor memories (such as random access memory ( RAM), read only memory (ROM), programmable read only memory (PROM)), etc. Transmission media include, but are not limited to, transmission via wireless communication networks, the Internet, intranets, telephone / modem-based network communications, hard-wired / cable communications networks, satellite communications, and other fixed or mobile network systems / communication links.

Although specific example embodiments have been disclosed, those skilled in the art will understand that changes can be made to the specific example embodiments without departing from the spirit and scope of the invention.

The present invention has been described above based on the embodiments with reference to the drawings. However, the present invention is not limited to the above-mentioned embodiments, and schemes in which the parts of each embodiment and each modified example are appropriately combined or replaced according to layout requirements are also included in Within the scope of the invention. In addition, based on the knowledge of those skilled in the art, the combination and processing order of the embodiments may be appropriately reorganized, or various design changes and other modifications may be applied to the embodiments. Embodiments to which such modifications are applied may also be included in the present invention. In the range.

Although the present invention has described various concepts in detail, those skilled in the art can understand that various modifications and substitutions to those concepts can be implemented within the spirit of the overall teaching disclosed in the present invention. A person skilled in the art can implement the invention set forth in the claims without undue experimentation by using ordinary techniques. It can be understood that the specific concepts disclosed are only illustrative and are not intended to limit the scope of the present invention, which is determined by the full scope of the appended claims and their equivalents.

Claims

A big data operation acceleration system, including more than two operation chips, the operation chip includes N cores, N data lanes and at least one storage unit, where N is a positive integer greater than or equal to 4; The data channel (lane) includes a sending interface (tx) and a receiving interface (rx), the core core and the data channel (lane) have a one-to-one correspondence, and the core core sends and receives data through the data channel (lane); the 2 More than one arithmetic chips are connected to transmit data through the sending interface (tx) and the receiving interface (rx), and the two or more arithmetic chips are connected in a ring shape.
The system according to claim 1, wherein the sending interface (tx) and the receiving interface (rx) of the arithmetic chip are serdes interfaces, and the arithmetic chips communicate through the serdes interface.
The system according to claim 1 or 2, wherein the data channel (lane) further comprises a receiving address judging unit and a sending address judging unit; one end of the receiving address judging unit is connected to the receiving interface, and the other end of the receiving address judging unit Connected to the core core; one end of the sending address judgment unit is connected to the sending interface (tx), the other end of the sending address judgment unit is connected to the core core; the receiving address judgment unit and the sending address judgment unit are connected to each other.
The system according to claim 3, wherein the receiving interface (rx) receives the data frame sent by the running chip on the adjacent side, sends the data frame to the receiving address judging unit, and the receiving address judging unit sends the data The frame is sent to the core core, and the data frame is sent to the sending address judgment unit; the sending address judgment unit receives the data frame, sends the data frame to the sending interface (tx), and the sending interface sends the data frame Run the chip to the other side.
The system according to claim 3, wherein the core core generates a data frame and sends the data frame to a sending address judgment unit, and the sending address judgment unit sends the data frame to a sending interface (tx), the sending interface (tx) Send the data frame to the running chip on the adjacent side.
The system according to claim 3, wherein the receiving address judging unit and the sending address judging unit are connected to each other through a first-in first-out memory.
The system according to claim 3, wherein the core of the arithmetic chip core acquires a data acquisition command sent by another arithmetic chip, and the core of the arithmetic chip determines whether the data is stored in the storage unit of the arithmetic chip according to the data address, if there is a core The core acquires data from at least one storage unit, and the core core sends the acquired data to the sending address judgment unit, and the sending address judgment unit sends the acquired data to the sending interface (tx), and the sending interface sends the acquired data to Adjacent running chips.
A data transmission method for a big data operation acceleration system. The big data operation acceleration system includes more than two operation chips, and the two or more operation chips are connected to transmit data through a transmission interface (tx) and a reception interface (rx). The two or more arithmetic chips are connected in a ring; after the first arithmetic chip in the data source generates data, the data is sent to the second arithmetic chip on the adjacent side of the first arithmetic chip through the sending interface (tx); the phase The second computing chip on the adjacent side divides the data into two channels for propagation, the first channel is sent to the core of the second computing chip, and the other channel is forwarded to the adjacent side of the second computing chip through the sending interface (tx) The third arithmetic chip.
The method according to claim 8, wherein the data carries an identification (ID) of a data source operation chip.
The method according to claim 9, wherein after the data is transmitted to an adjacent arithmetic chip, the adjacent arithmetic chip detects the identification (ID) of the arithmetic chip in the data, if the identification of the arithmetic chip is found When (ID) is equal to the identification (ID) of the next arithmetic chip connected to the adjacent arithmetic chip, the data will not be forwarded again.