WO2020087243A1

WO2020087243A1 - Big data computing chip

Info

Publication number: WO2020087243A1
Application number: PCT/CN2018/112541
Authority: WO
Inventors: 秦强
Original assignee: 北京比特大陆科技有限公司
Priority date: 2018-10-30
Filing date: 2018-10-30
Publication date: 2020-05-07

Abstract

Embodiments of the present invention provide a big data computing chip. The computing chip comprises N cores, N data channels, and at least one storage unit, wherein N is an integer greater than or equal to 4; each data channel comprises a sending interface and a receiving interface; the cores and the data channels have one-to-one correspondence; the cores send and receive data by means of the data channels; the computing chip performs data transmission with the outside of the chip by means of the sending interfaces and the receiving interfaces; each of the N cores is connected to the at least one storage unit. Multiple ASIC chips share a storage unit, and thus, the number of storage units is reduced, and connection lines between the ASIC computing chips are also reduced, thereby simplifying the system structure, and reducing the costs of ASIC chips.

Description

Big data operation chip

Technical field

The invention relates to the technical field of integrated circuits, in particular to a big data operation chip.

Background technique

ASIC (Application Specific Integrated Circuits), that is, application-specific integrated circuits, refer to the integrated circuits designed and manufactured in accordance with the requirements of specific users and the needs of specific electronic systems. The characteristics of ASICs are to meet the needs of specific users. Compared with general-purpose integrated circuits, ASICs have the advantages of smaller size, lower power consumption, improved reliability, improved performance, enhanced confidentiality, and lower costs.

With the development of science and technology, more and more fields, such as artificial intelligence, security computing, etc., involve specific calculations with large amounts of computation. For specific operations, ASIC chips can play a specific role such as fast operation and low power consumption. At the same time, for these fields with large amount of calculation, in order to improve the data processing speed and processing capacity, it is usually necessary to control N operation chips to work simultaneously. With the continuous improvement of data accuracy, more and more data needs to be calculated in the fields of artificial intelligence and security computing, for example: the size of photos is generally 3-7MB, but as the precision of digital cameras and video cameras increases, photos The size can reach 10MB or more, and 30 minutes of video may reach more than one G of data. However, in the fields of artificial intelligence and security computing, the calculation speed is fast and the delay is small, so how to improve the calculation speed and response time has always been the goal required by chip design. As the memory of the ASIC chip is generally 64MB or 128MB, when the data to be processed is more than 512MB, the ASIC chip needs to use the memory to access the data multiple times, and the data is moved into or out of the memory from the external storage space many times, reducing the Processing speed. At the same time, with the continuous improvement of data accuracy, artificial intelligence, security computing and other fields need to operate on larger and larger data. In order to store data, it is generally necessary to configure the ASIC chip with multiple storage units, such as an ASIC chip to configure 4 2G memory; when N arithmetic chips work at the same time, 4N 2NG memory is needed. However, when multiple computing chips work at the same time, the data storage capacity will not exceed 2 G, which causes a waste of storage units and increases system cost.

In the design of processing a large amount of related data, the existing technology faces two difficulties: 1. It is a requirement to greatly improve performance. 2. If it is a distributed system, then the problem of data relevance must also be solved, that is, the data processed in one subsystem needs to be presented to all other subsystems for confirmation and reprocessing. Generally, there are two ways to reduce the time spent on data processing. One is to speed up the clock for processing data logic; the other is to increase the number of concurrent blocks for processing data.

Under the process limitation, the improvement of the clock rate is very limited. Increasing the number of concurrency is a more effective way to improve performance. However, after increasing the number of concurrency, it generally increases the data bandwidth requirements accordingly. In a general system, if the data bandwidth depends on the bandwidth provided by DDR, the bandwidth increase of DDR is not linear. Assume that the initial system contains a group of DDRs, providing a bandwidth of 1x. If we need to obtain a 2x bandwidth increase, we can achieve two sets of DDR, but if we need to obtain a bandwidth increase of more than 16x, it is impossible to simply implement 16 sets of DDR in a system because of physical size limitations.

If you need multiple ASIC chips to work together, you cannot directly distribute the data in multiple unconnected systems for processing, because these data are related, and each piece of data completed in a processing unit must be processed in other Confirmation and reprocessing are performed in the unit, so if the rate of data transmission between multiple ASIC chips is increased, the problem of interconnection of multiple systems must also be solved.

Summary of the invention

The purpose of the embodiments of the present invention is to provide a way to use high-speed interfaces to connect to distributed storage, so that multiple homogeneous systems can process a large amount of related data concurrently. The embodiment of the present invention provides a big data operation acceleration system. In this system, the chip external memory is eliminated, and the storage unit is provided inside the ASIC chip, which reduces the time for the ASIC chip to read data from the outside and speeds up the chip operation speed. Multiple ASIC chips share storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips, simplifies the system structure, and reduces the cost of ASIC chips. At the same time, serdes interface technology is used for data transmission between multiple computing chips, which improves the data transmission rate between multiple ASIC chips.

To achieve the above objective, the embodiments of the present invention provide the following technical solutions:

According to a first aspect of the embodiments of the present invention, there is provided a big data operation chip, the operation chip includes N cores, N data lanes, and at least one storage unit, where N is an integer greater than or equal to 4; The data channel lane includes a sending interface (tx) and a receiving interface (rx), the core core corresponds to a data channel lane, and the core core sends and receives data through the data channel lane; the arithmetic chip passes the The transmission interface (tx) and the reception interface (rx) perform data transmission with the outside of the chip; each of the N core cores is connected to the at least one storage unit.

According to a second aspect of the embodiments of the present invention, a big data operation chip is provided, the operation chip includes N cores, and at least one built-in storage unit, characterized in that the storage unit includes a plurality of memories and at least one A storage control unit, the memory includes at least two storage subunits and a storage control subunit; each storage control subunit is connected to each of the at least one storage control unit through an interface; the storage control subunit It is used to control data reading or storage of the at least two storage subunits; the at least one storage control unit is used to control data reading or storage of the plurality of memories.

In the embodiment of the present invention, multiple core cores are provided in the big data operation chip, each core core performs operation and storage control functions, and at least one storage unit is connected to each core core inside the chip, so that each core is read by The data in the storage unit connected by itself and the storage unit connected to the other computing chip cores allows each core to have a large-capacity memory, reducing the number of times data is moved into or out of the memory from external storage spaces, and speeds up the data processing speed; At the same time, since multiple cores can operate independently or cooperatively, this also speeds up the data processing speed. Multiple ASIC chips share storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips, simplifies the system structure, and reduces the cost of ASIC chips.

BRIEF DESCRIPTION

In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings required in the embodiments or the description of the prior art. Obviously, the drawings in the following description are only These are exemplary embodiments. For a person of ordinary skill in the art, without paying any creative work, other drawings may be obtained based on these drawings. ;

FIG. 1 is a schematic diagram illustrating the structure of a big data operation acceleration system with M ASIC chips according to the first embodiment;

Figure 2 illustrates a schematic diagram of an arithmetic chip with 4 cores;

Figure 3 illustrates a schematic diagram of the structure of the data channel lane;

4 illustrates a schematic structural diagram of a first embodiment of a storage unit

FIG. 5 illustrates a schematic structural diagram of a second embodiment of a storage unit;

6 is a schematic diagram illustrating the data transmission process of the big data operation acceleration system;

7 illustrates a schematic diagram of a signal flow of an arithmetic chip with 4 cores in the first embodiment;

8 illustrates a schematic diagram of a data structure according to the present invention.

detailed description

The following will specifically describe exemplary embodiments of the present invention based on the drawings. It should be understood that these embodiments are given only to enable those skilled in the art to better understand and implement the present invention, and do not limit the present invention in any way. range. On the contrary, these embodiments are provided to make the present disclosure more thorough and complete, and to fully convey the scope of the present disclosure to those skilled in the art.

In addition, it is necessary to describe that the directions of up, down, left, and right in each drawing are only exemplified by specific embodiments, and those skilled in the art can change the components shown in the drawings according to actual needs. Part or all of them are changed in direction to be applied, without affecting each component or system as a whole to realize its function. Such a technical solution that changes direction still belongs to the protection scope of the present invention.

Multi-core chips are multi-processing systems embodied on a single large-scale integrated semiconductor chip. Typically, two or more chip cores can be embodied on a multi-core chip, interconnected by a bus (which can also be formed on the same multi-core chip). There can be from two chip cores to many chip cores embodied on the same multi-core chip, and the upper limit in the number of chip cores is limited only by manufacturing capabilities and performance constraints. Multi-core chips can have applications that are implemented in multimedia and signal processing algorithms (such as video encoding / decoding, 2D / 3D graphics, audio and voice processing, image processing, telephony, voice recognition and voice synthesis, encryption processing) Special arithmetic and / or logical operations.

Although only ASIC-specific integrated circuits are mentioned in the background art, the specific wiring implementation in the embodiments can be applied to CPUs, GPUs, FPGAs, etc. that have multi-core chips. In this embodiment, multiple cores may be the same core or different cores.

FIG. 1 is a schematic diagram illustrating the structure of a big data operation acceleration system with M ASIC chips according to the first embodiment. As shown in FIG. 1, the big data operation acceleration system includes M ASIC operation chips, where M is a positive integer greater than or equal to 2, for example, 6, 10, 12, and so on. The computing chip includes multiple cores core (core0, core1, core2, core3), 4 data channels (lane0, lane1, lane2, lane3), the data channel lane includes a transmit interface (tx) and a receive interface (rx), The core and the data channel lane have a one-to-one correspondence, for example, the core core0 of the arithmetic chip 10 has a data channel (lane0), the data channel (lane0) has a transmission interface (lane0) and a reception interface (lane0), and a data channel transmission interface (lane0 tx) is used for the core core0 to send data or control instructions to the outside of the arithmetic chip 10, and the data channel receiving interface (lane0 rx) is used to send the external data or control instructions for the arithmetic chip 10 to the core core0. In this way, the M arithmetic chips are connected through the sending interface (tx) and the receiving interface (rx) to facilitate data or control command transmission. M arithmetic chips form a closed loop. A storage unit is provided in each arithmetic chip, the four core cores in the arithmetic chip are connected to the storage unit, the storage units of the M arithmetic chips are used for distributed storage of data, and the arithmetic chip core core can be calculated from this operation The storage unit of the chip can acquire data, and the data can also be acquired from the storage unit of other arithmetic chips. The four cores of the arithmetic chip are all connected to the storage unit, and the purpose of data exchange of the four cores of the arithmetic chip is also achieved through the storage unit. Those skilled in the art can know that four cores are selected here as an example, which is only an exemplary description. The number of cores may be N, where N is a positive integer greater than or equal to 4, for example, 6, 10, 12, and so on. In this embodiment, multiple cores may be the same core or different cores.

The sending interface (lane tx) and the receiving interface (lane rx) of the data channel lane are serdes interfaces, and the arithmetic chips communicate through the serdes interface. Serdes is the abbreviation of English SERializer (serializer) / DESerializer (deserializer). It is a mainstream time division multiplexing (TDM) and point-to-point (P2P) serial communication technology. That is, multiple low-speed parallel signals at the transmitting end are converted into high-speed serial signals, and then through the transmission medium (optical cable or copper wire), and finally the high-speed serial signals at the receiving end are re-converted into low-speed parallel signals. This point-to-point serial communication technology makes full use of the channel capacity of the transmission medium, reduces the number of transmission channels and device pins required, increases the signal transmission speed, and thus greatly reduces the communication cost. Of course, other communication interfaces can also be used instead of the serdes interface, for example: SSI, UATR. The chip transmits data and control commands through the serdes interface.

FIG. 2 illustrates a first embodiment of a schematic structural diagram of an arithmetic chip with 4 cores. Those skilled in the art can know that 4 cores are selected here as an example, which is only an exemplary description. The number of cores of the arithmetic chip may be N, where N is a positive integer greater than or equal to 2, such as 6, 10, 12 etc. Wait. In this embodiment, the core of the arithmetic chip may be a core with the same function or a core with different functions.

4 core computing chips (1) include 4 cores (core0, core1, core2, core3), 4 data channels (lane0, lane1, lane2 lane3) and at least one storage unit, a data exchange control unit, specific The data exchange control unit is a UART control unit, and each data channel lane includes a sending interface (lane tx) and a receiving interface (lane rx).

The core0 of the computing chip (1) is connected to the sending interface (lane0) and receiving interface (lane0) of the data channel, and the data channel sending interface (lane0) is used for sending the core0 to the computing chip connected to the computing chip 1 Data or control instructions, a data channel receiving interface (lane0rx) is used to send data or control instructions transmitted by the arithmetic chip (1) connected to the arithmetic chip (1) to the core core0. Similarly, the core1 of the computing chip 1 is connected to the transmission interface (lane1 tx) and the receiving interface (lane1 rx) of the data channel; the core2 of the computing chip 1 is connected to the transmission interface (lane2) of the data channel and the receiving interface (lane2 rx), the core 3 of the computing chip 1 is connected to the sending interface (lane3) and receiving interface (lane3) of the data channel (lane3). The sending interface (lane tx) and receiving interface (lane rx) of the data channel lane are serdes interfaces.

A data exchange control unit is connected to the storage unit and 4 cores (core0, core1, core2, core3) through the bus. The bus is not drawn in FIG. 2. The data exchange control unit can be implemented using multiple protocols, such as UART, SPI, PCIE, SERDES, USB, etc. In this embodiment, the data exchange control unit is a UART (Universal Asynchronous Receiver / Transmitter) control unit. Universal asynchronous transceiver is usually called UART, which is an asynchronous transceiver. It converts the data to be transmitted between serial communication and parallel communication. UART is usually integrated on the connection of various communication interfaces. But here is just taking the UART protocol as an example, other protocols can also be used. The UART control unit accepts external data and sends the external data to the core (core0, core1, core2, core3) or storage unit according to the external data address. The UART control unit can also accept external control commands and send control commands to the core (core0, core1, core2, core3) or the storage unit; it can also be used to send internal or external control commands to other computing chips from the computing chip. Accept control commands, and feedback operation results or intermediate data to the outside. The internal data or internal control commands refer to data or control commands generated by the chip itself, and the external data or external control commands refer to data or control commands generated outside the chip, such as data or control sent by an external host or an external network instruction.

The core core (core0, core1, core2, core3) main functions are to execute external or internal control instructions, perform data calculations and data storage control functions. The cores (core0, core1, core2, core3) in the arithmetic chip are all connected to the storage unit, and data is read from or written to the storage unit of the arithmetic chip. Core data interaction; can also send control commands to the storage unit of the computing chip. The core core (core0, core1, core2, core3) writes data to, reads data from, or sends control commands to the storage units of other computing chips through the serdes interface according to the instructions; kernel core (core0, core1, core2 , Core3) You can also send data to, read data from, or send control commands to the core of other computing chips through the serdes interface according to the instructions.

FIG. 3 illustrates a first embodiment of a schematic structural diagram of a data channel lane. The data channel lane includes a reception interface, a transmission interface, a reception address judgment unit, a transmission address judgment unit, and a plurality of registers; one end of the reception address judgment unit is connected to the reception interface, and the other end of the reception address judgment unit is connected to the core core through registers; One end of the address judgment unit is connected to the sending interface (tx), and the other end of the sending address judgment unit is connected to the core core through a register; the receiving address judgment unit and the sending address judgment unit are connected to each other through a register. The receiving interface receives the data frame or control command sent by the running chip on the adjacent side connected to the receiving interface, sends the data frame or control command to the receiving address judgment unit, and receives the data frame or control instruction from the address judgment unit Send to the core core, and at the same time send the data frame or control instruction to the sending address judgment unit; the sending address judgment unit receives the data frame or control instruction, and sends the data frame or control instruction to the sending interface (tx), The sending interface sends the data frame or the control instruction to the adjacent running chip connected to the sending interface. The core core generates a data frame or control instruction, sends the data frame or control instruction to the sending address judgment unit, and the sending address judgment unit sends the data frame or control instruction to the sending interface, and the sending interface sends the data frame or control instruction The instruction is sent to the receiving interface of the running chip on the adjacent side. The purpose of the register is to temporarily store data frames or control instructions.

FIG. 4 illustrates a first embodiment of a schematic structural diagram of a memory cell. Each computing chip contains N core cores, and they need concurrent random access data. If the order of N reaches 64 and above, the memory bandwidth of the computing chip needs to reach a very high order, even GDDR is difficult To achieve such a high bandwidth. Therefore, in the embodiment of the present invention, a high bandwidth is provided by using a SRAM array and a large MUX route. As shown in Figure 4, the system consists of two-level storage control unit to alleviate the problem of yongsai during implementation. The storage unit (40) includes 8 memories (410 ... 417), and the 8 memories (410 ... 417) are connected to a storage control unit (420); the storage control unit is used to control the plurality of Read or store data in memory. The memory (410 ... 417) includes at least two storage subunits and a storage control subunit; the storage control subunit is connected to the storage control unit through an interface, and the storage control subunit is used to control the at least two The data of each storage subunit is read or stored. The storage subunit is an SRAM memory.

FIG. 5 illustrates a second embodiment of a schematic structural diagram of a memory cell. In FIG. 5, multiple storage control units (420, 421, 422, 423) may be provided in the storage unit, each core core and each of the multiple storage control units (420, 421, 422, 423) Connected, each storage control unit is connected to each memory (410 ... 417). The structure of the memory is exactly the same as in FIG. 4 and will not be described here again.

The core core sends the generated data to at least one storage control unit, and the at least one storage control unit sends the data to the storage control subunit, and the storage control subunit stores the data in the storage subunit. The arithmetic chip core core acquires a data acquisition command sent by another arithmetic chip, the arithmetic chip core core judges whether data is stored in the storage unit of the arithmetic chip according to the data address, and if it exists, sends a data read command to the at least one storage control unit At least one storage control unit sends a data read command to the corresponding storage control subunit, the storage control subunit obtains data from the storage subunit, and the storage control subunit sends the acquired data to at least one storage control unit, at least one The storage control unit sends the acquired data to the kernel core, and the kernel core sends the acquired data to the sending address judgment unit, and the sending address judgment unit sends the acquired data to the sending interface (tx), and the sending interface sends the acquired data The data is sent to the adjacent arithmetic chip.

The big data operation acceleration system is applied to the field of artificial intelligence. The UART control unit of the operation chip stores the image data or video data sent by the external host to the storage unit through the core core. The operation chip generates a mathematical model of the neural network. The mathematical model It can also be stored in the storage unit by the external host through the UART control unit and read by each arithmetic chip. Run the first layer of mathematical model of the neural network on the arithmetic chip. The core of the arithmetic chip reads data from the storage unit of the arithmetic chip and / or the storage unit of other arithmetic chips for operation, and stores the operation result to other through the serdes interface. At least one storage unit in the storage unit of the arithmetic chip, or the storage unit stored in the arithmetic chip. The arithmetic chip (1) sends a control instruction to the next arithmetic chip (2) through the UART control unit or serdes interface, and starts the next arithmetic chip (2) to perform arithmetic. Run the second layer mathematical model of the neural network on the next arithmetic chip (2). The core of the next arithmetic chip reads the data from the storage unit of the arithmetic chip and / or the storage unit of other arithmetic chips for operation, and the operation results At least one storage unit stored in a storage unit of another arithmetic chip through the serdes interface, or a storage unit of the arithmetic chip. Each chip executes a layer in the neural network, and obtains data from the storage unit of other arithmetic chips or the storage unit of the arithmetic chip through the serdes interface to perform calculations, and only calculates the calculation result to the last layer of the neural network. The operation chip obtains the operation result from the local storage unit or the storage unit of other operation chip, and feeds it back to the external host through the UART control unit.

The big data operation acceleration system is applied to the field of encrypted digital currency. The UART control unit of the operation chip (1) stores the block information sent by the external host to at least one storage unit among the plurality of storage units of the plurality of operation chips. The external host sends control instructions to the M arithmetic chips through the arithmetic chip (1 ... M) UART control unit to perform data calculation, and the M arithmetic chips start the arithmetic operation. Of course, an external host can also send control instructions to one arithmetic chip (1) UART control unit (130) to perform data operations, and the arithmetic chip (1) sequentially sends control instructions to other M-1 arithmetic chips to perform data operations. M arithmetic chips Start arithmetic operation. The external host can also send control instructions to a computing chip (1) UART control unit to perform data operations. The first computing chip (1) sends control instructions to the second computing chip (2) to perform data operations. The second computing chip (2) A control instruction is sent to the third arithmetic chip (3) for data calculation, a third arithmetic chip (3) sends a control instruction to the fourth arithmetic chip (4) for data calculation, and M arithmetic chips start arithmetic operations. The M arithmetic chips acquire data from the storage unit of other arithmetic chips or the storage unit of the arithmetic chip through the serdes interface to perform calculations. The M arithmetic chips simultaneously perform proof-of-work calculation operations. The arithmetic chip (1) obtains the calculation result from the storage unit and passes The UART control unit feeds back to the external host.

FIG. 6 illustrates a first embodiment of a schematic diagram of a data transmission process of a big data operation acceleration system. Each arithmetic chip completes the work of 1 / n, and after each arithmetic chip completes its responsible data, because of the data correlation, the result of its calculation must be transmitted to all other chips. Operation chip n-1 is the source operation chip of the data frame, and the data is sent to operation chip 0 through lane1 tx; in operation chip 0, the data frame will be divided into two propagations, the first way is sent to the core core of operation chip 0 , The other way is in the lane1 tx channel that is forwarded to the arithmetic chip 0, so the data frame will be sent to the arithmetic chip 1.

Source ID mechanism: Each data frame carries the operation chip ID of the source of the data frame. Whenever the data frame is sent to a new operation chip, the operation chip will detect the operation chip ID in the data frame. When it is found that the ID of the arithmetic chip and the ID of the next arithmetic chip connected to the arithmetic chip are equal, then the data frame will not be forwarded again, which means that the life cycle of the data frame ends here, nor Take up bandwidth again. The arithmetic chip detects that the arithmetic chip ID in the data frame may be performed in the core core, or may be performed in the received address judgment unit.

7 is a schematic diagram illustrating the signal flow of an arithmetic chip with four cores in the first embodiment. The UART control unit (130) is used to acquire external data or control instructions of the chip, and transmit the external data or control instructions to the core (110) connected to the UART control unit. The core core (110) transfers external data to the storage unit (120) of the chip for storage according to the data address, or the core core (110) sends data to the other chip core core corresponding to the data address through the signal channel according to the data address, Other chip cores store data in local storage units. The core core (110) is executed by the core core of the arithmetic chip according to the address of the external control instruction or sent to another chip core core corresponding to the address of the control instruction through the signal channel lane for execution. If the core of the arithmetic chip needs to acquire data, the core can acquire data from a local storage unit or data from storage units of other arithmetic chips. When acquiring data from the storage unit of another arithmetic chip, the core core (110) broadcasts the acquisition data control instruction to the connected arithmetic chip through the serdes interface (150) connected to it; the connected arithmetic chip will acquire the data The control instructions are divided into two ways, one way is sent to the core core, and the other way is forwarded to the next chip. If the connected arithmetic chip determines that the data is stored in the local storage unit, the core core reads the data from the storage unit and sends it to the arithmetic unit that issues the data acquisition instruction through the serdes interface. Of course, the control commands between the arithmetic chips can also be sent through the UART control unit. When the core core feedbacks the operation result or intermediate data to the outside according to the external control instruction or the internal control instruction, the core core obtains the operation result or intermediate data from the storage unit of the operation chip or from the storage unit of the other operation chip through the serdes interface, and calculates the operation The result or intermediate data is sent to the outside through the UART control unit. The external mentioned here may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the storage unit parameters through the UART control unit, and address multiple storage units uniformly.

Of course, the kernel core performs calculations based on the acquired data and stores the calculation results in the storage unit. Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results not used by other arithmetic chips; the shared storage area is used to store arithmetic data results of the arithmetic chips, which are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.

In the embodiment of the present invention, multiple core cores are provided in the chip, and each core core performs calculation and storage control functions, and at least one storage unit is connected to each core core inside the chip. The storage unit is connected to the storage unit of other cores, so that each core can have a large-capacity memory, reducing the number of times data is moved into or out of the memory from the external storage space, and speeding up the data processing speed; at the same time, because multiple cores can be separately Independent operation or cooperative operation, which also speeds up the data processing speed.

8 illustrates a schematic diagram of a data structure according to the present invention. The data mentioned here is various data such as command data, numeric data, character data, and so on. The data format specifically includes valid bit valid, destination address dst id, source address src id and data data. The kernel can determine whether the data packet is a command or a value by valid bit. Here, it can be assumed that 0 represents a value and 1 represents a command. The kernel will determine the destination address, source address and data type according to the data structure. For example, in Figure 1, the core 50 sends a data read command to the core 10, the valid bit is 1, the destination address is the address of the core 10, the source address is the address of the core 50, and the data data is the read data command and data type Or data address. When the core 10 sends data to the core 10, the effective bit is 0, the destination address is the address of the core 50, the source address is the address of the core 0, and the data data is the read data. From the perspective of instruction operation timing, the traditional six-stage pipeline structure is adopted in this embodiment, which are instruction fetch, decoding, execution, memory access, alignment and write-back stage respectively. From the perspective of the instruction set architecture, a simplified instruction set architecture can be adopted. According to the general design method of the reduced instruction set architecture, the instruction set of the present invention can be divided into register register type instructions, register immediate instruction, jump instruction, memory access instruction, control instruction and inter-core communication instruction according to functions.

Using the description provided herein, the embodiments can be implemented as a machine, process, or article of manufacture by using standard programming and / or engineering techniques to produce programming software, firmware, hardware, or any combination thereof.

Any generated program (s) (with computer-readable program code) can be embodied on one or more computer-usable media, such as resident storage devices, smart cards or other removable storage devices, or transmission devices, Thus, computer program products and manufactured products are produced according to the embodiments. As such, the terms "article of manufacture" and "computer program product" as used herein are intended to cover computer programs that are permanently or temporarily present on any non-transitory medium that can be used by computers.

As noted above, memory / storage devices include but are not limited to magnetic disks, optical disks, removable storage devices (such as smart cards, subscriber identity modules (SIM), wireless identification modules (WIM)), semiconductor memories (such as random access memory ( RAM), read only memory (ROM), programmable read only memory (PROM)), etc. Transmission media include, but are not limited to, transmission via wireless communication networks, the Internet, intranets, telephone / modem-based network communications, hard-wired / cable communications networks, satellite communications, and other fixed or mobile network systems / communication links.

Although specific example embodiments have been disclosed, those skilled in the art will understand that changes can be made to the specific example embodiments without departing from the spirit and scope of the invention.

The present invention has been described above based on the embodiments with reference to the drawings. However, the present invention is not limited to the above-mentioned embodiments, and schemes in which the parts of each embodiment and each modified example are appropriately combined or replaced according to layout requirements are also included in Within the scope of the invention. In addition, based on the knowledge of those skilled in the art, the combination and processing order of the embodiments may be appropriately reorganized, or various design changes and other modifications may be applied to the embodiments. Embodiments to which such modifications are applied may also be included in the present invention. In the range.

Although the present invention has described various concepts in detail, those skilled in the art can understand that various modifications and substitutions to those concepts can be implemented within the spirit of the overall teaching disclosed in the present invention. A person skilled in the art can implement the invention set forth in the claims without undue experimentation by using ordinary techniques. It can be understood that the specific concepts disclosed are only illustrative and are not intended to limit the scope of the present invention, which is determined by the full scope of the appended claims and their equivalents.

Claims

A big data operation chip, characterized in that the operation chip includes N cores, N data channel lanes, and at least one storage unit, where N is an integer greater than or equal to 4;

The data channel lane includes a sending interface (tx) and a receiving interface (rx), the core core corresponds to a data channel lane, and the core core sends and receives data through the data channel lane; the arithmetic chip passes the The transmission interface (tx) and the reception interface (rx) perform data transmission outside the chip;

Each of the N core cores is connected to the at least one storage unit.
The chip according to claim 1, wherein the sending interface (tx) and the receiving interface (rx) of the arithmetic chip are serdes interfaces.
The chip according to claim 1 or 2, wherein the data channel lane further comprises a receiving address judgment unit and a sending address judgment unit; one end of the receiving address judgment unit is connected to the receiving interface (rx), and the receiving address judgment unit One end is connected to the core core; one end of the sending address judgment unit is connected to the sending interface (tx), the other end of the sending address judgment unit is connected to the core core; the receiving address judgment unit and the sending address judgment unit are connected to each other.
The chip according to claim 3, characterized in that the receiving interface (rx) receives the data frame, sends the data frame to the receiving address judgment unit, and sends the data frame to the core core when receiving the address judgment unit, and Sending the data frame to the sending address judgment unit; the sending address judgment unit receives the data frame, sends the data frame to the sending interface (tx), and the sending interface (rx) sends the data frame out.
The chip according to claim 3, wherein the core core generates a data frame, and sends the data frame to the sending address judgment unit, and the sending address judgment unit sends the data frame to the sending interface (tx), the sending interface Send the data frame.
The chip according to claim 3, wherein the receiving address judging unit and the sending address judging unit are connected to each other through a first-in first-out memory.
The chip according to claim 1 or 2, wherein the storage unit includes a plurality of memories, the plurality of memories are connected to at least one storage control unit; the at least one storage control unit is used to control the multiple Read or store data from each memory.
The chip according to claim 7, wherein the memory includes at least two storage subunits and a storage control subunit; the storage control subunit is connected to each of the at least one storage control unit through an interface, so The storage control subunit is used to control data reading or storage of the at least two storage subunits.
The chip according to claim 8, wherein the storage subunit is an SRAM memory.
The chip according to claim 1 or 2, wherein the arithmetic chip and the external storage unit are in a non-connected state.
The chip according to claim 1 or 2, characterized in that the arithmetic chip further comprises a first data interface (130) connected to an external host for receiving external data or control instructions.
The chip according to claim 11, wherein the first data interface is a UART control unit.
The chip according to claim 6, wherein the N cores are connected to each of the at least one storage control unit; according to the operation commands of the N cores, from the plurality of memories Read and write data.
The chip according to claim 13, wherein the core core sends the generated data to the at least one storage control unit, the at least one storage control unit sends the data to the storage control subunit, the storage The control subunit stores data in the storage subunit.
The chip according to claim 1 or 2, wherein the arithmetic chip is used to perform one or more of encryption operation and convolution calculation.
The chip according to claim 12, characterized in that the at least one first data interface (130) receives an external instruction to initialize and configure the storage unit of the arithmetic chip, and the storage subunit in the storage unit of the arithmetic chip Perform unified addressing.
The chip according to claim 12, characterized in that the arithmetic chip can transmit the calculation result to the outside through the at least one first data interface (130).
The chip according to claim 1 or 2, wherein the core is used for data calculation and data storage control.
A big data operation chip, the operation chip includes N cores and at least one built-in storage unit, N is an integer greater than or equal to 4; characterized in that the storage unit includes a plurality of memories and at least one storage control unit , The memory includes at least two storage subunits and a storage control subunit; each of the storage control subunits is connected to each of the at least one storage control unit through an interface; the storage control subunit is used to control Data reading or storage of the at least two storage subunits; the at least one storage control unit is used to control data reading or storage of the plurality of memories.
The chip according to claim 19, wherein each of the N cores is connected to the at least one storage unit.
The chip according to claim 20, wherein the storage subunit is an SRAM memory.
The chip according to claim 20, wherein the N cores are connected to each of the at least one storage control unit; according to the operation commands of the N cores, from the plurality of memories Read and write data.
The chip according to claim 20, wherein the core core sends the generated data to the at least one storage control unit, the at least one storage control unit sends the data to the storage control subunit, the storage The control subunit stores data in the storage subunit.