WO2020087246A1 - Big data operation acceleration system and data transmission method - Google Patents

Big data operation acceleration system and data transmission method Download PDF

Info

Publication number
WO2020087246A1
WO2020087246A1 PCT/CN2018/112546 CN2018112546W WO2020087246A1 WO 2020087246 A1 WO2020087246 A1 WO 2020087246A1 CN 2018112546 W CN2018112546 W CN 2018112546W WO 2020087246 A1 WO2020087246 A1 WO 2020087246A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
chip
core
arithmetic
interface
Prior art date
Application number
PCT/CN2018/112546
Other languages
French (fr)
Chinese (zh)
Inventor
秦强
Original Assignee
北京比特大陆科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京比特大陆科技有限公司 filed Critical 北京比特大陆科技有限公司
Priority to PCT/CN2018/112546 priority Critical patent/WO2020087246A1/en
Priority to CN201880097576.0A priority patent/CN112740192B/en
Publication of WO2020087246A1 publication Critical patent/WO2020087246A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/17Interprocessor communication using an input/output type connection, e.g. channel, I/O port

Definitions

  • the invention relates to the field of integrated circuits, in particular to a big data operation acceleration system and a data transmission method.
  • ASIC Application Specific Integrated Circuits
  • ASIC Application Specific Integrated Circuits
  • ASICs application-specific integrated circuits
  • the characteristics of ASICs are to meet the needs of specific users.
  • ASICs Compared with general-purpose integrated circuits, ASICs have the advantages of smaller size, lower power consumption, improved reliability, improved performance, enhanced confidentiality, and lower costs.
  • the size of photos is generally 3-7MB, but as the precision of digital cameras and video cameras increases, photos The size can reach 10MB or more, and 30 minutes of video may reach more than one G of data.
  • the calculation speed is fast and the delay is small, so how to improve the calculation speed and response time has always been the goal required by chip design.
  • the memory of the ASIC chip is generally 64MB or 128MB, when the data to be processed is more than 512MB, the ASIC chip needs to use the memory to access the data multiple times, and the data is moved into or out of the memory from the external storage space many times, reducing the Processing speed.
  • the continuous improvement of data accuracy artificial intelligence, security computing and other fields need to operate on larger and larger data.
  • the ASIC chip In order to store data, it is generally necessary to configure the ASIC chip with multiple storage units, such as an ASIC chip to configure 4 2G memory; when N arithmetic chips work at the same time, 4N 2NG memory is needed. However, when multiple computing chips work at the same time, the data storage capacity will not exceed 2 G, which causes a waste of storage units and increases system cost.
  • the improvement of the clock rate is very limited.
  • Increasing the number of concurrency is a more effective way to improve performance.
  • it generally increases the data bandwidth requirements accordingly.
  • the bandwidth increase of DDR is not linear. Assume that the initial system contains a group of DDRs, providing a bandwidth of 1x. If we need to obtain a 2x bandwidth increase, we can achieve two sets of DDR, but if we need to obtain a bandwidth increase of more than 16x, it is impossible to simply implement 16 sets of DDR in a system because of physical size limitations.
  • the purpose of the embodiments of the present invention is to provide a way to use high-speed interfaces to connect to distributed storage, so that multiple homogeneous systems can process a large amount of related data concurrently.
  • the embodiment of the present invention provides a big data operation acceleration system.
  • the chip external memory is eliminated, and the storage unit is provided inside the ASIC chip, which reduces the time for the ASIC chip to read data from the outside and speeds up the chip operation speed.
  • Multiple ASIC chips share storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips, simplifies the system structure, and reduces the cost of ASIC chips.
  • serdes interface technology is used for data transmission between multiple computing chips, which improves the data transmission rate between multiple ASIC chips.
  • a big data operation acceleration system including more than two operation chips, the operation chip including N cores, N data channels (lane), and at least one storage unit, wherein N is a positive integer greater than or equal to 4;
  • the data channel (lane) includes a transmit interface (tx) and a receive interface (rx),
  • the core core corresponds to the data channel (lane) in one-to-one correspondence, and the core core passes through the data channel (lane) sending and receiving data;
  • the two or more arithmetic chips are connected to transmit data through the sending interface (tx) and the receiving interface (rx), and the two or more arithmetic chips are connected in a ring.
  • a data transmission method for a big data operation acceleration system includes more than two operation chips, and the two or more operation chips pass a transmission interface (tx) Connect with the receiving interface (rx) to transmit data, the two or more arithmetic chips are connected into a ring; after the first arithmetic chip in the data source generates data, the data is sent to the first arithmetic chip through the sending interface (tx) adjacent The second computing chip on one side; the second computing chip on the adjacent side divides the data into two channels for propagation, the first channel is sent to the core core of the second computing chip, and the other channel is through the sending interface (tx) Forwarded to the third arithmetic chip adjacent to the second arithmetic chip.
  • multiple chips are provided in the big data operation acceleration system.
  • the multiple chips include multiple core cores, and each core core performs operation and storage control functions, and at least one storage is connected to each core core within the chip. Unit, so that each core reads the data in the storage unit connected to itself and the storage unit connected to the other computing chip core, so that each core can have a large capacity of memory, reducing the data from the external storage space into or out of memory
  • the number of times speeds up the processing speed of the data; at the same time, because multiple cores can operate independently or cooperatively, this also speeds up the processing speed of the data.
  • Multiple ASIC chips share storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips, simplifies the system structure, and reduces the cost of ASIC chips.
  • FIG. 1 is a schematic diagram illustrating the structure of a big data operation acceleration system with M ASIC chips according to the first embodiment
  • Figure 2 illustrates a schematic diagram of an arithmetic chip with 4 cores
  • Figure 3 illustrates a schematic diagram of the structure of the data channel lane
  • FIG. 4a illustrates a schematic structural view of a first embodiment of a storage unit
  • 4b illustrates a schematic structural diagram of a second embodiment of a storage unit
  • FIG. 5 is a schematic diagram illustrating the data transmission process of the big data operation acceleration system
  • FIG. 6 is a schematic diagram illustrating a signal flow of an arithmetic chip with 4 cores in the first embodiment
  • FIG. 7 illustrates a schematic diagram of a data structure according to the present invention.
  • Multi-core chips are multi-processing systems embodied on a single large-scale integrated semiconductor chip.
  • two or more chip cores can be embodied on a multi-core chip, interconnected by a bus (which can also be formed on the same multi-core chip).
  • a bus which can also be formed on the same multi-core chip.
  • Multi-core chips can have applications that are implemented in multimedia and signal processing algorithms (such as video encoding / decoding, 2D / 3D graphics, audio and voice processing, image processing, telephony, voice recognition and voice synthesis, encryption processing) Special arithmetic and / or logical operations.
  • ASIC-specific integrated circuits are mentioned in the background art, the specific wiring implementation in the embodiments can be applied to CPUs, GPUs, FPGAs, etc. that have multi-core chips.
  • multiple cores may be the same core or different cores.
  • FIG. 1 is a schematic diagram illustrating the structure of a big data operation acceleration system with M ASIC chips according to the first embodiment.
  • the big data operation acceleration system includes M ASIC operation chips, where M is a positive integer greater than or equal to 2, for example, 6, 10, 12, and so on.
  • the arithmetic chip includes multiple cores core (core0, core1, core2, core3), 4 data channels (lane0, lane1, lane2lane3), the data channel (lane) includes a transmission interface (tx) and a reception interface (rx) ,
  • the core and the data channel (lane) have a one-to-one correspondence, for example, the core0 of the computing chip 10 has a data channel (lane0), the data channel (lane0) has a sending interface (lane0tx) and a receiving interface (lane0rx), and the data channel sends
  • the interface (lane0tx) is used by the core0 to send data or control instructions to the outside of the computing chip 10
  • the data channel receiving interface (lane0rx) is used to send the external data or control instructions of the computing chip 10 to the core0.
  • the M arithmetic chips are connected through the sending interface (tx) and the receiving interface (rx) to facilitate the transmission of data or control commands.
  • M arithmetic chips form a closed loop.
  • a storage unit is provided in each arithmetic chip, the four core cores in the arithmetic chip are connected to the storage unit, the storage units of the M arithmetic chips are used for distributed storage of data, and the arithmetic chip core core can be calculated from this operation
  • the storage unit of the chip can acquire data, and the data can also be acquired from the storage unit of other arithmetic chips.
  • the four cores of the arithmetic chip are all connected to the storage unit, and the purpose of data exchange of the four cores of the arithmetic chip is also achieved through the storage unit.
  • the number of cores may be N, where N is a positive integer greater than or equal to 4, for example, 6, 10, 12, and so on.
  • multiple cores may be the same core or different cores.
  • the sending interface (lane) and receiving interface (lane) of the data channel (lane) are serdes interfaces, and the arithmetic chips communicate through the serdes interface.
  • Serdes is the abbreviation of English SERializer (serializer) / DESerializer (deserializer). It is a mainstream time division multiplexing (TDM) and point-to-point (P2P) serial communication technology. That is, multiple low-speed parallel signals at the transmitting end are converted into high-speed serial signals, and then through the transmission medium (optical cable or copper wire), and finally the high-speed serial signals at the receiving end are re-converted into low-speed parallel signals.
  • TDM time division multiplexing
  • P2P point-to-point
  • This point-to-point serial communication technology makes full use of the channel capacity of the transmission medium, reduces the number of transmission channels and device pins required, increases the signal transmission speed, and thus greatly reduces the communication cost.
  • other communication interfaces can also be used instead of the serdes interface, for example: SSI, UATR.
  • the chip transmits data and control commands through the serdes interface.
  • FIG. 2 illustrates a first embodiment of a schematic structural diagram of an arithmetic chip with 4 cores.
  • 4 cores are selected here as an example, which is only an exemplary description.
  • the number of cores of the arithmetic chip may be N, where N is a positive integer greater than or equal to 2, such as 6, 10, 12 etc. Wait.
  • the core of the arithmetic chip may be a core with the same function or a core with different functions.
  • 4 core computing chips (1) include 4 cores (core0, core1, core2, core3), 4 data channels (lane0, lane1, lane2lane3) and at least one storage unit, a data exchange control unit, specific data
  • the exchange control unit is a UART control unit, and each data channel (lane) includes a transmit interface (lane tx) and a receive interface (lane rx).
  • the core0 of the computing chip (1) is connected to the transmitting interface (lane0tx) and the receiving interface (lane0rx) of the data channel, and the data channel transmitting interface (lane0tx) is used by the core0 to send data or control to the computing chip connected to the computing chip 1 Instruction, the data channel receiving interface (lane0rx) is used to send data or control instructions transmitted by the arithmetic chip (1) connected to the arithmetic chip (1) to the core0.
  • the core1 of the operation chip 1 is connected to the transmission interface (lane1tx) and the reception interface (lane1rx) of the data channel; the core2 of the operation chip 1 is connected to the transmission interface (lane2tx) and the reception interface (lane2rx) of the data channel.
  • the core3 of the chip 1 is connected to the sending interface (lane3tx) and the receiving interface (lane3rx) of the data channel.
  • the sending interface (lane tx) and the receiving interface (lane rx) of the data channel (lane) are serdes interfaces.
  • a data exchange control unit is connected to the storage unit and 4 cores (core0, core1, core2, core3) through the bus.
  • the bus is not drawn in FIG. 2.
  • the data exchange control unit can be implemented using multiple protocols, such as UART, SPI, PCIE, SERDES, USB, etc.
  • the data exchange control unit is a UART (Universal Asynchronous Receiver / Transmitter) control unit.
  • UART Universal asynchronous transceiver
  • UART Universal asynchronous transceiver
  • UART Universal asynchronous transceiver
  • UART Universal asynchronous transceiver
  • UART Universal asynchronous transceiver
  • UART Universal asynchronous transceiver
  • UART is usually integrated on the connection of various communication interfaces. But here is just taking the UART protocol as an example, other protocols can also be used.
  • the UART control unit accepts external data and sends the external data to the core (core0, core1, core2, core3) or storage unit according to the external data address.
  • the UART control unit can also accept external control commands and send control commands to the core (core0, core1, core2, core3) or the storage unit; it can also be used to send internal or external control commands to other computing chips from the computing chip. Accept control commands, and feedback operation results or intermediate data to the outside.
  • the internal data or internal control commands refer to data or control commands generated by the chip itself, and the external data or external control commands refer to data or control commands generated outside the chip, such as data or control sent by an external host or an external network instruction.
  • the core core (core0, core1, core2, core3) main functions are to execute external or internal control instructions, perform data calculations and data storage control functions.
  • the cores (core0, core1, core2, core3) in the arithmetic chip are all connected to the storage unit, and data is read from or written to the storage unit of the arithmetic chip. Core data interaction; can also send control commands to the storage unit of the computing chip.
  • the core core (core0, core1, core2, core3) writes data to, reads data from, or sends control commands to the storage units of other computing chips through the serdes interface according to the instructions; kernel core (core0, core1, core2 , Core3) You can also send data to, read data from, or send control commands to the core of other computing chips through the serdes interface according to the instructions.
  • FIG. 3 illustrates a first embodiment of a schematic structural diagram of a data channel lane.
  • the data channel (lane) includes a receiving interface, a sending interface, a receiving address judgment unit, a sending address judgment unit, and a plurality of registers; one end of the receiving address judgment unit is connected to the receiving interface, and the other end of the receiving address judgment unit is connected to the core core through registers ; One end of the sending address judgment unit is connected to the sending interface (tx), the other end of the sending address judgment unit is connected to the core core through a register; the receiving address judgment unit and the sending address judgment unit are connected to each other through a register.
  • the receiving interface receives the data frame or control command sent by the running chip on the adjacent side connected to the receiving interface, sends the data frame or control command to the receiving address judgment unit, and receives the data frame or control instruction from the address judgment unit Send to the core core, and at the same time send the data frame or control instruction to the sending address judgment unit;
  • the sending address judgment unit receives the data frame or control instruction, and sends the data frame or control instruction to the sending interface (tx),
  • the sending interface sends the data frame or the control instruction to the adjacent running chip connected to the sending interface.
  • the core core generates a data frame or control instruction, sends the data frame or control instruction to the sending address judgment unit, and the sending address judgment unit sends the data frame or control instruction to the sending interface, and the sending interface sends the data frame or control instruction
  • the instruction is sent to the receiving interface of the running chip on the adjacent side.
  • the purpose of the register is to temporarily store data frames or control instructions.
  • FIG. 4a illustrates a first embodiment of a schematic structural diagram of a memory cell.
  • Each computing chip contains N core cores, and they need concurrent random access data. If the order of N reaches 64 and above, the memory bandwidth of the computing chip needs to reach a very high order, even GDDR is difficult To achieve such a high bandwidth. Therefore, in the embodiment of the present invention, a high bandwidth is provided by using a SRAM array and a large MUX route.
  • the system consists of two-level storage control unit to alleviate the problem of yoke during implementation.
  • the storage unit (40) includes 8 memories (410 ... 417), the 8 memories (410 ...
  • the storage control unit is used to control the plurality of Read or store data in memory.
  • the memory (410 ... 417) includes at least two storage subunits and a storage control subunit; the storage control subunit is connected to the storage control unit through an interface, and the storage control subunit is used to control the at least two The data of each storage subunit is read or stored.
  • the storage subunit is an SRAM memory.
  • FIG. 4b illustrates a second embodiment of a schematic structural diagram of a memory cell.
  • Multiple storage control units (420, 421, 422, 423) can be provided in the storage unit in FIG. 4b, each core core and each of the multiple storage control units (420, 421, 422, 423) Connected, each storage control unit is connected to each memory (410 ... 417).
  • the structure of the memory is exactly the same as in FIG. 4a, and will not be described here again.
  • the core core sends the generated data to at least one storage control unit, and the at least one storage control unit sends the data to the storage control subunit, and the storage control subunit stores the data in the storage subunit.
  • the arithmetic chip core core acquires a data acquisition command sent by another arithmetic chip, the arithmetic chip core core judges whether data is stored in the storage unit of the arithmetic chip according to the data address, and if it exists, sends a data read command to the at least one storage control unit At least one storage control unit sends a data read command to the corresponding storage control subunit, the storage control subunit obtains data from the storage subunit, and the storage control subunit sends the acquired data to at least one storage control unit, at least one The storage control unit sends the acquired data to the kernel core, and the kernel core sends the acquired data to the sending address judgment unit, and the sending address judgment unit sends the acquired data to the sending interface (tx), and the sending interface sends the acquired data The data is sent to the adjacent
  • the big data operation acceleration system is applied to the field of artificial intelligence.
  • the UART control unit of the operation chip stores the image data or video data sent by the external host to the storage unit through the core core.
  • the operation chip generates a mathematical model of the neural network.
  • the mathematical model It can also be stored in the storage unit by the external host through the UART control unit and read by each arithmetic chip. Run the first layer of mathematical model of the neural network on the arithmetic chip.
  • the core of the arithmetic chip reads data from the storage unit of the arithmetic chip and / or the storage unit of other arithmetic chips for operation, and stores the operation result to other through the serdes interface.
  • the arithmetic chip (1) sends a control instruction to the next arithmetic chip (2) through the UART control unit or serdes interface, and starts the next arithmetic chip (2) to perform arithmetic. Run the second layer mathematical model of the neural network on the next arithmetic chip (2).
  • the core of the next arithmetic chip reads the data from the storage unit of the arithmetic chip and / or the storage unit of other arithmetic chips for operation, and the operation results At least one storage unit stored in a storage unit of another arithmetic chip through the serdes interface, or a storage unit of the arithmetic chip.
  • Each chip executes a layer in the neural network, and obtains data from the storage unit of other operation chips or the storage unit of the operation chip through the serdes interface to perform calculations, and only calculates the calculation result to the last layer of the neural network.
  • the operation chip obtains the operation result from the local storage unit or the storage unit of other operation chip, and feeds it back to the external host through the UART control unit.
  • the big data operation acceleration system is applied to the field of encrypted digital currency.
  • the UART control unit of the operation chip (1) stores the block information sent by the external host to at least one storage unit among the plurality of storage units of the plurality of operation chips.
  • the external host sends control instructions to the M arithmetic chips through the arithmetic chip (1 ... M) UART control unit to perform data calculation, and the M arithmetic chips start the arithmetic operation.
  • an external host can also send control instructions to one arithmetic chip (1) UART control unit (130) to perform data operations, and the arithmetic chip (1) sequentially sends control instructions to other M-1 arithmetic chips to perform data operations.
  • M arithmetic chips Start arithmetic operation.
  • the external host can also send control instructions to a computing chip (1) UART control unit to perform data operations.
  • the first computing chip (1) sends control instructions to the second computing chip (2) to perform data operations.
  • the second computing chip (2) A control instruction is sent to the third arithmetic chip (3) for data calculation, a third arithmetic chip (3) sends a control instruction to the fourth arithmetic chip (4) for data calculation, and M arithmetic chips start arithmetic operations.
  • the M arithmetic chips acquire data from the storage unit of other arithmetic chips or the storage unit of the arithmetic chip through the serdes interface to perform calculations.
  • the M arithmetic chips simultaneously perform proof-of-work calculation operations.
  • the arithmetic chip (1) obtains the calculation result from the storage unit and passes The UART control unit feeds back to the external host.
  • FIG. 5 illustrates a first embodiment of a schematic diagram of a data transmission process of a big data operation acceleration system.
  • Each arithmetic chip completes the work of 1 / n, and after each arithmetic chip completes its responsible data, because of the data correlation, the result of its calculation must be transmitted to all other chips.
  • Operation chip n-1 is the source operation chip of the data frame, and the data is sent to operation chip 0 through lane1tx; in operation chip 0, the data frame will be divided into two propagations, the first way is sent to the core core of operation chip 0, The other way is in the lane1tx channel that is forwarded to the arithmetic chip 0, so the data frame will be sent to the arithmetic chip 1.
  • Each data frame carries the operation chip ID of the source of the data frame. Whenever the data frame is sent to a new operation chip, the operation chip will detect the operation chip ID in the data frame. When it is found that the ID of the arithmetic chip and the ID of the next arithmetic chip connected to the arithmetic chip are equal, then the data frame will not be forwarded again, which means that the life cycle of the data frame ends here, nor Take up bandwidth again.
  • the arithmetic chip will detect the arithmetic chip ID in the data frame, which may be performed in the core or in the receiving address judgment unit.
  • FIG. 6 illustrates a signal flow diagram of an arithmetic chip with four cores in the first embodiment.
  • the UART control unit (130) is used to acquire external data or control instructions of the chip, and transmit the external data or control instructions to the core (110) connected to the UART control unit.
  • the core core (110) transfers external data to the storage unit (120) of the chip for storage according to the data address, or the core core (110) sends data to the other chip core core corresponding to the data address through the signal channel according to the data address, Other chip cores store data in local storage units.
  • the core core (110) is executed by the core core of the arithmetic chip according to the address of the external control instruction or sent to another chip core core corresponding to the address of the control instruction through the signal channel lane for execution.
  • the core of the arithmetic chip needs to acquire data, the core can acquire data from a local storage unit or data from storage units of other arithmetic chips.
  • the core core (110) broadcasts the acquisition data control instruction to the connected arithmetic chip through the serdes interface (150) connected to it; the connected arithmetic chip will acquire the data
  • the control instructions are divided into two ways, one way is sent to the core core, and the other way is forwarded to the next chip. If the connected arithmetic chip determines that the data is stored in the local storage unit, the core core reads the data from the storage unit and sends it to the arithmetic unit that issues the data acquisition instruction through the serdes interface.
  • control commands between the arithmetic chips can also be sent through the UART control unit.
  • the core core feedbacks the operation result or intermediate data to the outside according to the external control instruction or the internal control instruction
  • the core core obtains the operation result or intermediate data from the storage unit of the operation chip or from the storage unit of the other operation chip through the serdes interface, and calculates the operation
  • the result or intermediate data is sent to the outside through the UART control unit.
  • the external mentioned here may refer to an external host, an external network, an external platform, or the like.
  • the external host can initialize and configure the storage unit parameters through the UART control unit, and address multiple storage units uniformly.
  • the kernel core performs calculations based on the acquired data and stores the calculation results in the storage unit.
  • Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results not used by other arithmetic chips; the shared storage area is used to store arithmetic data results of the arithmetic chips, which are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.
  • each core core performs calculation and storage control functions
  • at least one storage unit is connected to each core core inside the chip.
  • the storage unit is connected to the storage unit of other cores, so that each core can have a large-capacity memory, reducing the number of times data is moved into or out of the memory from the external storage space, and speeding up the data processing speed; at the same time, because multiple cores can be separately Independent operation or cooperative operation, which also speeds up the data processing speed.
  • the data mentioned here is various data such as command data, numeric data, character data, and so on.
  • the data format specifically includes valid bit valid, destination address dst id, source address src id and data data.
  • the kernel can determine whether the data packet is a command or a value by valid bit. Here, it can be assumed that 0 represents a value and 1 represents a command. The kernel will determine the destination address, source address and data type according to the data structure.
  • the core 50 sends a data read command to the core 10, the valid bit is 1, the destination address is the address of the core 10, the source address is the address of the core 50, and the data data is the read data command and data type Or data address.
  • the core 10 sends data to the core 10
  • the effective bit is 0,
  • the destination address is the address of the core 50
  • the source address is the address of the core 0.
  • the data data is the read data.
  • the traditional six-stage pipeline structure is adopted in this embodiment, which are instruction fetch, decoding, execution, memory access, alignment and write-back stage respectively.
  • a simplified instruction set architecture can be adopted.
  • the instruction set of the present invention can be divided into register-register type instructions, register-immediate instruction, jump instruction, memory access instruction, control instruction and inter-core communication instruction according to functions.
  • the embodiments can be implemented as a machine, process, or article of manufacture by using standard programming and / or engineering techniques to produce programming software, firmware, hardware, or any combination thereof.
  • Any generated program (s) can be embodied on one or more computer-usable media, such as resident storage devices, smart cards or other removable storage devices, or transmission devices,
  • computer program products and manufactured products are produced according to the embodiments.
  • article of manufacture and “computer program product” as used herein are intended to cover computer programs that are permanently or temporarily present on any non-transitory medium that can be used by computers.
  • memory / storage devices include but are not limited to magnetic disks, optical disks, removable storage devices (such as smart cards, subscriber identity modules (SIM), wireless identification modules (WIM)), semiconductor memories (such as random access memory (RAM), read only memory (ROM), programmable read only memory (PROM)), etc.
  • Transmission media include, but are not limited to, transmission via wireless communication networks, the Internet, intranets, telephone / modem-based network communications, hard-wired / cable communications networks, satellite communications, and other fixed or mobile network systems / communication links.

Abstract

Embodiments of the present invention provide a big data operation acceleration system and a data transmission method. The big data operation acceleration system comprises two or more operation chips. The operation chip comprises N cores, N data lanes, and at least one storage unit, N being a positive integer greater than or equal to 4. The data lane comprises a transmission interface (tx) and a receiving interface (rx). The cores and the data lanes are in one-to-one correspondence. The core transmits and receives data via the data lane. The two or more operation chips are connected by means of the transmission interface (tx) and the receiving interface (rx) so as to transmit data. The two or more operation chips are connected to form a ring. The technical solution of the embodiments of the present invention improves the speed of data transmission among multiple ASIC chips.

Description

大数据运算加速系统及数据传输方法Big data operation acceleration system and data transmission method 技术领域Technical field
本发明涉及集成电路领域,特别是涉及一种大数据运算加速系统及数据传输方法。The invention relates to the field of integrated circuits, in particular to a big data operation acceleration system and a data transmission method.
背景技术Background technique
ASIC(Application Specific Integrated Circuits)即专用集成电路,是指应特定用户要求和特定电子系统的需要而设计、制造的集成电路。ASIC的特点是面向特定用户的需求,ASIC在批量生产时与通用集成电路相比具有体积更小、功耗更低、可靠性提高、性能提高、保密性增强、成本降低等优点。ASIC (Application Specific Integrated Circuits), that is, application-specific integrated circuits, refer to the integrated circuits designed and manufactured in accordance with the requirements of specific users and the needs of specific electronic systems. The characteristics of ASICs are to meet the needs of specific users. Compared with general-purpose integrated circuits, ASICs have the advantages of smaller size, lower power consumption, improved reliability, improved performance, enhanced confidentiality, and lower costs.
随着科技的发展,越来越多的领域,比如人工智能、安全运算等都涉及大运算量的特定计算。针对特定运算,ASIC芯片可以发挥其运算快,功耗小等特定。同时,对于这些大运算量领域,为了提高数据的处理速度和处理能力,通常需要控制N个运算芯片同时进行工作。随着数据精度的不断提升,人工智能、安全运算等领域需要对越来越大的数据进行运算,例如:现在照片的大小一般为3-7MB,但是随着数码相机和摄像机的精度增加,照片的大小可以达到10MB或者更多,而30分钟的视频可能达到1个多G的数据。而在人工智能、安全运算等领域中要求计算速度快,时延小,因此如何提高计算速度和反应时间一直是芯片设计所要求的目标。由于ASIC芯片搭配的内存一般为64MB或者128MB,而当要处理的数据在512MB以上时,ASIC芯片要多次利用内存存取数据,多次将数据从外部存储空间中搬入或者搬出内存,降低了处理速度。同时,随着数据精度的不断提升,人工智能、安全运算等领域需要对越来越大的数据进行运算,为了存储数据一般需要给ASIC芯片配置多个存储单元,例如一块ASIC芯片要配置4块2G内存;这样N个运算芯片同时工作时,就需要4N块2NG内存。但是,在多运算芯片同时工作时,数据存储量不会超过2个G,这样就造成了存储单元的浪费,提高了系统成本。With the development of science and technology, more and more fields, such as artificial intelligence, security computing, etc., involve specific calculations with large amounts of computation. For specific operations, ASIC chips can play a specific role such as fast operation and low power consumption. At the same time, for these fields with large amount of calculation, in order to improve the data processing speed and processing capacity, it is usually necessary to control N operation chips to work simultaneously. With the continuous improvement of data accuracy, more and more data needs to be calculated in the fields of artificial intelligence and security computing, for example: the size of photos is generally 3-7MB, but as the precision of digital cameras and video cameras increases, photos The size can reach 10MB or more, and 30 minutes of video may reach more than one G of data. However, in the fields of artificial intelligence and security computing, the calculation speed is fast and the delay is small, so how to improve the calculation speed and response time has always been the goal required by chip design. As the memory of the ASIC chip is generally 64MB or 128MB, when the data to be processed is more than 512MB, the ASIC chip needs to use the memory to access the data multiple times, and the data is moved into or out of the memory from the external storage space many times, reducing the Processing speed. At the same time, with the continuous improvement of data accuracy, artificial intelligence, security computing and other fields need to operate on larger and larger data. In order to store data, it is generally necessary to configure the ASIC chip with multiple storage units, such as an ASIC chip to configure 4 2G memory; when N arithmetic chips work at the same time, 4N 2NG memory is needed. However, when multiple computing chips work at the same time, the data storage capacity will not exceed 2 G, which causes a waste of storage units and increases system cost.
在处理大量相关数据的设计中,现有技术中面临两个难题:1、是大幅度提升性能的需求。2、如果是分布式系统,那么还要解决数据相关性问题,即某个子系统中处理完的数据需要呈现给所有其他的子系统中进行确认和再处理。一般通过两种方式减少数据处理耗费的时间,一是加快处理数据逻辑的 时钟;二是增加处理数据的并发块数。In the design of processing a large amount of related data, the existing technology faces two difficulties: 1. It is a requirement to greatly improve performance. 2. If it is a distributed system, the problem of data relevance must also be resolved, that is, the data processed in one subsystem needs to be presented to all other subsystems for confirmation and reprocessing. Generally, there are two ways to reduce the time spent on data processing. One is to speed up the clock for processing data logic; the other is to increase the number of concurrent blocks for processing data.
在工艺限制下,时钟速率的提升很有限。提升并发数目是更加有效的提升性能的方法。但提升并发数目之后,一般也相应的提高了数据带宽的要求。一般的系统中,如果数据带宽取决于DDR提供的带宽,但DDR的带宽提升并不是线性的。假设初始系统含有DDR一组,提供带宽1x。如果我们需要获得2x的带宽提升,可以实现两组DDR,但如果需要获得16x以上的带宽提升,因为物理尺寸的限制,不可能通过简单的在一个系统中例化16组DDR实现。Under the process limitation, the improvement of the clock rate is very limited. Increasing the number of concurrency is a more effective way to improve performance. However, after increasing the number of concurrency, it generally increases the data bandwidth requirements accordingly. In a general system, if the data bandwidth depends on the bandwidth provided by DDR, the bandwidth increase of DDR is not linear. Assume that the initial system contains a group of DDRs, providing a bandwidth of 1x. If we need to obtain a 2x bandwidth increase, we can achieve two sets of DDR, but if we need to obtain a bandwidth increase of more than 16x, it is impossible to simply implement 16 sets of DDR in a system because of physical size limitations.
如果需要多个ASIC芯片协同工作的话,不能直接将数据分布在不相连的多个系统中进行处理,因为这些数据都是相关的,每份在某个处理单元中完成的数据都必须在其他处理单元中进行确认和再处理,因此如果提高在多个ASIC芯片之间数据传输的速率也是必须要解决多系统互联的问题。If you need multiple ASIC chips to work together, you cannot directly distribute the data in multiple unconnected systems for processing, because these data are related, and each piece of data completed in a processing unit must be processed in other Confirmation and reprocessing are performed in the unit, so if the rate of data transmission between multiple ASIC chips is increased, the problem of interconnection of multiple systems must also be solved.
发明内容Summary of the invention
本发明实施例的目的就是提供一种使用高速接口连接分布式存储的方式,实现多个同构系统并发处理大量相关数据。本发明实施例提供一种大数据运算加速系统,该系统中取消了芯片外接内存,将存储单元设置在ASIC芯片内部,减少了ASIC芯片从外部读取数据的时间,加快了芯片运算速度。多个ASIC芯片共享存储单元,这样不仅减少了存储单元的数量,也减少了ASIC运算芯片之间的连接线,简化了系统构造,减低了ASIC芯片的成本。同时,多个运算芯片之间采用serdes接口技术进行数据传输,提高了在多个ASIC芯片之间数据传输的速率。The purpose of the embodiments of the present invention is to provide a way to use high-speed interfaces to connect to distributed storage, so that multiple homogeneous systems can process a large amount of related data concurrently. The embodiment of the present invention provides a big data operation acceleration system. In this system, the chip external memory is eliminated, and the storage unit is provided inside the ASIC chip, which reduces the time for the ASIC chip to read data from the outside and speeds up the chip operation speed. Multiple ASIC chips share storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips, simplifies the system structure, and reduces the cost of ASIC chips. At the same time, serdes interface technology is used for data transmission between multiple computing chips, which improves the data transmission rate between multiple ASIC chips.
为达到上述目的,本发明实施例提供如下技术方案:To achieve the above objective, the embodiments of the present invention provide the following technical solutions:
根据本发明实施例的第一方面,提供一种大数据运算加速系统,包括2个以上运算芯片,所述运算芯片包括N个内核core、N个数据通道(lane)和至少一个存储单元,其中N为大于等于4的正整数;所述数据通道(lane)包括发送接口(tx)和接收接口(rx),所述内核core和数据通道(lane)一一对应,所述内核core通过数据通道(lane)发送和接收数据;所述2个以上运算芯片通过所述发送接口(tx)和所述接收接口(rx)进行连接传输数据,所述2个以上运算芯片连接成环形。According to a first aspect of an embodiment of the present invention, there is provided a big data operation acceleration system, including more than two operation chips, the operation chip including N cores, N data channels (lane), and at least one storage unit, wherein N is a positive integer greater than or equal to 4; the data channel (lane) includes a transmit interface (tx) and a receive interface (rx), the core core corresponds to the data channel (lane) in one-to-one correspondence, and the core core passes through the data channel (lane) sending and receiving data; the two or more arithmetic chips are connected to transmit data through the sending interface (tx) and the receiving interface (rx), and the two or more arithmetic chips are connected in a ring.
根据本发明实施例的第二方面,提供一种大数据运算加速系统的数据传 输方法,所述大数据运算加速系统包括2个以上运算芯片,所述2个以上运算芯片通过发送接口(tx)和接收接口(rx)进行连接传输数据,所述2个以上运算芯片连接成环形;数据源头第一运算芯片产生数据后,通过所述发送接口(tx)将数据发送给第一运算芯片相邻一侧的第二运算芯片;所述相邻一侧的第二运算芯片将数据分为两路传播,第一路发送给所述第二运算芯片的内核core,另一路通过发送接口(tx)转发到第二运算芯片相邻一侧的第三运算芯片。According to a second aspect of the embodiments of the present invention, a data transmission method for a big data operation acceleration system is provided. The big data operation acceleration system includes more than two operation chips, and the two or more operation chips pass a transmission interface (tx) Connect with the receiving interface (rx) to transmit data, the two or more arithmetic chips are connected into a ring; after the first arithmetic chip in the data source generates data, the data is sent to the first arithmetic chip through the sending interface (tx) adjacent The second computing chip on one side; the second computing chip on the adjacent side divides the data into two channels for propagation, the first channel is sent to the core core of the second computing chip, and the other channel is through the sending interface (tx) Forwarded to the third arithmetic chip adjacent to the second arithmetic chip.
本发明实施例通过在大数据运算加速系统中设置多个芯片,多个芯片包括多个内核core,每个内核core执行运算和存储控制功能,并且在芯片内部给每个内核core连接至少一个存储单元,这样每个内核通过读取自己连接的存储单元和其他运算芯片内核连接的存储单元中的数据,使得每个内核可以具有大容量内存,减少了数据从外部存储空间中搬入或者搬出内存的次数,加快了数据的处理速度;同时,由于多个内核可以分别独立运算或者协同运算,这样也加快了数据的处理速度。多个ASIC芯片共享存储单元,这样不仅减少了存储单元的数量,也减少了ASIC运算芯片之间的连接线,简化了系统构造,减低了ASIC芯片的成本。In the embodiment of the present invention, multiple chips are provided in the big data operation acceleration system. The multiple chips include multiple core cores, and each core core performs operation and storage control functions, and at least one storage is connected to each core core within the chip. Unit, so that each core reads the data in the storage unit connected to itself and the storage unit connected to the other computing chip core, so that each core can have a large capacity of memory, reducing the data from the external storage space into or out of memory The number of times speeds up the processing speed of the data; at the same time, because multiple cores can operate independently or cooperatively, this also speeds up the processing speed of the data. Multiple ASIC chips share storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips, simplifies the system structure, and reduces the cost of ASIC chips.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是示例性的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings required in the embodiments or the description of the prior art. Obviously, the drawings in the following description are only These are exemplary embodiments. For a person of ordinary skill in the art, without paying any creative work, other drawings may be obtained based on these drawings.
图1说明第一实施例具有M个ASIC芯片的大数据运算加速系统结构的示意图;FIG. 1 is a schematic diagram illustrating the structure of a big data operation acceleration system with M ASIC chips according to the first embodiment;
图2说明具有4个内核的运算芯片结构示意图;Figure 2 illustrates a schematic diagram of an arithmetic chip with 4 cores;
图3说明数据通道lane的结构示意图;Figure 3 illustrates a schematic diagram of the structure of the data channel lane;
图4a说明存储单元第一实施例的结构示意图的4a illustrates a schematic structural view of a first embodiment of a storage unit
图4b说明存储单元第二实施例的结构示意图;4b illustrates a schematic structural diagram of a second embodiment of a storage unit;
图5说明大数据运算加速系统数据传输过程的示意图;5 is a schematic diagram illustrating the data transmission process of the big data operation acceleration system;
图6说明第一实施例具有4个内核的运算芯片信号流程示意图的;6 is a schematic diagram illustrating a signal flow of an arithmetic chip with 4 cores in the first embodiment;
图7说明根据本发明的数据结构示意图。7 illustrates a schematic diagram of a data structure according to the present invention.
具体实施方式detailed description
下面将基于附图具体说明本发明的示例性实施方式,应当理解,给出这些实施方式仅仅是为了使本领域技术人员能够更好地理解进而实现本发明,而并非以任何方式限制本发明的范围。相反,提供这些实施方式是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。The following will specifically describe exemplary embodiments of the present invention based on the drawings. It should be understood that these embodiments are given only to enable those skilled in the art to better understand and implement the present invention, and do not limit the present invention in any way. range. On the contrary, these embodiments are provided to make the present disclosure more thorough and complete, and to fully convey the scope of the present disclosure to those skilled in the art.
此外,需要说明书的是,各附图中的上、下、左、右的各方向仅是以特定的实施方式进行的例示,本领域技术人员能够根据实际需要将附图中所示的各构件的一部分或全部改变方向来应用,而不会影响各构件或系统整体实现其功能,这种改变了方向的技术方案仍属于本发明的保护范围。In addition, it is necessary to describe that the directions of up, down, left, and right in each drawing are only exemplified by specific embodiments, and those skilled in the art can change the components shown in the drawings according to actual needs. Part or all of them are changed in direction to be applied, without affecting each component or system as a whole to realize its function. Such a technical solution that changes direction still belongs to the protection scope of the present invention.
多核芯片是具体化在单个大规模集成半导体芯片上的多处理系统。典型地,两个或更多芯片核心可以被具体化在多核芯片上,由总线(也可以在相同的多核芯片上形成该总线)进行互连。可以有从两个芯片核心到许多芯片核心被具体化在相同的多核芯片上,在芯片核心的数量中的上限仅由制造能力和性能约束来限制。多核芯片可以具有应用,该应用包含在多媒体和信号处理算法(诸如,视频编码/解码、2D/3D图形、音频和语音处理、图像处理、电话、语音识别和声音合成、加密处理)中执行的专门的算术和/或逻辑操作。Multi-core chips are multi-processing systems embodied on a single large-scale integrated semiconductor chip. Typically, two or more chip cores can be embodied on a multi-core chip, interconnected by a bus (which can also be formed on the same multi-core chip). There can be from two chip cores to many chip cores embodied on the same multi-core chip, and the upper limit in the number of chip cores is limited only by manufacturing capabilities and performance constraints. Multi-core chips can have applications that are implemented in multimedia and signal processing algorithms (such as video encoding / decoding, 2D / 3D graphics, audio and voice processing, image processing, telephony, voice recognition and voice synthesis, encryption processing) Special arithmetic and / or logical operations.
虽然在背景技术中仅仅提到了ASIC专用集成电路,但是实施例中的具体布线实现方式可以应用到具有多核芯片CPU、GPU、FPGA等中。在本实施例中多个内核可以是相同内核,也可以是不同内核。Although only ASIC-specific integrated circuits are mentioned in the background art, the specific wiring implementation in the embodiments can be applied to CPUs, GPUs, FPGAs, etc. that have multi-core chips. In this embodiment, multiple cores may be the same core or different cores.
图1说明第一实施例具有M个ASIC芯片的大数据运算加速系统结构的示意图。如图1所示,大数据运算加速系统包括M个ASIC运算芯片,其中M为大于等于2的正整数,例如可以是6、10、12等等。所述运算芯片包括多个内核core(core0、core1、core2、core3)、4个数据通道(lane0、lane1、lane2lane3),所述数据通道(lane)包括发送接口(tx)和接收接口(rx),所述内核core和数据通道(lane)一一对应,例如运算芯片10的内核core0具有数据通道(lane0),数据通道(lane0)具有发送接口(lane0tx)和接收接口(lane0rx),数据通道发送接口(lane0tx)用于内核core0向所述运算芯片10外部发送数据或者控制指令,数据通道接收接口(lane0rx)用于向内核core0发送所述运算芯片10的外部数据或者控制指令。这样所述M个运算 芯片通过所述发送接口(tx)和所述接收接口(rx)进行连接,以便数据或者控制指令传输。M个运算芯片组成一个闭环形。在每个运算芯片中设置存储单元,所述运算芯片中的4个内核core都连接到存储单元,所述M个运算芯片的存储单元用于分布式存储数据,运算芯片内核core可以从本运算芯片的存储单元获取数据,也可以从其他运算芯片的存储单元获取数据。所述运算芯片中的4个内核core都连接到存储单元,通过存储单元也实现了所述运算芯片中的4个内核core数据交互的目的。而本领域技术人员可知,这里选择4个内核为例,只是示例性的说明,内核个数可以是N,其中N为大于等于4的正整数,例如可以是6、10、12等等。在本实施例中多个内核可以是相同内核,也可以是不同内核。FIG. 1 is a schematic diagram illustrating the structure of a big data operation acceleration system with M ASIC chips according to the first embodiment. As shown in FIG. 1, the big data operation acceleration system includes M ASIC operation chips, where M is a positive integer greater than or equal to 2, for example, 6, 10, 12, and so on. The arithmetic chip includes multiple cores core (core0, core1, core2, core3), 4 data channels (lane0, lane1, lane2lane3), the data channel (lane) includes a transmission interface (tx) and a reception interface (rx) , The core and the data channel (lane) have a one-to-one correspondence, for example, the core0 of the computing chip 10 has a data channel (lane0), the data channel (lane0) has a sending interface (lane0tx) and a receiving interface (lane0rx), and the data channel sends The interface (lane0tx) is used by the core0 to send data or control instructions to the outside of the computing chip 10, and the data channel receiving interface (lane0rx) is used to send the external data or control instructions of the computing chip 10 to the core0. In this way, the M arithmetic chips are connected through the sending interface (tx) and the receiving interface (rx) to facilitate the transmission of data or control commands. M arithmetic chips form a closed loop. A storage unit is provided in each arithmetic chip, the four core cores in the arithmetic chip are connected to the storage unit, the storage units of the M arithmetic chips are used for distributed storage of data, and the arithmetic chip core core can be calculated from this operation The storage unit of the chip can acquire data, and the data can also be acquired from the storage unit of other arithmetic chips. The four cores of the arithmetic chip are all connected to the storage unit, and the purpose of data exchange of the four cores of the arithmetic chip is also achieved through the storage unit. Those skilled in the art can know that four cores are selected here as an example, which is only an exemplary description. The number of cores may be N, where N is a positive integer greater than or equal to 4, for example, 6, 10, 12, and so on. In this embodiment, multiple cores may be the same core or different cores.
数据通道(lane)的发送接口(lane tx)和接收接口(lane rx)为serdes接口,所述运算芯片之间通过serdes接口进行通信。serdes是英文SERializer(串行器)/DESerializer(解串器)的简称。它是一种主流的时分多路复用(TDM)、点对点(P2P)的串行通信技术。即在发送端多路低速并行信号被转换成高速串行信号,经过传输媒体(光缆或铜线),最后在接收端高速串行信号重新转换成低速并行信号。这种点对点的串行通信技术充分利用传输媒体的信道容量,减少所需的传输信道和器件引脚数目,提升信号的传输速度,从而大大降低通信成本。当然,这里也可以采用其他的通信接口代替serdes接口,例如:SSI、UATR。芯片之间通过serdes接口进行数据和控制指令传输。The sending interface (lane) and receiving interface (lane) of the data channel (lane) are serdes interfaces, and the arithmetic chips communicate through the serdes interface. Serdes is the abbreviation of English SERializer (serializer) / DESerializer (deserializer). It is a mainstream time division multiplexing (TDM) and point-to-point (P2P) serial communication technology. That is, multiple low-speed parallel signals at the transmitting end are converted into high-speed serial signals, and then through the transmission medium (optical cable or copper wire), and finally the high-speed serial signals at the receiving end are re-converted into low-speed parallel signals. This point-to-point serial communication technology makes full use of the channel capacity of the transmission medium, reduces the number of transmission channels and device pins required, increases the signal transmission speed, and thus greatly reduces the communication cost. Of course, other communication interfaces can also be used instead of the serdes interface, for example: SSI, UATR. The chip transmits data and control commands through the serdes interface.
图2说明具有4个内核的运算芯片结构示意图的第一实施例。而本领域技术人员可知,这里选择4个内核为例,只是示例性的说明,运算芯片内核的个数可以是N,其中N为大于等于2的正整数,例如可以是6、10、12等等。在本实施例中运算芯片内核可以是具有相同功能的内核,也可以是不同功能的内核。FIG. 2 illustrates a first embodiment of a schematic structural diagram of an arithmetic chip with 4 cores. Those skilled in the art can know that 4 cores are selected here as an example, which is only an exemplary description. The number of cores of the arithmetic chip may be N, where N is a positive integer greater than or equal to 2, such as 6, 10, 12 etc. Wait. In this embodiment, the core of the arithmetic chip may be a core with the same function or a core with different functions.
4个内核的运算芯片(1)包括4个内核core(core0、core1、core2、core3)、4个数据通道(lane0、lane1、lane2lane3)和至少一个存储单元、一个数据交换控制单元,具体的数据交换控制单元为UART控制单元,每个数据通道(lane)包括一个发送接口(lane tx)和一个接收接口(lane rx)。4 core computing chips (1) include 4 cores (core0, core1, core2, core3), 4 data channels (lane0, lane1, lane2lane3) and at least one storage unit, a data exchange control unit, specific data The exchange control unit is a UART control unit, and each data channel (lane) includes a transmit interface (lane tx) and a receive interface (lane rx).
运算芯片(1)的内核core0连接于数据通道的发送接口(lane0tx)和接收接口(lane0rx),数据通道发送接口(lane0tx)用于内核core0向所述运算 芯片1连接的运算芯片发送数据或者控制指令,数据通道接收接口(lane0rx)用于向内核core0发送所述运算芯片(1)连接的运算芯片传输的数据或者控制指令。同理,运算芯片1的内核core1连接于数据通道的发送接口(lane1tx)和接收接口(lane1rx);运算芯片1的内核core2连接于数据通道的发送接口(lane2tx)和接收接口(lane2rx),运算芯片1的内核core3连接于数据通道的发送接口(lane3tx)和接收接口(lane3rx)。数据通道(lane)的发送接口(lane tx)和接收接口(lane rx)为serdes接口。The core0 of the computing chip (1) is connected to the transmitting interface (lane0tx) and the receiving interface (lane0rx) of the data channel, and the data channel transmitting interface (lane0tx) is used by the core0 to send data or control to the computing chip connected to the computing chip 1 Instruction, the data channel receiving interface (lane0rx) is used to send data or control instructions transmitted by the arithmetic chip (1) connected to the arithmetic chip (1) to the core0. Similarly, the core1 of the operation chip 1 is connected to the transmission interface (lane1tx) and the reception interface (lane1rx) of the data channel; the core2 of the operation chip 1 is connected to the transmission interface (lane2tx) and the reception interface (lane2rx) of the data channel. The core3 of the chip 1 is connected to the sending interface (lane3tx) and the receiving interface (lane3rx) of the data channel. The sending interface (lane tx) and the receiving interface (lane rx) of the data channel (lane) are serdes interfaces.
一个数据交换控制单元通过总线连接于存储单元和4个内核core(core0、core1、core2、core3),在附图2中总线并未划出。数据交换控制单元可以采用多种协议进行实现,例如UART,SPI,PCIE,SERDES,USB等,在本实施方式中数据交换控制单元为UART(Universal Asynchronous Receiver/Transmitter)控制单元。通用异步收发传输器通常称作UART,是一种异步收发传输器,它将要传输的资料在串行通信与并行通信之间加以转换,UART通常被集成于各种通讯接口的连结上。但是这里只是以UART协议为例进行说,也可以采用其他协议。UART控制单元接受外部数据,根据外部数据地址将外部数据发送给内核core(core0、core1、core2、core3)或者存储单元。UART控制单元也可以接受外部的控制指令,向内核core(core0、core1、core2、core3)或者存储单元发送控制指令;也可以用于运算芯片向其他运算芯片发送内部或者外部控制指令,从其他芯片接受控制指令,以及向外部反馈运算结果或者中间数据等。所述的内部数据或者内部控制指令是指芯片自身产生的数据或者控制指令,所述外部数据或者外部控制指令是指芯片外部产生的数据或者控制指令,例如外部主机、外部网络发送的数据或者控制指令。A data exchange control unit is connected to the storage unit and 4 cores (core0, core1, core2, core3) through the bus. The bus is not drawn in FIG. 2. The data exchange control unit can be implemented using multiple protocols, such as UART, SPI, PCIE, SERDES, USB, etc. In this embodiment, the data exchange control unit is a UART (Universal Asynchronous Receiver / Transmitter) control unit. Universal asynchronous transceiver is usually called UART, which is an asynchronous transceiver. It converts the data to be transmitted between serial communication and parallel communication. UART is usually integrated on the connection of various communication interfaces. But here is just taking the UART protocol as an example, other protocols can also be used. The UART control unit accepts external data and sends the external data to the core (core0, core1, core2, core3) or storage unit according to the external data address. The UART control unit can also accept external control commands and send control commands to the core (core0, core1, core2, core3) or the storage unit; it can also be used to send internal or external control commands to other computing chips from the computing chip. Accept control commands, and feedback operation results or intermediate data to the outside. The internal data or internal control commands refer to data or control commands generated by the chip itself, and the external data or external control commands refer to data or control commands generated outside the chip, such as data or control sent by an external host or an external network instruction.
内核core(core0、core1、core2、core3)的主要功能是执行外部或者内部控制指令、执行数据计算以及数据的存储控制等功能。所述运算芯片中的内核core(core0、core1、core2、core3)都连接到存储单元,向所述运算芯片的存储单元读取或者写入数据,通过存储单元实现了所述运算芯片中的多个内核core数据交互;也可以向所述运算芯片的存储单元发送控制命令。内核core(core0、core1、core2、core3)根据指令通过serdes接口向其他运算芯片的存储单元写入数据、读取数据或者向其他运算芯片的存储单元发送控 制指令;内核core(core0、core1、core2、core3)也可以根据指令通过serdes接口向其他运算芯片的内核core发送数据、读取数据或者向其他运算芯片的内核core发送控制指令。The core core (core0, core1, core2, core3) main functions are to execute external or internal control instructions, perform data calculations and data storage control functions. The cores (core0, core1, core2, core3) in the arithmetic chip are all connected to the storage unit, and data is read from or written to the storage unit of the arithmetic chip. Core data interaction; can also send control commands to the storage unit of the computing chip. The core core (core0, core1, core2, core3) writes data to, reads data from, or sends control commands to the storage units of other computing chips through the serdes interface according to the instructions; kernel core (core0, core1, core2 , Core3) You can also send data to, read data from, or send control commands to the core of other computing chips through the serdes interface according to the instructions.
图3说明数据通道lane的结构示意图的第一实施例。所述数据通道(lane)包括接收接口、发送接口、接收地址判断单元、发送地址判断单元和多个寄存器;接收地址判断单元一端连接于接收接口,接收地址判断单元另一端通过寄存器连接于内核core;发送地址判断单元一端连接于发送接口(tx),发送地址判断单元另一端通过寄存器连接于内核core;接收地址判断单元和发送地址判断单元通过寄存器相互连接。接收接口收到接收接口连接的相邻一侧运行芯片发送的数据帧或者控制指令,将所述数据帧或者控制指令发送给接收地址判断单元,接收到地址判断单元将所述数据帧或者控制指令发送给内核core,同时将所述数据帧或者控制指令发送给发送地址判断单元;发送地址判断单元接收所述数据帧或者控制指令,将所述数据帧或者控制指令发送给发送接口(tx),发送接口将所述数据帧或者控制指令发送给发送接口连接的相邻另一侧运行芯片。内核core产生数据帧或者控制指令,将所述数据帧或者控制指令发送给发送地址判断单元,发送地址判断单元将所述数据帧或者控制指令发送给发送接口,发送接口将所述数据帧或者控制指令发送给相邻一侧的运行芯片的接收接口。寄存器的作用是暂存数据帧或者控制指令的。FIG. 3 illustrates a first embodiment of a schematic structural diagram of a data channel lane. The data channel (lane) includes a receiving interface, a sending interface, a receiving address judgment unit, a sending address judgment unit, and a plurality of registers; one end of the receiving address judgment unit is connected to the receiving interface, and the other end of the receiving address judgment unit is connected to the core core through registers ; One end of the sending address judgment unit is connected to the sending interface (tx), the other end of the sending address judgment unit is connected to the core core through a register; the receiving address judgment unit and the sending address judgment unit are connected to each other through a register. The receiving interface receives the data frame or control command sent by the running chip on the adjacent side connected to the receiving interface, sends the data frame or control command to the receiving address judgment unit, and receives the data frame or control instruction from the address judgment unit Send to the core core, and at the same time send the data frame or control instruction to the sending address judgment unit; the sending address judgment unit receives the data frame or control instruction, and sends the data frame or control instruction to the sending interface (tx), The sending interface sends the data frame or the control instruction to the adjacent running chip connected to the sending interface. The core core generates a data frame or control instruction, sends the data frame or control instruction to the sending address judgment unit, and the sending address judgment unit sends the data frame or control instruction to the sending interface, and the sending interface sends the data frame or control instruction The instruction is sent to the receiving interface of the running chip on the adjacent side. The purpose of the register is to temporarily store data frames or control instructions.
图4a说明存储单元的结构示意图的第一实施例。每个运算芯片内包含了N个内核core,他们需要并发的随机的存取数据,如果N的数量级达到64以及以上的话,需要运算芯片的内存带宽达到非常高的数量级,即使是GDDR也很难达到这么高的带宽。因此,在本发明实施例中了使用SRAM阵列和大MUX路由的方式提供高带宽。如图4a所示系统由两级存储控制单元组成,以减缓实现时候的雍塞问题。所述存储单元(40)包括8个存储器(410……417),所述8个存储器(410……417)连接到存储控制单元(420);所述存储控制单元用于控制所述多个存储器的数据读取或者存储。所述存储器(410……417)包括至少两个存储子单元和一个存储控制子单元;存储控制子单元通过接口与所述存储控制单元连接,所述存储控制子单元用于控制所述至少两个存储子单元的数据读取或者存储。所述存储子单元为SRAM存储 器。FIG. 4a illustrates a first embodiment of a schematic structural diagram of a memory cell. Each computing chip contains N core cores, and they need concurrent random access data. If the order of N reaches 64 and above, the memory bandwidth of the computing chip needs to reach a very high order, even GDDR is difficult To achieve such a high bandwidth. Therefore, in the embodiment of the present invention, a high bandwidth is provided by using a SRAM array and a large MUX route. As shown in Figure 4a, the system consists of two-level storage control unit to alleviate the problem of yoke during implementation. The storage unit (40) includes 8 memories (410 ... 417), the 8 memories (410 ... 417) are connected to a storage control unit (420); the storage control unit is used to control the plurality of Read or store data in memory. The memory (410 ... 417) includes at least two storage subunits and a storage control subunit; the storage control subunit is connected to the storage control unit through an interface, and the storage control subunit is used to control the at least two The data of each storage subunit is read or stored. The storage subunit is an SRAM memory.
图4b说明存储单元的结构示意图的第二实施例。在附图4b中存储单元中可以设置多个存储控制单元(420、421、422、423),每个内核core和所述多个存储控制单元(420、421、422、423)中的每一个相连,每一个存储控制单元和每一个存储器(410……417)连接。存储器的结构和附图4a中完全相同,这里就不再次描述。FIG. 4b illustrates a second embodiment of a schematic structural diagram of a memory cell. Multiple storage control units (420, 421, 422, 423) can be provided in the storage unit in FIG. 4b, each core core and each of the multiple storage control units (420, 421, 422, 423) Connected, each storage control unit is connected to each memory (410 ... 417). The structure of the memory is exactly the same as in FIG. 4a, and will not be described here again.
内核core将产生的数据发送给至少一个存储控制单元,所述至少一个存储控制单元将数据发送给所述存储控制子单元,所述存储控制子单元将数据存储到存储子单元中。运算芯片内核core获取其他运算芯片发送的获取数据命令,运算芯片内核core通过数据地址判断数据是否存储在本运算芯片的存储单元中,如果存在则向所述至少一个存储控制单元发送数据读取命令;至少一个存储控制单元将数据读取命令发送给对应的存储控制子单元,存储控制子单元从存储子单元获取数据,存储控制子单元将所述获取数据发送给至少一个存储控制单元,至少一个存储控制单元将所述获取数据发送给内核core,内核core将所述获取数据发送给发送地址判断单元,发送地址判断单元将所述获取数据发送给发送接口(tx),发送接口将所述获取数据发送给相邻的运算芯片。The core core sends the generated data to at least one storage control unit, and the at least one storage control unit sends the data to the storage control subunit, and the storage control subunit stores the data in the storage subunit. The arithmetic chip core core acquires a data acquisition command sent by another arithmetic chip, the arithmetic chip core core judges whether data is stored in the storage unit of the arithmetic chip according to the data address, and if it exists, sends a data read command to the at least one storage control unit At least one storage control unit sends a data read command to the corresponding storage control subunit, the storage control subunit obtains data from the storage subunit, and the storage control subunit sends the acquired data to at least one storage control unit, at least one The storage control unit sends the acquired data to the kernel core, and the kernel core sends the acquired data to the sending address judgment unit, and the sending address judgment unit sends the acquired data to the sending interface (tx), and the sending interface sends the acquired data The data is sent to the adjacent arithmetic chip.
该大数据运算加速系统应用到人工智能领域中,运算芯片的UART控制单元将外部主机发送的图片数据或者视频数据通过内核core存储到存储单元中,运算芯片产生神经网络的数学模型,该数学模型也可以由外部主机通过UART控制单元存储到存储单元,由各个运算芯片读取。在运算芯片上运行神经网络第一层数学模型,运算芯片的内核core从本运算芯片的存储单元和/或其他运算芯片的存储单元读取数据进行运算,并将运算结果通过serdes接口存储到其他运算芯片的存储单元中的至少一个存储单元,或者存储到本运算芯片的存储单元。运算芯片(1)通过UART控制单元或者serdes接口向下一个运算芯片(2)发送控制指令,启动下一个运算芯片(2)进行运算。在下一个运算芯片(2)上运行神经网络第二层数学模型,下一个运算芯片的内核core从本运算芯片的存储单元和/或其他运算芯片的存储单元读取数据进行运算,并将运算结果通过serdes接口存储到其他运算芯片的存储单元中的至少一个存储单元,或者存储到本运算芯片的存储单元。每一个芯片执行神 经网络中的一层,通过serdes接口从其他运算芯片的存储单元或者本运算芯片的存储单元获取数据进行运算,只到神经网络最后一层计算出运算结果。运算芯片从本地存储单元或者其他运算芯片的存储单元中获取运算结果,通过UART控制单元反馈给外部主机。The big data operation acceleration system is applied to the field of artificial intelligence. The UART control unit of the operation chip stores the image data or video data sent by the external host to the storage unit through the core core. The operation chip generates a mathematical model of the neural network. The mathematical model It can also be stored in the storage unit by the external host through the UART control unit and read by each arithmetic chip. Run the first layer of mathematical model of the neural network on the arithmetic chip. The core of the arithmetic chip reads data from the storage unit of the arithmetic chip and / or the storage unit of other arithmetic chips for operation, and stores the operation result to other through the serdes interface. At least one storage unit in the storage unit of the arithmetic chip, or the storage unit stored in the arithmetic chip. The arithmetic chip (1) sends a control instruction to the next arithmetic chip (2) through the UART control unit or serdes interface, and starts the next arithmetic chip (2) to perform arithmetic. Run the second layer mathematical model of the neural network on the next arithmetic chip (2). The core of the next arithmetic chip reads the data from the storage unit of the arithmetic chip and / or the storage unit of other arithmetic chips for operation, and the operation results At least one storage unit stored in a storage unit of another arithmetic chip through the serdes interface, or a storage unit of the arithmetic chip. Each chip executes a layer in the neural network, and obtains data from the storage unit of other operation chips or the storage unit of the operation chip through the serdes interface to perform calculations, and only calculates the calculation result to the last layer of the neural network. The operation chip obtains the operation result from the local storage unit or the storage unit of other operation chip, and feeds it back to the external host through the UART control unit.
大数据运算加速系统应用到加密数字货币领域中,运算芯片(1)的UART控制单元将外部主机发送的区块信息存储到多个运算芯片的多个存储单元中的至少一个存储单元。外部主机通过运算芯片(1……M)UART控制单元向M个运算芯片发送控制指令进行数据运算,M个运算芯片启动运算操作。当然也可以外部主机向一个运算芯片(1)UART控制单元(130)发送控制指令进行数据运算,运算芯片(1)依次向其他M-1个运算芯片发送控制指令进行数据运算,M个运算芯片启动运算操作。也可以外部主机向一个运算芯片(1)UART控制单元发送控制指令进行数据运算,第一运算芯片(1)向第二运算芯片(2)发送控制指令进行数据运算,第二运算芯片(2)向第三运算芯片(3)发送控制指令进行数据运算,第三运算芯片(3)向第四运算芯片(4)发送控制指令进行数据运算,M个运算芯片启动运算操作。M个运算芯片通过serdes接口从其他运算芯片的存储单元或者本运算芯片的存储单元获取数据进行运算,M个运算芯片同时进行工作量证明运算,运算芯片(1)从存储单元获取运算结果,通过UART控制单元反馈给外部主机。The big data operation acceleration system is applied to the field of encrypted digital currency. The UART control unit of the operation chip (1) stores the block information sent by the external host to at least one storage unit among the plurality of storage units of the plurality of operation chips. The external host sends control instructions to the M arithmetic chips through the arithmetic chip (1 ... M) UART control unit to perform data calculation, and the M arithmetic chips start the arithmetic operation. Of course, an external host can also send control instructions to one arithmetic chip (1) UART control unit (130) to perform data operations, and the arithmetic chip (1) sequentially sends control instructions to other M-1 arithmetic chips to perform data operations. M arithmetic chips Start arithmetic operation. The external host can also send control instructions to a computing chip (1) UART control unit to perform data operations. The first computing chip (1) sends control instructions to the second computing chip (2) to perform data operations. The second computing chip (2) A control instruction is sent to the third arithmetic chip (3) for data calculation, a third arithmetic chip (3) sends a control instruction to the fourth arithmetic chip (4) for data calculation, and M arithmetic chips start arithmetic operations. The M arithmetic chips acquire data from the storage unit of other arithmetic chips or the storage unit of the arithmetic chip through the serdes interface to perform calculations. The M arithmetic chips simultaneously perform proof-of-work calculation operations. The arithmetic chip (1) obtains the calculation result from the storage unit and passes The UART control unit feeds back to the external host.
图5说明大数据运算加速系统数据传输过程的示意图的第一实施例。每个运算芯片完成1/n的工作,每个运算芯片完成其负责的数据之后,因为数据相关性,必须要把其算完的结果传输给其他所有的芯片。运算芯片n-1是数据帧的源头运算芯片,数据通过lane1tx发送到运算芯片0;在运算芯片0内,数据帧会被分为两路传播,第一路发送给运算芯片0的内核core,外一路在被转发到运算芯片0的lane1tx通道内,如此该数据帧会被发送到运算芯片1。FIG. 5 illustrates a first embodiment of a schematic diagram of a data transmission process of a big data operation acceleration system. Each arithmetic chip completes the work of 1 / n, and after each arithmetic chip completes its responsible data, because of the data correlation, the result of its calculation must be transmitted to all other chips. Operation chip n-1 is the source operation chip of the data frame, and the data is sent to operation chip 0 through lane1tx; in operation chip 0, the data frame will be divided into two propagations, the first way is sent to the core core of operation chip 0, The other way is in the lane1tx channel that is forwarded to the arithmetic chip 0, so the data frame will be sent to the arithmetic chip 1.
源ID机制:每个数据帧都携带了该数据帧的源头的运算芯片ID,每当该数据帧被发送到一个新的运算芯片之后,该运算芯片会检测数据帧内的运算芯片ID,如果发现该运算芯片ID和这个运算芯片所连接的下一个运算芯片的ID相等的时候,那么该数据帧就不会再被转发,也就意味着该数据帧的生命周期在此处终止,也不再占用带宽。运算芯片会检测数据帧内的运算芯 片ID可以是在内核core中进行,也可以是在接收地址判断单元中进行。Source ID mechanism: Each data frame carries the operation chip ID of the source of the data frame. Whenever the data frame is sent to a new operation chip, the operation chip will detect the operation chip ID in the data frame. When it is found that the ID of the arithmetic chip and the ID of the next arithmetic chip connected to the arithmetic chip are equal, then the data frame will not be forwarded again, which means that the life cycle of the data frame ends here, nor Take up bandwidth again. The arithmetic chip will detect the arithmetic chip ID in the data frame, which may be performed in the core or in the receiving address judgment unit.
图6说明第一实施例具有4个内核的运算芯片信号流程示意图的。所述UART控制单元(130)用于获取芯片外部数据或者控制指令,将外部数据或者控制指令传输给和UART控制单元连接的内核core(110)。内核core(110)将外部数据根据数据地址传输给本芯片的存储单元(120)进行存储,或者内核core(110)根据数据地址将数据通过信号通道lane发送给数据地址对应的其他芯片内核core,其他芯片内核core将数据存储到本地的存储单元中。内核core(110)根据外部控制指令地址由本运算芯片内核core执行或者通过信号通道lane发送给控制指令地址对应的其他芯片内核core执行。如果本运算芯片内核core需要获取数据,则内核core可以从本地存储单元获取数据,也可以从其他运算芯片的存储单元获取数据。当从其他运算芯片的存储单元获取数据时,内核core(110)将获取数据控制指令通过自身连接的serdes接口(150)将获取数据控制指令广播给连接的运算芯片;连接的运算芯片将获取数据控制指令分成两路,一路发送给内核core,另一路向下一个芯片转发。如果连接的运算芯片判断出数据存储在本地存储单元,则内核core从存储单元中读取数据,通过serdes接口发送给发出获取数据控制指令的运算单元。当然,运算芯片之间的控制指令也可以通过UART控制单元进行发送。内核core根据外部控制指令或者内部控制指令将运算结果或者中间数据反馈给外部时,内核core从本运算芯片的存储单元或者通过serdes接口从其他运算芯片的存储单元获取运算结果或者中间数据,将运算结果或者中间数据通过UART控制单元发送给外部。这里所述的外部可以是指外部主机、外部网络或者外部平台等。外部主机能通过UART控制单元初始化配置存储单元参数,对多个存储单元进行统一编址。FIG. 6 illustrates a signal flow diagram of an arithmetic chip with four cores in the first embodiment. The UART control unit (130) is used to acquire external data or control instructions of the chip, and transmit the external data or control instructions to the core (110) connected to the UART control unit. The core core (110) transfers external data to the storage unit (120) of the chip for storage according to the data address, or the core core (110) sends data to the other chip core core corresponding to the data address through the signal channel according to the data address, Other chip cores store data in local storage units. The core core (110) is executed by the core core of the arithmetic chip according to the address of the external control instruction or sent to another chip core core corresponding to the address of the control instruction through the signal channel lane for execution. If the core of the arithmetic chip needs to acquire data, the core can acquire data from a local storage unit or data from storage units of other arithmetic chips. When acquiring data from the storage unit of another arithmetic chip, the core core (110) broadcasts the acquisition data control instruction to the connected arithmetic chip through the serdes interface (150) connected to it; the connected arithmetic chip will acquire the data The control instructions are divided into two ways, one way is sent to the core core, and the other way is forwarded to the next chip. If the connected arithmetic chip determines that the data is stored in the local storage unit, the core core reads the data from the storage unit and sends it to the arithmetic unit that issues the data acquisition instruction through the serdes interface. Of course, the control commands between the arithmetic chips can also be sent through the UART control unit. When the core core feedbacks the operation result or intermediate data to the outside according to the external control instruction or the internal control instruction, the core core obtains the operation result or intermediate data from the storage unit of the operation chip or from the storage unit of the other operation chip through the serdes interface, and calculates the operation The result or intermediate data is sent to the outside through the UART control unit. The external mentioned here may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the storage unit parameters through the UART control unit, and address multiple storage units uniformly.
当然,内核core根据获取的数据进行计算,并将计算结果存储到存储单元中。每个存储单元中设置专有存储区域和共享存储区域;所述专有存储区域用于存储一个运算芯片的临时运算结果,该临时运算结果为所述一个运算芯片继续利用的中间计算结果,而其他运算芯片不会使用的中间计算结果;所述共享存储区域用于存储运算芯片的运算数据结果,该运算数据结果被其他运算芯片使用,或者需要向外部进行反馈传输。Of course, the kernel core performs calculations based on the acquired data and stores the calculation results in the storage unit. Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results not used by other arithmetic chips; the shared storage area is used to store arithmetic data results of the arithmetic chips, which are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.
本发明实施例通过在芯片中设置多个内核core,每个内核core执行运算 和存储控制功能,并且在芯片内部给每个内核core连接至少一个存储单元,这样每个内核通过读取自己连接的存储单元和其他内核连接的存储单元,使得每个内核可以具有大容量内存,减少了数据从外部存储空间中搬入或者搬出内存的次数,加快了数据的处理速度;同时,由于多个内核可以分别独立运算或者协同运算,这样也加快了数据的处理速度。In the embodiment of the present invention, multiple core cores are provided in the chip, and each core core performs calculation and storage control functions, and at least one storage unit is connected to each core core inside the chip. The storage unit is connected to the storage unit of other cores, so that each core can have a large-capacity memory, reducing the number of times data is moved into or out of the memory from the external storage space, and speeding up the data processing speed; at the same time, because multiple cores can be separately Independent operation or cooperative operation, which also speeds up the data processing speed.
图7说明根据本发明的数据结构示意图。这里所说的数据为命令数据、数值数据、字符数据等多种数据。数据格式具体包括有效位valid、目的地址dst id、源地址src id和data数据。内核可以通过有效位valid来判断该数据包是命令还是数值,这里可以假定0代表数值,1代表命令。内核会根据数据结构判断目的地址、源地址和数据类型。例如在附图1种,内核50向内核10发送数据读取命令,则有效位为1,目的地址为内核10的地址、源地址为内核50的地址和data数据为读取数据命令以及数据类型或者数据地址等。内核10向内核10发送数据,则有效位为0,目的地址为内核50的地址、源地址为内核0的地址和data数据为读取的数据。从指令运行时序上来看,本实施例中采用传统的六级流水线结构,分别为取指、译码、执行、访存、对齐和写回级。从指令集架构上来看,可以采取精简指令集架构。按照精简指令集架构的通用设计方法,本发明指令集可以按功能分为寄存器-寄存器型指令,寄存器-立即数指令,跳转指令,访存指令、控制指令和核间通信指令。7 illustrates a schematic diagram of a data structure according to the present invention. The data mentioned here is various data such as command data, numeric data, character data, and so on. The data format specifically includes valid bit valid, destination address dst id, source address src id and data data. The kernel can determine whether the data packet is a command or a value by valid bit. Here, it can be assumed that 0 represents a value and 1 represents a command. The kernel will determine the destination address, source address and data type according to the data structure. For example, in Figure 1, the core 50 sends a data read command to the core 10, the valid bit is 1, the destination address is the address of the core 10, the source address is the address of the core 50, and the data data is the read data command and data type Or data address. When the core 10 sends data to the core 10, the effective bit is 0, the destination address is the address of the core 50, the source address is the address of the core 0, and the data data is the read data. From the perspective of instruction operation timing, the traditional six-stage pipeline structure is adopted in this embodiment, which are instruction fetch, decoding, execution, memory access, alignment and write-back stage respectively. From the perspective of the instruction set architecture, a simplified instruction set architecture can be adopted. According to the general design method of the reduced instruction set architecture, the instruction set of the present invention can be divided into register-register type instructions, register-immediate instruction, jump instruction, memory access instruction, control instruction and inter-core communication instruction according to functions.
使用本文中提供的描述,可以通过使用标准的编程和/或工程技术将实施例实现成机器、过程或制造品以产生编程软件、固件、硬件或其任何组合。Using the description provided herein, the embodiments can be implemented as a machine, process, or article of manufacture by using standard programming and / or engineering techniques to produce programming software, firmware, hardware, or any combination thereof.
可以将任何生成的程序(多个)(具有计算机可读程序代码)具体化在一个或多个计算机可使用的介质上,诸如驻留存储设备、智能卡或其它可移动存储设备,或传送设备,从而根据实施例来制作计算机程序产品和制造品。照此,如本文中使用的术语“制造品”和“计算机程序产品”旨在涵盖永久性地或临时性地存在在任何计算机可以使用的非短暂性的介质上的计算机程序。Any generated program (s) (with computer-readable program code) can be embodied on one or more computer-usable media, such as resident storage devices, smart cards or other removable storage devices, or transmission devices, Thus, computer program products and manufactured products are produced according to the embodiments. As such, the terms "article of manufacture" and "computer program product" as used herein are intended to cover computer programs that are permanently or temporarily present on any non-transitory medium that can be used by computers.
如上所指出的,存储器/存储设备包含但不限制于磁盘、光盘、可移动存储设备(诸如智能卡、订户身份模块(SIM)、无线标识模块(WIM))、半导体存储器(诸如随机存取存储器(RAM)、只读存储器(ROM)、可编程只读存储器(PROM))等。传送介质包含但不限于经由无线通信网络、互联网、内部网、基于电话/调制解调器的网络通信、硬连线/电缆通信网络、卫星通信以及其它 固定或移动网络系统/通信链路的传输。As noted above, memory / storage devices include but are not limited to magnetic disks, optical disks, removable storage devices (such as smart cards, subscriber identity modules (SIM), wireless identification modules (WIM)), semiconductor memories (such as random access memory ( RAM), read only memory (ROM), programmable read only memory (PROM)), etc. Transmission media include, but are not limited to, transmission via wireless communication networks, the Internet, intranets, telephone / modem-based network communications, hard-wired / cable communications networks, satellite communications, and other fixed or mobile network systems / communication links.
虽然已经公开了特定的示例实施例,但是本领域的技术人员将理解的是,在不背离本发明的精神和范围的情况下,能够对特定示例实施例进行改变。Although specific example embodiments have been disclosed, those skilled in the art will understand that changes can be made to the specific example embodiments without departing from the spirit and scope of the invention.
以上参考附图,基于实施方式说明了本发明,但本发明并非限定于上述的实施方式,根据布局需要等将各实施方式及各变形例的部分构成适当组合或置换后的方案,也包含在本发明的范围内。另外,还可以基于本领域技术人员的知识适当重组各实施方式的组合和处理顺序,或者对各实施方式施加各种设计变更等变形,被施加了这样的变形的实施方式也可能包含在本发明的范围内。The present invention has been described above based on the embodiments with reference to the drawings. However, the present invention is not limited to the above-mentioned embodiments, and schemes in which the parts of each embodiment and each modified example are appropriately combined or replaced according to layout requirements are also included in Within the scope of the invention. In addition, based on the knowledge of those skilled in the art, the combination and processing order of the embodiments may be appropriately reorganized, or various design changes and other modifications may be applied to the embodiments. Embodiments to which such modifications are applied may also be included in the present invention. In the range.
本发明虽然已详细描述了各种概念,但本领域技术人员可以理解,对于那些概念的各种修改和替代在本发明公开的整体教导的精神下是可以实现的。本领域技术人员运用普通技术就能够在无需过度试验的情况下实现在权利要求书中所阐明的本发明。可以理解的是,所公开的特定概念仅仅是说明性的,并不意在限制本发明的范围,本发明的范围由所附权利要求书及其等同方案的全部范围来决定。Although the present invention has described various concepts in detail, those skilled in the art can understand that various modifications and substitutions to those concepts can be implemented within the spirit of the overall teaching disclosed in the present invention. A person skilled in the art can implement the invention set forth in the claims without undue experimentation by using ordinary techniques. It can be understood that the specific concepts disclosed are only illustrative and are not intended to limit the scope of the present invention, which is determined by the full scope of the appended claims and their equivalents.

Claims (10)

  1. 一种大数据运算加速系统,包括2个以上运算芯片,所述运算芯片包括N个内核core、N个数据通道(lane)和至少一个存储单元,其中N为大于等于4的正整数;所述数据通道(lane)包括发送接口(tx)和接收接口(rx),所述内核core和数据通道(lane)一一对应,所述内核core通过数据通道(lane)发送和接收数据;所述2个以上运算芯片通过所述发送接口(tx)和所述接收接口(rx)进行连接传输数据,所述2个以上运算芯片连接成环形。A big data operation acceleration system, including more than two operation chips, the operation chip includes N cores, N data lanes and at least one storage unit, where N is a positive integer greater than or equal to 4; The data channel (lane) includes a sending interface (tx) and a receiving interface (rx), the core core and the data channel (lane) have a one-to-one correspondence, and the core core sends and receives data through the data channel (lane); the 2 More than one arithmetic chips are connected to transmit data through the sending interface (tx) and the receiving interface (rx), and the two or more arithmetic chips are connected in a ring shape.
  2. 根据权利要求1所述的系统,其特征在于,所述运算芯片的所述发送接口(tx)和所述接收接口(rx)为serdes接口,所述运算芯片之间通过serdes接口进行通信。The system according to claim 1, wherein the sending interface (tx) and the receiving interface (rx) of the arithmetic chip are serdes interfaces, and the arithmetic chips communicate through the serdes interface.
  3. 根据权利要求1或2所述的系统,其特征在于,所述数据通道(lane)进一步包括接收地址判断单元、发送地址判断单元;接收地址判断单元一端连接于接收接口,接收地址判断单元另一端连接于内核core;发送地址判断单元一端连接于发送接口(tx),发送地址判断单元另一端连接于内核core;接收地址判断单元和发送地址判断单元相互连接。The system according to claim 1 or 2, wherein the data channel (lane) further comprises a receiving address judging unit and a sending address judging unit; one end of the receiving address judging unit is connected to the receiving interface, and the other end of the receiving address judging unit Connected to the core core; one end of the sending address judgment unit is connected to the sending interface (tx), the other end of the sending address judgment unit is connected to the core core; the receiving address judgment unit and the sending address judgment unit are connected to each other.
  4. 根据权利要求3所述的系统,其特征在于,接收接口(rx)接收相邻一侧运行芯片发送的数据帧,将所述数据帧发送给接收地址判断单元,接收地址判断单元将所述数据帧发送给内核core,同时将所述数据帧发送给发送地址判断单元;发送地址判断单元接收所述数据帧,将所述数据帧发送给发送接口(tx),发送接口将所述数据帧发送给相邻另一侧运行芯片。The system according to claim 3, wherein the receiving interface (rx) receives the data frame sent by the running chip on the adjacent side, sends the data frame to the receiving address judging unit, and the receiving address judging unit sends the data The frame is sent to the core core, and the data frame is sent to the sending address judgment unit; the sending address judgment unit receives the data frame, sends the data frame to the sending interface (tx), and the sending interface sends the data frame Run the chip to the other side.
  5. 根据权利要求3所述的系统,其特征在于,内核core产生数据帧,将所述数据帧发送给发送地址判断单元,发送地址判断单元将所述数据帧发送给发送接口(tx),发送接口(tx)将所述数据帧发送给相邻一侧的运行芯片。The system according to claim 3, wherein the core core generates a data frame and sends the data frame to a sending address judgment unit, and the sending address judgment unit sends the data frame to a sending interface (tx), the sending interface (tx) Send the data frame to the running chip on the adjacent side.
  6. 根据权利要求3所述的系统,其特征在于,所述接收地址判断单元和发送地址判断单元通过先进先出存储器进行相互连接。The system according to claim 3, wherein the receiving address judging unit and the sending address judging unit are connected to each other through a first-in first-out memory.
  7. 根据权利要求3所述的系统,其特征在于,运算芯片内核core获取其他运算芯片发送的获取数据命令,运算芯片内核core通过数据地址判断数据是否存储在本运算芯片的存储单元中,如果存在内核core则从至少一个存储单元获取数据,内核core将所述获取数据发送给发送地址判断单元,发送 地址判断单元将所述获取数据发送给发送接口(tx),发送接口将所述获取数据发送给相邻的运行芯片。The system according to claim 3, wherein the core of the arithmetic chip core acquires a data acquisition command sent by another arithmetic chip, and the core of the arithmetic chip determines whether the data is stored in the storage unit of the arithmetic chip according to the data address, if there is a core The core acquires data from at least one storage unit, and the core core sends the acquired data to the sending address judgment unit, and the sending address judgment unit sends the acquired data to the sending interface (tx), and the sending interface sends the acquired data to Adjacent running chips.
  8. 一种大数据运算加速系统的数据传输方法,所述大数据运算加速系统包括2个以上运算芯片,所述2个以上运算芯片通过发送接口(tx)和接收接口(rx)进行连接传输数据,所述2个以上运算芯片连接成环形;数据源头第一运算芯片产生数据后,通过所述发送接口(tx)将数据发送给第一运算芯片相邻一侧的第二运算芯片;所述相邻一侧的第二运算芯片将数据分为两路传播,第一路发送给所述第二运算芯片的内核core,另一路通过发送接口(tx)转发到第二运算芯片相邻一侧的第三运算芯片。A data transmission method for a big data operation acceleration system. The big data operation acceleration system includes more than two operation chips, and the two or more operation chips are connected to transmit data through a transmission interface (tx) and a reception interface (rx). The two or more arithmetic chips are connected in a ring; after the first arithmetic chip in the data source generates data, the data is sent to the second arithmetic chip on the adjacent side of the first arithmetic chip through the sending interface (tx); the phase The second computing chip on the adjacent side divides the data into two channels for propagation, the first channel is sent to the core of the second computing chip, and the other channel is forwarded to the adjacent side of the second computing chip through the sending interface (tx) The third arithmetic chip.
  9. 根据权利要求8所述的方法,其特征在于,所述数据中携带数据源头运算芯片的标识(ID)。The method according to claim 8, wherein the data carries an identification (ID) of a data source operation chip.
  10. 根据权利要求9所述的方法,其特征在于,所述数据被传输到相邻的运算芯片后,所述相邻运算芯片会检测数据内运算芯片的标识(ID),如果发现运算芯片的标识(ID)和所述相邻运算芯片所连接的下一个运算芯片的标识(ID)相等的时候,所述数据就不会再被转发。The method according to claim 9, wherein after the data is transmitted to an adjacent arithmetic chip, the adjacent arithmetic chip detects the identification (ID) of the arithmetic chip in the data, if the identification of the arithmetic chip is found When (ID) is equal to the identification (ID) of the next arithmetic chip connected to the adjacent arithmetic chip, the data will not be forwarded again.
PCT/CN2018/112546 2018-10-30 2018-10-30 Big data operation acceleration system and data transmission method WO2020087246A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2018/112546 WO2020087246A1 (en) 2018-10-30 2018-10-30 Big data operation acceleration system and data transmission method
CN201880097576.0A CN112740192B (en) 2018-10-30 2018-10-30 Big data operation acceleration system and data transmission method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/112546 WO2020087246A1 (en) 2018-10-30 2018-10-30 Big data operation acceleration system and data transmission method

Publications (1)

Publication Number Publication Date
WO2020087246A1 true WO2020087246A1 (en) 2020-05-07

Family

ID=70463294

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/112546 WO2020087246A1 (en) 2018-10-30 2018-10-30 Big data operation acceleration system and data transmission method

Country Status (2)

Country Link
CN (1) CN112740192B (en)
WO (1) WO2020087246A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140189197A1 (en) * 2012-12-27 2014-07-03 Ramamurthy Krithivas Sharing serial peripheral interface flash memory in a multi-node server system on chip platform environment
CN104699531A (en) * 2013-12-09 2015-06-10 超威半导体公司 Voltage dip relieving applied to three-dimensional chip system
CN104865938A (en) * 2015-04-03 2015-08-26 深圳市前海安测信息技术有限公司 Node connection chip applied to assess human body injury condition and node network thereof

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100573497C (en) * 2007-12-26 2009-12-23 杭州华三通信技术有限公司 Communication means and system between a kind of multinuclear multiple operating system
US9699079B2 (en) * 2013-12-30 2017-07-04 Netspeed Systems Streaming bridge design with host interfaces and network on chip (NoC) layers
CN103744644B (en) * 2014-01-13 2017-03-01 上海交通大学 The four core processor systems built using four nuclear structures and method for interchanging data
CN108536642A (en) * 2018-06-13 2018-09-14 北京比特大陆科技有限公司 Big data operation acceleration system and chip
CN209149287U (en) * 2018-10-30 2019-07-23 北京比特大陆科技有限公司 Big data operation acceleration system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140189197A1 (en) * 2012-12-27 2014-07-03 Ramamurthy Krithivas Sharing serial peripheral interface flash memory in a multi-node server system on chip platform environment
CN104699531A (en) * 2013-12-09 2015-06-10 超威半导体公司 Voltage dip relieving applied to three-dimensional chip system
CN104865938A (en) * 2015-04-03 2015-08-26 深圳市前海安测信息技术有限公司 Node connection chip applied to assess human body injury condition and node network thereof

Also Published As

Publication number Publication date
CN112740192A (en) 2021-04-30
CN112740192B (en) 2024-04-30

Similar Documents

Publication Publication Date Title
US7155554B2 (en) Methods and apparatuses for generating a single request for block transactions over a communication fabric
US7277975B2 (en) Methods and apparatuses for decoupling a request from one or more solicited responses
TW201915759A (en) High bandwidth memory systems
CN108536642A (en) Big data operation acceleration system and chip
TW201633171A (en) Enhanced data bus invert encoding for OR chained buses
US10922258B2 (en) Centralized-distributed mixed organization of shared memory for neural network processing
CN209149287U (en) Big data operation acceleration system
CN112052210A (en) Data structure for refined link training
WO2020087276A1 (en) Big data operation acceleration system and chip
US20210004347A1 (en) Approximate data bus inversion technique for latency sensitive applications
TWI720345B (en) Interconnection structure of multi-core system
CN103377170A (en) Inter-heterogeneous-processor SPI (serial peripheral interface) high speed two-way peer-to-peer data communication system
CN209560543U (en) Big data operation chip
WO2020087239A1 (en) Big data computing acceleration system
WO2020087246A1 (en) Big data operation acceleration system and data transmission method
WO2020087243A1 (en) Big data computing chip
WO2020087275A1 (en) Method for big data operation acceleration system carrying out operations
CN209784995U (en) Big data operation acceleration system and chip
CN115129657A (en) Programmable logic resource expansion device and server
CN112805727A (en) Artificial neural network operation acceleration device for distributed processing, artificial neural network acceleration system using same, and method for accelerating artificial neural network
CN208298179U (en) Big data operation acceleration system and chip
WO2020087278A1 (en) Big data computing acceleration system and method
CN209543343U (en) Big data operation acceleration system
CN109643301B (en) Multi-core chip data bus wiring structure and data transmission method
CN116561036B (en) Data access control method, device, equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18938388

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.10.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18938388

Country of ref document: EP

Kind code of ref document: A1