WO2020087276A1 - Big data operation acceleration system and chip - Google Patents

Big data operation acceleration system and chip Download PDF

Info

Publication number
WO2020087276A1
WO2020087276A1 PCT/CN2018/112688 CN2018112688W WO2020087276A1 WO 2020087276 A1 WO2020087276 A1 WO 2020087276A1 CN 2018112688 W CN2018112688 W CN 2018112688W WO 2020087276 A1 WO2020087276 A1 WO 2020087276A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
core
chip
unit
storage
Prior art date
Application number
PCT/CN2018/112688
Other languages
French (fr)
Chinese (zh)
Inventor
桂文明
Original Assignee
北京比特大陆科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京比特大陆科技有限公司 filed Critical 北京比特大陆科技有限公司
Priority to CN201880002364.XA priority Critical patent/CN109564562B/en
Priority to PCT/CN2018/112688 priority patent/WO2020087276A1/en
Publication of WO2020087276A1 publication Critical patent/WO2020087276A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17312Routing techniques specific to parallel machines, e.g. wormhole, store and forward, shortest path problem congestion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17356Indirect interconnection networks
    • G06F15/17368Indirect interconnection networks non hierarchical topologies
    • G06F15/17375One dimensional, e.g. linear array, ring

Definitions

  • Embodiments of the present invention relate to the field of integrated circuits, and in particular, to a big data operation acceleration system and chip.
  • ASIC Application Specific Integrated Circuits
  • ASIC Application Specific Integrated Circuits
  • ASICs application-specific integrated circuits
  • the characteristics of ASICs are to meet the needs of specific users.
  • ASICs Compared with general-purpose integrated circuits, ASICs have the advantages of smaller size, lower power consumption, improved reliability, improved performance, enhanced confidentiality, and lower costs.
  • the embodiment of the present invention provides a big data operation acceleration system and a chip, and two or more ASIC operation chips are respectively connected to more than two storage units through a bus, and the operation chip performs data exchange through the storage unit, which not only reduces the
  • the number of storage units also reduces the connection lines between ASIC operation chips, simplifies the system structure, and each ASIC operation chip is connected to multiple storage units separately, which will not cause conflicts when using the bus mode, and it is not necessary for each One ASIC operation chip sets Cache.
  • a big data operation acceleration system including more than two operation chips and more than two storage units, wherein:
  • the arithmetic chip includes at least one first data interface (130), more than two second data interfaces (150, 151, 152, 153), at least two cores (110, 111, 112, 113), and a routing unit ( 120); the at least one first data interface (130) and more than two second data interfaces (150, 151, 152, 153) are respectively connected to the routing unit, the routing unit and the at least two cores core (110, 111, 112, 113) connected;
  • the storage unit includes more than two third data interfaces (250, 251, 252, 253); the storage unit (20) includes more than two memories, a routing unit (230) and more than two third data interfaces (250 , 251, 252, 253); the two or more third data interfaces (250, 251, 252, 253) are respectively connected to the routing unit through a bus, and the routing unit is connected to the two or more memories.
  • the second data interface (150, 151, 152, 153) of the arithmetic chip is connected to the third data interface (250, 251, 252, 253) of the storage unit through a bus.
  • a big data operation acceleration system including more than two operation chips and more than two storage units, wherein:
  • the arithmetic chip includes at least one first data interface (130), more than two second data interfaces (150, 151, 152, 153), at least two cores (110, 111, 112, 113), and a routing unit ( 120); each second data interface is connected to a core core, the at least two core cores are connected to the routing unit, and the at least one first data interface (130) is connected to a core core (110);
  • the storage unit includes more than two third data interfaces (250, 251, 252, 253); the storage unit (20) includes more than two memories, a routing unit (230) and more than two third data interfaces (250 , 251, 252, 253); the two or more third data interfaces (250, 251, 252, 253) are respectively connected to the routing unit through a bus, and the routing unit is connected to the two or more memories;
  • the second data interface (150, 151, 152, 153) of the arithmetic chip is connected to the third data interface (250, 251, 252, 253) of the storage unit through a bus.
  • a big data operation chip includes at least one first data interface (130) and more than two second data interfaces (150, 151, 152, 153), at least two cores (110, 111, 112, 113), a routing unit (120); the at least one first data interface (130) and more than two second data interfaces (150, 151, 152, 153) Connected to the routing unit respectively, the routing unit is connected to the at least two cores (110, 111, 112, 113); the second data interface and the third data interface are serdes interfaces; The second data interface (150, 151, 152, 153) of the arithmetic chip is connected to the storage unit through a bus.
  • a big data operation chip includes at least one first data interface (130) and more than two second data interfaces (150, 151, 152, 153), at least two core cores (110, 111, 112, 113), a routing unit (120); each second data interface is connected to a core core, the at least two core cores are connected to the routing unit, so The at least one first data interface (130) is connected to a core core (110); the second data interface and the third data interface are serdes interfaces; the second data interface (150, 151, 152, 153) Connect to the storage unit through the bus.
  • the embodiment of the present invention achieves the technical effect of saving the number of memory units by reducing the number of memory units by connecting multiple operation chips in the big data operation acceleration system to each memory unit, and reducing the connection cost between ASIC operation chips.
  • the system structure is simplified, and each ASIC computing chip is connected to multiple storage units respectively, which will not cause conflicts when using the bus mode, and there is no need to set Cache for each ASIC computing chip.
  • FIG. 1 illustrates a first embodiment of a schematic structural diagram of a big data operation acceleration system having 4 operation chips and 4 storage units;
  • FIG. 2a illustrates a first embodiment of a schematic structural diagram of an arithmetic chip with 4 cores
  • 2b illustrates a schematic diagram of a signal flow of an arithmetic chip with 4 cores in the first embodiment
  • FIG. 3a illustrates a second embodiment of a schematic structural diagram of an arithmetic chip with 4 cores
  • 3b illustrates a schematic diagram of a signal flow of an arithmetic chip with 4 cores in the second embodiment
  • 4a illustrates a third embodiment of a schematic structural diagram of a storage unit corresponding to an arithmetic chip having 4 cores
  • 4b illustrates a schematic diagram of a signal flow of a storage unit corresponding to an arithmetic chip having 4 cores in the third embodiment
  • FIG. 5 illustrates a schematic diagram of a connection structure of a big data operation acceleration system with 4 operation chips and 4 storage units;
  • FIG. 6 illustrates a schematic diagram of the data structure according to this embodiment
  • Multi-core chips are multi-processing systems embodied on a single large-scale integrated semiconductor chip.
  • two or more chip cores may be embodied on a multi-core chip chip, interconnected by a bus (which may also be formed on the same multi-core chip chip).
  • a bus which may also be formed on the same multi-core chip chip.
  • Multi-core chips can have applications that are implemented in multimedia and signal processing algorithms (such as video encoding / decoding, 2D / 3D graphics, audio and voice processing, image processing, telephony, voice recognition and voice synthesis, encryption processing) Special arithmetic and / or logical operations.
  • ASIC-specific integrated circuits are mentioned in the background art, the specific wiring implementation in the embodiments can be applied to CPUs, GPUs, FPGAs, etc. that have multi-core chips.
  • multiple cores may be the same core or different cores.
  • the number of operation chips may be N, where N is a positive integer greater than or equal to 2, for example, 6, 10, 12, and so on.
  • the number of storage units may be M, where M is a positive integer greater than or equal to 2, for example, 6, 9, 12, etc.
  • N and M may be equal, or do not want to wait.
  • a plurality of arithmetic chips may be the same arithmetic chip or different arithmetic chips.
  • FIG. 1 is a first embodiment of a schematic structural diagram of a big data operation acceleration system having 4 operation chips and 4 storage units.
  • the big data computing acceleration system includes 4 computing chips (10, 11, 12, 13) and 4 storage units (20, 21, 22, 23); each computing chip through the bus and all storage The units are connected, and the arithmetic chips exchange data through the storage unit. The arithmetic chips do not directly exchange data; the control instructions are sent between the arithmetic chips.
  • Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results that will not be used by other arithmetic chips; the shared storage area is used to store the data arithmetic results of the arithmetic chips.
  • the data arithmetic results are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.
  • the storage unit may not be divided.
  • the storage unit here may be a high-speed external memory such as DDR, SDDR, DDR2, DDR3, DDR4, GDDR5, GDDR6, HMC, HBM, etc.
  • the storage unit preferably selects DDR series memory, DDR (Dual Data Rate) memory is double rate synchronous dynamic random access memory.
  • DDR uses a synchronization circuit to ensure that the main steps of the specified address and data transmission and output are not only executed independently, but also fully synchronized with the CPU;
  • DDR uses DLL (Delay Locked Loop, delay lock loop to provide a data filter signal) technology, when When the data is valid, the memory controller can use this data filter signal to accurately locate the data, output it every 16 times, and resynchronize the data from different memory modules.
  • DLL Delay Locked Loop, delay lock loop to provide a data filter signal
  • the frequency of DDR memory can be expressed in two ways: operating frequency and equivalent frequency.
  • the operating frequency is the actual operating frequency of the memory particles, but since DDR memory can transmit data on both the rising and falling edges of the pulse, the equivalent frequency of the transmitted data It is twice the operating frequency.
  • DDR2 (Double Data Rate 2) memory is a new generation memory technology standard developed by JEDEC (Joint Committee for Electronic Equipment Engineering). DDR2 memory can read / write data at 4 times the speed of the external bus per clock and can be controlled internally The bus runs at 4 times the speed.
  • DDR3, DDR4, GDDR5, GDDR6, HMC, HBM memory are all existing technologies, and will not be described in detail here.
  • ASIC operation chips are connected to 4 storage units through a bus, and the operation chips exchange data through the storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips,
  • the system structure is simplified, and each ASIC computing chip is connected to multiple storage units respectively, which will not cause conflicts when using the bus mode, and there is no need to set Cache for each ASIC computing chip.
  • FIG. 2a illustrates a first embodiment of a schematic structural diagram of an arithmetic chip with 4 cores.
  • the number of cores of the arithmetic chip may be Q, where Q is a positive integer greater than or equal to 2, for example, 6, 10, 12 etc. Wait.
  • the core of the arithmetic chip may be a core with the same function or a core with different functions.
  • the operation chip (10) of 4 cores includes 4 cores (110, 111, 112, 113), a routing unit (120), a data exchange control unit (130) and 4 serdes interfaces (150, 151, 152) , 153).
  • a data exchange control unit and four serdes interfaces are respectively connected to the routing unit through the bus, and the routing unit is connected to each core core.
  • the data exchange control unit can be implemented using multiple protocols, such as UART, SPI, PCIE, SERDES, USB, etc.
  • the data exchange control unit is a UART (Universal Asynchronous Receiver / Transmitter) control unit (130). Universal asynchronous transceiver is usually called UART, which is an asynchronous transceiver.
  • UART converts the data to be transmitted between serial communication and parallel communication.
  • UART is usually integrated on the connection of various communication interfaces. But here is just taking the UART protocol as an example, other protocols can also be used.
  • the UART control unit (130) can receive external data or control commands, send control commands to other chips, receive control commands from other chips, and feed back calculation results or intermediate data to the outside.
  • Serdes is the abbreviation of English SERializer (serializer) / DESerializer (deserializer). It is a mainstream time division multiplexing (TDM) and point-to-point (P2P) serial communication technology. That is, multiple low-speed parallel signals at the transmitting end are converted into high-speed serial signals, and then through the transmission medium (optical cable or copper wire), and finally the high-speed serial signals at the receiving end are re-converted into low-speed parallel signals.
  • TDM time division multiplexing
  • Other communication interfaces can also be used instead of the serdes interface, for example: SSI, UATR. Data and control commands are transmitted between the chip and the storage unit through the serdes interface and the transmission line.
  • the core core's main functions are to execute external or internal control instructions, perform data calculation, and data storage control.
  • the routing unit is used to send data or control instructions to the core core (110, 111, 112, 113), and accepts data or control instructions sent by the core core (110, 111, 112, 113) to implement communication between the core cores.
  • the routing unit and the UART control unit (130) accept external control instructions and send control instructions to each core core (110, 111, 112, 113); the UART control unit (130) accepts external data and converts the external data according to the external data address Send to the core (110, 111, 112, 113) or storage unit.
  • the internal data or internal control commands refer to data or control commands generated by the chip itself, and the external data or external control commands refer to data or control commands generated outside the chip, such as data or control sent by an external host or an external network instruction.
  • FIG. 2b illustrates a schematic diagram of a signal flow of an arithmetic chip with four cores in the first embodiment.
  • the UART interface (130) is used to obtain data or control instructions external to the chip, the routing unit (120) sends the data or control instructions to the core core according to the data or control instruction address, or the routing unit (120) sends to the serdes through the serdes interface Storage unit connected to the interface. If the destination address of the external control instruction points to another chip, the routing unit sends the control instruction to the UART control unit (130), which is sent to the other chip by the UART control unit (130).
  • the UART interface (130) sends the operation result to the outside according to the external control instruction or the internal control instruction.
  • the operation result can be obtained from the core core of the operation chip, or can be obtained through the serdes interface to the storage unit connected to the serdes interface.
  • the external mentioned here may refer to an external host, an external network, an external platform, or the like.
  • the external host can initialize and configure the storage unit parameters through the UART control unit, and uniformly address multiple storage particles.
  • the core core can send a control instruction to obtain or write data to the routing unit.
  • the control instruction carries the data address, and the routing unit reads or writes data to the storage unit through the serdes interface according to the address.
  • the core core may also send data or control instructions to other core cores through the routing unit according to the address, and obtain data or control instructions from other core cores through the routing unit.
  • the core calculates based on the acquired data and stores the calculation result in the storage unit.
  • Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results that will not be used by other arithmetic chips; the shared storage area is used to store the data arithmetic results of the arithmetic chips. The data arithmetic results are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission. If the control command generated by the core core is used to control the operation of other chips, the routing unit sends the control command to the UART control unit (130), and the UART control unit (130) sends it to the other chips. If the control command generated by the core core is used to control the storage unit, the routing unit sends the control command to the storage unit through the serdes interface.
  • FIG. 3a illustrates a second embodiment of a schematic structural diagram of an arithmetic chip with 4 cores.
  • the operation chip of 4 cores includes 4 cores (110, 111, 112, 113), a routing unit (120), a UART control unit (130) and 4 serdes interfaces (150, 151, 152, 153). Each serdes interface is connected to one core core, 4 core cores are connected to the routing unit, and the UART control unit (130) is connected to the core core (110).
  • FIG. 3b illustrates a schematic signal flow diagram of an arithmetic chip with 4 cores in the second embodiment.
  • the UART control unit (130) is used to acquire external data or control instructions of the chip, and transmit the external data or control instructions to the core (110) connected to the UART control unit.
  • the core (110) transmits external data or control instructions to the routing unit (120), and the routing unit sends the data or control instructions to the core (111, 112, 113) corresponding to the data address according to the data or control instruction addresses. If the destination address of the data or control instruction is the core core of the arithmetic chip, the routing unit sends the data or control instruction to the core core (110, 111, 112, 113).
  • the core (111, 112, 113) is sent to the corresponding storage unit through the serdes interface (151, 152, 153).
  • the core (110) can also directly send data or control commands to the corresponding storage unit through the serdes interface (150) connected to it.
  • the routing unit stores the serdes interface corresponding to all storage unit addresses. If the destination address of the data or control command is another arithmetic chip, the data is sent by the core (111, 112, 113) to the corresponding storage unit through the serdes interface (151, 152, 153); the control command is sent to the UART control unit to Other computing chips.
  • the core core When the core core feedbacks the operation result or intermediate data to the outside according to the external control instruction or the internal control instruction, the core core obtains the operation result or intermediate data from the storage unit from the serdes interface, and sends the operation result or intermediate data to the routing unit, and the routing unit will The operation result or intermediate data is sent to the core (110) connected to the UART control unit, and finally the operation result or intermediate data is sent to the outside through the UART control unit. If the serdes interface corresponding to the core core connected by the UART control unit obtains the operation result or intermediate data, then the operation result or intermediate data is directly sent to the outside through the UART control unit.
  • the external mentioned here may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the storage unit parameters through the UART control unit, and address multiple storage units uniformly.
  • the core core can send control instructions to the routing unit.
  • the routing unit sends control instructions to other core cores, other chips, or storage units according to the address of the control instructions. After receiving the control instructions, the other cores, other chips, or storage units perform corresponding operations.
  • the core core sends control commands or data to other core cores, it is directly forwarded through the routing unit.
  • the core core sends control commands to other chips via the UART control unit.
  • the routing unit queries the serdes interface corresponding to the address according to the address, and sends the control command to the core core corresponding to the serdes interface, and then sends the core core to the corresponding serdes interface.
  • the serdes interface sends the storage unit to the storage unit. Send control commands.
  • the routing unit queries the serdes interface corresponding to the address according to the address, and sends control instructions to the core core corresponding to the serdes interface, and then the core core sends the corresponding serdes interface to the corresponding serdes interface.
  • the storage unit sends data. Other chips are acquiring data through the storage unit.
  • the kernel core When the kernel core obtains data from the memory unit, it reads the data address carried in the control instruction, and the routing unit queries the serdes interface corresponding to the address according to the address, and sends the control instruction to the kernel core corresponding to the serdes interface, and then the kernel core sends the corresponding The serdes interface, the serdes interface sends a read control instruction to the storage unit, and the instruction carries the destination address and the source address. After the serdes interface obtains data from the storage unit, the data is sent to the core core corresponding to the serdes interface. The core core sends the data packet including the source address and the destination address to the routing unit, and the routing unit sends the data packet to the corresponding according to the destination address Core.
  • the kernel core finds that the destination address is its own address, the kernel core obtains data for processing. And the core core can also send data or commands to other core cores through the routing unit, and obtain data or commands from other core cores through the routing unit. The core calculates based on the acquired data and stores the calculation result in the storage unit.
  • Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results not used by other arithmetic chips; the shared storage area is used to store arithmetic data results of the arithmetic chips, which are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.
  • FIG. 4a illustrates a first embodiment of a schematic structural diagram of a memory cell corresponding to an arithmetic chip having 4 cores.
  • the storage unit (20) includes C memories.
  • C is a positive integer greater than or equal to 2, for example, 6, 10, 12, etc .
  • memory (240, 241, 242, 243) includes storage controllers (220, 221, 222, 223) and storage particles (210, 211, 212, 213); storage controllers are used to write or read data to storage particles according to instructions, and storage particles are used to store data .
  • the storage unit (20) further includes a routing unit (230) and four serdes interfaces (250, 251, 252, 253). The four serdes interfaces are connected to the routing unit through the bus, and the routing unit is connected to each memory.
  • FIG. 4b illustrates a first embodiment of a schematic diagram of a signal flow of a memory cell corresponding to an arithmetic chip with 4 cores.
  • the storage unit (20) accepts the control instruction through the serdes interface (250, 251, 252, 253) and sends the control instruction to the routing unit (230).
  • the routing unit sends the control instruction to the corresponding memory according to the address in the control instruction ( 240, 241, 242, 243), the storage controller (220, 221, 222, 223) performs related operations according to the control instructions. For example, according to the initial configuration memory parameters, multiple storage particles are addressed uniformly; or according to the reset instruction, the storage particles are reset and reset; write instructions or read instructions and other operations.
  • the serdes interface (250, 251, 252, 253) accept the data acquisition instruction sent by the arithmetic chip, the instruction carries the address of the data to be acquired, the routing unit sends the data acquisition instruction to the memory according to the address, and the storage controller stores The data is obtained from the particles, and the data is sent to the computing chip that needs the data through the serdes interface according to the source address.
  • the serdes interface receive the write data command and data sent by the arithmetic chip, the command carries the address of the data to be written, the routing unit sends the write data command and data to the memory according to the address, storage control The device writes data to the storage particles according to the write data instruction.
  • Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results not used by other arithmetic chips; the shared storage area is used to store arithmetic data results of the arithmetic chips, which are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.
  • FIG. 5 illustrates a schematic diagram of a connection structure of a big data operation acceleration system with 4 operation chips and 4 storage units.
  • the system has 4 arithmetic chips (10, 11, 12, 13) and 4 memory cells (20, 21, 22, 23).
  • the structure of the arithmetic chip may be the chip structure disclosed in the first embodiment and the second embodiment.
  • the arithmetic chip may also be an equivalent modified chip structure made by those skilled in the art for the first and second embodiments.
  • the chip structure is also within the scope of protection in this embodiment.
  • the structure of the storage unit may be the structure of the storage unit disclosed in the third embodiment.
  • the storage unit may also be an equivalently improved storage unit structure made by those skilled in the art for the third embodiment. The scope of protection of the embodiments.
  • the UART control unit (130) of the operation chip (10) is connected to an external host, and the UART control unit (130) of each chip (10, 11, 12, 13) is connected through a bus.
  • Each serdes interface (150, 151, 152, 153) of the chip (10, 11, 12, 13) is connected to the serdes interface (250, 251, 252, 253) of a storage unit (20, 21, 22, 23),
  • each operation chip is connected to all storage units through a bus, the operation chip performs data exchange through the storage unit, and data is not directly exchanged between the operation chips.
  • the internal and external signal flows of the arithmetic chip and the storage unit have been described in detail in the first, second, and third embodiments, and will not be described again here.
  • the system is applied in the field of artificial intelligence.
  • the UART control unit (130) of the arithmetic chip (10) stores the picture data or video data sent by the external host to the storage unit (20, 151, 152, 153) through the serdes interface (150, 151, 152, 153).
  • the arithmetic chip (10, 11, 12, 13) generates a mathematical model of the neural network, which can also be stored in the storage unit by the external host through the serdes interface (150, 151, 152, 153) (20, 21, 22, 23), read by each arithmetic chip (10, 11, 12, 13).
  • the arithmetic chip (10) Run the first layer of mathematical model of the neural network on the arithmetic chip (10), the arithmetic chip (10) reads data from the storage unit (20, 21, 22, 23) through the serdes interface to perform the operation, and stores the operation result through the serdes interface To at least one of the storage units (20, 21, 22, 23).
  • the arithmetic chip (10) sends a control instruction to the arithmetic chip (20) through the UART control unit (130), and starts the arithmetic chip (20) to perform arithmetic.
  • the arithmetic chip (20) Run the second layer of mathematical model of the neural network on the arithmetic chip (20), the arithmetic chip (20) reads data from the storage unit (20, 21, 22, 23) through the serdes interface for operation, and stores the operation result through the serdes interface To at least one of the storage units (20, 21, 22, 23). Each chip executes a layer in the neural network, and obtains data from the storage unit (20, 21, 22, 23) through the serdes interface for operation, and only the final layer of the neural network calculates the operation result.
  • the operation chip (10) obtains the operation result from the storage unit (20, 21, 22, 23) through the serdes interface, and feeds it back to the external host through the UART control unit (130).
  • the system is applied to the field of encrypted digital currency, and the UART control unit (130) of the arithmetic chip (10) stores the block information sent by the external host to at least one storage unit in the storage units (20, 21, 22, 23).
  • the external host sends control instructions to the four arithmetic chips (10, 11, 12, 13) through the arithmetic chip (10, 11, 12, 13) UART control unit (130), and the four arithmetic chips (10, 11, 12. 13) Start operation.
  • the external host can also send control instructions to one arithmetic chip (10) UART control unit (130) for data calculation, and the arithmetic chip (10) sends control instructions to the other three arithmetic chips (11, 12, 13) in sequence for data calculation , 4 arithmetic chips (10, 11, 12, 13) start the arithmetic operation.
  • one arithmetic chip (10) UART control unit (130) for data calculation
  • the arithmetic chip (10) sends control instructions to the other three arithmetic chips (11, 12, 13) in sequence for data calculation , 4 arithmetic chips (10, 11, 12, 13) start the arithmetic operation.
  • the external host may also send a control instruction to a computing chip (10) UART control unit (130) to perform data operations, the first computing chip (10) sends a control instruction to the second computing chip (11) to perform data operations, and the second computing chip (11) Send control instructions to the third arithmetic chip (12) for data calculation, the third arithmetic chip (12) sends control instructions to the fourth arithmetic chip (13) for data calculation, 4 arithmetic chips (10, 11, 12 , 13) Start operation.
  • a computing chip (10) UART control unit (130) to perform data operations
  • the first computing chip (10) sends a control instruction to the second computing chip (11) to perform data operations
  • the second computing chip (11) Send control instructions to the third arithmetic chip (12) for data calculation
  • the third arithmetic chip (12) sends control instructions to the fourth arithmetic chip (13) for data calculation
  • 4 arithmetic chips (10, 11, 12, 13) read the block information data from the storage unit through the serdes interface, 4 arithmetic chips (10, 11, 12, 13) simultaneously perform the proof of work calculation, the arithmetic chip ( 10) Obtain the operation result from the storage unit (20, 21, 22, 23) and feed it back to the external host through the UART control unit (130).
  • the number of the arithmetic chip and the storage unit are equal, and the number of the second data interface of the storage unit and the number of the second data interface of the arithmetic chip are both the number of the storage unit .
  • the number of the arithmetic chip and the storage unit may also be unequal.
  • the number of second data interfaces of the storage unit is the number of the arithmetic chip
  • the second The number of data interfaces is the number of storage units. For example, there are four arithmetic chips and five storage units. At this time, five second data interfaces are provided on the arithmetic chip, and four second data interfaces are provided on the storage unit.
  • the bus may use a centralized arbitration bus structure or a ring topology bus structure.
  • the bus technology is a common technology in the field, so it will not be described in detail here.
  • the data mentioned here is various data such as command data, numeric data, character data, and so on.
  • the data format specifically includes valid bit valid, destination address dst id, source address src id and data data.
  • the kernel can determine whether the data packet is a command or a value by valid bit. Here, it can be assumed that 0 represents a value and 1 represents a command.
  • the kernel will determine the destination address, source address and data type according to the data structure. From the perspective of instruction operation timing, the traditional six-stage pipeline structure is adopted in this embodiment, which are instruction fetch, decoding, execution, memory access, alignment and write-back stage respectively.
  • the instruction set of the present invention can be divided into register-register type instructions, register-immediate instruction, jump instruction, memory access instruction, control instruction and inter-core communication instruction according to functions.
  • the embodiments can be implemented as a machine, process, or article of manufacture by using standard programming and / or engineering techniques to produce programming software, firmware, hardware, or any combination thereof.
  • Any generated program (s) can be embodied on one or more computer-usable media, such as resident storage devices, smart cards or other removable storage devices, or transmission devices,
  • computer program products and manufactured products are produced according to the embodiments.
  • article of manufacture and “computer program product” as used herein are intended to cover computer programs that are permanently or temporarily present on any non-transitory medium that can be used by computers.
  • memory / storage devices include but are not limited to magnetic disks, optical disks, removable storage devices (such as smart cards, subscriber identity modules (SIM), wireless identification modules (WIM)), semiconductor memories (such as random access memory (RAM), read only memory (ROM), programmable read only memory (PROM)), etc.
  • Transmission media include, but are not limited to, transmission via wireless communication networks, the Internet, intranets, telephone / modem-based network communications, hard-wired / cable communications networks, satellite communications, and other fixed or mobile network systems / communication links.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Advance Control (AREA)

Abstract

Provided in the present application are a big data operation acceleration system and a chip. By means of setting a plurality of cores in a chip, each core performs operation and storage control functions, and at least one storage unit is connected to each core at an outer part of the chip. By employing the technical solution of the present invention, each core reads a storage unit connected thereto and storage units connected to other cores, thereby achieving the technical effect in which each core may have a large-capacity memory, reducing the number of times data is moved in from an external storage space or is moved out of the memory, and accelerating the processing speed of the data. Meanwhile, since the plurality of cores may operate independently or cooperatively, said manners also accelerate the processing speed of the data.

Description

大数据运算加速系统和芯片Big data computing acceleration system and chip 技术领域Technical field
本发明实施例涉及集成电路领域,特别是涉及一种大数据运算加速系统和芯片。Embodiments of the present invention relate to the field of integrated circuits, and in particular, to a big data operation acceleration system and chip.
背景技术Background technique
ASIC(Application Specific Integrated Circuits)即专用集成电路,是指应特定用户要求和特定电子系统的需要而设计、制造的集成电路。ASIC的特点是面向特定用户的需求,ASIC在批量生产时与通用集成电路相比具有体积更小、功耗更低、可靠性提高、性能提高、保密性增强、成本降低等优点。ASIC (Application Specific Integrated Circuits), that is, application-specific integrated circuits, refer to the integrated circuits designed and manufactured in accordance with the requirements of specific users and the needs of specific electronic systems. The characteristics of ASICs are to meet the needs of specific users. Compared with general-purpose integrated circuits, ASICs have the advantages of smaller size, lower power consumption, improved reliability, improved performance, enhanced confidentiality, and lower costs.
随着科技的发展,越来越多的领域,比如人工智能、安全运算等都涉及大运算量的特定计算。针对特定运算,ASIC芯片可以发挥其运算快,功耗小等特定。同时,对于这些大运算量领域,为了提高数据的处理速度和处理能力,通常需要控制N个运算芯片同时进行工作。随着数据精度的不断提升,人工智能、安全运算等领域需要对越来越大的数据进行运算,为了存储数据一般需要给ASIC芯片配置多个存储单元,例如一块ASIC芯片要配置4块2G内存;这样N个运算芯片同时工作时,就需要4N块2NG内存。但是,在多运算芯片同时工作时,数据存储量不会超过2个G,这样就造成了存储单元的浪费,提高了系统成本。With the development of science and technology, more and more fields, such as artificial intelligence, security computing, etc., involve specific calculations with large amounts of computation. For specific operations, ASIC chips can play a specific role such as fast operation and low power consumption. At the same time, for these fields with large amount of calculation, in order to improve the data processing speed and processing capacity, it is usually necessary to control N operation chips to work simultaneously. With the continuous improvement of data accuracy, more and more data needs to be calculated in the fields of artificial intelligence and security computing. In order to store data, it is generally necessary to configure multiple storage units for the ASIC chip, for example, one ASIC chip needs to configure 4 2G memory ; In this way, when N arithmetic chips work at the same time, 4N 2NG memory is needed. However, when multiple computing chips work at the same time, the data storage capacity will not exceed 2 G, which causes a waste of storage units and increases system cost.
上述背景技术内容仅用于帮助理解本申请,而并不代表承认或认可所提及的任何内容属于相对于本申请的公知常识的一部分。The above background content is only used to help understand this application, and does not mean that any content mentioned is recognized or recognized as part of the common general knowledge relative to this application.
发明内容Summary of the invention
本发明实施例提供一种大数据运算加速系统和芯片,将2个以上ASIC运算芯片通过总线分别和2个以上存储单元相连,所述运算芯片通过所述存储单元进行数据交换,这样不仅减少了存储单元的数量,也减少了ASIC运算芯片之间的连接线,简化了系统构造,并且每个ASIC运算芯片分别与多个存储单元连接,不会造成使用总线方式而发生冲突,也不用为每个ASIC运算芯片设置Cache。The embodiment of the present invention provides a big data operation acceleration system and a chip, and two or more ASIC operation chips are respectively connected to more than two storage units through a bus, and the operation chip performs data exchange through the storage unit, which not only reduces the The number of storage units also reduces the connection lines between ASIC operation chips, simplifies the system structure, and each ASIC operation chip is connected to multiple storage units separately, which will not cause conflicts when using the bus mode, and it is not necessary for each One ASIC operation chip sets Cache.
为达到上述目的,根据本实施例的第一方面提供一种大数据运算加速系统,包括两个以上运算芯片和两个以上存储单元,其中:To achieve the above objective, according to a first aspect of this embodiment, a big data operation acceleration system is provided, including more than two operation chips and more than two storage units, wherein:
所述运算芯片包括至少一个第一数据接口(130)、两个以上第二数据接口(150、151、152、153)、至少两个内核core(110、111、112、113)、路由单元(120);所述至少一个第一数据接口(130)和两个以上第二数据接口(150、151、152、153)分别与所述路由单元相连,所述路由单元与所述至少两个内核core(110、111、112、113)相连;The arithmetic chip includes at least one first data interface (130), more than two second data interfaces (150, 151, 152, 153), at least two cores (110, 111, 112, 113), and a routing unit ( 120); the at least one first data interface (130) and more than two second data interfaces (150, 151, 152, 153) are respectively connected to the routing unit, the routing unit and the at least two cores core (110, 111, 112, 113) connected;
所述存储单元包括两个以上第三数据接口(250、251、252、253);所述存储单元(20)包括两个以上存储器,路由单元(230)和两个以上第三数据接口(250、251、252、253);所述两个以上第三数据接口(250、251、252、253)通过总线分别与所述路由单元相连,所述路由单元再与所述两个以上存储器相连。The storage unit includes more than two third data interfaces (250, 251, 252, 253); the storage unit (20) includes more than two memories, a routing unit (230) and more than two third data interfaces (250 , 251, 252, 253); the two or more third data interfaces (250, 251, 252, 253) are respectively connected to the routing unit through a bus, and the routing unit is connected to the two or more memories.
所述运算芯片的第二数据接口(150、151、152、153)通过总线与所述存储单元的第三数据接口(250、251、252、253)连接。The second data interface (150, 151, 152, 153) of the arithmetic chip is connected to the third data interface (250, 251, 252, 253) of the storage unit through a bus.
根据本实施例的第二方面提供一种大数据运算加速系统,包括两个以上运算芯片和两个以上存储单元,其中:According to a second aspect of this embodiment, a big data operation acceleration system is provided, including more than two operation chips and more than two storage units, wherein:
所述运算芯片包括至少一个第一数据接口(130)、两个以上第二数据接口(150、151、152、153)、至少两个内核core(110、111、112、113)、路由单元(120);每个第二数据接口连接一个内核core,所述至少两个内核core与所述路由单元连接,所述至少一个第一数据接口(130)与一个内核core(110)连接;The arithmetic chip includes at least one first data interface (130), more than two second data interfaces (150, 151, 152, 153), at least two cores (110, 111, 112, 113), and a routing unit ( 120); each second data interface is connected to a core core, the at least two core cores are connected to the routing unit, and the at least one first data interface (130) is connected to a core core (110);
所述存储单元包括两个以上第三数据接口(250、251、252、253);所述存储单元(20)包括两个以上存储器,路由单元(230)和两个以上第三数据接口 (250、251、252、253);所述两个以上第三数据接口(250、251、252、253)通过总线分别与所述路由单元相连,所述路由单元再与所述两个以上存储器相连;The storage unit includes more than two third data interfaces (250, 251, 252, 253); the storage unit (20) includes more than two memories, a routing unit (230) and more than two third data interfaces (250 , 251, 252, 253); the two or more third data interfaces (250, 251, 252, 253) are respectively connected to the routing unit through a bus, and the routing unit is connected to the two or more memories;
所述运算芯片的第二数据接口(150、151、152、153)通过总线与所述存储单元的第三数据接口(250、251、252、253)连接。The second data interface (150, 151, 152, 153) of the arithmetic chip is connected to the third data interface (250, 251, 252, 253) of the storage unit through a bus.
根据本实施例的第三方面,提供一种大数据运算芯片,其特征在于,所述运算芯片包括至少一个第一数据接口(130)、两个以上第二数据接口(150、151、152、153)、至少两个内核core(110、111、112、113)、路由单元(120);所述至少一个第一数据接口(130)和两个以上第二数据接口(150、151、152、153)分别与所述路由单元相连,所述路由单元与所述至少两个内核core(110、111、112、113)相连;所述第二数据接口和第三数据接口为serdes接口;所述运算芯片的第二数据接口(150、151、152、153)通过总线与存储单元相连接。According to a third aspect of this embodiment, a big data operation chip is provided, characterized in that the operation chip includes at least one first data interface (130) and more than two second data interfaces (150, 151, 152, 153), at least two cores (110, 111, 112, 113), a routing unit (120); the at least one first data interface (130) and more than two second data interfaces (150, 151, 152, 153) Connected to the routing unit respectively, the routing unit is connected to the at least two cores (110, 111, 112, 113); the second data interface and the third data interface are serdes interfaces; The second data interface (150, 151, 152, 153) of the arithmetic chip is connected to the storage unit through a bus.
根据本实施例的第四方面,提供一种大数据运算芯片,其特征在于,所述运算芯片包括至少一个第一数据接口(130)、两个以上第二数据接口(150、151、152、153)、至少两个内核core(110、111、112、113)、路由单元(120);每个第二数据接口连接一个内核core,所述至少两个内核core与所述路由单元连接,所述至少一个第一数据接口(130)与一个内核core(110)连接;所述第二数据接口和第三数据接口为serdes接口;所述运算芯片的第二数据接口(150、151、152、153)通过总线与存储单元相连接。According to a fourth aspect of this embodiment, a big data operation chip is provided, characterized in that the operation chip includes at least one first data interface (130) and more than two second data interfaces (150, 151, 152, 153), at least two core cores (110, 111, 112, 113), a routing unit (120); each second data interface is connected to a core core, the at least two core cores are connected to the routing unit, so The at least one first data interface (130) is connected to a core core (110); the second data interface and the third data interface are serdes interfaces; the second data interface (150, 151, 152, 153) Connect to the storage unit through the bus.
本发明实施例通过将大数据运算加速系统中多个运算芯片分别和每个内存单元相连,达到了节省内存单元数量的技术效果,降低了系统成本也减少了ASIC运算芯片之间的连接线,简化了系统构造,并且每个ASIC运算芯片分别与多个存储单元连接,不会造成使用总线方式而发生冲突,也不用为每个ASIC运算芯片设置Cache。The embodiment of the present invention achieves the technical effect of saving the number of memory units by reducing the number of memory units by connecting multiple operation chips in the big data operation acceleration system to each memory unit, and reducing the connection cost between ASIC operation chips. The system structure is simplified, and each ASIC computing chip is connected to multiple storage units respectively, which will not cause conflicts when using the bus mode, and there is no need to set Cache for each ASIC computing chip.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是示例性的一些实施例,对于本领域普通技术人员来讲,在不付 出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings required in the embodiments or the description of the prior art. Obviously, the drawings in the following description are only These are exemplary embodiments. For a person of ordinary skill in the art, without paying any creative work, other drawings may be obtained based on these drawings.
图1说明具有4个运算芯片和4个存储单元的大数据运算加速系统结构示意图的第一实施例;FIG. 1 illustrates a first embodiment of a schematic structural diagram of a big data operation acceleration system having 4 operation chips and 4 storage units;
图2a说明具有4个内核的运算芯片结构示意图的第一实施例;FIG. 2a illustrates a first embodiment of a schematic structural diagram of an arithmetic chip with 4 cores;
图2b说明第一实施例具有4个内核的运算芯片信号流程示意图;2b illustrates a schematic diagram of a signal flow of an arithmetic chip with 4 cores in the first embodiment;
图3a说明具有4个内核的运算芯片结构示意图的第二实施例;FIG. 3a illustrates a second embodiment of a schematic structural diagram of an arithmetic chip with 4 cores;
图3b说明第二实施例具有4个内核的运算芯片信号流程示意图;3b illustrates a schematic diagram of a signal flow of an arithmetic chip with 4 cores in the second embodiment;
图4a说明与具有4个内核的运算芯片对应的存储单元结构示意图的第三实施例;4a illustrates a third embodiment of a schematic structural diagram of a storage unit corresponding to an arithmetic chip having 4 cores;
图4b说明第三实施例具有4个内核的运算芯片对应的存储单元信号流程示意图;4b illustrates a schematic diagram of a signal flow of a storage unit corresponding to an arithmetic chip having 4 cores in the third embodiment;
图5说明具有4个运算芯片和4个存储单元的大数据运算加速系统连接结构示意图;5 illustrates a schematic diagram of a connection structure of a big data operation acceleration system with 4 operation chips and 4 storage units;
图6说明根据本实施例的数据结构示意图FIG. 6 illustrates a schematic diagram of the data structure according to this embodiment
具体实施方式detailed description
下面将基于附图具体说明本实施例的示例性实施方式,应当理解,给出这些实施方式仅仅是为了使本领域技术人员能够更好地理解进而实现本发明,而并非以任何方式限制本发明的范围。相反,提供这些实施方式是为了使本公开更加透彻和完整,并且能够将本公开的范围完整地传达给本领域的技术人员。The following will specifically describe exemplary implementations of this embodiment based on the drawings. It should be understood that these implementations are given only to enable those skilled in the art to better understand and implement the present invention, and do not limit the present invention in any way. Scope. On the contrary, these embodiments are provided to make the present disclosure more thorough and complete, and to fully convey the scope of the present disclosure to those skilled in the art.
此外,需要说明书的是,各附图中的上、下、左、右的各方向仅是以特定的实施方式进行的例示,本领域技术人员能够根据实际需要将附图中所示的各构件的一部分或全部改变方向来应用,而不会影响各构件或系统整体实现其功能,这种改变了方向的技术方案仍属于本发明的保护范围。In addition, it is necessary to describe that the directions of up, down, left, and right in each drawing are only exemplified by specific embodiments, and those skilled in the art can change the components shown in the drawings according to actual needs. Part or all of them are changed in direction to be applied, without affecting each component or system as a whole to realize its function. Such a technical solution that changes direction still belongs to the protection scope of the present invention.
多核芯片是具体化在单个大规模集成半导体芯片上的多处理系统。典型地,两个或更多芯片核心可以被具体化在多核芯片芯片上,由总线(也可以在相同的 多核芯片芯片上形成该总线)进行互连。可以有从两个芯片核心到许多芯片核心被具体化在相同的多核芯片芯片上,在芯片核心的数量中的上限仅由制造能力和性能约束来限制。多核芯片可以具有应用,该应用包含在多媒体和信号处理算法(诸如,视频编码/解码、2D/3D图形、音频和语音处理、图像处理、电话、语音识别和声音合成、加密处理)中执行的专门的算术和/或逻辑操作。Multi-core chips are multi-processing systems embodied on a single large-scale integrated semiconductor chip. Typically, two or more chip cores may be embodied on a multi-core chip chip, interconnected by a bus (which may also be formed on the same multi-core chip chip). There can be from two chip cores to many chip cores embodied on the same multi-core chip chip, and the upper limit in the number of chip cores is limited only by manufacturing capabilities and performance constraints. Multi-core chips can have applications that are implemented in multimedia and signal processing algorithms (such as video encoding / decoding, 2D / 3D graphics, audio and voice processing, image processing, telephony, voice recognition and voice synthesis, encryption processing) Special arithmetic and / or logical operations.
虽然在背景技术中仅仅提到了ASIC专用集成电路,但是实施例中的具体布线实现方式可以应用到具有多核芯片CPU、GPU、FPGA等中。在本实施例中多个内核可以是相同内核,也可以是不同内核。Although only ASIC-specific integrated circuits are mentioned in the background art, the specific wiring implementation in the embodiments can be applied to CPUs, GPUs, FPGAs, etc. that have multi-core chips. In this embodiment, multiple cores may be the same core or different cores.
为了方便说明,以下将以图1中存在的4个运算芯片和4个存储单元的大数据运算加速系统为例进行说明,而本领域技术人员可知,这里选择4个运算芯片和4个存储单元为例,只是示例性的说明,运算芯片个数可以是N,其中N为大于等于2的正整数,例如可以是6、10、12等等。存储单元个数可以是M,其中M为大于等于2的正整数,例如可以是6、9、12等等。在实施例中N和M可以相等,也可以不想等。在本实施例中多个运算芯片可以是相同的运算芯片,也可以是不同的运算芯片。For convenience of explanation, the following will take the big data operation acceleration system of 4 operation chips and 4 storage units in FIG. 1 as an example for description, and those skilled in the art will know that 4 operation chips and 4 storage units are selected here For example, it is only an exemplary description. The number of operation chips may be N, where N is a positive integer greater than or equal to 2, for example, 6, 10, 12, and so on. The number of storage units may be M, where M is a positive integer greater than or equal to 2, for example, 6, 9, 12, etc. In the embodiment, N and M may be equal, or do not want to wait. In this embodiment, a plurality of arithmetic chips may be the same arithmetic chip or different arithmetic chips.
附图1是具有4个运算芯片和4个存储单元的大数据运算加速系统结构示意图的第一实施例。如图1所示,大数据运算加速系统包括包括4个运算芯片(10、11、12、13)和4个存储单元(20、21、22、23);每个运算芯片通过总线和所有存储单元相连,所述运算芯片通过所述存储单元进行数据交换,运算芯片之间不直接交换数据;运算芯片之间发送控制指令。FIG. 1 is a first embodiment of a schematic structural diagram of a big data operation acceleration system having 4 operation chips and 4 storage units. As shown in Figure 1, the big data computing acceleration system includes 4 computing chips (10, 11, 12, 13) and 4 storage units (20, 21, 22, 23); each computing chip through the bus and all storage The units are connected, and the arithmetic chips exchange data through the storage unit. The arithmetic chips do not directly exchange data; the control instructions are sent between the arithmetic chips.
每个存储单元中设置专有存储区域和共享存储区域;所述专有存储区域用于存储一个运算芯片的临时运算结果,该临时运算结果为所述一个运算芯片继续利用的中间计算结果,而其他运算芯片不会使用的中间计算结果;所述共享存储区域用于存储运算芯片的数据运算结果,数据运算结果被其他运算芯片使用,或者需要向外部进行反馈传输。当然,为了方便管理也可以不对存储单元进行划分。这里存储单元可能为DDR、SDDR、DDR2、DDR3、DDR4、GDDR5、 GDDR6、HMC、HBM等高速外部存储器。在这里存储单元优选的选择DDR系列内存,DDR(Dual Data Rate)内存即双倍速率同步动态随机存储器。DDR运用了同步电路,使指定地址、数据的输送和输出主要步骤既独立执行,又保持与CPU完全同步;DDR使用了DLL(Delay Locked Loop,延时锁定回路提供一个数据滤波信号)技术,当数据有效时,存储控制器可使用这个数据滤波信号来精确定位数据,每16次输出一次,并重新同步来自不同存储器模块的数据。DDR内存的频率可以用工作频率和等效频率两种方式表示,工作频率是内存颗粒实际的工作频率,但是由于DDR内存可以在脉冲的上升和下降沿都传输数据,因此传输数据的等效频率是工作频率的两倍。DDR2(Double Data Rate 2)内存是由JEDEC(电子设备工程联合委员会)进行开发的新生代内存技术标准,DDR2内存每个时钟能够以4倍外部总线的速度读/写数据,并且能够以内部控制总线4倍的速度运行。DDR3、DDR4、GDDR5、GDDR6、HMC、HBM内存都是现有技术,这里就不详细介绍。Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results that will not be used by other arithmetic chips; the shared storage area is used to store the data arithmetic results of the arithmetic chips. The data arithmetic results are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission. Of course, for convenience of management, the storage unit may not be divided. The storage unit here may be a high-speed external memory such as DDR, SDDR, DDR2, DDR3, DDR4, GDDR5, GDDR6, HMC, HBM, etc. Here, the storage unit preferably selects DDR series memory, DDR (Dual Data Rate) memory is double rate synchronous dynamic random access memory. DDR uses a synchronization circuit to ensure that the main steps of the specified address and data transmission and output are not only executed independently, but also fully synchronized with the CPU; DDR uses DLL (Delay Locked Loop, delay lock loop to provide a data filter signal) technology, when When the data is valid, the memory controller can use this data filter signal to accurately locate the data, output it every 16 times, and resynchronize the data from different memory modules. The frequency of DDR memory can be expressed in two ways: operating frequency and equivalent frequency. The operating frequency is the actual operating frequency of the memory particles, but since DDR memory can transmit data on both the rising and falling edges of the pulse, the equivalent frequency of the transmitted data It is twice the operating frequency. DDR2 (Double Data Rate 2) memory is a new generation memory technology standard developed by JEDEC (Joint Committee for Electronic Equipment Engineering). DDR2 memory can read / write data at 4 times the speed of the external bus per clock and can be controlled internally The bus runs at 4 times the speed. DDR3, DDR4, GDDR5, GDDR6, HMC, HBM memory are all existing technologies, and will not be described in detail here.
将4个ASIC运算芯片通过总线分别和4个存储单元相连,所述运算芯片通过所述存储单元进行数据交换,这样不仅减少了存储单元的数量,也减少了ASIC运算芯片之间的连接线,简化了系统构造,并且每个ASIC运算芯片分别与多个存储单元连接,不会造成使用总线方式而发生冲突,也不用为每个ASIC运算芯片设置Cache。4 ASIC operation chips are connected to 4 storage units through a bus, and the operation chips exchange data through the storage units, which not only reduces the number of storage units, but also reduces the connection lines between ASIC operation chips, The system structure is simplified, and each ASIC computing chip is connected to multiple storage units respectively, which will not cause conflicts when using the bus mode, and there is no need to set Cache for each ASIC computing chip.
图2a说明具有4个内核的运算芯片结构示意图的第一实施例。而本领域技术人员可知,这里选择4个内核为例,只是示例性的说明,运算芯片内核的个数可以是Q,其中Q为大于等于2的正整数,例如可以是6、10、12等等。在本实施例中运算芯片内核可以是具有相同功能的内核,也可以是不同功能的内核。FIG. 2a illustrates a first embodiment of a schematic structural diagram of an arithmetic chip with 4 cores. Those skilled in the art can know that 4 cores are selected here as an example, which is only an exemplary description. The number of cores of the arithmetic chip may be Q, where Q is a positive integer greater than or equal to 2, for example, 6, 10, 12 etc. Wait. In this embodiment, the core of the arithmetic chip may be a core with the same function or a core with different functions.
4个内核的运算芯片(10)包括4个内核core(110、111、112、113)、一个路由单元(120)、一个数据交换控制单元(130)和4个serdes接口(150、151、152、153)。一个数据交换控制单元、4个serdes接口通过总线分别与路由 单元相连,路由单元再和每个内核core相连。数据交换控制单元可以采用多种协议进行实现,例如UART,SPI,PCIE,SERDES,USB等,在本实施方式中数据交换控制单元为UART(Universal Asynchronous Receiver/Transmitter)控制单元(130)。通用异步收发传输器通常称作UART,是一种异步收发传输器,它将要传输的资料在串行通信与并行通信之间加以转换,UART通常被集成于各种通讯接口的连结上。但是这里只是以UART协议为例进行说,也可以采用其他协议。UART控制单元(130)可以接受外部数据或者控制指令,向其他芯片发送控制指令,从其他芯片接受控制指令,以及向外部反馈运算结果或者中间数据等。The operation chip (10) of 4 cores includes 4 cores (110, 111, 112, 113), a routing unit (120), a data exchange control unit (130) and 4 serdes interfaces (150, 151, 152) , 153). A data exchange control unit and four serdes interfaces are respectively connected to the routing unit through the bus, and the routing unit is connected to each core core. The data exchange control unit can be implemented using multiple protocols, such as UART, SPI, PCIE, SERDES, USB, etc. In this embodiment, the data exchange control unit is a UART (Universal Asynchronous Receiver / Transmitter) control unit (130). Universal asynchronous transceiver is usually called UART, which is an asynchronous transceiver. It converts the data to be transmitted between serial communication and parallel communication. UART is usually integrated on the connection of various communication interfaces. But here is just taking the UART protocol as an example, other protocols can also be used. The UART control unit (130) can receive external data or control commands, send control commands to other chips, receive control commands from other chips, and feed back calculation results or intermediate data to the outside.
serdes是英文SERializer(串行器)/DESerializer(解串器)的简称。它是一种主流的时分多路复用(TDM)、点对点(P2P)的串行通信技术。即在发送端多路低速并行信号被转换成高速串行信号,经过传输媒体(光缆或铜线),最后在接收端高速串行信号重新转换成低速并行信号。这种点对点的串行通信技术充分利用传输媒体的信道容量,减少所需的传输信道和器件引脚数目,提升信号的传输速度,从而大大降低通信成本。当然,这里也可以采用其他的通信接口代替serdes接口,例如:SSI、UATR。芯片和存储单元之间通过serdes接口和传输线进行数据和控制指令传输。Serdes is the abbreviation of English SERializer (serializer) / DESerializer (deserializer). It is a mainstream time division multiplexing (TDM) and point-to-point (P2P) serial communication technology. That is, multiple low-speed parallel signals at the transmitting end are converted into high-speed serial signals, and then through the transmission medium (optical cable or copper wire), and finally the high-speed serial signals at the receiving end are re-converted into low-speed parallel signals. This point-to-point serial communication technology makes full use of the channel capacity of the transmission medium, reduces the number of transmission channels and device pins required, increases the signal transmission speed, and thus greatly reduces the communication cost. Of course, other communication interfaces can also be used instead of the serdes interface, for example: SSI, UATR. Data and control commands are transmitted between the chip and the storage unit through the serdes interface and the transmission line.
内核core的主要功能是执行外部或者内部控制指令、执行数据计算以及数据的存储控制等功能。The core core's main functions are to execute external or internal control instructions, perform data calculation, and data storage control.
路由单元用于向内核core(110、111、112、113)发送数据或者控制指令,并且接受内核core(110、111、112、113)发送数据或者控制指令,实现内核core之间的通信。接受内部或者外部控制指令通过serdes接口向存储单元写入数据、读取数据或者向内存单元发送控制指令;如果内部或者外部控制指令用于控制其他芯片的控制指令,则路由单元将控制指令发送给UART控制单元(130),由UART控制单元(130)向其他芯片发送;如果需要向其他芯片发送数据时,路由单元通过serdes接口向存储单元传输数据,其他芯片通过存储单 元获取数据;如果需要从其他芯片接受数据时,路由单元通过serdes接口从存储单元获取数据。路由单元以及通过UART控制单元(130)接受外部控制指令,向各个内核core(110、111、112、113)发送控制指令;通过UART控制单元(130)接受外部数据,根据外部数据地址将外部数据发送给内核core(110、111、112、113)或者存储单元。所述的内部数据或者内部控制指令是指芯片自身产生的数据或者控制指令,所述外部数据或者外部控制指令是指芯片外部产生的数据或者控制指令,例如外部主机、外部网络发送的数据或者控制指令。The routing unit is used to send data or control instructions to the core core (110, 111, 112, 113), and accepts data or control instructions sent by the core core (110, 111, 112, 113) to implement communication between the core cores. Accept internal or external control instructions to write data to the storage unit, read data or send control instructions to the memory unit through the serdes interface; if the internal or external control instructions are used to control the control instructions of other chips, the routing unit sends the control instructions to UART control unit (130), sent by the UART control unit (130) to other chips; if data needs to be sent to other chips, the routing unit transmits data to the storage unit through the serdes interface, and other chips obtain data through the storage unit; if needed When other chips receive data, the routing unit obtains data from the storage unit through the serdes interface. The routing unit and the UART control unit (130) accept external control instructions and send control instructions to each core core (110, 111, 112, 113); the UART control unit (130) accepts external data and converts the external data according to the external data address Send to the core (110, 111, 112, 113) or storage unit. The internal data or internal control commands refer to data or control commands generated by the chip itself, and the external data or external control commands refer to data or control commands generated outside the chip, such as data or control sent by an external host or an external network instruction.
图2b说明第一实施例具有4个内核的运算芯片信号流程示意图的。所述UART接口(130)用于获取芯片外部数据或者控制指令,路由单元(120)根据数据或者控制指令地址将数据或者控制指令发送给内核core,或者路由单元(120)通过serdes接口发送给serdes接口连接的存储单元。如果外部控制指令的目的地址指向其他芯片,则路由单元将控制指令发送给UART控制单元(130),由UART控制单元(130)向其他芯片发送。UART接口(130)根据外部控制指令或者内部控制指令将运算结果发送给外部,运算结果可以从运算芯片的内核core获取,也可以通过serdes接口获取serdes接口连接的存储单元获取。这里所述的外部可以是指外部主机、外部网络或者外部平台等。外部主机能通过UART控制单元初始化配置存储单元参数,对多个存储颗粒进行统一编址。FIG. 2b illustrates a schematic diagram of a signal flow of an arithmetic chip with four cores in the first embodiment. The UART interface (130) is used to obtain data or control instructions external to the chip, the routing unit (120) sends the data or control instructions to the core core according to the data or control instruction address, or the routing unit (120) sends to the serdes through the serdes interface Storage unit connected to the interface. If the destination address of the external control instruction points to another chip, the routing unit sends the control instruction to the UART control unit (130), which is sent to the other chip by the UART control unit (130). The UART interface (130) sends the operation result to the outside according to the external control instruction or the internal control instruction. The operation result can be obtained from the core core of the operation chip, or can be obtained through the serdes interface to the storage unit connected to the serdes interface. The external mentioned here may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the storage unit parameters through the UART control unit, and uniformly address multiple storage particles.
内核core可以向路由单元发送获取或者写入数据的控制指令,控制指令中携带数据地址,路由单元根据地址通过serdes接口向存储单元读取或者写入数据。内核core也可以根据地址通过路由单元向其他内核core发送数据或者控制指令,并且通过路由单元从其他内核core获取数据或者控制指令。内核core根据获取的数据进行计算,并将计算结果存储到存储单元中。每个存储单元中设置专有存储区域和共享存储区域;所述专有存储区域用于存储一个运算芯片的临时运算结果,该临时运算结果为所述一个运算芯片继续利用的中间计算结果,而其他运算芯片不会使用的中间计算结果;所述共享存储区域用于存储运算芯片的数据运算结果,数据运算结果被其他运算芯片使用,或者需要向外部进行 反馈传输。如果内核core产生的控制指令用于控制其他芯片的操作,则路由单元将控制指令发送给UART控制单元(130),由UART控制单元(130)向其他芯片发送。如果内核core产生的控制指令用于控制存储单元,则路由单元通过serdes接口向存储单元发送控制指令。The core core can send a control instruction to obtain or write data to the routing unit. The control instruction carries the data address, and the routing unit reads or writes data to the storage unit through the serdes interface according to the address. The core core may also send data or control instructions to other core cores through the routing unit according to the address, and obtain data or control instructions from other core cores through the routing unit. The core calculates based on the acquired data and stores the calculation result in the storage unit. Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results that will not be used by other arithmetic chips; the shared storage area is used to store the data arithmetic results of the arithmetic chips. The data arithmetic results are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission. If the control command generated by the core core is used to control the operation of other chips, the routing unit sends the control command to the UART control unit (130), and the UART control unit (130) sends it to the other chips. If the control command generated by the core core is used to control the storage unit, the routing unit sends the control command to the storage unit through the serdes interface.
图3a说明具有4个内核的运算芯片结构示意图的第二实施例。根据图3a所示可知,4个内核的运算芯片包括4个内核core(110、111、112、113)、一个路由单元(120)、一个UART控制单元(130)和4个serdes接口(150、151、152、153)。每个serdes接口连接一个内核core,4个内核core连接于路由单元,所述UART控制单元(130)连接于内核core(110)。FIG. 3a illustrates a second embodiment of a schematic structural diagram of an arithmetic chip with 4 cores. As can be seen from Figure 3a, the operation chip of 4 cores includes 4 cores (110, 111, 112, 113), a routing unit (120), a UART control unit (130) and 4 serdes interfaces (150, 151, 152, 153). Each serdes interface is connected to one core core, 4 core cores are connected to the routing unit, and the UART control unit (130) is connected to the core core (110).
图3b说明第二实施例具有4个内核的运算芯片信号流程示意图的。所述UART控制单元(130)用于获取芯片外部数据或者控制指令,将外部数据或者控制指令传输给和UART控制单元连接的内核core(110)。内核core(110)将外部数据或者控制指令传输给路由单元(120),路由单元根据数据或者控制指令地址将数据或者控制指令发送给数据地址对应的内核core(111、112、113)。如果数据或者控制指令的目的地址为本运算芯片的内核core,则路由单元将数据或者控制指令发送给内核core(110、111、112、113)。如果数据或控制指令的目的地址为存储单元,再由内核core(111、112、113)通过serdes接口(151、152、153)发送给对应的存储单元。内核core(110)也可以直接将数据或者控制指令通过自身连接的serdes接口(150)发送给对应的存储单元。在这种情况下,路由单元存储所有存储单元地址所对应的serdes接口。如果数据或者控制指令的目的地址为其他运算芯片,则数据由内核core(111、112、113)通过serdes接口(151、152、153)发送给对应的存储单元;控制指令通过UART控制单元发送给其他运算芯片。内核core根据外部控制指令或者内部控制指令将运算结果或者中间数据反馈给外部时,内核core从serdes接口从存储单元获取运算结果或者中间数据,将运算结果或者中间数据发送给路由单元,路由单元将运算结果或者中间数据发送给UART控制单元连接的内核core(110),最后通过 UART控制单元将运算结果或者中间数据发送给外部。如果是由UART控制单元连接的内核core所对应的serdes接口获得运算结果或者中间数据,这时就直接通过UART控制单元将运算结果或者中间数据发送给外部。这里所述的外部可以是指外部主机、外部网络或者外部平台等。外部主机能通过UART控制单元初始化配置存储单元参数,对多个存储单元进行统一编址。FIG. 3b illustrates a schematic signal flow diagram of an arithmetic chip with 4 cores in the second embodiment. The UART control unit (130) is used to acquire external data or control instructions of the chip, and transmit the external data or control instructions to the core (110) connected to the UART control unit. The core (110) transmits external data or control instructions to the routing unit (120), and the routing unit sends the data or control instructions to the core (111, 112, 113) corresponding to the data address according to the data or control instruction addresses. If the destination address of the data or control instruction is the core core of the arithmetic chip, the routing unit sends the data or control instruction to the core core (110, 111, 112, 113). If the destination address of the data or control instruction is a storage unit, then the core (111, 112, 113) is sent to the corresponding storage unit through the serdes interface (151, 152, 153). The core (110) can also directly send data or control commands to the corresponding storage unit through the serdes interface (150) connected to it. In this case, the routing unit stores the serdes interface corresponding to all storage unit addresses. If the destination address of the data or control command is another arithmetic chip, the data is sent by the core (111, 112, 113) to the corresponding storage unit through the serdes interface (151, 152, 153); the control command is sent to the UART control unit to Other computing chips. When the core core feedbacks the operation result or intermediate data to the outside according to the external control instruction or the internal control instruction, the core core obtains the operation result or intermediate data from the storage unit from the serdes interface, and sends the operation result or intermediate data to the routing unit, and the routing unit will The operation result or intermediate data is sent to the core (110) connected to the UART control unit, and finally the operation result or intermediate data is sent to the outside through the UART control unit. If the serdes interface corresponding to the core core connected by the UART control unit obtains the operation result or intermediate data, then the operation result or intermediate data is directly sent to the outside through the UART control unit. The external mentioned here may refer to an external host, an external network, an external platform, or the like. The external host can initialize and configure the storage unit parameters through the UART control unit, and address multiple storage units uniformly.
内核core可以向路由单元发送控制指令,路由单元根据控制指令的地址向其他内核core、其他芯片或者存储单元发送控制指令,其他内核core、其他芯片或者存储单元接受控制指令后,执行相应的操作。内核core向其他内核core发送控制指令或者数据时,通过路由单元直接转发。内核core向其他芯片发送控制指令通过UART控制单元发送。内核core向存储单元发送控制指令时,路由单元根据地址查询地址所对应的serdes接口,将控制指令发送给serdes接口对应的内核core,再由内核core发送给对应的serdes接口,serdes接口向存储单元发送控制指令。内核core向其他芯片或者存储单元发送数据时,路由单元根据地址查询地址所对应的serdes接口,将控制指令发送给serdes接口对应的内核core,再由内核core发送给对应的serdes接口,serdes接口向存储单元发送数据。其他芯片在通过存储单元获取数据。内核core从内存单元获取数据时,读取控制指令中携带数据地址,路由单元根据地址查询地址所对应的serdes接口,将控制指令发送给serdes接口对应的内核core,再由内核core发送给对应的serdes接口,serdes接口向存储单元发送读取控制指令,指令中携带目的地址和源地址。serdes接口从存储单元获取数据后,将数据发送给serdes接口对应的内核core,内核core将包括源地址和目的地址的数据包发送给路由单元,路由单元根据目的地址将所述数据包发送给对应的内核core。如果内核core发现该目的地址为其自身地址的话,则内核core获取数据进行处理。并且内核core也可以通过路由单元向其他内核core发送数据或者命令,并且通过路由单元从其他内核core获取数据或者命令。内核core根据获取的数据进行计算,并将计算结果存储到存储单元中。每个存储单元中设置专有存储区域和共享存储区域; 所述专有存储区域用于存储一个运算芯片的临时运算结果,该临时运算结果为所述一个运算芯片继续利用的中间计算结果,而其他运算芯片不会使用的中间计算结果;所述共享存储区域用于存储运算芯片的运算数据结果,该运算数据结果被其他运算芯片使用,或者需要向外部进行反馈传输。The core core can send control instructions to the routing unit. The routing unit sends control instructions to other core cores, other chips, or storage units according to the address of the control instructions. After receiving the control instructions, the other cores, other chips, or storage units perform corresponding operations. When the core core sends control commands or data to other core cores, it is directly forwarded through the routing unit. The core core sends control commands to other chips via the UART control unit. When the core core sends a control command to the storage unit, the routing unit queries the serdes interface corresponding to the address according to the address, and sends the control command to the core core corresponding to the serdes interface, and then sends the core core to the corresponding serdes interface. The serdes interface sends the storage unit to the storage unit. Send control commands. When the core core sends data to other chips or storage units, the routing unit queries the serdes interface corresponding to the address according to the address, and sends control instructions to the core core corresponding to the serdes interface, and then the core core sends the corresponding serdes interface to the corresponding serdes interface. The storage unit sends data. Other chips are acquiring data through the storage unit. When the kernel core obtains data from the memory unit, it reads the data address carried in the control instruction, and the routing unit queries the serdes interface corresponding to the address according to the address, and sends the control instruction to the kernel core corresponding to the serdes interface, and then the kernel core sends the corresponding The serdes interface, the serdes interface sends a read control instruction to the storage unit, and the instruction carries the destination address and the source address. After the serdes interface obtains data from the storage unit, the data is sent to the core core corresponding to the serdes interface. The core core sends the data packet including the source address and the destination address to the routing unit, and the routing unit sends the data packet to the corresponding according to the destination address Core. If the kernel core finds that the destination address is its own address, the kernel core obtains data for processing. And the core core can also send data or commands to other core cores through the routing unit, and obtain data or commands from other core cores through the routing unit. The core calculates based on the acquired data and stores the calculation result in the storage unit. Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results not used by other arithmetic chips; the shared storage area is used to store arithmetic data results of the arithmetic chips, which are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.
图4a说明与具有4个内核的运算芯片对应的存储单元结构示意图的第一实施例。存储单元(20)包括C个存储器,这里以C=4为例进行说明,当然其中C为大于等于2的正整数,例如可以是6、10、12等等;存储器(240、241、242、243)包括存储控制器(220、221、222、223)和存储颗粒(210、211、212、213);存储控制器用于根据指令向存储颗粒写入或者读取数据,存储颗粒用于存储数据。存储单元(20)进一步包括一个路由单元(230)4个serdes接口(250、251、252、253)。4个serdes接口通过总线分别与路由单元相连,路由单元再和每个存储器相连。FIG. 4a illustrates a first embodiment of a schematic structural diagram of a memory cell corresponding to an arithmetic chip having 4 cores. The storage unit (20) includes C memories. Here, C = 4 is taken as an example for description. Of course, C is a positive integer greater than or equal to 2, for example, 6, 10, 12, etc .; memory (240, 241, 242, 243) includes storage controllers (220, 221, 222, 223) and storage particles (210, 211, 212, 213); storage controllers are used to write or read data to storage particles according to instructions, and storage particles are used to store data . The storage unit (20) further includes a routing unit (230) and four serdes interfaces (250, 251, 252, 253). The four serdes interfaces are connected to the routing unit through the bus, and the routing unit is connected to each memory.
图4b说明与具有4个内核的运算芯片对应的存储单元信号流程示意图的第一实施例。存储单元(20)通过serdes接口(250、251、252、253)接受控制指令,将控制指令发送给路由单元(230),路由单元根据控制指令中的地址,将控制指令发送给相应的存储器(240、241、242、243),存储控制器(220、221、222、223)根据控制指令执行相关操作。例如根据初始化配置存储器参数,对多个存储颗粒进行统一编址;或者根据重置指令,对存储颗粒进行重置复位;写入指令或者读出指令等操作。通过serdes接口(250、251、252、253)接受运算芯片发送的获取数据指令,指令中携带要获取数据的地址,路由单元根据地址向存储器发送获取数据指令,存储控制器根据获取数据指令从存储颗粒中获取数据,根据源地址将数据通过serdes接口发送给需求数据的运算芯片。通过serdes接口(250、251、252、253)接受运算芯片发送的写入数据指令和数据,指令中携带要写入数据的地址,路由单元根据地址向存储器发送写入数据指令和数据,存储控制器根据写入数据指令向存储颗粒写入数据。写入数据指令和数据可以是同步传输,也可以是异步传输。每个存储单元中设置专有存储 区域和共享存储区域;所述专有存储区域用于存储一个运算芯片的临时运算结果,该临时运算结果为所述一个运算芯片继续利用的中间计算结果,而其他运算芯片不会使用的中间计算结果;所述共享存储区域用于存储运算芯片的运算数据结果,该运算数据结果被其他运算芯片使用,或者需要向外部进行反馈传输。4b illustrates a first embodiment of a schematic diagram of a signal flow of a memory cell corresponding to an arithmetic chip with 4 cores. The storage unit (20) accepts the control instruction through the serdes interface (250, 251, 252, 253) and sends the control instruction to the routing unit (230). The routing unit sends the control instruction to the corresponding memory according to the address in the control instruction ( 240, 241, 242, 243), the storage controller (220, 221, 222, 223) performs related operations according to the control instructions. For example, according to the initial configuration memory parameters, multiple storage particles are addressed uniformly; or according to the reset instruction, the storage particles are reset and reset; write instructions or read instructions and other operations. Through the serdes interface (250, 251, 252, 253) accept the data acquisition instruction sent by the arithmetic chip, the instruction carries the address of the data to be acquired, the routing unit sends the data acquisition instruction to the memory according to the address, and the storage controller stores The data is obtained from the particles, and the data is sent to the computing chip that needs the data through the serdes interface according to the source address. Through the serdes interface (250, 251, 252, 253), receive the write data command and data sent by the arithmetic chip, the command carries the address of the data to be written, the routing unit sends the write data command and data to the memory according to the address, storage control The device writes data to the storage particles according to the write data instruction. The write data command and data can be transmitted synchronously or asynchronously. Each storage unit is provided with a dedicated storage area and a shared storage area; the dedicated storage area is used to store a temporary calculation result of one arithmetic chip, and the temporary calculation result is an intermediate calculation result that the one arithmetic chip continues to use, and Intermediate calculation results not used by other arithmetic chips; the shared storage area is used to store arithmetic data results of the arithmetic chips, which are used by other arithmetic chips, or need to be transmitted to the outside for feedback transmission.
图5说明具有4个运算芯片和4个存储单元的大数据运算加速系统连接结构示意图。在附图5中系统存在4个运算芯片(10、11、12、13)和4个存储单元(20、21、22、23)。运算芯片的结构可是第一实施例和第二实施例所公开的芯片结构,当然运算芯片也可以是本领域技术人员针对第一和第二实施例进行的等同改进的芯片结构,这些等同改进的芯片结构也在本实施例保护的范围。存储单元的结构可是第三实施例所公开的存储单元结构,当然存储单元也可以是本领域技术人员针对第三实施例进行的等同改进的存储单元结构,这些等同改进的存储单元结构也在本实施例保护的范围。在大数据运算加速系统中运算芯片(10)的UART控制单元(130)和外部主机相连,每个芯片(10、11、12、13)的UART控制单元(130)通过总线相连。芯片(10、11、12、13)的每一个serdes接口(150、151、152、153)连接一个存储单元(20、21、22、23)的serdes接口(250、251、252、253),进而实现每个运算芯片通过总线和所有存储单元相连,所述运算芯片通过所述存储单元进行数据交换,运算芯片之间不直接交换数据。运算芯片和存储单元内部和外部信号流程在第一、第二和第三实施例中已经详细说明了,这里就不再次进行描述。FIG. 5 illustrates a schematic diagram of a connection structure of a big data operation acceleration system with 4 operation chips and 4 storage units. In Fig. 5, the system has 4 arithmetic chips (10, 11, 12, 13) and 4 memory cells (20, 21, 22, 23). The structure of the arithmetic chip may be the chip structure disclosed in the first embodiment and the second embodiment. Of course, the arithmetic chip may also be an equivalent modified chip structure made by those skilled in the art for the first and second embodiments. The chip structure is also within the scope of protection in this embodiment. The structure of the storage unit may be the structure of the storage unit disclosed in the third embodiment. Of course, the storage unit may also be an equivalently improved storage unit structure made by those skilled in the art for the third embodiment. The scope of protection of the embodiments. In the big data operation acceleration system, the UART control unit (130) of the operation chip (10) is connected to an external host, and the UART control unit (130) of each chip (10, 11, 12, 13) is connected through a bus. Each serdes interface (150, 151, 152, 153) of the chip (10, 11, 12, 13) is connected to the serdes interface (250, 251, 252, 253) of a storage unit (20, 21, 22, 23), Furthermore, each operation chip is connected to all storage units through a bus, the operation chip performs data exchange through the storage unit, and data is not directly exchanged between the operation chips. The internal and external signal flows of the arithmetic chip and the storage unit have been described in detail in the first, second, and third embodiments, and will not be described again here.
该系统应用到人工智能领域中,运算芯片(10)的UART控制单元(130)将外部主机发送的图片数据或者视频数据通过serdes接口(150、151、152、153)存储到存储单元(20、21、22、23)中,运算芯片(10、11、12、13)产生神经网络的数学模型,该数学模型也可以由外部主机通过serdes接口(150、151、152、153)存储到存储单元(20、21、22、23),由各个运算芯片(10、11、12、13)读取。在运算芯片(10)上运行神经网络第一层数学模型,运算芯片(10) 通过serdes接口从存储单元(20、21、22、23)读取数据进行运算,并将运算结果通过serdes接口存储到存储单元(20、21、22、23)中的至少一个存储单元。运算芯片(10)通过UART控制单元(130)向运算芯片(20)发送控制指令,启动运算芯片(20)进行运算。在运算芯片(20)上运行神经网络第二层数学模型,运算芯片(20)通过serdes接口从存储单元(20、21、22、23)读取数据进行运算,并将运算结果通过serdes接口存储到存储单元(20、21、22、23)中的至少一个存储单元。每一个芯片执行神经网络中的一层,通过serdes接口从存储单元(20、21、22、23)获取数据进行运算,只到神经网络最后一层计算出运算结果。运算芯片(10)通过serdes接口从存储单元(20、21、22、23)获取运算结果,通过UART控制单元(130)反馈给外部主机。The system is applied in the field of artificial intelligence. The UART control unit (130) of the arithmetic chip (10) stores the picture data or video data sent by the external host to the storage unit (20, 151, 152, 153) through the serdes interface (150, 151, 152, 153). In 21, 22, 23), the arithmetic chip (10, 11, 12, 13) generates a mathematical model of the neural network, which can also be stored in the storage unit by the external host through the serdes interface (150, 151, 152, 153) (20, 21, 22, 23), read by each arithmetic chip (10, 11, 12, 13). Run the first layer of mathematical model of the neural network on the arithmetic chip (10), the arithmetic chip (10) reads data from the storage unit (20, 21, 22, 23) through the serdes interface to perform the operation, and stores the operation result through the serdes interface To at least one of the storage units (20, 21, 22, 23). The arithmetic chip (10) sends a control instruction to the arithmetic chip (20) through the UART control unit (130), and starts the arithmetic chip (20) to perform arithmetic. Run the second layer of mathematical model of the neural network on the arithmetic chip (20), the arithmetic chip (20) reads data from the storage unit (20, 21, 22, 23) through the serdes interface for operation, and stores the operation result through the serdes interface To at least one of the storage units (20, 21, 22, 23). Each chip executes a layer in the neural network, and obtains data from the storage unit (20, 21, 22, 23) through the serdes interface for operation, and only the final layer of the neural network calculates the operation result. The operation chip (10) obtains the operation result from the storage unit (20, 21, 22, 23) through the serdes interface, and feeds it back to the external host through the UART control unit (130).
该系统应用到加密数字货币领域中,运算芯片(10)的UART控制单元(130)将外部主机发送的区块信息存储到存储单元(20、21、22、23)中的至少一个存储单元。外部主机通过运算芯片(10、11、12、13)UART控制单元(130)向4个运算芯片(10、11、12、13)发送控制指令进行数据运算,4个运算芯片(10、11、12、13)启动运算操作。当然也可以外部主机向一个运算芯片(10)UART控制单元(130)发送控制指令进行数据运算,运算芯片(10)依次向其他3个运算芯片(11、12、13)发送控制指令进行数据运算,4个运算芯片(10、11、12、13)启动运算操作。也可以外部主机向一个运算芯片(10)UART控制单元(130)发送控制指令进行数据运算,第一运算芯片(10)向第二运算芯片(11)发送控制指令进行数据运算,第二运算芯片(11)向第三运算芯片(12)发送控制指令进行数据运算,第三运算芯片(12)向第四运算芯片(13)发送控制指令进行数据运算,4个运算芯片(10、11、12、13)启动运算操作。4个运算芯片(10、11、12、13)通过serdes接口从存储单元中读取区块信息数据,4个运算芯片(10、11、12、13)同时进行工作量证明运算,运算芯片(10)从存储单元(20、21、22、23)获取运算结果,通过UART控制单元(130)反馈给外部主机。The system is applied to the field of encrypted digital currency, and the UART control unit (130) of the arithmetic chip (10) stores the block information sent by the external host to at least one storage unit in the storage units (20, 21, 22, 23). The external host sends control instructions to the four arithmetic chips (10, 11, 12, 13) through the arithmetic chip (10, 11, 12, 13) UART control unit (130), and the four arithmetic chips (10, 11, 12. 13) Start operation. Of course, the external host can also send control instructions to one arithmetic chip (10) UART control unit (130) for data calculation, and the arithmetic chip (10) sends control instructions to the other three arithmetic chips (11, 12, 13) in sequence for data calculation , 4 arithmetic chips (10, 11, 12, 13) start the arithmetic operation. The external host may also send a control instruction to a computing chip (10) UART control unit (130) to perform data operations, the first computing chip (10) sends a control instruction to the second computing chip (11) to perform data operations, and the second computing chip (11) Send control instructions to the third arithmetic chip (12) for data calculation, the third arithmetic chip (12) sends control instructions to the fourth arithmetic chip (13) for data calculation, 4 arithmetic chips (10, 11, 12 , 13) Start operation. 4 arithmetic chips (10, 11, 12, 13) read the block information data from the storage unit through the serdes interface, 4 arithmetic chips (10, 11, 12, 13) simultaneously perform the proof of work calculation, the arithmetic chip ( 10) Obtain the operation result from the storage unit (20, 21, 22, 23) and feed it back to the external host through the UART control unit (130).
上述实施例中所述运算芯片和所述存储单元数量都是相等的,这时所述存储单元的第二数据接口个数与所述运算芯片的第二数据接口个数都为存储单元的数量。In the above embodiment, the number of the arithmetic chip and the storage unit are equal, and the number of the second data interface of the storage unit and the number of the second data interface of the arithmetic chip are both the number of the storage unit .
但是,本领域技术人员可知,所述运算芯片和所述存储单元数量也可以是不相等,这时所述存储单元的第二数据接口个数为运算芯片的数量,所述运算芯片的第二数据接口个数为存储单元的数量。例如运算芯片为4个,存储单元为5个,这时在运算芯片上设置5个第二数据接口,在存储单元上设置4个第二数据接口。However, those skilled in the art may know that the number of the arithmetic chip and the storage unit may also be unequal. In this case, the number of second data interfaces of the storage unit is the number of the arithmetic chip, and the second The number of data interfaces is the number of storage units. For example, there are four arithmetic chips and five storage units. At this time, five second data interfaces are provided on the arithmetic chip, and four second data interfaces are provided on the storage unit.
总线可以采用集中式仲裁总线结构,或者环线拓扑总线结构,总线技术为本领域的常用技术,因此在这里就不详细介绍。The bus may use a centralized arbitration bus structure or a ring topology bus structure. The bus technology is a common technology in the field, so it will not be described in detail here.
图6说明根据本发明的数据结构示意图。这里所说的数据为命令数据、数值数据、字符数据等多种数据。数据格式具体包括有效位valid、目的地址dst id、源地址src id和data数据。内核可以通过有效位valid来判断该数据包是命令还是数值,这里可以假定0代表数值,1代表命令。内核会根据数据结构判断目的地址、源地址和数据类型。从指令运行时序上来看,本实施例中采用传统的六级流水线结构,分别为取指、译码、执行、访存、对齐和写回级。从指令集架构上来看,可以采取精简指令集架构。按照精简指令集架构的通用设计方法,本发明指令集可以按功能分为寄存器-寄存器型指令,寄存器-立即数指令,跳转指令,访存指令、控制指令和核间通信指令。6 illustrates a schematic diagram of a data structure according to the present invention. The data mentioned here is various data such as command data, numeric data, character data, and so on. The data format specifically includes valid bit valid, destination address dst id, source address src id and data data. The kernel can determine whether the data packet is a command or a value by valid bit. Here, it can be assumed that 0 represents a value and 1 represents a command. The kernel will determine the destination address, source address and data type according to the data structure. From the perspective of instruction operation timing, the traditional six-stage pipeline structure is adopted in this embodiment, which are instruction fetch, decoding, execution, memory access, alignment and write-back stage respectively. From the perspective of the instruction set architecture, a simplified instruction set architecture can be adopted. According to the general design method of the reduced instruction set architecture, the instruction set of the present invention can be divided into register-register type instructions, register-immediate instruction, jump instruction, memory access instruction, control instruction and inter-core communication instruction according to functions.
使用本文中提供的描述,可以通过使用标准的编程和/或工程技术将实施例实现成机器、过程或制造品以产生编程软件、固件、硬件或其任何组合。Using the description provided herein, the embodiments can be implemented as a machine, process, or article of manufacture by using standard programming and / or engineering techniques to produce programming software, firmware, hardware, or any combination thereof.
可以将任何生成的程序(多个)(具有计算机可读程序代码)具体化在一个或多个计算机可使用的介质上,诸如驻留存储设备、智能卡或其它可移动存储设备,或传送设备,从而根据实施例来制作计算机程序产品和制造品。照此,如本文中使用的术语“制造品”和“计算机程序产品”旨在涵盖永久性地或临时性地存在在任何计算机可以使用的非短暂性的介质上的计算机程序。Any generated program (s) (with computer-readable program code) can be embodied on one or more computer-usable media, such as resident storage devices, smart cards or other removable storage devices, or transmission devices, Thus, computer program products and manufactured products are produced according to the embodiments. As such, the terms "article of manufacture" and "computer program product" as used herein are intended to cover computer programs that are permanently or temporarily present on any non-transitory medium that can be used by computers.
如上所指出的,存储器/存储设备包含但不限制于磁盘、光盘、可移动存储设备(诸如智能卡、订户身份模块(SIM)、无线标识模块(WIM))、半导体存储器(诸如随机存取存储器(RAM)、只读存储器(ROM)、可编程只读存储器(PROM))等。传送介质包含但不限于经由无线通信网络、互联网、内部网、基于电话/调制解调器的网络通信、硬连线/电缆通信网络、卫星通信以及其它固定或移动网络系统/通信链路的传输。As noted above, memory / storage devices include but are not limited to magnetic disks, optical disks, removable storage devices (such as smart cards, subscriber identity modules (SIM), wireless identification modules (WIM)), semiconductor memories (such as random access memory ( RAM), read only memory (ROM), programmable read only memory (PROM)), etc. Transmission media include, but are not limited to, transmission via wireless communication networks, the Internet, intranets, telephone / modem-based network communications, hard-wired / cable communications networks, satellite communications, and other fixed or mobile network systems / communication links.
虽然已经公开了特定的示例实施例,但是本领域的技术人员将理解的是,在不背离本发明的精神和范围的情况下,能够对特定示例实施例进行改变。Although specific example embodiments have been disclosed, those skilled in the art will understand that changes can be made to the specific example embodiments without departing from the spirit and scope of the invention.
以上参考附图,基于实施方式说明了本发明,但本发明并非限定于上述的实施方式,根据布局需要等将各实施方式及各变形例的部分构成适当组合或置换后的方案,也包含在本发明的范围内。另外,还可以基于本领域技术人员的知识适当重组各实施方式的组合和处理顺序,或者对各实施方式施加各种设计变更等变形,被施加了这样的变形的实施方式也可能包含在本发明的范围内。The present invention has been described above based on the embodiments with reference to the drawings. However, the present invention is not limited to the above-mentioned embodiments, and schemes in which the parts of each embodiment and each modified example are appropriately combined or replaced according to layout requirements are also included in Within the scope of the invention. In addition, based on the knowledge of those skilled in the art, the combination and processing order of the embodiments may be appropriately reorganized, or various design changes and other modifications may be applied to the embodiments. Embodiments to which such modifications are applied may also be included in the present invention. In the range.

Claims (17)

  1. 一种大数据运算加速系统,其特征在于,包括两个以上运算芯片和两个以上存储单元,其中:A big data operation acceleration system is characterized by comprising more than two operation chips and more than two storage units, wherein:
    所述运算芯片包括至少一个第一数据接口(130)、两个以上第二数据接口(150、151、152、153)、至少两个内核core(110、111、112、113)、路由单元(120);所述至少一个第一数据接口(130)和两个以上第二数据接口(150、151、152、153)分别与所述路由单元相连,所述路由单元与所述至少两个内核core(110、111、112、113)相连;The arithmetic chip includes at least one first data interface (130), more than two second data interfaces (150, 151, 152, 153), at least two cores (110, 111, 112, 113), and a routing unit ( 120); the at least one first data interface (130) and more than two second data interfaces (150, 151, 152, 153) are respectively connected to the routing unit, the routing unit and the at least two cores core (110, 111, 112, 113) connected;
    所述存储单元包括两个以上第三数据接口(250、251、252、253);所述存储单元(20)包括两个以上存储器,路由单元(230)和两个以上第三数据接口(250、251、252、253);所述两个以上第三数据接口(250、251、252、253)通过总线分别与所述路由单元相连,所述路由单元再与所述两个以上存储器相连;The storage unit includes more than two third data interfaces (250, 251, 252, 253); the storage unit (20) includes more than two memories, a routing unit (230) and more than two third data interfaces (250 , 251, 252, 253); the two or more third data interfaces (250, 251, 252, 253) are respectively connected to the routing unit through a bus, and the routing unit is connected to the two or more memories;
    所述运算芯片的第二数据接口(150、151、152、153)通过总线与所述存储单元的第三数据接口(250、251、252、253)连接。The second data interface (150, 151, 152, 153) of the arithmetic chip is connected to the third data interface (250, 251, 252, 253) of the storage unit through a bus.
  2. 一种大数据运算加速系统,其特征在于,包括两个以上运算芯片和两个以上存储单元,其中:A big data operation acceleration system is characterized by comprising more than two operation chips and more than two storage units, wherein:
    所述运算芯片包括至少一个第一数据接口(130)、两个以上第二数据接口(150、151、152、153)、至少两个内核core(110、111、112、113)、路由单元(120);每个第二数据接口连接一个内核core,所述至少两个内核core与所述路由单元连接,所述至少一个第一数据接口(130)与一个内核core(110)连接;The arithmetic chip includes at least one first data interface (130), more than two second data interfaces (150, 151, 152, 153), at least two cores (110, 111, 112, 113), and a routing unit ( 120); each second data interface is connected to a core core, the at least two core cores are connected to the routing unit, and the at least one first data interface (130) is connected to a core core (110);
    所述存储单元包括两个以上第三数据接口(250、251、252、253);所述存储单元(20)包括两个以上存储器,路由单元(230)和两个以上第三数据接口(250、251、252、253);所述两个以上第三数据接口(250、251、252、253)通过总线分别与所述路由单元相连,所述路由单元再与所述两个以上存储器相连;The storage unit includes more than two third data interfaces (250, 251, 252, 253); the storage unit (20) includes more than two memories, a routing unit (230) and more than two third data interfaces (250 , 251, 252, 253); the two or more third data interfaces (250, 251, 252, 253) are respectively connected to the routing unit through a bus, and the routing unit is connected to the two or more memories;
    所述运算芯片的第二数据接口(150、151、152、153)通过总线与所述存储单元的第三数据接口(250、251、252、253)连接。The second data interface (150, 151, 152, 153) of the arithmetic chip is connected to the third data interface (250, 251, 252, 253) of the storage unit through a bus.
  3. 根据权利要求1或2所述的系统,其特征在于,所述第二数据接口和第三数据接口为serdes接口,所述第一数据接口为UART控制单元的UART接口。The system according to claim 1 or 2, wherein the second data interface and the third data interface are serdes interfaces, and the first data interface is a UART interface of a UART control unit.
  4. 根据权利要求1或2所述的系统,其特征在于,所述运算芯片和所述存储单元数量相等,所述存储单元的第三数据接口数量与所述运算芯片的第二数据接口数量为存储单元的数量。The system according to claim 1 or 2, wherein the number of the arithmetic chip and the storage unit are equal, and the number of the third data interface of the storage unit and the number of the second data interface of the arithmetic chip are storage The number of units.
  5. 根据权利要求1或2所述的系统,其特征在于,所述路由单元通过所述至少一个第一数据接口(130)向外部芯片发送控制指令。The system according to claim 1 or 2, wherein the routing unit sends a control instruction to an external chip through the at least one first data interface (130).
  6. 根据权利要求1或2所述的系统,其特征在于,芯片之间通过所述第二数据接口和存储单元发送或者接受数据。The system according to claim 1 or 2, characterized in that data is sent or received between the chips through the second data interface and the storage unit.
  7. 根据权利要求1或2所述的系统,其特征在于,所述路由单元通过所述至少一个第一数据接口(130)接受外部数据或者控制指令,并将接受到的外部数据或者控制指令发送给内核core或者存储单元。The system according to claim 1 or 2, wherein the routing unit accepts external data or control instructions through the at least one first data interface (130), and sends the received external data or control instructions to Core or storage unit.
  8. 根据权利要求1或2所述的系统,其特征在于,所述存储器(240、241、242、243)包括存储控制器(220、221、222、223)和存储颗粒(210、211、212、213),其中,所述存储控制器用于根据指令向存储颗粒写入或者读取数据,所述存储颗粒用于存储数据。The system according to claim 1 or 2, characterized in that the memory (240, 241, 242, 243) includes a storage controller (220, 221, 222, 223) and storage particles (210, 211, 212, 213), wherein the storage controller is used to write or read data to the storage particles according to instructions, and the storage particles are used to store data.
  9. 根据权利要求8所述的系统,其特征在于,所述存储单元的所述路由单元通过所述两个以上第三数据接口(250、251、252、253)接受控制指令,并根据控制指令中的地址,将控制指令发送给相应的存储器(240、241、242、243)。The system according to claim 8, characterized in that the routing unit of the storage unit accepts control instructions through the two or more third data interfaces (250, 251, 252, 253), and according to the control instructions Address, send control commands to the corresponding memory (240, 241, 242, 243).
  10. 根据权利要求8所述的系统,其特征在于,所述存储单元的所述路由单元通过所述两个以上第三数据接口(250、251、252、253)将获取的数据发送给运算芯片。The system according to claim 8, characterized in that the routing unit of the storage unit sends the acquired data to the arithmetic chip through the two or more third data interfaces (250, 251, 252, 253).
  11. 根据权利要求8所述的系统,其特征在于,所述存储单元的所述存储单元中设置专有存储区域和共享存储区域。The system according to claim 8, wherein the storage unit of the storage unit is provided with a dedicated storage area and a shared storage area.
  12. 根据权利要求8所述的系统,其特征在于,所述存储颗粒为HMC内存。The system according to claim 8, wherein the storage particles are HMC memories.
  13. 根据权利要求1或2所述的系统,其特征在于,所述两个以上运算芯 片可以执行加密运算、卷积计算中的一种或者多种。The system according to claim 1 or 2, wherein the two or more operation chips can perform one or more of encryption operation and convolution calculation.
  14. 根据权利要求1或2所述的系统,其特征在于,所述两个以上运算芯片分别执行独立的运算,每个计算单元分别计算结果。The system according to claim 1 or 2, wherein the two or more arithmetic chips respectively perform independent calculations, and each calculation unit calculates the result separately.
  15. 根据权利要求1或2所述的系统,其特征在于,所述两个以上运算芯片可以执行协同运算,每个计算单元根据其他两个以上运算芯片的计算结果进行运算。The system according to claim 1 or 2, wherein the two or more arithmetic chips can perform a cooperative operation, and each calculation unit performs an operation based on the calculation results of the other two or more arithmetic chips.
  16. 一种大数据运算芯片,其特征在于,所述运算芯片包括至少一个第一数据接口(130)、两个以上第二数据接口(150、151、152、153)、至少两个内核core(110、111、112、113)、路由单元(120);所述至少一个第一数据接口(130)和两个以上第二数据接口(150、151、152、153)分别与所述路由单元相连,所述路由单元与所述至少两个内核core(110、111、112、113)相连;所述第二数据接口和第三数据接口为serdes接口;A big data operation chip, characterized in that the operation chip includes at least one first data interface (130), more than two second data interfaces (150, 151, 152, 153), and at least two cores (110 , 111, 112, 113), a routing unit (120); the at least one first data interface (130) and more than two second data interfaces (150, 151, 152, 153) are respectively connected to the routing unit, The routing unit is connected to the at least two cores (110, 111, 112, 113); the second data interface and the third data interface are serdes interfaces;
    所述运算芯片的第二数据接口(150、151、152、153)通过总线与存储单元相连接。The second data interface (150, 151, 152, 153) of the arithmetic chip is connected to the storage unit through a bus.
  17. 一种大数据运算芯片,其特征在于,所述运算芯片包括至少一个第一数据接口(130)、两个以上第二数据接口(150、151、152、153)、至少两个内核core(110、111、112、113)、路由单元(120);每个第二数据接口连接一个内核core,所述至少两个内核core与所述路由单元连接,所述至少一个第一数据接口(130)与一个内核core(110)连接;所述第二数据接口和第三数据接口为serdes接口;A big data operation chip, characterized in that the operation chip includes at least one first data interface (130), more than two second data interfaces (150, 151, 152, 153), and at least two cores (110 , 111, 112, 113), routing unit (120); each second data interface is connected to a core core, the at least two core cores are connected to the routing unit, the at least one first data interface (130) Connected to a core core (110); the second data interface and the third data interface are serdes interfaces;
    所述运算芯片的第二数据接口(150、151、152、153)通过总线与存储单元相连接。The second data interface (150, 151, 152, 153) of the arithmetic chip is connected to the storage unit through a bus.
PCT/CN2018/112688 2018-10-30 2018-10-30 Big data operation acceleration system and chip WO2020087276A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880002364.XA CN109564562B (en) 2018-10-30 2018-10-30 Big data operation acceleration system and chip
PCT/CN2018/112688 WO2020087276A1 (en) 2018-10-30 2018-10-30 Big data operation acceleration system and chip

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/112688 WO2020087276A1 (en) 2018-10-30 2018-10-30 Big data operation acceleration system and chip

Publications (1)

Publication Number Publication Date
WO2020087276A1 true WO2020087276A1 (en) 2020-05-07

Family

ID=65872661

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/112688 WO2020087276A1 (en) 2018-10-30 2018-10-30 Big data operation acceleration system and chip

Country Status (2)

Country Link
CN (1) CN109564562B (en)
WO (1) WO2020087276A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214448A (en) * 2020-10-10 2021-01-12 中科声龙科技发展(北京)有限公司 Data dynamic reconstruction circuit and method of heterogeneous integrated workload proving operation chip
CN114691591A (en) * 2020-12-31 2022-07-01 中科寒武纪科技股份有限公司 Circuit, method and system for inter-chip communication

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114003552B (en) * 2021-12-30 2022-03-29 中科声龙科技发展(北京)有限公司 Workload proving operation method, workload proving chip and upper computer

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314377A (en) * 2010-06-30 2012-01-11 国际商业机器公司 The method of accelerator and the migration of realization virtual support machine thereof
CN103634945A (en) * 2013-11-21 2014-03-12 安徽海聚信息科技有限责任公司 SOC-based high-performance cloud terminal
CN105183683A (en) * 2015-08-31 2015-12-23 浪潮(北京)电子信息产业有限公司 Multi-FPGA chip accelerator card
CN108536642A (en) * 2018-06-13 2018-09-14 北京比特大陆科技有限公司 Big data operation acceleration system and chip

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7593457B2 (en) * 2004-01-30 2009-09-22 Broadcom Corporation Transceiver system and method having a transmit clock signal phase that is phase-locked with a receive clock signal phase
CN105550140B (en) * 2014-11-03 2018-11-09 联想(北京)有限公司 A kind of electronic equipment and data processing method
CN107451075B (en) * 2017-09-22 2023-06-20 北京算能科技有限公司 Data processing chip and system, data storage forwarding and reading processing method
CN209784995U (en) * 2018-10-30 2019-12-13 北京比特大陆科技有限公司 Big data operation acceleration system and chip

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314377A (en) * 2010-06-30 2012-01-11 国际商业机器公司 The method of accelerator and the migration of realization virtual support machine thereof
CN103634945A (en) * 2013-11-21 2014-03-12 安徽海聚信息科技有限责任公司 SOC-based high-performance cloud terminal
CN105183683A (en) * 2015-08-31 2015-12-23 浪潮(北京)电子信息产业有限公司 Multi-FPGA chip accelerator card
CN108536642A (en) * 2018-06-13 2018-09-14 北京比特大陆科技有限公司 Big data operation acceleration system and chip

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112214448A (en) * 2020-10-10 2021-01-12 中科声龙科技发展(北京)有限公司 Data dynamic reconstruction circuit and method of heterogeneous integrated workload proving operation chip
CN112214448B (en) * 2020-10-10 2024-04-09 声龙(新加坡)私人有限公司 Data dynamic reconstruction circuit and method of heterogeneous integrated workload proving operation chip
CN114691591A (en) * 2020-12-31 2022-07-01 中科寒武纪科技股份有限公司 Circuit, method and system for inter-chip communication

Also Published As

Publication number Publication date
CN109564562A (en) 2019-04-02
CN109564562B (en) 2022-05-13

Similar Documents

Publication Publication Date Title
CN108536642A (en) Big data operation acceleration system and chip
JP4768386B2 (en) System and apparatus having interface device capable of data communication with external device
US7155554B2 (en) Methods and apparatuses for generating a single request for block transactions over a communication fabric
TW201918883A (en) High bandwidth memory system and logic die
KR20040062717A (en) memory module device for use in high frequency operation
US7277975B2 (en) Methods and apparatuses for decoupling a request from one or more solicited responses
EP2985699B1 (en) Memory access method and memory system
CN112817907B (en) Interconnected bare chip expansion micro system and expansion method thereof
CN209149287U (en) Big data operation acceleration system
CN106844263B (en) Configurable multiprocessor-based computer system and implementation method
WO2020087276A1 (en) Big data operation acceleration system and chip
CN209784995U (en) Big data operation acceleration system and chip
WO2020087275A1 (en) Method for big data operation acceleration system carrying out operations
CN209560543U (en) Big data operation chip
WO2020087278A1 (en) Big data computing acceleration system and method
JP2003050788A (en) Apparatus and method for distribution of signal from high level data link controller to multiple digital signal processor core
CN208298179U (en) Big data operation acceleration system and chip
WO2020087239A1 (en) Big data computing acceleration system
CN112805727A (en) Artificial neural network operation acceleration device for distributed processing, artificial neural network acceleration system using same, and method for accelerating artificial neural network
CN209543343U (en) Big data operation acceleration system
WO2021213075A1 (en) Inter-node communication method and device based on multiple processing nodes
EP3907624A1 (en) Memory and storage controller with integrated memory coherency interconnect
US11789884B2 (en) Bus system and method for operating a bus system
WO2021213076A1 (en) Method and device for constructing communication topology structure on basis of multiple processing nodes
Klilou et al. Performance optimization of high-speed Interconnect Serial RapidIO for onboard processing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18938534

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.10.2021)

122 Ep: pct application non-entry in european phase

Ref document number: 18938534

Country of ref document: EP

Kind code of ref document: A1