CN113407479A - Many-core architecture embedded with FPGA and data processing method thereof - Google Patents

Many-core architecture embedded with FPGA and data processing method thereof Download PDF

Info

Publication number
CN113407479A
CN113407479A CN202010183253.2A CN202010183253A CN113407479A CN 113407479 A CN113407479 A CN 113407479A CN 202010183253 A CN202010183253 A CN 202010183253A CN 113407479 A CN113407479 A CN 113407479A
Authority
CN
China
Prior art keywords
core
processing
fpga
many
cores
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010183253.2A
Other languages
Chinese (zh)
Inventor
谷丽
何伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Lynxi Technology Co Ltd
Original Assignee
Beijing Lynxi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Lynxi Technology Co Ltd filed Critical Beijing Lynxi Technology Co Ltd
Priority to CN202010183253.2A priority Critical patent/CN113407479A/en
Publication of CN113407479A publication Critical patent/CN113407479A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7807System on chip, i.e. computer system on a single chip; System in package, i.e. computer system on one or more chips in a single package
    • G06F15/7817Specially adapted for signal processing, e.g. Harvard architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/76Architectures of general purpose stored program computers
    • G06F15/78Architectures of general purpose stored program computers comprising a single central processing unit
    • G06F15/7839Architectures of general purpose stored program computers comprising a single central processing unit with memory
    • G06F15/7864Architectures of general purpose stored program computers comprising a single central processing unit with memory on more than one IC chip
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/38Concurrent instruction execution, e.g. pipeline or look ahead
    • G06F9/3867Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Microelectronics & Electronic Packaging (AREA)
  • Design And Manufacture Of Integrated Circuits (AREA)
  • Logic Circuits (AREA)

Abstract

The invention discloses a many-core architecture embedded with an FPGA (field programmable gate array) and a data processing method thereof, wherein the many-core architecture comprises a many-core array, the many-core array comprises a plurality of computing cores and at least one processing core integrated with the FPGA, synchronous clocks are arranged between the processing cores and the computing cores, the processing cores are communicated with the adjacent computing cores through inter-core routes, and the computing data of the current computing core is transmitted to an FPGA computing unit in at least one processing core for operation through the inter-core routes. The invention has the beneficial effects that: by integrating the processing cores of the FPGA in the many-core array, the processing cores are communicated with the adjacent computing cores through inter-core routing, and the computing data of the current computing core is transmitted to the FPGA computing unit in at least one processing core for operation through the inter-core routing, the processing time is saved, and the operation efficiency is improved.

Description

Many-core architecture embedded with FPGA and data processing method thereof
Technical Field
The invention relates to the technical field of many-core architectures, in particular to a many-core architecture embedded with an FPGA and a data processing method thereof.
Background
When an unsupported algorithm or operation instruction is encountered, the existing many-core architecture generally transmits data to a CPU outside a chip for processing, and after the CPU finishes processing, returns a result to a core of the chip to continue the operation of the next algorithm or operation instruction. This approach can take a significant amount of time in fetching and decoding.
Disclosure of Invention
In order to solve the above problems, an object of the present invention is to provide a many-core architecture embedded with an FPGA and a data processing method thereof, which can save processing time and improve operation efficiency.
The invention provides a many-core architecture embedded with an FPGA (field programmable gate array), which comprises the following components: the multi-core array comprises a plurality of computing cores and at least one processing core integrated with an FPGA, synchronous clocks are arranged between the processing cores and the computing cores, and the processing cores and the adjacent computing cores are communicated through inter-core routing.
As a further improvement of the present invention, the computation core includes an AI computation unit, a storage unit, and a route, the processing core includes an FPGA computation unit, a storage unit, and a route, and the processing core is configured to be the same as the storage unit and the route of the plurality of computation cores.
As a further improvement of the present invention, the computation core includes an AI computation unit, a storage unit, and a route, the processing core includes an FPGA computation unit, a storage unit, and a route, and the processing core has a storage capacity different from that of the storage units of the plurality of computation cores.
As a further improvement of the invention, the many-core array is a two-dimensional matrix network, a two-dimensional ring network, a two-dimensional star network or a three-dimensional hierarchical network.
As a further improvement of the present invention, the many-core array is a two-dimensional matrix network, at least one of the processing cores is disposed at a corner of the many-core array, and the processing core communicates with two adjacent computing cores through a routing path between the two cores.
As a further improvement of the present invention, the many-core architecture includes a plurality of processing cores symmetrically disposed at corners of the many-core array.
As a further improvement of the present invention, the many-core array is a two-dimensional matrix network, at least one processing core is disposed inside the many-core array, and the processing core communicates with four adjacent computing cores through four inter-core routing paths.
As a further improvement of the present invention, the many-core architecture includes a plurality of processing cores symmetrically disposed on a diagonal of an interior of the many-core array.
The invention also provides a data processing method of the many-core architecture embedded with the FPGA, which adopts the many-core architecture embedded with the FPGA and comprises the following steps: and transmitting the computing data which cannot be processed by the current computing core from the current computing core to at least one processing core through inter-core routing for operation.
As a further improvement of the present invention, the computation data that cannot be processed by the current computation core is transmitted from the current computation core to the FPGA computation unit in at least one of the processing cores through the inter-core route for performing the computation, and after the computation by the FPGA computation unit is completed, the computation result is transmitted to the next computation core through the inter-core route for continuing the computation.
As a further improvement of the present invention, a single operation task is divided into a plurality of sub-operation tasks, and the plurality of sub-operation tasks are distributed to at least one processing core and the plurality of computation cores for middle-level processing;
at the time of t-1, the core a processes the corresponding sub-operation task and transmits the processed data to the core b;
at the time t, the core b receives the data transmitted by the core a and continues processing the data, and transmits the processed data to the core c;
at the time of t +1, the core c receives the data transmitted by the core b and continues to process the data, and transmits the processed data to other cores for continuous processing;
in this way, on a time axis, at least one of the processing core and the plurality of computing cores pipeline respective sub-operation tasks, wherein the core a, the core b, the core c and the other cores are one of the at least one of the processing core and the plurality of computing cores;
at the same time, at least one processing core and the plurality of computing cores process respective sub-operation tasks in parallel.
As a further improvement of the invention, the computing data of the corresponding sub-operation task which cannot be processed by the current computing core is transmitted to the processing core for processing by the current computing core through the inter-core routing.
As a further improvement of the invention, the processing core closest to the current computing core is found through the inter-core route, the computing data of the sub-computing task which cannot be processed by the current computing core is transmitted to the FPGA computing unit in the processing core for operation, and after the FPGA computing unit finishes the operation, the operation result is transmitted to the next computing core through the inter-core route for continuous operation.
As a further improvement of the invention, a plurality of processing cores which are closest to the current computing core are respectively searched through inter-core routing, the computing data of a plurality of sub-computing tasks which cannot be processed by the current computing core are respectively transmitted to the FPGA computing units in each processing core for operation, and after the operation of each FPGA computing unit is finished, the operation result is respectively transmitted to the next computing core through the inter-core routing for continuous operation.
The invention also provides a many-core chip which comprises the many-core architecture embedded with the FPGA.
As a further improvement of the invention, the many-core chip comprises a many-core architecture of the embedded FPGA, an on-chip processor, a PCle controller, a DMA, a general interface and a DDR controller, wherein the many-core architecture of the embedded FPGA, the on-chip processor, the PCle controller, the DMA, the general interface and the DDR controller are communicated through a bus.
The invention also provides an electronic device which comprises a memory and a processor, wherein the memory is used for storing one or more computer instructions, and the one or more computer instructions are executed by the processor to realize the data processing method of the many-core architecture embedded with the FPGA.
The invention also provides a computer readable storage medium, on which a computer program is stored, wherein the computer program is executed by a processor to realize the data processing method of the many-core architecture embedded with the FPGA.
The invention has the beneficial effects that:
the FPGA is integrated in the AI many-core chip, so that the processing time is saved, and the operation efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without undue inventive faculty.
FIG. 1 is a functional block diagram of a many-core chip including a many-core architecture with embedded FPGAs according to an exemplary embodiment of the present disclosure;
FIG. 2 is a schematic diagram of a many-core architecture with an embedded FPGA integrated at a corner of the many-core array according to an exemplary embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a many-core architecture with an embedded FPGA integrated within the many-core array, according to an exemplary embodiment of the present disclosure;
fig. 4 is a schematic diagram of data processing during operation of a many-core architecture with an embedded FPGA according to an exemplary embodiment of the present disclosure;
fig. 5 is a timing diagram illustrating operation of a many-core architecture with an embedded FPGA according to an exemplary embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the disclosed embodiment, the directional indications are only used to explain the relative position relationship between the components, the motion situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.
In addition, in the description of the present disclosure, the terms used are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The terms "comprises" and/or "comprising" are used to specify the presence of stated elements, steps, operations, and/or components, but do not preclude the presence or addition of one or more other elements, steps, operations, and/or components. The terms "first," "second," and the like may be used to describe various elements, not necessarily order, and not necessarily limit the elements. In addition, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified. These terms are only used to distinguish one element from another. These and/or other aspects will become apparent to those of ordinary skill in the art in view of the following drawings, and the description of the embodiments of the disclosure will be more readily understood by those of ordinary skill in the art. The drawings are only for purposes of illustrating the described embodiments of the disclosure. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated in the present disclosure may be employed without departing from the principles described in the present disclosure.
The AI chip is a module specially used for processing a large number of calculation tasks in artificial intelligence application, and the traditional CPU and GPU are not designed for an AI algorithm and cannot achieve the optimal cost performance. The APU (AI Processing Unit) is a brain-like chip with a many-core architecture, supports the vital parallel computing capability of deep learning, is faster than the traditional processor, and greatly accelerates the training and reasoning process. However, the APU has a certain limitation when applied to the deep learning algorithm, and the hardware structure of the APU supports primitives and instructions relatively weakly. When an unsupported algorithm or operation instruction is encountered, data needs to be transmitted to an off-chip CPU for processing, and after the CPU finishes processing, the result is returned to a core of the chip, and the operation of the next algorithm or operation instruction is continued. Although the CPU has great flexibility and coverage, the CPU also takes a lot of time in the aspects of fetching and decoding.
The disclosed implementation of a many-core architecture of embedded FPGA includes: the multi-core array comprises a plurality of computing cores and at least one processing core integrated with an FPGA, synchronous clocks are arranged between the processing cores and the computing cores, and the processing cores and the adjacent computing cores are communicated through inter-core routing. As shown in fig. 1, in the multi-core architecture embedded with an FPGA according to the embodiment of the present disclosure, the FPGA is integrated in an AI computation core array, and external communication is performed through an inter-chip route of the AI computation core array. The control modules such as the CPU/ARM/MCU and the like can configure the FPGA during initialization. The FPGA is used for realizing precompilation, when the AI computing core array performs operation, the CPU and the ARM do not need to schedule data or instructions of the FPGA, a flow data processing mode can be formed, processing delay is reduced, and processing efficiency is improved.
According to the multi-core architecture embedded with the FPGA, the advantage that tasks are pre-compiled and then processed at high speed by the FPGA is utilized, and the problem that a large amount of time is consumed in the aspects of instruction fetching, decoding and the like is solved. Compared with a many-core architecture with an embedded FPGA integrated outside a chip, the many-core architecture with the embedded FPGA solves the problems that extra energy consumption is generated in inter-chip communication transmission caused by the FPGA integrated outside the chip and the overall performance is reduced due to the fact that data flow processing is interrupted when the FPGA is integrated outside the chip and task scheduling is carried out on the FPGA.
The many-core architecture embedded with the FPGA solves the problem of overall performance reduction caused by interruption of data flow processing when task scheduling is carried out on the FPGA through a bus, and correspondingly reduces the cost of a CPU (central processing unit) on a chip.
The many-core architecture of the embedded FPGA in the embodiment of the disclosure can support the newly added processing core of the embedded FPGA only by adding the type of the node in the existing network-on-chip structure. The newly added processing core of the embedded FPGA can also be used as a common computing core, so that the software and various applications which are already called by the upper layer do not need to be modified. When complex logic control operation is met, the processing cores of the embedded FPGA are called to complete corresponding logic operation, and when a plurality of processing cores of the embedded FPGA are designed in a many-core array, a plurality of small networks can be supported to simultaneously carry out logic control.
In an implementation mode, the computation cores comprise AI computation units, storage units and routes, the processing cores comprise FPGA computation units, storage units and routes, and the storage units and routes of the processing cores and the plurality of computation cores are configured identically. For example, the storage capacity of the storage units of the processing core and the plurality of computing cores may be the same. The computing core and the processing core are designed into modules with the same positioning for integration, have the same time sequence control and communication mode, and form an array integration framework of a two-dimensional grid with the same structure.
In one implementable embodiment, the configuration of the storage units of the processing core and the plurality of computing cores may be different, e.g., the storage capacities of the storage units of the processing core and the plurality of computing cores may also be different. The storage capacities of the storage units of the cores (including the processing core and the plurality of computing cores) may be different, or the storage capacities of the cores are not completely the same.
In one implementation, the many-core array is a two-dimensional matrix network, a two-dimensional torus network, a two-dimensional star network, or a three-dimensional hierarchical network. The many-core array network structure can be selected and designed according to the functions, requirements and the like of the chip. The number of the processing cores of the embedded FPGA can be adaptively designed according to functions to be completed by the chip, when the processing capacity of the logic operation and judgment instruction is small, only one processing core of the FPGA can be integrated in the chip multi-core array, and when the processing capacity of the logic operation and judgment instruction is large, the number of the processing cores of the embedded FPGA can be properly increased. In addition, a plurality of processing cores integrated with the FPGA are designed in a many-core array, so that simultaneous operation of a plurality of small networks can be supported, and the operation speed is improved.
In a preferred embodiment, the many-core array is designed into a two-dimensional square network, and the design of a symmetrical structure enables data transmission and processing, chip power consumption and heat dissipation and the like to be optimized, so that the overall performance of the chip is improved.
As for the position of the processing core of the embedded FPGA in the many-core array, the optimal position of the processing core of the embedded FPGA can be determined according to the position of the specified input computing core of the computing data, so that the processing core of the embedded FPGA closest to the computing core can be quickly found, and the overall operation rate of the chip is improved.
In an implementation, as shown in fig. 2, the many-core array is a two-dimensional matrix network, at least one processing core embedded with an FPGA is disposed at a corner of the many-core array, and the processing core communicates with two adjacent computing cores through a routing path between the two cores.
In one implementation, the many-core architecture includes a plurality of processing cores, which may be symmetrically disposed at corners of the many-core array. The mode of symmetrically arranging the processing cores is suitable for a two-dimensional matrix network, in particular a two-dimensional square network. And the processing capacity of chip logic operation and instruction judgment can be correspondingly increased every time one processing core is added. For example, 2 processing cores may be placed diagonally across the many-core array, one placed in the top left corner of the many-core array and the other in the bottom right corner of the many-core array. The 2 processing cores may also be located in the top right and bottom left corners of the many-core array. 2 processing cores embedded with the FPGA are designed on the diagonal angles of the many-core array, on one hand, the processing capacity of logic operation and instruction judgment can be increased, and on the other hand, the symmetrical design can also reduce the time for transmitting data from the computing core to the processing core, and improve the real-time performance of operation. The 4 processing cores may also be disposed at 4 corners of the many-core array. Each corner is provided with one processing core, generally aiming at the situation that the processing amount of the logic operation and judgment instruction is large, 4 processing cores are arranged at 4 corners, and the time of transmitting data from the computing core to the processing cores is reduced while the capacity of the logic operation and judgment instruction is increased. In the specific design, the number and the positions of the processing cores need to be adaptively designed by comprehensively considering the area, the function, the energy consumption and the like of the chip.
In one implementation, as shown in fig. 3, the many-core array is a two-dimensional matrix network, at least one processing core is disposed inside the many-core array, and the processing core communicates with four adjacent computing cores through four inter-core routing paths.
In one implementation, the many-core architecture includes multiple processing cores that may be symmetrically placed on a diagonal of the interior of the many-core array. This approach of processing cores symmetric to the internal diagonals of many-core arrays is particularly well suited for two-dimensional square networks. And the processing capacity of chip logic operation and instruction judgment can be correspondingly increased every time one processing core is added. A plurality of processing cores embedded with the FPGA are designed on diagonal lines inside the many-core array, on one hand, the processing capacity of logic operation and instruction judgment can be increased, and on the other hand, the symmetrical design can also reduce the time for transmitting data from the computing core to the processing core, and the operation real-time performance is improved. For example, for a two-dimensional 4 x 4 network, one FPGA-embedded processing core may be disposed in each of the second row, the second column, and the third row, the third column of the many-core array. For example, a processing core embedded with an FPGA may be arranged in each of the second row, the third column and the third row, the second column of the many-core array. For example, one processing core may be disposed in the second row, the third column, the third row, the second column, and the third row, the third column of the many-core array, and in this way, generally, for the case where the amount of processing of the logic operation and the judgment instruction is large, 4 processing cores are disposed inside the many-core array, so that the time for transferring data from the computing core to the processing core is reduced while the capacity of the logic operation and the judgment instruction is increased. In the specific design, the number and the positions of the processing cores need to be adaptively designed by comprehensively considering the area, the function, the energy consumption and the like of the chip.
According to the data processing method of the many-core architecture embedded with the FPGA, in the implementation mode, the many-core architecture embedded with the FPGA is adopted, and calculation data are transmitted to at least one processing core from the calculation core through inter-core routing to be operated. As shown in fig. 4, when a current computing core encounters an unsupported operation instruction (e.g., a complex logic control and judgment instruction, etc.), the computation data that cannot be processed by the current computing core is transmitted from the current computing core to an FPGA computation unit in at least one processing core through inter-core routing for operation, and after the operation of the FPGA computation unit is completed, the operation result is transmitted to a next computing core through inter-core routing for operation.
In an implementation manner, as shown in fig. 5, in data processing, a many-core architecture embedded with an FPGA divides a single operation task into a plurality of sub-operation tasks, and the plurality of sub-operation tasks are distributed among at least one processing core and a plurality of computing cores for processing.
At the time of t-1, the core a processes the corresponding sub-operation task and transmits the processed data to the core b; at the time t, the core b receives the data transmitted by the core a and continues processing the data, and transmits the processed data to the core c; at the time of t +1, the core c receives the data transmitted by the core b and continues to process the data, and transmits the processed data to other cores for continuous processing; in this way, on the time axis, the at least one processing core and the plurality of computing cores pipeline respective sub-operation tasks, wherein the core a, the core b, the core c and the other cores are one of the at least one processing core and the plurality of computing cores. Wherein t is an integer of 1 or more, and time 0 is the time when the operation is started.
For example, as shown in fig. 5, on the whole time axis, APU core 1 transmits data processed in a time period of T1 to APU core 5 (computing core 5) for processing in a time period of T2, while APU core 1 (computing core 1) continues to process data in a time period of T2, and transmits data processed in a time period of T2 to APU core 5 for processing in a time period of T3, while APU core 1 continues to process data in a time period of T3, and transmits data processed in a time period of T3 to APU core 5 for processing in a time period of T4, … …,
the APU core 5 processes the obtained data processed by the APU core 1 in the T1 time period in the T2 time period, transmits the obtained data processed in the T2 time period to the FPGA core 8 (processing core) for processing in the T3 time period, meanwhile, the APU core 5 continues to process the obtained data processed by the APU core 1 in the T2 time period in the T3 time period, transmits the obtained data processed in the T3 time period to the FPGA core 8 for processing in the T4 time period, meanwhile, the APU core 5 continues to process the obtained data processed by the APU core 1 in the T3 time period in the T4 time period, … …,
the FPGA core 8 processes the data obtained by the APU core 5 in the T2 time period in the T3 time period, and transmits the data obtained by the processing in the T3 time period to the APU core 3 for processing in the T4 time period, and simultaneously, the FPGA core 8 processes the data obtained by the APU core 5 in the T3 time period in the T4 time period, … …,
thus, a pipeline processing mode is formed on the time axis.
At the same time, at least one processing core and a plurality of computing cores process respective sub-operation tasks in parallel.
For example, as shown in FIG. 5, during time period T6, APU core 1 (compute core 1), APU core 5 (compute core 5), APU core 3 (compute core 3), APU core 6 (compute core 6) each process a corresponding compute subtask, while FPGA core 8 (process core) processes a corresponding logic control and arbitration instruction subtask.
Under the pipeline processing mode, data transmission and operation among the cores do not need intervention and scheduling of an on-chip CPU, processing delay is reduced, and processing efficiency is improved. The FPGA can process algorithms which cannot be processed or processed with low efficiency by the APU core, logic control, judgment instruction processing and the like, all transmission and processing processes are finished on a chip, transmission bandwidth can be greatly saved, energy consumption is reduced, operation efficiency is improved, and the inference/training process of the neural network is accelerated.
In an implementation manner, when the many-core array includes a plurality of processing cores embedded with the FPGA, if the current computing core encounters an unsupported sub-operation task, at this time, a processing core closest to the current computing core may be found through inter-core routing, and data of the sub-operation task that cannot be processed by the current computing core is transmitted to the FPGA computing unit in the processing core for operation, and after the FPGA computing unit finishes operation, an operation result is transmitted to the next computing core through inter-core routing for continued operation. The processing core of the embedded FPGA closest to the computing core is quickly found, so that the overall operation rate of the chip is improved.
In an implementation manner, when the many-core array includes a plurality of processing cores embedded with the FPGA, if the current computing core encounters a plurality of unsupported sub-computing tasks, at this time, a plurality of processing cores closest to the current computing core may be respectively found through inter-core routing, and computing data of the plurality of sub-computing tasks that cannot be processed by the current computing core are respectively transmitted to the FPGA computing units in each processing core for operation, and after the operation of each FPGA computing unit is completed, the operation result is respectively transmitted to the next computing core through inter-core routing to continue the operation. The processing cores of the embedded FPGA simultaneously process the sub-operation tasks, so that the overall operation efficiency of the chip is greatly improved.
The disclosure also relates to a many-core chip which comprises the many-core architecture embedded with the FPGA. In one implementation, as shown in fig. 2, the many-core chip includes a many-core architecture embedded with an FPGA (e.g., AI computation core array), an on-chip processor (e.g., CPU/ARM/MCU), a PCle controller, a DMA, a general-purpose interface (e.g., UART/I2C/SPI/GPIO, etc.), and a DDR controller, and the many-core architecture embedded with the FPGA, the on-chip processor, the PCle controller, the DMA, the general-purpose interface, and the DDR controller communicate via a bus. On one hand, the many-core chip utilizes the advantages of pre-compiling and then processing at high speed of the FPGA, can solve the problem of a large amount of time consumption in the aspects of instruction fetching and decoding and the like, and only needs to increase the type of nodes on the premise of not changing the existing network-on-chip structure, so that the chip can support various complex logic control operations while realizing common AI calculation. On the other hand, the FPGA is integrated in a many-core architecture (AI computing core array), and external communication is performed through inter-chip routing of the AI computing core array, so that the energy consumption of a chip is reduced. The CPU/ARM/MCU and the like configure the FPGA only during initialization. The FPGA is used for realizing precompilation, when the AI computing core array performs operation, the CPU and the ARM do not need to schedule data or instructions of the FPGA, a flow data processing mode can be formed, processing delay is reduced, and processing efficiency is improved.
The many-core chip disclosed by the embodiment of the disclosure can be applied to the field of artificial intelligence, and can process the algorithm which can not be processed or processed with low efficiency and the processing of logic control and judgment instructions by adding the processing core embedded with the FPGA in the AI computing array, thereby greatly saving transmission bandwidth, reducing energy consumption, improving operation efficiency and accelerating the inference/training process of a neural network.
The disclosure also relates to an electronic device comprising a server, a terminal and the like. The electronic device includes: at least one processor; a memory communicatively coupled to the at least one processor; and a communication component communicatively coupled to the storage medium, the communication component receiving and transmitting data under control of the processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to implement the data processing method of the many-core architecture embedded with the FPGA in the above embodiments.
In an alternative embodiment, the memory is used as a non-volatile computer-readable storage medium for storing non-volatile software programs, non-volatile computer-executable programs, and modules. The processor executes various functional applications and data processing of the device by running the nonvolatile software program, the instructions and the modules stored in the memory, namely, the data processing method of the many-core architecture embedded with the FPGA is realized.
The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be connected to the external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
One or more modules are stored in the memory and when executed by the one or more processors perform the data processing method of the many-core architecture of the embedded FPGA in any of the method embodiments described above.
The product can execute the data processing method of the many-core architecture embedded with the FPGA provided by the embodiment of the present application, has the corresponding functional modules and beneficial effects of the execution method, and the technical details not described in detail in the embodiment of the present application can be referred to the data processing method of the many-core architecture embedded with the FPGA provided by the embodiment of the present application.
The disclosure also relates to a computer-readable storage medium for storing a computer-readable program, where the computer-readable program is used for causing a computer to execute the above-mentioned data processing method embodiment of the many-core architecture of the embedded FPGA partially or completely.
That is, as can be understood by those skilled in the art, all or part of the steps in the method of the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps in the method of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the disclosure may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Furthermore, those of ordinary skill in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the disclosure and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.
It will be understood by those skilled in the art that while the present disclosure has been described with reference to exemplary embodiments, various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the disclosure. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the disclosure without departing from the essential scope thereof. Therefore, it is intended that the disclosure not be limited to the particular embodiment disclosed, but that the disclosure will include all embodiments falling within the scope of the appended claims.

Claims (10)

1. A many-core architecture of an embedded FPGA is characterized by comprising: the multi-core array comprises a plurality of computing cores and at least one processing core integrated with an FPGA, synchronous clocks are arranged between the processing cores and the computing cores, and the processing cores and the adjacent computing cores are communicated through inter-core routing.
2. The many-core architecture of an embedded FPGA of claim 1, wherein the compute core comprises AI compute units, storage units and routes, and the processing core comprises FPGA compute units, storage units and routes, and the storage units and routes of the processing core and the plurality of compute cores are configured identically.
3. The many-core architecture of an embedded FPGA of claim 1, wherein the compute core comprises AI compute units, storage units and routes, and the processing core comprises FPGA compute units, storage units and routes, and the processing core has a different storage capacity than the storage units of the plurality of compute cores.
4. The many-core architecture of an embedded FPGA of any one of claims 1-3, wherein the many-core array is a two-dimensional matrix network, a two-dimensional ring network, a two-dimensional star network, or a three-dimensional hierarchical network.
5. A data processing method of a many-core architecture embedded with an FPGA, wherein the many-core architecture embedded with the FPGA of any one of claims 1 to 4 is adopted, the data processing method comprising: and transmitting the computing data which cannot be processed by the current computing core from the current computing core to at least one processing core through inter-core routing for operation.
6. The data processing method of the many-core architecture with the embedded FPGA according to claim 5, wherein the computing data that cannot be processed by the current computing core is transmitted from the current computing core to the FPGA computing unit in at least one of the processing cores through inter-core routing for operation, and after the operation of the FPGA computing unit is completed, the operation result is transmitted to the next computing core through inter-core routing for operation.
7. The data processing method of the many-core architecture with the embedded FPGA according to claim 6, wherein a single operation task is divided into a plurality of sub-operation tasks, and the plurality of sub-operation tasks are distributed to at least one of the processing cores and the plurality of computing cores for processing;
at the time of t-1, the core a processes the corresponding sub-operation task and transmits the processed data to the core b;
at the time t, the core b receives the data transmitted by the core a and continues processing the data, and transmits the processed data to the core c;
at the time of t +1, the core c receives the data transmitted by the core b and continues to process the data, and transmits the processed data to other cores for continuous processing;
in this way, on a time axis, at least one of the processing cores and the plurality of computing cores pipeline respective sub-operation tasks, wherein the core a, the core b, the core c and the other cores are at least one of the processing core and the plurality of computing cores;
at the same time, at least one processing core and the plurality of computing cores process respective sub-operation tasks in parallel.
8. A many-core chip comprising a FPGA-embedded many-core architecture according to any one of claims 1-4.
9. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement a data processing method for a multi-core architecture of an embedded FPGA as recited in any one of claims 5-7.
10. A computer-readable storage medium, on which a computer program is stored, the computer program being executable by a processor to implement a data processing method of a many-core architecture of an embedded FPGA as claimed in any one of claims 5 to 7.
CN202010183253.2A 2020-03-16 2020-03-16 Many-core architecture embedded with FPGA and data processing method thereof Pending CN113407479A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010183253.2A CN113407479A (en) 2020-03-16 2020-03-16 Many-core architecture embedded with FPGA and data processing method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010183253.2A CN113407479A (en) 2020-03-16 2020-03-16 Many-core architecture embedded with FPGA and data processing method thereof

Publications (1)

Publication Number Publication Date
CN113407479A true CN113407479A (en) 2021-09-17

Family

ID=77676644

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010183253.2A Pending CN113407479A (en) 2020-03-16 2020-03-16 Many-core architecture embedded with FPGA and data processing method thereof

Country Status (1)

Country Link
CN (1) CN113407479A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407238A (en) * 2020-03-16 2021-09-17 北京灵汐科技有限公司 Many-core architecture with heterogeneous processors and data processing method thereof
CN115061975A (en) * 2022-08-05 2022-09-16 深圳时识科技有限公司 Firmware platform, network on chip and neuromorphic chip based on FPGA
CN116720560A (en) * 2023-07-13 2023-09-08 中电海康集团有限公司 Brain-like system based on many-core processing unit and data processing method
CN113407238B (en) * 2020-03-16 2024-09-24 北京灵汐科技有限公司 Many-core architecture with heterogeneous processor and data processing method thereof

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103580890A (en) * 2012-07-26 2014-02-12 深圳市中兴微电子技术有限公司 Reconfigurable on-chip network structure and configuration method thereof
CN104317770A (en) * 2014-10-28 2015-01-28 天津大学 Data storage structure and data access method for multiple core processing system
CN104834556A (en) * 2015-04-26 2015-08-12 西北工业大学 Mapping method for multimode real-time tasks and multimode computing resources
CN105207957A (en) * 2015-08-18 2015-12-30 中国电子科技集团公司第五十八研究所 On-chip network multi-core framework
CN109901878A (en) * 2019-02-25 2019-06-18 北京灵汐科技有限公司 One type brain computing chip and calculating equipment
CN110163016A (en) * 2019-04-29 2019-08-23 清华大学 Hybrid system and mixing calculation method
CN113407238A (en) * 2020-03-16 2021-09-17 北京灵汐科技有限公司 Many-core architecture with heterogeneous processors and data processing method thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103580890A (en) * 2012-07-26 2014-02-12 深圳市中兴微电子技术有限公司 Reconfigurable on-chip network structure and configuration method thereof
CN104317770A (en) * 2014-10-28 2015-01-28 天津大学 Data storage structure and data access method for multiple core processing system
CN104834556A (en) * 2015-04-26 2015-08-12 西北工业大学 Mapping method for multimode real-time tasks and multimode computing resources
CN105207957A (en) * 2015-08-18 2015-12-30 中国电子科技集团公司第五十八研究所 On-chip network multi-core framework
CN109901878A (en) * 2019-02-25 2019-06-18 北京灵汐科技有限公司 One type brain computing chip and calculating equipment
CN110163016A (en) * 2019-04-29 2019-08-23 清华大学 Hybrid system and mixing calculation method
CN113407238A (en) * 2020-03-16 2021-09-17 北京灵汐科技有限公司 Many-core architecture with heterogeneous processors and data processing method thereof

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113407238A (en) * 2020-03-16 2021-09-17 北京灵汐科技有限公司 Many-core architecture with heterogeneous processors and data processing method thereof
CN113407238B (en) * 2020-03-16 2024-09-24 北京灵汐科技有限公司 Many-core architecture with heterogeneous processor and data processing method thereof
CN115061975A (en) * 2022-08-05 2022-09-16 深圳时识科技有限公司 Firmware platform, network on chip and neuromorphic chip based on FPGA
CN115061975B (en) * 2022-08-05 2023-01-10 深圳时识科技有限公司 Firmware platform, network on chip and neuromorphic chip based on FPGA
CN116720560A (en) * 2023-07-13 2023-09-08 中电海康集团有限公司 Brain-like system based on many-core processing unit and data processing method
CN116720560B (en) * 2023-07-13 2023-12-01 中电海康集团有限公司 Brain-like system based on many-core processing unit and data processing method

Similar Documents

Publication Publication Date Title
US11789895B2 (en) On-chip heterogeneous AI processor with distributed tasks queues allowing for parallel task execution
US20210073170A1 (en) Configurable heterogeneous ai processor
US11321607B2 (en) Machine learning network implemented by statically scheduled instructions, with compiler
CN110751280A (en) Configurable convolution accelerator applied to convolutional neural network
CN111199275B (en) System on chip for neural network
US11080593B2 (en) Electronic circuit, in particular capable of implementing a neural network, and neural system
CN111630505B (en) Deep learning accelerator system and method thereof
CN107085562B (en) Neural network processor based on efficient multiplexing data stream and design method
US20180060034A1 (en) Communication between dataflow processing units and memories
CN112486903B (en) Reconfigurable processing unit, reconfigurable processing unit array and operation method thereof
US11645225B2 (en) Partitionable networked computer
CN113407479A (en) Many-core architecture embedded with FPGA and data processing method thereof
Huang et al. IECA: An in-execution configuration CNN accelerator with 30.55 GOPS/mm² area efficiency
US20210326189A1 (en) Synchronization of processing elements that execute statically scheduled instructions in a machine learning accelerator
JP2023107786A (en) Initializing on-chip operation
US8589584B2 (en) Pipelining protocols in misaligned buffer cases
Vink et al. Caffe barista: Brewing caffe with fpgas in the training loop
US20240054081A1 (en) Controlling access to a memory shared by a cluster of multiple processing elements
CN112486905B (en) Reconfigurable isomerised PEA interconnection method
US11614946B2 (en) Networked computer
CN113407238B (en) Many-core architecture with heterogeneous processor and data processing method thereof
WO2020051918A1 (en) Neuronal circuit, chip, system and method therefor, and storage medium
CN115904681A (en) Task scheduling method and device and related products
CN113407238A (en) Many-core architecture with heterogeneous processors and data processing method thereof
US20240296311A1 (en) Neural network processor system with reconfigurable neural processing unit, and method of operating and method of forming thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination