CN111090611A

CN111090611A - Small heterogeneous distributed computing system based on FPGA

Info

Publication number: CN111090611A
Application number: CN201811247613.XA
Authority: CN
Inventors: 陈钰文
Original assignee: Shanghai Xuehu Information Technology Co Ltd
Current assignee: Shanghai Xuehu Information Technology Co Ltd
Priority date: 2018-10-24
Filing date: 2018-10-24
Publication date: 2020-05-01

Abstract

The invention discloses a small heterogeneous distributed computing system based on an FPGA (field programmable gate array), belonging to the technical field of computation-intensive hardware design, and comprising a data input module, a data computing module and a data returning module; the data input module is used for scattering and recombining data and sending the data to the data calculation module in a string mode in a pipeline mode; the data calculation module is used for receiving the data input module and transmitting the data to the data return module; the data return module is used for grouping the sequence of the calculation output results of the preceding-stage data input module by the disordered return data, can exert the advantages of FPGA (field programmable gate array) flow calculation and high throughput to the greatest extent, and is very suitable for the calculation requirement of tattletales; and adopting an FPGA cascade configurable strategy in the distributed core computing unit to configure according to specific computing requirements.

Description

Small heterogeneous distributed computing system based on FPGA

Technical Field

The invention relates to the technical field of calculation intensive hardware design, in particular to a small heterogeneous distributed computing system based on an FPGA (field programmable gate array).

Background

Most of the existing software open source frameworks are based on an operating system, and for the operating system, the software open source frameworks are based on a hardware unit, and a core unit related to calculation in the hardware unit is a CPU. At present, according to different manufacturers or different instruction sets, CPUs can be divided into different architectures such as x86, MIPS, POWERPC, ARM and the like, but the architectures are all von willebran architectures in nature, each operation is simplified into the execution of a single instruction, and the single instruction undergoes the most basic steps of accessing, fetching, decoding, executing and writing back to complete the actual life cycle of the instruction. Therefore, from a microscopic perspective, each computation CPU performs a relatively complex and time-consuming instruction translation execution process. However, for the CPU, the execution among the instructions must be executed in order, that is, the next instruction must wait for the execution of the previous instruction to complete before continuing the execution, so the micro-accumulated time-consuming calculation will result in the unsatisfied macro real-time high-density calculation. Although various optimization blocks such as branch prediction, superscalar, hyper-threading, hyper-frequency, etc. have been proposed for the lack of computational performance of the CPU, they are just optimization, and their most fundamental architectural problems are not eliminated.

GPUs are also becoming more and more widely used for the market demands of dramatic increases in computational load and complexity. Compared with the CPU, the GPU has the data parallel capability which is not possessed by the CPU and can carry out block parallel operation on data, so that the GPU has higher data throughput rate and can better support streaming calculation of large data volume like multimedia, images and audios and videos. However, for most applications, the GPU is currently running on an operating system, and needs to interact with the CPU, and the computation process of the GPU is wound by one turn in the CPU-based framework, which is obvious. In addition, what is more critical is that the GPU can only perform data parallel, it cannot implement a deep-pipelined computation module, and the data entering the GPU must have no cross relationship before and after one computation process, and once the data are correlated, it must wait for the previous data to be prepared, and then can enter the next data computation. Therefore, although data parallelism is realized, the data parallelism is not really used, and the parallel data can be really calculated only by waiting for the completion of the data of the previous operation.

The computing unit of the existing distributed computing system adopts a CPU or a GPU of a von-Willebrand architecture, wherein the CPU is not suitable for intensive data computing, the CPU is more suitable for task scheduling, the GPU has higher efficiency but only data parallel, and the instruction flow depth is still limited, so the CPU and the GPU are not suitable for intensive computing; the existing FPGA computing modules aiming at acceleration are combined to form an FPGA computing block by adopting high-performance FPGA chips in a PCIE protocol cascading mode, so that great requirements are brought to requirements on PCB design, cost and the like, in addition, the mode has a limit on the number of FPGA integration, and once a single FPGA in the integrated module breaks down, the whole system is paralyzed; in the computing nodes of the distributed computing system, the mode of CPU + NIC is not adopted to receive the node data.

Based on the technical scheme, the invention designs a small heterogeneous distributed computing system based on the FPGA to solve the problems.

Disclosure of Invention

The invention aims to provide a small heterogeneous distributed computing system based on an FPGA (field programmable gate array), so as to solve the problem that the computing unit of the existing distributed computing system proposed in the background art adopts a CPU or a GPU (graphics processing unit) of a von-Willebrand architecture, wherein the CPU is not suitable for intensive data computing, the CPU is more suitable for task scheduling, the GPU has higher efficiency but only has data parallel, and the instruction flow depth of the GPU is still limited, so that the CPU and the GPU are not suitable for intensive computing; the existing FPGA computing modules aiming at acceleration are combined to form an FPGA computing block by adopting high-performance FPGA chips in a PCIE protocol cascading mode, so that great requirements are brought to requirements on PCB design, cost and the like, in addition, the mode has a limit on the number of FPGA integration, and once a single FPGA in the integrated module breaks down, the whole system is paralyzed; in the computing nodes of the distributed computing system, the problem of receiving node data in a mode of CPU + NIC is not adopted.

In order to achieve the purpose, the invention provides the following technical scheme: a small heterogeneous distributed computing system based on FPGA comprises a data input module, a data computing module and a data return module;

the data input module is used for scattering and recombining data and sending the data to the data calculation module in a string mode in a pipeline mode;

the data calculation module is used for receiving the data input module and transmitting the data to the data return module;

and the data returning module is used for grouping the sequence of the arrival of the output result calculated by the preceding-stage data input module through the out-of-sequence returning data.

Preferably, the data input module includes but is not limited to CPU, FPGA and DDR hardware module;

the FPGA module is used for receiving data and scattering and recombining the data;

the CPU module is directly connected with the FPGA module at a high speed through a QPI protocol and is used for the CPU module to rapidly and dynamically configure the FPGA module to receive and transmit data.

Preferably, the data input module further includes at least two groups of ethernet physical interfaces, one group of the ethernet physical interfaces is used for receiving data;

and the other group of Ethernet physical interfaces is used for data forwarding.

Preferably, the data input module further comprises a reassembly pipeline module, and the ethernet physical interface for receiving data can unwind serial input data and transmit parallel data to the reassembly pipeline module.

Preferably, the data calculation module comprises at least one group of data calculation units, and the data calculation units comprise a single group of FPGA, DDR and at least two groups of ethernet physical interfaces.

Preferably, the data backhaul module includes a post-stage processing module, and the post-stage processing module is configured to improve data throughput by means of deep pipelining the recombined data.

Compared with the prior art, the invention has the beneficial effects that: the invention can exert the advantages of FPGA flow calculation and high throughput to the maximum extent, and is very suitable for the calculation requirement of tattletale; adopting an FPGA cascade configurable strategy in a distributed core computing unit to configure according to specific computing requirements; in the data distribution module and the data return module, the FPGA communicates with the CPU through a QPI bus, the CPU can directly access a memory controller of the FPGA and can directly inform the FPGA to read and write data, and therefore, compared with a traditional mode that the CPU and the FPGA share the memory, a large amount of time is saved; the network protocol stack directly receives and transmits the network data packet through the FPGA, so that the time for executing a large amount of decoding in the receiving and transmitting verification process of the CPU can be saved, and the total receiving and transmitting time can be improved by one order of magnitude.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is an overall framework diagram of a distributed heterogeneous computing system of the present invention;

FIG. 2 is a diagram of a distributed heterogeneous computing system hardware framework of the present invention;

FIG. 3 is a block diagram of the embodiment of FIG. 2;

FIG. 4 is an enlarged view of the left end of FIG. 3 in accordance with the present invention;

FIG. 5 is an enlarged view of the right end connection of FIG. 4 according to the present invention;

FIG. 6 is an enlarged view of the right end connection of FIG. 5 in accordance with the present invention;

FIG. 7 is an enlarged view of the right end connection of FIG. 6 according to the present invention;

FIG. 8 is a block diagram of a data computing unit according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-8, the present invention provides a technical solution: a small heterogeneous distributed computing system based on FPGA comprises a data input module, a data computing module and a data return module;

It should be noted that the system is composed of three parts, namely a data input module, a data calculation module and a data return module. The input module is composed of hardware modules such as a CPU, an FPGA and a DDR. After input data are transmitted to the input module through a network, the input data are directly received by the FPGA, then the data are scattered and recombined, and the data are forwarded out in a pipeline mode. The CPU in the input module is directly connected with the FPGA at a high speed through a QPI protocol, and the CPU has the function of rapidly and dynamically configuring related strategies for the FPGA to receive and transmit data without directly participating in data receiving, transmitting, verifying, recombining and the like. The FPGA in the input module internally realizes a complete TCP/IP protocol stack, and a group of (two in total) Ethernet physical interfaces are externally configured, wherein one interface is specially used for receiving data, and the other interface is specially used for forwarding the data. At the data receiving end, the serial input data is unfolded into parallel data and transmitted to the recombination pipeline module; at one end of data output, before data forwarding, parallel output data of the recombination pipeline module is converted into serial data through a local frequency multiplication mode. The reorganized data will be serially distributed to the subsequent computing modules in a manner several times higher than the input rate. The computing module is composed of a group of computing units, and each computing unit is a single FPGA, a DDR and two Ethernet physical interfaces. The data distributed from the input module reaches each computing unit after passing through the switch, the computing unit receives the data through the IP internally realizing the TCP/IP protocol stack and transmits the data to the IP specially computed, and after the computation is completed, the data is forwarded to the post-stage return module through the Ethernet interface. The hardware composition of the data returning module is the same as that of the data input module, but the FPGA of the data returning module returns data out of order, specifically, the data returning module groups the data according to the arrival sequence of the output result calculated by the preceding-stage calculation module.

In still further embodiments, the data input module includes, but is not limited to, a CPU, an FPGA, and a DDR hardware module;

In a further embodiment, the data input module further includes at least two groups of ethernet physical interfaces, one group of the ethernet physical interfaces is used for receiving data;

In a further embodiment, the data input module further comprises a reassembly pipeline module, and the ethernet physical interface for receiving data may unwind serial input data and pass the parallel data to the reassembly pipeline module.

In a further embodiment, the data calculation module includes at least one group of data calculation units, and the data calculation units include a single group of FPGA, DDR and at least two groups of ethernet physical interfaces.

In a further embodiment, the data backhaul module includes a post-processing module, and the post-processing module is configured to improve data throughput by means of deep pipeline on the recombined data;

as shown in fig. 2, the hardware framework of the distributed heterogeneous computing system designed by the present invention includes a front-end data distribution module, a data computing unit, and a data returning unit. Fig. 3 shows a specific embodiment of fig. 2. The data distribution module adopts a CPU + FPGA architecture, and the CPU and the FPGA are connected through a PCIE or QPI bus. The front-end network data is input into the data distribution module through a route or a switch, and the FPGA in the data distribution module and the cascaded DDR thereof are used for caching together. If the later-stage computing module does not need to recombine the data at the moment, the FPGA directly distributes the cached data in parallel through the data distribution IP unit integrated inside. If the later-stage FPGA computing unit needs to recombine the data before computing, the data is directly connected to the data recombination module in series behind the FPGA buffer module and then forwarded to the later-stage computing unit. If the data recombination is complex and the recombination strategy needs to be changed dynamically, the operation required by the recombination can be converted into an instruction corresponding to an MIG module inside the FPGA, and the instruction is directly sent to the FPGA through a PCIE or QPI bus directly connected with the CPU and the FPGA, so that the FPGA can change the strategy of the data recombination rapidly while buffering data efficiently. The data calculation unit is completely composed of a plurality of groups of single FPGA, and the total amount of the data calculation unit is dynamically distributed according to the actual calculation amount or the communication task. The data computation unit in the monolithic FPGA is completed by an internal unique IPCore. The hardware composition of the data return unit is consistent with that of the data distribution module, and the difference is that the MIG instruction transmitted to the FPGA by the CPU and the specific design realization of the data result recombination module and the result return module in the FPGA.

As shown in fig. 3, the data distribution module is cascaded with the data computation module through a switch or other network devices, and the data computation unit is cascaded with the post-stage data backhaul module through another switch or other network devices, and the reason for using two sets of network devices is to fully adapt to the deep pipeline structure in the computation unit module, thereby ensuring high data throughput of the system.

As shown in fig. 4-7, the hardware architecture of the data distribution and data return module is the same, and two network physical interfaces, which may be RJ45 or ST and SC, are provided at the peripheral part of the FPGA. For the data distribution module, the data distribution module receives network computing data through one port, performs data recombination in a deep pipelining mode through an internal dedicated IPcore, and then forwards the data to the post-stage processing module through another port. The data throughput rate is greatly improved by a dual-port and deep pipelining mode. For the data back-transmission module, the design of the internal special IPcore is different from that of the data receiving module, and the IPcore function of the data back-transmission module is to regularly repackage the calculation results which arrive out of order, attach labels to the repackaged calculation results and then transmit the repackaged calculation results back to the rear-stage module. And the data distribution and data return modules realize a network protocol stack in the FPGA. As shown in fig. 8, the data calculation unit is composed of a single FPGA plus a dual network interface. According to actual requirements, the single computing unit can be deployed in a single node, and can also be locally interconnected into a star network or a ring network according to the complexity of computing tasks, and the formed local network and other nodes together form a computing unit part in the computing system. Thus, the computing unit portion is dynamically configured in a structure to compute the needs of the task. And inside each node of the computing unit, a special IPCore is adopted for parallel running computation.

In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A small heterogeneous distributed computing system based on FPGA is characterized in that: the device comprises a data input module, a data calculation module and a data return module;

2. The small heterogeneous distributed FPGA-based computing system of claim 1, wherein: the data input module comprises but is not limited to a CPU, an FPGA and a DDR hardware module;

3. The small heterogeneous distributed FPGA-based computing system of claim 2, wherein: the data input module also comprises at least two groups of Ethernet physical interfaces, and one group of the Ethernet physical interfaces is used for receiving data;

4. The small heterogeneous distributed FPGA-based computing system of claim 3, wherein: the data input module also comprises a recombination pipeline module, and the Ethernet physical interface for receiving data can spread serial input data and transmit the parallel data to the recombination pipeline module.

5. The small heterogeneous distributed FPGA-based computing system of claim 1, wherein: the data calculation module comprises at least one group of data calculation units, and each data calculation unit comprises a single group of FPGA, DDR and at least two groups of Ethernet physical interfaces.

6. The small heterogeneous distributed FPGA-based computing system of claim 1, wherein: the data returning module comprises a post-stage processing module, and the post-stage processing module is used for improving the data throughput of the recombined data in a deep pipeline mode.