CN209746539U

CN209746539U - Acceleration card device for self-adaptive programmable storage calculation

Info

Publication number: CN209746539U
Application number: CN201920826498.5U
Authority: CN
Inventors: 徐彦飞
Original assignee: Suzhou Changjiang Ruixin Electronic Technology Co Ltd
Current assignee: Suzhou Changjiang Ruixin Electronic Technology Co Ltd
Priority date: 2019-06-03
Filing date: 2019-06-03
Publication date: 2019-12-06
Anticipated expiration: 2029-06-03

Abstract

the utility model discloses a self-adaptive programmable storage calculation acceleration card device, belonging to the technical field of hardware acceleration calculation; an apparatus for an adaptive programmable memory compute accelerator card, comprising a body, further comprising an assembly frame; a PCIE interface arranged on the main body; the X86 processor is connected with the PCIE interface; the FPGA accelerator card is detachably connected in the assembly frame and is connected with the main body; the on-chip memory block and the on-chip core logic module are arranged on the FPGA accelerator card; an OpenCL architecture module; an on-chip interconnection module; the utility model provides an expansion capability that the accelerator card supported a quick-witted multicard can dispose the accelerator card of different quantity on a host computer, will calculate on the task distributes the polylith accelerator card, satisfies the demand with higher speed of different scale algorithms, the efficiency of very big improvement server operation, low power dissipation, the performance is high, low time delay.

Description

Acceleration card device for self-adaptive programmable storage calculation

Technical Field

the utility model relates to a hardware calculates technical field with higher speed, especially relates to a be used for self-adaptation storage calculation with higher speed to clamp the device programmable.

background

In recent years, with the development of internet big data technology and the rise of internet of things, the task of data calculation in a data center and some related embedded devices is more and more important; the traditional CPU serial computing mode is not enough to meet the computing requirement of exponential increase; parallel accelerator research based on adaptive computing is increasingly keen in academia and industry, and currently, adaptive accelerators are mainly realized by means of adaptive computing components such as Application Specific Integrated Circuits (ASICs), Graphic Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Digital Signal Processors (DSPs) and the like.

Through the retrieval, utility model patent with application number CN201820354999.3 discloses "a PGA and DSP multicore self-adaptation accelerated computation integrated circuit board belongs to hardware accelerated computation field, and this FPGA and DSP multicore self-adaptation accelerated computation integrated circuit board includes FPGA field programmable gate array device, the first DSP digital signal processor chip that is connected respectively with FPGA field programmable gate array device, second DSP digital signal processor chip, first FMC extension connector, second FMC extension connector, PCIE interface and CPLD complicated programmable logic ware".

the PGA and DSP multi-core adaptive acceleration calculation board card provided by the patent fully integrates the characteristics of flexibility, reconfigurability, high performance, low power consumption, high precision, high speed and short development period of the FPGA, can be evolved according to iteration of application and algorithms, has good customization and reconfigurable characteristics, still has defects, is used for calculation through a CPU and a GPU as a traditional server, does not have the expansion capability of one machine with multiple cards, and cannot adapt to acceleration requirements of algorithms of different scales.

SUMMERY OF THE UTILITY MODEL

The utility model aims at solving the problems in the prior art, and provides a card acceleration device for self-adaptive programmable storage calculation.

in order to achieve the above purpose, the utility model adopts the following technical scheme:

a device for adaptively storing and calculating an acceleration card comprises a main body and

An assembly frame for loading the main body;

The heat radiation fan is arranged on the inner side wall of the assembling frame;

a PCIE interface arranged on the main body;

the DDR4 memory controller is arranged on the main body and is used for connecting the DDR4 memory;

The X86 processor is connected with the PCIE interface;

The FPGA accelerator card is detachably connected in the assembly frame and is connected with the main body;

The on-chip storage block is arranged on the FPGA accelerator card;

the on-chip core logic module is arranged on the FPGA accelerator card;

The OpenCL framework module is arranged on the main body and used for distributing computing tasks to the plurality of FPGA accelerator cards;

And the in-chip interconnection module is arranged on the main body.

preferably, the OpenCL architecture module mainly comprises a Host end, a Kernel end and a compiler, the Host end and the Kernel end are in signal connection with the compiler, and the compiler is in signal connection with the X86 processor and the FPGA accelerator card.

preferably, the on-chip interconnection module mainly comprises a global memory interconnection network and a local memory interconnection network, the global memory interconnection network is in communication connection with the on-chip core logic module and the PCIE interface and the DDR4 storage controller, and the local memory interconnection network is in communication connection with the on-chip core logic module and the on-chip storage block.

preferably, the local memory interconnection network adopts an 8BANK high concurrency array for fast access of local data in the kernel end.

Preferably, the on-chip core logic module is composed of a custom computing resource heap with high concurrency deep flow.

Preferably, the FPGA acceleration card is internally provided with control logic, external interface logic and internal interconnection logic.

preferably, the hot spot portion of the kernel end mapped to the FPGA accelerator card is adapted to the control logic, the external interface logic and the internal interconnection logic customized in the FPGA accelerator card in a connecting manner.

Preferably, the FPGA accelerator card is connected to the assembly frame through screws.

preferably, the assembly frame is provided with uniformly distributed heat dissipation holes.

Preferably, a plurality of DMAs are connected inside the DDR4 memory controller for completing read-write control of the DDR4 memory.

Compared with the prior art, the utility model provides a be used for self-adaptation storage calculation to clamp device with higher speed able to programme possesses following beneficial effect:

the Kernel end uses OpenCLSDK to automatically map hot spot parts of the algorithm into core logic in an FPGA acceleration card, and is connected and adapted with control logic, external interface logic and internal interconnection logic which are customized in advance in the FPGA acceleration card to improve the operation speed, a core logic module consists of a customized computing resource stack with high concurrency and deep flow, is generated by OpenCLSDK tool chain mapping and is highly matched with computing hot spots of various target algorithms, and is favorable for improving the computing speed, the core logic module in a connecting piece is in communication connection with a PCIE interface and a DDR4 storage controller to form a global memory interconnection network, the core logic module in the connecting piece is in communication connection with a storage block in the piece to form a local memory interconnection network, so that the access speed of local data in the Kernel can be improved, the computing speed is further improved, and computing tasks are distributed on a plurality of FPGA acceleration cards by installing a plurality of FPGA acceleration cards, the acceleration requirements of algorithms of different scales are met, and the operation efficiency of the server is greatly improved.

drawings

FIG. 1 is a schematic diagram of an adaptive programmable memory computing accelerator card device according to the present invention;

Fig. 2 is a second schematic structural diagram of an adaptive programmable memory computing accelerator card device according to the present invention;

Fig. 3 is a block diagram of a chip logic architecture for an adaptive programmable memory computing accelerator card device according to the present invention;

fig. 4 is a block diagram of an OpenCL architecture module for an adaptive programmable memory card device.

In the figure: 1. a main body; 2. assembling a frame; 3. a heat radiation fan; 4. a PCIE interface; 5. a DDR4 memory controller; 6. an X86 processor; 7. an FPGA accelerator card; 8. an on-chip memory block; 9. an on-chip core logic module; 10. an OpenCL architecture module; 11. a global memory interconnect network; 12. a local memory interconnect network.

Detailed Description

the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention; obviously, the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work are within the scope of the present invention based on the embodiments of the present invention.

In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "top/bottom", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

in the description of the present invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted", "provided", "sleeved/connected", "connected", and the like are to be understood in a broad sense, such as "connected", which may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; the two components can be directly connected or indirectly connected through an intermediate medium, and the two components can be communicated with each other; the specific meaning of the above terms in the present invention can be understood in specific cases to those skilled in the art.

Example (b):

referring to fig. 1-4, an apparatus for an adaptive programmable memory compute accelerator card includes a main body 1 and further includes

An assembly frame 2 for loading the main body 1;

A heat radiation fan 3 arranged on the inner side wall of the assembly frame 2;

a PCIE interface 4 arranged on the main body 1;

A DDR4 memory controller 5, arranged on the main body 1, for connecting with a DDR4 memory;

an X86 processor 6 connected to the PCIE interface 4;

The FPGA accelerator card 7 is detachably connected in the assembly frame 2 and is connected with the main body 1;

the on-chip storage block 8 is arranged on the FPGA accelerator card 7;

the on-chip core logic module 9 is arranged on the FPGA accelerator card 7;

the OpenCL framework module 10 is arranged on the main body 1 and used for distributing computing tasks to the plurality of FPGA accelerator cards 7;

The on-chip interconnection module is arranged on the main body 1;

The OpenCL framework module 10 mainly comprises a Host end, a Kernel end and a compiler, wherein the Host end and the Kernel end are in signal connection with the compiler, and the compiler is in signal connection with the X86 processor 6 and the FPGA accelerator card 7;

the on-chip interconnection module mainly comprises a global memory interconnection network 11 and a local memory interconnection network 12, the global memory interconnection network 11 is in communication connection with the on-chip core logic module 9, the PCIE interface 4 and the DDR4 storage controller 5, and the local memory interconnection network 12 is in communication connection with the on-chip core logic module 9 and the on-chip storage block 8;

The local memory interconnection network 12 adopts an 8BANK high concurrency array and is used for quickly accessing local data in a kernel end;

The on-chip core logic module 9 is composed of a high-concurrency deep-flow customized computing resource pile;

The FPGA accelerator card 7 is internally customized with control logic, external interface logic and internal interconnection logic;

The hot spot part of the kernel end mapped on the FPGA accelerator card 7 is connected and adapted with the control logic, the external interface logic and the internal interconnection logic which are customized in the FPGA accelerator card 7;

The FPGA accelerator card 7 is connected to the assembly frame 2 through screws;

The assembly frame 2 is provided with heat dissipation holes which are uniformly distributed;

the DDR4 memory controller 5 is internally connected with a plurality of DMAs for completing the read-write control of the DDR4 memory;

the Host end in the OpenCL framework module 10 adopts a standard C/C + + compiling tool chain, is linked with the FPGA accelerator card 7 and then runs on a main CPU, the accelerator logic in the FPGA accelerator card 7 is called to carry out the calculation task of the Kernel end in the running process, the Kernel end uses OpenCLSDK to automatically map the hot spot part of the algorithm into the core logic in the FPGA accelerator card 7 and is connected and adapted with the control logic, the external interface logic and the internal interconnection logic which are customized in the FPGA accelerator card 7 so as to improve the operation speed, the core logic module consists of a customized calculation resource stack with high concurrent deep flow water, the core logic module is generated by the OpenCLSDK tool chain mapping and is highly matched with the calculation hot spots of various target algorithms, the calculation speed is improved, the core logic module 9 in the chip is in communication connection with the PCIE interface 4 and the DDR4 storage controller 5 to form a global memory interconnection network 11, the core logic module 9 in the chip is in communication connection with the storage block 8, the local memory interconnection network 12 is formed, the access speed of local data in the kernel can be increased, the calculation speed is further increased, calculation tasks are distributed to the FPGA acceleration cards 7 by installing the FPGA acceleration cards 7, the acceleration requirements of algorithms of different scales are met, and the operation efficiency of the server is greatly improved.

the above, only be the concrete implementation of the preferred embodiment of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art is in the technical scope of the present invention, according to the technical solution of the present invention and the utility model, the concept of which is equivalent to replace or change, should be covered within the protection scope of the present invention.

Claims

1. A device for an adaptive programmable memory computing accelerator card, comprising a main body (1), characterized in that it also comprises

an assembly frame (2) for mounting the main body (1);

a heat radiation fan (3) arranged on the inner side wall of the assembly frame (2);

A PCIE interface (4) arranged on the main body (1);

a DDR4 memory controller (5) arranged on the main body (1) and used for connecting the DDR4 memory;

the X86 processor (6) is connected with the PCIE interface (4);

the FPGA accelerator card (7) is detachably connected in the assembly frame (2) and is connected with the main body (1);

The on-chip storage block (8) is arranged on the FPGA accelerator card (7);

The on-chip core logic module (9) is arranged on the FPGA accelerator card (7);

The OpenCL framework module (10) is arranged on the main body (1) and used for distributing computing tasks to the plurality of FPGA accelerator cards (7);

the on-chip interconnection module is arranged on the main body (1).

2. the device of claim 1, wherein the OpenCL architecture module (10) mainly consists of a Host terminal, a Kernel terminal and a compiler, the Host terminal and the Kernel terminal are in signal connection with the compiler, and the compiler is in signal connection with the X86 processor (6) and the FPGA accelerator card (7).

3. The device for the adaptive programmable memory computing accelerator card according to claim 2, wherein the on-chip interconnect module mainly comprises a global memory interconnect network (11) and a local memory interconnect network (12), the global memory interconnect network (11) and the on-chip core logic module (9) are in communication connection with the PCIE interface (4) and the DDR4 memory controller (5), and the local memory interconnect network (12) is in communication connection with the on-chip core logic module (9) and the on-chip memory block (8).

4. the device of claim 3, wherein the local memory interconnect network (12) employs an 8BANK high concurrency array for fast access of local data in the kernel port.

5. An apparatus for adaptive programmable memory compute accelerator card according to claim 4 wherein the on-chip core logic module (9) is composed of a heap of custom compute resources that is highly concurrent and deeply pipelined.

6. An adaptive programmable memory computing accelerator card device according to claim 5, characterized in that the FPGA accelerator card (7) has control logic, external interface logic and internal interconnect logic customized on-chip.

7. the device for the adaptive programmable memory computing accelerator card of claim 6, wherein the hot spot portion of the kernel terminal mapped onto the FPGA accelerator card (7) is adapted to control logic, external interface logic and internal interconnect logic connections customized inside the FPGA accelerator card (7).

8. Device for an adaptively programmable memory computing accelerator card according to any one of claims 1 to 7, characterized in that the FPGA accelerator card (7) is screwed to the assembly frame (2).

9. device for adaptive programmable memory computing acceleration card according to any of claims 1-7, characterized by that, the assembly frame (2) is drilled with evenly distributed louvers.

10. the device for the adaptive programmable memory compute accelerator card of any of claims 1-7, wherein the DDR4 memory controller (5) is internally connected with a plurality of DMAs for completing the read-write control of DDR4 memory.