CN209746539U - Acceleration card device for self-adaptive programmable storage calculation - Google Patents

Acceleration card device for self-adaptive programmable storage calculation Download PDF

Info

Publication number
CN209746539U
CN209746539U CN201920826498.5U CN201920826498U CN209746539U CN 209746539 U CN209746539 U CN 209746539U CN 201920826498 U CN201920826498 U CN 201920826498U CN 209746539 U CN209746539 U CN 209746539U
Authority
CN
China
Prior art keywords
accelerator card
chip
main body
memory
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201920826498.5U
Other languages
Chinese (zh)
Inventor
徐彦飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Changjiang Ruixin Electronic Technology Co Ltd
Original Assignee
Suzhou Changjiang Ruixin Electronic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Changjiang Ruixin Electronic Technology Co Ltd filed Critical Suzhou Changjiang Ruixin Electronic Technology Co Ltd
Priority to CN201920826498.5U priority Critical patent/CN209746539U/en
Application granted granted Critical
Publication of CN209746539U publication Critical patent/CN209746539U/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Advance Control (AREA)

Abstract

the utility model discloses a self-adaptive programmable storage calculation acceleration card device, belonging to the technical field of hardware acceleration calculation; an apparatus for an adaptive programmable memory compute accelerator card, comprising a body, further comprising an assembly frame; a PCIE interface arranged on the main body; the X86 processor is connected with the PCIE interface; the FPGA accelerator card is detachably connected in the assembly frame and is connected with the main body; the on-chip memory block and the on-chip core logic module are arranged on the FPGA accelerator card; an OpenCL architecture module; an on-chip interconnection module; the utility model provides an expansion capability that the accelerator card supported a quick-witted multicard can dispose the accelerator card of different quantity on a host computer, will calculate on the task distributes the polylith accelerator card, satisfies the demand with higher speed of different scale algorithms, the efficiency of very big improvement server operation, low power dissipation, the performance is high, low time delay.

Description

Acceleration card device for self-adaptive programmable storage calculation
Technical Field
the utility model relates to a hardware calculates technical field with higher speed, especially relates to a be used for self-adaptation storage calculation with higher speed to clamp the device programmable.
background
In recent years, with the development of internet big data technology and the rise of internet of things, the task of data calculation in a data center and some related embedded devices is more and more important; the traditional CPU serial computing mode is not enough to meet the computing requirement of exponential increase; parallel accelerator research based on adaptive computing is increasingly keen in academia and industry, and currently, adaptive accelerators are mainly realized by means of adaptive computing components such as Application Specific Integrated Circuits (ASICs), Graphic Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), Digital Signal Processors (DSPs) and the like.
Through the retrieval, utility model patent with application number CN201820354999.3 discloses "a PGA and DSP multicore self-adaptation accelerated computation integrated circuit board belongs to hardware accelerated computation field, and this FPGA and DSP multicore self-adaptation accelerated computation integrated circuit board includes FPGA field programmable gate array device, the first DSP digital signal processor chip that is connected respectively with FPGA field programmable gate array device, second DSP digital signal processor chip, first FMC extension connector, second FMC extension connector, PCIE interface and CPLD complicated programmable logic ware".
the PGA and DSP multi-core adaptive acceleration calculation board card provided by the patent fully integrates the characteristics of flexibility, reconfigurability, high performance, low power consumption, high precision, high speed and short development period of the FPGA, can be evolved according to iteration of application and algorithms, has good customization and reconfigurable characteristics, still has defects, is used for calculation through a CPU and a GPU as a traditional server, does not have the expansion capability of one machine with multiple cards, and cannot adapt to acceleration requirements of algorithms of different scales.
SUMMERY OF THE UTILITY MODEL
The utility model aims at solving the problems in the prior art, and provides a card acceleration device for self-adaptive programmable storage calculation.
in order to achieve the above purpose, the utility model adopts the following technical scheme:
a device for adaptively storing and calculating an acceleration card comprises a main body and
An assembly frame for loading the main body;
The heat radiation fan is arranged on the inner side wall of the assembling frame;
a PCIE interface arranged on the main body;
the DDR4 memory controller is arranged on the main body and is used for connecting the DDR4 memory;
The X86 processor is connected with the PCIE interface;
The FPGA accelerator card is detachably connected in the assembly frame and is connected with the main body;
The on-chip storage block is arranged on the FPGA accelerator card;
the on-chip core logic module is arranged on the FPGA accelerator card;
The OpenCL framework module is arranged on the main body and used for distributing computing tasks to the plurality of FPGA accelerator cards;
And the in-chip interconnection module is arranged on the main body.
preferably, the OpenCL architecture module mainly comprises a Host end, a Kernel end and a compiler, the Host end and the Kernel end are in signal connection with the compiler, and the compiler is in signal connection with the X86 processor and the FPGA accelerator card.
preferably, the on-chip interconnection module mainly comprises a global memory interconnection network and a local memory interconnection network, the global memory interconnection network is in communication connection with the on-chip core logic module and the PCIE interface and the DDR4 storage controller, and the local memory interconnection network is in communication connection with the on-chip core logic module and the on-chip storage block.
preferably, the local memory interconnection network adopts an 8BANK high concurrency array for fast access of local data in the kernel end.
Preferably, the on-chip core logic module is composed of a custom computing resource heap with high concurrency deep flow.
Preferably, the FPGA acceleration card is internally provided with control logic, external interface logic and internal interconnection logic.
preferably, the hot spot portion of the kernel end mapped to the FPGA accelerator card is adapted to the control logic, the external interface logic and the internal interconnection logic customized in the FPGA accelerator card in a connecting manner.
Preferably, the FPGA accelerator card is connected to the assembly frame through screws.
preferably, the assembly frame is provided with uniformly distributed heat dissipation holes.
Preferably, a plurality of DMAs are connected inside the DDR4 memory controller for completing read-write control of the DDR4 memory.
Compared with the prior art, the utility model provides a be used for self-adaptation storage calculation to clamp device with higher speed able to programme possesses following beneficial effect:
the Kernel end uses OpenCLSDK to automatically map hot spot parts of the algorithm into core logic in an FPGA acceleration card, and is connected and adapted with control logic, external interface logic and internal interconnection logic which are customized in advance in the FPGA acceleration card to improve the operation speed, a core logic module consists of a customized computing resource stack with high concurrency and deep flow, is generated by OpenCLSDK tool chain mapping and is highly matched with computing hot spots of various target algorithms, and is favorable for improving the computing speed, the core logic module in a connecting piece is in communication connection with a PCIE interface and a DDR4 storage controller to form a global memory interconnection network, the core logic module in the connecting piece is in communication connection with a storage block in the piece to form a local memory interconnection network, so that the access speed of local data in the Kernel can be improved, the computing speed is further improved, and computing tasks are distributed on a plurality of FPGA acceleration cards by installing a plurality of FPGA acceleration cards, the acceleration requirements of algorithms of different scales are met, and the operation efficiency of the server is greatly improved.
drawings
FIG. 1 is a schematic diagram of an adaptive programmable memory computing accelerator card device according to the present invention;
Fig. 2 is a second schematic structural diagram of an adaptive programmable memory computing accelerator card device according to the present invention;
Fig. 3 is a block diagram of a chip logic architecture for an adaptive programmable memory computing accelerator card device according to the present invention;
fig. 4 is a block diagram of an OpenCL architecture module for an adaptive programmable memory card device.
In the figure: 1. a main body; 2. assembling a frame; 3. a heat radiation fan; 4. a PCIE interface; 5. a DDR4 memory controller; 6. an X86 processor; 7. an FPGA accelerator card; 8. an on-chip memory block; 9. an on-chip core logic module; 10. an OpenCL architecture module; 11. a global memory interconnect network; 12. a local memory interconnect network.
Detailed Description
the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention; obviously, the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and all other embodiments obtained by those skilled in the art without any inventive work are within the scope of the present invention based on the embodiments of the present invention.
In the description of the present invention, it should be noted that the terms "upper", "lower", "inner", "outer", "top/bottom", and the like indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, and are only for convenience of description and simplification of description, but do not indicate or imply that the device or element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus, should not be construed as limiting the present invention. Furthermore, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
in the description of the present invention, it is to be noted that, unless otherwise explicitly specified or limited, the terms "mounted", "provided", "sleeved/connected", "connected", and the like are to be understood in a broad sense, such as "connected", which may be fixedly connected, detachably connected, or integrally connected; can be mechanically or electrically connected; the two components can be directly connected or indirectly connected through an intermediate medium, and the two components can be communicated with each other; the specific meaning of the above terms in the present invention can be understood in specific cases to those skilled in the art.
Example (b):
referring to fig. 1-4, an apparatus for an adaptive programmable memory compute accelerator card includes a main body 1 and further includes
An assembly frame 2 for loading the main body 1;
A heat radiation fan 3 arranged on the inner side wall of the assembly frame 2;
a PCIE interface 4 arranged on the main body 1;
A DDR4 memory controller 5, arranged on the main body 1, for connecting with a DDR4 memory;
an X86 processor 6 connected to the PCIE interface 4;
The FPGA accelerator card 7 is detachably connected in the assembly frame 2 and is connected with the main body 1;
the on-chip storage block 8 is arranged on the FPGA accelerator card 7;
the on-chip core logic module 9 is arranged on the FPGA accelerator card 7;
the OpenCL framework module 10 is arranged on the main body 1 and used for distributing computing tasks to the plurality of FPGA accelerator cards 7;
The on-chip interconnection module is arranged on the main body 1;
The OpenCL framework module 10 mainly comprises a Host end, a Kernel end and a compiler, wherein the Host end and the Kernel end are in signal connection with the compiler, and the compiler is in signal connection with the X86 processor 6 and the FPGA accelerator card 7;
the on-chip interconnection module mainly comprises a global memory interconnection network 11 and a local memory interconnection network 12, the global memory interconnection network 11 is in communication connection with the on-chip core logic module 9, the PCIE interface 4 and the DDR4 storage controller 5, and the local memory interconnection network 12 is in communication connection with the on-chip core logic module 9 and the on-chip storage block 8;
The local memory interconnection network 12 adopts an 8BANK high concurrency array and is used for quickly accessing local data in a kernel end;
The on-chip core logic module 9 is composed of a high-concurrency deep-flow customized computing resource pile;
The FPGA accelerator card 7 is internally customized with control logic, external interface logic and internal interconnection logic;
The hot spot part of the kernel end mapped on the FPGA accelerator card 7 is connected and adapted with the control logic, the external interface logic and the internal interconnection logic which are customized in the FPGA accelerator card 7;
The FPGA accelerator card 7 is connected to the assembly frame 2 through screws;
The assembly frame 2 is provided with heat dissipation holes which are uniformly distributed;
the DDR4 memory controller 5 is internally connected with a plurality of DMAs for completing the read-write control of the DDR4 memory;
the Host end in the OpenCL framework module 10 adopts a standard C/C + + compiling tool chain, is linked with the FPGA accelerator card 7 and then runs on a main CPU, the accelerator logic in the FPGA accelerator card 7 is called to carry out the calculation task of the Kernel end in the running process, the Kernel end uses OpenCLSDK to automatically map the hot spot part of the algorithm into the core logic in the FPGA accelerator card 7 and is connected and adapted with the control logic, the external interface logic and the internal interconnection logic which are customized in the FPGA accelerator card 7 so as to improve the operation speed, the core logic module consists of a customized calculation resource stack with high concurrent deep flow water, the core logic module is generated by the OpenCLSDK tool chain mapping and is highly matched with the calculation hot spots of various target algorithms, the calculation speed is improved, the core logic module 9 in the chip is in communication connection with the PCIE interface 4 and the DDR4 storage controller 5 to form a global memory interconnection network 11, the core logic module 9 in the chip is in communication connection with the storage block 8, the local memory interconnection network 12 is formed, the access speed of local data in the kernel can be increased, the calculation speed is further increased, calculation tasks are distributed to the FPGA acceleration cards 7 by installing the FPGA acceleration cards 7, the acceleration requirements of algorithms of different scales are met, and the operation efficiency of the server is greatly improved.
the above, only be the concrete implementation of the preferred embodiment of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art is in the technical scope of the present invention, according to the technical solution of the present invention and the utility model, the concept of which is equivalent to replace or change, should be covered within the protection scope of the present invention.

Claims (10)

1. A device for an adaptive programmable memory computing accelerator card, comprising a main body (1), characterized in that it also comprises
an assembly frame (2) for mounting the main body (1);
a heat radiation fan (3) arranged on the inner side wall of the assembly frame (2);
A PCIE interface (4) arranged on the main body (1);
a DDR4 memory controller (5) arranged on the main body (1) and used for connecting the DDR4 memory;
the X86 processor (6) is connected with the PCIE interface (4);
the FPGA accelerator card (7) is detachably connected in the assembly frame (2) and is connected with the main body (1);
The on-chip storage block (8) is arranged on the FPGA accelerator card (7);
The on-chip core logic module (9) is arranged on the FPGA accelerator card (7);
The OpenCL framework module (10) is arranged on the main body (1) and used for distributing computing tasks to the plurality of FPGA accelerator cards (7);
the on-chip interconnection module is arranged on the main body (1).
2. the device of claim 1, wherein the OpenCL architecture module (10) mainly consists of a Host terminal, a Kernel terminal and a compiler, the Host terminal and the Kernel terminal are in signal connection with the compiler, and the compiler is in signal connection with the X86 processor (6) and the FPGA accelerator card (7).
3. The device for the adaptive programmable memory computing accelerator card according to claim 2, wherein the on-chip interconnect module mainly comprises a global memory interconnect network (11) and a local memory interconnect network (12), the global memory interconnect network (11) and the on-chip core logic module (9) are in communication connection with the PCIE interface (4) and the DDR4 memory controller (5), and the local memory interconnect network (12) is in communication connection with the on-chip core logic module (9) and the on-chip memory block (8).
4. the device of claim 3, wherein the local memory interconnect network (12) employs an 8BANK high concurrency array for fast access of local data in the kernel port.
5. An apparatus for adaptive programmable memory compute accelerator card according to claim 4 wherein the on-chip core logic module (9) is composed of a heap of custom compute resources that is highly concurrent and deeply pipelined.
6. An adaptive programmable memory computing accelerator card device according to claim 5, characterized in that the FPGA accelerator card (7) has control logic, external interface logic and internal interconnect logic customized on-chip.
7. the device for the adaptive programmable memory computing accelerator card of claim 6, wherein the hot spot portion of the kernel terminal mapped onto the FPGA accelerator card (7) is adapted to control logic, external interface logic and internal interconnect logic connections customized inside the FPGA accelerator card (7).
8. Device for an adaptively programmable memory computing accelerator card according to any one of claims 1 to 7, characterized in that the FPGA accelerator card (7) is screwed to the assembly frame (2).
9. device for adaptive programmable memory computing acceleration card according to any of claims 1-7, characterized by that, the assembly frame (2) is drilled with evenly distributed louvers.
10. the device for the adaptive programmable memory compute accelerator card of any of claims 1-7, wherein the DDR4 memory controller (5) is internally connected with a plurality of DMAs for completing the read-write control of DDR4 memory.
CN201920826498.5U 2019-06-03 2019-06-03 Acceleration card device for self-adaptive programmable storage calculation Active CN209746539U (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201920826498.5U CN209746539U (en) 2019-06-03 2019-06-03 Acceleration card device for self-adaptive programmable storage calculation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201920826498.5U CN209746539U (en) 2019-06-03 2019-06-03 Acceleration card device for self-adaptive programmable storage calculation

Publications (1)

Publication Number Publication Date
CN209746539U true CN209746539U (en) 2019-12-06

Family

ID=68723409

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201920826498.5U Active CN209746539U (en) 2019-06-03 2019-06-03 Acceleration card device for self-adaptive programmable storage calculation

Country Status (1)

Country Link
CN (1) CN209746539U (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083558A (en) * 2019-06-03 2019-08-02 苏州长江睿芯电子科技有限公司 One kind is calculated for adaptively programmable storage accelerates card device
US11467836B2 (en) 2020-02-07 2022-10-11 Alibaba Group Holding Limited Executing cross-core copy instructions in an accelerator to temporarily store an operand that cannot be accommodated by on-chip memory of a primary core into a secondary core

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110083558A (en) * 2019-06-03 2019-08-02 苏州长江睿芯电子科技有限公司 One kind is calculated for adaptively programmable storage accelerates card device
US11467836B2 (en) 2020-02-07 2022-10-11 Alibaba Group Holding Limited Executing cross-core copy instructions in an accelerator to temporarily store an operand that cannot be accommodated by on-chip memory of a primary core into a secondary core

Similar Documents

Publication Publication Date Title
Jouppi et al. Motivation for and evaluation of the first tensor processing unit
Jouppi et al. Ten lessons from three generations shaped google’s tpuv4i: Industrial product
Singh et al. NERO: A near high-bandwidth memory stencil accelerator for weather prediction modeling
CN102073481B (en) Multi-kernel DSP reconfigurable special integrated circuit system
CN106886177B (en) Radar signal processing system
Alam et al. Early evaluation of IBM BlueGene/P
KR101668899B1 (en) Communication between internal and external processors
CN209746539U (en) Acceleration card device for self-adaptive programmable storage calculation
CN110083558A (en) One kind is calculated for adaptively programmable storage accelerates card device
CN101833441A (en) Parallel vector processing engine structure
CN103020002A (en) Reconfigurable multiprocessor system
Meng et al. Analysis and runtime management of 3D systems with stacked DRAM for boosting energy efficiency
Ahmad et al. Design of an energy aware petaflops class high performance cluster based on power architecture
Torabzadehkashi et al. Accelerating hpc applications using computational storage devices
Gan et al. Solving mesoscale atmospheric dynamics using a reconfigurable dataflow architecture
Chen et al. GCIM: Towards Efficient Processing of Graph Convolutional Networks in 3D-Stacked Memory
US20230065842A1 (en) Prediction and optimization of multi-kernel circuit design performance using a programmable overlay
CN107766286A (en) A kind of Systemon-board implementation method based on FPGA
Radhakrishnan et al. The blackford northbridge chipset for the intel 5000
CN107729284A (en) A kind of calculating card based on multi-chip parallel processing
Morganti et al. Implementing a space-aware stochastic simulator on low-power architectures: a systems biology case study
Chen et al. Reducing virtual-to-physical address translation overhead in distributed shared memory based multi-core network-on-chips according to data property
Xue et al. Softssd: Software-defined ssd development platform for rapid flash firmware prototyping
Salapura et al. Exploiting workload parallelism for performance and power optimization in Blue Gene
Zhang et al. Design of the Main Control RISC-V Processor in Chiplet Applications

Legal Events

Date Code Title Description
GR01 Patent grant
GR01 Patent grant