CN109447256A

CN109447256A - The design method that Tensorflow system based on FPGA accelerates

Info

Publication number: CN109447256A
Application number: CN201811061386.1A
Authority: CN
Inventors: 张英杰; 郭开城; 陈勇彪; 刘焰强; 戚正伟; 管海兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-09-12
Filing date: 2018-09-12
Publication date: 2019-03-08

Abstract

A kind of accelerated method of the Tensorflow system based on FPGA, comprising steps of using Python as upper-layer client end program library；As interface for the client program calls on upper layer after the module encapsulation that OpenCL is realized；Prepare FPGA operator by OpenCL and develops environment；FPGA operator is developed by OpenCL；Interior nuclear operator is compiled by OpenCL；FPGA device is incorporated to whole system.The present invention reduces development difficulties, and FPGA is allowed to have the guarantee of stability and practicability as an equipment in Tensorflow system.

Description

The design method that Tensorflow system based on FPGA accelerates

Technical field

The present invention relates to Heterogeneous Computing development fields, specifically, it is a kind of will be some in Tensorflow artificial intelligence system The operator that itself is realized by CPU, uses field programmable gate array (hereinafter referred to as FPGA) instead to be realized.

Background technique

With the development of artificial intelligence, deep neural network computer vision, natural language processing and other across Disciplinary study field becomes increasingly popular.Deep neural network extracts feature by multiple stack layers from input naturally, and uses Classifier makes final decision, it means that wherein contains big moment matrix or convolution algorithm operator.Recent evidence show that The depth of neural network is most important for performance, which greatly increases the requirement to computing capability, traditional CPU without Method meets the needs of growing, and long time is often wanted in the identification for completing a deep neural network, has not been suitable for In practical application scene.The common method for solving the problems, such as this is using with the various different heterogeneous distributed rings for calculating equipment Border, such as DSP, the custom hardware accelerator of GPU and such as FPGA.However, although FPGA has flexibility, high energy efficiency and cost Benefit, but it is not included into state-of-the-art deep learning frame or system.Such as computation accelerator is used as in Tensorflow, This and its traditional development scheme speed slowly have relationship very much.Therefore, it is built in using FPGA as acceleration equipment is calculated The development efficiency for improving FPGA accelerator in tensorflow and by High Level Synthesis is one and gets a good eye meaning and tool There is novelty.

Summary of the invention

In order to solve can not to support the exploitation and operation of artificial intelligence neural networks using CPU well in the prior art Speed, and Tensorflow artificial intelligence system does not support this series of problems for the equipment of FPGA instantly, the present invention A kind of design method that the Tensorflow system based on FPGA accelerates is provided, FPGA is added as computation accelerator and is entered In Tensorflow system.It is incorporated to by this programmable logical device using FPGA as basic calculating equipment Tensorflow makes it complete some calculating in Tensorflow.Meanwhile it realizing in one group of corresponding operation of FPGA device Core ensures acceleration effect to obtain the calculating speed of various tensor stream operations on FPGA.

Technical solution of the invention is as follows:

A kind of accelerated method of the Tensorflow system based on FPGA, feature sign is, includes the following steps:

The first step, using Python as upper-layer client end program library

Second step provides suitable C language interface

As interface for the client program calls on upper layer after the module encapsulation that OpenCL is realized；

Third step prepares FPGA operator by OpenCL and develops environment

4th step develops FPGA operator by OpenCL

The kernel function of needs is first write with C language, then is compiled by BSP the and OpenCL kernel that equipment vendor provides Environment can automatically generate corresponding binary stream file, by can be on FPGA after the burned FPGA of binary stream file of generation Run operator；

5th step compiles interior nuclear operator by OpenCL

Using the operator of C language description as kernel, then by the BSP of equipment vendor's offer come the process that is mapped；

FPGA device is incorporated to whole system by the 6th step

C language interface layer is added in the C language code of OpenCL, and add the codes of some registration device ids to facilitate on The Python of layer, C++ client call directly.

Described prepares FPGA operator exploitation environment by OpenCL, specifically

By searching platform, query facility determines FPGA model and quantity in system；

Similar and with operation data the storage of on-unit is completed with creation caching by command queue；

The binary stream file that can be executed the kernel mappings that C language is completed at FPGA by BSP；

Setting executes parameter, and the host side of OpenCL executes kernel according to actual demand.

The execution parameter includes executing the type and number of kernel.

Described compiles interior nuclear operator by OpenCL, specifically

Determine the type kind of FPGA BSP corresponding with its；

The operator that will be realized is developed with C language and is completed；

Compiling instruction is executed in the environment that equipment vendor provides；

Can generate corresponding net meter file and can programming enter the binary stream file that FPGA generates corresponding hardware.

Compared with prior art, present invention has an advantage that Tensorflow system is added using FPGA to accelerate The operation of neural network improves neural network computing speed compared to traditional Tensorflow system unsupported for FPGA Degree.And corresponding operator is developed by OpenCL, using the characteristic of OpenCL itself, uses hardware description compared to traditional Language development operator, then go to develop corresponding data stream management step by step, telecommunication management greatly reduces development difficulty, FPGA is allowed to have the guarantee of stability and practicability as an equipment in Tensorflow system.

Detailed description of the invention

Fig. 1 is the flow chart of the design method accelerated the present invention is based on the Tensorflow system of FPGA；

Fig. 2 is that FPGA operator realizes architecture diagram

Fig. 3 OpenCL kernel implementation process

Specific embodiment

The present invention is further explained with reference to the accompanying drawings and examples, but this is not answered to limit protection model of the invention It encloses.

Specific implementation of the present invention is main including the following steps:

The first step selects suitable upper-layer client end program library

If directly to realize that neural network or other vector operations are necessarily very time-consuming by machine level language 's.So Tensorflow is often to the integrated library packaged towards high-level language such as python and c++ offer, this hair Bright realization is also with these libraries, because the function provided in these libraries whether realizes or state have extraordinary pumping As.

From the point of view of attached drawing 1, regardless of language realize library, last concrete operations still pass through after a system Column process will finally implement on the device of device level, that is to say, that the high-level language on upper layer is to provide unified programming model Facilitate user to carry out using and the present invention is also according to this set of rule, and some accelerating operators that FPGA is realized are also step by step Being packaged into function allows upper-layer user directly to be called with high-level language.

Second step provides suitable C language interface

From the point of view of Fig. 2, although high-level language can be convenient the exploitation of upper layer user, because of target of the invention Or whole system is added in FPGA, so the present invention still needs the C language interface to connection upper layer and mechanical floor later Layer makes corresponding modification.

The method of use is: with reference to the existing method realized for CPU or GPU, retaining the part that may be multiplexed. It is then developed using OpenCL, because OpenCL itself is an open source standard general towards heterogeneous system, simultaneously C language can be used also as carrier is realized in OpenCL itself.And the TensorFlow of new version is also achieved pair OpenCL and the support for supporting OpenCL equipment can substantially reduce the development difficulty on interface in this way and improve interface Stability and practicability.

Third step prepares FPGA operator by OpenCL and develops environment

If completing the acceleration of corresponding operator with FPGA, traditional method is to be write out with hardware description language accordingly Operator function.But such development time is very long, and OpenCL provides one kind can be implemented in C language operator function, most The plate grade provided again by relevant device company afterwards supports file (abbreviation BSP once) to be realized on FPGA.

As described in Figure 3, standard set development process has been provided in OpenCL, provides a large amount of C language interface, It can be multiplexed in the second portion.Firstly, query facility determines FPGA model and quantity in system by searching platform, then Similar and with operation data the storage of on-unit is completed with creation caching by command queue, then passes through BSP for C The binary stream file that the kernel mappings that language is completed can be executed at FPGA.Finally it is arranged in some execution parameters for example execute The type and number of core, the host side of OpenCL will execute kernel according to actual demand to which operator part is passed through FPGA reality It is existing.From above-mentioned process it can be found that OpenCL is greatly reduced the difficulty of FPGA access whole system.Because of its process sheet Body can complete the equipment management of data transmission migration to the end in most of such as attached drawing 2 that works.

4th step develops FPGA operator by OpenCL

The operator for needing to complete on FPGA can write kernel letter with C language by the standard that OpenCL is provided Number is to complete.The kernel function completed with C language can be certainly by BSP the and OpenCL kernel translation and compiling environment that equipment vendor provides The operator that dynamic generation can be run on FPGA hardware.

5th step compiles interior nuclear operator by OpenCL

Compiling kernel in Fig. 3 be using C language description operator as kernel, then by equipment vendor offer BSP come The process mapped.Concrete methods of realizing is the type kind BSP corresponding with its for determining FPGA, then will be realized Operator is developed with C language and is completed, and is then held in the SDK for OpenCL for the environment such as Intel company that equipment vendor provides Row compiling instruction.Corresponding netlist and actual hardware realization binary stream can be then generated, thus FPGA can be allowed direct Realize corresponding operator.

FPGA device is incorporated to whole system by the 6th step

After the completion of above-mentioned steps, the kernel in attached drawing 1 is realized and mechanical floor part is completed.And because using The reason of OpenCL, should not stand-alone development host side and FPGA again communication, data flow performer part has also been completed.Finally Only need the C language code of OpenCL C language interface layer is added, and add some codes about facility registration ID to facilitate on The Python of layer, C++ client call directly.By above-mentioned encapsulation, upper-layer user can be allowed to use in Tensorflow system High-level language realizes neural network, and FPGA is allowed to realize that part operator can accelerate network.

The present invention not only extends support of the Tensorflow for new equipment FPGA, is also created in development mode Newly, it is developed by this open, free programmed environment design towards the general multiple programming of heterogeneous system of OpenCL FPGA.It can not only allow Tensorflow that can make full use of the completely new operation of FPGA to explore the potentiality of FPGA in this way, also by The environment that OpenCL is provided accelerates the development progress of FPGA, while decreasing FPGA access Tensorflow bring is all Multiplex roles problem.It may make that the operator speed of service is promoted in this way, while also increasing some new operators and being completed by FPGA So that Tensorflow is called.

Claims

1. a kind of accelerated method of the Tensorflow system based on FPGA, which comprises the steps of:

The first step, using Python as upper-layer client end program library

Second step provides suitable C language interface

Third step prepares FPGA operator by OpenCL and develops environment

4th step develops FPGA operator by OpenCL

The kernel function of needs, then BSP the and OpenCL kernel translation and compiling environment provided by equipment vendor are first provided with C language Corresponding binary stream file can be automatically generated, will can be run on FPGA after the burned FPGA of binary stream file of generation Operator；

5th step compiles interior nuclear operator by OpenCL

FPGA device is incorporated to whole system by the 6th step

C language interface layer is added in the C language code of OpenCL, and adds the code of some registration device ids to facilitate upper layer Python, C++ client call directly.

2. the accelerated method of the Tensorflow system according to claim 1 based on FPGA, which is characterized in that described Prepare FPGA operator by OpenCL and develop environment, specifically

3. the accelerated method of the Tensorflow system according to claim 2 based on FPGA, which is characterized in that described Executing parameter includes executing the type and number of kernel.

4. the accelerated method of the Tensorflow system according to claim 1 based on FPGA, which is characterized in that described Interior nuclear operator is compiled by OpenCL, specifically

Determine the type kind of FPGA BSP corresponding with its；