CN110929883A

CN110929883A - Method and device for supporting FPGA (field programmable gate array) training in TensorFlow

Info

Publication number: CN110929883A
Application number: CN201911156208.1A
Authority: CN
Inventors: 赵谦谦; 仝培霖; 赵红博
Original assignee: Suzhou Wave Intelligent Technology Co Ltd
Current assignee: Suzhou Wave Intelligent Technology Co Ltd
Priority date: 2019-11-22
Filing date: 2019-11-22
Publication date: 2020-03-27

Abstract

The invention provides a method for supporting FPGA training in TensorFlow, which comprises the following steps: adding registration and discovery of an FPGA device in a TensorFlow so that the name of the FPGA device is in a device list of the TensorFlow; calling an operator registration interface of the TensorFlow to register an operator supporting the FPGA device according to the name of the FPGA device, and enabling the name of the operator to be the same as the names of operators of all devices supported by the TensorFlow; and writing an execution function of the operator by using opencl, wherein the execution function comprises a host end execution function and an FPGA equipment end execution function so as to execute data interaction between a CPU on the host and the FPGA. The invention enlarges the use scene of the FPGA and is suitable for the scene needing on-line training and model updating.

Description

Method and device for supporting FPGA (field programmable gate array) training in TensorFlow

Technical Field

The present invention relates to the field of computers, and more particularly, to a method and apparatus for supporting FPGA training in a TensorFlow.

Background

TensorFlow is the most widely used deep learning framework in the field of deep learning at present, and many deep learning models are realized based on TensorFlow. TensorFlow is used as a primary support framework for deep learning by many hardware vendors, including ASIC and FPGA vendors.

However, most of the current manufacturers only support reasoning of the TensorFlow model (the model is converted into an intermediate layer supported by the FPGA to run on the FPGA), and only the CPU, the GPU and the TPU support training of the TensorFlow. And part of manufacturers realize that the FPGA supports TensorFlow reasoning and does not support FPGA training. Because the FPGA does not support the training of TensorFlow and cannot be used for training in an acceleration mode, the model which is deployed on line and needs to be trained on line cannot use the FPGA, and the use scene of the FPGA is greatly limited.

Disclosure of Invention

In view of this, an object of the embodiments of the present invention is to provide a method and an apparatus for supporting FPGA training in a tensrflow, so as to increase support for training an FPGA device by using an original operation mechanism and a programming interface of the tensrflow.

Based on the above purpose, an aspect of the embodiments of the present invention provides a method for supporting FPGA training in a tensrflow, including the following steps:

adding registration and discovery of an FPGA device in a TensorFlow so that the name of the FPGA device is in a device list of the TensorFlow;

calling an operator registration interface of the TensorFlow to register an operator supporting the FPGA device according to the name of the FPGA device, and enabling the name of the operator to be the same as the names of operators of all devices supported by the TensorFlow;

and writing an execution function of the operator by using opencl, wherein the execution function comprises a host end execution function and an FPGA equipment end execution function so as to execute data interaction between a CPU on the host and the FPGA.

In some embodiments, the registered operators that support the FPGA device include a forward operator and a corresponding backward operator.

In some embodiments, the registered operators that support the FPGA device include a two-dimensional convolution operator, a full join operator, and a pooling operator.

In some embodiments, further comprising: and compiling the FPGA device end execution function by a CPU to generate an FPGAaocx executable file, wherein the FPGA aocx executable file receives the data transmitted by the host end when being operated by the FPGA device so as to carry out specific operator calculation.

In some embodiments, the host-side execution function runs in a CPU on the host and performs the following operations:

searching an FPGA platform supported by opencl;

searching an FPGA equipment list supported by opencl;

and creating an opencl operating environment.

In some embodiments, further comprising:

and acquiring the FPGA aocx executable file generated by compiling, and calling a secondary binary file creation program interface to burn and write the FPGA aocx executable file into the FPGA equipment.

In some embodiments, further comprising:

and creating FPGA parameters and an FPGA buffer area through a PCI driver and an application program interface of the FPGA equipment.

In some embodiments, further comprising:

and operating the FPGA aocx executable file to call the FPGA parameter calculation operator, storing the calculation result into the FPGA buffer area, and reading the FPGA buffer area to obtain the calculation result.

In some embodiments, the method further comprises:

and designating the FPGA as running equipment and designating the FPGA equipment required to be used in an application layer.

Another aspect of an embodiment of the present invention provides an apparatus for supporting FPGA training in a tensrflow, including:

at least one processor; and

a memory storing program code executable by the processor, the program code implementing the method of any of the above when executed by the processor.

The invention has the following beneficial technical effects: the method and the device for supporting FPGA training in the TensorFlow provided by the embodiment of the invention can specify the corresponding operator for running the FPGA equipment in the TensorFlow, and can realize a plurality of operators with large operation amount by using the FPGA, thereby achieving the training acceleration effect, expanding the use scene of the FPGA and being suitable for the scene needing on-line training and model updating.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other embodiments can be obtained by using the drawings without creative efforts.

FIG. 1 is a flow diagram of a method of supporting FPGA training in a TensorFlow in accordance with the present invention;

FIG. 2 is an architectural diagram of FPGA operator execution functions according to the present invention;

fig. 3 is a schematic diagram of a hardware structure of an apparatus for supporting FPGA training in a tensrflow according to the present invention.

Detailed Description

Embodiments of the present invention are described below. However, it is to be understood that the disclosed embodiments are merely examples and that other embodiments may take various and alternative forms. The figures are not necessarily to scale; certain features may be exaggerated or minimized to show details of particular components. Therefore, specific structural and functional details disclosed herein are not to be interpreted as limiting, but merely as a representative basis for teaching one skilled in the art to variously employ the present invention. As one of ordinary skill in the art will appreciate, various features illustrated and described with reference to any one of the figures may be combined with features illustrated in one or more other figures to produce embodiments that are not explicitly illustrated or described. The combination of features shown provides a representative embodiment for a typical application. However, various combinations and modifications of the features consistent with the teachings of the present invention may be desired for certain specific applications or implementations.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention are described in further detail with reference to the accompanying drawings.

Based on the above purpose, an aspect of the embodiments of the present invention provides a method for supporting FPGA training in a tensrflow, as shown in fig. 1, including the following steps:

step S101: adding registration and discovery of an FPGA device in a TensorFlow so that the name of the FPGA device is in a device list of the TensorFlow;

step S102: calling an operator registration interface of the TensorFlow to register an operator supporting the FPGA device according to the name of the FPGA device, and enabling the name of the operator to be the same as the names of operators of all devices supported by the TensorFlow;

step S103: and writing an execution function of the operator by using opencl, wherein the execution function comprises a host end execution function and an FPGA equipment end execution function so as to execute data interaction between a CPU on the host and the FPGA.

In some embodiments, first, registration and discovery of the FPGA DEVICE is added according to the original mechanism of the TensorFlow, so that the FPGA DEVICE name tag DEVICE _ FPGA is in the DEVICE list of the TensorFlow. Then, registering the FPGA operator according to an operator registration mechanism of TensorFlow.

In some embodiments, the registered operators that support the FPGA device include a forward operator and a corresponding backward operator. The TensorFlow training includes Forward and corresponding Backward computations, and therefore requires registration of a Forward operator and a corresponding Backward operator, as shown in FIG. 2.

In some embodiments, the registered operators that support the FPGA device may be two-dimensional convolution operators, full join operators, and pooling operators. And selecting and registering a corresponding operator algorithm according to the type of calculation to be executed. For example, taking Forward two-dimensional convolution operator as an example, the implementation process of the FPGA operator includes: invoking a TensorFlow operator registration interface to REGISTER an FPGA operator, REGISTER _ KERNEL _ BUILDER (Name ("Conv2D"). Device (DEVICE _ FPGA). TypeConstraint < flow > ("T"), Conv2DOp < FPGADevice, flow >); wherein, the name ("xx") keyword is the name of an operator (in this example, the name is Conv2D), and in order to achieve the effect of being compatible with the CPU version, the name is set to be the same as the name of the CPU operator, so that the two-dimensional convolution models of all devices (CPU, GPU and TPU) supported by the current tensrflow can be compatible; DEVICE (xx) needs to be registered as DEVICE _ FPGA (the name is the same as that in the DEVICE list of the TensorFlow), which means that the operator is an operator supported by FPGA, and when the upper layer application uses the tf.device ("/FPGA: N") interface, the operator supported by DEVICE _ FPGA is called; conv2DOp < FPGADevice, float > is the name of the function when the operator is specifically executed.

In some embodiments, the execution function of the operator, the Conv2DOp function, is written as shown in FIG. 2. The function is written in opencl, and an opencl program is divided into a host and an equipment end, namely, the function comprises a host end execution function and an FPGA equipment end execution function.

In some embodiments, the host-side execution function runs in a CPU on the host and performs the following operations: searching an FPGA platform supported by opencl; searching an FPGA equipment list supported by opencl; and creating an opencl operating environment.

In some embodiments, the host-side execution function further performs the following: and acquiring the FPGA aocx executable file generated by compiling, and calling a createprogrammFromBiry interface (createProgrammFromBiry interface) from a binary file to burn and write the FPGA aocx executable file into the FPGA equipment.

In some embodiments, the host-side execution function further performs the following: and creating FPGA parameters and an FPGA buffer area through a PCI driver and an application program interface of the FPGA equipment. Data interaction between the CPU and the FPGA is completed through a PCI driver and an application program interface of the FPGA, so that the purpose of tentor data interaction in Tensflow is completed.

In some embodiments, the host-side execution function further performs the following: and operating the FPGA aocx executable file to call the FPGA parameter calculation operator, storing the calculation result into the FPGA buffer area, and reading the FPGA buffer area to obtain the calculation result.

In some embodiments, the method further comprises: and designating the FPGA as running equipment and designating the FPGA equipment required to be used in an application layer. Wherein, the FPGA equipment needed to be used is appointed by using TF.device ("/FPGA: N"), a plurality of FPGA cards can be inserted into one server, and N is the equipment number of the FPGA.

Where technically feasible, the technical features listed above for the different embodiments may be combined with each other or changed, added, omitted, etc. to form further embodiments within the scope of the invention.

It can be seen from the foregoing embodiments that, according to the method for supporting FPGA training in a tensrflow provided in the embodiments of the present invention, it is possible to specify an operator corresponding to the FPGA device running in the tensrflow, and it is possible to implement some operators with large computation amount by using an FPGA, so as to achieve an acceleration effect of training, thereby expanding a use scenario of an FPGA, and being suitable for a scenario that requires online training and model updating.

In another aspect of the embodiments of the present invention, an embodiment of an apparatus for supporting FPGA training in a TensorFlow is provided.

The device for supporting FPGA training in TensorFlow comprises a memory and at least one processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes any one of the methods when executing the computer program.

Fig. 3 is a schematic diagram of a hardware structure of an embodiment of the apparatus for supporting FPGA training in a tensrflow according to the present invention.

Taking the computer device shown in fig. 3 as an example, the computer device includes a processor 301 and a memory 302, and may further include: an input device 303 and an output device 304.

The processor 301, the memory 302, the input device 303 and the output device 304 may be connected by a bus or other means, and fig. 3 illustrates the connection by a bus as an example.

The memory 302 is a non-volatile computer-readable storage medium and can be used for storing non-volatile software programs, non-volatile computer-executable programs, and modules, such as program instructions/modules corresponding to the method for supporting FPGA training in the tensrflow in the embodiment of the present application. The processor 301 executes various functional applications of the server and data processing by running nonvolatile software programs, instructions, and modules stored in the memory 302, that is, the method for supporting FPGA training in the tensrflow, which implements the above-described method embodiments.

The memory 302 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the data storage area may store data created according to a method of supporting FPGA training in the tensrflow, and the like. Further, the memory 302 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, memory 302 optionally includes memory located remotely from processor 301, which may be connected to a local module via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 303 may receive input numeric or character information and generate key signal inputs related to user settings and function control of a computer apparatus supporting the method of FPGA training in the tensrflow. The output means 304 may comprise a display device such as a display screen.

Program instructions/modules corresponding to the one or more Redis multi-instance pressure testing methods are stored in the memory 302, and when executed by the processor 301, perform a method for supporting FPGA training in TensorFlow in any of the above-described method embodiments.

Any embodiment of the computer device executing the method for supporting FPGA training in a tensrflow may achieve the same or similar effects as any of the preceding method embodiments corresponding thereto.

Finally, it should be noted that, as will be understood by those skilled in the art, all or part of the processes in the methods of the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.

In addition, the apparatuses, devices and the like disclosed in the embodiments of the present invention may be various electronic terminal devices, such as a mobile phone, a Personal Digital Assistant (PDA), a tablet computer (PAD), a smart television and the like, or may be a large terminal device, such as a server and the like, and therefore the scope of protection disclosed in the embodiments of the present invention should not be limited to a specific type of apparatus, device. The client disclosed in the embodiment of the present invention may be applied to any one of the above electronic terminal devices in the form of electronic hardware, computer software, or a combination of both.

Furthermore, the method disclosed according to an embodiment of the present invention may also be implemented as a computer program executed by a CPU, and the computer program may be stored in a computer-readable storage medium. The computer program, when executed by the CPU, performs the above-described functions defined in the method disclosed in the embodiments of the present invention.

Further, the above method steps and system elements may also be implemented using a controller and a computer readable storage medium for storing a computer program for causing the controller to implement the functions of the above steps or elements.

Further, it should be appreciated that the computer-readable storage media (e.g., memory) described herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the disclosed embodiments of the present invention.

The various illustrative logical blocks, modules, and circuits described in connection with the disclosure herein may be implemented or performed with the following components designed to perform the functions described herein: a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination of these components. A general purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a combination of a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP, and/or any other such configuration.

The steps of a method or algorithm described in connection with the disclosure herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

It should be understood that, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The numbers of the embodiments disclosed in the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk, an optical disk, or the like.

The above-described embodiments are possible examples of implementations and are presented merely for a clear understanding of the principles of the invention. Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, of embodiments of the invention is limited to these examples; within the idea of an embodiment of the invention, also technical features in the above embodiment or in different embodiments may be combined and there are many other variations of the different aspects of an embodiment of the invention as described above, which are not provided in detail for the sake of brevity. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of the embodiments of the present invention are intended to be included within the scope of the embodiments of the present invention.

Claims

1. A method for supporting FPGA training in TensorFlow is characterized by comprising the following steps:

2. The method of claim 1, wherein the registered operators that support the FPGA device include a forward operator and a corresponding backward operator.

3. The method of claim 1, wherein the registered operators that support the FPGA device include two-dimensional convolution operators, full join operators, and pooling operators.

4. The method of claim 1, further comprising: and compiling the FPGA device end execution function by a CPU to generate an FPGA aocx executable file, wherein the FPGA aocx executable file receives the data transmitted by the host end when being operated by the FPGA device so as to carry out specific operator calculation.

5. The method of claim 4, wherein the host-side executive function runs in a CPU on the host and performs the following operations:

searching an FPGA platform supported by opencl;

searching an FPGA equipment list supported by opencl;

and creating an opencl operating environment.

6. The method of claim 5, further comprising:

7. The method of claim 6, further comprising:

8. The method of claim 7, further comprising:

9. The method of claim 1, further comprising:

10. An apparatus for supporting FPGA training in a tensrflow, comprising:

at least one processor; and

a memory storing program code executable by the processor, the program code implementing the method of any one of claims 1-9 when executed by the processor.