CN110866610A

CN110866610A - Deep learning model distributed operation method and device

Info

Publication number: CN110866610A
Application number: CN201911140560.6A
Authority: CN
Inventors: 赵谦谦; 仝培霖; 赵红博
Original assignee: Suzhou Wave Intelligent Technology Co Ltd
Current assignee: Suzhou Wave Intelligent Technology Co Ltd
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2020-03-06
Also published as: WO2021098269A1

Abstract

The invention relates to a method and a device for deep learning model distributed operation, wherein the method comprises the following steps: registering a virtual processor in a device management list; registering and writing operators supported by the virtual processor; detecting hardware resources associated with the virtual processor, and determining respective allocation proportions of the hardware resources according to the computing power of the associated hardware resources; configuring a deep learning model based on an operator supported by a virtual processor, and assigning the virtual processor to the operator used in the deep learning model; and the virtual processor allocates the input data of the corresponding operator to the hardware resources associated with the virtual processor according to the allocation proportion so as to carry out operation, and combines the operation results of the hardware resources into the output of the corresponding operator. The invention introduces the concept of virtual processors, appoints the virtual processors as operation equipment for corresponding operators, and distributes the operation to different hardware equipment to realize parallel execution, thereby realizing heterogeneous acceleration of deep learning model operation.

Description

Deep learning model distributed operation method and device

Technical Field

The invention relates to the technical field of artificial intelligence. The invention further relates to a method and a device for distributed operation of the deep learning model.

Background

TensorFlow is the most widely used deep learning framework in the field of deep learning at present, many deep learning models are realized based on TensorFlow, and most hardware manufacturers (including ASIC and FPGA manufacturers) use TensorFlow as a primary support framework for deep learning. At present, the common reasoning and calculating units are GPU, CPU and TPU, and FPGA is not supported for deep learning training.

TensorFlow is an operation in a data flow graph mode, and is supported according to an operator, and the current implementation scheme can only distribute one operator to only one hardware when the operation is executed. When the model is executed in series, other hardware can not calculate in parallel when waiting for the result of the last operator.

In addition, most current manufacturers only support the reasoning of the TensorFlow model, but usually only CPU, GPU and TPU support the training of TensorFlow. Some manufacturers realize that the FPGA supports the TensorFlow inference, and the existing scheme that the TensorFlow supports the FPGA training is only limited to realize the FPGA training in a single machine scene.

In addition, most of the existing technical solutions are based on a GPU. Compared with the FPGA, the GPU has low power consumption. The existing FPGA scheme only supports single machine training, the training of a TensorFlow large model needs the previous month time, the model development period is long, and the ever-increasing model training requirement cannot be met.

Based on the above problems, it is necessary to provide a method for supporting simultaneous accelerated operation of multiple types of hardware in the tensrflow, so as to realize support of the VPU of the virtual processor on the basis of utilizing the original operation mechanism and programming interface of the tensrflow, thereby accelerating the operation speed of the deep learning model.

Disclosure of Invention

In one aspect, the present invention provides a method for distributed operation of a deep learning model based on the above object, wherein the method comprises the following steps:

registering a virtual processor in a device management list;

registering and writing operators supported by the virtual processor;

detecting hardware resources associated with the virtual processor, and determining respective allocation proportions of the hardware resources according to the computing power of the associated hardware resources;

configuring a deep learning model based on an operator supported by a virtual processor, and assigning the virtual processor to the operator used in the deep learning model;

and the virtual processor allocates the input data of the corresponding operator to the hardware resources associated with the virtual processor according to the allocation proportion so as to carry out operation, and combines the operation results of the hardware resources into the output of the corresponding operator.

According to the embodiment of the deep learning model distributed operation method, the hardware resources associated with the virtual processor comprise one or more of a CPU, a GPU and an FPGA.

According to an embodiment of the method for deep learning model distributed operations of the present invention, wherein registering and writing operators supported by the virtual processor further comprises: and compiling operation instructions for the CPU, the GPU and the FPGA and corresponding adaptive instructions in the same operator.

An embodiment of the method for deep learning model distributed operation according to the present invention, wherein configuring the deep learning model based on operators supported by a virtual processor, and assigning the virtual processor to the operators used in the deep learning model further comprises: and constructing a deep learning model based on a TensorFlow framework, and selecting operators supported by corresponding virtual processors for each layer in the deep learning model.

According to an embodiment of the method for deep learning model distributed operations of the present invention, the virtual processor supports operators including a forward operator and a backward operator associated with the forward operator.

In another aspect, the present invention further provides an apparatus for distributed operation of a deep learning model, where the apparatus includes:

at least one processor; and

a memory storing processor-executable program instructions that, when executed by the processor, perform the steps of:

registering a virtual processor in a device management list;

registering and writing operators supported by the virtual processor;

An embodiment of the apparatus for deep learning model distributed operation according to the present invention, wherein the hardware resources associated with the virtual processor comprise one or more of a CPU, a GPU and an FPGA.

An embodiment of the apparatus for deep learning model distributed operations according to the present invention, wherein registering and writing operators supported by the virtual processor further comprises: and compiling operation instructions for the CPU, the GPU and the FPGA and corresponding adaptive instructions in the same operator.

An embodiment of the apparatus for deep learning model distributed operation according to the present invention, wherein the deep learning model is configured based on operators supported by a virtual processor, and the specifying the virtual processor for the operators used in the deep learning model further comprises: and constructing a deep learning model based on a TensorFlow framework, and selecting operators supported by corresponding virtual processors for each layer in the deep learning model.

An embodiment of the apparatus for deep learning model distributed operations according to the present invention is described, wherein the virtual processor supported operators comprise a forward operator and a backward operator associated with the forward operator.

By adopting the technical scheme, the invention at least has the following beneficial effects: the distributed heterogeneous acceleration operation is supported in the operation process of the deep learning model, the concept of a virtual processor is introduced, the virtual processor is appointed to serve as operation equipment for corresponding operators, and the operation is distributed to different hardware equipment to realize parallel execution, so that the heterogeneous acceleration of the deep learning model operation is realized.

The present invention provides aspects of embodiments, which should not be used to limit the scope of the present invention. Other embodiments are contemplated in accordance with the techniques described herein, as will be apparent to one of ordinary skill in the art upon study of the following figures and detailed description, and are intended to be included within the scope of the present application.

Embodiments of the invention are explained and described in more detail below with reference to the drawings, but they should not be construed as limiting the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the description of the prior art and the embodiments will be briefly described below, parts in the drawings are not necessarily drawn to scale, and related elements may be omitted, or in some cases the scale may have been exaggerated in order to emphasize and clearly show the novel features described herein. In addition, the structural order may be arranged differently, as is known in the art.

FIG. 1 shows a schematic block diagram of a method of deep learning model distributed computation according to the present invention.

Detailed Description

While the present invention may be embodied in various forms, there is shown in the drawings and will hereinafter be described some exemplary and non-limiting embodiments, with the understanding that the present disclosure is to be considered an exemplification of the invention and is not intended to limit the invention to the specific embodiments illustrated.

FIG. 1 shows a schematic block diagram of a method of deep learning model distributed computation according to the present invention. In the embodiment shown in the figure, the method comprises at least the following steps:

s1: registering a virtual processor in a device management list;

s2: registering and writing operators supported by the virtual processor;

s3: detecting hardware resources associated with the virtual processor, and determining respective allocation proportions of the hardware resources according to the computing power of the associated hardware resources;

s4: configuring a deep learning model based on an operator supported by a virtual processor, and assigning the virtual processor to the operator used in the deep learning model;

s5: and the virtual processor allocates the input data of the corresponding operator to the hardware resources associated with the virtual processor according to the allocation proportion so as to carry out operation, and combines the operation results of the hardware resources into the output of the corresponding operator.

For the purpose of achieving heterogeneous acceleration, the concept of a Virtual Processor (VPU) is specified in an embodiment of the present invention in preferably a TensorFlow. Therefore, registration of the virtual processor VPU is first added according to the hardware registration mechanism of the tensrflow, so that the VPU device appears in the device list (device list) of the tensrflow. On this basis, step S2 registers and writes operators supported by the virtual processor. Specifically, the operators supported by the VPU are registered according to the operator registration mechanism of tensrflow, taking a two-dimensional convolution as an example, the operator "Conv 2D" is registered according to the following format:

REGISTER_KERNEL_BUILDER(Name("Conv2D").Device(DEVICE_VPU).TypeConstraint<float>("T"),Conv2DOp<VPUDevice,float>)；

where "Conv 2D" is the name of the operator and "facility" needs to be registered as "DEVICE _ VPU" to indicate that the virtual processor supports the operator. And the Name needs to be the same as the CPU version of the operator in the original TensorFlow, so that all the CPU, GPU and TPU two-dimensional convolution models can be compatible. And then writing a corresponding code instruction according to the arithmetic logic required by the operator.

Before deep learning training, step S3 detects hardware resources associated with virtual processors in the current host, and determines respective allocation proportions of the hardware resources according to the computational power of the associated hardware resources. In some embodiments of the invention, the hardware resources associated with the virtual processor include one or more of a CPU, GPU and FPGA. For example, if a 1T computational force FPGA, a 2T computational force GPU, and a 0.5T computational force CPU are on line in the current host, the distribution ratio is 2: 4: 1. subsequently, step S4 configures a deep learning model based on operators supported by the virtual processor, and specifies the virtual processor for the operator used in the deep learning model. That is, the respective operators are matched among the operators registered and written in step S2 for the respective layers of the deep learning model according to the operation requirement. The VPU is then designated at the application layer as the designated running device, for example using TF.device ("/VPU: N") to designate the VPU device N that needs to be used, where N is the device number of the virtual processor VPU. Finally, step S5 is that the virtual processor allocates input data of a corresponding operator to the hardware resource associated with the virtual processor according to the allocation proportion to perform an operation, and merges operation results of the hardware resources into an output of the corresponding operator. Taking the above hardware resource situation as an example, according to distribution ratio 2: 4: 1, distributing the input data of the operator to the FPGA, the GPU and the CPU, and simultaneously operating the distributed input data on corresponding hardware. And combining the calculated results to obtain the output of the operator, and transmitting the output to the next layer of the deep learning model as input. The data are scattered on different hardware resources for parallel operation, so that the operation speed is greatly increased, and the training efficiency of the deep learning model is improved.

Further embodiments of the present invention will be described below, it being noted that the numbering of the steps mentioned therein is used only for the convenience of unambiguously indicating the step without any particular indication, and does not limit the order of the steps described.

In several embodiments of the method for deep learning model distributed operations of the present invention, the step S2 of registering and writing operators supported by a virtual processor further comprises: and compiling operation instructions for the CPU, the GPU and the FPGA and corresponding adaptive instructions in the same operator. Because the virtual processor can be associated with one or more of the CPU, the GPU and the FPGA, and the logic processes required by the CPU, the GPU and the FPGA have differences when the same function is completed, the operation instructions and the corresponding adaptive instructions for the CPU, the GPU and the FPGA are written in the same operator when the operator is written.

In some embodiments of the method for deep learning model distributed operation of the present invention, the step S4 configures the deep learning model based on operators supported by the virtual processor, and the specifying the virtual processor for the operators used in the deep learning model further comprises: and constructing a deep learning model based on a TensorFlow framework, and selecting operators supported by corresponding virtual processors for each layer in the deep learning model. Tensorflow is a second generation artificial intelligence learning system developed by Google based on DistBerief, and the naming of Tensorflow comes from the operation principle of Google. Tensor means an N-dimensional array, Flow means a computation based on a dataflow graph, and tensorial is a computation process where tensors Flow from one end of the Flow graph to the other. Tensorflow is a system that transports complex data structures into artificial intelligent neural networks for analysis and processing. Therefore, in embodiments of the present invention, the deep learning model is preferably constructed based on the TensorFlow framework. And corresponding operators supported by the virtual processor are selected for each layer in the deep learning model so as to execute operation based on the virtual processor in the following.

In one or more embodiments of the method for deep learning model distributed operations of the present invention, the operators supported by the virtual processor include a Forward (Forward) operator and a Backward (Backward) operator associated with the Forward operator. For example, the aforementioned "Conv 2D" is a forward operator, and the "Conv 2D" operator should be registered and written at the same time as a backward operator related to the "Conv 2D" operator.

In another aspect, the present invention further provides an apparatus for distributed operation of a deep learning model, where the apparatus includes: at least one processor; and a memory storing processor-executable program instructions that, when executed by the processor, perform the steps of:

s1: registering a virtual processor in a device management list;

s2: registering and writing operators supported by the virtual processor;

In some embodiments of the apparatus for deep learning model distributed operations of the present invention, the hardware resources associated with the virtual processor include one or more of a CPU, a GPU, and an FPGA.

In several embodiments of the apparatus for deep learning model distributed operations of the present invention, the step S2 of registering and writing operators supported by a virtual processor further comprises: and compiling operation instructions for the CPU, the GPU and the FPGA and corresponding adaptive instructions in the same operator.

In some embodiments of the apparatus for deep learning model distributed operation of the present invention, the step S4 configures the deep learning model based on operators supported by the virtual processor, and the specifying the virtual processor for the operators used in the deep learning model further includes: and constructing a deep learning model based on a TensorFlow framework, and selecting operators supported by corresponding virtual processors for each layer in the deep learning model.

In one or more embodiments of the apparatus for deep learning model distributed operations of the present invention, the virtual processor supported operators include a forward operator and a backward operator associated with the forward operator.

The devices and apparatuses disclosed in the embodiments of the present invention may be various electronic terminal apparatuses, such as a mobile phone, a Personal Digital Assistant (PDA), a tablet computer (PAD), a smart television, and the like, or may be a large terminal apparatus, such as a server, and therefore the scope of protection disclosed in the embodiments of the present invention should not be limited to a specific type of device and apparatus. The client disclosed in the embodiment of the present invention may be applied to any one of the above electronic terminal devices in the form of electronic hardware, computer software, or a combination of both.

The computer-readable storage media (e.g., memory) described herein may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. By way of example, and not limitation, nonvolatile memory can include Read Only Memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which can act as external cache memory. By way of example and not limitation, RAM is available in a variety of forms such as synchronous RAM (DRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The storage devices of the disclosed aspects are intended to comprise, without being limited to, these and other suitable types of memory.

It is to be understood that the features listed above for the different embodiments may be combined with each other to form further embodiments within the scope of the invention, where technically feasible. Furthermore, the specific examples and embodiments described herein are non-limiting, and various modifications of the structure, steps and sequence set forth above may be made without departing from the scope of the invention.

In this application, the use of the conjunction of the contrary intention is intended to include the conjunction. The use of definite or indefinite articles is not intended to indicate cardinality. In particular, references to "the" object or "an" and "an" object are intended to mean one of many such objects possible. However, although elements of the disclosed embodiments of the invention may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated. Furthermore, the conjunction "or" may be used to convey simultaneous features, rather than mutually exclusive schemes. In other words, the conjunction "or" should be understood to include "and/or". The term "comprising" is inclusive and has the same scope as "comprising".

The above-described embodiments, particularly any "preferred" embodiments, are possible examples of implementations, and are presented merely for a clear understanding of the principles of the invention. Many variations and modifications may be made to the above-described embodiments without departing substantially from the spirit and principles of the technology described herein. All such modifications are intended to be included within the scope of this disclosure.

Claims

1. A method of deep learning model distributed operations, the method comprising the steps of:

registering a virtual processor in a device management list;

registering and writing operators supported by the virtual processor;

configuring a deep learning model based on operators supported by the virtual processor, and assigning the virtual processor to the operators used in the deep learning model;

and the virtual processor allocates the input data of the corresponding operator to the hardware resource associated with the virtual processor according to the allocation proportion so as to carry out operation, and combines the operation results of the hardware resources into the output of the corresponding operator.

2. The method of claim 1, wherein the hardware resources associated with the virtual processor comprise one or more of a CPU, a GPU, and an FPGA.

3. The method of claim 2, wherein registering and writing operators supported by the virtual processor further comprises:

and compiling operation instructions for the CPU, the GPU and the FPGA and corresponding adaptive instructions in the same operator.

4. The method of claim 1, wherein configuring a deep learning model based on operators supported by the virtual processor, and wherein assigning a virtual processor to an operator used in the deep learning model further comprises:

and constructing the deep learning model based on a TensorFlow framework, and selecting corresponding operators supported by the virtual processor for each layer in the deep learning model.

5. The method of claim 1, wherein the virtual processor supported operators comprise a forward operator and a backward operator associated with the forward operator.

6. An apparatus for deep learning model distributed operations, the apparatus comprising:

at least one processor; and

registering a virtual processor in a device management list;

registering and writing operators supported by the virtual processor;

7. The apparatus of claim 6, wherein the hardware resources associated with the virtual processor comprise one or more of a CPU, a GPU, and an FPGA.

8. The apparatus of claim 7, wherein registering and writing operators supported by the virtual processor further comprises:

9. The apparatus of claim 6, wherein configuring a deep learning model based on operators supported by the virtual processor, and wherein specifying a virtual processor for operators used in the deep learning model further comprises:

10. The apparatus of claim 6, wherein the virtual processor supported operators comprise a forward operator and a backward operator related to the forward operator.