WO2021098269A1 - 一种深度学习模型分布式运算的方法及装置 - Google Patents
一种深度学习模型分布式运算的方法及装置 Download PDFInfo
- Publication number
- WO2021098269A1 WO2021098269A1 PCT/CN2020/104006 CN2020104006W WO2021098269A1 WO 2021098269 A1 WO2021098269 A1 WO 2021098269A1 CN 2020104006 W CN2020104006 W CN 2020104006W WO 2021098269 A1 WO2021098269 A1 WO 2021098269A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- virtual processor
- deep learning
- learning model
- operator
- supported
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
Definitions
- the invention relates to the field of artificial intelligence technology.
- the invention further relates to a method and device for distributed operation of a deep learning model.
- TensorFlow is currently the most widely used deep learning framework in the field of deep learning. Many deep learning models are implemented based on TensorFlow. Most hardware vendors include ASIC (Application Specific Integrated Circuit) and FPGA (Field-Programmable Gate Array). , Field Programmable Gate Array) vendors all regard TensorFlow as the primary support framework for deep learning. At present, the commonly used inference calculation units are GPU (Graphics Processing Unit), CPU (Central Processing Unit, Central Processing Unit), and TPU (Tensor Processing Unit, tensor processor), and FPGAs are not supported for deep learning training.
- ASIC Application Specific Integrated Circuit
- FPGA Field-Programmable Gate Array
- TensorFlow is a data flow graph mode operation. It is supported by operators.
- the current implementation scheme can usually only be that one operator can only be allocated to one type of hardware when performing operations. When the model is executed serially, other hardware is waiting for the result of the previous operator and cannot be calculated in parallel.
- the present invention proposes a method for distributed operation of a deep learning model based on the above objective, wherein the method includes the following steps:
- the virtual processor allocates the input data of the corresponding operator to the hardware resource associated with the virtual processor according to the allocation ratio for calculation, and merges the calculation result of each hardware resource into the output of the corresponding operator.
- the hardware resources associated with the virtual processor include one or more of a CPU, a GPU, and an FPGA.
- registering and writing the operator supported by the virtual processor further includes: writing the operation instructions for the CPU, GPU and FPGA and the corresponding application in the same operator. With instructions.
- the configuration of the deep learning model based on the operators supported by the virtual processor and specifying the virtual processor for the operators used in the deep learning model further includes: based on the TensorFlow framework Construct a deep learning model, and select the corresponding operators supported by the virtual processor for each layer in the deep learning model.
- the operators supported by the virtual processor include a forward operator and a backward operator related to the forward operator.
- the present invention also provides a device for distributed operation of a deep learning model, wherein the device includes:
- At least one processor At least one processor
- a memory that stores program instructions executable by the processor, and the program instructions execute the following steps when run by the processor:
- the virtual processor allocates the input data of the corresponding operator to the hardware resource associated with the virtual processor according to the allocation ratio for calculation, and merges the calculation result of each hardware resource into the output of the corresponding operator.
- the hardware resources associated with the virtual processor include one or more of CPU, GPU, and FPGA.
- registering and writing the operators supported by the virtual processor further includes: writing the operation instructions for the CPU, GPU, and FPGA in the same operator and the corresponding adaptations. With instructions.
- the configuration of the deep learning model based on the operators supported by the virtual processor and specifying the virtual processor for the operators used in the deep learning model further includes: based on the TensorFlow framework Construct a deep learning model, and select the corresponding operators supported by the virtual processor for each layer in the deep learning model.
- the operators supported by the virtual processor include a forward operator and a backward operator related to the forward operator.
- the present invention has at least the following beneficial effects: supports distributed heterogeneous accelerated operations during the operation of the deep learning model, introduces the concept of virtual processors, and specifies virtual processors as operations for corresponding operators Device, which allocates operations to different hardware devices to achieve parallel execution, thereby achieving heterogeneous acceleration of deep learning model operations.
- Fig. 1 shows a schematic block diagram of a method for distributed operation of a deep learning model according to the present invention.
- Fig. 1 shows a schematic block diagram of a method for distributed operation of a deep learning model according to the present invention.
- the method at least includes the following steps:
- the virtual processor allocates the input data of the corresponding operator to the hardware resource associated with the virtual processor according to the allocation ratio for calculation, and merges the calculation result of each hardware resource into the output of the corresponding operator.
- the concept of a virtual processor is preferably specified in TensorFlow in the embodiment of the present invention. Therefore, first add the registration of the virtual processor VPU according to the hardware registration mechanism of TensorFlow, so that the VPU device appears in the device list (device list) of TensorFlow. On this basis, step S2 registers and writes the operators supported by the virtual processor. Specifically, the operators supported by the VPU are registered according to TensorFlow's operator registration mechanism. Taking two-dimensional convolution as an example, for example, the operator "Conv2D" is registered in the following format:
- Conv2D is the name of the operator
- Decice needs to be registered as “DEVICE_VPU” to indicate that the virtual processor supports the operator.
- the Name needs to be the same as the CPU version of the operator in the original TensorFlow, so that it can be compatible with all CPU, GPU, and TPU two-dimensional convolution models. Then, write the corresponding code instructions according to the operation logic required by the operator.
- step S3 detects the hardware resources associated with the virtual processors in the current host, and determines the respective allocation ratios of the hardware resources according to the computing power of the associated hardware resources.
- the hardware resources associated with the virtual processor include one or more of CPU, GPU, and FPGA. For example, if there is an FPGA with 1T computing power, a GPU with 2T computing power, and a CPU with 0.5T computing power online in the current host, the allocation ratio is 2:4:1.
- step S4 configures a deep learning model based on the operators supported by the virtual processor, and specifies a virtual processor for the operators used in the deep learning model.
- each layer of the deep learning model matches the corresponding operators in the operators registered and written in step S2 according to the calculation requirements.
- the VPU as the specified operating device at the application layer, for example, use TF.device ("/VPU:N") to specify the VPU device N that needs to be used, where N is the device number of the virtual processor VPU.
- the virtual processor allocates the input data of the corresponding operator to the hardware resource associated with the virtual processor according to the allocation ratio for calculation, and merges the calculation result of each hardware resource into the output of the corresponding operator.
- the input data of the operator is allocated to FPGA, GPU, and CPU according to the allocation ratio of 2:4:1, and the allocated input data is operated on the corresponding hardware at the same time.
- the calculated results are combined to obtain the output of the operator and passed to the next layer of the deep learning model as input. Since the data is scattered on different hardware resources and operated in parallel, the operation speed is greatly accelerated, and the training efficiency of the deep learning model is improved.
- step S2 registering and writing an operator supported by the virtual processor further includes: writing operation instructions for the CPU, GPU, and FPGA in the same operator and Corresponding adaptation instructions. Since the virtual processor can be associated with one or more of the CPU, GPU, and FPGA, and there are differences in the logic processes required by the CPU, GPU, and FPGA when completing the same function, so when you write the operator, use the same calculation method. Compile arithmetic instructions and corresponding adaptation instructions for CPU, GPU and FPGA.
- step S4 configures the deep learning model based on the operators supported by the virtual processor, and designating the virtual processor for the operators used in the deep learning model further includes: Build a deep learning model based on the TensorFlow framework, and select the corresponding virtual processor-supported operators for each layer in the deep learning model.
- Tensorflow is the second-generation artificial intelligence learning system developed by Google based on DistBelief. Its name comes from its own operating principle.
- Tensor (tensor) means N-dimensional array
- Flow (flow) means calculation based on data flow graph
- Tensorflow is the calculation process of tensor flowing from one end of the flow graph to the other end.
- Tensorflow is a system that transmits complex data structures to artificial intelligence neural networks for analysis and processing. Therefore, in the embodiment of the present invention, a deep learning model is preferably constructed based on the TensorFlow framework. And for each layer in the deep learning model, the corresponding operators supported by the virtual processor are selected, so that subsequent operations can be performed based on the virtual processor.
- the operators supported by the virtual processor include a forward operator and a backward operator related to the forward operator. child.
- the aforementioned "Conv2D” is a forward operator. When registering and writing this "Conv2D” operator, you should also register and write a backward operator related to the "Conv2D” operator.
- the present invention also provides a device for distributed operation of a deep learning model, wherein the device includes: at least one processor; and a memory that stores program instructions executable by the processor, and the program instructions are The following steps are performed while the processor is running:
- the virtual processor allocates the input data of the corresponding operator to the hardware resource associated with the virtual processor according to the allocation ratio for calculation, and merges the calculation result of each hardware resource into the output of the corresponding operator.
- the hardware resources associated with the virtual processor include one or more of a CPU, a GPU, and an FPGA.
- the step S2 registering and writing the operators supported by the virtual processor further includes: writing operation instructions for the CPU, GPU and FPGA in the same operator and Corresponding adaptation instructions.
- step S4 configures the deep learning model based on the operators supported by the virtual processor, and designating the virtual processor for the operators used in the deep learning model further includes: Build a deep learning model based on the TensorFlow framework, and select the corresponding virtual processor-supported operators for each layer in the deep learning model.
- the operators supported by the virtual processor include a forward operator and a backward operator related to the forward operator.
- the devices, devices, etc. disclosed in the embodiments of the present invention can be various electronic terminal devices, such as mobile phones, personal digital assistants (PDA), tablet computers (PAD), smart TVs, etc., or large-scale terminal devices, such as servers, etc. Therefore, the protection scope disclosed in the embodiments of the present invention should not be limited to a specific type of device or equipment.
- the client disclosed in the embodiment of the present invention may be applied to any of the above-mentioned electronic terminal devices in the form of electronic hardware, computer software, or a combination of the two.
- the computer-readable storage medium may be volatile memory or non-volatile memory, or may include both volatile memory and non-volatile memory.
- non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory Memory.
- Volatile memory can include random access memory (RAM), which can act as external cache memory.
- RAM can be obtained in many forms, such as synchronous RAM (DRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), Synchronous link DRAM (SLDRAM) and direct Rambus RAM (DRRAM).
- DRAM synchronous RAM
- DRAM dynamic RAM
- SDRAM synchronous DRAM
- DDR SDRAM double data rate SDRAM
- ESDRAM enhanced SDRAM
- SLDRAM Synchronous link DRAM
- DRRAM direct Rambus RAM
- the storage devices of the disclosed aspects are intended to include, but are not limited to, these and other suitable types of memory.
- the present invention has at least the following beneficial effects: supports distributed heterogeneous accelerated operations during the operation of the deep learning model, introduces the concept of virtual processors, and specifies virtual processors as operations for corresponding operators Device, which allocates operations to different hardware devices to achieve parallel execution, thereby achieving heterogeneous acceleration of deep learning model operations.
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Advance Control (AREA)
Abstract
一种深度学习模型分布式运算的方法及装置,其中该方法包括:在设备管理列表中注册虚拟处理器;注册并编写虚拟处理器支持的算子;检测虚拟处理器相关联的硬件资源,并根据相关联的各硬件资源的算力确定硬件资源各自的分配比例;基于虚拟处理器支持的算子配置深度学习模型,并为深度学习模型中使用的算子指定虚拟处理器;虚拟处理器根据分配比例为虚拟处理器相关联的硬件资源分配相应算子的输入数据以进行运算,并将各硬件资源的运算结果合并为相应算子的输出。本发明引入了虚拟处理器的构思,为相应的算子指定虚拟处理器作为运算设备,将运算分配给不同的硬件设备来实现并行执行,从而实现了深度学习模型运算的异构加速。
Description
本申请要求于2019年11月20日提交中国专利局、申请号为201911140560.6、发明名称为“一种深度学习模型分布式运算的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本发明涉及人工智能技术领域。本发明进一步涉及一种深度学习模型分布式运算的方法及装置。
TensorFlow是目前深度学习领域使用最广泛的深度学习框架,很多深度学习模型都基于TensorFlow来实现,大部分硬件厂商,包括ASIC(Application Specific Integrated Circuit,特殊应用集成电路)和FPGA(Field-Programmable Gate Array,现场可编程门阵列)厂商,都将TensorFlow作为深度学习的首要支持框架。目前常用推理计算单元为GPU(Graphics Processing Unit,图像处理器)、CPU(Central Processing Unit,中央处理器)和TPU(Tensor Processing Unit,张量处理器),并不支持FPGA进行深度学习训练。
TensorFlow是一种数据流图模式的运算,根据算子来支持,目前的实现方案通常只能是一个算子在执行运算时只能分配到一种硬件上。当模型串行执行时,其它硬件在等待上一个算子的结果,无法并行计算。
另外,目前的厂商仅大部分支持TensorFlow模型的推理,但通常仅CPU、GPU、TPU支持TensorFlow的训练。部分厂商实现了FPGA支持TensorFlow推理,现有的TensorFlow支持FPGA训练的方案仅限于在单机场景实现了FPGA训练。
此外,现有的技术方案中绝大部分是基于GPU进行的。而GPU相较于FPGA,功耗比低。而现有的FPGA方案仅支持单机训练,TensorFlow大模型的训练动辄需要上月时间,模型开发周期长,无法满足日益增长的模型训练需求。
基于上述问题,需要提出一种在TensorFlow中支持多种硬件同时加速运算的方法,在利用TensorFlow原有的运行机制和编程接口的基础上,实现对虚拟处理器VPU的支持,从而加快深度学习模型的运算速度。
发明内容
一方面,本发明基于上述目的提出了一种深度学习模型分布式运算的方法,其中该方法包括以下步骤:
在设备管理列表中注册虚拟处理器;
注册并编写虚拟处理器支持的算子;
检测虚拟处理器相关联的硬件资源,并根据相关联的各硬件资源的算力确定硬件资源各自的分配比例;
基于虚拟处理器支持的算子配置深度学习模型,并为深度学习模型中使用的算子指定虚拟处理器;
虚拟处理器根据分配比例为虚拟处理器相关联的硬件资源分配相应算子的输入数据以进行运算,并将各硬件资源的运算结果合并为相应算子的输出。
根据本发明的深度学习模型分布式运算的方法的实施例,其中虚拟处理器相关联的硬件资源包括CPU、GPU和FPGA中的一种或多种。
根据本发明的深度学习模型分布式运算的方法的实施例,其中注册并编写虚拟处理器支持的算子进一步包括:在同一算子中编写用于CPU、GPU和FPGA的运算指令及相应的适配指令。
根据本发明的深度学习模型分布式运算的方法的实施例,其中基于虚拟处理器支持的算子配置深度学习模型,并为深度学习模型中使用的算子指定虚拟处理器进一步包括:基于TensorFlow框架构建深度学习模型,并为深度学习模型中各层选择相应的虚拟处理器支持的算子。
根据本发明的深度学习模型分布式运算的方法的实施例,其中虚拟处理器支持的算子包括前向算子和该前向算子相关的后向算子。
另一方面,本发明还提出了一种深度学习模型分布式运算的装置,其中该装置包括:
至少一个处理器;和
存储器,该存储器存储有处理器可运行的程序指令,该程序指令在被处理器运行时执行以下步骤:
在设备管理列表中注册虚拟处理器;
注册并编写虚拟处理器支持的算子;
检测虚拟处理器相关联的硬件资源,并根据相关联的各硬件资源的算力确定硬件资源各自的分配比例;
基于虚拟处理器支持的算子配置深度学习模型,并为深度学习模型中使用的算子指定虚拟处理器;
虚拟处理器根据分配比例为虚拟处理器相关联的硬件资源分配相应算子的输入数据以进行运算,并将各硬件资源的运算结果合并为相应算子的输出。
根据本发明的深度学习模型分布式运算的装置的实施例,其中虚拟处理器相关联的硬件资源包括CPU、GPU和FPGA中的一种或多种。
根据本发明的深度学习模型分布式运算的装置的实施例,其中注册并编写虚拟处理器支持的算子进一步包括:在同一算子中编写用于CPU、GPU和FPGA的运算指令及相应的适配指令。
根据本发明的深度学习模型分布式运算的装置的实施例,其中基于虚拟处理器支持的算子配置深度学习模型,并为深度学习模型中使用的算子指定虚拟处理器进一步包括:基于TensorFlow框架构建深度学习模型,并为深度学习模型中各层选择相应的虚拟处理器支持的算子。
根据本发明的深度学习模型分布式运算的装置的实施例,其中虚拟处理器支持的算子包括前向算子和该前向算子相关的后向算子。
采用上述技术方案,本发明至少具有如下有益效果:在深度学习模型的运算过程中支持分布式的异构加速的运算,引入了虚拟处理器的构思,为相应的算子指定虚拟处理器作为运算设备,将运算分配给不同的硬件设备来实现并行执行,从而实现了深度学习模型运算的异构加速。
本发明提供了实施例的各方面,不应当用于限制本发明的保护范围。根据在此描述的技术可设想到其它实施方式,这对于本领域普通技术人员 来说在研究以下附图和具体实施方式后将是显而易见的,并且这些实施方式意图被包含在本申请的范围内。
下面参考附图更详细地解释和描述了本发明的实施例,但它们不应理解为对于本发明的限制。
为了更清楚地说明本发明实施例的技术方案,下面将对现有技术和实施例描述中所需要使用的附图作简单地介绍,附图中的部件不一定按比例绘制,并且可以省略相关的元件,或者在一些情况下比例可能已经被放大,以便强调和清楚地示出本文描述的新颖特征。另外,如本领域中已知的,结构顺序可以被不同地布置。
图1示出了根据本发明的深度学习模型分布式运算的方法的示意性框图。
虽然本发明可以以各种形式实施,但是在附图中示出并且在下文中将描述一些示例性和非限制性实施例,但应该理解的是,本公开将被认为是本发明的示例并不意图将本发明限制于所说明的具体实施例。
图1示出了根据本发明的深度学习模型分布式运算的方法的示意性框图。在如图所示的实施例中,该方法至少包括以下步骤:
S1:在设备管理列表中注册虚拟处理器;
S2:注册并编写虚拟处理器支持的算子;
S3:检测虚拟处理器相关联的硬件资源,并根据相关联的各硬件资源的算力确定硬件资源各自的分配比例;
S4:基于虚拟处理器支持的算子配置深度学习模型,并为深度学习模型中使用的算子指定虚拟处理器;
S5:虚拟处理器根据分配比例为虚拟处理器相关联的硬件资源分配相应算子的输入数据以进行运算,并将各硬件资源的运算结果合并为相应算子的输出。
为了实现异构加速的目的,在本发明的实施例中在优选地TensorFlow规定了虚拟处理器的构思。因此,首先根据TensorFlow的硬件注册机制添加对虚拟处理器VPU的注册,使得VPU设备出现在TensorFlow的设备列表(device列表)中。在此基础上,步骤S2注册并编写虚拟处理器支持的算子。具体地说,根据TensorFlow的算子注册机制注册VPU支持的算子,以二维卷积为例,例如按照以下格式注册算子“Conv2D”:
REGISTER_KERNEL_BUILDER(Name("Conv2D").Device(DEVICE_V PU).TypeConstraint<float>("T"),Conv2DOp<VPUDevice,float>);
其中“Conv2D”为算子的名称,“Decice”需要注册为“DEVICE_VPU”表示虚拟处理器支持该算子。并且,Name需要同原TensorFlow中的算子的CPU版本的相同,由此可以兼容所有的CPU、GPU和TPU二维卷积模型。然后,根据算子所需的运算逻辑编写相应的代码指令。
在进行深度学习训练之前,步骤S3检测当前主机中虚拟处理器相关联的硬件资源,并根据相关联的各硬件资源的算力确定硬件资源各自的分配比例。在本发明的一些实施例中,虚拟处理器相关联的硬件资源包括CPU、GPU和FPGA中的一种或多种。例如,当前主机中在线有一块1T算力FPGA、2T算力的GPU、0.5T算力的CPU,则分配比例为2:4:1。随后,步骤S4基于虚拟处理器支持的算子配置深度学习模型,并为深度学习模型中使用的算子指定虚拟处理器。也就是说,为深度学习模型的各个层根据运算需求在步骤S2中注册并编写的算子中匹配相应的算子。然后在应用层指定VPU作为指定的运行设备,例如使用TF.device(“/VPU:N”)指定需要使用的VPU设备N,其中N为虚拟处理器VPU的设备号。最后,步骤S5虚拟处理器根据分配比例为虚拟处理器相关联的硬件资源分配相应算子的输入数据以进行运算,并将各硬件资源的运算结果合并为相应算子的输出。以上述硬件资源情况为例,按分配比例2:4:1将该算子的输入数据分配给FPGA、GPU和CPU,并且将所分配的输入数据在相应的硬件上同时进行运算。计算后的结果合并后得到该算子的输出,并传递给深度学习模型的下一层作为输入。由于数据分散在不同的硬件资源上并行运算,所以极大地加快了运算的速度,提高了深度学习模型的训练效率。
下文将说明本发明的进一步实施例,需要注意的是,其中提到的步骤的编号在没有特殊说明的情况下,仅用于便捷明确地指示该步骤,并不限定所述步骤的顺序。
在本发明的深度学习模型分布式运算的方法的若干实施例中,步骤S2注册并编写虚拟处理器支持的算子进一步包括:在同一算子中编写用于CPU、GPU和FPGA的运算指令及相应的适配指令。由于虚拟处理器可以关联CPU、GPU和FPGA中的一种或多种,并且在完成相同的功能时CPU、GPU和FPGA所需的逻辑过程存在差异,所以在编写算子的时候,在同一算子中编写用于CPU、GPU和FPGA的运算指令及相应的适配指令。
在本发明的深度学习模型分布式运算的方法的一些实施例中,步骤S4基于虚拟处理器支持的算子配置深度学习模型,并为深度学习模型中使用的算子指定虚拟处理器进一步包括:基于TensorFlow框架构建深度学习模型,并为深度学习模型中各层选择相应的虚拟处理器支持的算子。Tensorflow是谷歌基于DistBelief进行研发的第二代人工智能学习系统,其命名来源于本身的运行原理。Tensor(张量)意味着N维数组,Flow(流)意味着基于数据流图的计算,Tensorflow为张量从流图的一端流动到另一端计算过程。Tensorflow是将复杂的数据结构传输至人工智能神经网中进行分析和处理过程的系统。因此,在本发明的实施例中优选地基于TensorFlow框架构建深度学习模型。并且为深度学习模型中各层选择相应的虚拟处理器支持的算子,以便后续基于虚拟处理器执行运算。
在本发明的深度学习模型分布式运算的方法的一个或多个实施例中,虚拟处理器支持的算子包括前向(Forward)算子和该前向算子相关的后向(Backward)算子。例如前述“Conv2D”为前向算子,注册并编写该“Conv2D”算子的同时还应该注册并编写与“Conv2D”算子相关的后向算子。
另一方面,本发明还提出了一种深度学习模型分布式运算的装置,其中该装置包括:至少一个处理器;和存储器,该存储器存储有处理器可运行的程序指令,该程序指令在被处理器运行时执行以下步骤:
S1:在设备管理列表中注册虚拟处理器;
S2:注册并编写虚拟处理器支持的算子;
S3:检测虚拟处理器相关联的硬件资源,并根据相关联的各硬件资源的算力确定硬件资源各自的分配比例;
S4:基于虚拟处理器支持的算子配置深度学习模型,并为深度学习模型中使用的算子指定虚拟处理器;
S5:虚拟处理器根据分配比例为虚拟处理器相关联的硬件资源分配相应算子的输入数据以进行运算,并将各硬件资源的运算结果合并为相应算子的输出。
在本发明的深度学习模型分布式运算的装置的一些实施例中,虚拟处理器相关联的硬件资源包括CPU、GPU和FPGA中的一种或多种。
在本发明的深度学习模型分布式运算的装置的若干实施例中,步骤S2注册并编写虚拟处理器支持的算子进一步包括:在同一算子中编写用于CPU、GPU和FPGA的运算指令及相应的适配指令。
在本发明的深度学习模型分布式运算的装置的一些实施例中,步骤S4基于虚拟处理器支持的算子配置深度学习模型,并为深度学习模型中使用的算子指定虚拟处理器进一步包括:基于TensorFlow框架构建深度学习模型,并为深度学习模型中各层选择相应的虚拟处理器支持的算子。
在本发明的深度学习模型分布式运算的装置的一个或多个实施例中,虚拟处理器支持的算子包括前向算子和该前向算子相关的后向算子。
本发明实施例公开所述的装置、设备等可为各种电子终端设备,例如手机、个人数字助理(PDA)、平板电脑(PAD)、智能电视等,也可以是大型终端设备,如服务器等,因此本发明实施例公开的保护范围不应限定为某种特定类型的装置、设备。本发明实施例公开所述的客户端可以是以电子硬件、计算机软件或两者的组合形式应用于上述任意一种电子终端设备中。
本文所述的计算机可读存储介质(例如存储器)可以是易失性存储器或非易失性存储器,或者可以包括易失性存储器和非易失性存储器两者。作为例子而非限制性的,非易失性存储器可以包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦写可编程ROM (EEPROM)或快闪存储器。易失性存储器可以包括随机存取存储器(RAM),该RAM可以充当外部高速缓存存储器。作为例子而非限制性的,RAM可以以多种形式获得,比如同步RAM(DRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据速率SDRAM(DDR SDRAM)、增强SDRAM(ESDRAM)、同步链路DRAM(SLDRAM)、以及直接Rambus RAM(DRRAM)。所公开的方面的存储设备意在包括但不限于这些和其它合适类型的存储器。
采用上述技术方案,本发明至少具有如下有益效果:在深度学习模型的运算过程中支持分布式的异构加速的运算,引入了虚拟处理器的构思,为相应的算子指定虚拟处理器作为运算设备,将运算分配给不同的硬件设备来实现并行执行,从而实现了深度学习模型运算的异构加速。
应当理解的是,在技术上可行的情况下,以上针对不同实施例所列举的技术特征可以相互组合,从而形成本发明范围内的另外实施例。此外,本文所述的特定示例和实施例是非限制性的,并且可以对以上所阐述的结构、步骤及顺序做出相应修改而不脱离本发明的保护范围。
在本申请中,反意连接词的使用旨在包括连接词。定或不定冠词的使用并不旨在指示基数。具体而言,对“该”对象或“一”和“一个”对象的引用旨在表示多个这样对象中可能的一个。然而,尽管本发明实施例公开的元素可以以个体形式描述或要求,但除非明确限制为单数,也可以理解为多个。此外,可以使用连接词“或”来传达同时存在的特征,而不是互斥方案。换句话说,连接词“或”应理解为包括“和/或”。术语“包括”是包容性的并且具有与“包含”相同的范围。
上述实施例,特别是任何“优选”实施例是实施方式的可能示例,并且仅仅为了清楚理解本发明的原理而提出。在基本上不脱离本文描述的技术的精神和原理的情况下,可以对上述实施例做出许多变化和修改。所有修改旨在被包括在本公开的范围内。
Claims (10)
- 一种深度学习模型分布式运算的方法,其特征在于,所述方法包括以下步骤:在设备管理列表中注册虚拟处理器;注册并编写所述虚拟处理器支持的算子;检测所述虚拟处理器相关联的硬件资源,并根据所述相关联的各硬件资源的算力确定硬件资源各自的分配比例;基于所述虚拟处理器支持的算子配置深度学习模型,并为所述深度学习模型中使用的算子指定虚拟处理器;所述虚拟处理器根据所述分配比例为所述虚拟处理器相关联的硬件资源分配相应算子的输入数据以进行运算,并将所述各硬件资源的运算结果合并为所述相应算子的输出。
- 根据权利要求1所述的方法,其特征在于,所述虚拟处理器相关联的硬件资源包括CPU、GPU和FPGA中的一种或多种。
- 根据权利要求2所述的方法,其特征在于,所述注册并编写所述虚拟处理器支持的算子进一步包括:在同一算子中编写用于CPU、GPU和FPGA的运算指令及相应的适配指令。
- 根据权利要求1所述的方法,其特征在于,所述基于所述虚拟处理器支持的算子配置深度学习模型,并为所述深度学习模型中使用的算子指定虚拟处理器进一步包括:基于TensorFlow框架构建所述深度学习模型,并为所述深度学习模型中各层选择相应的虚拟处理器支持的算子。
- 根据权利要求1所述的方法,其特征在于,所述虚拟处理器支持的算子包括前向算子和与所述前向算子相关的后向算子。
- 一种深度学习模型分布式运算的装置,其特征在于,所述装置包括:至少一个处理器;和存储器,所述存储器存储有处理器可运行的程序指令,所述程序指令在被处理器运行时执行以下步骤:在设备管理列表中注册虚拟处理器;注册并编写所述虚拟处理器支持的算子;检测所述虚拟处理器相关联的硬件资源,并根据所述相关联的各硬件资源的算力确定硬件资源各自的分配比例;基于所述虚拟处理器支持的算子配置深度学习模型,并为所述深度学习模型中使用的算子指定虚拟处理器;所述虚拟处理器根据所述分配比例为所述虚拟处理器相关联的硬件资源分配相应算子的输入数据以进行运算,并将所述各硬件资源的运算结果合并为所述相应算子的输出。
- 根据权利要求6所述的装置,其特征在于,所述虚拟处理器相关联的硬件资源包括CPU、GPU和FPGA中的一种或多种。
- 根据权利要求7所述的装置,其特征在于,所述注册并编写所述虚拟处理器支持的算子进一步包括:在同一算子中编写用于CPU、GPU和FPGA的运算指令及相应的适配指令。
- 根据权利要求6所述的装置,其特征在于,所述基于所述虚拟处理器支持的算子配置深度学习模型,并为所述深度学习模型中使用的算子指定虚拟处理器进一步包括:基于TensorFlow框架构建所述深度学习模型,并为所述深度学习模型中各层选择相应的虚拟处理器支持的算子。
- 根据权利要求6所述的装置,其特征在于,所述虚拟处理器支持的算子包括前向算子和与所述前向算子相关的后向算子。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911140560.6A CN110866610A (zh) | 2019-11-20 | 2019-11-20 | 一种深度学习模型分布式运算的方法及装置 |
CN201911140560.6 | 2019-11-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021098269A1 true WO2021098269A1 (zh) | 2021-05-27 |
Family
ID=69655743
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/104006 WO2021098269A1 (zh) | 2019-11-20 | 2020-07-24 | 一种深度学习模型分布式运算的方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110866610A (zh) |
WO (1) | WO2021098269A1 (zh) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110866610A (zh) * | 2019-11-20 | 2020-03-06 | 苏州浪潮智能科技有限公司 | 一种深度学习模型分布式运算的方法及装置 |
CN113469360B (zh) * | 2020-03-31 | 2023-10-20 | 杭州海康威视数字技术股份有限公司 | 推理方法及装置 |
CN111736463B (zh) * | 2020-05-09 | 2023-03-03 | 刘炜 | 一种基于运算平台的自适应深度学习控制方法 |
CN111752716A (zh) * | 2020-06-29 | 2020-10-09 | 北京小米松果电子有限公司 | 模型使用方法、数据处理方法及装置 |
CN111858036B (zh) * | 2020-06-29 | 2022-06-10 | 浪潮电子信息产业股份有限公司 | 基于FPGA设备的TensorFlow系统加速方法、装置、设备及存储介质 |
CN112270399B (zh) * | 2020-09-29 | 2022-03-11 | 北京百度网讯科技有限公司 | 基于深度学习的算子注册处理方法、装置及电子设备 |
CN113918351B (zh) | 2021-12-08 | 2022-03-11 | 之江实验室 | 深度学习框架与ai加速卡片内分布式训练适配方法和装置 |
CN116306856B (zh) * | 2023-05-17 | 2023-09-05 | 之江实验室 | 一种基于搜索的深度学习模型部署方法及装置 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108369537A (zh) * | 2015-12-31 | 2018-08-03 | 亚马逊科技公司 | 启用fpga的计算实例 |
US20180302291A1 (en) * | 2017-04-14 | 2018-10-18 | Accenture Global Solutions Limited | Comparative multi-forecasting analytics service stack for cloud computing resource allocation |
CN108805798A (zh) * | 2017-05-05 | 2018-11-13 | 英特尔公司 | 用于深度学习框架的细粒度计算通信执行 |
CN110121747A (zh) * | 2016-10-28 | 2019-08-13 | 伊鲁米那股份有限公司 | 用于执行二级和/或三级处理的生物信息学系统、设备和方法 |
CN110866610A (zh) * | 2019-11-20 | 2020-03-06 | 苏州浪潮智能科技有限公司 | 一种深度学习模型分布式运算的方法及装置 |
-
2019
- 2019-11-20 CN CN201911140560.6A patent/CN110866610A/zh not_active Withdrawn
-
2020
- 2020-07-24 WO PCT/CN2020/104006 patent/WO2021098269A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108369537A (zh) * | 2015-12-31 | 2018-08-03 | 亚马逊科技公司 | 启用fpga的计算实例 |
CN110121747A (zh) * | 2016-10-28 | 2019-08-13 | 伊鲁米那股份有限公司 | 用于执行二级和/或三级处理的生物信息学系统、设备和方法 |
US20180302291A1 (en) * | 2017-04-14 | 2018-10-18 | Accenture Global Solutions Limited | Comparative multi-forecasting analytics service stack for cloud computing resource allocation |
CN108805798A (zh) * | 2017-05-05 | 2018-11-13 | 英特尔公司 | 用于深度学习框架的细粒度计算通信执行 |
CN110866610A (zh) * | 2019-11-20 | 2020-03-06 | 苏州浪潮智能科技有限公司 | 一种深度学习模型分布式运算的方法及装置 |
Also Published As
Publication number | Publication date |
---|---|
CN110866610A (zh) | 2020-03-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021098269A1 (zh) | 一种深度学习模型分布式运算的方法及装置 | |
WO2021098509A1 (zh) | 神经网络联合编译的方法、装置和电子设备 | |
US10114662B2 (en) | Updating processor topology information for virtual machines | |
US11609792B2 (en) | Maximizing resource utilization of neural network computing system | |
US11900113B2 (en) | Data flow processing method and related device | |
WO2019127838A1 (zh) | 卷积神经网络实现方法及装置、终端、存储介质 | |
US10452538B2 (en) | Determining task scores reflective of memory access statistics in NUMA systems | |
CN109669772B (zh) | 计算图的并行执行方法和设备 | |
JP2019204492A (ja) | ニューロモルフィック・アクセラレータ・マルチタスキング | |
US20140244891A1 (en) | Providing Dynamic Topology Information in Virtualized Computing Environments | |
US11948352B2 (en) | Speculative training using partial gradients update | |
CN111105023B (zh) | 数据流重构方法及可重构数据流处理器 | |
US10748060B2 (en) | Pre-synaptic learning using delayed causal updates | |
US20210158131A1 (en) | Hierarchical partitioning of operators | |
US11816061B2 (en) | Dynamic allocation of arithmetic logic units for vectorized operations | |
US20200226458A1 (en) | Optimizing artificial neural network computations based on automatic determination of a batch size | |
CN112418416A (zh) | 神经网络计算系统、神经网络计算方法和计算机系统 | |
US12086706B2 (en) | Processing sequential inputs using neural network accelerators | |
US10990525B2 (en) | Caching data in artificial neural network computations | |
US11461662B1 (en) | Compilation time reduction for memory and compute bound neural networks | |
Wei et al. | Deploying and scaling distributed parallel deep neural networks on the Tianhe-3 prototype system | |
US11354130B1 (en) | Efficient race-condition detection | |
WO2020121030A1 (en) | Caching data in artificial neural network computations | |
Du et al. | Breaking the interaction wall: A DLPU-centric deep learning computing system | |
US20230126594A1 (en) | Instruction generating method, arithmetic processing device, and instruction generating device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20889287 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20889287 Country of ref document: EP Kind code of ref document: A1 |