WO2021057465A1 - 一种对深度学习模型进行并行处理的方法及装置 - Google Patents

一种对深度学习模型进行并行处理的方法及装置 Download PDF

Info

Publication number
WO2021057465A1
WO2021057465A1 PCT/CN2020/113982 CN2020113982W WO2021057465A1 WO 2021057465 A1 WO2021057465 A1 WO 2021057465A1 CN 2020113982 W CN2020113982 W CN 2020113982W WO 2021057465 A1 WO2021057465 A1 WO 2021057465A1
Authority
WO
WIPO (PCT)
Prior art keywords
parallel
relationship
groups
nodes
executed
Prior art date
Application number
PCT/CN2020/113982
Other languages
English (en)
French (fr)
Inventor
栗伟清
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2021057465A1 publication Critical patent/WO2021057465A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models

Definitions

  • the present disclosure relates to, but is not limited to, the field of computer technology.
  • the deep learning model has many parameters and large scale of training data, which leads to a large consumption of computing resources.
  • a training often takes several days or even months, which is simply intolerable for the staff who adjust the parameters. Therefore, it is very necessary to accelerate model training, and the improvement of the computing power of a single device is very limited, so it needs to rely on distributed training.
  • the distributed training of deep learning models mainly includes two methods: data parallel and model parallel.
  • Data parallel means that there is a copy of the complete model on each node, and different data is taken separately to complete the forward and backward directions. Calculate the gradient, and then update the parameters.
  • Model parallelism refers to splitting the model into different nodes for training according to certain rules.
  • the embodiments of the present disclosure provide a method for parallel processing of a deep learning model, including: determining a dependency relationship between computing nodes in the model, dividing a relationship group according to the dependency relationship; Clustering, generating a set that can be executed in parallel; wherein each relationship group in the set that can be executed in parallel can run in parallel; all the relationship groups in the set that can be executed in parallel are distributed to multiple target devices so that all the sets of the set can be executed in parallel The total parallel operation takes the shortest time.
  • the embodiments of the present disclosure provide an apparatus for parallel processing of deep learning models, including: a memory, a processor, and parallel processing of deep learning models that are stored in the memory and run on the processor
  • the processing program when the program for parallel processing the deep learning model is executed by the processor, realizes the steps of the method for parallel processing the deep learning model described herein.
  • an embodiment of the present disclosure provides a computer-readable storage medium, the computer-readable storage medium stores a program for parallel processing of a deep learning model, and the program for parallel processing of a deep learning model is processed When the device is executed, the steps of the method for parallel processing of the deep learning model described in this article are realized.
  • FIG. 1 is a flowchart of a method for parallel processing of a deep learning model according to Embodiment 1 of the disclosure
  • FIG. 2 is a schematic diagram of a device for parallel processing of a deep learning model according to Embodiment 2 of the disclosure
  • Example 3 is a schematic diagram of the calculation diagram of the Inception-V3 model in Example 1 of the disclosure.
  • FIG. 4 is a schematic diagram of selecting the time-consuming relationship group with the highest ranking in Example 1 of the present disclosure
  • FIG. 5 is a schematic diagram of dividing relationship groups according to name scope fields in Example 2 of the present disclosure.
  • FIG. 6 is a schematic diagram of four kinds of aggregation nodes in Example 2 of the present disclosure.
  • FIG. 7 is a schematic diagram of the serial running branch in Example 2 of the disclosure.
  • a method and device for parallel processing of a deep learning model are provided, which can automatically split the deep learning model and improve the efficiency of distributed training when the deep learning model adopts model parallelism.
  • an embodiment of the present disclosure provides a method for parallel processing of a deep learning model, including:
  • Step S110 Determine the dependency relationship between the computing nodes in the model, and divide the relationship group according to the dependency relationship;
  • Step S120 clustering the relationship groups according to predetermined rules to generate a parallel executable set; wherein, the relationship groups in each parallel executable set can run in parallel;
  • step S130 all the relational groups in the set that can be executed in parallel are distributed to multiple target devices so that the total parallel operation time consumption of all sets that can be executed in parallel is the shortest.
  • the determining the dependency relationship between the computing nodes in the model, and dividing the relationship group according to the dependency relationship includes:
  • Method 1 Divide computing nodes with the same attributes into the same relationship group
  • Method 2 A computing node with only one downstream node and no upstream node is divided into the same relationship group as the downstream nodes of the computing node.
  • the clustering of the relationship groups according to a predetermined rule to generate a set that can be executed in parallel includes:
  • the set that can be executed in parallel satisfies the following conditions: there is no upstream and downstream relationship within n levels between any two relationship groups in the set that can be executed in parallel; and any two relationships in the set that can be executed in parallel There are common upstream nodes or common downstream nodes within n levels between groups; where n is a preset value.
  • the time-consuming selection of the top multiple relationship groups includes:
  • a is the smallest integer that makes the ratio of the sum of the time-consuming of a relation groups to the sum of the time-consuming of all relation groups greater than or equal to the predetermined ratio value.
  • the determining the dependency relationship between calculations in the model, and dividing the relationship group according to the dependency relationship includes:
  • the calculation nodes are divided into relationship groups according to the predetermined fields in the names of the calculation nodes, and the calculation nodes with the same predetermined fields belong to the same relationship group;
  • the predetermined field includes: a name scope field; when the name scope field includes a nesting level, the predetermined field is the outermost name scope field.
  • the clustering of the relationship groups according to a predetermined rule to generate a set that can be executed in parallel includes:
  • serial running branch is a collection of relationship groups with upstream and downstream relationships between two sink nodes.
  • the simulated annealing algorithm is used to distribute all the relational groups in the set that can be executed in parallel to multiple target devices, so that the total parallel operation time of all sets that can be executed in parallel is the shortest.
  • the simulated annealing algorithm is used to combine all the available groups.
  • the relational groups in the parallel execution set are distributed to multiple target devices so that the total parallel operation time of all parallel execution sets is the shortest, including:
  • Step 1 Initialization: Set the initial temperature T 0 , the end temperature T min , the number of iterations K within each temperature, the cooling rate ⁇ , the disturbance ratio ⁇ each time the solution is updated; at the initial temperature T 0 , the initial solution is randomly generated X 0 , and calculate the initial time E 0 ; among them, the initial solution X 0 refers to randomly assigning all relation groups to the target device according to the initial allocation method; the initial time E 0 refers to the model calculation performed in the initial allocation method Total time consuming;
  • Step 4 Determine whether the updated current temperature T is less than the termination temperature T min , if yes, the current distribution method will be regarded as the final solution and end; otherwise, skip to step 2 to continue execution.
  • all the relation groups in the set that can be executed in parallel are allocated to multiple groups.
  • the total parallel operation time of all sets that can be executed in parallel is the shortest, including:
  • the simulated annealing algorithm is used to distribute all the serial running branches in the set that can be executed in parallel to multiple target devices, so that the total parallel operation time of all sets that can be executed in parallel is the shortest.
  • the simulated annealing algorithm is used to combine all the strings in the set that can be executed in parallel.
  • the running branch is distributed to multiple target devices so that the total parallel operation time of all parallel execution sets is the shortest, including:
  • Step 1 Initialization: Set the initial temperature T 0 , the end temperature T min , the number of iterations K within each temperature, the cooling rate ⁇ , the disturbance ratio ⁇ each time the solution is updated; at the initial temperature T 0 , the initial solution is randomly generated X 0 , and calculate the initial time E 0 ; among them, the initial solution X 0 refers to randomly assigning all serial running branches to the target device according to the initial allocation method; the initial time E 0 refers to the execution of the model operation in the initial allocation method Total time after
  • Step 4 Determine whether the updated current temperature T is less than the termination temperature T min , if yes, the current distribution method will be regarded as the final solution and end; otherwise, skip to step 2 to continue execution.
  • an embodiment of the present disclosure provides an apparatus for parallel processing of a deep learning model, including:
  • the relationship group dividing module 201 is configured to determine the dependency relationship between the computing nodes in the model, and divide the relationship group according to the dependency relationship;
  • the set division module 202 is configured to cluster the relationship groups according to predetermined rules to generate a parallel executable set; wherein, the relationship groups in each parallel executable set can run in parallel;
  • the device allocation module 203 is configured to allocate all the relational groups in the set that can be executed in parallel to multiple target devices so that the total parallel operation of all sets that can be executed in parallel takes the shortest time.
  • the relationship group dividing module 201 is configured to determine the dependency relationship between the computing nodes in the model in the following manner, and divide the relationship group according to the dependency relationship:
  • Method 1 Divide computing nodes with the same attributes into the same relationship group
  • Method 2 Divide a computing node with only one downstream node and no upstream node into the same relationship group as the downstream nodes of the computing node.
  • the set division module 202 is configured to cluster the relationship groups according to predetermined rules in the following manner to generate sets that can be executed in parallel:
  • the set that can be executed in parallel satisfies the following conditions: there is no upstream and downstream relationship within n levels between any two relationship groups in the set that can be executed in parallel; and any two relationships in the set that can be executed in parallel There are common upstream nodes or common downstream nodes within n levels between groups; where n is a preset value.
  • the set division module 202 is configured to select multiple relationship groups that are time-consuming and rank top in the following manner:
  • a is the smallest integer that makes the ratio of the total time-consuming of a relationship groups to the total time-consuming of all relationship groups greater than or equal to the predetermined ratio value.
  • the relationship group dividing module 201 is configured to determine the dependency relationship between calculations in the model in the following manner, and divide the relationship group according to the dependency relationship:
  • the calculation nodes are divided into relationship groups according to the predetermined fields in the names of the calculation nodes, and the calculation nodes with the same predetermined fields belong to the same relationship group;
  • the predetermined field includes: a name scope field; when the name scope field includes a nesting level, the predetermined field is the outermost name scope field.
  • the set division module 202 is configured to cluster the relationship groups according to predetermined rules in the following manner to generate a set that can be executed in parallel:
  • serial running branch is a collection of relationship groups with upstream and downstream relationships between two sink nodes.
  • the device allocation module 203 is configured to adopt the following method All the relational groups in the parallel executable set are distributed to multiple target devices so that the total parallel operation time of all the parallel executable sets is the shortest:
  • the simulated annealing algorithm is used to distribute all the relational groups in the set that can be executed in parallel to multiple target devices, so that the total parallel operation time of all sets that can be executed in parallel is the shortest.
  • the device allocation module 203 is configured to adopt the following method
  • the simulated annealing algorithm is used to distribute all the relational groups in the set that can be executed in parallel to multiple target devices so that the total parallel operation time of all sets that can be executed in parallel is the shortest:
  • Step 1 Initialization: Set the initial temperature T 0 , the end temperature T min , the number of iterations K within each temperature, the cooling rate ⁇ , the disturbance ratio ⁇ each time the solution is updated; at the initial temperature T 0 , the initial solution is randomly generated X 0 , and calculate the initial time E 0 ; among them, the initial solution X 0 refers to randomly assigning all relation groups to the target device according to the initial allocation method; the initial time E 0 refers to the model calculation performed in the initial allocation method Total time consuming;
  • Step 4 Determine whether the updated current temperature T is less than the termination temperature T min , if yes, the current distribution method will be regarded as the final solution and end; otherwise, skip to step 2 to continue execution.
  • the device allocation module 203 is configured to use the following method to execute all the groups that can be executed in parallel
  • the relational groups in the collection are distributed to multiple target devices so that the total parallel operations of the collection can be executed in parallel with the shortest time:
  • the simulated annealing algorithm is used to distribute all the serial running branches in the set that can be executed in parallel to multiple target devices, so that the total parallel operation time of all sets that can be executed in parallel is the shortest.
  • the device allocation module 203 is configured to adopt the following method to combine all the relation groups using the simulated annealing algorithm.
  • the serial running branches in the parallel executable set are distributed to multiple target devices so that the total parallel operation time of all parallel executable sets is the shortest:
  • Step 1 Initialization: Set the initial temperature T 0 , the end temperature T min , the number of iterations K within each temperature, the cooling rate ⁇ , the disturbance ratio ⁇ each time the solution is updated; at the initial temperature T 0 , the initial solution is randomly generated X 0 , and calculate the initial time E 0 ; among them, the initial solution X 0 refers to randomly assigning all serial running branches to the target device according to the initial allocation method; the initial time E 0 refers to the execution of the model operation in the initial allocation method Total time after
  • Step 4 Determine whether the updated current temperature T is less than the termination temperature T min , if yes, the current distribution method will be regarded as the final solution and end; otherwise, skip to step 2 to continue execution.
  • the embodiment of the present disclosure provides a device for parallel processing of a deep learning model, including:
  • a memory, a processor, and a program that is stored in the memory and can run on the processor for parallel processing of a deep learning model, which is implemented when the program for parallel processing of the deep learning model is executed by the processor The steps of the method for parallel processing of the deep learning model described in the above embodiment 1.
  • the embodiment of the present disclosure provides a computer-readable storage medium that stores a program for parallel processing of a deep learning model, and when the program for parallel processing of a deep learning model is executed by a processor The steps of the method for parallel processing of the deep learning model described in the above embodiment 1 are implemented.
  • each deep learning model corresponds to a calculation graph, also called a data flow graph, which uses a directed graph composed of nodes and edges to describe mathematical operations.
  • Each calculation (OP) is a calculation graph. Nodes, the edges between nodes describe the dependencies between calculations, and data (Tensor) flows along the edges between nodes.
  • the calculation graph of a complex deep learning model often contains thousands of OPs. For example, this example uses the calculation graph of Inception-V3. The calculation graph contains more than 30,000 OPs.
  • the model parallelism is to perform these OPs. Group them and put them on different devices for training.
  • the method for parallel processing of the deep learning model can include the following steps:
  • the three relationship groups represented by the dotted lines in the figure are the top three relationship groups that are time-consuming. Among them, there are two levels of upstream and downstream relationships between relationship groups a and b, so they cannot be placed in the same relationship. In a parallel execution set, and the above conditions are fully satisfied between a and c, so they can be placed in the same parallel execution set.
  • the simulated annealing algorithm is used to distribute all the relational groups in the set that can be executed in parallel to multiple target devices, so that the time-consuming parallel operation is the shortest.
  • each deep learning model corresponds to a calculation graph, also called a data flow graph, which uses a directed graph composed of nodes and edges to describe mathematical operations.
  • Each calculation (OP) is a calculation graph. Nodes, the edges between nodes describe the dependencies between calculations, and data (Tensor) flows along the edges between nodes.
  • the calculation graph of a complex deep learning model often contains thousands of OPs. For example, this example uses the calculation graph of Inception-V3. The calculation graph contains more than 30,000 OPs.
  • the model parallelism is to perform these OPs. Group them and put them on different devices for training.
  • the method for parallel processing of the deep learning model can include the following steps:
  • the OP is divided into relationship groups. OPs with the same name scope belong to the same relationship group. When the name scope includes nesting levels, the outermost name scope is used. Divide, as shown in Figure 5, there are three OPs whose name scope field is "a”, and the other 4 OPs whose name scope field is "b". They are respectively divided into the first relationship group surrounded by the dashed box on the left and In the second relationship group enclosed by the dashed box on the right.
  • Figure 6 shows four types of sink nodes, from left to right: single input single output, single input multiple output, multiple input single output, multiple input multiple output.
  • the serial running branch is a set of relationship groups having an upstream and downstream relationship between two sink nodes
  • the simulated annealing algorithm is used to distribute all the serial running branches in the parallel executable set to multiple target devices so that the time-consuming parallel operation is the shortest.
  • Such software may be distributed on a computer-readable medium
  • the computer-readable medium may include a computer storage medium (or non-transitory medium) and a communication medium (or transitory medium).
  • the term computer storage medium includes volatile and non-volatile data implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data).
  • Information such as computer-readable instructions, data structures, program modules, or other data.
  • Computer storage media include but are not limited to RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or Any other medium used to store desired information and that can be accessed by a computer.
  • communication media usually contain computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as carrier waves or other transmission mechanisms, and may include any information delivery media. .
  • the embodiment of the present disclosure provides a method and device for parallel processing of a deep learning model, determining the dependency relationship between computing nodes in the model, dividing the relationship group according to the dependency relationship; clustering the relationship group according to a predetermined rule, Generate a parallel executable set; among them, each relational group in the parallel executable set can run in parallel; all the relational groups in the parallel executable set are distributed to multiple target devices so that all the total parallel operations of the set can be executed in parallel The shortest time-consuming.
  • the embodiments of the present disclosure can automatically split the deep learning model, and improve the distributed training efficiency when the deep learning model adopts model parallelism.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种对深度学习模型进行并行处理的方法及装置。所述对深度学习模型进行并行处理的方法包括:确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组(S110);按照预定规则对关系组进行聚类,生成可并行执行集合;其中,每一个可并行执行集合内的关系组能够并行运行(S120);将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短(S130)。

Description

一种对深度学习模型进行并行处理的方法及装置 技术领域
本公开涉及但不限于计算机技术领域。
背景技术
深度学习模型参数多,训练数据规模大,导致计算资源消耗大,一次训练耗时经常达到数天甚至数月,这对于调参的工作人员来说简直是无法忍受的。因此,对模型训练进行加速是非常有必要的,而单个设备计算力的提升非常有限,因此需要依靠分布式训练。
目前,深度学习模型的分布式训练主要有数据并行和模型并行两种方式,数据并行是指每个节点上都有一个完整模型的副本,分别取用不同的数据,各自完成前向和后向的计算得到梯度,然后更新参数。模型并行是指根据一定的规则把模型分拆到不同的节点上进行训练。
相关技术中,模型并行时,模型拆分通常是人工手动完成的,人工拆分费时费力,如果拆分的不合理,再加上节点之间的通信开销,模型并行甚至起不到任何加速的效果。
发明内容
第一方面,本公开实施例提供一种对深度学习模型进行并行处理的方法,包括:确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组;按照预定规则对关系组进行聚类,生成可并行执行集合;其中,每一个可并行执行集合内的关系组能够并行运行;将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
第二方面,本公开实施例提供一种对深度学习模型进行并行处理的装置,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的对深度学习模型进行并行处理的程序,所述对深度学习模型进行并行处理的程序被所述处理器执行时实现本文所述对深 度学习模型进行并行处理的方法的步骤。
第三方面,本公开实施例提供一种计算机可读存储介质,所述计算机可读存储介质上存储有对深度学习模型进行并行处理的程序,所述对深度学习模型进行并行处理的程序被处理器执行时实现本文所述对深度学习模型进行并行处理的方法的步骤。
附图说明
图1为本公开实施例1的一种对深度学习模型进行并行处理的方法流程图;
图2为本公开实施例2的一种对深度学习模型进行并行处理的装置示意图;
图3为本公开示例1中Inception-V3模型计算图的示意图;
图4为本公开示例1中挑选耗时排名靠前的关系组的示意图;
图5为本公开示例2中按照名称作用域字段划分关系组的示意图;
图6为本公开示例2中四种汇聚节点的示意图;
图7为本公开示例2中串行运行分支的示意图。
具体实施方式
为使本公开的目的、特征和优点更加清楚明白,下文中将结合附图对本公开的实施例进行详细说明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互任意组合。
在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行。并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
根据本公开的实施例,提供了一种对深度学习模型进行并行处理的方法及装置,能够自动拆分深度学习模型,提高深度学习模型采用模型并行时的分布式训练效率。
实施例1
如图1所示,本公开实施例提供了一种对深度学习模型进行并行处理的方法,包括:
步骤S110,确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组;
步骤S120,按照预定规则对关系组进行聚类,生成可并行执行集合;其中,每一个可并行执行集合内的关系组能够并行运行;
步骤S130,将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
在一种实施方式中,所述确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组,包括:
确定每一个计算节点的上游节点和下游节点,以及该计算节点的属性;
按照以下至少一种方式划分关系组;
方式一:将具有相同属性的计算节点划分为同一个关系组;
方式二:将只有一个下游节点且没有上游节点的计算节点,与该计算节点的下游节点划分为同一个关系组。
在一种实施方式中,所述按照预定规则对关系组进行聚类,生成可并行执行的集合,包括:
统计所有关系组在单个设备上运行的耗时,根据耗时从多到少对所有关系组进行排序;
挑选耗时排名靠前的多个关系组,在挑选出的关系组中搜索可并行执行的集合;
其中,所述可并行执行的集合满足以下条件:所述可并行执行的集合中任意两个关系组之间没有n级以内的上下游关系;且所述可并行执行的集合中任意两个关系组之间具有n级以内的共同上游节点或共同下游节点;其中,n为预设值。
在一种实施方式中,所述挑选耗时排名靠前的多个关系组,包括:
挑选耗时排名靠前的a个关系组;
其中,a是使得a个关系组的耗时总和与所有关系组的耗时总 和的比值大于或等于预定比例值的最小整数。
在一种实施方式中,所述确定模型中计算之间的依赖关系,根据所述依赖关系划分关系组,包括:
根据计算节点的名字中的预定字段对计算节点划分关系组,具有相同预定字段的计算节点属于同一个关系组;
其中,所述预定字段包括:名称作用域字段;当所述名称作用域字段包括嵌套层级时,所述预定字段为所述最外层的名称作用域字段。
在一种实施方式中,所述按照预定规则对关系组进行聚类,生成可并行执行集合,包括:
遍历所有的关系组,搜索具有多个输入或者多个输出的汇聚节点;
以具有多个输入的汇聚节点为起点,向上游遍历所述汇聚节点的所有输入节点直至遇到另一个汇聚节点为止,由所述两个汇聚节点之间的所有串行运行分支生成一个可并行执行集合;或者
以具有多个输出的汇聚节点为起点,向下游遍历所述汇聚节点的所有输出节点直至遇到另一个汇聚节点为止,由所述两个汇聚节点之间的所有串行运行分支生成一个可并行执行集合;
其中,所述串行运行分支是两个汇聚节点之间具有上下游关系的关系组的集合。
在一种实施方式中,针对上述按照方式一和/或方式二进行关系组划分,并且在单个设备上按照耗时多少进行排序的关系组聚类方式,所述将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:
采用模拟退火算法将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
在一种实施方式中,针对上述按照方式一和/或方式二进行关系组划分,并且在单个设备上按照耗时多少进行排序的关系组聚类方式,所述采用模拟退火算法将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短, 包括:
步骤一:初始化:设置初始温度T 0,终止温度T min,每个温度内的迭代次数K,降温速率α,每次更新解时的扰动比例μ;在初始温度T 0下,随机生成初始解X 0,并计算初始耗时E 0;其中,初始解X 0是指将所有关系组按照初始分配方式随机分配到目标设备;初始耗时E 0是指在初始分配方式下执行模型运算后的总耗时;
步骤二:在当前温度T下进行K次扰动和接受过程,其中每一次扰动和接受过程包括:在当前温度T下,按照扰动比例μ随机挑选当前解X中的关系组,并对挑选出的关系组重新随机分配目标设备,并计算在新的分配方式下执行模型运算后的总耗时E new,如果E new小于E 0,则接受新的分配方式,如果E new大于或等于E 0,则以概率p接受新的分配方式;其中,p=exp(-(E new-E 0)/T);
步骤三:更新当前温度T和扰动比例μ:T=αT,μ=αμ;
步骤四:判断更新后的当前温度T是否小于终止温度T min,是则将当前的分配方式作为最终解,结束;否则跳到步骤二继续执行。
在一种实施方式中,针对上述按照名字中的预定字段进行关系组划分,并且按照汇聚节点进行聚类的关系组聚类方式,所述将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:
采用模拟退火算法将所有的可并行执行集合内的串行运行分支分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
在一种实施方式中,针对上述按照名字中的预定字段进行关系组划分,并且按照汇聚节点进行聚类的关系组聚类方式,所述采用模拟退火算法将所有的可并行执行集合内的串行运行分支分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:
步骤一:初始化:设置初始温度T 0,终止温度T min,每个温度内的迭代次数K,降温速率α,每次更新解时的扰动比例μ;在初始温度T 0下,随机生成初始解X 0,并计算初始耗时E 0;其中,初始解X 0是指将所有串行运行分支按照初始分配方式随机分配到目标设备;初始 耗时E 0是指在初始分配方式下执行模型运算后的总耗时;
步骤二:在当前温度T下进行K次扰动和接受过程,其中每一次扰动和接受过程包括:在当前温度T下,按照扰动比例μ随机挑选当前解X中的串行运行分支,并对挑选出的串行运行分支重新随机分配目标设备,并计算在新的分配方式下执行模型运算后的总耗时E new,如果E new小于E 0,则接受新的分配方式,如果E new大于或等于E 0,则以概率p接受新的分配方式;其中,p=exp(-(E new-E 0)/T);
步骤三:更新当前温度T和扰动比例μ:T=αT,μ=αμ;
步骤四:判断更新后的当前温度T是否小于终止温度T min,是则将当前的分配方式作为最终解,结束;否则跳到步骤二继续执行。
实施例2
如图2所示,本公开实施例提供了一种对深度学习模型进行并行处理的装置,包括:
关系组划分模块201,配置为确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组;
集合划分模块202,配置为按照预定规则对关系组进行聚类,生成可并行执行集合;其中,每一个可并行执行集合内的关系组能够并行运行;
设备分配模块203,配置为将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
在一种实施方式中,关系组划分模块201,配置为采用以下方式确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组:
确定每一个计算节点的上游节点和下游节点,以及该计算节点的属性;
按照以下至少一种方式划分关系组;
方式一:将具有相同属性的计算节点划分为同一个关系组;
方式二:将只有一个下游节点且没有上游节点的计算节点,与 该计算节点的下游节点划分为同一个关系组。
在一种实施方式中,集合划分模块202,配置为采用以下方式按照预定规则对关系组进行聚类,生成可并行执行的集合:
统计所有关系组在单个设备上运行的耗时,根据耗时从多到少对所有关系组进行排序;
挑选耗时排名靠前的多个关系组,在挑选出的关系组中搜索可并行执行的集合;
其中,所述可并行执行的集合满足以下条件:所述可并行执行的集合中任意两个关系组之间没有n级以内的上下游关系;且所述可并行执行的集合中任意两个关系组之间具有n级以内的共同上游节点或共同下游节点;其中,n为预设值。
在一种实施方式中,集合划分模块202,配置为采用以下方式挑选耗时排名靠前的多个关系组:
挑选耗时排名靠前的a个关系组;
其中,a是使得a个关系组的耗时总和与所有关系组的耗时总和的比值大于或等于预定比例值的最小整数。
在一种实施方式中,关系组划分模块201,配置为采用以下方式确定模型中计算之间的依赖关系,根据所述依赖关系划分关系组:
根据计算节点的名字中的预定字段对计算节点划分关系组,具有相同预定字段的计算节点属于同一个关系组;
其中,所述预定字段包括:名称作用域字段;当所述名称作用域字段包括嵌套层级时,所述预定字段为所述最外层的名称作用域字段。
在一种实施方式中,集合划分模块202,配置为采用以下方式按照预定规则对关系组进行聚类,生成可并行执行集合:
遍历所有的关系组,搜索具有多个输入或者多个输出的汇聚节点;
以具有多个输入的汇聚节点为起点,向上游遍历所述汇聚节点的所有输入节点直至遇到另一个汇聚节点为止,由所述两个汇聚节点之间的所有串行运行分支生成一个可并行执行集合;或者
以具有多个输出的汇聚节点为起点,向下游遍历所述汇聚节点的所有输出节点直至遇到另一个汇聚节点为止,由所述两个汇聚节点之间的所有串行运行分支生成一个可并行执行集合;
其中,所述串行运行分支是两个汇聚节点之间具有上下游关系的关系组的集合。
在一种实施方式中,针对上述按照方式一和/或方式二进行关系组划分,并且在单个设备上按照耗时多少进行排序的关系组聚类方式,设备分配模块203,配置为采用以下方式将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短:
采用模拟退火算法将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
在一种实施方式中,针对上述按照方式一和/或方式二进行关系组划分,并且在单个设备上按照耗时多少进行排序的关系组聚类方式,设备分配模块203,配置为采用以下方式采用模拟退火算法将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短:
步骤一:初始化:设置初始温度T 0,终止温度T min,每个温度内的迭代次数K,降温速率α,每次更新解时的扰动比例μ;在初始温度T 0下,随机生成初始解X 0,并计算初始耗时E 0;其中,初始解X 0是指将所有关系组按照初始分配方式随机分配到目标设备;初始耗时E 0是指在初始分配方式下执行模型运算后的总耗时;
步骤二:在当前温度T下进行K次扰动和接受过程,其中每一次扰动和接受过程包括:在当前温度T下,按照扰动比例μ随机挑选当前解X中的关系组,并对挑选出的关系组重新随机分配目标设备,并计算在新的分配方式下执行模型运算后的总耗时E new,如果E new小于E 0,则接受新的分配方式,如果E new大于或等于E 0,则以概率p接受新的分配方式;其中,p=exp(-(E new-E 0)/T);
步骤三:更新当前温度T和扰动比例μ:T=αT,μ=αμ;
步骤四:判断更新后的当前温度T是否小于终止温度T min,是则 将当前的分配方式作为最终解,结束;否则跳到步骤二继续执行。
在一种实施方式中,针对上述按照名字中的预定字段进行关系组划分,并且按照汇聚节点进行聚类的关系组聚类方式,设备分配模块203,配置为采用以下方式将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短:
采用模拟退火算法将所有的可并行执行集合内的串行运行分支分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
在一种实施方式中,针对上述按照名字中的预定字段进行关系组划分,并且按照汇聚节点进行聚类的关系组聚类方式,设备分配模块203,配置为采用以下方式采用模拟退火算法将所有的可并行执行集合内的串行运行分支分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短:
步骤一:初始化:设置初始温度T 0,终止温度T min,每个温度内的迭代次数K,降温速率α,每次更新解时的扰动比例μ;在初始温度T 0下,随机生成初始解X 0,并计算初始耗时E 0;其中,初始解X 0是指将所有串行运行分支按照初始分配方式随机分配到目标设备;初始耗时E 0是指在初始分配方式下执行模型运算后的总耗时;
步骤二:在当前温度T下进行K次扰动和接受过程,其中每一次扰动和接受过程包括:在当前温度T下,按照扰动比例μ随机挑选当前解X中的串行运行分支,并对挑选出的串行运行分支重新随机分配目标设备,并计算在新的分配方式下执行模型运算后的总耗时E new,如果E new小于E 0,则接受新的分配方式,如果E new大于或等于E 0,则以概率p接受新的分配方式;其中,p=exp(-(E new-E 0)/T);
步骤三:更新当前温度T和扰动比例μ:T=αT,μ=αμ;
步骤四:判断更新后的当前温度T是否小于终止温度T min,是则将当前的分配方式作为最终解,结束;否则跳到步骤二继续执行。
实施例3
本公开实施例提供了一种对深度学习模型进行并行处理的装置,包括:
存储器、处理器及存储在所述存储器上并可在所述处理器上运行的对深度学习模型进行并行处理的程序,所述对深度学习模型进行并行处理的程序被所述处理器执行时实现上述实施例1中所述的对深度学习模型进行并行处理的方法的步骤。
实施例4
本公开实施例提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有对深度学习模型进行并行处理的程序,所述对深度学习模型进行并行处理的程序被处理器执行时实现上述实施例1中所述的对深度学习模型进行并行处理的方法的步骤。
示例1
本示例提供一种对深度学习模型进行并行处理的方法。在Tensorflow中,每个深度学习的模型都对应一个计算图,也叫数据流图,是用节点和边组成的有向图来描述数学运算,每一个计算(OP)都是计算图上的一个节点,节点之间的边描述了计算之间的依赖关系,数据(Tensor)就在节点之间沿着边流动。一个复杂的深度学习模型的计算图往往包含成千上万个OP,比如本示例采用Inception-V3的计算图,所述计算图中包含了3万多个OP,模型并行就是要把这些OP进行分组,然后放到不同的设备上进行训练。
本示例中,对深度学习模型进行并行处理的方法,可以包括以下步骤:
1)确定每一个OP的上游OP和下游OP,如图3所示,为模型计算图的一部分;
2)划分关系组,即将关系紧密的OP归入同一个关系组。主要依据两个原则:一是具有相同colocation属性的OP放入同一个关系组;二是将只有一个下游节点且没有上游节点的OP,与其下游节点划分为同一个关系组。如图3所示,最底层两边的OP(用虚线表示) 可以分别与其上游OP划入一个关系组。
3)统计所有关系组在单个设备上运行的耗时,根据耗时从多到少对所有关系组进行排序;
4)挑选耗时排名靠前的多个关系组,在挑选出的关系组中搜索可并行执行集合,其中,所述可并行执行集合满足以下条件:所述可并行执行的集合中任意两个关系组之间没有n级以内的上下游关系;且所述可并行执行的集合中任意两个关系组之间具有n级以内的共同上游节点或共同下游节点;其中,n为预设值。
如图4所示,图中用虚线表示的三个关系组是耗时排名靠前的三个关系组,其中,关系组a和b之间有2级以内上下游关系,所以不能放到同一个并行执行集合内,而a和c之间完全满足上面的条件,所以可以放入同一个并行执行集合内。
5)采用模拟退火算法将所有的可并行执行集合内的关系组分配到多个目标设备上使得并行运算的耗时最短。
示例2
本示例提供一种对深度学习模型进行并行处理的方法。在Tensorflow中,每个深度学习的模型都对应一个计算图,也叫数据流图,是用节点和边组成的有向图来描述数学运算,每一个计算(OP)都是计算图上的一个节点,节点之间的边描述了计算之间的依赖关系,数据(Tensor)就在节点之间沿着边流动。一个复杂的深度学习模型的计算图往往包含成千上万个OP,比如本示例采用Inception-V3的计算图,所述计算图中包含了3万多个OP,模型并行就是要把这些OP进行分组,然后放到不同的设备上进行训练。
本示例中,对深度学习模型进行并行处理的方法,可以包括以下步骤:
1)根据OP名字中的名称作用域字段对OP划分关系组,具有相同名称作用域的OP属于同一个关系组,当名称作用域包括嵌套层级时,则以最外层的名称作用域进行划分,如图5所示,有三个OP的名称作用域字段为“a”,其他4个OP的名称作用域字段为“b”, 分别划分到左侧的虚线框包围的第一关系组和右侧的虚线框包围的第二关系组中。
2)遍历所有的关系组,搜索具有多个输入或者多个输出的汇聚节点。其中,图6示出了四种汇聚节点,从左到右依次是:单输入单输出,单输入多输出,多输入单输出,多输入多输出。
3)如图7所示,以具有多个输出的汇聚节点为起点,向下游遍历所述汇聚节点的所有输出节点直至遇到另一个汇聚节点为止,将所述两个汇聚节点之间的所有串行运行分支作为一个可并行执行集合。其中,所述串行运行分支是两个汇聚节点之间具有上下游关系的关系组集合;
4)采用模拟退火算法将所有的可并行执行集合内的串行运行分支分配到多个目标设备上使得并行运算的耗时最短。
本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的 是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。
需要说明的是,本公开还可有其他多种实施例,在不背离本公开精神及其实质的情况下,熟悉本领域的技术人员可根据本公开作出各种相应的改变和变形,但这些相应的改变和变形都应属于本公开所附的权利要求的保护范围。
工业实用性
本公开实施例提供的一种对深度学习模型进行并行处理的方法及装置,确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组;按照预定规则对关系组进行聚类,生成可并行执行集合;其中,每一个可并行执行集合内的关系组能够并行运行;将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。本公开实施例能够自动拆分深度学习模型,提高深度学习模型采用模型并行时的分布式训练效率。

Claims (12)

  1. 一种对深度学习模型进行并行处理的方法,包括:
    确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组;
    按照预定规则对关系组进行聚类,生成可并行执行集合;其中,每一个可并行执行集合内的关系组能够并行运行;
    将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
  2. 如权利要求1所述的方法,其中,
    所述确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组,包括:
    确定每一个计算节点的上游节点和下游节点,以及该计算节点的属性;
    按照以下至少一种方式划分关系组;
    方式一:将具有相同属性的计算节点划分为同一个关系组;
    方式二:将只有一个下游节点且没有上游节点的计算节点,与该计算节点的下游节点划分为同一个关系组。
  3. 如权利要求2所述的方法,其中,
    所述按照预定规则对关系组进行聚类,生成可并行执行的集合,包括:
    统计所有关系组在单个设备上运行的耗时,根据耗时从多到少对所有关系组进行排序;
    挑选耗时排名靠前的多个关系组,在挑选出的关系组中搜索可并行执行的集合;
    其中,所述可并行执行的集合满足以下条件:所述可并行执行的集合中任意两个关系组之间没有n级以内的上下游关系;且所述可并行执行的集合中任意两个关系组之间具有n级以内的共同上游节点或共同下游节点;其中,n为预设值。
  4. 如权利要求3所述的方法,其中,
    所述挑选耗时排名靠前的多个关系组,包括:
    挑选耗时排名靠前的a个关系组;
    其中,a是使得a个关系组的耗时总和与所有关系组的耗时总和的比值大于或等于预定比例值的最小整数。
  5. 如权利要求1所述的方法,其中,
    所述确定模型中计算之间的依赖关系,根据所述依赖关系划分关系组,包括:
    根据计算节点的名字中的预定字段对计算节点划分关系组,具有相同预定字段的计算节点属于同一个关系组;
    其中,所述预定字段包括:名称作用域字段;当所述名称作用域字段包括嵌套层级时,所述预定字段为所述最外层的名称作用域字段。
  6. 如权利要求5所述的方法,其中,
    所述按照预定规则对关系组进行聚类,生成可并行执行集合,包括:
    遍历所有的关系组,搜索具有多个输入或者多个输出的汇聚节点;
    以具有多个输入的汇聚节点为起点,向上游遍历所述汇聚节点的所有输入节点直至遇到另一个汇聚节点为止,由所述两个汇聚节点之间的所有串行运行分支生成一个可并行执行集合;或者
    以具有多个输出的汇聚节点为起点,向下游遍历所述汇聚节点的所有输出节点直至遇到另一个汇聚节点为止,由所述两个汇聚节点之间的所有串行运行分支生成一个可并行执行集合;
    其中,所述串行运行分支是两个汇聚节点之间具有上下游关系的关系组的集合。
  7. 如权利要求3所述的方法,其中,
    所述将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:
    采用模拟退火算法将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
  8. 如权利要求7所述的方法,其中,
    所述采用模拟退火算法将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:
    步骤一:初始化:设置初始温度T 0,终止温度T min,每个温度内的迭代次数K,降温速率α,每次更新解时的扰动比例μ;在初始温度T 0下,随机生成初始解X 0,并计算初始耗时E 0;其中,初始解X 0是指将所有关系组按照初始分配方式随机分配到目标设备;初始耗时E 0是指在初始分配方式下执行模型运算后的总耗时;
    步骤二:在当前温度T下进行K次扰动和接受过程,其中每一次扰动和接受过程包括:在当前温度T下,按照扰动比例μ随机挑选当前解X中的关系组,并对挑选出的关系组重新随机分配目标设备,并计算在新的分配方式下执行模型运算后的总耗时E new,如果E new小于E 0,则接受新的分配方式,如果E new大于或等于E 0,则以概率p接受新的分配方式;其中,p=exp(-(E new-E 0)/T);
    步骤三:更新当前温度T和扰动比例μ:T=αT,μ=αμ;
    步骤四:判断更新后的当前温度T是否小于终止温度T min,是则将当前的分配方式作为最终解,结束;否则跳到步骤二继续执行。
  9. 如权利要求6所述的方法,其中,
    所述将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:
    采用模拟退火算法将所有的可并行执行集合内的串行运行分支分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
  10. 如权利要求9所述的方法,其中,
    所述采用模拟退火算法将所有的可并行执行集合内的串行运行分支分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:
    步骤一:初始化:设置初始温度T 0,终止温度T min,每个温度内的迭代次数K,降温速率α,每次更新解时的扰动比例μ;在初始温 度T 0下,随机生成初始解X 0,并计算初始耗时E 0;其中,初始解X 0是指将所有串行运行分支按照初始分配方式随机分配到目标设备;初始耗时E 0是指在初始分配方式下执行模型运算后的总耗时;
    步骤二:在当前温度T下进行K次扰动和接受过程,其中每一次扰动和接受过程包括:在当前温度T下,按照扰动比例μ随机挑选当前解X中的串行运行分支,并对挑选出的串行运行分支重新随机分配目标设备,并计算在新的分配方式下执行模型运算后的总耗时E new,如果E new小于E 0,则接受新的分配方式,如果E new大于或等于E 0,则以概率p接受新的分配方式;其中,p=exp(-(E new-E 0)/T);
    步骤三:更新当前温度T和扰动比例私:T=αT,μ=αμ;
    步骤四:判断更新后的当前温度T是否小于终止温度T min,是则将当前的分配方式作为最终解,结束;否则跳到步骤二继续执行。
  11. 一种对深度学习模型进行并行处理的装置,包括:
    存储器、处理器及存储在所述存储器上并可在所述处理器上运行的对深度学习模型进行并行处理的程序,所述对深度学习模型进行并行处理的程序被所述处理器执行时实现上述权利要求1-10中任一项所述的对深度学习模型进行并行处理的方法的步骤。
  12. 一种计算机可读存储介质,所述计算机可读存储介质上存储有对深度学习模型进行并行处理的程序,所述对深度学习模型进行并行处理的程序被处理器执行时实现上述权利要求1-10中任一项所述的对深度学习模型进行并行处理的方法的步骤。
PCT/CN2020/113982 2019-09-26 2020-09-08 一种对深度学习模型进行并行处理的方法及装置 WO2021057465A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910916367.0 2019-09-26
CN201910916367.0A CN112561051A (zh) 2019-09-26 2019-09-26 一种对深度学习模型进行并行处理的方法及装置

Publications (1)

Publication Number Publication Date
WO2021057465A1 true WO2021057465A1 (zh) 2021-04-01

Family

ID=75029777

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113982 WO2021057465A1 (zh) 2019-09-26 2020-09-08 一种对深度学习模型进行并行处理的方法及装置

Country Status (2)

Country Link
CN (1) CN112561051A (zh)
WO (1) WO2021057465A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220457A (zh) * 2021-05-24 2021-08-06 交叉信息核心技术研究院(西安)有限公司 模型部署方法、模型部署装置、终端设备及可读存储介质
CN115277044A (zh) * 2022-05-17 2022-11-01 南京赛宁信息技术有限公司 OpenStack加密链路节点分层方法与系统
CN115983164A (zh) * 2023-01-12 2023-04-18 上海合见工业软件集团有限公司 用于数字电路原理图的游离元器件分层方法、设备和介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114691330A (zh) * 2022-03-28 2022-07-01 北京百度网讯科技有限公司 数据处理方法、装置、电子设备以及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399626A (zh) * 2013-07-18 2013-11-20 国家电网公司 面向混合计算环境的功耗感知的并行应用调度系统及方法
CN103870340A (zh) * 2014-03-06 2014-06-18 华为技术有限公司 流计算系统中的数据处理方法、控制节点及流计算系统
US20180285766A1 (en) * 2017-03-30 2018-10-04 Intel Corporation Diagnosing slow tasks in distributed computing
US20190005407A1 (en) * 2017-06-30 2019-01-03 Theodore D. Harris Gpu enhanced graph model build and scoring engine
CN109213587A (zh) * 2018-09-12 2019-01-15 中国人民解放军战略支援部队信息工程大学 GPU平台下的多Stream并行DAG图任务映射策略

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103399626A (zh) * 2013-07-18 2013-11-20 国家电网公司 面向混合计算环境的功耗感知的并行应用调度系统及方法
CN103870340A (zh) * 2014-03-06 2014-06-18 华为技术有限公司 流计算系统中的数据处理方法、控制节点及流计算系统
US20180285766A1 (en) * 2017-03-30 2018-10-04 Intel Corporation Diagnosing slow tasks in distributed computing
US20190005407A1 (en) * 2017-06-30 2019-01-03 Theodore D. Harris Gpu enhanced graph model build and scoring engine
CN109213587A (zh) * 2018-09-12 2019-01-15 中国人民解放军战略支援部队信息工程大学 GPU平台下的多Stream并行DAG图任务映射策略

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG HONG-FENG , ZHU HAI: "Trusted Scheduling of Dependent Tasks Using Genetic-annealing Algorithm under Grid Environment", COMPUTER SCIENCE, vol. 42, no. 6, 15 June 2015 (2015-06-15), pages 268 - 275, XP055793946, ISSN: 1002-137X, DOI: 10.11896/j.issn.1002-137X.2015.6.056 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113220457A (zh) * 2021-05-24 2021-08-06 交叉信息核心技术研究院(西安)有限公司 模型部署方法、模型部署装置、终端设备及可读存储介质
CN113220457B (zh) * 2021-05-24 2024-03-22 深圳市智芯华玺信息技术有限公司 模型部署方法、模型部署装置、终端设备及可读存储介质
CN115277044A (zh) * 2022-05-17 2022-11-01 南京赛宁信息技术有限公司 OpenStack加密链路节点分层方法与系统
CN115277044B (zh) * 2022-05-17 2023-06-09 南京赛宁信息技术有限公司 OpenStack加密链路节点分层方法与系统
CN115983164A (zh) * 2023-01-12 2023-04-18 上海合见工业软件集团有限公司 用于数字电路原理图的游离元器件分层方法、设备和介质
CN115983164B (zh) * 2023-01-12 2023-08-15 上海合见工业软件集团有限公司 用于数字电路原理图的游离元器件分层方法、设备和介质

Also Published As

Publication number Publication date
CN112561051A (zh) 2021-03-26

Similar Documents

Publication Publication Date Title
WO2021057465A1 (zh) 一种对深度学习模型进行并行处理的方法及装置
US9684874B2 (en) Parallel decision or regression tree growing
Xiong et al. A simulation-based study of dispatching rules in a dynamic job shop scheduling problem with batch release and extended technical precedence constraints
US9710751B2 (en) Parallel tree based prediction
CN108292241A (zh) 处理计算图
CN108388474A (zh) 基于dag的智能分布式计算管理系统及方法
CN106933534A (zh) 一种数据同步方法和装置
JP6825103B2 (ja) 計算リソース割り振り
US20150304420A1 (en) Functional programming in distributed computing
Verma et al. Big Data representation for grade analysis through Hadoop framework
EP2901344A1 (en) System and method for flexible distributed massively parallel processing (mpp) database
US20150269161A1 (en) Similarity and ranking of databases based on database metadata
US20210304066A1 (en) Partitioning for an execution pipeline
JP2015169654A (ja) コンピュータ実装されるk最短経路発見法
CN108415912A (zh) 基于MapReduce模型的数据处理方法和设备
Legrand et al. Adapting batch scheduling to workload characteristics: What can we expect from online learning?
WO2022198754A1 (zh) 一种大规模云服务流程的优化方法
Wijayanto et al. Implementation of multi-criteria collaborative filtering on cluster using Apache Spark
CN105210059B (zh) 一种数据处理方法及系统
Smutnicki et al. Very fast non-dominated sorting
US20200327128A1 (en) Query execution apparatus, method, and system for processing data, query containing a composite primitive
Lin et al. Mining high-utility sequential patterns from big datasets
CN117407921A (zh) 基于必连和勿连约束的差分隐私直方图发布方法及系统
US9361588B2 (en) Construction of tree-shaped bayesian network
US20150150011A1 (en) Self-splitting of workload in parallel computation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20867262

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20867262

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 200223)

122 Ep: pct application non-entry in european phase

Ref document number: 20867262

Country of ref document: EP

Kind code of ref document: A1