WO2021057465A1 - 一种对深度学习模型进行并行处理的方法及装置 - Google Patents
一种对深度学习模型进行并行处理的方法及装置 Download PDFInfo
- Publication number
- WO2021057465A1 WO2021057465A1 PCT/CN2020/113982 CN2020113982W WO2021057465A1 WO 2021057465 A1 WO2021057465 A1 WO 2021057465A1 CN 2020113982 W CN2020113982 W CN 2020113982W WO 2021057465 A1 WO2021057465 A1 WO 2021057465A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- parallel
- relationship
- groups
- nodes
- executed
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 104
- 238000013136 deep learning model Methods 0.000 title claims abstract description 46
- 238000012545 processing Methods 0.000 title claims abstract description 37
- 238000004364 calculation method Methods 0.000 claims description 27
- 238000011144 upstream manufacturing Methods 0.000 claims description 25
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 238000002922 simulated annealing Methods 0.000 claims description 14
- 230000008569 process Effects 0.000 claims description 12
- 230000002776 aggregation Effects 0.000 claims description 7
- 238000004220 aggregation Methods 0.000 claims description 7
- 238000001816 cooling Methods 0.000 claims description 6
- 238000012821 model calculation Methods 0.000 claims description 4
- 238000012549 training Methods 0.000 description 10
- 238000010586 diagram Methods 0.000 description 8
- 238000005516 engineering process Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
Definitions
- the present disclosure relates to, but is not limited to, the field of computer technology.
- the deep learning model has many parameters and large scale of training data, which leads to a large consumption of computing resources.
- a training often takes several days or even months, which is simply intolerable for the staff who adjust the parameters. Therefore, it is very necessary to accelerate model training, and the improvement of the computing power of a single device is very limited, so it needs to rely on distributed training.
- the distributed training of deep learning models mainly includes two methods: data parallel and model parallel.
- Data parallel means that there is a copy of the complete model on each node, and different data is taken separately to complete the forward and backward directions. Calculate the gradient, and then update the parameters.
- Model parallelism refers to splitting the model into different nodes for training according to certain rules.
- the embodiments of the present disclosure provide a method for parallel processing of a deep learning model, including: determining a dependency relationship between computing nodes in the model, dividing a relationship group according to the dependency relationship; Clustering, generating a set that can be executed in parallel; wherein each relationship group in the set that can be executed in parallel can run in parallel; all the relationship groups in the set that can be executed in parallel are distributed to multiple target devices so that all the sets of the set can be executed in parallel The total parallel operation takes the shortest time.
- the embodiments of the present disclosure provide an apparatus for parallel processing of deep learning models, including: a memory, a processor, and parallel processing of deep learning models that are stored in the memory and run on the processor
- the processing program when the program for parallel processing the deep learning model is executed by the processor, realizes the steps of the method for parallel processing the deep learning model described herein.
- an embodiment of the present disclosure provides a computer-readable storage medium, the computer-readable storage medium stores a program for parallel processing of a deep learning model, and the program for parallel processing of a deep learning model is processed When the device is executed, the steps of the method for parallel processing of the deep learning model described in this article are realized.
- FIG. 1 is a flowchart of a method for parallel processing of a deep learning model according to Embodiment 1 of the disclosure
- FIG. 2 is a schematic diagram of a device for parallel processing of a deep learning model according to Embodiment 2 of the disclosure
- Example 3 is a schematic diagram of the calculation diagram of the Inception-V3 model in Example 1 of the disclosure.
- FIG. 4 is a schematic diagram of selecting the time-consuming relationship group with the highest ranking in Example 1 of the present disclosure
- FIG. 5 is a schematic diagram of dividing relationship groups according to name scope fields in Example 2 of the present disclosure.
- FIG. 6 is a schematic diagram of four kinds of aggregation nodes in Example 2 of the present disclosure.
- FIG. 7 is a schematic diagram of the serial running branch in Example 2 of the disclosure.
- a method and device for parallel processing of a deep learning model are provided, which can automatically split the deep learning model and improve the efficiency of distributed training when the deep learning model adopts model parallelism.
- an embodiment of the present disclosure provides a method for parallel processing of a deep learning model, including:
- Step S110 Determine the dependency relationship between the computing nodes in the model, and divide the relationship group according to the dependency relationship;
- Step S120 clustering the relationship groups according to predetermined rules to generate a parallel executable set; wherein, the relationship groups in each parallel executable set can run in parallel;
- step S130 all the relational groups in the set that can be executed in parallel are distributed to multiple target devices so that the total parallel operation time consumption of all sets that can be executed in parallel is the shortest.
- the determining the dependency relationship between the computing nodes in the model, and dividing the relationship group according to the dependency relationship includes:
- Method 1 Divide computing nodes with the same attributes into the same relationship group
- Method 2 A computing node with only one downstream node and no upstream node is divided into the same relationship group as the downstream nodes of the computing node.
- the clustering of the relationship groups according to a predetermined rule to generate a set that can be executed in parallel includes:
- the set that can be executed in parallel satisfies the following conditions: there is no upstream and downstream relationship within n levels between any two relationship groups in the set that can be executed in parallel; and any two relationships in the set that can be executed in parallel There are common upstream nodes or common downstream nodes within n levels between groups; where n is a preset value.
- the time-consuming selection of the top multiple relationship groups includes:
- a is the smallest integer that makes the ratio of the sum of the time-consuming of a relation groups to the sum of the time-consuming of all relation groups greater than or equal to the predetermined ratio value.
- the determining the dependency relationship between calculations in the model, and dividing the relationship group according to the dependency relationship includes:
- the calculation nodes are divided into relationship groups according to the predetermined fields in the names of the calculation nodes, and the calculation nodes with the same predetermined fields belong to the same relationship group;
- the predetermined field includes: a name scope field; when the name scope field includes a nesting level, the predetermined field is the outermost name scope field.
- the clustering of the relationship groups according to a predetermined rule to generate a set that can be executed in parallel includes:
- serial running branch is a collection of relationship groups with upstream and downstream relationships between two sink nodes.
- the simulated annealing algorithm is used to distribute all the relational groups in the set that can be executed in parallel to multiple target devices, so that the total parallel operation time of all sets that can be executed in parallel is the shortest.
- the simulated annealing algorithm is used to combine all the available groups.
- the relational groups in the parallel execution set are distributed to multiple target devices so that the total parallel operation time of all parallel execution sets is the shortest, including:
- Step 1 Initialization: Set the initial temperature T 0 , the end temperature T min , the number of iterations K within each temperature, the cooling rate ⁇ , the disturbance ratio ⁇ each time the solution is updated; at the initial temperature T 0 , the initial solution is randomly generated X 0 , and calculate the initial time E 0 ; among them, the initial solution X 0 refers to randomly assigning all relation groups to the target device according to the initial allocation method; the initial time E 0 refers to the model calculation performed in the initial allocation method Total time consuming;
- Step 4 Determine whether the updated current temperature T is less than the termination temperature T min , if yes, the current distribution method will be regarded as the final solution and end; otherwise, skip to step 2 to continue execution.
- all the relation groups in the set that can be executed in parallel are allocated to multiple groups.
- the total parallel operation time of all sets that can be executed in parallel is the shortest, including:
- the simulated annealing algorithm is used to distribute all the serial running branches in the set that can be executed in parallel to multiple target devices, so that the total parallel operation time of all sets that can be executed in parallel is the shortest.
- the simulated annealing algorithm is used to combine all the strings in the set that can be executed in parallel.
- the running branch is distributed to multiple target devices so that the total parallel operation time of all parallel execution sets is the shortest, including:
- Step 1 Initialization: Set the initial temperature T 0 , the end temperature T min , the number of iterations K within each temperature, the cooling rate ⁇ , the disturbance ratio ⁇ each time the solution is updated; at the initial temperature T 0 , the initial solution is randomly generated X 0 , and calculate the initial time E 0 ; among them, the initial solution X 0 refers to randomly assigning all serial running branches to the target device according to the initial allocation method; the initial time E 0 refers to the execution of the model operation in the initial allocation method Total time after
- Step 4 Determine whether the updated current temperature T is less than the termination temperature T min , if yes, the current distribution method will be regarded as the final solution and end; otherwise, skip to step 2 to continue execution.
- an embodiment of the present disclosure provides an apparatus for parallel processing of a deep learning model, including:
- the relationship group dividing module 201 is configured to determine the dependency relationship between the computing nodes in the model, and divide the relationship group according to the dependency relationship;
- the set division module 202 is configured to cluster the relationship groups according to predetermined rules to generate a parallel executable set; wherein, the relationship groups in each parallel executable set can run in parallel;
- the device allocation module 203 is configured to allocate all the relational groups in the set that can be executed in parallel to multiple target devices so that the total parallel operation of all sets that can be executed in parallel takes the shortest time.
- the relationship group dividing module 201 is configured to determine the dependency relationship between the computing nodes in the model in the following manner, and divide the relationship group according to the dependency relationship:
- Method 1 Divide computing nodes with the same attributes into the same relationship group
- Method 2 Divide a computing node with only one downstream node and no upstream node into the same relationship group as the downstream nodes of the computing node.
- the set division module 202 is configured to cluster the relationship groups according to predetermined rules in the following manner to generate sets that can be executed in parallel:
- the set that can be executed in parallel satisfies the following conditions: there is no upstream and downstream relationship within n levels between any two relationship groups in the set that can be executed in parallel; and any two relationships in the set that can be executed in parallel There are common upstream nodes or common downstream nodes within n levels between groups; where n is a preset value.
- the set division module 202 is configured to select multiple relationship groups that are time-consuming and rank top in the following manner:
- a is the smallest integer that makes the ratio of the total time-consuming of a relationship groups to the total time-consuming of all relationship groups greater than or equal to the predetermined ratio value.
- the relationship group dividing module 201 is configured to determine the dependency relationship between calculations in the model in the following manner, and divide the relationship group according to the dependency relationship:
- the calculation nodes are divided into relationship groups according to the predetermined fields in the names of the calculation nodes, and the calculation nodes with the same predetermined fields belong to the same relationship group;
- the predetermined field includes: a name scope field; when the name scope field includes a nesting level, the predetermined field is the outermost name scope field.
- the set division module 202 is configured to cluster the relationship groups according to predetermined rules in the following manner to generate a set that can be executed in parallel:
- serial running branch is a collection of relationship groups with upstream and downstream relationships between two sink nodes.
- the device allocation module 203 is configured to adopt the following method All the relational groups in the parallel executable set are distributed to multiple target devices so that the total parallel operation time of all the parallel executable sets is the shortest:
- the simulated annealing algorithm is used to distribute all the relational groups in the set that can be executed in parallel to multiple target devices, so that the total parallel operation time of all sets that can be executed in parallel is the shortest.
- the device allocation module 203 is configured to adopt the following method
- the simulated annealing algorithm is used to distribute all the relational groups in the set that can be executed in parallel to multiple target devices so that the total parallel operation time of all sets that can be executed in parallel is the shortest:
- Step 1 Initialization: Set the initial temperature T 0 , the end temperature T min , the number of iterations K within each temperature, the cooling rate ⁇ , the disturbance ratio ⁇ each time the solution is updated; at the initial temperature T 0 , the initial solution is randomly generated X 0 , and calculate the initial time E 0 ; among them, the initial solution X 0 refers to randomly assigning all relation groups to the target device according to the initial allocation method; the initial time E 0 refers to the model calculation performed in the initial allocation method Total time consuming;
- Step 4 Determine whether the updated current temperature T is less than the termination temperature T min , if yes, the current distribution method will be regarded as the final solution and end; otherwise, skip to step 2 to continue execution.
- the device allocation module 203 is configured to use the following method to execute all the groups that can be executed in parallel
- the relational groups in the collection are distributed to multiple target devices so that the total parallel operations of the collection can be executed in parallel with the shortest time:
- the simulated annealing algorithm is used to distribute all the serial running branches in the set that can be executed in parallel to multiple target devices, so that the total parallel operation time of all sets that can be executed in parallel is the shortest.
- the device allocation module 203 is configured to adopt the following method to combine all the relation groups using the simulated annealing algorithm.
- the serial running branches in the parallel executable set are distributed to multiple target devices so that the total parallel operation time of all parallel executable sets is the shortest:
- Step 1 Initialization: Set the initial temperature T 0 , the end temperature T min , the number of iterations K within each temperature, the cooling rate ⁇ , the disturbance ratio ⁇ each time the solution is updated; at the initial temperature T 0 , the initial solution is randomly generated X 0 , and calculate the initial time E 0 ; among them, the initial solution X 0 refers to randomly assigning all serial running branches to the target device according to the initial allocation method; the initial time E 0 refers to the execution of the model operation in the initial allocation method Total time after
- Step 4 Determine whether the updated current temperature T is less than the termination temperature T min , if yes, the current distribution method will be regarded as the final solution and end; otherwise, skip to step 2 to continue execution.
- the embodiment of the present disclosure provides a device for parallel processing of a deep learning model, including:
- a memory, a processor, and a program that is stored in the memory and can run on the processor for parallel processing of a deep learning model, which is implemented when the program for parallel processing of the deep learning model is executed by the processor The steps of the method for parallel processing of the deep learning model described in the above embodiment 1.
- the embodiment of the present disclosure provides a computer-readable storage medium that stores a program for parallel processing of a deep learning model, and when the program for parallel processing of a deep learning model is executed by a processor The steps of the method for parallel processing of the deep learning model described in the above embodiment 1 are implemented.
- each deep learning model corresponds to a calculation graph, also called a data flow graph, which uses a directed graph composed of nodes and edges to describe mathematical operations.
- Each calculation (OP) is a calculation graph. Nodes, the edges between nodes describe the dependencies between calculations, and data (Tensor) flows along the edges between nodes.
- the calculation graph of a complex deep learning model often contains thousands of OPs. For example, this example uses the calculation graph of Inception-V3. The calculation graph contains more than 30,000 OPs.
- the model parallelism is to perform these OPs. Group them and put them on different devices for training.
- the method for parallel processing of the deep learning model can include the following steps:
- the three relationship groups represented by the dotted lines in the figure are the top three relationship groups that are time-consuming. Among them, there are two levels of upstream and downstream relationships between relationship groups a and b, so they cannot be placed in the same relationship. In a parallel execution set, and the above conditions are fully satisfied between a and c, so they can be placed in the same parallel execution set.
- the simulated annealing algorithm is used to distribute all the relational groups in the set that can be executed in parallel to multiple target devices, so that the time-consuming parallel operation is the shortest.
- each deep learning model corresponds to a calculation graph, also called a data flow graph, which uses a directed graph composed of nodes and edges to describe mathematical operations.
- Each calculation (OP) is a calculation graph. Nodes, the edges between nodes describe the dependencies between calculations, and data (Tensor) flows along the edges between nodes.
- the calculation graph of a complex deep learning model often contains thousands of OPs. For example, this example uses the calculation graph of Inception-V3. The calculation graph contains more than 30,000 OPs.
- the model parallelism is to perform these OPs. Group them and put them on different devices for training.
- the method for parallel processing of the deep learning model can include the following steps:
- the OP is divided into relationship groups. OPs with the same name scope belong to the same relationship group. When the name scope includes nesting levels, the outermost name scope is used. Divide, as shown in Figure 5, there are three OPs whose name scope field is "a”, and the other 4 OPs whose name scope field is "b". They are respectively divided into the first relationship group surrounded by the dashed box on the left and In the second relationship group enclosed by the dashed box on the right.
- Figure 6 shows four types of sink nodes, from left to right: single input single output, single input multiple output, multiple input single output, multiple input multiple output.
- the serial running branch is a set of relationship groups having an upstream and downstream relationship between two sink nodes
- the simulated annealing algorithm is used to distribute all the serial running branches in the parallel executable set to multiple target devices so that the time-consuming parallel operation is the shortest.
- Such software may be distributed on a computer-readable medium
- the computer-readable medium may include a computer storage medium (or non-transitory medium) and a communication medium (or transitory medium).
- the term computer storage medium includes volatile and non-volatile data implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data).
- Information such as computer-readable instructions, data structures, program modules, or other data.
- Computer storage media include but are not limited to RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or Any other medium used to store desired information and that can be accessed by a computer.
- communication media usually contain computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as carrier waves or other transmission mechanisms, and may include any information delivery media. .
- the embodiment of the present disclosure provides a method and device for parallel processing of a deep learning model, determining the dependency relationship between computing nodes in the model, dividing the relationship group according to the dependency relationship; clustering the relationship group according to a predetermined rule, Generate a parallel executable set; among them, each relational group in the parallel executable set can run in parallel; all the relational groups in the parallel executable set are distributed to multiple target devices so that all the total parallel operations of the set can be executed in parallel The shortest time-consuming.
- the embodiments of the present disclosure can automatically split the deep learning model, and improve the distributed training efficiency when the deep learning model adopts model parallelism.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims (12)
- 一种对深度学习模型进行并行处理的方法,包括:确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组;按照预定规则对关系组进行聚类,生成可并行执行集合;其中,每一个可并行执行集合内的关系组能够并行运行;将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
- 如权利要求1所述的方法,其中,所述确定模型中计算节点之间的依赖关系,根据所述依赖关系划分关系组,包括:确定每一个计算节点的上游节点和下游节点,以及该计算节点的属性;按照以下至少一种方式划分关系组;方式一:将具有相同属性的计算节点划分为同一个关系组;方式二:将只有一个下游节点且没有上游节点的计算节点,与该计算节点的下游节点划分为同一个关系组。
- 如权利要求2所述的方法,其中,所述按照预定规则对关系组进行聚类,生成可并行执行的集合,包括:统计所有关系组在单个设备上运行的耗时,根据耗时从多到少对所有关系组进行排序;挑选耗时排名靠前的多个关系组,在挑选出的关系组中搜索可并行执行的集合;其中,所述可并行执行的集合满足以下条件:所述可并行执行的集合中任意两个关系组之间没有n级以内的上下游关系;且所述可并行执行的集合中任意两个关系组之间具有n级以内的共同上游节点或共同下游节点;其中,n为预设值。
- 如权利要求3所述的方法,其中,所述挑选耗时排名靠前的多个关系组,包括:挑选耗时排名靠前的a个关系组;其中,a是使得a个关系组的耗时总和与所有关系组的耗时总和的比值大于或等于预定比例值的最小整数。
- 如权利要求1所述的方法,其中,所述确定模型中计算之间的依赖关系,根据所述依赖关系划分关系组,包括:根据计算节点的名字中的预定字段对计算节点划分关系组,具有相同预定字段的计算节点属于同一个关系组;其中,所述预定字段包括:名称作用域字段;当所述名称作用域字段包括嵌套层级时,所述预定字段为所述最外层的名称作用域字段。
- 如权利要求5所述的方法,其中,所述按照预定规则对关系组进行聚类,生成可并行执行集合,包括:遍历所有的关系组,搜索具有多个输入或者多个输出的汇聚节点;以具有多个输入的汇聚节点为起点,向上游遍历所述汇聚节点的所有输入节点直至遇到另一个汇聚节点为止,由所述两个汇聚节点之间的所有串行运行分支生成一个可并行执行集合;或者以具有多个输出的汇聚节点为起点,向下游遍历所述汇聚节点的所有输出节点直至遇到另一个汇聚节点为止,由所述两个汇聚节点之间的所有串行运行分支生成一个可并行执行集合;其中,所述串行运行分支是两个汇聚节点之间具有上下游关系的关系组的集合。
- 如权利要求3所述的方法,其中,所述将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:采用模拟退火算法将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
- 如权利要求7所述的方法,其中,所述采用模拟退火算法将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:步骤一:初始化:设置初始温度T 0,终止温度T min,每个温度内的迭代次数K,降温速率α,每次更新解时的扰动比例μ;在初始温度T 0下,随机生成初始解X 0,并计算初始耗时E 0;其中,初始解X 0是指将所有关系组按照初始分配方式随机分配到目标设备;初始耗时E 0是指在初始分配方式下执行模型运算后的总耗时;步骤二:在当前温度T下进行K次扰动和接受过程,其中每一次扰动和接受过程包括:在当前温度T下,按照扰动比例μ随机挑选当前解X中的关系组,并对挑选出的关系组重新随机分配目标设备,并计算在新的分配方式下执行模型运算后的总耗时E new,如果E new小于E 0,则接受新的分配方式,如果E new大于或等于E 0,则以概率p接受新的分配方式;其中,p=exp(-(E new-E 0)/T);步骤三:更新当前温度T和扰动比例μ:T=αT,μ=αμ;步骤四:判断更新后的当前温度T是否小于终止温度T min,是则将当前的分配方式作为最终解,结束;否则跳到步骤二继续执行。
- 如权利要求6所述的方法,其中,所述将所有的可并行执行集合内的关系组分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:采用模拟退火算法将所有的可并行执行集合内的串行运行分支分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短。
- 如权利要求9所述的方法,其中,所述采用模拟退火算法将所有的可并行执行集合内的串行运行分支分配到多个目标设备上使得所有可并行执行集合的总并行运算的耗时最短,包括:步骤一:初始化:设置初始温度T 0,终止温度T min,每个温度内的迭代次数K,降温速率α,每次更新解时的扰动比例μ;在初始温 度T 0下,随机生成初始解X 0,并计算初始耗时E 0;其中,初始解X 0是指将所有串行运行分支按照初始分配方式随机分配到目标设备;初始耗时E 0是指在初始分配方式下执行模型运算后的总耗时;步骤二:在当前温度T下进行K次扰动和接受过程,其中每一次扰动和接受过程包括:在当前温度T下,按照扰动比例μ随机挑选当前解X中的串行运行分支,并对挑选出的串行运行分支重新随机分配目标设备,并计算在新的分配方式下执行模型运算后的总耗时E new,如果E new小于E 0,则接受新的分配方式,如果E new大于或等于E 0,则以概率p接受新的分配方式;其中,p=exp(-(E new-E 0)/T);步骤三:更新当前温度T和扰动比例私:T=αT,μ=αμ;步骤四:判断更新后的当前温度T是否小于终止温度T min,是则将当前的分配方式作为最终解,结束;否则跳到步骤二继续执行。
- 一种对深度学习模型进行并行处理的装置,包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的对深度学习模型进行并行处理的程序,所述对深度学习模型进行并行处理的程序被所述处理器执行时实现上述权利要求1-10中任一项所述的对深度学习模型进行并行处理的方法的步骤。
- 一种计算机可读存储介质,所述计算机可读存储介质上存储有对深度学习模型进行并行处理的程序,所述对深度学习模型进行并行处理的程序被处理器执行时实现上述权利要求1-10中任一项所述的对深度学习模型进行并行处理的方法的步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910916367.0 | 2019-09-26 | ||
CN201910916367.0A CN112561051A (zh) | 2019-09-26 | 2019-09-26 | 一种对深度学习模型进行并行处理的方法及装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021057465A1 true WO2021057465A1 (zh) | 2021-04-01 |
Family
ID=75029777
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/113982 WO2021057465A1 (zh) | 2019-09-26 | 2020-09-08 | 一种对深度学习模型进行并行处理的方法及装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112561051A (zh) |
WO (1) | WO2021057465A1 (zh) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113220457A (zh) * | 2021-05-24 | 2021-08-06 | 交叉信息核心技术研究院(西安)有限公司 | 模型部署方法、模型部署装置、终端设备及可读存储介质 |
CN115277044A (zh) * | 2022-05-17 | 2022-11-01 | 南京赛宁信息技术有限公司 | OpenStack加密链路节点分层方法与系统 |
CN115983164A (zh) * | 2023-01-12 | 2023-04-18 | 上海合见工业软件集团有限公司 | 用于数字电路原理图的游离元器件分层方法、设备和介质 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114691330A (zh) * | 2022-03-28 | 2022-07-01 | 北京百度网讯科技有限公司 | 数据处理方法、装置、电子设备以及存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103399626A (zh) * | 2013-07-18 | 2013-11-20 | 国家电网公司 | 面向混合计算环境的功耗感知的并行应用调度系统及方法 |
CN103870340A (zh) * | 2014-03-06 | 2014-06-18 | 华为技术有限公司 | 流计算系统中的数据处理方法、控制节点及流计算系统 |
US20180285766A1 (en) * | 2017-03-30 | 2018-10-04 | Intel Corporation | Diagnosing slow tasks in distributed computing |
US20190005407A1 (en) * | 2017-06-30 | 2019-01-03 | Theodore D. Harris | Gpu enhanced graph model build and scoring engine |
CN109213587A (zh) * | 2018-09-12 | 2019-01-15 | 中国人民解放军战略支援部队信息工程大学 | GPU平台下的多Stream并行DAG图任务映射策略 |
-
2019
- 2019-09-26 CN CN201910916367.0A patent/CN112561051A/zh active Pending
-
2020
- 2020-09-08 WO PCT/CN2020/113982 patent/WO2021057465A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103399626A (zh) * | 2013-07-18 | 2013-11-20 | 国家电网公司 | 面向混合计算环境的功耗感知的并行应用调度系统及方法 |
CN103870340A (zh) * | 2014-03-06 | 2014-06-18 | 华为技术有限公司 | 流计算系统中的数据处理方法、控制节点及流计算系统 |
US20180285766A1 (en) * | 2017-03-30 | 2018-10-04 | Intel Corporation | Diagnosing slow tasks in distributed computing |
US20190005407A1 (en) * | 2017-06-30 | 2019-01-03 | Theodore D. Harris | Gpu enhanced graph model build and scoring engine |
CN109213587A (zh) * | 2018-09-12 | 2019-01-15 | 中国人民解放军战略支援部队信息工程大学 | GPU平台下的多Stream并行DAG图任务映射策略 |
Non-Patent Citations (1)
Title |
---|
WANG HONG-FENG , ZHU HAI: "Trusted Scheduling of Dependent Tasks Using Genetic-annealing Algorithm under Grid Environment", COMPUTER SCIENCE, vol. 42, no. 6, 15 June 2015 (2015-06-15), pages 268 - 275, XP055793946, ISSN: 1002-137X, DOI: 10.11896/j.issn.1002-137X.2015.6.056 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113220457A (zh) * | 2021-05-24 | 2021-08-06 | 交叉信息核心技术研究院(西安)有限公司 | 模型部署方法、模型部署装置、终端设备及可读存储介质 |
CN113220457B (zh) * | 2021-05-24 | 2024-03-22 | 深圳市智芯华玺信息技术有限公司 | 模型部署方法、模型部署装置、终端设备及可读存储介质 |
CN115277044A (zh) * | 2022-05-17 | 2022-11-01 | 南京赛宁信息技术有限公司 | OpenStack加密链路节点分层方法与系统 |
CN115277044B (zh) * | 2022-05-17 | 2023-06-09 | 南京赛宁信息技术有限公司 | OpenStack加密链路节点分层方法与系统 |
CN115983164A (zh) * | 2023-01-12 | 2023-04-18 | 上海合见工业软件集团有限公司 | 用于数字电路原理图的游离元器件分层方法、设备和介质 |
CN115983164B (zh) * | 2023-01-12 | 2023-08-15 | 上海合见工业软件集团有限公司 | 用于数字电路原理图的游离元器件分层方法、设备和介质 |
Also Published As
Publication number | Publication date |
---|---|
CN112561051A (zh) | 2021-03-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021057465A1 (zh) | 一种对深度学习模型进行并行处理的方法及装置 | |
US9684874B2 (en) | Parallel decision or regression tree growing | |
Xiong et al. | A simulation-based study of dispatching rules in a dynamic job shop scheduling problem with batch release and extended technical precedence constraints | |
US9710751B2 (en) | Parallel tree based prediction | |
CN108292241A (zh) | 处理计算图 | |
CN108388474A (zh) | 基于dag的智能分布式计算管理系统及方法 | |
CN106933534A (zh) | 一种数据同步方法和装置 | |
JP6825103B2 (ja) | 計算リソース割り振り | |
US20150304420A1 (en) | Functional programming in distributed computing | |
Verma et al. | Big Data representation for grade analysis through Hadoop framework | |
EP2901344A1 (en) | System and method for flexible distributed massively parallel processing (mpp) database | |
US20150269161A1 (en) | Similarity and ranking of databases based on database metadata | |
US20210304066A1 (en) | Partitioning for an execution pipeline | |
JP2015169654A (ja) | コンピュータ実装されるk最短経路発見法 | |
CN108415912A (zh) | 基于MapReduce模型的数据处理方法和设备 | |
Legrand et al. | Adapting batch scheduling to workload characteristics: What can we expect from online learning? | |
WO2022198754A1 (zh) | 一种大规模云服务流程的优化方法 | |
Wijayanto et al. | Implementation of multi-criteria collaborative filtering on cluster using Apache Spark | |
CN105210059B (zh) | 一种数据处理方法及系统 | |
Smutnicki et al. | Very fast non-dominated sorting | |
US20200327128A1 (en) | Query execution apparatus, method, and system for processing data, query containing a composite primitive | |
Lin et al. | Mining high-utility sequential patterns from big datasets | |
CN117407921A (zh) | 基于必连和勿连约束的差分隐私直方图发布方法及系统 | |
US9361588B2 (en) | Construction of tree-shaped bayesian network | |
US20150150011A1 (en) | Self-splitting of workload in parallel computation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20867262 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20867262 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 200223) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20867262 Country of ref document: EP Kind code of ref document: A1 |