CN110399222A

CN110399222A - GPU cluster deep learning task parallel method, device and electronic equipment

Info

Publication number: CN110399222A
Application number: CN201910675587.9A
Authority: CN
Inventors: 张海涛; 耿欣; 马华东
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2019-11-01
Anticipated expiration: 2039-07-25
Also published as: CN110399222B

Abstract

GPU cluster deep learning task parallel method provided by the embodiments of the present application, device and electronic equipment, it is related to Internet technical field, by the similitude for first analyzing each calculate node of deep learning task and GPU cluster to be processed, determine target computing nodes of the deep learning task to be processed in GPU cluster, a possibility that reduce calculate node contention for resources, to improve the utilization rate and execution efficiency of deep learning task system resource, further according to GPU number of the needs of deep learning task to be processed, deep learning task to be processed is divided into multiple target subtasks, analyze the disturbance level and communication price of target subtask, so that it is determined that target GPU of the target subtask in target computing nodes, it is unbalanced to avoid the resource allocation on GPU in calculate node, realize deep learning task Highly-parallel, improves the resource utilization of GPU cluster, while improving the execution efficiency of deep learning task.

Description

GPU cluster deep learning task parallel method, device and electronic equipment

Technical field

This application involves Internet technical fields, more particularly to GPU cluster deep learning task parallel method, device And electronic equipment.

Background technique

With deepening continuously for deep learning research, depth learning technology is in computer vision, speech recognition, text-processing Equal fields achieve great successes, bring great convenience for people's lives.However, complicated neural network mould More stringent requirements are proposed to computing capability for the data of type and substantial amounts.GPU (Graphic Processing Unit, image Processor) the multiple GPU computing resources of cluster integration, it is provided for computation-intensive deep learning task powerful, efficient parallel Computing capability efficiently solves the calculating demand of multiple deep learning tasks.

However when deep learning task is run in the GPU cloud platform of resource-sharing, execution efficiency can be total to by other It is influenced with interference caused by the resource contention between execution task.Therefore for the deep learning task in GPU cluster, how Demand according to the information of task and task to resource carries out task parallelization scheduling, realizes to GPU cluster interior joint and section The reasonable utilization of multiple GPU resources on point promotes the place of entire calculating task for optimizing the execution time of deep learning task Rationality energy, the resource utilization for improving system are most important.

The GPU cluster of mainstream mainly uses traditional scheduler (such as: Kubernetes, Yarn) to dispatch depth at present Learning tasks.By counting the service condition of whole resource, GPU use is reasonably allocated resources to, and ensure the life of GPU There are enough resources in period to guarantee its operation.

Although this method realizes the parallelization of deep learning task to a certain extent, this method mainly considers to provide The service condition in source does not account for the physical characteristic of resource and the characteristic of task itself, can not achieve the height of deep learning task Parallelization is imitated, the execution efficiency of deep learning workload can be reduced.Meanwhile this method does not support the fine granularity multitask of GPU Distribution, cannot make full use of the GPU resource on node, will affect the efficient execution of deep learning task, reduce the GPU benefit of node With rate, to influence the resource utilization of GPU cluster.

Summary of the invention

Be designed to provide the GPU cluster deep learning task parallel method, device, electronics of the embodiment of the present application are set Standby, storage medium and computer program product comprising instruction realize the highly-parallel of deep learning task, improve GPU The execution efficiency of the utilization rate of GPU resource and deep learning task in cluster.

Specific technical solution is as follows:

In a first aspect, the embodiment of the present application provides a kind of GPU cluster deep learning task parallel method, comprising:

Obtain the task-set information of deep learning task and GPU cluster to be processed, the task-set packet of the GPU cluster Include the mission bit stream of each task in each calculate node waiting list of GPU cluster, each GPU is being held in each calculate node of GPU cluster The mission bit stream of capable each subtask；

It analyzes in the mission bit stream and each calculate node waiting list of the GPU cluster of the deep learning task to be processed The mission bit stream of each task respectively obtains the similar of the deep learning task to be processed and each calculate node of the GPU cluster Property；

According to the similitude, determine that target of the deep learning task to be processed in the GPU cluster calculates section Point；

According to the GPU number that the deep learning task to be processed needs, the deep learning task to be processed is divided into Multiple target subtasks；

Analyze each son that each GPU in the mission bit stream and the target computing nodes of the target subtask is carrying out The mission bit stream of task respectively obtains the disturbance level and communication price of each target subtask；

According to the disturbance level of each target subtask and the communication price, in the GPU of the target computing nodes The middle execution GPU for determining each target subtask respectively.

Optionally, the mission bit stream of information deep learning task to be processed, comprising: the deep learning to be processed is appointed The maximum CPU usage of business, the host memory utilization rate of the deep learning task to be processed, the deep learning to be processed are appointed The I/O handling capacity of business, the GPU utilization rate of the deep learning task to be processed, the equipment of the deep learning task to be processed Memory usage, the bandwidth utilization rate of the deep learning task to be processed, each step of deep learning task to be processed The sample size of analysis, the deep learning task to be processed data set size.

Optionally, the mission bit stream of the analysis deep learning task to be processed and the GPU cluster respectively calculate section The mission bit stream of each task, respectively obtains the deep learning task to be processed and the GPU cluster is respectively counted in point waiting list The similitude of operator node, comprising:

By each task in the deep learning mission bit stream to be processed and each calculate node waiting list of the GPU cluster Mission bit stream inputs in default similitude prediction model, described in the feature vector for respectively obtaining the deep learning task to be processed The feature vector of each task in each calculate node waiting list of GPU cluster；

According to the following formula, the deep learning task to be processed and each calculate node of the GPU cluster etc. are calculated separately To the similitude between each task in queue:

Wherein, J_iRepresent the feature vector of i-th of deep learning task to be processed, S_jRepresent j-th of meter in GPU cluster The feature vector of task in operator node waiting list, wherein S_jInclude n task, S_jkRepresent j-th of the calculating section in GPU cluster The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting list_JiSjkIt is the J_iWith the S_jkAngle degree, cos(θ_Jisjk)_JisjkRepresent j-th of calculate node etc. in described i-th deep learning task to be processed and the GPU cluster To the similitude of k-th of task in queue, ‖ J_i‖ is J_iVector field homoemorphism, | | S_jk| | it is S_jkVector field homoemorphism；

According to each task in the deep learning task to be processed and each calculate node waiting list of the GPU cluster it Between similitude, obtain the similitude of the deep learning task to be processed and each calculate node of the GPU cluster.

Optionally, each GPU in the mission bit stream and the target computing nodes of the analysis target subtask is being just Mission bit stream in each subtask of execution respectively obtains the disturbance level and communication price of each target subtask, comprising:

The multiple target subtask is respectively mapped to each GPU in the target computing nodes, target is obtained and appoints The mapping relations of business and each GPU；

Each subtask that the target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model, Respectively obtain the performance degradation when target subtask executes jointly with each subtask that each GPU is carrying out；

The disturbance level of the target subtask is calculated separately according to each performance degradation；

Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain target The communication cost of task.

Optionally, the disturbance level according to each target subtask and the communication price, in the target meter The execution GPU of each target subtask is determined in the GPU of operator node respectively, comprising:

Determine the objective function of target GPU, wherein the objective function are as follows:

In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, C (M) represents mapping relations as M When communication cost, α is the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1；

WhenWhen minimum, mapping relations GPU corresponding when being M is the target GPU.

Determine the target subtask in the mesh according to default optimization algorithm, the disturbance level and the communication price The target GPU in calculate node is marked, the default optimization algorithm is ant group algorithm, genetic algorithm, simulated annealing or particle Colony optimization algorithm.

Second aspect, the embodiment of the present application provide a kind of GPU cluster deep learning task parallelization device, comprising:

Acquisition module, for obtaining the task-set information of deep learning task and GPU cluster to be processed, the GPU cluster Task-set information include the mission bit stream of each task, each calculate node of GPU cluster in each calculate node waiting list of GPU cluster In the mission bit stream of each subtask that is carrying out of each GPU；

First analysis module, the mission bit stream and the GPU cluster for analyzing the deep learning task to be processed are each The mission bit stream of each task in calculate node waiting list respectively obtains the deep learning task to be processed and the GPU collection The similitude of each calculate node of group；

Calculate node determining module, for determining the deep learning task to be processed described according to the similitude Target computing nodes in GPU cluster；

Subtask module, the GPU number for being needed according to the deep learning task to be processed, by the depth to be processed Degree learning tasks are divided into multiple target subtasks；

Second analysis module, it is each in the mission bit stream and the target computing nodes for analyzing the target subtask The mission bit stream for each subtask that GPU is carrying out respectively obtains the disturbance level and communication price of each target subtask；

GPU determining module, for according to each target subtask disturbance level and the communication price, in the mesh Mark the execution GPU for determining each target subtask in the GPU of calculate node respectively.

Optionally, the mission bit stream of the deep learning task to be processed, comprising: the deep learning task to be processed Maximum CPU usage, the host memory utilization rate of the deep learning task to be processed, the deep learning task to be processed I/O handling capacity, the GPU utilization rate of the deep learning task to be processed, the device memory of the deep learning task to be processed Utilization rate, the bandwidth utilization rate of the deep learning task to be processed, each step analysis of deep learning task to be processed Sample size, at least one of the size of data set of the deep learning task to be processed.

Optionally, first analysis module is specifically used for:

By the task letter in the deep learning mission bit stream to be processed and each calculate node waiting list of the GPU cluster Breath inputs in default similitude prediction model, respectively obtain the deep learning task to be processed feature vector and the GPU The feature vector of task in each calculate node waiting list of cluster；

According to each task in the deep learning task to be processed and each calculate node waiting list of the GPU cluster it Between similitude, obtain the similitude of the deep learning task to be processed and each calculate node of the GPU cluster.Optionally, Second analysis module is specifically used for:

Optionally, second analysis module is specifically used for:

The third aspect, the embodiment of the present application provide a kind of electronic equipment, comprising: processor, communication interface, memory and Communication bus, in which:

The processor, communication interface, memory complete mutual communication by communication bus；

The memory, for storing computer program；

The processor when for executing the program stored on memory, realizes that above-mentioned first aspect is any described GPU cluster deep learning task parallel method.

Fourth aspect, the embodiment of the present application provide a kind of storage medium, instruction are stored in the storage medium, when it When running on computers, so as to execute any GPU cluster deep learning task of above-mentioned first aspect parallel for computer Change method.

5th aspect, the embodiment of the present application provides a kind of computer program product comprising instruction, when it is in computer When upper operation, so that computer executes any GPU cluster deep learning task parallel method of above-mentioned first aspect.

GPU cluster deep learning task parallel method provided by the embodiments of the present application, device, electronic equipment, storage are situated between Matter and computer program product comprising instruction, by first analyzing each calculate node of deep learning task and GPU cluster to be processed Similitude, target computing nodes of the deep learning task to be processed in GPU cluster are determined, by fully considering depth to be processed The similitude between learning tasks and other tasks is spent, a possibility that reduce calculate node contention for resources, to improve depth The utilization rate and execution efficiency for spending learning tasks system resource, further according to GPU number of the needs of deep learning task to be processed, Deep learning task to be processed is divided into multiple target subtasks, analyzes the disturbance level and communication price of target subtask, from And determine target GPU of the target subtask in target computing nodes, by the disturbance level and communication that consider target subtask Cost avoids the resource allocation in calculate node on GPU unbalanced, realizes the highly-parallel of deep learning task, improves The resource utilization of GPU cluster, while improving the execution efficiency of deep learning task.Certainly, implement any of the application It is not absolutely required to reach above all advantages simultaneously for product or method.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of application for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of schematic diagram of the GPU cluster deep learning task parallel method of the embodiment of the present application；

Fig. 2 is a kind of schematic diagram of the GPU cluster deep learning task parallelization device of the embodiment of the present application；

Fig. 3 is a kind of schematic diagram of the electronic equipment of the embodiment of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.

The embodiment of the present application discloses a kind of GPU cluster deep learning task parallel method, device, electronic equipment, deposits Storage media and computer program product comprising instruction, are illustrated individually below.

The embodiment of the present application provides GPU cluster deep learning task parallel method, is that the application is real referring to Fig. 1, Fig. 1 A kind of schematic diagram for applying the GPU cluster deep learning task parallel method of example, includes the following steps:

Step 110, the task-set information of deep learning task and GPU cluster to be processed, the task of above-mentioned GPU cluster are obtained Collection information includes the mission bit stream of each task in each calculate node waiting list of GPU cluster, each in each calculate node of GPU cluster The mission bit stream for each subtask that GPU is carrying out.

The GPU cluster deep learning task parallel method of the embodiment of the present application can realize by electronic equipment, specifically , which can be server.

In order to improve the calculated performance of GPU, can it is extending transversely be GPU cluster, i.e., by multiple GPU groups on multiple nodes At GPU cluster, such GPU cluster, which integrates multiple GPU, can complete complicated calculating task.So being advised greatly in GPU cluster When mould deep learning appoints training, the deep learning task in GPU cluster can have multiple, and deep learning task to be processed is wherein Any task, it is assumed that system has p deep learning tasks to be processed, is J by deep learning task definition to be processed_i, i ∈ {1,…,p}.It include task-set in GPU cluster, wherein above-mentioned task-set includes in each calculate node waiting list of GPU cluster Each subtask that each GPU is carrying out in each calculate node of each task, GPU cluster.

Step 120, mission bit stream and each calculate node of above-mentioned GPU cluster etc. of above-mentioned deep learning task to be processed are analyzed To the mission bit stream of task each in queue, respectively obtains above-mentioned deep learning task to be processed and above-mentioned GPU cluster respectively calculates section The similitude of point.

In order to improve the execution efficiency of GPU cluster and the utilization rate of computing resource, need to calculate above-mentioned depth to be processed The similitude of habit task and each calculate node of above-mentioned GPU cluster, to avoid the interference between computing resource.

In a kind of possible embodiment, the mission bit stream of above-mentioned deep learning task to be processed, comprising: above-mentioned wait locate Manage the host memory utilization rate, above-mentioned wait locate of the maximum CPU usage of deep learning task, above-mentioned deep learning task to be processed Manage the I/O handling capacity of deep learning task, the GPU utilization rate of above-mentioned deep learning task to be processed, above-mentioned depth to be processed The device memory utilization rate of habit task, the bandwidth utilization rate of above-mentioned deep learning task to be processed, above-mentioned deep learning to be processed The size of the data set of sample size, above-mentioned deep learning task to be processed that each step of task is analyzed.

In the embodiment of the present application, above-mentioned deep learning task characterization to be processed is described as J_i=(jU^CPU,jU^hMem, jThP^I/O,jU^GPU,jU^dMem,jThP^PCIe, batch_size, dsize), wherein jU^CPURepresent above-mentioned deep learning task to be processed Maximum CPU usage, jU^hMemRepresent host memory utilization rate, the jThP of above-mentioned deep learning task to be processed^I/OIn representative State I/O handling capacity, the jU of deep learning task to be processed^GPURepresent above-mentioned deep learning task to be processed GPU utilization rate, jU^dMemRepresent device memory utilization rate, the jThP of above-mentioned deep learning task to be processed^PCIeRepresent above-mentioned deep learning to be processed The bandwidth utilization rate of task, batch_size represent above-mentioned deep learning task to be processed in each training step for analysis Sample size, dsize represent the size of the data set of above-mentioned deep learning task to be processed.

In a kind of possible embodiment, the mission bit stream of the above-mentioned above-mentioned deep learning task to be processed of analysis and above-mentioned The mission bit stream of each task in each calculate node waiting list of GPU cluster, respectively obtain above-mentioned deep learning task to be processed and The similitude of above-mentioned each calculate node of GPU cluster, comprising:

By the task letter in above-mentioned deep learning mission bit stream to be processed and each calculate node waiting list of above-mentioned GPU cluster Breath inputs in default similitude prediction model, respectively obtain above-mentioned deep learning task to be processed feature vector and above-mentioned GPU The feature vector of task in each calculate node waiting list of cluster；

According to the following formula, above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster etc. are calculated separately To the similitude between each task in queue:

Wherein, J_iRepresent the feature vector of i-th of deep learning task to be processed, S_jRepresent j-th of meter in GPU cluster The feature vector of task in operator node waiting list, wherein S_jInclude n task, S_jkRepresent j-th of the calculating section in GPU cluster The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting list_JiSjkIt is above-mentioned J_iWith above-mentioned S_jkAngle degree, cos(θ_Jisjk)_JiSjkRepresent j-th of calculate node etc. in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster To the similitude of k-th of task in queue, ‖ J_i‖ is J_iVector field homoemorphism, | | S_jk| | it is S_jkVector field homoemorphism；

According to each task in above-mentioned deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster it Between similitude, obtain the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster.

Above-mentioned default similitude prediction model can be IASP (Interference-Aware SimilarityPrediction, the prediction of interference perception similitude) model.

For example, defining deep learning task-metric matrix M to be processed_JCharacterize above-mentioned deep learning task to be processed, Middle matrix M_JEvery a line represent an above-mentioned deep learning task to be processed, each column represent above-mentioned deep learning to be processed and appoint One performance characteristic of business, i.e., be made of cpu resource utilization rate and GPU resource utilization rate.

Obtain above-mentioned deep learning task J to be processed_iMission bit stream, by the task of above-mentioned deep learning task to be processed Information standardization simultaneously inserts above-mentioned matrix M_JMiddle corresponding line, deep learning task J to be processed to each of queue_i, use decimal It is analyzed it according to amount, two features is arbitrarily selected to obtain its characteristic value from the feature of deep learning task to be processed.Make With Virtual File System, i.e. proc file system analyzes cpu resource utilization rate, obtains CPU measurement, and service performance analyzes work Tool obtains GPU measurement, the performance metric that analysis obtains is filled out such as NVIDIA Profiler tool analysis GPU resource utilization rate Write deep learning task characterization vector J to be processed_iIn, and by vector J_iIt is inserted into matrix M_JIn.

Above-mentioned matrix M is predicted using IASP model_JThe characteristic value of middle missing, by the feature of IASP model prediction to upper State matrix M_JIt is filled, obtains complete matrix.The feature of above-mentioned deep learning task to be processed is obtained by IASP model Vector.Similarly, the mission bit stream of the task in each calculate node waiting list of above-mentioned GPU cluster is inputted into IASP model, obtained The feature vector of task in above-mentioned each calculate node waiting list of GPU cluster.

Specifically, IASP model be depth collaborative filtering model, as DCF (Deep Collaborative Filtering, Depth collaborative filtering) model carries out similitude prediction., further, stochastic gradient descent method (SGD) Lai Youhua can be used DCF model.

Wherein, J_iRepresent the feature vector of i-th of deep learning task to be processed, S_jRepresent j-th of meter in GPU cluster The feature vector of task in operator node waiting list, wherein S_jInclude n task, S_jkRepresent j-th of the calculating section in GPU cluster The feature vector of k-th of task, k ∈ { 1 ..., n }, θ in point waiting list_JiSjkIt is above-mentioned J_iWith above-mentioned S_jkAngle degree, cos(θ_JjSjk)_JisjkRepresent j-th of calculate node etc. in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster To the similitude of k-th of task in queue, ‖ J_i‖ is J_iVector field homoemorphism, | | S_jk| | it is S_jkVector field homoemorphism；

According to each task in above-mentioned deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster it Between similitude, obtain the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster.For example, J_i With S_jSimilitude be J_iWith S_jIn each task similitude average value, that is, calculate separately above-mentioned deep learning task to be processed J_iWith the S in each calculate node waiting list of above-mentioned GPU cluster_jAfter similitude between middle n task, each similitude is calculated Average value obtains appointing in above-mentioned i-th deep learning task to be processed and j-th of calculate node waiting list of above-mentioned GPU cluster The similitude of business.

Because of S_jComprising n task, then need to calculate above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster In j-th of calculate node waiting list in each task between similitude, calculate the average value of all similitudes, obtain To the task in j-th of calculate node waiting list in above-mentioned i-th deep learning task to be processed and above-mentioned GPU cluster Similitude.

Step 130, according to above-mentioned similitude, mesh of the above-mentioned deep learning task to be processed in above-mentioned GPU cluster is determined Mark calculate node.

According to above-mentioned similitude, similitude is calculate node in the smallest above-mentioned GPU cluster, as above-mentioned depth to be processed Target computing nodes of the learning tasks in above-mentioned GPU cluster.

Select the smallest calculate node of similitude, can be fought for avoid computing resource, improve GPU cluster execution efficiency and The utilization rate of computing resource.

Step 140, the GPU number needed according to above-mentioned deep learning task to be processed, by above-mentioned deep learning to be processed Task is divided into multiple target subtasks.

After above-mentioned deep learning task to be processed is assigned to the target computing nodes in above-mentioned GPU cluster, according to it is above-mentioned to The GPU number that deep learning task needs is handled, above-mentioned deep learning task to be processed is divided into multiple target subtasks, such as Deep learning task J to be processed_i, define deep learning task to be processed and be divided into multiple target subtasks and beWherein n represents deep learning task J to be processed_iIn subtask quantity.

Definition target subtask is T_i ^j, j ∈ { 1 ..., n }, T_i ^j∈J_i, n indicates that above-mentioned deep learning task to be processed is divided into For n target subtask, wherein T_i ^jIndicate j-th of target subtask.

Step 150, each GPU analyzed in the mission bit stream and above-mentioned target computing nodes of above-mentioned target subtask is being held The mission bit stream of capable each subtask respectively obtains the disturbance level and communication price of each above-mentioned target subtask.

The GPU collection defined in above-mentioned target computing nodes is combined into G_k, k ∈ { 1 ..., m }, wherein m indicates target computing nodes In shared a m GPU, G_kFor k-th of GPU.

Defining target subtask is

T_i ^j=(tE^SM,tU^L1,tThP^L1,tU^L2,tThP^L2,tU^Tex,tThP^Tex,tU^DRAM,tThP^DRAM,tThP^L,tThP^S, Batch_size, sub_dsize), wherein tE^SM、tU^L1、tThP^L1、tU^L2、tThP^L2、tU^Tex、tThP^Tex、tU^DRAM、tThP^DRAM、 tThP^LAnd tThP^SThe SM efficiency, GPU L1 cache utilization rate, L1 for respectively representing above-mentioned target subtask read handling capacity, GPU L2 cache utilization rate, L2 read handling capacity, GPU texture caching utilization rate, texture cache read handling capacity, GPU memory usage, Memory reads handling capacity, global load handling capacity, global storage handling capacity, and sub_dsize indicates the data set of target subtask Size.

In a kind of possible embodiment, the mission bit stream of the above-mentioned above-mentioned target subtask of analysis and above-mentioned target are calculated The mission bit stream for each subtask that each GPU in node is carrying out respectively obtains the disturbance level of each above-mentioned target subtask And communication price, comprising:

Above-mentioned multiple target subtasks are respectively mapped to each GPU in above-mentioned target computing nodes, target is obtained and appoints The mapping relations of business and each GPU；

Each subtask that above-mentioned target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model, Respectively obtain performance degradation when above-mentioned target subtask executes jointly with each subtask that each GPU is carrying out；

The disturbance level of above-mentioned target subtask is calculated separately according to each above-mentioned performance degradation；

Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain above-mentioned target The communication cost of task.

Define the ratio between deadline when deadline and the isolated operation when subtask is run together with other subtasks To indicate performance degradation.

Each subtask that above-mentioned target subtask and each GPU are carrying out is inputted respectively in default capabilities prediction model, Default capabilities prediction model can be IAPP (Interference-Aware PerformancePrediction, interference perception Performance prediction) model, specifically, default capabilities prediction model can be DNN (Deep Neural Network depth nerve net Network) model.

For example, defining target subtask is J_i={ T_i ¹,…,T_i ⁿ, j ∈ { 1 ..., n }, T_i ^j∈J_i, n indicate it is above-mentioned wait locate Reason deep learning task is divided into n target subtask, wherein T_i ^jIt indicates j-th of target subtask, defines above-mentioned target and calculate Each GPU collection in node is combined into G={ G₁…G_m, wherein m indicates there be m GPU in target computing nodes, by above-mentioned multiple mesh Mark subtask J_i={ T_i ¹,…,T_i ⁿIt is respectively mapped to each GPU in above-mentioned target computing nodes, obtain target subtask and each The mapping relations of a GPU, definition mapping relations are M (j)=k, j ∈ { 1 ..., n }, and k ∈ { 1 ..., m }, n indicate above-mentioned to be processed Deep learning task is divided into n target subtask, and m indicates a shared m GPU in target computing nodes, M (j)=k expression the J target subtask T_i ^jIt is assigned to k-th of GPU, i.e. G_k。

Specifically, defining subtask-metric matrix M_tCharacterize subtask, subtask is by above-mentioned target subtask or each The subtask that GPU is carrying out.

Each subtask that each GPU is carrying out is by matrix M_tIn vector input IAPP model output be vector z_t, vector z_tIn performance degradation of each element representation target subtask when being executed jointly with other subtasks.

It is calculated using following formula by target subtask T_i ^jThe average behavior being assigned on a GPU degrades.

Wherein Slowdown_jkIndicate target subtask T_i ^jIt is assigned in above-mentioned target computing nodes flat on k-th of GPU Equal performance degradation, numk indicate the subtask quantity executed jointly on k-th of GPU in above-mentioned target computing nodes.

Defining disturbance level isWherein I (M (j)) represents j-th of target Subtask T_i ^jMapping relations be M (j) disturbance level, n indicates that deep learning task to be processed is divided into n target subtask. It will be from G_iTo G_jCommunication cost be defined as cc_ij, i ∈ { 1 ..., m }, j ∈ { 1 ..., m }, G_iIt indicates in above-mentioned target computing nodes I-th of GPU, G_jIndicate j-th of GPU in above-mentioned target computing nodes.Communication cost can be by the available bandwidth between physics GPU It is calculated, wherein High Availabitity bandwidth means low communication cost, and low available bandwidth means high communications cost.

By target subtask T_i ⁱWith target subtask T_i ^jBetween communication requirement be defined as cr_ij, communication requirement, which defines, to be equal to Data volume needed for updating model.Target subtask T is obtained according to the following formula_i ⁱWith target subtask T_i ^jBetween communication generation Valence:

Communication cost when C (M) represents mapping relations as M.

Step 160, it according to the disturbance level of each above-mentioned target subtask and above-mentioned communication price, calculates and saves in above-mentioned target The execution GPU of each above-mentioned target subtask is determined in the GPU of point respectively.

By considering the disturbance level and communication price of target subtask, the resource allocation in calculate node on GPU is avoided It is unbalanced, the highly-parallel of deep learning task is realized, the resource utilization of GPU cluster is improved, while improving depth Spend the execution efficiency of learning tasks.

In a kind of possible embodiment, the above-mentioned disturbance level according to each above-mentioned target subtask and above-mentioned communication generation Valence determines the execution GPU of each above-mentioned target subtask respectively in the GPU of above-mentioned target computing nodes, comprising:

Determine the objective function of target GPU, wherein above-mentioned objective function are as follows:

WhenWhen minimum, above-mentioned mapping relations GPU corresponding when being M is above-mentioned target GPU.

Determine above-mentioned target subtask in above-mentioned mesh according to default optimization algorithm, above-mentioned disturbance level and above-mentioned communication price The target GPU in calculate node is marked, above-mentioned default optimization algorithm is ant group algorithm, genetic algorithm, simulated annealing or particle Colony optimization algorithm.

Such as particle swarm optimization algorithm initializes a group particle, particle represents the target subtask to be distributed.Then we From GPU set G={ G₁…G_mIn random initializtion particle position and random initializtion particle rapidity, distribute to the every of particle The fitness value of a dimension is the number of GPU, for example k is k-th of GPU, i.e., particle indicates the mapping of multiple tasks to GPU.

Each particle is by its current locationAnd present speedIt indicates, and each particle both knows about particle and found Optimum position pbest and population entire so far in global optimum position gbest.The principle of the algorithm is mobile These particles are to find optimal solution.Each particle position is influenced by its optimum position pbest and overall situation optimum position gbest. The fitness value by optimizing the fitness function calculating in every generation can be used to update its optimum position in each particle pbest.In each iterative process, each example can update its speed and position using following formula.

Wherein, ω is the inertia weight for maintaining particle, c₁And c₂It is accelerator coefficient, r₁And r₂Be between 0 to 1 with Machine number.

Use objective functionThe assessment of each particle is executed, then according to following two Formula updates speed and the position of each particle:

Iteration is executed until reaching specified the number of iterations or meeting iteration precision, finds optimal GPU.

Determine above-mentioned target subtask in above-mentioned mesh according to default optimization algorithm, above-mentioned disturbance level and above-mentioned communication price The target GPU in calculate node is marked, the resource utilization for improving GPU cluster can be maximized, while maximizing and improving depth The execution efficiency of habit task.

By first analyzing the similitude of each calculate node of deep learning task and GPU cluster to be processed, depth to be processed is determined Target computing nodes of the learning tasks in GPU cluster are spent, by fully considering deep learning task to be processed and other tasks Between similitude, a possibility that reduce calculate node contention for resources, to improve deep learning task system resource Utilization rate and execution efficiency, further according to GPU number of the needs of deep learning task to be processed, by deep learning task to be processed It is divided into multiple target subtasks, the disturbance level and communication price of target subtask is analyzed, so that it is determined that target subtask is in mesh The target GPU in calculate node is marked, by considering the disturbance level and communication price of target subtask, is avoided in calculate node Resource allocation on GPU is unbalanced, realizes the highly-parallel of deep learning task, improves the utilization of resources of GPU cluster Rate, while improving the execution efficiency of deep learning task.

The embodiment of the present application also provides a kind of devices, and referring to fig. 2, Fig. 2 is the GPU cluster depth of the embodiment of the present application A kind of schematic diagram of habit task parallelization device, above-mentioned apparatus include:

Acquisition module 210, for obtaining the task-set information of deep learning task and GPU cluster to be processed, above-mentioned GPU collection Group task-set information include in each calculate node waiting list of GPU cluster the mission bit stream of each task, GPU cluster respectively calculate section The mission bit stream for each subtask that each GPU is carrying out in point；

First analysis module 220, for analyze above-mentioned deep learning task to be processed mission bit stream and above-mentioned GPU cluster The mission bit stream of each task in each calculate node waiting list, respectively obtains above-mentioned deep learning task to be processed and above-mentioned GPU The similitude of each calculate node of cluster；

Calculate node determining module 230, for determining above-mentioned deep learning task to be processed upper according to above-mentioned similitude State the target computing nodes in GPU cluster；

Subtask module 240, the GPU number for being needed according to above-mentioned deep learning task to be processed, by above-mentioned wait locate Reason deep learning task is divided into multiple target subtasks；

Second analysis module 250, in the mission bit stream and above-mentioned target computing nodes for analyzing above-mentioned target subtask The mission bit stream of each subtask that is carrying out of each GPU, respectively obtain the disturbance level and communication of each above-mentioned target subtask Cost；

GPU determining module 260, for according to each above-mentioned target subtask disturbance level and above-mentioned communication price, upper State the execution GPU for determining each above-mentioned target subtask in the GPU of target computing nodes respectively.

In a kind of possible embodiment, the mission bit stream of above-mentioned deep learning task to be processed, comprising: above-mentioned wait locate Manage the host memory utilization rate, above-mentioned wait locate of the maximum CPU usage of deep learning task, above-mentioned deep learning task to be processed Manage the I/O handling capacity of deep learning task, the GPU utilization rate of above-mentioned deep learning task to be processed, above-mentioned depth to be processed The device memory utilization rate of habit task, the bandwidth utilization rate of above-mentioned deep learning task to be processed, above-mentioned deep learning to be processed At least one of the sample size of task each step analysis, size of data set of above-mentioned deep learning task to be processed.

In a kind of possible embodiment, above-mentioned first analysis module 220 is specifically used for:

According to each task in above-mentioned deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster it Between similitude, obtain the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster.One kind can In the embodiment of energy, above-mentioned second analysis module 250 is specifically used for:

In a kind of possible embodiment, above-mentioned second analysis module 250 is specifically used for:

In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, C (M) represent mapping relations as Communication cost when M, α are the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1；

The embodiment of the present application also provides a kind of electronic equipment, referring to Fig. 3, comprising: processor 310, communication interface 320, Memory 330 and communication bus 340, wherein processor 310, communication interface 320, memory 330 are complete by communication bus 340 At mutual communication,

Above-mentioned memory 330, for storing computer program；

Above-mentioned processor 310 realizes following steps when for executing the computer program of the above-mentioned storage of memory 330:

Obtain the task-set information of deep learning task and GPU cluster to be processed, the task-set packet of above-mentioned GPU cluster Include the mission bit stream of each task in each calculate node waiting list of GPU cluster, each GPU is being held in each calculate node of GPU cluster The mission bit stream of capable each subtask；

It analyzes in the mission bit stream and each calculate node waiting list of above-mentioned GPU cluster of above-mentioned deep learning task to be processed The mission bit stream of each task respectively obtains the similar of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster Property；

According to above-mentioned similitude, determine that target of the above-mentioned deep learning task to be processed in above-mentioned GPU cluster calculates section Point；

According to the GPU number that above-mentioned deep learning task to be processed needs, above-mentioned deep learning task to be processed is divided into Multiple target subtasks；

Analyze each son that each GPU in the mission bit stream and above-mentioned target computing nodes of above-mentioned target subtask is carrying out The mission bit stream of task respectively obtains the disturbance level and communication price of each above-mentioned target subtask；

According to the disturbance level of each above-mentioned target subtask and above-mentioned communication price, in the GPU of above-mentioned target computing nodes The middle execution GPU for determining each above-mentioned target subtask respectively.

For example, the processor 310 of electronic equipment includes the GPU cluster that centralized control unit and multiple GPU are formed, wherein GPU cluster includes multiple calculate nodes, and each calculate node is made of multiple GPU, centralized control unit include data collector, Cluster management unit, node management unit, electronic equipment are used for multiple deep learning task parallel processings in GPU cluster.Its In, data collector obtains the task-set information of deep learning task and GPU cluster to be processed, and cluster management unit analysis is above-mentioned The task letter of each task in the mission bit stream of deep learning task to be processed and each calculate node waiting list of above-mentioned GPU cluster Breath, respectively obtains the similitude of above-mentioned deep learning task to be processed and each calculate node of above-mentioned GPU cluster, and according to above-mentioned phase Like property, determine target computing nodes of the above-mentioned deep learning task to be processed in above-mentioned GPU cluster, node management unit according to Above-mentioned deep learning task to be processed is divided into multiple target and appointed by the GPU number that above-mentioned deep learning task to be processed needs Business, then analyzes each son that each GPU in the mission bit stream and above-mentioned target computing nodes of above-mentioned target subtask is carrying out The mission bit stream of task respectively obtains the disturbance level and communication price of each above-mentioned target subtask, finally according to each above-mentioned mesh The disturbance level of mark subtask and above-mentioned communication price, determine each above-mentioned target respectively in the GPU of above-mentioned target computing nodes The execution GPU of subtask.

Optionally, processor 310 when for executing the program stored on memory 330, can also be realized any of the above-described GPU cluster deep learning task parallel method.

The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, PCI) bus or expanding the industrial standard structure (Extended Industry Standard Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, control bus etc..For just It is only indicated with a thick line in expression, figure, it is not intended that an only bus or a type of bus.

Communication interface is for the communication between above-mentioned electronic equipment and other equipment.

Memory may include random access memory (Random Access Memory, RAM), also may include non-easy The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also To be storage device that at least one is located remotely from aforementioned processor.

Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, CPU), GPU (Graphic Processing Unit, image processor), network processing unit (Network Processor, NP) Deng；It can also be digital signal processor (Digital Signal Processing, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components.

In the embodiment of the present application, additionally provide a kind of storage medium, this can be stored with instruction in storage medium, when its When being run on computer, so that computer executes any GPU cluster deep learning task parallel method in above-described embodiment.

In the embodiment of the present application, a kind of computer readable storage medium is additionally provided, the computer readable storage medium In be stored with instruction, when run on a computer, so that computer executes any above-mentioned GPU cluster in above-described embodiment Deep learning task parallel method.

It should be noted that, in this document, as long as the technical characteristic non-contradiction in each optinal plan can combine and carry out shape At scheme, these schemes are in range disclosed in the present application.Relational terms such as first and second and the like are used merely to It distinguishes one entity or operation from another entity or operation, without necessarily requiring or implying these entities or behaviour There are any actual relationship or orders between work.Moreover, the terms "include", "comprise" or its any other variant It is intended to non-exclusive inclusion, so that including that the process, method, article or equipment of a series of elements not only includes Those elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of person's equipment.In the absence of more restrictions, the element limited by sentence "including a ...", not There is also other identical elements in the process, method, article or equipment for including above-mentioned element for exclusion.

Each embodiment in this specification is all made of relevant mode and describes, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device, For the embodiment of electronic equipment and storage medium, since it is substantially similar to the method embodiment, so be described relatively simple, The relevent part can refer to the partial explaination of embodiments of method.

The preferred embodiment that above are only the application above is not intended to limit the protection scope of the application.It is all Any modification, equivalent replacement, improvement and so within spirit herein and principle are all contained in the protection scope of the application It is interior.

Claims

1. a kind of GPU cluster deep learning task parallel method characterized by comprising

The task-set information of deep learning task and GPU cluster to be processed is obtained, the task-set information of the GPU cluster includes The mission bit stream of each task in each calculate node waiting list of GPU cluster, each GPU is carrying out in each calculate node of GPU cluster Subtask each mission bit stream；

Each is analyzed in the mission bit stream and each calculate node waiting list of the GPU cluster of the deep learning task to be processed The mission bit stream of business respectively obtains the similitude of the deep learning task to be processed and each calculate node of the GPU cluster；

According to the similitude, target computing nodes of the deep learning task to be processed in the GPU cluster are determined；

According to the GPU number that the deep learning task to be processed needs, the deep learning task to be processed is divided into multiple Target subtask；

Analyze each subtask that each GPU in the mission bit stream and the target computing nodes of the target subtask is carrying out Mission bit stream, respectively obtain the disturbance level and communication price of each target subtask；

According to the disturbance level of each target subtask and the communication price, divide in the GPU of the target computing nodes The execution GPU of each target subtask is not determined.

2. the method according to claim 1, wherein the mission bit stream of the deep learning task to be processed, packet It includes: the host memory use of the maximum CPU usage of the deep learning task to be processed, the deep learning task to be processed Rate, the I/O handling capacity of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, it is described to Handle the device memory utilization rate of deep learning task, the bandwidth utilization rate of the deep learning task to be processed, described wait locate Manage the size of the sample size of each step analysis of deep learning task, the data set of the deep learning task to be processed.

3. the method according to claim 1, wherein the task of the analysis deep learning task to be processed The mission bit stream of each task, respectively obtains the depth to be processed in information and each calculate node waiting list of the GPU cluster The similitude of learning tasks and each calculate node of the GPU cluster, comprising:

By the task of each task in the deep learning mission bit stream to be processed and each calculate node waiting list of the GPU cluster Information input is preset in similitude prediction model, and the feature vector of the deep learning task to be processed and described is respectively obtained The feature vector of each task in each calculate node waiting list of GPU cluster；

According to the following formula, it calculates separately the deep learning task to be processed and each calculate node of the GPU cluster waits team The similitude between each task in column:

Wherein, J_iRepresent the feature vector of i-th of deep learning task to be processed, S_jRepresent j-th of the calculating section in GPU cluster The feature vector of task in waiting list is put, wherein S_jInclude n task, S_jkRepresent j-th of calculate node etc. in GPU cluster To the feature vector of k-th of task in queue, k ∈ { 1 ..., n }, θ_JiSjkIt is the J_iWith the S_jkAngle degree, cos (θ_JiSjk)_JiSjkJ-th of the calculate node represented in described i-th deep learning task to be processed and the GPU cluster waits team The similitude of k-th of task in column, | | J_i| | it is J_iVector field homoemorphism, | | S_jk| | it is S_jkVector field homoemorphism；

According between each task in the deep learning task to be processed and each calculate node waiting list of the GPU cluster Similitude obtains the similitude of the deep learning task to be processed and each calculate node of the GPU cluster.

4. the method according to claim 1, wherein mission bit stream and the institute of the analysis target subtask The mission bit stream for stating each subtask that each GPU in target computing nodes is carrying out respectively obtains each target subtask Disturbance level and communication price, comprising:

The multiple target subtask is respectively mapped to each GPU in the target computing nodes, obtain target subtask with The mapping relations of each GPU；

Data volume needed for according to the available bandwidth between each GPU and updating model calculates separately to obtain the target subtask Communication cost.

5. the method according to claim 1, wherein the disturbance level according to each target subtask and The communication price determines the execution GPU of each target subtask respectively in the GPU of the target computing nodes, comprising:

In formulaFor target function value, disturbance level when I (M) represents mapping relations as M, when C (M) represents mapping relations as M Communication cost, α are the weight of I (M), and β is the weight of C (M), and follows alpha+beta=1；

6. -5 any method according to claim 1, which is characterized in that the interference according to each target subtask Grade and the communication price determine the execution of each target subtask respectively in the GPU of the target computing nodes GPU, comprising:

Determine the target subtask in the target meter according to default optimization algorithm, the disturbance level and the communication price Target GPU in operator node, the default optimization algorithm are that ant group algorithm, genetic algorithm, simulated annealing or population are excellent Change algorithm.

7. a kind of GPU cluster deep learning task parallelization device characterized by comprising

Acquisition module, for obtaining the task-set information of deep learning task and GPU cluster to be processed, the GPU cluster is appointed Business collection information includes the mission bit stream of each task in each calculate node waiting list of GPU cluster, each in each calculate node of GPU cluster The mission bit stream for each subtask that GPU is carrying out；

First analysis module, mission bit stream and the GPU cluster for analyzing the deep learning task to be processed respectively calculate The mission bit stream of each task in node waiting list, respectively obtains the deep learning task to be processed and the GPU cluster is each The similitude of calculate node；

Calculate node determining module, for determining the deep learning task to be processed in the GPU collection according to the similitude Target computing nodes in group；

Subtask module, the GPU number for being needed according to the deep learning task to be processed, by the depth to be processed Habit task is divided into multiple target subtasks；

Second analysis module, each GPU in mission bit stream and the target computing nodes for analyzing the target subtask The mission bit stream for each subtask being carrying out respectively obtains the disturbance level and communication price of each target subtask；

GPU determining module, for according to each target subtask disturbance level and the communication price, in the target meter The execution GPU of each target subtask is determined in the GPU of operator node respectively.

8. device according to claim 7, which is characterized in that the mission bit stream of the deep learning task to be processed, packet It includes: the host memory use of the maximum CPU usage of the deep learning task to be processed, the deep learning task to be processed Rate, the I/O handling capacity of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, it is described to Handle the device memory utilization rate of deep learning task, the bandwidth utilization rate of the deep learning task to be processed, described wait locate In the size for managing the sample size of each step analysis of deep learning task, the data set of the deep learning task to be processed It is at least one.

9. a kind of electronic equipment characterized by comprising processor, communication interface, memory and communication bus, wherein

Memory, for storing computer program；

Processor when for executing the program stored on memory, realizes GPU collection of any of claims 1-6 Group's deep learning task parallel method.

10. a kind of storage medium, which is characterized in that be stored with computer program, the computer program in the storage medium GPU cluster deep learning task parallel method described in any one of claims 1-6 is realized when being executed by processor.