CN110399222B

CN110399222B - GPU cluster deep learning task parallelization method and device and electronic equipment

Info

Publication number: CN110399222B
Application number: CN201910675587.9A
Authority: CN
Inventors: 张海涛; 耿欣; 马华东
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2022-01-21
Anticipated expiration: 2039-07-25
Also published as: CN110399222A

Abstract

The GPU cluster deep learning task parallelization method, the device and the electronic equipment provided by the embodiment of the application relate to the technical field of Internet, the similarity of a deep learning task to be processed and each computing node of a GPU cluster is firstly analyzed, a target computing node of the deep learning task to be processed in the GPU cluster is determined, the possibility of computing node resource contention is reduced, the utilization rate and the execution efficiency of deep learning task system resources are improved, the deep learning task to be processed is divided into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed, the interference level and the communication cost of the target subtasks are analyzed, the target GPU of the target subtasks in the target computing node is determined, the resource distribution imbalance on the GPUs in the computing nodes is avoided, the high parallelization of the deep learning task is realized, and the resource utilization rate of the GPU cluster is improved, meanwhile, the execution efficiency of the deep learning task is improved.

Description

GPU cluster deep learning task parallelization method and device and electronic equipment

Technical Field

The application relates to the technical field of internet, in particular to a GPU cluster deep learning task parallelization method and device and electronic equipment.

Background

With the continuous and deep study, the deep learning technology obtains great achievements in the fields of computer vision, speech recognition, text processing and the like, and brings great convenience to the life of people. However, the complex neural network model and the huge amount of data put higher demands on the computing power. The GPU (Graphic Processing Unit) cluster integrates a plurality of GPU computing resources, provides powerful and efficient parallel computing capability for the computation-intensive deep learning task, and effectively meets the computing requirements of a plurality of deep learning tasks.

However, when the deep learning task runs on the resource-shared GPU cloud platform, the execution efficiency of the deep learning task is affected by interference caused by resource competition among other commonly-executed tasks. Therefore, for the deep learning task in the GPU cluster, how to perform task parallelization scheduling according to the information of the task and the requirement of the task on resources is achieved, the nodes in the GPU cluster and a plurality of GPU resources on the nodes are reasonably utilized, and the method is of great importance for optimizing the execution time of the deep learning task, improving the processing performance of the whole computing task and improving the resource utilization rate of the system.

The mainstream GPU cluster at present mainly employs a traditional scheduler (for example, kubernets, Yarn) to schedule the deep learning task. Through counting the use condition of the whole resources, the resources are reasonably allocated to the GPU for use, and enough resources are ensured to run in the life cycle of the GPU.

Although the method realizes parallelization of the deep learning task to a certain extent, the method mainly considers the use condition of resources, does not consider the physical characteristics of the resources and the characteristics of the task, cannot realize efficient parallelization of the deep learning task, and can reduce the execution efficiency of the deep learning workload. Meanwhile, the method does not support fine-grained multitask allocation of the GPU, GPU resources on the nodes cannot be fully utilized, efficient execution of deep learning tasks is affected, GPU utilization rate of the nodes is reduced, and therefore resource utilization rate of a GPU cluster is affected.

Disclosure of Invention

The embodiment of the application aims to provide a parallelization method and device for a deep learning task of a GPU cluster, electronic equipment, a storage medium and a computer program product containing instructions, so that high parallelization of the deep learning task is realized, and the utilization rate of GPU resources in the GPU cluster and the execution efficiency of the deep learning task are improved.

The specific technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a GPU cluster deep learning task parallelization method, including:

acquiring task set information of deep learning tasks to be processed and a GPU cluster, wherein the task set information of the GPU cluster comprises task information of each task in a waiting queue of each computing node of the GPU cluster and task information of each subtask being executed by each GPU in each computing node of the GPU cluster;

analyzing task information of the deep learning task to be processed and task information of each task in a waiting queue of each computing node of the GPU cluster to respectively obtain the similarity of the deep learning task to be processed and each computing node of the GPU cluster;

determining target computing nodes of the deep learning task to be processed in the GPU cluster according to the similarity;

dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed;

analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to respectively obtain the interference level and the communication cost of each target subtask;

and respectively determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level and the communication cost of each target subtask.

Optionally, the task information of the deep learning task to be processed includes: the maximum CPU utilization rate of the deep learning task to be processed, the host memory utilization rate of the deep learning task to be processed, the I/O throughput of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, the device memory utilization rate of the deep learning task to be processed, the bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.

Optionally, the analyzing the task information of the deep learning task to be processed and the task information of each task in the waiting queue of each computing node of the GPU cluster to obtain the similarity between the deep learning task to be processed and each computing node of the GPU cluster respectively includes:

inputting the deep learning task information to be processed and the task information of each task in each computing node waiting queue of the GPU cluster into a preset similarity prediction model to respectively obtain the feature vector of the deep learning task to be processed and the feature vector of each task in each computing node waiting queue of the GPU cluster;

respectively calculating the similarity between the deep learning task to be processed and each task in each calculation node waiting queue of the GPU cluster according to the following formula:

wherein, J_iFeature vector representing the ith deep learning task to be processed, S_jA feature vector representing a task in a waiting queue of the j-th compute node in the GPU cluster, wherein S_jComprising n tasks, S_jkRepresenting the feature vector of the kth task in a waiting queue of the jth computing node in the GPU cluster, wherein k belongs to { 1., n }, theta_JiSjkIs the same as J_iAnd said S_jkAngle of (a) to degrees, cos (theta)_JiSjk)_JiSjkRepresenting the similarity between the ith deep learning task to be processed and the kth task in the waiting queue of the jth computing node in the GPU cluster, | J_iI is J_iModulo of vector, | S_jkI is S_jkA modulus of the vector;

and obtaining the similarity between the deep learning task to be processed and each computing node of the GPU cluster according to the similarity between the deep learning task to be processed and each task in each computing node waiting queue of the GPU cluster.

Optionally, the analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to obtain the interference level and the communication cost of each target subtask respectively includes:

mapping the plurality of target subtasks to each GPU in the target computing node respectively to obtain the mapping relation between the target subtasks and each GPU;

respectively inputting the target subtask and each subtask being executed by each GPU into a preset performance prediction model, and respectively obtaining the performance degradation when the target subtask and each subtask being executed by each GPU are executed together;

respectively calculating the interference level of the target subtask according to each performance degradation;

and respectively calculating the communication cost of the target subtask according to the available bandwidth among the GPUs and the data volume required by the updated model.

Optionally, the determining, in the GPU of the target computing node, the execution GPU of each target subtask according to the interference level and the communication cost of each target subtask includes:

determining an objective function of a target GPU, wherein the objective function is as follows:

in the formula

For the objective function value, i (M) represents the interference level when the mapping relationship is M, c (M) represents the communication cost when the mapping relationship is M, α is the weight of i (M), β is the weight of c (M), and follows α + β ═ 1;

when in use

And when the mapping relation is M, the corresponding GPU is the target GPU.

and determining a target GPU of the target subtask in the target computing node according to a preset optimization algorithm, the interference level and the communication cost, wherein the preset optimization algorithm is an ant colony algorithm, a genetic algorithm, a simulated annealing algorithm or a particle swarm optimization algorithm.

In a second aspect, an embodiment of the present application provides a GPU cluster deep learning task parallelization apparatus, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring task set information of deep learning tasks to be processed and GPU clusters, and the task set information of the GPU clusters comprises task information of each task in a waiting queue of each computing node of the GPU clusters and task information of each subtask being executed by each GPU in each computing node of the GPU clusters;

the first analysis module is used for analyzing the task information of the deep learning task to be processed and the task information of each task in each computing node waiting queue of the GPU cluster to respectively obtain the similarity between the deep learning task to be processed and each computing node of the GPU cluster;

the calculation node determination module is used for determining a target calculation node of the deep learning task to be processed in the GPU cluster according to the similarity;

the subtask module is used for dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed;

the second analysis module is used for analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to respectively obtain the interference level and the communication cost of each target subtask;

and the GPU determining module is used for determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level and the communication cost of each target subtask.

Optionally, the task information of the deep learning task to be processed includes: the processing method comprises the following steps of obtaining a maximum CPU utilization rate of the deep learning task to be processed, a host memory utilization rate of the deep learning task to be processed, I/O throughput of the deep learning task to be processed, GPU utilization rate of the deep learning task to be processed, device memory utilization rate of the deep learning task to be processed, bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.

Optionally, the first analysis module is specifically configured to:

inputting the deep learning task information to be processed and the task information in the waiting queue of each computing node of the GPU cluster into a preset similarity prediction model to respectively obtain a feature vector of the deep learning task to be processed and a feature vector of a task in the waiting queue of each computing node of the GPU cluster;

and obtaining the similarity between the deep learning task to be processed and each computing node of the GPU cluster according to the similarity between the deep learning task to be processed and each task in each computing node waiting queue of the GPU cluster. Optionally, the second analysis module is specifically configured to:

Optionally, the second analysis module is specifically configured to:

in the formula

when in use

And when the mapping relation is M, the corresponding GPU is the target GPU.

Optionally, the second analysis module is specifically configured to:

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a communication interface, a memory, and a communication bus, wherein:

the processor, the communication interface and the memory complete mutual communication through a communication bus;

the memory is used for storing a computer program;

the processor is configured to implement the GPU cluster deep learning task parallelization method according to any one of the first aspect, when executing a program stored in a memory.

In a fourth aspect, an embodiment of the present application provides a storage medium, where instructions are stored in the storage medium, and when the instructions are executed on a computer, the instructions cause the computer to execute the GPU cluster deep learning task parallelization method according to any one of the first aspects.

In a fifth aspect, an embodiment of the present application provides a computer program product containing instructions, which when executed on a computer, causes the computer to execute the GPU cluster deep learning task parallelization method according to any one of the first aspect.

The GPU cluster deep learning task parallelization method, the device, the electronic equipment, the storage medium and the computer program product containing instructions provided by the embodiment of the application determine a target computing node of a deep learning task to be processed in a GPU cluster by analyzing the similarity of the deep learning task to be processed and each computing node of the GPU cluster, reduce the possibility of computing node resource contention by fully considering the similarity between the deep learning task to be processed and other tasks, thereby improving the utilization rate and the execution efficiency of deep learning task system resources, divide the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed, analyze the interference level and the communication cost of the target subtasks, thereby determining the target GPU of the target subtasks in the target computing nodes, and by considering the interference level and the communication cost of the target subtasks, resource distribution imbalance on the GPU in the computing nodes is avoided, high parallelization of deep learning tasks is achieved, the resource utilization rate of the GPU cluster is improved, and meanwhile execution efficiency of the deep learning tasks is improved. Of course, not all of the above advantages need be achieved in the practice of any one product or method of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of a GPU cluster deep learning task parallelization method according to an embodiment of the present application;

fig. 2 is a schematic diagram of a GPU cluster deep learning task parallelization apparatus according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application discloses a GPU cluster deep learning task parallelization method and device, electronic equipment, a storage medium and a computer program product containing instructions, which are respectively described below.

The embodiment of the application provides a parallelization method for a deep learning task of a GPU cluster, and referring to fig. 1, fig. 1 is a schematic diagram of the parallelization method for the deep learning task of the GPU cluster, and the parallelization method comprises the following steps:

and step 110, acquiring task set information of the deep learning task to be processed and the GPU cluster, wherein the task set information of the GPU cluster comprises task information of each task in a waiting queue of each computing node of the GPU cluster and task information of each subtask being executed by each GPU in each computing node of the GPU cluster.

The GPU cluster deep learning task parallelization method can be realized through electronic equipment, and specifically, the electronic equipment can be a server.

In order to improve the computation performance of the GPU, the GPU cluster can be transversely expanded, namely the GPU cluster is formed by a plurality of GPUs on a plurality of nodes, and therefore the GPU cluster integrates the plurality of GPUs to complete complex computation tasks. Therefore, when the GPU cluster is used for large-scale deep learning random training, the number of deep learning tasks in the GPU cluster can be multiple, the deep learning task to be processed is any one of the deep learning tasks, the deep learning task to be processed is defined as J if the system has p deep learning tasks to be processed_iI ∈ {1, …, p }. The GPU cluster comprises a task set, wherein the task set comprises tasks in a waiting queue of each computing node of the GPU cluster and subtasks which are executed by each GPU in each computing node of the GPU cluster.

And step 120, analyzing the task information of the deep learning task to be processed and the task information of each task in the waiting queue of each computing node of the GPU cluster to respectively obtain the similarity between the deep learning task to be processed and each computing node of the GPU cluster.

In order to improve the execution efficiency of the GPU cluster and the utilization rate of the computing resources, the similarity between the deep learning task to be processed and each computing node of the GPU cluster needs to be calculated, so as to avoid interference between the computing resources.

In a possible implementation manner, the task information of the deep learning task to be processed includes: the maximum CPU utilization rate of the deep learning task to be processed, the host memory utilization rate of the deep learning task to be processed, the I/O throughput of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, the device memory utilization rate of the deep learning task to be processed, the bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.

In the embodiment of the present application, the above deep learning task to be processed is characterized as J_i＝(jU^CPU，jU^hMem，jThP^I/O，jU^GPU，jU^dMem，jThP^PCIeBatch size, dsize), of which jU^CPUjU representing the maximum CPU usage of the pending deep learning task^hMemRepresenting the host memory utilization rate and jThP of the deep learning task to be processed^I/OjU representing the I/O throughput of the pending deep learning task^GPUjU representing GPU utilization of the deep learning task to be processed^dMemRepresenting the device memory usage rate, jThp, of the deep learning task to be processed^PCIeThe bandwidth utilization rate of the deep learning task to be processed is represented, the batch _ size represents the number of samples to be analyzed in each training step of the deep learning task to be processed, and the dsize represents the size of the data set of the deep learning task to be processed.

In a possible implementation manner, the analyzing task information of the deep learning task to be processed and task information of each task in a waiting queue of each compute node of the GPU cluster to obtain similarity between the deep learning task to be processed and each compute node of the GPU cluster respectively includes:

and respectively calculating the similarity between the deep learning task to be processed and each task in each calculation node waiting queue of the GPU cluster according to the following formula:

wherein, J_iFeature vector representing the ith deep learning task to be processed, S_jA feature vector representing a task in a waiting queue of the j-th compute node in the GPU cluster, wherein S_jComprising n tasks, S_jkRepresenting the feature vector of the kth task in a waiting queue of the jth computing node in the GPU cluster, wherein k belongs to { 1., n }, theta_JiSjkIs the above J_iAnd the above-mentioned S_jkAngle of (a) to degrees, cos (theta)_JiSjk)_JiSjkRepresenting the similarity between the ith deep learning task to be processed and the kth task in the waiting queue of the jth computing node in the GPU cluster, | J_iI is J_iModulo of vector, | S_jkI is S_jkA modulus of the vector;

The preset Similarity Prediction model may be an IASP (Interference-Aware Similarity Prediction) model.

For example, a to-be-processed deep learning task-metric matrix M is defined_JTo characterize the deep learning task, wherein the matrix M_JEach row of the table represents one deep learning task to be processed, and each column represents one performance characteristic of the deep learning task to be processed, namely the performance characteristic is composed of CPU resource utilization rate and GPU resource utilization rate.

Obtaining the deep learning task J to be processed_iStandardizing the task information of the deep learning task to be processed and filling the task information into the matrix M_JFor each to-be-processed deep learning task J in the queue in the corresponding row_iAnd analyzing the data by using a small data volume, and randomly selecting two features from the features of the deep learning task to be processed to obtain the feature values of the two features. Using virtual file systems, i.e. proc textAnalyzing the CPU resource utilization rate by the system to obtain CPU measurement, analyzing the GPU resource utilization rate by using a performance analysis tool such as an NVIDIA Profiler tool to obtain GPU measurement, and filling the performance measurement obtained by analysis into a deep learning task characterization vector J to be processed_iAnd vector J is combined_iInserted into matrix M_JIn (1).

Predicting the above matrix M using IASP model_jThe feature predicted by IASP model to the above matrix M_jAnd filling to obtain a complete matrix. And obtaining the characteristic vector of the deep learning task to be processed through an IASP model. Similarly, the task information of the tasks in the waiting queue of each computing node of the GPU cluster is input into an IASP model, and the feature vectors of the tasks in the waiting queue of each computing node of the GPU cluster are obtained.

Specifically, the IASP model is a depth Collaborative Filtering model, such as DCF (Deep Collaborative Filtering) model, for similarity prediction. Further, the DCF model can be optimized using a stochastic gradient descent method (SGD).

wherein, J_iFeature vector representing the ith deep learning task to be processed, S_jA feature vector representing a task in a waiting queue of the j-th compute node in the GPU cluster, wherein S_jComprising n tasks, S_jkRepresenting the feature vector of the kth task in a waiting queue of the jth computing node in the GPU cluster, wherein k belongs to { 1., n }, theta_JiSjkIs the above J_iAnd the above-mentioned S_jkAngle of (a) coS (theta)_JiSjk)_JiSjkRepresenting the similarity between the ith deep learning task to be processed and the kth task in the waiting queue of the jth computing node in the GPU cluster, | J_iI is J_iModulo of vector, | S_jkI is S_jkA modulus of the vector;

and obtaining the similarity between the deep learning task to be processed and each computing node of the GPU cluster according to the similarity between the deep learning task to be processed and each task in each computing node waiting queue of the GPU cluster. For example, J_iAnd S_jHas a similarity of J_iAnd S_jThe average value of the similarity of each task in the process is calculated, namely the deep learning task J to be processed is calculated respectively_iAnd S in waiting queue of each computing node of the GPU cluster_jAfter the similarity between the n tasks, calculating the average value of the similarities to obtain the similarity between the ith deep learning task to be processed and the task in the jth computing node waiting queue of the GPU cluster.

Because of S_jIf n tasks are included, the similarity between the ith deep learning task to be processed and each task in the jth computing node waiting queue in the GPU cluster needs to be calculated, and the average value of all the similarities is calculated, so as to obtain the similarity between the ith deep learning task to be processed and the task in the jth computing node waiting queue in the GPU cluster.

And step 130, determining target computing nodes of the deep learning task to be processed in the GPU cluster according to the similarity.

And according to the similarity, the calculation node with the minimum similarity in the GPU cluster is the target calculation node of the deep learning task to be processed in the GPU cluster.

And the computing node with the minimum similarity is selected, so that the contention of computing resources can be avoided, and the execution efficiency of the GPU cluster and the utilization rate of the computing resources are improved.

And 140, dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed.

After the deep learning task to be processed is distributed to the target computing nodes in the GPU cluster, the deep learning task to be processed is processed according to the number of GPUs required by the deep learning task to be processedThe learning task is divided into a plurality of target subtasks, such as a to-be-processed deep learning task J_iDefining the deep learning task to be processed into a plurality of target subtasks

Wherein n represents a deep learning task J to be processed_iThe number of subtasks in (1).

Defining a target subtask as

j∈{1，...，n}，

n represents that the deep learning task to be processed is divided into n target subtasks in total, wherein

Representing the jth target subtask.

Step 150, analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node, and obtaining the interference level and the communication cost of each target subtask respectively.

Defining the GPU set in the target computing node as G_kK is formed by { 1.,..,. m }, wherein m GPUs in the target computing node are represented, and G is the total number of the GPU in the target computing node_kThe kth GPU.

Defining a target subtask as

Wherein tE^SM、tU^L1、tThP^L1、tU^L2、tThP^L2、tU^Tex、tThP^Tex、tU^DRAM、tThP^DRAM、tThP^LAnd tThP^SRespectively representing SM efficiency, GPU L1 cache for the above-described target subtasksMemory usage, L1 read throughput, GPU L2 cache usage, L2 read throughput, GPU texture cache usage, texture cache read throughput, GPU memory usage, memory read throughput, global load throughput, global store throughput, sub _ dsize representing the size of the data set of the target subtask.

In one possible implementation, the analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to obtain the interference level and the communication cost of each target subtask respectively includes:

Defining the ratio of the completion time of a subtask running with other subtasks to the completion time of the individual run represents performance degradation.

The target subtasks and the subtasks being executed by the GPUs are respectively input into a preset Performance Prediction model, where the preset Performance Prediction model may be an IAPP (Interference-Aware Performance Prediction) model, and specifically, the preset Performance Prediction model may be a DNN (Deep Neural Network) model.

For example, define a target subtask as

J∈{1，...，n}，

Representing the jth target subtask, and defining each GPU set in the target computing nodes as G ═ G₁…G_mIndicating that there are, one GPU in the target computing nodes, and converting the plurality of target subtasks into a plurality of target subtasks

Mapping the target sub tasks to the GPUs in the target computing nodes respectively to obtain mapping relations between the target sub tasks and the GPUs, defining the mapping relations to be M (j) ═ k, j ∈ { 1., n }, k ∈ { 1., m }, wherein n represents that the deep learning task to be processed is divided into n target sub tasks in total, m represents that m GPUs are in total in the target computing nodes, and M (j) ═ k represents that the j target sub task is

Is assigned to the kth GPU, G_k。

In particular, a subtask-metric matrix M is defined_tTo characterize the subtasks, which are the subtasks to be executed by the target subtasks or the GPUs.

Each subtask being executed by each GPU is to form a matrix M_tThe output of the vector-in IAPP model is the vector z_tVector z_tEach element in (a) represents a performance degradation of the target subtask when executed in conjunction with other subtasks.

Calculating the objective subtask using the following formula

The average performance allocated to one GPU degrades.

Wherein slow down_jkRepresenting target subtasks

And the average performance degradation distributed to the kth GPU in the target computing node is reduced, and numk represents the number of commonly executed subtasks on the kth GPU in the target computing node.

Defining an interference level as

Wherein I (M (j)) represents the jth target subtask

The mapping relation of (a) is the interference level of M (j), and n represents that the deep learning task to be processed is divided into n target subtasks. Will be from G_iTo G_jIs defined as cc_ij，i∈{1，...，m}，j∈{1，...，m}，G_iRepresents the ith GPU, G in the target computing node_jAnd representing the jth GPU in the target computing node. The communication cost may be calculated from the available bandwidth between the physical GPUs, where high available bandwidth means low communication cost and low available bandwidth means high communication cost.

Sub-task to be targeted

And target subtasks

The communication requirement between is defined as cr_ijThe communication requirement definition is equal to the amount of data required to update the model. Obtaining the target subtask according to the following formula

And target subtasks

Communication cost therebetween:

c (M) represents the communication cost when the mapping relation is M.

And step 160, determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level of each target subtask and the communication cost.

By considering the interference level and the communication cost of the target subtask, unbalanced resource distribution on the GPU in the computing node is avoided, high parallelization of the deep learning task is realized, the resource utilization rate of the GPU cluster is improved, and the execution efficiency of the deep learning task is improved.

In one possible implementation, the determining, in the GPU of the target computing node, an execution GPU of each target subtask according to the interference level of each target subtask and the communication cost includes:

in the formula

when in use

When the mapping relation is M, the corresponding GPU is the target GPU.

For example, the particle swarm optimization algorithm initializes a population of particles, which represent the target subtasks to be assigned. Then we set G ═ G from GPU₁…G_mRandomly initializing particle positions and randomly initializing particle velocities, the fitness value assigned to each dimension of a particle is the number of the GPU, e.g. k is the kth GPU, i.e. the particle represents a mapping of multiple tasks to the GPU.

Each particle is determined by its current position

And current speed

Indicating and each particle knows the best position found by the particle pbest as well as the global best position gbest in the entire particle population so far. The principle of the algorithm is to move the particles to find the optimal solution. Each particle position is affected by its best position pbest and global best position gbest. Each particle may update its best position pbest using the fitness value calculated by optimizing the fitness function in each generation. During each iteration, each instance may update its velocity and position using the following formulas.

Where ω is the inertial weight for maintaining the particle, c₁And c₂Is the acceleration coefficient, r₁And r₂Is a random number between 0 and 1.

Using an objective function

To perform an evaluation of each particle and then update the velocity and position of each particle according to the following two formulas:

and (5) performing iteration until the specified iteration times are reached or the iteration precision is met, and finding the optimal GPU.

And determining a target GPU of the target subtask in the target computing node according to a preset optimization algorithm, the interference level and the communication cost, so that the resource utilization rate of a GPU cluster can be improved to the maximum extent, and the execution efficiency of a deep learning task can be improved to the maximum extent.

The method comprises the steps of firstly analyzing the similarity of a deep learning task to be processed and each computing node of a GPU cluster, determining a target computing node of the deep learning task to be processed in the GPU cluster, fully considering the similarity between the deep learning task to be processed and other tasks to reduce the possibility of computing node resource contention, thereby improving the utilization rate and the execution efficiency of deep learning task system resources, dividing the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed, analyzing the interference level and the communication cost of the target subtasks, thereby determining the target GPU of the target subtasks in the target computing node, avoiding the resource distribution imbalance on the GPU in the computing nodes by considering the interference level and the communication cost of the target subtasks, realizing the high parallelization of the deep learning task, and improving the resource utilization rate of the GPU cluster, meanwhile, the execution efficiency of the deep learning task is improved.

An embodiment of the present application further provides a device, referring to fig. 2, where fig. 2 is a schematic diagram of a GPU cluster deep learning task parallelization device according to an embodiment of the present application, where the device includes:

the acquisition module 210 is configured to acquire a deep learning task to be processed and task set information of a GPU cluster, where the task set information of the GPU cluster includes task information of each task in a waiting queue of each compute node of the GPU cluster and task information of each subtask being executed by each GPU in each compute node of the GPU cluster;

a first analysis module 220, configured to analyze task information of the deep learning task to be processed and task information of each task in a waiting queue of each computing node of the GPU cluster, and obtain similarities between the deep learning task to be processed and each computing node of the GPU cluster, respectively;

a computing node determining module 230, configured to determine, according to the similarity, a target computing node of the to-be-processed deep learning task in the GPU cluster;

the subtask module 240 is configured to divide the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed;

a second analysis module 250, configured to analyze task information of the target subtask and task information of each subtask being executed by each GPU in the target computing node, and obtain an interference level and a communication cost of each target subtask, respectively;

and a GPU determining module 260, configured to determine, in the GPUs of the target computing nodes, execution GPUs of the target subtasks respectively according to the interference levels of the target subtasks and the communication costs.

In a possible implementation manner, the first analysis module 220 is specifically configured to:

and obtaining the similarity between the deep learning task to be processed and each computing node of the GPU cluster according to the similarity between the deep learning task to be processed and each task in each computing node waiting queue of the GPU cluster. In a possible implementation manner, the second analysis module 250 is specifically configured to:

In a possible implementation manner, the second analysis module 250 is specifically configured to:

in the formula

when in use

When the mapping relation is M, the corresponding GPU is the target GPU.

An embodiment of the present application further provides an electronic device, see fig. 3, including: a processor 310, a communication interface 320, a memory 330 and a communication bus 340, wherein the processor 310, the communication interface 320 and the memory 330 are communicated with each other through the communication bus 340,

the memory 330 is used for storing computer programs;

the processor 310 is configured to implement the following steps when executing the computer program stored in the memory 330:

analyzing the task information of the deep learning task to be processed and the task information of each task in the waiting queue of each computing node of the GPU cluster to respectively obtain the similarity between the deep learning task to be processed and each computing node of the GPU cluster;

and respectively determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level of each target subtask and the communication cost.

For example, the processor 310 of the electronic device includes a GPU cluster composed of a central control unit and a plurality of GPUs, where the GPU cluster includes a plurality of compute nodes, each compute node is composed of a plurality of GPUs, the central control unit includes a data collector, a cluster management unit, and a node management unit, and the electronic device is configured to process a plurality of deep learning tasks in parallel in the GPU cluster. The data acquisition unit acquires task set information of a deep learning task to be processed and a GPU cluster, the cluster management unit analyzes the task information of the deep learning task to be processed and task information of each task in a waiting queue of each computing node of the GPU cluster to respectively obtain the similarity of the deep learning task to be processed and each computing node of the GPU cluster, and determines a target computing node of the deep learning task to be processed in the GPU cluster according to the similarity, the node management unit divides the deep learning task to be processed into a plurality of target subtasks according to the number of GPUs required by the deep learning task to be processed, then analyzes the task information of the target subtasks and the task information of each subtask being executed by each GPU in the target computing nodes to respectively obtain the interference level and the communication cost of each target subtask, and finally, respectively determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level of each target subtask and the communication cost.

Optionally, when the processor 310 is configured to execute the program stored in the memory 330, any of the GPU cluster deep learning task parallelization methods described above may also be implemented.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Graphic Processing Unit (GPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In an embodiment of the present application, a storage medium is further provided, where instructions are stored in the storage medium, and when the instructions are executed on a computer, the computer is caused to execute any GPU cluster deep learning task parallelization method in the foregoing embodiments.

In an embodiment of the present application, a computer-readable storage medium is further provided, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the computer is caused to execute any of the GPU cluster deep learning task parallelization methods in the foregoing embodiments.

It should be noted that, in this document, the technical features in the various alternatives can be combined to form the scheme as long as the technical features are not contradictory, and the scheme is within the scope of the disclosure of the present application. Relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the same element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the embodiments of the apparatus, the electronic device, and the storage medium, since they are substantially similar to the method embodiments, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiments.

The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A GPU cluster deep learning task parallelization method is characterized by comprising the following steps:

acquiring task set information of deep learning tasks to be processed and GPU clusters, wherein the task set information of the GPU clusters comprises task information of tasks in waiting queues of computing nodes of the GPU clusters and task information of subtasks being executed by GPUs in the computing nodes of the GPU clusters;

respectively determining the execution GPU of each target subtask in the GPU of the target computing node according to the interference level and the communication cost of each target subtask;

the analyzing the task information of the target subtask and the task information of each subtask being executed by each GPU in the target computing node to obtain the interference level and the communication cost of each target subtask respectively includes:

respectively calculating the communication cost of the target subtask according to the available bandwidth among the GPUs and the data volume required by the updated model;

wherein the target subtask T is calculated using the following formula_i ^jAverage performance degradation assigned to one GPU;

wherein slow down_jkRepresenting the target subtask T_i ^jThe average performance degradation distributed to the kth GPU in the target computing node is reduced, and numk represents the number of commonly executed subtasks on the kth GPU in the target computing node; vector z_tEach element in the set representing a performance degradation of the target subtask when executed in conjunction with other subtasks; the performance is degraded toA ratio of a completion time of the target subtask when run with the other subtask to a completion time of the target subtask when run alone;

defining an interference level as

Wherein I (M (j)) represents the jth target subtask T_i ^jThe mapping relation of (a) is the interference level of M (j), and n represents that the deep learning task to be processed is divided into n target subtasks.

2. The method of claim 1, wherein the task information of the deep learning task to be processed comprises: the maximum CPU utilization rate of the deep learning task to be processed, the host memory utilization rate of the deep learning task to be processed, the I/O throughput of the deep learning task to be processed, the GPU utilization rate of the deep learning task to be processed, the device memory utilization rate of the deep learning task to be processed, the bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.

3. The method according to claim 1, wherein the analyzing task information of the deep learning task to be processed and task information of each task in a waiting queue of each compute node of the GPU cluster to obtain similarity between the deep learning task to be processed and each compute node of the GPU cluster respectively comprises:

inputting the deep learning task information to be processed and the task information of each task in each computing node waiting queue of the GPU cluster into a preset similarity prediction model to respectively obtain a feature vector of the deep learning task to be processed and a feature vector of each task in each computing node waiting queue of the GPU cluster;

4. The method of claim 1, wherein the determining, in the GPU of the target compute node, the execution GPU for each of the target subtasks based on the interference level and the communication cost for each of the target subtasks comprises:

in the formula

For the objective function value, I (M) represents the interference level when the mapping relation is M, and C (M) represents the channel when the mapping relation is MA cost, α is the weight of i (m), β is the weight of c (m), and follows α + β ═ 1;

when in use

And when the mapping relation is M, the corresponding GPU is the target GPU.

5. The method according to any one of claims 1 to 4, wherein the determining, in the GPU of the target computing node, the execution GPU of each of the target subtasks according to the interference level and the communication cost of each of the target subtasks respectively comprises:

6. A GPU cluster deep learning task parallelization device is characterized by comprising:

a GPU determining module, configured to determine, in the GPU of the target computing node, a GPU for executing each of the target subtasks, according to the interference level and the communication cost of each of the target subtasks;

the second analysis module is specifically configured to map the multiple target subtasks to each GPU in the target computing node, so as to obtain a mapping relationship between each target subtask and each GPU; respectively inputting the target subtask and each subtask being executed by each GPU into a preset performance prediction model, and respectively obtaining the performance degradation when the target subtask and each subtask being executed by each GPU are executed together; respectively calculating the interference level of the target subtask according to each performance degradation; respectively calculating the communication cost of the target subtask according to the available bandwidth among the GPUs and the data volume required by the updated model;

wherein slow down_jkRepresenting the target subtask T_i ^jThe average performance degradation distributed to the kth GPU in the target computing node is reduced, and numk represents the number of commonly executed subtasks on the kth GPU in the target computing node; vector z_tEach element in the set representing a performance degradation of the target subtask when executed in conjunction with other subtasks; the performance degradation is a ratio of a completion time of the target subtask when run with other subtasks to a completion time of the target subtask when run alone;

defining an interference level as

7. The apparatus of claim 6, wherein the task information of the deep learning task to be processed comprises: the processing method comprises the following steps of obtaining a maximum CPU utilization rate of the deep learning task to be processed, a host memory utilization rate of the deep learning task to be processed, I/O throughput of the deep learning task to be processed, GPU utilization rate of the deep learning task to be processed, device memory utilization rate of the deep learning task to be processed, bandwidth utilization rate of the deep learning task to be processed, the number of samples analyzed in each step of the deep learning task to be processed, and the size of a data set of the deep learning task to be processed.

8. An electronic device, comprising: a processor, a communication interface, a memory, and a communication bus, wherein,

a memory for storing a computer program;

a processor configured to implement the GPU cluster deep learning task parallelization method of any of claims 1-5 when executing a program stored on a memory.

9. A storage medium having stored therein a computer program which, when executed by a processor, implements the GPU cluster deep learning task parallelization method of any of claims 1-5.